Why you should care about I/O

“First, you must import your data into R. This typically means that you take data stored in a file, database, or web application programming interface (API) and load it into a data frame in R. If you can’t get your data into R, you can’t do data science on it!

Source: R for Data Science

Also I/O

Source: https://xkcd.com/2347/

rio

Data import and export

## data i/o, circa pre 2013
write.csv(iris, "iris.csv")
saveRDS(iris, "iris.rds")
save(iris, "iris.Rdata")
x <- read.csv("iris.csv")
x <- readRDS("iris.rds")
load("iris.Rdata")

## data i/o after rio
library(rio)
export(iris, "iris.csv")
export(iris, "iris.rds")
export(iris, "iris.Rdata")
x <- import("iris.csv")
x <- import("iris.rds")
x <- import("iris.Rdata")

rio development, the human part

  • 2013 Aug: First version by me
  • 2015 Feb: Transfer maintainership to Thomas Leeper
  • 2015 Mar: Add support for ODS (OpenDocument Spreadsheet)
  • Circa 2019 “Unmaintained state”
  • 2023 Aug: Maintain collectively by GESIS TSA Team
  • 2023 Sep: rio 1.0.0

OpenDocument Spreadsheet

  • Truly open format (ISO/IEC 26300) with wide adoption: NATO, EU, many governments
  • Technology: Zipped XML file (copied by xlsx)

Example: UK Gov

readODS

  • Created by Gerrit-Jan Schutten in 2014, Maintained by me since 2016
  • Under rOpenSci since 2022-06-24
  • Before 2023 in pure R: Was in a bad state due to bad performance

readODS issue 49 / 71

readODS issue 71

  • 70 MB ODS -> 1.3GB XML
  • Not Working also:Google Sheets, Python odfpy (Also Julia’s wrapper: OdsIO.jl), JS SheetJS

Projekt 71

“I’ll devote my 2023 to the project I tentatively called”Projekt 71”. The idea is simple: I want to have a way that can read the aforementioned “jts0501.ods” directly as an R data frame without memory issues; but yet pass at least 80% of the current unit tests of readODS. So, I am embarking on solving just one Github issue of readODS. I will put other of my R packages into maintenance mode and focus only on this.”

My Projekt 71 manifesto

Thanks to our unsung heroes

  • In July 2023, Peter Brohan (UK Housing Department) rewrote readODS::read_ods() in C++ (RapidXML)
  • Detlef Steuer (HHU) proposed a speedy solution to write XML using R. I rewrote his solution in C++
  • Both issues fixed. “jts0501.ods” is readable

Benchmark

Reading speed: 5539 x 11

Writing speed: 3000 x 8

So, are we (random persons from Nebraska) done yet?

“Software, unlike papers or grants, is never done.”

Free as in mummies

“Ein guter Anfang braucht Begeisterung, ein gutes Ende Disziplin.”

das Motto des DFB für die WM 2014

These are not random persons

  • Thomas Leeper (for maintaining rio from 2015 to 2023)
  • Gerrit-Jan Schutten (for creating readODS)
  • Peter Brohan (for the C++ implementation of read_ods)
  • Detlef Steuer (for the quick XML generation algorithm)
  • Jason Becker (for the sustained contribution to rio)
  • David Schoch (for the contribution to rio and leading the GESIS TSA Team)
  • Hadley Wickham, Jennifer Bryan (for allowing me to fork some code from readr)
  • rOpenSci (for managing the infrastructure to support the development of readODS)
  • All rio and readODS contributors and users
  • All developers / maintainers of the I/O infrastructure
  • R Core

A more complete version of this presentation

rio in real world

rio in real world 2

rio in real world 3

  • Used by multiple organizations: WHO, Médecins Sans Frontières etc.
  • In data science education

Issues

  • Type guessing, not 100% the same as readxl - working on minty
  • Speed vs readxl (and other formats)