10 years of rio and readODS

Maintaining an I/O infrastructure of R

A random person in Mannheim

2025-02-27

Why you should care about I/O

“First, you must import your data into R. This typically means that you take data stored in a file, database, or web application programming interface (API) and load it into a data frame in R. If you can’t get your data into R, you can’t do data science on it!”

Source: R for Data Science

Also I/O

Source: https://xkcd.com/2347/

`rio`

Data import and export

## data i/o, circa pre 2013
write.csv(iris, "iris.csv")
saveRDS(iris, "iris.rds")
save(iris, "iris.Rdata")
x <- read.csv("iris.csv")
x <- readRDS("iris.rds")
load("iris.Rdata")

## data i/o after rio
library(rio)
export(iris, "iris.csv")
export(iris, "iris.rds")
export(iris, "iris.Rdata")
x <- import("iris.csv")
x <- import("iris.rds")
x <- import("iris.Rdata")

`rio` development, the human part

2013 Aug: First version by me
2015 Feb: Transfer maintainership to Thomas Leeper
2015 Mar: Add support for ODS (OpenDocument Spreadsheet)
Circa 2019 “Unmaintained state”
2023 Aug: Maintain collectively by GESIS TSA Team
2023 Sep: rio 1.0.0

OpenDocument Spreadsheet

Truly open format (ISO/IEC 26300) with wide adoption: NATO, EU, many governments
Technology: Zipped XML file (copied by xlsx)

Example: UK Gov

`readODS`

Created by Gerrit-Jan Schutten in 2014, Maintained by me since 2016
Under rOpenSci since 2022-06-24
Before 2023 in pure R: Was in a bad state due to bad performance

`readODS` issue 49 / 71

`readODS` issue 71

70 MB ODS -> 1.3GB XML
Not Working also:Google Sheets, Python odfpy (Also Julia’s wrapper: OdsIO.jl), JS SheetJS

Projekt 71

“I’ll devote my 2023 to the project I tentatively called”Projekt 71”. The idea is simple: I want to have a way that can read the aforementioned “jts0501.ods” directly as an R data frame without memory issues; but yet pass at least 80% of the current unit tests of readODS. So, I am embarking on solving just one Github issue of readODS. I will put other of my R packages into maintenance mode and focus only on this.”

My Projekt 71 manifesto

Thanks to our unsung heroes

In July 2023, Peter Brohan (UK Housing Department) rewrote readODS::read_ods() in C++ (RapidXML)
Detlef Steuer (HHU) proposed a speedy solution to write XML using R. I rewrote his solution in C++
Both issues fixed. “jts0501.ods” is readable

Benchmark

Reading speed: 5539 x 11

Writing speed: 3000 x 8

So, are we (random persons from Nebraska) done yet?

“Software, unlike papers or grants, is never done.”

Free as in mummies

“Ein guter Anfang braucht Begeisterung, ein gutes Ende Disziplin.”

das Motto des DFB für die WM 2014

These are not random persons

Thomas Leeper (for maintaining rio from 2015 to 2023)
Gerrit-Jan Schutten (for creating readODS)
Peter Brohan (for the C++ implementation of read_ods)
Detlef Steuer (for the quick XML generation algorithm)
Jason Becker (for the sustained contribution to rio)
David Schoch (for the contribution to rio and leading the GESIS TSA Team)
Hadley Wickham, Jennifer Bryan (for allowing me to fork some code from readr)
rOpenSci (for managing the infrastructure to support the development of readODS)
All rio and readODS contributors and users
All developers / maintainers of the I/O infrastructure
R Core

A more complete version of this presentation

`rio` in real world

`rio` in real world 2

`rio` in real world 3

Used by multiple organizations: WHO, Médecins Sans Frontières etc.
In data science education

Issues

Type guessing, not 100% the same as readxl - working on minty
Speed vs readxl (and other formats)

Benchmark of R file formats using rio and friends