10 years of rio and readODS

Maintaining the I/O infrastructure of R

Chung-hong Chan

GESIS

2024-07-10

Why you should care about I/O

“First, you must import your data into R. This typically means that you take data stored in a file, database, or web application programming interface (API) and load it into a data frame in R. If you can’t get your data into R, you can’t do data science on it!”

Also I/O

Source: https://xkcd.com/2347/

Pop quiz: Meet the Nebraskans

Who maintain these packages?

haven
readr, readxl
writexl, jsonlite
data.table
yaml
openxlsx
foreign

Pop quiz: Meet the Nebraskans

Who maintain these packages?

haven Hadley Wickham
readr, readxl Jennifer Bryan
writexl, jsonlite Jeroen Ooms
data.table Tyson Barrett
yaml Shawn Garbett
openxlsx 🇦🇹 Philipp Schauberger
foreign R Core Team

Hello from Nebraska Mannheim!

Before 2013, data import and export

write.csv(iris, "iris.csv")
saveRDS(iris, "iris.rds")
save(iris, "iris.Rdata")
# 2013: No way to write to spss

x <- read.csv("iris.csv")
x <- readRDS("iris.rds")
x <- read.spss("iris.sav")
load("iris.Rdata")

Wickham (2010)

## From this
text <- "she dances on the sand"
grepl("sand$", text)
strsplit(text, "[dD]ances?")

Wickham (2010)

## From this
text <- "she dances on the sand"
grepl("sand$", text)
strsplit(text, "[dD]ances?")
## To this
library(stringr)
str_detect(text, "sand$")
str_split(text, "[dD]ances?")

`rio`, since 2013

library(rio)
export(iris, "iris.csv")
export(iris, "iris.rds")
export(iris, "iris.sav")

x <- import("iris.csv")
x <- import("iris.rds")
x <- import("iris.sav")

`rio` version 0.1.1 2013-08-26 14:02 CEST

import <- function(file="", format=NULL, header=TRUE, ... ) {
  format <- .guess(file, format)
  x <- switch(format,
              txt=read.table(file=file, sep="\t", header=header, ...), ##tab-seperate txt file
              rds=readRDS(file=file, ...),
              csv=read.csv(file=file, ...),
              dta=read.dta(file=file, ...),
              sav=read.spss(file=file,to.data.frame=TRUE, ...),
              mtp=read.mtp(file=file, ...),
              rec=read.epiinfo(file=file, ...),
              stop("Unknown file format")
              )
  return(x)
}

`rio` development

2015 Feb: Transfer maintainership to Thomas Leeper
2015 Mar: Add support for ODS (OpenDocument Spreadsheet)
2016 Jan: Use S3 class (no longer switch()) by Jason Becker
2016 - 2023: Add many supported formats

Supported formats (partial list)

Full list

`rio` development

2023 Aug: 10 years
2023 Aug: Maintain collectively by GESIS Transparent Social Analytics Team
2023 Sep: rio 1.0.0

`rio` in real world 1

`rio` development

2015 Feb: Transfer maintainership to Thomas Leeper
2016 Jan: Use S3 class (no longer switch()) by Jason Becker
2016 - 2023: Add many supported formats

`rio` development

2015 Feb: Transfer maintainership to Thomas Leeper
2015 Mar: Add support for ODS (OpenDocument Spreadsheet)
2016 Jan: Use S3 class (no longer switch()) by Jason Becker
2016 - 2023: Add many supported formats

OpenDocument Spreadsheet

Truly open format (ISO/IEC 26300)
LibreOffice, Google Sheets, Microsoft Office etc.
Technology: Zipped XML file (like xlsx)
Adoption: NATO, EU, and many governments

Example: UK Gov

`readODS`

Created by Gerrit-Jan Schutten in 2014, Maintained by me since 2016
Peer reviewed and accepted by rOpenSci since 2022-06-24
Emulate readxl::read_excel() and writexl::write_xlsx()

`readODS` prior 2.0.0

Based on XML, and then xml2, in pure R
Was in a bad state due to performance

`readODS` issue 49

`readODS` issue 71

70 MB ODS -> 1.3GB XML

Not Working also:Google Sheets, Python odfpy (Also Julia’s wrapper: OdsIO.jl), JS SheetJS

Working but

LibreOffice ~ 15s
Export as … first by LibreOffice and then
- CSV via data.table::fread() - 1s
- XLSX via readxl::read_excel() - 2s

Projekt 71

“I’ll devote my 2023 to the project I tentatively called”Projekt 71”. The idea is simple: I want to have a way that can read the aforementioned “jts0501.ods” directly as an R data frame without memory issues; but yet pass at least 80% of the current unit tests of readODS. So, I am embarking on solving just one Github issue of readODS. I will put other of my R packages into maintenance mode and focus only on this.”

My Projekt 71 manifesto

Peter Brohan is our hero

In July 2023, Peter rewrote readODS::read_ods() in C++ (RapidXML) - super speed improvement
I switched to tackle issue 49 (Writing speed)

Detlef Steuer is also our hero

Detlef proposed a speedy solution to write XML using base R without xml2
I rewrote his solution in C++

Benchmark I

Reading speed: 5539 x 11

Benchmark II

Writing speed: 3000 x 8

In summary

Issue 71: “jts0501.ods” is readable with a modest computer in 120s (was impossible)
Issue 49: “80000 x 9” can be written in < 2s (was > 30 mins)

So, are we done yet?

“Software, unlike papers or grants, is never done.”

Free as in mummies

Issues

Type guessing, not 100% the same as readxl - working on minty
Speed vs readxl (and other formats)

`rio`-based comparison

Lahman::Batting (11,2164 x 22)

Rank	Format	Export	Import	Size	Accuracy
4	csv	1	1	1	2

Benchmark of R file formats using rio and friends

`rio`-based comparison

Rank	Format	Export	Import	Size	Accuracy
1	feather	0.9	0.3	0.5	0
2	parquet	3.6	0.4	0.3	0
3	qs	1.7	0.7	0.2	2
4	csv	1	1	1	2

Benchmark of R file formats using rio and friends

`rio`-based comparison

Rank	Format	Export	Import	Size	Accuracy
1	feather	0.9	0.3	0.5	0
2	parquet	3.6	0.4	0.3	0
3	qs	1.7	0.7	0.2	2
4	csv	1	1	1	2
…
16	xlsx	141.4	36.3	1.3	21

Benchmark of R file formats using rio and friends

`rio`-based comparison

Rank	Format	Export	Import	Size	Accuracy
1	feather	0.9	0.3	0.5	0
2	parquet	3.6	0.4	0.3	0
3	qs	1.7	0.7	0.2	2
4	csv	1	1	1	2
16	xlsx	141.4	36.3	1.3	21
23	fods	77.3	119.7	42.2	21
25	ods	258.2	253.7	0.8	21

Benchmark of R file formats using rio and friends

10 years of `rio` and `readODS`

“A good beginning requires enthusiasm, a good ending requires discipline.”

The Motto of the German Football Association for Word Cup 2014

10 years of `rio` and `readODS`

“Ein guter Anfang braucht Begeisterung, ein gutes Ende Disziplin.”

das Motto des DFB für die WM 2014

10 years of `rio` and `readODS`

“Ein guter Anfang braucht Begeisterung, ein gutes Ende Disziplin.”

das Motto des DFB für die WM 2014

Thank you:

Thomas Leeper (for maintaining rio from 2015 to 2023)
Gerrit-Jan Schutten (for creating readODS)
Peter Brohan (for the C++ implementation of read_ods)
Detlef Steuer (for the quick XML generation algorithm)
Jason Becker (for the sustained contribution to rio)
David Schoch (for the contribution to rio and leading the GESIS TSA Team)
Hadley Wickham, Jennifer Bryan (for allowing me to fork some code from readr)
rOpenSci (for managing the infrastructure to support the development of readODS)
All developers / maintainers of the I/O infrastructure
All rio and readODS contributors and users.

More about me: https://www.chainsawriot.com/

`rio` in real world 2

`rio` in real world 3

Used by multiple organizations: WHO, Médecins Sans Frontières etc.
In data science education

10 years of rio and readODS

Why you should care about I/O

Also I/O

Pop quiz: Meet the Nebraskans

Pop quiz: Meet the Nebraskans

Hello from Nebraska Mannheim!

Wickham (2010)

Wickham (2010)

rio, since 2013

rio version 0.1.1 2013-08-26 14:02 CEST

rio development

Supported formats (partial list)

rio development

rio in real world 1

rio development

rio development

OpenDocument Spreadsheet

Example: UK Gov

readODS

readODS prior 2.0.0

readODS issue 49

readODS issue 71

readODS issue 71

Projekt 71

Peter Brohan is our hero

Detlef Steuer is also our hero

Benchmark I

Benchmark II

In summary

So, are we done yet?

Issues

rio-based comparison

rio-based comparison

rio-based comparison

rio-based comparison

10 years of rio and readODS

10 years of rio and readODS

10 years of rio and readODS

10 years of rio and readODS

rio in real world 2

rio in real world 3

`rio`, since 2013

`rio` version 0.1.1 2013-08-26 14:02 CEST

`rio` development

`rio` development

`rio` in real world 1

`rio` development

`rio` development

`readODS`

`readODS` prior 2.0.0

`readODS` issue 49

`readODS` issue 71

`readODS` issue 71

`rio`-based comparison

`rio`-based comparison

`rio`-based comparison

`rio`-based comparison

10 years of `rio` and `readODS`

10 years of `rio` and `readODS`

10 years of `rio` and `readODS`

10 years of `rio` and `readODS`

`rio` in real world 2

`rio` in real world 3