10 years of rio and readODS

Maintaining the I/O infrastructure of R

Chung-hong Chan

GESIS

2024-07-10

Why you should care about I/O

Source: R for Data Science

“First, you must import your data into R. This typically means that you take data stored in a file, database, or web application programming interface (API) and load it into a data frame in R. If you can’t get your data into R, you can’t do data science on it!

Also I/O

Source: https://xkcd.com/2347/

Pop quiz: Meet the Nebraskans

Who maintain these packages?

  • haven
  • readr, readxl
  • writexl, jsonlite
  • data.table
  • yaml
  • openxlsx
  • foreign

Pop quiz: Meet the Nebraskans

Who maintain these packages?

  • haven Hadley Wickham
  • readr, readxl Jennifer Bryan
  • writexl, jsonlite Jeroen Ooms
  • data.table Tyson Barrett
  • yaml Shawn Garbett
  • openxlsx 🇦🇹 Philipp Schauberger
  • foreign R Core Team

Hello from Nebraska Mannheim!

Before 2013, data import and export

write.csv(iris, "iris.csv")
saveRDS(iris, "iris.rds")
save(iris, "iris.Rdata")
# 2013: No way to write to spss

x <- read.csv("iris.csv")
x <- readRDS("iris.rds")
x <- read.spss("iris.sav")
load("iris.Rdata")

Wickham (2010)

Wickham (2010)

## From this
text <- "she dances on the sand"
grepl("sand$", text)
strsplit(text, "[dD]ances?")

Wickham (2010)

Wickham (2010)

## From this
text <- "she dances on the sand"
grepl("sand$", text)
strsplit(text, "[dD]ances?")
## To this
library(stringr)
str_detect(text, "sand$")
str_split(text, "[dD]ances?")

rio, since 2013

library(rio)
export(iris, "iris.csv")
export(iris, "iris.rds")
export(iris, "iris.sav")

x <- import("iris.csv")
x <- import("iris.rds")
x <- import("iris.sav")

rio version 0.1.1 2013-08-26 14:02 CEST

import <- function(file="", format=NULL, header=TRUE, ... ) {
  format <- .guess(file, format)
  x <- switch(format,
              txt=read.table(file=file, sep="\t", header=header, ...), ##tab-seperate txt file
              rds=readRDS(file=file, ...),
              csv=read.csv(file=file, ...),
              dta=read.dta(file=file, ...),
              sav=read.spss(file=file,to.data.frame=TRUE, ...),
              mtp=read.mtp(file=file, ...),
              rec=read.epiinfo(file=file, ...),
              stop("Unknown file format")
              )
  return(x)
}

rio development

  • 2015 Feb: Transfer maintainership to Thomas Leeper
  • 2015 Mar: Add support for ODS (OpenDocument Spreadsheet)
  • 2016 Jan: Use S3 class (no longer switch()) by Jason Becker
  • 2016 - 2023: Add many supported formats

Supported formats (partial list)

Full list

rio development

  • 2023 Aug: 10 years
  • 2023 Aug: Maintain collectively by GESIS Transparent Social Analytics Team
  • 2023 Sep: rio 1.0.0

rio in real world 1

rio development

  • 2015 Feb: Transfer maintainership to Thomas Leeper
  • 2016 Jan: Use S3 class (no longer switch()) by Jason Becker
  • 2016 - 2023: Add many supported formats

rio development

  • 2015 Feb: Transfer maintainership to Thomas Leeper
  • 2015 Mar: Add support for ODS (OpenDocument Spreadsheet)
  • 2016 Jan: Use S3 class (no longer switch()) by Jason Becker
  • 2016 - 2023: Add many supported formats

OpenDocument Spreadsheet

  • Truly open format (ISO/IEC 26300)
  • LibreOffice, Google Sheets, Microsoft Office etc.
  • Technology: Zipped XML file (like xlsx)
  • Adoption: NATO, EU, and many governments

Example: UK Gov

readODS

  • Created by Gerrit-Jan Schutten in 2014, Maintained by me since 2016
  • Peer reviewed and accepted by rOpenSci since 2022-06-24
  • Emulate readxl::read_excel() and writexl::write_xlsx()

readODS prior 2.0.0

  • Based on XML, and then xml2, in pure R
  • Was in a bad state due to performance

readODS issue 49

readODS issue 71

readODS issue 71

  • 70 MB ODS -> 1.3GB XML
  • Not Working also:Google Sheets, Python odfpy (Also Julia’s wrapper: OdsIO.jl), JS SheetJS

Working but

  • LibreOffice ~ 15s
  • Export as … first by LibreOffice and then
    • CSV via data.table::fread() - 1s
    • XLSX via readxl::read_excel() - 2s

Projekt 71

“I’ll devote my 2023 to the project I tentatively called”Projekt 71”. The idea is simple: I want to have a way that can read the aforementioned “jts0501.ods” directly as an R data frame without memory issues; but yet pass at least 80% of the current unit tests of readODS. So, I am embarking on solving just one Github issue of readODS. I will put other of my R packages into maintenance mode and focus only on this.”

My Projekt 71 manifesto

Peter Brohan is our hero

  • In July 2023, Peter rewrote readODS::read_ods() in C++ (RapidXML) - super speed improvement
  • I switched to tackle issue 49 (Writing speed)

Detlef Steuer is also our hero

  • Detlef proposed a speedy solution to write XML using base R without xml2
  • I rewrote his solution in C++

Benchmark I

Reading speed: 5539 x 11

Benchmark II

Writing speed: 3000 x 8

In summary

  • Issue 71: “jts0501.ods” is readable with a modest computer in 120s (was impossible)
  • Issue 49: “80000 x 9” can be written in < 2s (was > 30 mins)

So, are we done yet?

“Software, unlike papers or grants, is never done.”

Free as in mummies

Issues

  • Type guessing, not 100% the same as readxl - working on minty
  • Speed vs readxl (and other formats)

rio-based comparison

Lahman::Batting (11,2164 x 22)

Rank Format Export Import Size Accuracy
4 csv 1 1 1 2

Benchmark of R file formats using rio and friends

rio-based comparison

Rank Format Export Import Size Accuracy
1 feather 0.9 0.3 0.5 0
2 parquet 3.6 0.4 0.3 0
3 qs 1.7 0.7 0.2 2
4 csv 1 1 1 2

Benchmark of R file formats using rio and friends

rio-based comparison

Rank Format Export Import Size Accuracy
1 feather 0.9 0.3 0.5 0
2 parquet 3.6 0.4 0.3 0
3 qs 1.7 0.7 0.2 2
4 csv 1 1 1 2
16 xlsx 141.4 36.3 1.3 21

Benchmark of R file formats using rio and friends

rio-based comparison

Rank Format Export Import Size Accuracy
1 feather 0.9 0.3 0.5 0
2 parquet 3.6 0.4 0.3 0
3 qs 1.7 0.7 0.2 2
4 csv 1 1 1 2
16 xlsx 141.4 36.3 1.3 21
23 fods 77.3 119.7 42.2 21
25 ods 258.2 253.7 0.8 21

Benchmark of R file formats using rio and friends

10 years of rio and readODS

10 years of rio and readODS

“A good beginning requires enthusiasm, a good ending requires discipline.”

The Motto of the German Football Association for Word Cup 2014

10 years of rio and readODS

“Ein guter Anfang braucht Begeisterung, ein gutes Ende Disziplin.”

das Motto des DFB für die WM 2014

10 years of rio and readODS

“Ein guter Anfang braucht Begeisterung, ein gutes Ende Disziplin.”

das Motto des DFB für die WM 2014

Thank you:

  • Thomas Leeper (for maintaining rio from 2015 to 2023)
  • Gerrit-Jan Schutten (for creating readODS)
  • Peter Brohan (for the C++ implementation of read_ods)
  • Detlef Steuer (for the quick XML generation algorithm)
  • Jason Becker (for the sustained contribution to rio)
  • David Schoch (for the contribution to rio and leading the GESIS TSA Team)
  • Hadley Wickham, Jennifer Bryan (for allowing me to fork some code from readr)
  • rOpenSci (for managing the infrastructure to support the development of readODS)
  • All developers / maintainers of the I/O infrastructure
  • All rio and readODS contributors and users.

More about me: https://www.chainsawriot.com/

rio in real world 2

rio in real world 3

  • Used by multiple organizations: WHO, Médecins Sans Frontières etc.
  • In data science education