README.Rmd

---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

# tidyods <img src="man/figures/tidyods_hex.png" align="right" alt="tidyods package logo" width="120" />

<!-- badges: start -->
[![Project Status: Concept – Minimal or no implementation has been done yet, or the repository is only intended to be a limited example, demo, or proof-of-concept.](https://www.repostatus.org/badges/latest/concept.svg)](https://www.repostatus.org/#concept)
<!-- badges: end -->

`{tidyods}` imports cells from an OpenDocument Spreadsheet (ODS) file, and 
provides not just the cell's value but also information about the cell's
position, value types and formulas in a tidy format to allow further
programmatic analysis, investigation and manipulation. It also provides methods
to "rectify" cells back to a 2-dimensional data.frame.

## Installation

You can install the development version of `{tidyods}` like so:

``` r
remotes::install_github("mattkerlogue/tidyods")
```

## Usage

You can read in cells via the `read_ods_cells()` function. There is a helper
function `ods_sheets()` that can list the sheets in an ODS file.

The package includes an example ODS with two sheets: `penguins` and `types`.

```{r}
library(tidyods)

example_file <- system.file("extdata", "basic_example.ods", package = "tidyods")

ods_sheets(example_file)

read_ods_cells(example_file, sheet = "penguins", quiet = TRUE)
```

The `penguins` sheet is a simple 6 rows by 4 columns sheet that stores the 
output of the following code:

```{r}
palmerpenguins::penguins |>
  tidyr::drop_na() |>
  dplyr::group_by(species, female = sex == "female") |>
  dplyr::summarise(
    dplyr::across(c(bill_length_mm, body_mass_g), ~mean(.x, na.rm = TRUE)),
    .groups = "drop"
  )
```


There are functions to "rectify" a `{tidyods}` data.frame back to a traditional
2-dimensional array, the function `read_ods_sheet()` combines `read_ods_cells()`
and the rectify functions to easily import a dataset.

```{r}
#| message: false
penguin_sheet <- read_ods_sheet(example_file, "penguins", quick = TRUE, quiet = TRUE)

penguin_sheet
```


The `types` sheet shows examples of the different ODS data types:
```{r}
#| message: false
types_cells <- read_ods_cells(example_file, "types", quiet = TRUE)

types_cells |> 
  dplyr::filter(row > 1) |>
  dplyr::group_by(col) |>
  dplyr::glimpse()

types_cells |>
  dplyr::filter(row > 1) |>
  dplyr::group_by(col) |>
  dplyr::slice_head(n = 2) |>
  dplyr::select(-cell_type)
```

## Performance

An ODS file is a zipped collection of XML files and associated files. 
`{tidyods}`, like the `{readODS}` package, uses the
[`{xml2}`](https://xml2.r-lib.org) package to process this file.

The main aim of `{tidyods}` is to extract a large set of information about
cells, not just just their location and value, and while slower than
`{readODS}` for small and medium sized files users are unlikely to see
noticeable differences when using `read_ods_cells()`.

If your primary interest is just in cell location and cell values then setting 
the argument `quick = TRUE` can result in a faster extraction process as it 
provides only six columns: `sheet`, `address`, `row`, `col`, `value_type` and 
`base_value`. This `quick` extraction process varies in how the `base_value`
column is constructed. When `quick = TRUE` only float and percentage values are
taken in their raw numeric form, for all other types the `cell_content` is
used. That means that for currency, date and time value types the value as
formatted for spreadsheet application users is returned rather than the "raw"
value stored in the underlying XML file.

```{r}
#| message: false
types_cells_quick <- read_ods_cells(example_file, "types",
                                    quick = TRUE, quiet = TRUE)

types_cells_quick |>
  dplyr::filter(row > 1) |>
  dplyr::group_by(col) |>
  dplyr::glimpse()
```

Using the package's example file we can see how setting `quick = TRUE` results
in performance that is nearly comaprable with that provided by `{readODS}`.

```r
bench::mark(
  "cells_quick" =
    read_ods_cells(example_file, 2, quick = TRUE, quiet = TRUE),
  "cells_slow" =
    read_ods_cells(example_file, 2, quiet = TRUE),
  "sheet_quick" =
    read_ods_sheet(example_file, 2, quick = TRUE, quiet = TRUE),
  "sheet_slow" =
    read_ods_sheet(example_file, 2, col_headers = FALSE, quiet = TRUE),
  "readODS" =
    readODS::read_ods(example_file, 2),
  check = FALSE, filter_gc = FALSE, iterations = 20
) |> 
  dplyr::transmute(expression, min, median, mean = total_time/n_itr, n_itr)
  
#> # A tibble: 5 × 7
#>   expression       min   median     mean n_itr
#>   <bch:expr>  <bch:tm> <bch:tm> <bch:tm> <int>
#> 1 cells_quick   55.4ms   61.2ms   61.9ms    20
#> 2 cells_slow    83.3ms   90.5ms   91.5ms    20
#> 3 sheet_quick   62.8ms   69.7ms   72.4ms    20
#> 4 sheet_slow    93.3ms  101.1ms  102.9ms    20
#> 5 readODS       50.4ms   53.7ms   55.3ms    20
```

To test real-world performance we can use an ODS file published by the UK
Government on the
[number of civil servants by postcode](https://www.gov.uk/government/statistics/number-of-civil-servants-by-postcode-department-responsibility-level-and-leaving-cause-2021),
which contains a large sheet of 5,544 rows by 11 columns.

```r
postcodes_file <- system.file("extdata", "civil-service-postcodes-2021.ods",
  package = "tidyods")

bench::mark(
  "cells_quick" =
    read_ods_cells(postcodes_file, 2, quick = TRUE, quiet = TRUE),
  "cells_slow" =
    read_ods_cells(postcodes_file, 2, quiet = TRUE),
  "sheet_quick" =
    read_ods_sheet(postcodes_file, 2, quick = TRUE, quiet = TRUE),
  "sheet_slow" =
    read_ods_sheet(postcodes_file, 2, col_headers = FALSE, quiet = TRUE),
  "readODS" =
    readODS::read_ods(postcodes_file, 2),
  check = FALSE, filter_gc = FALSE, iterations = 5
) |>
  dplyr::transmute(expression, min, median, mean = total_time/n_itr, n_itr)

#> # A tibble: 5 × 7
#>   expression       min   median     mean n_itr
#>   <bch:expr>  <bch:tm> <bch:tm> <bch:tm> <int>
#> 1 cells_quick    16.3s    16.7s    16.7s     5
#> 2 cells_slow     21.7s      22s    22.8s     5
#> 3 sheet_quick    18.3s    19.9s    19.7s     5
#> 4 sheet_slow     20.7s    23.3s      23s     5
#> 5 readODS        13.7s    14.8s    14.6s     5
```

For this large sheet setting `quick = TRUE` delivers only a couple of seconds
slower extraction time compared to that provided by `readODS::read_ods()`.

The dependency on `{xml2}` is likely to cause the function to fail/crash when
working with exceptionally large files. This is a limitation inherited from
the libxml2 C library that is used to power `{xml2}`, which is
[documented](https://gitlab.gnome.org/GNOME/libxml2/-/wikis/Memory-management#general-memory-requirements) 
as requiring available memory equal to around 4 times the size of the file
being processed. As the XML in an ODS is contained in a zip file the system 
file size of an ODS file can easily hide the true requirements of processing 
the file. Files are now checked for size and operations cancelled if its
estimated to exceed available memory.

## Related projects

The `{tidyODS}` package is heavily inspired by three existing packages:

- The [`{readODS}`](https://github.com/chainsawriot/readODS) package provides
  functionality to read and write ODS files with R, with the resulting 
  data.frame reflecting the way the data is viewed in a spreadsheet editor 
  application (i.e a two-dimension table structure).
- The [`{tidyxl}`](https://nacnudus.github.io/tidyxl/) package reads cells from 
  Excel files and provides a data.frame where each cell is its own row and 
  columns provide information about the cell's location, value and value type, 
  formula and formatting.
- The [`{unpivotr}`](https://nacnudus.github.io/unpivotr/) package works with 
  datasets of cells (such as those created by tidyxl).

## Philosophy

The `{readODS}` package is the only package on CRAN that handles the reading of 
ODS files, whereas there are more than 20 packages (as at 12 June 2022) on CRAN 
that work with Excel files (of which `{tidyxl}` is one). In some respects this 
is due to the wider usage of Excel files by businesses, governments, academia 
and other organisations to publish and share data. Various governments and
international organisations are starting to mandate the use of OpenDocument 
Format files for publishing of their information.

The initial purpose of `{tidyods}` was to provide a second package to the R 
ecosystem for reading ODS files, in part prompted when encountering an error 
with `{readODS}` and discovering no alternative package was available. This is
not the same when working with Excel spreadsheets, for example if you run into
an error when using the `{readxl}` package you could  use easily try the 
`{openxlsx}` or `{xlsx}` packages instead.

The initial conceptual development of code lead to the creation of an output
dataset  similar to that produced by the `{tidyxl}` package. Thus, `{tidyods}`
extracts cells from an ODS document in the same manner as `{tidyxl}`, but
includes functions to "rectify" the cells so as to also provide
an output similar to that of `{readODS}` (and many of those that read Excel 
files). This is similar to the 
[rectify](https://nacnudus.github.io/unpivotr/reference/rectify.html) function 
in the `{unpivotr}` package.

## Code of Conduct

Please note that the tidyods project is released with a
[Contributor Code of Conduct](.github/CODE_OF_CONDUCT.md). By contributing to
this project, you agree to abide by its terms. Please read the
[contributing guidelines](.github/CONTRIBUTING.md).