-
Notifications
You must be signed in to change notification settings - Fork 1
/
README.Rmd
250 lines (201 loc) · 9.57 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# tidyods <img src="man/figures/tidyods_hex.png" align="right" alt="tidyods package logo" width="120" />
<!-- badges: start -->
[![Project Status: Concept – Minimal or no implementation has been done yet, or the repository is only intended to be a limited example, demo, or proof-of-concept.](https://www.repostatus.org/badges/latest/concept.svg)](https://www.repostatus.org/#concept)
<!-- badges: end -->
`{tidyods}` imports cells from an OpenDocument Spreadsheet (ODS) file, and
provides not just the cell's value but also information about the cell's
position, value types and formulas in a tidy format to allow further
programmatic analysis, investigation and manipulation. It also provides methods
to "rectify" cells back to a 2-dimensional data.frame.
## Installation
You can install the development version of `{tidyods}` like so:
``` r
remotes::install_github("mattkerlogue/tidyods")
```
## Usage
You can read in cells via the `read_ods_cells()` function. There is a helper
function `ods_sheets()` that can list the sheets in an ODS file.
The package includes an example ODS with two sheets: `penguins` and `types`.
```{r}
library(tidyods)
example_file <- system.file("extdata", "basic_example.ods", package = "tidyods")
ods_sheets(example_file)
read_ods_cells(example_file, sheet = "penguins", quiet = TRUE)
```
The `penguins` sheet is a simple 6 rows by 4 columns sheet that stores the
output of the following code:
```{r}
palmerpenguins::penguins |>
tidyr::drop_na() |>
dplyr::group_by(species, female = sex == "female") |>
dplyr::summarise(
dplyr::across(c(bill_length_mm, body_mass_g), ~mean(.x, na.rm = TRUE)),
.groups = "drop"
)
```
There are functions to "rectify" a `{tidyods}` data.frame back to a traditional
2-dimensional array, the function `read_ods_sheet()` combines `read_ods_cells()`
and the rectify functions to easily import a dataset.
```{r}
#| message: false
penguin_sheet <- read_ods_sheet(example_file, "penguins", quick = TRUE, quiet = TRUE)
penguin_sheet
```
The `types` sheet shows examples of the different ODS data types:
```{r}
#| message: false
types_cells <- read_ods_cells(example_file, "types", quiet = TRUE)
types_cells |>
dplyr::filter(row > 1) |>
dplyr::group_by(col) |>
dplyr::glimpse()
types_cells |>
dplyr::filter(row > 1) |>
dplyr::group_by(col) |>
dplyr::slice_head(n = 2) |>
dplyr::select(-cell_type)
```
## Performance
An ODS file is a zipped collection of XML files and associated files.
`{tidyods}`, like the `{readODS}` package, uses the
[`{xml2}`](https://xml2.r-lib.org) package to process this file.
The main aim of `{tidyods}` is to extract a large set of information about
cells, not just just their location and value, and while slower than
`{readODS}` for small and medium sized files users are unlikely to see
noticeable differences when using `read_ods_cells()`.
If your primary interest is just in cell location and cell values then setting
the argument `quick = TRUE` can result in a faster extraction process as it
provides only six columns: `sheet`, `address`, `row`, `col`, `value_type` and
`base_value`. This `quick` extraction process varies in how the `base_value`
column is constructed. When `quick = TRUE` only float and percentage values are
taken in their raw numeric form, for all other types the `cell_content` is
used. That means that for currency, date and time value types the value as
formatted for spreadsheet application users is returned rather than the "raw"
value stored in the underlying XML file.
```{r}
#| message: false
types_cells_quick <- read_ods_cells(example_file, "types",
quick = TRUE, quiet = TRUE)
types_cells_quick |>
dplyr::filter(row > 1) |>
dplyr::group_by(col) |>
dplyr::glimpse()
```
Using the package's example file we can see how setting `quick = TRUE` results
in performance that is nearly comaprable with that provided by `{readODS}`.
```r
bench::mark(
"cells_quick" =
read_ods_cells(example_file, 2, quick = TRUE, quiet = TRUE),
"cells_slow" =
read_ods_cells(example_file, 2, quiet = TRUE),
"sheet_quick" =
read_ods_sheet(example_file, 2, quick = TRUE, quiet = TRUE),
"sheet_slow" =
read_ods_sheet(example_file, 2, col_headers = FALSE, quiet = TRUE),
"readODS" =
readODS::read_ods(example_file, 2),
check = FALSE, filter_gc = FALSE, iterations = 20
) |>
dplyr::transmute(expression, min, median, mean = total_time/n_itr, n_itr)
#> # A tibble: 5 × 7
#> expression min median mean n_itr
#> <bch:expr> <bch:tm> <bch:tm> <bch:tm> <int>
#> 1 cells_quick 55.4ms 61.2ms 61.9ms 20
#> 2 cells_slow 83.3ms 90.5ms 91.5ms 20
#> 3 sheet_quick 62.8ms 69.7ms 72.4ms 20
#> 4 sheet_slow 93.3ms 101.1ms 102.9ms 20
#> 5 readODS 50.4ms 53.7ms 55.3ms 20
```
To test real-world performance we can use an ODS file published by the UK
Government on the
[number of civil servants by postcode](https://www.gov.uk/government/statistics/number-of-civil-servants-by-postcode-department-responsibility-level-and-leaving-cause-2021),
which contains a large sheet of 5,544 rows by 11 columns.
```r
postcodes_file <- system.file("extdata", "civil-service-postcodes-2021.ods",
package = "tidyods")
bench::mark(
"cells_quick" =
read_ods_cells(postcodes_file, 2, quick = TRUE, quiet = TRUE),
"cells_slow" =
read_ods_cells(postcodes_file, 2, quiet = TRUE),
"sheet_quick" =
read_ods_sheet(postcodes_file, 2, quick = TRUE, quiet = TRUE),
"sheet_slow" =
read_ods_sheet(postcodes_file, 2, col_headers = FALSE, quiet = TRUE),
"readODS" =
readODS::read_ods(postcodes_file, 2),
check = FALSE, filter_gc = FALSE, iterations = 5
) |>
dplyr::transmute(expression, min, median, mean = total_time/n_itr, n_itr)
#> # A tibble: 5 × 7
#> expression min median mean n_itr
#> <bch:expr> <bch:tm> <bch:tm> <bch:tm> <int>
#> 1 cells_quick 16.3s 16.7s 16.7s 5
#> 2 cells_slow 21.7s 22s 22.8s 5
#> 3 sheet_quick 18.3s 19.9s 19.7s 5
#> 4 sheet_slow 20.7s 23.3s 23s 5
#> 5 readODS 13.7s 14.8s 14.6s 5
```
For this large sheet setting `quick = TRUE` delivers only a couple of seconds
slower extraction time compared to that provided by `readODS::read_ods()`.
The dependency on `{xml2}` is likely to cause the function to fail/crash when
working with exceptionally large files. This is a limitation inherited from
the libxml2 C library that is used to power `{xml2}`, which is
[documented](https://gitlab.gnome.org/GNOME/libxml2/-/wikis/Memory-management#general-memory-requirements)
as requiring available memory equal to around 4 times the size of the file
being processed. As the XML in an ODS is contained in a zip file the system
file size of an ODS file can easily hide the true requirements of processing
the file. Files are now checked for size and operations cancelled if its
estimated to exceed available memory.
## Related projects
The `{tidyODS}` package is heavily inspired by three existing packages:
- The [`{readODS}`](https://github.com/chainsawriot/readODS) package provides
functionality to read and write ODS files with R, with the resulting
data.frame reflecting the way the data is viewed in a spreadsheet editor
application (i.e a two-dimension table structure).
- The [`{tidyxl}`](https://nacnudus.github.io/tidyxl/) package reads cells from
Excel files and provides a data.frame where each cell is its own row and
columns provide information about the cell's location, value and value type,
formula and formatting.
- The [`{unpivotr}`](https://nacnudus.github.io/unpivotr/) package works with
datasets of cells (such as those created by tidyxl).
## Philosophy
The `{readODS}` package is the only package on CRAN that handles the reading of
ODS files, whereas there are more than 20 packages (as at 12 June 2022) on CRAN
that work with Excel files (of which `{tidyxl}` is one). In some respects this
is due to the wider usage of Excel files by businesses, governments, academia
and other organisations to publish and share data. Various governments and
international organisations are starting to mandate the use of OpenDocument
Format files for publishing of their information.
The initial purpose of `{tidyods}` was to provide a second package to the R
ecosystem for reading ODS files, in part prompted when encountering an error
with `{readODS}` and discovering no alternative package was available. This is
not the same when working with Excel spreadsheets, for example if you run into
an error when using the `{readxl}` package you could use easily try the
`{openxlsx}` or `{xlsx}` packages instead.
The initial conceptual development of code lead to the creation of an output
dataset similar to that produced by the `{tidyxl}` package. Thus, `{tidyods}`
extracts cells from an ODS document in the same manner as `{tidyxl}`, but
includes functions to "rectify" the cells so as to also provide
an output similar to that of `{readODS}` (and many of those that read Excel
files). This is similar to the
[rectify](https://nacnudus.github.io/unpivotr/reference/rectify.html) function
in the `{unpivotr}` package.
## Code of Conduct
Please note that the tidyods project is released with a
[Contributor Code of Conduct](.github/CODE_OF_CONDUCT.md). By contributing to
this project, you agree to abide by its terms. Please read the
[contributing guidelines](.github/CONTRIBUTING.md).