The OxCOVID19 Project aims to increase our understanding of the COVID-19 pandemic and elaborate possible strategies to reduce the impact on the society through the combined power of statistical, mathematical modelling, and machine learning techniques. The OxCOVID19 Database is a large, single-centre, multimodal relational database consisting of information (using acknowledged sources) related to COVID-19 pandemic. This package provides an R-specific interface to the OxCOVID19 Database based on widely-used data handling and manipulation approaches in R.
The OxCOVID19 Project team presented to the CoMo Consortium during its weekly meeting on the 1st of July 2020. During this meeting, the CoMo Consortium considered the use of the OxCOVID19 Database for use in filling data and information for country-specific parameters required in the CoMo Consortium model. Given that the CoMo Consortium model is R-centric, it makes logical sense to build an R-specific API to connect with the OxCOVID19 PostgreSQL database. This package aims to facilitate the possible use of the OxCOVID19 database for this purpose through purposefully-written functions that connects an R user to the database directly from R (as opposed to doing a manual download of the data or using separate tools to access PostgreSQL) and processes and structures the available datasets into country-specific tables structured for the requirements of the CoMo Consortium model. A direct link to the PostgreSQL via R is also advantageous as this is updated more frequently than the CSV datasets made available via GitHub.
You can install oxcovid19
from CRAN
with:
install.packages("oxcovid19")
You can install the development version of oxcovid19
from
GitHub with:
if(!require(remotes)) install.packages("remotes")
remotes::install_github("como-ph/oxcovid19")
The primary use case for the oxcovid19
package is for facilitating a
simplified, R-based workflow for 1)
connecting to the OxCOVID19
PostgreSQL server; 2) accessing table/s
available from the PostgreSQL server; and,
3) querying the PostgreSQL server with
specific parameters to customise table/s output for intended use.
The following code demonstrates this workflow:
library(oxcovid19)
## Step 1: Create a connection to OxCOVID19 PostgreSQL server
con <- connect_oxcovid19()
## Step 2: Access epidemiology table from OxCOVID19 PostgreSQL server
epi_tab <- get_table(con = con, tbl_name = "epidemiology")
## Step 3: Query the epidemiology table to show data for Great Britain
gbr_epi_tab <- dplyr::filter(.data = epi_tab, countrycode == "GBR")
Step 1 and Step 2 above are facilitated by the
connect_oxcovid19
and the get_table
functions provided by the
oxcovid19
package. These functions are basically low-level wrappers to
functions in the DBI
and
RPostgres
packages applied specifically
to work with the OxCOVID19
PostgreSQL. These functions facilitate
convenient access to the server for general
R users without having to learn to use the
DBI
and RPostgres
packages.
Step 3, on the other hand, is facilitated by the dplyr
package
functions which were designed to work with different types of tables
including those from various database server connections such as
PostgreSQL.
The output of the workflow shown above is:
#> # Source: lazy query [?? x 15]
#> # Database: postgres [[email protected]:5432/covid19]
#> source date country countrycode adm_area_1 adm_area_2 adm_area_3 tested
#> <chr> <date> <chr> <chr> <chr> <chr> <chr> <int>
#> 1 GBR_P… 2020-04-30 United… GBR <NA> <NA> <NA> 933455
#> 2 GBR_P… 2020-04-29 United… GBR <NA> <NA> <NA> 849296
#> 3 GBR_P… 2020-03-25 United… GBR <NA> <NA> <NA> NA
#> 4 GBR_P… 2020-07-28 United… GBR England Cambridge… <NA> NA
#> 5 GBR_P… 2020-07-28 United… GBR England Cumbria <NA> NA
#> 6 GBR_P… 2020-07-28 United… GBR England Derbyshire <NA> NA
#> 7 GBR_P… 2020-07-28 United… GBR England Devon <NA> NA
#> 8 GBR_P… 2020-07-28 United… GBR England East Suss… <NA> NA
#> 9 GBR_P… 2020-07-28 United… GBR England Essex <NA> NA
#> 10 GBR_P… 2020-07-28 United… GBR England Glouceste… <NA> NA
#> # … with more rows, and 7 more variables: confirmed <int>, recovered <int>,
#> # dead <int>, hospitalised <int>, hospitalised_icu <int>, quarantined <int>,
#> # gid <chr>
Note that the output of this workflow looks like a tibble
(which it
partly is) but it is also more than that. Annotation shows some of the
metadata about this table. Specifically it shows the source and the
database from which this table is from. This output table has the
following classes:
class(gbr_epi_tab)
#> [1] "tbl_PqConnection" "tbl_dbi" "tbl_sql" "tbl_lazy"
#> [5] "tbl"
Other than being a tibble
, the output is also a what is called a
“lazy” table (tbl_lazy
) or a dbi (database) table (tbl_dib
).
What this means is that the output table shown is not actually retrieved
and the basic query (which in this case is selecting a specific table
named epidemiology
) is not actually run. This has its advantageous in
that the data is not yet in memory in R
which can save computing time particularly if only a subset of the table
in the remote server is required, and further queries to the remote
server can be declared to customise the data output. These further
queries can be facilitated using dplyr
verbs such as filter
,
mutate
, arrange
which also operates lazily when applied to
tbl_lazy
or tbl_dbi
objects.
One can review the query parameter/s or the query plan applied to a
tbl_lazy
object by using the get_metadata
function provided in
oxcovid19
:
get_metadata(gbr_epi_tab)
#> $Name
#> NULL
#>
#> $Source
#> src: postgres [[email protected]:5432/covid19]
#> tbls: administrative_division, baseline_mortality, diagnostics, epidemiology,
#> epidemiology_england_msoa, geography_columns, geometry_columns,
#> government_response, mobility, spatial_ref_sys, surveys, weather, world_bank,
#> world_bank_time_series
#>
#> $`DBI connection`
#> <PqConnection> [email protected]:5432
#>
#> $`Query text`
#> <SQL> SELECT *
#> FROM "epidemiology"
#> WHERE ("countrycode" = 'GBR')
#>
#> $`Query plan`
#> [1] " QUERY PLAN\n1 Seq Scan on epidemiology (cost=0.00..60296.06 rows=197287 width=120)\n2 Filter: ((countrycode)::text = 'GBR'::text)"
The result is a list showing information on the remote table’s name
,
source
, DBI connection
, query text
and query plan
.
It should also be noted that other R
operations and functions will not work on tbl_lazy
objects. For
example, getting the number of rows of a tbl_lazy
using nrow
will
result in NA
:
nrow(gbr_epi_tab)
#> [1] NA
These operations require for the table to be retrieved into R (discussed below).
The oxcovid19
package functions are also designed to allow pipe
operations using the magrittr
package. The workflow above can be done
using piped operations as follows:
## Load magrittr to use pipe operator %>%
library(magrittr)
connect_oxcovid19() %>%
get_table(tbl_name = "epidemiology") %>%
dplyr::filter(countrycode == "GBR")
#> # Source: lazy query [?? x 15]
#> # Database: postgres [[email protected]:5432/covid19]
#> source date country countrycode adm_area_1 adm_area_2 adm_area_3 tested
#> <chr> <date> <chr> <chr> <chr> <chr> <chr> <int>
#> 1 GBR_P… 2020-04-30 United… GBR <NA> <NA> <NA> 933455
#> 2 GBR_P… 2020-04-29 United… GBR <NA> <NA> <NA> 849296
#> 3 GBR_P… 2020-03-25 United… GBR <NA> <NA> <NA> NA
#> 4 GBR_P… 2020-07-28 United… GBR England Cambridge… <NA> NA
#> 5 GBR_P… 2020-07-28 United… GBR England Cumbria <NA> NA
#> 6 GBR_P… 2020-07-28 United… GBR England Derbyshire <NA> NA
#> 7 GBR_P… 2020-07-28 United… GBR England Devon <NA> NA
#> 8 GBR_P… 2020-07-28 United… GBR England East Suss… <NA> NA
#> 9 GBR_P… 2020-07-28 United… GBR England Essex <NA> NA
#> 10 GBR_P… 2020-07-28 United… GBR England Glouceste… <NA> NA
#> # … with more rows, and 7 more variables: confirmed <int>, recovered <int>,
#> # dead <int>, hospitalised <int>, hospitalised_icu <int>, quarantined <int>,
#> # gid <chr>
The workflow using the piped workflow outputs the same result as the earlier workflow but with a much streamlined use of code.
Once all query parameters have been set to what is required, the actual
data from the remote table can be retrieved into
R by using the collect
function:
dplyr::collect(gbr_epi_tab)
#> # A tibble: 197,508 x 15
#> source date country countrycode adm_area_1 adm_area_2 adm_area_3 tested
#> <chr> <date> <chr> <chr> <chr> <chr> <chr> <int>
#> 1 GBR_P… 2020-04-30 United… GBR <NA> <NA> <NA> 933455
#> 2 GBR_P… 2020-04-29 United… GBR <NA> <NA> <NA> 849296
#> 3 GBR_P… 2020-03-25 United… GBR <NA> <NA> <NA> NA
#> 4 GBR_P… 2020-07-28 United… GBR England Cambridge… <NA> NA
#> 5 GBR_P… 2020-07-28 United… GBR England Cumbria <NA> NA
#> 6 GBR_P… 2020-07-28 United… GBR England Derbyshire <NA> NA
#> 7 GBR_P… 2020-07-28 United… GBR England Devon <NA> NA
#> 8 GBR_P… 2020-07-28 United… GBR England East Suss… <NA> NA
#> 9 GBR_P… 2020-07-28 United… GBR England Essex <NA> NA
#> 10 GBR_P… 2020-07-28 United… GBR England Glouceste… <NA> NA
#> # … with 197,498 more rows, and 7 more variables: confirmed <int>,
#> # recovered <int>, dead <int>, hospitalised <int>, hospitalised_icu <int>,
#> # quarantined <int>, gid <pq__text>
#> [1] "tbl_df" "tbl" "data.frame"
The resulting output is a tbl
but is now retrieved into
R. Now, further
R operations can be used:
nrow(dplyr::collect(gbr_epi_tab))
#> [1] 197508
It should be noted that the use of collect
should be well-planned and
ideally should be implemented once all required queries that can be
issued lazily have been specified. Depending on the size of the remote
database being accessed, using collect
too early to retrieve data can
cause significant computing overheads.
The OxCOVID19 Database includes a table
named administrative_division
which contains some spatial information.
This table can be accessed via oxcovid19
using the same approach
described above:
connect_oxcovid19() %>% get_table(tbl_name = "administrative_division")
#> # Source: table<administrative_division> [?? x 15]
#> # Database: postgres [[email protected]:5432/covid19]
#> country countrycode countrycode_alp… adm_level adm_area_1 adm_area_1_code
#> <chr> <chr> <chr> <int> <chr> <chr>
#> 1 Brazil BRA BR 3 Bahia BRA.5_1
#> 2 Brazil BRA BR 3 Bahia BRA.5_1
#> 3 Brazil BRA BR 3 Bahia BRA.5_1
#> 4 Brazil BRA BR 3 Bahia BRA.5_1
#> 5 Brazil BRA BR 3 Bahia BRA.5_1
#> 6 Brazil BRA BR 3 Bahia BRA.5_1
#> 7 Brazil BRA BR 3 Bahia BRA.5_1
#> 8 Brazil BRA BR 3 Bahia BRA.5_1
#> 9 Brazil BRA BR 3 Bahia BRA.5_1
#> 10 Brazil BRA BR 3 Bahia BRA.5_1
#> # … with more rows, and 9 more variables: adm_area_2 <chr>,
#> # adm_area_2_code <chr>, adm_area_3 <chr>, adm_area_3_code <chr>, gid <chr>,
#> # latitude <dbl>, longitude <dbl>, properties <chr>, geometry <chr>
The administrative_division
table contains rows of information on
longitude and latitude for several countries and for
administrative divisions (up to level 3) within some of these countries.
In addition, each row of geographical area available in this table has
area borders information in simple
features format.
The longitude and latitude information can be easily accessed
and utilised for mapping point locations using the get_table
function.
However, the area borders information in simple
features format require
specific parsing to be usable in R. This is
facilitated using the sf
package. The oxcovid19
package provides the
get_layer
function which is a wrapper function around the st_read
function in the sf
package to read simple
features format data
into R.
To query the administrative_division
table for the area borders data,
the function get_layer
is used as follows:
connect_oxcovid19() %>% get_layer(ccode = "GBR", adm = 0)
#> Simple feature collection with 1 feature and 14 fields
#> geometry type: MULTIPOLYGON
#> dimension: XY
#> bbox: xmin: -8.151357 ymin: 49.95847 xmax: 1.764168 ymax: 60.84583
#> geographic CRS: WGS 84
#> country countrycode countrycode_alpha2 adm_level adm_area_1
#> 1 United Kingdom GBR GB 0 <NA>
#> adm_area_1_code adm_area_2 adm_area_2_code adm_area_3 adm_area_3_code gid
#> 1 <NA> <NA> <NA> <NA> <NA> GBR
#> latitude longitude properties
#> 1 54.16473 -2.895751 {"GID_0": "GBR", "NAME_0": "United Kingdom"}
#> geometry
#> 1 MULTIPOLYGON (((-5.867211 5...
The get_layer
function requires two main parameters. First, the
country from which area borders are required need to be specified using
its three-letter ISO code. In the example above, the ISO code GBR
is
used to specify the country United Kingdom. Second, the administrative
division level should be indicated. This should be specified as a
numeric value from 0 to 3 corresponding to country level borders (0),
administrative level 1 borders such as regions or provinces (1),
administrative level 2 borders such as districts or communes (2), and
administrative level 3 borders such as sub-districts or sub-communes
(3). In the example above, 0 is specified for country borders.
Once the area borders have been read into R
using the get_layer
function, further spatial data manipulation and
processing can be done using other sf
package functions. Further
information on the sf
package can be found
here.
oxcovid19
includes four specialised wrapper functions that facilitate
easy access and query of specific tables available in the OxCOVID19
Database and reads the data into
R. These functions start with the
get_data_
prefix followed by the respective table descriptor.
Function | Description |
---|---|
get_data_epidemiology |
Get data from epidemiology table |
get_data_weather |
Get data from weather table |
get_data_mobility |
Get data from mobility table |
get_data_response |
Get data from government response table |
Each of these get_data_
functions can be supplied with specific query
parameters to further refine the data to retrieve from the remote table.
Query parameters | Description |
---|---|
.source |
Query the table via the source field |
ccode |
Query the table via the countrycode field |
start , end |
Query the table via the date field |
adm |
Query the table via the adm_area_1 , adm_area_2 and adm_area_3 fields |
The oxcovid19
package is in active development which will be dictated
by the evolution of the OxCOVID19
Database over time. Whilst every attempt
will be employed to maintain syntax of current functions, it is possible
that current functions may change syntax or operability in order to
ensure relevance or maybe deprecated in lieu of a more appropriate
and/or more performant function. In either of these cases, any change
will be well-documented and explained to users and deprecation will be
staged in such a way that users will be informed in good time to allow
for transition to using the new functions.
If you find OxCOVID19 Database useful please cite:
Adam Mahdi, Piotr Błaszczyk, Paweł Dłotko, Dario Salvi, Tak-Shing Chan, John Harvey, Davide Gurnari, Yue Wu, Ahmad Farhat, Niklas Hellmer, Alexander Zarebski, Bernie Hogan, Lionel Tarassenko, Oxford COVID-19 Database: a multimodal data repository for better understanding the global impact of COVID-19. University of Oxford, 2020. medRxiv (doi: https://doi.org/10.1101/2020.08.18.20177147).
The OxCOVID19 Database is the result of
many hours of volunteer efforts and generous contributions of many
organisations. If you use a specific table please also cite the
underlying source as described
here. Sources of
specific tables can also be accessed via the oxcovid19
package through
the data_sources
dataset. For example, if you have accessed the
Epidemiology table from the database, you can access the sources for
this table with:
data_sources[["epidemiology"]]
which gives the following result:
#> # A tibble: 63 x 5
#> Country `Source code` Source Features `Terms of Use`
#> <chr> <chr> <chr> <chr> <chr>
#> 1 Australia AUS_C1A covid-19-au.… "tested, confirme… "Strictly for educa…
#> 2 Belgium BEL_LE Laurent Esch… "confirmed (count… "CC0 1.0 Universal …
#> 3 Belgium BEL_SCI Epistat "" ""
#> 4 Brazil BRA_MSHM Ministerio d… "confirmed (count… "CC0 1.0 Universal"
#> 5 Canada CAN_GOV Government o… "tested, confirme… "Attribution requir…
#> 6 Switzerl… CHE_OPGOV Swiss Canton… "" "CC 4.0"
#> 7 Mainland… CHN_ICL MRC Centre f… "confirmed (both … "CC BY NC ND 4.0"
#> 8 Germany DEU_JPGG Jan-Philip G… "confirmed, dead" "MIT"
#> 9 Spain ESP_MS Ministerio d… "confirmed, dead,… "\"Desnaturalizacio…
#> 10 Spain ESP_MSVP Ministerio d… "confirmed, recov… "Apache License 2.0"
#> # … with 53 more rows
A utility function called cite_sources
is also available. This
function is applied to a dataset extracted from the OxCOVID19
Database using the get_data
functions
or a dataset that has been collected after using get_table
function.
For example:
## Get epidemiology data for UK
gbr_epi_tab <- get_data_epidemiology(ccode = "UK")
## Cite sources for UK epidemiology data
cite_sources(gbr_epi_tab)
This results in:
#> # A tibble: 8 x 3
#> Table Source `Terms of Use`
#> <chr> <chr> <chr>
#> 1 Epidemiol… Department of Health (Northern Ireland) ""
#> 2 Epidemiol… Public Health England "Open Government Licen…
#> 3 Epidemiol… Scottish Government "GPL 3.0"
#> 4 Epidemiol… Tom White "The Unlicense"
#> 5 Epidemiol… Public Health Wales "Open Government Licen…
#> 6 Epidemiol… European Centre for Disease Prevention and… "Attribution required"
#> 7 Epidemiol… World Health Organization ""
#> 8 Epidemiol… Center for Systems Science and Engineering… "CC BY 4.0"
Feedback, bug reports and feature requests are welcome; file issues or seek support here. If you would like to contribute to the package, please see our contributing guidelines.
This project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.