From 461c6395d2a93351190340856f811c99b6557d0d Mon Sep 17 00:00:00 2001 From: thomaszwagerman Date: Mon, 28 Oct 2024 17:36:48 +0000 Subject: [PATCH 01/13] add notes on versioning from klump et al., imfe from siddorn et al. etc --- vignettes/articles/butterfly_paper.Rmd | 59 +++++++++++++++++++------- 1 file changed, 44 insertions(+), 15 deletions(-) diff --git a/vignettes/articles/butterfly_paper.Rmd b/vignettes/articles/butterfly_paper.Rmd index a561c81..9c87b8e 100644 --- a/vignettes/articles/butterfly_paper.Rmd +++ b/vignettes/articles/butterfly_paper.Rmd @@ -1,19 +1,20 @@ --- -title: 'butterfly: An R package for the quality assurance of continually updating timeseries' +title: 'butterfly: An R package for the quality assurance of continually updating + timeseries' tags: - - R - - quality assurance - - timeseries - - ERA5 -authors: - - name: Thomas Zwagerman - orcid: 0000-0000-0000-0000 - equal-contrib: true - affiliation: 1 +- R +- quality assurance +- timeseries +- ERA5 +date: "23 October 2024" affiliations: - - name: British Antarctic Survey, UK - index: 1 -date: 23 October 2024 +- name: British Antarctic Survey, UK + index: 1 +authors: +- name: Thomas Zwagerman + orcid: "0000-0000-0000-0000" + equal-contrib: true + affiliation: 1 --- ```{r, include = FALSE} @@ -27,6 +28,26 @@ knitr::opts_chunk$set( # Summary +Importance of citing exact extract of data [(Klump et al. 2021)](https://datascience.codata.org/articles/10.5334/dsj-2021-012) + +Semantic versioning is widely adopted in research software [(Preston-Werner 2013)](https://semver.org/spec/v2.0.0.html) + +Generating a derived data product + +But what if you are not aware of upstream changes to your input data? + +A key recommendation in Siddorn et al.'s (2022) report "An Information Management Framework for Environmental Digital Twins (IMFe)... + +data provenance must be maintained + +data quality frameworks + +clearly documented for users and available in machine-readable format + +tools and methods + +... for a FAIR implementation (Wilkinson et al. 2016). + # Statement of Need At the British Antarctic Survey (BAS), we developed this package to deal with a very specific issue. @@ -43,9 +64,9 @@ This package was originally developed to deal with [ERA5](https://cds.climate.co Usually ERA5 and ERA5T are identical, but occasionally an issue with input data can (for example for [09/21 - 12/21](https://confluence.ecmwf.int/display/CKB/ERA5T+issue+in+snow+depth), and [07/24](https://forum.ecmwf.int/t/final-validated-era5-product-to-differ-from-era5t-in-july-2024/6685)) force a recalculation, meaning previously published data differs from the final product. -In most cases, this is not an issue. For static data publications which are a snapshot in time, such as data associated with a specific paper, as in "Forecasts, neural networks, and results from the paper: 'Seasonal Arctic sea ice forecasting with probabilistic deep learning'" [Andersson & Hosking (2021)](https://data.bas.ac.uk/full-record.php?id=GB/NERC/BAS/PDC/01526) or time period as in "Downscaled ERA5 monthly precipitation data using Multi-Fidelity Gaussian Processes between 1980 and 2012 for the Upper Beas and Sutlej Basins, Himalaya" [Tazi (2023)](https://data.bas.ac.uk/full-record.php?id=GB/NERC/BAS/PDC/01769), this is not an issue. These datasets clearly describe a version and time period of ERA5 from which the data were derived, and will not be amended or updated in the future, even if ERA5 is recalculated. +In most cases, this is not an issue. For static data publications which are a snapshot in time, such as data associated with a specific paper, as in "Forecasts, neural networks, and results from the paper: 'Seasonal Arctic sea ice forecasting with probabilistic deep learning'" [Andersson & Hosking (2021)](https://data.bas.ac.uk/full-record.php?id=GB/NERC/BAS/PDC/01526)[@Andersson_2021] or time period as in "Downscaled ERA5 monthly precipitation data using Multi-Fidelity Gaussian Processes between 1980 and 2012 for the Upper Beas and Sutlej Basins, Himalaya" [Tazi (2023)](https://data.bas.ac.uk/full-record.php?id=GB/NERC/BAS/PDC/01769), this is not an issue. These datasets clearly describe a version and time period of ERA5 from which the data were derived, and will not be amended or updated in the future, even if ERA5 is recalculated. -In our case however we want to continually append ERA5-derived datasets **and** continually publish them. This would be useful when functioning as a data source for an environmental digital twin (Blair & Hendrys, 2023), or simply as input data into an environmental forecasting model which itself is frequently running. +In our case however we want to continually append ERA5-derived datasets **and** continually publish them. This would be useful when functioning as a data source for an environmental digital twin (Blair & Hnerys et al. 2023), or simply as input data into an environmental forecasting model which itself is frequently running. Continually appending **and** publishing will require strict quality assurance. If a published dataset is only appended a DOI can be minted for it.  However, if the previously published data change, this will then invalidate the DOI.  For example, if you developed your code to find a better measure (more accurate, more precise) of the low pressure region, and wanted to reanalyse the previous data and republish. @@ -71,4 +92,12 @@ Hersbach, H., Bell, B., Berrisford, P., Biavati, G., Horányi, A., Muñoz Sabate Hosking, J. S., A. Orr, T. J. Bracegirdle, and J. Turner (2016), Future circulation changes off West Antarctica: Sensitivity of the Amundsen Sea Low to projected anthropogenic forcing, Geophys. Res. Lett., 43, 367–376, . +Klump, J., Wyborn, L., Wu, M., Martin, J., Downs, R.R. and Asmi, A. (2021) ‘Versioning Data Is About More than Revisions: A Conceptual Framework and Proposed Principles’, Data Science Journal, 20(1), p. 12. Available at: https://doi.org/10.5334/dsj-2021-012. + +Preston-Werner, T. 2013. Semantic Versioning 2.0.0. Semantic Versioning. Available at https://semver.org/spec/v2.0.0.html [Last accessed 28 October 2024]. + +Siddorn, John, Gordon Shaw Blair, David Boot, Justin James Henry Buck, Andrew Kingdon, et al. 2022. “An Information Management Framework for Environmental Digital Twins (IMFe).” Zenodo. https://doi.org/10.5281/ZENODO.7004351. + Tazi, K. (2023). Downscaled ERA5 monthly precipitation data using Multi-Fidelity Gaussian Processes between 1980 and 2012 for the Upper Beas and Sutlej Basins, Himalayas (Version 1.0) [Data set]. NERC EDS UK Polar Data Centre. + +Wilkinson, Mark D., Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, et al. 2016. “The FAIR Guiding Principles for Scientific Data Management and Stewardship.” Scientific Data 3 (1). https://doi.org/10.1038/sdata.2016.18. From 4517b7acd352f325e5cbb86cf777d5842e93eb98 Mon Sep 17 00:00:00 2001 From: thomaszwagerman Date: Mon, 28 Oct 2024 17:37:42 +0000 Subject: [PATCH 02/13] add orcid --- vignettes/articles/butterfly_paper.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vignettes/articles/butterfly_paper.Rmd b/vignettes/articles/butterfly_paper.Rmd index 9c87b8e..b590ee1 100644 --- a/vignettes/articles/butterfly_paper.Rmd +++ b/vignettes/articles/butterfly_paper.Rmd @@ -12,7 +12,7 @@ affiliations: index: 1 authors: - name: Thomas Zwagerman - orcid: "0000-0000-0000-0000" + orcid: "0009-0003-3742-3234" equal-contrib: true affiliation: 1 --- From 1eaf7f3359245853c61f58e3f934cab7429ba4ab Mon Sep 17 00:00:00 2001 From: thomaszwagerman Date: Tue, 29 Oct 2024 15:46:38 +0000 Subject: [PATCH 03/13] minor tweaks to paper --- vignettes/articles/butterfly_paper.Rmd | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/vignettes/articles/butterfly_paper.Rmd b/vignettes/articles/butterfly_paper.Rmd index b590ee1..96af8bc 100644 --- a/vignettes/articles/butterfly_paper.Rmd +++ b/vignettes/articles/butterfly_paper.Rmd @@ -1,6 +1,5 @@ --- -title: 'butterfly: An R package for the quality assurance of continually updating - timeseries' +title: 'butterfly: An R package for the verification of continually updating timeseries data where we expect new values, but want to ensure previous data remains unchanged.' tags: - R - quality assurance @@ -28,14 +27,25 @@ knitr::opts_chunk$set( # Summary +When ... specific version of data used to + +maintain data provenance. + +When using other data to generate your own, it is crucial to reference the exact version of the data used, in order to maintain data provenance. + +In other instances it may be required to revise data publications, due to the discovery of an inconsistency or error. + +But what if you are not aware of upstream changes to your input data? + +Here we present butterfly, an R package for the verification of continually updating timeseries data where we expect new values, but want to ensure previous data remains unchanged. + + Importance of citing exact extract of data [(Klump et al. 2021)](https://datascience.codata.org/articles/10.5334/dsj-2021-012) Semantic versioning is widely adopted in research software [(Preston-Werner 2013)](https://semver.org/spec/v2.0.0.html) Generating a derived data product -But what if you are not aware of upstream changes to your input data? - A key recommendation in Siddorn et al.'s (2022) report "An Information Management Framework for Environmental Digital Twins (IMFe)... data provenance must be maintained From 439aeb057b98c9c3a2c4b8f72adc32d7e4494b5f Mon Sep 17 00:00:00 2001 From: thomaszwagerman Date: Tue, 29 Oct 2024 16:48:20 +0000 Subject: [PATCH 04/13] more comprehensive summary --- vignettes/articles/butterfly_paper.Rmd | 13 ++++--------- 1 file changed, 4 insertions(+), 9 deletions(-) diff --git a/vignettes/articles/butterfly_paper.Rmd b/vignettes/articles/butterfly_paper.Rmd index 96af8bc..4e18a9c 100644 --- a/vignettes/articles/butterfly_paper.Rmd +++ b/vignettes/articles/butterfly_paper.Rmd @@ -27,18 +27,14 @@ knitr::opts_chunk$set( # Summary -When ... specific version of data used to +Previously recorded data could be revised after initial publication number of reasons, such as discovery of an inconsistency or error, a change in methodology or instrument re-calibration. When using other data to generate your own, it is crucial to reference the exact version of the data used, in order to maintain data provenance. Unnoticed changes in previous data could have unintended consequences, such as invalidating a published dataset’s Digital Object Identfier (DOI), or altering future predictions if used as input in forecasting models. -maintain data provenance. +But what if you are not aware of upstream changes to your input data? Monitoring data sources for these changes is not always possible. Here we present butterfly, an R package for the verification of continually updating timeseries data where we expect new values, but want to ensure previous data remains unchanged. -When using other data to generate your own, it is crucial to reference the exact version of the data used, in order to maintain data provenance. +The intention of butterfly is to check for changes in previously published data, and warn the user with a report that contains as much details as possible. This will allow them to stop unintended data transfer, revise their published data, release a new version and communicate the significance of the change to their users. -In other instances it may be required to revise data publications, due to the discovery of an inconsistency or error. - -But what if you are not aware of upstream changes to your input data? - -Here we present butterfly, an R package for the verification of continually updating timeseries data where we expect new values, but want to ensure previous data remains unchanged. +# Statement of Need Importance of citing exact extract of data [(Klump et al. 2021)](https://datascience.codata.org/articles/10.5334/dsj-2021-012) @@ -58,7 +54,6 @@ tools and methods ... for a FAIR implementation (Wilkinson et al. 2016). -# Statement of Need At the British Antarctic Survey (BAS), we developed this package to deal with a very specific issue. Quality assurance in continually updating and continually published ERA5-derived data. From d8c866d05d48121adeb6161328950b9f61e129eb Mon Sep 17 00:00:00 2001 From: thomaszwagerman Date: Wed, 6 Nov 2024 11:26:16 +0000 Subject: [PATCH 05/13] citation formatting --- vignettes/articles/butterfly_paper.Rmd | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/vignettes/articles/butterfly_paper.Rmd b/vignettes/articles/butterfly_paper.Rmd index 4e18a9c..f7e91ae 100644 --- a/vignettes/articles/butterfly_paper.Rmd +++ b/vignettes/articles/butterfly_paper.Rmd @@ -23,7 +23,7 @@ knitr::opts_chunk$set( ) ``` -#> left out bibliography: paper.bib from yaml +#\> left out bibliography: paper.bib from yaml # Summary @@ -33,7 +33,6 @@ But what if you are not aware of upstream changes to your input data? Monitoring The intention of butterfly is to check for changes in previously published data, and warn the user with a report that contains as much details as possible. This will allow them to stop unintended data transfer, revise their published data, release a new version and communicate the significance of the change to their users. - # Statement of Need Importance of citing exact extract of data [(Klump et al. 2021)](https://datascience.codata.org/articles/10.5334/dsj-2021-012) @@ -50,7 +49,7 @@ data quality frameworks clearly documented for users and available in machine-readable format -tools and methods +tools and methods ... for a FAIR implementation (Wilkinson et al. 2016). @@ -65,13 +64,14 @@ IceNet a sea ice prediction system based on deep learning (Andersson et al. 2021 ERA5-derived data. ## The issue with ERA5 and ERA5-Interim -This package was originally developed to deal with [ERA5](https://cds.climate.copernicus.eu/datasets/reanalysis-era5-single-levels?tab=documentation)'s initial release data, ERA5T. ERA5T data for a month is overwritten with the final ERA5 data two months after the month in question. + +This package was originally developed to deal with [ERA5](https://cds.climate.copernicus.eu/datasets/reanalysis-era5-single-levels?tab=documentation)'s initial release data, ERA5T. ERA5T data for a month is overwritten with the final ERA5 data two months after the month in question. Usually ERA5 and ERA5T are identical, but occasionally an issue with input data can (for example for [09/21 - 12/21](https://confluence.ecmwf.int/display/CKB/ERA5T+issue+in+snow+depth), and [07/24](https://forum.ecmwf.int/t/final-validated-era5-product-to-differ-from-era5t-in-july-2024/6685)) force a recalculation, meaning previously published data differs from the final product. In most cases, this is not an issue. For static data publications which are a snapshot in time, such as data associated with a specific paper, as in "Forecasts, neural networks, and results from the paper: 'Seasonal Arctic sea ice forecasting with probabilistic deep learning'" [Andersson & Hosking (2021)](https://data.bas.ac.uk/full-record.php?id=GB/NERC/BAS/PDC/01526)[@Andersson_2021] or time period as in "Downscaled ERA5 monthly precipitation data using Multi-Fidelity Gaussian Processes between 1980 and 2012 for the Upper Beas and Sutlej Basins, Himalaya" [Tazi (2023)](https://data.bas.ac.uk/full-record.php?id=GB/NERC/BAS/PDC/01769), this is not an issue. These datasets clearly describe a version and time period of ERA5 from which the data were derived, and will not be amended or updated in the future, even if ERA5 is recalculated. -In our case however we want to continually append ERA5-derived datasets **and** continually publish them. This would be useful when functioning as a data source for an environmental digital twin (Blair & Hnerys et al. 2023), or simply as input data into an environmental forecasting model which itself is frequently running. +In our case however we want to continually append ERA5-derived datasets **and** continually publish them. This would be useful when functioning as a data source for an environmental digital twin (Blair & Hnerys et al. 2023), or simply as input data into an environmental forecasting model which itself is frequently running. Continually appending **and** publishing will require strict quality assurance. If a published dataset is only appended a DOI can be minted for it.  However, if the previously published data change, this will then invalidate the DOI.  For example, if you developed your code to find a better measure (more accurate, more precise) of the low pressure region, and wanted to reanalyse the previous data and republish. @@ -91,18 +91,18 @@ Andersson, T., & Hosking, J. (2021). Forecasts, neural networks, and results fro Andersson, T.R., Hosking, J.S., Pérez-Ortiz, M. *et al.* Seasonal Arctic sea ice forecasting with probabilistic deep learning. *Nat Commun* **12**, 5124 (2021). -Blair, Gordon S., and Peter A. Henrys. 2023. “The Role of Data Science in Environmental Digital Twins: In Praise of the Arrows.” Environmetrics 34 (January): Not available. https://doi.org/10.1002/env.2789. +Blair, Gordon S., and Peter A. Henrys. 2023. “The Role of Data Science in Environmental Digital Twins: In Praise of the Arrows.” Environmetrics 34 (January): Not available. . Hersbach, H., Bell, B., Berrisford, P., Biavati, G., Horányi, A., Muñoz Sabater, J., Nicolas, J., Peubey, C., Radu, R., Rozum, I., Schepers, D., Simmons, A., Soci, C., Dee, D., Thépaut, J-N. (2023): ERA5 hourly data on single levels from 1940 to present. Copernicus Climate Change Service (C3S) Climate Data Store (CDS), DOI: 10.24381/cds.adbb2d47 Hosking, J. S., A. Orr, T. J. Bracegirdle, and J. Turner (2016), Future circulation changes off West Antarctica: Sensitivity of the Amundsen Sea Low to projected anthropogenic forcing, Geophys. Res. Lett., 43, 367–376, . -Klump, J., Wyborn, L., Wu, M., Martin, J., Downs, R.R. and Asmi, A. (2021) ‘Versioning Data Is About More than Revisions: A Conceptual Framework and Proposed Principles’, Data Science Journal, 20(1), p. 12. Available at: https://doi.org/10.5334/dsj-2021-012. +Klump, J., Wyborn, L., Wu, M., Martin, J., Downs, R.R. and Asmi, A. (2021) ‘Versioning Data Is About More than Revisions: A Conceptual Framework and Proposed Principles’, Data Science Journal, 20(1), p. 12. Available at: . -Preston-Werner, T. 2013. Semantic Versioning 2.0.0. Semantic Versioning. Available at https://semver.org/spec/v2.0.0.html [Last accessed 28 October 2024]. +Preston-Werner, T. 2013. Semantic Versioning 2.0.0. Semantic Versioning. Available at [Last accessed 28 October 2024]. -Siddorn, John, Gordon Shaw Blair, David Boot, Justin James Henry Buck, Andrew Kingdon, et al. 2022. “An Information Management Framework for Environmental Digital Twins (IMFe).” Zenodo. https://doi.org/10.5281/ZENODO.7004351. +Siddorn, John, Gordon Shaw Blair, David Boot, Justin James Henry Buck, Andrew Kingdon, et al. 2022. “An Information Management Framework for Environmental Digital Twins (IMFe).” Zenodo. . Tazi, K. (2023). Downscaled ERA5 monthly precipitation data using Multi-Fidelity Gaussian Processes between 1980 and 2012 for the Upper Beas and Sutlej Basins, Himalayas (Version 1.0) [Data set]. NERC EDS UK Polar Data Centre. -Wilkinson, Mark D., Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, et al. 2016. “The FAIR Guiding Principles for Scientific Data Management and Stewardship.” Scientific Data 3 (1). https://doi.org/10.1038/sdata.2016.18. +Wilkinson, Mark D., Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, et al. 2016. “The FAIR Guiding Principles for Scientific Data Management and Stewardship.” Scientific Data 3 (1). . From 5fff204cfbb387f4a6288062c4725aa7c97d123b Mon Sep 17 00:00:00 2001 From: thomaszwagerman Date: Wed, 6 Nov 2024 11:26:41 +0000 Subject: [PATCH 06/13] adding timeline continuity scripts --- R/continuous_timeline.R | 93 +++++++++++++++++++++++++++++++++++++++++ R/timeline.R | 63 ++++++++++++++++++++++++++++ 2 files changed, 156 insertions(+) create mode 100644 R/continuous_timeline.R create mode 100644 R/timeline.R diff --git a/R/continuous_timeline.R b/R/continuous_timeline.R new file mode 100644 index 0000000..93a015a --- /dev/null +++ b/R/continuous_timeline.R @@ -0,0 +1,93 @@ +#' get_continuous_timelines: check if a timeseries is continuous +#' +#' A loupe is a simple, small magnification device used to examine small details +#' more closely. +#' +#' @param df_current data.frame, the newest/current version of dataset x. +#' @param datetime_variable string, the "datetime" variable that should be +#' checked for continuity. +#' @param expected_lag numeric, the acceptable difference between timestep for +#' a timeseries to be classed as continuous. Any difference greater than +#' `expected_lag` will indicate a timeseries is not continuous. Default is 1. +#' The smallest units of measurement present in the column will be used. For +#' example in a column formatted YYYY-MM, month will be used. In a column +#' formatted YYYY-MM-DD day will be used. +#' @param direction character, is this timeseries orderd by ascending or by +#' descending? +#' +#' @returns A data.frame, identical to `df_current`, but with extra columns +#' `timeline_group`, which assigns a number to each continuous sets of +#' data and `timelag` which specifies the time lags between rows. +#' +#' @examples +#' # This example contains no differences with previous data +#' # Our datetime column is formatted YYYY-MM-DD, and we expect an observation +#' # every month, therefore our expected lag is 31 (days). +#' butterfly::get_continuous_timelines( +#' butterflycount$april, +#' datetime_variable = "time", +#' expected_lag = 31 +#' direction = "descending" +#' ) +#' +#' @export +group_timelines <- function( + df_current, + datetime_variable, + expected_lag = 1, + direction = c("ascending", "descending") +) { + stopifnot("`df_current` must be a data.frame" = is.data.frame(df_current)) + + # Check if `datetime_variable` is in `df_current` + if (!datetime_variable %in% names(df_current)) { + cli::cli_abort( + "`datetime_variable` must be present in `df_current`" + ) + } + # A direction multiplier will allow checking of expected lag difference + # in both ascending and descending datasets, without reordering or changing + # the dataset itself + if (direction == "ascending") { + direction_multiplier <- 1 + } else if (direction == "descending") { + direction_multiplier <- -1 + } + + # Check if datetime_variable can be used by lag + if ( + inherits( + df_current[[datetime_variable]], + c("POSIXct", "POSIXlt", "POSIXt", "Date") + ) == FALSE + ) { + df_current[[datetime_variable]] <- as.POSIXlt( + df_current[[datetime_variable]] + ) + } + + # Obtain distinct sequences of continuous measurement + df_timeline <- df_current |> + dplyr::mutate( + timelag = (time - dplyr::lag(time, 1)) * direction_multiplier + ) |> + dplyr::mutate( + timeline_group1 = dplyr::case_when( + # Include negative timelag, for example if a sensor cpu shuts down + # It can return to its original date (e.g. 1970-01-01 or when it was + # deployed) + is.na(timelag) | + timelag > expected_lag ~ 1 | + timelag < -expected_lag, + TRUE ~ 2 + ) + ) |> + dplyr::mutate( + timeline_group = cumsum(timeline_group1 == 1) + ) |> + dplyr::select( + -timeline_group + ) + + return(df_timeline) +} diff --git a/R/timeline.R b/R/timeline.R new file mode 100644 index 0000000..b4966b7 --- /dev/null +++ b/R/timeline.R @@ -0,0 +1,63 @@ +#' timeline: check if a timeseries is continuous +#' +#' A loupe is a simple, small magnification device used to examine small details +#' more closely. +#' +#' @param df_current data.frame, the newest/current version of dataset x. +#' @param datetime_variable string, the "datetime" variable that should be +#' checked for continuity. +#' @param expected_lag numeric, the acceptable difference between timestep for +#' a timeseries to be classed as continuous. Any difference greater than +#' `expected_lag` will indicate a timeseries is not continuous. Default is 1. +#' The smallest units of measurement present in the column will be used. For +#' example in a column formatted YYYY-MM, month will be used. In a column +#' formatted YYYY-MM-DD day will be used. +#' @param direction character, is this timeseries orderd by ascending or by +#' descending? +#' +#' @seealso [timeline_group()] +#' +#' @returns A boolean, TRUE if the timeseries is continuous, and FALSE if there +#' are more than one continuous timeseries within the dataset. +#' +#' @examples +#' # This example contains no differences with previous data +#' # Our datetime column is formatted YYYY-MM-DD, and we expect an observation +#' # every month, therefore our expected lag is 31 (days). +#' butterfly::is_continuous_timelines( +#' butterflycount$april, +#' datetime_variable = "time", +#' expected_lag = 31 +#' direction = "descending" +#' ) +#' +#' @export +timeline <- function( + df_current, + datetime_variable, + expected_lag = 1, + direction = c("ascending", "descending") +) { + + df_timelines <- group_timelines( + df_current, + datetime_variable, + expected_lag, + direction + ) + + if (length(unique(df_timelines$continuous_timeline)) < 1) { + is_continuous <- TRUE + } else if (length(unique(df_timelines$continuous_timeline)) > 1 ) { + is_continuous <- FALSE + + cli::cat_bullet( + "There are time lags which are greater than the expected lag: ", + deparse(substitute(expected_lag)), + ". This indicates the timeseries is not continuous.", + bullet = "info", + col = "orange", + bullet_col = "orange" + ) + } +} From bfba5db9f19cf435453beabbe5001f3839e311e5 Mon Sep 17 00:00:00 2001 From: thomaszwagerman Date: Wed, 6 Nov 2024 11:57:49 +0000 Subject: [PATCH 07/13] rename grouping function, add descriptions --- ...ontinuous_timeline.R => group_timelines.R} | 18 ++++++-- R/timeline.R | 46 +++++++++++++++---- 2 files changed, 52 insertions(+), 12 deletions(-) rename R/{continuous_timeline.R => group_timelines.R} (82%) diff --git a/R/continuous_timeline.R b/R/group_timelines.R similarity index 82% rename from R/continuous_timeline.R rename to R/group_timelines.R index 93a015a..4a2d8a7 100644 --- a/R/continuous_timeline.R +++ b/R/group_timelines.R @@ -1,7 +1,9 @@ -#' get_continuous_timelines: check if a timeseries is continuous +#' group_timelines: check if a timeseries is continuous #' -#' A loupe is a simple, small magnification device used to examine small details -#' more closely. +#' If after using `timeline()` you have established a timeseries is not +#' continuous, or if you are working with data where you expect distinct +#' sequences or events, you can use `group_timelines()` to extract and +#' classify different distinct continuous chunks of your data. #' #' @param df_current data.frame, the newest/current version of dataset x. #' @param datetime_variable string, the "datetime" variable that should be @@ -38,6 +40,7 @@ group_timelines <- function( direction = c("ascending", "descending") ) { stopifnot("`df_current` must be a data.frame" = is.data.frame(df_current)) + stopifnot("`expected_lag` must be numeric" = is.numeric(expected_lag)) # Check if `datetime_variable` is in `df_current` if (!datetime_variable %in% names(df_current)) { @@ -45,6 +48,13 @@ group_timelines <- function( "`datetime_variable` must be present in `df_current`" ) } + + # Check if `direction` is in "ascending or descending" + if (!direction %in% c("ascending", "descending")) { + cli::cli_abort( + "`direction` must be one of 'ascending' or 'descending'" + ) + } # A direction multiplier will allow checking of expected lag difference # in both ascending and descending datasets, without reordering or changing # the dataset itself @@ -86,7 +96,7 @@ group_timelines <- function( timeline_group = cumsum(timeline_group1 == 1) ) |> dplyr::select( - -timeline_group + -timeline_group1 ) return(df_timeline) diff --git a/R/timeline.R b/R/timeline.R index b4966b7..63a5893 100644 --- a/R/timeline.R +++ b/R/timeline.R @@ -1,7 +1,17 @@ #' timeline: check if a timeseries is continuous #' -#' A loupe is a simple, small magnification device used to examine small details -#' more closely. +#' Check if a timeseries is continuous. Even if a timeseries does not contain +#' obvious gaps, this does not automatically mean it is also continuous. +#' +#' Measuring instruments can have different behaviours when they fail. For +#' example, during power failure an internal clock could reset to "1970-01-01", +#' or the manufacturing date (say, "2021-01-01"). This leads to unpredictable +#' ways of checking if a dataset is continuous. +#' +#' The `group_timelines()` and `timeline()` functions attempt to give the user +#' control over how to check for continuity by providing an `expected_lag`. The +#' difference between timesteps in a dataset should not exceed the +#' `expected_lag`. #' #' @param df_current data.frame, the newest/current version of dataset x. #' @param datetime_variable string, the "datetime" variable that should be @@ -12,10 +22,10 @@ #' The smallest units of measurement present in the column will be used. For #' example in a column formatted YYYY-MM, month will be used. In a column #' formatted YYYY-MM-DD day will be used. -#' @param direction character, is this timeseries orderd by ascending or by +#' @param direction character, is this timeseries ordered by ascending or by #' descending? #' -#' @seealso [timeline_group()] +#' @seealso [group_timelines()] #' #' @returns A boolean, TRUE if the timeseries is continuous, and FALSE if there #' are more than one continuous timeseries within the dataset. @@ -46,18 +56,38 @@ timeline <- function( direction ) - if (length(unique(df_timelines$continuous_timeline)) < 1) { + if (length(unique(df_timelines$timeline_group)) < 1) { is_continuous <- TRUE - } else if (length(unique(df_timelines$continuous_timeline)) > 1 ) { + + cli::cat_bullet( + "There are no time lags which are greater than the expected lag: ", + deparse(substitute(expected_lag)), + " ", + units(df_timelines$timelag), + ". By this measure, the timeseries is continuous.", + bullet = "tick", + col = "green", + bullet_col = "green" + ) + + } else if (length(unique(df_timelines$timeline_group)) > 1 ) { is_continuous <- FALSE cli::cat_bullet( "There are time lags which are greater than the expected lag: ", deparse(substitute(expected_lag)), - ". This indicates the timeseries is not continuous.", + " ", + units(df_timelines$timelag), + ". This indicates the timeseries is not continuous. There are ", + length(unique(df_timelines$timeline_group)), + " distinct continuous sequences. Use `group_timelines()` to extract.", bullet = "info", col = "orange", bullet_col = "orange" - ) + ) } + + return(is_continuous) } + + From fd0236ff58fcb06689e519b4a83b06c3abf4e9c3 Mon Sep 17 00:00:00 2001 From: thomaszwagerman Date: Wed, 6 Nov 2024 12:33:33 +0000 Subject: [PATCH 08/13] convert time column to date --- data/butterflycount.rda | Bin 367 -> 377 bytes 1 file changed, 0 insertions(+), 0 deletions(-) diff --git a/data/butterflycount.rda b/data/butterflycount.rda index a377d0e2c75df13def4d01e9afd078fe97ac81a5..9e6c81335ed5a275ecafb8013bb7162b96175bd3 100644 GIT binary patch literal 377 zcmV-<0fzoUT4*^jL0KkKS&Udm6aWFzf7bu=NB{vrIf!%^FknA--=M$%5CA{`2mnAI zumP5a3?&kJfXZbw&}13_&;g*(Hl9idl!Ta?Gy_Md000dD0D6K_r|6-fpwJlr0009) zplS2)9_%7o5G}}SS`}p2lM0ywIe{n!yHj0=C4g&R6=}&vkASKu4N*X3fg?_lgB6P) zRwqzpfqPIZRrHACSVtANn##f!ORl=?SM%?e(ZIn$kwn}IIJ|1c{!*AzgF>q;90K5o zg^ICIB9VYWT0{m?g%}hfNP-=<(G1xnk_*E?z0Bng$^}>hgiv{GDwwUt4{{6;3Dqc4 zb*HpS!`gsyJQUvN^qYJ(2_*~7cV8omSUfoNSn9K?TpuD_5CRv;n2xA@1V>O53BylF zlr5IH7!=4QjffhMeu=6Quy)YzBh_2Yvt!Rr|s!=Q?z0rSXnHCS?{hw6R4A6f!_ag(9OMFe3zt zl!`4RgdmD$VFmU;rnKOJ%6LEw47H9(U^fTHAWw)HN^B5J3=aY2PzWx;0|F!gm_$X0 zvdZ3Jas7?R@)q0yZC=)g=&xYnaHoyto&ndh1Eti;u-qxYpN;sSUA~?HaOna;XqQ-| zY3p!67{)2#P(o0nW9AkBc04~9K!mJT{NTZ6psuLryeP5JA*S$Pr~7bi6-!!>AOr*a NUC9*TLO~i{?`)#3nacnG From f6268b5b01a061d10119a3c63f07d876cde72728 Mon Sep 17 00:00:00 2001 From: thomaszwagerman Date: Wed, 6 Nov 2024 12:38:45 +0000 Subject: [PATCH 09/13] add May to data, example of gap in time --- data/butterflycount.rda | Bin 377 -> 434 bytes 1 file changed, 0 insertions(+), 0 deletions(-) diff --git a/data/butterflycount.rda b/data/butterflycount.rda index 9e6c81335ed5a275ecafb8013bb7162b96175bd3..58e07cb4bf6088a413e8440005d73133a8ee6d56 100644 GIT binary patch literal 434 zcmV;j0ZslwT4*^jL0KkKS%hOAM*sp|fA;_LNB{vrIf!%^FknA--=M$%fdCKyKmZ5; zKp(&Xt^$-KlA|V%Nr5poK^V{pkjawHo!rrt_4C?%~@rBI-VFNTFA?qFsg<7Sj zO*GKnVVQw~xP%G;5GdQ3DCBfgH;!bun~pjbvf_b^3OkfYK}bdz5&;Q;0Ev=aQj{TL z1qb8HDpn#Q2q6O!QvBiNPvoW=1c3pq*_IS>lStPKdj>&x4k&LGu_rhIf8h|l#-OMEl7wF<8K$EiPss;KEURf36cB$m8rnrmr8JE>2P^>7(kbUR?i(0 z$p>g1cGwlbGV~O{9Z5Z&KX(e1;)gPjg9xd06g_4_JRqh!&S^3^Q_jyivUp8IWlo`X cJs>)7PhdT;#SUbsl~4G)k}1N3gd-U8IJDxbAOHXW literal 377 zcmV-<0fzoUT4*^jL0KkKS&Udm6aWFzf7bu=NB{vrIf!%^FknA--=M$%5CA{`2mnAI zumP5a3?&kJfXZbw&}13_&;g*(Hl9idl!Ta?Gy_Md000dD0D6K_r|6-fpwJlr0009) zplS2)9_%7o5G}}SS`}p2lM0ywIe{n!yHj0=C4g&R6=}&vkASKu4N*X3fg?_lgB6P) zRwqzpfqPIZRrHACSVtANn##f!ORl=?SM%?e(ZIn$kwn}IIJ|1c{!*AzgF>q;90K5o zg^ICIB9VYWT0{m?g%}hfNP-=<(G1xnk_*E?z0Bng$^}>hgiv{GDwwUt4{{6;3Dqc4 zb*HpS!`gsyJQUvN^qYJ(2_*~7cV8omSUfoNSn9K?TpuD_5CRv;n2xA@1V>O53BylF zlr5IH7!=4QjffhMeu=6Quy) Date: Wed, 6 Nov 2024 16:28:01 +0000 Subject: [PATCH 10/13] working tests for grouping method, simplified --- DESCRIPTION | 1 + NAMESPACE | 3 ++ R/data.R | 15 ++++++ R/group_timelines.R | 70 ++++++++++++-------------- R/timeline.R | 13 ++--- data/forestprecipitation.rda | Bin 0 -> 388 bytes man/forestprecipitation.Rd | 26 ++++++++++ man/group_timelines.Rd | 54 ++++++++++++++++++++ man/timeline.Rd | 54 ++++++++++++++++++++ tests/testthat/test-group_timelines.R | 54 ++++++++++++++++++++ tests/testthat/test-timeline.R | 3 ++ 11 files changed, 247 insertions(+), 46 deletions(-) create mode 100644 data/forestprecipitation.rda create mode 100644 man/forestprecipitation.Rd create mode 100644 man/group_timelines.Rd create mode 100644 man/timeline.Rd create mode 100644 tests/testthat/test-group_timelines.R create mode 100644 tests/testthat/test-timeline.R diff --git a/DESCRIPTION b/DESCRIPTION index af496fa..0b58373 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -13,6 +13,7 @@ Imports: cli, dplyr, lifecycle, + rlang, waldo Suggests: knitr, diff --git a/NAMESPACE b/NAMESPACE index 366c797..b0f2691 100644 --- a/NAMESPACE +++ b/NAMESPACE @@ -2,6 +2,9 @@ export(catch) export(create_object_list) +export(group_timelines) export(loupe) export(release) +export(timeline) importFrom(lifecycle,deprecated) +importFrom(rlang,.data) diff --git a/R/data.R b/R/data.R index 6000968..b079e8d 100644 --- a/R/data.R +++ b/R/data.R @@ -11,3 +11,18 @@ #' ... #' } "butterflycount" + +#' Forest precipitation dummy data +#' +#' A completely fictional dataset of daily precipitation +#' +#' @format ## `butterflycount` +#' A list with 2 dataframes (january, february) containing 2 columns, +#' and 6 rows. February intentionally resets to 1970-01-01 +#' \describe{ +#' \item{time}{The date on which the imaginary rainfall was measured took +#' place, in yyyy-mm-dd format} +#' \item{rainfall_mm}{Rainfall in mm} +#' ... +#' } +"forestprecipitation" diff --git a/R/group_timelines.R b/R/group_timelines.R index 4a2d8a7..6c9d47e 100644 --- a/R/group_timelines.R +++ b/R/group_timelines.R @@ -5,39 +5,47 @@ #' sequences or events, you can use `group_timelines()` to extract and #' classify different distinct continuous chunks of your data. #' +#' We attempt to do this without sorting, or changing the data for a couple +#' of reasons: +#' +#' 1. There are no difference in dates: +#' Some instruments might record dates that appear identical, +#' but are still in chronological order. For example, high-frequency data +#' in fractional seconds. This is a rare use case though. +#' +#' 2. Dates are generally ascending/descending, but the instrument has +#' returned to origin. Probably more common, and will results in a +#' non-continuous dataset, however the records are still in chronological order +#' This is something we would like to discover. This is accounted for in the +#' logic in case_when(). +#' #' @param df_current data.frame, the newest/current version of dataset x. #' @param datetime_variable string, the "datetime" variable that should be #' checked for continuity. #' @param expected_lag numeric, the acceptable difference between timestep for #' a timeseries to be classed as continuous. Any difference greater than #' `expected_lag` will indicate a timeseries is not continuous. Default is 1. -#' The smallest units of measurement present in the column will be used. For -#' example in a column formatted YYYY-MM, month will be used. In a column -#' formatted YYYY-MM-DD day will be used. -#' @param direction character, is this timeseries orderd by ascending or by -#' descending? +#' The smallest units of measurement present in the column will be used. In a +#' column formatted YYYY-MM-DD day will be used. #' #' @returns A data.frame, identical to `df_current`, but with extra columns #' `timeline_group`, which assigns a number to each continuous sets of #' data and `timelag` which specifies the time lags between rows. #' #' @examples -#' # This example contains no differences with previous data -#' # Our datetime column is formatted YYYY-MM-DD, and we expect an observation -#' # every month, therefore our expected lag is 31 (days). -#' butterfly::get_continuous_timelines( -#' butterflycount$april, +#' butterfly::group_timelines( +#' forestprecipitation$january, #' datetime_variable = "time", -#' expected_lag = 31 -#' direction = "descending" +#' expected_lag = 1 #' ) #' +#' @importFrom rlang .data +#' #' @export group_timelines <- function( df_current, datetime_variable, - expected_lag = 1, - direction = c("ascending", "descending") + expected_lag = 1 ) { stopifnot("`df_current` must be a data.frame" = is.data.frame(df_current)) stopifnot("`expected_lag` must be numeric" = is.numeric(expected_lag)) @@ -49,21 +57,6 @@ group_timelines <- function( ) } - # Check if `direction` is in "ascending or descending" - if (!direction %in% c("ascending", "descending")) { - cli::cli_abort( - "`direction` must be one of 'ascending' or 'descending'" - ) - } - # A direction multiplier will allow checking of expected lag difference - # in both ascending and descending datasets, without reordering or changing - # the dataset itself - if (direction == "ascending") { - direction_multiplier <- 1 - } else if (direction == "descending") { - direction_multiplier <- -1 - } - # Check if datetime_variable can be used by lag if ( inherits( @@ -71,32 +64,35 @@ group_timelines <- function( c("POSIXct", "POSIXlt", "POSIXt", "Date") ) == FALSE ) { - df_current[[datetime_variable]] <- as.POSIXlt( - df_current[[datetime_variable]] + cli::cli_abort( + "`datetime_variable` must be class of POSIXct, POSIXlt, POSIXt, Date" ) } # Obtain distinct sequences of continuous measurement df_timeline <- df_current |> dplyr::mutate( - timelag = (time - dplyr::lag(time, 1)) * direction_multiplier + timelag = ( + .data[[datetime_variable]] - dplyr::lag( + .data[[datetime_variable]], + 1 + ) + ) ) |> dplyr::mutate( timeline_group1 = dplyr::case_when( # Include negative timelag, for example if a sensor cpu shuts down # It can return to its original date (e.g. 1970-01-01 or when it was # deployed) - is.na(timelag) | - timelag > expected_lag ~ 1 | - timelag < -expected_lag, + is.na(timelag) | timelag > expected_lag | timelag < -expected_lag ~ 1, TRUE ~ 2 ) ) |> dplyr::mutate( - timeline_group = cumsum(timeline_group1 == 1) + timeline_group = cumsum(.data$timeline_group1 == 1) ) |> dplyr::select( - -timeline_group1 + -"timeline_group1" ) return(df_timeline) diff --git a/R/timeline.R b/R/timeline.R index 63a5893..62951d8 100644 --- a/R/timeline.R +++ b/R/timeline.R @@ -22,8 +22,6 @@ #' The smallest units of measurement present in the column will be used. For #' example in a column formatted YYYY-MM, month will be used. In a column #' formatted YYYY-MM-DD day will be used. -#' @param direction character, is this timeseries ordered by ascending or by -#' descending? #' #' @seealso [group_timelines()] #' @@ -34,29 +32,26 @@ #' # This example contains no differences with previous data #' # Our datetime column is formatted YYYY-MM-DD, and we expect an observation #' # every month, therefore our expected lag is 31 (days). -#' butterfly::is_continuous_timelines( +#' butterfly::timeline( #' butterflycount$april, #' datetime_variable = "time", #' expected_lag = 31 -#' direction = "descending" #' ) #' #' @export timeline <- function( df_current, datetime_variable, - expected_lag = 1, - direction = c("ascending", "descending") + expected_lag = 1 ) { df_timelines <- group_timelines( df_current, datetime_variable, - expected_lag, - direction + expected_lag ) - if (length(unique(df_timelines$timeline_group)) < 1) { + if (length(unique(df_timelines$timeline_group)) == 1) { is_continuous <- TRUE cli::cat_bullet( diff --git a/data/forestprecipitation.rda b/data/forestprecipitation.rda new file mode 100644 index 0000000000000000000000000000000000000000..4e69395ad9abcdd7976599effec900dde4140cce GIT binary patch literal 388 zcmV-~0ek*JT4*^jL0KkKS(6@Mvj71MfB*mgNB}`W7=Yv;*g*d`-m*Xd05CuV1b{#W zBp^s2B@wUzTTB>AnIa=;gqbwcA&?C+Xfy`Y8aye8&r}+klisQWKn9us00000002m$ zC^n{zsK{by(?9@ef&er!7)kZ^wGSz<(|lpo)OfA$LA=stPEG3{Ac$ z@aRe0^gdW*1q}!zDG^IxFxJg0{2Uh~ff1&_0koBZ15%JvMNBzFNWmMa142ajLll5W zpEC#`hlCNL(i38=!d8tY3=a%yH4lVhp-@Hw6;_LQ;FHvv7{-3~Y=-tLUym!%T&cLh z5t*e4Ln!%zP|2E5n7_by!3d7$u|A8z;B-_w@V{R_=TxrOSROJ4SKQ{^8*~}J6Jksqjp^E ix3yv|gAHzTCQU@;aJ;Pt=!614i@744C`d_ Date: Wed, 6 Nov 2024 16:54:21 +0000 Subject: [PATCH 11/13] testing suite for new functions --- tests/testthat/test-group_timelines.R | 29 +++++++++++++++++++++ tests/testthat/test-timeline.R | 36 +++++++++++++++++++++++++-- 2 files changed, 63 insertions(+), 2 deletions(-) diff --git a/tests/testthat/test-group_timelines.R b/tests/testthat/test-group_timelines.R index 767aaa2..3ad220e 100644 --- a/tests/testthat/test-group_timelines.R +++ b/tests/testthat/test-group_timelines.R @@ -52,3 +52,32 @@ test_that("returns expected number of sequences", { 2 ) }) + +test_that("expected errors work", { + expect_error( + df_timelines <- butterfly::group_timelines( + forestprecipitation$january, + datetime_variable = "foo", + expected_lag = 1 + ), + "`datetime_variable` must be present in `df_current`" + ) + + df_timelines <- butterfly::group_timelines( + forestprecipitation$january, + datetime_variable = "time", + expected_lag = 1 + ) + + df_timelines$time <- as.character(df_timelines$time) + + expect_error( + df_timelines <- butterfly::group_timelines( + df_timelines, + datetime_variable = "time", + expected_lag = 1 + ), + "`datetime_variable` must be class of POSIXct, POSIXlt, POSIXt, Date" + ) + +}) diff --git a/tests/testthat/test-timeline.R b/tests/testthat/test-timeline.R index 8849056..e1a87fa 100644 --- a/tests/testthat/test-timeline.R +++ b/tests/testthat/test-timeline.R @@ -1,3 +1,35 @@ -test_that("multiplication works", { - expect_equal(2 * 2, 4) +test_that("correct message is fed back", { + expect_output( + timeline( + forestprecipitation$january, + datetime_variable = "time", + expected_lag = 1 + ), + "There are no time lags which are greater than the expected lag" + ) + expect_output( + timeline( + forestprecipitation$february, + datetime_variable = "time", + expected_lag = 1 + ), + "There are time lags which are greater than the expected lag" + ) +}) + +test_that("correct message is fed back", { + expect_true( + timeline( + forestprecipitation$january, + datetime_variable = "time", + expected_lag = 1 + ) + ) + expect_false( + timeline( + forestprecipitation$february, + datetime_variable = "time", + expected_lag = 1 + ) + ) }) From 6d7eefaca86fafeda6ed1f404162bd59220bbc97 Mon Sep 17 00:00:00 2001 From: thomaszwagerman Date: Wed, 6 Nov 2024 17:08:50 +0000 Subject: [PATCH 12/13] cleanup for merging --- NAMESPACE | 2 +- R/catch.R | 13 ++++--- R/create_object_list.R | 21 +++++++----- R/data.R | 6 ++-- R/loupe.R | 15 ++++---- R/timeline.R | 8 ++--- R/{group_timelines.R => timeline_group.R} | 8 ++--- codemeta.json | 34 ++++++++++++++++--- man/butterflycount.Rd | 6 ++-- man/catch.Rd | 11 +++--- man/create_object_list.Rd | 9 ++--- man/loupe.Rd | 15 ++++---- man/timeline.Rd | 4 +-- man/{group_timelines.Rd => timeline_group.Rd} | 14 ++++---- tests/testthat/test-create_object_list.R | 11 ++++++ ...roup_timelines.R => test-timeline_group.R} | 12 +++---- 16 files changed, 124 insertions(+), 65 deletions(-) rename R/{group_timelines.R => timeline_group.R} (94%) rename man/{group_timelines.Rd => timeline_group.Rd} (84%) rename tests/testthat/{test-group_timelines.R => test-timeline_group.R} (82%) diff --git a/NAMESPACE b/NAMESPACE index b0f2691..da68701 100644 --- a/NAMESPACE +++ b/NAMESPACE @@ -2,9 +2,9 @@ export(catch) export(create_object_list) -export(group_timelines) export(loupe) export(release) export(timeline) +export(timeline_group) importFrom(lifecycle,deprecated) importFrom(rlang,.data) diff --git a/R/catch.R b/R/catch.R index 24206f8..d3d8361 100644 --- a/R/catch.R +++ b/R/catch.R @@ -8,11 +8,14 @@ #' The underlying functionality is handled by `create_object_list()`. #' #' @param df_current data.frame, the newest/current version of dataset x. -#' @param df_previous data.frame, the old version of dataset, for example x - t1. -#' @param datetime_variable character, which variable to use as unique ID to join `df_current` and `df_previous`. Usually a "datetime" variable. -#' -#' @returns A dataframe which contains only rows of `df_current` that have changes from `df_previous`, but without new rows. -#' also returns a waldo object as in `loupe()`. +#' @param df_previous data.frame, the old version of dataset, +#' for example x - t1. +#' @param datetime_variable character, which variable to use as unique ID to +#' join `df_current` and `df_previous`. Usually a "datetime" variable. +#' +#' @returns A dataframe which contains only rows of `df_current` that have +#' changes from `df_previous`, but without new rows. Also returns a waldo +#' object as in `loupe()`. #' #' @seealso [loupe()] #' @seealso [create_object_list()] diff --git a/R/create_object_list.R b/R/create_object_list.R index 048b939..01ff598 100644 --- a/R/create_object_list.R +++ b/R/create_object_list.R @@ -10,12 +10,13 @@ #' returns a `waldo::compare()` call to give a detailed breakdown of changes. #' #' The main assumption is that `df_current` and `df_previous` are a newer and -#' older versions of the same data, and that the `datetime_variable` variable name always -#' remains the same. Elsewhere new columns can of appear, and these will be -#' returned in the report. +#' older versions of the same data, and that the `datetime_variable` variable +#' name always remains the same. Elsewhere new columns can of appear, and these +#' will be returned in the report. #' #' @param df_current data.frame, the newest/current version of dataset x. -#' @param df_previous data.frame, the old version of dataset, for example x - t1. +#' @param df_previous data.frame, the old version of dataset, +#' for example x - t1. #' @param datetime_variable string, which variable to use as unique ID to join #' `df_current` and `df_previous`. Usually a "datetime" variable. #' @@ -39,13 +40,17 @@ create_object_list <- function(df_current, df_previous, datetime_variable) { stopifnot("`df_previous` must be a data.frame" = is.data.frame(df_previous)) # Check if `datetime_variable` is in both `df_current` and `df_previous` - if (!datetime_variable %in% names(df_current) || !datetime_variable %in% names(df_previous)) { + if ( + !datetime_variable %in% names(df_current) + || + !datetime_variable %in% names(df_previous) + ) { cli::cli_abort( - "`datetime_variable` must be present in both `df_current` and `df_previous`" + "`datetime_variable` must be present in `df_current` and `df_previous`" ) } - # Initialise list to store objects used by `loupe()`, `catch()` and `release()` + # Initialise list used by `loupe()`, `catch()` and `release()` list_butterfly <- list( "waldo_object" = character(), "df_current_without_new_row" = data.frame(), @@ -81,7 +86,7 @@ create_object_list <- function(df_current, df_previous, datetime_variable) { deparse(substitute(df_current)), "' is your most recent data, and '", deparse(substitute(df_previous)), - "' is your previous data. If comparing like for like, try waldo::compare()." + "' is your previous data. If comparing directly, try waldo::compare()." ) } else { # Tell the user which rows are new, regardless of previous data changing diff --git a/R/data.R b/R/data.R index b079e8d..7bb022d 100644 --- a/R/data.R +++ b/R/data.R @@ -3,9 +3,11 @@ #' A completely fictional dataset of monthly butterfly counts #' #' @format ## `butterflycount` -#' A list with 4 dataframes (january, february, march, april) containing 3 columns, and 3 + n_month rows: +#' A list with 4 dataframes (january, february, march, april) containing +#' 3 columns, and 3 + n_month rows: #' \describe{ -#' \item{time}{The date on which the imaginary count took place, in yyyy-mm-dd format} +#' \item{time}{The date on which the imaginary count took place, +#' in yyyy-mm-dd format} #' \item{count}{Number of fictional butterflies counted} #' \item{species}{Butterfly species name, only appears in april} #' ... diff --git a/R/loupe.R b/R/loupe.R index 5212893..9403762 100644 --- a/R/loupe.R +++ b/R/loupe.R @@ -14,17 +14,20 @@ #' returns a `waldo::compare()` call to give a detailed breakdown of changes. #' #' The main assumption is that `df_current` and `df_previous` are a newer and -#' older versions of the same data, and that the `datetime_variable` variable name always -#' remains the same. Elsewhere new columns can of appear, and these will be -#' returned in the report. +#' older versions of the same data, and that the `datetime_variable` variable +#' name always remains the same. Elsewhere new columns can of appear, and these +#' will be returned in the report. #' #' The underlying functionality is handled by `create_object_list()`. #' #' @param df_current data.frame, the newest/current version of dataset x. -#' @param df_previous data.frame, the old version of dataset, for example x - t1. -#' @param datetime_variable string, which variable to use as unique ID to join `df_current` and `df_previous`. Usually a "datetime" variable. +#' @param df_previous data.frame, the old version of dataset, +#' for example x - t1. +#' @param datetime_variable string, which variable to use as unique ID to +#' join `df_current` and `df_previous`. Usually a "datetime" variable. #' -#' @returns A boolean where TRUE indicates no changes to previous data and FALSE indicates unexpected changes. +#' @returns A boolean where TRUE indicates no changes to previous data and +#' FALSE indicates unexpected changes. #' #' @seealso [create_object_list()] #' diff --git a/R/timeline.R b/R/timeline.R index 62951d8..b358e02 100644 --- a/R/timeline.R +++ b/R/timeline.R @@ -8,7 +8,7 @@ #' or the manufacturing date (say, "2021-01-01"). This leads to unpredictable #' ways of checking if a dataset is continuous. #' -#' The `group_timelines()` and `timeline()` functions attempt to give the user +#' The `timeline_group()` and `timeline()` functions attempt to give the user #' control over how to check for continuity by providing an `expected_lag`. The #' difference between timesteps in a dataset should not exceed the #' `expected_lag`. @@ -23,7 +23,7 @@ #' example in a column formatted YYYY-MM, month will be used. In a column #' formatted YYYY-MM-DD day will be used. #' -#' @seealso [group_timelines()] +#' @seealso [timeline_group()] #' #' @returns A boolean, TRUE if the timeseries is continuous, and FALSE if there #' are more than one continuous timeseries within the dataset. @@ -45,7 +45,7 @@ timeline <- function( expected_lag = 1 ) { - df_timelines <- group_timelines( + df_timelines <- timeline_group( df_current, datetime_variable, expected_lag @@ -75,7 +75,7 @@ timeline <- function( units(df_timelines$timelag), ". This indicates the timeseries is not continuous. There are ", length(unique(df_timelines$timeline_group)), - " distinct continuous sequences. Use `group_timelines()` to extract.", + " distinct continuous sequences. Use `timeline_group()` to extract.", bullet = "info", col = "orange", bullet_col = "orange" diff --git a/R/group_timelines.R b/R/timeline_group.R similarity index 94% rename from R/group_timelines.R rename to R/timeline_group.R index 6c9d47e..9102da4 100644 --- a/R/group_timelines.R +++ b/R/timeline_group.R @@ -1,8 +1,8 @@ -#' group_timelines: check if a timeseries is continuous +#' timeline_group: check if a timeseries is continuous #' #' If after using `timeline()` you have established a timeseries is not #' continuous, or if you are working with data where you expect distinct -#' sequences or events, you can use `group_timelines()` to extract and +#' sequences or events, you can use `timeline_group()` to extract and #' classify different distinct continuous chunks of your data. #' #' We attempt to do this without sorting, or changing the data for a couple @@ -33,7 +33,7 @@ #' data and `timelag` which specifies the time lags between rows. #' #' @examples -#' butterfly::group_timelines( +#' butterfly::timeline_group( #' forestprecipitation$january, #' datetime_variable = "time", #' expected_lag = 1 @@ -42,7 +42,7 @@ #' @importFrom rlang .data #' #' @export -group_timelines <- function( +timeline_group <- function( df_current, datetime_variable, expected_lag = 1 diff --git a/codemeta.json b/codemeta.json index 31830e1..7d1b0fb 100644 --- a/codemeta.json +++ b/codemeta.json @@ -13,7 +13,7 @@ "name": "R", "url": "https://r-project.org" }, - "runtimePlatform": "R version 4.4.1 (2024-06-14)", + "runtimePlatform": "R version 4.4.2 (2024-10-31)", "author": [ { "@type": "Person", @@ -109,6 +109,18 @@ "sameAs": "https://CRAN.R-project.org/package=lifecycle" }, "4": { + "@type": "SoftwareApplication", + "identifier": "rlang", + "name": "rlang", + "provider": { + "@id": "https://cran.r-project.org", + "@type": "Organization", + "name": "Comprehensive R Archive Network (CRAN)", + "url": "https://cran.r-project.org" + }, + "sameAs": "https://CRAN.R-project.org/package=rlang" + }, + "5": { "@type": "SoftwareApplication", "identifier": "waldo", "name": "waldo", @@ -120,7 +132,7 @@ }, "sameAs": "https://CRAN.R-project.org/package=waldo" }, - "5": { + "6": { "@type": "SoftwareApplication", "identifier": "R", "name": "R", @@ -128,9 +140,23 @@ }, "SystemRequirements": null }, - "fileSize": "337.771KB", + "fileSize": "416.405KB", + "citation": [ + { + "@type": "CreativeWork", + "datePublished": "2024", + "author": [ + { + "@type": "Person", + "givenName": "Zwagerman", + "familyName": "Thomas" + } + ], + "name": "{butterfly}: quality assurance of continually updating and overwritten time-series data" + } + ], "readme": "https://github.com/thomaszwagerman/butterfly/blob/main/README.md", "contIntegration": ["https://github.com/thomaszwagerman/butterfly/actions/workflows/R-CMD-check.yaml", "https://app.codecov.io/gh/thomaszwagerman/butterfly?branch=main"], "developmentStatus": "https://lifecycle.r-lib.org/articles/stages.html#experimental", - "keywords": ["qaqc", "timeseries"] + "keywords": ["qaqc", "timeseries", "r", "r-package", "rstats", "data-versioning", "verification"] } diff --git a/man/butterflycount.Rd b/man/butterflycount.Rd index 98d15c3..af88821 100644 --- a/man/butterflycount.Rd +++ b/man/butterflycount.Rd @@ -7,9 +7,11 @@ \format{ \subsection{\code{butterflycount}}{ -A list with 4 dataframes (january, february, march, april) containing 3 columns, and 3 + n_month rows: +A list with 4 dataframes (january, february, march, april) containing +3 columns, and 3 + n_month rows: \describe{ -\item{time}{The date on which the imaginary count took place, in yyyy-mm-dd format} +\item{time}{The date on which the imaginary count took place, +in yyyy-mm-dd format} \item{count}{Number of fictional butterflies counted} \item{species}{Butterfly species name, only appears in april} ... diff --git a/man/catch.Rd b/man/catch.Rd index ee0ea5a..ac57845 100644 --- a/man/catch.Rd +++ b/man/catch.Rd @@ -9,13 +9,16 @@ catch(df_current, df_previous, datetime_variable) \arguments{ \item{df_current}{data.frame, the newest/current version of dataset x.} -\item{df_previous}{data.frame, the old version of dataset, for example x - t1.} +\item{df_previous}{data.frame, the old version of dataset, +for example x - t1.} -\item{datetime_variable}{character, which variable to use as unique ID to join \code{df_current} and \code{df_previous}. Usually a "datetime" variable.} +\item{datetime_variable}{character, which variable to use as unique ID to +join \code{df_current} and \code{df_previous}. Usually a "datetime" variable.} } \value{ -A dataframe which contains only rows of \code{df_current} that have changes from \code{df_previous}, but without new rows. -also returns a waldo object as in \code{loupe()}. +A dataframe which contains only rows of \code{df_current} that have +changes from \code{df_previous}, but without new rows. Also returns a waldo +object as in \code{loupe()}. } \description{ This function matches two dataframe objects by their unique identifier diff --git a/man/create_object_list.Rd b/man/create_object_list.Rd index d1a6fa0..1a21539 100644 --- a/man/create_object_list.Rd +++ b/man/create_object_list.Rd @@ -9,7 +9,8 @@ create_object_list(df_current, df_previous, datetime_variable) \arguments{ \item{df_current}{data.frame, the newest/current version of dataset x.} -\item{df_previous}{data.frame, the old version of dataset, for example x - t1.} +\item{df_previous}{data.frame, the old version of dataset, +for example x - t1.} \item{datetime_variable}{string, which variable to use as unique ID to join \code{df_current} and \code{df_previous}. Usually a "datetime" variable.} @@ -31,9 +32,9 @@ It informs the user of new (unmatched) rows which have appeared, and then returns a \code{waldo::compare()} call to give a detailed breakdown of changes. The main assumption is that \code{df_current} and \code{df_previous} are a newer and -older versions of the same data, and that the \code{datetime_variable} variable name always -remains the same. Elsewhere new columns can of appear, and these will be -returned in the report. +older versions of the same data, and that the \code{datetime_variable} variable +name always remains the same. Elsewhere new columns can of appear, and these +will be returned in the report. } \examples{ butterfly_object_list <- butterfly::create_object_list( diff --git a/man/loupe.Rd b/man/loupe.Rd index cb59b2e..7273099 100644 --- a/man/loupe.Rd +++ b/man/loupe.Rd @@ -9,12 +9,15 @@ loupe(df_current, df_previous, datetime_variable) \arguments{ \item{df_current}{data.frame, the newest/current version of dataset x.} -\item{df_previous}{data.frame, the old version of dataset, for example x - t1.} +\item{df_previous}{data.frame, the old version of dataset, +for example x - t1.} -\item{datetime_variable}{string, which variable to use as unique ID to join \code{df_current} and \code{df_previous}. Usually a "datetime" variable.} +\item{datetime_variable}{string, which variable to use as unique ID to +join \code{df_current} and \code{df_previous}. Usually a "datetime" variable.} } \value{ -A boolean where TRUE indicates no changes to previous data and FALSE indicates unexpected changes. +A boolean where TRUE indicates no changes to previous data and +FALSE indicates unexpected changes. } \description{ A loupe is a simple, small magnification device used to examine small details @@ -32,9 +35,9 @@ It informs the user of new (unmatched) rows which have appeared, and then returns a \code{waldo::compare()} call to give a detailed breakdown of changes. The main assumption is that \code{df_current} and \code{df_previous} are a newer and -older versions of the same data, and that the \code{datetime_variable} variable name always -remains the same. Elsewhere new columns can of appear, and these will be -returned in the report. +older versions of the same data, and that the \code{datetime_variable} variable +name always remains the same. Elsewhere new columns can of appear, and these +will be returned in the report. The underlying functionality is handled by \code{create_object_list()}. } diff --git a/man/timeline.Rd b/man/timeline.Rd index 7b0cd7c..00a9b33 100644 --- a/man/timeline.Rd +++ b/man/timeline.Rd @@ -33,7 +33,7 @@ example, during power failure an internal clock could reset to "1970-01-01", or the manufacturing date (say, "2021-01-01"). This leads to unpredictable ways of checking if a dataset is continuous. -The \code{group_timelines()} and \code{timeline()} functions attempt to give the user +The \code{timeline_group()} and \code{timeline()} functions attempt to give the user control over how to check for continuity by providing an \code{expected_lag}. The difference between timesteps in a dataset should not exceed the \code{expected_lag}. @@ -50,5 +50,5 @@ butterfly::timeline( } \seealso{ -\code{\link[=group_timelines]{group_timelines()}} +\code{\link[=timeline_group]{timeline_group()}} } diff --git a/man/group_timelines.Rd b/man/timeline_group.Rd similarity index 84% rename from man/group_timelines.Rd rename to man/timeline_group.Rd index 5d3733c..268383a 100644 --- a/man/group_timelines.Rd +++ b/man/timeline_group.Rd @@ -1,10 +1,10 @@ % Generated by roxygen2: do not edit by hand -% Please edit documentation in R/group_timelines.R -\name{group_timelines} -\alias{group_timelines} -\title{group_timelines: check if a timeseries is continuous} +% Please edit documentation in R/timeline_group.R +\name{timeline_group} +\alias{timeline_group} +\title{timeline_group: check if a timeseries is continuous} \usage{ -group_timelines(df_current, datetime_variable, expected_lag = 1) +timeline_group(df_current, datetime_variable, expected_lag = 1) } \arguments{ \item{df_current}{data.frame, the newest/current version of dataset x.} @@ -26,7 +26,7 @@ data and \code{timelag} which specifies the time lags between rows. \description{ If after using \code{timeline()} you have established a timeseries is not continuous, or if you are working with data where you expect distinct -sequences or events, you can use \code{group_timelines()} to extract and +sequences or events, you can use \code{timeline_group()} to extract and classify different distinct continuous chunks of your data. } \details{ @@ -45,7 +45,7 @@ logic in case_when(). } } \examples{ -butterfly::group_timelines( +butterfly::timeline_group( forestprecipitation$january, datetime_variable = "time", expected_lag = 1 diff --git a/tests/testthat/test-create_object_list.R b/tests/testthat/test-create_object_list.R index 3523b13..b063875 100644 --- a/tests/testthat/test-create_object_list.R +++ b/tests/testthat/test-create_object_list.R @@ -17,6 +17,17 @@ test_that("error when no new rows", { ) }) +test_that("error when no datetime_variable not present in both dfs", { + expect_error( + create_object_list( + butterflycount$january, + butterflycount$february, + datetime_variable = "foo" + ), + "`datetime_variable` must be present in `df_current` and `df_previous`" + ) +}) + test_that("correct message is fed back", { expect_output( create_object_list( diff --git a/tests/testthat/test-group_timelines.R b/tests/testthat/test-timeline_group.R similarity index 82% rename from tests/testthat/test-group_timelines.R rename to tests/testthat/test-timeline_group.R index 3ad220e..75ff648 100644 --- a/tests/testthat/test-group_timelines.R +++ b/tests/testthat/test-timeline_group.R @@ -1,5 +1,5 @@ test_that("returns dataframe", { - df_timelines <- butterfly::group_timelines( + df_timelines <- butterfly::timeline_group( forestprecipitation$january, datetime_variable = "time", expected_lag = 1 @@ -22,7 +22,7 @@ test_that("returns dataframe", { }) test_that("returns expected number of sequences", { - df_timelines <- butterfly::group_timelines( + df_timelines <- butterfly::timeline_group( forestprecipitation$january, datetime_variable = "time", expected_lag = 1 @@ -37,7 +37,7 @@ test_that("returns expected number of sequences", { 1 ) - df_reset <- butterfly::group_timelines( + df_reset <- butterfly::timeline_group( forestprecipitation$february, datetime_variable = "time", expected_lag = 1 @@ -55,7 +55,7 @@ test_that("returns expected number of sequences", { test_that("expected errors work", { expect_error( - df_timelines <- butterfly::group_timelines( + df_timelines <- butterfly::timeline_group( forestprecipitation$january, datetime_variable = "foo", expected_lag = 1 @@ -63,7 +63,7 @@ test_that("expected errors work", { "`datetime_variable` must be present in `df_current`" ) - df_timelines <- butterfly::group_timelines( + df_timelines <- butterfly::timeline_group( forestprecipitation$january, datetime_variable = "time", expected_lag = 1 @@ -72,7 +72,7 @@ test_that("expected errors work", { df_timelines$time <- as.character(df_timelines$time) expect_error( - df_timelines <- butterfly::group_timelines( + df_timelines <- butterfly::timeline_group( df_timelines, datetime_variable = "time", expected_lag = 1 From 6129790647625a65a32fb8b6c811a05e42de8993 Mon Sep 17 00:00:00 2001 From: thomaszwagerman Date: Wed, 6 Nov 2024 17:38:10 +0000 Subject: [PATCH 13/13] readme and vignette updates for timeline functions --- R/timeline.R | 26 +++++++++--------- R/timeline_group.R | 7 ++++- README.Rmd | 3 +++ README.md | 6 +++++ man/timeline.Rd | 21 +++++++++------ man/timeline_group.Rd | 7 ++++- vignettes/butterfly.Rmd | 59 ++++++++++++++++++++++++++++++++++++++--- 7 files changed, 102 insertions(+), 27 deletions(-) diff --git a/R/timeline.R b/R/timeline.R index b358e02..f702cfe 100644 --- a/R/timeline.R +++ b/R/timeline.R @@ -13,15 +13,7 @@ #' difference between timesteps in a dataset should not exceed the #' `expected_lag`. #' -#' @param df_current data.frame, the newest/current version of dataset x. -#' @param datetime_variable string, the "datetime" variable that should be -#' checked for continuity. -#' @param expected_lag numeric, the acceptable difference between timestep for -#' a timeseries to be classed as continuous. Any difference greater than -#' `expected_lag` will indicate a timeseries is not continuous. Default is 1. -#' The smallest units of measurement present in the column will be used. For -#' example in a column formatted YYYY-MM, month will be used. In a column -#' formatted YYYY-MM-DD day will be used. +#' @inheritParams timeline_group #' #' @seealso [timeline_group()] #' @@ -29,13 +21,19 @@ #' are more than one continuous timeseries within the dataset. #' #' @examples -#' # This example contains no differences with previous data -#' # Our datetime column is formatted YYYY-MM-DD, and we expect an observation -#' # every month, therefore our expected lag is 31 (days). +#' # A nice continuous dataset should return TRUE #' butterfly::timeline( -#' butterflycount$april, +#' forestprecipitation$january, #' datetime_variable = "time", -#' expected_lag = 31 +#' expected_lag = 1 +#' ) +#' +#' # In February, our imaginary rain gauge's onboard computer had a failure. +#' # The timestamp was reset to 1970-01-01 +#' butterfly::timeline( +#' forestprecipitation$february, +#' datetime_variable = "time", +#' expected_lag = 1 #' ) #' #' @export diff --git a/R/timeline_group.R b/R/timeline_group.R index 9102da4..6b650da 100644 --- a/R/timeline_group.R +++ b/R/timeline_group.R @@ -33,8 +33,13 @@ #' data and `timelag` which specifies the time lags between rows. #' #' @examples +#' # A nice continuous dataset should return TRUE +#' # In February, our imaginary rain gauge's onboard computer had a failure. +#' # The timestamp was reset to 1970-01-01 +#' +#' # We want to group these different distinct continuous sequences: #' butterfly::timeline_group( -#' forestprecipitation$january, +#' forestprecipitation$february, #' datetime_variable = "time", #' expected_lag = 1 #' ) diff --git a/README.Rmd b/README.Rmd index 1d26a64..f5fc07c 100644 --- a/README.Rmd +++ b/README.Rmd @@ -52,7 +52,10 @@ The butterfly package contains the following: * `butterfly::catch()` - returns rows which contain previously changed values in a dataframe. * `butterfly::release()` - drops rows which contain previously changed values, and returns a dataframe containing new and unchanged rows. * `butterfly::create_object_list()` - returns a list of objects required by all of `loupe()`, `catch()` and `release()`. Contains underlying functionality. + * `butterfly::timeline()` - check if a timeseries is continuous between timesteps. + * `butterfly::timeline_group()` - group distinct, but continuous sequences of a timeseres. * `butterflycount` - a list of monthly dataframes, which contain fictional butterfly counts for a given date. + * `forestprecipitation` - a list of monthly dataframes, which contain fictional daily precipitation measurements for a given date. ## Examples diff --git a/README.md b/README.md index 22c7936..ee53193 100644 --- a/README.md +++ b/README.md @@ -67,8 +67,14 @@ The butterfly package contains the following: - `butterfly::create_object_list()` - returns a list of objects required by all of `loupe()`, `catch()` and `release()`. Contains underlying functionality. +- `butterfly::timeline()` - check if a timeseries is continuous between + timesteps. +- `butterfly::timeline_group()` - group distinct, but continuous + sequences of a timeseres. - `butterflycount` - a list of monthly dataframes, which contain fictional butterfly counts for a given date. +- `forestprecipitation` - a list of monthly dataframes, which contain + fictional daily precipitation measurements for a given date. ## Examples diff --git a/man/timeline.Rd b/man/timeline.Rd index 00a9b33..79ed1e5 100644 --- a/man/timeline.Rd +++ b/man/timeline.Rd @@ -15,9 +15,8 @@ checked for continuity.} \item{expected_lag}{numeric, the acceptable difference between timestep for a timeseries to be classed as continuous. Any difference greater than \code{expected_lag} will indicate a timeseries is not continuous. Default is 1. -The smallest units of measurement present in the column will be used. For -example in a column formatted YYYY-MM, month will be used. In a column -formatted YYYY-MM-DD day will be used.} +The smallest units of measurement present in the column will be used. In a +column formatted YYYY-MM-DD day will be used.} } \value{ A boolean, TRUE if the timeseries is continuous, and FALSE if there @@ -39,13 +38,19 @@ difference between timesteps in a dataset should not exceed the \code{expected_lag}. } \examples{ -# This example contains no differences with previous data -# Our datetime column is formatted YYYY-MM-DD, and we expect an observation -# every month, therefore our expected lag is 31 (days). +# A nice continuous dataset should return TRUE butterfly::timeline( - butterflycount$april, + forestprecipitation$january, datetime_variable = "time", - expected_lag = 31 + expected_lag = 1 +) + +# In February, our imaginary rain gauge's onboard computer had a failure. +# The timestamp was reset to 1970-01-01 +butterfly::timeline( + forestprecipitation$february, + datetime_variable = "time", + expected_lag = 1 ) } diff --git a/man/timeline_group.Rd b/man/timeline_group.Rd index 268383a..e41715d 100644 --- a/man/timeline_group.Rd +++ b/man/timeline_group.Rd @@ -45,8 +45,13 @@ logic in case_when(). } } \examples{ +# A nice continuous dataset should return TRUE +# In February, our imaginary rain gauge's onboard computer had a failure. +# The timestamp was reset to 1970-01-01 + +# We want to group these different distinct continuous sequences: butterfly::timeline_group( - forestprecipitation$january, + forestprecipitation$february, datetime_variable = "time", expected_lag = 1 ) diff --git a/vignettes/butterfly.Rmd b/vignettes/butterfly.Rmd index eb434d1..5a4ce31 100644 --- a/vignettes/butterfly.Rmd +++ b/vignettes/butterfly.Rmd @@ -31,7 +31,7 @@ butterflycount This dataset is entirely fictional, and merely included to aid demonstrating butterfly's functionality. -## Examining datasets: loupe() +## Examining datasets: `loupe()` We can use `butterfly::loupe()` to examine in detail whether previous values have changed. @@ -70,7 +70,7 @@ butterfly::loupe( Call `?waldo::compare()` to see the full list of arguments. -## Extracting unexpected changes: catch() +## Extracting unexpected changes: `catch()` You might want to return changed rows as a dataframe. For this `butterfly::catch()`is provided. @@ -86,7 +86,7 @@ df_caught <- butterfly::catch( df_caught ``` -## Dropping unexpecrted changes: release() +## Dropping unexpected changes: `release()` Conversely, `butterfly::release()` drops all rows which had changed from the previous version. Note it retains new rows, as these were expected. @@ -114,6 +114,59 @@ df_release_without_new ``` +## Checking for continuity: `timeline()` +To check if a timeseries is continuous, `timeline()` and `timeline_group()` are +provided. Even if a timeseries does not contain obvious gaps, this does not +automatically mean it is also continuous. + +Measuring instruments can have different behaviours when they fail. For +example, during power failure an internal clock could reset to "1970-01-01", +or the manufacturing date (say, "2021-01-01"). This leads to unpredictable +ways of checking if a dataset is continuous. + +To check if a timeseries is continuous: + +```{r check_continuity} +butterfly::timeline( + forestprecipitation$january, + datetime_variable = "time", + expected_lag = 1 + ) +``` + +The above is a nice continuous dataset, where there is no more than a difference +of 1 day between timesteps. + +However, in February our imaginary rain gauge's onboard computer had a failure. + +The timestamp was reset to 1970-01-01: + +```{r not_continuous} +forestprecipitation$february + +butterfly::timeline( + forestprecipitation$february, + datetime_variable = "time", + expected_lag = 1 + ) +``` + +## Grouping distinct continuous sequences: `timeline_group()` + +If we wanted to group chunks of our timeseries that are distinct, or broken up +in some way, but still continuous, we can use `timeline_group()`: + +```{r timeline_group} +butterfly::timeline_group( + forestprecipitation$february, + datetime_variable = "time", + expected_lag = 1 + ) +``` + +We now have groups 1 & 2, which are both continuous sets of data, but there is +no continuity between them. + ## Using `butterfly` in a data processing pipeline If you would like to know more about using `butterfly` in an operational data processing pipeline, please refer to the article on [using `butterfly` in an operational pipeline](https://thomaszwagerman.github.io/butterfly/articles/butterfly_in_pipeline.html).