title | tags | authors | affiliations | date | bibliography | |||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
pyDisagg: A Python Package for Data Disaggregation |
|
|
|
02.23.2024 |
paper.bib |
Data sources report aggregated data for many reasons. Aggregating data may assuage privacy concerns, may be cheaper to tabulate, and easier to release. In global health applications, many data sources report data that is aggregated across age (for example, all-age prevalence of a disease), across sex (both-sex incidence of a disease), and across location (national estimates of mortality from a particular disease).
When building processing workflows for analyzing and modeling the data, the user must either incorporate the aggregation mechanism into the model, or somehow disaggregate the data. While the former option is feasible in some contexts, the latter option significantly simplifies processing and modeling in large-scale analyses. Many data sources used by the Institute for Health Metrics and Evaluation report aggregated data that is split, for example into age-specific and sex-specific bins, for futher processing, using additional information.
The pyDisagg
package implements simple methods for data splitting into any set of categories. Given
- An aggregated observation and uncertainty (for example, all-age prevalence with associated uncertainty)
- Set of categories that were `aggregated' by the datapoint (e.g. which age categories were aggregated to get the prevalence)
- Frequency pattern most relevant to the datapoint (e.g. age distribution of study or location relevant to the datapoint)
- Global pattern for observable of interest (e.g. global prevalence by age)
the
pyDisagg
package produces split estimates into the specified bins for further processing.
Nearly all groups within the Institute of Health Metrics and Evaluation (IHME) currently use some form of splitting, with age and sex-splitting being the most common. Typical assumptions are that
- split datapoints should follow the global pattern up to a multiplicative constant
- uncertainty is propagated by draws, i.e. performing the computation for different realizations of the datapoint and the global pattern (if uncertain).
The pyDisagg
package provides this functionality, along with additional technical solutions that allow
- splitting of fundamentally bounded quantities, such as prevalence, which has to lie between 0 and 1
- allowing draw-free uncertainty propagation using the multivariate delta method
- guarantees that uncertainty estimates are consistent with bounds for bounded quantities
Let
Mathematically, in the simpler rate multiplicative model, we find
When
When
Currently, the multiplicative-in-rate model RateMultiplicativeModel with
The pyDisagg
packages uses the multivariate delta method to propagate unceratinty. Given a variance for the observation and for the global rate, the package produces asymptotically valid
uncertainty intervals for the multiplicative factors within the transforms
Age- and sex- splitting is currently widely used to support ongoing work for the Global Burden of Disease (GBD) study. For example, the GBD capstones on Diseases and Injuries [@vos2020global], Risk Factors [@murray2020global], and AntiMicrobial Resistance [@murray2022global] all heavily use age- and sex-splitting.