Skip to content

Commit

Permalink
Merge pull request #245 from National-COVID-Cohort-Collaborative/rwdb…
Browse files Browse the repository at this point in the history
…lurbs

RWD callouts
  • Loading branch information
wibeasley authored Sep 13, 2024
2 parents 293ae4e + f3d03c8 commit e4a819b
Show file tree
Hide file tree
Showing 12 changed files with 206 additions and 6 deletions.
16 changes: 16 additions & 0 deletions chapters/access.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,15 @@ such as obituary-based mortality records and viral variant sequencing informatio
These are available alongside only Level 3 data as an optional add-on;
we'll discuss PPRL in more detail below.

::: {.callout-tip}

## Real-World-Data Tip

In order to preserve the ethical use of unconsented data for public health benefit, every attempt to obscure identities of institutions, communities, or individuals contributing data should be taken as shown here.
Access to EHR-derived data should minimize privacy and never provide more access to identifiers than is required for the research question.
:::


### Level 3, Limited Data Set (LDS) {#sec-access-background-l3}

Level 3, or LDS data is the most complete and protected
Expand Down Expand Up @@ -203,6 +212,13 @@ but cannot share data or files across them,
thus it is impossible for a researcher with access to Level 3 data to share it with their colleagues
in another project with Level 2 data.

::: {.callout-tip}

## Real-World-Data Tip

Shared workspaces facilitating secure, collaborative data access are not unique to N3C amongst cloud-hosted, centralized RWD environments. These technologies, coupled with appropriate [governance](governance.md) and [team-science](onboarding.md) principles, can facilitate successful research projects in the face of data complexity.
:::

## The DUR - Data Use Request {#sec-access-dur}

### Project Roles and DUR Types {#sec-access-dur-roles}
Expand Down
15 changes: 15 additions & 0 deletions chapters/cycle.md
Original file line number Diff line number Diff line change
Expand Up @@ -211,6 +211,13 @@ OMOP domain mapping typically involves the creation of mapping tables that trans
The OMOP vocabulary dictates which source code should be placed in which target domain after it is translated into OMOP concepts.
This vocabulary transformation using the OMOP concept relationships arranges the data into a well-organized and consistent format that can be easily analyzed and queried.

::: {.callout-tip}

## Real-World-Data Tip

Given the diversity of systems and medical vocabularies used in healthcare, vocabulary mapping is a crucial, but challenging and time-consuming part of RWD preparation for analysis. While technologies like Common Data Models (CDMs) and versioned medical vocabularies help, they are not universal solutions. The NIH and NCATS are currently engaged in a [Code Map Services](https://aspe.hhs.gov/code-map-services-interoperability-common-data-models-0) project to address this need broadly.
:::

### N3C Global ID generation for all primary key fields {#sec-cycle-overview-globalid}

The incoming data sets submitted to N3C may or may not include their own primary keys.
Expand Down Expand Up @@ -358,6 +365,14 @@ The scorecards allow sites to be directly involved in the data quality improveme
Additionally, the scorecards allow the DI&H team to monitor and maintain data quality across subsequent N3C data submissions and prevent any regression on those metrics.
If the scorecards reveal that a released site is no longer passing key data quality metrics, then the site is unreleased until they are able to remediate their data quality issues.

::: {.callout-tip}

## Real-World-Data Tip

Unit harmonization, data quality checks, and site scorecards are all crucial ways to ensure data quality when working with RWD from multiple institutions.
Having these elements in place provides quality insights, cross-site imputation of missing units of measurements, and the opportunity for local data administrators to observe, investigate, and communicate any anomalies that may otherwise go undetected without cross-site comparisons.
:::

## N3C Data Releases {#sec-cycle-releases}

### De-Identification and release to LDS (L3) {#sec-cycle-releases-lds}
Expand Down
29 changes: 29 additions & 0 deletions chapters/governance.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,13 @@ Also, data cannot be extracted or downloaded from the data N3C Data Enclave, wit
The workstream split into a smaller subgroup to draft the supporting governance documents while continuing to meet with the whole Workstream for ideation, context, and feedback weekly.
Onward, the workstream adjusted the frequency of the meetings to the work cadence.

::: {.callout-tip}

## Real-World-Data Tip

A governing body (with stakeholder community representation) helps to ensure ethical contribution to, and use of, sensitive patient data. Clear communication between the governing body and researcher community is instrumental to the success of any central database of RWD.
:::

## N3C Governance Bridges Individual Oversight Responsibilities {#sec-governance-bridges}

The N3C Governance is a set of behavioral norms, policies, and procedures supported by technology/security measures and oversight mechanisms.
Expand Down Expand Up @@ -113,6 +120,14 @@ Seven principles summarize the Community Guiding Principles:

![N3C shared Governance initiatives with sign-off responsibility represented.](images/governance/fig-governance-010-shared.png){#fig-governance-010-shared fig-alt="N3C shared Governance initiatives with sign-off responsibility represented"}

::: {.callout-tip}

## Real-World-Data Tip

Abiding by the Community Guiding Principles of partnership, inclusivity, transparency, reciprocity, accountability, security, and mutual respect are important for both the governance team and researchers accessing the data.
This facilitates clear communication and understanding of roles and responsibilities when handling data from across multiple sources.
:::

## Ethical oversight {#sec-governance-ethics}

The ethical oversight of N3C is two-fold.
Expand Down Expand Up @@ -142,6 +157,13 @@ Again, separating the DTA and LHBA provides the flexibility to contribute data w
A separate [Data Use Agreement](https://ncats.nih.gov/files/NCATS_N3C_Data_Use_Agreement.pdf) (DUA) must be executed between NCATS and signing officials from an institution whose investigators wish to access N3C data.
To improve efficiency, instead of executing a traditional pair-wise agreement each time a researcher needs access to N3C, a single DUA is executed between NCATS and an organization to render individual researchers eligible to request access to N3C content.

::: {.callout-tip}

## Real-World-Data Tip

Creating separate institution-level agreements for Data Transfer and Data Use helps accelerate agreement execution and ensures that institutions can participate at the level they feel comfortable with.
:::

## Data Access by Researchers {#sec-governance-access}

First-time users wishing to access N3C must verify that their institution has executed a DUA with NCATS.
Expand Down Expand Up @@ -169,6 +191,13 @@ during account registration.
Eligible users can then submit their Data Use Request for evaluation by the Data Access Committee.
A new request must be submitted for each specific project.](images//governance/fig-governance-030-process.png){#fig-governance-030-process fig-alt="Data Access Governance Process."}

::: {.callout-tip}

## Real-World-Data Tip

Well defined protocols for how users gain access or report data access incidents support the continued cooperation of data providers, enabling researchers to work with such a large collection of harmonized data and program sustainability.
:::

## Incident Notification and Escalation procedure {#sec-governance-incident}

It is essential to create an Incident Notification Policy to ensure that the right people are notified about incidents at the right time and that problems can be addressed rapidly.
Expand Down
12 changes: 6 additions & 6 deletions chapters/intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,10 +74,8 @@ Two overlapping teams of EHR data experts participate in this process: one works

## Real-World-Data Tip

When EHR is mapped to a research-appropriate Common Data Model (CDM),
analysts have the opportunity to write more concise code that can be rerun on other EHR data that is represented in this same CDM.
While translating data between CDMs is possible and facilitates interoperability and reproducibility,
RWD analysts must take into account the impact of each transformation step in terms of potential data loss or data restructuring.
When EHR is mapped to a research-appropriate Common Data Model (CDM), analysts have the opportunity to write more concise code that can be rerun on other EHR data that is represented in this same CDM.
While translating data between CDMs is possible and facilitates interoperability and reproducibility, RWD analysts must take into account the impact of each transformation step in terms of potential data loss or data restructuring.
:::

## The N3C Data Enclave and Data Access {#sec-intro-enclave}
Expand All @@ -94,7 +92,8 @@ Accessing data will full patient zip codes, for example, requires obtaining appr

## Real-World-Data Tip

Regardless of data source or platform, when working with RWD that is derived from patient records, the researcher must have been legally granted access via binding contracts and user agreements, permission from Institutional Review Boards who oversee human subject rights, a workspace that meets the appropriate security requirements, and permission from data stewards who manage the specific dataset. Note that HIPAA laws stipulate that the minimum amount of identifiable data be shared to enable the particular research project.
Regardless of data source or platform, when working with RWD that is derived from patient records, the researcher must have been legally granted access via binding contracts and user agreements, permission from Institutional Review Boards who oversee human subject rights, a workspace that meets the appropriate security requirements, and permission from data stewards who manage the specific dataset.
Note that HIPAA laws stipulate that the minimum amount of identifiable data be shared to enable the particular research project.
:::

Because effective analysis of EHR data requires a diverse set of skills-especially clinical and data science/statistical expertise-N3C provides organizational structures and resources to rapidly create and support multidisciplinary research teams, many of which are geographically diverse as well.
Expand Down Expand Up @@ -136,7 +135,8 @@ Other data are available as well, including publicly-available datasets (e.g., f

## Real-World-Data Tip

Collecting multi-center data centrally allow RWD researchers to identify novel associations by collaboratively building, testing, and refining algorithmic classifiers once the various sources of patient data have been harmonized and connected in a way that can create a comprehensive dataset for each individual’s life course. Having access to row-level data from a variety of sites supports detailed investigation of variances across sites.
Collecting multi-center data centrally allow RWD researchers to identify novel associations by collaboratively building, testing, and refining algorithmic classifiers once the various sources of patient data have been harmonized and connected in a way that can create a comprehensive dataset for each individual’s life course.
Having access to row-level data from a variety of sites supports detailed investigation of variances across sites.
:::

Big data is of little value without powerful analysis tools.
Expand Down
8 changes: 8 additions & 0 deletions chapters/ml.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,14 @@ Tools are available to help assess feature importance (e.g., SHAP values [-@lund
- **Adequate Documentation** -- Clearly annotated code, with explicit characterization of methodology and techniques that are employed, as well as descriptions of all key steps in a pipeline, including hyperparameter choice/search, appropriate train/test splits, etc., are critical to reproducibility of research.
In addition, The N3C Enclave provides a feature called Protocol Pad which can explain exactly how a study was conducted in N3C.

::: {.callout-tip}

## Real-World-Data Tip

Machine learning may be particularly beneficial when considering the gaps and missingness in RWD.
Machine learning models are able to differentiate high confidence negative cases from unlabeled positive cases.
:::

## ML in N3C {#sec-ml-in-n3c}

Developing ML pipelines in the N3C platform is similar to developing code to solve other tasks, e.g., statistical analyses.
Expand Down
18 changes: 18 additions & 0 deletions chapters/onboarding.md
Original file line number Diff line number Diff line change
Expand Up @@ -211,6 +211,14 @@ The original Data Use Agreement is set to expire in early 2025. Researchers wish
information and instructions can be found [here](https://covid.cd2h.org/covid-extension/).
:::

::: {.callout-tip}

## Real-World-Data Tip

Data Use Agreements are another common component of RWD research, ensuring that those accessing the data will take appropriate precautions to protect patient privacy.
It is critical that anyone with data access understands and follows the requirements of the DUA, or risk outcomes ranging from loss of access to legal or civil penalties.
:::

## Research Project Teams {#sec-onboarding-team}

### Project Lead vs Collaborators {#sec-onboarding-team-lead}
Expand Down Expand Up @@ -294,6 +302,16 @@ These include:
* _Infrastructure that underpins team science_.
In today's scientific landscape, it is not uncommon for researchers representing institutions from across the globe to collaborate on research studies.
To do so effectively, these collaborative teams need reliable internet connectivity with equally reliable cybersecurity policies and support and space for local as well as distant meetings (e.g., Zoom, Teams) during all stages of the life of the study.

::: {.callout-tip}

## Real-World-Data Tip

Diverse research teams are crucial when working with RWD.
Informaticians can help understand various limitations that may depend on the context of data collection and processing.
The inclusion of clinicians on the team ensures that assumptions match clinical care, that clinical concepts are captured with the appropriate level of sensitivity and specificity, and facilitate translational components of research to enhance patient care or policy.
The inclusion of biostatisticians and data scientists on a team ensures that EHR data, inevitably incomplete and a non-representative sample of the population at large, is modeled appropriately.
:::

## Domain Teams {#sec-onboarding-dts}

Expand Down
36 changes: 36 additions & 0 deletions chapters/practices.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,24 @@ Early on, the resulting Applicable Data Methods and Standards
group established a [number of principles](https://docs.google.com/document/d/1FZkHOKCC89qr4TM2voLuXQZpT-riCxUeU0-la48r4HU/edit#heading=h.9ymy4s8eihpu),
that have since been refined by cross-collaboration with other groups.

::: {.callout-tip}

## Real-World-Data Tip

EHR data is generated through the course of clinical care at a single institution and thus always comes with caveats.
When using this type of data, one should remember that it is not a representative sample.
Key elements to consider include:

- under-representation of healthy individuals
- catchment area and demographics of the patients of a particular health system
- level of access to care
- hospital specialization area
- coding and documentation practices at various stages in the healthcare system
- missing data, including events captured by other institutions not contributing data
- missing clinical history and events not reported or not recorded
- variability of quality related to local data encoding and data mapping
:::

### Goals {#sec-practices-overview-goals}

Our goals were to:
Expand Down Expand Up @@ -401,6 +419,14 @@ We list that caveat, along with others, below:
* Questions regarding COVID negative vs. COVID positives and co-morbidities (or other covariates)
that are associated with the factors used to bring data into the Enclave, (i.e., age, sex, race, and ethnicity).

::: {.callout-tip}

## Real-World-Data Tip

Though ICU status can be tough to identify outright in many EHR datasets, the use of formulas that combine adjacent inpatient codes (such as [macrovisits](understanding.md#sec-understanding-ehr-visit) in N3C) and ICU specific medications, labs, procedures, and other interventions may help to serve as a proxy for pinpointing these visits.
RWD researchers often have to use a combination of features such as those mentioned here to impute the presence of an event.
:::

**Special considerations**.
There are other questions that may _potentially_ be answerable in N3C, depending on whether the required considerations are compatible with the research question of interest.

Expand Down Expand Up @@ -1037,6 +1063,16 @@ Enabling such repeated analyses means automating a long chain of computing steps

: STaRT-RWE sensitivity analysis elements, based on Wang and colleagues [-@wang_2021]. {#tbl-practices-start-sensitivity tbl-colwidths="[60, 40]"}

::: {.callout-tip}

## Real-World-Data Tip

While a positive value may be found, mapped, documented, and used in analysis, there is a much more ambiguous range of possibilities when it comes to a negative value.
For example, a diagnosis of hypertension might be coded and present in the data.
However the absence of that code could be the result of information that is recorded but at another institution, a question that was never asked, a condition that was noted but not documented, or a code that was lost in the data mapping pipelines at local institutions.
Thus most features should be interpreted as “positive” or “unlabeled” findings, and RWD researchers might consider applying computable phenotypes, data density thresholds, or other approaches to identify high confidence “negative” patients within the set of “unlabeled” patients.
:::

## Protocol Completion ❸ {#sec-practices-completion}

### Gather Results; Publish {#sec-practices-completion-publish}
Expand Down
8 changes: 8 additions & 0 deletions chapters/publishing.md
Original file line number Diff line number Diff line change
Expand Up @@ -190,6 +190,14 @@ In fact, because downstream results are only updated when they are explicitly ru
Why is pinning to a release helpful? Because the default `master` branch is continuously being updated, analysis results based on it will change over time along with the underlying data (if they are re-run).
This becomes cumbersome when writing about results!

::: {.callout-tip}

## Real-World-Data Tip

Since harmonization efforts and vocabulary updates come with downstream consequences and re-compute needs, pinning an analysis to a specific release of data for all inputs is often helpful to ensure you’re not mixing new and old data.
This applies to all RWD analyses that have to consider the time window of data collection for each source of data and apply methods to filter joined datasets to appropriate time windows.
:::

### Download Request Process {#sec-publishing-tech-process}

All research results derived from N3C data-including summary tables, figures, and logs-must be reviewed to ensure they don't inadvertently leak any patient-level data.
Expand Down
16 changes: 16 additions & 0 deletions chapters/story.md
Original file line number Diff line number Diff line change
Expand Up @@ -203,6 +203,14 @@ Some trends we have noticed are:
If your team needs someone, consider asking a relevant [domain team](onboarding.md#sec-onboarding-dt) for help identifying and approaching potential collaborators.
Note that community-wide data and logic liaisons are available for consultation during regular office hours.^[See @sec-support-liaisons.]

::: {.callout-tip}

## Real-World-Data Tip

To properly address a research question with RWD, it is necessary to have an interdisciplinary team that can understand the various forms and sources of data, the context in which that data was collected, and the clinical relevance of any possible associations being noted.
:::


## Initial Meeting {#sec-story-meeting}

:::{.callout-note icon=false}
Expand Down Expand Up @@ -490,6 +498,14 @@ These tools inform decisions such as dropping specific sites or variables from t
: Scurvy Analytic Dataset {#tbl-story-analytic}
</div>

::: {.callout-tip}

## Real-World-Data Tip

When preparing RWD for analysis, it is important to remember that multiple levels of data processing and filtration are an implicit part of any RWD study cohort, with each level being impacted by data collection and entry practices, data models, and medical vocabularies in use.
:::


## Analyses {#sec-story-analyses}

:::{.callout-note icon=false}
Expand Down
Loading

0 comments on commit e4a819b

Please sign in to comment.