From f3d03c85634a0c51478415c7172f7978bd9d4c86 Mon Sep 17 00:00:00 2001 From: Shawn T O'Neil Date: Wed, 11 Sep 2024 08:30:44 -0700 Subject: [PATCH] Added blurb text to chapter markdowns --- chapters/access.md | 16 ++++++++++++++++ chapters/cycle.md | 15 +++++++++++++++ chapters/governance.md | 29 +++++++++++++++++++++++++++++ chapters/intro.md | 12 ++++++------ chapters/ml.md | 8 ++++++++ chapters/onboarding.md | 18 ++++++++++++++++++ chapters/practices.md | 36 ++++++++++++++++++++++++++++++++++++ chapters/publishing.md | 8 ++++++++ chapters/story.md | 16 ++++++++++++++++ chapters/support.md | 8 ++++++++ chapters/tools.md | 18 ++++++++++++++++++ chapters/understanding.md | 28 ++++++++++++++++++++++++++++ 12 files changed, 206 insertions(+), 6 deletions(-) diff --git a/chapters/access.md b/chapters/access.md index 61c146e1..14dc37ca 100644 --- a/chapters/access.md +++ b/chapters/access.md @@ -55,6 +55,15 @@ such as obituary-based mortality records and viral variant sequencing informatio These are available alongside only Level 3 data as an optional add-on; we'll discuss PPRL in more detail below. +::: {.callout-tip} + +## Real-World-Data Tip + +In order to preserve the ethical use of unconsented data for public health benefit, every attempt to obscure identities of institutions, communities, or individuals contributing data should be taken as shown here. +Access to EHR-derived data should minimize privacy and never provide more access to identifiers than is required for the research question. +::: + + ### Level 3, Limited Data Set (LDS) {#sec-access-background-l3} Level 3, or LDS data is the most complete and protected @@ -203,6 +212,13 @@ but cannot share data or files across them, thus it is impossible for a researcher with access to Level 3 data to share it with their colleagues in another project with Level 2 data. +::: {.callout-tip} + +## Real-World-Data Tip + +Shared workspaces facilitating secure, collaborative data access are not unique to N3C amongst cloud-hosted, centralized RWD environments. These technologies, coupled with appropriate [governance](governance.md) and [team-science](onboarding.md) principles, can facilitate successful research projects in the face of data complexity. +::: + ## The DUR - Data Use Request {#sec-access-dur} ### Project Roles and DUR Types {#sec-access-dur-roles} diff --git a/chapters/cycle.md b/chapters/cycle.md index b95bf905..fa812131 100644 --- a/chapters/cycle.md +++ b/chapters/cycle.md @@ -211,6 +211,13 @@ OMOP domain mapping typically involves the creation of mapping tables that trans The OMOP vocabulary dictates which source code should be placed in which target domain after it is translated into OMOP concepts. This vocabulary transformation using the OMOP concept relationships arranges the data into a well-organized and consistent format that can be easily analyzed and queried. +::: {.callout-tip} + +## Real-World-Data Tip + +Given the diversity of systems and medical vocabularies used in healthcare, vocabulary mapping is a crucial, but challenging and time-consuming part of RWD preparation for analysis. While technologies like Common Data Models (CDMs) and versioned medical vocabularies help, they are not universal solutions. The NIH and NCATS are currently engaged in a [Code Map Services](https://aspe.hhs.gov/code-map-services-interoperability-common-data-models-0) project to address this need broadly. +::: + ### N3C Global ID generation for all primary key fields {#sec-cycle-overview-globalid} The incoming data sets submitted to N3C may or may not include their own primary keys. @@ -358,6 +365,14 @@ The scorecards allow sites to be directly involved in the data quality improveme Additionally, the scorecards allow the DI&H team to monitor and maintain data quality across subsequent N3C data submissions and prevent any regression on those metrics. If the scorecards reveal that a released site is no longer passing key data quality metrics, then the site is unreleased until they are able to remediate their data quality issues. +::: {.callout-tip} + +## Real-World-Data Tip + +Unit harmonization, data quality checks, and site scorecards are all crucial ways to ensure data quality when working with RWD from multiple institutions. +Having these elements in place provides quality insights, cross-site imputation of missing units of measurements, and the opportunity for local data administrators to observe, investigate, and communicate any anomalies that may otherwise go undetected without cross-site comparisons. +::: + ## N3C Data Releases {#sec-cycle-releases} ### De-Identification and release to LDS (L3) {#sec-cycle-releases-lds} diff --git a/chapters/governance.md b/chapters/governance.md index 2701f99f..4ea3a478 100644 --- a/chapters/governance.md +++ b/chapters/governance.md @@ -70,6 +70,13 @@ Also, data cannot be extracted or downloaded from the data N3C Data Enclave, wit The workstream split into a smaller subgroup to draft the supporting governance documents while continuing to meet with the whole Workstream for ideation, context, and feedback weekly. Onward, the workstream adjusted the frequency of the meetings to the work cadence. +::: {.callout-tip} + +## Real-World-Data Tip + +A governing body (with stakeholder community representation) helps to ensure ethical contribution to, and use of, sensitive patient data. Clear communication between the governing body and researcher community is instrumental to the success of any central database of RWD. +::: + ## N3C Governance Bridges Individual Oversight Responsibilities {#sec-governance-bridges} The N3C Governance is a set of behavioral norms, policies, and procedures supported by technology/security measures and oversight mechanisms. @@ -113,6 +120,14 @@ Seven principles summarize the Community Guiding Principles: ![N3C shared Governance initiatives with sign-off responsibility represented.](images/governance/fig-governance-010-shared.png){#fig-governance-010-shared fig-alt="N3C shared Governance initiatives with sign-off responsibility represented"} +::: {.callout-tip} + +## Real-World-Data Tip + +Abiding by the Community Guiding Principles of partnership, inclusivity, transparency, reciprocity, accountability, security, and mutual respect are important for both the governance team and researchers accessing the data. +This facilitates clear communication and understanding of roles and responsibilities when handling data from across multiple sources. +::: + ## Ethical oversight {#sec-governance-ethics} The ethical oversight of N3C is two-fold. @@ -142,6 +157,13 @@ Again, separating the DTA and LHBA provides the flexibility to contribute data w A separate [Data Use Agreement](https://ncats.nih.gov/files/NCATS_N3C_Data_Use_Agreement.pdf) (DUA) must be executed between NCATS and signing officials from an institution whose investigators wish to access N3C data. To improve efficiency, instead of executing a traditional pair-wise agreement each time a researcher needs access to N3C, a single DUA is executed between NCATS and an organization to render individual researchers eligible to request access to N3C content. +::: {.callout-tip} + +## Real-World-Data Tip + +Creating separate institution-level agreements for Data Transfer and Data Use helps accelerate agreement execution and ensures that institutions can participate at the level they feel comfortable with. +::: + ## Data Access by Researchers {#sec-governance-access} First-time users wishing to access N3C must verify that their institution has executed a DUA with NCATS. @@ -169,6 +191,13 @@ during account registration. Eligible users can then submit their Data Use Request for evaluation by the Data Access Committee. A new request must be submitted for each specific project.](images//governance/fig-governance-030-process.png){#fig-governance-030-process fig-alt="Data Access Governance Process."} +::: {.callout-tip} + +## Real-World-Data Tip + +Well defined protocols for how users gain access or report data access incidents support the continued cooperation of data providers, enabling researchers to work with such a large collection of harmonized data and program sustainability. +::: + ## Incident Notification and Escalation procedure {#sec-governance-incident} It is essential to create an Incident Notification Policy to ensure that the right people are notified about incidents at the right time and that problems can be addressed rapidly. diff --git a/chapters/intro.md b/chapters/intro.md index 0d65b15d..e2230171 100644 --- a/chapters/intro.md +++ b/chapters/intro.md @@ -74,10 +74,8 @@ Two overlapping teams of EHR data experts participate in this process: one works ## Real-World-Data Tip -When EHR is mapped to a research-appropriate Common Data Model (CDM), -analysts have the opportunity to write more concise code that can be rerun on other EHR data that is represented in this same CDM. -While translating data between CDMs is possible and facilitates interoperability and reproducibility, -RWD analysts must take into account the impact of each transformation step in terms of potential data loss or data restructuring. +When EHR is mapped to a research-appropriate Common Data Model (CDM), analysts have the opportunity to write more concise code that can be rerun on other EHR data that is represented in this same CDM. +While translating data between CDMs is possible and facilitates interoperability and reproducibility, RWD analysts must take into account the impact of each transformation step in terms of potential data loss or data restructuring. ::: ## The N3C Data Enclave and Data Access {#sec-intro-enclave} @@ -94,7 +92,8 @@ Accessing data will full patient zip codes, for example, requires obtaining appr ## Real-World-Data Tip -Regardless of data source or platform, when working with RWD that is derived from patient records, the researcher must have been legally granted access via binding contracts and user agreements, permission from Institutional Review Boards who oversee human subject rights, a workspace that meets the appropriate security requirements, and permission from data stewards who manage the specific dataset. Note that HIPAA laws stipulate that the minimum amount of identifiable data be shared to enable the particular research project. +Regardless of data source or platform, when working with RWD that is derived from patient records, the researcher must have been legally granted access via binding contracts and user agreements, permission from Institutional Review Boards who oversee human subject rights, a workspace that meets the appropriate security requirements, and permission from data stewards who manage the specific dataset. +Note that HIPAA laws stipulate that the minimum amount of identifiable data be shared to enable the particular research project. ::: Because effective analysis of EHR data requires a diverse set of skills-especially clinical and data science/statistical expertise-N3C provides organizational structures and resources to rapidly create and support multidisciplinary research teams, many of which are geographically diverse as well. @@ -136,7 +135,8 @@ Other data are available as well, including publicly-available datasets (e.g., f ## Real-World-Data Tip -Collecting multi-center data centrally allow RWD researchers to identify novel associations by collaboratively building, testing, and refining algorithmic classifiers once the various sources of patient data have been harmonized and connected in a way that can create a comprehensive dataset for each individual’s life course. Having access to row-level data from a variety of sites supports detailed investigation of variances across sites. +Collecting multi-center data centrally allow RWD researchers to identify novel associations by collaboratively building, testing, and refining algorithmic classifiers once the various sources of patient data have been harmonized and connected in a way that can create a comprehensive dataset for each individual’s life course. +Having access to row-level data from a variety of sites supports detailed investigation of variances across sites. ::: Big data is of little value without powerful analysis tools. diff --git a/chapters/ml.md b/chapters/ml.md index a4f12b97..d0974203 100644 --- a/chapters/ml.md +++ b/chapters/ml.md @@ -105,6 +105,14 @@ Tools are available to help assess feature importance (e.g., SHAP values [-@lund - **Adequate Documentation** -- Clearly annotated code, with explicit characterization of methodology and techniques that are employed, as well as descriptions of all key steps in a pipeline, including hyperparameter choice/search, appropriate train/test splits, etc., are critical to reproducibility of research. In addition, The N3C Enclave provides a feature called Protocol Pad which can explain exactly how a study was conducted in N3C. +::: {.callout-tip} + +## Real-World-Data Tip + +Machine learning may be particularly beneficial when considering the gaps and missingness in RWD. +Machine learning models are able to differentiate high confidence negative cases from unlabeled positive cases. +::: + ## ML in N3C {#sec-ml-in-n3c} Developing ML pipelines in the N3C platform is similar to developing code to solve other tasks, e.g., statistical analyses. diff --git a/chapters/onboarding.md b/chapters/onboarding.md index c9566592..01389828 100644 --- a/chapters/onboarding.md +++ b/chapters/onboarding.md @@ -211,6 +211,14 @@ The original Data Use Agreement is set to expire in early 2025. Researchers wish information and instructions can be found [here](https://covid.cd2h.org/covid-extension/). ::: +::: {.callout-tip} + +## Real-World-Data Tip + +Data Use Agreements are another common component of RWD research, ensuring that those accessing the data will take appropriate precautions to protect patient privacy. +It is critical that anyone with data access understands and follows the requirements of the DUA, or risk outcomes ranging from loss of access to legal or civil penalties. +::: + ## Research Project Teams {#sec-onboarding-team} ### Project Lead vs Collaborators {#sec-onboarding-team-lead} @@ -294,6 +302,16 @@ These include: * _Infrastructure that underpins team science_. In today's scientific landscape, it is not uncommon for researchers representing institutions from across the globe to collaborate on research studies. To do so effectively, these collaborative teams need reliable internet connectivity with equally reliable cybersecurity policies and support and space for local as well as distant meetings (e.g., Zoom, Teams) during all stages of the life of the study. + +::: {.callout-tip} + +## Real-World-Data Tip + +Diverse research teams are crucial when working with RWD. +Informaticians can help understand various limitations that may depend on the context of data collection and processing. +The inclusion of clinicians on the team ensures that assumptions match clinical care, that clinical concepts are captured with the appropriate level of sensitivity and specificity, and facilitate translational components of research to enhance patient care or policy. +The inclusion of biostatisticians and data scientists on a team ensures that EHR data, inevitably incomplete and a non-representative sample of the population at large, is modeled appropriately. +::: ## Domain Teams {#sec-onboarding-dts} diff --git a/chapters/practices.md b/chapters/practices.md index b5be542e..406461c7 100644 --- a/chapters/practices.md +++ b/chapters/practices.md @@ -87,6 +87,24 @@ Early on, the resulting Applicable Data Methods and Standards group established a [number of principles](https://docs.google.com/document/d/1FZkHOKCC89qr4TM2voLuXQZpT-riCxUeU0-la48r4HU/edit#heading=h.9ymy4s8eihpu), that have since been refined by cross-collaboration with other groups. +::: {.callout-tip} + +## Real-World-Data Tip + +EHR data is generated through the course of clinical care at a single institution and thus always comes with caveats. +When using this type of data, one should remember that it is not a representative sample. +Key elements to consider include: + +- under-representation of healthy individuals +- catchment area and demographics of the patients of a particular health system +- level of access to care +- hospital specialization area +- coding and documentation practices at various stages in the healthcare system +- missing data, including events captured by other institutions not contributing data +- missing clinical history and events not reported or not recorded +- variability of quality related to local data encoding and data mapping +::: + ### Goals {#sec-practices-overview-goals} Our goals were to: @@ -401,6 +419,14 @@ We list that caveat, along with others, below: * Questions regarding COVID negative vs. COVID positives and co-morbidities (or other covariates) that are associated with the factors used to bring data into the Enclave, (i.e., age, sex, race, and ethnicity). +::: {.callout-tip} + +## Real-World-Data Tip + +Though ICU status can be tough to identify outright in many EHR datasets, the use of formulas that combine adjacent inpatient codes (such as [macrovisits](understanding.md#sec-understanding-ehr-visit) in N3C) and ICU specific medications, labs, procedures, and other interventions may help to serve as a proxy for pinpointing these visits. +RWD researchers often have to use a combination of features such as those mentioned here to impute the presence of an event. +::: + **Special considerations**. There are other questions that may _potentially_ be answerable in N3C, depending on whether the required considerations are compatible with the research question of interest. @@ -1037,6 +1063,16 @@ Enabling such repeated analyses means automating a long chain of computing steps : STaRT-RWE sensitivity analysis elements, based on Wang and colleagues [-@wang_2021]. {#tbl-practices-start-sensitivity tbl-colwidths="[60, 40]"} +::: {.callout-tip} + +## Real-World-Data Tip + +While a positive value may be found, mapped, documented, and used in analysis, there is a much more ambiguous range of possibilities when it comes to a negative value. +For example, a diagnosis of hypertension might be coded and present in the data. +However the absence of that code could be the result of information that is recorded but at another institution, a question that was never asked, a condition that was noted but not documented, or a code that was lost in the data mapping pipelines at local institutions. +Thus most features should be interpreted as “positive” or “unlabeled” findings, and RWD researchers might consider applying computable phenotypes, data density thresholds, or other approaches to identify high confidence “negative” patients within the set of “unlabeled” patients. +::: + ## Protocol Completion ❸ {#sec-practices-completion} ### Gather Results; Publish {#sec-practices-completion-publish} diff --git a/chapters/publishing.md b/chapters/publishing.md index d67582a8..715b8428 100644 --- a/chapters/publishing.md +++ b/chapters/publishing.md @@ -190,6 +190,14 @@ In fact, because downstream results are only updated when they are explicitly ru Why is pinning to a release helpful? Because the default `master` branch is continuously being updated, analysis results based on it will change over time along with the underlying data (if they are re-run). This becomes cumbersome when writing about results! +::: {.callout-tip} + +## Real-World-Data Tip + +Since harmonization efforts and vocabulary updates come with downstream consequences and re-compute needs, pinning an analysis to a specific release of data for all inputs is often helpful to ensure you’re not mixing new and old data. +This applies to all RWD analyses that have to consider the time window of data collection for each source of data and apply methods to filter joined datasets to appropriate time windows. +::: + ### Download Request Process {#sec-publishing-tech-process} All research results derived from N3C data-including summary tables, figures, and logs-must be reviewed to ensure they don't inadvertently leak any patient-level data. diff --git a/chapters/story.md b/chapters/story.md index e7f24f1d..a5b4bfdc 100644 --- a/chapters/story.md +++ b/chapters/story.md @@ -203,6 +203,14 @@ Some trends we have noticed are: If your team needs someone, consider asking a relevant [domain team](onboarding.md#sec-onboarding-dt) for help identifying and approaching potential collaborators. Note that community-wide data and logic liaisons are available for consultation during regular office hours.^[See @sec-support-liaisons.] +::: {.callout-tip} + +## Real-World-Data Tip + +To properly address a research question with RWD, it is necessary to have an interdisciplinary team that can understand the various forms and sources of data, the context in which that data was collected, and the clinical relevance of any possible associations being noted. +::: + + ## Initial Meeting {#sec-story-meeting} :::{.callout-note icon=false} @@ -490,6 +498,14 @@ These tools inform decisions such as dropping specific sites or variables from t : Scurvy Analytic Dataset {#tbl-story-analytic} +::: {.callout-tip} + +## Real-World-Data Tip + +When preparing RWD for analysis, it is important to remember that multiple levels of data processing and filtration are an implicit part of any RWD study cohort, with each level being impacted by data collection and entry practices, data models, and medical vocabularies in use. +::: + + ## Analyses {#sec-story-analyses} :::{.callout-note icon=false} diff --git a/chapters/support.md b/chapters/support.md index 5cb21567..e35eb490 100644 --- a/chapters/support.md +++ b/chapters/support.md @@ -342,6 +342,14 @@ The Logic Liaison team will send a representative to your domain team meetings o Each clue in the hunt asks that you find specific resources and provides a brief description of each once found. This resource--the Guide to N3C--is designed as a comprehensive reference for N3C, and provides information and links to many other resources in its chapters. +::: {.callout-tip} + +## Real-World-Data Tip + +Knowledge of biomedical, translational, and clinical data standards is essential for appropriate use of the coded RWD when curating concept sets to apply to the data. +Ultimately, that in combination with data availability and data limitations will determine project feasibility. +::: + ::: {.callout-note appearance="simple"} ## Additional Chapter Details diff --git a/chapters/tools.md b/chapters/tools.md index af26c7ce..52e57fd4 100644 --- a/chapters/tools.md +++ b/chapters/tools.md @@ -125,6 +125,15 @@ However, the codeset id being referenced in the code may need to be updated if t In constructing phenotypes from concept sets, concept sets may also need to be joined together; these actions are best done in SQL/R/Python [Code Workbook](tools.md#sec-tools-apps-workbook) transforms with the use of the [Logic Liaison's Combined Variable template](https://unite.nih.gov/workspace/module/view/latest/ri.workshop.main.module.3ab34203-d7f3-482e-adbd-f4113bfd1a2b?id=KO-DE908D4&view=focus) {{< fa lock title="Link requires an Enclave account" >}} or in [Code Repositories](tools.md#sec-tools-apps-repo). +::: {.callout-tip} + +## Real-World-Data Tip + +Even with thorough [harmonization and quality checks](cycle.md) of RWD, unaddressed vocabulary updates or unmappable data elements may result in systematically missing data, which may vary by data provider or timeframe. +For example, medical vocabularies updated prior to 2021 will not have any specific code related to Long COVID, meaning that source data representing this condition would be lost when mapping to a pre-2021 medical vocabulary. +RWD researchers should thoroughly examine the potential for such heterogeneous systematic missingness, and account for it via sensitivity analyses, dropping patients or data sources with incomplete data, or other techniques. +::: + ## N3C Knowledge Store {#sec-tools-store} The N3C Knowledge Store is an application where you, as an N3C Data Enclave user, can discover shared code templates, external datasets, reports, cohorts, and Python libraries (collectively also known as Knowledge Objects or KOs) and share similarly re-usable Knowledge Objects of your own with other Enclave users, regardless of the specific project from which the resource originated. @@ -219,6 +228,15 @@ You can then build upon this fact table using the ancillary templates that allow The Logic Liaison ancillary data quality templates provide the same structure for analyzing data missingness, density, and contribution quality by site. Further explanation as to why these Knowledge Store objects are highly applicable can be found in Best Practices for the Research Life Cycle (@sec-practices). +::: {.callout-tip} + +## Real-World-Data Tip + +Structured methods for identifying or deriving cohorts and variables are very important when working with observational health data and RWD. +Holes in the real data are only amplified when mapped to other formats. +Even with a basic threshold of participation, the level of variability prompts the need for community agreed-upon methods for calculating particular variables to promote reproducibility of this kind of research. +::: + ## Enclave Applications {#sec-tools-apps} This section will cover the usage of various applications made available in the N3C Data Enclave, including Protocol Pad, Contour, Code Workbooks, and more ([a complete list of Foundry applications can be found here](https://www.palantir.com/docs/foundry/getting-started/application-reference/)). diff --git a/chapters/understanding.md b/chapters/understanding.md index 15897bb3..f83c2029 100644 --- a/chapters/understanding.md +++ b/chapters/understanding.md @@ -273,6 +273,16 @@ It is true that a later analyst will look through the list of codes, but having Limitations communicate edge cases and caveats to the analyst. "Issues" communicates performance with the Enclave data. This performance could include the number of codes contributing the majority of the data (e.g., from Term Usage) or the distribution of values across sites, in the case of lab tests. +::: {.callout-tip} + +## Real-World-Data Tip + +Concept sets are helpful to identify clinical ideas in coded data. +However, the key limitation is that they cannot be used with clinical notes. +The user must apply natural language processing or other machine learning methods to understand the wealth of information stored in unstructured data. +::: + + #### Provenance {#sec-understanding-sets-metadata-provenance} Provenance communicates how the concept set was assembled. @@ -376,6 +386,16 @@ Please ensure that your concept sets are not missing these key components. Note, as well, that, if you publish a paper, using a concept set, that concept set should be published as well. The Properties and Reviews will be published, whether missing or not. +::: {.callout-tip} + +## Real-World-Data Tip + +Concept set definitions (or, in non-OMOP contexts, value set definitions) are a crucial component of EHR data analysis, effectively delineating the list of codes specific to a clinical idea of interest. +This requires careful design and review involving both informaticists and clinicians. +While concept sets are reusable, they must be evaluated in the context of any given research question to ensure these building blocks appropriately define study cohorts, variables, and outcomes in RWD research. +::: + + ## EHR-Based Data Beyond OMOP {#sec-understanding-ehr} Data tables of use or interest to analysts found in either the LDS or Safe Harbor Release are the following: @@ -583,6 +603,14 @@ Look for dictionaries for the [CMS Standard Analytic Files (SAF) or Limited Data A detailed CMS training webinar is available on [YouTube](https://www.youtube.com/watch?v=fs0tM7RnL54). +::: {.callout-tip} + +## Real-World-Data Tip + +Privacy preserving record linkage (PPRL) allows researchers to promote the collection of disparate patient data from across geographical and time variations to one source and promotes piecing together an individual’s life-course for improved centralized RWD research. +::: + + ## Public/External Datasets {#sec-understanding-public} Because our data are not representative of the geographic locations whence they come, it is important for many analyses to attempt to "correct" the results due to this selection bias.