diff --git a/vignettes/genie-bpc-vignette.Rmd b/vignettes/genie-bpc-vignette.Rmd
deleted file mode 100644
index 548d72c2..00000000
--- a/vignettes/genie-bpc-vignette.Rmd
+++ /dev/null
@@ -1,298 +0,0 @@
----
-title: "Analyzing GENIE BPC Data Using {gnomeR}"
-output: rmarkdown::html_vignette
-vignette: >
-  %\VignetteIndexEntry{Analyzing GENIE BPC Data Using {gnomeR}}
-  %\VignetteEngine{knitr::rmarkdown}
-  %\VignetteEncoding{UTF-8}
----
-
-```{r, include = FALSE}
-knitr::opts_chunk$set(
-  collapse = TRUE,
-  comment = "#>"
-)
-
-set.seed(20230302)
-
-# exit if user doesn't have synapser, a log in, or access to data.
-
-genieBPC::set_synapse_credentials()
-
-if (genieBPC:::.is_connected_to_genie() == FALSE){
-  knitr::knit_exit()
-}
-```
-
-```{r setup, include = FALSE, message = FALSE, results = FALSE, eval=genieBPC:::.is_connected_to_genie()}
-library(gnomeR)
-library(dplyr)
-library(genieBPC)
-library(cbioportalR)
-```
-
-## Introduction
-
-This vignette will walk through how to apply {gnomeR} functions to data from AACR Project Genomics Evidence Neoplasia Information Exchange BioPharma Collaborative (GENIE BPC). A broad overview of AACR Project GENIE BPC can be found [here](https://www.aacr.org/about-the-aacr/newsroom/news-releases/aacr-project-genie-begins-five-year-collaborative-research-project-with-36-million-in-new-funding/), with details on the clinical data structure available on the [{genieBPC} package website](https://genie-bpc.github.io/genieBPC/articles/clinical_data_structure_vignette.html). 
-
-For the purposes of this vignette, we will use the first publicly available GENIE BPC data release of non-small cell lung cancer patients, NSCLC v2.0-public.
-
-Note that the GENIE BPC genomic data are unique in a few particular ways:
-
-* These data are heterogeneous such that they come from multiple genomic panels from multiple institutions. Gene aliases across panels may vary. 
-* The GENIE BPC fusions and copy number alterations data may differ from common data formats. Differences are described in the Data Formats subsection below.
-
-## Data Access
-
-To gain access to the GENIE BPC data, please follow the instructions on the [{genieBPC} `pull_data_synapse()` vignette](https://genie-bpc.github.io/genieBPC/articles/pull_data_synapse_vignette.html) to register for a Synapse account. Once your Synapse account is created and you authenticate yourself using `genieBPC::set_synapse_credentials()`, you'll be ready to pull the GENIE BPC clinical and genomic data from Synapse into your local environment:
-
-```{r, eval=FALSE}
-library(genieBPC)
-
-# if credentials are not stored in your R environment
-set_synapse_credentials(username = "username", password = "password")
-```
-
-```{r, eval=genieBPC:::.is_connected_to_genie()}
-
-# if credentials are stored in your R environment
-set_synapse_credentials()
-```
-
-## Obtain GENIE BPC Data
-
-```{r, eval=genieBPC:::.is_connected_to_genie()}
-# pull NSCLC v2.0-public data from Synapse into the R environment
-nsclc_2_0 = pull_data_synapse(cohort = "NSCLC",
-                              version = "v2.0-public")
-```
-
-The resulting `nsclc_2_0` object is a nested list of datasets, including the mutations, fusions, and copy number data.
-
-- `nsclc_2_0$NSCLC_v2.0$mutations_extended`
-- `nsclc_2_0$NSCLC_v2.0$fusions`
-- `nsclc_2_0$NSCLC_v2.0$cna`
-
-Note that while the GENIE BPC clinical data are only available via Synapse, the genomic data can be accessed via both Synapse and cBioPortal. Using the {cbioportalR} package, users can pull the GENIE BPC genomic data directly from cBioPortal:
-
-```{r, eval=genieBPC:::.is_connected_to_genie()}
-library(cbioportalR)
-
-# connect to the GENIE instance of cBioPortal
-cbioportalR::set_cbioportal_db("https://genie.cbioportal.org/api")
-
-# view list of available studies from this instance of the portal
-# NSCLC v2.0-public is: nsclc_public_genie_bpc
-available_studies()
-```
-
-```{r, eval=genieBPC:::.is_connected_to_genie()}
-# obtain genomic data for GENIE BPC NSCLC v2.0-public
-mutations_extended_2_0 <- get_mutations_by_study("nsclc_public_genie_bpc")
-cna_2_0 <- get_cna_by_study("nsclc_public_genie_bpc")
-fusions_2_0 <- get_fusions_by_study("nsclc_public_genie_bpc")
-```
-
-## Data Formats
-
-The genomic data for GENIE BPC are stored both on Synapse and in cBioPortal. The data structure differs depending on where the genomic data are downloaded from. Therefore, the remainder of this vignette will proceed by outlining the process of annotating genomic data separately for genomic data downloaded from Synapse and genomic data downloaded from cBioPortal. 
-
-### Differences Between Synapse and cBioPortal Genomic Data
-
-Please note that pulling genomic GENIE data from Synapse using `pull_data_synapse()` and pulling GENIE data from cBioPortal may result in small differences in the data due to systematic differences in the processing pipelines employed by Synapse and cBioPortal. These differences may include: 
-
-* Data formatting - Some data sets (e.g. CNA files) may appear in wide format in Synapse data versus long format in cBioPortal data, or column attributes and names may appear sightly different (e.g. fusions files).
-
-* Default filtering - By default, cBioPortal filters out Silent, Intron, IGR, 3'UTR, 5'UTR, 3'Flank and 5'Flank, except for the promoter mutations of the TERT gene. See [cBioPortal documentation](https://docs.cbioportal.org/file-formats/#mutation-data) for more details. These are retained in Synapse processing pipelines.
-
-* Hugo Symbols - Some genes have more than one accepted Hugo Symbol and may be referred to differently between data sources (e.g. `NSD3` is an alias for `WHSC1L1`). Some tools exist to help you resolve gene aliases across genomic data sources. See `gnomeR::recode_alias()`, `cbioportal::get_alias()` and vignettes from the [{gnomeR}](https://mskcc-epi-bio.github.io/gnomeR/) and [{cbioportalR}](https://www.karissawhiting.com/cbioportalR/) for more information on how to use these functions and work with gene aliases.
-
-## Selecting a Cohort for Analysis
-
-The following code chunk uses the `genieBPC::create_analytic_cohort()` to create an analytic cohort of patients diagnosed with stage IV NSCLC of adenocarcinoma histology. Then, for patients with multiple genomic samples, the `genieBPC::select_unique_ngs()` function chooses the genomic sample with OncoTree code LUAD (if available). For patients with multiple samples with OncoTree code LUAD, we will select the metastatic genomic sample. If any patients have multiple metastatic samples with OncoTree code LUAD, take the latest of the samples.
-
-Note: for patients with exactly one genomic sample, that unique genomic sample will be returned *regardless of whether it meets the argument criteria specified below*.
-
-```{r, eval=genieBPC:::.is_connected_to_genie()}
-# create analytic cohort of patients diagnosed with Stage IV adenocarcinoma
-nsclc_2_0_example <- create_analytic_cohort(
-  data_synapse = nsclc_2_0$NSCLC_v2.0,
-  stage_dx = c("Stage IV"),
-  histology = "Adenocarcinoma"
-)
-
-# select unique NGS samples for this analytic cohort
-nsclc_2_0_samples <- select_unique_ngs(
-  data_cohort = nsclc_2_0_example$cohort_ngs,
-  oncotree_code = "LUAD",
-  sample_type = "Metastasis",
-  min_max_time = "max"
-)
-```
-
-Create a dataframe of the corresponding panel and sample IDs:
-
-```{r, eval=genieBPC:::.is_connected_to_genie()}
-# specify sample panels and IDs
-nsclc_2_0_sample_panels <- nsclc_2_0_samples %>% 
-  select(cpt_seq_assay_id, cpt_genie_sample_id) %>%
-  rename(panel_id = cpt_seq_assay_id,
-         sample_id = cpt_genie_sample_id) %>%
-  filter(!is.na(panel_id))
-```
-
-## Process Data with `create_gene_binary()`
-
-The `create_gene_binary()` function takes inputs of mutations, fusions, and CNA data and returns a binary matrix with the alteration status for each gene, annotating missingness when genes were not included on a next generation sequencing panel. 
-
-It is critical to utilize the `specify_panel` argument of `create_gene_binary()`. Samples included in GENIE BPC were sequenced across multiple sequencing platforms, with the genes included varying across panels. Without the `specify_panel` argument, missingness will not be correctly annotated, and genes that were not tested will be incorrectly documented as not being altered.
-
-Note: you can optionally check and recode any older gene names to their newer Hugo Symbol in your data set by passing the `genie` option to `create_gene_binary(recode_aliases=)`.
-
-**Using the genomic data from Synapse:**
-
-The fusions and CNA data as downloaded from Synapse require some modifications prior to being supplied to the `gnomeR::create_gene_binary()` function. 
-
-First, the CNA file can be transposed to match the expected input for `create_gene_binary()` using `pivot_cna_longer()`:
-
-```{r, eval=genieBPC:::.is_connected_to_genie()}
-# transpose CNA data from Synapse
-cna_synapse_long <- pivot_cna_longer(nsclc_2_0$NSCLC_v2.0$cna)
-```
-
-Next, the fusions file can be transposed to match the expected input for `create_gene_binary()`
-
-```{r, eval=genieBPC:::.is_connected_to_genie()}
-# transpose fusions data from Synapse
-fusions_synapse_long <- reformat_fusion(nsclc_2_0$NSCLC_v2.0$fusions)
-```
-
-Finally, the reformatted genomic data can be supplied to `create_gene_binary()` to annotate genomic alterations for patients in the analytic cohort of interest.
-
-The CNA data as downloaded from cBioPortal only includes high level CNA (-2, 2), so we will specify `high_level_cna_only = TRUE` to be consistent with the results based on the genomic data as downloaded from cBioPortal. 
-
-Additionally, we will use the built in 'genie` option to check gene aliases (see `?create_gene_binary` for more info).
-
-```{r, eval=genieBPC:::.is_connected_to_genie()}
-nsclc_2_0_gen_dat_synapse <-
-  create_gene_binary(
-    mutation = nsclc_2_0$NSCLC_v2.0$mutations_extended,
-    cna = cna_synapse_long,
-    high_level_cna_only = TRUE,
-    fusion = fusions_synapse_long,
-    samples = nsclc_2_0_sample_panels$sample_id,
-    specify_panel = nsclc_2_0_sample_panels, 
-    recode_aliases = "genie"
-  )
-```
-
-**Using the genomic data from cBioPortal:**
-
-```{r, eval=genieBPC:::.is_connected_to_genie()}
-nsclc_2_0_gen_dat_cbio <-
-  create_gene_binary(
-    mutation = mutations_extended_2_0,
-    cna = cna_2_0,
-    fusion = fusions_2_0,
-    samples = nsclc_2_0_sample_panels$sample_id,
-    specify_panel = nsclc_2_0_sample_panels, 
-    recode_aliases = "genie"
-  )
-```
-
-Binary genomic matrices created using the genomic data downloaded from Synapse and cBioPortal should be equal. We will proceed using the `nsclc_2_0_gen_dat_cbio` object.
-
-## Collapse Data with `summarize_by_gene()`
-
-We can summarize the presence of any alteration event (mutation, amplification, deletion, structural variant) with the `summarize_by_gene()` function, such that each gene is a column that captures the presence of any event regardless of alteration type.
-
-Summarizing the first 10 samples for KRAS alterations:
-
-**Using the genomic data from Synapse:**
-
-```{r, eval=genieBPC:::.is_connected_to_genie()}
-nsclc_2_0_gen_dat_synapse[1:10, ] %>% 
-  select(sample_id, KRAS, KRAS.Amp) %>%
-  summarize_by_gene()
-```
-
-## Analyzing Data 
-
-After the data have been transformed into a binary format, we can create summaries and visualizations to better understand the data.
-
-### Summarize Data with `tbl_genomic()`
-
-The `tbl_genomic()` function summarizes the frequency of alteration events from the binary data returned from `create_gene_binary()` or `summarize_by_gene()`.
-
-**Using the genomic data from Synapse:**
-
-Summarizing the frequencies of KEAP1, STK11, and SMARCA4 alteration events:
-
-```{r, eval=genieBPC:::.is_connected_to_genie()}
-
-nsclc_2_0_gen_dat_synapse %>% 
-  select(sample_id, KEAP1, STK11, SMARCA4) %>%
-  tbl_genomic()
-```
-
-Users can subset their data set to only include genes above a certain prevalence frequency threshold before passing to the function using the `subset_by_frequency()` function.
-
-Below, we summarize alteration events with >=10% frequency:
-
-```{r, eval=genieBPC:::.is_connected_to_genie()}
-
-nsclc_2_0_gen_dat_synapse %>%
-  subset_by_frequency(t = 0.1) %>%
-  tbl_genomic()
-```
-
-**Using the genomic data from cBioPortal:**
-
-Summarizing the frequencies of KEAP1, STK11, and SMARCA4 alteration events:
-
-```{r, eval=genieBPC:::.is_connected_to_genie()}
-
-nsclc_2_0_gen_dat_cbio %>%
-  select(sample_id, KEAP1, STK11, SMARCA4) %>%
-  tbl_genomic()
-```
-
-Summarizing alteration events with >=10% frequency:
-
-```{r, eval=genieBPC:::.is_connected_to_genie()}
-
-nsclc_2_0_gen_dat_cbio %>%
-  subset_by_frequency(t = 0.1) %>%
-  tbl_genomic()
-```
-
-### Data Visualizations
-
-We can use the `mutation_viz()` function to visualize several aspects of the mutation data, including variant classification, variant type, SNV class and top variant genes.
-
-For the purposes of this vignette we will visualize the genomic data from cBioPortal.
-
-**Using the genomic data from cBioPortal:**
-
-```{r, eval=genieBPC:::.is_connected_to_genie()}
-mutation_viz_gen_dat_cbio <- mutation_viz(mutations_extended_2_0)
-
-mutation_viz_gen_dat_cbio
-```
-
-# References
-
-Additional details regarding the GENIE BPC data and the {genieBPC} R package are published in the following papers:
-
-* Lavery, J. A., Brown, S., Curry, M. A., Martin, A., Sjoberg, D. D., & Whiting, K. (2023). [A data processing pipeline for the AACR project GENIE biopharma collaborative data with the {genieBPC} R package](https://pubmed.ncbi.nlm.nih.gov/36519837/). Bioinformatics (Oxford, England), 39(1), btac796. https://doi.org/10.1093/bioinformatics/btac796
-
-* Lavery, J. A., Lepisto, E. M., Brown, S., Rizvi, H., McCarthy, C., LeNoue-Newton, M., Yu, C., Lee, J., Guo, X., Yu, T., Rudolph, J., Sweeney, S., AACR Project GENIE Consortium, Park, B. H., Warner, J. L., Bedard, P. L., Riely, G., Schrag, D., & Panageas, K. S. (2022). [A Scalable Quality Assurance Process for Curating Oncology Electronic Health Records: The Project GENIE Biopharma Collaborative Approach](https://pubmed.ncbi.nlm.nih.gov/35192403/). JCO clinical cancer informatics, 6, e2100105. https://doi.org/10.1200/CCI.21.00105
-
-Technical details regarding proper analysis of this data can be found in the following publication:
-
-* Brown, S., Lavery, J. A., Shen, R., Martin, A. S., Kehl, K. L., Sweeney, S. M., Lepisto, E. M., Rizvi, H., McCarthy, C. G., Schultz, N., Warner, J. L., Park, B. H., Bedard, P. L., Riely, G. J., Schrag, D., Panageas, K. S., & AACR Project GENIE Consortium (2022). [Implications of Selection Bias Due to Delayed Study Entry in Clinical Genomic Studies](https://pubmed.ncbi.nlm.nih.gov/34734967/). JAMA oncology, 8(2), 287–291. https://doi.org/10.1001/jamaoncol.2021.5153
-
-* Kehl, K. L., Uno, H., Gusev, A., Groha, S., Brown, S., Lavery, J. A., Schrag, D., & Panageas, K. S. (2023). [Elucidating Analytic Bias Due to Informative Cohort Entry in Cancer Clinico-genomic Datasets](https://pubmed.ncbi.nlm.nih.gov/36626408/). Cancer epidemiology, biomarkers & prevention: a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Oncology, 32(3), 344–352. https://doi.org/10.1158/1055-9965.EPI-22-0875
-
-* Kehl, K. L., Riely, G. J., Lepisto, E. M., Lavery, J. A., Warner, J. L., LeNoue-Newton, M. L., Sweeney, S. M., Rudolph, J. E., Brown, S., Yu, C., Bedard, P. L., Schrag, D., Panageas, K. S., & American Association of Cancer Research (AACR) Project Genomics Evidence Neoplasia Information Exchange (GENIE) Consortium (2021). [Correlation Between Surrogate End Points and Overall Survival in a Multi-institutional Clinicogenomic Cohort of Patients With Non-Small Cell Lung or Colorectal Cancer](https://pubmed.ncbi.nlm.nih.gov/34309669/). JAMA network open, 4(7), e2117547. https://doi.org/10.1001/jamanetworkopen.2021.17547
-
diff --git a/vignettes/qa-impact-data.Rmd b/vignettes/qa-impact-data.Rmd
deleted file mode 100644
index 0904737a..00000000
--- a/vignettes/qa-impact-data.Rmd
+++ /dev/null
@@ -1,206 +0,0 @@
----
-title: "How to QA Your IMPACT Data"
-output: rmarkdown::html_vignette
-author: Esther Drill
-vignette: >
-  %\VignetteIndexEntry{How to QA Your IMPACT Data}
-  %\VignetteEngine{knitr::rmarkdown}
-  %\VignetteEncoding{UTF-8}
----
-
-```{r, include = FALSE}
-knitr::opts_chunk$set(
-  collapse = TRUE,
-  comment = "#>"
-)
-
-```
-
-```{r setup, include=FALSE, echo=FALSE,warning=FALSE}
-
-library(dplyr)
-library(cbioportalR)
-library(tidyr)
-library(gnomeR)
-
-
-data("clin_collab_df")
-```
-
-## Introduction
-
-The purpose of this vignette is to outline best practices for downloading, QA-ing and analyzing data generated from MSK IMPACT, a targeted tumor-sequencing test that can detect more than 468 gene mutations and other critical genetic changes in common and rare cancers. Using a hepatocellular cancer case study, we demonstrate a data analysis pipeline using {cbioportalR} functions that can help users generate reproducible analyses using this data.
-
-## Setting up
-
-For this vignette, we will be using {cbioportalR}, a package to download data from the cBioPortal website. We will also be using {dplyr} and {tidyr} to clean and manipulate the data:
-
-```{r message = FALSE, warning=FALSE}
-library(gnomeR)
-library(cbioportalR)
-library(dplyr)
-library(tidyr)
-```
-
-To access cBioPortal data using the [{cbioportalR}](https://www.karissawhiting.com/cbioportalR/) package, you must set the cBioPortal database using the `set_cbioportal_db()` function. To access public data, set this to `db = public`. If you are using a private version of cBioPortal, you would set the `db` argument to your institution's cBioPortal URL.
-
-```{r set_df}
-
-set_cbioportal_db(db = "public")
-
-```
-
-## Case Study
-
-**Scenario:** You are a data analyst whose collaborator has sent you a clinical file of a cohort of patients with hepatocellular cancer that she is interested in for retrospective data analysis. In particular, she wants to look at IMPACT sequencing data for the cohort and investigate associations between genomic alterations and pathological and clinical characteristics. She asks if you can get the IMPACT data and do the analysis.
-
-She gives you a clinical file with 80 sample IDs: `clin_collab_df`.
-
-```{r head_clin}
-head(clin_collab_df)
-```
-
-The sample IDs are in the `cbioportal_id` column.
-
-Before using cBioPortal to access the genomic data, you first want to do some QA on the clinical data and make sure it matches up with the clinical data in cBioPortal.
-
-### Check For Multiple Samples Per Patient
-
-One of the first things to check in your data is whether you have multiple sample IDs from the same patient. Sometimes a clinical file will have a patient_ID column as well; this one doesn't, so you can make your own. The patient ID is just the first 9 digits of the `cbioportal_id`:
-
-```{r patient_id}
-
-clin_collab_df <- clin_collab_df %>%
-  mutate(
-    patient_id = substr(cbioportal_id, 1, 9)
-  )
-```
-
-If there is only one sample per patient, there should be the same number of samples as patients.
-
-```{r pt_samp_sum}
-clin_collab_df %>%
-    summarize( patients = length(unique(patient_id)),
-            samples= length(unique(cbioportal_id))) 
-```
-
-So it's clear that we have multiple samples per patient. To find out which patient/s, you can count the `patient_id` values and filter for \>1.
-
-```{r }
-multiple_samps <- clin_collab_df %>%
-    count(patient_id) %>%
-  filter(n > 1)
-multiple_samps
-```
-
-There are 2 patients who each have 2 samples in the collaborator's dataset. Filter the dataset to see the `cbioportal_id`'s in question:
-
-```{r count_patients }
-
-clin_collab_df %>% 
-  filter(
-  patient_id %in% 
-    (multiple_samps$patient_id))
-
-```
-
-These are patients and samples to ask your collaborator about: Does using both samples make sense? Often times the answer is no. And if not, which sample is the most appropriate one to include? (To get more info for yourself, you can enter the patient ids into the [cBioPortal website](https://www.cbioportal.org/).)
-
-### Check That All `cbioportal_ids` Are In cBioPortal Database
-
-To do this, you need to retrieve the clinical data from cBioPortal using [{cbioportalR}](https://www.karissawhiting.com/cbioportalR/). You can use the `get_clinical_by_sample()` function from {cbioportalR} to do this. Set the `sample_id` parameter to the `cbioportal_ids` from the clinical collaborator's file.\
-Store the sample data in a file called `clin_cbio`.
-
-```{r get_cbio_clinical}
-
-clin_cbio = get_clinical_by_sample(sample_id = clin_collab_df$cbioportal_id) 
-
-```
-
-(You can disregard the warning message for now, though you may be interested in specific clinical attributes later.)
-
-*Note: If you are using the public version of cBioPortal, this function will only query the `msk_impact_2017` study.*
-
-Notice that you now have 2 clinical files: one given to you by the collaborator (`clin_collab_df`) and one you have retrieved yourself from cBioPortal (`clin_cbio`).
-
-Here's the header of `clin_cbio`:
-
-```{r head_clin_cbio}
-head(clin_cbio) %>% as.data.frame()
-```
-
-The sample IDs here are in the `sampleId` column. You may notice that this file is in "long" format and each sample has multiple rows. Later we will convert this file to "wide" format to do QA checking on attributes.
-
-But the first thing you want to know is whether you are able to find all of the `cbioportal_ids` from your `clin_collab_df` file in the `clin_cbio` file.\
-To do this, use the `setdiff()` function:
-
-```{r check_missing}
-
-setdiff(clin_collab_df$cbioportal_id, clin_cbio$sampleId)
-
-```
-
-So there are two sample ID's from your clinical file (`clin_collab_df`) that are currently not found in cBioPortal (in your `clin_cbio` file). Include these in the list of cBioPortal questions to ask your collaborator.
-
-(Again, if you want to investigate a bit further, you could enter the patient cBioPortal IDs as queries into the [cBioPortal website](https://www.cbioportal.org/).)
-
-### Check Clinical Data Matches cBioPortal Database
-
-Now we need to check whether clinical information in collaborator's file (`clin_collab_df`) matches clinical information in cBioPortal (in your `clin_cbio` file).
-
-Look at the `clin_collab_df` again:
-
-```{r head_clin_v2}
-head(clin_collab_df)
-```
-
-Aside from `cbioportal_id`, you have cancer type (`ctype`) and sample type (`primary_mets`) variables. Because it's a hepatocellular cancer study, all of the `ctype` values will be the same. To double check that, count `ctype`:
-
-```{r count_ctype}
-clin_collab_df %>% count(ctype)
-```
-
-So the only variable you can check in this example is the `primary_mets`. To see if the `clin_cbio` file has an analogous variable to check, first see the attributes that are available in it.
-
-```{r count_attribute}
-clin_cbio %>% count(clinicalAttributeId)
-```
-
-To quickly see values associated with a particular attribute, filter by the attribute and count the values. For example:
-
-```{r filter_and_count}
-clin_cbio %>% filter(clinicalAttributeId=="SAMPLE_TYPE") %>% count(value)
-```
-
-The attribute `SAMPLE_TYPE` looks like the appropriate variable to check `primary_mets` against. To do this, we will convert `clin_cbio` to "wide" form (only for the `SAMPLE_TYPE` variable for now), merge it with `clin_collab_df` and then cross-tabulate the 2 variables.
-
-To convert `clin_cbio` to "wide" form:
-
-```{r convert_samps_to_df}
-
-clin_cbio_wide = clin_cbio %>% 
-  select( sampleId, clinicalAttributeId, value) %>%
-  filter( clinicalAttributeId == "SAMPLE_TYPE") %>% 
-  pivot_wider(names_from = clinicalAttributeId, values_from = value)
-```
-
-Take a look at the "wide" file:
-
-```{r head_wide}
-head(clin_cbio_wide) %>% as.data.frame()
-```
-
-Now to check the `primary_mets` variable from `clin_collab_df` against the `SAMPLE_TYPE` variable from `clin_cbio_wide`, merge the files and tabulate the variables.
-
-```{r compare}
-clin_merged <- clin_cbio_wide %>% left_join(clin_collab_df, by = c("sampleId" = "cbioportal_id")) 
-clin_merged %>% select(primary_mets, SAMPLE_TYPE) %>% table()
-```
-
-There is 1 sample that has a value of "Metastasis" for the `primary_mets` variable but "Primary" for the `SAMPLE_TYPE` variable. To find the sample ID, filter:
-
-```{r find_discordant}
-clin_merged %>% filter(primary_mets == "Metastasis" & SAMPLE_TYPE == "Primary")
-```
-
-Include this sample in the list of questions for your collaborator. Either she will need to update her clinical file with the correct value or you/she will have to notify cBioPortal to update their database.