Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Target barcode #144

Open
wants to merge 15 commits into
base: develop
Choose a base branch
from
113 changes: 113 additions & 0 deletions docs/Encyclopedia/Under_Development/TARGET_Barcode
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
##title##
The ID of each sample will have a specific identifier that depends on the specific project. However, the
codes used in each are the same throughout OCG projects.


#Nomenclature#
![TARGET_image_1](images/target_barcode_img1.png)
![TARGET_image_2](images/TARGET_IMG_2.png)

#HCMI Nomenclature Structure#
![TARGET_image_3](images/TARGET_IMG_3.png)

#TARGET Nomenclature Structure#
![TARGET_image_4](images/TARGET_IMG_4.png)

#TumorCode#
![TARGET_image_5](images/TARGET_IMG_5.png)

#Tissue Code#
![TARGET_image_6](images/TARGET_IMG_6.png)

The tissue codes in the table above denote the source of tissue collected for study. A patient may under go multiple
tissue collections and/or resected tissue can be separated into smaller portions of material for research, and those
smaller sections may even be preserved using different methods (i.e. some flash frozen vs some with FFPE). Cell lines and
xenografts may also be grown up at different times. Therefore, a letter identifier is added to the tissue code number to
track separate aliquots/tissue sections from the same patient. For example:

1. A – first aliquot, growth or section of tissue reviewed to meet clinical quality criteria
2. B – second aliquot, growth or section of tissue reviewed to meet clinical quality criteria

Note: When characterizing multiple tissues from the same case, the sample codes must distinguish between these two types of
tissue by using a separate portion designation (i.e. the tissue codes used could be “01A” and “01B”, etc.)

#Nucleic Acid Codes#
• 01D = DNA, unamplified, from the first isolation of a tissue (fresh/frozen)
• 01E = DNA, unamplified, from the first isolation of a tissue embedded in FFPE
• 01W = DNA, whole genome amplified by Qiagen (one independent reaction)
• 01X=DNA,wholegenomeamplifiedbyQiagen(asecond,separateindependentreaction) • 01Y = DNA, whole genome amplified by Qiagen (pool of “W” and “X” aliquots)
• 01R = RNA, from the first isolation of a tissue (fresh/frozen)
• 01S = RNA, from the first isolation of a tissue embedded in FFPE

Note: If additional isolations are needed from the same tissue aliquot, the # would change to 02D, etc.
#BLGSP: Additional tissue code sample identifiers (when a single tissue yields multiple sample subtypes)#
# Pre-Extraction Manipulation of Tissue Samples (including Cell Sorting):#
Some analyses of patient tissues require certain tissue manipulation prior to nucleic acid extraction. For
example, some OCG tissue samples have undergone a specialized form of handling using flow cytometry called
Fluorescence-activated Cell Sorting (FACS) to separate a heterogeneous mixture of biological cells into two or more
subpopulations, one cell at a time, based upon the specific light scattering and fluorescent characteristics of
each cell type. Therefore, multiple cell types may be available for certain cases. Sorted samples can originate from
and/or result in tumor or normal tissues and will contain an extension of the tissue code following the letter “tissue portion”
identifier (i.e. BLGSP-XX-(USI)-03A.1- 01(D, R, etc.)). From the extension, it is not possible to determine specific
modifications or cell markers used to sort a subpopulation; users must use the metadata files to ascertain specific details
regarding the pre-extraction, post-pathology review handling of tissue. Tissue extension codes use sequential numbers
to denote only that a given sample is unique; the numbers themselves do not provide any additional information on specifics
of the sample.

Here is an example of two subpopulations from a FACS sort of the same tissue sample:
![TARGET_image_6](images/TARGET_IMG_7.png)

Note: Specific antibodies used for and sorted sample populations can be found in the associated OCG project metadata.
Additionally, small “c” before the antigen marker indicates the location is intracellular rather than cell surface.

#HCMI ICD-10 Codes#
An abbreviated ICD-10 code is used to denote the anatomic site of the diagnostic tumor origin.
![TARGET_image_7](images/TARGET_IMG_7.png)
![TARGET_image_8](images/TARGET_IMG_8.png)

#HCMI: Multiple Model Codes#
Some cases may have multiple models derived from independent tumors (primary, recurrent, metastatic, etc.).
To distinguish between the unique models, each will be identified using a letter identifier following the ID3’s ICD-10
code to. For example:

1. A – first cancer model
2. B – second cancer model

Note: For cases in which multiple models per subject are known at the time of ID3 assignment, the first model will have
suffix “A”, the second model suffix “B”, etc. While it would be useful if suffix A would be associated with primary tumor, and the
other suffix letters with pre -malignant, recurrence, or metastasis, this may not be always true. For example, if a model
that is successfully generated and already gone through the CMDC pipeline (CDC-approved, shipped to BPC and ATCC), and in the
future, a model is generated from another tumor, the second model will receive the “B” suffix. The ID3 of the first model will
not be changed to include the “A” suffix.

#HCMI Additional Model Codes#
If models from independent tumors from the same patient are generated, the samples will be identified by using the following
letter identifiers:

1. M – metastatic tumor model
2. N – second metastatic tumor model, from alternative location
3. R – recurrent tumor model
4. S – second recurrent tumor model, from a later date than R
5. P – premalignant model

#TARGET: Additional tissue portion code sample identifier#
#Cancer Models: Cell Lines/Xenografts:#

Some tissues are propagated as cell lines or xenografts. Multiple cell lines or xenografts may be available for certain cases, which
are derived from the tumor either at the time of surgery, at relapse, or during monitoring of therapeutic response. Various NCI projects have
decided to keep the codes for cell lines and xenografts “simple”, and OCG attempts to comply so that users can translate codes easily.
To address the issue of multiple in vitro cancer models per case, OCG projects will use the extension of “.1, .2, .3, etc.” following the
tumor tissue code within the sample name to differentiate the cell lines and xenografts; this extension is prior to the letter identifier(unlike sorted cells).
As with pre-extraction tissue manipulations, it is not possible to determine at which time point the original tumor was obtained from the extension.
It simply denotes a difference in samples, and users must refer to the appropriate metadata for details. If xenografts or cell lines were established
either from 2 separate aliquots/tissue sections (either in the same lab or another), then the letter in the tissue code will reflect it.


Here are some examples:
![TARGET_image_9](images/TARGET_IMG_9.png)
Note that the .1, .2, etc. does not indicate any additional information except that there are multiple cell lines from this patient. In the above example“.1”does not indicate that
this cell line was established from a tumor obtained pre-therapy, nor “.2” post-therapy. The number just indicates that they are separate isolates from a single case. In
addition, any case with only a single cell line or xenograft will not include the extension. The extension will only be used in the few cases where multiple samples are available.

In the example above,“OCG-30-(USI)-50.2B-01(D,R,etc.)” was generated either in another laboratory
or from a different tissue aliquot than “OCG-30-(USI)-50.2A-01(D, R, etc.)”.
113 changes: 113 additions & 0 deletions docs/Encyclopedia/Under_Development/TARGET_Barcode.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
##title##
The ID of each sample will have a specific identifier that depends on the specific project. However, the
codes used in each are the same throughout OCG projects.


#Nomenclature#
![TARGET_image_1](images/target_barcode_img1.png)
![TARGET_image_2](images/TARGET_IMG_2.png)

#HCMI Nomenclature Structure#
![TARGET_image_3](images/TARGET_IMG_3.png)

#TARGET Nomenclature Structure#
![TARGET_image_4](images/TARGET_IMG_4.png)

#TumorCode#
![TARGET_image_5](images/TARGET_IMG_5.png)

#Tissue Code#
![TARGET_image_6](images/TARGET_IMG_6.png)

The tissue codes in the table above denote the source of tissue collected for study. A patient may under go multiple
tissue collections and/or resected tissue can be separated into smaller portions of material for research, and those
smaller sections may even be preserved using different methods (i.e. some flash frozen vs some with FFPE). Cell lines and
xenografts may also be grown up at different times. Therefore, a letter identifier is added to the tissue code number to
track separate aliquots/tissue sections from the same patient. For example:

1. A – first aliquot, growth or section of tissue reviewed to meet clinical quality criteria
2. B – second aliquot, growth or section of tissue reviewed to meet clinical quality criteria

Note: When characterizing multiple tissues from the same case, the sample codes must distinguish between these two types of
tissue by using a separate portion designation (i.e. the tissue codes used could be “01A” and “01B”, etc.)

#Nucleic Acid Codes#
• 01D = DNA, unamplified, from the first isolation of a tissue (fresh/frozen)
• 01E = DNA, unamplified, from the first isolation of a tissue embedded in FFPE
• 01W = DNA, whole genome amplified by Qiagen (one independent reaction)
• 01X=DNA,wholegenomeamplifiedbyQiagen(asecond,separateindependentreaction) • 01Y = DNA, whole genome amplified by Qiagen (pool of “W” and “X” aliquots)
• 01R = RNA, from the first isolation of a tissue (fresh/frozen)
• 01S = RNA, from the first isolation of a tissue embedded in FFPE

Note: If additional isolations are needed from the same tissue aliquot, the # would change to 02D, etc.
#BLGSP: Additional tissue code sample identifiers (when a single tissue yields multiple sample subtypes)#
# Pre-Extraction Manipulation of Tissue Samples (including Cell Sorting):#
Some analyses of patient tissues require certain tissue manipulation prior to nucleic acid extraction. For
example, some OCG tissue samples have undergone a specialized form of handling using flow cytometry called
Fluorescence-activated Cell Sorting (FACS) to separate a heterogeneous mixture of biological cells into two or more
subpopulations, one cell at a time, based upon the specific light scattering and fluorescent characteristics of
each cell type. Therefore, multiple cell types may be available for certain cases. Sorted samples can originate from
and/or result in tumor or normal tissues and will contain an extension of the tissue code following the letter “tissue portion”
identifier (i.e. BLGSP-XX-(USI)-03A.1- 01(D, R, etc.)). From the extension, it is not possible to determine specific
modifications or cell markers used to sort a subpopulation; users must use the metadata files to ascertain specific details
regarding the pre-extraction, post-pathology review handling of tissue. Tissue extension codes use sequential numbers
to denote only that a given sample is unique; the numbers themselves do not provide any additional information on specifics
of the sample.

Here is an example of two subpopulations from a FACS sort of the same tissue sample:
![TARGET_image_6](images/TARGET_IMG_7.png)

Note: Specific antibodies used for and sorted sample populations can be found in the associated OCG project metadata.
Additionally, small “c” before the antigen marker indicates the location is intracellular rather than cell surface.

#HCMI ICD-10 Codes#
An abbreviated ICD-10 code is used to denote the anatomic site of the diagnostic tumor origin.
![TARGET_image_7](images/TARGET_IMG_7.png)
![TARGET_image_8](images/TARGET_IMG_8.png)

#HCMI: Multiple Model Codes#
Some cases may have multiple models derived from independent tumors (primary, recurrent, metastatic, etc.).
To distinguish between the unique models, each will be identified using a letter identifier following the ID3’s ICD-10
code to. For example:

1. A – first cancer model
2. B – second cancer model

Note: For cases in which multiple models per subject are known at the time of ID3 assignment, the first model will have
suffix “A”, the second model suffix “B”, etc. While it would be useful if suffix A would be associated with primary tumor, and the
other suffix letters with pre -malignant, recurrence, or metastasis, this may not be always true. For example, if a model
that is successfully generated and already gone through the CMDC pipeline (CDC-approved, shipped to BPC and ATCC), and in the
future, a model is generated from another tumor, the second model will receive the “B” suffix. The ID3 of the first model will
not be changed to include the “A” suffix.

#HCMI Additional Model Codes#
If models from independent tumors from the same patient are generated, the samples will be identified by using the following
letter identifiers:

1. M – metastatic tumor model
2. N – second metastatic tumor model, from alternative location
3. R – recurrent tumor model
4. S – second recurrent tumor model, from a later date than R
5. P – premalignant model

#TARGET: Additional tissue portion code sample identifier#
#Cancer Models: Cell Lines/Xenografts:#

Some tissues are propagated as cell lines or xenografts. Multiple cell lines or xenografts may be available for certain cases, which
are derived from the tumor either at the time of surgery, at relapse, or during monitoring of therapeutic response. Various NCI projects have
decided to keep the codes for cell lines and xenografts “simple”, and OCG attempts to comply so that users can translate codes easily.
To address the issue of multiple in vitro cancer models per case, OCG projects will use the extension of “.1, .2, .3, etc.” following the
tumor tissue code within the sample name to differentiate the cell lines and xenografts; this extension is prior to the letter identifier(unlike sorted cells).
As with pre-extraction tissue manipulations, it is not possible to determine at which time point the original tumor was obtained from the extension.
It simply denotes a difference in samples, and users must refer to the appropriate metadata for details. If xenografts or cell lines were established
either from 2 separate aliquots/tissue sections (either in the same lab or another), then the letter in the tissue code will reflect it.


Here are some examples:
![TARGET_image_9](images/TARGET_IMG_9.png)
Note that the .1, .2, etc. does not indicate any additional information except that there are multiple cell lines from this patient. In the above example“.1”does not indicate that
this cell line was established from a tumor obtained pre-therapy, nor “.2” post-therapy. The number just indicates that they are separate isolates from a single case. In
addition, any case with only a single cell line or xenograft will not include the extension. The extension will only be used in the few cases where multiple samples are available.

In the example above,“OCG-30-(USI)-50.2B-01(D,R,etc.)” was generated either in another laboratory
or from a different tissue aliquot than “OCG-30-(USI)-50.2A-01(D, R, etc.)”.
18 changes: 12 additions & 6 deletions docs/Encyclopedia/pages/MAGE-TAB.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,25 +5,31 @@ MicroArray Gene Expression Tabular (MAGE-TAB) archives are groups of tab-delimit

## Overview ##

MAGE-TAB archives provide information about the data collection and data processing that was not performed by the GDC. The files in these archives can include molecular protocols, computational pipelines, or information about file organization. MAGE-TAB archives are available for download at the GDC Legacy Archive<sup>1</sup> with data files that were processed with some TCGA pipelines and can be downloaded from file pages or retrieved from the API (`.../files?expand=metadata_files`). The files in a MAGE-TAB archive will include information about all files in a complete study, not just for the associated data file.
MAGE-TAB archives provide information about the data collection and data processing that was not performed by the GDC. The files in these archives are in plain text and tab delimited and can include molecular protocols, computational pipelines, or information about file organization. The files may also contain additional, non-standard headings that were significant to the experiment that they pertained to. MAGE-TAB archives are available for download at the GDC Legacy Archive<sup>1</sup> with data files that were processed with some TCGA pipelines and can be downloaded from file pages or retrieved from the API (`.../files?expand=metadata_files`). The files in a MAGE-TAB archive will include information about all files in a complete study, not just for the associated data file. MAGE-TAB files adhere to the Minimum Information About a Microarray Experiment (MIAME) standard. The standard is outlined in the following publication link [MIAME Standard](https://pubmed.ncbi.nlm.nih.gov/11726920/)



### Structure ###

Contents of a MAGE-TAB file:
Contents of a MAGE-TAB archive file:

* __Investigative Description Format (IDF):__ Provides general information about the study. This includes a brief description, the investigator's contact details, bibliographic references, and a text description of the protocols used in the study<sup>2</sup>.
* __Sample and Data Relationship Form (SDRF):__ Describes the relationships between samples, arrays, data, and other objects used or produced in the study<sup>2</sup>.
* __Array Design Format (ADF):__ Defines each array type used. An ADF file describes the design of an array, e.g., which sequence is located at each position on an array and associated annotations<sup>2</sup>.
* __Array Design Format (ADF):__ Defines each array type used. An ADF file describes the design of an array, e.g., which sequence is located at each position on an array and associated annotations<sup>2</sup>. ADF files are not mandatory in MAGE-TAB archives. As a corollary, could note that IDF and SDRF are always required in MAGE-TAB archives.
* __Description File:__ Provides details about how the data files and molecular material were processed
* __Changes File:__ Provides details about any changes that were made to each of the files in the MAGE-TAB archive
* __Manifest File:__ Lists file names and md5sums of the files that should be included in the MAGE-TAB archive
* __Readme File:__ Provides basic details about the MAGE-TAB archive and the associated study

![MAGETAB](images/MAGETAB_img.svg)
The following figure depicts the association among different files in a MAGE-TAB archive. The "raw data files" exist in data archives such as the GDC Legacy Archive.



## References ##
1. [GDC Legacy Archive](https://portal.gdc.cancer.gov/legacy-archive/)
2. [TCGA Encyclopedia - MAGE-TAB](https://wiki.nci.nih.gov/display/TCGA/MAGE-TAB)
2. [BMC Bioinformatics - MAGE-TAB](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-7-489)


## External Links ##
* N/A

Categories: Data Format
Loading