Skip to content

Commit

Permalink
deploy: 4d9af25
Browse files Browse the repository at this point in the history
  • Loading branch information
ypriverol committed Aug 25, 2024
0 parents commit 9400ed3
Show file tree
Hide file tree
Showing 16 changed files with 2,878 additions and 0 deletions.
Empty file added .nojekyll
Empty file.
1 change: 1 addition & 0 deletions CNAME
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
proteomics-sample-metadata.bigbio.io
554 changes: 554 additions & 0 deletions README.adoc

Large diffs are not rendered by default.

86 changes: 86 additions & 0 deletions additional.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
Additional conventions
########################

Specific use cases and conventions
*************************************

Conventions define how to encode some particular information in the file format by supporting specific use cases. Conventions define a set of new columns that are needed to represent a particular use case or experiment type (e.g., phosphorylation-enriched dataset). In addition, conventions define how some specific free-text columns (values that are not defined as ontology terms) should be written.

Conventions are documented and compiled from at https://github.com/bigbio/proteomics-sample-metadata/issues or by performing a pull-request. New conventions will be added to updated versions of this specification document in the future. It is planned that, unlike in other PSI formats, more regular updates will need to be done to be able to explain how new use cases for the format can be accommodated.

How to encode age and other elapsed times
==========================================

One of the characteristics of a sample can be the age of an individual. It is RECOMMENDED to provide the age in the following format: {X}Y{X}M{X}D. Some valid examples are:

- 40Y (forty years)
- 40Y5M (forty years and 5 months)
- 40Y5M2D (forty years, 5 months, and 2 days)

When needed, weeks can also be used: 8W (eight weeks)

Age interval:

Sometimes the sample does not have an exact age but contains a range of ages. To annotate an age range the following convention is RECOMMENDED:

40Y-85Y

This means that the subject (sample) is between 40 and 85 years old.
Other temporal information can be encoded similarly.

Phosphoproteomics and other post-translational modifications enriched studies
=============================================================================

In PTM-enriched experiments, the characteristics[enrichment process] SHOULD be provided. The different values already included in EFO are:

- enrichment of phosphorylated proteins
- enrichment of glycosylated proteins

This characteristic can be used as a factor value[enrichment process] to differentiate the expression between proteins in the phospho-enriched sample when compared with the control.

Synthetic peptide libraries
===========================

It is common to use synthetic peptide libraries for multiple use cases including:

- Benchmark of analytical and bioinformatics methods and algorithms.
- Improvement of peptide identification/quantification using spectral libraries.

When describing synthetic peptide libraries most of the sample metadata can be declared as “not applicable”. However, some authors can also annotate the organism, for example, because they know that the library has been designed from specific peptide species, see example the following experiment containing synthetic peptides (`Example PXD000759 <https://github.com/bigbio/proteomics-sample-metadata/blob/master/annotated-projects/PXD000759>`_).

In these cases, it is important to annotate that the sample is composed of a synthetic peptide library. This can be done by adding the **characteristics[synthetic peptide]**. The possible values are “synthetic”, “not synthetic” or “mixed”.

Normal and healthy samples
==========================

Samples from healthy patients or individuals normally appear in manuscripts and are often annotated as healthy or normal. We RECOMMEND using the word “normal” mapped to the CV term PATO_0000461, which is also included in EFO: `normal PATO term <https://www.ebi.ac.uk/ols/ontologies/efo/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FPATO_0000461>`_.

Example:

.. list-table:: Minimum data metadata for any proteomics dataset
:widths: 14 14 14 14 14 14
:header-rows: 1

* - source name
- characteristics[organism]
- characteristics[organism part]
- characteristics[phenotype]
- characteristics[compound]
- factor value[phenotype]
* - sample_treat
- homo sapiens
- liver
- necrotic tissue
- drug A
- necrotic tissue
* - sample_control
- homo sapiens
- liver
- normal
- none
- normal

Multiple projects into one annotation file
==========================================

It may be needed to annotate multiple ProteomeXchange datasets into one large SDRF-Proteomics file e.g., reanalysis purposes. If that is the case, it is RECOMMENDED to use the column name comment[proteomexchange accession number] to differentiate between different datasets.
48 changes: 48 additions & 0 deletions conf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# Configuration file for the Sphinx documentation builder.
#
# For a full list of options see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html

import os
import sys

# -- Project information -----------------------------------------------------

project = 'proteomics sample metadata'
author = 'Yasset Perez-Riverol'
release = '1.1'

# -- General configuration ---------------------------------------------------

# Add any Sphinx extension module names here, as strings.
# These can be extensions coming with Sphinx or custom ones.
extensions = [
'sphinx_asciidoc', # AsciiDoc support
]

# The master toctree document.
master_doc = 'index'

# The suffix(es) of source filenames.
source_suffix = {
'.rst': 'restructuredtext',
'.adoc': 'asciidoc', # Include .adoc files
}

# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
exclude_patterns = []

# The theme to use for HTML and HTML Help pages.
html_theme = 'sphinx_rtd_theme'

# -- Options for sphinx-asciidoc --------------------------------------------

# Additional arguments to pass to asciidoctor
asciidoc_args = ['-a', 'toc=left', '-a', 'sectnums']

# -- Path setup --------------------------------------------------------------
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
# sys.path.insert(0, os.path.abspath('.'))
54 changes: 54 additions & 0 deletions documentation.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
Additional information
=========================

Ontologies/Controlled Vocabularies Supported
---------------------------------------------

The list of ontologies/controlled vocabularies (CV) supported are:

- PSI Mass Spectrometry CV (`PSI-MS <https://www.ebi.ac.uk/ols/ontologies/ms>`_)
- Experimental Factor Ontology (`EFO <https://www.ebi.ac.uk/ols/ontologies/efo>`_).
- Unimod protein modification database for mass spectrometry (`UNIMOD <https://www.ebi.ac.uk/ols/ontologies/unimod>`_)
- PSI-MOD CV (`PSI-MOD <https://www.ebi.ac.uk/ols/ontologies/mod>`_)
- Cell line ontology (`CLO <https://www.ebi.ac.uk/ols/ontologies/clo>`_)
- Drosophila anatomy ontology (`FBBT <https://www.ebi.ac.uk/ols/ontologies/fbbt>`_)
- Cell ontology (`CL <https://www.ebi.ac.uk/ols/ontologies/cl>`_)
- Plant ontology (`PO <https://www.ebi.ac.uk/ols/ontologies/po>`_)
- Uber-anatomy ontology (`UBERON <https://www.ebi.ac.uk/ols/ontologies/uberon>`_)
- Zebrafish anatomy and development ontology (`ZFA <https://www.ebi.ac.uk/ols/ontologies/zfa>`_)
- Zebrafish developmental stages ontology (`ZFS <https://www.ebi.ac.uk/ols/ontologies/zfs>`_)
- Plant Environment Ontology (`PEO <https://www.ebi.ac.uk/ols/ontologies/peo>`_)
- FlyBase Developmental Ontology (`FBdv <https://www.ebi.ac.uk/ols/ontologies/fbdv>`_)
- Rat Strain Ontology (`RSO <https://www.ebi.ac.uk/ols/ontologies/rso>`_)
- Chemical Entities of Biological Interest Ontology (`CHEBI <https://www.ebi.ac.uk/ols/ontologies/chebi>`_)
- NCBI organismal classification (`NCBITaxon <https://www.ebi.ac.uk/ols/ontologies/ncbitaxon>`_)
- PATO - the Phenotype and Trait Ontology (`PATO <https://www.ebi.ac.uk/ols/ontologies/pato>`_)
- PRIDE Controlled Vocabulary (`PRIDE <https://www.ebi.ac.uk/ols/ontologies/pride>`_)

Relations with other formats
-----------------------------------------------

SDRF-Proteomics is fully compatible with the SDRF file format part of `MAGE-TAB <https://www.ebi.ac.uk/arrayexpress/help/magetab_spec.html>`_. The MAGE-TAB is the file format to store the metadata and sample information on transcriptomics experiments.
MAGE-TAB (MicroArray Gene Expression Tabular) is a standard format for storing and exchanging microarray and other high-throughput genomics data. It consists of two spreadsheets for each experiment: the Investigation Description Format (IDF) file and the Sample and Data Relationship Format (SDRF) file.

The IDF file contains general information about the experiment, such as the project title, description, and funding sources, as well as details about the experimental design, such as the type of technology used, the organism studied, and the experimental conditions.
The SDRF file contains detailed information about the samples and the data generated from them, including sample annotations, data file locations, and data processing parameters. It also defines the relationships between samples, such as replicates or time-course experiments. Together, the IDF and SDRF files provide a complete description of the experiment and the data generated from it, allowing researchers to share and compare their data with others in a standardized and interoperable format.

SDRF-Proteomics sample information can be embedded into mzTab metadata files. The mzTab (Mass Spectrometry Tabular) format is a standard format for reporting the results of proteomics and metabolomics experiments. It can be used to store information such as protein identification, peptide sequences, and quantitation results.
The mzTab format allows for the embedding of sample metadata into the file, which includes information about the samples and the experimental conditions. This metadata can be derived from the Sample and Data Relationship Format (SDRF) file in a proteomics experiment.
In the mzTab format, sample metadata is stored in a separate section called the "metadata section," which contains a list of key-value pairs that describe the samples. The keys in the metadata section correspond to the column names in the SDRF file, and the values correspond to the values in the Sample cells.
By embedding sample metadata into the mzTab file, researchers can ensure that all relevant information about the experiment is stored in a single file, making it easier to share and compare data with others.


Documentation
-----------------------------

The official website for SDRF-Proteomics project is https://github.com/bigbio/proteomics-sample-metadata. New use cases, changes to the specification and examples can be added by using Pull requests or issues in GitHub (see introduction to `GitHub <https://lab.github.com/githubtraining/introduction-to-github>`_).

A set of examples and annotated projects from ProteomeXchange can be `found here <https://github.com/bigbio/proteomics-sample-metadata/tree/master/annotated-projects>`_

Multiple tools have been implemented to validate SDRF-Proteomics files:

- `sdrf-pipelines <https://github.com/bigbio/sdrf-pipelines>`_ (Python): This tool allows a user to validate an SDRF-Proteomics file. In addition, it allows a user to convert SDRF to other popular pipelines and software configuration files such as: MaxQuant or OpenMS.

- `jsdrf <https://github.com/bigbio/jsdrf>`_ (Java): This Java library and tool allows a user to validate SDRF-Proteomics files. It also includes a generic data model that can be used by Java applications.
Binary file added images/contact.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/sample-metadata.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/sdrf-nutshell.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 9400ed3

Please sign in to comment.