Skip to content

Latest commit

 

History

History
1455 lines (1163 loc) · 57.6 KB

README.adoc

File metadata and controls

1455 lines (1163 loc) · 57.6 KB

The quantms.io format

1. Introduction

The majority of formats in HUPO-PSI are based on XML format including mzML, mzIdentML making difficult to use them for large-scale, AI model technologies. Also, the previous approach to move away from XML-based approaches, mzTab "falls short" to produce a tab-delimited format that can scale with the size of the data. Here, we aim to formalize and develop a more standardized format that enables better representation of the identification and quantification results but also enables new and novel use cases for proteomics data analysis. The main use cases for the format are:

  • Fast and easy visualization of the identification and quantification results.

  • Easy integration with other omics data.

  • Easy integration with sample metadata.

  • AI/ML model development based on identification and quantification results.

  • Easy data retrieval for big datasets and large-scale collections of proteomics data.

ℹ️

We are not trying to do the following:

  • Replace the mzTab format, but to provide a new format that enables AI-related use cases.

  • Replace all the software tools file formats and intermediate files, but to provide a new format that enables easy integration of the main output results with other tools.

2. General data model and structure

The quantms.io (.qms) could be seen as a multiple view representation of a proteomics data analysis results. Similar to other tools that produce multiple output files for their analysis, like MaxQuant, DIA-NN, FragPipe or spectronaut. Each view of the format can be serialized in different formats depending on the use case. The data model defines two main things, the view and how the view is serialized. Both views and serialization can be extended, and new views can be added on each Chapter 6 of the specification.

formats relation
  • The data model view defines the structure, the fields and properties that will be included in a view for each peptide, psms, feature or protein.

  • The data serialization defines the format in which the view will be serialized and what features of serialization will be supported, for example, compression, indexing, or slicing.

view

file class

serialization format

definition

mz

mz_file

parquet

Chapter 15

psm

psm_file

parquet

Section 12.1

feature

feature_file

parquet

Section 12.2

pg

pg_file

parquet

Section 13.1

peptide

peptide_file

parquet

Section 12.3

protein

protein_file

parquet

Chapter 13

absolute

absolute_file

tsv

Chapter 10

differential

differential_file

tsv

Chapter 11

sdrf

sdrf_file

tsv

Chapter 9

project

-

json

Chapter 8

ℹ️
Some of these data models fit better for some analytical methods than others, for example, the psm view Section 12.1 is more suitable for data-dependent acquisition (DDA) methods, and may not be present in data-independent acquisition (DIA) methods; while the feature view Section 12.2 could be generated in both DDA and DIA methods. Different expression view Chapter 11 are only present in those experiments while absolute-expression (based on IBAQ values) is only available on datasets where comparisons are not performed between conditions.

The .qms contains all the files of a quantms.io experiment. It will contain metadata files and different views of the experiments; Chapter 2.

3. Common data structures and formats

We have some concepts that are common for some outputs and would be good to define and explain them here:

3.1. Peptidoform

A peptidoform is a peptide sequence with modifications. For example, the peptide sequence PEPTIDM with a modification of Oxidation would be PEPTIDM[Oxidation]. The peptidoform show be written using the Proforma specification. This concept is used in the following outputs:

3.2. Modifications

A modification is a chemical change in the peptide sequence. Modifications can be annotated in multiple ways in quantms.io format:

  • As part of the Proforma notation inside the peptide or as a separate by [Oxidation] with modification name or accession: For example, Oxidation or UNIMOD:35. It Is RECOMMENDED to report modifications using UNIMOD. If a modification is not defined in UNIMOD, a CHEMMOD definition must be used like CHEMMOD:-18.0913, where the number is the mass shift in Daltons.

  • As a list of modification names for each peptidoform for easy integration and filtering of the given peptide evidence. For example, Oxidation;Phosphorylation.

  • Full modification annotation with the given position, modification name, and quality score. In this case, modifications will be encoded as:

    • Accession or name: The modification accession or name. For example, CHEMMOD:-18.0913, UNIMOD:35 or Oxidation.

    • Position: The position of the modification in the peptide sequence. Terminal modifications in proteins and peptides MUST be reported with the position set to 0 (N-terminal) or the amino acid length +1 (C-terminal) respectively. For example, 1 or 1,2,3.

    • Localization Probability: The probability of the modification being in the reported position.

Those three properties can be combined, for example, in a string like one string as:

{position}({Probabilistic Score:0.9})|{position2}|..-{modification accession or name}

1(Probabilistic Score:0.8)|2(Probabilistic Score:0.9)|3-UNIMOD:35

When represented in parquet files Section 12.1, Section 12.2, modification details will be a list of struct:

  [{
      "name": "UNIMOD:35",
      "fields": [
        {
          "position": 2,
          "localization_probability": 0.94
        },
        {
          "position": 12,
          "localization_probability": 0.06
        }
      ]
    },
    {
      "name": "UNIMOD:0",
      "fields": [
        {
          "position": 0,
          "localization_probability": 0.92
        },
        {
          "position": 16,
          "localization_probability": 0.08
        }
      ]
    }
    ]

3.3. Scan (scan number)

Scan number (scan) aims to point to the MS/MS in a Raw, mzML, or peak list file (e.g., MGF). mzIdentML, mzTab, USI, and another HUPO-PSI standardization have different ways to use and define scan number. Here we will use the latest definition from USI. A single scan point to an MS/MS in the spectra file. The scan is a unique identifier, and it could be a number or a string depending on the instrument.

  • AB Sciex: sample=1 period=1 cycle=2740 experiment=101,1,2740,10. In this scenario, where reference to the original scan event is desired but a single scan number is not sufficient, then we use nativeId mechanism.

  • Waters nativeId: function=10 process=1 scan=34510,1,345

  • Bruker nativeId: frame=120 scan=475120,475

  • Thermo scan : controllerType=0 controllerNumber=1 scan=4392043920

Note: since the controllerType and controllerNumber are always 0 and 1 for mass spectra. In rare cases, if either controllerType is not 0 or controllerNumber is not 1 (e.g., a PDA spectrum is being referenced), then the nativeId form MUST be used: controllerType=5 controllerNumber=1 scan=75,1,7

The scan is use in the following section: Section 12.1, Section 12.2, Chapter 15.

ℹ️

Normally the scan value is only captured in the column, while the format of the scan: nativeId, scan or index should be captured in the metadata of the file. However, in some types of analyses we may have more than one type of scan in the same file, (e.g., when merging multiple experiments.), in this case, each scan MUST be prefixed by the type of scan. For example, nativeId:1,1,2740,10, scan:43920.

3.4. Identification scores

Every workflow within quantms uses different identification/quantification scores to determinate the quality of the identification or the quantification. additional_scores in quantms try to capture multiple scores from different workflows such as the Comet:xcorr or DIA-NN:Q.Value. Additional scores are stored as a key/value pair where the key is the name of the score (is RECOMMENDED to use HUPO-PSI MS ontology) and the value is the score value. This concept is used in the following outputs:

  • [Comet:xcorr:67.8", DIA-NN:Q.Value:0.01]

This concept is used in the following outputs:

3.5. Controlled vocabulary terms

The following views Section 12.1, Section 12.2, Chapter 15 use controlled vocabularies to describe the data. The controlled vocabulary terms are used to standardize the data and make it easier to integrate with other datasets. The controlled vocabulary terms are stored as a key/value pair where the key is the name of the controlled vocabulary term and the value is the term value. This concept is used in the following outputs:

  • ["ms level": "2", "deconvoluted data": null]

The name/key of the controlled vocabulary MUST be provided; the value is optional.

4. Serialization formats

The quantms.io format has different serialization formats for each view. The serialization format defines how the view will be serialized and what features of serialization will be supported, for example, compression, indexing, or slicing. The following serialization formats are supported:

  • tsv: Tab-separated values format.

  • parquet: Apache Parquet format.

  • json: JavaScript Object Notation format.

4.1. Parquet format

Parquet is a columnar storage format that supports nested data. Apache Parquet is an open-source format designed for efficient data storage and retrieval. It offers high-performance compression and encoding schemes, making it well-suited for handling large volumes of complex data. Parquet is widely supported across various programming languages and analytics tools.

Apache Parquet includes two types of metadata: file metadata and column metadata. File metadata contains pointers to the starting locations of all the column metadata, while column metadata holds location information for the individual column chunks. Readers first access the file metadata to find the column chunks they need, then use the column metadata to efficiently skip over irrelevant pages.

A Parquet table can be distributed across multiple compute nodes, and its key advantage is that applications can quickly jump to the relevant fields in a record using metadata. For large-scale analyses, Parquet has helped users reduce storage requirements by at least one-third on large datasets. Additionally, it significantly improves scan and deserialization times (important for web-based use cases), thus reducing overall costs.

Project Type Original file size(GB) Converted parquet size(MB) Writing psm time(s) Writing feature time(s)

PXD046440

maxquant

48

337/343

985.2671835

678.474133

PXD016999

mzTab

160

155/228

539.0019641

3554.52738

PXD019909

diaNN

1.9

195

229.482332

4.1.1. Parquet features

  • Columnar Storage: Parquet’s columnar design improves compression and query performance by storing data by columns rather than rows, which reduces I/O for analytical queries that typically access only a few columns.

  • Efficient Compression: The format achieves better compression ratios with algorithms like Snappy, Gzip, and LZO, and uses techniques like RLE, and dictionary encoding for further optimization.

  • Schema Evolution: Parquet supports adding, deleting, or modifying columns without affecting existing data, making it adaptable to schema changes.

  • Complex Data Types: Supports nested structures and data types like arrays, maps, and structs, allowing efficient storage of complex data.

4.1.2. Parquet slicing

quantms.io supports slicing parquet files using any field when generating them.Upon storage, the files are organized into distinct folders according to the chosen slicing fields.

PXD004683/
│
├── sample_accession_1/
│   ├── file1.parquet
│   └── file2.parquet
│
├── sample_accession_2/
│   ├── file3.parquet
│   └── file4.parquet
│
└── sample_accession_3/
    ├── file5.parquet
    └── file6.parquet
...

When registering parquet files to project.json Chapter 8, it will be in such a format.

  "quantms_files": [
    {
      "feature_file": [
        {
          "path_name": "PXD004683",
          "is_folder": true,
          "partition_fields": ["sample_accession"]
        }
      ]
    },
  ]

5. File extensions

File extensions are used to identify the file type. In quantms.io the extensions are constructed as follows: *.{view}.{format} where the view is one of the well-defined views in the specification and the format is one of the serialization formats. For example:

  • An absolute expression file: PXD000000-943a8f02-0527-4528-b1a3-b96de99ebe75.absolute.tsv

  • A differential expression file: PXD000000-943a8f02-0527-4528-b1a3-b96de99ebe75.differential.tsv

  • A feature file: PXD000000-943a8f02-0527-4528-b1a3-b96de99ebe75.feature.parquet

  • A psm file: PXD000000-943a8f02-0527-4528-b1a3-b96de99ebe75.psm.parquet

ℹ️
In quantms.io we use the UUID to identify the project and the files {PREFIX}-{UUID}.{view}.{format}, it is optional, but for most of the code examples we will use it. uuids: A Universally Unique Identifier (UUID) URN Namespace, as defined in RFC 4122, provides a standardized method for generating globally unique identifiers across various systems and applications. The UUID URN Namespace ensures that each generated UUID is highly unlikely to collide with any other UUID, even when produced by different entities and systems.

6. Versioning

The structure of the version is as follows {major release}.{minor update}: The current quantms.io specification version is: 1.0

  • All views (Section 12.1, Section 12.2, Section 13.1) and serialization formats will have a version number in the way: quantmsio_version: {}. This will help to identify the version of the specification used to generate the file.

  • Major release changes will be backward incompatible, while minor updates will be backward compatible.

7. Software provider

The data within quantms.io is mainly generated from quantms workflow. However, the format is open and can be used by any software provider that wants to generate the data in this format. The software provider and the version of the software used to generate the data will be stored in the project view Chapter 8 as:

"software_provider": {
    "name": "quantms",
    "version": "1.3.0"
  }

8. Project quantms.io

The project view is the file that stores the metadata of the entire quantms.io project. The project view is a JSON file that contains the following fields:

8.1. Project fields

Field

Description

Type

project_accession

Project accession identifier

string

project_title

Title of the project

string

project_description

Description of the project

string

project_sample_description

Description of the project sample

string

project_data_description

Description of the project data

string

project_pubmed_id

PubMed ID associated with the project

int32

organisms

List of Organisms involved in the project

list[string], null

organism_parts

Parts of Organisms studied

list[string], null

diseases

Diseases associated with the study

list[string], null

cell_lines

Cell lines used in the study

list[string], null

instruments

Instruments used for data acquisition

list[string]

enzymes

Enzymes used in the study

list[string]

experiment_type

Types of experiments conducted

list[string]

acquisition_properties

Properties of the data acquisition methods

list[key/value]

quantms_files

Files related to quantMS analysis

list[key/value]

quantmsio_version

Version of the quantms.io

string

software_provider

The Chapter 7 used to generate the data

key/value

comments

Additional comments or notes

list[string]

  • key/value pair object: The key/value pairs are used to store the acquisition properties, and the quantms files.

Example of acquisition_properties:

   "acquisition_properties": [
        {"precursor tolerance": "0.05 Da"},
        {"dissociation method": "HCD"}
   ]

8.2. Project files

The files within a project are in the current version Chapter 6 optional. Files within a project should be listed in the quantms_files, for every file the following information is necessary:

  • path_name: The name of the file or folder.

  • is_folder: A boolean value that indicates if the file is a folder or not.

  • partition_fields: The fields that are used to partition the data in the file. This is used to optimize the data retrieval and filtering of the data. This field is optional.

ℹ️
Parquet files can be storage as folders when the data is partitioned by some fields. For example, a parquet file that is partitioned by the sample_accession field will be stored as a folder with the name of the field and the value of the field.

Example of quantms_files:

   {
  "quantms_files": [
    {
      "psm_file": [
        {
          "path_name": "PXD004683-550e8400-e29b-41d4.1.psm.parquet",
          "is_folder": false
        },
        {
          "path_name": "PXD004683-550e8400-e29b-41d4.2.psm.parquet",
          "is_folder": false
        }
      ]
    },
    {
      "feature_file": [
        {
          "path_name": "PXD004683",
          "is_folder": true,
          "partition_fields": ["sample_accession"]
        }
      ]
    },
    {
      "differential_file": [
        {
          "path_name": "PXD004683-a716.differential.tsv",
          "is_folder": false
        }
      ]
    },
    {
      "absolute_file": [
        {
          "path_name": "PXD004683-e29b-41f4-a716.absolute.tsv",
          "is_folder": false
        }
      ]
    },
    {
      "sdrf_file": [
        {
          "path_name": "PXD004683-e29b-41f4-a716.sdrf.tsv",
          "is_folder": false
        }
      ]
    }
  ]
}

Example:

   {
    "project_accession": "PXD014414",
    "project_title": "",
    "project_sample_description": "",
    "project_data_description": "",
    "project_pubmed_id": 32265444,
    "organisms": [
        "Homo sapiens"
    ],
    "organism_parts": [
        "mammary gland",
        "adjacent normal tissue"
    ],
    "diseases": [
        "metaplastic breast carcinomas",
        "Triple-negative breast cancer",
        "Normal",
        "not applicable"
    ],
    "cell_lines": [
        "not applicable"
    ],
    "instruments": [
        "Orbitrap Fusion"
    ],
    "enzymes": [
        "Trypsin"
    ],
    "experiment_type": [
        "Triple-negative breast cancer",
        "Wisp3",
        "Tandem mass tag (tmt) labeling",
        "Ccn6",
        "Metaplastic breast carcinoma",
        "Precision therapy",
        "Lc-ms/ms shotgun proteomics"
    ],
    "acquisition_properties": [
        {"proteomics data acquisition method": "TMT"},
        {"proteomics data acquisition method": "Data-dependent acquisition"},
        {"dissociation method": "HCD"},
        {"precursor mass tolerance": "20 ppm"},
        {"fragment mass tolerance": "0.6 Da"}
    ],
    "quantms_files": [
      {
        "feature_file": [
          {
            "path_name": "PXD014414.feature.parquet",
            "is_folder": false
          }
        ]
      },
      {
        "sdrf_file": [
          {
            "path_name": "PXD014414.sdrf.tsv",
            "is_folder": false
          }
        ]
      },
      {
        "psm_file": [
          {
            "path_name": "PXD014414-f4fb88f6.psm.parquet",
            "is_folder": false
          }
        ]
      },
      {
        "differential_file": [
          {
            "path_name": "PXD014414-3026e5d5.differential.tsv",
            "is_folder": false
          }
        ]
      }
    ],
    "software_provider": {
       "name": "quantms",
       "version": "1.3.0"
    },
    "quantmsio_version": "1.0",
    "comments": []
   }

9. SDRF view

The Proteomics Sample and Data Relationship Format (SDRF) is a tab-delimited file format that describes the relationship between samples, data files, and the experimental factors. The SDRF is a key file in the proteomics data analysis workflow as it describes the relationship between the samples and the data files. The specification of the SDRF can be found in the SDRF GitHub repository.

10. Absolute quantification view

Absolute quantification is the process of determining the absolute/baseline amount of a target protein in a sample. In proteomics, the main computational method to determine the absolute quantification is the intensity-based absolute quantification (iBAQ) method.

10.1. Absolute quantification use cases

  • Fast and easy visualization absolute expression (AE) results using iBAQ values.

  • Store the AE results of each protein on each sample.

  • It could be used as a proxy to understand the expression profile of a protein in different conditions, tissues and organisms.

10.2. Format

The absolute expression format is a tab-delimited file format that contains the following fields:

  • protein → Protein accession or semicolon-separated list of accessions for indistinguishable groups

  • sample_accession → Sample accession in the SDRF.

  • condition → Condition name

  • ibaq → iBAQ value

  • ibaq_normalized → Relative iBAQ value, Ibaq value normalized by the sum of the iBAQ values in the sample.

Example:

protein

sample_accession

condition

ibaq

ibaq_normalized

LV861_HUMAN

Sample-1

heart

1234.1

12.34

10.2.1. AE header

We based the AE format (Chapter 10) and DE (Chapter 11) based on MSstats and other genomics formats such as VCF. By default, the MSstats format does not have any header of metadata. We suggest adding a header to the output for better understanding of the file. By default, MSstats allows comments in the file if the line starts with #. The quantms output will start with some key value pairs that describe the project, the workflow and also the columns in the file. For

Example:

#project_accession=PXD000000

In addition, for each Default column of the matrix the following information should be added:

#INFO=<ID=protein, Number=inf, Type=String, Description="Protein Accession">
#INFO=<ID=sample_accession, Number=1, Type=String, Description="Sample Accession in the SDRF">
#INFO=<ID=condition, Number=1, Type=String, Description="Value of the factor value">
#INFO=<ID=ibaq, Number=1, Type=Float, Description="Intensity based absolute quantification">
#INFO=<ID=ibaq_normalized, Number=1, Type=Float, Description="normalized iBAQ">
  • The ID is the column name in the matrix, the Number is the number of values in the column (separated by ;), the Type is the type of the values in the column and the Description is a description of the column. The number of values in the column can go from 1 to inf (infinity).

  • Protein groups are written as a list of protein accessions separated by ; (e.g.P12345;P12346)

We RECOMMEND including the following properties in the header:

  • project_accession: The project accession in PRIDE Archive

  • project_title: The project title in PRIDE Archive

  • project_description: The project description in PRIDE Archive

  • quantmsio_version: The version of the quantmsio used to generate the file

  • factor_value: The factor values used in the analysis (e.g.tissue)

Please check also the differential expression example for more information: Chapter 11

11. Differential expression view

The differential expression view is a tab-delimited file format that contains the differential expression results between two contrasts, with the corresponding fold changes and p-values. The differential expression view is a key file in the proteomics data analysis workflow as it describes the differential expression between two conditions.

11.1. Differential expression use cases

  • Store the differential express proteins between two contrasts, with the corresponding fold changes and p-values.

  • Enable easy visualization using tools like `Volcano Plot https://en.wikipedia.org/wiki/Volcano_plot_(statistics)`__.

  • Enable easy integration with other omics data resources.

  • Store metadata information about the project, the workflow and the columns in the file.

11.2. Format

The differential expression format by quantms.io is based on the MSstats output:

  • protein → Protein Accession

  • label → Label for the contrast on which the fold changes and p-values are based on

  • log2fc → Log2 Fold Change

  • se → Standard error of the log2 fold change

  • df → Degree of freedom of the t-student test

  • pvalue → Raw p-values

  • adj_pvalue → P-values adjusted among all the proteins in the specific comparison using the approach by Benjamini and Hochberg

  • issue → Issue column shows if there is any issue for inference in corresponding protein and comparison, for example, OneConditionMissing or CompleteMissing.

Example:

protein

label

log2fc

se

df

pvalue

adj_pvalue

issue

ADA2_HUMAN

normal - squamous cell carcinoma

0.3057

0.26

37

0.02

0.43

11.2.1. DE header

By default, the MSstats format does not have any header of metadata. We suggest adding a header to the output for better understanding of the file. By default, MSstats allows comments in the file if the line starts with #. The quantms output will start with some key value pairs that describe the project, the workflow and also the columns in the file. For example:

#project_accession=PXD000000

In addition, for each Default column of the matrix the following information should be added:

#INFO=<ID=protein, Number=inf, Type=String, Description="Protein Accession">
#INFO=<ID=label, Number=1, Type=String, Description="Label for the Conditions combination">
#INFO=<ID=log2fc, Number=1, Type=Double, Description="Log2 Fold Change">
#INFO=<ID=se, Number=1, Type=Double, Description="Standard error of the log2 fold change">
#INFO=<ID=df, Number=1, Type=Integer, Description="Degree of freedom of the Student test">
#INFO=<ID=pvalue, Number=1, Type=Double, Description="Raw p-values">
#INFO=<ID=adj_pvalue, Number=1, Type=Double, Description="P-values adjusted among all the proteins in the specific comparison using the approach by Benjamini and Hochberg">
#INFO=<ID=issue, Number=1, Type=String, Description="Issue column shows if there is any issue for inference in corresponding protein and comparison">
  • The ID is the column name in the matrix, the Number is the number of values in the column (separated by ;), the Type is the type of the values in the column and the Description is a description of the column. The number of values in the column can go from 1 to inf (infinity).

  • Protein groups are written as a list of protein accessions separated by ; (e.g. P12345;P12346`)

We suggest including the following properties in the header:

  • project_accession: The project accession in PRIDE Archive

  • project_title: The project title in PRIDE Archive

  • project_description: The project description in PRIDE Archive

  • quantmsio_version: The version of the quantmsio used to generate the file.

  • factor_value: The factor values used in the analysis (e.g. phenotype)

  • adj_pvalue: The FDR threshold used to filter the protein lists (e.g. adj_pvalue < 0.05)

12. Peptide-based Views: psm, feature and peptide

Multiple peptide-level views are available for the quantms.io format. The views are the following:

  • Section 12.1: Peptide Spectrum Match (psm) View—The psm view aims to cover detail on Peptide spectrum matches (psm) level for AI/ML training and other use-cases, mainly for DDA analytical methods.

  • Section 12.2: Peptide Feature View—The peptide feature views (peptide features) aims to cover detail on quantified peptide information level, including peptide intensity in relation to the sample metadata.

  • Section 12.3: Peptide View—The peptide view is a summary of quantified peptides by samples, the aim of this representation is to provide a simple summary of the number of peptides and their given quantity for each protein on each sample. This view is useful for quick visualization and data retrieval.

12.1. Peptide spectrum match (psm) view

Peptide spectrum matches (psms) are the results of the identification of peptides in mass spectrometry data. PSMs are mainly the results of peptide identification by database search engines on data-dependent acquisition (DDA) experiments.

12.1.1. Psm use cases

  • The psm table aims to cover detail on psm level for AI/ML use-cases.

  • Most of the content is similar to mzTab, a psm would a peptide identification in a msrun file.

  • We included in the psm view the spectrum information as optional for those use cases that want to have fast access to peptide information + spectrum data, for example, clustering or intensity prediction

  • Fast and easy visualization of PSM information.

12.1.2. Psm fields

The following table presents all the fields and attributes for each PSM entry in the psm_file. Some fields are shared between the Section 12.1, Section 12.2 and Section 12.3 views.

We added to the following table the corresponding fields in different tools and mzTab for each field. For each tool, we use the following output tables:

  • MQ - msms.txt

  • FragPipe - psm.tsv

  • mzTab - PSM section

Field Description Type DIA-NN FragPipe MaxQuant mzTab

These fields are shared with features (Section 12.2) and peptides (Section 12.3)

sequence

The peptide’s sequence (with no modifications)

string

Stripped.Sequence

Peptide

Sequence

sequence

peptidoform

Peptide sequence with modifications, see more Section 3.1

string

Modified.Sequence

Modified Peptide

Modified sequence

opt_global_cv_MS:1000889_peptidoform_sequence

modifications

Modifications details: modification name, positions and localization probabilities: read Section 3.2

array[struct], null

-

-

-

-

precursor_charge

Precursor charge

int32

Precursor.Charge

-

Charge

charge

posterior_error_probability

Posterior error probability (PEP) for the given peptide or psm match.

float32, null

PEP

-

PEP

opt_global_Posterior_Error_Probability_score

is_decoy

Decoy indicator, 1 if the peptide is a decoy, 0 target

int32

-

-

Reverse

opt_global_cv_MS:1002217_decoy_peptide

calculated_mz

Theoretical peptide mass-to-charge ratio based on an identified sequence and modifications

float32

-

Calculated M/Z

-

calc_mass_to_charge

observed_mz

Experimental peptide mass-to-charge ratio of identified peptide (in Da)

float32

-

Observed M/Z

m/z

exp_mass_to_charge

rt

MS2 scan’s precursor retention time (in seconds)

float32, null

RT

-

Retention time

retention_time

predicted_rt

Predicted retention time of the peptide (in seconds)

float32, null

Predicted.RT

-

-

-

reference_file_name

Spectrum file name with no path information and not including the file extension

string

Run

Spectrum File

Raw file

spectra_ref

scan

Scan index (number of nativeId) of the spectrum identified: read Section 3.3

string

[scan-diann]

Spectrum

MS/MS scan number

spectra_ref

additional_scores

List of structures, each structure contains two fields: name and value.

array[struct{name: string, value: float32}]

DIA-NN Scores

FragPipe Scores

MaxQuant Scores

search_engine_score

cv_params

Optional list of CV parameters for additional metadata Section 12.1.4

array[struct{cv_name:string, cv_value:string}], null

-

-

-

-

Protein fields shared by Section 12.2 Section 12.1

mp_accessions

Protein accessions of all the proteins that the peptide maps to

array[string], null

Protein.Ids

-

Proteins

accession

These fields are optional and part of the MS/MS information Chapter 15

ion_mobility

Ion mobility value for the precursor ion

float, null

-

-

-

-

number_peaks

Number of peaks in the spectrum used for the peptide spectrum match

int32, null

-

-

-

-

mz_array

Array of m/z values for the spectrum used for the peptide spectrum match

array[float], null

-

-

-

-

intensity_array

Array of intensity values for the spectrum used for the peptide spectrum match

array[float], null

-

-

-

-

12.1.3. Additional scores

Additional scores are stored as a list of key-value pairs, where the key is the name of the score (is RECOMMENDED to use HUPO-PSI MS ontology) and the value is the score value. Additional scores are mainly the search engine and protein scores that want to be added at PSM level. Some RECOMMENDED scores are:

  • pg_global_qvalue: Protein group global q-value used to filter the psm at the level of the protein group and experiment.

  • rank: Rank of the peptide in the search engine results. (1.0)

  • global_qvalue: Global q-value of the PSM at the level of the experiment.

  • Psm view is NOT RECOMMENDED to be generated for DIA methods because it will be duplicated information with the feature view. The psm view is more suitable for DDA methods where the psm is the main output of the identification process.

  • Protein inference SHOULD NOT be included in the psm view, as it is not the main purpose of the psm view. However, for some use cases like peptide filtering, search, etc., maybe interesting to have access to all the psms for a given protein accession, you can include that in the mp_accessions: mapped protein accessions. Another two protein-related fields can help the users to understand the resulted psm table, unique (if the peptide only maps to one protein), pg_global_qvalue: The Global qvalue at the protein group use to filter the psm. For protein inference please look into the feature view (Section 12.2) and protein group (Section 13.1).

  • The mz_array and intensity_array are arrays of the same length, where the mz_array contains the m/z values and the intensity_array contains the intensity values; and the size of the arrays is the same as the number of peaks in the spectrum. These three columns could help use cases like AI/ML that need the spectrum information for a given psm. We RECOMMEND using for spectra data the mz view (Chapter 15), where the spectra are stored in a more efficient way.

12.1.4. Psm CV parameters

Cv params are a key-value pairs list that allows to store additional information for a given psm. For example, it could be used to store the following, mzIdentML information:

  • 'prot:FDR threshold': 0.01

  • number of unmatched peaks: 3

In quantms we use consensus_support where the value is the number of search engines that support the identification. This field could be added as an additional_score as: consensus_result: 3

The cv_params are stored as a list of key-value pairs, where the key is the name of the parameter, and the value is the value of the parameter. This is similar to the CVParams in the mzIdentML format. Please, be aware that search engine scores should be stored for psms in the column additional_scores.

12.1.5. Psm file metadata

For parquet psm files, the metadata of the file including quantms.io version and other metadata should be stored in the file. The metadata should be stored in the file as a key/value pair. The metadata should include the following fields:

  • quantmsio_version: The version of the quantms.io format used to generate the file.

  • software_provider: The software provider and the version of the software used to generate the data.

  • project_accession: The project accession in PRIDE Archive if available.

  • project_title: The project title in PRIDE Archive if available.

  • project_description: The project description in PRIDE Archive if available.

  • scan_format: The format of the scan, with possible values: scan, index, nativeId, multiple. Multiple is used when multiple experiments are merged into one file.

  • creator: Name of the tool or person who created the file.

  • file_type Type of the file (psm_file)

  • creation_date: Date when the file was created

  • uuid: Unique identifier for the file

  • compression_format: [gzip, snappy, lzo, none]

Example parquet in Python:

import pyarrow as pa
import pyarrow.parquet as pq

# Define a sample schema for the Parquet file
schema = pa.schema([
    ....
])

# Create sample data to write to the Parquet file
data = {
    ....
}

# Convert the data to a PyArrow Table
table = pa.table(data, schema=schema)

# Define the custom metadata as key-value pairs
file_metadata = {
    'quantmsio_version': '1.0',
    'software_provider': 'QuantMS 1.3.0',
    'project_accession': 'PXD012345',
    'project_title': 'Proteomics of Disease X',
    'project_description': 'Project description',
    'scan_format': 'scan',
    'creator': 'John Doe',
    'file_type': 'psm_file',
    'creation_date': '2021-01-01',
    'uuid': '943a8f02-0527-4528-b1a3-b96de99ebe75'
}

# Write the Parquet file with metadata
pq.write_table(table, 'psm_data.parquet', metadata=file_metadata)

Parquet files don’t have a specific limit for metadata size, but practical constraints exist based on your system’s memory, processing capabilities, and file management practices. The Parquet metadata, which is stored in the file’s footer, includes information like schema, column statistics, and data offsets. The metadata is loaded into memory when the file is read, so large metadata can impact performance. For large metadata, consider storing the metadata in a separate file or database and linking to it from the Parquet file.

12.1.6. Psm global q-value

The global q-value represents the q-value at the level of the experiment. In OpenMS this is the PSM q-value that is by default global at the level of the experiment and the run. In DIA-NN, it represents Global.Q.Value. At the run level, the Q.Value will be collected by additional_scores.

12.1.7. Format

The psm view can be found in psm.avsc.

12.2. Peptide feature view

The peptide feature view (peptide features) aims to cover detail on quantified peptide information level at the msrun level, including peptide intensity in relation to the msrun and sample metadata. The feature parquet file is a parquet file that contains the details of the peptides quantified in the experiment and sample.

The feature file is similar to the mztab peptide table, the peptide evidence in MaxQuant, the diann matrix table.

12.2.1. Feature use cases

  • Store peptide intensities in relation to the sample metadata to perform down-stream analysis and integration.

  • Enable peptide level statistics and algorithms to move from peptide level to protein level.

  • Different to the psm section Section 12.1 contains all the protein inference information depending on if protein inference was applied or not.

ℹ️
quantms also release the peptide table for MSstats. The goal of the feature table is to provide a more general peptide table and improve the annotations of the peptides with more columns.

12.2.2. Feature fields

The following table presents the fields needed to describe each feature in quantms.io. Some of the fields are shared with the psm view (Section 12.1).

Field Description Type DIA-NN FragPipe MaxQuant mzTab

These fields are shared with features (Section 12.1) and peptides (Section 12.3)

sequence

The peptide’s sequence (with no modifications)

string

Stripped.Sequence

Peptide

Sequence

sequence

peptidoform

Peptide sequence with modifications, see more Section 3.1

string

Modified.Sequence

Modified Peptide

Modified sequence

opt_global_cv_MS:1000889_peptidoform_sequence

modifications

Modifications details: modification name, positions and localization probabilities: read Section 3.2

array[struct], null

-

-

-

-

precursor_charge

Precursor charge

int32

Precursor.Charge

-

Charge

charge

posterior_error_probability

Posterior error probability (PEP) for the given peptide or psm match.

float32, null

PEP

x

PEP

opt_global_Posterior_Error_Probability_score

is_decoy

Decoy indicator, 1 if the peptide is a decoy, 0 target

int32

-

-

Reverse

opt_global_cv_MS:1002217_decoy_peptide

calculated_mz

Theoretical peptide mass-to-charge ratio based on an identified sequence and modifications

float32

-

Calculated M/Z

-

calc_mass_to_charge

observed_mz

Experimental peptide mass-to-charge ratio of identified peptide (in Da)

float32

-

-

m/z

exp_mass_to_charge

rt

Precursor retention time (in seconds)

float32, null

RT

-

Retention time

retention_time

rt_start

Start of the retention time window for feature

float, null

RT.Start

-

-

-

rt_stop

End of the retention time window for feature

float, null

RT.Stop

-

-

-

predicted_rt

Predicted retention time of the peptide (in seconds)

float, null

Predicted.RT

-

-

-

ion_mobility

Ion mobility value for the precursor ion

float, null

-

-

-

-

start_ion_mobility

start ion mobility value for the precursor ion

float, null

-

-

-

-

stop_ion_mobility

stop ion mobility value for the precursor ion

float, null

-

-

-

-

additional_scores

List of structures, each structure contains two fields: name and value.

array[struct{name: string, value: float32}]

DIA-NN Scores

FragPipe Scores

MaxQuant Scores

search_engine_score

cv_params

Optional list of CV parameters for additional metadata Section 12.1.4

array[struct{cv_name:string, cv_value:string}], null

-

-

-

-

Feature quantification and relation to the given reference file

intensities

The intensity-based abundance of the feature in the reference file for different channels

Section 12.2.3

Precursor.Quantity

Intensity

Intensity

Intensity

reference_file_name

The reference file name that contains the feature

string

Run

-

Raw file

-

additional_intensities

Apart from the raw intensity, multiple intensity values can be provided as key-values pairs, for example, normalized intensity.

Section 12.2.3

Protein and protein groups information related to Section 13.1, Section 12.3

pg_accessions

Protein group accession. Could be one single protein or multiple protein accessions, depending on the tool.

array[string], null

Protein.Group

x

Proteins

accession

anchor_protein

One protein accession that represents the protein group

string, null

-

-

-

-

unique

Unique peptide indicator, if the peptide maps to a single protein, the value is 1, otherwise 0

int32, null

-

Is Unique

Unique

unique

pg_global_qvalue

Global q-value of the protein group at the experiment level

float, null

Global.PG.Q.Value

-

-

best_search_engine_score

gg_accessions

Gene group accessions.

array[string], null

-

-

-

-

gg_names

Gene names, as a string array

array[string], null

Genes

-

-

-

mp_accessions

Protein accessions of all the proteins that the peptide maps to

array[string], null

Protein.Ids

-

Proteins

accession

Spectra information

scan_reference_file_name

The reference file containing the best psm that identified the feature. Note: This file can be different from the file that contains the feature (ReferenceFile).

string, null

-

-

-

-

scan

The scan number of the spectrum. The scan number or index of the spectrum in the file.

string, null

Section 12.2.4

-

-

-

ℹ️
  • The spectra information aims to provide for a given feature the scan used to identify it. In DDA protocols LFQ-DDA and DDAplex, we recommended os use the best psm for a given feature.

  • Protein groups gg_accessions should contain all the proteins that discreve the protein group — for example, in MQ and FragPipe the anchor protein is the one selected to represent the group; while DIA-NN put all the proteins within a group. Similar to the psm section Section 12.1 the entire list of proteins for a given group could be written in the mp_accessions field.

  • conditions: Conditions for every feature, are the values of the factor values.

12.2.3. Intensities

We capture an intensity value for each feature on a given reference_file_name. In label-free experiments that it is a single value, but in multiple experiments it could be multiple values depending on the number of channels, and each channel is associated with one sample accession (normally the source name in the SDRF). Then, we suggest storing the intensities as a list of struct in parquet like:

  • intensity: 1234.1

  • sample_accession: Sample-1

  • channel: TMT126

Additional intensities could be added could be added in the similar way in the field/column additional_intensities, but and additional field will be added with the name of the intensity, for example, normalized_intensity: 0.1234.

12.2.4. DIANN scan

The DIA-NN scan is a string that contains the scan number of the MS2 used to identify the peptide. We use the rt field and the mzML information to get that number.

12.2.5. Format

The feature view can be found in feature.avsc.

12.3. Peptide summary view

The peptide summary view aims to cover detail on peptides quantified in the experiment and sample. A peptide could be a modified peptide (sequence with modifications) or non-modified peptide (sequence with no modifications) depending on the use case and the granularity of the data. The peptide view is a tab-delimited file format that claims to represent the peptides quantified in the experiment.

12.3.1. Peptide use cases

  • It serves as a report file with all peptides quantified in the experiment for each protein.

  • It can be used to generate peptide reports for integration with tools and services.

12.3.2. Peptide fields

Some of the fields are shared between the Section 12.1 and Section 12.2 views.

Field Description Type

These fields are shared with features (Section 12.2) and peptides (Section 12.1)

sequence

The peptide’s sequence (with no modifications)

string

peptidoform

Peptide sequence with modifications, see more Section 3.1

string

modifications

Modifications details: modification name, positions and localization probabilities: read Section 3.2

array[struct], null

gg_accessions

Gene group accessions.

array[string], null

gg_names

Gene names, as a string array

array[string], null

best_id_score

The best search engine score from all the features/psms identified

array[struct[name: string, value:float32]], null

sample_accession

The sample accession in the SDRF, which column is called source name

string, null

abundance

The peptide abundance in the given sample accession

float32, null

12.3.3. Format

The peptide view can be found in peptide.avsc.

13. Protein views: Protein groups and Protein summary

We have two main reports for protein information.

  • The Section 13.1 report is the output of the quantitative tool including quantms, MaxQuant or DIA-NN.

  • The Chapter 14 is a protein summary is a summary of the protein quantified by samples.

13.1. Protein group view

The protein group view is a tabular file that contains the details of the protein groups identified and quantified. The protein group is similar to the outputs of multiple tools such as MaxQuant, DIA-NN, and others.

The file defines the relation between a protein groups and the raw file that contains the protein group. The protein group view is a key file in the proteomics data analysis workflow as it describes the protein groups identified and quantified in the experiment.

13.1.1. Protein group use cases

  • Retrieve all the protein groups identified or quantified in the file.

  • Compute the protein group abundance by file and condition.

  • Store information about FDR and q-values for the protein groups identified/quantified.

13.1.2. Protein group fields

Field Description Type DIA-NN FragPipe MaxQuant

pg_accessions

Protein group accessions of all the proteins within this group

array[string]

Protein.Group

Group + Indistinguishable Proteins

Protein IDs

pg_names

Protein group names

array[string]

Protein.Names

-

Protein names

gg_accessions

Gene group accessions, as a string array

array[string]

Genes

-

Gene names

reference_file_name

The raw file containing the identified/quantified protein

string

Run

-

-

global_qvalue

Global q-value of the protein group at the experiment level

float

Global.PG.Q.Value

-

Q-value

intensities

Similar to the feature view, the intensity-based abundance of the protein group in the reference file for different channels

Section 12.2.3

Intensity, Normalized Intensity

-

iBAQ, Intensity, LFQ intensity

additional_intensities

Apart from the raw intensity, multiple intensity values can be provided as key-values pairs, for example, normalized intensity.

Section 12.2.3

-

-

-

is_decoy

Definition of the protein group as decoy or target

null, integer

-

-

Reverse

contaminant

If the protein is a contaminant

null, integer

-

-

Potential contaminant

peptides

Number of peptides per protein in the protein group

null, struct{sequence: string, count: int}

-

-

-

anchor_protein

The anchor protein of the protein group, leading protein or representative

null, string

-

Protein ID

Protein IDs

additional_scores

List of structures, each structure contains two fields: name and value.

Section 13.1.3

-

-

-

13.1.3. protein additional scores

At the protein level, additional scores should be store for each given protein group. The additional scores are stored as a list of key-value pairs, where the key is the name of the score (is RECOMMENDED to use HUPO-PSI MS ontology) and the value is an array of float32 values where the index of values matches to the index on the pg_accessions field. Additional scores are mainly the search engine and protein scores that want to be added at the protein group level.

14. Protein view

The protein view is a report of the proteins identified/quantified in the experiment. It doesn’t contain major information about the inference of the protein group, but it contains the protein abundance and the protein identification scores.

14.1. Use cases

  • Fast reports of the proteins quantified/identified in an experiment with for Web interfaces and search engines.

  • Connection to AE/DE formats that enable to talk about the coverage of the protein identification.

Field Description Type

abundance

Abundance of the given protein in the sample/experiment

null, float

sample_accession

Sample accession in the SDRF, which column is called source name

string

best_id_score

The best search engine score for the identification

[{"type": "record", "name": "score", "fields": [{ "name": "name", "type": "string" },{ "name": "value", "type": "float32" }]}, null]

gene_accessions

The gene accessions corresponding to every protein

null, array[string]

gene_names

The gene names corresponding to every protein

null, array[string]

number_peptides

The total number of peptides for a give protein

null, integer

number_psms

The total number of peptide spectrum matches

null, integer

number_unique_peptides

The total number of unique peptides

null, integer

14.1.1. Format

The protein view can be found in protein.avsc.

15. Mass spectra view

The mass spectra view is a tabular file that contains the details of the mass spectra identified and quantified. This view is based on mz_parquet format developed by Michael Lazear. The mz_parquet format is a parquet-based format that stores the mass spectra information in a columnar format.

15.1. Mass spectra use cases

  • Retrieve all the precursor mass, retention time, and intensity in the file.

  • Enable easy visualization and scanning on mass spectra level.

  • AI/ML training and prediction on mass spectra level.

15.2. Mass spectra fields

Field Type Description

id

string

Unique identifier for the scan or spectrum.

ms_level

int

The MS level (e.g., 1 for MS1, 2 for MS2).

centroid

boolean

Indicates whether the data is centroided (true) or profile mode (false).

scan_start_time

float32

The start time of the scan in minutes.

inverse_ion_mobility

float32, null

Inverse ion mobility, if available, used for TIMS data.

ion_injection_time

float32

The ion injection time in milliseconds.

total_ion_current

float

Total ion current (TIC) for the scan.

precursors

[null, {"type": "array", "items": {"type": "record", "name": "precursor"}}]

List of precursors for this scan, if applicable.

selected_ion_mz

float32

The m/z value of the selected precursor ion.

selected_ion_charge

int32, null

Charge state of the selected precursor ion, if available.

selected_ion_intensity

float32, null

Intensity of the selected precursor ion.

isolation_window_target

float32, null

The target m/z for the isolation window.

isolation_window_lower

float32, null

The lower bound of the isolation window.

isolation_window_upper

float32, null

The upper bound of the isolation window.

spectrum_ref

float32, null

Reference to another spectrum (e.g., for linking to external datasets).

mz

{"type": "array", "items": "float32"}

List of m/z values for the scan.

intensity

{"type": "array", "items": "float32"}

List of intensity values corresponding to the m/z values.

cv_params

[null, {"type": "array", "items": {"type": "record", "name": "cv_param"}}]

Optional list of CV parameters for additional metadata.

name

string

Name of the CV term (e.g., from PSI-MS or other ontologies).

value

string

Value associated with the CV term.

15.2.1. Format

The mass spectra view can be found in mz.avsc.

16. Get in touch

The following links should be followed to get support and help with the quantms maintainers:

Report Issue Get help on GitHub Forum