The quantms.io format

Table of Contents

1. Introduction
2. General data model and structure
3. Common data structures and formats
4. Serialization formats
- 4.1. Parquet format
5. File extensions
6. Versioning
7. Software provider
8. Project quantms.io
- 8.1. Project fields
- 8.2. Project files
9. SDRF view
10. Absolute quantification view
- 10.1. Absolute quantification use cases
- 10.2. Format
11. Differential expression view
- 11.1. Differential expression use cases
- 11.2. Format
12. Peptide-based Views: psm, feature and peptide
13. Protein views: Protein groups and Protein summary
- 13.1. Protein group view
14. Protein view
- 14.1. Use cases
15. Mass spectra view
- 15.1. Mass spectra use cases
- 15.2. Mass spectra fields
16. Get in touch

1. Introduction

The majority of formats in HUPO-PSI are based on XML format including mzML, mzIdentML making difficult to use them for large-scale, AI model technologies. Also, the previous approach to move away from XML-based approaches, mzTab "falls short" to produce a tab-delimited format that can scale with the size of the data. Here, we aim to formalize and develop a more standardized format that enables better representation of the identification and quantification results but also enables new and novel use cases for proteomics data analysis. The main use cases for the format are:

Fast and easy visualization of the identification and quantification results.
Easy integration with other omics data.
Easy integration with sample metadata.
AI/ML model development based on identification and quantification results.
Easy data retrieval for big datasets and large-scale collections of proteomics data.

ℹ️

We are not trying to do the following:

Replace the mzTab format, but to provide a new format that enables AI-related use cases.
Replace all the software tools file formats and intermediate files, but to provide a new format that enables easy integration of the main output results with other tools.

2. General data model and structure

The quantms.io (.qms) could be seen as a multiple view representation of a proteomics data analysis results. Similar to other tools that produce multiple output files for their analysis, like MaxQuant, DIA-NN, FragPipe or spectronaut. Each view of the format can be serialized in different formats depending on the use case. The data model defines two main things, the view and how the view is serialized. Both views and serialization can be extended, and new views can be added on each Chapter 6 of the specification.

The data model view defines the structure, the fields and properties that will be included in a view for each peptide, psms, feature or protein.
The data serialization defines the format in which the view will be serialized and what features of serialization will be supported, for example, compression, indexing, or slicing.

view	file class	serialization format	definition
mz	mz_file	parquet	Chapter 15
psm	psm_file	parquet	Section 12.1
feature	feature_file	parquet	Section 12.2
pg	pg_file	parquet	Section 13.1
peptide	peptide_file	parquet	Section 12.3
protein	protein_file	parquet	Chapter 13
absolute	absolute_file	tsv	Chapter 10
differential	differential_file	tsv	Chapter 11
sdrf	sdrf_file	tsv	Chapter 9
project	-	json	Chapter 8

ℹ️

Some of these data models fit better for some analytical methods than others, for example, the psm view Section 12.1 is more suitable for data-dependent acquisition (DDA) methods, and may not be present in data-independent acquisition (DIA) methods; while the feature view Section 12.2 could be generated in both DDA and DIA methods. Different expression view Chapter 11 are only present in those experiments while absolute-expression (based on IBAQ values) is only available on datasets where comparisons are not performed between conditions.

The .qms contains all the files of a quantms.io experiment. It will contain metadata files and different views of the experiments; Chapter 2.

3. Common data structures and formats

We have some concepts that are common for some outputs and would be good to define and explain them here:

3.1. Peptidoform

A peptidoform is a peptide sequence with modifications. For example, the peptide sequence PEPTIDM with a modification of Oxidation would be PEPTIDM[Oxidation]. The peptidoform show be written using the Proforma specification. This concept is used in the following outputs:

Section 12.1
Section 12.2
Section 12.3

3.2. Modifications

A modification is a chemical change in the peptide sequence. Modifications can be annotated in multiple ways in quantms.io format:

As part of the Proforma notation inside the peptide or as a separate by [Oxidation] with modification name or accession: For example, Oxidation or UNIMOD:35. It Is RECOMMENDED to report modifications using UNIMOD. If a modification is not defined in UNIMOD, a CHEMMOD definition must be used like CHEMMOD:-18.0913, where the number is the mass shift in Daltons.
As a list of modification names for each peptidoform for easy integration and filtering of the given peptide evidence. For example, Oxidation;Phosphorylation.
Full modification annotation with the given position, modification name, and quality score. In this case, modifications will be encoded as:
- Accession or name: The modification accession or name. For example, CHEMMOD:-18.0913, UNIMOD:35 or Oxidation.
- Position: The position of the modification in the peptide sequence. Terminal modifications in proteins and peptides MUST be reported with the position set to 0 (N-terminal) or the amino acid length +1 (C-terminal) respectively. For example, 1 or 1,2,3.
- Localization Probability: The probability of the modification being in the reported position.

Those three properties can be combined, for example, in a string like one string as:

{position}({Probabilistic Score:0.9})|{position2}|..-{modification accession or name}

1(Probabilistic Score:0.8)|2(Probabilistic Score:0.9)|3-UNIMOD:35

When represented in parquet files Section 12.1, Section 12.2, modification details will be a list of struct:

  [{
      "name": "UNIMOD:35",
      "fields": [
        {
          "position": 2,
          "localization_probability": 0.94
        },
        {
          "position": 12,
          "localization_probability": 0.06
        }
      ]
    },
    {
      "name": "UNIMOD:0",
      "fields": [
        {
          "position": 0,
          "localization_probability": 0.92
        },
        {
          "position": 16,
          "localization_probability": 0.08
        }
      ]
    }
    ]

3.3. Scan (scan number)

Scan number (scan) aims to point to the MS/MS in a Raw, mzML, or peak list file (e.g., MGF). mzIdentML, mzTab, USI, and another HUPO-PSI standardization have different ways to use and define scan number. Here we will use the latest definition from USI. A single scan point to an MS/MS in the spectra file. The scan is a unique identifier, and it could be a number or a string depending on the instrument.

AB Sciex: sample=1 period=1 cycle=2740 experiment=10 → 1,1,2740,10. In this scenario, where reference to the original scan event is desired but a single scan number is not sufficient, then we use nativeId mechanism.
Waters nativeId: function=10 process=1 scan=345 → 10,1,345
Bruker nativeId: frame=120 scan=475 → 120,475
Thermo scan : controllerType=0 controllerNumber=1 scan=43920 → 43920

Note: since the controllerType and controllerNumber are always 0 and 1 for mass spectra. In rare cases, if either controllerType is not 0 or controllerNumber is not 1 (e.g., a PDA spectrum is being referenced), then the nativeId form MUST be used: controllerType=5 controllerNumber=1 scan=7 → 5,1,7

The scan is use in the following section: Section 12.1, Section 12.2, Chapter 15.

ℹ️

Normally the scan value is only captured in the column, while the format of the scan: nativeId, scan or index should be captured in the metadata of the file. However, in some types of analyses we may have more than one type of scan in the same file, (e.g., when merging multiple experiments.), in this case, each scan MUST be prefixed by the type of scan. For example, nativeId:1,1,2740,10, scan:43920.

3.4. Identification scores

Every workflow within quantms uses different identification/quantification scores to determinate the quality of the identification or the quantification. additional_scores in quantms try to capture multiple scores from different workflows such as the Comet:xcorr or DIA-NN:Q.Value. Additional scores are stored as a key/value pair where the key is the name of the score (is RECOMMENDED to use HUPO-PSI MS ontology) and the value is the score value. This concept is used in the following outputs:

[Comet:xcorr:67.8", DIA-NN:Q.Value:0.01]

This concept is used in the following outputs:

Section 12.1
Section 12.2
Section 12.3

3.5. Controlled vocabulary terms

The following views Section 12.1, Section 12.2, Chapter 15 use controlled vocabularies to describe the data. The controlled vocabulary terms are used to standardize the data and make it easier to integrate with other datasets. The controlled vocabulary terms are stored as a key/value pair where the key is the name of the controlled vocabulary term and the value is the term value. This concept is used in the following outputs:

["ms level": "2", "deconvoluted data": null]

The name/key of the controlled vocabulary MUST be provided; the value is optional.

4. Serialization formats

The quantms.io format has different serialization formats for each view. The serialization format defines how the view will be serialized and what features of serialization will be supported, for example, compression, indexing, or slicing. The following serialization formats are supported:

tsv: Tab-separated values format.
parquet: Apache Parquet format.
json: JavaScript Object Notation format.

4.1. Parquet format

Parquet is a columnar storage format that supports nested data. Apache Parquet is an open-source format designed for efficient data storage and retrieval. It offers high-performance compression and encoding schemes, making it well-suited for handling large volumes of complex data. Parquet is widely supported across various programming languages and analytics tools.

Apache Parquet includes two types of metadata: file metadata and column metadata. File metadata contains pointers to the starting locations of all the column metadata, while column metadata holds location information for the individual column chunks. Readers first access the file metadata to find the column chunks they need, then use the column metadata to efficiently skip over irrelevant pages.

A Parquet table can be distributed across multiple compute nodes, and its key advantage is that applications can quickly jump to the relevant fields in a record using metadata. For large-scale analyses, Parquet has helped users reduce storage requirements by at least one-third on large datasets. Additionally, it significantly improves scan and deserialization times (important for web-based use cases), thus reducing overall costs.

Project	Type	Original file size(GB)	Converted parquet size(MB)	Writing psm time(s)	Writing feature time(s)
PXD046440	maxquant	48	337/343	985.2671835	678.474133
PXD016999	mzTab	160	155/228	539.0019641	3554.52738
PXD019909	diaNN	1.9	195		229.482332

4.1.1. Parquet features

Columnar Storage: Parquet’s columnar design improves compression and query performance by storing data by columns rather than rows, which reduces I/O for analytical queries that typically access only a few columns.
Efficient Compression: The format achieves better compression ratios with algorithms like Snappy, Gzip, and LZO, and uses techniques like RLE, and dictionary encoding for further optimization.
Schema Evolution: Parquet supports adding, deleting, or modifying columns without affecting existing data, making it adaptable to schema changes.
Complex Data Types: Supports nested structures and data types like arrays, maps, and structs, allowing efficient storage of complex data.

4.1.2. Parquet slicing

quantms.io supports slicing parquet files using any field when generating them.Upon storage, the files are organized into distinct folders according to the chosen slicing fields.

PXD004683/
│
├── sample_accession_1/
│   ├── file1.parquet
│   └── file2.parquet
│
├── sample_accession_2/
│   ├── file3.parquet
│   └── file4.parquet
│
└── sample_accession_3/
    ├── file5.parquet
    └── file6.parquet
...

When registering parquet files to project.json Chapter 8, it will be in such a format.

  "quantms_files": [
    {
      "feature_file": [
        {
          "path_name": "PXD004683",
          "is_folder": true,
          "partition_fields": ["sample_accession"]
        }
      ]
    },
  ]

5. File extensions

File extensions are used to identify the file type. In quantms.io the extensions are constructed as follows: *.{view}.{format} where the view is one of the well-defined views in the specification and the format is one of the serialization formats. For example:

An absolute expression file: PXD000000-943a8f02-0527-4528-b1a3-b96de99ebe75.absolute.tsv
A differential expression file: PXD000000-943a8f02-0527-4528-b1a3-b96de99ebe75.differential.tsv
A feature file: PXD000000-943a8f02-0527-4528-b1a3-b96de99ebe75.feature.parquet
A psm file: PXD000000-943a8f02-0527-4528-b1a3-b96de99ebe75.psm.parquet

ℹ️

In quantms.io we use the UUID to identify the project and the files {PREFIX}-{UUID}.{view}.{format}, it is optional, but for most of the code examples we will use it. uuids: A Universally Unique Identifier (UUID) URN Namespace, as defined in RFC 4122, provides a standardized method for generating globally unique identifiers across various systems and applications. The UUID URN Namespace ensures that each generated UUID is highly unlikely to collide with any other UUID, even when produced by different entities and systems.

6. Versioning

The structure of the version is as follows {major release}.{minor update}: The current quantms.io specification version is: 1.0

All views (Section 12.1, Section 12.2, Section 13.1) and serialization formats will have a version number in the way: quantmsio_version: {}. This will help to identify the version of the specification used to generate the file.
Major release changes will be backward incompatible, while minor updates will be backward compatible.

7. Software provider

The data within quantms.io is mainly generated from quantms workflow. However, the format is open and can be used by any software provider that wants to generate the data in this format. The software provider and the version of the software used to generate the data will be stored in the project view Chapter 8 as:

"software_provider": {
    "name": "quantms",
    "version": "1.3.0"
  }

8. Project quantms.io

The project view is the file that stores the metadata of the entire quantms.io project. The project view is a JSON file that contains the following fields:

8.1. Project fields

Field	Description	Type
`project_accession`	Project accession identifier	string
`project_title`	Title of the project	string
`project_description`	Description of the project	string
`project_sample_description`	Description of the project sample	string
`project_data_description`	Description of the project data	string
`project_pubmed_id`	PubMed ID associated with the project	int32
`organisms`	List of Organisms involved in the project	list[string], null
`organism_parts`	Parts of Organisms studied	list[string], null
`diseases`	Diseases associated with the study	list[string], null
`cell_lines`	Cell lines used in the study	list[string], null
`instruments`	Instruments used for data acquisition	list[string]
`enzymes`	Enzymes used in the study	list[string]
`experiment_type`	Types of experiments conducted	list[string]
`acquisition_properties`	Properties of the data acquisition methods	list[key/value]
`quantms_files`	Files related to quantMS analysis	list[key/value]
`quantmsio_version`	Version of the `quantms.io`	string
`software_provider`	The Chapter 7 used to generate the data	key/value
`comments`	Additional comments or notes	list[string]

key/value pair object: The key/value pairs are used to store the acquisition properties, and the quantms files.

Example of acquisition_properties:

   "acquisition_properties": [
        {"precursor tolerance": "0.05 Da"},
        {"dissociation method": "HCD"}
   ]

8.2. Project files

The files within a project are in the current version Chapter 6 optional. Files within a project should be listed in the quantms_files, for every file the following information is necessary:

path_name: The name of the file or folder.
is_folder: A boolean value that indicates if the file is a folder or not.
partition_fields: The fields that are used to partition the data in the file. This is used to optimize the data retrieval and filtering of the data. This field is optional.

ℹ️	Parquet files can be storage as folders when the data is partitioned by some fields. For example, a parquet file that is partitioned by the `sample_accession` field will be stored as a folder with the name of the field and the value of the field.

Example of quantms_files:

   {
  "quantms_files": [
    {
      "psm_file": [
        {
          "path_name": "PXD004683-550e8400-e29b-41d4.1.psm.parquet",
          "is_folder": false
        },
        {
          "path_name": "PXD004683-550e8400-e29b-41d4.2.psm.parquet",
          "is_folder": false
        }
      ]
    },
    {
      "feature_file": [
        {
          "path_name": "PXD004683",
          "is_folder": true,
          "partition_fields": ["sample_accession"]
        }
      ]
    },
    {
      "differential_file": [
        {
          "path_name": "PXD004683-a716.differential.tsv",
          "is_folder": false
        }
      ]
    },
    {
      "absolute_file": [
        {
          "path_name": "PXD004683-e29b-41f4-a716.absolute.tsv",
          "is_folder": false
        }
      ]
    },
    {
      "sdrf_file": [
        {
          "path_name": "PXD004683-e29b-41f4-a716.sdrf.tsv",
          "is_folder": false
        }
      ]
    }
  ]
}

Example:

   {
    "project_accession": "PXD014414",
    "project_title": "",
    "project_sample_description": "",
    "project_data_description": "",
    "project_pubmed_id": 32265444,
    "organisms": [
        "Homo sapiens"
    ],
    "organism_parts": [
        "mammary gland",
        "adjacent normal tissue"
    ],
    "diseases": [
        "metaplastic breast carcinomas",
        "Triple-negative breast cancer",
        "Normal",
        "not applicable"
    ],
    "cell_lines": [
        "not applicable"
    ],
    "instruments": [
        "Orbitrap Fusion"
    ],
    "enzymes": [
        "Trypsin"
    ],
    "experiment_type": [
        "Triple-negative breast cancer",
        "Wisp3",
        "Tandem mass tag (tmt) labeling",
        "Ccn6",
        "Metaplastic breast carcinoma",
        "Precision therapy",
        "Lc-ms/ms shotgun proteomics"
    ],
    "acquisition_properties": [
        {"proteomics data acquisition method": "TMT"},
        {"proteomics data acquisition method": "Data-dependent acquisition"},
        {"dissociation method": "HCD"},
        {"precursor mass tolerance": "20 ppm"},
        {"fragment mass tolerance": "0.6 Da"}
    ],
    "quantms_files": [
      {
        "feature_file": [
          {
            "path_name": "PXD014414.feature.parquet",
            "is_folder": false
          }
        ]
      },
      {
        "sdrf_file": [
          {
            "path_name": "PXD014414.sdrf.tsv",
            "is_folder": false
          }
        ]
      },
      {
        "psm_file": [
          {
            "path_name": "PXD014414-f4fb88f6.psm.parquet",
            "is_folder": false
          }
        ]
      },
      {
        "differential_file": [
          {
            "path_name": "PXD014414-3026e5d5.differential.tsv",
            "is_folder": false
          }
        ]
      }
    ],
    "software_provider": {
       "name": "quantms",
       "version": "1.3.0"
    },
    "quantmsio_version": "1.0",
    "comments": []
   }

9. SDRF view

The Proteomics Sample and Data Relationship Format (SDRF) is a tab-delimited file format that describes the relationship between samples, data files, and the experimental factors. The SDRF is a key file in the proteomics data analysis workflow as it describes the relationship between the samples and the data files. The specification of the SDRF can be found in the SDRF GitHub repository.

10. Absolute quantification view

Absolute quantification is the process of determining the absolute/baseline amount of a target protein in a sample. In proteomics, the main computational method to determine the absolute quantification is the intensity-based absolute quantification (iBAQ) method.

10.1. Absolute quantification use cases

Fast and easy visualization absolute expression (AE) results using iBAQ values.
Store the AE results of each protein on each sample.
It could be used as a proxy to understand the expression profile of a protein in different conditions, tissues and organisms.

10.2. Format

The absolute expression format is a tab-delimited file format that contains the following fields:

protein → Protein accession or semicolon-separated list of accessions for indistinguishable groups
sample_accession → Sample accession in the SDRF.
condition → Condition name
ibaq → iBAQ value
ibaq_normalized → Relative iBAQ value, Ibaq value normalized by the sum of the iBAQ values in the sample.

Example:

protein	sample_accession	condition	ibaq	ibaq_normalized
LV861_HUMAN	Sample-1	heart	1234.1	12.34

10.2.1. AE header

We based the AE format (Chapter 10) and DE (Chapter 11) based on MSstats and other genomics formats such as VCF. By default, the MSstats format does not have any header of metadata. We suggest adding a header to the output for better understanding of the file. By default, MSstats allows comments in the file if the line starts with #. The quantms output will start with some key value pairs that describe the project, the workflow and also the columns in the file. For

Example:

#project_accession=PXD000000

In addition, for each Default column of the matrix the following information should be added:

#INFO=<ID=protein, Number=inf, Type=String, Description="Protein Accession">
#INFO=<ID=sample_accession, Number=1, Type=String, Description="Sample Accession in the SDRF">
#INFO=<ID=condition, Number=1, Type=String, Description="Value of the factor value">
#INFO=<ID=ibaq, Number=1, Type=Float, Description="Intensity based absolute quantification">
#INFO=<ID=ibaq_normalized, Number=1, Type=Float, Description="normalized iBAQ">

The ID is the column name in the matrix, the Number is the number of values in the column (separated by ;), the Type is the type of the values in the column and the Description is a description of the column. The number of values in the column can go from 1 to inf (infinity).
Protein groups are written as a list of protein accessions separated by ; (e.g.P12345;P12346)

We RECOMMEND including the following properties in the header:

project_accession: The project accession in PRIDE Archive
project_title: The project title in PRIDE Archive
project_description: The project description in PRIDE Archive
quantmsio_version: The version of the quantmsio used to generate the file
factor_value: The factor values used in the analysis (e.g.tissue)

Please check also the differential expression example for more information: Chapter 11

11. Differential expression view

The differential expression view is a tab-delimited file format that contains the differential expression results between two contrasts, with the corresponding fold changes and p-values. The differential expression view is a key file in the proteomics data analysis workflow as it describes the differential expression between two conditions.

11.1. Differential expression use cases

Store the differential express proteins between two contrasts, with the corresponding fold changes and p-values.
Enable easy visualization using tools like `Volcano Plot https://en.wikipedia.org/wiki/Volcano_plot_(statistics)`__.
Enable easy integration with other omics data resources.
Store metadata information about the project, the workflow and the columns in the file.

11.2. Format

The differential expression format by quantms.io is based on the MSstats output:

protein → Protein Accession
label → Label for the contrast on which the fold changes and p-values are based on
log2fc → Log2 Fold Change
se → Standard error of the log2 fold change
df → Degree of freedom of the t-student test
pvalue → Raw p-values
adj_pvalue → P-values adjusted among all the proteins in the specific comparison using the approach by Benjamini and Hochberg
issue → Issue column shows if there is any issue for inference in corresponding protein and comparison, for example, OneConditionMissing or CompleteMissing.

Example:

protein	label	log2fc	se	df	pvalue	adj_pvalue	issue
ADA2_HUMAN	normal - squamous cell carcinoma	0.3057	0.26	37	0.02	0.43

11.2.1. DE header

By default, the MSstats format does not have any header of metadata. We suggest adding a header to the output for better understanding of the file. By default, MSstats allows comments in the file if the line starts with #. The quantms output will start with some key value pairs that describe the project, the workflow and also the columns in the file. For example:

#project_accession=PXD000000

In addition, for each Default column of the matrix the following information should be added:

#INFO=<ID=protein, Number=inf, Type=String, Description="Protein Accession">
#INFO=<ID=label, Number=1, Type=String, Description="Label for the Conditions combination">
#INFO=<ID=log2fc, Number=1, Type=Double, Description="Log2 Fold Change">
#INFO=<ID=se, Number=1, Type=Double, Description="Standard error of the log2 fold change">
#INFO=<ID=df, Number=1, Type=Integer, Description="Degree of freedom of the Student test">
#INFO=<ID=pvalue, Number=1, Type=Double, Description="Raw p-values">
#INFO=<ID=adj_pvalue, Number=1, Type=Double, Description="P-values adjusted among all the proteins in the specific comparison using the approach by Benjamini and Hochberg">
#INFO=<ID=issue, Number=1, Type=String, Description="Issue column shows if there is any issue for inference in corresponding protein and comparison">

The ID is the column name in the matrix, the Number is the number of values in the column (separated by ;), the Type is the type of the values in the column and the Description is a description of the column. The number of values in the column can go from 1 to inf (infinity).
Protein groups are written as a list of protein accessions separated by ; (e.g. P12345;P12346`)

We suggest including the following properties in the header:

project_accession: The project accession in PRIDE Archive
project_title: The project title in PRIDE Archive
project_description: The project description in PRIDE Archive
quantmsio_version: The version of the quantmsio used to generate the file.
factor_value: The factor values used in the analysis (e.g. phenotype)
adj_pvalue: The FDR threshold used to filter the protein lists (e.g. adj_pvalue < 0.05)

12. Peptide-based Views: psm, feature and peptide

Multiple peptide-level views are available for the quantms.io format. The views are the following:

Section 12.1: Peptide Spectrum Match (psm) View—The psm view aims to cover detail on Peptide spectrum matches (psm) level for AI/ML training and other use-cases, mainly for DDA analytical methods.
Section 12.2: Peptide Feature View—The peptide feature views (peptide features) aims to cover detail on quantified peptide information level, including peptide intensity in relation to the sample metadata.
Section 12.3: Peptide View—The peptide view is a summary of quantified peptides by samples, the aim of this representation is to provide a simple summary of the number of peptides and their given quantity for each protein on each sample. This view is useful for quick visualization and data retrieval.

12.1. Peptide spectrum match (psm) view

Peptide spectrum matches (psms) are the results of the identification of peptides in mass spectrometry data. PSMs are mainly the results of peptide identification by database search engines on data-dependent acquisition (DDA) experiments.

12.1.1. Psm use cases

The psm table aims to cover detail on psm level for AI/ML use-cases.
Most of the content is similar to mzTab, a psm would a peptide identification in a msrun file.
We included in the psm view the spectrum information as optional for those use cases that want to have fast access to peptide information + spectrum data, for example, clustering or intensity prediction
Fast and easy visualization of PSM information.

12.1.2. Psm fields

The following table presents all the fields and attributes for each PSM entry in the psm_file. Some fields are shared between the Section 12.1, Section 12.2 and Section 12.3 views.

We added to the following table the corresponding fields in different tools and mzTab for each field. For each tool, we use the following output tables:

MQ - msms.txt
FragPipe - psm.tsv
mzTab - PSM section

Field	Description	Type	DIA-NN	FragPipe	MaxQuant	mzTab
These fields are shared with features (Section 12.2) and peptides (Section 12.3)
`sequence`	The peptide’s sequence (with no modifications)	string	Stripped.Sequence	Peptide	Sequence	sequence
`peptidoform`	Peptide sequence with modifications, see more Section 3.1	string	Modified.Sequence	Modified Peptide	Modified sequence	opt_global_cv_MS:1000889_peptidoform_sequence
`modifications`	Modifications details: modification name, positions and localization probabilities: read Section 3.2	array[struct], null	-	-	-	-
`precursor_charge`	Precursor charge	int32	Precursor.Charge	-	Charge	charge
`posterior_error_probability`	Posterior error probability (PEP) for the given peptide or psm match.	float32, null	PEP	-	PEP	opt_global_Posterior_Error_Probability_score
`is_decoy`	Decoy indicator, 1 if the peptide is a decoy, 0 target	int32	-	-	Reverse	opt_global_cv_MS:1002217_decoy_peptide
`calculated_mz`	Theoretical peptide mass-to-charge ratio based on an identified sequence and modifications	float32	-	Calculated M/Z	-	calc_mass_to_charge
`observed_mz`	Experimental peptide mass-to-charge ratio of identified peptide (in Da)	float32	-	Observed M/Z	m/z	exp_mass_to_charge
`rt`	MS2 scan’s precursor retention time (in seconds)	float32, null	RT	-	Retention time	retention_time
`predicted_rt`	Predicted retention time of the peptide (in seconds)	float32, null	Predicted.RT	-	-	-
`reference_file_name`	Spectrum file name with no path information and not including the file extension	string	Run	Spectrum File	Raw file	spectra_ref
`scan`	Scan index (number of nativeId) of the spectrum identified: read Section 3.3	string	[scan-diann]	Spectrum	MS/MS scan number	spectra_ref
`additional_scores`	List of structures, each structure contains two fields: name and value.	array[struct{name: string, value: float32}]	DIA-NN Scores	FragPipe Scores	MaxQuant Scores	search_engine_score
`cv_params`	Optional list of CV parameters for additional metadata Section 12.1.4	array[struct{cv_name:string, cv_value:string}], null	-	-	-	-
Protein fields shared by Section 12.2 Section 12.1
`mp_accessions`	Protein accessions of all the proteins that the peptide maps to	array[string], null	Protein.Ids	-	Proteins	accession
These fields are optional and part of the MS/MS information Chapter 15
`ion_mobility`	Ion mobility value for the precursor ion	float, null	-	-	-	-
`number_peaks`	Number of peaks in the spectrum used for the peptide spectrum match	int32, null	-	-	-	-
`mz_array`	Array of m/z values for the spectrum used for the peptide spectrum match	array[float], null	-	-	-	-
`intensity_array`	Array of intensity values for the spectrum used for the peptide spectrum match	array[float], null	-	-	-	-

12.1.3. Additional scores

Additional scores are stored as a list of key-value pairs, where the key is the name of the score (is RECOMMENDED to use HUPO-PSI MS ontology) and the value is the score value. Additional scores are mainly the search engine and protein scores that want to be added at PSM level. Some RECOMMENDED scores are:

pg_global_qvalue: Protein group global q-value used to filter the psm at the level of the protein group and experiment.
rank: Rank of the peptide in the search engine results. (1.0)
global_qvalue: Global q-value of the PSM at the level of the experiment.

Psm view is NOT RECOMMENDED to be generated for DIA methods because it will be duplicated information with the feature view. The psm view is more suitable for DDA methods where the psm is the main output of the identification process.
Protein inference SHOULD NOT be included in the psm view, as it is not the main purpose of the psm view. However, for some use cases like peptide filtering, search, etc., maybe interesting to have access to all the psms for a given protein accession, you can include that in the mp_accessions: mapped protein accessions. Another two protein-related fields can help the users to understand the resulted psm table, unique (if the peptide only maps to one protein), pg_global_qvalue: The Global qvalue at the protein group use to filter the psm. For protein inference please look into the feature view (Section 12.2) and protein group (Section 13.1).
The mz_array and intensity_array are arrays of the same length, where the mz_array contains the m/z values and the intensity_array contains the intensity values; and the size of the arrays is the same as the number of peaks in the spectrum. These three columns could help use cases like AI/ML that need the spectrum information for a given psm. We RECOMMEND using for spectra data the mz view (Chapter 15), where the spectra are stored in a more efficient way.

12.1.4. Psm CV parameters

Cv params are a key-value pairs list that allows to store additional information for a given psm. For example, it could be used to store the following, mzIdentML information:

'prot:FDR threshold': 0.01
number of unmatched peaks: 3

In quantms we use consensus_support where the value is the number of search engines that support the identification. This field could be added as an additional_score as: consensus_result: 3

The cv_params are stored as a list of key-value pairs, where the key is the name of the parameter, and the value is the value of the parameter. This is similar to the CVParams in the mzIdentML format. Please, be aware that search engine scores should be stored for psms in the column additional_scores.

12.1.5. Psm file metadata

For parquet psm files, the metadata of the file including quantms.io version and other metadata should be stored in the file. The metadata should be stored in the file as a key/value pair. The metadata should include the following fields:

quantmsio_version: The version of the quantms.io format used to generate the file.
software_provider: The software provider and the version of the software used to generate the data.
project_accession: The project accession in PRIDE Archive if available.
project_title: The project title in PRIDE Archive if available.
project_description: The project description in PRIDE Archive if available.
scan_format: The format of the scan, with possible values: scan, index, nativeId, multiple. Multiple is used when multiple experiments are merged into one file.
creator: Name of the tool or person who created the file.
file_type Type of the file (psm_file)
creation_date: Date when the file was created
uuid: Unique identifier for the file
compression_format: [gzip, snappy, lzo, none]

Example parquet in Python:

import pyarrow as pa
import pyarrow.parquet as pq

# Define a sample schema for the Parquet file
schema = pa.schema([
    ....
])

# Create sample data to write to the Parquet file
data = {
    ....
}

# Convert the data to a PyArrow Table
table = pa.table(data, schema=schema)

# Define the custom metadata as key-value pairs
file_metadata = {
    'quantmsio_version': '1.0',
    'software_provider': 'QuantMS 1.3.0',
    'project_accession': 'PXD012345',
    'project_title': 'Proteomics of Disease X',
    'project_description': 'Project description',
    'scan_format': 'scan',
    'creator': 'John Doe',
    'file_type': 'psm_file',
    'creation_date': '2021-01-01',
    'uuid': '943a8f02-0527-4528-b1a3-b96de99ebe75'
}

# Write the Parquet file with metadata
pq.write_table(table, 'psm_data.parquet', metadata=file_metadata)

Parquet files don’t have a specific limit for metadata size, but practical constraints exist based on your system’s memory, processing capabilities, and file management practices. The Parquet metadata, which is stored in the file’s footer, includes information like schema, column statistics, and data offsets. The metadata is loaded into memory when the file is read, so large metadata can impact performance. For large metadata, consider storing the metadata in a separate file or database and linking to it from the Parquet file.

12.1.6. Psm global q-value

The global q-value represents the q-value at the level of the experiment. In OpenMS this is the PSM q-value that is by default global at the level of the experiment and the run. In DIA-NN, it represents Global.Q.Value. At the run level, the Q.Value will be collected by additional_scores.

12.1.7. Format

The psm view can be found in psm.avsc.

12.2. Peptide feature view

The peptide feature view (peptide features) aims to cover detail on quantified peptide information level at the msrun level, including peptide intensity in relation to the msrun and sample metadata. The feature parquet file is a parquet file that contains the details of the peptides quantified in the experiment and sample.

The feature file is similar to the mztab peptide table, the peptide evidence in MaxQuant, the diann matrix table.

12.2.1. Feature use cases

Store peptide intensities in relation to the sample metadata to perform down-stream analysis and integration.
Enable peptide level statistics and algorithms to move from peptide level to protein level.
Different to the psm section Section 12.1 contains all the protein inference information depending on if protein inference was applied or not.

ℹ️	quantms also release the peptide table for MSstats. The goal of the feature table is to provide a more general peptide table and improve the annotations of the peptides with more columns.

12.2.2. Feature fields

The following table presents the fields needed to describe each feature in quantms.io. Some of the fields are shared with the psm view (Section 12.1).

Field	Description	Type	DIA-NN	FragPipe	MaxQuant	mzTab
These fields are shared with features (Section 12.1) and peptides (Section 12.3)
`sequence`	The peptide’s sequence (with no modifications)	string	Stripped.Sequence	Peptide	Sequence	sequence
`peptidoform`	Peptide sequence with modifications, see more Section 3.1	string	Modified.Sequence	Modified Peptide	Modified sequence	opt_global_cv_MS:1000889_peptidoform_sequence
`modifications`	Modifications details: modification name, positions and localization probabilities: read Section 3.2	array[struct], null	-	-	-	-
`precursor_charge`	Precursor charge	int32	Precursor.Charge	-	Charge	charge
`posterior_error_probability`	Posterior error probability (PEP) for the given peptide or psm match.	float32, null	PEP	x	PEP	opt_global_Posterior_Error_Probability_score
`is_decoy`	Decoy indicator, 1 if the peptide is a decoy, 0 target	int32	-	-	Reverse	opt_global_cv_MS:1002217_decoy_peptide
`calculated_mz`	Theoretical peptide mass-to-charge ratio based on an identified sequence and modifications	float32	-	Calculated M/Z	-	calc_mass_to_charge
`observed_mz`	Experimental peptide mass-to-charge ratio of identified peptide (in Da)	float32	-	-	m/z	exp_mass_to_charge
`rt`	Precursor retention time (in seconds)	float32, null	RT	-	Retention time	retention_time
`rt_start`	Start of the retention time window for feature	float, null	RT.Start	-	-	-
`rt_stop`	End of the retention time window for feature	float, null	RT.Stop	-	-	-
`predicted_rt`	Predicted retention time of the peptide (in seconds)	float, null	Predicted.RT	-	-	-
`ion_mobility`	Ion mobility value for the precursor ion	float, null	-	-	-	-
`start_ion_mobility`	start ion mobility value for the precursor ion	float, null	-	-	-	-
`stop_ion_mobility`	stop ion mobility value for the precursor ion	float, null	-	-	-	-
`additional_scores`	List of structures, each structure contains two fields: name and value.	array[struct{name: string, value: float32}]	DIA-NN Scores	FragPipe Scores	MaxQuant Scores	search_engine_score
`cv_params`	Optional list of CV parameters for additional metadata Section 12.1.4	array[struct{cv_name:string, cv_value:string}], null	-	-	-	-
Feature quantification and relation to the given reference file
`intensities`	The intensity-based abundance of the feature in the reference file for different channels	Section 12.2.3	Precursor.Quantity	Intensity	Intensity	Intensity
`reference_file_name`	The reference file name that contains the feature	string	Run	-	Raw file	-
`additional_intensities`	Apart from the raw intensity, multiple intensity values can be provided as key-values pairs, for example, normalized intensity.	Section 12.2.3
Protein and protein groups information related to Section 13.1, Section 12.3
`pg_accessions`	Protein group accession. Could be one single protein or multiple protein accessions, depending on the tool.	array[string], null	Protein.Group	x	Proteins	accession
`anchor_protein`	One protein accession that represents the protein group	string, null	-	-	-	-
`unique`	Unique peptide indicator, if the peptide maps to a single protein, the value is 1, otherwise 0	int32, null	-	Is Unique	Unique	unique
`pg_global_qvalue`	Global q-value of the protein group at the experiment level	float, null	Global.PG.Q.Value	-	-	best_search_engine_score
`gg_accessions`	Gene group accessions.	array[string], null	-	-	-	-
`gg_names`	Gene names, as a string array	array[string], null	Genes	-	-	-
`mp_accessions`	Protein accessions of all the proteins that the peptide maps to	array[string], null	Protein.Ids	-	Proteins	accession
Spectra information
`scan_reference_file_name`	The reference file containing the best psm that identified the feature. Note: This file can be different from the file that contains the feature (`ReferenceFile`).	string, null	-	-	-	-
`scan`	The scan number of the spectrum. The scan number or index of the spectrum in the file.	string, null	Section 12.2.4	-	-	-

ℹ️

The spectra information aims to provide for a given feature the scan used to identify it. In DDA protocols LFQ-DDA and DDAplex, we recommended os use the best psm for a given feature.
Protein groups gg_accessions should contain all the proteins that discreve the protein group — for example, in MQ and FragPipe the anchor protein is the one selected to represent the group; while DIA-NN put all the proteins within a group. Similar to the psm section Section 12.1 the entire list of proteins for a given group could be written in the mp_accessions field.
conditions: Conditions for every feature, are the values of the factor values.

12.2.3. Intensities

We capture an intensity value for each feature on a given reference_file_name. In label-free experiments that it is a single value, but in multiple experiments it could be multiple values depending on the number of channels, and each channel is associated with one sample accession (normally the source name in the SDRF). Then, we suggest storing the intensities as a list of struct in parquet like:

intensity: 1234.1
sample_accession: Sample-1
channel: TMT126

Additional intensities could be added could be added in the similar way in the field/column additional_intensities, but and additional field will be added with the name of the intensity, for example, normalized_intensity: 0.1234.

12.2.4. DIANN scan

The DIA-NN scan is a string that contains the scan number of the MS2 used to identify the peptide. We use the rt field and the mzML information to get that number.

12.2.5. Format

The feature view can be found in feature.avsc.

12.3. Peptide summary view

The peptide summary view aims to cover detail on peptides quantified in the experiment and sample. A peptide could be a modified peptide (sequence with modifications) or non-modified peptide (sequence with no modifications) depending on the use case and the granularity of the data. The peptide view is a tab-delimited file format that claims to represent the peptides quantified in the experiment.

12.3.1. Peptide use cases

It serves as a report file with all peptides quantified in the experiment for each protein.
It can be used to generate peptide reports for integration with tools and services.

12.3.2. Peptide fields

Some of the fields are shared between the Section 12.1 and Section 12.2 views.

Field	Description	Type
These fields are shared with features (Section 12.2) and peptides (Section 12.1)
`sequence`	The peptide’s sequence (with no modifications)	string
`peptidoform`	Peptide sequence with modifications, see more Section 3.1	string
`modifications`	Modifications details: modification name, positions and localization probabilities: read Section 3.2	array[struct], null
`gg_accessions`	Gene group accessions.	array[string], null
`gg_names`	Gene names, as a string array	array[string], null
`best_id_score`	The best search engine score from all the features/psms identified	array[struct[name: string, value:float32]], null
`sample_accession`	The sample accession in the SDRF, which column is called `source name`	string, null
`abundance`	The peptide abundance in the given sample accession	float32, null

12.3.3. Format

The peptide view can be found in peptide.avsc.

13. Protein views: Protein groups and Protein summary

We have two main reports for protein information.

The Section 13.1 report is the output of the quantitative tool including quantms, MaxQuant or DIA-NN.
The Chapter 14 is a protein summary is a summary of the protein quantified by samples.

13.1. Protein group view

The protein group view is a tabular file that contains the details of the protein groups identified and quantified. The protein group is similar to the outputs of multiple tools such as MaxQuant, DIA-NN, and others.

The file defines the relation between a protein groups and the raw file that contains the protein group. The protein group view is a key file in the proteomics data analysis workflow as it describes the protein groups identified and quantified in the experiment.

13.1.1. Protein group use cases

Retrieve all the protein groups identified or quantified in the file.
Compute the protein group abundance by file and condition.
Store information about FDR and q-values for the protein groups identified/quantified.

13.1.2. Protein group fields

Field	Description	Type	DIA-NN	FragPipe	MaxQuant
`pg_accessions`	Protein group accessions of all the proteins within this group	array[string]	Protein.Group	Group + Indistinguishable Proteins	Protein IDs
`pg_names`	Protein group names	array[string]	Protein.Names	-	Protein names
`gg_accessions`	Gene group accessions, as a string array	array[string]	Genes	-	Gene names
`reference_file_name`	The raw file containing the identified/quantified protein	string	Run	-	-
global_qvalue	Global q-value of the protein group at the experiment level	float	Global.PG.Q.Value	-	Q-value
`intensities`	Similar to the feature view, the intensity-based abundance of the protein group in the reference file for different channels	Section 12.2.3	Intensity, Normalized Intensity	-	iBAQ, Intensity, LFQ intensity
`additional_intensities`	Apart from the raw intensity, multiple intensity values can be provided as key-values pairs, for example, normalized intensity.	Section 12.2.3	-	-	-
`is_decoy`	Definition of the protein group as decoy or target	null, integer	-	-	Reverse
`contaminant`	If the protein is a contaminant	null, integer	-	-	Potential contaminant
`peptides`	Number of peptides per protein in the protein group	null, struct{sequence: string, count: int}	-	-	-
`anchor_protein`	The anchor protein of the protein group, leading protein or representative	null, string	-	Protein ID	Protein IDs
`additional_scores`	List of structures, each structure contains two fields: name and value.	Section 13.1.3	-	-	-

13.1.3. protein additional scores

At the protein level, additional scores should be store for each given protein group. The additional scores are stored as a list of key-value pairs, where the key is the name of the score (is RECOMMENDED to use HUPO-PSI MS ontology) and the value is an array of float32 values where the index of values matches to the index on the pg_accessions field. Additional scores are mainly the search engine and protein scores that want to be added at the protein group level.

14. Protein view

The protein view is a report of the proteins identified/quantified in the experiment. It doesn’t contain major information about the inference of the protein group, but it contains the protein abundance and the protein identification scores.

14.1. Use cases

Fast reports of the proteins quantified/identified in an experiment with for Web interfaces and search engines.
Connection to AE/DE formats that enable to talk about the coverage of the protein identification.

Field	Description	Type
`abundance`	Abundance of the given protein in the sample/experiment	null, float
`sample_accession`	Sample accession in the SDRF, which column is called `source name`	string
`best_id_score`	The best search engine score for the identification	`[{"type": "record", "name": "score", "fields": [{ "name": "name", "type": "string" },{ "name": "value", "type": "float32" }]}, null]`
`gene_accessions`	The gene accessions corresponding to every protein	null, array[string]
`gene_names`	The gene names corresponding to every protein	null, array[string]
`number_peptides`	The total number of peptides for a give protein	null, integer
`number_psms`	The total number of peptide spectrum matches	null, integer
`number_unique_peptides`	The total number of unique peptides	null, integer

14.1.1. Format

The protein view can be found in protein.avsc.

15. Mass spectra view

The mass spectra view is a tabular file that contains the details of the mass spectra identified and quantified. This view is based on mz_parquet format developed by Michael Lazear. The mz_parquet format is a parquet-based format that stores the mass spectra information in a columnar format.

15.1. Mass spectra use cases

Retrieve all the precursor mass, retention time, and intensity in the file.
Enable easy visualization and scanning on mass spectra level.
AI/ML training and prediction on mass spectra level.

15.2. Mass spectra fields

Field	Type	Description
`id`	string	Unique identifier for the scan or spectrum.
`ms_level`	int	The MS level (e.g., 1 for MS1, 2 for MS2).
`centroid`	boolean	Indicates whether the data is centroided (true) or profile mode (false).
`scan_start_time`	float32	The start time of the scan in minutes.
`inverse_ion_mobility`	float32, null	Inverse ion mobility, if available, used for TIMS data.
`ion_injection_time`	float32	The ion injection time in milliseconds.
`total_ion_current`	float	Total ion current (TIC) for the scan.
`precursors`	[null, {"type": "array", "items": {"type": "record", "name": "precursor"}}]	List of precursors for this scan, if applicable.
`selected_ion_mz`	float32	The m/z value of the selected precursor ion.
`selected_ion_charge`	int32, null	Charge state of the selected precursor ion, if available.
`selected_ion_intensity`	float32, null	Intensity of the selected precursor ion.
`isolation_window_target`	float32, null	The target m/z for the isolation window.
`isolation_window_lower`	float32, null	The lower bound of the isolation window.
`isolation_window_upper`	float32, null	The upper bound of the isolation window.
`spectrum_ref`	float32, null	Reference to another spectrum (e.g., for linking to external datasets).
`mz`	{"type": "array", "items": "float32"}	List of m/z values for the scan.
`intensity`	{"type": "array", "items": "float32"}	List of intensity values corresponding to the m/z values.
`cv_params`	[null, {"type": "array", "items": {"type": "record", "name": "cv_param"}}]	Optional list of CV parameters for additional metadata.
name	string	Name of the CV term (e.g., from PSI-MS or other ontologies).
value	string	Value associated with the CV term.

15.2.1. Format

The mass spectra view can be found in mz.avsc.

16. Get in touch

The following links should be followed to get support and help with the quantms maintainers:

Files

README.adoc

Latest commit

History

README.adoc

File metadata and controls

The quantms.io format

1. Introduction

2. General data model and structure

3. Common data structures and formats

3.1. Peptidoform

3.2. Modifications

3.3. Scan (scan number)

3.4. Identification scores

3.5. Controlled vocabulary terms

4. Serialization formats

4.1. Parquet format

4.1.1. Parquet features

4.1.2. Parquet slicing

5. File extensions

6. Versioning

7. Software provider

8. Project quantms.io

8.1. Project fields

8.2. Project files

9. SDRF view

10. Absolute quantification view

10.1. Absolute quantification use cases

10.2. Format

10.2.1. AE header

11. Differential expression view

11.1. Differential expression use cases

11.2. Format

11.2.1. DE header

12. Peptide-based Views: psm, feature and peptide

12.1. Peptide spectrum match (psm) view

12.1.1. Psm use cases

12.1.2. Psm fields

12.1.3. Additional scores

12.1.4. Psm CV parameters

12.1.5. Psm file metadata

12.1.6. Psm global q-value

12.1.7. Format

12.2. Peptide feature view

12.2.1. Feature use cases

12.2.2. Feature fields

12.2.3. Intensities

12.2.4. DIANN scan

12.2.5. Format

12.3. Peptide summary view

12.3.1. Peptide use cases

12.3.2. Peptide fields

12.3.3. Format

13. Protein views: Protein groups and Protein summary

13.1. Protein group view

13.1.1. Protein group use cases

13.1.2. Protein group fields

13.1.3. protein additional scores

14. Protein view

14.1. Use cases

14.1.1. Format

15. Mass spectra view

15.1. Mass spectra use cases

15.2. Mass spectra fields

15.2.1. Format

16. Get in touch