DIA quantification - precursor ions - AIF data#

This module compares the sensitivity and quantification accuracy for data-independent acquisition (DIA) data, namely All-Ion Fragmentation (AIF), on a Q Exactive HF-X Orbitrap (Thermo Fisher). Users can load their data and inspect the results privately. They can also make their outputs public by providing the associated parameter file and submitting the benchmark run to ProteoBench. By doing so, their workflow output will be stored alongside all other benchmark runs in ProteoBench and will be accessible to the entire community.

This module is not designed to compare later-stages post-processing of quantitative data such as missing value replacement, and we advise users to publically upload data without replacement of missing values and without manual filtering.

We think that this module is more suited to evaluate the impact of (non exhaustive list):

  • search engine identification

  • peak picking

  • low-level ion signal normalisation

Other modules will be more suited to explore further post-pocessing steps.

Data set#

A subset of the Q Exactive HF-X Orbitrap (Thermo Fisher) data independent acquisition (DIA) data described by Van Puyvelde et al., 2022 was used as a benchmark dataset (in the manuscript referred to as All-Ion Fragmentation (AIF)). Here, only the first biological replicate series (named “alpha”) was used, encompassing three technical replicates of two different conditions (referred to as “A” and “B”). The samples are a mixture of commercial peptide digest standards of the following species: Escherichia coli (P/N:186003196, Waters Corporation), Yeast (P/N: V7461, Promega) and Human (P/N: V6951, Promega), with logarithmic fold changes (log2FCs) of 0, −1 and 2 for respectively Human, Yeast and E.coli. Please refer to the original publication for the full description of sample preparation and data acquisition parameters (Van Puyvelde et al., 2022).

The files can be downloaded from the proteomeXchange repository PXD028735, make sure that you download the following raw files:

Alternatively, you can download them from the ProteoBench server here: proteobench.cubimed.rub.de/datasets/raw_files/DIA/

It is imperative not to rename the files once downloaded!

Download the zipped FASTA file here: ProteoBenchFASTA_DDAQuantification.zip. The fasta file provided for this module contains the three species present in the samples and contaminant proteins. (Frankenfield et al., JPR) Note that this is the same FASTA as used in Module 2 - DDA Quantification.

Metric calculation#

For each precursor ion (modified sequence + charge), we calculate the sum of signal per raw file. Contaminant sequences flagged with the prefix “Cont_” in the fasta file are removed, as well as the peptide ions that match proteins from several species and the peptide ions that are not quantified in any raw file. When applicable, “0” are replaced by NAs and missing values are ignored. Then we log2-transform the values, and calculate the mean signal per condition, with the standard deviation and coefficient of variation (CV). For each precursor ion, we calculate the difference between the mean(log2) in A and B, and compare it to its expected value. The difference between measured and expected mean(log2) is called “epsilon”. The total number of unique precursor ions is reported on the vertical axis, and the mean or median absolute epsilon is reported on the horizontal axis. Precursors matched to contaminant sequences and/or to multiple species are excluded for error calculation. More detailed description of how the data are handled before metrics calculation may be found in the tool-specific paragraphs below.

How to use#

Input data for private visualisation of your benchmark run(s)#

The module is flexible in terms of what workflow the participants can run. However, to ensure a fair comparison of the different processing tools, we suggest using the parameters listed in Table 1.

Parameter

Value

Maximum number of missed cleavages

1

PSM FDR

0.01

Spectral Library

Predicted spectral library from FASTA

Precursor charge state

1-4

Precursor m/z range

400-1000

Fragment ion m/z range

50-2000

Endopeptidase

Trypsin/P

Fixed modifications

Carbamidomethylation (C)

Variable modifications

Oxidation (M), Acetyl (Protein N-term)

Maximum number of variable modifications

1

Minimum peptide length

6 residues

Submit your run for public usage#

When you have successfully uploaded and visualized a benchmark run, we strongly encourage you to add the result to the online repository. This way, your run will be available to the entire community and can be compared to all other uploaded benchmark runs. By doing so, your workflow outputs, parameters and calculated metrics will be stored and publicly available.

To submit your run for public usage, you need to upload the parameter file associated to your run in the field Meta data for searches. Currently, we accept outputs from DIA-NN, AlphaDIA, FragPipe, MaxDIA and Spectronaut (see bellow for more tool-specific details). Please fill the Comments for submission if needed, and confirm that the metadata is correct (corresponds to the benchmark run) before checking the button I confirm that the metadata is correct. Then the button I really want to upload it will appear to trigger the submission.

Table 2 provides an overview of the required input files for public submission. More detailed instructions are provided for each individual tool in the following section.

Table 2. Overview of input files required for metric caluclation and public submission

Tool

Input file

Parameter File

AlphaDIA

precursors.tsv

log.txt

DIA-NN

*_report.tsv

*report.log.txt

FragPipe

*_report.tsv

fragpipe.workflow

MaxDIA

evidence.txt

mqpar.xml

Spectronaut

*.tsv

*.txt

PEAKS

lfq.dia.peptides.csv

parameters.txt

After upload, you will get a link to a Github pull request associated with your data. Please copy it and save it. With this link, you can get the unique identifier of your run (for example Proline__20240106_141919), and follow the advancement of your submission and add comments to communicate with the ProteoBench maintainers. If everything looks good, your submission will be reviewed and accepted (it will take a few working days). Then, your benchmark run will be added to the public runs of this module and plotted alongside all other benchmark runs in the figure.

Important Tool-specific settings#

DIA-NN#

  1. Import Raw files

  2. Add FASTA but do not select “Contaminants” since these are already included in the FASTA file

  3. Turn on FASTA digest for library-free search / library generation (automatically activates deep-learning based spectra, RTs, and IMs prediction).

  4. Do not set verbosity/Log Level higher than 1, otherwise parameter parsing will not work correctly.

  5. The input files for Proteobench are “*_report.tsv” (main report for the precursor quantities) and “*report.log.txt” (parameter files).

AlphaDIA#

  1. Select FASTA and import .raw files in “Input files”

  2. In “Method settings” you need to define your search parameters

  3. Turn on “Predict Library”

  4. The input files for ProteoBench are “precursors.tsv” (peptide identification) and “log.txt” (parameter files)

FragPipe - DIA-NN#

  1. Load the DIA_SpecLib_Quant workflow

  2. Following import of raw files, assign experiments “by File Name” right above the list of raw files.

  3. Make sure contaminants are not added when you add decoys to the database.

  4. Upload “*report.tsv” in order for Proteobench to calculate the ion ratios. For public submission, please provide the parameter file “fragpipe.workflow” that correspond to your search.

In FragPipe output files, the protein identifiers matching a given ion are in two separate columns: “Proteins” and “Mapped Proteins”. So we concatenate these two fields to have the protein groups.

Spectronaut (work in progress)#

MaxDIA (work in progress)#

By default, MaxDIA uses a contaminants-only fasta file that is located in the software folder (“contaminant.txt”). However, the fasta file provided for this module already contains a set of curated contaminant sequences. Therefore, in the MaxQuant settings (Global parameters > Sequences), UNTICK the “Include contaminants” box. Furthermore, please make sure the FASTA parsing is set as Identifier rule = >([^\t]*); Description rule = >(.*)). When uploading the raw files, press the “No Fractions” button to set up the experiment names as follows: “A_Sample_Alpha_01”, “A_Sample_Alpha_02”, “A_Sample_Alpha_03”, “B_Sample_Alpha_01”, “B_Sample_Alpha_02”, “B_Sample_Alpha_03”.

For this module, use the “evidence.txt” output in the “txt” folder of MaxQuant search outputs. For public submission, please upload the “mqpar.xml” file associated with your search.

PEAKS/) (work in progress)#

When starting a new project and selecting the .RAW files, there is no need to modify the sample names given by PEAKS. Just make sure that Sample 1 -> 3 are Condition “A” and Sample 4 -> 6 are condition “B”. Make sure to set Enzyme as trypsin, Instrument as Orbitrap (Orbi-Orbi), Fragment as HCD and Acquisition as DIA. In workflow section use the Quantification option. While we do not propose to use a custom spectral library, one could define one in the “Spectral library” tab. Define the different search parameters in the tab “DB search”. In the tab “Quantification” use the “Label Free” option, followed by either adding all samples individually or grouping samples according to their respective condition. In the “Report” tab, make sure both Peptide FDR and Protein Group FDR are set to 1%. Once the workflow has run succesfully, make sure to check the “All Search Parameters” and the “Peptide CSV” from the Label Free Quantification Exports in the “Export” tab.

Troubleshooting:#

Since the Thermo DIA data .raw files were acquired using a staggered window approach it is highly recommended to convert and demultiplex the .RAW files first into .mzML using MSConvert. Detailed instructions for this process can be found here.

Custom format#

If you do not use a tool that is compatible with ProteoBench, you can upload a tab-delimited table format containing the following columns:

  • Sequence: peptide sequence without the modification(s)

  • Proteins: column containing the protein identifiers. These should be separated by “;”, and contain the species flag (for example “_YEAST”).

  • Charge: Charge state of measured peptide ions

  • Modified sequence: column containing the sequences and the localised modifications in the ProForma standard format.

  • LFQ_Orbitrap_AIF_Condition_A_Sample_Alpha_01: Quantitative column sample 1

  • LFQ_Orbitrap_AIF_Condition_A_Sample_Alpha_02: Quantitative column sample 2

  • LFQ_Orbitrap_AIF_Condition_A_Sample_Alpha_03: Quantitative column sample 3

  • LFQ_Orbitrap_AIF_Condition_B_Sample_Alpha_01: Quantitative column sample 4

  • LFQ_Orbitrap_AIF_Condition_B_Sample_Alpha_02: Quantitative column sample 5

  • LFQ_Orbitrap_AIF_Condition_B_Sample_Alpha_03: Quantitative column sample 6

the table must not contain non-validated ions. If you have any issue, contact us here.

toml file description (work in progress)#

Each software tool produces specific output files formats. We made .toml files that describe where to find the information needed in each type of input. These can be found in proteobench/modules/dia_quant/io_parse_settings:

  • [mapper] mapping between the headers in the input file (left-hand side) and the header of the intermediate file generated by ProteoBench. If more parsing is required before metrics calculation, this part can contain mapping between intermediatec column names and the name in the intermediate file. This is the case for Proline where protein accessions are reported in two independent columns that need to be combined. This should be commented in the toml.

    • “Raw file” = field that contains the raw file identifiers. If the field “Raw file” is present, the table is parsed is a long format, otherwise it is parsed as wide format.

    • “Reverse” = field that indicates if the precursor is identified as decoy/reverse. If the field “Reverse” is present, we will filter out the values of this column equal to the decoy flag (see [general]).

    • “Sequence” = peptide sequence without modification(s)

    • “Modified sequence” = peptide sequence with localised modifications, ideally in the ProForma format.

    • “Charge” = precursor charge.

    • “Proteins” = field containing the protein identifiers. These should be separated by “;”, and contain the species flag (for example “_YEAST”). Curently, there is an exception for FragPipe’s .toml where we combine two columns, and protein IDs are seperated by “,” (see the FragPipe section).

    • “Intensity” = field containing the intensities utilised to calculate the module metrics. Used for long-format input.

  • [condition_mapper] mapping between the headers of the quantification values in the input file (left-hand side) and the condition (Condition A and Condition B).

  • [run_mapper] mapping between the headers of the quantification values in the input file (left-hand side) and the samples (condition + replicate) in the intermediate file.

  • [species_mapper] suffix corresponding to the species in the input table (left-hand side), and corresponding species (right-hand side) for ratio calculation.

  • [general] contaminant and decoy flags used for filtering out precursor ions matched to decoy or contaminant sequences. The contaminant flag in this module should be “Cont_” to correspond to the contaminants as labelled in the provided fasta file. The decoy flag is only used to filter out rows that do not pass the validation step but are reported in the table.

  • [modifications_parser] information necessary for parsing the modification and their localisation when the input table contains a columns with modified sequences. When the input contains a column with stripped sequences and a column with the localised modification, this part is not needed.

    • “parse_column” = “Modified Sequence” / Indicates the name of the column that should be parsed (i.e. that contains the sequence and localised modifications).

    • “before_aa” = false / Indicates if the modification flag is before or after the modification. For example, this has to be set to false when the cysteine is carbamidomethylated on the third position here: NEC[+57.0214]VVVIR. However, when the modification tag is before the amino acid it needs to be set to true, for example for the same peptidoforms: NE[+57.0214]CVVVIR.

    • “isalpha” = true / In the code the sequence is stripped to insert modifications later. This flag indicates that the modification can be separated by taking only characters that are alpha. For example, “NE[+57.0214]CVVVIR”, “[+57.0214]” is removed.

    • “isupper” = true / In the code the sequence is stripped to insert modifications later. This flag indicates that the modification can be separated by taking only characters that are capitalized. For example, “NEYpCVVVIR”, “p” is removed.

    • “pattern” = “\[([^]]+)\]” \ This regex pattern indicates the values to be matched for modifications. Make sure to include the full tag (only the peptide sequence should remain): “NEC[+57.0214]VVM[+15.9949]VIR”. You can test your python regexes here: https://regex101.com/

    • “modification_dict” = {“+57.0215” = “Carbamidomethyl”, “+15.9949” = “Oxidation”, “-17.026548” = “Gln->pyro-Glu”, “-18.010565” = “Glu->pyro-Glu”, “+42” = “Acetyl”} / Pattern that is matched to be translated into the ProForma standard: HUPO-PSI/ProForma: HUPO-PSI Standardized peptidoform notation (link to github). Make sure there are no additional parentheses, only the modification name should be translated to.

Result Description#

After uploading an output file, a table is generated that contains the following columns:

  • precursor ion = concatenation of the modified sequence and charge

  • mean log2-transformed intensities for condition A and B

  • standard deviations calculated for the log2-transformed values in condition A and B

  • mean intensity for condition A and B

  • standard deviations calculated for the intensity values in condition A and B

  • coefficient of variation (CV) for condition A and B

  • differences of the mean log2-transformed values between condition A and B

  • MS signal from the input table (“abundance_DDA_Condition_A_Sample_Alpha_01” to “abundance_DDA_Condition_B_Sample_Alpha_03”)

  • Count = number of runs with non-missing values

  • species the sequence matches to

  • unique = TRUE if the sequence is species-specific

  • species

  • expected ratio for the given species

  • epsilon = difference of the observed and expected log2-transformed fold change

Choose with the slider below the minimum number of quantification value per raw file. Example: when 3 is selected, only the precursor ions quantified in 3 or more raw files will be considered for the plot.

Define Parameters#

To make the results available to the entire community, you need to provide the parameter file that corresponds to your analysis. You can upload it in the drag and drop area in the “Add results to online repository” section (under Download calculated ratio’s). See here for all compatible parameter files. In this module, we keep track of the following parameters, if you feel that some important information is missing, please add it in the Comments for submission field.

  • software tool name and version

  • search engine name and version (if different from software tool)

  • FDR threshold for PSM, peptide and protein level

  • match between run (or not)

  • Precursor and fragment m/z range

  • precursor and fragment mass tolerance

  • enzyme (although for these data it should be Trypsin)

  • maximum number of missed-cleavages

  • minimum and maximum peptide length

  • fixed and variable modifications

  • maximum number of modifications

  • minimum and maximum precursor charge

Once you confirm that the metadata is correct (and corresponds to the table you uploaded before generating the plot), a button will appear. Press it to submit.

If some parameters are not in your parameter file, it is important that you provide them in the “comments” section.

Once submitted, you will see a weblink that will prompt you to a pull request on the github repository of the module. Please write down its number to keep track of your submission. If it looks good, one of the reviewers will accept it and make your data public.

Please contact us if you have any issue. To do so, you can create an issue on our github, or send us an email.