DIA quantification - precursor ions - Astral 2Th#

This module compares the sensitivity and quantification accuracy for data-independent acquisition (DIA) data, namely Narrow-window 2 Th, on an Orbitrap Astral (Thermo Fisher Scientific). Users can load their data and inspect the results privately. They can also make their outputs public by providing the associated parameter file and submitting the benchmark run to ProteoBench. By doing so, their workflow output will be stored alongside all other benchmark runs in ProteoBench and will be accessible to the entire community.

This module is not designed to compare later-stages post-processing of quantitative data such as missing value replacement, and we advise users to publically upload data without replacement of missing values and without manual filtering.

We think that this module is more suited to evaluate the impact of (non exhaustive list):

search engine identification
peak picking
low-level ion signal normalisation

Other modules will be more suited to explore further post-pocessing steps.

Data set#

A not yet released Astral (Thermo Fisher) data independent acquisition (DIA) dataset using the same sample composition (for “A” and “B”) as described by Van Puyvelde et al., 2022 was used as a benchmark dataset. The samples are a mixture of commercial peptide digest standards of the following species: Escherichia coli (P/N:186003196, Waters Corporation), Yeast (P/N: V7461, Promega) and Human (P/N: V6951, Promega), with logarithmic fold changes (log2FCs) of 0, −1 and 2 for respectively Human, Yeast and E.coli. Please refer to the original publication for the full description of sample preparation (Van Puyvelde et al., 2022).

Data acquisition parameters were as following: Peptides were loaded directly onto the analytical column and were separated by reversed-phase chromatography using a 50 cm μPAC™ column (Thermo Scientific, cat # COL-NANO050NEOB), featuring a structured pillar array bed with a 180 µm bed width. The chromatographic gradient was initiated with 96% buffer A and 4% buffer B at a flow rate of 750 nL/min durig 1 minute. The flow rate was then reduced to 250 nL/min, and the percentage of buffer B was further increased to 40% over 15 minutes. Buffer A: 0.1% formic acid in water. Buffer B: 0.1% formic acid in 80% acetonitrile.

The mass spectrometer was operated in positive ionization mode with data-independent acquisition, with a full MS scans over a mass range of m/z 380-980 with detection in the Orbitrap at a resolution of 240,000. In each cycle of data-independent acquisition, 300 windows of 2 Th were used to isolate and fragment all precursor ions from 380 to 980 m/z. A normalized collision energy of 25% was used for higher-energy collisional dissociation (HCD) fragmentation. MS2 scan range was set from 150 to 2000 m/z with detection in the Astral with a maximum injection time of 3 ms.

The files are currently not yet uploaded to the ProteomeXchange repository, but we are working on this to make them accessible in the near future.

For now, you can download the raw files from the ProteoBench server here:

All files can be found here proteobench.cubimed.rub.de/raws/DIA-astral/

It is imperative not to rename the files once downloaded!

Download the zipped FASTA file here: ProteoBenchFASTA_MixedSpecies_HYE.zip. The fasta file provided for this module contains the three species present in the samples and contaminant proteins. (Frankenfield et al., JPR)

Metric calculation#

For each precursor ion (modified sequence + charge), we calculate the sum of signal per raw file. Contaminant sequences flagged with the prefix “Cont_” in the fasta file are removed, as well as the peptide ions that match proteins from several species and the peptide ions that are not quantified in any raw file. When applicable, “0” are replaced by NAs and missing values are ignored. Then we log2-transform the values, and calculate the mean signal per condition, with the standard deviation and coefficient of variation (CV). For each precursor ion, we calculate the difference between the mean(log2) in A and B, and compare it to its expected value. The difference between measured and expected mean(log2) is called “epsilon”. The total number of unique precursor ions is reported on the vertical axis, and the mean or median absolute epsilon is reported on the horizontal axis. More detailed description of how the data are handled before metrics calculation may be found in the tool-specific paragraphs below.

How to use#

Input data for private visualisation of your benchmark run(s)#

The module is flexible in terms of what workflow the participants can run. However, to ensure a fair comparison of the different processing tools, we suggest using the parameters listed in Table 1.

Parameter	Value
Maximum number of missed cleavages	1
PSM/Precursor FDR	0.01
Spectral Library	Predicted spectral library from FASTA
Precursor charge state	1-5
Precursor m/z range	380-980
Fragment ion m/z range	150-2000
Endopeptidase	Trypsin/P
Fixed modifications	Carbamidomethylation (C)
Variable modifications	Oxidation (M), Acetyl (Protein N-term)
Maximum number of variable modifications	1
Minimum peptide length	6 residues

Submit your run for public usage#

When you have successfully uploaded and visualized a benchmark run, we strongly encourage you to add the result to the online repository. This way, your run will be available to the entire community and can be compared to all other uploaded benchmark runs. By doing so, your workflow outputs, parameters and calculated metrics will be stored and publicly available.

To submit your run for public usage, you need to upload the parameter file associated to your run in the field Meta data for searches. Currently, we accept outputs from AlphaDIA, DIA-NN, FragPipe, MaxDIA, PEAKS and Spectronaut (see bellow for more tool-specific details). Please fill the Comments for submission if needed, and confirm that the metadata is correct (corresponds to the benchmark run) before checking the button I confirm that the metadata is correct. Then the button I really want to upload it will appear to trigger the submission.

Table 2 provides an overview of the required input files for public submission. More detailed instructions are provided for each individual tool in the following section.

Table 2. Overview of input files required for metric caluclation and public submission

Tool	Input file	Parameter File
AlphaDIA	precursors.tsv	log.txt
DIA-NN	_report.tsv or _report.parquet	*report.log.txt
FragPipe	*_report.tsv	fragpipe.workflow
MaxDIA	evidence.txt	mqpar.xml
Spectronaut	*.tsv	*.txt
PEAKS	lfq.dia.features.csv	parameters.txt

After upload, you will get a link to a Github pull request associated with your data. Please copy it and save it. With this link, you can get the unique identifier of your run (for example DIANN_20250505_083341), and follow the advancement of your submission and add comments to communicate with the ProteoBench maintainers. If everything looks good, your submission will be reviewed and accepted (it will take a few working days). Then, your benchmark run will be added to the public runs of this module and plotted alongside all other benchmark runs in the figure.

Important Tool-specific settings#

DIA-NN #

Import Raw files
Add FASTA but do not select “Contaminants” since these are already included in the FASTA file
Turn on FASTA digest for library-free search / library generation (automatically activates deep-learning based spectra, RTs, and IMs prediction).
Do not set verbosity/Log Level higher than 1, otherwise parameter parsing will not work correctly.
The input files for Proteobench are “_report.tsv” or “_report.parquet” (main report for the precursor quantities) and “report.log.txt” (parameter files).

AlphaDIA #

Select FASTA and import .raw files in “Input files”
In “Method settings” you need to define your search parameters
Turn on “Predict Library”
Turn on “Precursor Level LFQ”
Because ProteoBench requires information from both “precursors.tsv” and “precursor.matrix.tsv”, it needs to be preprocessed. For this, please refer to the Jupyter Notebook “ProteoBench_input_conversion.ipynb” HERE. Using this notebook will provide you with the correct input file that can be used in ProteoBench. The parameter file is “log.txt”.

Note: V1.10.4 is required to obtain the most optimal performance (improved check for MS1 cycle)

FragPipe - DIA-NN #

Load the DIA_SpecLib_Quant workflow
Following import of raw files, assign experiments “by File Name” right above the list of raw files.
Make sure contaminants are not added when you add decoys to the database.
Upload “*report.tsv” in order for Proteobench to calculate the ion ratios. For public submission, please provide the parameter file “fragpipe.workflow” that correspond to your search.

In FragPipe output files, the protein identifiers matching a given ion are in two separate columns: “Proteins” and “Mapped Proteins”. So we concatenate these two fields to have the protein groups.

Spectronaut #

Configure the proteobench fasta by importing the fasta provided in this module in the “Databases” tab using uniprot parsing rule
In the “Analysis” tab, select “Set up a DirectDIA Analysis from folder”
Select the folder containting the raw files in order to load the raw files
Once loaded, you optionally can change the name of the project
In the next tab select the proteobench fasta as the database
Choose your settings in the next tab
In the next tab fill in the conditions: “LFQ_Astral_DIA_15min_50ng_Condition_A_REP1”,”LFQ_Astral_DIA_15min_50ng_Condition_A_REP2”, “LFQ_Astral_DIA_15min_50ng_Condition_A_REP3”,”LFQ_Astral_DIA_15min_50ng_Condition_B_REP1”,”LFQ_Astral_DIA_15min_50ng_Condition_B_REP2”,”LFQ_Astral_DIA_15min_50ng_Condition_B_REP3”
Do not tick any GO terms or Library exensions in the next tabs
Finish the settings on the next tab in order to start the search
After the search is finished go to the “Report” tab, select “BGS factory Report” and go for “export Report”, name the file”…_Report” and select .tsv format
Upload the “…_Report.tsv” for private submission and “…Report.setup.txt” (which is in the same folder as the report.tsv file) for public submission to Proteobench

We accept Spectronaut BGS Factory Reports (normal format): the “.._Report.tsv” file is used for calculating the metrics, and the “…_Report.setup.txt” file for parameter parsing when doing public upload.

MaxDIA (work in progress)#

By default, MaxDIA uses a contaminants-only fasta file that is located in the software folder (“contaminant.txt”). However, the fasta file provided for this module already contains a set of curated contaminant sequences. Therefore, in the MaxQuant settings (Global parameters > Sequences), UNTICK the “Include contaminants” box. Furthermore, please make sure the FASTA parsing is set as Identifier rule = >([^\t]*); Description rule = >(.*)). When uploading the raw files, press the “No Fractions” button to set up the experiment names as follows: “A_REP1”, “A_REP2”, “A_REP3”, “B_REP1”, “B_REP2”, “B_REP3”.

For this module, use the “evidence.txt” output in the “txt” folder of MaxQuant search outputs. For public submission, please upload the “mqpar.xml” file associated with your search.

PEAKS #

When starting a new project and selecting the .RAW files for analysis, you will need to modify the sample names given by PEAKS (Sample 1->6), so they match exactly with the .RAW file names: -LFQ_Astral_DIA_15min_50ng_Condition_A_REP1 -LFQ_Astral_DIA_15min_50ng_Condition_A_REP2 -LFQ_Astral_DIA_15min_50ng_Condition_A_REP3 -LFQ_Astral_DIA_15min_50ng_Condition_B_REP1 -LFQ_Astral_DIA_15min_50ng_Condition_B_REP2 -LFQ_Astral_DIA_15min_50ng_Condition_B_REP3

Make sure to set Enzyme as trypsin, Instrument as Orbitrap (Astral), Fragment as HCD and Acquisition as DIA. In workflow section use the Quantification option. While we do not propose to use a custom spectral library, one could define one in the “Spectral library” tab. Define the different search parameters in the tab “DB search”. In the tab “Quantification” use the “Label Free” option, followed by either adding all samples individually or grouping samples according to their respective condition. In the “Report” tab, make sure both Precursor or Peptide FDR and Protein Group FDR are set to 1%. Once the workflow has run succesfully, make sure to check the “All Search Parameters” and the “Feature Vector CSV” from the Label Free Quantification Exports in the “Export” tab.

Custom format#

If you do not use a tool that is compatible with ProteoBench, you can upload a tab-delimited table format containing the following columns:

Sequence: peptide sequence without the modification(s)
Proteins: column containing the protein identifiers. These should be separated by “;”, and contain the species flag (for example “_YEAST”).
Charge: Charge state of measured peptide ions
Modified sequence: column containing the sequences and the localised modifications in the ProForma standard format.
LFQ_Astral_DIA_15min_50ng_Condition_A_REP1: Quantitative column sample 1
LFQ_Astral_DIA_15min_50ng_Condition_A_REP2: Quantitative column sample 2
LFQ_Astral_DIA_15min_50ng_Condition_A_REP3: Quantitative column sample 3
LFQ_Astral_DIA_15min_50ng_Condition_B_REP1: Quantitative column sample 4
LFQ_Astral_DIA_15min_50ng_Condition_B_REP2: Quantitative column sample 5
LFQ_Astral_DIA_15min_50ng_Condition_B_REP3: Quantitative column sample 6

the table must not contain non-validated ions. If you have any issue, contact us here.

toml file description (work in progress)#

Each software tool produces specific output files formats. We made .toml files that describe where to find the information needed in each type of input. These can be found in proteobench/modules/dia_quant/io_parse_settings:

[mapper] mapping between the headers in the input file (left-hand side) and the header of the intermediate file generated by ProteoBench. If more parsing is required before metrics calculation, this part can contain mapping between intermediatec column names and the name in the intermediate file. This is the case for Proline where protein accessions are reported in two independent columns that need to be combined. This should be commented in the toml.
- “Raw file” = field that contains the raw file identifiers. If the field “Raw file” is present, the table is parsed is a long format, otherwise it is parsed as wide format.
- “Reverse” = field that indicates if the precursor is identified as decoy/reverse. If the field “Reverse” is present, we will filter out the values of this column equal to the decoy flag (see [general]).
- “Sequence” = peptide sequence without modification(s)
- “Modified sequence” = peptide sequence with localised modifications, ideally in the ProForma format.
- “Charge” = precursor charge.
- “Proteins” = field containing the protein identifiers. These should be separated by “;”, and contain the species flag (for example “_YEAST”). Curently, there is an exception for FragPipe’s .toml where we combine two columns, and protein IDs are seperated by “,” (see the FragPipe section).
- “Intensity” = field containing the intensities utilised to calculate the module metrics. Used for long-format input.
[condition_mapper] mapping between the headers of the quantification values in the input file (left-hand side) and the condition (Condition A and Condition B).
[run_mapper] mapping between the headers of the quantification values in the input file (left-hand side) and the samples (condition + replicate) in the intermediate file.
[species_mapper] suffix corresponding to the species in the input table (left-hand side), and corresponding species (right-hand side) for ratio calculation.
[general] contaminant and decoy flags used for filtering out precursor ions matched to decoy or contaminant sequences. The contaminant flag in this module should be “Cont_” to correspond to the contaminants as labelled in the provided fasta file. The decoy flag is only used to filter out rows that do not pass the validation step but are reported in the table.
[modifications_parser] information necessary for parsing the modification and their localisation when the input table contains a columns with modified sequences. When the input contains a column with stripped sequences and a column with the localised modification, this part is not needed.
- “parse_column” = “Modified Sequence” / Indicates the name of the column that should be parsed (i.e. that contains the sequence and localised modifications).
- “before_aa” = false / Indicates if the modification flag is before or after the modification. For example, this has to be set to false when the cysteine is carbamidomethylated on the third position here: NEC[+57.0214]VVVIR. However, when the modification tag is before the amino acid it needs to be set to true, for example for the same peptidoforms: NE[+57.0214]CVVVIR.
- “isalpha” = true / In the code the sequence is stripped to insert modifications later. This flag indicates that the modification can be separated by taking only characters that are alpha. For example, “NE[+57.0214]CVVVIR”, “[+57.0214]” is removed.
- “isupper” = true / In the code the sequence is stripped to insert modifications later. This flag indicates that the modification can be separated by taking only characters that are capitalized. For example, “NEYpCVVVIR”, “p” is removed.
- “pattern” = “\[([^]]+)\]” \ This regex pattern indicates the values to be matched for modifications. Make sure to include the full tag (only the peptide sequence should remain): “NEC[+57.0214]VVM[+15.9949]VIR”. You can test your python regexes here: https://regex101.com/
- “modification_dict” = {“+57.0215” = “Carbamidomethyl”, “+15.9949” = “Oxidation”, “-17.026548” = “Gln->pyro-Glu”, “-18.010565” = “Glu->pyro-Glu”, “+42” = “Acetyl”} / Pattern that is matched to be translated into the ProForma standard: HUPO-PSI/ProForma: HUPO-PSI Standardized peptidoform notation (link to github). Make sure there are no additional parentheses, only the modification name should be translated to.

Result Description#

After uploading an output file, a table is generated that contains the following columns:

precursor ion = concatenation of the modified sequence and charge
mean log2-transformed intensities for condition A and B
standard deviations calculated for the log2-transformed values in condition A and B
mean intensity for condition A and B
standard deviations calculated for the intensity values in condition A and B
coefficient of variation (CV) for condition A and B
differences of the mean log2-transformed values between condition A and B
MS signal from the input table (“abundance_LFQ_Astral_DIA_15min_50ng_Condition_A_REP1” to “abundance_LFQ_Astral_DIA_15min_50ng_Condition_B_REP3”)
Count = number of runs with non-missing values
species the sequence matches to
unique = TRUE if the sequence is species-specific
species
expected ratio for the given species
epsilon = difference of the observed and expected log2-transformed fold change

Choose with the slider below the minimum number of quantification value per raw file. Example: when 3 is selected, only the precursor ions quantified in 3 or more raw files will be considered for the plot.

Define Parameters#

To make the results available to the entire community, you need to provide the parameter file that corresponds to your analysis. You can upload it in the drag and drop area in the “Add results to online repository” section (under Download calculated ratio’s). See here for all compatible parameter files. In this module, we keep track of the following parameters, if you feel that some important information is missing, please add it in the Comments for submission field.

software tool name and version
search engine name and version (if different from software tool)
FDR threshold for PSM, precursor, peptide and protein level
match between run (or not)
Precursor and fragment m/z range
precursor and fragment mass tolerance
enzyme (although for these data it should be Trypsin)
maximum number of missed-cleavages
minimum and maximum peptide length
fixed and variable modifications
maximum number of modifications
minimum and maximum precursor charge

Once you confirm that the metadata is correct (and corresponds to the table you uploaded before generating the plot), a button will appear. Press it to submit.

DISCLAIMER: When submitting parameter files, please be aware that your dataset may contain identifiable information through embedded file paths. These paths can reveal personal usernames, system architecture, project names, and directory structures associated with e.g.

The FASTA database location
The RAW data location
Installation paths for the tools being used

Such metadata can inadvertently disclose sensitive or institution-specific information. We recommend reviewing and sanitizing any file paths prior to submission to ensure compliance with your organization’s data privacy policies and to protect personal or institutional identifiers.

If some parameters are not in your parameter file, it is important that you provide them in the “comments” section.

Once submitted, you will see a weblink that will prompt you to a pull request on the github repository of the module. Please write down its number to keep track of your submission. If it looks good, one of the reviewers will accept it and make your data public.

Please contact us if you have any issue. To do so, you can create an issue on our github, or send us an email.