Intermediate format specification#

This document specifies the internal tabular formats that ProteoBench produces while processing a benchmark submission. These formats are for the internal use of: scoring, plotting, datapoint generation, the submission-validation layer, and the intermediate_hash that identifies a dataset all depend on the column names, types, and semantics defined here. It is important to note that the intermediate format is specific for a module. Even though models can share intermediate formats where applicable. If there is any explanation on this page that is module specific, this is purely for illustrative purposes.

Note

Status: descriptive specification of the current behaviour (format version 1, implicit). It documents what the code produces today. Proposed changes (an explicit version field, a canonical serialisation for hashing) are described in the “Reproducibility and the intermediate hash” and “Versioning” sections below.

Scope#

There are two distinct tables:

Artifact	Produced by	Persisted	Consumed by
Standard format (pre-scoring)	`ParseSettingsQuant.convert_to_standard_format()`	No (in-memory)	Scoring; the submission-validation layer
Intermediate format (post-scoring)	`QuantScoresHYE.generate_intermediate()`	Yes, as `result_performance.csv` in the dataset archive; also hashed into `intermediate_hash`	Plotting, datapoint metrics, the “View Single Result” table

The intermediate format is the primary subject of this specification. The standard format is documented as the declared input format that any module or tool parser must produce.

Standard format#

Produced by convert_to_standard_format() in proteobench/io/parsing/parse_settings.py. It is a long table with one row per (e.g., precursor, run for the precursor quantification module): a precursor that was quantified in six runs yields six rows. The table is not projected to a fixed column set, so unmapped input columns may also be present; only the columns below are relied upon downstream (this is quite specific for quantification modules; may differ for other modules!).

Column	Type	Required	Meaning
`Proteins`	str	yes	Protein or protein-group identifier(s). Multiple proteins may be joined by `;` or `,`; UniProt-style identifiers are split on the pipe character into database, accession, and entry name.
`Sequence`	str	yes	Plain (unmodified) peptide sequence.
`Charge`	int	yes	Precursor charge.
`proforma`	str	conditional	ProForma-encoded modified sequence. Present only when the tool TOML defines a `[modifications_parser]`.
`precursor ion`	str	conditional	`proforma + "/" + Charge`. Present when `level = "ion"`.
`peptidoform`	str	conditional	Equal to `proforma`. Present when `level = "peptidoform"`.
`Raw file`	str	yes	Cleaned run name (extensions and known suffixes stripped by `_clean_run_name()`).
`Intensity`	float	yes	Quantitative intensity for the (precursor, run). Rows with `Intensity <= 0` are removed.
`replicate`	str	yes	Condition label, `"A"` or `"B"`, mapped from `Raw file` via the tool’s `condition_mapper`.
`contaminant`	bool	yes	Result of matching `Proteins` against the contaminant flag. Contaminant rows are removed, so survivors are all `False`.
species flags	bool	yes	One boolean column per species in the module `species_mapper` (for HYE: `HUMAN`, `YEAST`, `ECOLI`), set by substring match on `Proteins`.
`MULTI_SPEC`	bool	yes	`True` when the precursor matches more than `min_count_multispec` species. Multi-species rows are removed.
run dummy columns	bool	yes	One boolean column per distinct `Raw file` (from `pd.get_dummies`).

Intermediate format#

The intermediate format is the table defined for calculating all metrics in a given module. It should contain every value needed for metric calculation in a standardized fashion. It is not limited to these columns, it can contain more columns, even ones that are tool specific. However, it at least needs defined (i.e., description of columns) and “normalized” (i.e., values should be comparable between submissions) columns for metric calculation.

An example for quantificaiton modules; produced by QuantScoresHYE.generate_intermediate() in proteobench/score/quantscoresHYE.py. One row per precursor (or peptidoform). Only single-species precursors are retained: compute_epsilon() keeps rows where exactly one species flag is set (unique == 1), so multi-species precursors do not appear.

Column catalog#

Column	Type	Meaning
`precursor ion` or `peptidoform`	str	Precursor key. Name depends on the module `level`. For ion modules this is `proforma + "/" + charge`.
`Intensity_mean_A`, `Intensity_mean_B`	float	Mean linear intensity across the runs of condition A / B.
`Intensity_std_A`, `Intensity_std_B`	float	Standard deviation of linear intensity per condition.
`log_Intensity_mean_A`, `log_Intensity_mean_B`	float	Mean of the log2 intensities per condition.
`log_Intensity_std_A`, `log_Intensity_std_B`	float	Standard deviation of log2 intensities per condition.
`CV_A`, `CV_B`	float	Coefficient of variation per condition, `Intensity_std / Intensity_mean`.
`log2_A_vs_B`	float	`log_Intensity_mean_A - log_Intensity_mean_B`. The observed log2 fold change.
per-run intensity columns	float	One column per distinct run name (cleaned `Raw file`), holding that run’s intensity for the quantified feature. The set of columns therefore depends on the experimental design.
`nr_observed`	int	Number of runs in which the feature (i.e. precursor, peptidoform) was quantified.
species flags	bool	One boolean per species (HYE: `YEAST`, `ECOLI`, `HUMAN`). For a retained row exactly one is `True`.
`unique`	int	Sum of the species flags. Always `1` in the persisted table.
`species`	str	The single species name for the precursor.
`log2_expectedRatio`	float	`log2` of the module’s expected A-vs-B ratio for `species`.
`epsilon`	float	Accuracy: `log2_A_vs_B - log2_expectedRatio`. Deviation from the known ratio.
`log2_empirical_median`, `log2_empirical_mean`	float	Per-species empirical centre of `log2_A_vs_B` (median / mean over the species).
`epsilon_precision_median`, `epsilon_precision_mean`	float	Precision: `log2_A_vs_B - log2_empirical_{median,mean}`. Deviation from the empirical centre.

Notes on semantics#

All logarithms are base 2.
epsilon requires ground truth (the expected ratio) and measures accuracy; epsilon_precision_* requires no ground truth and measures reproducibility.
The per-run intensity columns are not a fixed part of the schema: their names and number follow the run names of the submitted experiment.

Reproducibility and the intermediate hash#

The dataset identity is intermediate_hash, computed in proteobench/datapoint/quant_datapoint.py as the SHA1 of the intermediate DataFrame rendered with pandas.DataFrame.to_string().

Because the hash is taken over the rendered text, it depends on:

the set of columns and their order;
the row order;
the floating-point formatting used by to_string();
the run-specific intensity column names.

Any change to the scoring code, the column order, or the pandas rendering can change the hash for identical inputs. Implementers must therefore treat the column order and row order as part of the format.

Module variants#

HYE (Human-Yeast-Ecoli): the format described above. Species set HUMAN, YEAST, ECOLI.
PYE (Plasma, quant_lfq_DIA_ion_plasma): QuantScoresPYE.generate_intermediate() is currently a pass-through over the HYE implementation, so the intermediate columns are the same. The plasma-specific quantities (spike-in fold-change error, dynamic range, human-plasma epsilon) are computed later, in QuantDatapointPYE, not in the intermediate table. The plasma datapoint also uses max_nr_observed = 12 instead of 6.
De novo (denovo_DDA_HCD): a different schema entirely, produced by columns above. This modules is still in development, the full documentation of its intermediate format will be provided soon. and carries match-type and amino-acid-level columns (for example match_type, aa_matches_dn/aa_matches_gt, pep_match) rather than the quantitative columns above. A detailed catalogue for the de novo intermediate is out of scope for this revision and should be added alongside the de novo module’s maturation.

Persistence and external resources#

The intermediate format table is written to result_performance.csv inside the dataset archive on the public datasets server (https://proteobench.cubimed.rub.de/datasets/), keyed by intermediate_hash. It is reloaded for the “View Single Result” tab and for regenerating plots.

Reference example#

test/data/intermediate_files/result_performance_MaxQuant_20241216_120952.csv is a committed sample of the (version-1, pre-precision) intermediate for a MaxQuant DDA submission.

Non-goals#

This specification does not define the per-tool native output formats; those are handled by the format-specific loaders in io/parsing/parse_ion.py.
It does not define the datapoint JSON stored in the results repositories; that is a separate structure built by the QuantDatapoint* classes.