Intermediate format specification#
This document specifies the internal tabular formats that ProteoBench produces
while processing a benchmark submission. These formats are for the internal use of:
scoring, plotting, datapoint generation, the submission-validation layer, and the
intermediate_hash that identifies a dataset all depend on the column names,
types, and semantics defined here. It is important to note that the intermediate format
is specific for a module. Even though models can share intermediate formats where
applicable. If there is any explanation on this page that is module specific, this
is purely for illustrative purposes.
Note
Status: descriptive specification of the current behaviour (format version 1, implicit). It documents what the code produces today. Proposed changes (an explicit version field, a canonical serialisation for hashing) are described in the “Reproducibility and the intermediate hash” and “Versioning” sections below.
Scope#
There are two distinct tables:
Artifact |
Produced by |
Persisted |
Consumed by |
|---|---|---|---|
Standard format (pre-scoring) |
|
No (in-memory) |
Scoring; the submission-validation layer |
Intermediate format (post-scoring) |
|
Yes, as |
Plotting, datapoint metrics, the “View Single Result” table |
The intermediate format is the primary subject of this specification. The standard format is documented as the declared input format that any module or tool parser must produce.
Standard format#
Produced by convert_to_standard_format() in
proteobench/io/parsing/parse_settings.py. It is a long table with one row per
(e.g., precursor, run for the precursor quantification module): a precursor
that was quantified in six runs yields six rows. The table is not projected to a fixed column set, so unmapped input columns may
also be present; only the columns below are relied upon downstream (this is quite specific
for quantification modules; may differ for other modules!).
Column |
Type |
Required |
Meaning |
|---|---|---|---|
|
str |
yes |
Protein or protein-group identifier(s). Multiple proteins may be joined by |
|
str |
yes |
Plain (unmodified) peptide sequence. |
|
int |
yes |
Precursor charge. |
|
str |
conditional |
ProForma-encoded modified sequence. Present only when the tool TOML defines a |
|
str |
conditional |
|
|
str |
conditional |
Equal to |
|
str |
yes |
Cleaned run name (extensions and known suffixes stripped by |
|
float |
yes |
Quantitative intensity for the (precursor, run). Rows with |
|
str |
yes |
Condition label, |
|
bool |
yes |
Result of matching |
species flags |
bool |
yes |
One boolean column per species in the module |
|
bool |
yes |
|
run dummy columns |
bool |
yes |
One boolean column per distinct |
Intermediate format#
The intermediate format is the table defined for calculating all metrics in a given module. It should contain every value needed for metric calculation in a standardized fashion. It is not limited to these columns, it can contain more columns, even ones that are tool specific. However, it at least needs defined (i.e., description of columns) and “normalized” (i.e., values should be comparable between submissions) columns for metric calculation.
An example for quantificaiton modules; produced by QuantScoresHYE.generate_intermediate() in
proteobench/score/quantscoresHYE.py. One row per precursor (or peptidoform).
Only single-species precursors are retained: compute_epsilon() keeps rows
where exactly one species flag is set (unique == 1), so multi-species
precursors do not appear.
Column catalog#
Column |
Type |
Meaning |
|---|---|---|
|
str |
Precursor key. Name depends on the module |
|
float |
Mean linear intensity across the runs of condition A / B. |
|
float |
Standard deviation of linear intensity per condition. |
|
float |
Mean of the log2 intensities per condition. |
|
float |
Standard deviation of log2 intensities per condition. |
|
float |
Coefficient of variation per condition, |
|
float |
|
per-run intensity columns |
float |
One column per distinct run name (cleaned |
|
int |
Number of runs in which the feature (i.e. precursor, peptidoform) was quantified. |
species flags |
bool |
One boolean per species (HYE: |
|
int |
Sum of the species flags. Always |
|
str |
The single species name for the precursor. |
|
float |
|
|
float |
Accuracy: |
|
float |
Per-species empirical centre of |
|
float |
Precision: |
Notes on semantics#
All logarithms are base 2.
epsilonrequires ground truth (the expected ratio) and measures accuracy;epsilon_precision_*requires no ground truth and measures reproducibility.The per-run intensity columns are not a fixed part of the schema: their names and number follow the run names of the submitted experiment.
Reproducibility and the intermediate hash#
The dataset identity is intermediate_hash, computed in
proteobench/datapoint/quant_datapoint.py as the SHA1 of the intermediate
DataFrame rendered with pandas.DataFrame.to_string().
Because the hash is taken over the rendered text, it depends on:
the set of columns and their order;
the row order;
the floating-point formatting used by
to_string();the run-specific intensity column names.
Any change to the scoring code, the column order, or the pandas rendering can change the hash for identical inputs. Implementers must therefore treat the column order and row order as part of the format.
Module variants#
HYE (Human-Yeast-Ecoli): the format described above. Species set
HUMAN,YEAST,ECOLI.PYE (Plasma,
quant_lfq_DIA_ion_plasma):QuantScoresPYE.generate_intermediate()is currently a pass-through over the HYE implementation, so the intermediate columns are the same. The plasma-specific quantities (spike-in fold-change error, dynamic range, human-plasma epsilon) are computed later, inQuantDatapointPYE, not in the intermediate table. The plasma datapoint also usesmax_nr_observed = 12instead of 6.De novo (
denovo_DDA_HCD): a different schema entirely, produced by columns above. This modules is still in development, the full documentation of its intermediate format will be provided soon. and carries match-type and amino-acid-level columns (for examplematch_type,aa_matches_dn/aa_matches_gt,pep_match) rather than the quantitative columns above. A detailed catalogue for the de novo intermediate is out of scope for this revision and should be added alongside the de novo module’s maturation.
Persistence and external resources#
The intermediate format table is written to result_performance.csv inside the dataset
archive on the public datasets server
(https://proteobench.cubimed.rub.de/datasets/), keyed by intermediate_hash.
It is reloaded for the “View Single Result” tab and for regenerating plots.
Reference example#
test/data/intermediate_files/result_performance_MaxQuant_20241216_120952.csv
is a committed sample of the (version-1, pre-precision) intermediate for a
MaxQuant DDA submission.
Non-goals#
This specification does not define the per-tool native output formats; those are handled by the format-specific loaders in
io/parsing/parse_ion.py.It does not define the datapoint JSON stored in the results repositories; that is a separate structure built by the
QuantDatapoint*classes.