Submission validation#
ProteoBench validates an uploaded submission before the public datapoint is
created. The validation layer checks that the standardized results and the
parsed parameters are internally consistent and consistent with the module
reference database, and returns a structured ValidationReport.
Validation is non-blocking. Every finding, including error-severity
ones, is shown to the submitter and embedded in the pull-request description for
the reviewers, but submission always proceeds. Severity controls only display
prominence and inclusion in the pull-request summary. It does not gate the
submission flow.
The layer is framework-agnostic and registry-driven. Each module maps to a validation profile (a named, ordered set of checks). Adding a new module of an existing category is configuration-only. Adding a genuinely new category only requires registering a new profile. The orchestrator never needs to change.
The code lives in the proteobench.validation package. The Streamlit glue
lives in webinterface/pages/base_pages/utils/validation_ui.py.
Package layout#
File |
Contents |
|---|---|
|
|
|
|
|
|
|
Pure, individually testable check functions (protein IDs, charge range, peptide length, enzyme, modifications, maximum modifications, mass tolerances, PSM FDR, run consistency). |
|
|
|
|
|
|
|
Helpers to split protein groups, extract identifiers, and skip decoys and contaminants. |
|
|
Data flow#
module_settings.toml + parser --> ModuleValidationConfig.from_parse_settings(...)
reference FASTA --> FastaReference.from_url(...)
|
standardized DataFrame + params -----+--> validate_submission(...)
| |
| +-- resolve profile (registry)
| +-- build ValidationContext
| +-- run each Check (fault-tolerant)
v
ValidationReport --> UI display + PR summary
The core validator performs no I/O. Any reference data (a FASTA, a ground-truth table) is supplied through the arguments. The front end is responsible for obtaining the standardized DataFrame and the reference, which is what the Streamlit glue does.
Built-in profiles#
quant_lfqRuns, in order:
protein_ids(against the reference FASTA),charge_range,peptide_length,enzyme,modifications,max_modifications,mass_tolerances,fdr_psm, andrun_consistency.protein_ids,charge_range, andpeptide_lengthdefault toerrorseverity; the rest default towarning.denovoRuns
run_consistencyplus adenovo_pendinginformational placeholder. De novo uses a different standardized schema and a ground-truth table rather than a FASTA, so content checks are a documented to-do inprofiles.py.
Checks are reusable across profiles. For example, run_consistency is shared
by both built-in profiles.
Integrating validation for a new module#
Existing category (quantification)#
For a quantification module no code is required. Two configuration steps are enough:
Add a reference database to the module’s
module_settings.toml(beside[species_expected_ratio]and[general]):[reference_database] "fasta_url" = "https://proteobench.cubimed.rub.de/fasta/ProteoBenchFASTA_MixedSpecies_HYE.zip" # "fasta_filename" = "optional_member_name_inside_the_archive.fasta"
If
[reference_database]is absent, the protein-identifier check is skipped with an informational message and the other checks still run.The profile resolves automatically to
quant_lfqfrom the module’s parser class (ParseSettingsQuant), so no profile declaration is needed. You may pin it explicitly if you prefer:[validation] "profile" = "quant_lfq"
The orchestrator and the submission tab already run the resolved profile. The
profile is resolved by ModuleValidationConfig.from_parse_settings in this
order of precedence:
an explicit
[validation].profilekey inmodule_settings.toml;inference from the parser class via the
MODULE_TO_CLASSregistry (ParseSettingsQuant->quant_lfq,ParseSettingsDeNovo->denovo);the
DEFAULT_VALIDATION_PROFILEfallback (quant_lfq).
An unregistered profile name produces a single unknown_validation_profile
warning and runs nothing. It never blocks.
New category#
A genuinely new category of module needs a profile of its own:
Write the checks it needs (see below), or reuse existing ones.
Register a
ValidationProfileinprofiles.py(or, from third-party code, viaregister_profile).Point the module at it with
[validation] profile = "<name>"inmodule_settings.toml. If you want the profile to be inferred from the parser class instead, add an entry to_PROFILE_BY_PARSER_CLASSinconfig.py.
The orchestrator (validate_submission) is generic and does not change.
Extending and maintaining the checks#
Adding a check#
A check is a pure function with the signature ctx -> list[ValidationIssue].
Add it to proteobench/validation/checks.py, keeping it independently
unit-testable, then register it in the relevant profile in profiles.py.
# proteobench/validation/checks.py
from proteobench.validation.context import ValidationContext
from proteobench.validation.report import ValidationReport, ValidationIssue
from typing import List
def check_my_constraint(standard_df, parameters, config) -> List[ValidationIssue]:
"""Check some property of the standardized results against a parameter."""
report = ValidationReport()
# Parameter-dependent checks must self-report when the value was not
# parsed, and never crash.
limit = getattr(parameters, "my_limit", None)
if limit is None:
report.add_warning(
"my_limit_absent",
"The parameter 'my_limit' could not be parsed; the check was skipped.",
"my_constraint",
)
return report.issues
offending = standard_df[standard_df["some_column"] > limit]
if not offending.empty:
report.add_error(
"my_constraint_violated",
f"{len(offending)} row(s) exceed my_limit ({limit}).",
"my_constraint",
observed=int(offending["some_column"].max()),
expected=f"<= {limit}",
examples=offending["some_column"].head(10).tolist(),
)
return report.issues
Then add it to a profile. Trivial checks that only forward context fields are
written as a lambda adapter; checks that need orchestration logic (such as
deciding whether a reference is available) are written as named functions in
profiles.py.
# proteobench/validation/profiles.py
QUANT_LFQ_PROFILE = ValidationProfile(
name="quant_lfq",
checks=[
# ... existing checks ...
Check(
"my_constraint",
lambda ctx: check_my_constraint(ctx.standard_df, ctx.parameters, ctx.config),
"What this check verifies.",
),
],
)
Guidelines for checks:
Use
ValidationContextfields:ctx.standard_df,ctx.parameters,ctx.config,ctx.fasta,ctx.input_format.Choose severity by intent only.
errorandwarningdiffer in display prominence and pull-request inclusion. Neither blocks submission.A parameter-dependent check should emit a
warningwhen the constraint was not parsed, rather than failing. The orchestrator also wraps every check so an unexpected exception becomes acheck_failedwarning, but checks should not rely on that.
Registering a profile#
from proteobench.validation import Check, ValidationProfile, register_profile
register_profile(
ValidationProfile(
name="my_module",
description="Checks for the new module category.",
checks=[Check("my_check", my_check_func, "what it does")],
)
)
The registry helpers are register_profile (with overwrite=False by
default), unregister_profile, get_profile, and available_profiles.
The report object#
validate_submission returns a ValidationReport. Useful members:
report.errors/report.warnings/report.infos: issues by severity.report.has_errorsandreport.passed: informational only. The Streamlit flow does not gate on them.report.summary(): a compact Markdown summary embedded in the pull-request description.report.raise_if_errors(): optional path for programmatic callers that prefer an exception (SubmissionValidationError).
Each ValidationIssue carries a machine-readable code, a severity, a
human-readable message, the originating check name, and optional
field, observed, expected, and examples values.
Web integration#
Validation runs inside submit_to_repository in
webinterface/pages/base_pages/tabs/tab6_submit_results.py, after the
confirmation button and before the pull request is created. The standardized
DataFrame is re-derived by rerunning the existing parser on the input DataFrame
already in session state, so no tool-specific parsing logic is duplicated. The
glue functions are in
webinterface/pages/base_pages/utils/validation_ui.py:
run_submission_validation(variables, ionmodule, user_input, params)runs the full flow (re-parse, load and cache the FASTA, run the checks) and returns aValidationReport. It is fault-tolerant: any infrastructure problem (missing input, parser failure, FASTA download failure) becomes a warning.render_validation_report(report)displays errors and warnings, with info items inside a collapsed expander.
The findings are rendered in the UI and appended to the pull-request description
through report.summary(). None of them block the pull request. The local
Tab 2 upload path is unaffected.
Testing#
Validation is covered by test/test_validation.py, which tests FASTA
parsing, protein-identifier matching, the individual checks, the profile
registry (resolution, custom-profile registration, unknown-profile handling, de
novo routing), report serialization, and integration through the real MaxQuant
and Sage parsers. A small reference FASTA fixture lives at
test/data/validation/ProteoBench_validation_reference.fasta.
When you add a check, add unit tests for it directly (it is a pure function). When you add a profile, add a test that the profile resolves for the module and that its checks run.
Documented limitations#
Some checks are intentionally limited because the required reference data is not available:
Full enzyme-specificity checks need reference protein sequences. Only internal K/R counting is done for the trypsin family, as a heuristic.
Cross-tool modification normalization is not implemented. Tools encode modifications differently (human-readable names, UniMod accessions, raw mass dictionaries); only matching human-readable names against the declared modifications is attempted.
Run-level matching (raw file, sample, experiment) is not possible because
ProteoBenchParametersdoes not expose those fields. Only software identity is compared inrun_consistency.