Submission validation#

ProteoBench validates an uploaded submission before the public datapoint is created. The validation layer checks that the standardized results and the parsed parameters are internally consistent and consistent with the module reference database, and returns a structured ValidationReport.

Validation is non-blocking. Every finding, including error-severity ones, is shown to the submitter and embedded in the pull-request description for the reviewers, but submission always proceeds. Severity controls only display prominence and inclusion in the pull-request summary. It does not gate the submission flow.

The layer is framework-agnostic and registry-driven. Each module maps to a validation profile (a named, ordered set of checks). Adding a new module of an existing category is configuration-only. Adding a genuinely new category only requires registering a new profile. The orchestrator never needs to change.

The code lives in the proteobench.validation package. The Streamlit glue lives in webinterface/pages/base_pages/utils/validation_ui.py.

Package layout#

File

Contents

report.py

Severity enum (error / warning / info), ValidationIssue, and ValidationReport (issue collection plus has_errors, passed, summary(), raise_if_errors()).

context.py

ValidationContext: bundles everything a check might need (standard_df, parameters, config, fasta, input_format, a generic reference, and an extras dict) so every check has the uniform signature ctx -> list[ValidationIssue].

config.py

ModuleValidationConfig: column names, contaminant flag, decoy prefixes, reference FASTA location, and the resolved validation_profile. Built with from_parse_settings(...).

checks.py

Pure, individually testable check functions (protein IDs, charge range, peptide length, enzyme, modifications, maximum modifications, mass tolerances, PSM FDR, run consistency).

profiles.py

Check, ValidationProfile, the profile registry (register_profile / unregister_profile / get_profile / available_profiles), and the built-in quant_lfq and denovo profiles. This is the extensibility surface.

validator.py

validate_submission(...): resolves the profile, builds the context, runs the profile’s checks, and returns the report. Each check is fault-tolerant: an unexpected exception becomes a warning.

fasta.py

FastaReference: builds the expected protein set from FASTA text, a path, bytes, a zip / gzip, or a URL.

protein_ids.py

Helpers to split protein groups, extract identifiers, and skip decoys and contaminants.

exceptions.py

SubmissionValidationError, which wraps a report for programmatic callers that opt in to raising.

Data flow#

module_settings.toml + parser  --> ModuleValidationConfig.from_parse_settings(...)
reference FASTA                --> FastaReference.from_url(...)
                                      |
standardized DataFrame + params -----+--> validate_submission(...)
                                      |        |
                                      |        +-- resolve profile (registry)
                                      |        +-- build ValidationContext
                                      |        +-- run each Check (fault-tolerant)
                                      v
                                 ValidationReport  --> UI display + PR summary

The core validator performs no I/O. Any reference data (a FASTA, a ground-truth table) is supplied through the arguments. The front end is responsible for obtaining the standardized DataFrame and the reference, which is what the Streamlit glue does.

Built-in profiles#

quant_lfq

Runs, in order: protein_ids (against the reference FASTA), charge_range, peptide_length, enzyme, modifications, max_modifications, mass_tolerances, fdr_psm, and run_consistency. protein_ids, charge_range, and peptide_length default to error severity; the rest default to warning.

denovo

Runs run_consistency plus a denovo_pending informational placeholder. De novo uses a different standardized schema and a ground-truth table rather than a FASTA, so content checks are a documented to-do in profiles.py.

Checks are reusable across profiles. For example, run_consistency is shared by both built-in profiles.

Integrating validation for a new module#

Existing category (quantification)#

For a quantification module no code is required. Two configuration steps are enough:

  1. Add a reference database to the module’s module_settings.toml (beside [species_expected_ratio] and [general]):

    [reference_database]
    "fasta_url" = "https://proteobench.cubimed.rub.de/fasta/ProteoBenchFASTA_MixedSpecies_HYE.zip"
    # "fasta_filename" = "optional_member_name_inside_the_archive.fasta"
    

    If [reference_database] is absent, the protein-identifier check is skipped with an informational message and the other checks still run.

  2. The profile resolves automatically to quant_lfq from the module’s parser class (ParseSettingsQuant), so no profile declaration is needed. You may pin it explicitly if you prefer:

    [validation]
    "profile" = "quant_lfq"
    

The orchestrator and the submission tab already run the resolved profile. The profile is resolved by ModuleValidationConfig.from_parse_settings in this order of precedence:

  1. an explicit [validation].profile key in module_settings.toml;

  2. inference from the parser class via the MODULE_TO_CLASS registry (ParseSettingsQuant -> quant_lfq, ParseSettingsDeNovo -> denovo);

  3. the DEFAULT_VALIDATION_PROFILE fallback (quant_lfq).

An unregistered profile name produces a single unknown_validation_profile warning and runs nothing. It never blocks.

New category#

A genuinely new category of module needs a profile of its own:

  1. Write the checks it needs (see below), or reuse existing ones.

  2. Register a ValidationProfile in profiles.py (or, from third-party code, via register_profile).

  3. Point the module at it with [validation] profile = "<name>" in module_settings.toml. If you want the profile to be inferred from the parser class instead, add an entry to _PROFILE_BY_PARSER_CLASS in config.py.

The orchestrator (validate_submission) is generic and does not change.

Extending and maintaining the checks#

Adding a check#

A check is a pure function with the signature ctx -> list[ValidationIssue]. Add it to proteobench/validation/checks.py, keeping it independently unit-testable, then register it in the relevant profile in profiles.py.

# proteobench/validation/checks.py
from proteobench.validation.context import ValidationContext
from proteobench.validation.report import ValidationReport, ValidationIssue
from typing import List


def check_my_constraint(standard_df, parameters, config) -> List[ValidationIssue]:
    """Check some property of the standardized results against a parameter."""
    report = ValidationReport()

    # Parameter-dependent checks must self-report when the value was not
    # parsed, and never crash.
    limit = getattr(parameters, "my_limit", None)
    if limit is None:
        report.add_warning(
            "my_limit_absent",
            "The parameter 'my_limit' could not be parsed; the check was skipped.",
            "my_constraint",
        )
        return report.issues

    offending = standard_df[standard_df["some_column"] > limit]
    if not offending.empty:
        report.add_error(
            "my_constraint_violated",
            f"{len(offending)} row(s) exceed my_limit ({limit}).",
            "my_constraint",
            observed=int(offending["some_column"].max()),
            expected=f"<= {limit}",
            examples=offending["some_column"].head(10).tolist(),
        )
    return report.issues

Then add it to a profile. Trivial checks that only forward context fields are written as a lambda adapter; checks that need orchestration logic (such as deciding whether a reference is available) are written as named functions in profiles.py.

# proteobench/validation/profiles.py
QUANT_LFQ_PROFILE = ValidationProfile(
    name="quant_lfq",
    checks=[
        # ... existing checks ...
        Check(
            "my_constraint",
            lambda ctx: check_my_constraint(ctx.standard_df, ctx.parameters, ctx.config),
            "What this check verifies.",
        ),
    ],
)

Guidelines for checks:

  • Use ValidationContext fields: ctx.standard_df, ctx.parameters, ctx.config, ctx.fasta, ctx.input_format.

  • Choose severity by intent only. error and warning differ in display prominence and pull-request inclusion. Neither blocks submission.

  • A parameter-dependent check should emit a warning when the constraint was not parsed, rather than failing. The orchestrator also wraps every check so an unexpected exception becomes a check_failed warning, but checks should not rely on that.

Registering a profile#

from proteobench.validation import Check, ValidationProfile, register_profile

register_profile(
    ValidationProfile(
        name="my_module",
        description="Checks for the new module category.",
        checks=[Check("my_check", my_check_func, "what it does")],
    )
)

The registry helpers are register_profile (with overwrite=False by default), unregister_profile, get_profile, and available_profiles.

The report object#

validate_submission returns a ValidationReport. Useful members:

  • report.errors / report.warnings / report.infos: issues by severity.

  • report.has_errors and report.passed: informational only. The Streamlit flow does not gate on them.

  • report.summary(): a compact Markdown summary embedded in the pull-request description.

  • report.raise_if_errors(): optional path for programmatic callers that prefer an exception (SubmissionValidationError).

Each ValidationIssue carries a machine-readable code, a severity, a human-readable message, the originating check name, and optional field, observed, expected, and examples values.

Web integration#

Validation runs inside submit_to_repository in webinterface/pages/base_pages/tabs/tab6_submit_results.py, after the confirmation button and before the pull request is created. The standardized DataFrame is re-derived by rerunning the existing parser on the input DataFrame already in session state, so no tool-specific parsing logic is duplicated. The glue functions are in webinterface/pages/base_pages/utils/validation_ui.py:

  • run_submission_validation(variables, ionmodule, user_input, params) runs the full flow (re-parse, load and cache the FASTA, run the checks) and returns a ValidationReport. It is fault-tolerant: any infrastructure problem (missing input, parser failure, FASTA download failure) becomes a warning.

  • render_validation_report(report) displays errors and warnings, with info items inside a collapsed expander.

The findings are rendered in the UI and appended to the pull-request description through report.summary(). None of them block the pull request. The local Tab 2 upload path is unaffected.

Testing#

Validation is covered by test/test_validation.py, which tests FASTA parsing, protein-identifier matching, the individual checks, the profile registry (resolution, custom-profile registration, unknown-profile handling, de novo routing), report serialization, and integration through the real MaxQuant and Sage parsers. A small reference FASTA fixture lives at test/data/validation/ProteoBench_validation_reference.fasta.

When you add a check, add unit tests for it directly (it is a pure function). When you add a profile, add a test that the profile resolves for the module and that its checks run.

Documented limitations#

Some checks are intentionally limited because the required reference data is not available:

  • Full enzyme-specificity checks need reference protein sequences. Only internal K/R counting is done for the trypsin family, as a heuristic.

  • Cross-tool modification normalization is not implemented. Tools encode modifications differently (human-readable names, UniMod accessions, raw mass dictionaries); only matching human-readable names against the declared modifications is attempted.

  • Run-level matching (raw file, sample, experiment) is not possible because ProteoBenchParameters does not expose those fields. Only software identity is compared in run_consistency.