.. _submission-validation: ###################### Submission validation ###################### ProteoBench validates an uploaded submission before the public datapoint is created. The validation layer checks that the standardized results and the parsed parameters are internally consistent and consistent with the module reference database, and returns a structured ``ValidationReport``. Validation is **non-blocking**. Every finding, including ``error``-severity ones, is shown to the submitter and embedded in the pull-request description for the reviewers, but submission always proceeds. Severity controls only display prominence and inclusion in the pull-request summary. It does not gate the submission flow. The layer is framework-agnostic and **registry-driven**. Each module maps to a *validation profile* (a named, ordered set of checks). Adding a new module of an existing category is configuration-only. Adding a genuinely new category only requires registering a new profile. The orchestrator never needs to change. The code lives in the ``proteobench.validation`` package. The Streamlit glue lives in :file:`webinterface/pages/base_pages/utils/validation_ui.py`. Package layout ============== .. list-table:: :header-rows: 1 :widths: 30 70 * - File - Contents * - ``report.py`` - ``Severity`` enum (``error`` / ``warning`` / ``info``), ``ValidationIssue``, and ``ValidationReport`` (issue collection plus ``has_errors``, ``passed``, ``summary()``, ``raise_if_errors()``). * - ``context.py`` - ``ValidationContext``: bundles everything a check might need (``standard_df``, ``parameters``, ``config``, ``fasta``, ``input_format``, a generic ``reference``, and an ``extras`` dict) so every check has the uniform signature ``ctx -> list[ValidationIssue]``. * - ``config.py`` - ``ModuleValidationConfig``: column names, contaminant flag, decoy prefixes, reference FASTA location, and the resolved ``validation_profile``. Built with ``from_parse_settings(...)``. * - ``checks.py`` - Pure, individually testable check functions (protein IDs, charge range, peptide length, enzyme, modifications, maximum modifications, mass tolerances, PSM FDR, run consistency). * - ``profiles.py`` - ``Check``, ``ValidationProfile``, the profile **registry** (``register_profile`` / ``unregister_profile`` / ``get_profile`` / ``available_profiles``), and the built-in ``quant_lfq`` and ``denovo`` profiles. This is the extensibility surface. * - ``validator.py`` - ``validate_submission(...)``: resolves the profile, builds the context, runs the profile's checks, and returns the report. Each check is fault-tolerant: an unexpected exception becomes a warning. * - ``fasta.py`` - ``FastaReference``: builds the expected protein set from FASTA text, a path, bytes, a zip / gzip, or a URL. * - ``protein_ids.py`` - Helpers to split protein groups, extract identifiers, and skip decoys and contaminants. * - ``exceptions.py`` - ``SubmissionValidationError``, which wraps a report for programmatic callers that opt in to raising. Data flow ========= .. code-block:: text module_settings.toml + parser --> ModuleValidationConfig.from_parse_settings(...) reference FASTA --> FastaReference.from_url(...) | standardized DataFrame + params -----+--> validate_submission(...) | | | +-- resolve profile (registry) | +-- build ValidationContext | +-- run each Check (fault-tolerant) v ValidationReport --> UI display + PR summary The core validator performs no I/O. Any reference data (a FASTA, a ground-truth table) is supplied through the arguments. The front end is responsible for obtaining the standardized DataFrame and the reference, which is what the Streamlit glue does. Built-in profiles ================== ``quant_lfq`` Runs, in order: ``protein_ids`` (against the reference FASTA), ``charge_range``, ``peptide_length``, ``enzyme``, ``modifications``, ``max_modifications``, ``mass_tolerances``, ``fdr_psm``, and ``run_consistency``. ``protein_ids``, ``charge_range``, and ``peptide_length`` default to ``error`` severity; the rest default to ``warning``. ``denovo`` Runs ``run_consistency`` plus a ``denovo_pending`` informational placeholder. De novo uses a different standardized schema and a ground-truth table rather than a FASTA, so content checks are a documented to-do in ``profiles.py``. Checks are reusable across profiles. For example, ``run_consistency`` is shared by both built-in profiles. Integrating validation for a new module ======================================== Existing category (quantification) ----------------------------------- For a quantification module no code is required. Two configuration steps are enough: 1. Add a reference database to the module's ``module_settings.toml`` (beside ``[species_expected_ratio]`` and ``[general]``): .. code-block:: toml [reference_database] "fasta_url" = "https://proteobench.cubimed.rub.de/fasta/ProteoBenchFASTA_MixedSpecies_HYE.zip" # "fasta_filename" = "optional_member_name_inside_the_archive.fasta" If ``[reference_database]`` is absent, the protein-identifier check is skipped with an informational message and the other checks still run. 2. The profile resolves automatically to ``quant_lfq`` from the module's parser class (``ParseSettingsQuant``), so no profile declaration is needed. You may pin it explicitly if you prefer: .. code-block:: toml [validation] "profile" = "quant_lfq" The orchestrator and the submission tab already run the resolved profile. The profile is resolved by ``ModuleValidationConfig.from_parse_settings`` in this order of precedence: 1. an explicit ``[validation].profile`` key in ``module_settings.toml``; 2. inference from the parser class via the ``MODULE_TO_CLASS`` registry (``ParseSettingsQuant`` -> ``quant_lfq``, ``ParseSettingsDeNovo`` -> ``denovo``); 3. the ``DEFAULT_VALIDATION_PROFILE`` fallback (``quant_lfq``). An unregistered profile name produces a single ``unknown_validation_profile`` warning and runs nothing. It never blocks. New category ------------ A genuinely new category of module needs a profile of its own: 1. Write the checks it needs (see below), or reuse existing ones. 2. Register a ``ValidationProfile`` in ``profiles.py`` (or, from third-party code, via ``register_profile``). 3. Point the module at it with ``[validation] profile = ""`` in ``module_settings.toml``. If you want the profile to be inferred from the parser class instead, add an entry to ``_PROFILE_BY_PARSER_CLASS`` in ``config.py``. The orchestrator (``validate_submission``) is generic and does not change. Extending and maintaining the checks ===================================== Adding a check -------------- A check is a pure function with the signature ``ctx -> list[ValidationIssue]``. Add it to :file:`proteobench/validation/checks.py`, keeping it independently unit-testable, then register it in the relevant profile in ``profiles.py``. .. code-block:: python # proteobench/validation/checks.py from proteobench.validation.context import ValidationContext from proteobench.validation.report import ValidationReport, ValidationIssue from typing import List def check_my_constraint(standard_df, parameters, config) -> List[ValidationIssue]: """Check some property of the standardized results against a parameter.""" report = ValidationReport() # Parameter-dependent checks must self-report when the value was not # parsed, and never crash. limit = getattr(parameters, "my_limit", None) if limit is None: report.add_warning( "my_limit_absent", "The parameter 'my_limit' could not be parsed; the check was skipped.", "my_constraint", ) return report.issues offending = standard_df[standard_df["some_column"] > limit] if not offending.empty: report.add_error( "my_constraint_violated", f"{len(offending)} row(s) exceed my_limit ({limit}).", "my_constraint", observed=int(offending["some_column"].max()), expected=f"<= {limit}", examples=offending["some_column"].head(10).tolist(), ) return report.issues Then add it to a profile. Trivial checks that only forward context fields are written as a lambda adapter; checks that need orchestration logic (such as deciding whether a reference is available) are written as named functions in ``profiles.py``. .. code-block:: python # proteobench/validation/profiles.py QUANT_LFQ_PROFILE = ValidationProfile( name="quant_lfq", checks=[ # ... existing checks ... Check( "my_constraint", lambda ctx: check_my_constraint(ctx.standard_df, ctx.parameters, ctx.config), "What this check verifies.", ), ], ) Guidelines for checks: - Use ``ValidationContext`` fields: ``ctx.standard_df``, ``ctx.parameters``, ``ctx.config``, ``ctx.fasta``, ``ctx.input_format``. - Choose severity by intent only. ``error`` and ``warning`` differ in display prominence and pull-request inclusion. Neither blocks submission. - A parameter-dependent check should emit a ``warning`` when the constraint was not parsed, rather than failing. The orchestrator also wraps every check so an unexpected exception becomes a ``check_failed`` warning, but checks should not rely on that. Registering a profile ---------------------- .. code-block:: python from proteobench.validation import Check, ValidationProfile, register_profile register_profile( ValidationProfile( name="my_module", description="Checks for the new module category.", checks=[Check("my_check", my_check_func, "what it does")], ) ) The registry helpers are ``register_profile`` (with ``overwrite=False`` by default), ``unregister_profile``, ``get_profile``, and ``available_profiles``. The report object ------------------ ``validate_submission`` returns a ``ValidationReport``. Useful members: - ``report.errors`` / ``report.warnings`` / ``report.infos``: issues by severity. - ``report.has_errors`` and ``report.passed``: informational only. The Streamlit flow does not gate on them. - ``report.summary()``: a compact Markdown summary embedded in the pull-request description. - ``report.raise_if_errors()``: optional path for programmatic callers that prefer an exception (``SubmissionValidationError``). Each ``ValidationIssue`` carries a machine-readable ``code``, a ``severity``, a human-readable ``message``, the originating ``check`` name, and optional ``field``, ``observed``, ``expected``, and ``examples`` values. Web integration ================ Validation runs inside ``submit_to_repository`` in :file:`webinterface/pages/base_pages/tabs/tab6_submit_results.py`, after the confirmation button and before the pull request is created. The standardized DataFrame is re-derived by rerunning the existing parser on the input DataFrame already in session state, so no tool-specific parsing logic is duplicated. The glue functions are in :file:`webinterface/pages/base_pages/utils/validation_ui.py`: - ``run_submission_validation(variables, ionmodule, user_input, params)`` runs the full flow (re-parse, load and cache the FASTA, run the checks) and returns a ``ValidationReport``. It is fault-tolerant: any infrastructure problem (missing input, parser failure, FASTA download failure) becomes a warning. - ``render_validation_report(report)`` displays errors and warnings, with info items inside a collapsed expander. The findings are rendered in the UI and appended to the pull-request description through ``report.summary()``. None of them block the pull request. The local Tab 2 upload path is unaffected. Testing ======= Validation is covered by :file:`test/test_validation.py`, which tests FASTA parsing, protein-identifier matching, the individual checks, the profile registry (resolution, custom-profile registration, unknown-profile handling, de novo routing), report serialization, and integration through the real MaxQuant and Sage parsers. A small reference FASTA fixture lives at :file:`test/data/validation/ProteoBench_validation_reference.fasta`. When you add a check, add unit tests for it directly (it is a pure function). When you add a profile, add a test that the profile resolves for the module and that its checks run. Documented limitations ======================= Some checks are intentionally limited because the required reference data is not available: - Full enzyme-specificity checks need reference protein sequences. Only internal K/R counting is done for the trypsin family, as a heuristic. - Cross-tool modification normalization is not implemented. Tools encode modifications differently (human-readable names, UniMod accessions, raw mass dictionaries); only matching human-readable names against the declared modifications is attempted. - Run-level matching (raw file, sample, experiment) is not possible because ``ProteoBenchParameters`` does not expose those fields. Only software identity is compared in ``run_consistency``.