proteobench.validation.checks module#

Individual validation checks operating on the standardized result DataFrame.

Every check is a pure function that takes the standardized DataFrame, the parsed ProteoBenchParameters (or any object with the same attributes), and a ModuleValidationConfig, and returns a list of ValidationIssue.

The checks are deliberately generic: they read the standardized columns (Proteins, Sequence, Charge, proforma) and the parameter attributes, never tool-specific result columns. Missing or unparsed parameters yield warnings rather than errors, so a submission is never blocked merely because a value could not be parsed.

Documented limitations and intentionally skipped checks:

Enzyme specificity: a missed-cleavage heuristic is implemented for common C-terminal cleaving enzymes (trypsin, trypsin/P, Lys-C, Arg-C, Glu-C, chymotrypsin) and only as a warning. It ignores protein N-/C-termini and ragged ends (resolving those would need the reference protein sequences), and N-terminal cleavers (Asp-N, Lys-N) are skipped.
Modifications: cross-tool modification representations are not normalized (human-readable names, UniMod accessions, and raw masses all occur). Only human-readable modification names observed in the proforma column are compared, as warnings; mass-only / UniMod-only tokens are skipped. The maximum-modifications count includes any fixed modifications written into the sequence, so it is an upper bound (warning only).
Mass tolerances: there is no per-result tolerance to compare against, so the precursor/fragment tolerances are only sanity-checked (present, numeric, positive), as warnings. An optional plausibility ceiling (max_plausible_ppm / max_plausible_dalton on the config) has no default; the implausible-value check is skipped unless a module configures it.
PSM FDR: validated against the valid [0, 1] range and the benchmark’s recommended maximum (configurable), as warnings.
Run identity: ProteoBenchParameters does not expose raw-file, sample, or experiment identifiers, so result-vs-parameter run matching is limited to software identity. This is reported as info.

proteobench.validation.checks.MAX_PROTEIN_EXAMPLES = 20#: Maximum number of example offending protein identifiers to report.

proteobench.validation.checks.MAX_ROW_EXAMPLES = 10#: Maximum number of example offending rows to report for other checks.

proteobench.validation.checks.check_charge_range(df: DataFrame, params: Any, config: ModuleValidationConfig) → List[ValidationIssue][source]#

Validate that observed precursor charges fall within the parsed charge range.

Parameters:

df (pandas.DataFrame) – The standardized result DataFrame.
params (Any) – Parsed parameters (object with min_precursor_charge / max_precursor_charge attributes).
config (ModuleValidationConfig) – Module validation configuration.

Returns:

Issues describing out-of-range charges, or warnings when the constraint or column is unavailable.

Return type:

list of ValidationIssue

proteobench.validation.checks.check_enzyme(df: DataFrame, params: Any, config: ModuleValidationConfig) → List[ValidationIssue][source]#

Best-effort enzyme/specificity check (missed cleavages, warning only).

Supports common C-terminal cleaving enzymes via _ENZYME_CLEAVAGE_RULES (trypsin, trypsin/P, Lys-C, Arg-C, Glu-C, chymotrypsin). For each unique peptide it counts internal cleavage residues and warns when more peptides than allowed exceed allowed_miscleavages. This is a heuristic: it ignores ragged termini and protein ends, so it can only be a warning. N-terminal cleavers (Asp-N, Lys-N) and unknown enzymes are reported as info (skipped).

Parameters:

df (pandas.DataFrame) – The standardized result DataFrame.
params (Any) – Parsed parameters (object with enzyme, semi_enzymatic, allowed_miscleavages attributes).
config (ModuleValidationConfig) – Module validation configuration.

Returns:

Warnings for peptides exceeding the allowed missed cleavages, or info/warning describing why the check was skipped.

Return type:

list of ValidationIssue

proteobench.validation.checks.check_fdr_psm(df: DataFrame, params: Any, config: ModuleValidationConfig) → List[ValidationIssue][source]#

Sanity-check the PSM-level FDR (warning only).

Validates that ident_fdr_psm is present, within [0, 1], and not above the benchmark’s recommended maximum (ModuleValidationConfig.recommended_max_fdr_psm, default 0.01).

Parameters:

df (pandas.DataFrame) – The standardized result DataFrame (unused; kept for signature consistency).
params (Any) – Parsed parameters (object with an ident_fdr_psm attribute).
config (ModuleValidationConfig) – Module validation configuration (provides recommended_max_fdr_psm).

Returns:

Warnings for a missing, out-of-range, or above-recommended PSM FDR.

Return type:

list of ValidationIssue

proteobench.validation.checks.check_mass_tolerances(df: DataFrame, params: Any, config: ModuleValidationConfig) → List[ValidationIssue][source]#

Sanity-check the precursor and fragment mass tolerances (warning only).

There is no per-result tolerance to compare against, so this validates that the parsed precursor_mass_tolerance and fragment_mass_tolerance are present, numeric, and positive. When the module configures a plausibility ceiling (config.max_plausible_ppm / config.max_plausible_dalton, which have no default), tolerances above it are also flagged; otherwise that sub-check is skipped. Mis-parsed or nonsensical values are flagged as warnings.

Parameters:

df (pandas.DataFrame) – The standardized result DataFrame (unused; kept for signature consistency).
params (Any) – Parsed parameters (object with precursor_mass_tolerance / fragment_mass_tolerance attributes).
config (ModuleValidationConfig) – Module validation configuration.

Returns:

Warnings for missing, unparsable, or implausible tolerances.

Return type:

list of ValidationIssue

proteobench.validation.checks.check_max_modifications(df: DataFrame, params: Any, config: ModuleValidationConfig) → List[ValidationIssue][source]#

Check that no peptide carries more modifications than allowed (warning only).

Counts the bracketed modifications in each proforma string and warns when more than max_mods are present. This is a heuristic: the count includes any fixed modifications written into the sequence, so it is an upper bound on the number of variable modifications.

Parameters:

df (pandas.DataFrame) – The standardized result DataFrame.
params (Any) – Parsed parameters (object with a max_mods attribute).
config (ModuleValidationConfig) – Module validation configuration.

Returns:

A warning for peptides exceeding max_mods, or a warning/info describing why the check was skipped.

Return type:

list of ValidationIssue

proteobench.validation.checks.check_modifications(df: DataFrame, params: Any, config: ModuleValidationConfig) → List[ValidationIssue][source]#

Best-effort modification compatibility check (warnings only).

Compares human-readable modification names observed in the proforma column against the parsed fixed/variable modification settings. Mass-only and UniMod-only modification tokens are not compared because their representation is not normalized across tools.

Parameters:

df (pandas.DataFrame) – The standardized result DataFrame.
params (Any) – Parsed parameters (object with fixed_mods / variable_mods).
config (ModuleValidationConfig) – Module validation configuration.

Returns:

Warnings for observed modification names not found in the declared settings, or a warning/info describing why the check was limited.

Return type:

list of ValidationIssue

proteobench.validation.checks.check_peptide_length(df: DataFrame, params: Any, config: ModuleValidationConfig) → List[ValidationIssue][source]#

Validate that peptide lengths fall within the parsed peptide-length range.

Parameters:

df (pandas.DataFrame) – The standardized result DataFrame.
params (Any) – Parsed parameters (object with min_peptide_length / max_peptide_length attributes).
config (ModuleValidationConfig) – Module validation configuration.

Returns:

Issues describing out-of-range peptide lengths, or warnings when the constraint or column is unavailable.

Return type:

list of ValidationIssue

proteobench.validation.checks.check_protein_ids(df: DataFrame, fasta: FastaReference, config: ModuleValidationConfig) → List[ValidationIssue][source]#

Validate protein identifiers against the reference FASTA accession set.

Splits protein groups, skips decoy and contaminant identifiers, and reports as an error any remaining identifier that is not found in the reference.

Parameters:

df (pandas.DataFrame) – The standardized result DataFrame.
fasta (FastaReference) – Reference protein identifiers.
config (ModuleValidationConfig) – Module validation configuration.

Returns:

Issues describing missing protein identifiers (or an info confirming all identifiers were found).

Return type:

list of ValidationIssue

proteobench.validation.checks.check_run_consistency(df: DataFrame, params: Any, input_format: str | None, config: ModuleValidationConfig) → List[ValidationIssue][source]#

Check that the parameter file matches the submitted run, where feasible.

Only software identity can be compared, because ProteoBenchParameters does not expose raw-file, sample, or experiment identifiers. A mismatch in software identity is reported as an error; the unavailable run-level matching is reported as info.

Parameters:

df (pandas.DataFrame) – The standardized result DataFrame (unused for software identity but kept for signature consistency and future extension).
params (Any) – Parsed parameters (object with software_name / software_version).
input_format (str or None) – The selected software tool used to parse the results.
config (ModuleValidationConfig) – Module validation configuration.

Returns:

Issues describing software-identity mismatches and the documented limitation on run-level matching.

Return type:

list of ValidationIssue