proteobench.validation.checks module#
Individual validation checks operating on the standardized result DataFrame.
Every check is a pure function that takes the standardized DataFrame, the
parsed ProteoBenchParameters (or any object with
the same attributes), and a ModuleValidationConfig,
and returns a list of ValidationIssue.
The checks are deliberately generic: they read the standardized columns
(Proteins, Sequence, Charge, proforma) and the parameter
attributes, never tool-specific result columns. Missing or unparsed parameters
yield warnings rather than errors, so a submission is never blocked merely
because a value could not be parsed.
Documented limitations and intentionally skipped checks:
Enzyme specificity: a missed-cleavage heuristic is implemented for common C-terminal cleaving enzymes (trypsin, trypsin/P, Lys-C, Arg-C, Glu-C, chymotrypsin) and only as a warning. It ignores protein N-/C-termini and ragged ends (resolving those would need the reference protein sequences), and N-terminal cleavers (Asp-N, Lys-N) are skipped.
Modifications: cross-tool modification representations are not normalized (human-readable names, UniMod accessions, and raw masses all occur). Only human-readable modification names observed in the
proformacolumn are compared, as warnings; mass-only / UniMod-only tokens are skipped. The maximum-modifications count includes any fixed modifications written into the sequence, so it is an upper bound (warning only).Mass tolerances: there is no per-result tolerance to compare against, so the precursor/fragment tolerances are only sanity-checked (present, numeric, positive), as warnings. An optional plausibility ceiling (
max_plausible_ppm/max_plausible_daltonon the config) has no default; the implausible-value check is skipped unless a module configures it.PSM FDR: validated against the valid
[0, 1]range and the benchmark’s recommended maximum (configurable), as warnings.Run identity:
ProteoBenchParametersdoes not expose raw-file, sample, or experiment identifiers, so result-vs-parameter run matching is limited to software identity. This is reported as info.
- proteobench.validation.checks.MAX_PROTEIN_EXAMPLES = 20#
Maximum number of example offending protein identifiers to report.
- proteobench.validation.checks.MAX_ROW_EXAMPLES = 10#
Maximum number of example offending rows to report for other checks.
- proteobench.validation.checks.check_charge_range(df: DataFrame, params: Any, config: ModuleValidationConfig) List[ValidationIssue][source]#
Validate that observed precursor charges fall within the parsed charge range.
- Parameters:
df (pandas.DataFrame) – The standardized result DataFrame.
params (Any) – Parsed parameters (object with
min_precursor_charge/max_precursor_chargeattributes).config (ModuleValidationConfig) – Module validation configuration.
- Returns:
Issues describing out-of-range charges, or warnings when the constraint or column is unavailable.
- Return type:
- proteobench.validation.checks.check_enzyme(df: DataFrame, params: Any, config: ModuleValidationConfig) List[ValidationIssue][source]#
Best-effort enzyme/specificity check (missed cleavages, warning only).
Supports common C-terminal cleaving enzymes via
_ENZYME_CLEAVAGE_RULES(trypsin, trypsin/P, Lys-C, Arg-C, Glu-C, chymotrypsin). For each unique peptide it counts internal cleavage residues and warns when more peptides than allowed exceedallowed_miscleavages. This is a heuristic: it ignores ragged termini and protein ends, so it can only be a warning. N-terminal cleavers (Asp-N, Lys-N) and unknown enzymes are reported as info (skipped).- Parameters:
df (pandas.DataFrame) – The standardized result DataFrame.
params (Any) – Parsed parameters (object with
enzyme,semi_enzymatic,allowed_miscleavagesattributes).config (ModuleValidationConfig) – Module validation configuration.
- Returns:
Warnings for peptides exceeding the allowed missed cleavages, or info/warning describing why the check was skipped.
- Return type:
- proteobench.validation.checks.check_fdr_psm(df: DataFrame, params: Any, config: ModuleValidationConfig) List[ValidationIssue][source]#
Sanity-check the PSM-level FDR (warning only).
Validates that
ident_fdr_psmis present, within[0, 1], and not above the benchmark’s recommended maximum (ModuleValidationConfig.recommended_max_fdr_psm, default 0.01).- Parameters:
df (pandas.DataFrame) – The standardized result DataFrame (unused; kept for signature consistency).
params (Any) – Parsed parameters (object with an
ident_fdr_psmattribute).config (ModuleValidationConfig) – Module validation configuration (provides
recommended_max_fdr_psm).
- Returns:
Warnings for a missing, out-of-range, or above-recommended PSM FDR.
- Return type:
- proteobench.validation.checks.check_mass_tolerances(df: DataFrame, params: Any, config: ModuleValidationConfig) List[ValidationIssue][source]#
Sanity-check the precursor and fragment mass tolerances (warning only).
There is no per-result tolerance to compare against, so this validates that the parsed
precursor_mass_toleranceandfragment_mass_toleranceare present, numeric, and positive. When the module configures a plausibility ceiling (config.max_plausible_ppm/config.max_plausible_dalton, which have no default), tolerances above it are also flagged; otherwise that sub-check is skipped. Mis-parsed or nonsensical values are flagged as warnings.- Parameters:
df (pandas.DataFrame) – The standardized result DataFrame (unused; kept for signature consistency).
params (Any) – Parsed parameters (object with
precursor_mass_tolerance/fragment_mass_toleranceattributes).config (ModuleValidationConfig) – Module validation configuration.
- Returns:
Warnings for missing, unparsable, or implausible tolerances.
- Return type:
- proteobench.validation.checks.check_max_modifications(df: DataFrame, params: Any, config: ModuleValidationConfig) List[ValidationIssue][source]#
Check that no peptide carries more modifications than allowed (warning only).
Counts the bracketed modifications in each
proformastring and warns when more thanmax_modsare present. This is a heuristic: the count includes any fixed modifications written into the sequence, so it is an upper bound on the number of variable modifications.- Parameters:
df (pandas.DataFrame) – The standardized result DataFrame.
params (Any) – Parsed parameters (object with a
max_modsattribute).config (ModuleValidationConfig) – Module validation configuration.
- Returns:
A warning for peptides exceeding
max_mods, or a warning/info describing why the check was skipped.- Return type:
- proteobench.validation.checks.check_modifications(df: DataFrame, params: Any, config: ModuleValidationConfig) List[ValidationIssue][source]#
Best-effort modification compatibility check (warnings only).
Compares human-readable modification names observed in the
proformacolumn against the parsed fixed/variable modification settings. Mass-only and UniMod-only modification tokens are not compared because their representation is not normalized across tools.- Parameters:
df (pandas.DataFrame) – The standardized result DataFrame.
params (Any) – Parsed parameters (object with
fixed_mods/variable_mods).config (ModuleValidationConfig) – Module validation configuration.
- Returns:
Warnings for observed modification names not found in the declared settings, or a warning/info describing why the check was limited.
- Return type:
- proteobench.validation.checks.check_peptide_length(df: DataFrame, params: Any, config: ModuleValidationConfig) List[ValidationIssue][source]#
Validate that peptide lengths fall within the parsed peptide-length range.
- Parameters:
df (pandas.DataFrame) – The standardized result DataFrame.
params (Any) – Parsed parameters (object with
min_peptide_length/max_peptide_lengthattributes).config (ModuleValidationConfig) – Module validation configuration.
- Returns:
Issues describing out-of-range peptide lengths, or warnings when the constraint or column is unavailable.
- Return type:
- proteobench.validation.checks.check_protein_ids(df: DataFrame, fasta: FastaReference, config: ModuleValidationConfig) List[ValidationIssue][source]#
Validate protein identifiers against the reference FASTA accession set.
Splits protein groups, skips decoy and contaminant identifiers, and reports as an error any remaining identifier that is not found in the reference.
- Parameters:
df (pandas.DataFrame) – The standardized result DataFrame.
fasta (FastaReference) – Reference protein identifiers.
config (ModuleValidationConfig) – Module validation configuration.
- Returns:
Issues describing missing protein identifiers (or an info confirming all identifiers were found).
- Return type:
- proteobench.validation.checks.check_run_consistency(df: DataFrame, params: Any, input_format: str | None, config: ModuleValidationConfig) List[ValidationIssue][source]#
Check that the parameter file matches the submitted run, where feasible.
Only software identity can be compared, because
ProteoBenchParametersdoes not expose raw-file, sample, or experiment identifiers. A mismatch in software identity is reported as an error; the unavailable run-level matching is reported as info.- Parameters:
df (pandas.DataFrame) – The standardized result DataFrame (unused for software identity but kept for signature consistency and future extension).
params (Any) – Parsed parameters (object with
software_name/software_version).input_format (str or None) – The selected software tool used to parse the results.
config (ModuleValidationConfig) – Module validation configuration.
- Returns:
Issues describing software-identity mismatches and the documented limitation on run-level matching.
- Return type: