proteobench.validation.config module#
Module-level validation configuration.
ModuleValidationConfig collects the small amount of per-module
information the validator needs that is not part of the standardized result
DataFrame or the parsed parameters: the standardized column names, the
protein-group separators, the contaminant flag and decoy prefixes used to skip
non-target identifiers, and the reference FASTA location.
The validation_profile field selects which set of checks the orchestrator
runs. It is the name of a profile registered in
proteobench.validation.profiles. It is resolved (in order of precedence):
an explicit
[validation].profilekey in the module’smodule_settings.toml(the declarative path: adding a new module of an existing category is config-only);inferred from the module’s parser class via the existing
MODULE_TO_CLASSregistry (ParseSettingsQuant->"quant_lfq",ParseSettingsDeNovo->"denovo");the
DEFAULT_VALIDATION_PROFILEfallback.
A genuinely new category of module is supported by registering a new profile
in profiles.py (or from third-party code) and pointing the module at it via
the TOML key; the orchestrator itself never changes.
The reference FASTA is read from an optional [reference_database] section in
the module’s module_settings.toml (beside [species_expected_ratio] and
[general]). Module types whose reference is not a FASTA (e.g. de novo, which
compares against a ground-truth table) simply omit fasta_url.
Example module_settings.toml sections:
[reference_database]
"fasta_url" = "https://proteobench.cubimed.rub.de/fasta/ProteoBenchFASTA_MixedSpecies_HYE.zip"
[validation]
"profile" = "quant_lfq"
# optional mass-tolerance plausibility ceilings (no default; skipped if unset):
# "max_plausible_ppm" = 1000.0
# "max_plausible_dalton" = 10.0
- proteobench.validation.config.DEFAULT_DECOY_PREFIXES = ('rev_', 'rev__', 'decoy_', 'decoy', 'reverse_', '##')#
Common decoy-identifier prefixes. The ParseSettings configuration marks decoys via a boolean
Reversecolumn rather than an accession prefix, so these defaults provide a tool-agnostic fallback for skipping decoy proteins.
- proteobench.validation.config.DEFAULT_VALIDATION_PROFILE = 'quant_lfq'#
Profile used when none can be resolved from config or the parser class.
- class proteobench.validation.config.ModuleValidationConfig(protein_column: str = 'Proteins', sequence_column: str = 'Sequence', charge_column: str = 'Charge', proforma_column: str = 'proforma', contaminant_column: str = 'contaminant', contaminant_flag: str | None = None, decoy_prefixes: Tuple[str, ...]=('rev_', 'rev__', 'decoy_', 'decoy', 'reverse_', '##'), protein_group_separators: Tuple[str, ...]=(';', ', '), fasta_url: str | None = None, fasta_filename: str | None = None, species_flags: Tuple[str, ...]=<factory>, recommended_max_fdr_psm: float | None = 0.01, max_plausible_ppm: float | None = None, max_plausible_dalton: float | None = None, validation_profile: str = 'quant_lfq')[source]#
Bases:
objectPer-module configuration for submission validation.
- protein_column#
Column holding protein identifiers in the standardized DataFrame. Default
"Proteins".- Type:
str, optional
- sequence_column#
Column holding the (plain) peptide sequence. Default
"Sequence".- Type:
str, optional
- proforma_column#
Column holding the ProForma modified sequence. Default
"proforma".- Type:
str, optional
- contaminant_column#
Boolean column flagging contaminant rows. Default
"contaminant".- Type:
str, optional
- contaminant_flag#
Substring marking contaminant proteins (from the tool parse settings, e.g.
"Cont_").- Type:
str, optional
- decoy_prefixes#
Prefixes marking decoy proteins. Defaults to
DEFAULT_DECOY_PREFIXES.
- protein_group_separators#
Separators used to split protein groups. Defaults to
DEFAULT_GROUP_SEPARATORS.
- species_flags#
Species names configured for the module (e.g.
("YEAST", "ECOLI", "HUMAN")), derived from the tool’s species mapper. Currently informational.
- recommended_max_fdr_psm#
Recommended maximum PSM-level FDR for the benchmark. A parsed FDR above this value produces a warning. Default
0.01(1%). Set toNoneto disable the recommendation check.- Type:
float, optional
- max_plausible_ppm#
Plausibility ceiling for ppm mass tolerances. A parsed tolerance above this value produces a warning. No default (
None); when unset, the implausible-value check is skipped. Set via[validation]inmodule_settings.toml.- Type:
float, optional
- max_plausible_dalton#
Plausibility ceiling for absolute (Da / Th / amu) mass tolerances, scaled by 1000 for mmu. No default (
None); when unset, the implausible-value check is skipped. Set via[validation]inmodule_settings.toml.- Type:
float, optional
- validation_profile#
Name of the registered profile whose checks the orchestrator runs. Set automatically by
from_parse_settings(); defaults toDEFAULT_VALIDATION_PROFILEfor direct construction so that the existing quant behaviour is preserved.- Type:
str, optional
- classmethod from_parse_settings(parse_settings_dir: str, module_id: str, input_format: str) ModuleValidationConfig[source]#
Build a config from the existing parse settings of a module/tool.
This reuses
ParseSettingsBuilderto read the contaminant flag and species flags for the selected tool, reads the optional[reference_database]and[validation]sections from the module’smodule_settings.toml, and resolves the validation profile.- Parameters:
- Returns:
Configuration populated from the parse settings. Falls back to the defaults for any value that cannot be read.
- Return type: