proteobench.validation.config module#

Module-level validation configuration.

ModuleValidationConfig collects the small amount of per-module information the validator needs that is not part of the standardized result DataFrame or the parsed parameters: the standardized column names, the protein-group separators, the contaminant flag and decoy prefixes used to skip non-target identifiers, and the reference FASTA location.

The validation_profile field selects which set of checks the orchestrator runs. It is the name of a profile registered in proteobench.validation.profiles. It is resolved (in order of precedence):

  1. an explicit [validation].profile key in the module’s module_settings.toml (the declarative path: adding a new module of an existing category is config-only);

  2. inferred from the module’s parser class via the existing MODULE_TO_CLASS registry (ParseSettingsQuant -> "quant_lfq", ParseSettingsDeNovo -> "denovo");

  3. the DEFAULT_VALIDATION_PROFILE fallback.

A genuinely new category of module is supported by registering a new profile in profiles.py (or from third-party code) and pointing the module at it via the TOML key; the orchestrator itself never changes.

The reference FASTA is read from an optional [reference_database] section in the module’s module_settings.toml (beside [species_expected_ratio] and [general]). Module types whose reference is not a FASTA (e.g. de novo, which compares against a ground-truth table) simply omit fasta_url.

Example module_settings.toml sections:

[reference_database]
"fasta_url" = "https://proteobench.cubimed.rub.de/fasta/ProteoBenchFASTA_MixedSpecies_HYE.zip"

[validation]
"profile" = "quant_lfq"
# optional mass-tolerance plausibility ceilings (no default; skipped if unset):
# "max_plausible_ppm" = 1000.0
# "max_plausible_dalton" = 10.0
proteobench.validation.config.DEFAULT_DECOY_PREFIXES = ('rev_', 'rev__', 'decoy_', 'decoy', 'reverse_', '##')#

Common decoy-identifier prefixes. The ParseSettings configuration marks decoys via a boolean Reverse column rather than an accession prefix, so these defaults provide a tool-agnostic fallback for skipping decoy proteins.

proteobench.validation.config.DEFAULT_VALIDATION_PROFILE = 'quant_lfq'#

Profile used when none can be resolved from config or the parser class.

class proteobench.validation.config.ModuleValidationConfig(protein_column: str = 'Proteins', sequence_column: str = 'Sequence', charge_column: str = 'Charge', proforma_column: str = 'proforma', contaminant_column: str = 'contaminant', contaminant_flag: str | None = None, decoy_prefixes: Tuple[str, ...]=('rev_', 'rev__', 'decoy_', 'decoy', 'reverse_', '##'), protein_group_separators: Tuple[str, ...]=(';', ', '), fasta_url: str | None = None, fasta_filename: str | None = None, species_flags: Tuple[str, ...]=<factory>, recommended_max_fdr_psm: float | None = 0.01, max_plausible_ppm: float | None = None, max_plausible_dalton: float | None = None, validation_profile: str = 'quant_lfq')[source]#

Bases: object

Per-module configuration for submission validation.

protein_column#

Column holding protein identifiers in the standardized DataFrame. Default "Proteins".

Type:

str, optional

sequence_column#

Column holding the (plain) peptide sequence. Default "Sequence".

Type:

str, optional

charge_column#

Column holding the precursor charge. Default "Charge".

Type:

str, optional

proforma_column#

Column holding the ProForma modified sequence. Default "proforma".

Type:

str, optional

contaminant_column#

Boolean column flagging contaminant rows. Default "contaminant".

Type:

str, optional

contaminant_flag#

Substring marking contaminant proteins (from the tool parse settings, e.g. "Cont_").

Type:

str, optional

decoy_prefixes#

Prefixes marking decoy proteins. Defaults to DEFAULT_DECOY_PREFIXES.

Type:

tuple of str, optional

protein_group_separators#

Separators used to split protein groups. Defaults to DEFAULT_GROUP_SEPARATORS.

Type:

tuple of str, optional

fasta_url#

URL of the reference FASTA / zip / gzip for the module.

Type:

str, optional

fasta_filename#

Preferred FASTA member name when the resource is an archive.

Type:

str, optional

species_flags#

Species names configured for the module (e.g. ("YEAST", "ECOLI", "HUMAN")), derived from the tool’s species mapper. Currently informational.

Type:

tuple of str, optional

recommended_max_fdr_psm#

Recommended maximum PSM-level FDR for the benchmark. A parsed FDR above this value produces a warning. Default 0.01 (1%). Set to None to disable the recommendation check.

Type:

float, optional

max_plausible_ppm#

Plausibility ceiling for ppm mass tolerances. A parsed tolerance above this value produces a warning. No default (None); when unset, the implausible-value check is skipped. Set via [validation] in module_settings.toml.

Type:

float, optional

max_plausible_dalton#

Plausibility ceiling for absolute (Da / Th / amu) mass tolerances, scaled by 1000 for mmu. No default (None); when unset, the implausible-value check is skipped. Set via [validation] in module_settings.toml.

Type:

float, optional

validation_profile#

Name of the registered profile whose checks the orchestrator runs. Set automatically by from_parse_settings(); defaults to DEFAULT_VALIDATION_PROFILE for direct construction so that the existing quant behaviour is preserved.

Type:

str, optional

charge_column: str = 'Charge'#
contaminant_column: str = 'contaminant'#
contaminant_flag: str | None = None#
decoy_prefixes: Tuple[str, ...] = ('rev_', 'rev__', 'decoy_', 'decoy', 'reverse_', '##')#
fasta_filename: str | None = None#
fasta_url: str | None = None#
classmethod from_parse_settings(parse_settings_dir: str, module_id: str, input_format: str) ModuleValidationConfig[source]#

Build a config from the existing parse settings of a module/tool.

This reuses ParseSettingsBuilder to read the contaminant flag and species flags for the selected tool, reads the optional [reference_database] and [validation] sections from the module’s module_settings.toml, and resolves the validation profile.

Parameters:
  • parse_settings_dir (str) – Directory containing the module’s parse settings (the module’s parse_settings_dir attribute).

  • module_id (str) – The module identifier (e.g. "quant_lfq_DDA_ion_QExactive").

  • input_format (str) – The selected software tool (e.g. "MaxQuant").

Returns:

Configuration populated from the parse settings. Falls back to the defaults for any value that cannot be read.

Return type:

ModuleValidationConfig

max_plausible_dalton: float | None = None#
max_plausible_ppm: float | None = None#
proforma_column: str = 'proforma'#
protein_column: str = 'Proteins'#
protein_group_separators: Tuple[str, ...] = (';', ',')#
static read_reference_database(parse_settings_dir: str) dict[source]#

Read the [reference_database] section of a module’s settings.

Parameters:

parse_settings_dir (str) – Directory containing the module’s module_settings.toml.

Returns:

The [reference_database] table, or an empty dict if absent.

Return type:

dict

recommended_max_fdr_psm: float | None = 0.01#
sequence_column: str = 'Sequence'#
species_flags: Tuple[str, ...]#
validation_profile: str = 'quant_lfq'#