proteobench.validation package#
Submission-validation layer for ProteoBench.
This package validates uploaded benchmark submissions before a public datapoint
is created. It checks that the standardized results and parsed parameters are
internally consistent and consistent with the module reference database, and
returns a structured ValidationReport (overall status plus per-issue
severity, machine-readable code, message, field, observed/expected values, and
example offending rows).
The layer is generic and registry-driven. Each module maps to a validation
profile (a named, ordered set of checks). Adding a new module of an existing
category requires only configuration; adding a new category requires only
registering a new profile via register_profile().
Typical use:
from proteobench.validation import validate_submission, FastaReference, ModuleValidationConfig
config = ModuleValidationConfig.from_parse_settings(parse_settings_dir, module_id, input_format)
fasta = FastaReference.from_url(config.fasta_url, member_filename=config.fasta_filename)
report = validate_submission(standard_df, parameters=params, fasta=fasta, config=config,
input_format=input_format)
if report.has_errors:
... # block public submission
Registering a custom profile:
from proteobench.validation import Check, ValidationProfile, register_profile
register_profile(ValidationProfile(
name="my_module",
checks=[Check("my_check", my_check_func, "what it does")],
))
- class proteobench.validation.Check(name: str, func: Callable[[ValidationContext], List[ValidationIssue]], description: str = '')[source]#
Bases:
objectA single, named validation check.
- func#
A function
ctx -> list[ValidationIssue].- Type:
callable
- func: Callable[[ValidationContext], List[ValidationIssue]]#
- run(ctx: ValidationContext) List[ValidationIssue][source]#
Execute the check against a validation context.
- Parameters:
ctx (ValidationContext) – The inputs available to the check.
- Returns:
Issues produced by the check (possibly empty).
- Return type:
- class proteobench.validation.FastaReference(identifiers: Iterable[str] | None = None)[source]#
Bases:
objectSet of protein identifiers derived from a FASTA / reference database.
- Parameters:
identifiers (iterable of str, optional) – Pre-computed identifiers to seed the reference with.
- contains_any(identifiers: Iterable[str]) bool[source]#
Test whether any of several identifiers is present.
- classmethod from_bytes(data: bytes, source_name: str | None = None, member_filename: str | None = None, encoding: str = 'utf-8') FastaReference[source]#
Build a reference from in-memory bytes (plain, gzip, or zip).
- Parameters:
data (bytes) – Raw file content.
source_name (str, optional) – Original file name or URL, used to detect the compression type.
member_filename (str, optional) – Preferred FASTA member name when
datais a ZIP archive.encoding (str, optional) – Text encoding used to decode the FASTA content. Default
"utf-8".
- Returns:
Reference indexing every header’s identifiers.
- Return type:
- classmethod from_identifiers(identifiers: Iterable[str]) FastaReference[source]#
Build a reference directly from an iterable of identifiers.
- Parameters:
identifiers (iterable of str) – Identifiers to index (e.g. accessions extracted elsewhere).
- Returns:
Reference indexing the supplied identifiers.
- Return type:
- classmethod from_path(path: str, member_filename: str | None = None) FastaReference[source]#
Build a reference from a local file path (plain,
.gz, or.zip).- Parameters:
- Returns:
Reference indexing every header’s identifiers.
- Return type:
- classmethod from_text(text: str) FastaReference[source]#
Build a reference from raw FASTA text.
- Parameters:
text (str) – FASTA content (one or more records).
- Returns:
Reference indexing every header’s identifiers.
- Return type:
- classmethod from_url(url: str, member_filename: str | None = None, timeout: int = 60) FastaReference[source]#
Build a reference by downloading a FASTA / zip / gzip from a URL.
requestsis imported lazily so that importing this module does not require network access.- Parameters:
- Returns:
Reference indexing every header’s identifiers.
- Return type:
- class proteobench.validation.ModuleValidationConfig(protein_column: str = 'Proteins', sequence_column: str = 'Sequence', charge_column: str = 'Charge', proforma_column: str = 'proforma', contaminant_column: str = 'contaminant', contaminant_flag: str | None = None, decoy_prefixes: Tuple[str, ...]=('rev_', 'rev__', 'decoy_', 'decoy', 'reverse_', '##'), protein_group_separators: Tuple[str, ...]=(';', ', '), fasta_url: str | None = None, fasta_filename: str | None = None, species_flags: Tuple[str, ...]=<factory>, recommended_max_fdr_psm: float | None = 0.01, max_plausible_ppm: float | None = None, max_plausible_dalton: float | None = None, validation_profile: str = 'quant_lfq')[source]#
Bases:
objectPer-module configuration for submission validation.
- protein_column#
Column holding protein identifiers in the standardized DataFrame. Default
"Proteins".- Type:
str, optional
- sequence_column#
Column holding the (plain) peptide sequence. Default
"Sequence".- Type:
str, optional
- proforma_column#
Column holding the ProForma modified sequence. Default
"proforma".- Type:
str, optional
- contaminant_column#
Boolean column flagging contaminant rows. Default
"contaminant".- Type:
str, optional
- contaminant_flag#
Substring marking contaminant proteins (from the tool parse settings, e.g.
"Cont_").- Type:
str, optional
- decoy_prefixes#
Prefixes marking decoy proteins. Defaults to
DEFAULT_DECOY_PREFIXES.
- protein_group_separators#
Separators used to split protein groups. Defaults to
DEFAULT_GROUP_SEPARATORS.
- species_flags#
Species names configured for the module (e.g.
("YEAST", "ECOLI", "HUMAN")), derived from the tool’s species mapper. Currently informational.
- recommended_max_fdr_psm#
Recommended maximum PSM-level FDR for the benchmark. A parsed FDR above this value produces a warning. Default
0.01(1%). Set toNoneto disable the recommendation check.- Type:
float, optional
- max_plausible_ppm#
Plausibility ceiling for ppm mass tolerances. A parsed tolerance above this value produces a warning. No default (
None); when unset, the implausible-value check is skipped. Set via[validation]inmodule_settings.toml.- Type:
float, optional
- max_plausible_dalton#
Plausibility ceiling for absolute (Da / Th / amu) mass tolerances, scaled by 1000 for mmu. No default (
None); when unset, the implausible-value check is skipped. Set via[validation]inmodule_settings.toml.- Type:
float, optional
- validation_profile#
Name of the registered profile whose checks the orchestrator runs. Set automatically by
from_parse_settings(); defaults toDEFAULT_VALIDATION_PROFILEfor direct construction so that the existing quant behaviour is preserved.- Type:
str, optional
- classmethod from_parse_settings(parse_settings_dir: str, module_id: str, input_format: str) ModuleValidationConfig[source]#
Build a config from the existing parse settings of a module/tool.
This reuses
ParseSettingsBuilderto read the contaminant flag and species flags for the selected tool, reads the optional[reference_database]and[validation]sections from the module’smodule_settings.toml, and resolves the validation profile.- Parameters:
- Returns:
Configuration populated from the parse settings. Falls back to the defaults for any value that cannot be read.
- Return type:
- class proteobench.validation.Severity(value)[source]#
-
Severity level of a validation issue.
Severity controls only display prominence and inclusion in the pull-request summary; it does not gate the Streamlit submission flow (no severity blocks submission). It also drives the optional programmatic
ValidationReport.raise_if_errors()path.- ERROR = 'error'#
- INFO = 'info'#
- WARNING = 'warning'#
- exception proteobench.validation.SubmissionValidationError(report: ValidationReport)[source]#
Bases:
ExceptionRaised when a submission fails validation.
The originating
ValidationReportis attached as thereportattribute so callers can inspect every issue.- Parameters:
report (ValidationReport) – The validation report that triggered the error.
- class proteobench.validation.ValidationContext(standard_df: DataFrame, parameters: Any = None, config: ModuleValidationConfig = <factory>, fasta: FastaReference | None = None, input_format: str | None = None, reference: Any = None, extras: Dict[str, ~typing.Any]=<factory>)[source]#
Bases:
objectInputs available to a validation check.
- standard_df#
The standardized result DataFrame produced by the module parser.
- Type:
- parameters#
Parsed parameters (a
ProteoBenchParametersor any object with the same attributes).Nonewhen no parameter file was provided.- Type:
Any, optional
- config#
Module validation configuration (column names, flags, FASTA location, resolved profile).
- Type:
- fasta#
Reference protein identifiers, for profiles that validate against a sequence database.
Nonewhen unavailable or not applicable.- Type:
FastaReference, optional
- reference#
Generic reference object for profiles whose reference is not a FASTA (for example a de novo ground-truth table).
Nonewhen unused.- Type:
Any, optional
- config: ModuleValidationConfig#
- fasta: FastaReference | None = None#
- class proteobench.validation.ValidationIssue(code: str, severity: Severity, message: str, check: str, field: str | None = None, observed: Any = None, expected: Any = None, examples: List[Any] = <factory>)[source]#
Bases:
objectA single validation finding.
- observed#
Observed value (or a short summary of it).
- Type:
Any, optional
- expected#
Expected value or allowed range, where applicable.
- Type:
Any, optional
- class proteobench.validation.ValidationProfile(name: str, checks: List[Check] = <factory>, description: str = '')[source]#
Bases:
objectAn ordered set of checks that applies to one category of module.
- class proteobench.validation.ValidationReport(issues: List[ValidationIssue] = <factory>)[source]#
Bases:
objectCollection of validation issues with overall status helpers.
- issues#
Issues collected during validation.
- Type:
- add(code: str, severity: Severity, message: str, check: str, field: str | None = None, observed: Any = None, expected: Any = None, examples: List[Any] | None = None) ValidationReport[source]#
Append a new issue to the report.
- Parameters:
code (str) – Machine-readable issue code.
severity (Severity) – Severity of the issue.
message (str) – Human-readable description.
check (str) – Name of the originating check.
field (str, optional) – Relevant field, file, or column name.
observed (Any, optional) – Observed value.
expected (Any, optional) – Expected value or allowed range.
examples (list, optional) – Example offending rows or identifiers.
- Returns:
The report itself, to allow chaining.
- Return type:
- add_error(code: str, message: str, check: str, **kwargs: Any) ValidationReport[source]#
Append an
ERRORissue.- Parameters:
- Returns:
The report itself, to allow chaining.
- Return type:
- add_info(code: str, message: str, check: str, **kwargs: Any) ValidationReport[source]#
Append an
INFOissue.- Parameters:
- Returns:
The report itself, to allow chaining.
- Return type:
- add_warning(code: str, message: str, check: str, **kwargs: Any) ValidationReport[source]#
Append a
WARNINGissue.- Parameters:
- Returns:
The report itself, to allow chaining.
- Return type:
- property errors: List[ValidationIssue]#
Return all
ERRORissues.- Returns:
The error-level issues.
- Return type:
- extend(issues: List[ValidationIssue]) ValidationReport[source]#
Append several issues at once.
- Parameters:
issues (list of ValidationIssue) – Issues to add.
- Returns:
The report itself, to allow chaining.
- Return type:
- property has_errors: bool#
Whether the report contains any
ERRORissue.- Returns:
Trueif at least one error is present.- Return type:
- property infos: List[ValidationIssue]#
Return all
INFOissues.- Returns:
The info-level issues.
- Return type:
- issues: List[ValidationIssue]#
- property passed: bool#
Overall pass status (no
ERRORissues).This is informational only: the Streamlit submission flow does not gate on it (submission is never blocked). It is used for display and by the optional
raise_if_errors()path.- Returns:
Truewhen there are noERRORissues (warnings allowed).- Return type:
- raise_if_errors() None[source]#
Raise
SubmissionValidationErrorif any error issue is present.- Raises:
SubmissionValidationError – If the report contains at least one
ERRORissue.
- summary(include_info: bool = False) str[source]#
Build a compact Markdown summary of the report.
Useful for embedding the findings into pull-request text or logs. The wording is neutral: submission validation does not block submission, it only surfaces points for the submitter and reviewers to consider.
- to_dict() Dict[str, Any][source]#
Convert the report to a JSON-serialisable dictionary.
- Returns:
Dictionary with overall status and the list of issues.
- Return type:
- property warnings: List[ValidationIssue]#
Return all
WARNINGissues.- Returns:
The warning-level issues.
- Return type:
- proteobench.validation.available_profiles() List[str][source]#
List the names of all registered profiles.
- proteobench.validation.get_profile(name: str) ValidationProfile | None[source]#
Look up a registered profile by name.
- Parameters:
name (str) – Profile name.
- Returns:
The registered profile, or
Noneif no profile has that name (or ifnameis not a string).- Return type:
ValidationProfile or None
- proteobench.validation.register_profile(profile: ValidationProfile, overwrite: bool = False) None[source]#
Register a validation profile under its name.
- Parameters:
profile (ValidationProfile) – The profile to register.
overwrite (bool, optional) – If
False(default), registering a name that already exists raises. SetTrueto replace an existing profile.
- Raises:
ValueError – If a profile with the same name is already registered and
overwriteisFalse.
- proteobench.validation.unregister_profile(name: str) None[source]#
Remove a profile from the registry if present.
- Parameters:
name (str) – Name of the profile to remove.
- proteobench.validation.validate_submission(standard_df: DataFrame, parameters: Any = None, fasta: FastaReference | None = None, config: ModuleValidationConfig | None = None, input_format: str | None = None, profile: str | None = None) ValidationReport[source]#
Validate a benchmark submission and return a structured report.
The set of checks run is determined by the validation profile, resolved from (in order): the explicit
profileargument,config.validation_profile, or the default. Each check is fault-tolerant: a check that raises an unexpected exception is converted to a warning so that validation itself never crashes the submission flow.- Parameters:
standard_df (pandas.DataFrame) – The standardized result DataFrame produced by the module parser.
parameters (Any, optional) – Parsed parameters (a
ProteoBenchParametersor any object with the same attributes). Parameter-dependent checks degrade to warnings when values are missing.fasta (FastaReference, optional) – Reference protein identifiers, for profiles that validate against a sequence database.
config (ModuleValidationConfig, optional) – Module validation configuration. Defaults to a generic configuration (which selects the default profile).
input_format (str, optional) – The selected software tool, used for run-consistency checks.
profile (str, optional) – Explicit profile name, overriding
config.validation_profile. Mostly useful for testing.
- Returns:
The aggregated validation report.
- Return type:
Submodules#
- proteobench.validation.checks module
- proteobench.validation.config module
DEFAULT_DECOY_PREFIXESDEFAULT_VALIDATION_PROFILEModuleValidationConfigModuleValidationConfig.protein_columnModuleValidationConfig.sequence_columnModuleValidationConfig.charge_columnModuleValidationConfig.proforma_columnModuleValidationConfig.contaminant_columnModuleValidationConfig.contaminant_flagModuleValidationConfig.decoy_prefixesModuleValidationConfig.protein_group_separatorsModuleValidationConfig.fasta_urlModuleValidationConfig.fasta_filenameModuleValidationConfig.species_flagsModuleValidationConfig.recommended_max_fdr_psmModuleValidationConfig.max_plausible_ppmModuleValidationConfig.max_plausible_daltonModuleValidationConfig.validation_profileModuleValidationConfig.charge_columnModuleValidationConfig.contaminant_columnModuleValidationConfig.contaminant_flagModuleValidationConfig.decoy_prefixesModuleValidationConfig.fasta_filenameModuleValidationConfig.fasta_urlModuleValidationConfig.from_parse_settings()ModuleValidationConfig.max_plausible_daltonModuleValidationConfig.max_plausible_ppmModuleValidationConfig.proforma_columnModuleValidationConfig.protein_columnModuleValidationConfig.protein_group_separatorsModuleValidationConfig.read_reference_database()ModuleValidationConfig.recommended_max_fdr_psmModuleValidationConfig.sequence_columnModuleValidationConfig.species_flagsModuleValidationConfig.validation_profile
- proteobench.validation.context module
ValidationContextValidationContext.standard_dfValidationContext.parametersValidationContext.configValidationContext.fastaValidationContext.input_formatValidationContext.referenceValidationContext.extrasValidationContext.configValidationContext.extrasValidationContext.fastaValidationContext.input_formatValidationContext.parametersValidationContext.referenceValidationContext.standard_df
- proteobench.validation.exceptions module
- proteobench.validation.fasta module
- proteobench.validation.profiles module
- proteobench.validation.protein_ids module
- proteobench.validation.report module
SeverityValidationIssueValidationIssue.codeValidationIssue.severityValidationIssue.messageValidationIssue.checkValidationIssue.fieldValidationIssue.observedValidationIssue.expectedValidationIssue.examplesValidationIssue.checkValidationIssue.codeValidationIssue.examplesValidationIssue.expectedValidationIssue.fieldValidationIssue.messageValidationIssue.observedValidationIssue.severityValidationIssue.to_dict()
ValidationReportValidationReport.issuesValidationReport.add()ValidationReport.add_error()ValidationReport.add_info()ValidationReport.add_warning()ValidationReport.errorsValidationReport.extend()ValidationReport.has_errorsValidationReport.infosValidationReport.issuesValidationReport.passedValidationReport.raise_if_errors()ValidationReport.summary()ValidationReport.to_dict()ValidationReport.warnings
- proteobench.validation.validator module