proteobench.validation package#

Submission-validation layer for ProteoBench.

This package validates uploaded benchmark submissions before a public datapoint is created. It checks that the standardized results and parsed parameters are internally consistent and consistent with the module reference database, and returns a structured ValidationReport (overall status plus per-issue severity, machine-readable code, message, field, observed/expected values, and example offending rows).

The layer is generic and registry-driven. Each module maps to a validation profile (a named, ordered set of checks). Adding a new module of an existing category requires only configuration; adding a new category requires only registering a new profile via register_profile().

Typical use:

from proteobench.validation import validate_submission, FastaReference, ModuleValidationConfig

config = ModuleValidationConfig.from_parse_settings(parse_settings_dir, module_id, input_format)
fasta = FastaReference.from_url(config.fasta_url, member_filename=config.fasta_filename)
report = validate_submission(standard_df, parameters=params, fasta=fasta, config=config,
                             input_format=input_format)
if report.has_errors:
    ...  # block public submission

Registering a custom profile:

from proteobench.validation import Check, ValidationProfile, register_profile

register_profile(ValidationProfile(
    name="my_module",
    checks=[Check("my_check", my_check_func, "what it does")],
))

class proteobench.validation.Check(name: str, func: Callable[[ValidationContext], List[ValidationIssue]], description: str = '')[source]#

Bases: object

A single, named validation check.

name#

Stable identifier used in fallback error messages and progress display.

Type:: str

func#

A function ctx -> list[ValidationIssue].

Type:: callable

description#

Human-readable description of what the check verifies.

Type:: str, optional

description: str = ''#

func: Callable[[ValidationContext], List[ValidationIssue]]#

name: str#

run(ctx: ValidationContext) → List[ValidationIssue][source]#

Execute the check against a validation context.

Parameters:: ctx (ValidationContext) – The inputs available to the check.
Returns:: Issues produced by the check (possibly empty).
Return type:: list of ValidationIssue

class proteobench.validation.FastaReference(identifiers: Iterable[str] | None = None)[source]#

Bases: object

Set of protein identifiers derived from a FASTA / reference database.

Parameters:: identifiers (iterable of str, optional) – Pre-computed identifiers to seed the reference with.

contains(identifier: str) → bool[source]#

Test whether an identifier is present (case-insensitive).

Parameters:: identifier (str) – Identifier to test.
Returns:: True if the identifier is in the reference.
Return type:: bool

contains_any(identifiers: Iterable[str]) → bool[source]#

Test whether any of several identifiers is present.

Parameters:: identifiers (iterable of str) – Candidate identifiers for a single protein.
Returns:: True if at least one candidate is in the reference.
Return type:: bool

classmethod from_bytes(data: bytes, source_name: str | None = None, member_filename: str | None = None, encoding: str = 'utf-8') → FastaReference[source]#

Build a reference from in-memory bytes (plain, gzip, or zip).

Parameters:

data (bytes) – Raw file content.
source_name (str, optional) – Original file name or URL, used to detect the compression type.
member_filename (str, optional) – Preferred FASTA member name when data is a ZIP archive.
encoding (str, optional) – Text encoding used to decode the FASTA content. Default "utf-8".

Returns:

Reference indexing every header’s identifiers.

Return type:

FastaReference

classmethod from_identifiers(identifiers: Iterable[str]) → FastaReference[source]#

Build a reference directly from an iterable of identifiers.

Parameters:: identifiers (iterable of str) – Identifiers to index (e.g. accessions extracted elsewhere).
Returns:: Reference indexing the supplied identifiers.
Return type:: FastaReference

classmethod from_path(path: str, member_filename: str | None = None) → FastaReference[source]#

Build a reference from a local file path (plain, .gz, or .zip).

Parameters:

path (str) – Path to the FASTA, gzip, or zip file.
member_filename (str, optional) – Preferred FASTA member name when path is a ZIP archive.

Returns:

Reference indexing every header’s identifiers.

Return type:

FastaReference

classmethod from_text(text: str) → FastaReference[source]#

Build a reference from raw FASTA text.

Parameters:: text (str) – FASTA content (one or more records).
Returns:: Reference indexing every header’s identifiers.
Return type:: FastaReference

classmethod from_url(url: str, member_filename: str | None = None, timeout: int = 60) → FastaReference[source]#

Build a reference by downloading a FASTA / zip / gzip from a URL.

requests is imported lazily so that importing this module does not require network access.

Parameters:

url (str) – URL of the FASTA, gzip, or zip resource.
member_filename (str, optional) – Preferred FASTA member name when the resource is a ZIP archive.
timeout (int, optional) – Request timeout in seconds. Default 60.

Returns:

Reference indexing every header’s identifiers.

Return type:

FastaReference

property identifiers: Set[str]#

Return all indexed identifiers.

Returns:: The identifier set (accessions and entry names).
Return type:: set of str

class proteobench.validation.ModuleValidationConfig(protein_column: str = 'Proteins', sequence_column: str = 'Sequence', charge_column: str = 'Charge', proforma_column: str = 'proforma', contaminant_column: str = 'contaminant', contaminant_flag: str | None = None, decoy_prefixes: Tuple[str, ...]=('rev_', 'rev__', 'decoy_', 'decoy', 'reverse_', '##'), protein_group_separators: Tuple[str, ...]=(';', ', '), fasta_url: str | None = None, fasta_filename: str | None = None, species_flags: Tuple[str, ...]=<factory>, recommended_max_fdr_psm: float | None = 0.01, max_plausible_ppm: float | None = None, max_plausible_dalton: float | None = None, validation_profile: str = 'quant_lfq')[source]#

Bases: object

Per-module configuration for submission validation.

protein_column#

Column holding protein identifiers in the standardized DataFrame. Default "Proteins".

Type:: str, optional

sequence_column#

Column holding the (plain) peptide sequence. Default "Sequence".

Type:: str, optional

charge_column#

Column holding the precursor charge. Default "Charge".

Type:: str, optional

proforma_column#

Column holding the ProForma modified sequence. Default "proforma".

Type:: str, optional

contaminant_column#

Boolean column flagging contaminant rows. Default "contaminant".

Type:: str, optional

contaminant_flag#

Substring marking contaminant proteins (from the tool parse settings, e.g. "Cont_").

Type:: str, optional

decoy_prefixes#

Prefixes marking decoy proteins. Defaults to DEFAULT_DECOY_PREFIXES.

Type:: tuple of str, optional

protein_group_separators#

Separators used to split protein groups. Defaults to DEFAULT_GROUP_SEPARATORS.

Type:: tuple of str, optional

fasta_url#

URL of the reference FASTA / zip / gzip for the module.

Type:: str, optional

fasta_filename#

Preferred FASTA member name when the resource is an archive.

Type:: str, optional

species_flags#

Species names configured for the module (e.g. ("YEAST", "ECOLI", "HUMAN")), derived from the tool’s species mapper. Currently informational.

Type:: tuple of str, optional

recommended_max_fdr_psm#

Recommended maximum PSM-level FDR for the benchmark. A parsed FDR above this value produces a warning. Default 0.01 (1%). Set to None to disable the recommendation check.

Type:: float, optional

max_plausible_ppm#

Plausibility ceiling for ppm mass tolerances. A parsed tolerance above this value produces a warning. No default (None); when unset, the implausible-value check is skipped. Set via [validation] in module_settings.toml.

Type:: float, optional

max_plausible_dalton#

Plausibility ceiling for absolute (Da / Th / amu) mass tolerances, scaled by 1000 for mmu. No default (None); when unset, the implausible-value check is skipped. Set via [validation] in module_settings.toml.

Type:: float, optional

validation_profile#

Name of the registered profile whose checks the orchestrator runs. Set automatically by from_parse_settings(); defaults to DEFAULT_VALIDATION_PROFILE for direct construction so that the existing quant behaviour is preserved.

Type:: str, optional

charge_column: str = 'Charge'#

contaminant_column: str = 'contaminant'#

contaminant_flag: str | None = None#

decoy_prefixes: Tuple[str, ...] = ('rev_', 'rev__', 'decoy_', 'decoy', 'reverse_', '##')#

fasta_filename: str | None = None#

fasta_url: str | None = None#

classmethod from_parse_settings(parse_settings_dir: str, module_id: str, input_format: str) → ModuleValidationConfig[source]#

Build a config from the existing parse settings of a module/tool.

This reuses ParseSettingsBuilder to read the contaminant flag and species flags for the selected tool, reads the optional [reference_database] and [validation] sections from the module’s module_settings.toml, and resolves the validation profile.

Parameters:

parse_settings_dir (str) – Directory containing the module’s parse settings (the module’s parse_settings_dir attribute).
module_id (str) – The module identifier (e.g. "quant_lfq_DDA_ion_QExactive").
input_format (str) – The selected software tool (e.g. "MaxQuant").

Returns:

Configuration populated from the parse settings. Falls back to the defaults for any value that cannot be read.

Return type:

ModuleValidationConfig

max_plausible_dalton: float | None = None#

max_plausible_ppm: float | None = None#

proforma_column: str = 'proforma'#

protein_column: str = 'Proteins'#

protein_group_separators: Tuple[str, ...] = (';', ',')#

static read_reference_database(parse_settings_dir: str) → dict[source]#

Read the [reference_database] section of a module’s settings.

Parameters:: parse_settings_dir (str) – Directory containing the module’s module_settings.toml.
Returns:: The [reference_database] table, or an empty dict if absent.
Return type:: dict

recommended_max_fdr_psm: float | None = 0.01#

sequence_column: str = 'Sequence'#

species_flags: Tuple[str, ...]#

validation_profile: str = 'quant_lfq'#

class proteobench.validation.Severity(value)[source]#

Bases: str, Enum

Severity level of a validation issue.

Severity controls only display prominence and inclusion in the pull-request summary; it does not gate the Streamlit submission flow (no severity blocks submission). It also drives the optional programmatic ValidationReport.raise_if_errors() path.

ERROR = 'error'#

INFO = 'info'#

WARNING = 'warning'#

exception proteobench.validation.SubmissionValidationError(report: ValidationReport)[source]#

Bases: Exception

Raised when a submission fails validation.

The originating ValidationReport is attached as the report attribute so callers can inspect every issue.

Parameters:: report (ValidationReport) – The validation report that triggered the error.

class proteobench.validation.ValidationContext(standard_df: DataFrame, parameters: Any = None, config: ModuleValidationConfig = <factory>, fasta: FastaReference | None = None, input_format: str | None = None, reference: Any = None, extras: Dict[str, ~typing.Any]=<factory>)[source]#

Bases: object

Inputs available to a validation check.

standard_df#

The standardized result DataFrame produced by the module parser.

Type:: pandas.DataFrame

parameters#

Parsed parameters (a ProteoBenchParameters or any object with the same attributes). None when no parameter file was provided.

Type:: Any, optional

config#

Module validation configuration (column names, flags, FASTA location, resolved profile).

Type:: ModuleValidationConfig

fasta#

Reference protein identifiers, for profiles that validate against a sequence database. None when unavailable or not applicable.

Type:: FastaReference, optional

input_format#

The selected software tool used to produce the results.

Type:: str, optional

reference#

Generic reference object for profiles whose reference is not a FASTA (for example a de novo ground-truth table). None when unused.

Type:: Any, optional

extras#

Escape hatch for additional, profile-specific inputs.

Type:: dict, optional

config: ModuleValidationConfig#

extras: Dict[str, Any]#

fasta: FastaReference | None = None#

input_format: str | None = None#

parameters: Any = None#

reference: Any = None#

standard_df: DataFrame#

class proteobench.validation.ValidationIssue(code: str, severity: Severity, message: str, check: str, field: str | None = None, observed: Any = None, expected: Any = None, examples: List[Any] = <factory>)[source]#

Bases: object

A single validation finding.

code#

Machine-readable issue code (stable identifier, e.g. "protein_not_in_fasta").

Type:: str

severity#

Severity of the issue.

Type:: Severity

message#

Human-readable description of the issue.

Type:: str

check#

Name of the check that produced the issue (e.g. "protein_ids").

Type:: str

field#

Relevant field, file, or column name the issue refers to.

Type:: str, optional

observed#

Observed value (or a short summary of it).

Type:: Any, optional

expected#

Expected value or allowed range, where applicable.

Type:: Any, optional

examples#

A small number of example offending rows or identifiers.

Type:: list, optional

check: str#

code: str#

examples: List[Any]#

expected: Any = None#

field: str | None = None#

message: str#

observed: Any = None#

severity: Severity#

to_dict() → Dict[str, Any][source]#

Convert the issue to a JSON-serialisable dictionary.

Returns:: Dictionary representation of the issue.
Return type:: dict

class proteobench.validation.ValidationProfile(name: str, checks: List[Check] = <factory>, description: str = '')[source]#

Bases: object

An ordered set of checks that applies to one category of module.

name#

Unique profile name (the routing key declared by modules).

Type:: str

checks#

Checks to run, in order.

Type:: list of Check

description#

Human-readable description of the profile.

Type:: str, optional

property check_names: List[str]#

Return the names of the checks in this profile.

Returns:: The ordered check names.
Return type:: list of str

checks: List[Check]#

description: str = ''#

name: str#

class proteobench.validation.ValidationReport(issues: List[ValidationIssue] = <factory>)[source]#

Bases: object

Collection of validation issues with overall status helpers.

issues#

Issues collected during validation.

Type:: list of ValidationIssue

add(code: str, severity: Severity, message: str, check: str, field: str | None = None, observed: Any = None, expected: Any = None, examples: List[Any] | None = None) → ValidationReport[source]#

Append a new issue to the report.

Parameters:

code (str) – Machine-readable issue code.
severity (Severity) – Severity of the issue.
message (str) – Human-readable description.
check (str) – Name of the originating check.
field (str, optional) – Relevant field, file, or column name.
observed (Any, optional) – Observed value.
expected (Any, optional) – Expected value or allowed range.
examples (list, optional) – Example offending rows or identifiers.

Returns:

The report itself, to allow chaining.

Return type:

ValidationReport

add_error(code: str, message: str, check: str, **kwargs: Any) → ValidationReport[source]#

Append an ERROR issue.

Parameters:

code (str) – Machine-readable issue code.
message (str) – Human-readable description.
check (str) – Name of the originating check.
**kwargs (dict) – Optional field, observed, expected, and examples values.

Returns:

The report itself, to allow chaining.

Return type:

ValidationReport

add_info(code: str, message: str, check: str, **kwargs: Any) → ValidationReport[source]#

Append an INFO issue.

Parameters:

code (str) – Machine-readable issue code.
message (str) – Human-readable description.
check (str) – Name of the originating check.
**kwargs (dict) – Optional field, observed, expected, and examples values.

Returns:

The report itself, to allow chaining.

Return type:

ValidationReport

add_warning(code: str, message: str, check: str, **kwargs: Any) → ValidationReport[source]#

Append a WARNING issue.

Parameters:

code (str) – Machine-readable issue code.
message (str) – Human-readable description.
check (str) – Name of the originating check.
**kwargs (dict) – Optional field, observed, expected, and examples values.

Returns:

The report itself, to allow chaining.

Return type:

ValidationReport

property errors: List[ValidationIssue]#

Return all ERROR issues.

Returns:: The error-level issues.
Return type:: list of ValidationIssue

extend(issues: List[ValidationIssue]) → ValidationReport[source]#

Append several issues at once.

Parameters:: issues (list of ValidationIssue) – Issues to add.
Returns:: The report itself, to allow chaining.
Return type:: ValidationReport

property has_errors: bool#

Whether the report contains any ERROR issue.

Returns:: True if at least one error is present.
Return type:: bool

property infos: List[ValidationIssue]#

Return all INFO issues.

Returns:: The info-level issues.
Return type:: list of ValidationIssue

issues: List[ValidationIssue]#

property passed: bool#

Overall pass status (no ERROR issues).

This is informational only: the Streamlit submission flow does not gate on it (submission is never blocked). It is used for display and by the optional raise_if_errors() path.

Returns:: True when there are no ERROR issues (warnings allowed).
Return type:: bool

raise_if_errors() → None[source]#

Raise SubmissionValidationError if any error issue is present.

Raises:: SubmissionValidationError – If the report contains at least one ERROR issue.

summary(include_info: bool = False) → str[source]#

Build a compact Markdown summary of the report.

Useful for embedding the findings into pull-request text or logs. The wording is neutral: submission validation does not block submission, it only surfaces points for the submitter and reviewers to consider.

Parameters:: include_info (bool, optional) – Whether to include INFO issues in the summary. Default False.
Returns:: Markdown-formatted summary.
Return type:: str

to_dict() → Dict[str, Any][source]#

Convert the report to a JSON-serialisable dictionary.

Returns:: Dictionary with overall status and the list of issues.
Return type:: dict

property warnings: List[ValidationIssue]#

Return all WARNING issues.

Returns:: The warning-level issues.
Return type:: list of ValidationIssue

proteobench.validation.available_profiles() → List[str][source]#

List the names of all registered profiles.

Returns:: Sorted profile names.
Return type:: list of str

proteobench.validation.get_profile(name: str) → ValidationProfile | None[source]#

Look up a registered profile by name.

Parameters:: name (str) – Profile name.
Returns:: The registered profile, or None if no profile has that name (or if name is not a string).
Return type:: ValidationProfile or None

proteobench.validation.register_profile(profile: ValidationProfile, overwrite: bool = False) → None[source]#

Register a validation profile under its name.

Parameters:

profile (ValidationProfile) – The profile to register.
overwrite (bool, optional) – If False (default), registering a name that already exists raises. Set True to replace an existing profile.

Raises:

ValueError – If a profile with the same name is already registered and overwrite is False.

proteobench.validation.unregister_profile(name: str) → None[source]#

Remove a profile from the registry if present.

Parameters:: name (str) – Name of the profile to remove.

proteobench.validation.validate_submission(standard_df: DataFrame, parameters: Any = None, fasta: FastaReference | None = None, config: ModuleValidationConfig | None = None, input_format: str | None = None, profile: str | None = None) → ValidationReport[source]#

Validate a benchmark submission and return a structured report.

The set of checks run is determined by the validation profile, resolved from (in order): the explicit profile argument, config.validation_profile, or the default. Each check is fault-tolerant: a check that raises an unexpected exception is converted to a warning so that validation itself never crashes the submission flow.

Parameters:

standard_df (pandas.DataFrame) – The standardized result DataFrame produced by the module parser.
parameters (Any, optional) – Parsed parameters (a ProteoBenchParameters or any object with the same attributes). Parameter-dependent checks degrade to warnings when values are missing.
fasta (FastaReference, optional) – Reference protein identifiers, for profiles that validate against a sequence database.
config (ModuleValidationConfig, optional) – Module validation configuration. Defaults to a generic configuration (which selects the default profile).
input_format (str, optional) – The selected software tool, used for run-consistency checks.
profile (str, optional) – Explicit profile name, overriding config.validation_profile. Mostly useful for testing.

Returns:

The aggregated validation report.

Return type:

ValidationReport

proteobench.validation package#

Submodules#