proteobench.validation package#

Submission-validation layer for ProteoBench.

This package validates uploaded benchmark submissions before a public datapoint is created. It checks that the standardized results and parsed parameters are internally consistent and consistent with the module reference database, and returns a structured ValidationReport (overall status plus per-issue severity, machine-readable code, message, field, observed/expected values, and example offending rows).

The layer is generic and registry-driven. Each module maps to a validation profile (a named, ordered set of checks). Adding a new module of an existing category requires only configuration; adding a new category requires only registering a new profile via register_profile().

Typical use:

from proteobench.validation import validate_submission, FastaReference, ModuleValidationConfig

config = ModuleValidationConfig.from_parse_settings(parse_settings_dir, module_id, input_format)
fasta = FastaReference.from_url(config.fasta_url, member_filename=config.fasta_filename)
report = validate_submission(standard_df, parameters=params, fasta=fasta, config=config,
                             input_format=input_format)
if report.has_errors:
    ...  # block public submission

Registering a custom profile:

from proteobench.validation import Check, ValidationProfile, register_profile

register_profile(ValidationProfile(
    name="my_module",
    checks=[Check("my_check", my_check_func, "what it does")],
))
class proteobench.validation.Check(name: str, func: Callable[[ValidationContext], List[ValidationIssue]], description: str = '')[source]#

Bases: object

A single, named validation check.

name#

Stable identifier used in fallback error messages and progress display.

Type:

str

func#

A function ctx -> list[ValidationIssue].

Type:

callable

description#

Human-readable description of what the check verifies.

Type:

str, optional

description: str = ''#
func: Callable[[ValidationContext], List[ValidationIssue]]#
name: str#
run(ctx: ValidationContext) List[ValidationIssue][source]#

Execute the check against a validation context.

Parameters:

ctx (ValidationContext) – The inputs available to the check.

Returns:

Issues produced by the check (possibly empty).

Return type:

list of ValidationIssue

class proteobench.validation.FastaReference(identifiers: Iterable[str] | None = None)[source]#

Bases: object

Set of protein identifiers derived from a FASTA / reference database.

Parameters:

identifiers (iterable of str, optional) – Pre-computed identifiers to seed the reference with.

contains(identifier: str) bool[source]#

Test whether an identifier is present (case-insensitive).

Parameters:

identifier (str) – Identifier to test.

Returns:

True if the identifier is in the reference.

Return type:

bool

contains_any(identifiers: Iterable[str]) bool[source]#

Test whether any of several identifiers is present.

Parameters:

identifiers (iterable of str) – Candidate identifiers for a single protein.

Returns:

True if at least one candidate is in the reference.

Return type:

bool

classmethod from_bytes(data: bytes, source_name: str | None = None, member_filename: str | None = None, encoding: str = 'utf-8') FastaReference[source]#

Build a reference from in-memory bytes (plain, gzip, or zip).

Parameters:
  • data (bytes) – Raw file content.

  • source_name (str, optional) – Original file name or URL, used to detect the compression type.

  • member_filename (str, optional) – Preferred FASTA member name when data is a ZIP archive.

  • encoding (str, optional) – Text encoding used to decode the FASTA content. Default "utf-8".

Returns:

Reference indexing every header’s identifiers.

Return type:

FastaReference

classmethod from_identifiers(identifiers: Iterable[str]) FastaReference[source]#

Build a reference directly from an iterable of identifiers.

Parameters:

identifiers (iterable of str) – Identifiers to index (e.g. accessions extracted elsewhere).

Returns:

Reference indexing the supplied identifiers.

Return type:

FastaReference

classmethod from_path(path: str, member_filename: str | None = None) FastaReference[source]#

Build a reference from a local file path (plain, .gz, or .zip).

Parameters:
  • path (str) – Path to the FASTA, gzip, or zip file.

  • member_filename (str, optional) – Preferred FASTA member name when path is a ZIP archive.

Returns:

Reference indexing every header’s identifiers.

Return type:

FastaReference

classmethod from_text(text: str) FastaReference[source]#

Build a reference from raw FASTA text.

Parameters:

text (str) – FASTA content (one or more records).

Returns:

Reference indexing every header’s identifiers.

Return type:

FastaReference

classmethod from_url(url: str, member_filename: str | None = None, timeout: int = 60) FastaReference[source]#

Build a reference by downloading a FASTA / zip / gzip from a URL.

requests is imported lazily so that importing this module does not require network access.

Parameters:
  • url (str) – URL of the FASTA, gzip, or zip resource.

  • member_filename (str, optional) – Preferred FASTA member name when the resource is a ZIP archive.

  • timeout (int, optional) – Request timeout in seconds. Default 60.

Returns:

Reference indexing every header’s identifiers.

Return type:

FastaReference

property identifiers: Set[str]#

Return all indexed identifiers.

Returns:

The identifier set (accessions and entry names).

Return type:

set of str

class proteobench.validation.ModuleValidationConfig(protein_column: str = 'Proteins', sequence_column: str = 'Sequence', charge_column: str = 'Charge', proforma_column: str = 'proforma', contaminant_column: str = 'contaminant', contaminant_flag: str | None = None, decoy_prefixes: Tuple[str, ...]=('rev_', 'rev__', 'decoy_', 'decoy', 'reverse_', '##'), protein_group_separators: Tuple[str, ...]=(';', ', '), fasta_url: str | None = None, fasta_filename: str | None = None, species_flags: Tuple[str, ...]=<factory>, recommended_max_fdr_psm: float | None = 0.01, max_plausible_ppm: float | None = None, max_plausible_dalton: float | None = None, validation_profile: str = 'quant_lfq')[source]#

Bases: object

Per-module configuration for submission validation.

protein_column#

Column holding protein identifiers in the standardized DataFrame. Default "Proteins".

Type:

str, optional

sequence_column#

Column holding the (plain) peptide sequence. Default "Sequence".

Type:

str, optional

charge_column#

Column holding the precursor charge. Default "Charge".

Type:

str, optional

proforma_column#

Column holding the ProForma modified sequence. Default "proforma".

Type:

str, optional

contaminant_column#

Boolean column flagging contaminant rows. Default "contaminant".

Type:

str, optional

contaminant_flag#

Substring marking contaminant proteins (from the tool parse settings, e.g. "Cont_").

Type:

str, optional

decoy_prefixes#

Prefixes marking decoy proteins. Defaults to DEFAULT_DECOY_PREFIXES.

Type:

tuple of str, optional

protein_group_separators#

Separators used to split protein groups. Defaults to DEFAULT_GROUP_SEPARATORS.

Type:

tuple of str, optional

fasta_url#

URL of the reference FASTA / zip / gzip for the module.

Type:

str, optional

fasta_filename#

Preferred FASTA member name when the resource is an archive.

Type:

str, optional

species_flags#

Species names configured for the module (e.g. ("YEAST", "ECOLI", "HUMAN")), derived from the tool’s species mapper. Currently informational.

Type:

tuple of str, optional

recommended_max_fdr_psm#

Recommended maximum PSM-level FDR for the benchmark. A parsed FDR above this value produces a warning. Default 0.01 (1%). Set to None to disable the recommendation check.

Type:

float, optional

max_plausible_ppm#

Plausibility ceiling for ppm mass tolerances. A parsed tolerance above this value produces a warning. No default (None); when unset, the implausible-value check is skipped. Set via [validation] in module_settings.toml.

Type:

float, optional

max_plausible_dalton#

Plausibility ceiling for absolute (Da / Th / amu) mass tolerances, scaled by 1000 for mmu. No default (None); when unset, the implausible-value check is skipped. Set via [validation] in module_settings.toml.

Type:

float, optional

validation_profile#

Name of the registered profile whose checks the orchestrator runs. Set automatically by from_parse_settings(); defaults to DEFAULT_VALIDATION_PROFILE for direct construction so that the existing quant behaviour is preserved.

Type:

str, optional

charge_column: str = 'Charge'#
contaminant_column: str = 'contaminant'#
contaminant_flag: str | None = None#
decoy_prefixes: Tuple[str, ...] = ('rev_', 'rev__', 'decoy_', 'decoy', 'reverse_', '##')#
fasta_filename: str | None = None#
fasta_url: str | None = None#
classmethod from_parse_settings(parse_settings_dir: str, module_id: str, input_format: str) ModuleValidationConfig[source]#

Build a config from the existing parse settings of a module/tool.

This reuses ParseSettingsBuilder to read the contaminant flag and species flags for the selected tool, reads the optional [reference_database] and [validation] sections from the module’s module_settings.toml, and resolves the validation profile.

Parameters:
  • parse_settings_dir (str) – Directory containing the module’s parse settings (the module’s parse_settings_dir attribute).

  • module_id (str) – The module identifier (e.g. "quant_lfq_DDA_ion_QExactive").

  • input_format (str) – The selected software tool (e.g. "MaxQuant").

Returns:

Configuration populated from the parse settings. Falls back to the defaults for any value that cannot be read.

Return type:

ModuleValidationConfig

max_plausible_dalton: float | None = None#
max_plausible_ppm: float | None = None#
proforma_column: str = 'proforma'#
protein_column: str = 'Proteins'#
protein_group_separators: Tuple[str, ...] = (';', ',')#
static read_reference_database(parse_settings_dir: str) dict[source]#

Read the [reference_database] section of a module’s settings.

Parameters:

parse_settings_dir (str) – Directory containing the module’s module_settings.toml.

Returns:

The [reference_database] table, or an empty dict if absent.

Return type:

dict

recommended_max_fdr_psm: float | None = 0.01#
sequence_column: str = 'Sequence'#
species_flags: Tuple[str, ...]#
validation_profile: str = 'quant_lfq'#
class proteobench.validation.Severity(value)[source]#

Bases: str, Enum

Severity level of a validation issue.

Severity controls only display prominence and inclusion in the pull-request summary; it does not gate the Streamlit submission flow (no severity blocks submission). It also drives the optional programmatic ValidationReport.raise_if_errors() path.

ERROR = 'error'#
INFO = 'info'#
WARNING = 'warning'#
exception proteobench.validation.SubmissionValidationError(report: ValidationReport)[source]#

Bases: Exception

Raised when a submission fails validation.

The originating ValidationReport is attached as the report attribute so callers can inspect every issue.

Parameters:

report (ValidationReport) – The validation report that triggered the error.

class proteobench.validation.ValidationContext(standard_df: DataFrame, parameters: Any = None, config: ModuleValidationConfig = <factory>, fasta: FastaReference | None = None, input_format: str | None = None, reference: Any = None, extras: Dict[str, ~typing.Any]=<factory>)[source]#

Bases: object

Inputs available to a validation check.

standard_df#

The standardized result DataFrame produced by the module parser.

Type:

pandas.DataFrame

parameters#

Parsed parameters (a ProteoBenchParameters or any object with the same attributes). None when no parameter file was provided.

Type:

Any, optional

config#

Module validation configuration (column names, flags, FASTA location, resolved profile).

Type:

ModuleValidationConfig

fasta#

Reference protein identifiers, for profiles that validate against a sequence database. None when unavailable or not applicable.

Type:

FastaReference, optional

input_format#

The selected software tool used to produce the results.

Type:

str, optional

reference#

Generic reference object for profiles whose reference is not a FASTA (for example a de novo ground-truth table). None when unused.

Type:

Any, optional

extras#

Escape hatch for additional, profile-specific inputs.

Type:

dict, optional

config: ModuleValidationConfig#
extras: Dict[str, Any]#
fasta: FastaReference | None = None#
input_format: str | None = None#
parameters: Any = None#
reference: Any = None#
standard_df: DataFrame#
class proteobench.validation.ValidationIssue(code: str, severity: Severity, message: str, check: str, field: str | None = None, observed: Any = None, expected: Any = None, examples: List[Any] = <factory>)[source]#

Bases: object

A single validation finding.

code#

Machine-readable issue code (stable identifier, e.g. "protein_not_in_fasta").

Type:

str

severity#

Severity of the issue.

Type:

Severity

message#

Human-readable description of the issue.

Type:

str

check#

Name of the check that produced the issue (e.g. "protein_ids").

Type:

str

field#

Relevant field, file, or column name the issue refers to.

Type:

str, optional

observed#

Observed value (or a short summary of it).

Type:

Any, optional

expected#

Expected value or allowed range, where applicable.

Type:

Any, optional

examples#

A small number of example offending rows or identifiers.

Type:

list, optional

check: str#
code: str#
examples: List[Any]#
expected: Any = None#
field: str | None = None#
message: str#
observed: Any = None#
severity: Severity#
to_dict() Dict[str, Any][source]#

Convert the issue to a JSON-serialisable dictionary.

Returns:

Dictionary representation of the issue.

Return type:

dict

class proteobench.validation.ValidationProfile(name: str, checks: List[Check] = <factory>, description: str = '')[source]#

Bases: object

An ordered set of checks that applies to one category of module.

name#

Unique profile name (the routing key declared by modules).

Type:

str

checks#

Checks to run, in order.

Type:

list of Check

description#

Human-readable description of the profile.

Type:

str, optional

property check_names: List[str]#

Return the names of the checks in this profile.

Returns:

The ordered check names.

Return type:

list of str

checks: List[Check]#
description: str = ''#
name: str#
class proteobench.validation.ValidationReport(issues: List[ValidationIssue] = <factory>)[source]#

Bases: object

Collection of validation issues with overall status helpers.

issues#

Issues collected during validation.

Type:

list of ValidationIssue

add(code: str, severity: Severity, message: str, check: str, field: str | None = None, observed: Any = None, expected: Any = None, examples: List[Any] | None = None) ValidationReport[source]#

Append a new issue to the report.

Parameters:
  • code (str) – Machine-readable issue code.

  • severity (Severity) – Severity of the issue.

  • message (str) – Human-readable description.

  • check (str) – Name of the originating check.

  • field (str, optional) – Relevant field, file, or column name.

  • observed (Any, optional) – Observed value.

  • expected (Any, optional) – Expected value or allowed range.

  • examples (list, optional) – Example offending rows or identifiers.

Returns:

The report itself, to allow chaining.

Return type:

ValidationReport

add_error(code: str, message: str, check: str, **kwargs: Any) ValidationReport[source]#

Append an ERROR issue.

Parameters:
  • code (str) – Machine-readable issue code.

  • message (str) – Human-readable description.

  • check (str) – Name of the originating check.

  • **kwargs (dict) – Optional field, observed, expected, and examples values.

Returns:

The report itself, to allow chaining.

Return type:

ValidationReport

add_info(code: str, message: str, check: str, **kwargs: Any) ValidationReport[source]#

Append an INFO issue.

Parameters:
  • code (str) – Machine-readable issue code.

  • message (str) – Human-readable description.

  • check (str) – Name of the originating check.

  • **kwargs (dict) – Optional field, observed, expected, and examples values.

Returns:

The report itself, to allow chaining.

Return type:

ValidationReport

add_warning(code: str, message: str, check: str, **kwargs: Any) ValidationReport[source]#

Append a WARNING issue.

Parameters:
  • code (str) – Machine-readable issue code.

  • message (str) – Human-readable description.

  • check (str) – Name of the originating check.

  • **kwargs (dict) – Optional field, observed, expected, and examples values.

Returns:

The report itself, to allow chaining.

Return type:

ValidationReport

property errors: List[ValidationIssue]#

Return all ERROR issues.

Returns:

The error-level issues.

Return type:

list of ValidationIssue

extend(issues: List[ValidationIssue]) ValidationReport[source]#

Append several issues at once.

Parameters:

issues (list of ValidationIssue) – Issues to add.

Returns:

The report itself, to allow chaining.

Return type:

ValidationReport

property has_errors: bool#

Whether the report contains any ERROR issue.

Returns:

True if at least one error is present.

Return type:

bool

property infos: List[ValidationIssue]#

Return all INFO issues.

Returns:

The info-level issues.

Return type:

list of ValidationIssue

issues: List[ValidationIssue]#
property passed: bool#

Overall pass status (no ERROR issues).

This is informational only: the Streamlit submission flow does not gate on it (submission is never blocked). It is used for display and by the optional raise_if_errors() path.

Returns:

True when there are no ERROR issues (warnings allowed).

Return type:

bool

raise_if_errors() None[source]#

Raise SubmissionValidationError if any error issue is present.

Raises:

SubmissionValidationError – If the report contains at least one ERROR issue.

summary(include_info: bool = False) str[source]#

Build a compact Markdown summary of the report.

Useful for embedding the findings into pull-request text or logs. The wording is neutral: submission validation does not block submission, it only surfaces points for the submitter and reviewers to consider.

Parameters:

include_info (bool, optional) – Whether to include INFO issues in the summary. Default False.

Returns:

Markdown-formatted summary.

Return type:

str

to_dict() Dict[str, Any][source]#

Convert the report to a JSON-serialisable dictionary.

Returns:

Dictionary with overall status and the list of issues.

Return type:

dict

property warnings: List[ValidationIssue]#

Return all WARNING issues.

Returns:

The warning-level issues.

Return type:

list of ValidationIssue

proteobench.validation.available_profiles() List[str][source]#

List the names of all registered profiles.

Returns:

Sorted profile names.

Return type:

list of str

proteobench.validation.get_profile(name: str) ValidationProfile | None[source]#

Look up a registered profile by name.

Parameters:

name (str) – Profile name.

Returns:

The registered profile, or None if no profile has that name (or if name is not a string).

Return type:

ValidationProfile or None

proteobench.validation.register_profile(profile: ValidationProfile, overwrite: bool = False) None[source]#

Register a validation profile under its name.

Parameters:
  • profile (ValidationProfile) – The profile to register.

  • overwrite (bool, optional) – If False (default), registering a name that already exists raises. Set True to replace an existing profile.

Raises:

ValueError – If a profile with the same name is already registered and overwrite is False.

proteobench.validation.unregister_profile(name: str) None[source]#

Remove a profile from the registry if present.

Parameters:

name (str) – Name of the profile to remove.

proteobench.validation.validate_submission(standard_df: DataFrame, parameters: Any = None, fasta: FastaReference | None = None, config: ModuleValidationConfig | None = None, input_format: str | None = None, profile: str | None = None) ValidationReport[source]#

Validate a benchmark submission and return a structured report.

The set of checks run is determined by the validation profile, resolved from (in order): the explicit profile argument, config.validation_profile, or the default. Each check is fault-tolerant: a check that raises an unexpected exception is converted to a warning so that validation itself never crashes the submission flow.

Parameters:
  • standard_df (pandas.DataFrame) – The standardized result DataFrame produced by the module parser.

  • parameters (Any, optional) – Parsed parameters (a ProteoBenchParameters or any object with the same attributes). Parameter-dependent checks degrade to warnings when values are missing.

  • fasta (FastaReference, optional) – Reference protein identifiers, for profiles that validate against a sequence database.

  • config (ModuleValidationConfig, optional) – Module validation configuration. Defaults to a generic configuration (which selects the default profile).

  • input_format (str, optional) – The selected software tool, used for run-consistency checks.

  • profile (str, optional) – Explicit profile name, overriding config.validation_profile. Mostly useful for testing.

Returns:

The aggregated validation report.

Return type:

ValidationReport

Submodules#