proteobench.validation.fasta module#

FASTA / reference-database parsing for submission validation.

FastaReference builds the set of accepted protein identifiers from a FASTA file. It parses common UniProt-style headers (sp|P49327|FAS_HUMAN, tr|...|...) as well as bare accession-like headers, indexing both the accession and the entry name so that result protein identifiers can be matched regardless of which form a tool reports.

The class can be built from raw text, a local path (plain, .gz, or .zip), in-memory bytes, or an explicit iterable of identifiers. Downloading from a URL is supported via FastaReference.from_url(); the actual network call is performed lazily so that importing this module never requires network access.

class proteobench.validation.fasta.FastaReference(identifiers: Iterable[str] | None = None)[source]#

Bases: object

Set of protein identifiers derived from a FASTA / reference database.

Parameters:

identifiers (iterable of str, optional) – Pre-computed identifiers to seed the reference with.

contains(identifier: str) bool[source]#

Test whether an identifier is present (case-insensitive).

Parameters:

identifier (str) – Identifier to test.

Returns:

True if the identifier is in the reference.

Return type:

bool

contains_any(identifiers: Iterable[str]) bool[source]#

Test whether any of several identifiers is present.

Parameters:

identifiers (iterable of str) – Candidate identifiers for a single protein.

Returns:

True if at least one candidate is in the reference.

Return type:

bool

classmethod from_bytes(data: bytes, source_name: str | None = None, member_filename: str | None = None, encoding: str = 'utf-8') FastaReference[source]#

Build a reference from in-memory bytes (plain, gzip, or zip).

Parameters:
  • data (bytes) – Raw file content.

  • source_name (str, optional) – Original file name or URL, used to detect the compression type.

  • member_filename (str, optional) – Preferred FASTA member name when data is a ZIP archive.

  • encoding (str, optional) – Text encoding used to decode the FASTA content. Default "utf-8".

Returns:

Reference indexing every header’s identifiers.

Return type:

FastaReference

classmethod from_identifiers(identifiers: Iterable[str]) FastaReference[source]#

Build a reference directly from an iterable of identifiers.

Parameters:

identifiers (iterable of str) – Identifiers to index (e.g. accessions extracted elsewhere).

Returns:

Reference indexing the supplied identifiers.

Return type:

FastaReference

classmethod from_path(path: str, member_filename: str | None = None) FastaReference[source]#

Build a reference from a local file path (plain, .gz, or .zip).

Parameters:
  • path (str) – Path to the FASTA, gzip, or zip file.

  • member_filename (str, optional) – Preferred FASTA member name when path is a ZIP archive.

Returns:

Reference indexing every header’s identifiers.

Return type:

FastaReference

classmethod from_text(text: str) FastaReference[source]#

Build a reference from raw FASTA text.

Parameters:

text (str) – FASTA content (one or more records).

Returns:

Reference indexing every header’s identifiers.

Return type:

FastaReference

classmethod from_url(url: str, member_filename: str | None = None, timeout: int = 60) FastaReference[source]#

Build a reference by downloading a FASTA / zip / gzip from a URL.

requests is imported lazily so that importing this module does not require network access.

Parameters:
  • url (str) – URL of the FASTA, gzip, or zip resource.

  • member_filename (str, optional) – Preferred FASTA member name when the resource is a ZIP archive.

  • timeout (int, optional) – Request timeout in seconds. Default 60.

Returns:

Reference indexing every header’s identifiers.

Return type:

FastaReference

property identifiers: Set[str]#

Return all indexed identifiers.

Returns:

The identifier set (accessions and entry names).

Return type:

set of str

proteobench.validation.fasta.parse_fasta_header(header: str) Set[str][source]#

Parse a single FASTA header line into candidate identifiers.

Parameters:

header (str) – A FASTA header line, with or without the leading >.

Returns:

Candidate identifiers (accession, entry name, isoform base, …).

Return type:

set of str