proteobench.validation.protein_ids module#

Protein-identifier extraction helpers for submission validation.

ProteoBench tool outputs store protein identifiers in the standardized Proteins column. The representation is not fully normalized across tools:

a single protein may be a UniProt-style triplet such as sp|P49327|FAS_HUMAN (the | separates database/accession/entry-name), a bare accession such as P49327, or an isoform such as P49327-2;
multiple proteins (protein groups) are joined with ; (e.g. MaxQuant) or , (e.g. the FragPipe loader combines Protein and Mapped Proteins).

These helpers split protein-group strings into individual proteins and extract the candidate identifiers (accession, entry name, isoform base) used to match against a FASTA-derived accession set. They are deliberately generic so the core validator does not embed tool-specific assumptions.

proteobench.validation.protein_ids.DEFAULT_GROUP_SEPARATORS = (';', ',')#: Default separators used to split a protein-group string into individual proteins. The | character is intentionally excluded because it is a within-protein separator in UniProt identifiers (db|accession|entryname).

proteobench.validation.protein_ids.extract_identifiers(protein_token: str) → Set[str][source]#

Extract candidate identifiers from a single protein token.

For a UniProt triplet such as sp|P49327|FAS_HUMAN this returns the accession (P49327), the entry name (FAS_HUMAN), and (for isoforms) the isoform base accession. For a bare accession it returns the accession and its isoform base. For any other token it returns the token unchanged.

Parameters:: protein_token (str) – A single protein identifier (one element of a protein group).
Returns:: Candidate identifiers usable for FASTA membership testing.
Return type:: set of str

proteobench.validation.protein_ids.is_decoy_or_contaminant(protein_token: str, contaminant_flag: str = None, decoy_prefixes: Iterable[str] = ()) → bool[source]#

Determine whether a protein token is a decoy or contaminant marker.

The check is case-insensitive and matches the contaminant flag as a substring (mirroring ParseSettings contaminant detection) and the decoy markers as case-insensitive prefixes.

Parameters:

protein_token (str) – A single protein identifier.
contaminant_flag (str, optional) – Substring marking contaminant proteins (from the tool parse settings, e.g. "Cont_"). None disables contaminant detection.
decoy_prefixes (iterable of str, optional) – Prefixes marking decoy proteins (e.g. "rev_", "DECOY_").

Returns:

True if the token is a decoy or contaminant identifier.

Return type:

bool

proteobench.validation.protein_ids.split_protein_groups(value: str, separators: Iterable[str] = (';', ',')) → List[str][source]#

Split a protein-group cell into individual protein tokens.

Parameters:

value (str) – The raw value of a Proteins cell (may contain several proteins).
separators (iterable of str, optional) – Characters that separate proteins within a group. Defaults to DEFAULT_GROUP_SEPARATORS (; and ,).

Returns:

Stripped, non-empty individual protein tokens.

Return type:

list of str