proteobench.validation.protein_ids module#
Protein-identifier extraction helpers for submission validation.
ProteoBench tool outputs store protein identifiers in the standardized
Proteins column. The representation is not fully normalized across tools:
a single protein may be a UniProt-style triplet such as
sp|P49327|FAS_HUMAN(the|separates database/accession/entry-name), a bare accession such asP49327, or an isoform such asP49327-2;multiple proteins (protein groups) are joined with
;(e.g. MaxQuant) or,(e.g. the FragPipe loader combinesProteinandMapped Proteins).
These helpers split protein-group strings into individual proteins and extract the candidate identifiers (accession, entry name, isoform base) used to match against a FASTA-derived accession set. They are deliberately generic so the core validator does not embed tool-specific assumptions.
- proteobench.validation.protein_ids.DEFAULT_GROUP_SEPARATORS = (';', ',')#
Default separators used to split a protein-group string into individual proteins. The
|character is intentionally excluded because it is a within-protein separator in UniProt identifiers (db|accession|entryname).
- proteobench.validation.protein_ids.extract_identifiers(protein_token: str) Set[str][source]#
Extract candidate identifiers from a single protein token.
For a UniProt triplet such as
sp|P49327|FAS_HUMANthis returns the accession (P49327), the entry name (FAS_HUMAN), and (for isoforms) the isoform base accession. For a bare accession it returns the accession and its isoform base. For any other token it returns the token unchanged.
- proteobench.validation.protein_ids.is_decoy_or_contaminant(protein_token: str, contaminant_flag: str = None, decoy_prefixes: Iterable[str] = ()) bool[source]#
Determine whether a protein token is a decoy or contaminant marker.
The check is case-insensitive and matches the contaminant flag as a substring (mirroring ParseSettings contaminant detection) and the decoy markers as case-insensitive prefixes.
- Parameters:
protein_token (str) – A single protein identifier.
contaminant_flag (str, optional) – Substring marking contaminant proteins (from the tool parse settings, e.g.
"Cont_").Nonedisables contaminant detection.decoy_prefixes (iterable of str, optional) – Prefixes marking decoy proteins (e.g.
"rev_","DECOY_").
- Returns:
Trueif the token is a decoy or contaminant identifier.- Return type:
- proteobench.validation.protein_ids.split_protein_groups(value: str, separators: Iterable[str] = (';', ',')) List[str][source]#
Split a protein-group cell into individual protein tokens.
- Parameters:
value (str) – The raw value of a
Proteinscell (may contain several proteins).separators (iterable of str, optional) – Characters that separate proteins within a group. Defaults to
DEFAULT_GROUP_SEPARATORS(;and,).
- Returns:
Stripped, non-empty individual protein tokens.
- Return type: