proteobench.io.parsing.parse_peptidoform module#

Module for parsing peptidoform strings and extracting modifications.

proteobench.io.parsing.parse_peptidoform.aggregate_modification_column(input_string_seq: str, input_string_modifications: str, special_locations: Dict[str, int] = {'Any C-term': -1, 'Any N-term': 0, 'C-Term': -1, 'N-Term': 0, 'Protein C-term': -1, 'Protein N-term': 0}) str[source]#

Aggregate modifications into a string representing the modified sequence.

This version handles both: - Original format (e.g. “Methylation (C11)” or “Carbamidomethyl (Any N-term)”) - New format (e.g. “1xCarbamidomethyl [C11]”, “1xOxidation [M4]”, “1xAcetyl [N-Term]”)

Parameters:
  • input_string_seq (str) – The input sequence string.

  • input_string_modifications (str) – The modifications applied to the sequence.

  • special_locations (dict, optional) – A dictionary specifying special locations for modifications.

Returns:

The modified sequence string with aggregated modifications.

Return type:

str

proteobench.io.parsing.parse_peptidoform.count_chars(input_string: str, isalpha: bool = True, isupper: bool = True) int[source]#

Count the number of characters in the string that match the given criteria.

Parameters:
  • input_string (str) – The input string.

  • isalpha (bool, optional) – Whether to count alphabetic characters. Defaults to True.

  • isupper (bool, optional) – Whether to count uppercase characters. Defaults to True.

Returns:

The count of characters that match the criteria.

Return type:

int

proteobench.io.parsing.parse_peptidoform.get_proforma_bracketed(input_string: str, before_aa: bool = True, isalpha: bool = True, isupper: bool = True, pattern: str = '\\[([^]]+)\\]', modification_dict: Dict[str, str] = {'+15.9949': 'Oxidation', '+42': 'Acetyl', '+57.0215': 'Carbamidomethyl', '-17.026548': 'Gln->pyro-Glu', '-18.010565': 'Glu->pyro-Glu'}) str[source]#

Get the proforma sequence with bracketed modifications.

Parameters:
  • input_string (str) – The input sequence string.

  • before_aa (bool, optional) – Whether to add the modification before the amino acid. Defaults to True.

  • isalpha (bool, optional) – Whether to include alphabetic characters. Defaults to True.

  • isupper (bool, optional) – Whether to include uppercase characters. Defaults to True.

  • pattern (str, optional) – The regular expression pattern for matching modifications. Defaults to r”[([^]]+)]”.

  • modification_dict (dict, optional) – A dictionary of modifications and their names.

Returns:

The proforma sequence with bracketed modifications.

Return type:

str

proteobench.io.parsing.parse_peptidoform.get_stripped_seq(input_string: str, isalpha: bool = True, isupper: bool = True) str[source]#

Get a stripped version of the sequence containing only characters that match the given criteria.

Parameters:
  • input_string (str) – The input string.

  • isalpha (bool, optional) – Whether to include alphabetic characters. Defaults to True.

  • isupper (bool, optional) – Whether to include uppercase characters. Defaults to True.

Returns:

The stripped sequence.

Return type:

str

proteobench.io.parsing.parse_peptidoform.load_input_file(input_csv: str, input_format: str) DataFrame[source]#

Load a dataframe from a CSV file depending on its format.

Parameters:
  • input_csv (str) – The path to the CSV file.

  • input_format (str) – The format of the input file (e.g., “WOMBAT”, “Custom”).

Returns:

The loaded dataframe with the required columns added (like “proforma”).

Return type:

pd.DataFrame

proteobench.io.parsing.parse_peptidoform.match_brackets(input_string: str, pattern: str = '\\[([^]]+)\\]', isalpha: bool = True, isupper: bool = True) tuple[source]#

Match and extract bracketed modifications from the string.

Parameters:
  • input_string (str) – The input string.

  • pattern (str, optional) – The regular expression pattern for matching modifications. Defaults to r”[([^]]+)]”.

  • isalpha (bool, optional) – Whether to match alphabetic characters. Defaults to True.

  • isupper (bool, optional) – Whether to match uppercase characters. Defaults to True.

Returns:

A tuple containing the matched modifications and their positions.

Return type:

tuple

proteobench.io.parsing.parse_peptidoform.to_lowercase(match: Match) str[source]#

Convert a match to lowercase.

Parameters:

match (re.Match) – The match object from a regular expression.

Returns:

The lowercase version of the matched string.

Return type:

str