Parameter homogenization#
When adding a new parameter parser to ProteoBench, all extracted parameters must be homogenized to a canonical format before being stored. This ensures that the same setting (e.g. “Carbamidomethylation on cysteine”) is represented identically regardless of which software tool produced it. Without homogenization, filtering and comparison across tools would not work as desired.
This page documents the canonical formats and the reusable helpers available for each parameter type.
Modifications#
Target format (ProForma-like)#
All modifications must be converted to a ProForma-like notation:
Residue[ModificationName]
Multiple modifications are separated by , (comma-space).
Raw input (examples) |
Canonical output |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Terminal modifications use these residue prefixes:
Prefix |
Meaning |
|---|---|
|
Peptide N-terminus, or when distinction is not made |
|
Peptide C-terminus, or when distinction is not made |
|
Protein N-terminus |
|
Protein C-terminus |
Choosing a homogenization approach#
Different software tools use different notation styles. Pick the approach that matches your tool’s raw format:
1. MaxQuant-style: ModName (Residues)#
Tools using this notation: MaxQuant, Spectronaut, Proline, quantms, MSAID, MSAngel (Mascot mode).
Import and use the dynamic parser from maxquant.py:
from proteobench.io.params.maxquant import _homogenize_mods
params.fixed_mods = _homogenize_mods(raw_fixed_mods_string)
params.variable_mods = _homogenize_mods(raw_variable_mods_string)
This parser dynamically handles ModName (Residues) notation. Multi-letter
residues like STY are expanded to one entry per amino acid. Terminal qualifiers
(N-term, Protein N-term, etc.) are recognized automatically. A small
MODIFICATION_MAPPING fallback dict handles edge cases without parentheses
(e.g. Cys-Cys).
The default separator is ,. Pass sep="; " if your tool uses semicolons:
params.fixed_mods = _homogenize_mods(raw_string, sep="; ")
2. MetaMorpheus-style: ModName on Residue#
Tools using this notation: MetaMorpheus.
Use the _homogenize_mod function from metamorpheus.py, which parses
{modname} on {residue} dynamically. Terminal qualifiers (Pep N-Term) and
(Prot N-Term) are detected automatically.
3. X!Tandem-style: ModName of Residue#
Tools using this notation: MSAngel (X!Tandem mode), WOMBAT-P.
Import from msangel.py:
from proteobench.io.params.msangel import _homogenize_mod_xtandem
params.fixed_mods = ", ".join(_homogenize_mod_xtandem(m) for m in raw_list)
4. AlphaDIA-style: ModName@Residue#
Tools using this notation: AlphaDIA.
Use the homogenize_modification_string() function from alphadia.py, which
splits on @ and reformats to Residue[ModName].
5. Mass-shift based#
Tools using this notation: MSFragger/FragPipe, Sage.
When the tool identifies modifications by mass shift rather than name, use a
MASS_TO_MOD dictionary to look up the modification name by mass within a
tolerance (typically 0.001 Da). See fragger.py and sage.py for examples.
For unknown masses, fall back to using the mass value as the name:
K[4.025107].
6. Static mapping#
When none of the above patterns apply (e.g. PEAKS, i2MassChroQ, AlphaPept),
define a MODIFICATION_MAPPING dictionary in your parser module that maps raw
strings to canonical format:
MODIFICATION_MAPPING = {
"Carbamidomethylation (+57.02)": "C[Carbamidomethyl]",
"Oxidation (M) (+15.99)": "M[Oxidation]",
}
Apply it when assigning:
params.fixed_mods = ", ".join(MODIFICATION_MAPPING.get(m, m) for m in raw_list)
Enzyme names#
Enzyme normalization is handled centrally by the _ENZYME_MAP in
proteobench/io/params/__init__.py. It is applied automatically when
params.fill_none() or normalize_dataframe_columns() is called.
If your tool uses a non-standard enzyme name, add a lowercase mapping to
_ENZYME_MAP:
_ENZYME_MAP = {
"trypsin": "Trypsin",
"trypsin/p": "Trypsin/P",
"stricttrypsin": "Trypsin/P",
"lys-c": "Lys-C",
# ... add new mappings here (keys must be lowercase)
}
Individual parsers do not need to normalize enzyme names themselves, though doing so is harmless.
Mass tolerances#
Tolerances should be formatted as bracket-enclosed ranges with units:
[-10 ppm, 10 ppm]
[-0.02 Da, 0.02 Da]
If the tool uses automatic calibration (e.g. DIA-NN, Spectronaut), store
"Dynamic" or "0" - these should be normalized to "Automatic calibration".
Integer parameters#
The following parameters must be stored as integers (or None if unavailable):
allowed_miscleavagesmin_peptide_lengthmax_peptide_lengthmax_modsmin_precursor_chargemax_precursor_charge
Make sure to convert string values with int() in your parser. The central
normalization also coerces these fields via pd.to_numeric(), but it is best
practice to store them as integers from the start.
Boolean parameters#
enable_match_between_runs and semi_enzymatic should be stored as Python
bool values (True/False), not strings.
FDR values#
FDR values should be stored as decimals between 0 and 1 (e.g. 0.01 for 1%).
If the tool reports percentages, divide by 100. The central normalization in
__init__.py also handles this: values >= 1 are automatically divided by 100.
Missing values#
If a parameter is not available for a given tool, set it to None. The
params.fill_none() call replaces sentinel strings like "not specified",
"N/A", "None", and "" with proper None/NaN values.
Checklist for new parsers#
When adding a new parameter parser:
Create
proteobench/io/params/<toolname>.pyImplement
extract_params()returning aProteoBenchParametersobjectHomogenize modifications using one of the approaches above, or a new approach
Format tolerances as
[-value unit, value unit]Store FDR as decimal (0-1)
Add any new enzyme names to
_ENZYME_MAPin__init__.pyCall
params.fill_none()before returningAdd test data files and expected output CSVs in
test/params/Run existing tests to verify nothing is broken
Note#
If you notice anything wrong with the current parameter parsing approaches, do not hesitate to create an issue on GitHub