proteobench.io.params package#

Parameter handling for ProteoBench.

ProteoBenchParameters is initialized from a JSON field-definition file (default: json/Quant/quant_lfq_DDA_ion.json) and populated by per-software extract_params functions in the sibling modules. After population, every parser calls fill_none(), which coerces values to canonical types via normalize().

normalize_dataframe_columns applies the same coercion rules to a full DataFrame of historical datapoints loaded from the results repository.

Normalization rules (applied by normalize() / normalize_dataframe_columns):

  • Missing sentinel strings ("None", "N/A", "", "unknown", etc.) → np.nan

  • ident_fdr_psm, ident_fdr_peptide, ident_fdr_protein → float in [0, 1]; values ≥ 1 are treated as percentages and divided by 100

  • allowed_miscleavages, min/max_peptide_length, min/max_precursor_charge, max_mods, min/max_precursor_mz, min/max_fragment_mz, n_beams, n_peaks, min_mz, max_mz → int

  • enable_match_between_runs → bool

  • enzyme → canonical name via _ENZYME_MAP (e.g. "trypsin""Trypsin", "kr|p,true""Trypsin")

  • precursor_mass_tolerance, fragment_mass_tolerance → mapped to "Automatic calibration" when a known auto-calibration sentinel is detected (e.g. "dynamic", "0 ppm")

NOT normalized (kept as-is from parsers, parsers should homogenize themselves):

  • precursor_mass_tolerance, fragment_mass_tolerance, remove_precursor_tol — string, format varies by tool

  • fixed_mods, variable_mods — string, tool-specific format

  • quantification_method, protein_inference, abundance_normalization_ions — string

  • software_name, software_version, search_engine, search_engine_version — string

  • min_intensity, max_intensity — float/int, kept as-is

  • tokens — string, semicolon-separated amino acids/modifications

  • isotope_error_range — string (e.g. "[0, 2]")

  • decoding_strategy, checkpoint — string, tool-specific

Classes#

ProteoBenchParameters

Parameter container initialized from a JSON field-definition file.

Functions#

normalize_dataframe_columns

Apply the same normalization rules to a historical-results DataFrame.

class proteobench.io.params.ProteoBenchParameters(filename='/home/docs/checkouts/readthedocs.org/user_builds/proteobench/envs/v0.16.3/lib/python3.11/site-packages/proteobench/io/params/json/Quant/quant_lfq_DDA_ion.json', **kwargs)[source]#

Bases: object

Parameter container for a single ProteoBench submission.

Attributes are determined at runtime by the JSON field-definition file; only fields present in that file are set as instance attributes.

Parameters:
  • filename (str or os.PathLike) – Path to a JSON field-definition file. Defaults to json/Quant/quant_lfq_DDA_ion.json (relative to this package).

  • **kwargs – Optional attribute overrides applied after JSON initialization. A string value of "None" is coerced to np.nan.

fill_none()[source]#

Convert string "None" sentinels to np.nan and call normalize().

Every extract_params function should call this at the end of parameter extraction so that normalization is applied uniformly.

normalize()[source]#

Coerce parsed parameter values to their canonical types.

This method is called automatically at the end of fill_none() so that every parser benefits without per-parser changes.

Normalization rules#

  1. Any attribute whose value is a missing sentinel string (e.g. "not specified", "N/A", "None", "") is set to np.nan.

  2. FDR fields are coerced to float in the range [0, 1]. Values > 1 are assumed to be percentages and divided by 100.

  3. Integer fields (miscleavages, peptide length, charge, max_mods) are coerced to int.

  4. enable_match_between_runs is coerced to bool.

  5. enzyme is mapped to a canonical name via _ENZYME_MAP.

proteobench.io.params.normalize_dataframe_columns(df: DataFrame) DataFrame[source]#

Apply the same coercion rules as ProteoBenchParameters.normalize() to an entire DataFrame of historical results.

Operates in-place on df and also returns it for convenience.

Submodules#