proteobench.datapoint.quant_datapoint module#

This module provides functionality for handling and processing quantitative datapoints in the ProteoBench framework.

class proteobench.datapoint.quant_datapoint.QuantDatapointHYE(id: str = None, software_name: str = None, software_version: int = 0, search_engine: str = None, search_engine_version: int = 0, ident_fdr_psm: int = 0, ident_fdr_peptide: int = 0, ident_fdr_protein: int = 0, enable_match_between_runs: bool = False, precursor_mass_tolerance: str = None, fragment_mass_tolerance: str = None, enzyme: str = None, allowed_miscleavages: int = 0, min_peptide_length: int = 0, max_peptide_length: int = 0, is_temporary: bool = True, intermediate_hash: str = '', results: dict = None, median_abs_epsilon_global: float = 0, mean_abs_epsilon_global: float = 0, median_abs_epsilon_eq_species: float = 0, mean_abs_epsilon_eq_species: float = 0, median_abs_epsilon_precision_global: float = 0, mean_abs_epsilon_precision_global: float = 0, median_abs_epsilon_precision_eq_species: float = 0, mean_abs_epsilon_precision_eq_species: float = 0, nr_prec: int = 0, comments: str = '', proteobench_version: str = '')[source]#

Bases: DatapointBase

A data structure used to store the results of a quantification benchmark run.

This class extends DatapointBase to implement quantification-specific metrics and metadata storage for LFQ benchmarking runs.

id#

Unique identifier for the benchmark run.

Type:

str

software_name#

Name of the software used in the benchmark.

Type:

str

software_version#

Version of the software.

Type:

str

search_engine#

Name of the search engine used.

Type:

str

search_engine_version#

Version of the search engine.

Type:

str

ident_fdr_psm#

False discovery rate for PSMs.

Type:

float

ident_fdr_peptide#

False discovery rate for peptides.

Type:

float

ident_fdr_protein#

False discovery rate for proteins.

Type:

float

enable_match_between_runs#

Whether matching between runs is enabled.

Type:

bool

precursor_mass_tolerance#

Mass tolerance for precursor ions.

Type:

str

fragment_mass_tolerance#

Mass tolerance for fragment ions.

Type:

str

enzyme#

Enzyme used for digestion.

Type:

str

allowed_miscleavages#

Number of allowed miscleavages.

Type:

int

min_peptide_length#

Minimum peptide length.

Type:

int

max_peptide_length#

Maximum peptide length.

Type:

int

is_temporary#

Whether the data is temporary.

Type:

bool

intermediate_hash#

Hash of the intermediate result.

Type:

str

results#

A dictionary of metrics for the benchmark run.

Type:

dict

median_abs_epsilon_global#

Median absolute epsilon value for the benchmark.

Type:

float

mean_abs_epsilon_global#

Mean absolute epsilon value for the benchmark.

Type:

float

median_abs_epsilon_eq_species#

Median absolute epsilon value for equivalently weighted species.

Type:

float

mean_abs_epsilon_eq_species#

Mean absolute epsilon value for equivalently weighted species.

Type:

float

median_abs_epsilon_precision_global#

Median absolute precision epsilon (deviation from empirical center).

Type:

float

mean_abs_epsilon_precision_global#

Mean absolute precision epsilon (deviation from empirical center).

Type:

float

median_abs_epsilon_precision_eq_species#

Median absolute precision epsilon for equivalently weighted species.

Type:

float

mean_abs_epsilon_precision_eq_species#

Mean absolute precision epsilon for equivalently weighted species.

Type:

float

nr_prec#

Number of precursors identified.

Type:

int

comments#

Any additional comments.

Type:

str

proteobench_version#

Version of the Proteobench tool used.

Type:

str

allowed_miscleavages: int = 0#
comments: str = ''#
enable_match_between_runs: bool = False#
enzyme: str = None#
fragment_mass_tolerance: str = None#
static generate_datapoint(intermediate: DataFrame, input_format: str, user_input: dict, default_cutoff_min_prec: int = 3) Series[source]#

Generate a Datapoint object containing metadata and results from the benchmark run.

Parameters:
  • intermediate (pd.DataFrame) – The intermediate DataFrame containing benchmark results.

  • input_format (str) – The format of the input data (e.g., file format).

  • user_input (dict) – User-defined input values for the benchmark.

  • default_cutoff_min_prec (int, optional) – The default minimum precursor cutoff value. Defaults to 3.

Returns:

A Pandas Series containing the Datapoint’s attributes as key-value pairs.

Return type:

pd.Series

generate_id() None[source]#

Generate a unique ID for the benchmark run by combining the software name and a timestamp.

This ID is used to uniquely identify each run of the benchmark.

get_count_metrics(min_nr_observed: int) dict[str, int][source]#

Compute precursor counts (total and per-species).

get_cv_metrics(min_nr_observed: int) dict[str, float][source]#

Compute CV quantiles.

get_epsilon_metrics(min_nr_observed: int, agg: str = 'median') dict[str, float][source]#

Compute epsilon-based accuracy metrics using specified aggregation.

Parameters:
  • df (pd.DataFrame) – DataFrame with epsilon column (deviation from expected ratio)

  • min_nr_observed (int) – Filter threshold for minimum observations

  • agg (str) – Aggregation method: “median” or “mean”

Returns:

Accuracy metrics: global, equal-species average, and per-species values

Return type:

dict

get_metrics(min_nr_observed: int = 1) dict[int, dict[str, float]][source]#

Compute all benchmark metrics (backward compatible wrapper).

Merges: epsilon (accuracy) + precision + cv + roc + counts

static get_metrics_old(df: DataFrame, min_nr_observed: int = 1) Dict[int, Dict[str, float]][source]#

Compute various statistical metrics from the provided DataFrame for the benchmark.

Parameters:
  • df (pd.DataFrame) – The DataFrame containing the benchmark results.

  • min_nr_observed (int, optional) – The minimum number of observed values for a valid computation. Defaults to 1.

Returns:

A dictionary containing computed metrics such as ‘median_abs_epsilon’, ‘variance_epsilon’, etc.

Return type:

dict

get_precision_metrics(min_nr_observed: int, agg: str = 'median') dict[str, float][source]#

Compute precision metrics directly from log2FC (log2_A_vs_B) column.

Precision measures deviation from the empirical center (reproducibility), computed independently from expected ratios.

Parameters:
  • df (pd.DataFrame) – DataFrame with log2_A_vs_B and species columns

  • min_nr_observed (int) – Filter threshold for minimum observations

  • agg (str) – Aggregation method: “median” or “mean”

Returns:

Precision metrics including: - {agg}_log2_empirical_{species}: Center of log2FC distribution per species - {agg}_abs_epsilon_precision_global: Global aggregated precision - {agg}_abs_epsilon_precision_eq_species: Equal-weighted species average - {agg}_abs_epsilon_precision_{species}: Per-species precision values

Return type:

dict

get_roc_metrics(min_nr_observed: int) dict[str, float][source]#

Compute ROC-AUC metrics.

id: str = None#
ident_fdr_peptide: int = 0#
ident_fdr_protein: int = 0#
ident_fdr_psm: int = 0#
intermediate_hash: str = ''#
is_temporary: bool = True#
max_peptide_length: int = 0#
mean_abs_epsilon_eq_species: float = 0#
mean_abs_epsilon_global: float = 0#
mean_abs_epsilon_precision_eq_species: float = 0#
mean_abs_epsilon_precision_global: float = 0#
median_abs_epsilon_eq_species: float = 0#
median_abs_epsilon_global: float = 0#
median_abs_epsilon_precision_eq_species: float = 0#
median_abs_epsilon_precision_global: float = 0#
min_peptide_length: int = 0#
nr_prec: int = 0#
precursor_mass_tolerance: str = None#
proteobench_version: str = ''#
results: dict = None#
search_engine: str = None#
search_engine_version: int = 0#
software_name: str = None#
software_version: int = 0#
proteobench.datapoint.quant_datapoint.compute_roc_auc(df: DataFrame, unchanged_species: str = None) float[source]#

Compute ROC-AUC for distinguishing unchanged from changed species.

Uses absolute log2 fold change as the score to separate species that should show no change (e.g., HUMAN with 1:1 ratio) from species that should show change (e.g., YEAST, ECOLI with different ratios).

Parameters:
  • df (pd.DataFrame) – DataFrame with ‘species’ and ‘log2_A_vs_B’ columns. Optionally ‘log2_expectedRatio’ for auto-detecting unchanged species.

  • unchanged_species (str, optional) – Species name for the unchanged/control group. If None, auto-detects from data as the species with smallest absolute expected log2 ratio.

Returns:

ROC-AUC score, or np.nan if computation is not possible (e.g., only one class present or all scores are NaN).

Return type:

float

proteobench.datapoint.quant_datapoint.compute_roc_auc_directional(df: DataFrame) float[source]#

Compute directional ROC-AUC for distinguishing changed from unchanged species.

Unlike the abs-based ROC-AUC, this method accounts for the expected direction of fold change for each species: - For species with positive expected log2 ratio (e.g., YEAST): Uses raw log2_FC - For species with negative expected log2 ratio (e.g., ECOLI): Uses -log2_FC

This approach is more robust to systematic bias where the unchanged species may not be centered at zero.

Parameters:

df (pd.DataFrame) – DataFrame with ‘species’, ‘log2_A_vs_B’, and ‘log2_expectedRatio’ columns.

Returns:

Average ROC-AUC score across all changed species, or np.nan if computation is not possible.

Return type:

float

proteobench.datapoint.quant_datapoint.filter_df_numquant_epsilon(row: Dict[str, Any], min_quant: int = 3, metric: str = 'median', mode: str = 'global') float | None[source]#

Extract the ‘median_abs_epsilon’ value from a row (assumed to be a dictionary).

Parameters:
  • row (dict) – The row from which to extract the value. Expected to be a dictionary.

  • min_quant (int or str, optional) – The key for the desired value. Defaults to 3.

  • metric (str) – The metric to be calculated. Should be either median or mean, defaults to median.

  • mode (str, optional) – The mode of metric calculation, defaults to “global”.

Returns:

The ‘median_abs_epsilon’ value if found, otherwise None.

Return type:

float or None

proteobench.datapoint.quant_datapoint.filter_df_numquant_nr_prec(row: Series, min_quant: int = 3) int | None[source]#

Extract the ‘nr_prec’ value from a row (assumed to be a dictionary).

Parameters:
  • row (pd.Series) – The row from which to extract the value. Expected to be a dictionary or Series.

  • min_quant (int or str, optional) – The key for the desired value. Defaults to 3.

Returns:

The ‘nr_prec’ value if found, otherwise None.

Return type:

int, None