proteobench.datapoint.quant_datapoint module#
This module provides functionality for handling and processing quantitative datapoints in the ProteoBench framework.
- class proteobench.datapoint.quant_datapoint.QuantDatapointHYE(id: str = None, software_name: str = None, software_version: int = 0, search_engine: str = None, search_engine_version: int = 0, ident_fdr_psm: int = 0, ident_fdr_peptide: int = 0, ident_fdr_protein: int = 0, enable_match_between_runs: bool = False, precursor_mass_tolerance: str = None, fragment_mass_tolerance: str = None, enzyme: str = None, allowed_miscleavages: int = 0, min_peptide_length: int = 0, max_peptide_length: int = 0, is_temporary: bool = True, intermediate_hash: str = '', results: dict = None, median_abs_epsilon_global: float = 0, mean_abs_epsilon_global: float = 0, median_abs_epsilon_eq_species: float = 0, mean_abs_epsilon_eq_species: float = 0, median_abs_epsilon_precision_global: float = 0, mean_abs_epsilon_precision_global: float = 0, median_abs_epsilon_precision_eq_species: float = 0, mean_abs_epsilon_precision_eq_species: float = 0, nr_prec: int = 0, comments: str = '', proteobench_version: str = '')[source]#
Bases:
DatapointBaseA data structure used to store the results of a quantification benchmark run.
This class extends DatapointBase to implement quantification-specific metrics and metadata storage for LFQ benchmarking runs.
- median_abs_epsilon_eq_species#
Median absolute epsilon value for equivalently weighted species.
- Type:
- mean_abs_epsilon_eq_species#
Mean absolute epsilon value for equivalently weighted species.
- Type:
- median_abs_epsilon_precision_global#
Median absolute precision epsilon (deviation from empirical center).
- Type:
- mean_abs_epsilon_precision_global#
Mean absolute precision epsilon (deviation from empirical center).
- Type:
- median_abs_epsilon_precision_eq_species#
Median absolute precision epsilon for equivalently weighted species.
- Type:
- mean_abs_epsilon_precision_eq_species#
Mean absolute precision epsilon for equivalently weighted species.
- Type:
- static generate_datapoint(intermediate: DataFrame, input_format: str, user_input: dict, default_cutoff_min_prec: int = 3, max_nr_observed: int = None) Series[source]#
Generate a Datapoint object containing metadata and results from the benchmark run.
- Parameters:
intermediate (pd.DataFrame) – The intermediate DataFrame containing benchmark results.
input_format (str) – The format of the input data (e.g., file format).
user_input (dict) – User-defined input values for the benchmark.
default_cutoff_min_prec (int, optional) – The default minimum precursor cutoff value. Defaults to 3.
max_nr_observed (int, optional) – Maximum nr_observed value to calculate metrics for. If None, defaults to 6.
- Returns:
A Pandas Series containing the Datapoint’s attributes as key-value pairs.
- Return type:
pd.Series
- generate_id() None[source]#
Generate a unique ID for the benchmark run by combining the software name and a timestamp.
This ID is used to uniquely identify each run of the benchmark.
- static get_cv_metrics(df: DataFrame, min_nr_observed: int) dict[str, float][source]#
Compute CV quantiles.
- static get_epsilon_metrics(df: DataFrame, min_nr_observed: int, agg: str = 'median') dict[str, float][source]#
Compute epsilon-based accuracy metrics using specified aggregation.
- Parameters:
- Returns:
Accuracy metrics: global, equal-species average, and per-species values
- Return type:
- static get_metrics(df: DataFrame, min_nr_observed: int = 3) Dict[int, Dict[str, float]][source]#
Compute statistical metrics from the provided DataFrame.
- static get_precision_metrics(df: DataFrame, min_nr_observed: int, agg: str = 'median') dict[str, float][source]#
Compute precision metrics directly from log2FC (log2_A_vs_B) column.
Precision measures deviation from the empirical center (reproducibility), computed independently from expected ratios.
- Parameters:
- Returns:
Precision metrics including: - {agg}_log2_empirical_{species}: Center of log2FC distribution per species - {agg}_abs_epsilon_precision_global: Global aggregated precision - {agg}_abs_epsilon_precision_eq_species: Equal-weighted species average - {agg}_abs_epsilon_precision_{species}: Per-species precision values
- Return type:
- class proteobench.datapoint.quant_datapoint.QuantDatapointPYE(id: str = None, software_name: str = None, software_version: int = 0, search_engine: str = None, search_engine_version: int = 0, ident_fdr_psm: int = 0, ident_fdr_peptide: int = 0, ident_fdr_protein: int = 0, enable_match_between_runs: bool = False, precursor_mass_tolerance: str = None, fragment_mass_tolerance: str = None, enzyme: str = None, allowed_miscleavages: int = 0, min_peptide_length: int = 0, max_peptide_length: int = 0, is_temporary: bool = True, intermediate_hash: str = '', results: dict = None, median_abs_epsilon_global: float = 0, mean_abs_epsilon_global: float = 0, median_abs_epsilon_eq_species: float = 0, mean_abs_epsilon_eq_species: float = 0, median_abs_epsilon_precision_global: float = 0, mean_abs_epsilon_precision_global: float = 0, median_abs_epsilon_precision_eq_species: float = 0, mean_abs_epsilon_precision_eq_species: float = 0, nr_prec: int = 0, comments: str = '', proteobench_version: str = '')[source]#
Bases:
QuantDatapointHYEA data structure used to store the results of a quantification benchmark run for plasma (PYE) setups.
This class extends QuantDatapointHYE to implement plasma-specific metrics and metadata storage for quantification benchmarking runs on plasma samples. The PYE module benchmarks quantification performance across three species (yeast, E. coli, and human plasma) with metrics for visualization in a scatterplot format.
- Inherits all attributes from QuantDatapointHYE.
- median_abs_log2_fc_error_spike_ins#
Median absolute log2 fold-change error for yeast and E. coli spike-ins.
- Type:
- nr_quantified_spike_ins#
Number of quantified yeast and E. coli spike-in precursors (quantification depth).
- Type:
- dynamic_range_human_plasma#
Dynamic range of human plasma precursors (log10 difference between 90th and 10th percentile).
- Type:
- median_abs_epsilon_human_plasma#
Median absolute epsilon for human plasma precursors (quantification accuracy).
- Type:
- static generate_datapoint(intermediate: DataFrame, input_format: str, user_input: dict, default_cutoff_min_prec: int = 3, max_nr_observed: int = None) Series[source]#
Generate a Datapoint object containing metadata and results from the plasma benchmark run.
This method extends the parent implementation to compute plasma-specific metrics: - Median fold-change error for yeast and E. coli spike-ins (for x-axis) - Number of quantified spike-in precursors (for y-axis) - Dynamic range of human plasma precursors (for dot size) - Quantification accuracy for human plasma (for transparency/opacity)
- Parameters:
intermediate (pd.DataFrame) – The intermediate DataFrame containing benchmark results with species annotations.
input_format (str) – The format of the input data (e.g., file format).
user_input (dict) – User-defined input values for the benchmark.
default_cutoff_min_prec (int, optional) – The default minimum precursor cutoff value. Defaults to 3.
max_nr_observed (int, optional) – Maximum nr_observed value to calculate metrics for. If None, defaults to 6.
- Returns:
A Pandas Series containing the Datapoint’s attributes as key-value pairs.
- Return type:
pd.Series
- proteobench.datapoint.quant_datapoint.compute_roc_auc(df: DataFrame, unchanged_species: str = None) float[source]#
Compute ROC-AUC for distinguishing unchanged from changed species.
Uses absolute log2 fold change as the score to separate species that should show no change (e.g., HUMAN with 1:1 ratio) from species that should show change (e.g., YEAST, ECOLI with different ratios).
- Parameters:
df (pd.DataFrame) – DataFrame with ‘species’ and ‘log2_A_vs_B’ columns. Optionally ‘log2_expectedRatio’ for auto-detecting unchanged species.
unchanged_species (str, optional) – Species name for the unchanged/control group. If None, auto-detects from data as the species with smallest absolute expected log2 ratio.
- Returns:
ROC-AUC score, or np.nan if computation is not possible (e.g., only one class present or all scores are NaN).
- Return type:
- proteobench.datapoint.quant_datapoint.compute_roc_auc_directional(df: DataFrame) float[source]#
Compute directional ROC-AUC for distinguishing changed from unchanged species.
Unlike the abs-based ROC-AUC, this method accounts for the expected direction of fold change for each species: - For species with positive expected log2 ratio (e.g., YEAST): Uses raw log2_FC - For species with negative expected log2 ratio (e.g., ECOLI): Uses -log2_FC
This approach is more robust to systematic bias where the unchanged species may not be centered at zero.
- Parameters:
df (pd.DataFrame) – DataFrame with ‘species’, ‘log2_A_vs_B’, and ‘log2_expectedRatio’ columns.
- Returns:
Average ROC-AUC score across all changed species, or np.nan if computation is not possible.
- Return type:
- proteobench.datapoint.quant_datapoint.filter_df_numquant_epsilon(row: Dict[str, Any], min_quant: int = 3, metric: str = 'median', mode: str = 'global') float | None[source]#
Extract the ‘median_abs_epsilon’ value from a row (assumed to be a dictionary).
- Parameters:
row (dict) – The row from which to extract the value. Expected to be a dictionary.
min_quant (int or str, optional) – The key for the desired value. Defaults to 3.
metric (str) – The metric to be calculated. Should be either median or mean, defaults to median.
mode (str, optional) – The mode of metric calculation, defaults to “global”.
- Returns:
The ‘median_abs_epsilon’ value if found, otherwise None.
- Return type:
float or None