proteobench.datapoint.quant_datapoint module#

This module provides functionality for handling and processing quantitative datapoints in the ProteoBench framework.

class proteobench.datapoint.quant_datapoint.QuantDatapointHYE(id: str = None, software_name: str = None, software_version: int = 0, search_engine: str = None, search_engine_version: int = 0, ident_fdr_psm: int = 0, ident_fdr_peptide: int = 0, ident_fdr_protein: int = 0, enable_match_between_runs: bool = False, precursor_mass_tolerance: str = None, fragment_mass_tolerance: str = None, enzyme: str = None, allowed_miscleavages: int = 0, min_peptide_length: int = 0, max_peptide_length: int = 0, is_temporary: bool = True, intermediate_hash: str = '', results: dict = None, median_abs_epsilon_global: float = 0, mean_abs_epsilon_global: float = 0, median_abs_epsilon_eq_species: float = 0, mean_abs_epsilon_eq_species: float = 0, median_abs_epsilon_precision_global: float = 0, mean_abs_epsilon_precision_global: float = 0, median_abs_epsilon_precision_eq_species: float = 0, mean_abs_epsilon_precision_eq_species: float = 0, nr_prec: int = 0, comments: str = '', proteobench_version: str = '')[source]#

Bases: DatapointBase

A data structure used to store the results of a quantification benchmark run.

This class extends DatapointBase to implement quantification-specific metrics and metadata storage for LFQ benchmarking runs.

id#

Unique identifier for the benchmark run.

Type:: str

software_name#

Name of the software used in the benchmark.

Type:: str

software_version#

Version of the software.

Type:: str

search_engine#

Name of the search engine used.

Type:: str

search_engine_version#

Version of the search engine.

Type:: str

ident_fdr_psm#

False discovery rate for PSMs.

Type:: float

ident_fdr_peptide#

False discovery rate for peptides.

Type:: float

ident_fdr_protein#

False discovery rate for proteins.

Type:: float

enable_match_between_runs#

Whether matching between runs is enabled.

Type:: bool

precursor_mass_tolerance#

Mass tolerance for precursor ions.

Type:: str

fragment_mass_tolerance#

Mass tolerance for fragment ions.

Type:: str

enzyme#

Enzyme used for digestion.

Type:: str

allowed_miscleavages#

Number of allowed miscleavages.

Type:: int

min_peptide_length#

Minimum peptide length.

Type:: int

max_peptide_length#

Maximum peptide length.

Type:: int

is_temporary#

Whether the data is temporary.

Type:: bool

intermediate_hash#

Hash of the intermediate result.

Type:: str

results#

A dictionary of metrics for the benchmark run.

Type:: dict

median_abs_epsilon_global#

Median absolute epsilon value for the benchmark.

Type:: float

mean_abs_epsilon_global#

Mean absolute epsilon value for the benchmark.

Type:: float

median_abs_epsilon_eq_species#

Median absolute epsilon value for equivalently weighted species.

Type:: float

mean_abs_epsilon_eq_species#

Mean absolute epsilon value for equivalently weighted species.

Type:: float

median_abs_epsilon_precision_global#

Median absolute precision epsilon (deviation from empirical center).

Type:: float

mean_abs_epsilon_precision_global#

Mean absolute precision epsilon (deviation from empirical center).

Type:: float

median_abs_epsilon_precision_eq_species#

Median absolute precision epsilon for equivalently weighted species.

Type:: float

mean_abs_epsilon_precision_eq_species#

Mean absolute precision epsilon for equivalently weighted species.

Type:: float

nr_prec#

Number of precursors identified.

Type:: int

comments#

Any additional comments.

Type:: str

proteobench_version#

Version of the Proteobench tool used.

Type:: str

allowed_miscleavages: int = 0#

comments: str = ''#

enable_match_between_runs: bool = False#

enzyme: str = None#

fragment_mass_tolerance: str = None#

static generate_datapoint(intermediate: DataFrame, input_format: str, user_input: dict, default_cutoff_min_prec: int = 3, max_nr_observed: int = None) → Series[source]#

Generate a Datapoint object containing metadata and results from the benchmark run.

Parameters:

intermediate (pd.DataFrame) – The intermediate DataFrame containing benchmark results.
input_format (str) – The format of the input data (e.g., file format).
user_input (dict) – User-defined input values for the benchmark.
default_cutoff_min_prec (int, optional) – The default minimum precursor cutoff value. Defaults to 3.
max_nr_observed (int, optional) – Maximum nr_observed value to calculate metrics for. If None, defaults to 6.

Returns:

A Pandas Series containing the Datapoint’s attributes as key-value pairs.

Return type:

pd.Series

generate_id() → None[source]#

Generate a unique ID for the benchmark run by combining the software name and a timestamp.

This ID is used to uniquely identify each run of the benchmark.

static get_cv_metrics(df: DataFrame, min_nr_observed: int) → dict[str, float][source]#: Compute CV quantiles.

static get_epsilon_metrics(df: DataFrame, min_nr_observed: int, agg: str = 'median') → dict[str, float][source]#

Compute epsilon-based accuracy metrics using specified aggregation.

Parameters:

df (pd.DataFrame) – DataFrame with epsilon column (deviation from expected ratio)
min_nr_observed (int) – Filter threshold for minimum observations
agg (str) – Aggregation method: “median” or “mean”

Returns:

Accuracy metrics: global, equal-species average, and per-species values

Return type:

dict

static get_metrics(df: DataFrame, min_nr_observed: int = 3) → Dict[int, Dict[str, float]][source]#

Compute statistical metrics from the provided DataFrame.

Parameters:

df (pd.DataFrame) – DataFrame containing the intermediate results.
min_nr_observed (int) – Minimum number of observations threshold.

Returns:

Dictionary mapping quantification cutoffs to their computed metrics.

Return type:

Dict[int, Dict[str, float]]

static get_precision_metrics(df: DataFrame, min_nr_observed: int, agg: str = 'median') → dict[str, float][source]#

Compute precision metrics directly from log2FC (log2_A_vs_B) column.

Precision measures deviation from the empirical center (reproducibility), computed independently from expected ratios.

Parameters:

df (pd.DataFrame) – DataFrame with log2_A_vs_B and species columns
min_nr_observed (int) – Filter threshold for minimum observations
agg (str) – Aggregation method: “median” or “mean”

Returns:

Precision metrics including: - {agg}_log2_empirical_{species}: Center of log2FC distribution per species - {agg}_abs_epsilon_precision_global: Global aggregated precision - {agg}_abs_epsilon_precision_eq_species: Equal-weighted species average - {agg}_abs_epsilon_precision_{species}: Per-species precision values

Return type:

dict

id: str = None#

ident_fdr_peptide: int = 0#

ident_fdr_protein: int = 0#

ident_fdr_psm: int = 0#

intermediate_hash: str = ''#

is_temporary: bool = True#

max_peptide_length: int = 0#

mean_abs_epsilon_eq_species: float = 0#

mean_abs_epsilon_global: float = 0#

mean_abs_epsilon_precision_eq_species: float = 0#

mean_abs_epsilon_precision_global: float = 0#

median_abs_epsilon_eq_species: float = 0#

median_abs_epsilon_global: float = 0#

median_abs_epsilon_precision_eq_species: float = 0#

median_abs_epsilon_precision_global: float = 0#

min_peptide_length: int = 0#

nr_prec: int = 0#

precursor_mass_tolerance: str = None#

proteobench_version: str = ''#

results: dict = None#

search_engine: str = None#

search_engine_version: int = 0#

software_name: str = None#

software_version: int = 0#

class proteobench.datapoint.quant_datapoint.QuantDatapointPYE(id: str = None, software_name: str = None, software_version: int = 0, search_engine: str = None, search_engine_version: int = 0, ident_fdr_psm: int = 0, ident_fdr_peptide: int = 0, ident_fdr_protein: int = 0, enable_match_between_runs: bool = False, precursor_mass_tolerance: str = None, fragment_mass_tolerance: str = None, enzyme: str = None, allowed_miscleavages: int = 0, min_peptide_length: int = 0, max_peptide_length: int = 0, is_temporary: bool = True, intermediate_hash: str = '', results: dict = None, median_abs_epsilon_global: float = 0, mean_abs_epsilon_global: float = 0, median_abs_epsilon_eq_species: float = 0, mean_abs_epsilon_eq_species: float = 0, median_abs_epsilon_precision_global: float = 0, mean_abs_epsilon_precision_global: float = 0, median_abs_epsilon_precision_eq_species: float = 0, mean_abs_epsilon_precision_eq_species: float = 0, nr_prec: int = 0, comments: str = '', proteobench_version: str = '')[source]#

Bases: QuantDatapointHYE

A data structure used to store the results of a quantification benchmark run for plasma (PYE) setups.

This class extends QuantDatapointHYE to implement plasma-specific metrics and metadata storage for quantification benchmarking runs on plasma samples. The PYE module benchmarks quantification performance across three species (yeast, E. coli, and human plasma) with metrics for visualization in a scatterplot format.

Inherits all attributes from QuantDatapointHYE.

median_abs_log2_fc_error_spike_ins#

Median absolute log2 fold-change error for yeast and E. coli spike-ins.

Type:: float

nr_quantified_spike_ins#

Number of quantified yeast and E. coli spike-in precursors (quantification depth).

Type:: int

dynamic_range_human_plasma#

Dynamic range of human plasma precursors (log10 difference between 90th and 10th percentile).

Type:: float

median_abs_epsilon_human_plasma#

Median absolute epsilon for human plasma precursors (quantification accuracy).

Type:: float

dynamic_range_human_plasma: float = 0.0#

static generate_datapoint(intermediate: DataFrame, input_format: str, user_input: dict, default_cutoff_min_prec: int = 3, max_nr_observed: int = None) → Series[source]#

Generate a Datapoint object containing metadata and results from the plasma benchmark run.

This method extends the parent implementation to compute plasma-specific metrics: - Median fold-change error for yeast and E. coli spike-ins (for x-axis) - Number of quantified spike-in precursors (for y-axis) - Dynamic range of human plasma precursors (for dot size) - Quantification accuracy for human plasma (for transparency/opacity)

Parameters:

intermediate (pd.DataFrame) – The intermediate DataFrame containing benchmark results with species annotations.
input_format (str) – The format of the input data (e.g., file format).
user_input (dict) – User-defined input values for the benchmark.
default_cutoff_min_prec (int, optional) – The default minimum precursor cutoff value. Defaults to 3.
max_nr_observed (int, optional) – Maximum nr_observed value to calculate metrics for. If None, defaults to 6.

Returns:

A Pandas Series containing the Datapoint’s attributes as key-value pairs.

Return type:

pd.Series

median_abs_epsilon_human_plasma: float = 0.0#

median_abs_log2_fc_error_spike_ins: float = 0.0#

nr_quantified_spike_ins: int = 0#

proteobench.datapoint.quant_datapoint.compute_roc_auc(df: DataFrame, unchanged_species: str = None) → float[source]#

Compute ROC-AUC for distinguishing unchanged from changed species.

Uses absolute log2 fold change as the score to separate species that should show no change (e.g., HUMAN with 1:1 ratio) from species that should show change (e.g., YEAST, ECOLI with different ratios).

Parameters:

df (pd.DataFrame) – DataFrame with ‘species’ and ‘log2_A_vs_B’ columns. Optionally ‘log2_expectedRatio’ for auto-detecting unchanged species.
unchanged_species (str, optional) – Species name for the unchanged/control group. If None, auto-detects from data as the species with smallest absolute expected log2 ratio.

Returns:

ROC-AUC score, or np.nan if computation is not possible (e.g., only one class present or all scores are NaN).

Return type:

float

proteobench.datapoint.quant_datapoint.compute_roc_auc_directional(df: DataFrame) → float[source]#

Compute directional ROC-AUC for distinguishing changed from unchanged species.

Unlike the abs-based ROC-AUC, this method accounts for the expected direction of fold change for each species: - For species with positive expected log2 ratio (e.g., YEAST): Uses raw log2_FC - For species with negative expected log2 ratio (e.g., ECOLI): Uses -log2_FC

This approach is more robust to systematic bias where the unchanged species may not be centered at zero.

Parameters:: df (pd.DataFrame) – DataFrame with ‘species’, ‘log2_A_vs_B’, and ‘log2_expectedRatio’ columns.
Returns:: Average ROC-AUC score across all changed species, or np.nan if computation is not possible.
Return type:: float

proteobench.datapoint.quant_datapoint.filter_df_numquant_epsilon(row: Dict[str, Any], min_quant: int = 3, metric: str = 'median', mode: str = 'global') → float | None[source]#

Extract the ‘median_abs_epsilon’ value from a row (assumed to be a dictionary).

Parameters:

row (dict) – The row from which to extract the value. Expected to be a dictionary.
min_quant (int or str, optional) – The key for the desired value. Defaults to 3.
metric (str) – The metric to be calculated. Should be either median or mean, defaults to median.
mode (str, optional) – The mode of metric calculation, defaults to “global”.

Returns:

The ‘median_abs_epsilon’ value if found, otherwise None.

Return type:

float or None

proteobench.datapoint.quant_datapoint.filter_df_numquant_nr_prec(row: Series, min_quant: int = 3) → int | None[source]#

Extract the ‘nr_prec’ value from a row (assumed to be a dictionary).

Parameters:

row (pd.Series) – The row from which to extract the value. Expected to be a dictionary or Series.
min_quant (int or str, optional) – The key for the desired value. Defaults to 3.

Returns:

The ‘nr_prec’ value if found, otherwise None.

Return type:

int, None