proteobench.datapoint package#

Datapoint module for ProteoBench benchmarking.

class proteobench.datapoint.DatapointBase[source]#

Bases: ABC

Abstract base class for benchmark datapoints.

This class defines the interface that all datapoint types must implement, allowing for modular and extensible datapoint handling for different benchmarking modules.

Subclasses should define their own attributes specific to their benchmarking module.

abstractmethod static generate_datapoint(intermediate: DataFrame, input_format: str, user_input: dict, **kwargs) Series[source]#

Generate a datapoint object containing metadata and results from the benchmark run.

Parameters:
  • intermediate (pd.DataFrame) – The intermediate DataFrame containing benchmark results.

  • input_format (str) – The format of the input data (e.g., software tool name).

  • user_input (dict) – User-defined input values for the benchmark.

  • **kwargs (dict) – Additional module-specific parameters.

Returns:

A Pandas Series containing the datapoint’s attributes as key-value pairs.

Return type:

pd.Series

abstractmethod generate_id() None[source]#

Generate a unique ID for the benchmark run.

This ID should uniquely identify each run of the benchmark.

abstractmethod static get_metrics(df: DataFrame, **kwargs) Dict[int, Dict[str, float]][source]#

Compute statistical metrics from the provided DataFrame.

Parameters:
  • df (pd.DataFrame) – DataFrame containing the intermediate results.

  • **kwargs (dict) – Additional module-specific parameters.

Returns:

Dictionary mapping quantification cutoffs to their computed metrics.

Return type:

Dict[int, Dict[str, float]]

class proteobench.datapoint.QuantDatapointHYE(id: str = None, software_name: str = None, software_version: int = 0, search_engine: str = None, search_engine_version: int = 0, ident_fdr_psm: int = 0, ident_fdr_peptide: int = 0, ident_fdr_protein: int = 0, enable_match_between_runs: bool = False, precursor_mass_tolerance: str = None, fragment_mass_tolerance: str = None, enzyme: str = None, allowed_miscleavages: int = 0, min_peptide_length: int = 0, max_peptide_length: int = 0, is_temporary: bool = True, intermediate_hash: str = '', results: dict = None, median_abs_epsilon_global: float = 0, mean_abs_epsilon_global: float = 0, median_abs_epsilon_eq_species: float = 0, mean_abs_epsilon_eq_species: float = 0, median_abs_epsilon_precision_global: float = 0, mean_abs_epsilon_precision_global: float = 0, median_abs_epsilon_precision_eq_species: float = 0, mean_abs_epsilon_precision_eq_species: float = 0, nr_prec: int = 0, comments: str = '', proteobench_version: str = '')[source]#

Bases: DatapointBase

A data structure used to store the results of a quantification benchmark run.

This class extends DatapointBase to implement quantification-specific metrics and metadata storage for LFQ benchmarking runs.

id#

Unique identifier for the benchmark run.

Type:

str

software_name#

Name of the software used in the benchmark.

Type:

str

software_version#

Version of the software.

Type:

str

search_engine#

Name of the search engine used.

Type:

str

search_engine_version#

Version of the search engine.

Type:

str

ident_fdr_psm#

False discovery rate for PSMs.

Type:

float

ident_fdr_peptide#

False discovery rate for peptides.

Type:

float

ident_fdr_protein#

False discovery rate for proteins.

Type:

float

enable_match_between_runs#

Whether matching between runs is enabled.

Type:

bool

precursor_mass_tolerance#

Mass tolerance for precursor ions.

Type:

str

fragment_mass_tolerance#

Mass tolerance for fragment ions.

Type:

str

enzyme#

Enzyme used for digestion.

Type:

str

allowed_miscleavages#

Number of allowed miscleavages.

Type:

int

min_peptide_length#

Minimum peptide length.

Type:

int

max_peptide_length#

Maximum peptide length.

Type:

int

is_temporary#

Whether the data is temporary.

Type:

bool

intermediate_hash#

Hash of the intermediate result.

Type:

str

results#

A dictionary of metrics for the benchmark run.

Type:

dict

median_abs_epsilon_global#

Median absolute epsilon value for the benchmark.

Type:

float

mean_abs_epsilon_global#

Mean absolute epsilon value for the benchmark.

Type:

float

median_abs_epsilon_eq_species#

Median absolute epsilon value for equivalently weighted species.

Type:

float

mean_abs_epsilon_eq_species#

Mean absolute epsilon value for equivalently weighted species.

Type:

float

median_abs_epsilon_precision_global#

Median absolute precision epsilon (deviation from empirical center).

Type:

float

mean_abs_epsilon_precision_global#

Mean absolute precision epsilon (deviation from empirical center).

Type:

float

median_abs_epsilon_precision_eq_species#

Median absolute precision epsilon for equivalently weighted species.

Type:

float

mean_abs_epsilon_precision_eq_species#

Mean absolute precision epsilon for equivalently weighted species.

Type:

float

nr_prec#

Number of precursors identified.

Type:

int

comments#

Any additional comments.

Type:

str

proteobench_version#

Version of the Proteobench tool used.

Type:

str

allowed_miscleavages: int = 0#
comments: str = ''#
enable_match_between_runs: bool = False#
enzyme: str = None#
fragment_mass_tolerance: str = None#
static generate_datapoint(intermediate: DataFrame, input_format: str, user_input: dict, default_cutoff_min_prec: int = 3) Series[source]#

Generate a Datapoint object containing metadata and results from the benchmark run.

Parameters:
  • intermediate (pd.DataFrame) – The intermediate DataFrame containing benchmark results.

  • input_format (str) – The format of the input data (e.g., file format).

  • user_input (dict) – User-defined input values for the benchmark.

  • default_cutoff_min_prec (int, optional) – The default minimum precursor cutoff value. Defaults to 3.

Returns:

A Pandas Series containing the Datapoint’s attributes as key-value pairs.

Return type:

pd.Series

generate_id() None[source]#

Generate a unique ID for the benchmark run by combining the software name and a timestamp.

This ID is used to uniquely identify each run of the benchmark.

get_count_metrics(min_nr_observed: int) dict[str, int][source]#

Compute precursor counts (total and per-species).

get_cv_metrics(min_nr_observed: int) dict[str, float][source]#

Compute CV quantiles.

get_epsilon_metrics(min_nr_observed: int, agg: str = 'median') dict[str, float][source]#

Compute epsilon-based accuracy metrics using specified aggregation.

Parameters:
  • df (pd.DataFrame) – DataFrame with epsilon column (deviation from expected ratio)

  • min_nr_observed (int) – Filter threshold for minimum observations

  • agg (str) – Aggregation method: “median” or “mean”

Returns:

Accuracy metrics: global, equal-species average, and per-species values

Return type:

dict

get_metrics(min_nr_observed: int = 1) dict[int, dict[str, float]][source]#

Compute all benchmark metrics (backward compatible wrapper).

Merges: epsilon (accuracy) + precision + cv + roc + counts

static get_metrics_old(df: DataFrame, min_nr_observed: int = 1) Dict[int, Dict[str, float]][source]#

Compute various statistical metrics from the provided DataFrame for the benchmark.

Parameters:
  • df (pd.DataFrame) – The DataFrame containing the benchmark results.

  • min_nr_observed (int, optional) – The minimum number of observed values for a valid computation. Defaults to 1.

Returns:

A dictionary containing computed metrics such as ‘median_abs_epsilon’, ‘variance_epsilon’, etc.

Return type:

dict

get_precision_metrics(min_nr_observed: int, agg: str = 'median') dict[str, float][source]#

Compute precision metrics directly from log2FC (log2_A_vs_B) column.

Precision measures deviation from the empirical center (reproducibility), computed independently from expected ratios.

Parameters:
  • df (pd.DataFrame) – DataFrame with log2_A_vs_B and species columns

  • min_nr_observed (int) – Filter threshold for minimum observations

  • agg (str) – Aggregation method: “median” or “mean”

Returns:

Precision metrics including: - {agg}_log2_empirical_{species}: Center of log2FC distribution per species - {agg}_abs_epsilon_precision_global: Global aggregated precision - {agg}_abs_epsilon_precision_eq_species: Equal-weighted species average - {agg}_abs_epsilon_precision_{species}: Per-species precision values

Return type:

dict

get_roc_metrics(min_nr_observed: int) dict[str, float][source]#

Compute ROC-AUC metrics.

id: str = None#
ident_fdr_peptide: int = 0#
ident_fdr_protein: int = 0#
ident_fdr_psm: int = 0#
intermediate_hash: str = ''#
is_temporary: bool = True#
max_peptide_length: int = 0#
mean_abs_epsilon_eq_species: float = 0#
mean_abs_epsilon_global: float = 0#
mean_abs_epsilon_precision_eq_species: float = 0#
mean_abs_epsilon_precision_global: float = 0#
median_abs_epsilon_eq_species: float = 0#
median_abs_epsilon_global: float = 0#
median_abs_epsilon_precision_eq_species: float = 0#
median_abs_epsilon_precision_global: float = 0#
min_peptide_length: int = 0#
nr_prec: int = 0#
precursor_mass_tolerance: str = None#
proteobench_version: str = ''#
results: dict = None#
search_engine: str = None#
search_engine_version: int = 0#
software_name: str = None#
software_version: int = 0#

Submodules#