Export to pandas DataFrame

NOTE: This feature is available only if using a version of pyOpenMS >= 3.0, at the time of writing this means using one of the nightly builds as described in the Installation Instructions.

In pyOpenMS some data structures can be converted to a tabular format as a pandas.DataFrame. This allows convenient access to data and meta values of spectra, features and identifications.

Required imports for the examples:

from pyopenms import *
import pandas as pd
from urllib.request import urlretrieve
url = 'https://raw.githubusercontent.com/OpenMS/pyopenms-docs/master/src/data/'

MSExperiment

pyopenms.MSExperiment.get_df( long=False )

Generates a pandas DataFrame with all peaks in the MSExperiment

Parameters:

long : default False

set to True if you want to have a long/expanded/melted dataframe with one row per peak. Faster but replicated RT information. If False, returns rows in the style: rt, np.array(mz), np.array(int)

Returns:

pandas.DataFrame

peak map information stored in a DataFrame

Examples:

urlretrieve(url+'BSA1.mzML', 'BSA1.mzML')
exp = MSExperiment()
MzMLFile().load('BSA1.mzML', exp)

df = exp.get_df() # default: long = False
df.head(2)
exp.get_df()

RT

mzarray

intarray

0

1501.41394

[300.0897645621494, 300.18132740129533, 300.20…

[3431.0261, 1181.809, 1516.1746, 1719.8547, 11…

1

1503.03125

[300.06577092599525, 300.08932376441896, 300.2…

[914.79034, 1842.2311, 2395.1025, 851.4738, 16…

df = exp.get_df(long=True)
df.head(2)
exp.get_df(long=True)

RT

mz

inty

0

1501.41394

300.089752

3431.026123

1

1501.41394

300.181335

1181.808960

PeptideIdentifications

pyopenms.peptide_identifications_to_df( peps, decode_ontology=True, default_missing_values={bool: False, int: -9999, float: np.nan, str: ‘’}, export_unidentified=True )

Generates a pandas DataFrame with all peaks in the MSExperiment

Parameters:

peps :

list of PeptideIdentification objects

decode_ontology : default True

if meta values contain CV identifer (e.g., from PSI-MS) they will be automatically decoded into the human readable CV term name.

default_missing_values : default {bool: False, int: -9999, float: np.nan, str: ‘’}

default value for missing values for each data type

export_unidentified : default True

export PeptideIdentifications without PeptideHit

Returns:

pandas.DataFrame

peptide identifications in a DataFrame

Example:

urlretrieve(url+'small.idXML', 'small.idXML')
prot_ids = []
pep_ids = []
IdXMLFile().load('small.idXML', prot_ids, pep_ids)

df = peptide_identifications_to_df(pep_ids)
df.head(2)
peptide_identifications_to_df(pep_ids)

id

RT

mz

q-value

charge

protein_accession

start

end

NuXL:z2 mass

NuXL:z3 mass

isotope_error

NuXL:peptide_mass_z0

NuXL:XL_U

NuXL:sequence_score

0

OpenNuXL_2019-12-04T16:39:43_1021782429466859437

900.425415

414.730865

0.368649

4

DECOY_sp|Q86UQ0|ZN589_HUMAN

255

267

828.458069

552.641113

0

1654.901611

0

0.173912

1

OpenNuXL_2019-12-04T16:39:43_7293634134684008928

903.565186

506.259521

0.422779

2

sp|P61313|RL15_HUMAN

179

187

0.0

0.0

0

1010.504639

0

0.290786

FeatureMap

pyopenms.FeatureMap.get_df( meta_values = None )

Generates a pandas DataFrame with information contained in the FeatureMap.

Optionally the feature meta values and information for the assigned PeptideHit can be exported.

Parameters:

meta_values : default None

meta values to include (None, [custom list of meta value names] or ‘all’)

export_peptide_identifications (bool): default True

export sequence and score for best PeptideHit assigned to a feature. Additionally the ID_filename (file name of the corresponding ProteinIdentification) and the ID_native_id (spectrum ID of the corresponding Feature) are exported. They are also annotated as meta values when collecting all assigned PeptideIdentifications from a FeatureMap with FeatureMap.get_assigned_peptide_identifications(). A DataFrame from the assigned peptides generated with peptide_identifications_to_df(assigned_peptides) can be merged with the FeatureMap DataFrame with: merged_df = pd.merge(feature_df, assigned_peptide_df, on=[‘feature_id’, ‘ID_native_id’, ‘ID_filename’])

Returns:

pandas.DataFrame

feature information stored in a DataFrame

Examples:

urlretrieve(url+'BSA1_F1_idmapped.featureXML', 'BSA1_F1_idmapped.featureXML')
feature_map = FeatureMap()
FeatureXMLFile().load('BSA1_F1_idmapped.featureXML', feature_map)

df = feature_map.get_df() # default: meta_values = None
df.head(2)
feature_map.get_df()

id

peptide_sequence

peptide_score

ID_filename

ID_native_id

charge

RT

mz

RTstart

RTend

mzstart

mzend

quality

intensity

9650885788371886430

LVTDLTK

0.000000

unknown

spectrum=1270

2

1942.600083

395.239277

1932.484009

1950.834351

395.239199

397.245758

0.808494

157572000.0

18416216708636999474

DDSPDLPK

0.034483

unknown

spectrum=1167

2

1749.138335

443.711224

1735.693115

1763.343506

443.711122

445.717531

0.893553

54069300.0

df = feature_map.get_df(meta_values = 'all', export_peptide_identifications = False)
df.head(2)
feature_map.get_df(meta_values = ‘all’, export_peptide_identifications = False)

id

charge

RT

mz

RTstart

RTend

mzstart

mzend

quality

intensity

FWHM

spectrum_index

spectrum_native_id

label

score_correlation

score_fit

9650885788371886430

2

1942.600083

395.239277

1932.484009

1950.834351

395.239199

397.245758

0.808494

157572000.0

10.061090

259

spectrum=1270

168

0.989969

0.660286

18416216708636999474

2

1749.138335

443.711224

1735.693115

1763.343506

443.71112

445.717531

0.893553

54069300.0

14.156094

156

spectrum=1167

169

0.999002

0.799234

df = feature_map.get_df(meta_values = [b'FWHM', b'label'])
df.head(2)
feature_map.get_df(meta_values = [b’FWHM’, b’label’])

id

charge

RT

mz

RTstart

RTend

mzstart

mzend

quality

intensity

FWHM

label

9650885788371886430

2

1942.600083

395.239277

1932.484009

1950.834351

395.239199

397.245758

0.808494

157572000.0

10.061090

168

18416216708636999474

2

1749.138335

443.711224

1735.693115

1763.343506

443.71112

445.717531

0.893553

54069300.0

14.156094

169

Extract assigned peptide identifications from a feature map

Peptide identifications can be mapped to their corresponding features in a FeatureMap. It is possible to extract them using the function pyopenms.FeatureMap.get_assigned_peptide_identifications() returning a list of PeptideIdentification objects.

pyopenms.FeatureMap.get_assigned_peptide_identifications()

Generates a list with peptide identifications assigned to a feature.

Adds ‘ID_native_id’ (feature spectrum id), ‘ID_filename’ (primary MS run path of corresponding ProteinIdentification) and ‘feature_id’ (unique ID of corresponding Feature) as meta values to the peptide hits. A DataFrame from the assigned peptides generated with peptide_identifications_to_df(assigned_peptides) can be merged with the FeatureMap DataFrame with: merged_df = pd.merge(feature_df, assigned_peptide_df, on=[‘feature_id’, ‘ID_native_id’, ‘ID_filename’])

Returns:

[PeptideIdentification]

list of PeptideIdentification objects

A DataFrame can be created on the resulting list of PeptideIdentification objects using pyopenms.peptide_identifications_to_df(assigned_peptides). Feature map and peptide data frames contain columns, on which they can be merged together to contain the complete information for peptides and features in a single data frame.

The columns for unambiguously merging the data frames:

  • feature_id: the unique feature identifier

  • ID_native_id: the feature spectrum native identifier

  • ID_filename: the filename (primary MS run path) of the corresponding ProteinIdentification

Example:

feature_df = feature_map.get_df()
assigned_peptides = feature_map.get_assigned_peptide_identifications()
assigned_peptide_df = peptide_identifications_to_df(assigned_peptides)

merged_df = pd.merge(feature_df, assigned_peptide_df, on=['feature_id', 'ID_native_id', 'ID_filename'])
merged_df.head(2)
consensus_map.get_df()

feature_id

peptide_sequence

peptide_score

ID_filename

ID_native_id

charge_x

RT_x

mz_x

RTstart

RTend

id

RT_y

mz_y

q-value

charge_y

protein_accession

start

end

OMSSA_score

target_decoy

9650885788371886430

LVTDLTK

0.000000

unknown

spectrum=1270

2

1942.600083

395.239277

1932.484009

1950.834351

OMSSA_2009-11-17T11:11:11_4731105163044641872

1933.405151

395.239349

0.000000

2

P02769|ALBU_BOVIN

-1

-1

0.001084

True

18416216708636999474

DDSPDLPK

0.034483

unknown

spectrum=1167

2

1749.138335

443.711224

1735.693115

1763.343506

OMSSA_2009-11-17T11:11:11_4731105163044641872

1738.033447

443.711243

0.034483

2

P02769|ALBU_BOVIN

-1

-1

0.003951

True

ConsensusMap

pyopenms.ConsensusMap.get_df()

Generates a pandas DataFrame with both consensus feature meta data and intensities from each sample.

Returns:

pandas.DataFrame

consensus map meta data and intensity stored in pandas DataFrame

pyopenms.ConsensusMap.get_intensity_df()

Generates a pandas DataFrame with feature intensities from each sample in long format (over files).

For labelled analyses channel intensities will be in one row, therefore resulting in a semi-long/block format. Resulting DataFrame can be joined with result from get_metadata_df by their index ‘id’.

Returns:

pandas.DataFrame

intensity DataFrame

pyopenms.ConsensusMap.get_metadata_df()

Generates a pandas DataFrame with feature meta data (sequence, charge, mz, RT, quality).

Resulting DataFrame can be joined with result from get_intensity_df by their index ‘id’.

Returns:

pandas.DataFrame

DataFrame with metadata for each feature (such as: best identified sequence, charge, centroid RT/mz, fitting quality)

Examples:

urlretrieve(url+'ProteomicsLFQ_1_out.consensusXML', 'ProteomicsLFQ_1_out.consensusXML')
consensus_map = ConsensusMap()
ConsensusXMLFile().load('ProteomicsLFQ_1_out.consensusXML', consensus_map)

df = consensus_map.get_df()
df.head(2)
df = consensus_map.get_intensity_df()
df.head(2)
consensus_map.get_intensity_df()

id

BSA1_F1.mzML

BSA1_F2.mzML

2935923263525422257

0.0

0.0

10409195546240342212

1358151.0

0.0

df = consensus_map.get_metadata_df()
df.head(2)
consensus_map.get_metadata_df()

id

sequence

charge

RT

mz

quality

2935923263525422257

DGDIEAEISR

3

1523.370634

368.843773

0.000000

10409195546240342212

SHC(Carbamidomethyl)IAEVEK

3

1552.032973

358.174576

0.491247