Export to pandas DataFrame

NOTE: This feature is available only if using a version of pyOpenMS >= 3.0, at the time of writing this means using one of the nightly builds as described in the Installation Instructions.

In pyOpenMS some data structures can be converted to a tabular format as a pandas.DataFrame. This allows convenient access to data and meta values of spectra, features and identifications.

Required imports for the examples:

from pyopenms import *
import pandas as pd
from urllib.request import urlretrieve
url = 'https://raw.githubusercontent.com/OpenMS/pyopenms-docs/master/src/data/'

MSExperiment

pyopenms.MSExperiment.get_df( long=False )

Generates a pandas DataFrame with all peaks in the MSExperiment

Parameters:

long : default False

set to True if you want to have a long/expanded/melted dataframe with one row per peak. Faster but replicated RT information. If False, returns rows in the style: rt, np.array(mz), np.array(int)

Returns:

pandas.DataFrame

peak map information stored in a DataFrame

Examples:

urlretrieve(url+'BSA1.mzML', 'BSA1.mzML')
exp = MSExperiment()
MzMLFile().load('BSA1.mzML', exp)

df = exp.get_df() # default: long = False
df.head(2)

exp.get_df()
	RT	mzarray	intarray
0	1501.41394	[300.0897645621494, 300.18132740129533, 300.20…	[3431.0261, 1181.809, 1516.1746, 1719.8547, 11…
1	1503.03125	[300.06577092599525, 300.08932376441896, 300.2…	[914.79034, 1842.2311, 2395.1025, 851.4738, 16…

df = exp.get_df(long=True)
df.head(2)

exp.get_df(long=True)
	RT	mz	inty
0	1501.41394	300.089752	3431.026123
1	1501.41394	300.181335	1181.808960

PeptideIdentifications

pyopenms.peptide_identifications_to_df( peps, decode_ontology=True, default_missing_values={bool: False, int: -9999, float: np.nan, str: ‘’}, export_unidentified=True )

Generates a pandas DataFrame with all peaks in the MSExperiment

Parameters:

peps :

list of PeptideIdentification objects

decode_ontology : default True

if meta values contain CV identifer (e.g., from PSI-MS) they will be automatically decoded into the human readable CV term name.

default_missing_values : default {bool: False, int: -9999, float: np.nan, str: ‘’}

default value for missing values for each data type

export_unidentified : default True

export PeptideIdentifications without PeptideHit

Returns:

pandas.DataFrame

peptide identifications in a DataFrame

Example:

urlretrieve(url+'small.idXML', 'small.idXML')
prot_ids = []
pep_ids = []
IdXMLFile().load('small.idXML', prot_ids, pep_ids)

df = peptide_identifications_to_df(pep_ids)
df.head(2)

peptide_identifications_to_df(pep_ids)
	id	RT	mz	q-value	charge	protein_accession	start	end	NuXL:z2 mass	NuXL:z3 mass	…	isotope_error	NuXL:peptide_mass_z0	NuXL:XL_U	NuXL:sequence_score
0	OpenNuXL_2019-12-04T16:39:43_1021782429466859437	900.425415	414.730865	0.368649	4	DECOY_sp\|Q86UQ0\|ZN589_HUMAN	255	267	828.458069	552.641113	…	0	1654.901611	0	0.173912
1	OpenNuXL_2019-12-04T16:39:43_7293634134684008928	903.565186	506.259521	0.422779	2	sp\|P61313\|RL15_HUMAN	179	187	0.0	0.0	…	0	1010.504639	0	0.290786

FeatureMap

pyopenms.FeatureMap.get_df( meta_values = None )

Generates a pandas DataFrame with information contained in the FeatureMap.

Optionally the feature meta values and information for the assigned PeptideHit can be exported.

Parameters:

meta_values : default None

meta values to include (None, [custom list of meta value names] or ‘all’)

export_peptide_identifications (bool): default True

export sequence and score for best PeptideHit assigned to a feature. Additionally the ID_filename (file name of the corresponding ProteinIdentification) and the ID_native_id (spectrum ID of the corresponding Feature) are exported. They are also annotated as meta values when collecting all assigned PeptideIdentifications from a FeatureMap with FeatureMap.get_assigned_peptide_identifications(). A DataFrame from the assigned peptides generated with peptide_identifications_to_df(assigned_peptides) can be merged with the FeatureMap DataFrame with: merged_df = pd.merge(feature_df, assigned_peptide_df, on=[‘feature_id’, ‘ID_native_id’, ‘ID_filename’])

Returns:

pandas.DataFrame

feature information stored in a DataFrame

Examples:

urlretrieve(url+'BSA1_F1_idmapped.featureXML', 'BSA1_F1_idmapped.featureXML')
feature_map = FeatureMap()
FeatureXMLFile().load('BSA1_F1_idmapped.featureXML', feature_map)

df = feature_map.get_df() # default: meta_values = None
df.head(2)

feature_map.get_df()
id	peptide_sequence	peptide_score	ID_filename	ID_native_id	charge	RT	mz	RTstart	RTend	mzstart	mzend	quality	intensity
9650885788371886430	LVTDLTK	0.000000	unknown	spectrum=1270	2	1942.600083	395.239277	1932.484009	1950.834351	395.239199	397.245758	0.808494	157572000.0
18416216708636999474	DDSPDLPK	0.034483	unknown	spectrum=1167	2	1749.138335	443.711224	1735.693115	1763.343506	443.711122	445.717531	0.893553	54069300.0

df = feature_map.get_df(meta_values = 'all', export_peptide_identifications = False)
df.head(2)

feature_map.get_df(meta_values = ‘all’, export_peptide_identifications = False)
id	charge	RT	mz	RTstart	RTend	mzstart	mzend	quality	intensity	FWHM	spectrum_index	spectrum_native_id	label	score_correlation	score_fit
9650885788371886430	2	1942.600083	395.239277	1932.484009	1950.834351	395.239199	397.245758	0.808494	157572000.0	10.061090	259	spectrum=1270	168	0.989969	0.660286
18416216708636999474	2	1749.138335	443.711224	1735.693115	1763.343506	443.71112	445.717531	0.893553	54069300.0	14.156094	156	spectrum=1167	169	0.999002	0.799234

df = feature_map.get_df(meta_values = [b'FWHM', b'label'])
df.head(2)

feature_map.get_df(meta_values = [b’FWHM’, b’label’])
id	charge	RT	mz	RTstart	RTend	mzstart	mzend	quality	intensity	FWHM	label
9650885788371886430	2	1942.600083	395.239277	1932.484009	1950.834351	395.239199	397.245758	0.808494	157572000.0	10.061090	168
18416216708636999474	2	1749.138335	443.711224	1735.693115	1763.343506	443.71112	445.717531	0.893553	54069300.0	14.156094	169

Extract assigned peptide identifications from a feature map

Peptide identifications can be mapped to their corresponding features in a FeatureMap. It is possible to extract them using the function pyopenms.FeatureMap.get_assigned_peptide_identifications() returning a list of PeptideIdentification objects.

pyopenms.FeatureMap.get_assigned_peptide_identifications()

Generates a list with peptide identifications assigned to a feature.

Adds ‘ID_native_id’ (feature spectrum id), ‘ID_filename’ (primary MS run path of corresponding ProteinIdentification) and ‘feature_id’ (unique ID of corresponding Feature) as meta values to the peptide hits. A DataFrame from the assigned peptides generated with peptide_identifications_to_df(assigned_peptides) can be merged with the FeatureMap DataFrame with: merged_df = pd.merge(feature_df, assigned_peptide_df, on=[‘feature_id’, ‘ID_native_id’, ‘ID_filename’])

Returns:

[PeptideIdentification]

list of PeptideIdentification objects

A DataFrame can be created on the resulting list of PeptideIdentification objects using pyopenms.peptide_identifications_to_df(assigned_peptides). Feature map and peptide data frames contain columns, on which they can be merged together to contain the complete information for peptides and features in a single data frame.

The columns for unambiguously merging the data frames:

feature_id: the unique feature identifier
ID_native_id: the feature spectrum native identifier
ID_filename: the filename (primary MS run path) of the corresponding ProteinIdentification

Example:

feature_df = feature_map.get_df()
assigned_peptides = feature_map.get_assigned_peptide_identifications()
assigned_peptide_df = peptide_identifications_to_df(assigned_peptides)

merged_df = pd.merge(feature_df, assigned_peptide_df, on=['feature_id', 'ID_native_id', 'ID_filename'])
merged_df.head(2)

consensus_map.get_df()
feature_id	peptide_sequence	peptide_score	ID_filename	ID_native_id	charge_x	RT_x	mz_x	RTstart	RTend	…	id	RT_y	mz_y	q-value	charge_y	protein_accession	start	end	OMSSA_score	target_decoy
9650885788371886430	LVTDLTK	0.000000	unknown	spectrum=1270	2	1942.600083	395.239277	1932.484009	1950.834351	…	OMSSA_2009-11-17T11:11:11_4731105163044641872	1933.405151	395.239349	0.000000	2	P02769\|ALBU_BOVIN	-1	-1	0.001084	True
18416216708636999474	DDSPDLPK	0.034483	unknown	spectrum=1167	2	1749.138335	443.711224	1735.693115	1763.343506	…	OMSSA_2009-11-17T11:11:11_4731105163044641872	1738.033447	443.711243	0.034483	2	P02769\|ALBU_BOVIN	-1	-1	0.003951	True

ConsensusMap

pyopenms.ConsensusMap.get_df()

Generates a pandas DataFrame with both consensus feature meta data and intensities from each sample.

Returns:

pandas.DataFrame

consensus map meta data and intensity stored in pandas DataFrame

pyopenms.ConsensusMap.get_intensity_df()

Generates a pandas DataFrame with feature intensities from each sample in long format (over files).

For labelled analyses channel intensities will be in one row, therefore resulting in a semi-long/block format. Resulting DataFrame can be joined with result from get_metadata_df by their index ‘id’.

Returns:

pandas.DataFrame

intensity DataFrame

pyopenms.ConsensusMap.get_metadata_df()

Generates a pandas DataFrame with feature meta data (sequence, charge, mz, RT, quality).

Resulting DataFrame can be joined with result from get_intensity_df by their index ‘id’.

Returns:

pandas.DataFrame

DataFrame with metadata for each feature (such as: best identified sequence, charge, centroid RT/mz, fitting quality)

Examples:

urlretrieve(url+'ProteomicsLFQ_1_out.consensusXML', 'ProteomicsLFQ_1_out.consensusXML')
consensus_map = ConsensusMap()
ConsensusXMLFile().load('ProteomicsLFQ_1_out.consensusXML', consensus_map)

df = consensus_map.get_df()
df.head(2)

df = consensus_map.get_intensity_df()
df.head(2)

consensus_map.get_intensity_df()
id	BSA1_F1.mzML	…	BSA1_F2.mzML
2935923263525422257	0.0	…	0.0
10409195546240342212	1358151.0	…	0.0

df = consensus_map.get_metadata_df()
df.head(2)

consensus_map.get_metadata_df()
id	sequence	charge	RT	mz	quality
2935923263525422257	DGDIEAEISR	3	1523.370634	368.843773	0.000000
10409195546240342212	SHC(Carbamidomethyl)IAEVEK	3	1552.032973	358.174576	0.491247