Peptides and Proteins

Sequences of amino acids form peptides and proteins.

Amino Acid Sequences

The AASequence class handles amino acid sequences in OpenMS. A string of amino acid residues can be turned into a instance of AASequence to provide some commonly used operations like chemical formula, weight, and isotope distribution calculations.

The example below shows how amino acid sequences can be created and how basic mass calculations are conducted.

 1from pyopenms import *
 2seq = AASequence.fromString("DFPIANGER") # create AASequence object from string representation
 3prefix = seq.getPrefix(4) # extract prefix of length 4
 4suffix = seq.getSuffix(5) # extract suffix of length 5
 5concat = seq + seq # concatenate two sequences
 6
 7# print string representation of sequences
 8print("Sequence:", seq)
 9print("Prefix:", prefix)
10print("Suffix:", suffix)
11print("Concatenated:", concat)
12
13# some mass calculations
14mfull = seq.getMonoWeight() # weight of M
15mprecursor = seq.getMonoWeight(Residue.ResidueType.Full, 2) # weight of M+2H
16
17# we can calculate mass-over-charge manually
18mz = seq.getMonoWeight(Residue.ResidueType.Full, 2) / 2.0 # m/z of [M+2H]2+
19# or simply by:
20mz = seq.getMZ(2) # same as above
21
22print()
23print("Monoisotopic mass of peptide [M] is", mfull)
24print("Monoisotopic mass of peptide precursor [M+2H]2+ is", mprecursor)
25print("Monoisotopic m/z of [M+2H]2+ is", mz)

Which prints the amino acid sequence as well as the result of concatenating two sequences or taking the suffix of a sequence. We then compute the mass of the full peptide ([M]), the mass of the peptide precursor ([M+2H]2+) and m/z value of the peptide precursor ([M+2H]2+). Note that, the mass of the peptide precursor is shifted by two protons that are now attached to the molecules as charge carriers. (Detail: the proton mass of 1.007276 u is slightly different from the mass of an uncharged hydrogen atom at 1.007825 u). We can easily calculate the charged weight of a (M+2H)2+ ion and compute m/z simply dividing by the charge.

Sequence: DFPIANGER
Prefix: DFPI
Suffix: ANGER
Concatenated: DFPIANGERDFPIANGER

Monoisotopic mass of peptide [M] is 1017.4879641373001
Monoisotopic mass of peptide precursor [M+2H]2+ is 1019.5025170708421
Monoisotopic m/z of [M+2H]2+ is 509.7512585354211

The AASequence object also allows iterations directly in Python:

1seq = AASequence.fromString("DFPIANGER")
2
3print("The peptide", str(seq), "consists of the following amino acids:")
4for aa in seq:
5    print(aa.getName(), ":", aa.getMonoWeight())

Which will print

The peptide DFPIANGER consists of the following amino acids:
Aspartate : 133.0375092233
Phenylalanine : 165.0789793509
Proline : 115.0633292871
Isoleucine : 131.0946294147
Alanine : 89.04767922330001
Asparagine : 132.0534932552
Glycine : 75.0320291595
Glutamate : 147.05315928710002
Arginine : 174.1116764466

The N- and C-Terminus as well as the residues themself can be modified. The example below shows how to check fo such modifications.

 1seq = AASequence.fromString("C[143]PKCK(Label:13C(6)15N(2))CR")
 2
 3# check if AASequence has a N- or C-terminal modification
 4if seq.hasNTerminalModification():
 5    print("N-Term Modification: ", seq.getNTerminalModification().getFullId())
 6if seq.hasCTerminalModification():
 7    print("C-Term Modification: ", seq.getCTerminalModification().getFullId())
 8# iterate over all residues and look for modifications
 9for aa in seq:
10    if (aa.isModified()):
11        print(aa.getName(), ":", aa.getMonoWeight(), ":", aa.getModificationName())
12    else:
13        print(aa.getName(), ":", aa.getMonoWeight())

Which will print:

N-Term Modification:  Pyro-carbamidomethyl (N-term C)
Cysteine : 121.01974995329999
Proline : 115.06332928709999
Lysine : 146.1055284466
Cysteine : 121.01974995329999
Lysine : 154.11972844660002 : Label:13C(6)15N(2)
Cysteine : 121.01974995329999
Arginine : 174.1116764466

Molecular formula

We can now combine our knowledge of AASequence with what we learned in about EmpiricalFormula to get accurate mass and isotope distributions from the amino acid sequence. But first, let’s get the formula of peptide:

1seq = AASequence.fromString("DFPIANGER")
2seq_formula = seq.getFormula()
3print("Peptide", seq, "has molecular formula", seq_formula)

Isotope patterns

We now want to print the coarse (e.g., peaks only at nominal masses) distribution.

1# print coarse isotope distribution
2coarse_isotopes = seq_formula.getIsotopeDistribution( CoarseIsotopePatternGenerator(6) )
3for iso in coarse_isotopes.getContainer():
4    print ("Isotope", iso.getMZ(), "has abundance", iso.getIntensity()*100, "%")

For most applications in computational proteomics, the coarse isotope distribution is sufficient. But if we deal with very high resolution instruments, we still might want to calculate the isotopic fine structure. We use the FineIsotopePatternGenerator in OpenMS to reveal these addtional peaks:

1# print fine structure of isotope distribution
2fine_isotopes = seq_formula.getIsotopeDistribution( FineIsotopePatternGenerator(0.01) ) # max 0.01 unexplained probability
3for iso in fine_isotopes.getContainer():
4    print ("Isotope", iso.getMZ(), "has abundance", iso.getIntensity()*100, "%")

And plot the very similar looking distributions using standard matplotlib functionality:

 1import math
 2from matplotlib import pyplot as plt
 3
 4def plotIsotopeDistribution(isotope_distribution, title="Isotope distribution"):
 5    plt.title(title)
 6    distribution = {"mass": [], "abundance": []}
 7    for iso in isotope_distribution.getContainer():
 8        distribution["mass"].append(iso.getMZ())
 9        distribution["abundance"].append(iso.getIntensity() * 100)
10
11    bars = plt.bar(distribution["mass"], distribution["abundance"], width=0.01, snap=False) # snap ensures that all bars are rendered
12
13    plt.ylim([0, 110])
14    plt.xticks(range(math.ceil(distribution["mass"][0]) - 2,
15                     math.ceil(distribution["mass"][-1]) + 2))
16    plt.xlabel("Atomic mass (u)")
17    plt.ylabel("Relative abundance (%)")
18
19plt.figure(figsize=(10,7))
20plt.subplot(1,2,1)
21plotIsotopeDistribution(coarse_isotopes, "Isotope distribution - coarse")
22plt.subplot(1,2,2)
23plotIsotopeDistribution(fine_isotopes, "Isotope distribution - fine structure")
24plt.show()
_images/DFPIANGER_isoDistribution.png

Fragment ions

We can easily calculate different ion types for amino acid sequences:

 1suffix = seq.getSuffix(3) # y3 ion "GER"
 2print("="*35)
 3print("y3 ion sequence:", suffix)
 4y3_formula = suffix.getFormula(Residue.ResidueType.YIon, 2) # y3++ ion
 5suffix.getMonoWeight(Residue.ResidueType.YIon, 2) / 2.0 # CORRECT
 6suffix.getMonoWeight(Residue.ResidueType.XIon, 2) / 2.0 # CORRECT
 7suffix.getMonoWeight(Residue.ResidueType.BIon, 2) / 2.0 # INCORRECT
 8
 9print("y3 mz:", suffix.getMonoWeight(Residue.ResidueType.YIon, 2) / 2.0 )
10print("y3 molecular formula:", y3_formula)

Which will produce

===================================
y3 ion sequence: GER
y3 mz: 181.09514385
y3 molecular formula: C13H24N6O6

Easy, isn’t it? To generate full theoretical spectra watch out for the more specialized (and faster) TheoreticalSpectrumGenerator which we will take a look at later.

Modified Sequences

The AASequence class can also handle modifications, modifications are specified using a unique string identifier present in the ModificationsDB in round brackets after the modified amino acid or by providing the mass of the residue in square brackets. For example AASequence.fromString(".DFPIAM(Oxidation)GER.") creates an instance of the peptide “DFPIAMGER” with an oxidized methionine. There are multiple ways to specify modifications, and AASequence.fromString("DFPIAM(UniMod:35)GER"), AASequence.fromString("DFPIAM[+16]GER") and AASequence.fromString("DFPIAM[147]GER") are all equivalent).

 1    seq = AASequence.fromString("PEPTIDESEKUEM(Oxidation)CER")
 2    print(seq.toUnmodifiedString())
 3    print(seq.toString())
 4    print(seq.toUniModString())
 5    print(seq.toBracketString())
 6    print(seq.toBracketString(False))
 7
 8    print(AASequence.fromString("DFPIAM(UniMod:35)GER"))
 9    print(AASequence.fromString("DFPIAM[+16]GER"))
10    print(AASequence.fromString("DFPIAM[+15.99]GER"))
11    print(AASequence.fromString("DFPIAM[147]GER"))
12    print(AASequence.fromString("DFPIAM[147.035405]GER"))

The above code outputs:

PEPTIDESEKUEMCER
PEPTIDESEKUEM(Oxidation)CER
PEPTIDESEKUEM(UniMod:35)CER
PEPTIDESEKUEM[147]CER
PEPTIDESEKUEM[147.0354000171]CER

DFPIAM(Oxidation)GER
DFPIAM(Oxidation)GER
DFPIAM(Oxidation)GER
DFPIAM(Oxidation)GER
DFPIAM(Oxidation)GER

Note there is a subtle difference between AASequence.fromString(".DFPIAM[+16]GER.") and AASequence.fromString(".DFPIAM[+15.9949]GER.") - while the former will try to find the first modification matching to a mass difference of 16 +/- 0.5, the latter will try to find the closest matching modification to the exact mass. The exact mass approach usually gives the intended results while the first approach may or may not. In all instances, it is better to use an exact description of the desired modification, such as UniMod, instead of mass differences.

N- and C-terminal modifications are represented by brackets to the right of the dots terminating the sequence. For example, ".(Dimethyl)DFPIAMGER." and ".DFPIAMGER.(Label:18O(2))" represent the labelling of the N- and C-terminus respectively, but ".DFPIAMGER(Phospho)." will be interpreted as a phosphorylation of the last arginine at its side chain:

1    s = AASequence.fromString(".(Dimethyl)DFPIAMGER.")
2    print(s, s.hasNTerminalModification())
3    s = AASequence.fromString(".DFPIAMGER.(Label:18O(2))")
4    print(s, s.hasCTerminalModification())
5    s = AASequence.fromString(".DFPIAMGER(Phospho).")
6    print(s, s.hasCTerminalModification())

Arbitrary/unknown amino acids (usually due to an unknown modification) can be specified using tags preceded by X: “X[weight]”. This indicates a new amino acid (“X”) with the specified weight, e.g. "RX[148.5]T". Note that this tag does not alter the amino acids to the left (R) or right (T). Rather, X represents an amino acid on its own. Be careful when converting such AASequence objects to an EmpiricalFormula using getFormula(), as tags will not be considered in this case (there exists no formula for them). However, they have an influence on getMonoWeight() and getAverageWeight()!

Proteins and FASTA files

Protein sequences, can be loaded from and stored in FASTA protein databases using FASTAFile. The example below shows how protein sequences can be stored in FASTA files and loaded back in pyOpenMS:

 1    bsa = FASTAEntry() # one entry in a FASTA file
 2    bsa.sequence = "MKWVTFISLLLLFSSAYSRGVFRRDTHKSEIAHRFKDLGE"
 3    bsa.description = "BSA Bovine Albumin (partial sequence)"
 4    bsa.identifier = "BSA"
 5    alb = FASTAEntry()
 6    alb.sequence = "MKWVTFISLLFLFSSAYSRGVFRRDAHKSEVAHRFKDLGE"
 7    alb.description = "ALB Human Albumin (partial sequence)"
 8    alb.identifier = "ALB"
 9
10    entries = [bsa, alb]
11
12    f = FASTAFile()
13    f.store("example.fasta", entries)

Afterwards, the example.fasta file can be read again from the disk:

1    entries = []
2    f = FASTAFile()
3    f.load("example.fasta", entries)
4    print( len(entries) )
5    for e in entries:
6      print (e.identifier, e.sequence)