Secuenciación de novo Prácticas Salvador Martínez de Bartolomé Bioinformatics Support ProteoRed...

Secuenciación de novo

Prácticas

Salvador Martínez de BartoloméBioinformatics Support [email protected]

Abrev. Nombre completo Lado de la cadena Masa punto isoeléctrico

A Ala Alanina hidrofobo 89.09 6.11

C Cys Cisteina hidrofilo 121.16 5.05

D Asp Ácido aspártico acidic 133.10 2.85

E Glu Ácido glutámico acidic 147.13 3.15

F Phe Phenylalanine hydrophobic 165.19 5.49

G Gly Glicina hydrophilic 75.07 6.06

H His Histidine basic 155.16 7.60

I Ile Isoleucine hydrophobic 131.17 6.05

K Lys Lisina basic 146.19 9.60

L Leu Leucine hydrophobic 131.17 6.01

M Met Methionine hydrophobic 149.21 5.74

N Asn Asparagine hydrophilic 132.12 5.41

P Pro Proline hydrophobic 115.13 6.30

Q Gln Glutamine hydrophilic 146.15 5.65

R Arg Arginine basic 174.20 10.76

S Ser Serine hydrophilic 105.09 5.68

T Thr Threonine hydrophilic 119.12 5.60

V Val Valine hydrophobic 117.15 6.00

W Trp Tryptophan hydrophobic 204.23 5.89

Y Tyr Tyrosine hydrophilic 181.19 5.64

Why de novo sequencing is difficult

1. Leucine and isoleucine have the same mass2. Glutamine and lysine differ in mass by 0.036Da3. Phenylalanine and oxidized methionine differ in

mass by 0.033Da4. Cleavages do not occur at every peptide bond (or

cannot be observed on the MS-MS)– Poor quality spectrum (some fragment ions are below

noise level)– The C-terminal side of proline is often resistant to cleavage– Absence of mobile protons– Peptides with free N-termini often lack fragmentation

between the first and second amino acids

Why de novo sequencing is difficult (II)

5. Certain amino acids have the same mass as pairs of other amino acids

– Gly +Gly (114.0429) Asn (114.0429)– Ala +Gly (128.0586) Gln (128.0586)– Ala +Gly (128.0586) Lys (128.0950)– Gly + Val (156.0742) Arg (156.1011)– Ala + Asp (186.0641) Trp (186.0793)– Ser + Val (186.1005) Trp (186.0793)

6. Directionality of an ion series is not always known (are they b- or y-ions?)


• Dos aproximaciones bioinformáticas

– Enfoque global: Se calculan todos los péptidos posibles Pi con masa Mi. Posteriormente se genera para cada una de las secuencias de aminoácidos calculadas su espectro de fragmentación teórico, T(Pi), y por último realizar la comparación de cada T(Pi) con el espectro real S. La solución consiste en determinar la secuencia del péptido Pi que genera el espectro teórico T(Pi) con mayor identidad con S.

– Enfoque local: En este caso se utiliza la información de los picos del espectro para reducir el número de candidatos a generar. La secuenciación se realiza partiendo del extremo N terminal, y comprobando la existencia de algún pico cuya diferencia de masa respecto al extremo corresponda con la masa de algún aminoácido. El proceso continua chequeando la existencia de picos en el espectro de la misma manera.


• Aproximación mejorada: local o global?Los problemas asociados a los algoritmos globales eran principalmente dos: – El primero, relativo al crecimiento exponencial del número de

candidatos, y por otro lado, el tiempo computacional dedicado para la comparación de los espectros teóricos generados a partir de los candidatos y el espectro real. Estos problemas son dependientes de la masa a determinar, aunque, en el caso de existir, la solución siempre se proporciona.

• Solución: la región del espectro a determinar por el enfoque global se reduce al mínimo para evitar que los inconvenientes presentados supongan un problema.

– Por su parte, los algoritmos locales, más rápidos y dirigidos, plantean el problema que en zonas mal expresadas del espectro (poca abundancia de picos), no generan buenas soluciones.

• Solución: Por tanto, en este sistema, sólo se realizará una búsqueda de este tipo en las zonas con abundancia de picos considerable.

Summary of de novo sequencing tools

Software Source website

PEAKS* www.bioinformaticssolutions.com

SeqMS (download) www.protein.osaka-u.ac.jp/rcsfp/profiling/SeqMS.html

Sherenga (included in SpectrumMill)*

N/A

Lutefisk (download) www.hairyfatguy.com/Lutefisk

DeNovoX* www.thermo.com

PepNovo peptide.ucsd.edu/pepnovo.py

SpectrumMill* www.home.agilent.com

*Commercialized

Prácticas DeNovo

Lutefisk

Lutefisk is software for the de novo interpretation of peptide CID

spectra• http://www.hairyfatguy.com/lutefisk/• To run Lutefisk, you need to have four files

within the same directory or folder: – CID data file (data files can be specified with a full

or partial pathname) – Lutefisk.details – Lutefisk.params – Lutefisk.residues

• One additional file is optional: – Database.sequence

‘Lutefisk.details’ file• The Lutefisk.details file contains the so-called "ion probabilities" for each type of ion.• Each column in the file contains the "ion probabilities" for different fragmentation patterns (see the description of "fragmentation patterns" below).• Currently there are only two types of fragmentation pattern that have been coded, which is for low energy CID of tryptic peptides on triple quadrupole (or Qtof) instruments or ion traps, and these ion probabilities are listed in the second and third columns. The first column is not used (oddly enough).

‘Lutefisk.residues’ file• The Lutefisk.residues file contains the single letter code, monoisotopic masses, average masses, and nominal masses for each amino acid.• To add an additional residue to the list, replace the 0's in one of the rows w/ the corresponding monoisotopic, average, and nominal masses. • Up to five additional non-traditional residues can be entered here, and will be given the single letter code of J, O, U, X, or Z

‘Database.sequence’ file• The Database.sequence file is a text file containing a sequence or a list of sequences that might have been derived from a sequence database search.

• In the final steps, where it determines scores for the candidate sequences, Lutefisk tosses in these database-derived sequences along with the de novo sequence candidates to determine if the database sequences are as good as or better than the de novo sequences. If so, then this constitutes evidence that the database derived sequences might actually be correct.

‘Lutefisk.params’ file

241103plata_bernabe.369.369.2.dta

‘Lutefisk.params’ file

Lutefisk help

>lutefisk.exe -h

Run Lutefisk.exe

• Once all files are configured correctly,• on command prompt, type:(in “C:\Documents and Settings\Bioworks32\Desktop\denovo\LUTEFISK”

folder)

– Lutefisk.exe

Output from Lutefisk – lut file

• The candidate sequences are ranked according to Pr(C) which is the estimated probability of being correct.• Also gives four scores:

• Pevzscr is an adaptation of the ideas presented by Dancik et al (J. of Comput. Biol (1999) Vol 6, 327), which is a score that penalizes for the absence of expected ions and accounts for the possibility of random matches. • Quality is the percentage of the peptide mass that can be accounted for by a contiguous ion series. • Intscr is the percentage of the fragment ion intensity that can be accounted for as b, y, internal fragment, etc, ions. • X-corr is the cross-correlation score that has been normalized by its auto-correlation score.

pepNovo

PepNovo

• scoring method uses a probabilistic network whose structure reflects the chemical and physical rules that govern the peptide fragmentation• specific for Ion Trap data

pepNovo

• Pepnovo was developed at the University of California, San Diego

• Pepnovo uses a probabilistics network to model the peptide fragmentation events in a mass spectrometer.

• It’s available online at: http://bix.ucsd.edu/MassSpec/ and also in an inhouse instalation.

pepNovo

• PepNovo runs via command line arguments:– -file <full path to input file> to specify a single input file

(mgf,dta,mzxml)or– -list <full path to txt file> to give a list of input files (this is

the preferred method for large amounts of files since the models are not reread for each input file).

– -model <model name> (currently only CID_IT_TRYP is available)

pepNovo

• Optional PepNovo arguments:– -prm - only print spectrum graph nodes with scores– -prm_norm - prints spectrum graph scores after normalization and removal of

negative scores.– -correct_pm - finds optimal precursor mass and charge values.– -use_spectrum_charge - does not correct charge.– -use_spectrum_mz - does not correct the precursor m/z value that appears in the

file.– -no_quality_filter - does not remove low quality spectra.– -fragment_tolerance < 0-0.75 > - the fragment tolerance (each model has a default

setting)– -pm_tolerance < 0-5.0 > - the precursor masss tolerance (each model has a

default setting)– -PTMs <PTM string> - separated by a colons (no spaces) e.g., M+16:S+80:N+1– -digest <NON_SPECIFIC,TRYPSIN> - default TRYPSIN– -num_solutions < number > - default 20– -tag_length < 3-6> - returns peptide sequence of the specified length (only lengths

3-6 are allowed).– -model_dir < path > - directory where model files are kept (default ./Models)

Pepnovo

• Using pepnovo:

>PepNovo.exe –list paths_of_lots_of_spectra.txt –model CID_IT_TRYP –PTMs C+57:M+16 –digest TRYPSIN

This command runs Pepnovo on all the spectra files in “paths_of_lots_of_spectra.txt” assumes that peptides were digested with trypsin and that the cystine are carbomethylated and that the methionine can be oxidized. The output is the defaults output of 20 sequences.

>PepNovo.exe –file my_great_spectra.mgf –model CID_IT_TRYP C+57:M+16 –digest NON_SPECIFIC –tag_length 3 –num_solutions 50

Runs pepnovo on a single mgf file and generates 50 tags of length 3 for each spectrum (assumes that the digest was not with trypsin).

Pepnovo

• Using pepnovo:(in “C:\Documents and Settings\Bioworks32\Desktop\denovo\

pepNovo” folder)

PepNovo.exe –file 241103plata_bernabe.369.369.2.dta –model CID_IT_TRYP –digest TRYPSIN

PepNovo.exe –file FQSEEQQQTEDELQDK.dta –model CID_IT_TRYP –digest TRYPSIN

Pepnovo

• PepNovo output:• The output gives the following tab delimited fields for

each MS/MS spectrum:– Idx – the sequence/tag rank (starts at 0)– RnkScr - the ranking score (the major score that is used)– PnvScr – the PepNovo score of the sequence (see Anal

Chem 2005, and JPR 2006 for more details on the score).– N-Gap - the mass gap from the N-terminal to the start of

the de novo sequence.– C-Gap - the mass gap from the C-terminal to the end of the

de novo sequence.– Sequence – the predicted amino acid sequence.

Pepnovo http://bix.ucsd.edu/MassSpec/

Secuenciación de novo Prácticas Salvador Martínez de Bartolomé Bioinformatics Support ProteoRed...

Documents

Transcript of Secuenciación de novo Prácticas Salvador Martínez de Bartolomé Bioinformatics Support ProteoRed...