Bioinformática 2007-I Prof. Mirko Zimic Lunes -Alineamiento simple de secuencias (pairwise...

Post on 12-Jan-2016

223 views 0 download

Transcript of Bioinformática 2007-I Prof. Mirko Zimic Lunes -Alineamiento simple de secuencias (pairwise...

Bioinformática 2007-I Prof. Mirko Zimic

Lunes -Alineamiento simple de secuencias (pairwise alignment). - Alineamiento local y global. - Matrices de ‘score’ -Algoritmos de Programación Dinámica-Dot Plot

MiércolesAlineamiento simple de secuencias: Manejo de los programas: Clustal, Macaw y servidores en línea

“Nada en Biología tiene sentido a menos que se entienda en términos

de Evolución”

T. Dobzhansky

“Alinear” = “Comparar”

Finches of the Galápagos Islands observed by Charles Darwin on the voyage of HMS Beagle

Sequence alignment is similar to other types of comparative analysis

Involves scoring similarities and differences among a group of related entities

Homología

Homology Is the central concept for all of biology. Whenever we say that a mammalian hormone is the ‘same’ hormone as a fish hormone, that a human gene sequence is the ‘same’ as a sequence in a chimp or a mouse, that a HOX gene is the ‘same’ in a mouse, a fruit fly, a frog and a human - even when we argue that discoveries about a worm, a fruit fly, a frog, a mouse, or a chimp have relevance to the human condition - we have made a bold and direct statement about homology. The aggressive confidence of modern biomedical science implies that we know what we are talking about.”

David B. Wake

Similitud ≠ Homología

1) 25% similarity ≥ 100 AAs is likely homology

2) Homology is an evolutionary statement which means “descent from a common ancestor” –common 3D structure–usually common function–all or nothing, cannot say "50%

homologous"

C O M P A R A T I V E A N A L Y S I S

Alignment algorithms model evolutionary processes

GATTACCA

GATGACCA GATTACCA

Derivation from a common ancestor through incremental change due to dna replication errors, mutations, damage, or unequal crossing-over.

insertion

GATCATCA GATTGATCA

GATTACCA GATTATCA GATTACCA

deletionSubstitution

GAT ACCA

T

C O M P A R A T I V E A N A L Y S I S

Alignment algorithms model evolutionary processes

GATTACCA

GATGACCA GATTACCA

Derivation from a common ancestor through incremental change

GATCATCA GATTGATCA

GATTACCA GATTATCA GATTACCA

GATACCA

Only extant sequences are known, ancestral sequences are postulated.

GATCATCA GATTGATCA

GATTACCA

GATACCA

The term homology implies a common ancestry, which may be inferred from observations of sequence similarity

C O M P A R A T I V E A N A L Y S I S

Alignment algorithms model evolutionary processes

GATTACCA

GATGACCA GATTACCA

Derivation from a common ancestor through incremental change. Mutations that do not kill the host may carry over to the population. Rarely are mutations kept/rejected by natural selection.

GATCATCA GATTGATCA

GATTACCA GATTATCA GATTACCA

GATACCA

Sequence AlignmentsSequence Alignments

• Why align?

Can delineate sequence elements that are functionally significant Illuminates phylogenetic relationships

• Algorithms for sequence alignment

Dynamic programming Dot-matrix Word-based algorithms Bayesian methods

What is Meant by Alignment?What is Meant by Alignment?

Identical nucleotide sequences (trivial example)

A better alignment

ATTCGGCATTCAGTGCTAGAATTCGGCATTCAGTGCTAGA

Score = 20(20 1)

Imperfect match

ATTCGGCATTCAGTGCTAGAATTCGGCATTGCTAGA

Score = 11

ATTCGGCATTCAGTGCTAGAATTCGGCATT----GCTAGA

Score = 14= 10 + 6 + 4(-0.5){

Gap penalty

Beware of aligning apples and Beware of aligning apples and oranges oranges [[and grapefruitand grapefruit]]!!

Parologous versus orthologous;

genomic versus cDNA;

mature versus precursor.

Los alineamientos se pueden efectuar tanto en secuencias de ADN como en secuencias de

proteínas…

Why Do We Want To Compare Sequences

wheat --DPNKPKRAMTSFVFFMSEFRSEFKQKHSKLKSIVEMVKAAGER | | |||||||| || | ||| ||| | |||| ||||????? KKDSNAPKRAMTSFMFFSSDFRS----KHSDL-SIVEMSKAAGAA

EXTRAPOLATE

??????

Homology?

SwissProt

Why Does It Make Sense To Align Sequences ?

-Evolution is our Real Tool.

-Nature is LAZY and Keeps re-using Stuff.

-Evolution is mostly DIVERGEANT

Same Sequence Same Ancestor

Why Does It Make Sense To Align Sequences ?

SameSequence

Same Function

Same 3D Fold

Same Origin

Comparing Is Reconstructing Evolution

An Alignment is a STORY

ADKPKRPLSAYMLWLN

ADKPKRPKPRLSAYMLWLNADKPRRPLS-YMLWLN

ADKPKRPLSAYMLWLN ADKPKRPLSAYMLWLN

Mutations+

Selection

An Alignment is a STORY

ADKPRRP---LS-YMLWLNADKPKRPKPRLSAYMLWLN

Mutation

InsertionDeletion

ADKPKRPLSAYMLWLN

ADKPKRPKPRLSAYMLWLNADKPRRPLS-YMLWLN

ADKPKRPLSAYMLWLN ADKPKRPLSAYMLWLN

Mutations+

Selection

Evolution is NOT Always Divergent…

AFGP with (ThrAlaAla)nSimilar To Trypsynogen

AFGP with (ThrAlaAla)nNOT

Similar to Trypsinogen

N

S

SIMILAR Sequences

BUTDIFFERENT origin

…But in MOST cases, you may assume it is.

How Do Sequences Evolve ?

CONSTRAINED Genome Positions Evolve SLOWLY

EVERY Protein Family Has its Own Level Of Constraint

Family KS KA

Histone3 6.4 0Insulin 4.0 0.1Interleukin I 4.6 1.4Globin 5.1 0.6Apolipoprot. AI 4.5 1.6Interferon G 8.6 2.8

Rates in Substitutions/site/Billion Years as measured on Mouse Vs Human (80 Million years)Ks Synonymous Mutations, Ka Non-Neutral.

GC

LIV A

F

Aliphatic

Aromatic

Hydrophobic

C

How Do Sequences Evolve ?The amino Acids Venn Diagram

To Make Things Worse, Every Residue has its Own Personality

ST

WY

QHK

R

ED N

Polar

PG

Small

C

How Do Sequences Evolve ?

In a structure, each Amino Acid plays a Special Role

OmpR, Cter Domain

In the core, SIZE MATTERS

On the surface, CHARGE MATTERS

--+

How Do Sequences Evolve ?

Accepted Mutations Depend on the Structure

Big -> BigSmall ->SmallNO DELETION

--+

Charged -> ChargedSmall <-> Big or SmallDELETIONS

How Can We Compare Sequences ?

To Compare Two Sequences, We need:

Their Function

Their Structure

We Do Not Have Them !!!

How Can We Compare Sequences ?

We will Need To Replace Structural Information With Sequence Information.

SameSequence

Same Function

Same 3D Fold

Same Origin

It CANNOT Work ALL THE TIME !!!

How Can We Compare Sequences ?

To Compare Sequences, We need to Compare ResiduesWe Need to Know How Much it COSTS to SUBSTITUTE

an Alanine into an Isoleucinea Tryptophan into a Glycine…The table that contains the costs for all the

possible substitutions is called the SUBSTITUTION MATRIX

How to derive that matrix?

How Can We Compare Sequences ?Making a Substitution Matrix

-Take 100 nice pairs of Protein Sequences, easy to align (80% identical).

-Align them…

-Count each mutations in the alignments

-25 Tryptophans into phenylalanine-30 Isoleucine into Leucine…

-For each mutation, set the substitution score to the log odd ratio:

Expected by chance

ObservedLog

How Can We Compare Sequences ?Making a Substitution Matrix

The Diagonal Indicates How Conserved a residue tends to be.W is VERY Conserved

Some Residues are Easier To mutate into other similar

Cysteins that make disulfide bridges and those that do not get averaged

How Can We Compare Sequences ?Using Substitution Matrix

ADKPRRP---LS-YMLWLNADKPKRPKPRLSAYMLWLN

Mutation

InsertionDeletion

Given two Sequences and a substitution Matrix,We must Compute the CHEAPEST Alignment

Most popular Subsitution Matrices • PAM250• Blosum62 (Most widely used)

Raw Score

TPEA¦| |APGA

TPEA¦| |APGA

Score =1 = 9

• Question: Is it possible to get such a good alignment by chance only?

+ 6 + 0 + 2

Scoring an Alignment

Insertions and Deletions

Gap Penalties

• Opening a gap is more expensive than extending it

Seq AGARFIELDTHE----CAT||||||||||| |||

Seq BGARFIELDTHELASTCAT

Seq AGARFIELDTHE----CAT||||||||||| |||

Seq BGARFIELDTHELASTCAT

gap

Gap Opening PenaltyGap Extension Penalty

How Can We Compare Sequences ?Limits of the substitution Matrices

They ignore non-local interactions and Assume that identical residues are equal

They assume evolution rate to be constant

ADKPKRPLSAYMLWLN

ADKPKRPKPRLSAYMLWLN

ADKPRRPLS-YMLWLN

ADKPKRPLSAYMLWLNADKPKRPLSAYMLWLN

Mutations+

Selection

How Can We Compare Sequences ?Limits of the substitution Matrices

Substitution Matrices Cannot Work !!!

How Can We Compare Sequences ?Limits of the substitution Matrices

I know… But at least, could I get some idea of when they are likely to do all right

How Can We Compare Sequences ?The Twilight Zone

Length

%Sequence Identity

100

Same 3D Fold

Twilight Zone

Similar SequenceSimilar Structure

30%

Different SequenceStructure ????

30

How Can We Compare Sequences ?The Twilight Zone

Substitution Matrices Work Reasonably Well on Sequences that have more than 30 % identity over more than 100 residues

PAM BLOSUM

Built from global alignments Built from local alignments

Built from small amout of Data Built from vast amout of Data

Counting is based on minimumreplacement or maximum parsimony

Counting based on groups ofrelated sequences counted as one

Perform better for finding globalalignments and remote homologs

Better for finding localalignments

Higher PAM series means moredivergence

Lower BLOSUM series meansmore divergence

Major Differences between PAM and BLOSUM

How Can We Compare Sequences ?Which Matrix Shall I use

PAM: Distant Proteins High Index (PAM 350)BLOSUM: Distant Proteins Low Index (Blosum30)

•GONNET 250> BLOSUM62>PAM 250.

•But This will depend on:

•The Family.•The Program Used and Its Tuning.

Choosing The Right Matrix may be Tricky…

•Insertions, Deletions?

Dot MatricesGlobal AlignmentsLocal Alignment

HOW Can we Align Two Sequences ?

Cost

L

Afine Gap Penalty

Global Alignments

-Take 2 Nice Protein Sequences

-A good Substitution Matrix (blosum)

-A Gap opening Penalty (GOP)

-A Gap extension Penalty (GEP)

GOP

GEP

GOP GOP

GOP

Parsimony: Evolution takes the simplest path

(So We Think…)

Insertions and Deletions

Gap Penalties

• Opening a gap is more expensive than extending it

Seq AGARFIELDTHE----CAT||||||||||| |||

Seq BGARFIELDTHELASTCAT

Seq AGARFIELDTHE----CAT||||||||||| |||

Seq BGARFIELDTHELASTCAT

gap

Gap Opening PenaltyGap Extension Penalty

Global Alignments

-Take 2 Nice Protein Sequences

-A good Substitution Matrix (blosum)

-A Gap opening Penalty (GOP)

-A Gap extension Penalty (GEP)

>Seq1THEFATCAT>Seq2THEFASTCAT

-DYNAMIC PROGRAMMING

DYNAMICPROGRAMMING

THEFA-TCATTHEFASTCAT

Global Alignments

F A S T

F A T

----FATFAST---

(L1+l2)!

(L1)!*(L2)!

---FAT-FAST---

--F-AT-FAST---

Brut Force Enumeration

2

( )

DYNAMIC PROGRAMMING

G A T A C T AG A T T A C C A

Construct an optimal of these two sequences:

Using these scoring rules: Match:

Mismatch:Gap:

+1-1-1

D Y N A M I C P R O G R A M M I N G

Dynamic Programming Example

D Y N A M I C P R O G R A M M I N G

G A T A C T AGATTACCA

Arrange the sequence residues along a two-dimensional lattice

Vertices of the lattice fall between letters

D Y N A M I C P R O G R A M M I N G

G A T A C T AGATTACCA

The goal is to find the optimal path

from here

to here

D Y N A M I C P R O G R A M M I N G

G A T A C T AGATTACCA

Each path corresponds to a unique alignment

Which one is optimal?

D Y N A M I C P R O G R A M M I N G

G A T A C T AGATTACCA

The score for a path is the sum of its incremental edges scores

A aligned with AMatch = +1

D Y N A M I C P R O G R A M M I N G

G A T A C T AGATTACCA

The score for a path is the sum of its incremental edges scores A aligned with T

Mismatch = -1

D Y N A M I C P R O G R A M M I N G

G A T A C T AGATTACCA

The score for a path is the sum of its incremental edges scores

T aligned with NULL

Gap = -1

NULL aligned with T

D Y N A M I C P R O G R A M M I N G

G A T A C T AGATTACCA

Incrementally extend the path

0 -1

+1-1

D Y N A M I C P R O G R A M M I N G

G A T A C T AGATTACCA

Incrementally extend the path

0

+1-1

-2

-2

-1

Remember the best sub-path leading to each point on the lattice

D Y N A M I C P R O G R A M M I N G

G A T A C T AGATTACCA

Incrementally extend the path

0

-1

-2

Remember the best sub-path leading to each point on the lattice

0 +2

+1

-1

-20

D Y N A M I C P R O G R A M M I N G

G A T A C T AGATTACCA

Incrementally extend the path

0 -2

Remember the best sub-path leading to each point on the lattice

0 +2

+1

-1

-20

-2

-1

D Y N A M I C P R O G R A M M I N G

G A T A C T AGATTACCA

Incrementally extend the path

0

Remember the best sub-path leading to each point on the lattice

+1

-1

-2-1

-3-2

-3

-2

+3

-1

-1

0

0

+1

+1

+2

D Y N A M I C P R O G R A M M I N G

G A T A C T AGATTACCA

Incrementally extend the path

0

Remember the best sub-path leading to each point on the lattice

+1

-1

-1

-2

-2 0

0

+1+2

-5-4

-5

-4

-3

-3

-1 -3-2

-10

+1

+2

0

+1-1

+2

-3 -1

-2

+1 +3

+2 +1

+2+3

D Y N A M I C P R O G R A M M I N G

G A T A C T AGATTACCA

Incrementally extend the path

Remember the best sub-path leading to each point on the lattice

0

+1

-1

-1

-2

-2 0

0

+1+2

-4

-4

-3

-3

-1 -2

0

+2

0

+1-1

+2-2 +2 +1

+2+3

-8

-7

-6

-5

-7-6-5

-5-3

-2 -3

-4

-1

-1

0+1

+1

+1 +3

+2

-4

-6

-3

-2

-3

-1

-4

-5

+1 +3

+1

0 +2

+4

+4

+3

+2

+2

+3

-2 0

-1

+2 +2

+3

D Y N A M I C P R O G R A M M I N G

G A T A C T AGATTACCA

Trace-back to get optimal path and alignment

0

+1

-1

-1

-2

-2 0

0

+1+2

-4

-4

-3

-3

-1 -2

0

+2

0

+1-1

+2-2 +2 +1

+2+3

-8

-7

-6

-5

-7-6-5

-5-3

-2 -3

-4

-1

-1

0+1

+1

+1 +3

+2

-4

-6

-3

-2

-3

-1

-4

-5

+1 +3

+1

0 +2

+4

+4

+3

+2

+2

+3

-2 0

-1

+2 +2

+3

D Y N A M I C P R O G R A M M I N G

G A T A C T AGATTACCA

Print out the alignment

AA-TTTAACCTCAA

GG

Global AlignmentsDYNAMIC PROGRAMMING

Match=1 MisMatch=-1Gap=-1

FAT

F A S T

1

-1

-1

-2

-3

0

-2 -3 -4

2

0

0

Dynamic Programming (Needlman and Wunsch)

FAT

F A S T

1

-1

-1

-2

-3

0

-2 -3 -4

2

0

0 -1 0

0

21-1-1

1

FAT

F A S T

1

-1 -2 -3 -4

2

0

2

1

F A S TF A - T

Local Alignments

GLOBAL Alignment

LOCAL Alignment

Smith And Waterman (SW)=LOCAL Alignment

Two different types of Alignment

Needleman & Wunch (J. Mol. Biol. (1970) 48,443-453 : Problem of finding the best path. Revelation: Any partial sub-path that ends at a point along the true optimal path must itself be the optimal path leading to that point. This provides a method to create a matrix of path “score”, the score of a path leading to that point. Trace the optimal path from one end to the other of the two sequences.

Global Alignment methods:

Smith & Waterman.(J. Mol. Biol. (1981), 147,195-197: Use Needleman &Wunch, but report all non-overlapping paths, starting at the highest scoring points in the path graph.

FASTP(Lipman &Pearson(1985),Science 227,1435-1441

BLAST (Altschul et al (1990),J. Mol. Bio. 215,408-410): don’t report all overlapping paths, but only attempt to find paths if there are words that are high-scoring. Speeds up considerably the alignments.

Local Alignment methods:

Global vs. Local AlignmentGlobal vs. Local Alignment

High-scoringsubsequence Gap

Global alignment

Local alignment

Global alignment: best overall alignment independent of whether local high-scoring sequences are included

Local alignment: alignments involving high-scoring sequences take precedence of global features

G L O B A L & L O C A L S I M I L A R I T Y

Implementations of dynamic programming for global and local similarities

Optimal global alignment

Needleman & Wunsch (1970)

Sequences align essentially from end to end

Optimal local alignment

Smith & Waterman (1981)

Sequences align only in small, isolated regions

Filtering low complexity sequences

• Filters out short repeats and low complexity regions from the query sequences before searching the database

• Filtering helps to obtain statistically significant results and reduce the background noise resulting from matches with repeats and low complexity regions

• The output shows which regions of the query sequence were masked

Sequence Periodicities in Kinetoplast DNASequence Periodicities in Kinetoplast DNA

Marini et al. Proc. Natl. Acad. Sci. USA 79, 7664-7668 (1982)

Local Alignments

We now have a PairWise Comparison Algorithm,

We are ready to search Databases

Database Search

1.10e-20

10

1.10e-100

1.10e-2

1.10e-1

10

3

1

3

6

1.10e-2

1

20

15

13

QUERRY

Comparison Engine

Database

E-valuesHow many time do we expect such anAlignment by chance?

SWQ