Download - Introducción a la Bioinformática 2002 Universidad Nacional San Cristobal de Huamanga, Ayacucho Mirko Zimic.

Introducción a la Bioinformática2002

Universidad Nacional San Cristobal de Huamanga, Ayacucho

Mirko Zimic

Tópicos de interés en la bioinformática

• Análisis de secuencias

• Filogenia y evolución molecular

• Modelamiento molecular

• Plegamiento de Proteínas

• Genómica y Proteómica

• Genética estadística

• Microarreglos

• Programación científica

Pongamos un ejemplo …

Cisteíno proteasa de la fasciola hepática: En busca de un péptido

inmunogénico

VPKSVDWREKGYVTPVKNQGQCGSCWAFSATGALEGQMFRKTGR ISLSEQNLVDCSRPQGNAVPDKIDWRESGYVTEVKDQGNCGSCWAFSTTGTMEGQYM KNERTSISFSEQQLVDCSRPWGN

_____ROJO_________

QGCNGGLMDNAFQYIKENGGLDSEESYPYEATDTSCNY KPEYSVANDTGFVDIPQREKA LMKNGCGGGLMENAYQYLKQF GLETESSYPYTAVGGQCRYNKQLG VAKVTGYYTV QSGSEVEL KN _VIOLETA____ _AMARILLO_______

AVATVGPISVAIDAGHSFQFYKSGIYYEPDCSSKDLDHGVLVVGYGFEG TDSNNNKYW IVKNSWLIGSEGPSAVAVDVESDFMMYRSGIYQSQTCSPLRVNHAVLAVGYGTQGGTD YW IVKNSW_____ _VERDE_____

GPEWGM-GYVKMAKDRNNH CGIATAASYPTVGLSWGERGYIRMV RNRGNMCGIASLASLPMVARFP

Alineamiento: cisteíno proteasas de mamífero Vs. cisteíno

proteasa de Fasciola hepatica.

AA Idénticos AA divergentes

Epítope Discontinuo, formado por porciones distantes de la secuencia.

Denaturación

El epítope se pierde con la denaturación.

Denaturación

El epítope se conserva como tal.

Epítope Continuo, formado por una porción de la secuencia

Modelaje tridimensional por homología. Identidad de secuencia de 56% con quimopapaína (1YAL)

AA idénticos AA divergentes

Análisis de Superficie: vista de átomos por radio de van der Waals

TMEGQYMKNERTSISFS

YYTVQSGSEVELKNLIGSE

QSQTCSPLRVN

RYNKQLGVAKV

Selección de secuencias (1)divergentes, (2)accesibles al solvente y (3)contínuas.

Evaluación de la estabilidad conformacional de los péptidos por minimización de energía.

H2O “backbone”

TMEGQYMKNERTSISFS YYTVQSGSEVELKNLIGSE

Pongamos otro ejemplo…

Sensibilidad de la aspartyl proteasa del HIV-1 a los inhibidores más

frecuentes

Representación en “cartoon” de la enzima proteasa de HIV-1

MONOMERO PROTEASA HIV

Enzima proteasa de HIV-1 mostrando los elementos de estructura secundaria, flaps y

sitio activo

Enzima proteasa de HIV-1 indicando los residuos consenso de unión inhibidor-enzima

INDINAVIR

RITONAVIR

Asociación de indinavir a la proteasa de HIV-1

Proteasa de HIV-1 mutante modelada en complejo con

Ritonavir

COMPARACION ENTRE UNA ENZIMA SENSIBLE Y UNA

RESISTENTE A RITONAVIR

Un ejemplo más…

Ordenamiento filogenético y el contenido de GC en tripanosomátidos

Reported %GC variation for each codon position in Trypanosomatids

(Alonso et al,1992)

4 2 4 4 4 6 4 8 5 0 5 2 5 4 5 6 5 8 6 04 0

4 5

5 0

5 5

6 0

6 5

7 0

7 5

8 0

8 5

9 0

C r i t h i d i aL e i s h m a n i a

T . c r u z iT . b r u c e i

1 s t2 n d3 r d

% G Cc o d o np o s i t i o n

% G C t o t a l D N A

Codon usage in Trypanosomatids leucine

0

10

20

30

40

50

60

70

TTA

CTA

TTG

CTT

CTC

CTG

TTA

CTA

TTG

CTT

CTC

CTG

TTA

CTA

TTG

CTT

CTC

CTG

TTA

CTA

TTG

CTT

CTC

CTG

T.brucei T.cruzi Leishmania Critidia

Codon usage in Trypanosomatids serine

0

5

10

15

20

25

30

35

40

AG

T

TCA

TCT

TCC

AG

C

TCG

AG

T

TCA

TCT

TCC

AG

C

TCG

AG

T

TCA

TCT

TCC

AG

C

TCG

AG

T

TCA

TCT

TCC

AG

C

TCG

T.brucei T.cruzi Leishmania Critidia

Phylogeny of Trypanosomatid lineage (Maslov & Simpson)

“Hole” formation by DNA replication

GC content variation in timeRestriction: AA family conservation

and AA conservation

%GC variation in Trypanosomatid lineage(Nuclear coding DNA)

GC variation in trypanosomatidae lineage Nuclear DNA

0.00

0.100.20

0.300.40

0.50

0.600.70

0.800.90

1.00

T.b

ruce

i

T.c

ruzi

Le

ishm

ani

a

Cri

thid

ia

% G

C

P1

P2

P3

P3*

P

I. Proyecto Genoma Humano

La secuencia del genoma está casi completa!– aproximadamente 3.5 billones de pares de bases.

All the Genes

• Any human gene can now be found in the genome by similarity searching with over 90% certainty.

• However, the sequence still has many gaps– one is unlikely to find a complete and

uninterrupted genomic segment for any gene – still can’t identify pseudogenes with certainty

• This will improve as more sequence data accumulates

Raw Genome Data:

The next step is obviously to locate all of the genes and describe their functions. This will probably take another 15-20 years!

– so why are there 60,000 human genes on Affymetrix GeneChips?

– Why does GenBank have 49,000 gene coding sequence and UniGene have 89,000 clusters of unique ESTs?

• Clearly we are in desperate need of a theoretical framework to go with all of this data

…Algunos años atrás…Celera sostenía que sólo

habrían 30,000 genes

Implications for Biomedicine

• Physicians will use genetic information to diagnose and treat disease.– Virtually all medical conditions (other than

trauma) have a genetic component.

• Faster drug development research– Individualized drugs– Gene therapy

• All Biologists will use gene sequence information in their daily work

II. Bioinformatics Challenges

Lots of new sequences being added- automated sequencers

- Human Genome Project

- EST sequencing

GenBank has over 10 Billion bases and is doubling every year!!

(problem of exponential growth...)

How can computers keep up?

The huge dataset

New Types of Biological Data

• Microarrays - gene expression

• Multi-level maps: genetic, physical, sequence, annotation

• Networks of Protein-protein interactions

• Cross-species relationships– Homologous genes– Chromosome organization

Similarity Searching the Databanks

What is similar to my sequence?

Searching gets harder as the databases get bigger - and quality degrades

Tools: BLAST and FASTA = time saving heuristics (approximate)

Statistics + informed judgement of the biologist

Alignment Alignment is the basis for finding similarity Pairwise alignment = dynamic

programming Multiple alignment: protein families and

functional domains Multiple alignment is "impossible" for lots

of sequences Another heuristic - progressive pairwise

alignment

Sample Multiple Alignment

Structure- Function Relationships Can we predict the function of protein

molecules from their sequence?

sequence > structure > function Conserved functional domains = motifs

Prediction of some simple 3-D structures (-helix, -sheet, membrane spanning, etc.)

Protein domains

DNA Sequencing Automated sequencers > 40 KB per day 500 bp reads must be assembled into

complete genes- errors especially insertions and deletions

- error rate is highest at the ends where we want to overlap the reads

- vector sequences must be removed from ends

Faster sequencing relies on better software

overlapping deletions vs. shotgun approaches: TIGR

Finding Genes in genome Sequence is

Not Easy • About 2% of human DNA encodes

functional genes.

• Genes are interspersed among long stretches of non-coding DNA.

• Repeats, pseudo-genes, and introns confound matters

Pattern Finding Tools• It is possible to use DNA sequence patterns

to predict genes:• promoters• translational start and stop codes (ORFs)• intron splice sites• codon bias

• Can also use similarity to known genes/ESTs

Phylogenetics Evolution = mutation of DNA (and

protein) sequences

Can we define evolutionary relationships between organisms by comparing DNA sequences- is there one molecular clock?- phenetic vs. cladisitic approaches- lots of methods and software, what is the

"correct" analysis?

II. El papel del Biólogo en la Era de la

Información

El Internet provee abundante información biologica

Puede resultar abrumador…

- e-mail

- Web

Necesidad de nuevas habilidades = localizar información necesaria de manera eficiente

Computing in the lab - everyday tasks (vs. computational biology)

ordering supplies reference books lab notes literature

searching

Training "computer" scientists

Know the right tool for the job

Get the job done with tools available

Network connection is the lifeline of the scientist

Jobs change, computers change, projects change, scientists need to be adaptable

The job of the biologist is changing

• As more biological information becomes available …– The biologist will spend more time using

computers– The biologist will spend more time on data

analysis (and less doing lab biochemistry)– Biology will become a more quantitative science

(think how the periodic table and atomic theory affected chemistry)

III. Molecular Biology Software Tools

GCG (Wisconsin Package) The most popular and most

comprehensive set of tools for the molecular biologist.

- Runs on mainframe computers: (UNIX)

- Web, X-Windows (SeqLab) interfaces

- Inexpensive for large numbers of users

- Requires local databases (on the mainframe computer)

- Allows for custom databases and programming

The Web Many of the best tools are free over the Web

BLAST ENTREZ/PUBMED Protein motifs databases

Bioinformatics “service providers” DoubleTwist™, Celera, BioNavigator™

Hodgepodge collection of other tools PCR primer design Pairwise and Multiple Alignment

Personal Computer Programs Macintosh and Windows applications

- Commercial: Vector NTI™, MacVector™, OMIGA™, Sequencher™

- Freeware: Phylip, Fasta, Clustal, etc.

Better graphics, easier to use Can't access very large databases or perform

demanding calculations Integration with web databases and computing

services

Putting it all together The current state of the art requires the

biologist to jump around from Web to mainframe to personal computer

The trend is for integration

– Web + personal computer will replace text interface to mainframe ?

– Will the Web become the ultimate interface for all computing ??

IV. Genómica

Genomics Technologies

• Automated DNA sequencing• Automated annotation of sequences• DNA microarrays

– gene expression (measure RNA levels)– single nucleotide polymorphisms (SNPs)

• Protein chips (SELDI, etc.)• Protein-protein interactions

cDNA spotted microarrays

Affymetrix Gene Chips

Impact on Bioinformatics

• Genomics produces high-throughput, high-quality data, and bioinformatics provides the analysis and interpretation of these massive data sets.

• It is impossible to separate genomics laboratory technologies from the computational tools required for data analysis.

Pharmacogenomics • The use of DNA sequence information to

measure and predict the reaction of individuals to drugs.

• Personalized drugs

• Faster clinical trials– Selected trail populations

• Less drug side effects– toxicogenomics