GabrielaCavalcantiCunha EconomiaSolidariaPoliticasPublicas USP 2002
Genômica comparativa - IQ USP · Genômica comparativa João Carlos Setubal IQ-USP . outubro 2012...
Transcript of Genômica comparativa - IQ USP · Genômica comparativa João Carlos Setubal IQ-USP . outubro 2012...
Genômica comparativa
João Carlos Setubal IQ-USP
outubro 2012
1 11/5/2012 J. C. Setubal
Comparative genomics
• There are currently (out/2012) 2,230 completed sequenced microbial genomes publicly available
• Many are of closely related species • Why compare? • How to do it?
5 November 2012 2 JC Setubal
Why comparative genomics?
• To understand the genomic basis of the present – Differences in lifestyle
• pathogen vs. nonpathogen • Obligate vs. free-living
– Host specificity • animals vs. plants, plant X vs. plant Y, etc
– In the case of pathogens: this understanding should help us in fighting disease
• To understand the past – How organisms evolved to be what they are
5 November 2012 3 JC Setubal
Citrus canker Xanthomonas
axonopodis pathovar citri
5 November 2012 4 JC Setubal
Black rot: Xanthomonas campestris pathovar campestris
5 November 2012 5 JC Setubal
What is comparative genomics • Assuming input is the sequence and its annotation • There are many ways that genomes can be compared
– Different resolutions • Whole genome
– Genome alignments – Synteny (gene order conservation) – Anomalous regions
• Gene-centric – Gene families and unique genes – Gene clustering by function
• Gene sequence variations – Codon usage, SNPs, inDels, pseudogenes
5 November 2012 6 JC Setubal
Resolution
• Low resolution – Scope: entire genomes – Example event: rearrangement
• High resolution – Scope: nucleotide sequences – Example event: single mutation
5 November 2012 JC Setubal 7
Genome-wide evolutionary events
• Replicon rearrangements • Gene/region duplication • Gene/region loss • Chromosome plasmid DNA exchange • Lateral transfer
5 November 2012 8 JC Setubal
Whole replicon alignments: the pairwise case
If the sequences were identical we would see
B
A 5 November 2012 9 JC Setubal
an inversion
A B C D
A
C B
D
5 November 2012 10 JC Setubal
A B C D
A
C
D
B
Such inversions seem to happen around the origin or terminus of replication 5 November 2012 11 JC Setubal
5 November 2012 13 JC Setubal
5 November 2012 14 JC Setubal
15
Xanthomonas axonopodis pv citri
E. coli K12 Promer alignment
Both are γ proteobacteria! Red: direct; green: reverse
5 November 2012 16 JC Setubal
Eisen JA, Heidelberg JF, White O, Salzberg SL. Evidence for symmetric chromosomal inversions around the replication origin in bacteria. Genome Biol. 2000;1(6):RESEARCH0011
5 November 2012 18 JC Setubal
Replicon sequence comparisons
• Basic tool: MUMmer – Delcher AL, Phillippy A, Carlton J, Salzberg SL. Fast
algorithms for large-scale genome alignment and comparison. Nucleic Acids Res. 2002 Jun 1;30(11):2478-83.
– Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL. Versatile and open software for comparing large genomes. Genome Biol. 2004;5(2):R12
• http://mummer.sourceforge.net
5 November 2012 19 JC Setubal
Basics of MUMmer
• It finds Maximal Unique Matches • These are exact matches above a user-specified threshold
that are unique • Exact matches found are clustered and extended (using
dynamic programming) – Result is approximate matches
• Data structure for exact match finding: suffix tree – Difficult to build but very fast
• Nucmer and promer – Both very fast – O(n + #MUMs), n = genome lengths
5 November 2012 20 JC Setubal
Whole replicon multiple alignment
• The program MAUVE • Darling AC, Mau B, Blattner FR, Perna NT.
Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res. 2004 Jul;14(7):1394-403.
5 November 2012 21 JC Setubal
22
Main chromosome alignment MAUVE
5 November 2012 JC Setubal
23
Chromosome 2 alignment MAUVE
5 November 2012 JC Setubal
24
RSA 493
RSA 331
Dugway
Chromosome alignment MAUVE
5 November 2012 JC Setubal
25
Genome Alignments MAUVE
5 November 2012 JC Setubal
How MAUVE works
• Seed-and-extend hashing • Seeds/anchors: Maximal Multiple Unique
Matches of minimum length k • Result: Local collinear blocks (LCBs) • O(G2n + Gn log Gn), G = # genomes, n =
average genome length
5 November 2012 26 JC Setubal
Alignment algorithm
1. Find Multi-MUMs 2. Use the multi-MUMs to calculate a phylogenetic
guide tree 3. Find LCBs (subset of multi-MUMs; filter out
spurious matches; requires minimum weight) 4. Recursive anchoring to identify additional anchors
(extension of LCBs) 5. Progressive alignment (CLUSTALW) using guide tree
5 November 2012 JC Setubal 27
Gene-centric comparisons
• Homologs: genes that have the same ancestor; in general retain the same function
• Orthologs: homologs from different species (arise from speciation)
• Paralogs: homologs from the same species (arise from duplication) – Duplication before speciation (ancient duplication)
• Out-paralogs; may not have the same function
– Duplication after speciation (recent duplication) • In-paralogs; likely to have the same function
5 November 2012 28 JC Setubal
Gene Set Computations
• Given a set of genomes, represented by their ‘proteomes’ or sets of protein sequences
• Given homlogous relationships (as given for example by orthoMCL) – Which genes are shared by genomes X and Y? – Which genes are unique to genome Z? – Venn or extended Venn diagrams
5 November 2012 29 JC Setubal
3-way genome comparison
5 November 2012 JC Setubal 30
A B
C
Copyright ©2004 by the National Academy of Sciences
Boussau, Bastien et al. (2004) Proc. Natl. Acad. Sci. USA 101, 9722-9727
Fig. 4. Net gene loss or gain throughout the evolution of the {alpha}-proteobacterial species
5 November 2012 34 JC Setubal
Proteome alignment done with LCS (top: Xcc; bottom: Xac )
Blue: BBHs that are in the LCS; dark blue: BBHs not in the LCS; red: Xac specifics; yellow: Xcc specifics
5 November 2012 35 JC Setubal
5 November 2012 36 JC Setubal
5 November 2012 37 JC Setubal
5 November 2012 38 JC Setubal
What do the tables show
• conserved blocks (aka “microsyntenic regions”), and how these blocks appear in different replicons across the genomes compared
• some of these blocks are not operons (would need to show strand)
• possible block losses
5 November 2012 39 JC Setubal
Polymorphism detection
• inDels, SNPs • pseudogenes
5 November 2012 42 JC Setubal
I
II
Figure 4.
Gluconate isomerase – A Brucella gene in the process a decay
44