Anda di halaman 1dari 42

Sequence Classification using Reference

Taxonomies

Gabriel Valiente

Algorithms, Bioinformatics, Complexity and Formal Methods Research Group


Technical University of Catalonia

Computational Biology and Bioinformatics Research Group


Research Institute of Health Science, University of the Balearic Islands

Centre for Genomic Regulation


Barcelona Biomedical Research Park

Third French Bioinformatics Symposium


Paris, 19–20 November 2013
Abstract

Next generation sequencing technologies have opened up an


unprecedented opportunity for microbiology by enabling the
culture-independent genetic study of complex microbial
communities, which were so far largely unknown.

The analysis of metagenomic data is challenging, since a sample


may contain a mixture of many different microbial species, whose
genome has not necessarily been sequenced beforehand.

In this talk, we address the problem of analyzing metagenomic data


for which databases of reference sequences are already known.

We discuss both composition and alignment-based methods for the


classification of sequence reads, and present recent results on the
assignment of ambiguous sequence reads to microbial species at
the best possible taxonomic rank.
• A. Magi, M. Benelli, A. Gozzini, F. Girolami, F. Torricelli, and M. L.
Brandi. Bioinformatics for next generation sequencing data.
Genes, 1(2):294–307, 2010

There is a clear tendency to higher throughput at the expense of


slower turnaround and lower raw read accuracy
• C. Linnæus. Systema Naturae. L. Salvius, Stockholm, 1735
Kingdom Archaea Bacteria Eukaryota Viruses

Phylum Streptophyta

Class Streptophytina

Order Solanales

Family Solanaceae

Genus Solanum

Species Solanum lycopersicum

Solanum caule inermi herbaceo, foliis pinnatis incisis, racemis


simplicibus
Kingdom Archaea Bacteria Eukaryota Viruses

Phylum Chordata

Class Mammalia

Order Primates

Family Hominidae

Genus Gorilla Homo Pan Pongo

Species Homo sapiens


• E. Haeckel. Generelle Morphologie der Organismen. Georg
Reimer, Berlin, 1866
• H. F. Copeland. The kingdoms of organisms. Q. Rev. Biol.,
13(4):383–420, 1938
• R. H. Whittaker. New concepts of kingdoms of organisms.
Science, 163(3863):150–160, 1969
• C. R. Woese. Bacterial evolution. Microbiol. Rev., 51(2):221–271,
1987
Linnæus Haeckel Copeland Whittaker Woese
1735 1866 1938 1969 1987
Bacteria
Monera Monera
Protista Archaea
Protoctista Protista
Plantae Plantae
Plantae Plantae Eukarya
Protoctista Fungi
Animalia Animalia Animalia Animalia
• Y. Van de Peer, P. D. Rijk, J. Wuyts, T. Winkelmans, and R. D.
Wachter. The european small subunit ribosomal RNA database.
Nucleic Acids Res., 28(1):175–176, 2000
• K. E. Ashelford, N. A. Chuzhanova, J. C. Fry, A. J. Jones, and
A. J. Weightman. At least 1 in 20 16S rRNA sequence records
currently held in public repositories is estimated to contain
substantial anomalies. Appl. Environ. Microbiol.,
71(12):7724–7736, 2005
• P. D. Schloss. A high-throughput DNA sequence aligner for
microbial ecology studies. PLoS ONE, 4(12):e8230, 2009
sequence genomic
reads reference

mapping

matching taxonomic
statistics reference

non-taxonomic taxonomic
assignment assignment

non-taxonomic taxonomic
classification classification
Classification of whole genomes
• k -mer searching
• Top hit Closest sequence to the sequence read
• Best stratum Sequences at the same distance as the top hit

Classification of 16S ribosomal RNA sequences


• k -mer searching
• Top hit Closest sequence to the sequence read
• Best stratum Sequences at the same distance as the top hit
Classification of whole genomes
• BLAST alignment of sequence reads
• Top hit E-value up to 0.001
• Best stratum Sequences with the same E-value as the top hit

Classification of 16S ribosomal RNA sequences


• BWT alignment of sequence reads
• Top hit Sequences with at most k mismatches
• Best stratum Sequences with same number of mismatches as the
top hit
Taxonomic assignment problem
Input A genomic reference S (set of sequences)
A taxonomic reference T (tree) with a leaf set L, where each leaf in
L has an associated known sequence of S
A set R of sequence (short or long) reads
A positive integer k
Output For each read Ri 2 R, a single node in T that represents in a
“good” way the subset Mi ✓ L of hits or matches whose sequences
contain a substring with at most k mismatches to Ri
Accuracy and coverage of taxonomic assignment
Precision Proportion of correctly labeled positive elements with
respect to the total number of elements labeled positive (correctly
or incorrectly labeled positive)

TP
P=
TP + FP

Recall Proportion of correctly labeled positive elements with respect


to the total number of positive elements (correctly labeled positive
or incorrectly labeled negative)

TP
R=
TP + FN

F -measure Harmonic mean of precision and recall

2 2PR
F= 1
=
P
+ R1 P +R
Given a reference taxonomy T , a set R of sequence reads, and a
threshold value k of sequence similarity,
• Let Ri be the ith read
• Let Mi be the leaves of T matching Ri with up to k mismatches
• Let Ti be the subtree of T rooted at the lowest common ancestor
of Mi
• Let Ni be the leaves of Ti not matching Ri with up to k
mismatches

For the ith read, the leaves of Ti can be partitioned in the following four
subsets:
• TP i = Mi (true positives)
• FP i = Ni (false positives)
• TN i = 0/ (true negatives)
• FN i = 0/ (false negatives)
Ti

Ni Mi

FPi TPi
Given a reference taxonomy T , a set R of sequence reads, and a
threshold value k of sequence similarity,
• Let Tij be the subtree of T rooted at the jth node of Ti
• Let Mij be the leaves of Tij matching Ri with up to k mismatches
• Let Nij be the leaves of Tij not matching Ri with up to k
mismatches

For the ith read and the jth node of Ti , the leaves of Ti can be
partitioned in the following four subsets:
• TP ij = Mij (true positives)
• FP ij = Nij (false positives)
• TN ij = Ni \ Nij (true negatives)
• FN ij = Mi \ Mij (false negatives)
Ti

Tij

Ni Nij Mij Mi

TNij FPij TPij FNij


Precision The precision of classifying Ri as Tj is

|TP ij |
Pij =
|TP ij | + |FP ij |

Recall The recall of classifying Ri as Tj is

|TP ij |
Rij =
|TP ij | + |FN ij |

F -measure The combined F -measure of precision and recall is

2 2Pij Rij
Fij = 1
=
Pij
+ R1ij Pij + Rij
Bacteria
Aquificae
Aquificae
Aquificales
Aquificaceae
Aquifex
Aquifex pyrophilus
Hydrogenobaculum
Hydrogenobaculum acidophilum
P = 6/(6 + 8) = 43% Hydrogenobacter
R = 6/(6 + 0) = 100% Hydrogenobacter subterraneus
Hydrogenobacter thermophilus
F = 60% Hydrogenobacter hydrogenophilus
Persephonella
Persephonella hydrogeniphila
Persephonella marina
Persephonella guaymasensis
Sulfurihydrogenibium
Sulfurihydrogenibium subterraneum
P = 3/(3 + 0) = 100% Sulfurihydrogenibium azorense
R = 3/(3 + 3) = 50% Sulfurihydrogenibium yellowstonense
Thermocrinis
F = 67% Thermocrinis albus
Thermocrinis ruber
Hydrogenivirga
Hydrogenivirga caldilitoris
F -measure The combined F -measure of precision and recall is

2Pij Rij 2|TP ij |


Fij = =
Pij + Rij |FN ij | + |FP ij | + 2|TP ij |

Penalty score The penalty score of assigning Ri to Tj is

|FN ij | |FP ij |
PS ij = q + (1 q)
|TP ij | |TP ij |

The parameter q takes values in the range [0, 1] and determines


how close to the LCA or to the leaves the assignment shall be
q = 0 Each read Ri is assigned to a matching leaf
q = 0.5 Each read Ri is assigned to the node that maximizes the
F -measure
q = 1 Each read Ri is assigned to the LCA of the matching leaves Mi
Generalization of the F -measure For q = 0.5, the node m that
minimizes the penalty score is

arg min(|FN im |/|TP im | + |FP im |/|TP im |)


m
= arg min((|FN im | + |FP im |)/|TP im |)
m
= arg min((|FN im | + |FP im |)/2|TP im |)
m
= arg min(((|FN im | + |FP im |)/2|TP im |) + 1)
m
= arg min((|FN im | + |FP im | + 2|TP im |)/2|TP im |)
m
= arg max(2|TP im |/(|FN im | + |FP im | + 2|TP im |))
m

The node that minimizes the penalty score is the same node that
would maximize the F -measure
Theorem
Given a set Mi ✓ L of hits and the subtree Ti of T rooted at the LCA of
Mi , the penalty scores PSi ,j for every node j in Ti can be obtained in
O (|Ti |) total time

Theorem
Given a set Mi ✓ L of hits and the subtree Ti of T rooted at the LCA of
Mi , the penalty scores PSi ,j for every node j in Ti can be obtained in
O (|Mi |) total time after O (|T |) time preprocessing

Definition
Any node j in Ti is called relevant if it is a leaf in Mi or the LCA of two
or more leaves in Mi

Lemma
For each node j in Ti there exists a relevant node j 0 such that
PSi ,j 0  PSi ,j
• TANGO http://www.lsi.upc.edu/˜valiente/tango/
• TANGO http://tango.lsi.upc.edu/tango.php
Input Sample 1

Input Sample 2

Input Sample 3
Output Sample 1

Output Sample 2

Output Sample 3
Output Taxonomy Table
Reference Initial Reference Sequence
Taxonomies Mappings Sequences Reads

contraction mapping

Contracted Sequence
equalizing
Taxonomies Matches

Taxonomy
relabeling
Correspondences

Relabeled
Matches

assignment

Assigned
Reads
Reference taxonomies are contracted to the seven taxonomic ranks
usually used to classify organisms
• Kingdom
• Phylum
• Class
• Order
• Family
• Genus
• Species

Leaves of the original taxonomy at other taxonomic ranks are retained


in the contracted taxonomy, to be able to use them when assigning
sequence reads

Incomplete classifications are resolved by extracting the lineages of


the organisms and inferring the taxonomy from the lineages
Contracted reference taxonomies are equalized in order to be able to
assign sequence reads using heterogeneous taxonomies

Correspondences between the leaves are obtained from the


annotation of the reference taxonomies

The parent of each node in the source taxonomy is taken to


correspond to the LCA of the nodes of the target taxonomy that
correspond to the children of the parent node
Sequence read matches in a given taxonomy are relabeled to another
taxonomy by using the correspondences between reference
taxonomies

Sequence matches are always relabeled (even when source and


target taxonomies are the same) because sequence reads may have
been matched to genomic sequences of organisms at some rank other
than the seven taxonomic ranks in the contracted taxonomies
• J. C. Clemente, J. Jansson, and G. Valiente. Flexible taxonomic
assignment of ambiguous sequencing reads. BMC
Bioinformatics, 12:8, 2011
• P. Ribeca and G. Valiente. Computational challenges of sequence
classification in microbiomic data. Brief. Bioinform.,
12(6):614–625, 2011
• D. Alonso, J. C. Clemente, J. Jansson, and G. Valiente.
Taxonomic assignment in metagenomics with TANGO.
EMBnet.journal, 17(2):46–50, 2011
• M. Santamaria, B. Fosso, A. Consiglio, G. De Caro, G. Grillo,
F. Licciulli, S. Liuni, M. Marzano, D. Alonso, G. Valiente, and
G. Pesole. Reference databases for taxonomic assignment in
metagenomics. Brief. Bioinform., 13(6):682–695, 2012
• D. Alonso, A. Barré, S. Beretta, P. Bonizzoni, M. Nikolski, and
G. Valiente. Further steps in tango: Improved taxonomic
assignment in metagenomics. Bioinformatics, 2013

Anda mungkin juga menyukai