Sequence Classification Using Reference Taxonomies: Gabriel Valiente

Sequence Classification using Reference
Taxonomies
Gabriel Valiente
Algorithms, Bioinformatics, Complexity and Formal Methods Research Group

Technical University of Catalonia
Computational Biology and Bioinformatics Research Group

Research Institute of Health Science, University of the Balearic Islands
Centre for Genomic Regulation

Barcelona Biomedical Research Park
Third French Bioinformatics Symposium

Paris, 19–20 November 2013
Abstract
Next generation sequencing technologies have opened up an

unprecedented opportunity for microbiology by enabling the
culture-independent genetic study of complex microbial
communities, which were so far largely unknown.
The analysis of metagenomic data is challenging, since a sample

may contain a mixture of many different microbial species, whose
genome has not necessarily been sequenced beforehand.
In this talk, we address the problem of analyzing metagenomic data

for which databases of reference sequences are already known.
We discuss both composition and alignment-based methods for the

classification of sequence reads, and present recent results on the
assignment of ambiguous sequence reads to microbial species at
the best possible taxonomic rank.
• A. Magi, M. Benelli, A. Gozzini, F. Girolami, F. Torricelli, and M. L.
Brandi. Bioinformatics for next generation sequencing data.
Genes, 1(2):294–307, 2010
There is a clear tendency to higher throughput at the expense of

slower turnaround and lower raw read accuracy
• C. Linnæus. Systema Naturae. L. Salvius, Stockholm, 1735
Kingdom Archaea Bacteria Eukaryota Viruses
Phylum Streptophyta
Class Streptophytina
Order Solanales
Family Solanaceae
Genus Solanum
Species Solanum lycopersicum
Solanum caule inermi herbaceo, foliis pinnatis incisis, racemis

simplicibus
Kingdom Archaea Bacteria Eukaryota Viruses
Phylum Chordata
Class Mammalia
Order Primates
Family Hominidae
Genus Gorilla Homo Pan Pongo
Species Homo sapiens

• E. Haeckel. Generelle Morphologie der Organismen. Georg
Reimer, Berlin, 1866
• H. F. Copeland. The kingdoms of organisms. Q. Rev. Biol.,
13(4):383–420, 1938
• R. H. Whittaker. New concepts of kingdoms of organisms.
Science, 163(3863):150–160, 1969
• C. R. Woese. Bacterial evolution. Microbiol. Rev., 51(2):221–271,
1987
Linnæus Haeckel Copeland Whittaker Woese
1735 1866 1938 1969 1987
Bacteria
Monera Monera
Protista Archaea
Protoctista Protista
Plantae Plantae
Plantae Plantae Eukarya
Protoctista Fungi
Animalia Animalia Animalia Animalia
• Y. Van de Peer, P. D. Rijk, J. Wuyts, T. Winkelmans, and R. D.
Wachter. The european small subunit ribosomal RNA database.
Nucleic Acids Res., 28(1):175–176, 2000
• K. E. Ashelford, N. A. Chuzhanova, J. C. Fry, A. J. Jones, and
A. J. Weightman. At least 1 in 20 16S rRNA sequence records
currently held in public repositories is estimated to contain
substantial anomalies. Appl. Environ. Microbiol.,
71(12):7724–7736, 2005
• P. D. Schloss. A high-throughput DNA sequence aligner for
microbial ecology studies. PLoS ONE, 4(12):e8230, 2009
sequence genomic
reads reference
mapping
matching taxonomic
statistics reference
non-taxonomic taxonomic
assignment assignment
non-taxonomic taxonomic
classification classification
Classification of whole genomes
• k -mer searching
• Top hit Closest sequence to the sequence read
• Best stratum Sequences at the same distance as the top hit
Classification of 16S ribosomal RNA sequences

• k -mer searching
• Top hit Closest sequence to the sequence read
• Best stratum Sequences at the same distance as the top hit
Classification of whole genomes
• BLAST alignment of sequence reads
• Top hit E-value up to 0.001
• Best stratum Sequences with the same E-value as the top hit
Classification of 16S ribosomal RNA sequences

• BWT alignment of sequence reads
• Top hit Sequences with at most k mismatches
• Best stratum Sequences with same number of mismatches as the
top hit
Taxonomic assignment problem
Input A genomic reference S (set of sequences)
A taxonomic reference T (tree) with a leaf set L, where each leaf in
L has an associated known sequence of S
A set R of sequence (short or long) reads
A positive integer k
Output For each read Ri 2 R, a single node in T that represents in a
“good” way the subset Mi ✓ L of hits or matches whose sequences
contain a substring with at most k mismatches to Ri
Accuracy and coverage of taxonomic assignment
Precision Proportion of correctly labeled positive elements with
respect to the total number of elements labeled positive (correctly
or incorrectly labeled positive)
TP
P=
TP + FP
Recall Proportion of correctly labeled positive elements with respect

to the total number of positive elements (correctly labeled positive
or incorrectly labeled negative)
TP
R=
TP + FN
F -measure Harmonic mean of precision and recall
2 2PR
F= 1
=
P
+ R1 P +R
Given a reference taxonomy T , a set R of sequence reads, and a
threshold value k of sequence similarity,
• Let Ri be the ith read
• Let Mi be the leaves of T matching Ri with up to k mismatches
• Let Ti be the subtree of T rooted at the lowest common ancestor
of Mi
• Let Ni be the leaves of Ti not matching Ri with up to k
mismatches
For the ith read, the leaves of Ti can be partitioned in the following four
subsets:
• TP i = Mi (true positives)
• FP i = Ni (false positives)
• TN i = 0/ (true negatives)
• FN i = 0/ (false negatives)
Ti
Ni Mi
FPi TPi
Given a reference taxonomy T , a set R of sequence reads, and a
threshold value k of sequence similarity,
• Let Tij be the subtree of T rooted at the jth node of Ti
• Let Mij be the leaves of Tij matching Ri with up to k mismatches
• Let Nij be the leaves of Tij not matching Ri with up to k
mismatches
For the ith read and the jth node of Ti , the leaves of Ti can be
partitioned in the following four subsets:
• TP ij = Mij (true positives)
• FP ij = Nij (false positives)
• TN ij = Ni \ Nij (true negatives)
• FN ij = Mi \ Mij (false negatives)
Ti
Tij
Ni Nij Mij Mi
TNij FPij TPij FNij

Precision The precision of classifying Ri as Tj is
|TP ij |
Pij =
|TP ij | + |FP ij |
Recall The recall of classifying Ri as Tj is
|TP ij |
Rij =
|TP ij | + |FN ij |
F -measure The combined F -measure of precision and recall is
2 2Pij Rij
Fij = 1
=
Pij
+ R1ij Pij + Rij
Bacteria
Aquificae
Aquificae
Aquificales
Aquificaceae
Aquifex
Aquifex pyrophilus
Hydrogenobaculum
Hydrogenobaculum acidophilum
P = 6/(6 + 8) = 43% Hydrogenobacter
R = 6/(6 + 0) = 100% Hydrogenobacter subterraneus
Hydrogenobacter thermophilus
F = 60% Hydrogenobacter hydrogenophilus
Persephonella
Persephonella hydrogeniphila
Persephonella marina
Persephonella guaymasensis
Sulfurihydrogenibium
Sulfurihydrogenibium subterraneum
P = 3/(3 + 0) = 100% Sulfurihydrogenibium azorense
R = 3/(3 + 3) = 50% Sulfurihydrogenibium yellowstonense
Thermocrinis
F = 67% Thermocrinis albus
Thermocrinis ruber
Hydrogenivirga
Hydrogenivirga caldilitoris
F -measure The combined F -measure of precision and recall is
2Pij Rij 2|TP ij |

Fij = =
Pij + Rij |FN ij | + |FP ij | + 2|TP ij |
Penalty score The penalty score of assigning Ri to Tj is
|FN ij | |FP ij |
PS ij = q + (1 q)
|TP ij | |TP ij |
The parameter q takes values in the range [0, 1] and determines

how close to the LCA or to the leaves the assignment shall be
q = 0 Each read Ri is assigned to a matching leaf
q = 0.5 Each read Ri is assigned to the node that maximizes the
F -measure
q = 1 Each read Ri is assigned to the LCA of the matching leaves Mi
Generalization of the F -measure For q = 0.5, the node m that
minimizes the penalty score is
arg min(|FN im |/|TP im | + |FP im |/|TP im |)

m
= arg min((|FN im | + |FP im |)/|TP im |)
m
= arg min((|FN im | + |FP im |)/2|TP im |)
m
= arg min(((|FN im | + |FP im |)/2|TP im |) + 1)
m
= arg min((|FN im | + |FP im | + 2|TP im |)/2|TP im |)
m
= arg max(2|TP im |/(|FN im | + |FP im | + 2|TP im |))
m
The node that minimizes the penalty score is the same node that
would maximize the F -measure
Theorem
Given a set Mi ✓ L of hits and the subtree Ti of T rooted at the LCA of
Mi , the penalty scores PSi ,j for every node j in Ti can be obtained in
O (|Ti |) total time
Theorem
Given a set Mi ✓ L of hits and the subtree Ti of T rooted at the LCA of
Mi , the penalty scores PSi ,j for every node j in Ti can be obtained in
O (|Mi |) total time after O (|T |) time preprocessing
Definition
Any node j in Ti is called relevant if it is a leaf in Mi or the LCA of two
or more leaves in Mi
Lemma
For each node j in Ti there exists a relevant node j 0 such that
PSi ,j 0  PSi ,j
• TANGO http://www.lsi.upc.edu/˜valiente/tango/
• TANGO http://tango.lsi.upc.edu/tango.php
Input Sample 1
Input Sample 2
Input Sample 3
Output Sample 1
Output Sample 2
Output Sample 3
Output Taxonomy Table
Reference Initial Reference Sequence
Taxonomies Mappings Sequences Reads
contraction mapping
Contracted Sequence
equalizing
Taxonomies Matches
Taxonomy
relabeling
Correspondences
Relabeled
Matches
assignment
Assigned
Reads
Reference taxonomies are contracted to the seven taxonomic ranks
usually used to classify organisms
• Kingdom
• Phylum
• Class
• Order
• Family
• Genus
• Species
Leaves of the original taxonomy at other taxonomic ranks are retained

in the contracted taxonomy, to be able to use them when assigning
sequence reads
Incomplete classifications are resolved by extracting the lineages of

the organisms and inferring the taxonomy from the lineages
Contracted reference taxonomies are equalized in order to be able to
assign sequence reads using heterogeneous taxonomies
Correspondences between the leaves are obtained from the

annotation of the reference taxonomies
The parent of each node in the source taxonomy is taken to

correspond to the LCA of the nodes of the target taxonomy that
correspond to the children of the parent node
Sequence read matches in a given taxonomy are relabeled to another
taxonomy by using the correspondences between reference
taxonomies
Sequence matches are always relabeled (even when source and

target taxonomies are the same) because sequence reads may have
been matched to genomic sequences of organisms at some rank other
than the seven taxonomic ranks in the contracted taxonomies
• J. C. Clemente, J. Jansson, and G. Valiente. Flexible taxonomic
assignment of ambiguous sequencing reads. BMC
Bioinformatics, 12:8, 2011
• P. Ribeca and G. Valiente. Computational challenges of sequence
classification in microbiomic data. Brief. Bioinform.,
12(6):614–625, 2011
• D. Alonso, J. C. Clemente, J. Jansson, and G. Valiente.
Taxonomic assignment in metagenomics with TANGO.
EMBnet.journal, 17(2):46–50, 2011
• M. Santamaria, B. Fosso, A. Consiglio, G. De Caro, G. Grillo,
F. Licciulli, S. Liuni, M. Marzano, D. Alonso, G. Valiente, and
G. Pesole. Reference databases for taxonomic assignment in
metagenomics. Brief. Bioinform., 13(6):682–695, 2012
• D. Alonso, A. Barré, S. Beretta, P. Bonizzoni, M. Nikolski, and
G. Valiente. Further steps in tango: Improved taxonomic
assignment in metagenomics. Bioinformatics, 2013

Sequence Classification Using Reference Taxonomies: Gabriel Valiente

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Sequence Classification Using Reference Taxonomies: Gabriel Valiente

Diunggah oleh

Hak Cipta:

Format Tersedia

Sequence Classification using Reference

Algorithms, Bioinformatics, Complexity and Formal Methods Research Group

Computational Biology and Bioinformatics Research Group

Centre for Genomic Regulation

Third French Bioinformatics Symposium

Next generation sequencing technologies have opened up an

The analysis of metagenomic data is challenging, since a sample

In this talk, we address the problem of analyzing metagenomic data

We discuss both composition and alignment-based methods for the

There is a clear tendency to higher throughput at the expense of

Species Solanum lycopersicum

Solanum caule inermi herbaceo, foliis pinnatis incisis, racemis

Genus Gorilla Homo Pan Pongo

Species Homo sapiens

Classification of 16S ribosomal RNA sequences

Classification of 16S ribosomal RNA sequences

Recall Proportion of correctly labeled positive elements with respect

F -measure Harmonic mean of precision and recall

TNij FPij TPij FNij

Recall The recall of classifying Ri as Tj is

F -measure The combined F -measure of precision and recall is

2Pij Rij 2|TP ij |

Penalty score The penalty score of assigning Ri to Tj is

The parameter q takes values in the range [0, 1] and determines

arg min(|FN im |/|TP im | + |FP im |/|TP im |)

Leaves of the original taxonomy at other taxonomic ranks are retained

Incomplete classifications are resolved by extracting the lineages of

Correspondences between the leaves are obtained from the

The parent of each node in the source taxonomy is taken to

Sequence matches are always relabeled (even when source and

Anda mungkin juga menyukai