Data Minng and Bioinformatics

Data Mining and Bioinformatics:
Some Challenges
Qiang Yang, Computer Thanks:

Science and
Engineering HKUST RPC Project
-Ben Niu,
HKUST
-Can Yang
-Prof. Hannah Xue
-Prof. W. Yu
http://www.cse.ust.hk/~qyang/ 1
State of Art: DM for Bio
 We know how to classify biological

sequences
 SVM, Neural Nets, Decision Trees, Rules
 We know how to cluster biological
entities
 Bi-clustering, K-means, hierarchical
 We know how to select features
 PCA, LDA, SVM-RFE
Data Mining: Challenges in Bio
1. Non-traditional Feature Selection
 When the number of attributes >> number of samples?
 Highly imbalanced
1. Explainable and Accurate Data Mining Methods
 NN, SVM Rules?
1. Transfer Learning
 Can knowledge learned from one set of samples help data
mining on another sample?
1. Exploiting the network structure
 Individual i.i.d type of classification vs social networks?
Challenge 1: Non-traditional Feature
Selection: Question: which (few) genes lead to diseases?
 # of attributes >> # samples
Dataset name Train Test Attributes Source
ALL-AML leukemia 38 34 7129 ‘Molecular Classification of Cancer: Class Discovery and Class Prediction by
Gene Exressoin Monitoring’, Science, Vol. 286, 1999.
Breast cancer 78 19 24481 ‘Gene Expression Profiling Predicts Clinical Outcome of Breast Cancer’,
Nature, Vol. 415, 2002.
Central nervous 30 30 7192 ‘Prediction of Central Nervous System Embryonal Tumour Outcome Based
system on Gene Expression’, Nature, Vol. 415, 2002.
MLL_Leukemia 57 15 12582 ‘MLL Translocations Specify A Distinct Gene Expression Profile that
Distinguishes A Unique Leukemia’, Nature Genetics, Vol. 30, 2002.
Large B-cell 24 23 4026 ‘Distinct types of diffuse large B-cell lymphoma identified by gene
Lymphoma expression profiling’. Nature, Vol. 403, 2000.
 # of positive << # negative Train Test

 KDD CUP 2002 task 2
 Yeast Gene Regulation Prediction Positive set Negative set 1489
122 2896
 Traditional feature selection methods fail: overfitting, singularity of covariance matrix
Non-traditional Feature Selection (2)
 Some potential solutions
 ‘Characterization of a family of algorithms for generalized
discriminant analysis on undersampled problems’, Journal of
Machine Learning Research. Vol. 6, 2005.
 Singularity problem is solved by splitting the subspace into the
regular and the irregular parts.
 Irregular part (null space) of the within-class scatter matrix is fully
utilized to extract the discriminant info.
 ‘Two-Dimensional PCA: A New Approach to Appearance-Based
Face Representation and Recognition’, IEEE Transactions on
Pattern Analysis and Machine Intelligence, Vol. 26, 2004.
 High dimensional data in 2D arrays are projected directly onto the
subspaces.
 Size of covariance matrix can be reduced significantly.
 Singularity is avoided.
Non-traditional Feature Selection (3)
 Other approaches:
 Manifold learning
 Manifold learning methods, e.g., Isomap, LLE, maintain the
local patterns of distribution during transform,
 Extract features suitable for k-NN classifiers
 Can be used to reduce the dimensionality of Bio. Data.
 Semi-supervised learning
 What if we have 10% labeled data, but the rest 90% are
unlabelled?
 Build clusters around the labeled samples.
 Samples in the same cluster are labeled as from the same
class, assuming they follow the normal distributions.
Challenge 2: Explainable and
Accurate Data Mining Methods
 Current methods, such as SVMs,
discriminant analysis, neural
networks, are ‘black box’ models.
 The learned knowledge is hard to
understand by biologists.
 Some potential solutions

 Logic based method, e.g.,
decision trees and variants may
be better in giving the ‘IF-THEN’
like rules that explicitly define the
epigenetic logics in cancer and
stem cell development.
 DNA methylation rules can be
learned by using SVM based
recursive feature elimination and
fuzzy logics.
 [Gene selection for cancer
classification using support vector
machines’, Machine Learning,
2002.]
Epigenetic Analysis: A Case Study
 Epigenetic events dominate
the growth of cancer and  Traditional methods, SVMs,
embryonic stem cells ANNs are
 These two type of cells are  ‘black box’ models
of great importance  Knowledge are trained
connection weights, or
 Genes can be turned on/ off Support Vectors.
through Cytosine methylation  Hard to understand for
or Histone modifications biologists
 The logics of DNA
methylation underlie the
cells’ behaviors
 Wish to Know: Methylation status
of CpG sites
 CpG islands/ promoter regions
in DNA sequence
 Cancer prediction
Adaptive Cascade Sharing Trees
(ACS4) Niu et al. 2007 (tmr’s talk)
 Objective: learn human
understandable rules that
define the epigenetic process in
cancer and embryonic stem cells
 Idea:
 Adaptively partition the
numeric attributes into a
set of the linguistic domains,
e.g., ‘high’, ‘very high’,
‘Medium’, ‘Low’, ‘Very Low’
 Method: clustering
 Train a committee of trees
to select the most salient
features and predict by
voting
 Method: tree learning
ACS4 method (2)
ACS4 method (3)
 Dataset:
 37 hESC, 33 non-hESC, 24 cancer cell lines, 9 normal cell lines.
 1,536 attributes
 Result
 Just 2 attributes are enough to separate the 3 cell types
 No need of 40 attributes by using fisher’s score in [1].
 Wet lab cost can be reduced by testing on 2 attributes only, instead of 40.
 Accuracy is better, except when compared with SVM, but SVM cannot tell us ‘why’.
 Rules can be easily understood to biologist to conceive new biological experiments
seeking in wet lab proof.
40 attributes: [1] ‘Human embryonic stem cells have a unique epigenetic signature‘, Genome Research, Vol. 16, 2006
Challenge 3: Transfer Learning
 In real life, data are hard to obtain
 Biological experiments are expensive
 However, biological data are related
 Can we leverage the knowledge learned in one
task/domain/data set for prediction of another?
 Humans often do this: having learned one language, find it
easier to learn another
 In Web mining, having learned to classify one web site, use
the abstract knowledge to help classify another web site
 Challenge: can we leverage the knowledge learned
from one data set to classify/cluster/predict another?
Transfer Learning (Examples)
 Problem:
how to Propagate the classification
knowledge?
 Difficulty: old and new data may have different distributions
t t Time
Night time period
0
Day time period
1
Transfer Learning to Classify Web
[Dai,
et al, 2007] 20 newsgroups (20,000
documents, 20 data sets)
New
Old
comp.graphics (comp) comp.sys.ibm (?)
comp.os.mis-windows.misc (comp) comp.windows.x (?)
sci.crypt (sci) sci.med (?)
sci.electronics (sci) sci.space (?)
SRAA (A. McCallum, 70,000 articles)

Old New
sim-auto (auto) real-auto (?)
sim-aviation (aviation) real-aviation (?)
Document-word co-occurrence
[Dai, et al. 2007]
Old Di
transfer
Knowledge
New Do
Transfer Learning: Related Works
 Semi-supervised Learning
 [Zhu, Survey, Blum and Mitchell “co-training”, Nigam et
al, “EM-based”, Zeng et al “clustering”, Joachims,
“transductive”]
 Distributions of training and test data are usually
assumed to be the same
 Multi-task Learning
 [Caruana, MLJ]
 multiple Dis exist
 Domain specific knowledge jointly learned to benefit each
other.
 Focused on how multiple tasks helping each other
 Semi-supervised Clustering
 Samedistribution assumption, but can be relaxed
when must-links are few
Transfer to Classify Web
Co-clustering is applied between words and
out-of-domain documents (new tasks)
Word clustering is constrained by the labels
of in-domain (Old) documents
The word clustering part in both domains
serve as a bridge
A Biological Transfer Problem
 ‘Promoter prediction analysis on the whole human
genome’, Nature Biotechnology, Vol. 22, 2004.
 Most of the promoter prediction programs are effective on
individual chromosomes, e.g., Chr21, Chr22,
 But inadequate to generalize to the whole genome scale
 only 65% of accuracy rate on average  too low
 Can we build a unifying model for transferring the
learned knowledge to other chromosomes
 to predict across the whole genome?
 to cluster other genes and protein arrays?
 to classify related sequences?
Challenge 4: Exploiting the network
structure
 We are short of labeled data
 The matrix structures are very sparse if we only
have several hundred samples and a huge number
of attributes
 Classification accuracy cannot be improved much
 Gene expression data: tens or low hundreds of samples,
but tens of thousands of attributes (?)
 Accuracy ~ less than 80%
 Challenge: can we leverage the network

structure?
Social Network Mining
Citation (Paper 2) Conference Name
 Very large scale computational
analysis of gene and social
networks.
 Social networks: a social structure
Title Author (Paper1) made of nodes (individuals or
organizations) tied by one or more
specific types of relations.
 Relations: financial exchanges, friends,

web links, disease transmission
(epidemiology), or gene interactions.
 Small world phenomenon: chain of

social acquaintances required to
connect on arbitrary person to another
arbitrary person anywhere in the world is
generally short, five to seven separation
steps are sufficient.
 Centrality Eigenvector: measure the

importance of node in a network.
• Collective Classification
• Collective Recommendation
Social Net Mining: Engineering
meets Science
 ‘Empirical Analysis of an Evolving Social Network’, Science, Vol. 311,
2006.
 A dynamic social network comprising 43,553 students, faculty, and staff at a
large University.
 Interactions between individuals are inferred from time-stamped e-mail
headers recorded over one academic year and are matched with affiliations
and attributes.
 Findings:
 when two students are in the same class, they are on average 3 times more likely
to interact if they also share an acquaintance
 Netflix Challenge and KDDCUP 2007
 Blog Evolution (NEC Work)
 Can we deduce the actors’ roles/functions/attributes from the topology

of the network?
Using Network Structure in Biology
 ‘Adaptive Response of a Gene  The dynamics of a gene network are
Network to Environmental Changes by described by differential equations, e.g., a
simplified network involving only two gene
Fitness-Induced Attractor Selection’, nodes is formulated as:
Plos One, 2006
 The gene network is formulated as

differential equations, given some
initial state the network stabilized at
some attractors, corresponding to the
different cell types.
 where m1 and m2 are the gene
 The complex dynamics of the gene expression levels.
networks can explain the high diversity
of the species.  S(A) and D(A) are the rate coefficients of
synthesis and degradation. They depend
on A, which represents cellular activity.
 Given some perturbations, how will the
state of the gene networks change to  g1 and g2 represent the noises in gene
adjust the levels of gene expression to expression.
environment factors?
‘Adaptive Response of a Gene Network to Environmental Changes
by Fitness-Induced Attractor Selection’, PlosOne, 2006.
Given the initial condition,

the gene expression levels
stabilize at the attractors
determined by the
coefficients of equations.
Real world gene networks
can be much more complex,
involving thousands of
genes, leading to the
complex patterns of
attractors and cell activities.
Conclusions
 I have listed four challenges
1. Non-traditional Feature Selection
2. Explainable and Accurate Data Mining Methods
3. Transfer Learning
4. Exploiting the network structure
 There are more…
 Solving these challenges requires biologists
and computer scientists to work hand in
hand

Data Minng and Bioinformatics

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Data Minng and Bioinformatics

Diunggah oleh

Hak Cipta:

Format Tersedia

Data Mining and Bioinformatics:

Qiang Yang, Computer Thanks:

 We know how to classify biological

Dataset name Train Test Attributes Source

 # of positive << # negative Train Test

 Traditional feature selection methods fail: overfitting, singularity of covariance matrix

 Some potential solutions

SRAA (A. McCallum, 70,000 articles)

 Challenge: can we leverage the network

 Relations: financial exchanges, friends,

 Small world phenomenon: chain of

 Centrality Eigenvector: measure the

 Can we deduce the actors’ roles/functions/attributes from the topology

 The gene network is formulated as

Given the initial condition,

Anda mungkin juga menyukai