Some Challenges
http://www.cse.ust.hk/~qyang/ 1
State of Art: DM for Bio
http://www.cse.ust.hk/~qyang/ 2
Data Mining: Challenges in Bio
1. Non-traditional Feature Selection
When the number of attributes >> number of samples?
Highly imbalanced
1. Explainable and Accurate Data Mining Methods
NN, SVM Rules?
1. Transfer Learning
Can knowledge learned from one set of samples help data
mining on another sample?
1. Exploiting the network structure
Individual i.i.d type of classification vs social networks?
http://www.cse.ust.hk/~qyang/ 3
Challenge 1: Non-traditional Feature
Selection: Question: which (few) genes lead to diseases?
# of attributes >> # samples
ALL-AML leukemia 38 34 7129 ‘Molecular Classification of Cancer: Class Discovery and Class Prediction by
Gene Exressoin Monitoring’, Science, Vol. 286, 1999.
Breast cancer 78 19 24481 ‘Gene Expression Profiling Predicts Clinical Outcome of Breast Cancer’,
Nature, Vol. 415, 2002.
Central nervous 30 30 7192 ‘Prediction of Central Nervous System Embryonal Tumour Outcome Based
system on Gene Expression’, Nature, Vol. 415, 2002.
MLL_Leukemia 57 15 12582 ‘MLL Translocations Specify A Distinct Gene Expression Profile that
Distinguishes A Unique Leukemia’, Nature Genetics, Vol. 30, 2002.
Large B-cell 24 23 4026 ‘Distinct types of diffuse large B-cell lymphoma identified by gene
Lymphoma expression profiling’. Nature, Vol. 403, 2000.
122 2896
http://www.cse.ust.hk/~qyang/ 4
Non-traditional Feature Selection (2)
Some potential solutions
‘Characterization of a family of algorithms for generalized
discriminant analysis on undersampled problems’, Journal of
Machine Learning Research. Vol. 6, 2005.
Singularity problem is solved by splitting the subspace into the
regular and the irregular parts.
Irregular part (null space) of the within-class scatter matrix is fully
utilized to extract the discriminant info.
‘Two-Dimensional PCA: A New Approach to Appearance-Based
Face Representation and Recognition’, IEEE Transactions on
Pattern Analysis and Machine Intelligence, Vol. 26, 2004.
High dimensional data in 2D arrays are projected directly onto the
subspaces.
Size of covariance matrix can be reduced significantly.
Singularity is avoided.
http://www.cse.ust.hk/~qyang/ 5
Non-traditional Feature Selection (3)
Other approaches:
Manifold learning
Manifold learning methods, e.g., Isomap, LLE, maintain the
local patterns of distribution during transform,
Extract features suitable for k-NN classifiers
Can be used to reduce the dimensionality of Bio. Data.
Semi-supervised learning
What if we have 10% labeled data, but the rest 90% are
unlabelled?
Build clusters around the labeled samples.
Samples in the same cluster are labeled as from the same
class, assuming they follow the normal distributions.
http://www.cse.ust.hk/~qyang/ 6
Challenge 2: Explainable and
Accurate Data Mining Methods
Current methods, such as SVMs,
discriminant analysis, neural
networks, are ‘black box’ models.
The learned knowledge is hard to
understand by biologists.
http://www.cse.ust.hk/~qyang/ 7
Epigenetic Analysis: A Case Study
Epigenetic events dominate
the growth of cancer and Traditional methods, SVMs,
embryonic stem cells ANNs are
These two type of cells are ‘black box’ models
of great importance Knowledge are trained
connection weights, or
Genes can be turned on/ off Support Vectors.
through Cytosine methylation Hard to understand for
or Histone modifications biologists
The logics of DNA
methylation underlie the
cells’ behaviors
Wish to Know: Methylation status
of CpG sites
CpG islands/ promoter regions
in DNA sequence
Cancer prediction
http://www.cse.ust.hk/~qyang/ 8
Adaptive Cascade Sharing Trees
(ACS4) Niu et al. 2007 (tmr’s talk)
Objective: learn human
understandable rules that
define the epigenetic process in
cancer and embryonic stem cells
Idea:
Adaptively partition the
numeric attributes into a
set of the linguistic domains,
e.g., ‘high’, ‘very high’,
‘Medium’, ‘Low’, ‘Very Low’
Method: clustering
Train a committee of trees
to select the most salient
features and predict by
voting
Method: tree learning
http://www.cse.ust.hk/~qyang/ 9
ACS4 method (2)
http://www.cse.ust.hk/~qyang/ 10
ACS4 method (3)
Dataset:
37 hESC, 33 non-hESC, 24 cancer cell lines, 9 normal cell lines.
1,536 attributes
Result
Just 2 attributes are enough to separate the 3 cell types
No need of 40 attributes by using fisher’s score in [1].
Wet lab cost can be reduced by testing on 2 attributes only, instead of 40.
Accuracy is better, except when compared with SVM, but SVM cannot tell us ‘why’.
Rules can be easily understood to biologist to conceive new biological experiments
seeking in wet lab proof.
40 attributes: [1] ‘Human embryonic stem cells have a unique epigenetic signature‘, Genome Research, Vol. 16, 2006
http://www.cse.ust.hk/~qyang/ 11
Challenge 3: Transfer Learning
In real life, data are hard to obtain
Biological experiments are expensive
However, biological data are related
Can we leverage the knowledge learned in one
task/domain/data set for prediction of another?
Humans often do this: having learned one language, find it
easier to learn another
In Web mining, having learned to classify one web site, use
the abstract knowledge to help classify another web site
Challenge: can we leverage the knowledge learned
from one data set to classify/cluster/predict another?
http://www.cse.ust.hk/~qyang/ 12
Transfer Learning (Examples)
Problem:
how to Propagate the classification
knowledge?
Difficulty: old and new data may have different distributions
t t Time
Night time period
0
Day time period
1
http://www.cse.ust.hk/~qyang/ 13
Transfer Learning to Classify Web
[Dai,
et al, 2007] 20 newsgroups (20,000
documents, 20 data sets)
New
Old
comp.graphics (comp) comp.sys.ibm (?)
comp.os.mis-windows.misc (comp) comp.windows.x (?)
sci.crypt (sci) sci.med (?)
sci.electronics (sci) sci.space (?)
http://www.cse.ust.hk/~qyang/ 14
Document-word co-occurrence
[Dai, et al. 2007]
Old Di
transfer
Knowledge
New Do
http://www.cse.ust.hk/~qyang/ 15
Transfer Learning: Related Works
Semi-supervised Learning
[Zhu, Survey, Blum and Mitchell “co-training”, Nigam et
al, “EM-based”, Zeng et al “clustering”, Joachims,
“transductive”]
Distributions of training and test data are usually
assumed to be the same
Multi-task Learning
[Caruana, MLJ]
multiple Dis exist
Domain specific knowledge jointly learned to benefit each
other.
Focused on how multiple tasks helping each other
Semi-supervised Clustering
Samedistribution assumption, but can be relaxed
when must-links are few
http://www.cse.ust.hk/~qyang/ 16
Transfer to Classify Web
Co-clustering is applied between words and
out-of-domain documents (new tasks)
Word clustering is constrained by the labels
of in-domain (Old) documents
The word clustering part in both domains
serve as a bridge
http://www.cse.ust.hk/~qyang/ 17
A Biological Transfer Problem
‘Promoter prediction analysis on the whole human
genome’, Nature Biotechnology, Vol. 22, 2004.
Most of the promoter prediction programs are effective on
individual chromosomes, e.g., Chr21, Chr22,
But inadequate to generalize to the whole genome scale
only 65% of accuracy rate on average too low
Can we build a unifying model for transferring the
learned knowledge to other chromosomes
to predict across the whole genome?
to cluster other genes and protein arrays?
to classify related sequences?
http://www.cse.ust.hk/~qyang/ 18
Challenge 4: Exploiting the network
structure
We are short of labeled data
The matrix structures are very sparse if we only
have several hundred samples and a huge number
of attributes
Classification accuracy cannot be improved much
Gene expression data: tens or low hundreds of samples,
but tens of thousands of attributes (?)
Accuracy ~ less than 80%
http://www.cse.ust.hk/~qyang/ 19
Social Network Mining
Citation (Paper 2) Conference Name
Very large scale computational
analysis of gene and social
networks.
Social networks: a social structure
Title Author (Paper1) made of nodes (individuals or
organizations) tied by one or more
specific types of relations.
• Collective Classification
• Collective Recommendation
http://www.cse.ust.hk/~qyang/ 20
Social Net Mining: Engineering
meets Science
‘Empirical Analysis of an Evolving Social Network’, Science, Vol. 311,
2006.
A dynamic social network comprising 43,553 students, faculty, and staff at a
large University.
Interactions between individuals are inferred from time-stamped e-mail
headers recorded over one academic year and are matched with affiliations
and attributes.
Findings:
when two students are in the same class, they are on average 3 times more likely
to interact if they also share an acquaintance
Netflix Challenge and KDDCUP 2007
Blog Evolution (NEC Work)
http://www.cse.ust.hk/~qyang/ 21
Using Network Structure in Biology
‘Adaptive Response of a Gene The dynamics of a gene network are
Network to Environmental Changes by described by differential equations, e.g., a
simplified network involving only two gene
Fitness-Induced Attractor Selection’, nodes is formulated as:
Plos One, 2006
http://www.cse.ust.hk/~qyang/ 22
‘Adaptive Response of a Gene Network to Environmental Changes
by Fitness-Induced Attractor Selection’, PlosOne, 2006.
http://www.cse.ust.hk/~qyang/ 24