10 - Introduction To Machine Learning

Lecture 10: Introduction to
Machine Learning
Course: Biomedical Informatics
Parisa Rashidi
Fall 2014
Reminder
Your project progress reports are due on Tuesday,
10/28
~2 pages in length (excluding references)
formatted using IEEE style link
Agenda
Machine learning
Today
Introduction to machine learning
Different types of machine learning methods
Walkthrough: a machine learning process
Later
More machine learning methods
NLP
Software
Rapidminer Link
Artificial Intelligence
Artificial Intelligence (AI) has many subfields
Machine Learning (ML)

Natural Language Processing (NLP)
Vision
What is Learning ?
Machine learning is programming computers to
optimize a performance criterion using example data

or past experience.
You were not made to live like beasts, but to follow virtue and knowledge.
(Dante Alighieri)
*ROBERTO BATTITI AND MAURO BRUNATO.
The LION way. Machine Learning plus Intelligent Optimization.
What We Talk About When We

Talk AboutLearning
Learning general models from a data of particular
examples
Data is cheap and abundant (data warehouses, data
marts); knowledge is expensive and scarce.
Example 1: adverse drug-drug interactions
Example 2: Customer behavior:
People who bought Blink also bought David and

Goliath (www.amazon.com)
Build a model that is a good and useful approximation
to the data.
Relation with Other Fields

ML draws on ideas from many fields
Statistics
Statistical
Physics
Economics
Neuroscience
Machine Learning
Control
Theory
Computer
Science
Optimization
To Understand ML
You need
Basic Knowledge of computer science
Linear Algebra
Calculus
Probability and statistics
Optimization
Example ML Algorithms
Linear Regression
Decision trees, neural network, support vector machine,
Total
Energy
Very Low
Very High
Low
Stand
Run
Main
Frequency
Low
Sit
High
Walk
Support Vector Machines

A simple decision tree
Generic Applications
Almost everywhere
Speech recognition, face recognition, search engines,
bioinformatics, fraud detection
And it will be everywhere
Smart homes, smart vehicles, smart cities
Biomedical Application
Mobile health monitoring solutions
Electronic Health Record (EHR) mining
Genome-wide associations (GWAS)
Smart homes for elderly
Biomarker discovery
13
Challenges & Competitions

Many other competitions at Kaggle
http://www.kaggle.com/competitions
Example: predict the likelihood that an HIV patient's
infection will become less severe
A great way to improve your skills (and maybe make
some money!)
Supervised vs. Unsupervised

Learning
Supervised Machine Learning

Goal is Prediction
Example:
Input: examples of benign and malignant tumors
defined in terms of tumor shape, radius, ..
Output: predict whether a previously unseen example is
benign or malignant
Tumor
Examples
Machine
Learning
Algorithm
New Instance
Model
Benign or
Malignant?
0 Example: Surgery
Risk
0 Differentiating
between low-risk and
high-risk patients
Cell Shape Uniformity
Supervised Learning Toy

Example: Classification
Cell Size Uniformity

Rule: x > a AND y > b
then low-risk
Supervised Learning Toy

Example: Regression
0 Example: Child Mortality
y : child mortality
y = g (x | q )
where
g ( ) model,
q parameters
Child Mortality
0 x : maternal education
y = wx+w0
Maternal Education
Supervised Learning: Uses

Prediction of future cases: Use the rule to predict the
output for future inputs

Knowledge extraction: The rule is easy to understand
Compression: The rule is simpler than the data it
explains
Outlier detection: Exceptions that are not covered by
the rule, e.g., fraud
Unsupervised Machine
Learning
Also known as data mining
Goal is knowledge discovery
Example:
Input: DNA Sequence as a long string of {A,C,G,T}
Output: frequent subsequences (gene patterns)
()
()
AACGTAACGGGACTCCAC
DNA
Sequence
AC
Data
Mining
Algorithm
Model
Gene
Pattern
Unsupervised Learning
Example: Learning Associations
It started with market basket analysis
P (Y | X ) probability that somebody who buys X also
buys Y where X and Y are products/services.
Unsupervised Learning
Learning what normally happens
No labels
Example method:
Clustering: Grouping similar instances
Example applications
Image compression: Color quantization
Bioinformatics: Learning motifs
You dont Always need Machine

Learning!
Machine Learning definition (supervised):
The ability to learn and to improve with experience
instead of using pre-determined rules.
Consider the following two tasks:

Problem: Is m a prime number?
Solution: test up to to see if m can be

factored into two values.
Testing for
Prime Numbers
Recognizing
Handwritten Digits
You dont Always need Machine

Learning!
Unsupervised learning definition(rather unofficial):
Automatic analysis of data to extract previously
unknown interesting patterns
Consider the following two tasks:
Problem: Find all patterns matching regular
expression A*C.
Solution: Simple String matching (finite state
machine)
DNA Sequence Mining

Regular Expression Matching
When Learning is needed?

There is no need to learn to calculate payroll
Learning is used when:
Human expertise does not exist (navigating on Mars),
Humans are unable to explain their expertise (speech
recognition)
Solution changes in time (routing on a computer
network)
Solution needs to be adapted to particular cases (user
biometrics)
Supervised vs. Unsupervised

Learning
Supervised Learning (learn from my example)
Goal: A program that performs a task as good as humans.
TASK well defined (the target function)
EXPERIENCE training data provided by a human
PERFORMANCE Metric error/accuracy on the task
Unsupervised Learning (see what you can find)
Goal: To find some kind of structure in the data.
TASK vaguely defined
No EXPERIENCE: no labeled data
No PERFORMANCE Metric (but, there are some evaluations
metrics)
*TAKIS METAXAS, CS 315 Web Search and Data Mining
Terminology
A Simple Example
Tumor Classification
Benign: -1
Malignant: +1
Uniformity
of Cell Size
Uniformity
of Cell
Shape
Marginal
Adhesion
Single
Epithelial
Cell Size
Bare
Nuclei
Bland
Chromatin
Normal
Nucleoli
Mitoses
Class Label
(benign =2,
malignant
=4)
-1
10
+1
Terminology: Feature
Features = the set of attributes associated with an
example
(aka Independent variable in statistics)
Feature
Uniformity
of Cell Size
Uniformity
of Cell
Shape
Marginal
Adhesion
Single
Epithelial
Cell Size
Bare
Nuclei
Bland
Chromatin
Normal
Nucleoli
Mitoses
Class Label
(benign =2,
malignant
=4)
-1
10
+1
Terminology: Instance
Example = an instance of data = data point = xi
Each row of the table is a data instance.
Instance
Uniformity
of Cell Size
Uniformity
of Cell
Shape
Marginal
Adhesion
Single
Epithelial
Cell Size
Bare
Nuclei
Bland
Chromatin
Normal
Nucleoli
Mitoses
Class Label
(benign =2,
malignant
=4)
-1
10
+1
Terminology: Label
Label = Class = the feature to be predicted = category
associated with an object

Denoted by yi
(aka Dependent variable in statistics)
Label usually provided by an expert
Label
Uniformity
of Cell Size
Uniformity
of Cell
Shape
Marginal
Adhesion
Single
Epithelial
Cell Size
Bare
Nuclei
Bland
Chromatin
Normal
Nucleoli
Mitoses
Class Label
(benign =2,
malignant
=4)
-1
10
+1
Data Representation
We usually represent data in a matrix
Label
2 5 1 1 1 2
2 5 4 4 5 7 10 3
3 2 1 1 1 2
Co-variance Matrix (Feature Feature)
Gram Matrix (Instance Instance)
Instances
Instances
Features
-1
+1
?
Note: We can also assign a probability to each label (well discuss it later)
Summary of Key Terms

Instance = example = data point
Feature = independent variable
Class label = dependent variable
Decision boundary = separates examples in different
classes
Algorithms
Availability of Labeled Data

Supervised learning => when all data is labeled
Semi-supervised learning => when a small amount of data is labeled
Unsupervised learning => when data is not labeled
Transfer Learning => when labeled data is available in another domain
Active Learning => when the algorithm has access to a human oracle to
ask for labels of a few data points
Do you have
labeled data?
Yes
Supervised
A little
Semisupervised
No
Unsupervised
In another
domain
Transfer
Learning
By asking
oracle
Active
Learning
Task Type
Categorical: Classification task
Classifier
Continuous: Regression task

Ordered: Ranking task
What is your
output type?
Categorical
Classification
Continuous
Regression
Ordered
Ranking
Input Representation
The most common type
Simple records in Tables
Can be analyzed using regular machine learning

techniques.
Most other data types are converted to this type.
(Not always: There are methods that directly process other
data types.)
ID WGT
HGT
Cholesterol
Risk
(Class)
high
short
260
high
high
med
254
high
high
tall
142
med
A Simple Record
Input Representation(cont.)
Image, video
is preprocessed using Vision techniques.
Text
is preprocessed using NLP techniques.
Continuous measures along time (Time series)
is preprocessed using Time Series analysis.
Graphs
is preprocessed using Graph Theory tools.
Image
Time series
Text
Graph
More Details
Important Steps
1. Determine relevant features (expert knowledge)
2. Collect data (and label data)
3. Split labeled data into training and test datasets
4. Use training data to train machine learning
algorithm.
5. Predict labels of examples in test data,
6. Evaluate algorithm.
Features Are Important!

Should be rich enough to capture the problem
Should be simple enough to allow learning the model
Too Many features
Makes learning more difficult
Not enough features

Impacts generalization power
Feature Extraction
Typically results in significant reduction in
dimensionality
Domain-specific
* Image taken from Jeff Howbert Slides
Feature Extraction
Typically results in significant reduction in
dimensionality
Domain-specific
* Image taken from Jeff Howbert Slides
Important Steps
2. Collect data
algorithm.
5. Predict labels of examples in test data,
How to Split Data?

Holdout
Training set
(validation set)
Test set
K-fold Cross-validation
E.g. 10 fold cross validation
Methods of Sampling
Holdout
E.g. Reserve 2/3 for training and 1/3 for testing
Random subsampling
Cross validation
Partition data into k disjoint subsets
k-fold: train on k-1 partitions, test on the remaining one
Leave-one-out: k=n
Stratified sampling
Bootstrap
Sampling with replacement
Important Steps
2. Collect data
algorithm.
5. Predict labels of examples in test data
Decision Boundary
We seek to find this boundary
True Decision
Boundary
x2
Learned
Decision
Boundary
Benign
(uniformity)
Outlier
Malignant
x1 (Radius)
= Labeled
Why Noise?
Noise might be due to different reasons
Imprecision in recording the input data
Errors in labeling data
We might not have considered additional features
(latent, or hidden features)
When there is noise, the decision boundary becomes
more complex
Overfitting
Data are well described by our model, but the
predictions do not generalize to new data.

A very rich hypothesis space
Training set too small
y
Overfitting and Underfitting

Underfitting
If your hypothesis is less complex than the actual
function
Using a straight line to model data generated by a third
order polynomial
Overfitting
If your hypothesis is more complex than the actual
function
Using a fifth order polynomial to model data generated by a
second order polynomial
Bias-Variance
Bias = assumptions, restrictions on model
Variance = variation of the prediction of the model
Simple linear model => high bias
Complex model => high variance
y
x
Under-fitting
x
Over-fitting
Important Steps
2. Collect data
algorithm.
5. Predict labels of examples in test data
Model Evaluation
Metrics for Performance Evaluation
How to evaluate the performance of a model?
Methods for Model Comparison
How to compare the relative performance among
competing models?
Metrics for Performance

Evaluation
Focus on the predictive capability of a model
Rather than how fast it takes to classify or build models,
scalability, etc.
Confusion Matrix:
PREDICTED CLASS
Class=Yes Class=No
Class=Yes
ACTUAL
CLASS Class=No
a: TP (true positive)
b: FN (false negative)
c: FP (false positive)
d: TN (true negative)
Metrics for Performance

Evaluation
PREDICTED CLASS
Class=Yes
ACTUAL
CLASS
Class=No
Class=Yes
a
(TP)
b
(FN)
Class=No
c
(FP)
d
(TN)
Most widely-used metric:
ad
TP TN
Accuracy
a b c d TP TN FP FN
Cost Matrix
PREDICTED CLASS
C(i|j)
Class=Yes
Class=Yes
C(Yes|Yes)
C(No|Yes)
C(Yes|No)
C(No|No)
ACTUAL
CLASS Class=No
Class=No
C(i|j): Cost of misclassifying class j example as class i
Computing Cost of Classification

Cost
Matrix
PREDICTED CLASS
ACTUAL
CLASS
Model
M1
ACTUAL
CLASS
PREDICTED CLASS
150
40
60
250
Accuracy = 80%
Cost = 3910
C(i|j)
-1
100
Model
M2
ACTUAL
CLASS
PREDICTED CLASS
250
45
200
Accuracy = 90%
Cost = 4255
Limitation of Accuracy
Consider a 2-class problem
Number of Class 0 examples = 9990
Number of Class 1 examples = 10
If model predicts everything to be class 0, accuracy is
9990/10000 = 99.9 %
Accuracy is misleading because model does not detect
any class 1 example
Other Measures
True Positives
a
Precision (p)
All items predicted as positive
ac
a
True Positives
Recall (r)
All actual positive items
ab
2rp
2a
F - measure (F)
r p 2a b c
Triple Tradeoff
Complexity of the hypothesis space: C
Amount of training data: N
Generalization error on new data: E
N E
C first E, then E
Learning Curve
Learning curve shows how
accuracy (or error) changes

with varying sample size
More on Bias vs. Variance

Typical learning curve for high variance:
Test error still decreasing as m increases. Suggests larger
training set will help.

Large gap between training and test error.
*Andrew Y. Ng, Advice for applying Machine Learning, Stanford
More on Bias vs. Variance

Typical learning curve for high bias:
Even training error is unacceptably high.

Small gap between training and test error.
Diagnosis
Fixes to try:
Solution
Try getting more training examples.
Try a smaller set of features.
Try a larger set of features.
Try different features.
Fixes the problem of

high variance.
high variance.
high bias.
high bias.
Model Evaluation
Metrics for Performance Evaluation
How to evaluate the performance of a model?
Methods for Model Comparison
How to compare the relative performance among
competing models?
We will look at this next time!
Putting It All Together

Differentiate between walking and Jogging using
accelerometer
Kwapisz et al, SIGKDD exploration, 2010
Data
Sample d=(x,y,z)
at 60 HZ
preprocess
-Segment
-Label
Feature
Extraction
f_1, f_2, f_3,.
Feature
Selection
Total
Energy
Very Low
Select some features
Run
Low
Stand
Main
Frequency
Train
Low
Evaluate
Very High
Sit
High
Walk
A simple decision tree model
References
Slides partially based on:
Lecture Notes for E Alpaydn 2010 Introduction to
Machine Learning 2e The MIT Press (V1.0)
Resources for You
Tools
RapidMiner
Weka
R
Scikits-learn
Matlab
More here
https://sites.google.com/site/parisar/links
(You can also find some publicly available free e-books
on machine learning)
Resources: Datasets
UCI Repository:
http://www.ics.uci.edu/~mlearn/MLRepository.html
UCI KDD Archive:

http://kdd.ics.uci.edu/summary.data.application.html
Statlib: http://lib.stat.cmu.edu/
Delve: http://www.cs.utoronto.ca/~delve/
71
Resources: Journals
IEEE transaction on knowledge and data engineering
Journal of Machine Learning Research www.jmlr.org
Machine Learning
Neural Computation
Neural Networks
IEEE Transactions on Neural Networks
IEEE Transactions on Pattern Analysis and Machine
Intelligence
Annals of Statistics
Journal of the American Statistical Association
...
72
Resources: Conferences
International Conference on Knowledge Discovery and
Data Mining (KDD)

International Conference on Machine Learning (ICML)
European Conference on Machine Learning (ECML)
Neural Information Processing Systems (NIPS)
Uncertainty in Artificial Intelligence (UAI)
Computational Learning Theory (COLT)
International Conference on Artificial Neural Networks
(ICANN)
International Conference on AI & Statistics (AISTATS)
International Conference on Pattern Recognition (ICPR)
...
73

10 - Introduction To Machine Learning

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

10 - Introduction To Machine Learning

Diunggah oleh

Hak Cipta:

Format Tersedia

Lecture 10: Introduction to

Course: Biomedical Informatics

Walkthrough: a machine learning process

Machine Learning (ML)

optimize a performance criterion using example data

What We Talk About When We

People who bought Blink also bought David and

Build a model that is a good and useful approximation

Relation with Other Fields

Support Vector Machines

Challenges & Competitions

infection will become less severe

A great way to improve your skills (and maybe make

Supervised vs. Unsupervised

Supervised Machine Learning

Cell Shape Uniformity

Supervised Learning Toy

Cell Size Uniformity

Supervised Learning Toy

Supervised Learning: Uses

output for future inputs

You dont Always need Machine

Consider the following two tasks:

Solution: test up to to see if m can be

You dont Always need Machine

DNA Sequence Mining

When Learning is needed?

Supervised vs. Unsupervised

associated with an object

(aka Dependent variable in statistics)

Label usually provided by an expert

Co-variance Matrix (Feature Feature)

Gram Matrix (Instance Instance)

Summary of Key Terms

Availability of Labeled Data

Continuous: Regression task

Can be analyzed using regular machine learning

Features Are Important!

Not enough features

* Image taken from Jeff Howbert Slides

* Image taken from Jeff Howbert Slides

How to Split Data?

predictions do not generalize to new data.

Overfitting and Underfitting

second order polynomial

Metrics for Performance

Metrics for Performance

Most widely-used metric:

C(i|j): Cost of misclassifying class j example as class i

Computing Cost of Classification

If model predicts everything to be class 0, accuracy is

any class 1 example

accuracy (or error) changes

More on Bias vs. Variance

Test error still decreasing as m increases. Suggests larger

training set will help.

*Andrew Y. Ng, Advice for applying Machine Learning, Stanford

More on Bias vs. Variance

Even training error is unacceptably high.

*Andrew Y. Ng, Advice for applying Machine Learning, Stanford

*Andrew Y. Ng, Advice for applying Machine Learning, Stanford

Fixes the problem of

Putting It All Together