(BCO 086A)
Submitted By:
Anuja Sharma
Assistant Professor
CSE
Machine Learning
MODULE-I
Concept and Decision Tree
Concept & Concept Learning
3 Machine Learning
Concept & Concept Learning
4 Machine Learning
Concept & Concept Learning
5 Machine Learning
A Concept Learning Task – Enjoy Sport
Training Examples
6 Machine Learning
EnjoySport – Hypothesis Representation
7 Machine Learning
Hypothesis Representation
8 Machine Learning
Enjoy Sport ConcptLearning Task
9 Machine Learning
Terminology
10 Machine Learning
Concept Learning as a Search
11 Machine Learning
Enjoy Sport-Hypothesis Space
12 Machine Learning
General to specific ordering
13 Machine Learning
More General than Relation
14 Machine Learning
More-General Relation
15 Machine Learning
FIND-S Algorithm
16 Machine Learning
FIND- S Algorithm
17 Machine Learning
Candidate Elimination Algorithm
18 Machine Learning
Consistent Hypothesis
19 Machine Learning
Version Space & Candidate elimination
Algorthm
20 Machine Learning
Compact Representation of Version
Spaces
21 Machine Learning
Example of Version Space
22 Machine Learning
Candidate Elimination Example
23 Machine Learning
Candidate Elimination Example
24 Machine Learning
Candidate Elimination Example
25 Machine Learning
Candidate Elimination Example
26 Machine Learning
Candidate Elimination Algorithm
27 Machine Learning
Candidate Elimination Algorithm
28 Machine Learning
Candidate Elimination Algortihm
29 Machine Learning
Final Version Space
30 Machine Learning
Candidate Elimination Algorithm-
Example Final Version Space
31 Machine Learning
Inductive Bias-A Biased Hypothesis
32 Machine Learning
Inductive Bias-An Unbaised Learner
33 Machine Learning
Inductive Bias- Formal Definition
34 Machine Learning
Decision Trees
35 Machine Learning
From Decision trees to Logic
36 Machine Learning
Decision Trees
37 Machine Learning
Machine Learning
MODULE-II
Neural Networks
Perceptron Node – Threshold Logic Unit
x1 w1
x2 w2 Z
xn wn
n
1 if x w
i 1
i i
z n
0 if x w
i 1
i i
39
Learning Algorithm
x1 .4
.1 Z
x2 -.2
x2 x2 T n
1 if x w i i
.8 .3 1 z
i 1
n
.4 .1 0 0 if x w
i 1
i i
40 Machine Learning
First Training Instance
.8 .4
.1 Z =1
x2 x2 T n
1 if x w i i
.8 .3 1 z
i 1
n
.4 .1 0 0 if x w
i 1
i i
41 Machine Learning
Second Training Instance
.4 .4
.1 Z =1
x2 x2 T n
1 if x w i i
.8 .3 1 z
i 1 Dwi = (T - Z)* C * Xi
n
.4 .1 0 0 if x w
i 1
i i
42 Machine Learning
Delta Rule Learning
Dwij = C(Tj – Zj) xi
Create a network with n input and m output nodes
Each iteration through the training set is an epoch
Continue training until error is less than some epsilon
Perceptron Convergence Theorem: Guaranteed to find a
solution in finite time if a solution exists
As can be seen from the node activation function the decision
surface is an n-dimensional hyper plane
n
1 if x w
i 1
i i
z n
0 if x w
i 1
i i
43 Machine Learning
Linear Separability
44 Machine Learning
Linear Separability and Generalization
45 Machine Learning
Limited Functionality of Hyperplane
46 Machine Learning
Gradient Descent Learning
Error Landscape
TSS:
Total
Sum
Squared
Error
0
Weight Values
47 Machine Learning
Deriving a Gradient Descent Learning
Algorithm
Goal to decrease overall error (or other objective function)
each time a weight is changed
Total Sum Squared error = S (Ti – Zi)2
Seek a weight changing algorithm such that E is negative
wij descent
If a formula can be found then we have a gradient
learning algorithm
Perceptron/Delta rule is a gradient descent learning
algorithm
Linearly-separable problems have no local minima
48 Machine Learning
Multi-layer Perceptron
Can compute arbitrary mappings
Assumes a non-linear activation function
Training Algorithms less obvious
Backpropagation learning algorithm not exploited until
1980’s
First of many powerful multi-layer learning algorithms
49 Machine Learning
Responsibility Problem
Output 1
Wanted 0
50 Machine Learning
Multi-Layer Generalization
51 Machine Learning
Backpropagation
Multi-layer supervised learner
Gradient Descent weight updates
Sigmoid activation function (smoothed threshold logic)
52 Machine Learning
Multi-layer Perceptron Topology
i
i j
53 Machine Learning
Backpropagation Learning Algorithm
Until Convergence (low error or other criteria) do
Present a training pattern
Calculate the error of the output nodes (based on T - Z)
Calculate the error of the hidden nodes (based on the error of
the output nodes which is propagated back to the hidden nodes)
Continue propagating error back until the input layer is reached
Update all weights based on the standard delta rule with the
appropriate error function d
Dwij = Cdj Zi
54 Machine Learning
Activation Function and its Derivative
Node activation function f(net) is typically the sigmoid
1
Z j f (net j ) -net
.5
1 e j 0
-5 0 5
Net
Derivate of activation function is critical part of algorithm
.25
f ' (net j ) Z j (1 - Z j )
0
-5 0 5
Net
55 Machine Learning
Backpropagation Learning Equations
Dwij Cd j Z i
d j (T j - Z j ) f ' (net j ) [Output Node]
d j (d k w jk ) f ' (net j ) [Hidden Node]
k
i j
56 Machine Learning
Backpropagation Summary
Excellent Empirical results
Scaling – The pleasant surprise
Local Minima very rare is problem and network complexity increase
Most common neural network approach
User defined parameters lead to more difficulty of use
Number of hidden nodes, layers, learning rate, etc.
Many variants
Adaptive Parameters, Ontogenic (growing and pruning) learning
algorithms
Higher order gradient descent (Newton, Conjugate Gradient, etc.)
Recurrent networks
57 Machine Learning
Inductive Bias
The approach used to decide how to generalize novel cases
Occam’s Razor – The simplest hypothesis which fits the data
is usually the best – Still many remaining options
A B C -> Z
A B’ C -> Z
A B C’ -> Z
A B’ C’ -> Z
A’ B’ C’ -> Z’
58 Machine Learning
Overfitting
Noise vs. Exceptions revisited
59 Machine Learning
The Overfit Problem
Newer powerful models can have very complex decision
TSS
surfaces which can converge well on most training sets by
learning noisy and irrelevant aspects of theValidation/Test
training setSetin
order to minimize error (memorization inTraining
the limit)
Set
Epochs
This makes them susceptible to overfit if not carefully
considered
60 Machine Learning
Avoiding Overfit
Inductive Bias – Simplest accurate model
More Training Data (vs. overtraining - One epoch limit)
Validation Set (requires separate test set)
Backpropagation – Tends to build from simple model (0
weights) to just large enough weights (Validation Set)
Stopping criteria with any constructive model (Accuracy
increase vs Statistical significance) – Noise vs. Exceptions
Specific Techniques
Weight Decay, Pruning, Jitter, Regularization
Ensembles
61 Machine Learning
Ensembles
Many different Ensemble approaches
Stacking, Gating/Mixture of Experts, Bagging, Boosting, Wagging, Mimicking,
Combinations
Multiple diverse models trained on same problem and then their outputs are
combined
The specific overfit of each learning model is averaged out
If models are diverse (uncorrelated errors) then even if the individual models
are weak generalizers, the ensemble can be very accurate
Combining Technique
M1 M2 M3 Mn
62 Machine Learning
Application Issues
Choose relevant features
Normalize features
Can learn to ignore irrelevant features, but will have to fight
the curse of dimensionality
More data (training examples) the better
Slower training acceptable for complex and production
applications if accuracy improvement, (“The week
phenomenon”)
Execution normally fast regardless of training time
63 Machine Learning
Decision Trees - ID3/C4.5
Top down induction of decision trees
Highly used and successful
Attribute Features - discrete nominal (mutually exclusive) –
Real valued features are discretized
Search for smallest tree is too complex (always NP hard)
C4.5 use common symbolic ML philosophy of a greedy
iterative approach
64 Machine Learning
Decision Tree Learning
Mapping by Hyper-Rectangles
A1
A2
65 Machine Learning
ID3 Learning Approach
C is the current set of examples
A test on attribute A partitions C into {Ci, C2,...,Cw} where w
is the number of values of A
C1 C2 C3
66 Machine Learning
Decision Tree Learning Algorithm
Start with the Training Set as C and test how each attribute
partitions C
Choose the best A for root
The goodness measure is based on how well attribute A divides
C into different output classes – A perfect attribute would
divide C into partitions that contain only one output class
each – A poor attribute (irrelevant) would leave each
partition with the same ratio of classes as in C
20 questions analogy – good questions quickly minimize the
possibilities
Continue recursively until sets unambiguously classified or a
stopping criteria is reached
67 Machine Learning
ID3 Example and Discussion
14 Examples. Uses Information Gain. Attributes which best
discriminate between classes are chosenHumidity
Temperature
68 Machine Learning
ID3 - Conclusions
Good Empirical Results
Comparable application robustness and accuracy with neural
networks - faster learning (though NNs are more natural
with continuous features - both input and output)
Most used and well known of current symbolic systems -
used widely to aid in creating rules for expert systems
69 Machine Learning
Nearest Neighbor Learners
Broad Spectrum
Basic K-NN, Instance Based Learning, Case Based Reasoning,
Analogical Reasoning
Simply store all or some representative subset of the
examples in the training set
Generalize on the fly rather than use pre-acquired hypothesis
- faster learning, slower execution, information retained,
memory intensive
70 Machine Learning
Nearest Neighbor Algorithms
71 Machine Learning
Nearest Neighbor Variations
How many examples to store
How do stored example vote (distance weighted, etc.)
Can we choose a smaller set of near-optimal examples
(prototypes/exemplars)
Storage reduction
Faster execution
Noise robustness
Distance Metrics – non-Euclidean
Irrelevant Features – Feature weighting
72 Machine Learning
Evolutionary Computation/Algorithms
Genetic Algorithms
Simulate “natural” evolution of structures via selection and
reproduction, based on performance (fitness)
Type of Heuristic Search - Discovery, not inductive in
isolation
Genetic Operators - Recombination (Crossover) and
Mutation are most common
1 1 0 2 3 1 0 2 2 1 (Fitness = 10)
2 2 0 1 1 3 1 1 0 0 (Fitness = 12)
2 2 0 1 3 1 0 2 2 1 (Fitness = calculated or f(parents))
73 Machine Learning
Evolutionary Algorithms
Start with initialized population P(t) - random, domain-
knowledge, etc.
Population usually made up of possible parameter settings for
a complex problem
Typically have fixed population size (like beam search)
Selection
Parent_Selection P(t) - Promising Parents used to create new
children
Survive P(t) - Pruning of unpromising candidates
Evaluate P(t) - Calculate fitness of population members.
Ranges from simple metrics to complex simulations.
74 Machine Learning
Evolutionary Algorithm
Procedure EA
t = 0;
Initialize Population P(t);
Evaluate P(t);
Until Done{ /*Sufficiently “good” individuals discovered*/
t = t+1;
Parent_Selection P(t);
Recombine P(t);
Mutate P(t);
Evaluate P(t);
Survive P(t);}
75 Machine Learning
EA Example
Goal: Discover a new automotive engine to maximize
performance, reliability, and mileage while minimizing
emissions
Features: CID (Cubic inch displacement), fuel system, # of
valves, # of cylinders, presence of turbo-charging
Assume - Test unit which tests possible engines and returns
integer measure of goodness
Start with population of random engines
76 Machine Learning
77 Machine Learning
78 Machine Learning
Genetic Operators
Crossover variations - multi-point, uniform probability,
averaging, etc.
Mutation - Random changes in features, adaptive, different
for each feature, etc.
Others - many schemes mimicking natural genetics:
dominance, selective mating, inversion, reordering,
speciation, knowledge-based, etc.
Reproduction - terminology - selection based on fitness - keep
best around - supported in the algorithms
Critical to maintain balance of diversity and quality in the
population
79 Machine Learning
Evolutionary Algorithms
There exist mathematical proofs that evolutionary techniques are efficient
search strategies
There are a number of different Evolutionary strategies
Genetic Algorithms
Evolutionary Programming
Evolution Strategies
Genetic Programming
Strategies differ in representations, selection, operators, evaluation, etc.
Most independently discovered, initially function optimization (EP, ES)
Strategies continue to “evolve”
80 Machine Learning
Genetic Algorithm Comments
Much current work and extensions
Numerous application attempts. Can plug into many
algorithms requiring search. Has built-in heuristic. Could
augment with domain heuristics
“Lazy Man’s Solution” to any tough parameter search
81 Machine Learning
Rule Induction
Creates a set of symbolic rules to solve a classification
problem
Sequential Covering Algorithms
Until no good and significant rules can be created
Create all first order rules Ax -> Classy
Score each rule based on goodness (accuracy) and significance
using the current training set
Iteratively (greedily) expand the best rules to n+1 attributes,
score the new rules, and prune weak rules to keep the total
candidate list at a fixed size (beam search)
Pick the one best rule and remove all instances from the
training set that the rule covers
82 Machine Learning
Rule Induction Variants
Ordered Rule lists (decision lists) - naturally supports
multiple output classes
A=Green and B=Tall -> Class 1
A=Red and C=Fast -> Class 2
Else Class 1
Placing new rules at beginning or end of list
Unordered rule lists for each output class (must handle
multiple matches)
Rule induction can handle noise by no longer creating new
rules when gain is negligible or not statistically significant
83 Machine Learning
Conclusion
Many new algorithms and approaches being proposed
Application areas rapidly increasing
Amount of available data and information growing
User desire for more adaptive and user-specific computer
interaction
This need for specific and adaptable user interaction will make
machine learning a more important tool in user interface
research and applications
84 Machine Learning
Thank You!