Data Mining
COMP527: Data Mining
Dr Robert Sanderson
(azaroth@liv.ac.uk)
Dept. of Computer Science
University of Liverpool
2008
Introduction to the Course Input Preprocessing
Introduction to Data Mining Attribute Selection
Introduction to Text Mining Association Rule Mining
General Data Mining Issues ARM: A Priori and Data Structures
Data Warehousing ARM: Improvements
Classification: Challenges, Basics ARM: Advanced Techniques
Classification: Rules Clustering: Challenges, Basics
Classification: Trees Clustering: Improvements
Classification: Trees 2 Clustering: Advanced Algorithms
Classification: Bayes Hybrid Approaches
Classification: Neural Networks Graph Mining, Web Mining
Classification: SVM Text Mining: Challenges, Basics
Classification: Evaluation Text Mining: TextasData
Classification: Evaluation 2 Text Mining: TextasLanguage
Regression, Prediction Revision for Exam
Prediction / Regression
Linear Regression
Logistic Regression
Support Vector Regression
Regression Trees
Classification tries to determine which class an instance belongs to,
based on known classes for instances by generating a model
and applying it to new instances. The model generated can be in
many forms (rules, tree, graph, vectors...). The output is the
class which the new instance is predicted to be part of.
Regression takes data and finds a formula for it. As with SVM, the
formula can be the model used for classification. This might
learn the formula for the probability of a particular class from 0..1
and then return the most likely class.
For example, instead of determining that the weather will be 'hot'
'warm', 'cool' or 'cold', we may need to be able to say with some
degree of accuracy that it will be 25 degrees or 7.5 degrees,
even if 7.5 never appeared in the temperature attribute for the
training data.
Express the 'class' as a linear combination of the attributes with
determined weights. eg:
x = w0 + w1a1 + w2a2 + ... + wnan
Where w is a weight, and a is an attribute.
The predicted value for instance i then is found by putting the attribute
values for i into the appropriate a slots.
So we need to learn the weights that minimize the error between actual
value and predicted value across the training set.
(Sounds like Perceptron, right?)
To determine the weights, we try to minimize the sum of the squared error
across all the documents:
∑(xi ∑wjaik)2
Where x is the actual value for instance i and the second
half is the predicted value by applying all k weights to the
k attribute values of instance i.
Simple case: Method of Least Squares
∑(xi avg(x))(yi avg(y))
w=
∑(xiavg(x))2
solves the simple case of y = b + wx
We could apply a function to each attribute instead of just
multiplying by a weight.
For example:
x = c + f1(a1) + f2(a2) + ... + fn(an)
Where f is some function (eg square, log, square root, modulo 6,
etc)
Of course determining the appropriate function is a problem!
Instead of fitting the data to a straight line, we can try to fit it to a
logistic curve (a flat S shape).
This curve gives values between 0 and 1, and hence can be used
for probability.
We won't go into how to work
out the coefficients, but the
result is the same as the linear
case:
x = c + wa + wa + ... + wa
We looked at the maximum margin hyperplane, which involved
learning a hyperplane to distinguish two classes. Could we learn
a prediction hyperplane in the same way?
That would allow the use of kernel functions for the nonlinear case.
Goal is to find a function that has at most E deviation in prediction
from the training set, while being as flat as possible. This
creates a tube of width 2E around the function. Points that do
not fall within the tube are support vectors.
By also trying to flatten the function, bad choices for E can be
problematic.
If E is too big and encloses all the points, then the function will
simply find the mean. If E is 0, then all instances are support
vectors. Too small and there will be too many support vectors,
too large and the function will be too flat to be useful.
We can replace the dot product in the regression equation with a
kernel function to perform nonlinear support vector regression:
x = b + ∑αia(i)∙a
The problem with linear regression is that most data sets are not linear.
The problem with nonlinear regression is that it's even more
complicated!
Enter Regression Trees and Model Trees.
Idea: Use a Tree structure (divide and conquer) to split up the instances
such that we can more accurately apply a linear model to only the
instances that reach the end node.
So branches are normal decision tree tests, but instead of a class value
at the node, we have some way to predict or specify the value.
Regression Trees: The leaf nodes have the average value of the
instances to reach it.
Model Trees: The leaf nodes have a (linear) regression model to
predict the value of the instances that reach it.
So a regression tree is a constant value model tree.
Issues to consider:
– Building
– Pruning / Smoothing
We know that we need to construct a tree, with a linear model at
each node and an attribute split at non leaf nodes.
To split, we need to determine which attribute to split on, and where
to split it. (Remember that all attributes are numeric)
Witten (p245) proposes Standard Deviation Reduction treating
the std dev of the class values as a measure of the error at the
node and maximising the reduction in that value for each split.
It turns out that the value predicted at the bottom of the tree is generally
too coarse, probably because it was built against only a small subset of
the data.
We can fine tune the value by building a linear model at each node along
with the regular split and then send the value from the leaf back up the
path to the root of the tree, combining it with the values at each step.
p' = (np + kq) / (n + k)
p' is prediction to be passed up. p is prediction passed to this node.
q is the value predicted at this node. n is the number of instances that
reach the node below. k is a constant.
Pruning can also be accomplished using the models built at each
node.
We can estimate the error at each node using the model built by
taking the actual error on the test set and multiplying by (n+v)/(n
v) where n is the number of instances that reach the node and v
is the number of parameters in the linear model for the node.
We do this multiplication to avoid underestimating the error on new
data, rather than the data it was trained against.
If the estimated error is lower at the parent, the leaf node can be
dropped.
MakeTree(instances)
SD = sd(instances) // standard deviation
root = new Node(instances)
split(root)
prune(root)
split(node)
if len(node)< 4 or sd(node) < 0.05*SD:
node.type = LEAF
else
node.type = INTERIOR
foreach attribute a:
foreach possibleSplitPosition s in a:
calculateSDR(a, s)
splitNode(node, maximumSDR)
split(node.left)
split(node.right)
prune(node)
if node.type == INTERIOR:
prune(node.left)
prune(node.right)
node.model = new linearRegression(node)
if (subTreeError(node) > error(node):
node.type = LEAF
subTreeError(node)
if node.type = INTERIOR:
return len(left)*subTreeError(left) +
len(right)*subTreeError(right) / len(node)
else:
return error(node)
Some regression/model trees:
CHAID (ChiSquared Automatic Interaction Detector). 1980.
Can also be used either for continuous or nominal classes.
CART (Classification And Regression Tree). 1984.
Entropy or Gini to choose attribute, binary split for selected
attribute.
M5 Quinlan's model tree inducer (of C4.5 fame). 1992.
●
Introductory statistical text books, still!
●
Witten, 3.7, 4.6, 6.5
●
Dunham, 3.2, 4.2
●
Han, 6.11