Cheatsheet Midterms 2 - 3

PARAMETER MODELLING/ PARAMETER LEARNING: parameters w that produces class probability estimates.
parameters w that produces class probability estimates. We would like p+(x+) =1 and After determining that the classification tree is a better method (choose higher mean
1. Data miner specifying the structure of the model with certain numeric parameters p+(x.) to be =0. The g function gives the models estimated probability of seeing xs accuracy, St. dev lower), you will then use the entire data set to build the model
left unspecified. e.g. parameterized mathematical function or equation of a set of actual class given xs features. Generalisation Performance:
numeric attributes (attributes used in the model chosen based on domain knowledge The model (set of weights (w)) that gives the highest product is the model that gives 1 Different modeling procedures = different performance of the same data
regarding which attributes is informative in predicting the target variable) the highest likelihood to the datathe maximum likelihood model. The 2 Diff training sets = diff. generalisation performance
b. data miner specifies the form of the model and the attributes maximum likelihood model on average gives the highest probabilities to the positive 3 Diff test sets may = diff. estimates of the generation performance
2. data mining calculates the best parameter values given a particular set of training examples and the lowest probabilities to the negative examples. 4 Training set size change = diff gene performance for the resultant model
data. This is the goal; tune the parameters so that the model fits the data = PL Target variable is not a representation of the probability of class membership. LEARNING CURVES
Instance-space view: shows the space broken up into regions by decision boundaries; The training data comprise a set of statistical draws from the underlying probabilities, Generalization performance generally improves (since more accurate also) as more
examples in each space should have similar target variables and homogenous regions rather than representing the underlying probabilities themselves. The Logit tries to training data become available, up to a point (learning curve steep initially; marginal
help predict the target variable of a new unseen instance estimate the probabilities with a linear-log-odds model based on the observed data. advantage of having more data decreases). A plot of the generalization performance
LINEAR DISCRIMINANT FUNCTIONS Logit =/= Regression: The way to distinguish classification/ regression is via the against the amount of training data is called a learning curve
Linear Split = a linear classifier and is a weighted sum of the values for various target variable - is it categorical or numerical **in log regression, the model produces a Difference between learning curve and fitting curve: A learning curve shows the
attributes numeric estimate but the target values are still categorical generalization performance (performance only on testing data), plotted against the
This is a linear discriminant because it discriminates between the classes, and the THE DIFFERENCE BETWEEN LOG REG & TREE INDUCTION: amount of training data used. A fitting graph shows the generalization performance &
function of the decision boundary is a linear combinationa weighted sumof the A classification tree uses decision boundaries that are perpendicular to the instance- the performance on the training data, but plotted against model complexity (fixed
attributes.(y=mx+b where y space axes; linear classifier can use decision boundaries of any direction/ orientation amount of training data).
and x are attributes) classification trees select a single attribute at a time whereas linear classifiers use for smaller training-set sizes, Logit
- 2D=line, 3D=Plane, 4D= Hyperplane a weighted combination of all attributes. yields better generalization accuracy
Using the values of class(x) or f(x) we are then able to classify a new instance: A classification tree is a piecewise classifier that segments the instance space than tree induction as the training sets
recursively using a divide-and-conquer approach cut up arbitrarily finely into very get larger, tree induction soon is more
small regions. A linear classifier places a single decision surface through the entire accurate.
space; great freedom in the orientation of the surface, but limited to a single division Tree Induction: Overfit more, Better for
The more general form of a linear into 2 segments (linear eqn) that uses all of variables,must fit the entire data space. larger datasets, Tree can represent
discriminant function; a numeric Accuracy & Comprehensibility will be an important factor in which method to use: nonlinear relationships between the
classification model: Logit model = with a strong background in statistics features and the target; more flexible,
Parameterized model: the weights of the linear function (wi) are the parameters A decision tree, if it is not too large =without a strong statistics or mathematics leading it to overfit more with small data, but to model more complex regularities with
When the weights change, the answer and classification will also change accordingly background. larger training sets Logit: Perform better for smaller datasets; Since log regress is only
Supervised segmentation by using a mathematical function of multiple attributes WHAT IS OVERFITTING? a line, cannot capture everything; has less flexibility, which allows it to overfit less with
these weights magnitude are loosely interpreted as importance of the features: Overfitting: Finding chance occurrences in data that look like interesting patterns, but small data, but keeps it from modeling the full complexity of the data
OBJECTIVE FUNCTION FOR LINEAR DISCRIMINANT -WHAT PARAMETERS which do not generalize, is called overfitting the data. Overfitting is the tendency of AVOIDING OVERFITTING
SHOULD WE USE? data mining procedures to tailor models to the training data, at the expense of Tree Induction
Define an objective function (loss function) that represents our goal, and can be generalization to previously unseen data points. => want models to apply to the Stop growing a tree before it gets too complex:
calculated for a particular set of weights and a particular set of data find the optimal general population 1) The simplest method to limit tree size is to specify a minimum number of instances
value for the weights by maximizing or minimizing the objective function. Generalization is the property of a model or modeling process, whereby the model that must be present in a leaf. (this minimum-instance stopping criterion )
A loss function will also determine how much penalty should be assigned to an applies to data that were not used to build the model. using the data at the leaf to make a statistical estimate of the value of the target
instance based on the error in the model's predicted 100% accurate = table model => customer not part of the historical dataset, lookup will variable on a very small subset of data= inaccurate; amt is det. based on experience.
value. Different loss functions will result in different fail since there will be no exact match (no generalisation = memorising the training 2) conduct a hypothesis test at every leaf to determine whether the observed
answers: => seek to minimise errors data) difference in information gain could have been due to chance. If the hypothesis test
***weights are best only if we believe that the objective HOW TO RESOLVE OVERFITTING? concludes that it was likely not due to chance, then the split is accepted
function truly represents what we want to achieve faith and experience. All data mining procedures have the tendency to overfit to some extentsome more Prune back a tree that is too large
LINEAR DISCRIMINANT FUNCTIONS FOR SCORING & RANKING INSTANCES than others. There is no single choice/ procedure that will eliminate overfitting. The Pruning means to cut off leaves and branches, replacing them with leaves.
need a score that will rank cases by the likelihood of belonging to one class or the best strategy is to recognize overfitting and to manage complexity in a Estimate whether replacing a set of leaves or a branch with a leaf would reduce
other. principled way. accuracy. If not, then go ahead and prune.
f(x) will be zero when x is sitting on the decision boundary (points are randomly Trade-off between model complexity (better capture the real complexities & be more Build trees of varying sizes and then compare using CV which of the trees are
classified, 50:50); f(x) will be relatively small when x is near the boundary. (points here accurate) and the possibility of overfitting performing better
are sensitive to the position of the line; highly uncertain); And f(x) will be large (and HOW TO IDENTIFY 2. Linear Models
positive) when x is far from the boundary in the + direction. OVERFITTING? Regularisation: optimise fit and simplicity
f(x) gives a ranking of the instances by their (estimated) likelihood of belonging to the HOLD OUT DATA: Need to hold Complexity control via regularization works by adding to this objective function a
class of interest. => but no probability estimate out some data (or test set/ lab test) penalty for complexity (depending whether the objective function is max or min, have
LINEAR DISCRIMINANT 1: SUPPORT VECTOR MACHINES (LD) where we know target variable, but to change accordingly):
Classify instances based on a linear function of the features not be used to build the model After maximum likelihood/ accuracy overfitting. The term is simply a weight that
Objective Function: Maximise the margins = fattest bar between the classes model will then be used to predict determines how much importance the optimization procedure should place on the
once the widest bar is found, the linear discriminant will be the center line through the the values of the holdout data and penalty, compared to the data fit.
bar this will help estimate the generalization performance by comparing the predicted
margin-maximizing boundary gives the maximal leeway for classifying new points that values with the hidden true values. FITTING GRAPH: fitting graph shows the accuracy
may fall in the margin of a model as a function of complexity. More overfitting when the complexity increases. We can choose using cross-validation Then this would be used to learn a
If the data are not linearly separable, the best fit is some balance between a fat margin model is not allowed to be complex enough -> not very accurate. regularized model on all the training data So when you increase the number of
and a low total error penalty. The penalty for a misclassified point is proportional to the models get too complex -> look very accurate on the training data, but are overfitting parameters, it will be harmful for the objective function
distance from the decision boundary, so if possible the SVM will make only small -> the training accuracy diverges from the holdout (generalization) accuracy. There are different types of penalties that can be applied:
errors. Base error rate = % of churn cases in the population. (A corresponding baseline for a The most commonly used penalty is the sum of the squares of the weights, sometimes
Objective fn = Max(Margin - |distance (pt, MB)| regression model is a simple model that always predicts the mean or median value of called the L2-norm of w L2-norm+standard least squares linear reg=ridge regression
LINEAR REGRESSION the target variable.) functions can fit data better if they are allowed to have very large positive and negative
Standard linear regression Complexity: log reg = # of attributes; classification = # of nodes (not leaves) weights; sum of the squares of the weights gives a large penalty when weights
procedures minimize the sum or mean of the squares of these errorswhich gives the OVERFITTING IN TREE INDUCTION have large absolute values.
procedure its common name least squares regression. Tree induction: find impt predictive individual attributes recursively (divide parent node Sum of squares given a large penalty when weights have large values
2 types of objective functions: 1. Sum of absolute errors (more robust) 2. Sum of to child node to get purer subset) splitting on attributes -> left with a single instance at sum of the absolute values (rather than the squares), known as the L1-norm+standard
squared errors (more convenient) each leaf node, which is pure => lookup table => give a nontrivial classification (better least squares linear reg=lasso
SHORTCOMINGS WITH THE LEAST SQ REG: For least squares regression a than lookup table) Generally: a procedure that grows trees until the leave are pure ends up zeroing out many coefficients= Since these coefficients are the multiplicative
serious drawback is that it is very sensitive to the data: erroneous or otherwise outlying tends to overfit = allowed to grow without bound, the decision trees can fit any data to weights on the features,
data points can severely skew the resultant linear function; Penalizes large errors arbitrary precision L1-regularization effectively performs an automatic form of feature selection.
*systems that are building and applying models automatically, modelling needs to be The sweet spot represents the best Similarity and Distance:
more robust **choose the objective function to optimizeand we should do so trade-off between the extremes of - similarity between objects = the distance between objects: Data is represented as
with the ultimate business application in mind (i) not splitting the data at all (using feature vectors hence, the closer two objects = the more similar they are.
LINEAR DISCRIMINANT #2: LOGISTICS REGRESSION the average target value) and (ii)
Logit gives accurate estimates of class probability (estimate the probability that building a complete tree out until the
a new instance belongs to the class of interest). leaves are pure.
Logit = model for the probability of class membership; It is a class probability OVERFITTING IN
estimation model and not a regression model; usually used for binary problems = (Y/ MATHEMATICAL FUNCTIONS NEAREST NEIGHBOUR REASONING:
N) #1: More complexity:
***the same linear function f(x) (since the magnitudes of f(x) were used to proxy the - adding more
probability) but given that probability ranges from 0 to 1, we seek to convert the f(x) variables (more attributes);
distance to a probability => by using f(x) as a measure of the log-odds of the event Adding non-linear terms (will increase flexibility) #2: As you increase the
of interest => f(x) is an estimation of the log-odds that x belongs to the positive class dimensionality via more attributes, you can perfectly fit larger and larger sets of
LIKELIHOOD: The odds of an event is p(occurring):p(not occurring). So, for example, arbitrary points. = To avoid overfitting: manual/automatic attribute selection
if the event has an 80% probability of occurrence, the odds are 80:20 or 4:1. 90% is Comparing SVM to Logit: Logit appears to be overfitting; SVM tends to be less
90:10 or 9 ln=2.19. 0.001 is 1:999 or 0.001001 then ln=-6.9 ln(p(occurring)/p(not sensitive to individual examples; procedure incorporates complexity control
M
CROSS VALIDATION
ost
Cross-validation is a
more sophisticated
occurring)) holdout training and
p+(x) represents the model's estimate of the probability of the class membership of a testing procedure => 1)
data item by feature vector x statistics such as mean
1) For probability estimation, and variance of the accuracy, similar instances are
Logit uses the same linear model to understand how the called nearest neighbours
as linear discriminants for performance is expected to Heuristic for Nearest Neighbour: given a new example whose target variable we
classification and linear vary across datasets 2) want to predict, we scan through all the training examples and choose several that are
regression for estimating numeric makes better use of a limited the most similar to the new example (this is still unsupervised) Then we predict
target values. 2) The output of the dataset (computes its the new examples target value, based on the nearest neighbors (known) target
Logit model is interpreted as the estimates over all the data) values. (this makes it supervised)
log-odds of class membership. 3) The purpose of CV is to use the original labeled data efficiently to estimate the Combining Functions --> Prediction: Classification: highest count/ majority vote;
These log-odds can be translated performance of a modeling procedure. Regression: mean/ median of neighbours; Note that in retrieving neighbors we do not
directly into the probability of class membership. 4) fit ln(odds) with the linear function How does it work? use the target variable because were trying to predict it -> it will not enter into the
of the attributes through optimising the parameters 1 Cross-validation begins by splitting a labeled dataset into k partitions called folds. distance calculation
2 Typically, k will be five or ten. These folds are created randomly. CLASSIFICATION: From the clusters, you calculate the distances and find the closest
WHAT'S THE OBJECTIVE FUNCTION? 3 Cross-validation then iterates training and testing k times, in a particular way. neighbours. Look at the target variable of the k neighbours are retrieved and consult
Consider the following function 4 In this iteration, the other k1 folds are combined to form the training data. their known target variables (classes), decide on the new entrys predicted target
computing the likelihood that a So, in each iteration we have (k1)/k of the data used for training and 1/k used for variable via a majority vote
particular labeled example belongs testing. Each iteration produces one model, and thereby one estimate of PROBABILITY ESTIMATION: Usually important not just to classify a new example but
to the correct class, given a set of generalization performance, for example, one estimate of accuracy. to estimate its probabilityto assign a score to it; score gives more information than
just a Yes/No decision. Probability = (Out of k who said "Yes")/(Out of k who said "No"+ Because the y axis represents the distance between clusters, the dendrogram can PROBLEMS WITH UNEQUAL
Out of k who said "Yes") might want to use more than just three nearest neighbors give an idea of where natural clusters may occur/ outlier, COSTS AND BENEFITS:
to compute the probability estimates Hierarchical clusterings generally are formed by simple classification accuracy as a
REGRESSION: Once we can retrieve nearest neighbors We could use the average starting with each node as its own cluster -> metric is that it makes no
(about 42) or the median (40) merged iteratively until only a single cluster distinction between false positive
HOW TO DECIDE WHICH K TO USE? remains. and false negative errors = it
Main parameter here is k ; using different Ks, you can compare the performance Distance function between clusters= called the makes the tacit assumption that both errors are equally important
(using CV) to decide which K you want to use linkage function: the Euclidean distance between the closest points in each of Should estimate the cost or benefit of each decision a classifier can make. Once
Odd numbers are convenient for breaking ties for majority vote classification. the clusters, aggregated, these will
Too small a k= overfitting; shading each pixel of the graph CLUSTERING: NEAREST NEIGHBOURS (K-MEANS): produce an expected
the greater k =more the estimates are smoothed out among neighborsincrease k to Input: number of clusters k and a database containing n objects profit (or expected
the maximum possible (so that k = n) the entire dataset would be used for every Output: set of k clusters that minimizes the squared-error criterion benefit or expected
prediction: 1) For classification= predict the majority class in the entire dataset;2) for The k-means algorithm: cost) estimate for the
regression, the average of all the target values; 3) for class probability estimation, the The algorithm starts by creating k initial cluster centers, usually randomly, but classifier.
base rate probability USING APPROPRIATE METRICES: mean-
sometimes by choosing k of the actual data points, or by being given specific initial
SOME ARE NEARER THAN OTHERS? squared-error (measure accuracy of a regression
starting points by the user, or via a pre-processing of the data to determine a good
nearest-neighbor methods often use weighted voting or similarity moderated voting model)= the value of the target variable, which is
set of starting centers.
such that each neighbors contribution is scaled by its similarity** can use all neighbors the number of stars that a user would give as a rating for the movie
The clusters corresponding to these cluster centers are formed, by determining
but scaling each by its similarity (scaling weight = 1/square of the distance; the EXPECTED VALUE: decomposes data-analytic thinking into (i) the structure of the
which is the closest center to each point.
contributions are proportional to the weights, adding up to one) => weighted probability problem, (ii) the elements of the analysis that can be extracted from the data, and (iii)
Next, for each of these clusters, its center is recalculated by finding the actual the elements of the analysis that need to be acquired from other sources
Benefits of using weighted scoring:
reduces the importance of deciding how many neighbors to use; => avoid committing centroid of the points in the cluster. Method: expected value calculation
to a k by retrieving a very large number of instances (e.g., all instances, k = n) and The process simply iterates: since the cluster centers have shifted, we need to the possible outcomes of a situation are enumerated and expected value is then the
depend upon distance weighting to moderate the influences. recalculate which points belong to each cluster weighted average where the weight given to each value is its probability of occurrence
COMPLEXITY CONTROL Once these are reassigned, we might have to shift the cluster centers again.
irregular concept boundaries; do not impose any particular geometric form on the The k-means procedure keeps iterating until there is no change in the clusters (or
classifier; form boundaries in instance space tailored to the specific data training. possibly until some other stopping criterion is met). **Each oi is a possible decision outcome; p(oi) is its probability and v(oi) is its value.
=1-NN overfits strongly => When classifying each training data point; retrieval of that Evaluation of k-means: The probabilities often can be estimated from the data (ii), but the business values
training point itself as its own nearest neighbor; own value for the target variable would No guarantee that a single run of the k-means algorithm will result in a good often need to be acquired from other sources (iii)
be used to predict itself --> perfect classification; n-NN model: do not allow much clustering: result of a single clustering run will often terminate local optimum but Specifically, lets calculate the expected benefit (or cost) of targeting consumer x:
complexity at all in our model depends on the initial centroid locations (-> random initial partitioning)
ISSUES WITH NEAREST NEIGHBOUR METHODS AS PREDICTIVE MODELS: k-means algorithm is scalable and efficient in large data sets: generally
Intelligibility 2 aspects to it: the justification of a specific decision and the ***What we care about is, in aggregate, how well does each model do: what is its
relatively fast, because it only computes the distances between each data point and
intelligibility of an entire model.(difficult is to explain more deeply what knowledge the cluster centers on each iteration; Hierarchical clustering is generally slower, as it
has been mined from the data) = KNN methods lack of specific decision models needs to know the distances between all pairs of clusters on each iteration, which at
Dimensionality and Domain Knowledge 1)numeric attributes may have vastly different the start is all pairs of data points.
ranges, 2) Problem with having too many attributes, or many that are irrelevant to the Works well when clusters are compact clouds and well separated
similarity judgment. curse of dimensionality = incidental information => instance Mean must be defined (e.g. not suitable for categorical attributes), Sensitive to noisy
similarity can be confused and misled by the presence of many irrelevant attributes. and outlier data (-> mean calculation)
To fix this problem: feature selection, the judicious determination of features that How to determine a good value for k?
should be included in the data mining model => done manually by the data miner, One answer is simply to experiment with different k values and see which
using background knowledge as what attributes are relevant.
ones generate good results. (NO CV) = value for k can be decreased if some clusters
An also inject domain knowledge into similarity calculations is to tune the
are too small and overly specific, and increased if some clusters are too broad and
similarity/distance function manually => assigning diff weights to the different attributes
diffuse.
Computational Efficiency
k-means is often used for exploratory data mining
training is very fast because it usually involves only storing the instances; main
K-MEDOIDS: Medoid: most centrally located datapoint in a cluster; Algorithm:
computational cost of a nearest neighbor method is borne by the
Find k cluster in n objects by first arbitrarily determining a representative
prediction/classification step, when the database must be queried to find nearest
object (medoid) for each cluster
neighbors of a new instance ==> expensive and might take a long while
Each remaining object is clustered with the medoid which is the most similar
CLUSTERING
This idea of finding natural groupings in the data may be called unsupervised The medoids are iteratively replace by one of the non-medoids as long as the
segmentation, or more simply clustering. => Unsupervised modeling does not focus on quality of clustering is improved (e.g. the actually square-error E is reduced)*** same
a target variable. => objects within groups are similar but the objects in different distance measure as k-means
Compared to K-means: expected value.
groups are not similar ERROR RATES these probabilities (of errors and correct decisions) can be
2 main purposes: 1. Discovery of all overall distribution patterns and interesting It is more robust to noise and outliers as compared to k-means
estimated from the confusion matrix.: Each o i corresponds to one of the possible
correlations among data attributes 2. Data reduction: cluster to get 1 group for
applications. Requirements of Clustering: small low dimensional data sets and Processing is
numerical attributes => cluster algo works well; large data sets: need to deal with 1) more
scalability 2) different types of attributes (categorical, numerical) not all clusters are costly than K-meanS
shaped like a circle. Domain knowledge required to determine input parameters and K-MEDIAN: Use Manhattan distance and
evaluate results, deal with noisy data (outliers have large impact on clustering, use median instead
insensitivity to order of input records), high dimensionality, interpretability and usability Plain Accuracy (1-Error Rate): T=110, P=61,
Choice of clustering techniques: the type of data available, the particular purpose Classification accuracy is popular (very easy to measure), but usually too
and application (choice depends on domain knowledge) simplistic
Common method for focusing on the clusters themselves is to represent each cluster Decompose and count the different types of correct and incorrect decisions
by its cluster center, or centroid = not necessarily one of the instances; it is the made by classifier
geometric center of a group of instances. Accuracy is a common evaluation metric that is often used in data mining
studies because it reduces classifier performance to a single number and it is very
1 Hierarchical Clustering: easy to measure
a Using Euclidean distance renders points more similar to each need a way to decompose and count the different types of correct and
other if they are closer to each other in the plane. incorrect decisions made by a classifier ==> CONFUSION MATRIX
Circles Chart: Confusion Matrix: combinations of the class we predict/the actual class
Circles labeled 1-5 are placed over the points to indicate A type of contingency table COST & BENEFITS
clusters A confusion matrix for a problem involving n classes is an n n matrix with the While the probabilities can be estimated from data, the costs and benefits often
It is a clustering because it groups the points by their similarity. cannot.
Notice only overlap between clusters is when one cluster contains other clusters.
circles actually represent a hierarchy of clusterings Using this equation, we can now compute and compare the expected profits for
most general (highest-level) clustering is just the single cluster that contains everything various models and other targeting strategies.
cluster 5 in the example
lowest-level clustering is when we remove all the circles, and the points themselves A cost-benefit matrix specifies for each (predicted, actual) pair the cost or benefit
are six (trivial) clusters making such a decision
Dendrogram:
shows explicitly the
hierarchy of the columns labeled = actual classes; rows =predicted classes.
clusters A confusion matrix separates out the decisions made by the classifier, making
The numbers at the explicit how one class is being confused for another. => different sorts of errors
joins of the may be dealt with separately = false positives and false negatives
dendrograms a bad outcome = positive example [alarm!]
correspond to the a good outcome = negative example [uninteresting]
numbered circles Problems with Unbalanced Classes:
It creates a collection consider a classification problem where one class is rare ==> classifiers often The class priors, p(p) and p(n), specify the likelihood of seeing positive and negative
of ways to group the are used to sift through a large population of normal or uninteresting entities instances, respectively = factoring out allows us to separate the influence of class
points --> clipping because the unusual or interesting class is rare among the general population, the imbalance from the predictive power of the model
the dendrogram class distribution is unbalanced or skewed
=>allows the data becomes more skewed, evaluation based on accuracy breaks down
analyst to see the groupings before deciding on the number of clusters THEREFORE: we need more information about the data

Cheatsheet Midterms 2 - 3

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Cheatsheet Midterms 2 - 3

Diunggah oleh

Hak Cipta:

Format Tersedia

PARAMETER MODELLING/ PARAMETER LEARNING: parameters w that produces class probability estimates.

Anda mungkin juga menyukai