Contents :
Data Mining
Next Generation Data Mining Techniques
What is Data Mining? Data mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses.
What is a Decision Tree? A decision tree is a predictive model that, as its name implies, can be viewed as a tree. Specifically each branch of the tree is a classification question and the leaves of the tree are partitions of the dataset with their classification.
A decision tree consists of 3 types of nodes:1. Decision nodes - commonly represented by squares 2. Chance nodes - represented by circles 3. End nodes - represented by triangles
Example :
Rule Induction
Rule induction is one of the major forms of data mining
and is perhaps the most common form of knowledge discovery in unsupervised learning systems. In Rule induction all possible patterns are systematically pulled out of the data and then an accuracy and significance are added to them that tell the user how strong the pattern is and how likely it is to occur again.
What is a Rule?
In rule induction systems the rule itself is of a simple form
of if this and this and this then this. For example a rule that a supermarket might find in their data collected from scanners would be:
if pickles are purchased then ketchup is purchased. Or If paper plates then plastic forks
Accuracy - How often is the rule correct? Coverage - How often does the rule apply?
What is a Rule?
In some cases accuracy is called the confidence of the rule
and coverage is called the support. Accuracy and coverage appear to be the preferred ways of naming these two measurements.
Rule If breakfast cereal purchased then milk purchased. Accuracy 85% Coverage 20%
15%
6%
The left hand side is called the antecedent and the right
can be used either for understanding better the business problems that the data reflects or for performing actual predictions against some predefined prediction target.
Target the antecedent.
Target the consequent. Target based on accuracy.
Discovery
These systems provide both a very detailed view of the
data where significant patterns Target the antecedent. These systems thus display a nice combination of both micro and macro views:
Macro Level - Patterns that cover many situations are
provided to the user that can be used very often and with great confidence and can also be used to summarize the database. Micro Level - Strong rules that cover only a very few situations can still be retrieved by the system and proposed to the end user
Discovery
These systems provide both a very detailed view of the
data where significant patterns Target the antecedent. These systems thus display a nice combination of both micro and macro views:
Macro Level - Patterns that cover many situations are
provided to the user that can be used very often and with great confidence and can also be used to summarize the database. Micro Level - Strong rules that cover only a very few situations can still be retrieved by the system and proposed to the end user
Prediction
Each rule by itself can perform prediction - the
consequent is the target and the accuracy of the rule is the accuracy of the prediction. But because rule induction systems produce many rules for a given antecedent or consequent there can be conflicting predictions with different accuracies. This can be done in a variety of ways by summing the accuracies as if they were weights or just by taking the prediction of the rule with the maximum accuracy.
Prediction
Antecedent bread butter eggs cheese Consequent milk milk milk milk Accuracy 35% 65% 35% 40% Coverage 30% 20% 15% 8%
they imply that there is useful predictive information in the database that can be exploited - namely that there is something far from independent between the antecedent and the consequent.
Accuracy Low Accuracy High Rule is often correct and can be used often. Rule is often correct but can be only rarely used. Rule is rarely correct but can be used often. Rule is rarely correct and can be only rarely used.
For instance you may have a rule that is 100% accurate but is only
applicable in 1 out of every 100,000 shopping baskets. You can rearrange your shelf space to take advantage of this .