David W. Miller Semantic Web Spring 2002 Department of Computer Science University of Georgia
www.cs.uga.edu/~miller/SemWeb
Query to General-Purpose Search Engine: +camp +basketball north carolina two weeks
Tough Tasks
Find pages that belong in the search engine.
Supervised learning
1. Expert labels example texts with classes 2. Machine learning algorithm produces rule that tends to agree with expert classifications
Machine Learning for Text Classification, David D. Lewis, AT&T Labs
10
11
Text representation
Profiles /Rules
Model induction
12
13
14
15
16
18
35 1 12 4 1 7 44 3 2 1 5 9 2 56 11 1
a block computer field leg machine of paper perspective rate reinforcement science survey the this underrated
20
Bayes Method
Pick the most probable class, given the evidence:
d
Bayes Rule:
Pr(c j | d )
Bayes Rule
Pr(c j | d )
P (c j | d )
P(d )
P (c j )
P (d j | c)
Bayes Method
Generates conditional probabilities of particular words occurring in a document given it belongs to a particular category. Larger vocabulary generate better probabilities Each category is given a threshold p for which it judges the worthiness of a document to fall in that classification. Documents may fall into one, more than one, or not even one category.
23
Rocchio Method
Each document is D is represented as a vector within a given vector space V:
Documents with similar content have similar vectors Each dimension of the vector space represents a word selected via a feature selection process
24
Rocchio Method
Values of d(i) for a document d are calculated as a combination of the statistics TF(w,d) and DF(w) TF(w,d) (Term Frequency) is the number of times word w occurs in a document d. DF(w) (Document Frequency) is the number of documents in which the word w occurs at least once.
25
Rocchio Method
The inverse document frequency is calculated as
IDF(w) log(
|D| DF ( w)
d (i ) TF ( wi , d ) IDF ( wi )
d(i) is called the weight of the word wi in the document d.
26
Rocchio Method
Based on word weight heuristics, the word wi is an important indexing term for a document d if it occurs frequently in that document However, words that occurs frequently in many document spanning many categories are rated less importantly
27
28
Decision Trees
Internal nodes are labeled by terms Branches (departing from a node) are labeled by tests on the weight that the term has in a test document Leafs are labeled by categories
29
30
Decision Tree
Classifier categorizes a test document d by recursively testing for the weights that the terms labeling the internal nodes have until a leaf node is reached. The label of the leaf node is then assigned to the document Most decision trees are binary trees
31
Decision Tree
Fully grown trees tend to have decision rules that are overly specific and are therefore unable to categorize documents
Therefore, pruning and growing methods for such Decision Trees are normally standard part of the classification packages
32
K-Nearest Neighbor
Features
All instances correspond to points in an ndimensional Euclidean space Classification is delayed till a new instance arrives Classification done by comparing feature vectors of the different points Target function may be discrete or realvalued
K-Nearest Neighbor Learning, Dipanjan Chakraborty
33
1-Nearest Neighbor
34
K-Nearest Neighbor
An arbitrary instance is represented by (a1(x), a2(x), a3(x),.., an(x))
ai(x) denotes features
Euclidean distance between two instances d(xi, xj)=sqrt (sum for r=1 to n (ar(xi) - ar(xj))2) Find the k-nearest neighbors whose distance from your test cases falls within a threshold p. If x of those k-nearest neighbors are in category ci, then assign the test case to ci, else it is unmatched.
K-Nearest Neighbor Learning, Dipanjan Chakraborty
35
36
37
38
Effectiveness Measures
Truth Yes Yes System No
a c
b d
recall = a/(a+c) precision = a/(a+b) accuracy = (a+c)/(a+b+c+d) utility = any weighted average of a,b,c,d F-measure = 2a/(2a+b+c) others
Machine Learning for Text Classification, David D. Lewis, AT&T Labs
No
39
40
41
Conclusions
Performance of classifier depends strongly on the choice of data used for evaluation. Dense category space become problematic for unique categorization, many documents share characteristics
42
Credits
*This Presentation is Partially Based on Those of Others Listed Below*
Supervised Machine Learning Based Text Categorization Machine Learning for Text Classification Automatically Building Internet Portals using Machine Learning Web Search Machine Learning
43
Resources
Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification A Probalisitic Analysis of the Rocchio Alg. w/ TFIDF for Text Categorization Text Categorization w/ Support Vector Machines Learning to Extract Symbolic Knowledge from the WWW An Evaluation of Statistical Approaches to Text Categorization A Comparison of Two Learning Algorithms for Text Categorization Machine Learning in Automated Text Categorization
Full List of Resources can be found at: http://webster.cs.uga.edu/~miller/SemWeb/Presentation/ACT.html
44