Automatic Text Classification ThroughMachine Learning

Automatic Text Classification through Machine Learning
David W. Miller Semantic Web Spring 2002 Department of Computer Science University of Georgia
www.cs.uga.edu/~miller/SemWeb
Query to General-Purpose Search Engine: +camp +basketball north carolina two weeks
Automatic Text Classification through Machine Learning, McCallum, et. al.
Domain-Specific Search Engine
Domain-Specific Search Engine Advantages

High precision. Powerful searches on domain-specific features.
by location, time, price, institution.
Domain-specific presentation interfaces:

Topic hierarchies. Specific fields shown in clear format. Links for special relationships.
Domain-Specific Search Engine Disadvantages
Much human effort to build and maintain!

e.g. Yahoo has hired many people to build their hierarchy, and maintain Full Coverage, etc.
Tough Tasks
Find pages that belong in the search engine.
Find specific fields (price, location, etc).
Organize the content for browsing.
Machine Learning to the Rescue!

Efficient spidering by reinforcement learning.

Information extraction with hidden Markov models.

Populate a topic hierarchy by document classification.
Building Text Classifiers

Manual approach
Interactive query refinement Expert system methodologies
Supervised learning
1. Expert labels example texts with classes 2. Machine learning algorithm produces rule that tends to agree with expert classifications
Machine Learning for Text Classification, David D. Lewis, AT&T Labs
10
Advantages of Using Machine Learning to Build Classifiers

Requires no linguistic or computer skills Competitive with manual rule-writing Forces good practices
Looking at data Estimating accuracy
Can be combined with manual engineering

ML research pays too little attention to this
11
Main Processes for a Machine-Learning System

Prepare training samples Feature selection Selected features
Text representation
Profiles /Rules
Model induction
Feature vectors Selected features
Supervised Machine-Learning Based Text Categorization, Ng Hong I
12
Preparation of Training Texts

Essential for a supervised machine learning text categorization system Decide on the set of categories A set of positive training texts is prepared for each of the categories Assign subject code(s) to each of the training texts
More than one subject code may be assigned to one training text
Supervised Machine-Learning Based Text Categorization, Ng Hong I
13
Demonstration System: Cora

Spider CS departments for research papers.

Extract titles, authors, abstracts, institutions, etc from paper headers and references.

Populate a hand-built topic hierarchy by using text classification.
14
15
16
See also CiteSeer [Bollacker, Lawrence & Giles 98]

17
18
Automatic Text Classification via Statistical Methods

Text Categorization is the problem of assigning predefined categories to free text documents.
Popular Approach is Statistical Learning Methods
Bayes Method
Rocchio Method (most popular) Decision Trees K-Nearest Neighbor Classification Support Vector Machines (fairly new concept)
19
A Probabilistic Generative Model

Define a probabilistic generative model for Bag-of-words documents with classes. Bayes:
Reinforcement Learning: a Survey This paper surveys the field of reinforcement learning from a computer science perspective.
35 1 12 4 1 7 44 3 2 1 5 9 2 56 11 1
a block computer field leg machine of paper perspective rate reinforcement science survey the this underrated
20
Bayes Method
Pick the most probable class, given the evidence:
c arg max c j Pr(c j | d )

cj
- a class (like Planning) - a document (like language intelligence proof...)
d
Bayes Rule:
Pr(c j | d )
Pr(c j ) Pr( d | c j ) Pr( d )
Probability Category cj should be assigned to document d

21
Bayes Rule
Pr(c j | d )
P (c j | d )
Pr(c j ) Pr( d | c j ) Pr( d )
- Probability that document d belongs to category cj
P(d )
P (c j )
- Probability that a randomly picked document has the same attributes
- Probability that a randomly picked document belongs to this category
P (d j | c)
- Probability that category c contains document d

22
Bayes Method
Generates conditional probabilities of particular words occurring in a document given it belongs to a particular category. Larger vocabulary generate better probabilities Each category is given a threshold p for which it judges the worthiness of a document to fall in that classification. Documents may fall into one, more than one, or not even one category.
23
Rocchio Method
Each document is D is represented as a vector within a given vector space V:
(1) (|F |) d (d ,..., d )
Documents with similar content have similar vectors Each dimension of the vector space represents a word selected via a feature selection process
24
Rocchio Method
Values of d(i) for a document d are calculated as a combination of the statistics TF(w,d) and DF(w) TF(w,d) (Term Frequency) is the number of times word w occurs in a document d. DF(w) (Document Frequency) is the number of documents in which the word w occurs at least once.
25
Rocchio Method
The inverse document frequency is calculated as
IDF(w) log(
|D| DF ( w)
Value of d(i) of feature wi for a document d is calculated as the product
d (i ) TF ( wi , d ) IDF ( wi )
d(i) is called the weight of the word wi in the document d.
26
Rocchio Method
Based on word weight heuristics, the word wi is an important indexing term for a document d if it occurs frequently in that document However, words that occurs frequently in many document spanning many categories are rated less importantly
27
Decision Tree Learning Algorithm

Probabilistic methods have been criticized since they are not easily interpreted by humans, not so with Decision Trees Decision Trees fall into the category of symbolic (non-numeric) algorithms
28
Decision Trees
Internal nodes are labeled by terms Branches (departing from a node) are labeled by tests on the weight that the term has in a test document Leafs are labeled by categories
29
Decision Tree Example
30
Decision Tree
Classifier categorizes a test document d by recursively testing for the weights that the terms labeling the internal nodes have until a leaf node is reached. The label of the leaf node is then assigned to the document Most decision trees are binary trees
31
Decision Tree
Fully grown trees tend to have decision rules that are overly specific and are therefore unable to categorize documents
Therefore, pruning and growing methods for such Decision Trees are normally standard part of the classification packages
32
K-Nearest Neighbor
Features
All instances correspond to points in an ndimensional Euclidean space Classification is delayed till a new instance arrives Classification done by comparing feature vectors of the different points Target function may be discrete or realvalued
K-Nearest Neighbor Learning, Dipanjan Chakraborty
33
1-Nearest Neighbor
34
K-Nearest Neighbor
An arbitrary instance is represented by (a1(x), a2(x), a3(x),.., an(x))
ai(x) denotes features
Euclidean distance between two instances d(xi, xj)=sqrt (sum for r=1 to n (ar(xi) - ar(xj))2) Find the k-nearest neighbors whose distance from your test cases falls within a threshold p. If x of those k-nearest neighbors are in category ci, then assign the test case to ci, else it is unmatched.
35
Support Vector Machines

Based on the Structural Risk Minimization principle form computational learning theory
Find a hypothesis h for which we can guarantee the lowest true error
The true error of h is the probability that h will make an error on an unseen and randomly selected test example
36
Evaluating Learning Algorithms and Software

How effective/accurate is classification? Compatibility with operational environment Resource usage Persistence Areas learning algorithms need improvement
37
Effectiveness: Contingency Table

Truth Yes Yes System No c d a No b
38
Effectiveness Measures
Truth Yes Yes System No
a c
b d
recall = a/(a+c) precision = a/(a+b) accuracy = (a+c)/(a+b+c+d) utility = any weighted average of a,b,c,d F-measure = 2a/(2a+b+c) others
No
39
Effectiveness: How to Predict

Theoretical gaurantees rarely useful Test system on manually classified data
Representativeness of sample important Will data vary over time? Effectiveness varies widely across classes and data sets
Interindexer agreement an upper bound?

40
Effectiveness: How to Improve

More training data Better training data Better text representation
Usual IR tricks (term weighting, etc.) Manually construct good predictor features
e.g. % capitalized letters for spam filtering
Hand off hard cases to human being

41
Conclusions
Performance of classifier depends strongly on the choice of data used for evaluation. Dense category space become problematic for unique categorization, many documents share characteristics
42
Credits
*This Presentation is Partially Based on Those of Others Listed Below*
Supervised Machine Learning Based Text Categorization Machine Learning for Text Classification Automatically Building Internet Portals using Machine Learning Web Search Machine Learning
K-Nearest Neighbor Learning

Full Presentations can be found at: http://webster.cs.uga.edu/~miller/SemWeb/Presentation/ACT.html
43
Resources
Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification A Probalisitic Analysis of the Rocchio Alg. w/ TFIDF for Text Categorization Text Categorization w/ Support Vector Machines Learning to Extract Symbolic Knowledge from the WWW An Evaluation of Statistical Approaches to Text Categorization A Comparison of Two Learning Algorithms for Text Categorization Machine Learning in Automated Text Categorization
Full List of Resources can be found at: http://webster.cs.uga.edu/~miller/SemWeb/Presentation/ACT.html
44

Automatic Text Classification ThroughMachine Learning

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Automatic Text Classification ThroughMachine Learning

Diunggah oleh

Hak Cipta:

Format Tersedia

Automatic Text Classification through Machine Learning

Automatic Text Classification through Machine Learning, McCallum, et. al.

Domain-Specific Search Engine

Automatic Text Classification through Machine Learning, McCallum, et. al.

Automatic Text Classification through Machine Learning, McCallum, et. al.

Automatic Text Classification through Machine Learning, McCallum, et. al.

Domain-Specific Search Engine Advantages

Domain-specific presentation interfaces:

Domain-Specific Search Engine Disadvantages

Much human effort to build and maintain!

Automatic Text Classification through Machine Learning, McCallum, et. al.

Find specific fields (price, location, etc).

Organize the content for browsing.

Automatic Text Classification through Machine Learning, McCallum, et. al.

Machine Learning to the Rescue!

Find specific fields (price, location, etc).

Organize the content for browsing.

Automatic Text Classification through Machine Learning, McCallum, et. al.

Building Text Classifiers

Advantages of Using Machine Learning to Build Classifiers

Can be combined with manual engineering

Main Processes for a Machine-Learning System

Feature vectors Selected features

Supervised Machine-Learning Based Text Categorization, Ng Hong I

Preparation of Training Texts

Demonstration System: Cora

Find specific fields (price, location, etc).

Organize the content for browsing.

Automatic Text Classification through Machine Learning, McCallum, et. al.

Automatic Text Classification through Machine Learning, McCallum, et. al.

Automatic Text Classification through Machine Learning, McCallum, et. al.

See also CiteSeer [Bollacker, Lawrence & Giles 98]

Automatic Text Classification through Machine Learning, McCallum, et. al.

Automatic Text Classification through Machine Learning, McCallum, et. al.

Automatic Text Classification via Statistical Methods

A Probabilistic Generative Model

Automatic Text Classification through Machine Learning, McCallum, et. al.

c arg max c j Pr(c j | d )

Pr(c j ) Pr( d | c j ) Pr( d )

Probability Category cj should be assigned to document d

Automatic Text Classification through Machine Learning, McCallum, et. al.

Pr(c j ) Pr( d | c j ) Pr( d )

- Probability that document d belongs to category cj

- Probability that a randomly picked document has the same attributes

- Probability that a randomly picked document belongs to this category

- Probability that category c contains document d

(1) (|F |) d (d ,..., d )

Value of d(i) of feature wi for a document d is calculated as the product

Decision Tree Learning Algorithm

Decision Tree Example

K-Nearest Neighbor Learning, Dipanjan Chakraborty

Support Vector Machines

Evaluating Learning Algorithms and Software

Effectiveness: Contingency Table

Machine Learning for Text Classification, David D. Lewis, AT&T Labs

Effectiveness: How to Predict

Interindexer agreement an upper bound?

Effectiveness: How to Improve

Hand off hard cases to human being

K-Nearest Neighbor Learning

Anda mungkin juga menyukai