Anda di halaman 1dari 5

John

Retterer-Moore, Qinghao Wu
Project Report

Problem Statement

Our goal in this project was to build a classi=ier that can determine what
genre of news a given article or forum post relates to. Ideally, such a
classi=ier would both be able to identify most of the categories widely
used in the world of news, although which set of categories to use is
hard to choose, since many different news sources have slightly different
methods of categorizing news; maybe one source uses science while
another has technology, and they have subtly different sets of articles
that they contain that would be hard to distinguish. We ended up
choosing one set of categories based on a good source of training/test
data we found, although many alternate categorizations could be used; it
would be interesting to see whether our method in fact maintained all of
its effectiveness on those categorizations. This is a useful problem to
solve, since it has many applications; for example, a search engine may
want to aggregate news from many sources on a speci=ic topic, like
business news, and so it may want to be able to scan a wide variety of
sources for their content and not just rely on the classi=ication the
sources themselves use, since as addressed above, different sources may
classify news in many subtly distinct ways. It could also be used to
analyze trends in reporting, for example, do articles about politics tend
to use more words related to emotion than articles about science? Now
that our problem is clearly speci=ied, let's describe the training and test
data we used and how we built our classi=ier to accurately categorize
news.

Data
We got our data from http://qwone.com/~jason/20Newsgroups/, a
collection of about 20,000 documents from newsgroup forums in the
late 1990s. The documents were split into 20 or so categories, but we
combined some similar categories and left out a few categories to get 5
broad categories similar to ones that a news website might use:
automotive news, political news, sports news, computer news, and
religion news. The data we used consisted of approximately 1000
training documents for each category and 1000 test documents for each
category, all forum posts from various newsgroups related to the
category. It had been at least partially =iltered for =iller words, so a lot of

common =iller words like a and the did not appear in it, although
some still remained and had to be dealt with by our feature selection
methods. Overall, this data gave us a large pool of documents to train
our classi=ier with, and the documents also had a fairly high
concentration of useful words (ie the word team appeared many times
in the sports article, etc), so it was a good set of data to build our
classi=ier on.

Method
We start by =iguring out how many times each word appears in each
set of training data, as word frequency is the most basic metric to
consider when classifying various types of documents. We =ilter out any
words that appear fewer than 50 times in the 1000 documents, to
reduce the number of words we need to consider as it's unlikely that
words that appear that few times will appear in the documents we want
to classify later. That also can help remove uncommon =iller words like
although or between that appear infrequently across all categories of
news. We then store the word-frequency pairs separately for each
category of document. A future direction that would certainly improve
our results is if some easy system existed to parse the root of a word.
For example, if run, runs, runner, runners, and ran each had 45
occurrences, we wouldn't consider any of them, when really the root
word run had 200 occurrences and we really should consider it.
Unfortunately, we couldn't =ind any ef=icient, easy to implement methods
for the word-parsing issue, so implementing it remains a future
direction for our project rather than a current one.
For each word, for each category, we assign the word a weight for
that category representing how strongly correlated that word's
appearance is with the article belonging to that category. We considered
a few variants on TF-IDF to calculate the weight of each word. One
would be the frequency of the word in that speci=ic category, divided by
the frequency of the word in all categories, to measure how closely
related with the speci=ic category it is. We initially decided against this
because it treated all words that were exclusive to a category with the
same weight if soccer only appeared in sports and appeared many
times, and turnover only appeared in sports and appeared only once,
both would end up with weight 1, when turnover might just be an
unusual word that just happened to show up once in sports and soccer
clearly has a strong connection to sports if it appears many times

exclusively in sports articles. We then considered squaring the


category-speci=ic frequency, and dividing by total frequency, but this had
the opposite issue words that appeared many, many times across all
categories, like the handful of =iller words that remained in the data,
would get massive scores while words that were exclusive to one
category would get less of a bonus. The traditional TF-IDF method
would have a similar issue to that one. We went with a compromise
between the two extremes in the end category-speci=ic frequency
divided by total frequency, but solving the problem with uncommon
category-speci=ic words with the aforementioned =ilter that removes
words that appear fewer than 50 times in a category. This should
remove any word that is not strongly linked to one category, and then
dividing by total frequency should help with the words that are strongly
linked to more than one category. So for each word, we have a weight
for that word for each category.
For any new news article we want to classify, we calculate a score for
each category based on adding up the scores of each word in the article.
If the highest of those total scores is above a certain threshold (larger
for longer articles), we classify the article as belonging to that category.
This method performs pretty well on the test data from the 20
newsgroups, as you can see in the results section, and our own tests on
random Google news articles have seen that it works reasonably well
there too.

What is new?
(1) We achieved better feature selection by changing the formula of TF-
IDF. We do not use traditional formula of TF-IDF to get the weight of
each word and do feature selection. We use TF/DF as the weight of each
word. Because we already did some preprocessing work during TF
calculating period, we keep every word in this step as feature.
We do not use the traditional formula of TF-IDF, because we =ind that
many words occur in all kinds of news, and the term frequency are
relatively high. Thus words should not be considered as features or
should have pretty low weight. Using TF/DF as the weight of each word
can make sure that the words occur in all categories will get a low
weight and will not in=luence the accuracy of our classi=ier. In this way,
the words only occur in one category can get a high weight.
(2) We improved algorithm ef=iciency and accuracy of the classi=ier by

deleting some low frequency words. We used a large amount of training


data, and we think that the term frequency of features in each category
should be relatively high, so we deleted the words whose term
frequency lower than a speci=ic value. In this way, we save lots of time
that we originally used in these meaningless words. What is more, if the
word we deleted in one category is the feature of other category, the
weight of that word in that category will become higher, and improve
the accuracy of our classi=ier. What is more, if we do not delete the word
with low term frequency and the word not occur in any other category,
it will have a relatively high weight even it is not a feature, thus it will
have a bad in=luence on our classi=ier.
(3) We use some new strategy to get the probability of the news
belonging to one speci=ic category. In our algorithm, each category has
many features and each feature has its weight. When our classi=ier
judging which category of the given news belongs to, it just check how
many features the news contain and calculate the sum of weight of the
features it contain. And the news belongs to the category that gets the
highest score. Our strategy is simple, but it really get a good result.
(4) We can catch the case that the news given does not belong to any of
our category by calculating the ratio of number of features of news and
the length of the news. Our classi=ier can not classify all categories of
news, but it can give the result that the news does not belong to any
category. Our classi=ier will give this result if the score of every category
are all very low. Because long news are easier to contain more features
and short news often contain fewer features, we take the length of the
news into consideration. Only when the ratio of number of features of
news and the length of the news is relatively low, our classi=ier will say
the news do not belong to any category.

Results
We use 600 testing news of every category for testing, and bellowing
is the confusion matrix of the testing result.
Confusion matrix
Politics
Politics
577
Comp
1
Auto
2
Religion
19
Sports
1
Others
0

Comp
58
515
7
8
6
0

Auto
77
21
473
20
9
0

Religion
99
12
0
488
1
0

Sports
54
12
2
15
516
1

Politics
Comp
Auto
Religion
Sports
Others

Politics
577
1
2
19
1
0

Comp
58
515
7
8
6
0

Auto
77
21
473
20
9
0

Religion
99
12
0
488
1
0

Sports
54
12
2
15
516
1

From the confusion matrix of the result, we can get the recall,
precision and accuracy of our classi=ier.

Recall
Precision
Accuracy

Politics
96.17%
66.71%

Comp
85.83%
91.80%

Auto
78.83%
97.72%

Religion
81.33%
88.73%
85.63%

Sports
86%
96.81%

From the result, we can see that the classi=ier has different recall and
precision on different categories.
We can see that politics has a relatively high recall but low precision.
This is because the news about politics has obvious feature, but some
news of other categories are also related with politics.
The overall accuracy of our classi=ier is 85.63%. It is an acceptable
result.

Anda mungkin juga menyukai