Abstract
The purpose of this paper is to build a classier for a community based question answering service that could be able to classify user submitted questions into a set of predened categories which are arranged in hierarchical manner. We use Nave Bayes algorithm for classication and later extend it with the ability to use hierarchical data structure in the category tree. After implementing such classier we then experiment in a number of dierent scenarios and try to nd its advantages and disadvantages over at classication manner. We nd that the biggest advantage in using such hierarchical classication approach over at classication is the eciency. We also were able to increase the classication accuracy by assigning different weights to dierent parts of the question.
course dealing with such a massive amount of information needs the right classication techniques to be involved. No one would want to manually read everything and then try to nd the category that it best ts in. Such classication of web contents is a very common application for automated classiers. The purpose of this paper is to build a classier that could be able to classify user submitted questions into a set of predened categories which are arranged in hierarchical manner and compare it with the traditional at classication approach. In such services as Yahoo! Answers this would signicantly help the user to choose the category that his or her question is most relevant to. This way there is no need for the user to go through all the available categories and decide which is the most relevant to the question. Classier does that for him and presents a list of categories that best match the question. In fact, such classier is already employed by Yahoo! Answers but no documentation about
1 Introduction
With today's advancement in information technology no one is surprised when seeing computer performing a task that originally involves some human intelligence or intuition. One of those tasks is the classication of various contents. With so much information available on the World Wide Web it is very hard to nd the data that you need. To introduce some order into this big pile of information on the global network some companies, such as Yahoo! are trying to classify it into categories. Of
its implementation details is publicly revealed. There are a number of dierent methods designed for this classication assignment, such as Nave Bayes, decision trees or Support Vector Machines (SVM). Generally the eectiveness of each method can vary depending on the set of features used for training, but when having a proper amount of decent training data such classiers can perform their task surprisingly well. However none of the above techniques is able to classify a set of attributes into categories that are represented as a hierarchical structure. The traditional binary clas-
sier (also known as at classier), given a set of predened attributes, is able to make a distinction between one class and the rest of them. Thus it can only predict one category that the attributes best t in. The main focus of this paper is devoted to making a classier that could be able to classify a set of attributes into categories that are arranged in a hierarchical manner. Since our categories reside in a hierarchically organized structure there is always one top level class which can have an arbitrary number of subclasses and each subclass has at most one parent class. We design our classier to adopt the top-down approach, so the rst task for such hierarchical classier would be to distinguish between top level categories. Once it has been done it can go deeper into the category tree and try to nd the best suiting subclass. It continues to go down the category tree until it reaches a leaf class. Each such iteration would put dierent weights on an attribute according to which level we are classifying at. For example when having the following categories: music, music/jazz, music/blues, health and health/diet, a word rhythm would very clearly indicate the category at the top level, but when classifying between subcategories blues and jazz it would really give us no useful information. This example shows us that it is very important to experiment with dierent set of attributes in dierent levels of classication in order to achieve decent accuracy. The data that we are using to train our classier is obtained from Yahoo! Answers. It is basically a huge amount of user submitted questions that can contain some additional information such as detailed description of a question or other details. This data is extracted from plain HTML les, then preprocessed and inserted into a database for easier data manipulation. Our category hierarchy also shares the same structure as Yahoo! Answers is using. Using the top-down approach makes our classier very sensitive at the rst step. If we fail to correctly determine the top class category we can never reach our destination because every node in our category tree has at most one parent so there is only one path to any of the lower level class. To solve this issue we build our classier to be able to propose more than one category for a given question. This also enables the user to select one of the proposed categories that he or she thinks is the
most relevant to the submitted question. The following sections of the paper describe related works and how they dier from our approach. We also in details explain our system architecture and working methods. Later we experiment with hierarchical Nave Bayes classier in a number of dierent scenarios and try to nd its advantages and disadvantages over at classication manner. We nd that the biggest advantage in using such hierarchical classication approach over at classication is the efciency. We also were able to increase the classication accuracy by assigning dierent weights to dierent parts of the question.
2 Related Works
There are already a number of published papers that are proposing some original methods for this hierarchical classication approach. Susan Dumais and Hao Chen [DC00] conclude that hierarchical classication procedure used to classify proach. Web content achieves an improvement in accuracy compared with at classication apTo obtain those results they were using reduced-dimension binary-feature version of SVM algorithm, which is assumed to be very ecient for both initial learning and real-time classication. The classier was trained and tested with a large, heterogeneous collection of web content. The authors kept their focus on the case study where category tree is comprised of two levels. Second level category models were trained using dierent sets either from the same top level category or across all categories. Later, the scores from top and second level models were combined using dierent combination of rules. As a result the SVM model was extended with the ability to use hierarchical category structure and there was a slight improvement in both performance and accuracy. Ashwin K Pulijala and Susan Gauch [PG04] employed a top-down hierarchical classication approach to classify heterogeneous collection of web content. Rather than building one big classier the authors decided to build a hierarchy of classiers that would focus only on a small set of classes at each level in category tree, those relevant to the current task. Using this approach each task is divided into a set of sub-tasks, which are later classied
more accurately and eciently by corresponding classier. Furthermore, the accuracy is improved because of the fact that classiers can identify and ignore the commonalities between subtopics of a specic class. Our approach diers from the ones mentioned above by some additional procedures involved into the classication. We extend Nave Bayes classication algorithm with the ability to classify a set of attributes into hierarchically arranged categories. To carry out this task we employ the top-down approach, which makes our classier to be very sensitive at the rst step in the classication predicting the top level category. Failure to correctly guess the top level category would always lead us to faulty results. To deal with this issue, we design our classier to be able to suggest three categories with the highest probabilities, so that the user can then select one that is most relevant. The data that we are using is a huge set of user submitted questions, which were obtained from Yahoo! Answers. To make our classier more universal we are using a very big number of features used to train it. We make sure that there are a relatively high number of features in each category. To speed up the process and make it easier to manipulate the data, we have decided to extract all the questions into a relational database. Later we experiment by including a dierent set of features to our training data. We also try to get better classication accuracy by including features from dierent parts of a question and putting dierent weights on them. As in the previous work by Baykan et al. [BHW08] we have decided to use a single word as a feature. This decision was made because in many cases a typical question in our data source was a relatively short sentence with just a few words in it.
application.
Input of the application is 20 GB of HTML les with Yahoo! Answers question data. These webpages contain redundant data, which is not useful for classication. The most relevant information as question, description, details and categories are extracted and thereinafter all changes are saved into the database. In further step dierent preprocessing techniques are applied to prepare the data for categorization process. We use supervised classication method based on the statistical calculations. Thereby, as it is known, it demands data to be trained. In our case we treat training process as an aggregation of selected features. Counters of the features and categories are collected and stored into the database. sify specied question data. Accordingly to our task output is suggested categories. To assure the classication accuracy we propose few categories instead of unique which can be classied wrongly. Later on they are applied in probability computation to clas-
3 Data ow
We design application to classify the question data and also to test the result correctness. to be taken. Data ow diagram in gure 1 illustrates how data is processed by the system in terms of inputs and outputs. It breaks the process into steps. Moreover it reveals phases how data is being aected by the Therefore to categorise it as much as correctly few steps need
together in the classication process rather than being two separate features. Not only that should improve classication accuracy a bit, but also we have fewer features to be stored in the database, thus saving us some memory and improving querying performance.
third table, named catfeatures, was because of the fact that in our case each feature can have a number of categories. This is dictated by the hierarchical structure in the category tree. So every feature Each category, always has one top level category and an arbitrary number of lower level categories. in turn, can have many features, thus creating a many-to-many relation among the two mentioned entities. We have decided to represent each single word as one feature.
eld is only used in lower level categories because top level categories has no parent. The eld,
counter, shows the number of features that the cur-
rent category contains, while quest_counter eld represents number of questions that the current category contains instead. These counter elds are used to compute prior and likelihood probabilities in Nave Bayes formula. The level in the category tree is shown in the hlevel eld. The lower the level the bigger value is set in the hlevel eld. Top level categories have a value of 1 set in this attribute. The eld counter in catfeatures table represents the number of features with feat_id that reside in the current category (determined by cat_id ). Besides the name of the feature (feature ) in the
features table there is also a counter row. This row
5 Classication
5.1 Nave Bayes probabilistic model
A Nave Bayes algorithm is one of the most popular methods used for text classication. As name suggests, this method is based on the Bayes theorem (1) for calculating conditional probabilities.
P (A | B ) =
P (A) P (B | A) P (B )
(1)
, where P(A) is the prior probability - probability of event A. P(A|B) is the conditional probability of
A, given B. P(B|A) is the likelihood - conditional
illustrates how often this feature is encountered in the vocabulary (its freaquency).
probability of B given A and P(B) is the evidence probability of B, it acts as a normalizing constant. Using equation (1) we can construct a probability model of document D with a set of features
F1 , . . . , F n D
mula (2).
being in class
as shown in for-
P (C | F1 , . . . , Fn ) =
P (C ) P (F1 , . . . , Fn | C ) P (F1 , . . . , Fn )
(2)
ring in class C. If a set of features in a document does not supply a lot of information that would help to distinguish between a set of classes, we choose the class that has a higher prior probability. Knowing the prior probability P(C) and likelihood
P (F1 , . . . , Fn | C ),
This is because
P (F1 , . . . , Fn ).
P (F1 , . . . , Fn )
value throughout all the classes - its value does not change in dierent classes. After removing the denominator of the fraction in (2) we get the simplied model shown in equation (3).
P (C | F1 , . . . , Fn ) P (C )
i=1
P (Fi | C )
(3)
rule.
is used to avoid this awkward situation. It simply adds one to each count as shown in equation (9).
would be to use a class that has a highest probability. This is known as MAP (maximum a posteriori) decision rule. According to [MRS08], for a set of
P (F | c) =
Fc + 1 F c F + 1
(9)
Because in equation (4) we multiply a lot of probabilities, which usually tend to be pretty small values, there is a rather high possibility of oating point underow. One possible solution to this problem is to multiply logarithms of the probabilities rather than their real values. According to the fact that logarithm function is monotonic and theorem (5) - the class with the highest probability will still have the highest logarithm value.
(5)
log P (Fi | c)
(6)
In case of text classication, prior probability P(c) can be expressed as the following fraction:
P (c) =
, where
Nc N
(7)
Nc
is the number of documents in class The As it can be seen from a gure above this way we are able to save a lot of time and resources. We do not need to include all classes into the classication (8) process. Assume that the document being classied belongs to a class named 24 as depicted in gure 3. Flat classier would iterate through all the classes in the tree, compute all the probabilities and return a class with the highest probability. However by employing a simple technique described above, we would be able to signicantly reduce the number of classes that are involved in the classication process. Which in turn would remarkably increase performance of our algorithm. At rst, such classier needs to classify a document is calculated as follows:
likelihood
P (F | c)
P (F | c) =
Here,
Fc
F c
F F
Fc
F c
is
the total number of features in training documents The critical disadvantage of equation (8) is encountered when we have a case of calculating probability for a feature-class combination that did not occur in the training data. A single occurrence of such a feature forces all the equation to evaluate to zero. A simple technique called Laplace smoothing
between three top level classes 1, 11 and 12. Find one with highest probability, in our case 11, and then apply classication only on two of its children classes - 23 and 24. After distinguishing between those two nodes our classier is ready to return the result. In this case we had only ve classes that truly participated in the classication procedure three top level (1, 11, 12 ) and two second level (23,
24 ) categories.
In procedure we rstly select questions assigned to predened or randomly chosen categories. Predened categories are basically chosen manually according to particular domain. We select categories which belong to domains not overlapping between themselves.
or randomly selected;
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0;
if
if
cnt
<
cntTrain
then then
end end else if cnt >= cntTrain and cnt < cntTrain + cntTest then if question is not in test data then
insert question into the test data table; cnt
cnt + 1 ;
cnt + 1 ;
Total
amount
of
questions
in
each
category This is
should not be less than 100 questions. Figure 4: Collecting training and test data
because categories with too small amount of total questions could be assigned only to the training data set and futher they will not be tested. Selected To prepare training and test data we employ procedure 1. There are three parameters: cntTrain, cntTest and cntCat. cntTrain and cntTest variables dene amount of questions per category for training data or test data respectively. cntCat determines how many top level categories are chosen. questions are divided into the training and test data accordingly. Amount of questions for each category is determined by procedures input parameters. For various experiments we use dierent input parameters in order to get various size of training and test datasets.
7 Evaluation
In this section we will present the results of comparing the at Nave Bayes classier and the one that uses the hierarchical category structure. To evaluate the accuracy of each classier we are calculating the F-measure based on precision and recall that were retrieved in a number of dierent scenarios. We determine this value for three In the rst case we estimate dierent classiers.
sier to nd a class with the highest probability for attribute A. If the classier predicts a correct class for this attribute, in our case class C, then we have a True Positive (TP ) case. Otherwise, if classier categorizes this attribute to a class other than
C, then such a case is labeled as "False negative"
(FN ) in the confusion matrix. Similarly we obtain the False positive (FP ) and True negative (TN ) cases. After summing up all such cases in the classication results we can compute the precision and recall values using the following simple equations:
the F-score for Flat Nave Bayes classier (FNB). Later we compare it with two versions of classiers that use hierarchical classication approach. As it was mentioned in the sections above, we design the classier to pick three categories and later propose those to the user, so in this case we assume that if at least one of those categories is predicted correctly - the classication result is correct. Later in this section this classier is referred as Hierarchical Nave Bayes 3 (HNB3). The third classier that we evaluate is similar to the previously described one, but instead of suggesting three categories with the highest probabilities, it only suggests one like the at classier. In the evaluation results this classier is labeled as Hierarchical Nave Bayes 1 (HNB1).
P recision = Recall =
TP (T P + F P )
(10)
TP (T P + F N )
(11)
As it can be seen from equation 10, the intuitive meaning or precision value can be expressed in words as the percentage of positive predictions that are correct. Similarly, recall shows us the percentage of positive labeled instances that were predicted as positive. After obtaining precision and recall values, we can very easily compute the F-measure for our classier. The formula for calculating it is given in equation 12.
F =2
(12)
The expression in equation 12 is also known as F1 value because the Precision and Recall values are evenly weighted. All the classier evaluations in this paper are based on this formula.
increase the amount of questions that each category contains. In the second case we try to increase the number of categories that reside in the training data set and gure out which classication method is more appropriate in such situation.
improvement
Figure 6 also shows us that the increase in accuracy of both hierarchical classiers follow a very similar pattern. These classiers react to the inBy increasing the number of creased amount of training data more sensitively than the at one. questions in the training data from 4500 to 22000, HNB1 and HNB3 gave us respectively 20.9% and 21.4% improvement in accuracy. The FNB had an increase of 19.4%.
question (later labeled as Q) more detailed description of a question (Desc) additional details (Det) answers to that question (Answ)
As we have expected, increasing the amount of training data resulted in a more accurate classication outcome. very similar. The F-score of the at and the rst version of hierarchical Nave Bayes classier is In all cases it varies only by 0.1% 1.4%. In the rst step the HNB1 reacted very sensitively to the increased amount of training data and showed almost the same accuracy as the at classier. However further increase in the size of the training data was not as noticeable as in the case of FNB. The HNB3 was more noticeably ahead of other
Below we present three tables with Precision, Recall and F-score values for HNB3 classication results. In each table we dene dierent weights for each part of the question. In total we test the classication results in three dierent cases. As before, in each case we gradually increase the amount of questions in our training data set.
7.3.1 Case I
In the rst case we train our classier with 4500 questions divided into 45 categories, which results in 100 questions contained by each category. Table 1 illustrates the results of 7 dierent classication iterations. In each iteration we assign dierent weights to dierent parts of the question. The test data that we are using to test our classier in this case is composed of 867 questions and 45 categories. Each category in test data set has more or less 20 questions. It is very important to note that in the tables below we present the average Precision, Recall and F1 values. Those were computed taking into account all 42 categories that reside in our test data.
dierent categories. Each category contains about 290 questions. We tested the classier with 1260 questions that were split into 42 categories. achieved results. Table 2 shows the
Q 1 1 3 4 4 4
Desc 1 1 2 3 2 3
Det 1 1 1 2 1 1
Answ 0 1 4 1 3 2
Q 1 1 3 4 4 1 1
Desc 1 1 2 3 2 3 0
Det 1 1 1 2 1 2 0
Answ 0 1 4 1 3 0 0
Again, as in the previous case, it conrms that in order to achieve the highest accuracy we need to include all the parts of the question that are available. Two best weight combinations in this case are situated at rows 5 and 7. Yet again, the overall best weight combination remains to be 4, 3, 2, 1. This time improvement in the F-score after nding the best weight combination was a little bit lower 4.8%. We suppose that it is due to the fact that the classier was trained with a notice-
From table 1 we can clearly see that the most important part is the question itself. Including description and additional details into classication process does not actually have a big aect on the F1 value. However if we put bigger weights on those two attributes than on the question itself, it can signicantly lower the accuracy of our classication results (row 7). In general, it seems a good idea to include all parts of the questions . Each one of them can increase the accuracy if appropriately weighted. The best accuracy was achieved when putting weights 4, 3, 2 and 1 for question, description, details and answers respectively. Compared with the case when using only question alone, we got 6.5% improvement in the F-score.
ably bigger training data set and gathered enough information to better distinguish between dierent categories. That way additional information had a lower impact on the classication accuracy.
7.3.2 Case II
In the second case we use 13090 questions in our training data set. Again they are divided into 45
10
takes a lot of time for the classier to nish the classication. Table 3 sums up the results achieved in this case.
of a question we were able to improve the F-score up to 6.5%. Weighting dierent attributes seems to have a lower impact on the classication accuracy when we train our classier with larger amount of train-
Q 1 1 3 4
Desc 1 1 2 3
Det 1 1 1 2
Answ 0 1 4 1
ing data. This way classier gathers enough information to better distinguish between dierent categories and any additional information becomes less helpful. It is very important to assign suitable weight values for each part of the question. The overall best weight combination for question, description, details and answers is accordingly 4, 3, 2 and 1. It is not advised to assign a relatively high weight to description or details because 55% of all the questions in our data source contain description and only 13% of the questions has additional details. In all other cases those elds are left blank. Including all the answers to the question into the classication process can have positive eect on the classication outcome. However such features should be weighted fairly low. That is because occasionally answers can cover a very broad range of topics that are not necessary very closely related to the question. As for pursuing the better classication accuracy, the possible future work could be made on experimenting with dierent feature selection models described by [YP97]. Another approach would be to assume that every question can reside in the category which is not necessary located at the leave of category tree. Additional classication techniques need to be employed in order for this method to produce decent results. A possible solution could be to set a threshold value in each level in category tree like proposed by [TZL06].
This case follows the same scenario as the previous ones. The overall best weight combination remains to be 4, 3, 2, 1 (row 5). As predicted in previous case, after further increasing the amount of training data, the improvement in F-measure, achieved after assigning the weights, in this case was even smaller 1.9%.
8 Conclusions
After completing the experiments, described in the previous section it is clearly visible that hierarchical (HNB1) and at (FNB) classiers are very close to each other in terms of classication accuracy. By increasing the amount of training data from 4500 to 22000 questions the dierence in accuracy between those two classiers varied only by 0.1% to 1.4%. As it was expected, classication 5.76%. methods. the HNB3 was able to Compared with FNB, demonstrate the best accuracy amongst the tested the improvement in the F-measure ranged up to Using the hierarchical classication approach, again, makes it much faster than the at classier. However, since we need three classication iterations using this classier, it cannot complete the classication as fast as HNB1. In the same case as described above this classier took 6 to 8 minutes to classify 4200 questions. This is a little bit worse than in the HNB1 case but still a major improvement compared with the results of the at classier. The idea of assigning dierent weights to dierent parts of the question seems to be a helpful technique in increasing the classication accuracy. By assigning appropriate weights to dierent elements
References
[BHW08] Eda Baykan, Monika Henzinger, and Ingmar Weber. Web page language identication based on urls. Proc. VLDB Endow., 1(1):176187, 2008.
[DC00]
Susan T. Dumais and Hao Chen. Hierarchical classication of Web content. In Nicholas J. Belkin, Peter Ingwersen, and Mun-Kew Leong, editors,
Proceedings
11
Information
Retrieval,
pages
256263, Athens, GR, 2000. ACM Press, New York, US. [MRS08] Christopher D. Manning, Prabhakar
Intro-
bridge University Press, July 2008. [PG04] Ashwin Pulijala and Susan Gauch. erarchical text classication. Hi-
In Inter-
national Conference on Cybernetics and Information Technologies, Systems and Applications: CITSA 2004, pages 257
[TZL06]
Lei Tang, Jianping Zhang, and Huan Liu. tion. Acclimatizing taxonomic semanIn KDD '06:
Proceedings of the
tics for hierarchical content classica12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 384393, New York, NY,
comparative study on feature selection in text categorization. In ICML '97: Proceedings of the Fourteenth International Conference on Machine Learning, pages
412420, San Francisco, CA, USA, 1997. Morgan Kaufmann Publishers Inc.
12