Anda di halaman 1dari 12

Hierarchical Classication Approach in Community-Based Question Answering Services

Artur Baniukevic banart@cs.aau.dk, Dovydas Sabonis sabonis@cs.aau.dk

Computer Science Department Aalborg University May 28, 2009

Abstract
The purpose of this paper is to build a classier for a community based question answering service that could be able to classify user submitted questions into a set of predened categories which are arranged in hierarchical manner. We use Nave Bayes algorithm for classication and later extend it with the ability to use hierarchical data structure in the category tree. After implementing such classier we then experiment in a number of dierent scenarios and try to nd its advantages and disadvantages over at classication manner. We nd that the biggest advantage in using such hierarchical classication approach over at classication is the eciency. We also were able to increase the classication accuracy by assigning different weights to dierent parts of the question.

course dealing with such a massive amount of information needs the right classication techniques to be involved. No one would want to manually read everything and then try to nd the category that it best ts in. Such classication of web contents is a very common application for automated classiers. The purpose of this paper is to build a classier that could be able to classify user submitted questions into a set of predened categories which are arranged in hierarchical manner and compare it with the traditional at classication approach. In such services as Yahoo! Answers this would signicantly help the user to choose the category that his or her question is most relevant to. This way there is no need for the user to go through all the available categories and decide which is the most relevant to the question. Classier does that for him and presents a list of categories that best match the question. In fact, such classier is already employed by Yahoo! Answers but no documentation about

1 Introduction
With today's advancement in information technology no one is surprised when seeing computer performing a task that originally involves some human intelligence or intuition. One of those tasks is the classication of various contents. With so much information available on the World Wide Web it is very hard to nd the data that you need. To introduce some order into this big pile of information on the global network some companies, such as Yahoo! are trying to classify it into categories. Of

its implementation details is publicly revealed. There are a number of dierent methods designed for this classication assignment, such as Nave Bayes, decision trees or Support Vector Machines (SVM). Generally the eectiveness of each method can vary depending on the set of features used for training, but when having a proper amount of decent training data such classiers can perform their task surprisingly well. However none of the above techniques is able to classify a set of attributes into categories that are represented as a hierarchical structure. The traditional binary clas-

sier (also known as at classier), given a set of predened attributes, is able to make a distinction between one class and the rest of them. Thus it can only predict one category that the attributes best t in. The main focus of this paper is devoted to making a classier that could be able to classify a set of attributes into categories that are arranged in a hierarchical manner. Since our categories reside in a hierarchically organized structure there is always one top level class which can have an arbitrary number of subclasses and each subclass has at most one parent class. We design our classier to adopt the top-down approach, so the rst task for such hierarchical classier would be to distinguish between top level categories. Once it has been done it can go deeper into the category tree and try to nd the best suiting subclass. It continues to go down the category tree until it reaches a leaf class. Each such iteration would put dierent weights on an attribute according to which level we are classifying at. For example when having the following categories: music, music/jazz, music/blues, health and health/diet, a word rhythm would very clearly indicate the category at the top level, but when classifying between subcategories blues and jazz it would really give us no useful information. This example shows us that it is very important to experiment with dierent set of attributes in dierent levels of classication in order to achieve decent accuracy. The data that we are using to train our classier is obtained from Yahoo! Answers. It is basically a huge amount of user submitted questions that can contain some additional information such as detailed description of a question or other details. This data is extracted from plain HTML les, then preprocessed and inserted into a database for easier data manipulation. Our category hierarchy also shares the same structure as Yahoo! Answers is using. Using the top-down approach makes our classier very sensitive at the rst step. If we fail to correctly determine the top class category we can never reach our destination because every node in our category tree has at most one parent  so there is only one path to any of the lower level class. To solve this issue we build our classier to be able to propose more than one category for a given question. This also enables the user to select one of the proposed categories that he or she thinks is the

most relevant to the submitted question. The following sections of the paper describe related works and how they dier from our approach. We also in details explain our system architecture and working methods. Later we experiment with hierarchical Nave Bayes classier in a number of dierent scenarios and try to nd its advantages and disadvantages over at classication manner. We nd that the biggest advantage in using such hierarchical classication approach over at classication is the efciency. We also were able to increase the classication accuracy by assigning dierent weights to dierent parts of the question.

2 Related Works
There are already a number of published papers that are proposing some original methods for this hierarchical classication approach. Susan Dumais and Hao Chen [DC00] conclude that hierarchical classication procedure used to classify proach. Web content achieves an improvement in accuracy compared with at classication apTo obtain those results they were using reduced-dimension binary-feature version of SVM algorithm, which is assumed to be very ecient for both initial learning and real-time classication. The classier was trained and tested with a large, heterogeneous collection of web content. The authors kept their focus on the case study where category tree is comprised of two levels. Second level category models were trained using dierent sets either from the same top level category or across all categories. Later, the scores from top and second level models were combined using dierent combination of rules. As a result the SVM model was extended with the ability to use hierarchical category structure and there was a slight improvement in both performance and accuracy. Ashwin K Pulijala and Susan Gauch [PG04] employed a top-down hierarchical classication approach to classify heterogeneous collection of web content. Rather than building one big classier the authors decided to build a hierarchy of classiers that would focus only on a small set of classes at each level in category tree, those relevant to the current task. Using this approach each task is divided into a set of sub-tasks, which are later classied

more accurately and eciently by corresponding classier. Furthermore, the accuracy is improved because of the fact that classiers can identify and ignore the commonalities between subtopics of a specic class. Our approach diers from the ones mentioned above by some additional procedures involved into the classication. We extend Nave Bayes classication algorithm with the ability to classify a set of attributes into hierarchically arranged categories. To carry out this task we employ the top-down approach, which makes our classier to be very sensitive at the rst step in the classication  predicting the top level category. Failure to correctly guess the top level category would always lead us to faulty results. To deal with this issue, we design our classier to be able to suggest three categories with the highest probabilities, so that the user can then select one that is most relevant. The data that we are using is a huge set of user submitted questions, which were obtained from Yahoo! Answers. To make our classier more universal we are using a very big number of features used to train it. We make sure that there are a relatively high number of features in each category. To speed up the process and make it easier to manipulate the data, we have decided to extract all the questions into a relational database. Later we experiment by including a dierent set of features to our training data. We also try to get better classication accuracy by including features from dierent parts of a question and putting dierent weights on them. As in the previous work by Baykan et al. [BHW08] we have decided to use a single word as a feature. This decision was made because in many cases a typical question in our data source was a relatively short sentence with just a few words in it.

application.

Figure 1: Data ow

Input of the application is 20 GB of HTML les with Yahoo! Answers question data. These webpages contain redundant data, which is not useful for classication. The most relevant information as question, description, details and categories are extracted and thereinafter all changes are saved into the database. In further step dierent preprocessing techniques are applied to prepare the data for categorization process. We use supervised classication method based on the statistical calculations. Thereby, as it is known, it demands data to be trained. In our case we treat training process as an aggregation of selected features. Counters of the features and categories are collected and stored into the database. sify specied question data. Accordingly to our task output is suggested categories. To assure the classication accuracy we propose few categories instead of unique which can be classied wrongly. Later on they are applied in probability computation to clas-

3 Data ow
We design application to classify the question data and also to test the result correctness. to be taken. Data ow diagram in gure 1 illustrates how data is processed by the system in terms of inputs and outputs. It breaks the process into steps. Moreover it reveals phases how data is being aected by the Therefore to categorise it as much as correctly few steps need

4 Preparing the data


The data that we got from Yahoo! is stored in plain HTML les. To make it usable for classication we rst need to convert it into an appropriate form. In all we have 1663663 HTML documents- each document contains a single question, all the replies to it and additional HTML formatting information that can be ignored in this case. Since we are dealing with such a huge amount of data, storing any additional information that does not directly aect the classication results, would be a waste of resources. That is why we need to only extract those parts of HTML that represent user submitted question as well as some additional details about it.

together in the classication process rather than being two separate features. Not only that should improve classication accuracy a bit, but also we have fewer features to be stored in the database, thus saving us some memory and improving querying performance.

4.3 Data model


The architecture of the database is illustrated in gure 2. There are two tables in our database categories and features. The reason to include the

third table, named catfeatures, was because of the fact that in our case each feature can have a number of categories. This is dictated by the hierarchical structure in the category tree. So every feature Each category, always has one top level category and an arbitrary number of lower level categories. in turn, can have many features, thus creating a many-to-many relation among the two mentioned entities. We have decided to represent each single word as one feature.

4.1 Data export


The rst task in data preparation process is to extract the information that we need from a formatted HTML document. To accomplish this task we have written a small HTML parser that goes through all HTML tags in a document and extracts only those text parts that are closely related to the user submitted question. Usually this information is only one tenth size of the document itself, so a great amount of memory is saved. After extracting the needed information from the document it then needs to be stored somewhere for further usage. For this task we have implemented a loader which loads the relevant parts of a document in either XML format or a database server. We found that using a database in this case is more convenient because of the remarkably large amount of data being involved.

4.2 Data preprocessing


Once we have the needed information stored in a database we then take additional steps to prepare the data for the classication. First step in this phase is to lter the stop words - the words that give us no additional information when classifying the document. Additionally, we also eliminate all the words that are composed of numerical symbols, because there was quite a big number of such information-less features in our database. Once those words are eliminated we then employ Porter stemming algorithm to stem all the words. This way similar words are being grouped The parent_id eld in categories table is used to determine the parent of a current category. This Figure 2: Database schema

eld is only used in lower level categories because top level categories has no parent. The eld,
counter, shows the number of features that the cur-

ent training data in order to improve classication results.

rent category contains, while quest_counter eld represents number of questions that the current category contains instead. These counter elds are used to compute prior and likelihood probabilities in Nave Bayes formula. The level in the category tree is shown in the hlevel eld. The lower the level  the bigger value is set in the hlevel eld. Top level categories have a value of 1 set in this attribute. The eld counter in catfeatures table represents the number of features with feat_id that reside in the current category (determined by cat_id ). Besides the name of the feature (feature ) in the
features table there is also a counter row. This row

5 Classication
5.1 Nave Bayes probabilistic model
A Nave Bayes algorithm is one of the most popular methods used for text classication. As name suggests, this method is based on the Bayes theorem (1) for calculating conditional probabilities.

P (A | B ) =

P (A) P (B | A) P (B )

(1)

, where P(A) is the prior probability - probability of event A. P(A|B) is the conditional probability of
A, given B. P(B|A) is the likelihood - conditional

illustrates how often this feature is encountered in the vocabulary (its freaquency).

probability of B given A and P(B) is the evidence probability of B, it acts as a normalizing constant. Using equation (1) we can construct a probability model of document D with a set of features

4.4 Feature selection


Feature selection is a process of choosing the subset of variables while elminating features with unnecessary information. It contributes to classier accuracy. However our feature selection is more specic related to Yahoo! Answers system. In general it was decided to choose words as features based on [BHW08] results, where reasonable amount of training data and words as features can lead to a good performance using Nave Bayes. In other word feature in our case is a question data token which is selected after preprocessing. The question data extracted from HTML les consist from question, description, details and the best answer. Our aim is to investigate if each of the question data unit contributes equally and which part is the most signicant. each unit. To achieve that we introduce question data element weights. We assign weight to each element from 1 to 4. Thus contribution of element changes. Mainly these weights are applied while computing the likelihood probabilities in Nave Bayes classier. If we do not use some of the question data element, we just use weight 0 with particular element instead or we do not include the element in set of the attributes selected from question table to be preprocessed. Consequently we experiment with feature selection using various weights with dierMoreover we bring forward the assignment controlling contribution of

F1 , . . . , F n D
mula (2).

being in class

as shown in for-

P (C | F1 , . . . , Fn ) =

P (C ) P (F1 , . . . , Fn | C ) P (F1 , . . . , Fn )

(2)

P(C) is the prior probability of a document occur-

ring in class C. If a set of features in a document does not supply a lot of information that would help to distinguish between a set of classes, we choose the class that has a higher prior probability. Knowing the prior probability P(C) and likelihood

P (F1 , . . . , Fn | C ),
This is because

the most probable class can

be determined without the evidence

P (F1 , . . . , Fn ).

P (F1 , . . . , Fn )

is the same constant

value throughout all the classes - its value does not change in dierent classes. After removing the denominator of the fraction in (2) we get the simplied model shown in equation (3).

P (C | F1 , . . . , Fn ) P (C )
i=1

P (Fi | C )

(3)

5.1.1 Flat Nave Bayes classier


Using the above described model we can now construct a at Nave Bayes classier. Having a probability model, we now need to dene our decision

rule.

The most common choice in this situation

is used to avoid this awkward situation. It simply adds one to each count as shown in equation (9).

would be to use a class that has a highest probability. This is known as MAP (maximum a posteriori) decision rule. According to [MRS08], for a set of

P (F | c) =

classes C it can be dened as follows:

Fc + 1 F c F + 1

(9)

5.1.2 Introducing hierarchy


P (Fi | c)
(4) To extend the at Nave Bayes classier with the ability to use hierarchically arranged categories we follow straightforward and simple idea illustrated in gure 3. We start classifying at the top level in the category tree. Then we select one category with the highest probability and apply the classication only on its children. Since we know that all the documents reside in the lowest level in the category tree, we continue this process until we reach a leaf node where no further classication is possible.

C map = argmaxcC P (c)


i=1

Because in equation (4) we multiply a lot of probabilities, which usually tend to be pretty small values, there is a rather high possibility of oating point underow. One possible solution to this problem is to multiply logarithms of the probabilities rather than their real values. According to the fact that logarithm function is monotonic and theorem (5) - the class with the highest probability will still have the highest logarithm value.

log(x y ) = log(x) + log(y )


would look like this:

(5)

So a better implementation of MAP decision rule

C map = argmaxcC log P (c) +


i=1

log P (Fi | c)
(6)

In case of text classication, prior probability P(c) can be expressed as the following fraction:

P (c) =
, where

Nc N

(7)

Figure 3: Hierarchical classication approach

Nc

is the number of documents in class The As it can be seen from a gure above this way we are able to save a lot of time and resources. We do not need to include all classes into the classication (8) process. Assume that the document being classied belongs to a class named 24 as depicted in gure 3. Flat classier would iterate through all the classes in the tree, compute all the probabilities and return a class with the highest probability. However by employing a simple technique described above, we would be able to signicantly reduce the number of classes that are involved in the classication process. Which in turn would remarkably increase performance of our algorithm. At rst, such classier needs to classify a document is calculated as follows:

c and N is the total number of documents.

likelihood

P (F | c)

P (F | c) =
Here,

Fc
F c

F F

Fc

is the number of occurrences of feature F

in training documents of class c, and belonging to class c.

F c

is

the total number of features in training documents The critical disadvantage of equation (8) is encountered when we have a case of calculating probability for a feature-class combination that did not occur in the training data. A single occurrence of such a feature forces all the equation to evaluate to zero. A simple technique called Laplace smoothing

between three top level classes 1, 11 and 12. Find one with highest probability, in our case 11, and then apply classication only on two of its children classes - 23 and 24. After distinguishing between those two nodes our classier is ready to return the result. In this case we had only ve classes that truly participated in the classication procedure three top level (1, 11, 12 ) and two second level (23,
24 ) categories.

In procedure we rstly select questions assigned to predened or randomly chosen categories. Predened categories are basically chosen manually according to particular domain. We select categories which belong to domains not overlapping between themselves.

6 Training and test data


One of the crucial factors in classication quality is training and test data selection. classication. As the size of the question data is very huge, we use only subset of all questions for various experiments. As shown in gure 4 bulk_questions table, where all questions reside, is transformed to the training data table and test data table. Test data is also selected from bulk_questions table as in consequence we want to evaluate accuracy of classier based on category predened before classication. Consequently low-quality selected data leads to the low-quality

Procedure DivToTrTst(cntTrain,cntTest,cntCat ) Output: Training data, Test data


classes

cntCat of top level classes predened

or randomly selected;

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

remove all subcategories where counter < 100; cnt

while there is row in bulk_questions do


read question from bulk_questions table;

0;

if

question is from classes

if

cnt

if question is not in training data then


insert question into the training data table; cnt

<

cntTrain

then then

end end else if cnt >= cntTrain and cnt < cntTrain + cntTest then if question is not in test data then
insert question into the test data table; cnt

cnt + 1 ;

end end end end

cnt + 1 ;

Total

amount

of

questions

in

each

category This is

should not be less than 100 questions. Figure 4: Collecting training and test data

because categories with too small amount of total questions could be assigned only to the training data set and futher they will not be tested. Selected To prepare training and test data we employ procedure 1. There are three parameters: cntTrain, cntTest and cntCat. cntTrain and cntTest variables dene amount of questions per category for training data or test data respectively. cntCat determines how many top level categories are chosen. questions are divided into the training and test data accordingly. Amount of questions for each category is determined by procedures input parameters. For various experiments we use dierent input parameters in order to get various size of training and test datasets.

7 Evaluation
In this section we will present the results of comparing the at Nave Bayes classier and the one that uses the hierarchical category structure. To evaluate the accuracy of each classier we are calculating the F-measure based on precision and recall that were retrieved in a number of dierent scenarios. We determine this value for three In the rst case we estimate dierent classiers.

Assume that in our test data there is an attribute


A that actually belong to class C. We run our clas-

sier to nd a class with the highest probability for attribute A. If the classier predicts a correct class for this attribute, in our case class C, then we have a True Positive (TP ) case. Otherwise, if classier categorizes this attribute to a class other than
C, then such a case is labeled as "False negative"

(FN ) in the confusion matrix. Similarly we obtain the False positive (FP ) and True negative (TN ) cases. After summing up all such cases in the classication results we can compute the precision and recall values using the following simple equations:

the F-score for Flat Nave Bayes classier (FNB). Later we compare it with two versions of classiers that use hierarchical classication approach. As it was mentioned in the sections above, we design the classier to pick three categories and later propose those to the user, so in this case we assume that if at least one of those categories is predicted correctly - the classication result is correct. Later in this section this classier is referred as Hierarchical Nave Bayes 3 (HNB3). The third classier that we evaluate is similar to the previously described one, but instead of suggesting three categories with the highest probabilities, it only suggests one  like the at classier. In the evaluation results this classier is labeled as Hierarchical Nave Bayes 1 (HNB1).

P recision = Recall =

TP (T P + F P )

(10)

TP (T P + F N )

(11)

As it can be seen from equation 10, the intuitive meaning or precision value can be expressed in words as the percentage of positive predictions that are correct. Similarly, recall shows us the percentage of positive labeled instances that were predicted as positive. After obtaining precision and recall values, we can very easily compute the F-measure for our classier. The formula for calculating it is given in equation 12.

7.1 Measurement explanation


For evaluating the accuracy of our classication results we are using the F-measure. F-measure or F-score is a common method used to evaluate the accuracy of a supervised prediction system. To calculate the F-measure it is rst needed to determine the precision and recall values, thoroughly studied in [Seb02]. This can be achieved by constructing a confusion matrix rst. A typical confusion matrix can be constructed as illustrated in the gure 5.

F =2

(P recision Recall) (P recision + Recall)

(12)

The expression in equation 12 is also known as F1 value because the Precision and Recall values are evenly weighted. All the classier evaluations in this paper are based on this formula.

7.2 Comparing dierent classication methods


This section provides the results that we got when comparing three dierent classication methods (FNB, HNB1 and HNB3) that were described earlier. The classiers were tested in two dierent scenarios. First we test how does the amount of training data aect the classication accuracy for each classier. In this case we do not include more categories in the training data set. Instead, we just

Figure 5: Structure of the confusion matrix

increase the amount of questions that each category contains. In the second case we try to increase the number of categories that reside in the training data set and gure out which classication method is more appropriate in such situation.

classication methods. classier the ranged from 3.68 to 5.76.

Compared with the at achieved in F-score

improvement

Figure 6 also shows us that the increase in accuracy of both hierarchical classiers follow a very similar pattern. These classiers react to the inBy increasing the number of creased amount of training data more sensitively than the at one. questions in the training data from 4500 to 22000, HNB1 and HNB3 gave us respectively 20.9% and 21.4% improvement in accuracy. The FNB had an increase of 19.4%.

7.2.1 Dependence on the size of training data


The rst experiment that we have made to compare the accuracy of each classier was changing the amount of questions in the training data set. During this experiment our training data set was composed of 45 dierent categories. We tested the classiers in three dierent cases, each time increasing the number of questions in each category. In the rst case we had 100 questions in each category which makes 4500 questions in total. Later this number was increased to 13000 and 22000 questions as shown in gure 6. As a test data we used 867, 1260 and 4200 questions respectively to the three cases mentioned before.

7.2.2 Classication eciency


As it revealed after the experiments - the major advantage of HNB1 is the classication eciency. In the case of having 21490 questions in the training data set and testing with 4200 questions from 42 categories, the HNB1 was able to nish this task in 4 to 5 minutes. Compared with the FNB, which accomplished the same task in 24 to 50 minutes, this seems to be a signicant improvement.

7.3 Experimenting with features


Since the main focus of this paper is directed towards the concept of HNB3 classier we no longer take into consideration FNB and HNB1. Because we use structural data from Yahoo!, we can further experiment by changing the weights of each part of a question. As mentioned before, the typical question in our data source is composed of the following four parts: Figure 6: Varying amount of training data

question (later labeled as Q) more detailed description of a question (Desc) additional details (Det) answers to that question (Answ)

As we have expected, increasing the amount of training data resulted in a more accurate classication outcome. very similar. The F-score of the at and the rst version of hierarchical Nave Bayes classier is In all cases it varies only by 0.1% 1.4%. In the rst step the HNB1 reacted very sensitively to the increased amount of training data and showed almost the same accuracy as the at classier. However further increase in the size of the training data was not as noticeable as in the case of FNB. The HNB3 was more noticeably ahead of other

Below we present three tables with Precision, Recall and F-score values for HNB3 classication results. In each table we dene dierent weights for each part of the question. In total we test the classication results in three dierent cases. As before, in each case we gradually increase the amount of questions in our training data set.

7.3.1 Case I
In the rst case we train our classier with 4500 questions divided into 45 categories, which results in 100 questions contained by each category. Table 1 illustrates the results of 7 dierent classication iterations. In each iteration we assign dierent weights to dierent parts of the question. The test data that we are using to test our classier in this case is composed of 867 questions and 45 categories. Each category in test data set has more or less 20 questions. It is very important to note that in the tables below we present the average Precision, Recall and F1 values. Those were computed taking into account all 42 categories that reside in our test data.

dierent categories. Each category contains about 290 questions. We tested the classier with 1260 questions that were split into 42 categories. achieved results. Table 2 shows the

Q 1 1 3 4 4 4

Desc 1 1 2 3 2 3

Det 1 1 1 2 1 1

Answ 0 1 4 1 3 2

P 0.5821 0.6073 0.6025 0.6133 0.6067 0.6115

R 0.5833 0.5968 0.5896 0.6023 0.5929 0.6031

F1 0.5699 0.5886 0.5841 0.5975 0.5882 0.5965

Table 2: Case II: Experimenting with weights

Q 1 1 3 4 4 1 1

Desc 1 1 2 3 2 3 0

Det 1 1 1 2 1 2 0

Answ 0 1 4 1 3 0 0

P 0.5183 0.5339 0.5162 0.5427 0.5304 0.4959 0.5183

R 0.5251 0.5330 0.5318 0.5541 0.5452 0.5030 0.5252

F1 0.4955 0.5010 0.4990 0.5277 0.5133 0.4699 0.4955

Again, as in the previous case, it conrms that in order to achieve the highest accuracy we need to include all the parts of the question that are available. Two best weight combinations in this case are situated at rows 5 and 7. Yet again, the overall best weight combination remains to be 4, 3, 2, 1. This time improvement in the F-score after nding the best weight combination was a little bit lower  4.8%. We suppose that it is due to the fact that the classier was trained with a notice-

Table 1: Case I: Experimenting with weights

From table 1 we can clearly see that the most important part is the question itself. Including description and additional details into classication process does not actually have a big aect on the F1 value. However if we put bigger weights on those two attributes than on the question itself, it can signicantly lower the accuracy of our classication results (row 7). In general, it seems a good idea to include all parts of the questions . Each one of them can increase the accuracy if appropriately weighted. The best accuracy was achieved when putting weights 4, 3, 2 and 1 for question, description, details and answers respectively. Compared with the case when using only question alone, we got 6.5% improvement in the F-score.

ably bigger training data set and gathered enough information to better distinguish between dierent categories. That way additional information had a lower impact on the classication accuracy.

7.3.3 Case III


The third case gives us the best results. This is a good example that the amount of training data has the most inuence on the classication accuracy. In this case our training data set comprised 21490 questions in total. Again, those were divided into 45 categories, with approximately 480 questions in each. The test data in this case is composed of 4200 questions evenly split into 42 categories. This gives us 100 questions for each category. Notice that the number of classication iterations in this case is only 4, while in the rst case we were able to experiment with 7. This is because the amount of our training data gets bigger and it

7.3.2 Case II
In the second case we use 13090 questions in our training data set. Again they are divided into 45

10

takes a lot of time for the classier to nish the classication. Table 3 sums up the results achieved in this case.

of a question we were able to improve the F-score up to 6.5%. Weighting dierent attributes seems to have a lower impact on the classication accuracy when we train our classier with larger amount of train-

Q 1 1 3 4

Desc 1 1 2 3

Det 1 1 1 2

Answ 0 1 4 1

P 0.6134 0.6246 0.6092 0.6263

R 0.6155 0.6202 0.6117 0.6269

F1 0.6080 0.6082 0.6016 0.6195

ing data. This way classier gathers enough information to better distinguish between dierent categories and any additional information becomes less helpful. It is very important to assign suitable weight values for each part of the question. The overall best weight combination for question, description, details and answers is accordingly 4, 3, 2 and 1. It is not advised to assign a relatively high weight to description or details because 55% of all the questions in our data source contain description and only 13% of the questions has additional details. In all other cases those elds are left blank. Including all the answers to the question into the classication process can have positive eect on the classication outcome. However such features should be weighted fairly low. That is because occasionally answers can cover a very broad range of topics that are not necessary very closely related to the question. As for pursuing the better classication accuracy, the possible future work could be made on experimenting with dierent feature selection models described by [YP97]. Another approach would be to assume that every question can reside in the category which is not necessary located at the leave of category tree. Additional classication techniques need to be employed in order for this method to produce decent results. A possible solution could be to set a threshold value in each level in category tree like proposed by [TZL06].

Table 3: Case III: Experimenting with weights

This case follows the same scenario as the previous ones. The overall best weight combination remains to be 4, 3, 2, 1 (row 5). As predicted in previous case, after further increasing the amount of training data, the improvement in F-measure, achieved after assigning the weights, in this case was even smaller  1.9%.

8 Conclusions
After completing the experiments, described in the previous section it is clearly visible that hierarchical (HNB1) and at (FNB) classiers are very close to each other in terms of classication accuracy. By increasing the amount of training data from 4500 to 22000 questions the dierence in accuracy between those two classiers varied only by 0.1% to 1.4%. As it was expected, classication 5.76%. methods. the HNB3 was able to Compared with FNB, demonstrate the best accuracy amongst the tested the improvement in the F-measure ranged up to Using the hierarchical classication approach, again, makes it much faster than the at classier. However, since we need three classication iterations using this classier, it cannot complete the classication as fast as HNB1. In the same case as described above this classier took 6 to 8 minutes to classify 4200 questions. This is a little bit worse than in the HNB1 case but still a major improvement compared with the results of the at classier. The idea of assigning dierent weights to dierent parts of the question seems to be a helpful technique in increasing the classication accuracy. By assigning appropriate weights to dierent elements

References
[BHW08] Eda Baykan, Monika Henzinger, and Ingmar Weber. Web page language identication based on urls. Proc. VLDB Endow., 1(1):176187, 2008.

[DC00]

Susan T. Dumais and Hao Chen. Hierarchical classication of Web content. In Nicholas J. Belkin, Peter Ingwersen, and Mun-Kew Leong, editors,
Proceedings

11

of SIGIR-00, 23rd ACM International Conference ment in on Research and Develop-

Information

Retrieval,

pages

256263, Athens, GR, 2000. ACM Press, New York, US. [MRS08] Christopher D. Manning, Prabhakar
Intro-

Raghavan, and Hinrich Schtze.

duction to Information Retrieval. Cam-

bridge University Press, July 2008. [PG04] Ashwin Pulijala and Susan Gauch. erarchical text classication. Hi-

In Inter-

national Conference on Cybernetics and Information Technologies, Systems and Applications: CITSA 2004, pages 257

262, Orlando, FL, 2004. [Seb02] Fabrizio Sebastiani. Machine learning

in automated text categorization. ACM


Comput. Surv., 34(1):147, March 2002.

[TZL06]

Lei Tang, Jianping Zhang, and Huan Liu. tion. Acclimatizing taxonomic semanIn KDD '06:
Proceedings of the

tics for hierarchical content classica12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 384393, New York, NY,

USA, 2006. ACM. [YP97] Yiming Yang and Jan O. Pedersen. A

comparative study on feature selection in text categorization. In ICML '97: Proceedings of the Fourteenth International Conference on Machine Learning, pages

412420, San Francisco, CA, USA, 1997. Morgan Kaufmann Publishers Inc.

12

Anda mungkin juga menyukai