Anda di halaman 1dari 12

1st International Conference of Recent Trends in Information and Communication Technologies

A Comparative Study on Different Types of Approaches to the Arabic


text classification
Rasha Mamoun*, Mahmoud Ali Ahmed

Faculty of Mathematical Science, University of Khartoum, Sudan

Abstract

The numbers of Arabic documents which published online are in the orders of millions and
millions. This resulted by the progress in digital data and storage technology. The task of store
the data in systematical way become an import issue. The aim of this paper is to highlight the
most important algorithms that are applied to the text documents classification and give a
comparative study on different types of approaches to the text categorization that were used in
the available Arabic text classification literature. It compare between the accuracy of the used
algorithms and its results in within the literature. Several algorithms, Support vector machine,
K nearest neighbor (KNN), Nave Bayesian (NB), Distance-Based, C5.0, J48 and Rocchio,
were proposed for the Arabic documents classification. The authors used different types of
measure to test their classifiers and different corpus. The experimental result in the literature
illustrated that Support Vector Machine algorithm was an efficient method and gives generali-
zation accuracy of about 9o%.

Keywords: Text classification, Arabic text classification, classifier.

*Corresponding author: rashamamoun@yahoo.com

IRICT 2014 Proceeding


12th -14th September, 2014, Universiti Teknologi Malaysia, Johor, Malaysia
Rasha Mamoun and Mahmoud Ali Ahmed /IRICT (2014) 246-257 247

1 Introduction

Today there are huge amounts of information that available easier from the online
text documents. There is a lot of knowledge that hidden within these documents.
When identify the correct intelligent tools and applied it to these documents, the
knowledge can be easily extracted. Machine learning, ML, deals with the design and
development of algorithms and techniques that allow the system to learn from data
to improve the performance of system [1]. The problem that faced the almost
researchers in the field of machine learning is how to select the most suitable
model/algorithm to be use on a given application [2]. Data mining is one of the most
important machine learning applications. [4, 3] define data mining as the process of
analysis the large quantities of data that store in the computers and discovering
patterns in it. Machine learning and data mining techniques are using to
automatically discover patterns and classify from the documents [5]. Data mining
reduces the costs in time and money, so that it becomes most popular in the fields of
science, analysis and healthcare. Text classification is one of the most important
research issues and common technique in the data mining. Text classification is to
labeling documents with thematic categories from a predened set known in
advance [6].

Despite of the huge growth of the available online Arabica documents and the
development of computer hardware which provided enough computing power, there
is a lack of work in this area [7, 8, 9, 10, 11]. Some research returned that to the
difficulty of the process of classifying Arabic text than classifying other languages
like English and other European language [7]. This is due to high inflection of the
Arabic language and the richness of it which increases the size of the feature vectors
[12, 13]. On the other hand the process of difficulty of selecting the most suitable
model to be use on a given application made it more complex [2].

This paper presents a brief scientific idea about the text classification algorithms and
compare between it. Different types of algorithms used to classify the Arabic text
categorization in the literature. The comparison between the accuracy of the
classifier and the results will be presented.
Rasha Mamoun and Mahmoud Ali Ahmed /IRICT (2014) 246-257 248

2 Text classification

Text classification is the one of the most important research issues and common
technique in the data mining. Text classification utilized to give useful information
from the large amount of unstructured data. [14, 15] defined text categorization as
the process of classifying or structuring documents into a predefined set of
categories according to a group structure that is known in advance. The terms text
categorization is sometimes taken to deal with sorting documents by content, while
Text Classification is used to classify the documents by any kind of assignment of
documents to classes, like sorting by author, by publisher, or by language [16] but in
general they used as same meaning. Text classifications have been used in many
applications such as e-mail filtering and routing; news monitoring; spam filtering;
automated indexing of scientific articles in e-libraries; searching for interesting
information on web sites; authorship attribution; classification of news articles [1].
There are two main approaches to text categorization: rule based and machine
learning based [15]. The main different between the two approaches is that : in the
machine learning approaches the classification rules are defined automatically using
sample labeled documents while in the rule based the classification rules are
manually created by experts [7]. Text classification systems can be separated into
two main phases, Information retrieval phase and main classification phase: the
information retrieval and classification phase [17].

3 Text classification Algorithms

Algorithm is the a well-defined procedure that takes data as input and produces
output, after some finite steps and then terminate, in the form of models or patterns
[18]. There are a lot of text classifications algorithms with well performance. The
performance of a classification algorithm is greatly affected by the quality of data
source. Irrelevant and redundant features of data may reduce the quality of the result
[5]. Accuracy of classifier can be expressed in term of Recall, Precision and F1.
Precision and recall are widely used to evaluate measures in Information Retrieve
and Machine Language [19]. Precision is the ratio of all documents categorized as
responsive to the number of those documents that were categorized correctly while
Recall is the ratio of responsive documents found and categorized correctly, to the
total number of responsive documents in the full collection. The F1 is a weighted
average of precision and recall [20]. The most important supervised text
classification algorithms that used in our survey were Support Vector Machine, K-
Nearest Neighbor, Nave Bayes, and Decision Trees [1].
Rasha Mamoun and Mahmoud Ali Ahmed /IRICT (2014) 246-257 249

3.1. Support Vector Machine (SVM)

Support vector machine [21] is a statistical algorithm based on the Structural Risk
Minimization principle [5]. SVM classifiers are binary classifiers. The general idea
of SVM is to find a hypothesis that can guarantee the lowest true error and the main
goal of the SVM learning is to find the optimal separating hyper-plane that has the
maximal margin to both sides [7, 22]. The need of both positive and negative
training set is the main characteristic of SVM comparing to all other text
classification algorithms [1]. It use the hyper plane to separates the positive from the
negative data in the n dimensional space [5].

Advantages: it has been recognized as one of the most effective text classification
methods [7]. Disadvantages: it needs a large number of features and large set of
training to work properly. This is due to the clearer optimal hyper plane obtained by
the increasing number of features [22, 23, 24].

3.2. K-Nearest Neighbor

It is a well known pattern recognition algorithm [1]. Key nearest neighbor (kNN) is
a statistical learning Algorithm. The idea is - in the multidimensional space the
document that being classified represents by dot, we look around to find out the k
nearest neighbors among the training documents and uses the categories of the k
nearest neighbors to weight the candidate category. Only k numbers of nearest
neighbors are considered [25]. The probability of categorized some document to
some category, measured by the distribution of it from that category of the nearest
neighbors [5]. KNN depends only on comparing the features of the test article and
the features of the training articles in order to find the K nearest training articles
[23].

Advantages: The most simple and effective algorithm, that can be easily implement
[27]. It performs well to classification tasks with multi-categorized documents.
Disadvantages: The success of classification is dependent on the value of k, so that
selecting the value of k is a major drawback of the classifier [12]. Opposite to the
support vector machine this algorithm becomes slow when size of training set
grows. It is requires more time for classifying objects [21].
Rasha Mamoun and Mahmoud Ali Ahmed /IRICT (2014) 246-257 250

3.3. Nave Bayes

Nave Bayes classifier based on applying Bayes Theorem with strong independence
assumptions. It is used classification approach based on probability. Nave Bayesian
classifier assumes that features of the input feature vector are statistically
independent [12, 25].

Advantages: This method requires a small amount of training data to estimate the
parameters necessary for classification. Nave Bayes classifier is a very simple
classifier which works very well on numerical and textual data. It is very effective
and it is non-parametric in nature [19]. Disadvantages: it has trouble dealing with
noise or irrelevant features in training data.

3.4. Decision Trees

A decision tree is a special form of a rule set. Each non-terminal node represents a
decision on the considered data item. By using Decision trees the documents
classify instances by sorting them based on their values of some feature [19].

Advantages: This method works on data of any type. It is fastest even in the
presence of large amounts of attributes. Disadvantages: the impossibility of the
algorithms to assign a document to a category exclusively due to the rules from
different rule sets is applicable to each other [25]. By implementation of decision
tree the main risk is it over fits the training data with the occurrence of an alternative
tree [5].

4 Literature Survey

In despite of the increase of the available Arabian documents, there is a lack of


researches in the Arabic text categorization [7, 8, 9, 10, 11]. Also there is only one
commercial automatic Arabic text categorizer referred as Sakhr Categorizer [26].
Following are the studies and researches that deal with the idea of the Arabic text
categorization:

4.1. Support Vector Machine text Classifier for Arabic Articles: Ant Colony Optimi-
zation-based Feature Subset Selection.
Rasha Mamoun and Mahmoud Ali Ahmed /IRICT (2014) 246-257 251

The author had implemented a support vector machine (SVM), Nave Bayesian
(NB) classifier and k-nearest neighbor KNN text classifier for Arabic articles. The
research investigated the effectiveness of six state-of-the art feature subset selection
(FSS) method. Implementation of an Ant Colony Optimization Based-Feature
Subset Selection (ACO Based-FSS) method was present. The results were shown in
term of Precision, recall, F1 accuracy measure. The author used an in-house Arabic
that consists of 1445 documents which classified in 9 categories and referred as the
result of unavailable Arabic corpus. The SVM classifier outperformed, 90%, Nave
Bayesian (NB) classifier and k-nearest neighbor (KNN) classifier in the experiment
result [7].

4.2. Arabic text categorization

The author compared performance of three classifiers for Arabic text


categorization: Nave Bayes, K-Nearest Neighbor and DistanceBased. The recall,
precision, error rate and fallout are using to compare the accuracy of the classifiers.
As the mention of the author of the unavailability of Arabic corpus, the author used
an in-house Arabic that consists of 1000 documents which divided in 10 categories.
Unclassified documents were preprocessed by removing punctuation marks and
stopwords. Each document is then represented as a vector of words. Stemming word
was used to reduce the dimensionality of feature vectors of documents. Accuracy of
classifier expressed in term of Precision, Recall, Fallout and error rate. The results
of the experimentation show that the Nave Bayes classifier outperforms the other
two (more than 95%) [13].

4.3. An Intelligent System for Arabic Text Categorization.

The authors used the K-Nearest Neighbor and Rocchio classifiers for classification
process. He mentioned that as the result of unavailable Arabic corpus, he used an in-
house Arabic that consists of 1132 documents collected from the three main
Egyptian newspapers El Ahram, El Akhbar, and El Gomhoria and divided in 6
categories. According to the experiments results, they found that the most suitable
stemming algorithm for Arabic language is the hybrid method of statistical and light
stemmers. In the classification process the results showed that the Rocchio classifier
has the advantage over k-nearest neighbor classifier in both time and accuracy (more
than 90%) [12].
Rasha Mamoun and Mahmoud Ali Ahmed /IRICT (2014) 246-257 252

4.4. Arabic Text Classification Using Support Vector Machines.

The authors applied the Support Vector Machines (SVM) model to their classifying
process. As the result of unavailable Arabic corpus as the authors referred, they used
an in-house Arabic that consists of 1132 documents. These documents were
collected from the three main Egyptian newspapers El Ahram , El Akhbar and El
Gomhoria during the period from August 1998 to September 2004. These
documents were classified in 6 categories. Two experiments used to test the
different classifiers. The first used the training set as the test set, and the second
used Leave one testing method. They compared their results with Bayes classifiers,
K-Nearest Neighbor classifier and Rocchio classifier. The results of SVM, Nave
Bayes and Rocchio classifiers were very high and the classification accuracy tend to
be 100% by used the training set as test set. This was because the classifiers have
already seen the test documents during training phase and thus it was very trivial to
the classifier to classify those documents. While using the Leave One method for
testing gives more realistic results. Leave one method involves using a single
document from the original sample as the test data, and the remaining documents as
the training data. This is repeated such that each document in the sample is used
once as the test data. Experimental results show when that the size of feature set is
large enough, more than 4000 feature, the SVM outperform the other classifiers.
Classification rate exceeds 90% [22].

4.5. Automated Arabic Text Categorization Using SVM and NB.

The author investigated Nave Bayesian method (NB) and Support Vector Machine
algorithm (SVM). The data set that used was the Saudi Newspapers (SNP) which
consists of 5121 Arabic documents of different lengths that belongs to 7 categories.
Three evaluation measures (Recall, Precision, and F1) used as the bases of the
comparison. After analyzed they reported that all measures vary among categories.
"Sport" category had a neat classification F1 of 94.2%, while the "General" category
had a noticeably poor Recall of 41.7% used NB. These poor results indicate that the
"General" category was highly overlapped with other categories. The Experimental
results had shown that the SVM algorithm outperformed the other classifiers (about
77%) [19].
Rasha Mamoun and Mahmoud Ali Ahmed /IRICT (2014) 246-257 253

4.6. Performance of KNN and SVM classifiers on full word Arabic articles.

The authors evaluated K nearest neighbor (KNN) algorithm, and support vector
machines (SVM) algorithm. They used CHI statistics as a ranking metric. They used
full word features and considered the tf.idf as the weighting method for feature
selection. The number of articles was 2235 articles and the number of testing articles
was 29 articles. The articles include just two categories: Sport and Economy. The
authors noticed that some articles were classified differently by the two classifiers.
They returned that to the occurrences of features of both categories in the same
article. Also they returned that to the different methods of learning and classification
that used. The SVM had better Micro recall than KNN. The Micro precision values
of both classifiers are similar, with KNN a bit better. Micro average F1 for both
classifiers was high. When the number of features was low, classifiers had similar
performance, with advantage given to SVM. But SVM outperformed KNN as the
number of features increases. The prediction time of SVM was better than KNN.
Experiments explanted that SVM showed a better micro average F1 and prediction
time. SVM classifier, in general, gives better accuracy (95 %) [23].

4.7 Automatic Arabic Text Classification.

By using seven different available Arabic corpora, the authors used some recognized
statistics technique, SVM and C5.0, to classifying Arabic text documents. Two data
mining software used: RapidMiner1 and Clementine2. The RapidMiner open source
software was used to provide an implementation for the SVM algorithm and
Clementine for the C5.0 decision tree algorithm. Accuracy was computed by
dividing the number of the correctly classified document by the total number of
documents in the testing dataset. C5.0 classifier, in general, gives better accuracy
(78.42%) [8].
Rasha Mamoun and Mahmoud Ali Ahmed /IRICT (2014) 246-257 254

4.8 A Comparative Study for Arabic Text Classification Algorithms Based on Stop
Words.

The authors compared three techniques; Support Vector Machine (SVM) with
Sequential Minimal Optimization (SMO), Nave Bayesian (NB), and J48. Their
main objective was to measure the accuracy for each classifier based on stop words
elimination. The Arabic data set contains 2363 documents divided into six
categories Sport, Economic, Medicine, Politic, Religion, and Science. The
experiments were conducted using Wakaito Environment for Knowledge
Acquisition (WEKA) where J48, NB, and SMO are already implemented. The
dataset was tested using percentage split method (holdout), where 60% of the data
used as training and the remaining 40% used as testing. The accuracy for classifier
is measured by Percentage split method (holdout), and K-fold cross validation
methods. The results show that the SMO classifier achieves the lowest error rate,
highest accuracy and the time needed to build the SMO model is the smallest time.
The best classification accuracy they reported was 96% for SMO. [9].

4.9 Arabic text classification using K-NN and Nave Bayes.

The authors evaluated the use of K-NN and BN to classify the Arabic text. They
used in house corpus consists of 242 documents which belong to six categories. One
of difficulties that noted by authors was to found Arabic corpus for evaluating text
categorization. Another thing was the huge number of features or keyword found in
Arabic language. They extracted feature set of keyword to improve the performance.
The K-fold cross validation method used to test the accuracy. The result showed that
K-NN classifier had better performance. The average accuracy reported was about
84%, while NB classifier achieved 73% [10].

5 Methods
An analytic, comparative approach was used in order to:
1. Explain the different types of approaches to the text categorization.
2. Brief view in available Arabic text classification literature
3. Camper between the results in the literature.
4. Analyzes the comparison.
Rasha Mamoun and Mahmoud Ali Ahmed /IRICT (2014) 246-257 255

6 Results and Discussion

Comparing the above mentioned algorithms: Support vector machine (SVM), K


nearest neighbor (KNN), Nave Bayesian (NB), Distance-Based, C5.0, J48, and
Rocchio, regarding accuracy the findings of the comparison is shown in Table 1.

Table1. Comparative results

No. Main author No of Documents No Category Classifier Used Result Accuracy Notes
4.1 Mesleh, A.M. 1545 9 SVM/NB/ KNN SVM 90% -
Missed
NB/KNN/ Distance-
4.2 Duwairi, R. 1000 10 NB - accuracy
Based
value
4.3 Syaim, M.M. 1132 6 KNN/ Rocchio Rocchio 98% -
large
SVM/NB/KNN/
4.4 Gharib, T.F 1132 6 SVM 90% feature
Rocchio
size
4.5 Alsaleem, S. 5121 7 SVM/NB SVM 3.8% -
large
4.6 Hmeidi, I. 2235 2 KNN/SVM SVM 95% feature
size
1 to 7 in each
4.7 Al-Harbi, S. 17,658 SVM/ C5.0 C5.0 78% -
one
(SMO)
4.8 Al-Shargabi, B. 2363 6 Nave Bayesian, SMO 96% -
J48.
Small
4.9 Bawaneh, M.J. 242 6 KNN/NB KNN 84%
feature

Analyzing results in above table was found:


1. There is a lack of work in the Arabic text classification field, and almost all of the
authors had used their own corpus. They had justified this with the lack of Arabic
corpus in general [7, 8, 9, 10, 11].
2. The authors had used different types of measures to test their classifiers. Almost
all of them used Recall, Precision, and F1. Precision and recall are widely used to
evaluate measures in Information Retrieve and Machine Language [19].
3. Support Vector Machine, Nave Bayesian and K-Nearest Neighbor were used fre-
quently.
4. [23] had proved that support vector machine is generalizes well in high dimen-
sional feature space which [2, 24] had considered to be a disadvantage of the algo-
rithm.
5. There are differences between the classifiers in the accuracy, error rate, and time
taken to build the classification [9].
6. The experimental results had shown that Support vector machine classifier outper-
formed K-Nearest Neighbor, Rocchio and Nave Bayesian [7, 22, 23, 25, 9].
Rasha Mamoun and Mahmoud Ali Ahmed /IRICT (2014) 246-257 256

7. C5.0 had shown more accurate results than Support Vector Machine in the same
corpus by about 10%. [8]. This might be due to using different datasets, methods,
and measures. For example accuracy is computed by dividing the number of the
correctly classified document by the total number of documents in the testing da-
taset. SVM classifiers are binary classifiers while C5.0 0 allows two or more out-
comes [27].

7 Conclusion and future work

This paper has investigated text classification for Arabic language. In conclusion
SVM is recommended in relatively large corpus, while C 5.0 is recommended to be
used with smaller and large corpus. Using different types of measures to test
classifiers may lead to different results regarding accuracy of the algorithm. Using
different corpus also affects the accuracy of the algorithm.

The future work will focus on comparing the same algorithms for both English and
Arabic languages classification, in order to determine the correlation between the
language and the accuracy of algorithms.

ACKNOWLEDGMENT

The authors like to thanks Dr. Adil Ali, the head of department of information
technology, Garden City Collage of Science and Technology, for his support and
helpful comments.

References

1. Dharmadhikari, C.S., Ingle, M. and Kulkarni, P. "Empirical Studies on Machine


Learning Based Text Classification Algorithms," Advanced Computing: An Inter-
national Journal ( ACIJ ), Vol.2, 2011.
2. Saleeb, H."Information Retrieval: A framework for Recommending text based clas-
sification algorithms". Pace University, Ph.D.Thesis, 2002.
3. Witten, I.H., Frank, E. and Hall, M.A. Data Mining Practical Machine Learning
Tools and Techniques, ,2011.
4. Olson, D.L. and Delen, D. Advanced Data Mining Techniques. Springer-Verlag
Berlin Heidelberg, German, 2008.
5. Khan, A., Baharudin B., Lee, L.H. and A.Khan, K.K. "A Review of Machine
Learning Algorithms for Text- Documents Classification," Journal of Advances in
Information Technology, Vol. 1, 2010.
6. Sebastiani, F. "Machine learning in automated text categorization," ACM compu-
ting surveys, vol. 34, 2002.
Rasha Mamoun and Mahmoud Ali Ahmed /IRICT (2014) 246-257 257

7. Mesleh, A.M.A. Support Vector Machine text Classifier for Arabic Articles: Ant
Colony Optimization-based Feature Subset Selection., The Arab Academy for
banking and financial Science, Ph.D.Thesis, 2008.
8. Al-Harbi, S., Almuhareb, A., Al-Thubaity, A., Khorsheed, M. S. and Al-Rajeh, A.
"Automatic Arabic Text Classification," Proceedings of The 9th International Con-
ference on the Statistical Analysis of Textual Data, Lyon- France, 2008.
9. Al-Shargabi, B., AL-Romimah, W. and Olayah, F. "A Comparative Study for Ara-
bic Text Classification Algorithms Based on Stop Words Elimination," ISWSA
'11 Proceedings of the 2011 International Conference on Intelligent Semantic Web-
Services and Applications, Article No. 11, 2011.
10. Bawaneh, M.J., Alkoffash, M.S. and Alrebea, A.I. "Arabic Text Classification us-
ing K-NN and Nave Bayes," Journal of computer science, vol: 4, 2008.
11. Al-Ansary, S., Nagi, M. and Adly, N. " Building an International Corpus of Ara-
bic (ICA): Progress of Compilation Stage," Bibliotheca Alexandrina, 2008.
12. Syiam, M. M., Fayed, Z. T. and Habib, M. B."An intelligent system for Arabic text
categorization," IJICIS, Vol.6, 2006.
13. Duwairi, R. "Arabic Text Categorization," The International Arab Journal of Infor-
mation Technology, Vol. 4, 2007.
14. Khreisat, L. "A machine learning approach for Arabic text classification using N-
gram frequency statistics," Journal of Informatics, Volume 3, 2009.
15. Pawar, P.Y. and Gawande, S. H. "A Comparative Study on Different Types of Ap-
proaches to Text Categorization," International Journal of Machine Learning and
Computing, Vol. 2, 2012.
16. Jackson, P. and Moulinier, I. Natural language processing for online applications:
text retrieval, extraction, and categorization. John Benjamins Publishing Company,
Amsterdam, 2002.
17. Mahinovs, A. and Tiwari, A. Text Classification Method Review. Decision Engi-
neering Report Series. Cranfield University, 2007.
18. Hand, D., Mannila, H. and Smyth P. Principles of Data Mining. The MIT Press
Cambridge, Massachusetts London England, 2001.
19. Alsaleem, S. "Automated Arabic Text Categorization Using SVM and NB." Inter-
national Arab Journal of e-Technology, Vol. 2, 2011.
20. White Paper: Control Sets in Relativity Assisted Review Control Sets: Introducing
Precision, Recall, and F1 into Relativity Assisted Review, kCura Corporation, 2013
21. Vapink V.N. The Nature of Statistical Learning Theory. Springer-verlag, New
York, 1995.
22. Gharib, T.F., Habib, M.B. and Fayed, Z.T. "Arabic Text Classification Using Sup-
port Vector Machines,"
http://wwwhome.cs.utwente.nl/~badiehm/PDF/ISCA2009.pdf.
23. Hmeidi I., Hawashin B., El-Qawasmeh, E. "Performance of KNN and SVM classi-
fiers on full word Arabic articles". Advanced Engineering Informatics,vol:106111,
2008.
24. Pitigala, S., Li C., Seo, S. "A Comparative Study of Text Classification Approaches
for Personalized Retrieval in PubMed," IEEE International Conference on Bioin-
formatics and Biomedicine Workshops, 2011.
25. Murthy, K.N. "Advances in Automatic text categorization". librarians' Digital li-
brary, DRTC Workshop on Semantic Web, Bangalore. India, 2003.
26. Sakhr Company: http://www.sakhr.com.
27. Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang Q., Motoda, H., McLachlan
G.J., Ng, A., Liu, B., Yu, P.S., Zhou, Z-H., Steinbach, M., Hand, D.J., Steinberg, D.
"Top 10 algorithms in Data Mining," Knowl Inf Syst, 14:137, 2008.

Anda mungkin juga menyukai