Anda di halaman 1dari 6

Automatic Part-of-Speech Tagger Based Arabic Language

Jabar H.Yousif 1 , Tengku m. T.sembok2,


1
Faculty of Computing & Information Technology,
Sohar University ,P.O. Box: 44, P. Code 311,
Sohar, Sultanate of Oman
2
Faculty of Information Science and Technology,
Universiti Kebangsaan Malaysia(UKM),
43600 UKM Bangi, Selangor, Malaysia.
Jyousif@soharuni.edu.om, tmts@ftsm.ukm.my

Abstract neural network based solution in solving


problems [6]. Their main features are massive
Automatic Part-of-Speech Tagger for Arabic parallelism, uniformity, generalization ability,
language Based on neural network approaches, distribution representation and computation,
namely, Multilayer Perceptron (MLP) and Fully learnability, trainability and adaptivity. Neural
Recurrent Neural Network (FRNN), has been approaches have been performed successfully in
developed and evaluated. Besides, the super many aspects of artificial intelligence such as
vector machines tagger is developed and image processing, NLP, speech recognition,
evaluated. The neural based approaches not only pattern recognition and classification tasks.
consummate the associations (word-to-tag The Recurrent Neural Networks (RNN) is a
mappings) from a representative training data network of neurons with feedback connections,
set, but it can also be generalized as unseen which are biologically more plausible and
exemplars. Moreover, the RNN approaches, computationally more powerful than other
nevertheless, exhibit a complex and dynamic adaptive models like Hidden Markov Models
behavior, in which it has proven to be capable of (HMM), Feed-Forward Networks and Support
representing these structures. The experiments Vector Machines (SVM). The SVMs are
have proven that not only the proposed taggers considered as supervised learning methods,
are highly accurate (of 99.99%), they requires a which are used for binary classification and
lower processing time and a lesser amount of regression tasks. They belong to a family of
data to achieve the learning phase. Similarly, the generalized linear classifiers. The main
POS disambiguation problem was successfully advantages of SVMs are that they
solved. simultaneously minimize the experimental
classification error and maximize the geometric
margin.
Keywords: Arabic part-of-speech, Recently, the application of Arabic language
Arabic morphology, Natural language processing has become a primary focus of
processing, neural networks. research and commercial development. Indeed,
any momentous Arabic NLP application often
1. Introduction includes speed and accurate POS tagger as one
The Part of Speech (POS) is a classification of its main core components. The reliability of
of words according to their meanings and the POS tagger depends largely on its ability to
functions. The POS tagger plays a crucial and address matters in an expeditious, expressive,
important foundation for most of the Natural inclusive, accurate and portable manner [3, 4].
Language Processing (NLP) applications such as The POS tagger systems have been implemented
machine translation, information extraction, using various methods such as "Rule-Based
speech recognition, as well as grammar and model” [7, 11], “Statistical model” [5, 9] and
spelling checkers. Nevertheless, the accuracy of "Neural Network model “ [1, 12, 14, 15, 17]. On
the POS tagging is determined by factors like the other hand, rule based models [4, 11], and
ambiguous words, phrases, unknown words as Support Vector Machines (SVMs) models have
well as multi-part words. There are specific been proposed to address and Implement the
features that can stimulate scientist to adopt Arabic POS tagger [8, 20].
The Rule-Based model and Statistical models preceding words and two succeeding words
need a vast amount of data to implement the trained on a large corpus called Penn Treebank
POS tagger. Furthermore, these models cannot Corpus performed considerably well as
represent the hierarchical and complex structures compared to statistical approaches based on
commonly found in natural languages. In order “Trigram model” and “Hidden Markov model”
to solve some of these problems, an automatic (HMM).
POS tagger for Arabic language based on neural Diab et al. [8] implemented SVMs approach to
network approaches (MLP, RNN) and SVM has automatically tokenize POS tagging and annotate
been developed and evaluated. The neural based base phrases (BPs) in Arabic text. They achieved
approaches not only consummate the for tokenizing when Fβ =1 a score of 99.12, and
associations (word-to-tag mappings) from a for tagging, accuracy of 95.49%. While,
representative training data set, but it can also be recorded score of 92.08 for chunking
generalized as unseen exemplars. Moreover, The
RNN approaches, nevertheless, exhibit a when Fβ =1 .
complex and dynamic behavior, in which it has Pérez-Ortiz [15] used the Discrete-time
proven to be capable of representing these Recurrent Neural Networks (DTRNN) for POS
structures. tagging of an ambiguous words from the
sequential information stored in the network’s
state. The experiments were performed to
2. Arabic language characteristics compute the error rates when tagging text taken
from the Penn Treebank corpus.
Arabic, is one of the seven official languages Ahmed [1] used MLP-tagger with three-layers
of the United Nations and it is the mother tongue using error back-propagation learning algorithm.
of 250 million people. The vast increase in it's The tagger was implemented on SUSANNE
usage on the internet was assigned propagation English tagged-corpus consisting of 156,622
work and transmission of the information in the words. The MLP-tagger is trained using 85% of
Arabic word. The Arabic language processing the corpus. Based on the tag mappings learned,
has recently become a primary focus of research the MLP-tagger demonstrated an accuracy of
and commercial development [2, 13]. 90.04% on test data that also included words
The nature and structure of the Arabic words unseen during the training.
is highly derivative and inflective. Moreover, Khoja [11] proposed the APT Arabic Part-of-
Arabic words are often compound structures Speech Tagger (used a combination of both the
which should syntactically be regarded as statistical and rule-based techniques as such
phrases rather than single words [3]. Unlike technique was believed to achieve the highest
Latin-based alphabets, the orientation of writing accuracy rates. The tags of the APT tagset are
in Arabic is from right-to-left, thus making mainly derived from the BNC English tagset,
Arabic writing differs distinctly from any other which was modified with some concepts from
languages like English, Spanish, etc. the traditional Arabic grammar.
The Arabic word can be divided into three
types, noun, verb and particle. The noun is a 4. Tagger Configuration & Design
word that describes a person or thing. Like
"madrasa" which means school. The verb is a
word which describes an action and it takes three The main components of the proposed
forms (past, present, imperative). Lastly the taggers are summarized in this section.
Particle is a word that attends a one purposes and The system is divided into two phases as
it is preceded or succeeded by a noun or a verb directed in Figure 1.
such as preposition adverse Interjection and
Conjunctions
4.1 Pre-processing phase
3. Related work This phase is written and implemented using
VBA commands for Excel. The pre-processing
phase designed and utilized to achieve the
This section surveys previous work in fields
following tasks:
of part of speech tagger and using of neural
a) Text Normalization: The Text normalization
networks.
is the first stage of the system used to provide the
Schmid [17] successfully demonstrated that a
input text into a suitable form to be processed
Net-Tagger with a context window of three
later. In general, the input text can be configured The TanhAxon applies a bias and Tanh function
either into a text file or XML file. Therefore, the to each neuron in the layer.
system is designed to disregard all the HTML b) The Recurrent tagger [22]: This task is
TAGS and extract the pure contents of the designed to implement the POS tagger for
document. Arabic text using the fully Recurrent Neural
Networks technique. The tagger has one hidden
Fully Synapse (The Synapse family implements
Input Output the dynamic linear mapping characteristics of the
Text Tagging neuron. A synapse connects two layers of
axons) with 230 processing elements as input
Automatic Interface and 23 processing elements as output. A static
controller for a network with multi-channels
Pre-Processing tapped delay line memory structure
phase Processing phase (TDNNAxon) which has 10 tapes, usually used
as the Input Axon. Likewise, in both the hidden
Text Normalization SVM–POS Tagger and output layers, the TanhAxon as a transfer
function is implemented. The RNN has 23
Text Tokenization MLP–POS Tagger processing elements (Columns) tagged as Input
and 23 processing elements (Rows) tagged as
Output.
Text Encoding FRNN–POS Tagger c) The SVM tagger [20]: This task is used to
implement the POS tagger for Arabic text using
the terminology of super vector machine
Figure 1: System Architecture algorithms. The SVM architecture have 23 PEs
as input set xi , 23 PEs as output set di and have
b) Text Tokenization: The text tokenization is a no hidden layer. The maximum number of
module that locates the document and distributes epochs is 1000 and set the step size to 0.01. The
the text into simple tokens such as numbers, learning algorithm is based on the Adatron
punctuation, symbols, and words. An algorithm algorithm which is extended to the RBF network.
has been developed to implement and perform
the text tokenization task. 5- Experiments and Results
c) Text Encoding: The text encoding is the
process that transforms the input data into a The experiments undertaken are achieved
suitable form, which the network can identify using the Arabic tagset and Arabic corpus, which
and use. Subsequently, every word in the includes 50000 words, and it proposed by Khoja
sentence is associated with a bit-vector, i.e., the [10]. The tagset contains 177 tags; they divided
size of which is equal to the number of different into various categories. The extraction of words
grammatical categories (for parts of speech) in a into basic roots is not considered in this study.
specific language. We suppose that the words were segmented
before POS tagging began. The experiments
4.2 The processing phase covered the three proposed taggers in this paper
SVM tagger, MLP tagger and FRNN tagger.
This section presents the design and The input text is encoded into a suitable
implementation of processing phase using form and then divided it into three categories;
NeuroSolutions for Excel software. This phase training data sets cross validation data sets and
involves the following tasks: test data sets .The Cross validation computes the
a) The MLP tagger [21, 23]: This task is error in a test data sets at the same time that the
designed to implement the POS tagger for network is being trained with the training set.
Arabic text using the Multilayered Perceptron The Genetic Algorithm (GA) used as a
technique. The architecture of the network has heuristic optimization in the problem of finding
one hidden layer, 23 processing elements tagged the best network parameters [18]. It establishes
as input, and 23 processing elements tagged as with an initial population of randomly created bit
output. The maximum number of epochs is strings. These initial samples are decoded and
1000. The TanhAxon is implemented as a applied to the problem. The experiments used the
transfer function in the hidden and output layer. GA methods to improve the learning rule
parameters such as step size and momentum
value. This will enable the optimization of the MSE versus Epoch
momentum values for all Gradient components
in NeuroSolutions software that use momentum. 1
1 100 199 298 397 496 595 694 793 892 991
Beside, it used to determine the number of
processing elements. Likewise, to tolerate the
enhanced fit specimen in the population to

MSE
0.1
reproduce at a higher rate is to use a selection
method based on the roulette wheel selection
technique.
There is several ways to test the networks 0.01
Epoch
performance. Usually, MSE "mean squared
Training MSE Cross Validation MSE
error" is used. It is two times the average cost
which is computed as follows: Figure 3. The graphs of MSE for SVM
∑∑ (d )
P N

ij −
2
yi j
j =o i =0
MSE = .
NP
MSE versus Epoch
1
Cross Validation MSE
Where, P is the number of output processing 0.9
Training MSE
elements. N is the number of exemplars in the 0.8
0.7
data set. yij is the network output for exemplar i 0.6
at processing element j . dij is the desired output

MSE
0.5

for exemplar i at processing element j . 0.4

0.3
The best network results for training data of 0.2
the proposed taggers are reports a minimum 0.1

MSE as clarified in Table 1. Figures 2, 3, 4 0


1 100 199 298 397 496 595 694 793 892 991
illustrate the training graph of proposed taggers Epoch
MLP, SVM and FRNN respectively.
Figure 4. The graphs of MSE for FRNN
Table 1. Minimum and final value
MSE of proposed taggers 6. Comparison and conclusion
SVM MLP FRNN 6.1 Comparison with related work
Epoch# 433 1000 1000

Min. MSE 0.0390025 0.0001036 0.0206866 The comparison study has to be done
carefully because the features used here do not
Final MSE 0.0390053 0.0001034 0.0206866 match those in the previous studies. On the other
hand, the comparison of proposed tagger with
other existing taggers is difficult matter, because
the tagger accuracy relies on numerous
MSE versus Epoch
1 parameters such as language complication
1 100 199 298 397 496 595 694 793 892 991
(ambiguous words, ambiguous phrases), the
0.1 language type (English, Arabic, Chinese, etc),
the training data magnitude, the tag-set size and
MSE

0.01
the evaluation measurement criteria. Tag-set size
has a great impact on the tagging process.
0.001
The proposed taggers are assessed using the
0.0001
measurement of Accuracy, besides MSE aspects.
Epoch In comparison study of proposed taggers with
Training MSE Cross Validation MSE
the results of several taggers of other researchers,
the accuracy of proposed taggers gets a super
Figure 2. The graphs of MSE for MLP rate when using GA optimization techniques to
improve the values of the momentum rate and
the step size. The proposed taggers (MLP, SVM
and FRNN) achieved accuracy of 99.99% at last
experiments when using GA optimization
process. Table 2 summarized the comparison Taggers Accurcy Without Using GA Optimizatiion
information and illustrated in Figures 5 and 6. 105

100
6.2 Conclusion

SVM
Pérez-Ortiz 2001
95
Khoja 2001b

Schmid 1994
Accuracy

Diab 2004

FRNN
This research was mainly aimed to implement 90
Diab 2004

Pérez-Ortiz 2001

Ahmed 2002
Schmid 1994
and utilize an automatic POS tagging system

Khoja 2001b
Ahmed 2002
85
based on neural network methods, which has the

MLP
SVM
ability to tag the Arabic texts mechanically. We 80 MLP
FRNN
have demonstrated variant kinds of tagger which
75
can solve the problem of Arabic part of speech. Tagger Name
The new approaches are highly accurate with
low processing time, and high speed words
tagging. The results are greatly encouraging, Figure 6. Comparison result (Proposed
with correct assignments between 86% and taggers without using GA optimization)
99.99% depending on either using genetic
Algorithm optimization or not, in optimizing the 7. Future Work
values of network variables like the momentum The research has demonstrated the
rate and step size. applicability of Neural-based tagging techniques
to Arabic tagger systems. Still, there are some
modules that can be added to improve the
Table 2. The comparison results of preprocess phase like the affixes extraction and
proposed tagger & other taggers segmentation.

REFERENCES
proposed
Diab [8]
Ahmed[
Schmid

Khoja
Pérez
[17]

[15]

[11]
1]

[1] Ahmed. "Application of Multilayer


Perceptron Network for Tagging Parts-of-
POS NN DT-RNN NN Rule SVM NN Speech", Proceedings of the Language
Type base
Lang. Eng. Eng. Eng. Ar. Ar. Ar. Engineering Conference (LEC’02), IEEE,
2002.
Corpus 4.5 0.0465 0.1566 0.05 *106 4519 0.05
size *106 *106 Sen. *106
[2] Al-Sulaiti's Latifa. "Online corpus".
*106 http://www.comp.leeds.ac.uk/latifa/research.
Train 44.4 100 85 100 80 10 htm.
data % [3] Attia, M. " A large-scale computational
Tag 48 19 48 131 19 131 processor of the Arabic morphology and
size applications", MSc. thesis, Dept. of
Acc.% 96.22 92 90.4 90 95.49 99.99 Computer Engineering, faculty of
Engineering, Cairo University, 2000.
[4] Beesley, K. "Finite-State Morphological
Analysis and Generation of Arabic at Xerox
Research: Status and Plans" , ACL, Arabic
Taggers Accurcy Using GA Optimizatiion
NLP Workshop, Toulouse, 2001.
102 [5] Brants, T. "TnT- a statistical part-of-speech
100
tagger", proceedings of the 6th ANLP
FRNN
MLP
SVM

98 Pérez-Ortiz 2001
96 Khoja 2001b conference, Seattle, WA, 2000.
Accuracy

Diab 2004
94
Schmid 1994
[6] Brill, E. "Unsupervised learning of
Pérez-Ortiz 2001

Schmid 1994

92
Ahmed 2002 disambiguation rules for part of speech
Diab 2004
Khoja 2001b

Ahmed 2002

90
SVM
88 MLP
tagging". Proceedings of third ACL
86 FRNN Workshop on Very Large Corpora, 1995.
84
Tagger Name
[7] Brill, E. "A simple rule-based part-of-speech
tagger", proceedings of ANLP-92, 3rd
Conference on Applied Natural Language
Figure 5. Comparison result (Proposed Processing, pp 152–155, Trento, IT, 1992.
[8] Diab, M., Kadri H. & Daniel J. "Automatic
taggers using GA optimization) tagging of Arabic text: from raw text to base
phrase chunks", proceedings of HLT- machines, International Symposium on
NAACL-04, 2004. Information Technology, Kuala Lumpur
[9] Gimenez, J. & Llu´ıs M. "Fast and accurate Convention Centre, 200808, ISBN 978-1-
part-of-speech tagging: The SVM approach 4244-2328-6©IEEE, Malaysia, August 26-
revisited", proceedingsof the International 29,2008.
conference on recent advances on natural [21] Yousif, J. H. & Sembok T. M. T. "Design
language processing, Borovets, Bulgaria, and Implement an Automatic Neural Tagger
2003. Based Arabic Language for NLP
[10] Khoja S, Garside, R. & Gerry, K. "An Applications", Asian Journal of Information
Arabic tagset for the morphosyntactic Technology 5(7): PP 784-789, ISSN 1682-
tagging of Arabic", corpus linguistics, 3915,2006.
Lancaster University, Lancaster, UK, 2001. [22] Yousif, J. H. & Sembok T. M. T.
[11] Khoja, S. "APT: Arabic part-of-speech "Recurrent Neural Approach based Arabic
tagger", proceedings of the student Part-of-Speech Tagging", International
workshop at the second meeting of the north Conference on Computer and
American chapter of the association for Communication Engineering (ICCCE'06),
computational linguistics (NAACL2001), VOL 2, ISBN 983-43090-1-5© IEEE. KL-
Carnegie Mellon University, Pennsylvania, Malaysia May 9-11,2006.
2001. [23] Yousif, J. H. & Sembok T. M. T. "Arabic
[12] Ma, Q., Uchimoto, K., Murata,M. & Part-of-Speech Tagger Based neural
Isahara,H. "Elastic neural networks for part networks", International Arab Conference
of speech tagging", proceedings of on Information Technology
IJCNN’99, pp 2991–2996, Washington, DC, (ACIT2005),ISSN 1812/0857. Jordan-
1999. Amman-2005.
[13] Mahtab, N. & Choukri, K. "Survey on
Industrial needs for Language Resources",
2005.Online
"http://www.nemlar.org/Publications/Nemla
r-report-ind-needs_web.pdf".
[14] Marques, N. C. & Gabriel. P. L. "Using
neural nets for Portuguese part-of-speech
tagging", proceedings of the 5th international
conference on the cognitive science of
natural language processing, Dublin City
University, Ireland, 1996.
[15] Persz-ortz A. J. & Forcada M. L. "Part-of-
speech tagging with recurrent neural
networks", proceedings of the International
Joint Conference on Neural Networks,
IJCNN- IEEE2001:1588-1592, 2001.
[16] Principe, J.C. ,Euliano,N.R. & Lefebvre,
W.C. "Neural and adaptive systems,
fundamentals through simulations", John
Wiley & Sons, NY, 2000.
[17] Schmid, H. "Part-of-speech tagging with
neural networks", proceedings of COLING-
94, Kyoto, Japan, pp 172–176, 1994.
[18] Srinivas, M. & Patnaik, L. M. Genetic
algorithms: a survey", IEEE Computer
27(6), pp17-26, 1994.
[19] Weischedel, R., et al. "Coping with
ambiguity and unknown words through
probabilistic models", Computational
Linguistics. 19(2), pp359-382,1993.
[20] Yousif, J. H. & Sembok T. M. T. "Arabic
part-of-speech tagger based support vectors

Anda mungkin juga menyukai