Anda di halaman 1dari 35

CHAPTER 6

KANNADA MORPHOLOGICALANALYZER AND


GENERATOR

From a serious computational perspective, the creation and availability of a MAG for
a language is important.The role of morphology is very significant in the field of NLP, as
seen in applications like MT, QA system, IE, IR, spell checker, lexicography etc.
Morphological analyzer and generators are very essential for languages having rich
inflectional and derivational morphological features such as Kannada. The function of
morphological analyzer is to return all the morphemes and their grammatical categories
associated with a particular word form. On the other hand, for a given root word and
grammatical information, morphological generator will generate the particular word form
of that word.Developing a full fledged morphological analyzer and generator tools for
highly agglutinative language like Kannada is a challenging task. To build a
morphological analyzer and generator for a language, one has to take care of the
morphological peculiarities of that language. Some peculiarities of Kannada language such
as, the usage of classifiers, excessive presence of vowel harmony etc. make it
morphologically complex and thus, a challenge in NLG.

Table 6.1: Input/Output examples for morphological analyzer

Input Output

(AnegaLu) + (Ane+gaLu)

(hOguttEne) ++ (hOgu+utt+Ene)

The function of morphological analyzer is to segment the given word into component
morphemes and assigning correct morpho-syntactic information. The Table 6.1 shows
examples for morphological analysis of Kannada words.The function of morphological
generator is to combine the constituent morphemes to get the actual word. The Table 6.2
shows examples for morphological generation of Kannada words.

184
Table 6.2: Input/output Examples for Morphological Generator

Input Output

+ (Ane+gaLu) (AnegaLu)

++ (hOgu+utt+Ene) (hOguttEne)

The first section of this chapter used to explain the development of a Rule Based
Morphological Analyzer and Generator (RBMAG) system for Kannada language by
incorporating morphological information and peculiarities of the language. The proposed
RBMAG system was developed using FST.

The second section of this chapter explains the development of a statistical based
morphological analyser for Kannada language verbs. The developed statistical system is a
paradigm based morphological analyzer, developedusing machine learning approach. The
systemwas designed using sequence labelling approach and training, testing and
evaluations were performed by SVM algorithms.

6.1 CHALLENGES IN KANNADA MORPHOLOGY

Kannada belongs to the south Dravidian family of languages. Kannada morphology is


characterized as agglutinative or concatinative, i.e., words are formed by adding suffixes
to the root word in a series. When suffixes attach to the root word, several
morphophonemic changes take place. The orders in which suffixes attach to a root word
determine the morpho-syntax. It is a verb-final inflectional language with relatively free
word order. Agglutination is another important feature of this language. Out of 38 basic
characters, 330 conjuncts are formed due to combination vowels and consonants. There
are more than 10,000 basic stems (root words) in Kannada language. Also more than a
million morphed variants formed due to more than 5000 distinct character variants
[158,159].

The complexity of developing morphological analyzer for Dravidian language like


Kannada is comparatively higher than the other languages like English. Most of the words
may change spelling when stems are inflected. In agglutinative language like Kannada

185
normally root word is affixed with several morphemes to generate thousands of word
forms. To build an effective morphological analyzer one should carefully analyze and
identify all these roots and morphemes. The next challenge is the design of morphological
structure and generation of well organized corpus that should possibly cover all types of
inflections.

6.1.1 Agglutination of Kannada Language

Agglutination is the most critical and important feature of Kannada language. Due to
the highly agglutinating nature of this language and the morphophonemic variations that
take place at the point of agglutination, it is very difficult to mark word boundaries. For
example consider the verb root (OdikoMDiddavana) ->the
one(masculine) who was reading. The different meaningful parts of this word are as
follows:

+ + + + + + + + +

Odu + i + koLLu + MD + u + iru + dd + a + avanu + a

Root + VBP + AUXV + PST + VBP +AUXV + PST + RP + PRON-3SM + ACC

The above word consists of ten meaningful parts, in which one root word (Root), two
Verbal Participle (VBP), two Auxiliary Verbs (AUXV), two Past Tense Markers (PST),
one Relative Participle (RP), one Pronoun (PRON-3SM) and one accusative (ACC).

6.1.2 Types and Features of Kannada Words

In general, there are three types of Kannada words namely: i) namapada (Declinable
words or nouns) ii) kriyapada (Conjugable words or Verbs) and iii) avyaya (Uninflected
words). Nouns, pronouns and adjectives belong to declinable words and are inflected to
differences of case, number and gender. Conjugable words are inflected to mark
differences of person, gender, number, aspect, mood and tense. All the Kannada words are
of three genders: masculine, feminine and neuter. Declinable and conjugable words have
two numbers: singular and plural. The singular has no particular distinguishing marker
added. The plural marker is usually gaLu, but there are some exceptions as follows:
Masculine nouns (E.g., huDuga) ending in a and some feminine nouns (E.g., heMgasu)

186
endings in u have plural with aru . Feminine nouns ending with i (E.g., huDugi) or
e (atte) have plural with yaru. Also nouns with kinship terms (E.g., aNNa), the marker
for plural is often aMdiru. Some nouns are irregular plurals such as makkaLu which is
the plural for noun magu.

6.1.3 Noun Cases and Characteristics suffixes

The case system of Kannada is similar to those of other south Dravidian languages
like Tamil, Telugu and Malayalam. Nouns may usually end in a, e, i, u, A or in a
consonant [158]. Various suffixes are added to the noun stem to indicate different
relationships between the noun and other constituents of the sentence. The different types
of suffixes are used with a particular case, based on the type of nouns and their end
character. For example dative case characteristic suffixes are decided by the following
criteria as shown in Table 6.3.

Table 6.3: Dative Case Characteristics suffixes for Nouns

Noun type Ends with Dative Example Dative form


suffix noun

(a) (kke) (mara) (marakke)

Neuter noun ,, (e,i,u) (ge) (mane) (manege)

consonants (ige) (Uru) (Urige)


Neuter - (akke) (idu) (idakke)
determinative
Rational noun - (nige) (aNNa) (aNNanige)

Table 6.4 below shows the different cases and their corresponding characteristic
suffixes for nouns.

187
Table 6.4: Noun Cases and their Characteristics suffixes

Feature Characteristic Suffix

Singular Plural

Nominative(Pr (vu)/(yu)/(u)/ (nu) (gaLu)/(ru)/(Mdiru)/


athama)
(yaru)
Accusative (vannu)/(yannu)/ (gaLannu)/(Mdiran
(Dwitiya)
(annu)/ (nannu) nu)/(yarannu)/(rannu)
Instrumental (diMda)/(yiMda)/ (gaLiMda)/(MdiriMd
(Tritiya)
(iMda)/(niMda) a)/(yariMda)/(riMda)

Dative (kke)/(ge)/ (gaLige)/(Mdirige)/


(Chaturthi)
(ige)/(nige)/ (akke) (yarige)/(rige)

(gaLadeseyiMda)/
Ablative
(Pachami) (deseyiMda) (MdiradeseyiMda)/
(yaradeseyiMda)/
(radeseyiMda)
Genitive (da)/(ya)/(ina)/ (gala)/(Mdira)/(yara)/
(Shashti)
(na)/(a)/(vina)/ (ra) (ra)
Locative (dalli)/(yalli)/(alli) (gaLalli)/(Mdiralli)/
(Saptami)
/ (nalli) (aralli)/(ralli)
Vocative (E)/(vE)/(A)/(I) (gale)/(MdirE)/
(Sambhodana)
/(yare)

6.1.4 Verb Morphology

Comparing with other Dravidian language like Malayalam, the morphological


structure of Kannada is more complex because it inflects to person, gender, and number
markings. In case of verb morphology each root word is combines with auxiliaries that
indicate aspect, mood, causation, attitude etc. The uniqueness in the structure of verbal

188
complexity makes it very challenging to capture in a machine analyzable and generatable
format. Also the formation of the verbal complex involves arrangement of the verbal units
and the interpretation of their combinatory meaning. Phonology also plays a little role in
word formation in terms of morphophonemic and sandhi rules which account for the
shape changes due to inflection. To resolve the computational challenges in verb
morphological analysis,I have classified verbs into 35 distinguished paradigms and verb
words are grouped based on their class paradigms.

6.1.4.1 Inflections and Features of Kannada Verbs

Verb forms can be broadly classified into two types: finite verbs and non-finite verbs.
In case of finite verbs, the verbs are usually added to the end of sentences with the
exception of clitics or reportives and can have nothing added to them. The general syntax
of finite verb is the form: Subject-Object-Verb. Some of the finite forms of the verbs are
imperatives, present and past forms marked with PNG, modals and verbal/participle
nouns. The tense can be past/present/future, if it is in the affirmative. The negative form
does not take tense. The non-finite verbs in contrast cannot stand alone and must have
some other forms following them. Non-finite verb forms include infinitives, verbal and
adjectival participles and tense-marked verb stems [158]. The non-past denotes both
present and future tenses and unlike Malayalam language (another south Dravidian
language) all tenses have different tense markers in Kannada language. Mood is another
important feature of Kannada language and is associated with statements of fact versus
possibility, supposition, etc [160]. There are four different moods that are expressed in
Kannada are: infinitive, imperative, affirmative and negative. Also Kannada has some
additional modal forms such as: indicative, conditional, optative, potential, monitory and
conjunctive.

Kannada language also include past verb stems in addition to simple verb stems, that
are used in forming the past tense, past participles, conditionals and some other
constructions. The contingent form is another distinguished feature of Kannada language
that is not present in any other Dravidian languages [161]. Past verbs are broadly classified
into two types called regular and irregular (or semi regular). In case of regular the different
words are formed by adding id to the verb stem. In the other case different words are
formed by adding any one of the past tense marker as shown in Table 6.18.

189
6.1.4.1.1 The Infinitive

The infinitive is a non-finite form of the verbs that occur together with other verbs,
auxiliary verbs (modals), negative morphemes and some other forms. There are two types
of infinitives in Kannada called (al) and (Okke). Both are added to the verb
root to generate other word forms as shown in Table 6.5.

Table 6.5: The Infinitive

Case Example and Meaning

Case 1 ( baru ) -> come + ( al ) + ( illa ) -> negative =


(baralilla ) -> didnt come
Case 2 ( nODu ) -> see + (Okke) = (nODOkke ) ->
to see

6.1.4.1.2The Imperative

Based on various degrees of politeness and deference, Kannada verbs exhibit number
of forms that express commands or exhortations. These imperatives also changes based on
verb types which usually depend upon the end of verb. The Table 6.6 below indicates the
imperatives with example each.

Table 6.6: The Imperative

Degree of politeness i Stem, o Stem, bA , tA Stem, Other


Eg. Eg. Eg., stems
(kuDi)drink (tego) take (bA,baru)come Eg.
(hOgu)
go
Masculine (bArO )
(kuDiyO ) (tegoLLO ) (hOgO)
Impolite,
Feminine (bArE)
casual
(kuDiyE) (tegoLLE) (hOgE)

Nonpolite (kuDi) (tego) (bA)


(hOgu)

190
Polite, plural (banni)
(kuDIri) (tegoLLi ) (hOgi)
Very polite (banniri)
(kuDiyiri ) (tegoLLiri) (hOgiri)
ultrapolite
(tAvu (tAvu (tAvu banniri)
kuDiyiri) tegoLLiri) (tAvu
hOgiri)

6.1.4.1.3 Negative Imperatives

These forms command someone not to do something. There are two ways for
indicating negative imperatives called (bAradu) and (bEDa). The first one

bAradu (historically a form of bA/baru, come) maybe added with infinitive (a) to

generate negative verb form. Also added with infinitive (al) or (l) and an

emphatic (E) to generate strong negative verb forms. Table 6.7 illustrates the negative
imperative with example.

Table 6.7: The Negative Imperative

Case Example and Meaning

(hOgu) + (a) + (bAradu) =

Case 1 (hOgabAradu) -> dont go or


(hOgu) + (bAradu)= (hOgbAradu) -> dont go
(hOgu) + (al) + (E) + (bAradu) =

Case 2 (hOgalEbAradu) definitely dont go or


(hOgu) + (l) + (E) + (bAradu) =
(hOglEbAradu) definitely dont go

Similarly the second way for indicating negative imperatives are using the negative
modals (bEDa, negative modal of the word (bEku) - want, need, must,

should) and (kUDadu, must not). Table 6.8 illustrates these negative
imperatives with example.

191
Table 6.8: The Negative Imperative

Case Example and Meaning

(mADu) + (a) + (bEDa) = (mADabEDa) or


Case 1 (mADu) + (bEDa) = (mADbEDa), dont do

Case 2 (mADu) + (a) + (bEDi) = (mADabEDi) or


(mADu) + (bEDi)= (mADbEDi), please dont do
(mADu) + (al) + (E) (bEDa) =

Case 3 (mADalEbEDa) or
(mADu) + (l) + (E) (bEDa) =
(mADlEbEDa), must not do
(mADu) + (a) + (kUDadu) =

Case 4 (mADakUDadu) or
(mADu) + (kUDadu)= (mADkUDadu), must
not do

6.1.4.1.4 Optative

The optative is usually used with first and third persons in Kannada and is formed by
adding (i) to the infinitives. It is often translates into an English word let, if it is used
in an affirmative and has the meaning shall, should and may when appeared in the
interrogative as shown in table 6.9.

Table 6.9: The Optative

Case Example and Meaning

Affirmative (avaLu maDu) + (al) + (i) =


(avaLu mADali), let her do
Interrogative (avaLu yAvAga maDu) + (al) + (i) =

(avaLu yAvAga mADali)?, when


shall/should/may she do?

192
6.1.4.1.5 Hortative

This form in Kannada can be formed by (ONa) to the verb stem and can be
translated either as lets (do something) or shall we (do something)? in the interrogative
sentence. Table 6.10 illustrates this with example.

Table 6.10: The Hortative

Case Example and Meaning

Affirmative (mADu) + (ONa) = (mADONa), lets do


Interrogative (Enu) + (mADu) + (ONa) = (Enu

mADONa)?, What shall we do?

6.1.4.1.6 Participle

Kannada has some non finite verb forms called participles that function verbally,
adjectivally or has some special syntactic function in the sentence. Participles may be
affirmative and that can be marked for tense or negative. The important verbal participles
in Kannada are as shown in Table 6.11 with example for each.

Table 6.11: Verbal Participle

Participle Description Example

Present Adding (A) to the verb stem (avnu


verbal + (utt, present tense
participle yOcane mADuttA kuLittiddanu) he
marker) followed by a finite
verb or verb phrase. was sitting thinking

Past verbal If the past tense marker of the Case 1:( mADi
participle verb is (id), then add Urige baMdenu)having done, I
(i) to the verb stem came to village
followed by a finite verb or Case 2: + + ( hOgi +
verb phrase. Otherwise add biTTu + banni) go and come
(u) to the past verb stem

193
followed a finite verb or verb
phrase.
Negative Used to express the negative Caes 1: + =
verbal of both the present and past (nODu + ade = nODade)without
participle verbal participle. These are seeing
formed by adding (ade) to Case 2: + = (illa + ade
the verb stem or to the =illade) not being/having
negative stem illa. been
Verbal/part The most common among Caese 1: + +=
iciple noun verbal participle nouns are (ADu + O +adu= mADOdu) the
the neuter singular (adu) (act/fact of) doing, that which does
and personal verbal nouns Case 2: + + =
like (avanu), n(ODu + O +avaru =
(avaLu), (avaru) etc. nODOvavaru) those (people
) who see

Negative Can be formed by affixing the Case 1: + + =


verbal/parti negative adjectival participle (mADu + ada + adu =mADadadu)
ciple noun (ada) to the (the act/fact of) not doing, that
demonstrative pronouns. which does/did not do
Case 2: +
+=
( illa +ada +adu=illadadu or illaddu )
the act/fact of not being, that which
is /was not
Case 3: + + =
(hOgu+ ada +avaLu =
hOgadavaLu ) the women
who does/did not go

194
6.1.4.1.7 Modal auxiliaries

There are number of modal auxiliaries in Kannada language that may have number of
different meanings as shown in Table 6.12.

Table 6.12: Modal Auxiliaries

Modal Description Example


auxiliaries

The modal (is Case 1: ( nAnu


wanted, needed; must, hOga bEku) I ought/need/want to
should, ought) is used in go
different situation with Case 2: ( nIvu
(bEku) different meaning. nALe illi ira bEku ) you
must/should be here tomorrow
Case 3: (
nIvu avanannu nODi ira bEku )
you must have seen him
The negative of modal Case 1: (bara bEDa)
(bEDa), is ( should not, dont come
(bEDi) must not, need not). The Case 2: (baradira
more polite or plural form bEDa) come
is .
Similar to the negative (bara kUDadu)
modal but is should/must not come
the strongest negative
(kUDadu) modal. This also attached
to the infinitive as like
and
(bAradu).
This is also attached to (nODa bahudu)
(bahudu) the infinitive and has the can/might see
meaning (someone)

195
can/may (do something)
The negative of is (nODa bAradu)
(bAradu) cant/shouldnt see
The negative contingent + + + =
with PNG markers is nagu + al + Ar + enu =
(Ar) attached to the verbal nagalArenu
infinitive to generate the cannot/might not laugh
meaning cannot/might
not.

6.1.4.1.8 Dative-stative verbs

Mainly, there are two verb stems namely (baru) and (Agu) are frequently
used as dative-stative verbs in Kannada language. These verbs have a habitual sense when
they stand alone and normally do not take tense markers. The verb ( iru, be)

indicating possession has the meaning have in English. The second verb (become)
act as an aspect marker indicating finality.

6.1.4.1.9 Verbal aspect markers

Aspect markers are very similar to main verbs in their morphology and syntax but
semantically they do not express the lexical meaning like the other main verbs. In
Kannada language, the verbal aspect marker is usually added to the past verbal participle.
The Table 6.13 shows the various verbal aspect markers that are used in Kannada. The
aspect markers in Kannada and their meaning in English are underlined in each of the
given example.

Table 6.13: Verbal Aspect Markers

Aspect markers and Description Example


their meaning
(biDu) Meaning: Aspectual Case 1: ( avanu
completion is attached to biddu biTTanu) he fell down
(perfective, the past verbal Case 2: (biTTubiDu) let

196
definiteness). participle. It is go
homophonous
with the lexical
verb
(leave) and has
the similar
tense
formation.
However the
aspectual
can also
attached to the
lexical verb
.
(hOgu)Meaning: The aspectual Case 1: (anna
completion(sometimes beMdu hOgide) the rice has
in voluntaryor indicates that gotten overcooked
accidental) something has Case 2: ( keTTu
changed from hOgu)get spoiled
one state to
another. It is
homophonous
with the lexical
verb
(go) and has
the similar
tense
formation.
Aspectual
is
attached to the
past verbal

197
participle.
(ADu) It is Case 1: (avaru
Meaning: continuity, homophonous ODADidaru)they ran around
duration (with some with the lexical Case 2: (avaru
verbs reciprocalor verb (paly) kAdADidaru)they fought with
competitive) and has the each other
similar tense
formation.
Aspectual
is also attached
to the past
verbal
participle.
(koDu) It is Case :
Meaning: benefactive homophonous (avanu kate
with the lexical baredu koTTanu)he wrote the
verb story for someones benefit
(give) and has
the similar
tense
formation.
Aspectual
is also
attached to the
past verbal
participle.
(nODu) It is Case:
Meaning: attemptive, homophonous (avanu
experimentive with the lexical kAfi kuDidu nODidanu)hetried
verb drinking/tasted the coffee
(see) and has
the similar

198
tense
formation. It is
usually used
with transitive
verbs and
rarely used with
intransitive
formation.
(hAku) It is Case: (
Meaning :exhaustive, homophonous avanu dOseyella tiMdu
malefactive with the lexical hAkidanu)he ate up all the pan
verb (put, cakes (against our whishes)
place) and
takes regular
(weak) tense
formation. It is
usually used
with transitive
verbs.
(koLLu) It is Case : (
Meaning: reflexive, self homophonous avanu kate baredu koMDanu)he
benefactive with the lexical wrote a story for himself
verb
(buy, take,
acquire) and
has the similar
tense
formation.

6.1.4.1.10 The causative suffix (isu)

The causative suffix (isu) (or yisu) can be added to any verb stems to make
causative verbs out of noncausative ones as in Table 6.14.

199
Table 6.14: The causative suffix (isu)

Case Example and Meaning

Case 1 (kali -> learn) + (isu) -> (kalisu -> teach)


(Intransitive) ( Causative)
(mADu -> do) + (isu) -> ( mADisu, make

Case 2 (someone ) do->Intransitive- Causative)+(isu)+(


mADisisu, make (someone) make (someone) do -> Double
causative)

6.1.4.1.11 The conditional suffix are

The conditional suffix are in Kannada is used to express if (something) in English


and the same form is used for all persons. Usually the suffix are is attached to the past
verb stem to create the conditional form. The Table 6.15 illustrates this.

Table 6.15: The conditional suffix

Case Example and Meaning

Case 1 (mADidare) if (someone) does (something) , (then..)

Case 2 (hOdare) if (someone) goes, (then)

6.1.4.1.12 Verbs marked with tense and PNG

The PNG and the tense marker concatenated to the verb stems are the two important
aspect of verb morphology. The verbal inflectional morphemes attach to the verbs
providing information about the syntactic aspects like number, person, case-ending
relation and tense. The Table 6.16 shows the various PNG suffixes that can be attached to
be any verb root word.

Table 6.16: PNG-Suffixes

Person Number Gender PNG Suffix

200
Present Future Past Contingent

First Singular Masculine (Ene) , , (Enu)


/Feminine
(enu, e) (enu, e)
Plural Masculine (Eve) (evu) (evu) ( Evu)
/Feminine

Second Singular Masculine , , (i, , (i, (Izha)


/Feminine
(I, Iye) iye) iye)
Plural Masculine (Iri) (iri) (iri) ( Iri)
/Feminine

Singular Masculine (Ane) (anu) (anu) (Anu)

Singular Feminine (Ale) (aLu) (aLu) ( ALu)

Third Plural Masculine (Are) (aru) (aru) ( Aru)


/Feminine

Singular Neuter (ide) (udu) (itu) , (Itu)


Plural Neuter (ive) (Avu) (avu) ( Avu)
6.2 CLASSIFYING VERB PARADIGMS AND SYSTEM DESIGN

The first step involved in the implementation of morphological analyzer is to classify


the verb paradigms with computational perspective. Most of the cases the problem arises
due to past tense markers that change from one paradigm to another [160]. I have
classified Kannada verbs into 35 different paradigms by considering the entire situation
that possibly generates different possible words as discussed previously. Each paradigm
root word will inflect with the same set of inflections. In other words, every word in each
paradigm will have similar orthographic changes (sandhi changes) or has the same
inflectional behavior, when a suffix is added to it. Consider the two words (ELu) and

(bILu) . These two words, when inflected with tense (past) and PNG markers to

form (eddenu) and (biddenu). As these two words show the same
orthographic changes, they are grouped under the same paradigm. The Table 6.17
illustrates these paradigms with example each.

Table 6.17: Verb Paradigms

201
Paradigms Past tense Description & Example
marker
Class-1 --(-tt-) Verbs ends with 'Ayu', 'Iyu', 'ILu'; Eg: sAyu, Iyu, kILu etc.

Class-2 --(-tt-) Verbs ends with 'ru', 'aLu', 'uLu'; Eg: heru,aLu,uLuetc.

Class-3 --(-tt-) Verbs ends with aLu,uLu; Eg : aLu, uLu

Class-4 -- (-Mt-) Verbs ends with 'illu'; Eg : nillu

Class-5 -- (-t-) Verbs ending with i and e; Eg: kali, mere, koLe etc.

Class-6 -- (-t-) Verbs ends with 'ULu'; Eg: Example: hULu

Class-7 -- (-t-) Verbs ends with Olu, Ulu, Elu; Eg:jOlu,nUlu,hElu

Class-8 -- (-d-) Verbs ending with 'Ayu','Oyu','Eyu','Iyu'; Eg: kAyu, sIyu

Class-9 -- (-d-) Verbs ending with 'Agu','Ogu'; Eg: hOgu, Agu etc

Class-10 -- (-d-) Verbs ends with 'are' ; Eg: bare

Class-11 -- (-d-) verbs ending with 'ge' and 'gi' ; Eg: age, agi

Class-12 -- (-d-) Verbs ending with 'yyu' ; Eg: koyyu, geyyu, hoyyu etc.

Class-13 -- (-d-) Verbs ends with 'nnu' ; Eg: annu, tinnu, ennu etc

Class-14 -- (-d-) Verbs ending with 'Eyu' ; Eg: gEyu, nEyu etc

Class-15 -- (-d-) Verbs ending with 'Ayu' ; Eg: Ayu

Class-16 --(-dd-) Verbs ends with 'iru' ; Eg: iru

Class-17 --(-dd-) Verbs ends with 'kaLu' ; Eg: kaLu

Class-18 --(-dd-) Verbs ends with 'ILu','ELu' ; Eg: bILu ,ELu, etc

Class-19 --(-dd-) Verbs ends with 'Eyu' ; Eg: :mEyu

Class-20 --(-dd-) Verbs ends with 'ellu' ; Eg: gellu

Class-21 --(-id-) Verbs ends with 'ADu', 'ODu' ; Eg: ADu,nODu, tODu

202
Class-22 --(-id-) Verb ends with 'TTu',ddu, bbu, ttu, llu, ccu

Eg: aTTu, addu, ubbu, kuttu, cellu, heTTu, beccu,hottu etc


Class-23 --(-id-) verbs ending with 'Oru', 'Eru' ; Eg: tOru, sEru, etc

Class-24 --(-id-) Verbs ends with 'ju',su ; Eg: mOju, aMkurisu etc

Class-25 --(-id-) Verb ends with 'MTu',Mju,Mcu; Eg: IMTu, aMju, hoMcu

Class-26 --(-id-) Verbs ends with 'ELu', 'ILu'; Eg: hELu, sILu etc

Class-27 -- (-nd-) Verbs ends with 'Eyu', 'Oyu' ; Eg: bEyu, nOyu etc

Class-28 -- (-nd-) Verbs ends with 'A'(aru) ; Eg: taru(tA),baru(bA) etc.

Class-29 -- (-nd-) Verbs ends with 'ollu', 'ellu', 'allu' ; Eg: kollu,mellu ,sallu etc.

Class-30 -- (nD-) Verb stems ending with 'ANu' ; Eg: kANu

Class-31 -- (nD-) Verb ends with 'oLLu' ; Eg: koLLu

Class-32 -- (-T-) Verb ends with aDu',eDu,oDu,iDu,uDu

Eg: aDu, keDu, koDu, iDu, uDu, etc


Class-33 -- (-k-) Verb ends with 'ggu'and ''gu'; Eg: oggu, sigu, nagu, etc

Class-34 -- (-d-) Verbs ends with 'kAyu' ; Eg: : kAyu, dArikAyu

Class-35 -- (nD-) Verbs ends with 'ko' ; Eg: baggiko, bEDiko etc

Designing the morphological system thatshould probably generate all possible word
forms for the given verb root word was the next important stage in the developed system.
Fig. 6.1 shows the proposed flowchart for verb morphology. The meaning of each of the
abbreviations that are used in the flowchart is shown in Table 6.18.

Table 6.18: Abbreviations in the System Design

Abreviation Examples

PAST: Past tense marker tt, Mt.,t, d, dd,id,Md, D,T, kk, MD, ttidd, Mtidd,

203
tidd, didd, ddidd, idd, Mdidd , Didd, Tidd, kkidd
,MDidd .

PRESENT: Present tense marker utt, yutt, uttidd, yuttidd, iyutt, iyuttidd

FUTURE: Future tense marker uv, yuv.

PNG: Person Noun Gender As shown in Table V

PN: Pronoun avanu, avaLu, avaru, adu

PP: Past Participle u, I, (Nul)

TM: Tense Marker PRESENT, PAST, FUTURE

INF: Infinitive isi, is, isal, isalu, isid , iyisid, sid, al , alu ,i ,sal
,salu ,si , s, a

The different levels show the possibility of different verb words derived from the
same root word. For example the different morphemes associated with the verb word
(OdisinODuttAne) is:

+ + + +

Odu + isi + nODu + utt + Ane


Root(Level 1) + INF (Level 2) + AUXV (Level 3) + PRESENT (Level 2) + PNG(Level 3)

6.3 PROPOSED RULE BASED MAG

The proposed rule based MAG tool was developed using AT &T Finite State
Machine. This section describes the various efforts required to create the proposed rule
based MAG system.

6.3.1 Information Required to Build MAG

The following informations are required to build a morphological analyzer and


generator:

204
6.3.1.1 Lexicon

The list of stems and affixes together with basic informations about them (Noun stem
or Verb stem etc,).

6.3.1.2 Morphotactics

The model of morpheme ordering that explains which classes of morphemes can follow
other classes of morphemes inside a word. E.g., the rule that Kannada plural morpheme
follows the noun stem rather than preceding it.

6.3.1.3 Orthographic rules

These are spelling rules used to model the changes that occur in a word, usually when
two
INF morphemes combine. For example, insert a yu on the surface tape just when the

lexical tape has a morpheme ending in e (or i, bEkAgiru


Aguvudilla etc) and the INF
next morphemes are
ONa/LL
tt(PRES)
ONa/y
and
illa Ane(3SM). PN mADu RP(a) PN
ONa
Agu NEG(ad) PNG
TM
al/yal/ bEkAgibaru PAS udu
LLal Agadu tOrisu T
iDu PP(u)
Aguvudu
uva/yu
va/LLu koDu
Aguttade nODu
va
Aguttive iDu
iri/yir illa PN
i/LLiri Aguttave baru
/ nODu RP(iruv)
is/yis PRESENT PNG PRE
/LLis Agu SEN ade
koLLu
FUTURE PNG T
Olla/
ide
yOlla ddre hOgu
hOgu
/LLOll
ddu
a
i/isi/L sAdhyavAgu
Lisi yAytu PNG
biDu
ade/y bEku
ade/L iru udilla
Lade bEDa koMDiru
FU
bAradu TU RP (a) PN
a/ya/iy ikkAgibA
RE
a/LLa Aytu
bahudu Aytu
ali/zha illa udu
Root
li/LLali iddu
adu hAku
iddare
uvike/y
illa bAradu illa
uvike/L RP()
Luvike
u tolagu
PNG 205 al
PAST bahudu
PN
PP(u,i uvike
) taLLu bEDa
ad
PNG
u
beLe + insertyu + PRES(tt) + 3SM(anu) ->beLe-yu-tt-Ane =beLeyuttAne

6.3.2 Creation of rules using FST

A FST essentially is a finite state automaton that works on two (or more) tapes. The
most common way to think about transducers is as a kind of translating machine which
works by reading from one tape and writing onto the other. For example, on one tape we

read , on the other we write +N +PL, or the other way around as

shown in Fig. 6.2.: means read a symbol on one tape and write the same

on the other tape. Similarly +N: means read a +N symbol on one tape and write
nothing on the other.

Lexical Level

Surface Level

206
Fig. 6.2:FST working principle

FSTs can be used for both analysis and generation (they are bidirectional) and, it act
as a two level morphology [162, 163]. A word is represented as a correspondence between
a lexicalleveland surfacelevel. At lexical level, represents a simple concatenation of
morphemes making up a word. But at the surface level, represents the actual spelling of
the final word.

6.3.3Architecture of Proposed MAG Model

With all relevant morphological feature information of Kannada words, created well
defined sandhi rules based on FST. The architecture of proposed rule based MAG system
is as shown in Fig. 6.3.

For the Morphological generator, if the string which has the root word and its
morphemic information is accepted by the automaton, then it generates the corresponding
root word and morpheme units in the first level as shown in Fig. 6.4.

Fig. 6.3: Architecture of proposed MAG model

Here beLe is the root word, V indicates the category of the root word as verb,
PRES and FUT indicates the tense markers for presentence and future tense respectively
and 3SM indicates PNG marker for third singular masculine.

207
Fig. 6.4: Example forMorphotactics Rule

The output of the first level becomes the input of the second level, where the
orthographic (sandhi) rules are handled as shown in Fig. 6.5. If it gets accepted then it
generates the inflected word.

Fig. 6.5: Application of Sandhi Rule

The sandhi rules should be written in such a way that, if the root word ends with e
and the next morphemes are tt(PRES) or Ane(3SM), then insert yu immediately
after the root word. Fig. 6.6 below shows the corresponding sandhi rule.

208
Fig. 6.6: Example for Sandhi Rule

Based on the inflections and differences, all possible Morphotactic and Sandhi rules
were written for all forms of nouns and verbs using FST.

Example Rules for Noun Morphology

####Plural "gaLu"#####
[b2] [<epsilon>] / __ gaLu
This rule works as follows if any root word given as input in the form Root+N+PL
as like this mara+N+PL, here mara (tree) is root word, here + is consider as [b2] this is
already defined in alphabets file. So +N should be replaced as [<epsilon>] that is replace
as empty string and then Plural marker is added to the root word +PL is replaced as
gaLu, now tool give output as maragaLu.

Example Rules for Verb Morphology

####PRESENT tt#####
[b2] [<epsilon>] / __ tt

This rule works as follows if any root word given as input in the form Root+V+PRES
as like this Odu+V+PRES, here Odu (tree) is root word, here + is consider as [b2] this is
already defined in alphabets file this is common for both Noun and Verb. So +V should
be replaced as [<epsilon>] that is replace as empty string and then Presentence marker is
added to the root word +PRES is replaced as tt, now tool give output as Odutt.

209
6.4 MORPHOLOGICAL ANALYZER USING MACHINE LEARNING
APPROACH

The proposed morphological analyzer model was developed using supervised machine
learning approach using the popular classification and regression tool called SVM. In the
supervised machine learning approach, the training corpus consists of pairs of input
objects and the desired output.

Generally, in morphological analysis process, a complex word form is transformed


into root and suffixes. In the case of machine learning approach all the rules including
complex spelling rules are also handled by the classification task. Machine learning
approaches needs only corpora with linguistical information and do not require any hand
coded morphological rules [164]. The morphological or linguistical rules are automatically
extracted from the annotated corpora. In the proposed method, sequence labelling is used
to align the parallel corpus and morphological analysis problem was converted into
classification problem using machine learning approach.

6.4.1 Corpus development

The performance of the morphological analyzer is greatly depends on the aligned


morphological corpora which should be large, good-quality and representative. The
sequence of steps involved in the corpus development is as shown in Fig. 6.7.

6.4.1.1 Romanization

SVM support only Roman character code but Dravidian language like Kannada does
not support this code format and support only Unicode character. Mapping files were
created and used to map from Unicode to Roman and vice versa. The Romanized aligned
input-output data corpus,consisting of most commonly used verbs, selected from all verb
paradigms was created manually.

210
Fig. 6.7: Preprocessing steps

6.4.1.2Segmentation

This step involves four different stages: grapheme segmentation, splitting syllable, C-
V (Consonant-Vowel) representation and segmentation. In the first stage each and every
Romanized word in the corpora is segmented based on Kannada grapheme. Again these
graphemes are split into syllables of consonants and vowels in the second stage. In the
next stage the consonant and vowel markers -C and -V are append to the segmented
consonant and vowel syllable respectively. The symbol * is used to indicate the
morpheme boundaries of the output data.

6.4.1.3Alignment and Mapping

Using the sequence labelling approaches [165] the segmented input - output words
were aligned vertically and consequently as segments with space between them. The
Table 6.19 shows a sample of alignment of input and out data in the corpus creation.

Table 6.19: Sample Training Data Format

211
Input k-C a-V l-C i-V t-C a-V n-C u-V

Output k a l i* t* a n u*

6.4.1.4 Dissolving Mismatch

When we map input and output data words, the problem of mismatching may occur
due to either the input units are larger or smaller than that of the output units. In the first
case when the input units are larger than the output units, based on the morph-syntactic
rules, inserting a null symbol $ in the output vector can solve the mismatching problem.
Consider the Kannada verb kaliyuttAne is having 11 segments in the input sequence and
10 segments in the output sequence.
Due to the morphosyntactic rule the occurrence of y in the input sequence becomes
null in the output sequence. To equalize the input and output data in such situation the
training data y is mapped with the empty symbol $ as shown in Table 6.20.
In the second case the number of segments in the input sequence is less than the
corresponding output sequence as oppose to the first case. To illustrate this situation,
consider the Kannada verb OdidaLu which is having 7 segments in the input sequence
but 8 segments in the output sequence. In order to overcome this problem using the
morphosyntactic rule the first occurrence of d- C in the input sequence is mapped with
two segments d and u in output sequence in the training data.

Table 6.20: Dissolving Mismatch

Case Input Sequence Mismatched Corrected Output


Output Labels Labels
Case 1 k-C | a-V | l-C | i-V |y-C |u-V k | a|l| i* |u| t | t*| A| k | a| l| i* | $ | u| t | t*| A|
| t | t-C | A-V | n-C |e-V n|e* n|e*
( 11 segments ) ( 10 segments ) (11 segments )

Case 2 O | d-C | i-V | d-C |a-V |L-V | O | d | u* | i | d* | a | L | O | du* | i | d* | a | L | u*


u-V u* ( 7 segments )
( 7 segments ) ( 8 segments )

212
6.4.2 Morphological Analyzer Model Creation

The architecture of proposed model mainly consists of three different phases as shown
in Fig. 6.8.

Fig. 6.8: Architecture of Morphological Analyzer Model

The pre-processed parallel corpus consists of sequence of input characters and their
corresponding output labels. The parallel corpus consist of more than 200,000 words were
trained using SVMTlearn tool and the morphological analyser model called model-I was
generated. The model-I was used to analyse and identify different morphemes associated
with the given input test word. Similarly another model called model-II was created for
assigning grammatical classes to each morpheme in a word and this second model was
trained using sequence of morphemes and their grammatical categories.
The working principle is as follows:The test input word is given to the trained model-
I. The trained model-I predicts each label to the input segments. In the next phase the
segmented morphemes are given to the trained model-II. It predicts grammatical
categories to the segmented morphemes for the given input word.

213
6.5 SYSTEM EVALUATION AND PERFORMANCE

Development of MAG is a challenging task for all types of word forms. The
developed rule based MAG is capable of analyzing and generating a list of twenty
thousand nouns, around three thousand verbs and a relatively smaller list of adjectives.The
uniqueness of the developed MAG is its capacity to generate and analyze transitive,
causative and tense forms apart from the passive constructions, auxiliaries and verbal
nouns. The performance of the developed system can be substantially improved by adding
more rules such as rules for complex morphology etc. Also by checking against more and
more different types of word lexicons, the accuracy of the developed MAG can be
improved. A rule based machine translation system for English to Kannada language was
developed using the proposed MAG. The following Fig. 6.9 shows a command line
screenshot for the developed RMAG.
In the second proposed method, a parallel corpus consist of more than 200,000 words were
trained using SVMTlearn tool and the morphological analyzer model was generated.
Using the SVMTagger we have tested more than 50,000 different verb words selected
from two standard dictionaries [166,167] and also from the Amrita POS tagged corpus.
The performance of the system was evaluated using SVMTeval tool and the outputs which
are incorrect are noticed. In contrast to the rule based approach, the system performance is
considerably increased by adding the input words to the training corpus whose
corresponding output are incorrect during testing and evaluation. From the experiment we
found that the performance of our system significantly outperforms and achieves a very
competitive accuracy of 96.25% for Kannada verbs.

214
Fig. 6.9: Command Line Output of Morph Tool

Sample screenshots of the developed MAG model for noun are shown in Fig. 6.10 and
6.11. Similarly Fig. 6.12 and 6.13 shows the screenshots of the developed MAG model for
verb.

Fig. 6.10: Screenshot of Kannada Morph analyzer for Noun

215
Fig. 6.11: Screenshot of Kannada Morph generator for Noun

Fig. 6.12: Screenshot of Kannada Morph analyzer for Verb

216
Fig. 6.13: Screenshot of Kannada Morph generator for Verb

6.6 SUMMARY

This chapter is a part of the research work which deals with the design and
development of morphological analyzer and generator for Kannada language using rule
based as well as statistical based approaches.
Development of a morphological analyzer and generator for all types of word forms is a
challenging task for an agglutinative language like Kannada. The implementations aimed
to incorporate more lexical information of Kannada language with good semantic features,
which will solve the morphological problem more effectively. The performance of the
statistical approach depends on large sized aligned bilingual corpora of all types of word
categories. On the hand the performance of the rule based approach depends on all types
of simple and complex linguistic rules, in order to cover all types of word forms.
The performance of the developed rule based system can be substantially improved by
adding more rules by checking against more and more different types of word lexicons.
On the other hand, the performance of the statistical approach can be improved by
increasing the corpus size to cover other word categories like noun, pronoun, adverb etc.

217
6.7 PUBLICATIONS

1. Antony P J, Anand Kumar M and Soman K P: Paradigm Based Morphological


Analyzer for Kannada Language Using Machine Learning Approach., International
journal on-Advances in Computer Science and Technology (ACST), ISSN 0973-6107,
Vol 3 No. 4, 2010, pp. 457481.
2. Ramasami Veerappan, Antony P J and Soman K P: A Rule based Kannada
Morphological Analyzer and Generator using Finite State Transducer, International
journal on Computer Application -IJCA (0975 8887), Volume 27 No.10, August
2011.

218

Anda mungkin juga menyukai