Anda di halaman 1dari 73

English-Telugu Rule Based Machine Translation system

A Thesis submitted for the degree of

Master of Science (by research) in the School of Engineering By R.SRIBADRI NARAYANAN Centre for Excellence
Master of Science (by research)
in the School of Engineering
By
R.SRIBADRI NARAYANAN
Centre for Excellence in Computational Engineering
Amrita School of Engineering

Amrita Vishwa Vidyapeetham University Coimbatore 641105

English-Telugu Rule Based Machine Translation system A Thesis submitted for the degree of Master of Science

March, 2012

Amrita School of Engineering Amrita Vishwa Vidyapeetham, Coimbatore 641105

Amrita School of Engineering Amrita Vishwa Vidyapeetham, Coimbatore – 641105 BONAFIDE CERTIFICATE This is to certify
BONAFIDE CERTIFICATE
BONAFIDE CERTIFICATE

This is to certify that the thesis entitled ENGLISH-TELUGU RULE BASED MACHINE TRANSLATION SYSTEMsubmitted by R.SRIBADRI NARAYANAN (Reg. No.: CB.EN.M*CEN09009) for the award of the degree of Master of Science (by research) in the School of Engineering, is a bonafide record of the research work carried out by him under my guidance. He has satisfied all the requirements put forth for the project and has completed all the formalities regarding the same to the fullest of my satisfaction.

Amrita School of Engineering Amrita Vishwa Vidyapeetham, Coimbatore – 641105 BONAFIDE CERTIFICATE This is to certify

Ettimadai, Coimbatore. Date:

DR. K P SOMAN

RESEARCH GUIDE AND HEAD, CEN.

Amrita School of Engineering,

Amrita Vishwa Vidyapeetham, Coimbatore

641105

Centre for Excellence in Computational Engineering.

DECLARATION
DECLARATION

I, R.SRIBADRI NARAYANAN (REG. NO.: CB.EN.M*CEN09009), hereby declare that this thesis entitled ENGLISH-TELUGU RULE BASED MACHINE TRANSLATION SYSTEM is the record of the original work done by me under the guidance of Dr. K P Soman, Head, Centre for Excellence in Computational Engineering, Amrita School of Engineering, Coimbatore and to the best of my knowledge this work has not formed the basis for the award of any degree / diploma / associateship / fellowship or a similar award, to any candidate in any University.

Countersigned by
Countersigned by

Place: Ettimadai Date:

Signature of the Student

K P SOMAN

PROFESSOR AND HEAD, CEN, AMRITA VISHWA VIDYAPEETHAM, COIMBATORE.

ACKNOWLEDGEMENTS

First and foremost, I would like to thank my guide Dr. K.P Soman for his support, valuable suggestion and constant encouragement throughout the project. I would like to thank Dr. S. Rajendran who spend enormous amount of time in guiding and rectifying our problems whenever it was necessary.

I would like to thank Ms Mallika V Research Associate Computational Engineering and networks, for her support in linguistic knowledge who gave full support and enormous amount of ideas to scale up the system.

A CKNOWLEDGEMENTS First and foremost, I would like to thank my guide Dr. K.P Soman for

I am very grateful to have my friend Mr. Saravanan.S who has immense experience and meticulously tried to shape me up in the project. I extend my gratitude to Mr. Sankara Narayanan for his valuable suggestions and ideas. I would also thank Mr. Senthil for his support in my research work and giving his valuable ideas.

I extend my cordial thanks to all the teaching and the non teaching staffs of the Department of Computational Engineering and Networking for the help rendered at various phases of the project work.

I express our thanks to my parents and friends who always stood with me with their valuable suggestions and help.

A CKNOWLEDGEMENTS First and foremost, I would like to thank my guide Dr. K.P Soman for

ABSTRACT

Translation from one language to another language plays a vital role in sharing the information between two languages. For example in Indian language we have ethics like Ramayana, Mahabharata etc., which are life transforming stories, should be made available in all other languages. Similarly many advanced or latest technological topics should be translated to our Indian language. For this purposes we have developed English to Telugu machine translation system. In this system English sentence is given as input and we get output as Telugu sentence. Before producing the Telugu output, English sentence have to go through certain process such as parser, reordering, lexicalization, transliteration and morphology.

A BSTRACT Translation from one language to another language plays a vital role in sharing the

Parser gives grammatical tree structure for English sentence. For this purpose we are using Stanford parser, which gives better results when compared with other parser.

In reordering we reorder the English sentence with respect to our Telugu sentence. In English, format of the sentence will be Subject-Verb-Object (SVO) type but in Telugu we have SOV format. Using reordering rules we have to reorder the sentences.

Lexicalization is a process of changing the English words to Telugu words. We have a English-Telugu bilingual dictionary. Using it, English words will be searched and replaced with Telugu words.

A BSTRACT Translation from one language to another language plays a vital role in sharing the

Transliteration is done using Support Vector Machine (SVM) based approach which is developed at CEN, Amrita University. Transliteration is mainly used for transliterating the named entities and also for those words which are not available in the bilingual dictionary.

Morphology is done for the grammatical words. Morphology plays a vital role in Telugu language, because Telugu language is rich in inflection and agglutinative in nature. We have used SVMTool for morphological analyzer and data driven approach for morphological generator.

Final process is integrating the tools in a unique platform and producing the Telugu output.

i

CONTENTS

Abstract Contents List Of Figures List of Tables Chapter 1 Introduction i ii iv v 1
Abstract
Contents
List Of Figures
List of Tables
Chapter 1
Introduction
i
ii
iv
v
1
1
1.1
ISSUES IN MACHINE TRANSLATION
2
Chapter 2
Literature Survey
3
3
2.1
MACHINE TRANSLATION
3
2.2
THE NECESSITY OF MACHINE TRANSLATION
3
2.3
DIFFERENT CATEGORIES OF MACHINE TRANSLATION SYSTEMS
4
2.4
VARIOUS APPROACHES TO MACHINE TRANSLATION
5
2.4.1
LINGUISTICS OR RULE BASED APPROACH
6
2.4.2
NON-LINGUISTIC APPROACHES
8
2.4.3
HYBRID APPROACH
10
2.5 MORPHOLOGICAL ANALYZER AND GENERATOR
10
Chapter 3
14
Overview Of Telugu Language
14
3.1
DEMOGRAPHIC INFORMATION
14
3.2
GENERIC AFFILIATION AND HISTORY
14
3.3
THE TELUGU SCRIPT
14
3.3.1
ORIGIN AND DEVELOPMENT
14
3.3.2
TELUGU ALPHABET
......................................................................................
15
3.4
COMPUTATIONAL GRAMMAR OF TELUGU
17
3.4.1
NOUNS
17
3.4.2
VERBS
19
Chapter 4
23
Overview Of English-Telugu Machine Translation System
23
4.1
PARSER
24

ii

4.2 REORDERING ....................................................................................................... 24 4.3 DICTIONARY ....................................................................................................... 24 4.4 TRANSLITERATION .............................................................................................. 25 4.5 MORPHOLOGICAL ANALYZER ..............................................................................
4.2
REORDERING
.......................................................................................................
24
4.3
DICTIONARY
.......................................................................................................
24
4.4
TRANSLITERATION
..............................................................................................
25
4.5
MORPHOLOGICAL ANALYZER
..............................................................................
25
4.5.1
INTRODUCTION.............................................................................................
25
4.5.2
DATA CREATION FOR SUPERVISED LEARNING
.............................................
26
4.5.3
IMPLEMENTATION OF MORPHOLOGICAL ANALYZER MODULE
31
4.6
MORPHOLOGICAL GENERATOR
...........................................................................
33
4.6.1
INTRODUCTION
.............................................................................................
33
4.6.2
MORPHOLOGICAL GENERATOR FOR TELUGU
...............................................
34
4.6.3
DIFFICULTIES IN MORPHOLOGICAL GENERATION FOR TELUGU
34
4.6.4
FORMATION OF INFLECTIONAL TABLE
.........................................................
35
4.6.5
METHODOLOGY
...........................................................................................
36
Chapter 5
......................................................................................................................
41
Results
..........................................................................................................................
41
5.1
TESTING AND RESULTS
.......................................................................................
41
5.2
DISCUSSION
........................................................................................................
41
5.3
SCREEN SHOT
OF MORPHOLOGICAL ANALYZER
.................................................
42
5.4
TESTING AND
RESULTS
.......................................................................................
43
5.5
DISCUSSION
........................................................................................................
43
5.6
SCREEN SHOT
OF MORPHOLOGICAL GENERATOR ...............................................
44
5.7
TESTING AND
RESULTS
.......................................................................................
45
5.8
DISCUSSION
........................................................................................................
45
5.9
SCREEN SHOT OF ENGLISH-TELUGU MACHINE TRANSLATION SYSTEM
46
Chapter 6
......................................................................................................................
47
Conclusion
...................................................................................................................
47
References
....................................................................................................................
48
Publication
49

iii

LIST OF FIGURES

FIG. 2.1. Illustrates different approach of machine translation system ………………6

Fig. 2.2. DAWG ……………………………………………………………………..12

Fig. 4.1. General block diagram for English-Telugu machine translation system …..23 Fig. 4.2. Example to illustrate
Fig. 4.1. General block diagram for English-Telugu machine translation system …..23
Fig. 4.2. Example to illustrate morphological analyzer ……………………………..25
Fig. 4.3. Formation of paradigm …………………………………………………….26
Fig. 4.4. Steps involved in preprocessing data for SVM model …………………….27
Fig. 4.5. SVM model for morphological analyzer ………………………………… 31
..
Fig. 4.6. Illustration for training module 1 and 2 in SVM ………………………… 32
..
Fig. 4.7. Overview of morphological generator system …………………………… 35
..
Fig. 4.8. Grammatical tree structure …………………………………………………37
Fig. 4.9. Reordering of “She is writing a letter”...……………………………………38
Fig. 4.10. Lexicalization …………………………………………………………… 38
..
Fig. 5.1. GUI for morphological analyzer-verb ……………………………………
...
41
Fig. 5.2. GUI for morphological analyzer-noun …………………………………… 41
..
Fig. 5.3. GUI for morphological generator-verb …………………………………….43
Fig. 5.4. GUI for morphological generator-noun ……………………………………43
Fig. 5.5. GUI for English-Telugu machine translation system …………………… ...45
Fig. 5.6. GUI for English-Telugu machine translation system ……………………
...
45

iv

LIST OF TABLES

Table 2.1. An example to illustrate the direct approach to machine

translation system …………………………………………………………7 Table 2.2. An example to illustrate the interlingua representation ………………… ... 7 Table
translation system …………………………………………………………7
Table 2.2. An example to illustrate the interlingua representation …………………
...
7
Table 2.3. An example to illustrate the transfer approach …………………………….8
Table 4.1. Database information …………………………………………………….24
Table 4.2. Grouping of words in „ADU‟ paradigm ………………………………….27
Table 4.3. Sample input for SVM model ……………………………………………28
Table 4.4. Verb paradigm …………………………………………………………
....
34
Table 4.5. Noun paradigm …………………………………………………………
...
34
Table 4.6. Morpho-lexical forms …………………………………………………….35
Table 5.1. Testing results of morphological analyzer-noun …………………………40
Table 5.2. Testing results of morphological analyzer-Verb …………………………40
Table 5.3. Testing results of morphological generator-noun ……………………… 42
..
Table 5.4. Testing results of morphological generator-verb ………………………
...
42
Table 5.5. Testing results of translation system …………………………………… 44
..

v

CV

DAWG

FAMT

Constant-Vowel

ABBREVIATION

Direct Acrylic Word Graph

Fully Automatic Machine Translation

FAHQMT Fully Automatic High Quality Machine Translation FST Finite State Transducers HAMT Human Aided Machine Translation
FAHQMT
Fully Automatic High Quality Machine Translation
FST
Finite State Transducers
HAMT
Human Aided Machine Translation
MAHT
Machine Aided Human Translation
MT
Machine Translation
NLP
Natural Language Processing
PCFG
Probabilistic Context Free Grammar
POS
Parts Of Speech
SOV
Subject-Object-Verb
SVO
Subject-Verb-Object
SVM
Support Vector Machine
XML
Extensible Markup Language

vi

CHAPTER 1 INTRODUCTION

Machine translation is the task of translating the text in source language to

target language, automatically. Machine translation can be considered as an area of

applied research that draws ideas and techniques from linguistics, computer science,

C HAPTER 1 I NTRODUCTION Machine translation is the task of translating the text in source

artificial intelligence, translation theory, and statistics. Even though machine

translation was envisioned as a computer application in the 1950‟s and research has

been made for 60 years, machine translation is still considered to be an open problem.

In a linguistically diverged country like India, machine translation is an

important and most appropriate technology for localization. Human translation in

India can be found since the ancient times which are being evident from the various

works of philosophy, arts, mythology, religion and science which have been translated

among ancient and modern Indian languages. Also, numerous classic works of art,

ancient, medieval and modern, have also been translated between European and

Indian languages since the 18th century. As of now, human translation in India finds

application mainly in the administration, media and education, and to a lesser extent,

in business, arts and science and technology.

India has 18 constitutional languages, which were written in 10 different scripts.

Hindi is the official language of the India. English is the language which is most

widely used in the media, commerce, science and technology and education. Many of

C HAPTER 1 I NTRODUCTION Machine translation is the task of translating the text in source

the states have their own regional language, which is either Hindi or one of the other

constitutional languages. Only about 5% of the population speaks English.

In such a situation, there is a big market for translation between English and the

various Indian languages. Currently, the translation is done manually. Use of

automation is largely restricted to word processing. Two specific examples of high

volume manual translation are -Translation of news from English into local

languages, translation of annual reports of government departments and public sector

units among, English, Hindi and the local language. Many resources such as news,

weather reports, books, etc., in English are being manually translated to Indian

languages. Of these, news and weather reports from all around the world are

1

translated from English to Indian languages by human translators more often. Human

translation is slow and also consumes more time and cost compared to machine

translation. The reason for choosing automatic machine translation rather than human

translation is that machine translation is better, faster and cheaper than human

translation.

1.1 ISSUES IN MACHINE TRANSLATION

translated from English to Indian languages by human translators more often. Human translation is slow and

Natural language processing has many challenges, of which the biggest is the

inherent ambiguity of natural language. Machine translation systems have to deal with

ambiguity, and various other natural language phenomena. In addition, the linguistic

diversity between the source and target language makes machine translation a bigger

challenge. This is particularly true for widely divergent languages such as English and

Indian languages. The major structural difference between English and Indian

languages can be summarized as follows. English is a highly positional language with

rudimentary morphology, and default sentence structure as SVO. Indian languages are

highly inflectional, with a rich morphology, relatively free word order, and default

sentence structure as SOV [3]. In addition, there are many stylistic differences. For

example, it is common to see very long sentences in English, using abstract concepts

as the subjects of sentences, and stringing several clauses together. Such constructions

are not natural in Indian languages, and this leads to major difficulties in producing

good translations. Compared to Hindi, Telugu is rich in morphology and is an

agglutinative language. As it is recognized all over the world, with the current state of

translated from English to Indian languages by human translators more often. Human translation is slow and

above dimensions.

art in machine translation, it is not possible to have fully automatic, high quality, and

general-purpose machine translation. Practical systems need to handle ambiguity and

the other complexities of natural language processing, by relaxing one or more of the

2

CHAPTER 2 LITERATURE SURVEY

  • 2.1 MACHINE TRANSLATION

Machine translation is one of the major, oldest and the most active area in natural

C HAPTER 2 L ITERATURE S URVEY 2.1 M ACHINE T RANSLATION Machine translation is one

language processing and it was one of the man‟s oldest dreams. It has become a

reality in the twentieth century, in the form of computer programs capable of

translating a wide variety of texts from one natural language into another. Yet there

are no „translating machines‟ that are capable of translating text in any language and

produce a perfect translation in any other language without human intervention or

assistance. Till now programs were developed which can produce „raw‟ translations

of texts in relatively well-defined subject domains, which can be revised manually to

give good-quality translated texts at an economically feasible rate or which in their

unrevised state can be read and understood by experts in the subject for information

purposes.

  • 2.2 THE NECESSITY OF MACHINE TRANSLATION

Machine Translation is an important technology for localization, and is

particularly relevant in a linguistically diverse country like India. This is because we

have 18 constitutional languages, which are written in 10 different scripts. So the

C HAPTER 2 L ITERATURE S URVEY 2.1 M ACHINE T RANSLATION Machine translation is one

translation among these languages is very important. It‟s not possible to manually

translate the required resources among these languages. In our country, only a less

percentage of people speak English. Though Hindi is our National language, not

everyone in our country knows Hindi. There comes the need for the machine

translation. Also the resources such as text books, literatures and other valuable

resources might be available only in a specific language. For example, consider a

literature which is available in a language that is known only by a few people and it

was required by some other people who don‟t have any idea about the language using

which the literature was written. Therefore it will be difficult for those people to use

that resource, due to language which here acts as a communication barrier. This is the

situation where we seek the help of the human translators to translate the resource.

3

But this will be a tedious job to find a translator who knows the language in which the

literature was written and the language in which the user required to translate the

literature i.e. the language known by him. Also it is time consuming and very

expensive. And if the resource to be translated is huge, it is definitely impossible for

humans to manually translate the entire resources, in a short span of time. The only

solution for this problem is to design machine which can perform the translation

automatically. 2.3 DIFFERENT CATEGORIES OF MACHINE TRANSLATION SYSTEMS The three categories of machine translation systems are
automatically.
2.3 DIFFERENT CATEGORIES OF MACHINE TRANSLATION
SYSTEMS
The three categories of machine translation systems are [1],
MACHINE AIDED HUMAN TRANSLATION can range from automatic look-up programs
to systems which are practically fully automatic, but which require the translator to
approve each sentence. Examples of some of the more successful of this type of
software are the Translators Workbench of Trados and INK Tools.
HUMAN AIDED MACHINE TRANSLATION also covers a broad range of systems.
Human intervention can mean pre-editing the SL text by a person skilled at using the
machine translation system in order to make the SL easier for the computer to
understand, or it can mean interactive intervention, in which the translator may be
asked questions about the meaning of the SL text by the computer. Some of the most
irritating MT systems have used this approach, requiring the translator to sit in front
of the computer terminal and answer such questions as:
The word 'pen' means:
a writing pen

a play pen

a pig pen

Human intervention can also mean post-editing to check the translation and fix

mistakes made by the computer. It should be noted that the pre-editing and glossary

compilation required for HAMT typically require a person who is a trained linguist

4

who can parse the syntax of the sentence, not simply a translator who understands the

foreign language and can express it in his/her own language.

Obviously the most primitive is the system which requires pre-editing, since the

computer cannot handle the text unless a human converts NL into a semi-artificial

language which is easier for the computer to understand. The ideal is when the

automatic translation is so good that all that is necessary is to check the translation

and change a few details. Interactive intervention can be anywhere in between.

who can parse the syntax of the sentence, not simply a translator who understands the foreign

FULLY AUTOMATED MACHINE TRANSLATION systems, and although they may suit

the needs of people who have to search through mountains of information and only

need to get a very general idea of the contents of a document (a good example is

provided by the low-quality needs of the military and the intelligence agencies), high-

quality translation of truly natural language which is really fully automatic

(automated) hardly exists. Fully Automatic High Quality Machine Translation

(FAHQMT) systems have requirements either for the compilation of extensive

glossaries and/or are restricted to specific sub worlds or sublanguages.

2.4 VARIOUS APPROACHES TO MACHINE TRANSLATION

From the period when the first idea of using machine for the process of

language translation, there have been many different approaches to machine

translation that have been proposed, implemented and put into use, during the course

of time. The main approaches to machine translation are shown in Figure 2.1.

who can parse the syntax of the sentence, not simply a translator who understands the foreign

5

F IGURE 2.1 I LLUSTRATES D IFFERENT A PPROACHES O F M ACHINE T RANSLATION S
F IGURE 2.1 I LLUSTRATES D IFFERENT A PPROACHES O F M ACHINE T RANSLATION S

FIGURE 2.1 ILLUSTRATES DIFFERENT APPROACHES OF MACHINE TRANSLATION SYSTEM

2.4.1 LINGUISTICS OR RULE BASED APPROACH

Rule based approaches require linguistic knowledge during the translation and

so it uses grammar rules and computer programs which will be helpful in analysing

the text for determining grammatical information and features for each and every

word in the source language, translating it by replacing each word by lexicon or word

that have the same context in the target language. Rule based approach is the principal

methodology that was developed in machine translation. Linguistic knowledge will be

F IGURE 2.1 I LLUSTRATES D IFFERENT A PPROACHES O F M ACHINE T RANSLATION S

required in order to write the rules for this type of approaches. These rules will play a

vital role during the different levels of translation.

2.4.1.1 DIRECT APPROACH

Direct translation approach can be considered as the first approach to machine

translation. It involves the process of analysing morphological information, identify

the constituents and reorder the words in the source language according to the word

order pattern of the target language and then replace the words in the source language

by the target language words using a lexical dictionary of that particular language pair

and as a last step, inflect the words appropriately to produce translations.

6

TABLE2.1 AN EXAMPLE TO ILLUSTRATE THE DIRECT APPROACH TO MACHINE TRANSLATION

Input Sentence in English He came late to school yesterday Morphological Analysis He come PAST late
Input Sentence in English
He came late to school yesterday
Morphological Analysis
He come PAST late to school yesterday
<He> <come PAST> <late> <to school>
Constituent Identification
<yesterday>
<He> <yesterday> <to school> <late> <come
Word Reordering
PAST>
Dictionary Lookup
అతను తునన పాఠశాల చివరలో వచిి
Inflect(the final translated
అతను తునన పాఠశాల చివరిలో వచిింది
sentence)
2.4.1.2 INTERLINGUA APPROACH
Interlingua approach to machine translation mainly aims at transforming the
texts in the source language to a common representation applicable to many
languages, using which translation of text to the target language is performed.
Interlingua approach sees machine translation as a two stage process:
1
Analysing and transforming the source language texts into a
common language independent representation.
2
From the common language independent form generate the text in
the target language.
TABLE 2.2 AN EXAMPLE TO ILLUSTRATE THE INTERLINGUA REPRESENTATION
Predicate
Reach
Agent
Boy (Number: Singular)
Theme
Hospital (Number: Singular)
Instrument
Ambulance (Number: Singular)
Tense
FUTURE
After

7

2.4.1.3

TRANSFER APPROACH

The transfer model involves three stages: analysis, transfer, and generation. In

the analysis stage, the source language sentence is parsed, and the sentence structure

and the constituents of the sentence are identified. In the transfer stage,

transformations are applied to the source language parse tree to convert the structure

to that of the target language. The generation stage translates the words and expresses

the tense, number, gender etc. TABLE 2.3 AN EXAMPLE TO ILLUSTRATE THE TRANSFER APPROACH Input Sentence
the tense, number, gender etc.
TABLE 2.3 AN EXAMPLE TO ILLUSTRATE THE TRANSFER APPROACH
Input Sentence
He will come to school in bus
Analysis
<he> <will come> < to school> < in bus>
Transfer
<he> <in bus> <to school> <will come>
Generation (Output)
అతను బస్ుు లో పాఠశాల వస్ాాయి
RELATED WORKS
Rule based approach is widely used in developing machine translation for
Indian language.
2.4.2 NON-LINGUISTIC APPROACHES
The non-linguistic approaches are those which don‟t require any linguistic
knowledge explicitly to translate texts in the source language to target language. The
only resource required by this type of approaches is data either the dictionaries for the
dictionary based approach or bilingual and monolingual corpus for the empirical or
corpus based approaches.
2.4.2.1 DICTIONARY BASED APPROACH
The dictionary based approach to machine translation uses s dictionary for

the language pair to translate the texts in the source language to target language. In

this approach, word level translations will be done. This dictionary based approach

can either be preceded by some pre-processing stages to analyse the morphological

information and lemmatize the word to be retrieved from the dictionary. This kind of

approach can be used to translate the phrases in a sentence and found to be least

useful in translating a full sentence. This approach will be very useful in accelerating

8

the human translation, by providing meaningful word translations and limiting the

work of humans to correcting the syntax and grammar of the sentence.

2.4.2.2 EMPIRICAL OR CORPUS BASED APPROACHES

The corpus based approaches don‟t require any explicit linguistic knowledge to

translate the sentence. But a bilingual corpus of the language pair and the monolingual

corpus of the target language are required to train the system to translate a sentence.

This approach has driven lots of interest world-wide, from late 1980s till now. 2.4.2.2.1 EXAMPLE BASED
This approach has driven lots of interest world-wide, from late 1980s till now.
2.4.2.2.1
EXAMPLE BASED APPROACH
This approach to machine translation is a technique that is mainly based how
human beings interpret and solve the problems. That is, normally the humans split the
problem into sub problems, solve each of the sub problems with the idea of how they
solved this type of similar problems in the past and integrate them to solve the
problem in whole. This approach needs a huge bilingual corpus of the language pair
among which translation has to be performed.
2.4.2.2.2
STATISTICAL BASED APPROACH
Statistical approach to machine translation generates translations using
statistical methods by deriving the parameters for those methods by analysing the
bilingual corpora. Even though designing a statistical system for a particular language
pair is a rapid process, the work lies on creating bilingual corpora for that particular
language pair, as this was the technology behind this approach. In order obtain better
translations from this approach, at least more than two million words if designing the
system for a particular domain and more than this for designing a general system for
translating particular language pair.
RELATED WORKS

Recently Google has released alpha version of English to Telugu machine

translation system. The system is developed using statistical approach [15]. The

online version of the system is available. The output of the system is good for simple

and frequently used sentences. Since they have huge amount of bilingual corpus their

output is acceptable.

9

2.4.3 HYBRID APPROACH

Hybrid machine translation approach makes use of the advantages of both

statistical and rule-based translation methodologies. Commercial translation systems

such as Asia Online and Systran provide systems that were implemented using this

approach. Hybrid machine translation approaches differ in many numbers of aspects:

  • 1. Rule-based system with post-processing by statistical approach

2. Statistical translation system with pre-processing by the rule based approach 2.5 PARSER Parser is the
2.
Statistical translation system with pre-processing by the rule based approach
2.5
PARSER
Parser is the process of analyzing a text, made of a sequence of tokens, to
determine the grammatical structure with respect to given formal language. Two
approaches for developing parsers are top down approach and bottom up approach.
Some of the parsers available as open software are XML parser, Stanford parser, LL
parser and LR parser.
2.6
MORPHOLOGICAL ANALYZER AND GENERATOR
Various NLP research groups have developed different methods and algorithm
for morphological analysis. Some of the algorithms are language dependent and some
of them are language independent. A survey of various methods involved in
Morphological Analysis includes the following:
1.
Finite State Transducer (FST)
2.
Stemmer Algorithm
3.
Corpus Based Approach
4.
DAWG (Directed Acrylic Word Graph)
5.
Paradigm Based Approach

2.6.1 FINITE STATE TRANSDUCERS

The FST based morphological analyzer and generators are widely

implemented for many languages [4]. The FST systems are mainly used in speech

recognition and speech processing while building the language models. The morph

analyzer and generator can be built in a bidirectional manner using FST [2].

10

RELATED WORKS

FST

based

approach

is

one

of

the

popular

method

for

developing

morphological generator and analyzer. Using FST, Morphological analyzer and

generator is developed for Tamil, Malayalam and Kannada at AMRITA-CEN.

  • 2.6.2 STEMMER

Stemmer [6] is used for stripping of affixes. It uses a set of rules containing list
Stemmer [6] is used for stripping of affixes. It uses a set of rules containing
list of stems and replacement rules.
E.g: writing  write + ing
For a stemmer program
me
we have
to
specify all
possible affixes with
replacement rules.
E.g. ational →ate relational→ relate
tional → tion conditional→ condition
The most widely used stemmer algorithm is Potter algorithm. The algorithm
is available free of cost http://www.tartarus.org/martin/PotterStemmer/.
RELATED WORKS
There are some attempts to develop stemmer for Indian Languages also. IIT
Bombay and NCST Bombay has developed stemmer for Hindi [Manish, Anantha].
2.6.3 CORPUS BASED APPROACH
Corpus is a large collection of written text belongs to a particular language.
Raw corpus can be used for morphological analysis. It takes raw corpus as input and
produces a segmentation of the word forms observed in the text. Such segmentation
resembles morphological segmentation.
RELATED WORKS
Morfessor1.0 developed in Helsinki University is a corpus based language

independent morphological segmentation program. The LTRC Hyderabad

successfully developed a corpus based morphological analyzer. The program

combines paradigm based approach as well corpus based approach.

11

2.6.4

DAWG (DIRECTED ACRYLIC WORD GRAPH)

DAWG is a very efficient data structure for lexicon representation and fast

string matching, with a great variety of application. This method has been

successfully implemented for Greek language by University of Partas Greece. DAWG

data structure can be used for both morphological analysis and generation. This

approach is language independent it does not utilizes any morphological rules or any

other special linguistic information. The method can be tested for Indian languages

FIGURE 2.2 AN EXAMPLE FOR DAWG
FIGURE 2.2 AN EXAMPLE FOR DAWG

also. Figure 2.2 shows an example for DAWG graph. In the figure A, B, C, U, L, T, S

are different states from one node to another.

  • 2.6.5 PARADIGM APPROACH

2.6.4 DAWG (D IRECTED A CRYLIC W ORD G RAPH ) DAWG is a very efficient

A paradigm defines all the word form of a given stem and also provides a

feature structure with every word form. The paradigm based approach is efficient for

inflectionally rich languages.

This or a variant of this scheme has been used widely in NLP. The linguist or

the language expert is asked to provide different tables of word forms covering the

words in a language. Each word-forms table covers a set of roots which means that

the roots follow the pattern (or paradigm) implicit in the table for generating their

word forms. Almost all Indian language morphological analyzers are developed using

this method. Based on paradigms the program generates add delete string for

analyzing. Paradigm approach rely on findings that the different types of word

paradigms are based on their morphological behavior.

12

RELATED WORKS

The ANUSAARAKA research group has developed a language independent

paradigm based morphological compiler program for Indian Languages.

Words are categorized as nouns, verbs, adjectives, adverbs and postpositions.

Each category will be classified into certain types of paradigms based on their

morphophonemic behavior. For example noun Uru (village) belongs to a paradigm

class is different form Abbayi (boy) which belongs to a different paradigm class as they behave
class is different form Abbayi (boy) which belongs to a different paradigm class as
they behave differently morpho-phonemically.
We
have
used
Machine
learning
using
SVMTool
for
implementing
Morphological Analyzer and paradigm approach for Morphological generator.

13

CHAPTER 3

OVERVIEW OF TELUGU LANGUAGE

Historically Telugu Language is also known by the names, amdhram, tenu (m) gu

and gentoo [8]. 3.1 DEMOGRAPHIC INFORMATION Telugu is one of the major Scheduled Languages in India.
and gentoo [8].
3.1
DEMOGRAPHIC INFORMATION
Telugu is one of the major Scheduled Languages in India. It is the second
most popular language in India [10]. Its speakers are mainly concentrated in South
India. It is the official language of Andhra Pradesh and Secondly widely spoken
language in Tamilnadu and Karnataka. Considerable numbers of Telugu speaking
minorities live in Maharashtra, Orissa, Madhya Pradesh and West Bengal.
Considerable numbers of Telugu language speakers have migrated to Mauritius,
South Africa and recently to U.S.A, UK, and Australia.
3.2
GENERIC AFFILIATION AND HISTORY
Telugu [10] belongs to South Central branch of Dravidian family of
languages. It is the most widely spoken Dravidian language. It is the only literary
language outside the South-Dravidian branch. Its literature goes back to 11th century
A.D. Its ancient forms were attested through inscriptions dating back to 200A.D.
3.3
THE TELUGU SCRIPT

3.3.1 ORIGIN AND DEVELOPMENT

Telugu is written is Telugu script which is derived from Ashokan Brahmi [8]

used in the South India cerca 2nd A.D. The Southern Brahmi is also known as

Dravidian Brahmi gave rise to vengi-calukyan script also known as Telugu-Kannada

script. By the end of 13th Century A.D., the Telugu and Kannada scripts got

separated. In the early Telugu- Kannada script, no orthographic distinct was made

between the short mid (e, o) and long mid vowels (E, O).

14

3.3.2 TELUGU ALPHABET

The primary units of Telugu [10] alphabet are syllables; therefore, it should

be rightly called a syllabary and most appropriately a mixed alphabetic-syllabic script.

Unlike in the Roman alphabet used for English, in the Telugu alphabet the

correspondence between the symbols (graphemes) and sounds (phonemes) is more or

less exact. However, there exist some differences between the alphabet and the

phonemic inventory of Telugu. The overall pattern consists of 60 vowels, 3 vowel

3.3.2 T ELUGU A LPHABET The primary units of Telugu [10] alphabet are syllables; therefore, it

modifiers and 41 consonants.

NOTABLE FEATURES

Type of writing system: syllabic alphabet in which all consonants have an

inherent vowel. Diacritics, which can appear above, below, before or after the

consonant they belong to, are used to change the inherent vowel.

When they appear at the beginning of a syllable, vowels are written as independent

letters.

When certain consonants occur together, special conjunct symbols are used

which combine the essential parts of each letter.

Direction of writing: left to right in horizontal lines.

VOWELS

3.3.2 T ELUGU A LPHABET The primary units of Telugu [10] alphabet are syllables; therefore, it

15

CONSTANTS

C ONSTANTS C ONJUNCT C ONSONANTS 16
C ONSTANTS C ONJUNCT C ONSONANTS 16

CONJUNCT CONSONANTS

C ONSTANTS C ONJUNCT C ONSONANTS 16

16

VOWEL MODIFIERS

V OWEL M ODIFIERS 3.4 C OMPUTATIONAL G RAMMAR OF T ELUGU In Telugu [9], Morphology

3.4 COMPUTATIONAL GRAMMAR OF TELUGU

In Telugu [9], Morphology plays a crucial role in not only generating

numerous word forms from nouns and verbs but determining their shapes as well. As head of
numerous word forms from nouns and verbs but determining their shapes as well. As
head of noun phrases, nouns carry distinct morphological inflections indicating
various syntactic and semantic functions expressed in proposition. Word Order, unlike
English, does not determine the syntactic relations between a noun and its governing
category verb.
3.4.1 NOUNS
A noun
[9]
in
Telugu is
inflected in a complex
way. Nouns in Telugu
characteristically carry the markings of gender, number, person and case.
A number of nouns in Telugu often change their form before the marking of
gender, number, and person and case. Systematic changes occur in the base
particularly when inflected for non-nominative cases such as accusative, dative,
instrumental, ablative and locative. Conventionally noun-nominative base of a noun is
also known as oblique base or oblique form. However, it should be noted that such a
base is neither unique nor common.
GENDER MARKING ON NOUN
Though the inflection classes are insensitive to gender distinctions, there are
distinctions of gender discernible from morphology of agreement on verbs, adjectives,
possessives, predicate nominal, numerals and deictic categories. It is necessary to
identify four distinctions in gender, viz. nouns indicating:
Human males
Other than human males, in singular and plural, nouns indicating
Humans, and

Non-humans.

This distinct is necessitated by the distribution of nouns indicating human

females which are grouped with neuter nouns in singular, but human males in plural.

17

However, a number of nouns denoting human males end in du, and human

females end in di.

NUMBER MARKING IN TELUGU NOUNS

Telugu nouns usually occur in two numbers, singular and plural. However,

only plural nouns are explicitly marked. In case of large number of nouns the form of

the plural suffix is lu, while in case of some nouns of human male category, the form

of plural suffix alternant is ru.

However, a number of nouns denoting human males end in – du, and human females end

GENDER- NUMBER-PERSON MARKING ON NOUNS

Telugu nouns when function as nominal predicate show agreement with the

gender, number and person of the surface subject of the clause. Pronominalized

possessive nouns (possessors) show agreement (in gender, number and person) with

the nouns of possession and function as heads of possessive phrases. In these two

cases nouns are marked by pronominal suffixes of the relevant gender-number-

person. The person marking on nouns is however, explicit only in 1st and 2nd person

both singular and plural, In the case of 3rd person, only the number is marked

explicitly and not the person.

CASES: CASE MARKERS AND POST- POSITIONS

Nouns are usually inflected by case by case markers and post-positions to

indicate their semantic-syntactic function in clausal predication. The terms case

markers and post-positions roughly correspond to Type-1 and Type-2 post-positions

of Krishnamurti and Gwynn. They use the term post-positions corresponds in

However, a number of nouns denoting human males end in – du, and human females end

meaning to prepositions in English. However, they makes a distinction between two

types of post-positions, viz. Type-1 and Type-2 based on the criteria like the freedom

of distribution (bound and free) and the nature of composition of post-positions

(Type-1 post-positions are attached to Type-2 post-positions and not vice-versa).

Telugu uses a wide variety of case markers and post-positions and their

combinations to indicate various relations between nouns and verbs or nouns. Case

suffixes and post-positions fall into two types viz. “Grammaticaland “Semantic or

location and directional. Grammatical case suffixes are those which express

grammatical case relations such as nominative, accusative, dative, instrumental,

genitive, comitative, vocative and causal. The semantic cases include such as nouns

18

inflected for location in time and space. Nouns when attached with various

combinations of adverbial nouns and case markers or post-positions express many

more such relations.

3.4.2 VERBS

Verb [16] denotes the state of or action by a substance. Telugu verb may be

finite or non-finite. All finite verbs and some non-finite verbs can occur according to situation before
finite or non-finite. All finite verbs and some non-finite verbs can occur according to
situation before the utterance final juncture /#/ characterized by of following terminal
contours: rising pitch, meaning question; level pitch, falling pitch, meaning command.
A finite verb does not occur before any of the non-final junctures. On the
morphological level, no non- finite verb contains a morpheme indicating person; this
statement should not, however, be taken to mean that all finite verbs necessarily
contain a morpheme indicating person. Since any verb, finite or non-finite, occurs
only after some marked juncture, by definition of these junctures, all verbs have
phonetic stress or prominence on their first syllable, which invariably part of the root.
Almost every Telugu verb has a Finite and a non- finite form. A finite form is
one that can stand as the main verb of a sentence and occur before a final pause (full
stop). A non- finite form cannot stand as a main verb and rarely occurs before a final
pause.
FINITE VERBS
The eight finite forms of the modern Telugu verb may be arranged in three
structural types, which are set up according to the differences in the grouping of the
three substitution classes,
Stem or inflection root
Tense-mode suffix
Personal suffix( es )

The paradigms of the finite forms of a simple verbal base are given below

under the three structural types: ammu (to sell), with two allomorphs: amm- before a

vowel.

Type 1: stem + personal suffix:

1. Imperative : singular u amm-u (sell)

Plural - andi ammu - andi

19

Type 2: stem + tense-mode suffix:

2.

Admonitive or abusive:

On account of semantic restrictions, many verbs cannot occur in this mood. A

few bases like kAlu (to burn), kUlu (to fall), cAvu (to die), pagulu (to break), etc.,

occur

Eg: nIyilli kUlu - may your house fall

3. Obligative (in all persons): -Ali amma – Ali I, we, you( sg, pl) he, she,
3.
Obligative (in all persons): -Ali
amma – Ali I, we, you( sg, pl)
he, she, it
Type 3: stem + tense-mode suffix + personal suffix
4.
Habitual- future or non-past: -t-
ammu – t - Anu I shall sell
ammu – t – Am we shall sell
ammu – t – Ava you shall sell
ammu – t – Aru he shall sell
ammu – t – Adu she shall sell
ammu – tun – di she sell
ammu – t – Ay they sell
5.
Past tense: -i-
ammu – i – Anu* I sold
ammu – i – Am we sold
ammu – i – Ava you sold (Singular)
ammu – i – Aru you sold (plural)
ammu – i – Adu he sold
ammu – in – di she/ it sold

ammu i Aru they sold

6. Hortative: -d-

ammu d Am let us sell, or we shall sell

7.

Negative tense: -a-

ammu a nu I (do, did, and shall) not sell

20

ammu a m we(do, did, and shall) not sell

ammu a va you (do, did, and shall) not sell

ammu a Du he(does, did, and shall) not sell

ammu a du she/ it(do, did, and shall) not sell

ammu a ru they (do, did, and shall) not sell

8. Negative imperative or prohibitive: -Ak- ammu – Ak – a you(sg.) don‟t sell ammu –
8.
Negative imperative or prohibitive: -Ak-
ammu – Ak – a you(sg.) don‟t sell
ammu – Ak – andi you(pl.) don‟t sell
NON FINITE VERBS
There are ten non- finite verbs which may be arranged into two structural types:
Unbound
Bound
Type 1:
1.
Present participle -tu ammu- tU selling
2.
Past participle -i ammu- i having sold
3.
Concessive -inA ammu- inA even though sold
4.
Conditional -itE ammu- itE if sold
5.
Infinitive -a ammu- a to sell
6.
Negative participle -aka amm-aka not selling
7.
Habitual adjective -E amm-E that sells
8.
Past adjective -ina amma-ina that sold
9.
Negative adjective -ani ammu- ani not selling

Type 2:

Bound present - t- : ammu- t occurs with any finite form of the verb un- to be and

also a few non- finite forms.

Example: ammu- t- unnAnu I am selling

ammu- t- un- nA even selling( now)

21

ammu- t- un- tE if selling

ammu- t- un- na that selling

ammu- t- un- tE if selling ammu- t- un- na that selling 22

22

CHAPTER 4

OVERVIEW OF ENGLISH- TELUGU MACHINE TRANSLATION SYSTEM

C HAPTER 4 O VERVIEW O F E NGLISH - T ELUGU M ACHINE T RANSLATION

English to Telugu machine translation system is developed by integrating five

modules namely parser, reordering, lexicalization, transliteration and morphological

generator.

E NGLISH SENTENCE

(INPUT TEXT)

C HAPTER 4 O VERVIEW O F E NGLISH - T ELUGU M ACHINE T RANSLATION
R EORDERING
R EORDERING

S TANFORD PARSER

C HAPTER 4 O VERVIEW O F E NGLISH - T ELUGU M ACHINE T RANSLATION
C HAPTER 4 O VERVIEW O F E NGLISH - T ELUGU M ACHINE T RANSLATION

LEXICALIZATION

C HAPTER 4 O VERVIEW O F E NGLISH - T ELUGU M ACHINE T RANSLATION

TRANSLITERATION

C HAPTER 4 O VERVIEW O F E NGLISH - T ELUGU M ACHINE T RANSLATION
C HAPTER 4 O VERVIEW O F E NGLISH - T ELUGU M ACHINE T RANSLATION

M ORPHOLOGICAL GENERATION

C HAPTER 4 O VERVIEW O F E NGLISH - T ELUGU M ACHINE T RANSLATION

T ELUGU S ENTENCE ( OUTPUT TEXT )

FIGURE 4.1 GENERAL BLOCK DIAGRAM FOR ENGLISH-TELUGU MACHINE TRANSLATION

23

  • 4.1 PARSER

The Stanford parser is used for generating grammatical tree structure and parts

of speech (POS) category for the given English sentence. Stanford parser is a

lexicalized PCFG parser. When compared with all other existing parsers it provides

4.1 P ARSER The Stanford parser is used for generating grammatical tree structure and parts of

better results and so the Stanford parser is integrated in the present system.

  • 4.2 REORDERING

Reordering plays a vital role in overcoming the structural difference between

English and Telugu language. In English, format of the sentence will be Subject-

Verb-Object (SVO) type but in Telugu we have SOV format. To overcome this

problem reordering rules are applied in the source language level. A set of reordering

rules for Telugu has been adopted from the reordering rules developed for Tamil.

  • 4.3 DICTIONARY

A well groomed comprehensive bilingual dictionary, specially made from the

point of view of translation, is an essential component in a translation system. The

prototype of one such dictionary is created for the present English-Telugu machine

translation system. The bilingual dictionary is collected through various resources like

4.1 P ARSER The Stanford parser is used for generating grammatical tree structure and parts of

internet, books etc. At present the dictionary contains 26000 words which belong to

different grammatical categories.

24

TABLE 4.1 DATABASE INFORMATION

T ABLE 4.1 D ATABASE I NFORMATION 4.4 T RANSLITERATION SVM based English to Telugu transliterator
T ABLE 4.1 D ATABASE I NFORMATION 4.4 T RANSLITERATION SVM based English to Telugu transliterator
  • 4.4 TRANSLITERATION

SVM based English to Telugu transliterator is used for transliteration.

Transliteration is mainly done for the words which are not available in the bilingual

dictionary.

  • 4.5 MORPHOLOGICAL ANALYZER

4.5.1 INTRODUCTION

Morphological analyzer takes input as a word and produces output as the

analysis of the word. Presently morphological analyzer is considered as a module in

word.
word.

which the input is Telugu word and the output is the analysis of the given Telugu

T ABLE 4.1 D ATABASE I NFORMATION 4.4 T RANSLITERATION SVM based English to Telugu transliterator

FIGURE 4.2 EXAMPLE TO ILLUSTRATE MORPHOLOGICAL ANALYZER

Before explaining the module, let us first look at the inflections that are to be

considered. The morphological structure of Telugu verbs inflects for tense, person,

gender, and number. The nouns inflect for plural, oblique, case and postpositions. The

structure of verbal complex is unique and capturing this complexity in a machine

25

analyzable and generatable format is a challenging task. Inflections of the Telugu

verbs include finite, infinite, adjectival, adverbial and conditional markers. The verbs

are classified into certain number of paradigms based on the inflections. For

computational need we have 37 paradigms of verb and each paradigm with 160

inflections.

Sixty seven paradigms are identified for Telugu noun. Each paradigm has 117

sets of inflected forms. Based on the nature of the inflections the root words are

analyzable and generatable format is a challenging task. Inflections of the Telugu verbs include finite, infinite,

classified into groups. A corpus with all morphological information has been

prepared. So the machine by itself captures all the morphological rules.

Morphological analysis of nouns is less complex compared to verbs. The detailed list

of Paradigms and the possible inflections of the verbs and nouns are given in the

Appendix.

Support Vector Machine (SVM) is used for classifying task. Presently there

are three modules [13]. 1. SVMTlearn 2. SVMTagger 3.SVMTeval. SVMTlearn is

used for training the system with manually created corpus. SVMTagger is used for

tagging the sequence of words by taking samples from previously learned SVM

model. SVMTeval is used for evaluating the final output.

4.5.2 DATA CREATION FOR SUPERVISED LEARNING

1. The first step involves the data creation (corpora development) for morphological

analyzable and generatable format is a challenging task. Inflections of the Telugu verbs include finite, infinite,

analyzer and classifying the verbs and nouns into paradigm types. Each root word

inflects for different grammatical features. But the nature of these inflections is same

for each paradigm type. The verbs inflect for grammatical features such as tense,

person, number, gender, non-finite, infiniteness, conditional negation, emphasis and

interrogation. The nouns inflect for plural numbers, postpositions may follow the case

immediately or after a space. Figure 4.3 illustrates the formation of paradigms.

26

FIGURE 4.3 FORMATION OF PARADIGM 2. The second step is to collect the list of words
FIGURE 4.3 FORMATION OF PARADIGM
2.
The second step is to collect the list of words which will fall under the paradigms of
verbs and nouns. Table 4.2 illustrates some of the words and its inflections under the
paradigm ADu.
TABLE 4.2 GROUPING OF WORDS IN ‘ADU’ PARADIGM
PARADIGM 1 ADU
LIST OF WORDS
INFLECTIONS
1.ATADU
1.tunnAnu
2.IdADu
2.tunnAmu
3.KoniyADu
3.Anu
4.koTTADu
4.Amu
……….
5.tAnu
……….
……………
3.
The third step is pre-processing the corpus for morphological analyzer [12]. Steps
involved in pre-processing are explained in the Figure 7.

27

F IGURE 4.4 S TEPS I NVOLVED I N P RE -P ROCESSING D ATA F
F IGURE 4.4 S TEPS I NVOLVED I N P RE -P ROCESSING D ATA F

FIGURE 4.4 STEPS INVOLVED IN PRE-PROCESSING DATA FOR SVM MODEL

The pre-processing steps involves the Romanization, Segmentation, Alignment-

mapping and mismatching.

ROMANIZATION: The set of most commonly used noun and verb forms are

generated manually for input structure and similarly the output structure is developed.

These data are converted to Romanized forms using the Unicode to Roman mapping

file.

F IGURE 4.4 S TEPS I NVOLVED I N P RE -P ROCESSING D ATA F

SEGMENTATION: After Romanization each and every word in the corpora is

segmented based on the Telugu grapheme and each grapheme which is syllabic is

further segmented into consonants and vowels. The Consonant are represented by "C"

and vowel is represented by "V" respectively. It is named as C-V representation or

Consonant Vowel representation. Morpheme boundaries (end of each morpheme)

are indicated by “*” symbol in output data.

ALIGNMENT AND MAPPING: The segmented syllables are aligned vertically as

shown in Table 1. Here the input segmented syllables are consequently mapped with

labeled output segmented syllables. First column represents the input data with C-V

28

representation and latter one represents output data labels.”*” indicates the morpheme

boundaries

TABLE 4.3 SAMPLE INPUT FOR SVM MODEL

repr esentation and latter one represents output data labels.”*” indicates the morpheme boundaries T ABLE 4.3
repr esentation and latter one represents output data labels.”*” indicates the morpheme boundaries T ABLE 4.3

MISMATCHING: It is the key problem in mapping between the input and output data.

Mismatching occurs in two cases i.e., either the input units are larger or smaller than

that of the output units. This problem is solved by inserting null symbol “$” or

combining two units based on the morph-syntactic rules to the output data. And the

input segments are mapped with output segments. After mapping machine learning

tool is used for training the data. This type of problems sometimes it may occur in

case of nouns also.

Case 1:
Case 1:

Input sequence: Input sequence:

1-C|E-V|s|t-C|u-V|n|n-C|A-V|n-C|u-V|

(10 segments)

Output sequence (Mismatching)

1|E*|t|u|n|n|A*|n|u*|

(9 segments)

Corrected output sequence:

1|E*|$|t|u|n|n|A*|n|u*|

(10 segments)

29

Case 1

This case shows the input sequence is having more number of segments than the output sequence. Telugu verb lEstunnAnu is having 10 segments in input sequence but in

output it has only 9segments.the occurrence of “s” in the input sequence becomes null due to the morph syntactic rule. So there is no segment to map with that “s”. For this

reason, in training data “s” is mapped with “$” symbol ($ indicates null). Now the number
reason, in training data “s” is mapped with “$” symbol ($ indicates null). Now the
number of input units are equal to the number of output units is shown in corrected output
sequence.
Case 2:
(A)
Input sequence:
A|D-C|a-V|n-C|u-V|
(5 segments)
Output sequence (Mismatched):
A|D|u*|a*|n|u*|
(6 segments)
Corrected output sequence:
A|Du*|a*|n|u*|
(5 segments)
(B)

Input sequence

A|v-C|A-V|m-C| e-V|

(5 segments)

Output sequence (Mismatched):

A|v|u*|A|m|e*|

(6 segments)

Corrected output sequence:

30

A|vu*|A|m|e*|

(5 segments)

Case 2

This shows the input sequence is having less number of units than the output

units. (A) and (B) are examples for case2 in case of verbs and nouns. Telugu verb

ADanu is having 5 units in input sequence but output has 6 units or segments. Due
ADanu is having 5 units in input sequence but output has 6 units or segments. Due to
morph syntactic change the unit “D-C” in the input sequence is mapped to two
segments “D, u*” in output sequence is shown in corrected output sequence. For this
reason in training “D-C” is mapped with “Du*”. Now the input and output sequences
are having equal number of units. So the problem of mismatching is solved. Same
thing happened in case of nouns also which is explained in (B).
There are some cases in which both case 1and case 2 will occur together. We
can solve such type of mismatching problems by applying same rules of case1 and
case2. Example with Telugu noun Urikeduru is shown below.
Input sequence:
U|r-C|i-V|K-C|e-V|d-C|u-V|r-C|u-V|
(9 segments)
Output sequence (Mismatching)
U |r|u*|i*|K|e|d|u|r|u*|
(10 segments)
Corrected output sequence:
U |ru*|i*|$|e|d|u|r|u*|
(9 segments)
4.5.3 IMPLEMENTATION OF MORPHOLOGICAL ANALYZER MODULE

Using machine learning approach the morphological analyzer for Telugu is

developed. Separate engines are developed for nouns and verbs. Morphological

analyzer is redefined as a classification task using the machine learning approaches.

Three phases are involved in morphological analyzer.

  • 1. Pre-processing.

  • 2. Segmentation of morphemes.

31

3. Identifying morphemes.

3. Identifying morphemes. FIGURE 4.5 SVM MODEL FOR MORPHOLOGICAL ANALYZER Figure 4.5 gives an outlook of
FIGURE 4.5 SVM MODEL FOR MORPHOLOGICAL ANALYZER
FIGURE 4.5 SVM MODEL FOR MORPHOLOGICAL ANALYZER

Figure 4.5 gives an outlook of the morphological analyzer system. In this

machine learning approach, two training modules are created for morphological

analyzer. These two modules are represented as module-I and module-II. In the first

module the system is trained using the sequence of input characters and their

corresponding output labels. The first module of training is used for identifying

morpheme boundaries. For example for the noun form abbAYilu (boys), there are

two morpheme boundaries, „abbaYi‟(boy) and „lu‟(plural). These two morpheme

boundaries are made to learn in module- I. Similarly the system is trained with large

set of corpus. In module-II, the sequence of morphemes and their grammatical

categories are used for training. By this grammatical classes to each morpheme are

module-II.
module-II.

assigned. For example, for abbaYilu two grammatical categories have been assigned,

abbaYi as root and lu as plural suffix. These two morpheme information are trained

in module II. The figure 4.6 clearly depicts the training by SVM module-I and

PRE-PROCESSING: In pre-processing, first the given word is romanized. After that the

Romanized words are segmented into syllables according to the Telugu grapheme

segmentation. These segmented syllables are further split for C-V representation.

SEGMENTATION OF MORPHEMES: Pre-processed words are segmented into morphemes

according to the morpheme boundaries. The input sequence is given to the training

module-I. The training module predicts each output label to the input segments.

32

IDENTIFYING MORPHEME: The Segmented morpheme is given to the training module-II.

It predicts grammatical categories to the segmented morphemes.

I DENTIFYING M ORPHEME : The Segmented morpheme is given to the training module-II. It predicts
I DENTIFYING M ORPHEME : The Segmented morpheme is given to the training module-II. It predicts

FIGURE 4.6 AN ILLUSTRATION FOR TRAINING MODULE 1 AND 2 IN SVM

The system is trained for the word „abbAYilu‟. When the system names across a

similar kind word like „AvUlu‟ the SVM modules will give the correct morphological

interpretation.

4.6 MORPHOLOGICAL GENERATOR

4.6.1 INTRODUCTION

Morphological generator is developed using Data Driven Approach. In this

approach three different modules are developed. The first module takes the lemma

and POS category as input and gives the lemma‟s paradigm number and word‟s stem

as output. The second module takes morpho-lexical information as the input and gives

I DENTIFYING M ORPHEME : The Segmented morpheme is given to the training module-II. It predicts

its index number as the output. In third module, a suffix-table is used to generate the

word with the information from the above two modules.

33

4.6.2

MORPHOLOGICAL GENERATOR FOR TELUGU

There are different methods available for Morphological generation. In

particular most familiar approach is rule based morphological generator. In rule based

approach we need linguistic knowledge to develop the Morphological generator

system as it requires morpho-phonemic rules and morpheme dictionary. In the present

approach, rules and dictionaries are not needed. It requires only suffix table and code

for paradigm classification. Information given as the input to morphological generator are 1.lemma , 2.word_class and
for paradigm classification. Information given as the input to morphological generator
are 1.lemma , 2.word_class and 3.Morpho-lexical information. Lemma specifies the
word-form to be generated, Word-class specifies the grammatical category and
Morpho-lexical information specifies the type of information. The input to the
morphological generator is given in the form of lemma + word_class + Morpho-
lexical Information. Morpho-lexical information is extracted from the Morphological
analyzer tool for Telugu. An example of Morphological generator system is given
below.
Adu verb + past +3SM  Adadu
ప్లే
verb + past +3SM  ఆడాడు
3SM = Third Person Singular Male.
4.6.3 DIFFICULTIES IN MORPHOLOGICAL GENERATION FOR TELUGU
Developing a morphological generator is a tedious job, because every word in
Telugu has multiple inflections. Some of the inflections include auxiliaries, clitics,
adjectival, adverbial, finite, infinite and condition forms of verbs. The number of
inflected forms varies with each and every word. To solve this problem, a
classification of Telugu verbs based on tense markers and inflection is made. Verbs

are classified in to thirty six paradigms and the paradigms are listed in Table 4.4.

34

TABLE 4.4 VERB PARADIGM ADu aruvu avvu Cavu Ceppu Ceyi Cudu Cupimcu eduvu Ivvu Kaavu Kadulcu
TABLE 4.4 VERB PARADIGM
ADu
aruvu
avvu
Cavu
Ceppu
Ceyi
Cudu
Cupimcu
eduvu
Ivvu
Kaavu
Kadulcu
Kalu
Kaluvu
Konu
Koyyi
Kudurcu
Kurco
Kuriyu
kuTTi
Lee
moopu
Padu
Pannu
pettu
Piluvu
Pogudu
Poo
Puyyi
Rayyi
Tannu
Tee
Tiyuu
Umdu
valayu
vellu
Nouns are classified in to sixty five paradigms and the paradigms are listed in Table
4.5.
TABLE 4.5 NOUN PARADIGM
Abbayi
Baludu
Bandi
Bendu
Bonu
Buddi
Cenu
Cillu
Dari
Enimidi
Gadi
Goru
Goyyi
Guddu
Gudi
Gumdu
Guudu
Illu
Iamtuvu
Kalcar
Kalu
Kannu
Kilu
Kota
Koti
koTTu
Kotu
Kundelu
Mamdi
Manishi
Medadu
Menu
Meku
Metuku
Nokaru
Nuru
Okati
Palu
Pamdiri
Pandem
Pani
Papam
Pelli
Pennu
Pette
Pillavadu
Pimdi
Puli
Pustakam
Putti
Puvu
Raatri
Rayyi
Rani
Remdu
Riksha
Saari
Samdadi
Snehitudu
Taragati
Tennu
Tiragali
Uru
Velu
veyyi
4.6.4 FORMATION OF INFLECTIONAL TABLE
The initial work is collection of Telugu words. Telugu words are collected
manually from the books and the internet using information retrieval process. The

collected words are classified into separate groups. The groups are formed on the

basis of similarity between the words. For example the root word „ADu‟ inflects as

Adikadu, Adanu, AdtunnAnu etc. All these inflected words are tabulated. Paradigm

and inflection tables are formed by using the data collected. Paradigm and inflection

tables are made separately for nouns and verbs. There are 36 paradigms for verbs and

65 paradigms for nouns. Here the most frequently used Morpho-lexical forms of verbs

35

and nouns are selected. The creation of morpho-lexical forms of verbs and nouns

make use of an order which is followed for all the paradigms. Morpho-lexical

information list is created using Morpho-lexical forms. In the tabular column, row

indicates the Morpho-lexical information and column indicates the paradigm number.

The inflection table for Verb is given in Table 4.6.

TABLE 4.6 MORPHO-LEXICAL FORMS

P-1 P-2 P-3 P-4 P-5 ML-1 u vu pu nu yi ML-2 utunnAnu ustunnAnu utunnAnu uTunAnnu
P-1
P-2
P-3
P-4
P-5
ML-1
u
vu
pu
nu
yi
ML-2
utunnAnu
ustunnAnu
utunnAnu
uTunAnnu
ustunnAnu
ML-3
utunnAmu
stunnAmu
tunnAmu
TunAmu
stunnAmu
ML-4
Anu
sAnu
pAnu
nnanu
sAnu
ML-5
Amu
sAmu
pAmu
nnamu
sAmu

4.6.5 METHODOLOGY

and nouns are selected. The creation of morpho-lexical forms of verbs and nouns make use of

FIGURE 4.7 OVERVIEW OF MORPHOLOGICAL GENERATOR SYSTEM

Block diagram for Telugu morphological generator is shown in Figure 4.7.

When compared with other Morphological generator the implementation of the

present system is entirely different. The information given to Morphological generator

is lemma or root word, word class and Morpho-lexical information. The lemma or

root word with POS tag information is romanized. For the Romanized root word the

36

paradigm number has to be found. The paradigm number corresponds to column

index for the inflection table. The Morpho-lexical information of the required word

class is given by the user as input. From the Morpho-lexicon information list the

index number of the corresponding input is identified and this corresponds to the row

index. The row and column index number thus obtained is sent to Noun/verb suffix

table. The input word class determines the Noun/verb Suffix table to be selected.

Stemming is done to the root word. The selected information from the inflection table is concatenated
Stemming is done to the root word. The selected information from the inflection table
is concatenated with the root word.
The above process is explained with an example.
STEP 1
Let us consider input to the system is given as ప్లే (ADu) + verb + Present Tense.
1. ప్లే is lemma
2.
Verb is word_class
3.
Present Tense is Morpho-Lexical Information.
STEP 2
ప్లే is Romanized and we get output as ADu.
STEP 3
The Romanized ADu is given as input for the verb paradigm table and we get the
output as paradigm number of ADU which is 1. This is the column index for Table
4.6(Morpho-Lexical forms)

STEP 4

The lemma „ADu‟ is send for stemming process and the output is „AD‟

37

STEP 5

With Morpo-Lexical Information we have to find the Morpho-Lexical index number.

In this case for the present tense it is ML-3. This is the row index for Table 4.6

(Morpho-Lexical forms)

STEP 6

Now with the help of row index and column index we can find the morpho-Lexical

S TEP 5 With Morpo-Lexical Information we have to find the Morpho-Lexical index number. In this
S TEP 5 With Morpo-Lexical Information we have to find the Morpho-Lexical index number. In this

information which is „utunnAnu’.

STEP 7

Now we have to concatenate the lemma „AD‟ and Morpho-Lexical information

utunnAnu‟ and produce output as ADutunnAnu.

Working of English to Telugu machine translation system is explained with a simple

example.

STEP 1

Consider the input sentence as „She is writing a letter.

STEP 2

Input sentence is given to parser to get the grammatical tree structure and Parts Of

Speech category. Grammatical tree structure is shown in figure 4.8.

S TEP 5 With Morpo-Lexical Information we have to find the Morpho-Lexical index number. In this

FIGURE 4.8 GRAMMATICAL TREE STRUCTURE

38

STEP 3

Reordering rule is applied for the English sentence.

S TEP 3 Reordering rule is applied for the English sentence . F IGURE 4.9 R
S TEP 3 Reordering rule is applied for the English sentence . F IGURE 4.9 R
S TEP 3 Reordering rule is applied for the English sentence . F IGURE 4.9 R

FIGURE 4.9 REORDERING OF SHE IS WRITING A LETTER

STEP 4

For the given English words equivalent Telugu words are found in the bilingual

dictionary.

S TEP 3 Reordering rule is applied for the English sentence . F IGURE 4.9 R

FIGURE 4.10 LEXICALIZATION

39

STEP 5

Next step is morphological generation for verb.

VERB MORPHOLOGICAL GENERATION

Input vrAYU(write) + V + present + 3SF

Outpu  vrAstundi STEP 6 FINAL OUTPUT English He is writing a letter Transliterate Ame oka
Outpu  vrAstundi
STEP 6
FINAL OUTPUT
English
He is writing a letter
Transliterate
Ame oka aksharamu vrAstundi
Telugu
ఆమె ఒక అక్షరము రాస్తా
ఉంది

40

CHAPTER 5 RESULTS

5.1 TESTING AND RESULTS Morphological analyzer for Telugu Nouns and Verbs are tested separately and the
5.1 TESTING AND RESULTS
Morphological analyzer for Telugu Nouns and Verbs are tested separately and
the results of the system are mentioned in Table 5.1 and 5.2.
TABLE 5.1 TESTING RESULTS MORPHOLOGICAL ANALYZER NOUN
TESTING RESULTS MORPHOLOGICAL ANALYZER-NOUN
NUMBER OF NOUNS TESTED
NUMBER OF CORRECT OUTPUT
NUMBER OF INCORRECT OUTPUT
150
94
56
ACCURACY (%)
62.6
TABLE 5.2 TESTING RESULTS OF MORPHOLOGICAL ANALYZER VERB
TESTING RESULTS MORPHOLOGICAL ANALYZER-VERB
NUMBER OF NOUNS TESTED
NUMBER OF CORRECT OUTPUT
NUMBER OF INCORRECT OUTPUT
200
117
83
ACCURACY (%)
58.5

5.2 DISCUSSION

Morphological analyzer for noun and verb are tested separately. The system is

tested with 150 nouns and 200 verbs. The accuracy of the system is 62.6 percent and

58.5 percent respectively. Incorrect output occurs mainly due to words which do not

fall under the classified paradigm.

41

5.3 SCREEN SHOT OF MORPHOLOGICAL ANALYZER

Screen shots of morphological analyzer for verb and noun is given below.

FIGURE 5.1 GUI FOR MORPHOLOGICAL ANALYZER-VERB
FIGURE 5.1 GUI FOR MORPHOLOGICAL ANALYZER-VERB

FIGURE 5.2 GUI FOR MORPHOLOGICAL ANALYZER-NOUN

42

5.4 TESTING AND RESULTS

Morphological generation for verbs and nouns are tested separately and the results are

mentioned in Table 5.3 and Table 5.4.

TABLE 5.3 TESTING RESULTS OF MORPHOLOGICAL GENERATOR FOR NOUN TESTING RESULTS MORPHOLOGICAL GENERATOR-NOUN NUMBER OF NOUNS
TABLE 5.3 TESTING RESULTS OF MORPHOLOGICAL GENERATOR FOR NOUN
TESTING RESULTS MORPHOLOGICAL GENERATOR-NOUN
NUMBER OF NOUNS TESTED
NUMBER OF CORRECT OUTPUT
NUMBER OF INCORRECT OUTPUT
300
174
136
ACCURACY (%)
58
TABLE 5.4 TESTING RESULTS OF MORPHOLOGICAL GENERATOR FOR VERB
TESTING RESULTS MORPHOLOGICAL GENERATOR-VERB
NUMBER OF NOUNS TESTED
NUMBER OF CORRECT OUTPUT
NUMBER OF INCORRECT OUTPUT
200
107
93
ACCURACY (%)
53.5
5.5 DISCUSSION
Morphological generation for noun and verb are tested separately. The system is
tested with 300 nouns and 200 verbs. The accuracy of the system is 58 percent and
53.5 percent respectively. Incorrect output occurs mainly due to words which do not
fall under the classified paradigm. The accuracy of the system can be scaled up by
considering more special cases, clitics and negative forms.

43

5.6 SCREEN SHOT OF MORPHOLOGICAL GENERATOR

Screen shot of morphological generator verb and noun is given below

FIGURE 5.3 GUI FOR MORPHOLOGICAL GENERATOR-VERB
FIGURE 5.3 GUI FOR MORPHOLOGICAL GENERATOR-VERB

FIGURE 5.4 GUI FOR MORPHOLOGICAL GENERATOR-NOUN

44

5.7 TESTING AND RESULTS

The system is tested with simple sentences. The outputs of the sentences are classified

into three categories. 1. Good 2.Understandable and 3. Bad

TABLE 5.5 TESTING RESULTS OF TRANSLATION SYSTEM TESTING RESULTS ACCURACY NUMBER OF TESTED SENTENCE NUMBER OF
TABLE 5.5 TESTING RESULTS OF TRANSLATION SYSTEM
TESTING RESULTS
ACCURACY
NUMBER OF TESTED SENTENCE
NUMBER OF GOOD TRANSLATION
NUMBER OF UNDERSTANDABLE TRANSLATION
NUMBER OF BAD TRANSLATION
450
128
28.44
227
61.55
95
21.11
5.8 DISCUSSION
English to Telugu Machine translation system is tested with 450 simple sentences.
The output is categorized into three types namely good, understandable and Bad. Bad
translation occurs mainly due to following reasons,
1.
Non-availability of Lexicon in the bilingual dictionary.
2.
Reordering Output is incorrect. (Cases like Exclamation sentences, Question types
and Negative sentences)
3.
Due to limited Morphological inflection.
A set of tested sentences is attached as an excel file and the output is compared with
Google translator system. Since morphological generation is not available in Google
translator, the outputs of our translation system are morphologically better than
Google. So, the translations are meaningful and more understandable in our system.
But the number of lexicon in Google is high compared to our translation system,
therefore lexicon wise Google‟s translation system works better. The online system is

45

5.9 SCREEN SHOT OF ENGLISH-TELUGU MACHINE TRANSLATION SYSTEM

5.9 S CREEN S HOT OF E NGLISH -T ELUGU M ACHINE T RANSLATION S YSTEM
5.9 S CREEN S HOT OF E NGLISH -T ELUGU M ACHINE T RANSLATION S YSTEM

FIGURE 5.5 GUI FOR ENGLISH-TELUGU MACHINE TRANSLATION SYSTEM

5.9 S CREEN S HOT OF E NGLISH -T ELUGU M ACHINE T RANSLATION S YSTEM

FIGURE 5.6 GUI FOR ENGLISH-TELUGU MACHINE TRANSLATION SYSTEM

46

CHAPTER 6 CONCLUSION

Machine translation plays a key role for breaking the barrier of language

problem. Particularly in India we have different states and in each state we have

different kinds of languages. Throughout the country it is difficult to follow a unique

C HAPTER 6 C ONCLUSION Machine translation plays a key role for breaking the barrier of

language. There needs lot of research in this field to handle the difficulties. Telugu is

second most spoken language in India, it is important to have a translation system for

Telugu language.

Morphological analyzer based on the Support Vector Machine (SVM) a new

state of art. We have demonstrated a new methodology adopted for the preparation of

the data which was used for the machine learning approaches. We have not used any

morpheme dictionary but from the training model our system has identified the

morpheme boundaries. The accuracy obtained from the different machine learning

tools shows that SVM based machine learning tool gives better result than other

machine learning tools.

Morphological analyzer and generator have been developed with the limited

resource of linguistic knowledge. In the future people who have good knowledge in

Telugu can use the system and provide an enhanced output.

C HAPTER 6 C ONCLUSION Machine translation plays a key role for breaking the barrier of

47

REFERENCES

1.

W.John

Hutchins

and

Halord

L.

Somers,

“An Introduction To Machine

Translation”, Academic Press Ltd.,1992, pp 1-9

  • 2. Jurafsky, Daniel and Martin, James.H, Speech and Language Processing-An

Introduction to Natural Language processing, Computational Linguistics and Speech

Recognition, 2002. 3. Manish Shrivastava, “Morphology Based Natural Language Processing tools for Indian Languages,” Department of
Recognition, 2002.
3.
Manish Shrivastava, “Morphology Based Natural Language Processing tools for
Indian Languages,” Department of Computer Science and Engineering, Indian
Institute of Technology, Powai, Mumbai, 2005.
4.
K.
R.
Beesley and
L. Karttunen, Finite State Morphology. Stanford: CSLI
Publications, 2003.
5.
http://unicode.org/standard/WhatIsUnicode.html
6.
M.F.Potter, An Algorithm for Suffix Stripping, 2001.
7.
Brown, C.P., The Grammar of the Telugu Language. New Delhi: Laurier Books
Ltd, 2001
8.
Kosti D, A Mitter, Bh Krishnamurti. “A Short Outline of Telugu Phonetics”, on
phone frequencies, 1979, pp 202-204.
9.
Krishnamurti Bh, J P L Gwunn. “A Grammar of Modern Telugu” Chapter 5: The
structure of Telugu Orthography, 1985.
10.
Uma Maheshwar Rao G, Rajeev Sangal, P V H M L Narasimham, S C Babu, J
Satyanarayana. Subcommitee report on “standards for the Implementation of Telugu
in Information Technology”, 2001.
11.
Gwynn and Krishnamurti: “A Grammar of Modern Telugu”, volume 11, Oxford
University Press, Delhi, 1987.
12.
K.P.Soman, R.Loganathan, V.Ajay, “Support Vector Machines and other Kernel
Methods”, PHI Learning Private Ltd.,2009, pp 115-155.

13.

Jesus Gimenez and Lluis Marquez, “SVMTool Technical Manual v1.3”, TALP

Research Center, LSI Department, Salgado, Barcelona, 2006.

14.

Anand Kumar M, Dhanalakshmi V, Rajendran S, Soman K P: A Novel Approach

to Morphological "Hörsaalgebäude" of the University of Koeln Köln,

Universitätsstrasse 35, Albertus-Magnus-Platz 1,Germany, 2009.

15.

48

PUBLICATION

INTERNATIONAL JOURNAL

[1] R. SriBadri Narayanan, Saravanan.S and Dr Soman K.P, Amrita University,

Coimbatore, India, “ Data Driven Suffix List And Concatenation Algorithm

For Telugu Morphological Generator,” In Proceedings of International

Journal Of Engineering Science and Technology,vol.3, no 8, pp.6712-6717,

August 2011. Education,” In Proceedings of NCILC, 2011.
August 2011.
Education,” In Proceedings of NCILC, 2011.

NATIONAL CONFERENCE

[1] Ramasamy Veerappan, R. SriBadri Narayanan, and Dr. K. P. Soman, Amrita

University, Coimbatore, India, “Translation Based Support System for Smart

P UBLICATION INTERNATIONAL JOURNAL [1] R. SriBadri Narayanan, Saravanan.S and Dr Soman K.P, Amrita University, Coimbatore,

49

MARKERS

APPENDIX

GIVEN BELOW ARE THE INFLECTIONS CONSIDERED FOR TELUGU VERBS

  • 1. PRESENT TENSE MARKERS <PRESENT_TENSE> tunnA, TunnA, tunTE, TunTE,

Tum~m, tU , TU, to~m, To~m.

  • 2. PAST TENSE MARKERS <PAST_TENSE> nnA, sunnA, A, sA, DA, cA, ppA, lcA,

slA, tA, LLA, TTA, ccA, kunnA, kua~m, ia~m, ccA, ia~mcA, se, de, ce, ppe,

te, ue, rce, nne, ye. 3. FUTURE TENSE MARKERS <FUTURE_TENSE> TA, ddA, A, tA, tua~m, ia~mcu,
te, ue, rce, nne, ye.
3.
FUTURE TENSE MARKERS <FUTURE_TENSE> TA, ddA, A, tA, tua~m, ia~mcu,
su, u, cu, ccu, dcu.
4.
CLITIC <CLITIC> vO, nO, rO, dO, lO, lA, kO, sai, si, stu, akA, nnA, lE.
5.
AUXILIARY VERBS <AUX> nivvu, vaccu, valayu, pO, ua~mdu, cUdu, peTTu,
pArEyi, veyyi, avvu, mugia~mcu, cUpu,daluvu, manu, cupia~mcu, veLLu, goTTu,
beTTu, sAgu, tIru.
6 .NEGATIVE MARKERS <NEG> aka, akua~mDA, akpoyinA, akapotE, a, akpotEnE,
akunnA.
7.
PRONOUNS <PRONOUN> vanni, aTTua~mdi, naTTua~mdi.
8.
NOUNS <NOUNS> ammA, ayyA, nakkara, annamATa, nEkkara.
9.
ADJECTIVE <ADJECTIVE> anavasara~m.
10.
ADVERBIAL ADJECTIVE a~mduku, a~mduvalana, a~mduna, aTuva~mTi,
aTlu, aTlugA.
11.
POST POSITIONS <PP> lOga, lOpuna, dAkA, koddi, kadA, gAni, kanuka,kadu,
gUDA, kAbOlu, kAni, gAdA, annA, kUDA, mua~mdu, ni, a~mTA, a~mTE,
aMTu, mAku, baTTi, gAni, kUDa, mAllE, mari, gala, bO, lA, sariki, dagu
nua~mDu, galugu, joccu, jAlu, baDuvu, tappa, pATiki, varaku, ka~mTE.
12.
IMPERATIVE SUFFIX a~mDi, lEa~mDi
13.
IMPERATIVE NEGATIVE SUFFIX aka~mDi.

GIVEN BELOW ARE THE INFLECTIONS CONSIDERED FOR TELUGU NOUNS

  • 1. POST-POSITIONS <PP> a~mTE, O, gAni, gUDA, kAkua~mDA, gA, lEkua~mDA,

vu, ki, ni, runibaTTi, lA, lAa~mTi, aDuduna, aDugunua~mci, aDuguki, eDuTaki,

bataTa, bayaTanua~mDi, badulegA, cEta, cOTiki, cOTO, cOTOnua~mDi,

cOTinua~mci, gua~mDa, guria~mci, gADa, ka~mTE, kedurugA, kosa~m, kOraku.

malle, lO, lOgUDA, lOki, nua~mDi, lOpala, lOpali, lOpalanua~mDi, mIda, mIdaku,

50

mIdanua~mDi, madya, madyaki, madyalOnua~mDi, madyalOki, medalukoni,

mua~mdu, naDuma, naDumaki, ni, nua~mDi, pai, paiki, painua~mDi, pakka,

painua~mdi, pakkaku, pakkalO, pakkanua~mDi, prakAra~m, stAnAniki, stAna~m,

stAna~mlO, stAna~mlOnua~Di, valana, vadd, vaddaku, vaddanua~mDi,

venukanua~mDi, venuka, venukaku, taravAta, taravAnua~mDi, venuka, venukaku,

taravAta, taravAtanua~mDi, tO, gUDA, tOpATu, gAka, daggara, daggaralO,

daggaraku, daggaranua~mDi, dRushTilO, yOkka, dvArA.

  • 2. PRONOUNS < pro> Ayana, Ame, atanu, gAru, di, vi, taravAta, vADu, vAru, vaipu.

3. ADJECTIVE <Adj> ayinA, ayina. Paradigm List For example, Verb ఆడు have the following paradigms ఆడేస్ాాను
3. ADJECTIVE <Adj> ayinA, ayina.
Paradigm List
For example, Verb ఆడు have the following paradigms
ఆడేస్ాాను
ఆడేస్ుాన్ానను
ఆడేదాాం
ఆడేస్ాను
ఆడేసింది
TELUGU PARADIGM LIST – VERB
Paradigm 1
అయిాకనుక
ఆడడం
అయిాకదు
ఆడుతూ
అయిాగూడ
ఆడేయ్యాలి
Paradigm 4
ఆడకూడదు
చావాలి
ఆడేసలాకాతు
చావకూడదు
Paradigm 2
చసలాకాతు
అరవను
చచిినట్లే
అరవలేను
చచిినట్లేగా
అరవలేదు
Paradigm 5
అరవక
చెప్పక
అరిచేయ్కుండా
చెప్పకుండ
Paradigm 3
చెప్ిపన
అయిాకదా
చెప్తానన
అయిాగాతు
చెప్పకపోయిన్ా

51

Paradigm 6

Paradigm 11

చేసలా కాస్ుాంట్ే చేస్ుాంట్ే కాయ్కపోతే చెయ్ాకపోతే కాస్ుాన్ాన చేస్ుాన్ాన కాస ై చేస ై కాయ్డం Paradigm 7 Paradigm 12
చేసలా
కాస్ుాంట్ే
చేస్ుాంట్ే
కాయ్కపోతే
చెయ్ాకపోతే
కాస్ుాన్ాన
చేస్ుాన్ాన
కాస ై
చేస ై
కాయ్డం
Paradigm 7
Paradigm 12
చతడవదుా
కదలిను
చతడన్ేవదుా
కదలిలేను
చతడమయకు
కదలిడంలేదు
చతడకూడదు
కదలిక
చతడన్ేకూడదు
కదలికుండ
Paradigm 8
Paradigm 13
చతప్ిస్ాాను
కాలక
చతప్ిస్ుాన్ానను
కాలకుండ
చతప్ిదాాం
కాలిిన
చతప్ించాను
కాలుస్ుానన
చతప్ించింది
కాలకపోయిన్ా
Paradigm 9
Paradigm 14
ఏడుస్ాాను
కలిసింది
ఏడుస్ుాన్ానను
కలవను
ఏడుదాాం
కలవలేను
ఏడాిను
కలవడంలేదు
ఏడచింది
కలవక
Paradigm 10
Paradigm 15
ఇస్ాానననమయట్
కొన్ేస్ుాననప్పట్ికి
ఇస్తా కాతు
కొన్ేస్ుాననందుకు
ఇదతా లే
కొన్ేస్ుాననందువలన
ఇస్తా కనుక
కొన్ేస్ుాననట్లవంట్ి
ఇస్తా కదా

52

కొన్ేస్ుాననందున కుడుతుననట్లేగా Paradigm 16 Paradigm 21 కోస్ాాను లేస్ాాను కోస్ుాన్ానను లేస్ుాన్ానను కోదాాం లేదాాం కోస్ాను లేచాను కోసింది లేచింది
కొన్ేస్ుాననందున
కుడుతుననట్లేగా
Paradigm 16
Paradigm 21
కోస్ాాను
లేస్ాాను
కోస్ుాన్ానను
లేస్ుాన్ానను
కోదాాం
లేదాాం
కోస్ాను
లేచాను
కోసింది
లేచింది
Paradigm 17
Paradigm 22
కుదురచిసింది
మోప్తతాను
కుదరిను
మోప్తతున్ానను
కుదరిలేను
మోప్తదాం
కుదరిలేదు
మోపాను
కుదరిక
మోప్ింది
Paradigm 18
Paradigm 23
కూరచిను
ప్డేసలా
కూరచిలేను
ప్డేస్ుాంట్ే
కూరచిడంలేదు
ప్డేయ్కపొతే
కూరచిక
ప్డేస్ుాన్ాన
కూరచికుండ
ప్డేయి
Paradigm 19
Paradigm 24
కురిసలా
ప్ననకకదు
కురుస్ుాంట్ే
ప్ననకకదా
కురియ్కపోతే
ప్ననకననమయట్ా
కురుస్ుాన్ాన
ప్న్ేనయ్కతు
కురిస ై
ప్ననకట్లింది
Paradigm 20
Paradigm 25
కుట్ిినకొదిా
ప్రిచడమంట్ృ
కుడుతుననట్లే
ప్రచడమన్ాన
కుట్ిినచో
ప్రచడమననమయట్
కుడుతుననచో

53

ప్రచడమంట్ే ప్ూయ్కపోయిన్ా ప్రచడంకాతు ప్ూసలా ప్ూస్ుాంట్ే Paradigm 26 ప్ ట్టియ్యాలి Paradigm 31 ప్ ట్ికూడదు రాయ్ను ప్ ట్ేిసలాకాతు
ప్రచడమంట్ే
ప్ూయ్కపోయిన్ా
ప్రచడంకాతు
ప్ూసలా
ప్ూస్ుాంట్ే
Paradigm 26
ప్ ట్టియ్యాలి
Paradigm 31
ప్ ట్ికూడదు
రాయ్ను
ప్ ట్ేిసలాకాతు
రాయ్లేను
ప్ ట్ేిసినట్లే
రాయ్డంలేదు
ప్ ట్ేిసినట్లేగా
రాయ్క
రాయ్కుండ
Paradigm 27
ప్ిలుస్ుానన
Paradigm 32
ప్ిలవకపోయిన్ా
తననలేను
ప్ిలిసలా
తననడంలేదు
ప్ిలుస్ుాంట్ే
తననక
ప్ిలవకపోతే
తననకుండ
తన్ేనసిన
Paradigm 28
పొగుడాాను
Paradigm 33
పొగుడుాన్ానను
తేకుండ
పొగుడుదాం
తెచిిన
పొగిడాను
తేస్ుానన
పొగిడచంది
తేకపోయిన్ా
తెసలా
Paradigm 29
పోతాను
Paradigm 34
పోతున్ానను
తీస్ుాననప్పట్ి
పోదాం
తీస్ుాననప్పట్ినుంచి
పోయ్యను
తీస్ుాననప్పట్ికీ
పోయింది
తీస్ుాననందుకు
తీస్ుాననందువలన
Paradigm 30
ప్ూసిన
Paradigm 35
ప్ూస్ుానన
ఉండలేదు

54

ఉండక

ఉండేయ్కుండా

వలచలేదు

ఉండేసిన Paradigm 37 ఉండేస్ుానన వచేియ్కుండా Paradigm 36 వచేిసిన వలచేస్ాను వచేిస్ుానన వలచేసింది రాకపోయిన్ా వలయ్ను వచేిస్ వలచలేను
ఉండేసిన
Paradigm 37
ఉండేస్ుానన
వచేియ్కుండా
Paradigm 36
వచేిసిన
వలచేస్ాను
వచేిస్ుానన
వలచేసింది
రాకపోయిన్ా
వలయ్ను
వచేిస్
వలచలేను

55

For example, Noun ఊరు have the బెండతను following paradigms బెండయిన్ా ఊరాయ్న బెండయిన ఊరామె Paradigm 5 ఊరతను
For example, Noun ఊరు have the
బెండతను
following paradigms
బెండయిన్ా
ఊరాయ్న
బెండయిన
ఊరామె
Paradigm 5
ఊరతను
బోన్ాయిన
ఊరయిన్ా
బోన్ామె
ఊరయిన
బోనతను
బోనయిన్ా
Paradigm 1
బోనయిన
అబాా య్యయ్న
Paradigm 6
అబాాయ్యమె
బుడాాయిన
అబాా
య్తను
బుడాామె
అబాా
య్యిన్ా
బుడాతను
అబాా
య్యిన
బుడాయిన్ా
Paradigm 2
బుడాయిన
బాలుడాయిన
Paradigm 7
బాలుడామె
చేన్ాయ్న
బాలుడతను
చేన్ామె
బాలుడయిన్ా
చేనతను
బాలుడయిన
చేనయిన్ా
Paradigm 3
చేనయిన
బండాయిన
Paradigm 8
బండామె
చిలయేయిన
బండతను
చిలయేమె
బండయిన్ా
చిలేతను
బండయిన
చిలేయిన్ా
Paradigm 4
చిలేయిన
బెండాయిన
Paradigm 9
బెండామె
దారాయ్న

56

దారామె గుడాాయిన దారతను గుడాామె దారయిన్ా గుడాతను గుడాయిన్ా దారయిన‌ గుడాయిన Paradigm 10 Paradigm 15 ఎతుమిదాయ్న గుడాయిన‌ ఎతుమిదామె
దారామె
గుడాాయిన
దారతను
గుడాామె
దారయిన్ా
గుడాతను
గుడాయిన్ా
దారయిన‌
గుడాయిన
Paradigm 10
Paradigm 15
ఎతుమిదాయ్న
గుడాయిన‌
ఎతుమిదామె
గుడామె
ఎతుమిదతను
గుడతను
ఎతుమిదయిన్ా
గుడయిన్ా
ఎతుమిదయిన
గుడయిన
Paradigm 11
Paradigm 16
గదాయ్న
గుండాయిన
గదామె
గుండామె
గదతను
గుండతను
గదయిన్ా
గుండయిన్ా
గదయిన‌
గుండయిన
Paradigm 12
Paradigm 17
గచరాయ్న
గూడాయిన
గచరామె
గూడామె
గచరతను
గూడతను
గచరయిన్ా
గూడయిన్ా
గచరయిన
గూడయిన
Paradigm 13
Paradigm 18
గొయ్యాయిన
ఇలయేయిన
గొయ్యామె
ఇలయేమె
గొయ్ాతను
ఇలేతను
గొయ్ాయిన్ా
ఇలేయిన్ా
గొయ్ాయిన
ఇలేయిన
Paradigm 14
Paradigm 19

57

జంతువాయిన

జంతువామె

జంతువ

తను

జంతువ

యిన్ా

జంతువ

యిన

Paradigm 20

కలిరాయిన

కలిరామె

కలిరతను

కలిరయిన్ా

కలిరయిన

Paradigm 21

కాలయయిన

కాలయమె

కాలతను

కాలయిన్ా

కాలయిన

Paradigm 22

కన్ానయిన

కన్ానమె

కననతను

కననయిన్ా

కననయిన

Paradigm 23

కీలయయ్న

కీలయమె

కీలతను

కీలయిన్ా

కీలయిన

జంతువాయిన జంతువామె జంతువ తను జంతువ యిన్ా జంతువ యిన Paradigm 20 కలిరాయిన కలిరామె కలిరతను కలిరయిన్ా కలిరయిన Paradigm
కోట్ాయిన కోట్ామె కోట్తను కోట్యిన్ా కోట్యిన Paradigm 25 కోట్ాయ్న కోట్ామె కోట్తను కోట్యిన్ా కోట్యిన Paradigm 26 కొట్ాియ్న కొట్ాిమె
కోట్ాయిన
కోట్ామె
కోట్తను
కోట్యిన్ా
కోట్యిన
Paradigm 25
కోట్ాయ్న
కోట్ామె
కోట్తను
కోట్యిన్ా
కోట్యిన
Paradigm 26
కొట్ాియ్న
కొట్ాిమె
కొట్ితను
కొట్ియిన్ా
కొట్ియిన‌
Paradigm 27
కోట్ాయిన
కోట్ామె
కోట్తను
కోట్యిన్ా
కోట్యిన
Paradigm 28
కుందేలయయ్న
కుందేలయమె
కుందేలతను
కుందేలయిన్ా
కుందేలయిన

Paradigm 24

58

Paradigm 29 మేకయిన‌ మందాయ్న Paradigm 34 మందామె మెతుకాయిన మందతను మెతుకామె మందయిన్ా మెతుకతను మందయిన మెతుకయిన్ా Paradigm 30
Paradigm 29
మేకయిన‌
మందాయ్న
Paradigm 34
మందామె
మెతుకాయిన
మందతను
మెతుకామె
మందయిన్ా
మెతుకతను
మందయిన
మెతుకయిన్ా
Paradigm 30
మెతుకయిన
మతుషాయిన
Paradigm 35
మతుషామె
న్ౌకరాయ్న
మతుషతను
న్ౌకరామె
మతుషయిన్ా
న్ౌకరతను
మతుషయిన
న్ౌకరయిన్ా
Paradigm 31
న్ౌకరయిన
మెదడాయిన
Paradigm 36
మెదడామె
నతరాయ్న
మెదడతను
నతరామె
మెదడయిన్ా
నతరతను
మెదడయిన
నతరయిన్ా
Paradigm 32
నతరయిన‌
మేన్ాయిన
మేన్ామె
Paradigm 37
మేనతను
ఒకట్ాయిన
మేనయిన్ా
ఒకట్ామె
మేనయిన
ఒకట్తను
ఒకట్యిన్ా
Paradigm 33
ఒకట్యిన
మేకాయ్న
మేకామె
Paradigm 38
మేకతను
పాలయయ్న
మేకయిన్ా
పాలయమె
పాలతను

59

పాలయిన్ా ప్ ళ్ాతను పాలయిన ప్ ళ్ాయిన్ా ప్ ళ్ాయిన Paradigm 39 ప్ందిరాయ్న Paradigm 44 ప్ందిరామె ప్ న్ానయ్న
పాలయిన్ా
ప్ ళ్ాతను
పాలయిన
ప్ ళ్ాయిన్ా
ప్ ళ్ాయిన
Paradigm 39
ప్ందిరాయ్న
Paradigm 44
ప్ందిరామె
ప్ న్ానయ్న
ప్ందిరతను
ప్ న్ానమె
ప్ందిరయిన్ా
ప్ ననతను
ప్ందిరయిన
ప్ ననయిన్ా
Paradigm 40
ప్ ననయిన‌
ప్ందెమయయ్న
Paradigm 45
ప్ందెమయమె
ప్ ట్ాియ్న
ప్ందెమతను
ప్ ట్ాిమె
ప్ందయిన్ా
ప్ ట్ితను
ప్ందయిన
ప్ట్ియిన్ా
Paradigm 41
ప్ ట్ియిన
ప్న్ాయ్న
Paradigm 46
ప్న్ామె
ప్ిలేవాడాయిన
ప్నతను
ప్ిలేవాడామె
ప్నయిన్ా
ప్ిలేవాడతను
ప్నయిన
ప్ిలేవాడయిన్ా
Paradigm 42
ప్ిలేవాడయిన
పాప్మయయిన
Paradigm 47
పాప్మయమె
ప్ిండాయిన
పాప్మతను
ప్ిండామె
పాప్మయిన్ా
ప్ిండతను
పాప్మయిన
ప్ిండయిన్ా
Paradigm 43
ప్ిండయిన
ప్ ళ్ళాయిన
Paradigm 48
ప్ ళ్ళామె
ప్తలయయ్న

60

ప్తలయమె రాణాయ్న ప్తలతను రాణామె ప్తలయిన్ా రాణతను ప్తలయిన రాణి అయిన్ా Paradigm 49 రాణి అయిన‌ ప్తస్ాక మయయిన Paradigm
ప్తలయమె
రాణాయ్న
ప్తలతను
రాణామె
ప్తలయిన్ా
రాణతను
ప్తలయిన
రాణి అయిన్ా
Paradigm 49
రాణి అయిన‌
ప్తస్ాక మయయిన
Paradigm 54
ప్తస్ాకమయమె
రాయ్యయిన
ప్తస్ాకమతను
రాయ్యమె
ప్తస్ాకమయిన్ా
రాయ్తను
ప్తస్ాకమయిన
రాయ్యిన్ా
Paradigm 50
రాయ్యిన
ప్తట్ాియిన
ప్తట్ాిమె
Paradigm 55
ప్తట్ితను
రెండాయిన
ప్తట్ియిన్ా
రెండామె
ప్తట్ియిన
రెండతను
Paradigm 51
రెండయిన్ా
ప్ూలయయ్న
రెండయిన
ప్ూలయమె
Paradigm 56
ప్ూలతను
రిక్షా అయ్న
ప్ూవయిన్ా
రిక్షా ఆమె
ప్ూవయిన
రిక్షా అతను
Paradigm 52
రిక్షా అయిన్ా
రాతాాయ్న
రిక్షా అయిన
రాతాామె
Paradigm 57
రాతాతను
స్ారాయిన
రాతాయిన్ా
స్ారామె
రాతాయిన
స్ారతను
Paradigm 53
స్ారయిన్ా

61

స్ారయిన తిరగలయిన్ా తిరగలయిన Paradigm 58 స్ందడాయిన Paradigm 63 స్ందడామె ఊరాయ్న స్ందడతను ఊరామె స్ందడయిన్ా ఊరతను స్ందడయిన ఊరయిన్ా
స్ారయిన
తిరగలయిన్ా
తిరగలయిన
Paradigm 58
స్ందడాయిన
Paradigm 63
స్ందడామె
ఊరాయ్న
స్ందడతను
ఊరామె
స్ందడయిన్ా
ఊరతను
స్ందడయిన
ఊరయిన్ా
ఊరయిన
Paradigm 59
సలనహితుడాయిన
Paradigm 64
సలనహితుడామె
వేలయయ్న
సలనహితుడతను
వేలయమె
సలనహితుడయిన్ా
వేలతను
సలనహితుడయిన
వేలయిన్ా
వేలయిన
Paradigm 60
తరగతాయ్న
Paradigm 65
తరగతామె
వెయ్యాయిన
తరగతతను
వెయ్యామె
తరగతయిన్ా
వెయ్ాతను
తరగతయిన
వెయ్యాయిన్ా
వెయ్యాయ్
Paradigm 61
తెన్ానయ్న
తెన్ానమె
తెననతను
తెననయిన్ా

తెననయిన

Paradigm 62

తిరగలయయిన

తిరగలయమె

తిరగలతను

62

63

63