Bhat Thesis

Exploiting Linguistic Knowledge to Address Representation and Sparsity
Issues in Dependency Parsing of Indian Languages
Thesis submitted in partial fulfillment

of the requirements for the degree of
Ph.D
in
Computational Linguistics
by
Riyaz Ahmad Bhat

200922001
riyaz.bhat@research.iiit.ac.in
Intrnational Institute of Information Technology

Hyderabad - 500 032, INDIA
January 2017
Copyright
c Riyaz Ahmad Bhat, 2017
All Rights Reserved
iii
To My Parents
Acknowledgments
Finally, the journey of intense enthusiasm and frustration has reached its culmination. It has been an
enormously fruitful and fulfilling experience and I thoroughly enjoyed it despite the ups and downs. I
am thankful to a lot of people who directly or indirectly helped me complete this dissertation. Here, I
would like to thank everyone who has helped and supported me during my Ph.D.
First and foremost, I would like to thank Prof. Dipti Misra Sharma for being an outstanding mentor.
I am grateful for her guidance, and especially for providing me consistent feedback while allowing me
enough freedom to grow as a researcher. It has been a great pleasure working with her. I look forward
to more collaborations with her in the future.
Besides my advisor, I would like to thank the other members of my thesis committee for their valu-
able feedback on my dissertation: Dr. Sriram Venkatapathy, Prof. Amba Kulkarni, Prof. Girish Nath
Jha, and Prof. Sudeshna Sarkar. Moreover, I would like to thank my teachers (the unsung heroes), Late
Prof. Lakshmi Bai, Prof. Vineet Chaitanya, Dr. Radhika Mamidi, Dr. Soma Paul, Prof. Aditi Mukherjee
and Dr. Manish Shrivastava for instilling in me a passion for linguistic research and language technol-
ogy, for their critical feedback on my work and other life lessons.
I would also like to thank my colleagues who are (or were) part of the Language Technology group
here at IIIT-Hyderabad: Irshad Ahmad Bhat, Praveen Dakwale, Himani Chaudhry, Rafiya Begum, Sand-
hya Jena, Himanshu Sharma, Sambhav Jain, Naman Jain, Maaz Nomani, Aniruddha Tammewar, Karan
Singla, Rishabh Srivastava, Bhasha Agrawal, Pruthwik Mishra, Pratibha Rani, Sukhada Sharma, Juhi
Tandon, Silpa Kanneganti, Vandan Mujadia and others. Special thanks go to Praveen Dakwale, Anand
Mishra, Himani Chaudhry, Rafiya Begum, Reenu and many other graduate students for sharing the joys
and sorrows of Ph.D journey and being toghether during the tough periods!
I would like to thank my friends from my hometown who, every now and then, helped me take some
time off from my hectic Ph.D schedule. Thank you Aameer, Atif, Aqib, Asif, Shakir, Khursheed and
other friends for giving me some joyful moments which deflated the work pressure.
Most importantly, I would like to thank my family, especially my parents and my brothers for their
support and endless love. This adventure was only possible due to their enormous support and trust in
me. I sincerely thank them for their love and everyday prayers to God for my successful life.
Thank you very much, everyone!
v
Abstract
Recent trends in natural language processing (NLP) show ever increasing popularity of dependency-
based analysis of natural language texts. Dependency representations offer simplicity, compactness and
transparent encoding of predicate-argument structure. In the last decade and a half, a number of de-
pendency treebanks have been built and various parsing algorithms have been proposed for automatic
dependency analysis across wide range of languages. Over the years, it has been observed that mor-
phologically rich and free word order languages, unlike fixed word order languages, are harder to parse,
regardless of the parsing technique used. On the one hand, rich morphology provides explicit cues for
parsing, while on the other hand it worsens the problem of data sparsity as it leads to high lexical diver-
sity and variation in word order. In this thesis, we aim to address this trade-off for accurate and robust
parsing of morphologically rich Indian languages. We present novel strategies to effectively represent
morphology in the parsing models and also to mitigate the effect of its trade-offs.
We propose to represent morphosyntactic information as higher-order features under the Markovian
assumption. More specifically, we use the history of a transition-based parser to extract and propagate
morphological information such as case and grammatical agreement as higher-order features for parsing
nominal nodes. Despite its benefits, rich morphology can also pose a multitude of challenges to statis-
tical parsing. The most prominent issue is related to sampling bias towards canonical structures of a
language. As current parsers are mostly trained on formal texts, even a slight deviation from canonical
word order can severely affect their performance. To overcome this bias, we propose a sampling tech-
nique to generate training instances with diverse word orders from the available canonical structures.
We show that linearly interpolated models trained on diverse views of the same data can effectively
parse both canonical and non-canonical texts. Similarly, to mitigate the effect of lexical sparsity, we
use supervised domain adaptation techniques for training parsers on lexically more diverse annotations
from augmented Hindi and Urdu treebanks. We demonstrate that a feedforward neural network-based
dependency parser trained on augmented, harmonized Hindi and Urdu data performs significantly better
than the parsing models trained separately on their individual datasets. Furthermore, we explore lexical
semantics as a viable alternative to more training data for parsing semantically rich but sparse depen-
dency annotations in Indian language treebanks. We show that lexical semantics in the form of discrete
and continuous features such as ontological categories, Brown clusters and word embeddings can play
a major role in disambiguating highly rich dependency relations.
vi
Contents
Chapter Page
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Parsing Indian languages: Challenges and Issues . . . . . . . . . . . . . . . . . . . . 1
1.2 Goals of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Primary Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Auxiliary Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Related Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 General Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 Dependency Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Parsing framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Transition-based Dependency Parsing . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Oracle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.3 Non-Projectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 Indian Language Treebanking: Grammar Formalism and Annotation Procedure . . . . . . . . 15

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Computational Pān.inian Grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.1 The Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.1.1 Dependency Relations and Labels . . . . . . . . . . . . . . . . . . . 17
3.2.2 Annotation Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.2.1 Intra-Chunk Expansion . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4 Improving Parsing of Hindi and Urdu by Modeling Syntactically Relevant Phenomena . . . . 24

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2 Transition-based Parsing with Rich Syntactic Features . . . . . . . . . . . . . . . . . . 25
4.2.1 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.2 Modeling Syntactically Relevant Features . . . . . . . . . . . . . . . . . . . . 30
4.2.2.1 Case Marking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.2.1.1 PSD Models . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.2.2 Role of Case Information in Parsing . . . . . . . . . . . . . . . . . 35
4.2.2.3 Grammatical Agreement . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.2.4 Complex Predication . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2.2.5 Identification of Ezafe . . . . . . . . . . . . . . . . . . . . . . . . . 44
vii
viii CONTENTS
4.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.3.1 Feature Ablation Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3.2 Comparison with the IL Feature Representation . . . . . . . . . . . . . . . . . 50
4.4 Related Work and Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5 Non-Projectivity and Scrambling: Trade-off for Morphological Richness . . . . . . . . . . . 54

5.1 Non-projectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.1.1 Dependency Graph and its properties . . . . . . . . . . . . . . . . . . . . . . 55
5.1.1.1 Condition of Projectivity . . . . . . . . . . . . . . . . . . . . . . . 55
5.1.1.2 Relaxations of Projectivity . . . . . . . . . . . . . . . . . . . . . . 55
5.1.2 Evaluation of Tree Constraints on IL Treebanks . . . . . . . . . . . . . . . . . 56
5.1.3 Analysis and Categorization . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.1.3.1 Relative Clause Constructions . . . . . . . . . . . . . . . . . . . . . 58
5.1.3.2 Clausal Complements . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.1.3.3 Conditionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.1.3.4 Genitive Constructions . . . . . . . . . . . . . . . . . . . . . . . . 60
5.1.3.5 Control Constructions . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.1.3.6 Quantified Expressions . . . . . . . . . . . . . . . . . . . . . . . . 61
5.1.3.7 Other Finite Clauses . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.1.4 Parsing Non-projective Structures . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2 Scrambling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6 Enhancing Data-driven Parsing of Hindi and Urdu by Leveraging their Typological Similarity 70
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.2 Divergence between Hindi and Urdu . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.2.1 Lexical Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.2.1.1 Bifurcation of Urdu Lexicon . . . . . . . . . . . . . . . . . . . . . 72
6.2.1.1.1 Multinomial Naive Bayes . . . . . . . . . . . . . . . . . . 76
6.2.1.1.2 Etymological Data . . . . . . . . . . . . . . . . . . . . . 76
6.2.1.1.3 Experiments and Results . . . . . . . . . . . . . . . . . . 77
6.2.1.2 Perso-Arabic Borrowings in Urdu Text: A Quantitative Analysis . . 78
6.2.2 Syntactic Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.2.2.1 Comparing Probability Distributions using Jensen-Shannon Divergence 81
6.3 Common Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.3.1 Hindi-Urdu Transliteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.3.1.1 Transliteration Pair Extraction and Character Alignment . . . . . . . 86
6.3.1.2 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . 87
6.3.1.3 Extrinsic Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.4 Resource Sharing and Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.4.1 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.5 Related Work and Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
CONTENTS ix
7 Adding Semantics to Data-driven Parsing of Semantically-oriented Dependency Representations100

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7.2 Effect of Granularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.3 Lexical Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.3.1 IndoWordNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.3.1.1 Sense Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.3.2 Distributional Representations . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.3.2.1 Brown Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.3.2.2 Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.4.1 Parsing Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.5 Feature Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.5.1 Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.5.2 Non-linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.6 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
8 Data-driven Parsing of Kashmiri: Setting up Parsing Pipeline with Preliminary Experiments . 117
8.1 About Kashmiri . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
8.2 Setting up a Parsing Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
8.2.1 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
8.2.2 Part-of-Speech Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
8.2.3 Chunking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
8.2.4 Dependency Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
8.2.4.0.1 V2: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
8.2.4.0.2 Pronominal Cliticization: . . . . . . . . . . . . . . . . . . 124
8.2.4.1 Inter-Annotator Agreement Study . . . . . . . . . . . . . . . . . . . 126
8.2.4.2 Parsing Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 126
8.2.4.2.1 Intra-chunk parsing: . . . . . . . . . . . . . . . . . . . . . 127
8.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
9 Summary and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

9.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
9.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
List of Figures
Figure Page
2.1 Dependency tree of Example sentence 1. . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Transition sequence for Example sentence 1 based on Arc-eager algorithm. . . . . . . 12
3.1 Hierarchy of dependency labels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2 Dependency trees of examples (2)-(4). . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Dependency tree showing inter-chunk dependencies for Example 5. . . . . . . . . . . 20
3.4 Dependency tree showing word-level dependencies for Example 5. . . . . . . . . . . . 20
3.5 An example of transition-based chunk expansion. . . . . . . . . . . . . . . . . . . . . 21
4.1 Case marker ‘ne’ as second-order Markov neighbor for the arc (khāyā, Maryam). . . . 36
4.2 Parser configuration capturing the interaction between third-order lexical features. . . . 36
4.3 Stacking of case markers attached to their respective heads in the stack. . . . . . . . . 37
4.4 Parser configuration showing agreement features. . . . . . . . . . . . . . . . . . . . . 40
4.5 Dependency tree of Example sentence 13. . . . . . . . . . . . . . . . . . . . . . . . . 40
4.6 Dependency tree of Example sentence 14. . . . . . . . . . . . . . . . . . . . . . . . . 41
4.7 Dependency trees of Examples (19) and (20). . . . . . . . . . . . . . . . . . . . . . . 45
4.8 Impact of case and agreement features on different verb arguments and genitives. . . . 49
5.1 Parser configuration showing heads of the main and relative clauses. . . . . . . . . . . 64
5.2 Dependency parse tree for Example 29. . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.3 Multiple argument scrambling in Example 29. Dashed edges represent scrambling. . . 66
6.1 Relative distribution of Arabic, Hindi, Persian and Urdu alphabets (consonants only). . 74
6.2 Learning curves. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.3 The plot shows percentage of Perso-Arabic words across text types of Urdu. . . . . . . 79
6.4 The plot shows the distribution of Indic and Perso-Arabic words across POS categories. 79
6.5 Relative comparison of JS divergences between eight Indian Languages. . . . . . . . . 82
6.6 Relative comparison of JS divergences between Hindi, Urdu and domains of Hindi. . . 83
6.7 Performance of different systems on vowel prediction and letter disambiguation. . . . . 89
6.8 Impact of different length ngrams on bidirectional SHMM model. . . . . . . . . . . . 89
7.1 Accuracies of each label type in Hindi and Urdu test sets. . . . . . . . . . . . . . . . . 103
7.2 Fine-grained and coarse-grained dependency labels for verb arguments and adjuncts. . 105
7.3 Sample hierarchy of categories in Hindi Wordnet. . . . . . . . . . . . . . . . . . . . . 107
7.4 Ontological nodes of chāt representing its verbal and nominal senses. . . . . . . . . . 108
x
LIST OF FIGURES xi
7.5 Two ontologies corresponding to two nominal senses of the word kuttā. . . . . . . . . 108
7.6 Learning curves for optimal length of bit strings (number of clusters). . . . . . . . . . 114
7.7 Learning curves for optimal dimensionality of word embeddings. . . . . . . . . . . . . 114
7.8 Impact of lexical semantics on parsing of different label types in Hindi test sets. . . . . 115
7.9 Impact of lexical semantics on parsing of different label types in Urdu test sets. . . . . 115
8.1 Impact of affix-based features on tag-wise performance on our POS tagger. . . . . . . 121
8.2 Dependency tree of an example sentence from the Kashmiri treebank. . . . . . . . . . 123
8.3 Dependency tree of example sentence 35 from the Kashmiri Treebank. . . . . . . . . . 124
8.4 Intra-chunk dependency annotation of Example 39 in the Kashmiri Treebank. . . . . . 125
List of Tables
Table Page
3.1 Some major dependency relations depicted in Figure 3.1. . . . . . . . . . . . . . . . . 18

3.2 Statistics on training, testing and development sets. . . . . . . . . . . . . . . . . . . . 23
4.1 Feature template for setting the baseline for parsing Hindi and Urdu. . . . . . . . . . . 27
4.2 Feature template used in previous works on Parsing Hindi and other Indian Languages. 28
4.3 Performance of Morphological analyzers, POS taggers and chunkers on Hindi and Urdu. 28
4.4 Baseline Parsing accuracies of Hindi using the extended ZN feature template. . . . . . 29
4.5 Baseline Parsing accuracies of Urdu using the extended ZN feature template. . . . . . 29
4.6 Accuracies (LAS) for Leave-one-out and Only-one evaluation. . . . . . . . . . . . . . 30
4.7 Description of case clitics and their annotations in the Hindi and Urdu treebanks. . . . 31
4.8 Performance (in terms of accuracy) of SVM-based PSD models. . . . . . . . . . . . . 33
4.9 Leave-one-out experiments to evaluate the importance of individual features for PSD. . 34
4.10 Feature template capturing interaction of lexical and non-lexical case features. . . . . . 37
4.11 Agreement features relevant for parsing Hindi and Urdu. . . . . . . . . . . . . . . . . 39
4.12 Distribution of nominal-light verb pairs and adjacent nominal-literal verb pairs. . . . . 42
4.13 Performance of our SVM-based classifier on complex predicate identification. . . . . . 43
4.14 NPMI Scores and the origin of few Ezafe Constructions in the Urdu Treebank. . . . . . 45
4.15 Parsing accuracies of Hindi after incorporating morphosyntactic features. . . . . . . . 47
4.16 Parsing accuracies of Urdu after incorporating morphosyntactic features. . . . . . . . . 47
4.17 Parsing accuracies (LAS) showing the impact of different feature extraction procedures. 48
4.18 Comparison of accuracies showing the impact of features capturing complex predication. 48
4.19 Accuracies (LAS) for Leave-one-out and Only-one evaluation. . . . . . . . . . . . . . 50
4.20 Comparison of IL feature template with the ZN feature template. . . . . . . . . . . . . 51
5.1 Non-projectivity measures of Dependency Structures in IL treebanks. . . . . . . . . . 57

5.2 Sources of Non-projectivity in Hindi/Urdu dependency treebanks. . . . . . . . . . . . 57
5.3 Impact of different encoding schemes on parsing of non-projective arcs. . . . . . . . . 63
5.4 Impact of different encoding schemes on parsing of Hindi and Urdu test sets. . . . . . 63
5.5 Impact of different heuristics on parsing of Hindi and Urdu test sets. . . . . . . . . . . 64
5.6 Comparison of results on scrambled and normal views of the test set. . . . . . . . . . . 68
5.7 Comparison of results on different domains of Hindi. . . . . . . . . . . . . . . . . . . 69
6.1 Morphological paradigm of khabar. . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

6.2 Relative distribution of Arabic, Hindi, Persian and Urdu alphabets (consonants only). . 75
6.3 Statistics of etymological data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
xii
LIST OF TABLES xiii
6.4 Performance of our language identification model on the test set. . . . . . . . . . . . . 77

6.5 Relatively frequent partial trees in the Urdu treebank. . . . . . . . . . . . . . . . . . . 83
6.6 Feature template used for learning the emission parameters. . . . . . . . . . . . . . . . 85
6.7 Comparison of accuracies of available system on internet with our system. . . . . . . . 88
6.8 Performance of noisy-channel model in resolving homograph ambiguity. . . . . . . . . 88
6.9 Performance of different modules of a dependency parsing pipeline. . . . . . . . . . . 90
6.10 Comparison of lexical and POS-tag merging rates. . . . . . . . . . . . . . . . . . . . . 91
6.11 Some examples of homographs from the Hindi and Urdu treebanks. . . . . . . . . . . 92
6.12 OOV rates of Hindi and Urdu development sets. . . . . . . . . . . . . . . . . . . . . . 93
6.13 Choice of model architecture and other hyperparameters. . . . . . . . . . . . . . . . . 94
6.14 Hyperparameters of our neural network models tuned on development sets. . . . . . . 96
6.15 Results of different supervised domain adaptation methods. . . . . . . . . . . . . . . . 97
6.16 Comparison of parsing & tagging accuracy of Hindi parser & tagger on multiple data sets. 98
7.1 Mappings of CPG dependencies to PropBank numbered arguments adapted from [189]. 103
7.2 Impact of annotation granularity on parsing Hindi and Urdu test data. . . . . . . . . . 106
7.3 Additional cluster and WordNet-based features for parsing Hindi and Urdu. . . . . . . 111
7.4 Impact of IndoWordNet and Brown clusters on Hindi and Urdu parsers. . . . . . . . . 113
7.5 Impact of IndoWordNet and word embeddings on Hindi and Urdu parsers. . . . . . . . 113
8.1 Tokenization problem in Kashmiri texts. . . . . . . . . . . . . . . . . . . . . . . . . . 119

8.2 Feature template used in POS tagging experiments. . . . . . . . . . . . . . . . . . . . 120
8.3 POS tagging accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
8.4 New chunk tags. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
8.5 Feature template used in the chunking experiments. w denotes a word, p is the POS tag. 122
8.6 Comparison of Chunking Accuracies using different feature models. . . . . . . . . . . 122
8.7 Kappa statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
8.8 Case suffixes in Kashmiri (adapted from Koul and Wali). . . . . . . . . . . . . . . . . 127
8.9 Effect of different features on our Arc-eager parser. . . . . . . . . . . . . . . . . . . . 127
8.10 Intra-chunk parsing accuracies on expanded version of the Kashmiri treebank. . . . . . 128
Chapter 1
Introduction
Natural language text is an agglomeration of naturally occurring sentences, where predicate-argument

relations like “who did what to whom, when and where?” are essence of the semantics expressed in
them. An automatic analysis of a natural language text in terms of the relations expressed between words
in a sentence would prove extremely useful to applications which are more semantically oriented like
question-answering and relation extraction systems. Dependency-based frameworks, which are the most
widely used grammar formalisms in parsing community, provide an explicit encoding of the predicate-
argument structure. Therefore, an automatic text analysis ingrained in dependency framework would
essentially provide a semantically-oriented representation of a sentence to downstream applications. In
order to facilitate automatic dependency analysis of a text, various approaches and parsing algorithms
have been proposed in the past decade. Most of these approaches are data-driven and rely on the use
of machine learning algorithms to learn the model parameters from manually annotated corpora called
treebanks. However, the reliance of such approaches on manually annotated treebanks restricts their
applicability to only resource-rich languages such as English. Besides, these appraoches often yield
unsatisfactory results when applied to languages that are morphologically rich in nature [185]. In this
thesis, we propose novel strategies for addressing the issues that rich morphology poses to statistical
parsing of Indian languages, which not only are morphologically rich in nature, but are resource-poor as
well. In particular, we focus on various liguistic factors such as morphosyntactic interactions, typolog-
ical similarities and selectional restrictions for accurate and robust parsing of (mostly) Hindi and Urdu
texts.
1.1 Parsing Indian languages: Challenges and Issues

Indian languages are morphologically rich in nature. While their word morphology explicitly rep-
resents grammatical or semantic relations in a sentence, it allows considerable freedom in constituent
ordering. The rich morphological nature of a language can prove challenging for a statistical parser as
is noted by Tsarfaty et al. [185]. Here, we list some of the challenges Indian languages can pose to
parsing:
1
• Lexical Diversity: Lexical diversity in a natural language text is the ratio of token count over
type count. Indian languages exhibit high lexical diversity due to their rich inflectional and/or
agglutinative nature. Direct implication of lexical diversity is that it leads to a high rate of out-of-
vocabulary words unseen in the annotated data [185]. Parsing in Indian languages is thus asso-
ciated with increased lexical data sparseness. Take for example the case of Kashmiri–one of the
languages studied in this thesis–nominal modifiers like demonstratives, quantifiers and adjectives
often agree with their head noun in case, e.g. in the noun phrase, yam-is ak-is bad-is ladak-as
(this-DAT one-DAT big-SG.M.DAT boy-SG.M.DAT) all the dependent words (modifiers) agree
with the head ladake ‘boy’ in case information which is represented by the dative/accusative
marker (-is/-as). In English, all these modifiers have just one form which is context invariant,
while in Kashmiri they take many forms showing agreement with the head nouns grammatical
case.
• Case Ambiguity: In most of the Indian languages, case marking is widely used to mark semantic
or grammatical roles of nominals with the verbal predicates. However, case markers and/or ad-
positions are one of the most ambiguous lexical categories. In Hindi, for example, case markers
and case roles do not have a one to one mapping, each case marker is distributed over a number
of case roles. Among the six case markers Hindi has, only Ergative case marker is unambiguous
[25]. Although case markers are good indicators of the relation a nominal bears in a sentence, the
phenomenon of case syncretism bars their ability to accurately identify the role of the nominal
while parsing.
• Argument Scrambling and Non-projectivity: In Indian languages since case makers (either cl-
itics or affixes) carry the information about the relations between words, these words can freely
change their positions in a sentence. Argument scrambling often leads to discontinuities in syn-
tactic constituents and long distance dependencies, thus posing a challenge to parsing. Given
appropriate pragmatic conditions a simple sentence (containing a single verb) in ILs allows n
factorial1 (n!) permutations. Potential n! sramblings would worsen the data sparsity problem, as
most of the valid structures may never show up in a limited-sized treebank.
Non-projectivity is another source of challenge that any statistical parser has to face in parsing
Indian languages, since this phenomenon is very common in Indian languages. In one of our study
on non-projectivity in Indian language treebanks, we observed that non-projective structures are
found in as many as 23-25% of sentences (cf. Chapter 5, Table 5.1). Non-projectivity has been
shown to pose problems to both grammar formalisms and syntactic parsing [110]. In case of
parsing, the most widely used transition-based dependency parsers (arc-eager and arc-standard)
can not handle non-projective structures directly due to projectivity constraint.
In addition to the above listed linguistically-oriented challenges, Indian languages lack the sufficient
resources that are needed for training accurate and robust parsers.
1n is the number of chunks in a sentence.
2
• Lack of Resources: Most of the Indian languages do not have the necessary annotated corpora
(treebanks) that form the bedrock of statistical parsing. Till date Hindi and Urdu are the only
Indian languages for which reasonably large treebanks are available. Although, there are a couple
of treebanks available in other languages as well such as Bengali, Tamil and Telugu, their size is
limited to 1,000-1,500 sentences. Parsing Indian languages under such resource scarce settings
would suffer from high lexical sparsity. To circumvent the challenges such as lexical diversity,
parsers would need to be trained on large-sized treebanks.
1.2 Goals of the Thesis

In light of the above discussed issues, we seek to develop parsing models that are capable of ex-
ploiting morphosyntactic interactions relevant for parsing and tackling the problem of sparsity. More
specifically, the goals of the thesis are:
1. to capture morphosyntactic correlations in the parsing models, and
2. to leverage existing linguistic resources for addressing lexical and syntactic sparsity.
The research reported in this thesis presents novel strategies to achieve these goals without a need
for additional training data.
1.3 Primary Contributions

The major contributions of the thesis to data-driven parsing of Indian languages can be summarized
as:
1. Efficient Representation of Morphosyntactic Information: Case marking, grammatical agree-

ment and complex predication are essential components of Hindi and Urdu grammar. Modeling
these phenomena properly should, therefore, improve parsing of these languages. We have shown
that existing feature models do not model these phenomena efficiently and demonstrated how
these morphosyntactic phenomena can be modeled over a rich history of parsing decisions in a
transition-based parsing system. We used the history of our transition-based parser to extract
and propagate case information unlike previous approaches where the information is copied in a
preprocessing step. Our results showed that using case markers as a second-order (and even third-
order) feature in Hindi and Urdu improves parsing and additionally removes the dependency of
the parser on preprocessing tools. Similarly, we modeled complex predicates as single lexical
units and empirically showed that instead of treating their components (host nominal and light
verb) as separate, independent lexical units, such a composite treatment facilitates better model-
ing of their argument structure. Likewise we incorporated grammatical agreement into the model
3
and introduced heuristics for ezafe identification in Urdu achieving significant boost in parsing
performance.
2. Effective Parsing of Non-projective and Scrambled Structures: Besides parsing of formal

texts, we also proposed effective ways for parsing of non-canonical and discontinuous structures
which mainly arise in conversational data due to morphological richness of a language. Firstly, we
showed that majority of discontinuous structures in Indian languages contain strong syntactic cues
that can help in parsing these prominent but sparse structures. Based on these cues, we designed
heuristics that can be easily extracted from the parse history of a transition-based parser. Secondly,
we proposed a sampling technique to generate training instances with diverse word orders from
the available canonical structures. We showed that linearly interpolated models trained on diverse
views of the same data can effectively parse both canonical and non-canonical texts.
3. Resource Sharing and Augmentation–There’s no Data like More Data: Based on the assump-
tion that Hindi and Urdu represent the same language, we proposed to augment and share their
individual resources (treebanks) to improve the performance of their parsers. To facilitate re-
source sharing and augmentation, we proposed a state-of-the-art statistical transliteration model
for harmonizing their orthographic differences, while their lexical divergences are resolved by
learning cross-register word embeddings. We augmented their harmonized treebanks and trained
neural network-based parsing models using different supervised domain adaptation techniques.
We empirically showed that our augmented models perform significantly better than the models
trained separately on the individual treebanks. Moreover, we also demonstrated that the individ-
ual parsing models trained on harmonized Hindi and Urdu resources can be used interchangeably
to parse both Hindi and Urdu texts with near state-of-the-art results.
4. Lexical Semantics for Parsing Semantically-rich Dependencies: We proposed lexical seman-

tics as a complementary feature for parsing semantically rich dependency annotations in Indian
language treebanks. We showed that lexical semantics in the form of discrete and continuous fea-
tures such as ontological categories, Brown clusters and word embeddings can play a major role
in disambiguating highly rich CPG dependencies.2 We proposed simple feature combinations to
incorporate WordNet and clusters features in the linear parsing model. We also proposed to use
retrofitting for incorporating WordNet information in a neural network parsing model. By model-
ing lexical semantics in our parsing models, we achieved very significant improvements in parsing
of both Hindi and Urdu test sets. The improvements are particularly prominent in dependencies
related to verb argument structure.
5. First Dependency Parser for Kashmiri: We developed a first statistical parser for Kashmiri
along the lines of our work on Hindi and Urdu. Kashmiri is an under-resourced language. It has
hardly been digitized, although it is rich in literature and has 5.6 million speakers worldwide3 . The
2 The dependency relations that are defined in the Computational Pān.inian Grammar framework.
3 https://www.ethnologue.com/language/kas/view
4
parser is trained on a treebank of 1,500 sentences annotated with part-of-speech categories, chunk
types and word-level dependency relations. Annotations at each level are manually done using the
existing guidelines for Indian language treebanking, however, modified wherever necessary. We
addressed some of the phenomena specific to Kashmiri such as v2 and pronominal cliticization
and proposed linguistically relevant analysis for them.
1.4 Auxiliary Contributions

1. Grammar-driven Chunk Exapander: We proposed a grammar-driven approach to expand the
intra-chunk dependencies for training word-level dependency parsers. The expander uses a well-
known shift-reduce parsing strategy [146] coupled with manually crafted production rules to ex-
press dependency relations at word level. Unlike the approach proposed by Kosaraju et al. [106],
our approach does not rely on preprocessing of data, thus avoids error propagation. Moreover it
is also robust and very accurate.
2. State-of-the-art Transliteration Models: We proposed very accurate bidirectional translitera-

tion models for Devanagari and Perso-Arabic scripts. The models are trained using Structured
Perceptron algorithm [51] and use sentence-level decoding on n-best transliterations to resolve
homograph ambiguity.
1.5 Related Publications

Major part of the work described in this thesis has previously been presented as the below listed pub-
lications. The total number of citations for these publications are 50 [source: Google scholar, January
20174 ].
Journal Papers
1. Riyaz Ahmad Bhat, Irshad Ahmad Bhat, and Dipti Misra Sharma. “Improving Transition-
based Dependency Parsing of Hindi and Urdu by Modeling Syntactically Relevant Phenom-
ena.” In ACM Transactions on Asian and Low-Resource Language Information Processing
(TALIP), 2016.
2. Riyaz Ahmad Bhat, Irshad Ahmad Bhat, Naman Jain, and Dipti Misra Sharma. “Bridging
the Script and Lexical Barrier between Hindi and Urdu for Resource Sharing and Augmen-
tation.” Journal under review at Natural Language Engineering (NLE).
Conference Papers
4 https://goo.gl/FtvWF8
5
3. Riyaz Ahmad Bhat, Irshad Ahmad Bhat, Naman Jain, and Dipti Misra Sharma. “A House
United: Bridging the Script and Lexical Barrier between Hindi and Urdu.” In Proceedings
of the 26th International Conference on Computational Linguistics (COLING 2016).
4. Riyaz Ahmad Bhat, and Dipti Misra Sharma. “Non-projective Structures in Indian Lan-
guage Treebanks.” In Proceedings of the 11th Workshop on Treebanks and Linguistic The-
ories (TLT11), 2012.
5. Riyaz Ahmad Bhat, Sambhav Jain, and Dipti Misra Sharma. “Experiments on Depen-
dency Parsing of Urdu.” In Proceedings of the 11th Workshop on Treebanks and Linguistic
Theories (TLT11), 2012.
6. Sambhav Jain, Naman Jain, Aniruddha Tammewar, Riyaz Ahmad Bhat, and Dipti Misra
Sharma. “Exploring Semantic Information in Hindi WordNet for Hindi Dependency Pars-
ing.” In Proceedings of the Sixth International Joint Conference on Natural Language Pro-
cessing (IJCNLP), 2013.
7. Riyaz Ahmad Bhat, Naman Jain, Dipti Misra Sharma, Ashwini Vaidya, Martha Palmer,
James Babani, and Tafseer Ahmed. “Adapting Predicate Frames for Urdu PropBanking.” In
Proceedings of LT4CloseLang: Language Technology for Closely Related Languages and
Language Variants, 2014.
8. Riyaz Ahmad Bhat, Shahid Mushtaq Bhat, and Dipti Misra Sharma. “Towards building
a Kashmiri Treebank: Setting up the Annotation Pipeline.” In Proceedings of the Ninth
International Conference on Language Resources and Evaluation (LREC), 2014.
Other publications during my PhD which are not part of this thesis, although relevant, are as
follows:
9. Riyaz Ahmad Bhat, and Dipti Misra Sharma. “A Dependency Treebank of Urdu and its
Evaluation.” In Proceedings of the Sixth Linguistic Annotation Workshop, pp. 157-165.
Association for Computational Linguistics, 2012.
10. Itisree Jena, Riyaz Ahmad Bhat, Sambhav Jain, and Dipti Misra Sharma. “Animacy Anno-
tation in the Hindi Treebank.” In Proceedings of the 7th Linguistic Annotation Workshop &
Interoperability with Discourse (LAW-VII & ID), 2013.
11. Riyaz Ahmad Bhat and Dipti Misra Sharma. “Animacy Acquisition Using Morphological
Case.” In Proceedings of the Sixth International Joint Conference on Natural Language
Processing (IJCNLP), 2013.
12. Riyaz Ahmad Bhat, Rajesh Bhatt, Annahita Farudi, Prescott Klassen, Bhuvana Narasimhan,
Martha Palmer, Owen Rambow et al. “The Hindi/Urdu Treebank Project.” In the Handbook
of Linguistic Annotation (edited by Nancy Ide and James Pustejovsky), Springer Press.
2015.
6
13. Maaz Anwar Nomani, Riyaz Ahmad Bhat, Dipti Misra Sharma, Ashwini Vaidya, Martha
Palmer, and Tafseer Ahmed. “A Proposition Bank of Urdu.” In Proceedings of Tenth Inter-
national Conference on Language Resources and Evaluation (LREC), 2016.
14. Juhi Tandon, Himani Chaudhry, Riyaz Ahmad Bhat and Dipti Sharma. “Conversion from
Paninian Karakas to Universal Dependencies for Hindi Dependency Treebank.” In Proceed-
ings of the 10th Linguistic Annotation Workshop, ACL, 2016.
1.6 Thesis Overview

• Chapter 2. In this chapter, we provide the necessary background for the thesis, particularly
focusing on different approaches to parsing and our choice of parsing paradigm and its formal
description.
• Chapter 3. This chapter is devoted to Computational Pān.inian Grammar (CPG) formalism that
lies at the heart of Indian language treebanking. Specifically, we discuss the application of CPG
to Hindi and Urdu in this chapter. Since the relations are only marked at chunk/constituent level,
we also discuss our grammar-driven approach to expand the intra-chunk dependencies, thereby
expressing the dependency relations at the word level.
• Chapter 4. In this chapter, we describe different strategies to represent morphosyntactic infor-

mation in a linear transition-based parsing system. We show that how rich parse history in an
arc-eager transition system can be leveraged to extract crucial information such as case mark-
ers and agreement features. The chapter also describes a more viable representation of complex
predicates for identification of their argument roles. Moreover, we present certain heuristics to
identify ezafe in Urdu texts.
• Chapter 5. In this chapter, we present an extensive study of non-projective structures in Indian

language treebanks. We explore pseudo-projective transformations to projective these structures
so that they can be parsed using an arc-eager parsing algorithm. We also identify linguistic cues
that can help in their identification while parsing. Besides non-projectivity, we also propose a
sampling technique to handle argument scrambling. We show that linear interpolation of model
parameters learned from normal treebank data and automatically generated scrambled structures
can be effective in parsing of canonical as well as non-canonical data.
• Chapter 6. This chapter discusses orthographic, lexical and syntactic differences between Hindi
and Urdu texts and proposes novel strategies to harmonize them. We provide a quantitative analy-
sis of these divergences and point out the fact that Hindi and Urdu resources can be augmented and
shared if their differences are bridged. More importantly, we describe different domain adaptation
strategies to augment and share the annotation resources for learning better parsing models.
7
• Chapter 7. This chapter addresses the issues in parsing highly granular dependency annotations
in Indian language treebanks. We explore auxiliary resources like IndoWordNet and data-driven
distributional similarity methods to mitigate the effect of granularity on parsing. We empirically
show that lexical semantics in the form of discrete and continuous features such as ontological
categories, Brown clusters and word embeddings can play a major role in disambiguating the rich
predicate-argument relations.
• Chapter 8. Following our discussion on Indian language treebanking and parsing in previous
chapters, in this chapter we discuss our efforts to build a reasonably large dependency treebank
for data-driven parsing of Kashmiri. We discuss basic annotation guidelines and some of the
major differences in annotations for some Kashmiri specific phenomena like v2. The annotation
scheme is discussed using appropriate sentences taken from the treebank. We report preliminary
parsing experiments using similar feature representations that we proposed for parsing of Hindi
and Urdu treebanks.
• Chapter 9. In this chapter, we provide the concluding remarks and outline directions for possible
future research.
8
Chapter 2
General Background
A significant part of the research work presented in this thesis is based on the application of determin-
istic transition systems to dependency parsing of unrestricted natural language text. In this chapter, we
provide the necessary background of dependency parsing, particularly covering the inner workings of a
transition-based parsing system. Moreover, we also discuss different oracles that underlie the learning
process of such a parsing system.
2.1 Dependency Parsing

Dependency parsing is an approach to automatic syntactic analysis of a natural language text based
on dependency grammar [108]. The basic assumption that underlie a dependency grammar is that sen-
tential structure primarily consists of words linked by binary, asymmetrical relations called dependency
relations. A dependency relation holds between a pair of words in which one word called the head syn-
tactically dominates the other called the dependent. Formally these dependencies are represented as:
l
X → Y , meaning “Y depends on X”; X is the head of Y , Y is a dependent of X and l encodes the
type of dependency. Essentially, the goal of dependency parsing is to elucidate these binary word-
level dependencies in a labeled dependency graph. Consider an input sentence as a string of words
W = w0 , ..., wn , n ≥ 1, where w0 is a dummy ROOT symbol. A dependency tree for W is a labeled
directed graph T = (V, A), where V = {wi | ∈ [0, n]} is a set of words, and A is a set of labeled arcs
l
(wi , l, w j ). Arc (wi , l, w j ) encodes a labeled dependency wi → w j , where l is a permissible depen-
dency label from L = {li | ∈ [0, m]}. The arc direction is defined by the sign of inequality, if j > i
for (wi , w j ) ∈ Aw , the arc is right directed, while it is le f t directed otherwise. To state it precisely,
dependency parsing tries to automatically construct a well-formed labeled dependency graph T for an
input sentence W . A dependency graph T is well-formed if it is acyclic and connected as the one shown
in 2.1.
(1) un kī nimāz-e janāzā jāmā masjid mẽ Maqsōd Alī ne paRhāyī .
he of prayer funeral jama mosque in Maqsod Ali Erg teach.Perf.3FSg .
His funeral prayer was lead by Maqsod Ali in Jama masjid .
9
root
k2
k7p
r6 k1
lwg psp r6 pof cn lwg psp pof cn lwg psp rsym
un kī nimāz-e janāzā jāmā masjid mẽ Maqsōd Alī ne paRhāyī .
Figure 2.1: Dependency tree of Example sentence 1.
During the last two decades, a number of varied algorithms have been proposed for automatic depen-
dency parsing of unrestricted natural language text. These algorithms are used in conjunction with ma-
chine learning techniques to learn accurate dependency parsers. Syntactically annotated corpora, called
treebanks, are usually at the heart of these approaches, since they provide necessary information for
learning accurate parsers in supervised machine learning settings. The approaches can be broadly cate-
gorized as graph-based and transition-based. Graph-based parsers use near exhaustive search over the
graphical representation of a sentence to find a maximum scoring dependency graph, while transition-
based parsers use a local greedy search to derive a dependency tree. Graph-based methods were first
explored for dependency parsing by Eisner [65] who proposed a O(n3 ) parsing algorithm based on dy-
namic programming and a generative learning model. The transition based approach was first explored
by Kudo and Matsumoto [109] for Japanese and Yamada and Matsumoto [198] for English. Both the
methods have their strengths and weaknesses. While graph-based parsers are very accurate, they run at
quadratic time for non-projective parsing using Chu-Liu-Edmonds algorithm and at cubic time for pro-
jective parsing with Eisner’s algorithm [132]. On the other hand transition based parsers have linear time
complexity but they are less accurate. However, recent advancements in transition based parsing have
minimized the accuracy gap between the two approaches while compromising less on their efficiency
[76, 77, 202, 203]. Next we discuss our choice of parsing paradigm and give its formal description.
2.2 Parsing framework

In this thesis we use transition-based dependency parsing paradigm [149] to experiment with parsing
of Indian language texts. In the last decade, transition based parsers have gained popularity due to
their efficiency. Transition-based greedy parsers allow us to carry out wide range of experiments in
reasonable time with commodity hardware. Moreover an arc-eager system, which is the basis of our
parser, nearly follows an incremental parsing strategy thus making it cognitively plausible as well [147].
2.2.1 Transition-based Dependency Parsing

Transition-based dependency parsing aims to predict a transition sequence from an initial configura-
tion to some terminal configuration, which derives a target dependency parse tree for an input sentence.
In data-driven settings such an optimal transition sequence is predicted using a classifier. Even though,
10
quite advanced machine learning algorithms like neural networks and structured prediction algorithms
[48, 201] have been used to train the classifier, it has been observed that even simple memory-based
algorithms work well for the task [83].
In the last two decades, a number of incremental parsing algorithms have been proposed to parse
natural language text. In this thesis, we restrict our choice to arc-eager system [146]. The arc-eager
system is one of the most popular transition systems. It defines a set of configurations for a sentence
w1 ,...,wn , where each configuration C = (S, B, A) consists of a stack S, a buffer B, and a set of depen-
dency arcs A. For each sentence, the parser starts with an initial configuration where S = [ROOT], B =
[w1 ,...,wn ] and A = 0/ and terminates with a configuration C if the buffer is empty and the stack contains
the ROOT. The parse trees derived from transition sequences are given by A. Denoting Si and B j as the
ith and jth elements on the stack and buffer, the acr-eager system defines four types of transitions (t):
1. A LEFT-ARC(l) adds an arc B j → Si to A with label l, where Si is the node on top of the stack and
B j is the first node in the buffer, and removes the node B j from the stack. It has as a precondition
that the token Si is not the artificial root node 0 and does not already have a head.
2. A RIGHT-ARC(l) adds an arc Si → B j to A with label l, where Si is the node on top of the stack
and B j is the first node in the buffer, and pushes the node B j onto the stack.
3. The REDUCE transition removes the top node in the stack and is subject to the precondition that
the node has a head.
4. The SHIFT transition removes the first node in the buffer and pushes it onto the stack.
As an illustration of the arc-eager parsing algorithm, we derive the transition sequence for an ex-
ample sentence from the Hindi treebank. The transition sequence is derived by the oracle presented in
Algorithm 1 guided by the tree representation (Ggold ) of the sentence shown in Figure 2.1. The algo-
rithm derives 2n-1 transitions for a sentence of length n1 . The overall derivation process in shown in
Table 2.2.
1: if c = (S|i, j|B, A) and ( j,l,i) ∈Agold then

2: t ← LEFT-ARC(l)
3: else if c = (S|i, j|B, A) and (i,l, j) ∈Agold then
4: t ← RIGHT-ARC(l)
5: else if c = (S|i, j|B, A) and ∃k[k < i ∧∃l[(k,l, j) ∈Agold ∨ ( j,l,k) ∈Agold ]]
then
6: t ← REDUCE
7: else
8: t ← SHIFT
9: return t
Algorithm 1: Standard Oracle for Arc-eager parsing algorithm adapted from Goldberg and Nivre [76].
1 Including dummy ROOT node.
11
Transition Stack Buffer A
[ROOT] [un kī nimāz-e ..] 0/
SHIFT [ROOT un] [kī nimāz-e janāzā ..]
RIGHT-ARC(lwg psp) [ROOT un kī] [nimāz-e janāzā jāmā ..] A ∪ lwg psp(un, kī)
REDUCE [ROOT un] [nīmāz-e janāzā jāmā ..]
LEFT-ARC(r6) [ROOT] [nimāz-e janāzā jāmā ..] A ∪ r6(nimāz, un)
SHIFT [ROOT nimāz-e] [janāzā jāmā masjid ..]
RIGHT-ARC(r6) [ROOT nimāz-e janāzā] [jāmā masjid ..] A ∪ r6(nimāz-e, janāzā)
SHIFT [.. janāzā jāmā] [masjid mẽ Maqsōd ..]
LEFT-ARC(pof cn) [ROOT nimāz-e janāzā] [masjid mẽ Maqsōd ..] A ∪ pof cn(masjid, jāmā)
SHIFT [.. janāzā masjid] [mẽ Maqsōd Alī ..]
RIGHT-ARC(lwg psp) [.. masjid mẽ] [Maqsōd Alī ne ..] A ∪ lwg psp(masjid, mẽ)
SHIFT [.. mẽ Maqsōd] [Alī ne padhāyī .]
LEFT-ARC(pof cn) [.. masjid mẽ] [Alī ne padhāyī .] A ∪ pof cn(Alī, Maqsōd)
SHIFT [.. mẽ Alī] [ne padhāyī .]
RIGHT-ARC(lwg psp) [.. masjid Alī ne] [padhāyī .] A ∪ lwg psp(Alī, ne)
REDUCE [.. masjid Alī] [padhāyī .]
LEFT-ARC(k1) [.. masjid mẽ] [padhāyī .] A ∪ k1(padhāyī, Alī)
REDUCE [.. janāzā masjid] [padhāyī .]
LEFT-ARC(k7p) [ROOT nimāz-e janāzā] [padhāyī .] A ∪ k7p(padhāyī, masjid)
REDUCE [ROOT nimāz-e] [padhāyī .]
LEFT-ARC(k2) [ROOT] [padhāyī .] A ∪ k2(padhāyī, nimāz)
RIGHT-ARC(root) [ROOT padhāyī] [.] A ∪ root(ROOT, padhāyī)
RIGHT-ARC(rsym) [ROOT padhāyī .] [] A ∪ rsym(padhāyī, .)
REDUCE [ROOT padhāyī] []
REDUCE [ROOT ] []
Figure 2.2: Transition sequence for Example sentence 1 based on Arc-eager algorithm.
2.2.2 Oracle
Transition-based parsers use an oracle to learn a sequence of actions from a gold-standard tree that
they should take in order to derive it back. Until quite recently, these oracles were defined as functions
from trees to transition sequences, mapping each gold-standard tree to a single sequence of actions,
even if more than one sequence of actions can potentially derive them2 . In the parsing literature, these
oracles have been referred as static-greedy oracles. Goldberg and Nivre in their recent works [76, 77]
have redefined these oracles as relations from configurations to transitions. These oracles, aptly called
as dynamic oracles, allow the learner to choose dynamically from the transitions defined as optimal at a
given parser configuration.
2 This type of ambiguity is defined as spurious ambiguity.
12
Algorithm 2 details the learning process of an arc-eager parser with a dynamic oracle using a vanilla
perceptron. At line 7, a set of optimal transitions are derived based on whether they can successfully
derive the gold tree once applied on the given configuration C3 . In addition, the algorithm also defines
suboptimal transitions in terms of loss of gold dependencies that can not be retrieved once these transi-
tions are applied (cf. CHOOSE NEXTEXP , line 2-3). More precisely, the parser applies model predictions
on parser configurations once the model is put in a good region of the parameter space, which usually is
the case after a few iterations. Such exploration of non-optimal transitions helps to mitigate the problem
of error propagation in greedy transition systems.
1: w ← 0
2: for i = 1 → ITERATIONS do
3: for sentence x with gold tree Ggold in corpus do
4: C ← Cs (x)
5: while C is not terminal do
6: t p ← argmaxt w · φ(C,t)

7: ZERO COST ← t|o(t;C, Ggold ) = true
8: to ← argmaxt∈ZERO COST w · φ(C,t)
9: if t p ∈
/ ZERO COST then
10: w ← w + φ(C,to ) − φ(C,t p )
11: tn ← CHOOSE NEXT(i,t p , ZERO COST)
12: C ← tn (C)
13: return w
1: function CHOOSE NEXT AMB (i,t,ZERO COST)

2: if t ∈ZERO COST then
3: return t
4: else
5: return RANDOM ELEMENT(ZERO COST)
1: function CHOOSE NEXT EXP (i,t,ZERO COST)

2: if i > k and RAND() > p4 then
3: return t
4: else
5: return CHOOSE NEXTAMB (i,t,ZERO COST)
Algorithm 2: The perceptron learning algorithm for the transition-based parser using dynamic oracle
adapted from Goldberg and Nivre [76].
As we noted above, the advantage of using dynamic oracle is that it can efficiently handle spurious
ambiguity and error propagation while training. Error propagation has been particularly the major
bottleneck in the performance of static, greedy transition based systems [133]. Addressing this issue
has lead to an increase in the performance of the transition based parsers on a range of languages (an
average of 1.2% LAS cf. [76]) and bridged the gap between globally optimized graph based models like
MST [132] and beam search based transition systems [202] without altering the parsing time complexity
of these parsers which is still O(n).
(
3 o(t;C, G true, if t is optimal
gold ) =
false, otherwise
4 p regulates the percentage of non-optimal transitions to be explored, while k ensures that the model is in a good region of
the parameter space.
13
2.2.3 Non-Projectivity
If we clearly observe the arc-eager oracle algorithm 1, we may note that arc-eager system is re-
stricted to projective trees or in simple words, disallows crossing of arcs. For example, line 5 of the
algorithm clearly prohibits S0 to have dependents in the buffer. As Nivre [148] remarks, natural lan-
guages approve grammatical constructs that violate the condition of projectivity. In those languages
where non-projectivity is a common scene, one may use the arc-eager system with a caveat that non-
projective arcs will never be parsed correctly. In case of Hindi, for example, we may loose ≥ 2% arcs5
which are non-projective (see Chapter-5 for figures on other Indian languages).
Given such huge loss of accuracy in Hindi and other Indian languages, we need a workaround to
tackle non-projective structures in our arc-eager parser. As a possible solution, we use the pseudo-
projective transformations of Nivre and Nilsson [152]. The fact that dependency trees are labeled, we
can transform the non-projective arcs while preserving the lift information in their dependency labels.
At parsing time, inverse transformation based on breadth-first search can be applied to recover the non-
projective arcs efficiently. There is, however, a trade-off between the parsing accuracy and parsing
time as these transformations can increase the cardinality of the label set by a factor of n-square6 .
Nevertheless, we will use the encoding schemes proposed by Nivre and Nilsson and explore and evaluate
them for different Indian languages.
Concluding Remarks. As a concluding remark, all the parsing experiments in this thesis are carried
using our implementation of arc-eager system with dynamic oracle. We use averaged perceptron and
feedforward neural network to train the parser and use pseudo-projective transformations to handle
non-projectivity. Our implementation including the state-of-the-art models will be made available for
download from the authors web page.
5 Ignoring these 2% arcs in our data would mean that our parser would be 2% less accurate. Moreover, non-projective arcs
are found in 15% sentences in the Hindi treebank which would imply that our projective parser can not produce an accurate
parse for these sentences.
6 At least in case of the most informative encoding scheme i.e., head + path. n is cardinality of the original label set.
14
Chapter 3
Indian Language Treebanking: Grammar Formalism and Annotation

Procedure
This chapter is devoted to Computational Pān.inian Grammar (CPG) formalism that underlie the
dependency annotation scheme used for Indian language treebanking. Specifically, we discuss the ap-
plication of CPG to Hindi and Urdu. Besides, we also discuss the annotation procedure adapted to
build treebanks based on the formalism. The annotation procedure restricts the manual annotation to
inter-chunk dependencies. In this regard, we also discuss our grammar-driven approach to automatically
expand the Intra-chunk dependencies therefore expressing the dependency relations at the word level.
3.1 Introduction
The need for manually annotated linguistic resources is widely acknowledged in the field of compu-
tational linguistics (CL) and natural language processing (NLP). In the NLP-CL research community,
a great deal of effort has been put into the creation of these linguistic resources due to the reliance of
basic as well as advanced NLP applications on manual annotations. Specifically, syntactic treebanking
projects have generated a lot of interest in the community due to their manifold usage. A syntactic tree-
bank is, by definition, a set of syntactic trees capturing the syntactic or semantic structure of sentences.
Creation of these treebanks has interested both linguists and computational linguists. For the former,
they provide insights about the linguistic theory they have been built upon, and the later use them for
the development of data driven parsers.
Treebanks can vary with respect to the formalism that determine the choice of syntactic represen-
tation of sentences. The most popular syntactic representations adapted by major treebanking projects
are either based on phrase structure or dependency grammars1 . However, other formalisms are also
used, albeit rarely, like Lexical Functional Grammar (LFG) [179] and Head-driven phrase structure
grammar (HPSG) [174]. In the past two decades, dozens of phrase structure and dependency treebanks
have been created for languages such as Arabic, Chinese, Czech, English, French, German, and many
1 see https://en.wikipedia.org/wiki/Treebank for the existing treebanking projects.
15
more. Among them few popular treebanks are the English Penn Treebank [129], the Prague Dependency
Treebank [39] and the German TIGER treebank [40].
Treebanking efforts for languages like English and Czech started in the last decade of 20th century.
However, the interest in Indian language treebanking started of late with the development of a pilot tree-
bank for Hindi [14], which later culminated in a multi-layered and multi-representational treebanking
project for Hindi and Urdu [31]. The treebanking efforts, however, have not been restricted to Hindi
and Urdu. There have been efforts for the creation of treebanks for a number of Indian languages which
include languages like Bengali, Telugu and Tamil. The grammar formalism underlying the syntactic
representation of sentences in these treebanks is Computational Pān.inian Grammar. The formalism
motivates a dependency based representation of sentence structure which is important given the fact
that dependency representations have been advocated for morphologically rich languages. In addition,
they offer a compact representation and explicitly encode predicate-argument structure. In what fol-
lows, we discuss the CPG grammar formalism, the annotation scheme based on CPG and the annotation
procedure adapted for treebanking of Indian languages.
3.2 Computational Pān.inian Grammar

Computational Pān.inian Grammar (CPG) is a grammar formalism that takes concepts and insights
from Pān.inian Grammar (PG) for computational processing of a natural language text [20]. Pān.ini was
an Indian grammarian who is credited with writing a comprehensive grammar of Sanskrit. The under-
lying theory of his grammar provides a framework for the syntactico-semantic analysis of a sentence.
The grammar treats a sentence as a series of modified-modifier relations where one of the elements
(usually a verb) is the primary modified. This brings it close to a dependency analysis model as pro-
posed in Tesnière’s Dependency Grammar [182]. The application of this grammar for the automatic
syntactic analysis of Sanskrit texts can be found in the works of Kulkarni et al. [114], Kulkarni and
Ramakrishnamacharyulu [113], and Kulkarni [112].
The syntactico-semantic relations between lexical items provided by the Pān.inian grammatical model
can be split into two types2 :
• Kāraka: These are semantically related to a verb as the direct participants in the action denoted
by a verb root. The grammatical model has six ‘kārakas’, namely ‘kartā’ (the doer), ‘karma’
(the locus of action’s result in transitive sentences [20]), ‘karan.a’ (instrument), ‘sampradāna’
(recipient), ‘apādāna’ (source), and ‘adhikaran.a’ (location). These relations provide crucial
information about the main action stated in a sentence.
• Non-kāraka: These relations include reason, purpose, possession, adjectival or adverbial modi-
fications etc.
2 The complete set of dependency relation types can be found in [24]
16
The relations are marked through ‘vibhaktis’. The term ‘vibhakti’ can be approximately translated
as inflections for both nouns (number, gender, person and case) and verbs (tense, aspect and modality
(TAM)). The kāraka vibhakti correspondence is not one to one. A kāraka (in fact all the relations)
may occur with different vibhaktis under different conditions. One of the ‘kāraka’ is expressed through
agreement features. This could either be a ‘kartā’ or a ‘karma’. The noun (kāraka) that agrees with the
verb appears in nominative.
Since Sanskrit is typologically related to Indian languages (Indo-Aryan particularly), the Pān.inian
grammatical model came up as a natural choice for the formal representation of these languages. Initially
the model was applied and adapted for Hindi [14]. It was later extended to Urdu [26], Telugu [192],
Bengali and Kashmiri [29].
3.2.1 The Scheme

As we mentioned above, the theoretical model chosen for the dependency annotation of Hindi and
other Indian languages was primarily modeled for Sanskrit. Applying it to Hindi and other mod-
ern Indian languages was not, thus, straightforward. The model needed some modifications to ad-
dress/accommodate the specific linguistic properties of these languages. The modifications related to
Hindi are discussed in [14].
3.2.1.1 Dependency Relations and Labels
The relations in the scheme are split into inter-chunk and intra-chunk relations. The inter-chunk
relations in the scheme are represented in Figure 3.1; glosses and definitions of these relations are given
in Table 3.1. The purpose of choosing a hierarchical model for relation types was to have the possibility
of under specifying certain relations. Going to a finer level of granularity does not add much information
at the syntactic level and may lead to inconsistencies in annotation. For example, several verb-verb
head-modifier relations are annotated as ‘vmod’ as most of these relations are better interpreted at the
discourse level. Hence, it was decided to leave out the finer degree of relation type for such cases at the
sentence level of annotation.
17
mod
vmod nmod jjmod rbmod
varg vad adj r6 relc rs etc
k1 k2* k3 k4* k5 k7* rt rh ras adv k*u k*s etc
k2 k2p k2g k4 k4a k7 k7p k7t ras-k1 ras-k2 etc k1u k2u etc k1s k2s
Figure 3.1: Hierarchy of dependency labels.
Kāraka Meaning Non-Kāraka Meaning

k1 Agent/Subject/Doer rt Purpose
k2* Theme/Patient/goal rh Cause
k3 Instrument ras Associative
k4* Recipient/Experiencer r6 Genitives
k5 Source relc Modification by Relative Clause
k7* Spatio-temporal rs Noun Complements (Appositive)
k*u Comparative adv Verb modifier
k*s Noun/Adjective Complements adj Noun modifier
Table 3.1: Some major dependency relations depicted in Figure 3.1.
Apart from the relations provided in Figure 3.1, the scheme also has:
• Some inter-chunk labels which technically do not strictly fall under a ‘dependency’ relation. How-
ever, these relations are included in the scheme to label the arcs which connect two nodes for
completing a tree. There are mainly three such relation labels - ccof, pof and fragof. ‘ccof’ occurs
on an arc attaching any node to a conjunct, ‘pof’ connects parts of a multi-part single lexical unit
(multi-word expression) like complex predicates while ‘fragof’ is used for non-projecting words
separated away from their heads. In quantifier floating constructions, the floating quantifier is
treated as a frag(ment)of of the quantified expression.
• Few intra-chunk relations, such as ‘nmod adj’, ‘jjmod intf’, which are automatically annotated
at a later stage of the treebank development.
The following three examples from Hindi-Urdu language present the concepts discussed so far.
(2) Atif kitāb paRhegā .

Atif book read-Fut.3MSg .
18
‘Atif will read a/the book .’
(3) darvāzā kal khulegā .

door tomorrow open-Fut.3MSg .
‘The door will open tomorrow .’
(4) Atif soyegā .

Atif sleep-Fut.3MSg .
‘Atif will sleep .’
paRhegā ‘read’ khulegā ‘open’ soyegā ‘sleep’
k1 k2 k1 k7t k1
Atif kitāb ‘book’ darvāzā ‘door’ kal ‘tomorrow’ Atif
Figure 3.2: Dependency trees of examples (2)-(4).
The DS trees for Examples (2)-(4) are shown in Figure 3.2. In Example (2), the transitive verb
‘read’ heads the sentence with its two dependents, in the Pān.inian framework, marked as k1 (kartā,
‘doer’) and k2 (karma, ‘approximate translation ‘patient’). Example (3) is an inchoative construction,
the participants of the action ‘open’ marked as per the scheme are ‘door’ as k1 (kartā, ‘doer’) and
tomorrow as k7t (kālādhikaran.a, time). As per the theory, the argument of an inchoative or unaccusative
(intransitive) verb is the ‘kartā’ of the action denoted by the verb. Similarly, in Example (4), the single
participant, ‘Atif’, of the intransitive (unergative) verb ‘sleep’ is the kartā of the action denoted by the
verb.
3.2.2 Annotation Procedure

The DS in the Hindi and Urdu treebanks is built on top of the morphologically analyzed, POS tagged
and chunked data. These pre-DS analyses are first done by state-of-the-art tools,3 built in-house, which
are then followed by human post-editing. After the human validation, the DS annotation takes place in
two stages: (a) inter-chunk and, (b) intra-chunk. In the first stage, dependencies are marked manually
between chunks4 (approximation of constituent or phrase), while in the second stage, dependencies be-
tween words in a chunk would be marked automatically. This two stage annotation process is explained
in Figures 3.3 and 3.4 for Example 5. As shown in Figure 3.3, in the first stage, dependencies are
marked between chunks without specifying the relations between words in a chunk. In the second stage,
dependencies are expanded and expressed at word level using a semi-automatic procedure as shown in
Figure 3.4.
3 The toolkits can be downloaded from http://ltrc.iiit.ac.in/showfile.php?filename=downloads/ shallow parser.php

4 This way DS trees look like phrase structure trees with the non-terminals being chunk/phrase types.
19
(5) guzashtah hafte bahut hī tez hawāom se kafī sāre darakhat jaR se ukhaR gaye .
last week very Emp fast wind by lot all trees roots from pull go .
Last week a lot of trees were uprooted due to strong winds .
k7p root
rh
k1
k5 rsym
guzashtah hafte bahut hī tez hawāom se kafī sāre darakhat jaR se ukhaR gaye .
NP NP NP NP VGF BLK
Figure 3.3: Dependency tree showing inter-chunk dependencies for Example 5.
k7p root
rh
k1
k5 rsym
adj rpintf adj psp intf adj psp vaux
guzashtah hafte bahut hī tez hawāom se kafī sāre darakhat jaR se ukhaR gaye .
jj nn intf rp jj nn psp intf jj nn nn psp vm vaux sym
Figure 3.4: Dependency tree showing word-level dependencies for Example 5.
Such a two stage strategy has been adopted to reduce the time consuming manual labor which is
the hallmark of syntactic annotations. The treebank statistics in Table II clearly indicate that the man-
ual annotations in these treebanks are reduced to half (chunk count minus token count) by following
this strategy. The automatic expansion of intra-chunk dependencies is motivated by the fact that these
dependencies are highly predictable and can be marked deterministically if the chunk boundaries are
specified. Since in pre-DS annotations, words are grouped in chunks of appropriate type, intra-chunk
word dependencies can be specified by merely identifying the head word with high precision.
To complete the annotations in these treebanks by expressing dependencies at the word level instead
of chunk level, a highly accurate chunk expander is required. Next we discuss the specifics of the chunk
expander used to express dependencies at word level in the Hindi and Urdu treebanks.
3.2.2.1 Intra-Chunk Expansion
The first attempt at expanding intra-chunk dependencies in Indian language treebanks was by Kosaraju
et al. [106]. They defined the annotation guidelines and the expansion procedure to automatically anno-
tate the intra-chunk dependencies in the Hindi treebank. The expansion mainly relies on the computation
of the head word in a chunk. Once the head word is identified, other words are attached to it with appro-
priate labels. Additional rules are specified to handle cases where dependencies involve words excluding
the head word. Instead of working with the same procedure for expanding the intra-chunk dependencies
in the Urdu treebank, we propose an alternative procedure based on a well-known linear-time parsing
20
algorithm of Nivre [146]. The algorithm follows shift-reduce parsing strategy which we already dis-
cussed in Chapter-2. It uses a finite context-free grammar as its oracle to predict the optimal transition
given the state of the buffer and the stack. In our context-free grammar, production rules5 are of the form
X → Y , where X and Y are gold POS tags of head and dependent words as shown in the toy grammar in
Figure (3.5) above-left. Unlike Kosaraju et al. [106] our expander does not need prior computation of
the head word of a chunk, rather implicitly derives it from the context-free production rules.
The expander is initialized with an empty stack and the tokens of the chunk in the buffer. We do
not use an artificial ROOT since the head node has to be attached to rest of the tree via some token in
the sentence other than the artificial ROOT node. The oracle predicts either a LEFT-ARC or RIGHT-
ARC if there is a production rule containing the top node in the stack S (S0 ) and the first node in the
buffer B (B0 ). The ambiguity between the REDUCE and SHIFT transitions is resolved based on the
presence or absence of a dependency relation between a node <S0 and B0 . Top node in the buffer (B0 )
is pushed onto the stack if no such dependency exists, otherwise S0 is popped from the stack. Finally,
the parser terminates with the head node in the stack. Given a toy grammar Figure (3.5) above-left, a
dependency tree is generated from the transition sequences as shown in Figure (3.5) above-right and
below respectively. The chunk used in the Figure is the second NP in the dependency tree of Example
5.
head
NN → JJ|PSP lwg intf
JJ → INTF lwg rp nmod adj lwg psp
INTF → RP
... bohat hī tez hawāom se ...
Transition Stack Buffer A
[] [bohat hī tez hawāom se] 0/
SHIFT [bohat] [hī tez hawāom se]
RIGHT-ARC(lwg rp) [bohat hī] [tez hawāom se] A ∪ lwg rp(bohat, hī)
REDUCE [bohat] [tez hawāom se]
LEFT-ARC(lwg intf) [] [tez hawāom se] A ∪ lwg intf(tez, bohat)
SHIFT [tez] [hawāom se]
LEFT-ARC(nmod adj) [] [hawāom se] A ∪ nmod adj(hawāom, tez)
SHIFT [hawāom] [se]
RIGHT-ARC(lwg psp) [hawāom se] [] A ∪ lwg psp(hawāom, se)
REDUCE [hawāom] []
Figure 3.5: An example of transition-based chunk expansion. Above left: a context-free grammar,
above right: a desired dependency tree, bottom: a transition sequence of the arc-eager system.
5 Each rule in the grammar is associated with an appropriate dependency label.
21
We used the annotation guidelines by Kosaraju et al. and our knowledge of Hindi and Urdu gram-
mar to write the context-free production rules for both languages.6 To achieve complete coverage of
our context-free grammar on the Hindi and Urdu treebanks, we ran the expander multiple times (∼20
iterations) on full treebank data (excluding the evaluation set). The grammar was updated whenever
the expander generated a forest for any chunk instead of a fully connected tree.7 The expander con-
verged with a grammar8 containing around 203 labeled production rules; 107 for head-final and 96 for
head-initial dependencies.
To ensure the quality of expanded dependencies, we evaluated our expander on 500 sentences from
the Hindi and Urdu treebanks. The evaluation sets were manually annotated with intra-chunk depen-
dency structures by the same annotators who annotated the respective treebanks. Our expander per-
formed with a labeled attachment score (LAS) of 99.7% and 99.35% on both evaluation sets while the
accuracy of the rule-based system of Kosaraju et al. is 97.8% and 96.32% respectively.9 Our expander
is not only resource light but also more accurate. There are two major problems in the procedure of
Kosaraju et al. [106]. Firstly it suffers from the problem of error propagation; if head computation goes
wrong the chunk dependencies will also be wrongly annotated. Secondly it does not cover all the rules
for expansion and directly attaches a word to the head word of a chunk. In our case, error propagation
is not an issue since we do not apply prior head computation. However, coverage is a major issue faced
by all the grammar-driven approaches. In our case, our system will fail to generate a fully connected
tree for a chunk if the necessary production rules are missing in the grammar. It implies that there is a
natural way to know when to update the grammar, while Kosaraju et al. use a fall back strategy in such
cases.
Due to the aforementioned problems in the previous approach for chunk expansion, we expanded
the Hindi treebank afresh with our chunk expander. All the experiments in this thesis are conducted on
the expanded versions of Hindi and Urdu treebanks. The statistics of the two treebanks are provided in
Table (3.2).
We split the treebank data with a ratio of 80:10:10 for training, testing and tuning the parsers. For
both treebanks, the internal structure of annotation files is preserved. However, we randomly distribute
the files across training, testing and development sets. Each document or annotation file mainly contains
newswire articles.
6 Our grammar-driven expander for Hindi and Urdu is available at https://github.com/riyazbhat/Shift-Reduce-
Chunk-Expander.
7 It should be noted that the expander is built with the purpose to complete the annotation process. It is never intended to
be used as a text processing tool like a parser. That is the reason why we aim at its full coverage on the treebank data instead
of on an unseen data set.
8 The complete context-free grammar can be downloaded from https://github.com/ltrc/
Shift-Reduce-Chunk-Expander/blob/master/grammar/grammar.json.
9 LAS is relatively low on the Urdu test set when expanded using the system of Kosaraju et al. The reason seems to be the
lack of expansion rules peculiar to Urdu, since the expander was originally designed for Hindi. It can only handle intra-chunk
word dependencies which are similar in Hindi and Urdu. A small loss of ∼0.5% LAS by our system is due to the irregularities
in both treebanks which are hard to generalize like foreign words (POS tagged as “UNK”) modified by local vocabulary words.
22
Hindi Urdu
Count of
Training Testing Development Training Testing Development
Tokens 3,47,744 43,556 43,556 1,53,317 19,065 19,065
Chunks 1,87,029 23,418 23,417 72,319 9,010 9,010
Sentences 16,629 2,077 2,077 5,432 677 677
Table 3.2: Statistics on training, testing and development sets used in all the experiments reported
in this thesis.
3.3 Summary
In this chapter, we discussed the annotation procedure and the grammar formalism used to build
dependency treebanks for Indian languages. Since the manual annotation is restricted to inter-chunk
dependencies, we proposed and evaluated a new grammar-driven algorithm to automatically annotate
dependency structures within chunks.
23
Chapter 4
Improving Dependency Parsing of Hindi and Urdu by Modeling

Syntactically Relevant Phenomena
In recent years, transition-based parsers have shown promise in terms of efficiency and accuracy.
Though these parsers have been extensively explored for multiple Indian languages, there is still con-
siderable scope for improvement by properly incorporating syntactically relevant information. In this
chapter, we enhance transition-based parsing of Hindi and Urdu by redefining the features and feature
extraction procedures that have been previously proposed in the parsing literature of Indian languages.
We propose and empirically show that properly incorporating syntactically relevant information like
case marking, complex predication and grammatical agreement in an arc-eager parsing model can sig-
nificantly improve parsing accuracy. Our experiments show an absolute improvement of ∼2% LAS for
parsing of both Hindi and Urdu over a competitive baseline which uses rich features like part-of-speech
(POS) tags, chunk tags, cluster ids and lemmas. We also propose some heuristics to identify ezafe
constructions in Urdu texts which show promising results in parsing these constructions.
4.1 Introduction
In the last few years, the availability of treebanks has facilitated research in data-driven dependency
parsing in multiple Indian languages (ILs). Most of the works on parsing Indian languages have tried to
work out the features that are relevant and important for reasonably accurate parsing of these languages
using mostly the graph-based and transition-based parsing frameworks. The features that have been tried
and explored are morpho-syntactic features like lemmas, POS tags, chunk tags, case information and
grammatical agreement [10, 28, 87, 88]; semantic features like animacy [9]; wordnet hierarchies [93];
numeric features like word embeddings [11] and supertags [11]. In this chapter, we revisit some of these
features and feature interactions and redefine them for transition-based parsing of Hindi and Urdu. We
show that IL specific feature models are too shallow to model necessary syntactic phenomena in these
languages. On the other hand, state-of-the-art feature models used in transition-based systems, despite
being rich, do not model the relevant syntactic phenomena of ILs at all. To address this deficiency
24
in these feature models, we define syntactically relevant features and feature interactions over a rich
history of parse decisions in a transition-based parser and empirically show their profound impact on
the accuracy of Hindi and Urdu parsing.
Case marking, grammatical agreement and complex predication are essential components of Hindi
and Urdu grammar. Modeling these phenomena properly should, therefore, improve parsing of these
languages. We show that existing feature models do not model these phenomena efficiently and demon-
strate how these syntactic phenomena can be modeled over a rich history of parsing decisions in a
transition-based parsing system. We use the history of our transition-based parser to extract and prop-
agate case information unlike previous approaches where the information is copied in a preprocessing
step. Our results show that using case markers as a second-order feature in Hindi and Urdu improves
parsing and additionally removes the dependency of the parser on preprocessing tools. Similarly, we
model complex predicates as single lexical units and empirically show that instead of treating their com-
ponents (host nominal and light verb) as separate, independent lexical units, such a composite treatment
facilitates better modeling of their argument structure. Likewise we incorporate grammatical agreement
into the model and introduce heuristics for ezafe identification in Urdu achieving significant boost in
parsing performance.
Furthermore, while Hindi dependency parsing has been extensively studied, there are very few stud-
ies on dependency parsing of Urdu [7, 28]. The work reported in this chapter is a step towards filling
this gap. In addition to modeling syntactic phenomena for better parsing, this work would be the first
exhaustive study on Urdu dependency parsing.
The remainder of the chapter is organized as follows. In §4.2, we show how to incorporate rich
syntactic features into our parsing model. In §4.3, we discuss the experiments and results based on the
proposed feature modeling. In §4.4, we discuss related works that also model morphological/syntactic
features for dependency parsing. We conclude the chapter with possible future directions in §6.6.
4.2 Transition-based Parsing with Rich Syntactic Features

Transition-based parsers offer a rich history of parsing decisions which can be used to define higher-
order Markov features. More importantly, modeling higher-order features in transition-based parsers can
be achieved without compromising on their efficiency. In natural language processing (NLP), higher-
order features are essential to capture structural dependencies exhibited by linguistic units such as POS
tags in a tag sequence or treelets in a syntactic tree. In parsing, the importance of these features have
been shown in both graph-based and transition-based methods [103, 203]. In particular to transition-
based parsers, Zhang and Nivre [203] have shown very significant improvements in accuracy using rich
non-local features defined over previous parsing decisions of the parser.
The role of syntactic features related to case marking and complex predication have been found to be
very useful in previous works on dependency parsing of Hindi texts [10, 15, 87, 88]. In this section, we
show that these works have not efficiently captured the interactions between different syntactic phenom-
25
ena and have failed to properly incorporate them into the parsing model. With respect to a transition-
based parsing system, we demonstrate how syntactic phenomena such as case marking, complex predi-
cation and grammatical agreement and their interactions can be properly modeled at the syntactic level
for better parsing of Hindi and Urdu. As these phenomena are an essential part of the Hindi and Urdu
grammar, modeling them properly should, therefore, improve parsing of these languages. Case marking
can have a substantial role in improving the label score (LS) of a parser. A nominal marked with case
can only take a subset of relations thereby reducing the search space of possible labels that it can take.
Similarly, addressing complex predication and grammatical agreement can be of great help. Treating
complex predicates as a composite unit during parsing may improve performance of the parser, while
when treating the host and light verb in a complex predicate individually, independent of each other, the
parser may not correctly identify the arguments related to it. Grammatical agreement, on the other hand,
may help to parse dislocated genitives and non-oblique verb arguments more accurately. Incorporating
these phenomena correctly can improve the parsing of core arguments of a verb as these phenomena
are directly or indirectly related to verb argument structure. Moreover, we also propose some heuristics
to identify ezafe constructions in Urdu. Ezafe is an enclitic short vowel ‘e’ which joins two nouns, a
noun and an adjective, or an adposition and a noun into a possessive relationship. We show that ezafe
constructions can be identified even without an explicit morphological cue.
4.2.1 Baseline
To put our results and findings into perspective, we first setup a baseline using the state-of-the-art
features used in transition-based parsing with a linear model1 and discuss the results based on it. We
use the feature template of Zhang and Nivre [203] and include other relevant features like chunk tags,
lemma of a word and its cluster id to setup an improved and challenging baseline. Cluster features are
incorporated into the feature template similarly to Täckström et al. [181], while chunk tags and lemmas
are incorporated similarly to POS tags and words respectively. In addition to the state-of-the-art feature
model, we also show comparison with the feature model(s) defined for Hindi and other Indian languages
in works like [10, 105, 150]. The extended feature template of Zhang and Nivre (ZN) is shown in Table
4.1, while the feature model defined for Indian languages is shown in Table 4.2.
1 In Table 4.1, base features are combined to create more features to capture non-linearities in the data. Feature combination
is essential in our case as our underlying classification model is linear.
26
Single words S0 wp; S0 t; S0 w; S0 r; S0 rp; S0 rt; S0 c; S0 cp; S0 p; S0 t; B0 wp; B0 w; B0 wt; B0 r;
B0 rp; B0 rt; B0 c; B0 cp; B0 p; B0 t; B1 wp; B1 wt; B1 w; B1 r; B1 rp; B1 rt; B1 c;
B1 cp; B1 p; B1 t; B2 wp; B2 wt; B2 w; B2 r; B2 rp; B2 rt; B2 c; B2 cp; B2 p; B2 t;
Word pairs S0 wpB0 wp; S0 wtB0 wt; S0 rB0 r; S0 rpB0 rp; S0 rB0 rp; S0 rpB0 r; S0 rpB0 p;
S0 pB0 rp; S0 rtB0 rt; S0 rB0 rt; S0 rtB0 r; S0 rtB0 t; S0 tB0 rt; S0 cB0 c; S0 cpB0 p;
S0 cpB0 cp; S0 cB0 cp; S0 cpB0 c; S0 pB0 cp; S0 wpB0 w; S0 wtB0 w; S0 wB0 r;
S0 rB0 w; S0 wB0 c; S0 cB0 w; S0 wB0 wp; S0 wpB0 p; S0 pB0 wp; S0 wB0 wt; S0 wtB0 t;
S0 tB0 wt; S0 wB0 w; S0 pB0 p; B0 pB1 p; S0 tB0 t; B0 tB1 t; B0 rB1 r; B1 rB2 r; B0 cB1 c;
B1 cB2 c;
Word triplets B0 pB1 pB2 p; B0 tB1 tB2 t; B0 rB1 rB2 r; B0 cB1 cB2 c; S0 pB0 pB1 p; S0 tB0 tB1 t;
S0 rB0 rB1 r; S0 cB0 cB1 c; S0h pS0 pB0 p; S0h tS0 tB0 t; S0h rS0 rB0 r; S0h cS0 cB0 c;
S0 pS0l pB0 p; S0 tS0l tB0 t; S0 rS0l rB0 r; S0 cS0l cB0 c; S0 pS0r pB0 p; S0 tS0r tB0 t;
S0 rS0r rB0 r; S0 cS0r cB0 c; S0 pB0 pB0l p; S0 tB0 tB0l t; S0 rB0 rB0l r; S0 cB0 cB0l c
Distance S0 wd; S0 rd; S0 cd; S0 pd; S0 td; B0 wd; B0 rd; B0 cd; B0 pd; B0 td; S0 wB0 wd;
S0 rB0 rd; S0 cB0 cd; S0 pB0 pd; S0 tB0 td
Valency S0 wvr ; S0 rvr ; S0 cvr ; S0 pvr ; S0 tvr ; S0 wvl ; S0 rvl ; S0 cvl ; S0 pvl ; S0 tvl ; B0 wvl ;
B0 rvl ; B0 cvl ; B0 pvl ; B0 tvl
Unigrams S0h w; S0h r; S0h c; S0h p; S0h t; S0 l; S0l w; S0l r; S0l c; S0l p; S0l t; S0l l; S0r w; S0r r;
S0r c; S0r p; S0r t; S0r l; B0l w; B0l r; B0l c; B0l p; B0l t; B0l l
Third Order S0h2 w; S0h2 r; S0h2 c; S0h2 p; S0h2 t; S0h l; S0l2 w; S0l2 r; S0l2 c; S0l2 p; S0l2 t; S0l2 l;
S0r2 w; S0r2 r; S0r2 c; S0r2 p; S0r2 t; S0r2 l; B0l2 w; B0l2 c; B0l2 p; B0l2 t; B0l2 l;
S0 pS0l pS0l2 p; S0 pS0r pS0r2 p; S0 tS0l tS0l2 t; S0 tS0r tS0r2 t
Labels S0 wsr ; S0 rsr ; S0 csr ; S0 psr ; S0 tsr ; S0 wsl ; S0 rsl ; S0 csl ; S0 psl ; S0 tsl ; B0 wsl ;
B0 rsl ; B0 csl ; B0 psl ; B0 tsl
Table 4.1: Feature template for setting the baseline for parsing Hindi and Urdu. The template
extends the feature template defined in Zhang and Nivre. w denotes word, p denotes POS tag, d
denotes distance, v denotes valency and S|Bil , S|Bir denote the left and rightmost children of S|Bi , c
denotes cluster, r denotes root of a word and t denotes chunk tag.
Furthermore, to parse in realistic settings, we show parsing results using predicted POS tags and
chunk tags. We train and test classifiers on the same training and testing sets that are used for parsing.
We use Collins structured perceptron [51] with second-order structural features for both POS tagging
and chunking. We use second-order Viterbi algorithm for inference. For POS tagging we also use
morphological features like affixes (range 4), length of a word and its cluster id (bit strings of length 10),
while for chunking, we additionally use POS tags in a window size of 5 words. All these features were
tuned on the development sets of both Hindi and Urdu. The results of both POS tagging and chunking
are reported in Table 4.3. For lemmatization and to obtain morphological features like gender, number,
person, etc., we use the paradigm-based morphological analyzers of both languages [21, chapter 3].
27
The morphological analyzers are built-in-house and have high coverage on the treebank data.2 These
analyzers provide multiple analyses of a word. We use POS-based heuristics and the linear model of
Malladi and Mannem [126] to prune morphological analyses that are irrelevant to a word in a given
context. POS heuristics help in selecting syntactically relevant lemma for a word, while the linear
model is used to select the best gender-number feature pair. Among multiple gender-number feature
pairs corresponding to a selected lemma, we select the one that receives the highest score from the
model. Finally, cluster ids are generated using Brown’s clustering algorithm [41].
Single words S0 w; S0 r; S0 p; B0 w; B0 r; B0 p; B1 w; B1 p; B2 p; B0l w; B0l p; S0l p; S0r p; S0 t;

B0 t; S0r t; S0 agr; B0 agr; S0r agr; S0 c; B0 c; S0r c;
Word pairs S0 pB0 p; S0 fB0 f;
Unigrams S0 l;
Third Order S0r2 l;
Distance S0 B0 d;
Valency B0 vl ;
Table 4.2: Feature template used in previous works on Parsing Hindi and other Indian Languages.
c denotes case and tense, aspect and modal auxiliaries (TAM); agr denotes agreement features; r
denotes lemma of the corresponding tree node; underscripted l and r denote the left and right child
of the corresponding tree nodes, t denotes chunk tag and f represents conjoined gender, number,
case, tense, aspect, modality and chunk tag features.
Chunking
Morph. Analysis POS tagging
Language Gold Predicted
Lemma (%) GN (%) Accuracy(%) Accuracy(%) Accuracy(%)
Urdu 88.12 89.87 92.92 97.70 95.25
Hindi 90.65 93.46 96.02 98.87 97.50
Table 4.3: Performance of Morphological analyzers, POS taggers and chunkers on Hindi and
Urdu test sets. We exclude punctuation in our evaluation of the morphological analyzers. GN
stands for gender and number morphological features.
In Tables 4.4 and 4.5 baseline accuracies are reported based on the feature template defined in Table
4.1. The accuracies are high for both Hindi and Urdu3 . One reason for that is the abundance of local
2 http://ltrc.iiit.ac.in/showfile.php?filename=downloads/shallow parser.php
3 In comparison with the Hindi parser, the Urdu parser is less accurate which could be due to smaller training data used for
training the parser. However, a Hindi parser trained on 5,432 sentences randomly selected from the Hindi training data still
performs better than the Urdu parser trained on the same number of sentences. Its accuracy in terms of UAS, LS and LAS
is 92.09%, 89.04% and 86.10% respectively. Given these figures, it seems Urdu structures are harder to parse than the Hindi
structures.
28
intra-chunk structures in the treebanks. These structures are easier to parse and particularly transition-
based systems are good at predicting local dependencies. Ambati et al. [9] observed that chunk tag
information is useful for root prediction in Hindi. In the treebanks, chunk tags carry finiteness informa-
tion about verbs which proves helpful in root prediction as finite verbs are mainly clause and/or sentence
heads. The chunk information leads to improvements in root prediction in both Hindi and Urdu pars-
ing by around ∼6% LAS over a simple POS-based baseline. Implicitly, chunk tags also capture the
information about chunk boundaries which is important for intra-chunk dependencies. This chunk in-
formation helped to predict intra-chunk relations better, particularly nmod adj increased by 10% and
15% LS in Hindi and Urdu respectively. We experimented with cluster ids of varied lengths and found
cluster ids of length 12 to be optimal on Hindi and Urdu development sets. Clusters also proved useful.
However, the improvements for Urdu are lower than for Hindi. This could be due to the fact that the
Urdu monolingual data that we use to learn the clusters is smaller in size. It contains 5M sentences,
while the Hindi monolingual data has 8M sentences. Finally, to gauge the individual importance of each
feature type, we performed ablation experiments. The results are reported in Table 4.6.
Development Test
Feature
UAS LS LAS UAS LS LAS
POS 92.82 89.50 86.89 92.77 89.51 86.79
+Chunk 93.380.56 89.690.19 87.450.56 93.310.54 89.560.05 87.420.63
+Lemma 93.390.01 89.950.26 87.610.16 93.24−0.07 89.730.17 87.520.10
+Clusters 93.510.12 90.210.26 87.820.21 93.520.28 90.010.28 87.770.25
Table 4.4: Baseline Parsing accuracies of Hindi using the ZN feature template extended with chunk,
lemma and cluster features.
Development Test
Feature
POS 88.19 84.44 80.60 88.14 84.41 80.57
+Chunk 88.550.36 84.700.26 81.010.41 88.460.32 84.590.18 81.000.43
+Lemma 88.700.15 84.800.10 81.130.12 88.690.23 84.770.18 81.110.11
+Clusters 88.810.11 84.880.08 81.340.21 88.770.08 84.840.07 81.190.08
Table 4.5: Baseline Parsing accuracies of Urdu using the ZN feature template extended with chunk,
lemma and cluster features.
29
Hindi Urdu
Feature
Leave-one-out Only-one Leave-one-out Only-one
POS 86.43 86.89 77.24 80.60
Chunk 86.98 86.06 79.96 80.10
Lemma 87.14 83.69 80.22 77.85
Clusters 87.61 85.09 81.13 77.06
Table 4.6: Accuracies (LAS) for Leave-one-out and Only-one evaluation.
4.2.2 Modeling Syntactically Relevant Features

After discussing baseline features and the corresponding results, we will now discuss the ways to
model the syntactic phenomena of Hindi and Urdu in our parsing model. The empirical results that
show the impact of modeling these syntactic phenomena are discussed in §4.3.
4.2.2.1 Case Marking
Case markers express binary relations between entities and events and have been observed to be
highly ambiguous. From a computational perspective, their relational function seems lucrative for iden-
tification and extraction of entity relations in a text. However, their highly ambiguous nature overshad-
ows their usability. The role of case markers in parsing Indian languages have been discussed in works
like [9, 10, 28]. However, their sense disambiguation has not been addressed yet. In this section, we will
discuss the role of case clitics in parsing Hindi and Urdu and also address their sense disambiguation
and its subsequent impact on parsing.
In Hindi and Urdu, adpositions follow their complements, which is a typical behavior of lexical and
functional heads in a head final language. Mohanan [138] classified Hindi adpositions into two main
categories: clitic-postpositions (case clitics) and nonclitic-postpositions (postpositions) based on their
form and syntactic distribution (see also [42]). While case clitics are single words (clitics), postpositions
are composite words marked with an oblique form of genitive ‘kā’. In Hindi and Urdu case clitics can
either mark grammatical functions like subject and object, can correspond to predicate roles like agent
and patient or can carry semantic information like animacy and definiteness [138]. In case of their
correspondence to predicate roles or grammatical functions, case clitics do not unambiguously map
to a single predicate/grammatical role (henceforth case role). This implies that a case clitic can mark
different case roles in different contexts. In this section, we will discuss how to identify the role that a
case clitic maps to in a given context. The roles marked by these case clitics correspond to the relations
that a nominal bears with its head in the Hindi and Urdu treebanks. The description of these relations is
given in Table 4.7.
30
Frequency Entropy
Clitic4 Case Function
Hindi Urdu Hindi Urdu
se instrumental / ablative mk1 ‘causee’, adv ‘manner’, k2 5,645 3,046 3.767 3.972
‘theme’, k5 ‘source’, k2u ‘com-
parison’, k3 ‘instrument’, k4 ‘re-
cipient’, k7t ‘temporal’, rh ‘cause’,
ras-k* ‘comitative’
tak locative k7p ‘spatial’, k7t ‘temporal’, k7 796 305 2.131 2.580
‘abstract locative’
ko accusative / dative k1 ‘obligational subject’, jk1 ‘af- 7,110 2,766 2.071 2.051
fected agent’, k2 ‘patient/theme’,
k2g ‘goal’, k4 ‘recipient’,
k4a‘experiencer’, k7t ‘tempo-
ral’
mẽ locative k7p ‘spatial’, k7t ‘temporal’, k7 10,129 4,452 1.662 1.888
par locative k7p ‘spatial’, k7t ‘temporal’, k7 3,966 1,997 1.418 1.758
Table 4.7: Description of case clitics and their annotations in the Hindi and Urdu treebanks. Boldfaced
relations are the most frequent ones.
There are 6 case clitics in Hindi and Urdu namely ne ‘Ergative’, kā ‘Genitive’ (of), ko ‘Dative/Accusative’,
se ‘Instrumental’ (by/with/from/through), mẽ ‘Locative’ (in), par ‘Locative’ (on), and tak ‘Locative’
(to). Among these ne and kā are excluded for disambiguation, since the former unambiguously marks
a single case role (‘agent’), while for the latter we do not have labeled data for training.5 Therefore,
we only focus on the other five case clitics. Throughout this chapter, we will use the terms case cli-
tics and postpositions interchangeably and refer to the task of case clitic sense disambiguation as PSD
(postposition sense disambiguation).
Hovy et al. [90] have identified complement and governing head as the two most indispensable and
effective features in their work on sense disambiguation of English prepositions. In a cross-linguistic
study, Svenonius [180] has identified s-selectional constraint6 as one of the defining properties of ad-
positions. In Hindi and Urdu, ergative ‘ne’ s-selects nouns with the semantic property of agency, while
locatives, for example ‘mẽ’, s-selects locational semantics (either temporal or spatial). Each case clitic
s-selects according to its individual senses. Locative ‘mẽ’ will, for example, restrict its complement to
be either a temporal or a spatial expression based on its sense prevalent in a given context. S-selectional
4 cf.[42, 169] for detailed description and examples of the senses listed in this table.
5 The Hindi and Urdu treebanks do not recognize the different senses of genitive case and mark all its senses uniformly.
6 S-selection denotes the ability of predicates to determine the semantic content of their arguments.
31
property of a case clitic is thus a reflection of its sense. Thus modeling the sense disambiguation of
case clitics with their complements is linguistically justified. On the other hand, case government will
complement the s-selectional property of an adposition in cases where the complements have the same
semantic content. For example, for certain roles like experiencer subject/object, recipient/beneficiary
and patient, dative/accusative ‘ko’ s-selects an animate complement. In such cases, it is the case gov-
ernor i.e. the verb in this case, that will help discriminate the different senses. Consider the case of
Example 6, since the governing verb ‘yād āyī’ is the psychological predicate, the nominal marked by
‘ko’ (‘Maryam’) would be necessarily its experiencer.
(6) Maryam-ko kahānī yād āyī .

Maryam-Erg story remember come-Perf.3FSg .
‘Maryam remembered the story .’
Apart from the complement and governing head, there are other features that are relevant for PSD
of Hindi and Urdu case clitics but are not relevant for English prepositions due to the different morpho-
logical nature of these languages. These features include other case clitics in the sentence surrounding
the case clitic considered for disambiguation and aspect of the governing verb. Consider the case of
‘ko’. If the arguments of a verb are separately marked by two clitics: one by dative/accusative ‘ko’
and another by ergative ‘ne’, the case role of the noun marked with ‘ko’ would most probably be the
recipient or theme/patient (see Example 7). However, if there is no ‘ne’ marked argument of the verb,
the ‘ko’ marked noun would probably be the experiencer or the obligational subject7 (see Example 8).
Similarly, if a verb has two arguments one marked by ‘ko’ and the other by ‘se’, the ‘se’ marked argu-
ment can rarely be the patient or theme in such a context. There is also a high correlation between the
aspect of a governing verb and the case marking. Ergative case is licensed by a verb carrying perfective
aspect and dative ‘ko’ on an obligational subject is licensed in modal contexts [42]. Finally to address
the problem of data sparsity due to lexical features, we also use cluster ids from Brown clustering as a
complementary feature.
(7) Shahid-ne Atif-ko kitāb dī .

Shahid-Erg Atif-Dat book give-Perf.3FSg .
‘Shahid gave Atif a book .’
(8) Shahid-ko Kashmir jānā paRā .

Shahid-Dat Kashmir go-Inf had to .
‘Shahid had to go to Kashmir.’
4.2.2.1.1 PSD Models We carried a range of experiments to explore the effectiveness of the afore-
mentioned lexical features in the disambiguation task. We also carried out experiments differentiating
the impact of different features using leave-one-out evaluation. We formulate the task of postposition
7 The ergative marked noun may be sometimes dropped because of the pro-drop nature of both Hindi and Urdu. In such
cases ‘ko’ can not mark obligational subject.
32
sense disambiguation as a typical supervised classification problem. To build the sense disambiguation
models, we use support vector machines (SVM) [57] as a supervised learning algorithm. The hyperpa-
rameters of SVM are tuned separately for each case clitic on the development set using grid search on
the parameter space. All the PSD experiments are conducted using Scikit-Learn8 package [160]. We
learn two types of models: a separate model for each case clitic and a single model for all of them. The
accuracies for both models are reported in Table 4.8 for both Hindi and Urdu case clitics.
Hindi Urdu
Case Clitic
Baseline SVM-PSD Baseline SVM-PSD
tak 65.43 81.48 40.00 57.33
par 77.43 88.50 75.63 82.16
ko 48.50 86.09 57.95 76.68
se 25.16 56.73 14.50 53.75
mẽ 55.24 84.02 62.21 82.31
Single Model 50.97 76.33 51.12 69.46
Table 4.8: Performance (in terms of accuracy) of SVM-based PSD models on sense disam-
biguation of Hindi and Urdu case clitics. Baseline accuracies are based on the most frequent
sense for both single and individual models.
The baseline accuracies for comparison are set using the most frequent sense of a case clitic for both
models. Single models turned out to be more accurate than the separate models. The obvious reason
seems to be the increase in the number of labeled examples for relations that are shared across case
clitics.
The amount of ambiguity in a case clitic profoundly effects the performance of its individual disam-
biguation model. As reported in Table 4.8, the PSD models of case clitics of lower entropy are more
accurate than the ones with higher entropy. In Table 4.9, the impact of individual features is quantified
using leave-one-out evaluation. As we have already discussed, the more crucial feature is indeed the
complement of a case clitic, since there is a high correlation between the sense of a case marker and
the semantics of its complement due to the s-selectional constraint. The next most important feature for
the disambiguation of a case clitic is the presence of other case clitics in a fixed context window. Their
presence helps to reduce the search space for possible senses of a case clitic. The importance of these
8 http://scikit-learn.org/stable/
33
features can be represented in a hierarchy as complement > case clitics > governing head > clusters >
aspect.
Hindi Urdu
S.No. Left-out
Accuracy(%) Accuracy(%)
1. Complement 57.98 52.49
2. Governing Head 74.02 67.36
3. Case Clitics 72.11 59.90
4. Aspect 75.73 69.18
5. Clusters 74.70 68.25
All Features 76.33 69.46
Table 4.9: Leave-one-out experiments to evaluate the importance of individual features for
PSD on Hindi and Urdu development sets.
To extract the features relevant for PSD of a case clitic, we parsed the test data using the parser built
on feature templates discussed so far (i.e. excluding PSD feature). Since we aim to increase labeled
accuracy, use of unlabeled parse trees generated by a parser to extract features is not contrary to our
goal. We bank on parsers ability to correctly identify the attachment between words, so that we can
reliably extract relevant features from unlabeled parse trees to train PSD models to increase labeled
parsing accuracy. These features are extracted from parse trees as:
• Complement: the parent of a case clitic,
• Governing head: the grandparent of a case clitic,
• Case Clitics: case clitics attached to the siblings of its parent, and
• Aspect: auxiliaries of its grandparent.
Clusters are induced from large raw corpora of Hindi and Urdu using Brown clustering algorithm
[41]. We use the implementation of brown clustering by Liang [119] to induce these clusters.
Even without using the parser for feature extraction, the first two features in the hierarchy can be
trivially extracted from the context surrounding the current case clitic. Complement is almost always
the first nominal preceding the case clitic, while other case clitics can be searched for in a fixed window
size. Similarly, the governing head and its aspect can be extracted using the POS information. Usually
they are taken to be the first verb and its auxiliaries from the current case clitic. However, we observed
34
that using parse trees for feature extraction works much better. The PSD models performed worse when
the aforementioned heuristics were used to generate the features. We do not report the results of the
PSD models using these heuristics for brevity. We also skip the results when the gold parse tree was
used to generate these features, even though the models performed slightly better.
4.2.2.2 Role of Case Information in Parsing
Case markers should supposedly play a vital role in dependency parsing since their primary function
is to mark a relation between a complement nominal and its governing head. In §4.2.2.1 on PSD,
we discussed how the interactions between nominal arguments, predicates and their TAM can help to
predict the sense or the relation marked by a case marker. None of these interactions are captured in the
feature template discussed in Zhang and Nivre [203]. It could be due to the nature of languages they
experimented with. Neither Chinese nor English have rich case marking system. The only feature that
could use case marking to help parse nodes on the stack and buffer is the second-order unigram feature
S0r w i.e., the rightmost child of a word on top of the stack. It only implicitly captures the interaction
between the rightmost dependent (case marker) of a word (complement) on top of the stack with the
first word (governing head) in the buffer.
The interactions that could possibly benefit parsing languages with rich morphological case could
be:
• between complement, case marker and the governing head (lexical and non-lexical),
• between complement, case marker, governing head and its aspect (both lexical and non-lexical),
• case markers as unigram features,
• between case markers of a word on top of the stack and n first words in the buffer, etc.
These interactions are incorporated in the feature template of Zhang and Nivre as shown in Table
4.10. We extract the case information related to the nodes on top of the stack and buffer during the
feature extraction process of the parser. In all the works on parsing Hindi and other Indian languages,
case information is copied in a preprocessing step on the related nodes. The head word of a chunk is
identified using a set of heuristics like the position of a word in a chunk and its POS tag etc. and the case
information is copied on to it. The major drawback of copying case information before parsing is that
it may lead to error propagation due to wrong identification of the head word. However, the head word
that a particular case marker should be attached to can be identified more accurately by the parser due to
the rich features that are available to it at a given parser configuration. Moreover, due to the head-final
nature of Indian languages, case markers (clitics and postpositions) would be attached to their heads
before these heads, in turn, are attached to their respective heads. Therefore, it would be easy to use
case markers as second-order features (assuming a second-order Markov assumption defined vertically
in a tree) for the attachment between their nominal heads and the governing predicates as shown in
35
Figure 4.1. Similarly, we can extract TAM markers of a verb from the auxiliaries that immediately
follow it.
khāyā
Maryam ām
ne
Figure 4.1: Case marker ‘ne’ as second-order Markov neighbor for the arc (khāyā, Maryam).
Hindi and Urdu treebanks use the Prague style analysis of coordination where the coordinating con-
junction is treated as head [81]. McDonald and Nivre [133] have observed that transition-based systems
are less accurate for such analysis of coordination which leads to long distance dependencies. How-
ever, in case of Hindi and Urdu, case markers in noun coordinated structures can help in parsing these
structures accurately. We use the parse history in the stack to extract the case marker on the rightmost
child of a conjunction and use it as a feature. This would be the third-order feature for the arc between
conjunction in the stack and its head on the buffer as shown in Figure 4.2. The importance of this feature
seems obvious. For example an ergative case marker on the rightmost child of a coordinating conjunc-
tion would definitely guide the parser for correct prediction. The parser will wait for attachment until
a verb with perfective aspect arrives on top of the buffer, which is unlikely otherwise. The same holds
true for other case marked coordinated structures as well. At a specific parser configuration for sentence
9, Figure 4.2 shows interaction between the dative case marker ‘ko’ and the modal auxiliary ‘paRā’ of
the verb ‘jānā’. Such an interaction would provide enough context to the parser to parse the coordinated
structure headed by ‘aur’ as an obligational subject of verb ‘jānā’.
(9) Shahid aur Atif ko Kashmir jānā paRā .

Shahid and Atif Dat Kashmir go-Inf had to .
Shahid and Atif had to go to Kashmir .
Stack Buffer
aur jā nā padā
Shahid Atif
ko
Figure 4.2: Parser configuration capturing the interaction between third-order lexical feature ‘ko’
and modality ‘-nā paRā’ for parsing S0 (‘aur’) and B0 (‘jānā’).
36
In addition, we also use the parse history to retrieve case makers from partially parsed noun phrases
in the stack to guide the parsing of nominal words on top of the stack and governing verbs in the buffer.
This feature is the second best feature for PSD in both Hindi and Urdu (cf. Table 4.9), hence the
motivation for using it for parsing as well. We maintain a list of case markers in partial structures in the
stack and use it as a feature (as a single concatenated feature and each case marker as separate feature)
whenever S0 is a noun and B0 is a verb. To ensure that we only use case markers which are relevant to
the current predicate (B0 ), we filter them based on the parent of their complement. We use only those
case markers as features whose complements are either yet to be parsed or they are the dependents of
B0 . Case stacking in a specific parser configuration for Example 7 is shown in Figure 4.3. Case stacking
would implicitly capture the information about the arguments of a verb in the buffer before they are
attached to it. This may help to parse non-oblique argument(s) better (‘kitāb’ in this case).
Stack Buffer
Shahid Atif kitāb dī-Perf
ne-Erg ko-Dat
Figure 4.3: Stacking of case markers attached to their respective heads in the stack.
Single words S0 c; S0 cd ; B0 c; B0 cd ; B1 c; B1 cd ; B2 c; B2 cd ; S0 cS0 cd ; B0 cB0 cd ;
B1 cB1 cd ; B2 cB2 cd ;
Word pairs S0 wc p B0 wcd ; S0 wcd B1 wcd ; S0 wcd B2 wcd ; S0 wB0 c; S0 wB0 cd ;
S0 cB0 w; S0 cd B0 w; S0 cB0 c; S0 cd B0 c; S0 cB1 c; S0 cd B1 c;
S0 cd B1 cd ; S0 cB2 c; S0 cd B2 c; S0 cd B2 cd ;
Word triplets B0 cB1 cB2 c; S0 cB0 cB1 c;
Valency S0 cvr ; S0 cd vr ; S0 cvl ; S0 cd vl ; B0 cvl ; B0 cd vl ;
Labels S0 csr B0 c; S0 cd sr B0 c; S0 csl B0 c; S0 cd sl B0 c; B0 csl S0 c; B0 cd sl S0 c
Table 4.10: Feature template capturing interaction of lexical and non-lexical features with the
lexical and non-lexical case information. c denotes case and aspect; cd denotes disambiguated
case.
37
4.2.2.3 Grammatical Agreement
Agreement in Hindi and Urdu is exhibited between verbs and one of their non-oblique arguments,
the head noun of a genitive construction and the genitive case marker, and the noun and its modifying
adjective. Agreement is usually expressed in terms of gender and number. In case of genitive, agreement
probably allows the head noun to move around and leads to dislocated genitive constructions. In both
Hindi and Urdu treebanks there are a considerable number of dislocated genitives. In the Hindi treebank
around 2% non-projective structures are due to dislocated genitives while in the Urdu treebank 3% are
non-projective. Non-projective structures are known to be tough for parsing, particularly the arc-eager
algorithm can not handle them directly. Capturing agreement in the parsing model could benefit parsing
in multiple ways. For instance, it can complement case marking. When both arguments of a transitive
verb are non-oblique, the one which agrees with the verb would be the subject (kartā) and the other
would be object (karma) (See Example 10). Similarly, if one of the arguments is non-oblique and the
other is marked by ergative case marker, the verb would agree with the non-oblique argument and it
would be most probably the theme/object (karma) of the verb (See Example 11). In case of genitives,
the nominal that agrees with the genitive case marker would be the head of the genitive construction
(See Example 12). Thus agreement can improve the label as well as the attachment score of a parser.
(10) Maryam ām khā rahī hai .

Maryam-3FSg mango-3MSg eat be-Prog.3FSg be-Prs .
Maryam is eating a mango.
kartā karama
karama
karama karama
Maryam ām khā rahī hai .
(11) Maryam-ne ām khāyā .

Maryam-Erg mango-3MSg eat-Perf.3MSg .
Maryam ate a mango.
kartā
psp karama rsym
Maryam ne ām khāyā .
(12) Shahid kī behan doctor hai .

Shahid of-3FSg sister-3FSg doctor be-Prs .
Shahid’s sister is a doctor.
r6 k1
lwg psp k1s rsym
Shahid kī behan doctor hai .
38
Contrary to expectations, incorporating agreement into the parsing model has been disappointing in
almost all the works on parsing Hindi and Urdu [9, 10, 28]. It has led to a drop in parsing accuracies
in most of the works. Upon close inspection of the feature models (transition-based systems) that
use agreement, we realized that the agreement is wrongly captured. Agreement features of nodes to
be parsed are usually irrelevant. In case of genitives, it is not the agreement between the possessor
and possessed but between the possessed and the genitive marker which is relevant as shown in the
dependency graph of Example 12. However, the feature templates capture the agreement between the
possessor and possessed which is irrelevant. Similarly, predicate agreement is mostly irrelevant between
the main verb and its argument. The agreement features are mostly present on the auxiliaries of a verb as
shown in Example 10, so capturing the agreement between the verb and its argument becomes irrelevant.
Due to these wrong interactions, there is usually a drop in parsing performance. To make agreement
relevant, we could incorporate agreement features in the following way:
• agreement between the rightmost child (genitive marker) of a possessed noun (top of the stack) in
a genitive construction and the first n words in the buffer.
• agreement between the noun on top of the stack and the auxiliaries following the verb in the
buffer.
In our parsing model, we capture these interactions as per the feature template shown in Table 4.11.
Single words S0 wgn; B0 wgn; B1 wgn; B2 wgn
Word pairs S0 wgnB0 wgn; S0 wgnB1 wgn; S0 wgnB2 wgn; B0 wgnB1 wgn
Word triplets B0 gnB1 gnB2 gn; S0 gnB0 gnB1 gn
Unigrams S0h gn; S0l gn; S0r gn; B0l gn
Third Order S0h2 gn; S0l2 gn; S0r2 gn; B0l2 gn
Labels S0 gnsr ; S0 gnsl ; B0 gnsl ; S0 gnB0 gnsl
Table 4.11: Agreement features relevant for parsing Hindi and Urdu. g denotes gender and n
denotes grammatical number.
As we mentioned above, agreement, if captured properly, would help in better parsing of discon-
tinuous genitive constructions. We explain this by considering the case of a discontinuous genitive
construction shown in Example 13. If we consider agreement between the top nodes on the stack and
the buffer, ‘Nadiya’ would be attached to ‘itvār’ since their agreement features match and they are adja-
cent in the linear order.9 However ‘Nadiya’ should be attached to ‘checkup’ which is its actual syntactic
9 kā is a clitic marker which is assumed to be prosodically dependent on its head on the left.
39
head (see Figure 4.5). This can not be achieved in an arc-eager model due to the projectivity constraint.
‘checkup’ would be popped off the stack once it is attached to ‘hogā’, so it never will be available for
attachment with ‘Nadiya’. However, if we replace agreement features of ‘Nadiya’ with the features of
‘kā’, the parser would attach it to ‘hogā’ instead of ‘itvār’. The attachment between ‘Nadiya’ and ‘itvār’
would be blocked due to agreement violation. Even though ‘hogā’ is not the syntactic head of ‘Nadiya’,
it is its linear head after the projective transformation of the edge (Nadiya, checkup) (see dotted edge in
Figure 4.5). By applying reverse transformations using breadth-first search [152], we would be able to
retrieve the attachment between ‘Nadiya’ and ‘checkup’.
(13) Nadiya kā itvār ko checkup hogā .

Nadiya-3FSg of-3MSg sunday-3FSg Dat checkup-3MSg will-3MSg .
Nadiya’s checkup will be on Sunday.
Stack Buffer
Nadiya-FS itvār-FS ko checkup-MS hogā-MS
kā-MS
Figure 4.4: Parser configuration showing agreement features between words in the stack and
the buffer. Dotted lines show agreement between the right-most child of S0 and B2&3 , while the
dashed line shows agreement between S0 and B0 .
r6↑↑pof
r6 ¯
k7t
lwg psp lwg psp pof|pof↓↓ rsym
¯
Nadiya kā itvār ko checkup hogā .
Figure 4.5: Dependency tree of Example sentence 13. The dotted line shows projective transfor-
mation of the non-projective edge (Nadiya, checkup). r6↑↑pof and pof↓↓ are the transformed edge
labels encoding lifting operations according to Head+Path scheme of Nivre and Nilsson [152].
4.2.2.4 Complex Predication
In the Hindi and Urdu treebanks, there are 44,546 and 17,672 predicates respectively, of which
more than half have been identified as noun-verb and adjective-verb complex predicates (NVC) at the
dependency level. Typically, a noun-verb complex predicate like chorī ‘theft’ karnā ‘to do’ (steal) has
two components: a noun chorī and a light verb karnā. The verbal component in NVCs has reduced
40
predicating power (although it is inflected for person, number, and gender agreement as well as tense,
aspect and mood) and its nominal complement is considered the true predicate [44, 190]. Since the
light verb governs the agreement and case marking on its argument(s), the arguments are treated as
dependents of the light verb instead of the host nominal in the Indian language treebanks as shown in
Figure 4.6.
(14) Bhensah ke MPDO office mẽ māzorīn kā aik ijlās monaqid kiyā gayā .
Bhensah of MPDO office in paraplegic of one conference-3MSg hold do-3MSg go .
A conference was organized for paraplegics in the MPDO office of Bhensah.
root
Bhensah ke MPDO office mẽ māzorīn kā aik ijlās monaqid kiyā gayā .
Figure 4.6: Dependency tree of an example sentence from the Urdu treebank showing the anno-
tation of a complex predicate monaqid kiyaa. All the arguments of the complex predicate and the
host nominal monaqid are attached to the light verb kiyaa. Note that the theme ijlaas ‘conference’
agrees in gender and number with the light verb kiyaa.
It is usually the case that the arguments of a complex predicate can be inferred from its light verb. In
Example 15, light verb kī ‘do’ provides enough information about the valency of the complex predicate.
Even if we mask the host nominal we can still infer that the complex predicate is transitive and kahānī
would be its theme argument. However, oblique arguments can hardly be disambiguated by just looking
at the light verb. In case of Example 16, we may infer that the ko-marked noun is the indirect object
(recipient) of the action ‘give’ which, however, is the direct object (theme).
(15) Nadiya-ne kahānī yād kī .

Nadiya-3FSg.Erg story-3FSg.Nom memory do-Perf.3FSg .
Nadiya memorized the story.
(16) Nadiya-ne Shahid-ko dhakā diyā .

Nadiya-3FSg.Erg Shahid-3MSg.Acc push give-Perf.3FSg .
Nadiya pushed Shahid.
(17) Nadiya-ne Shahid-ko phōl diyā .

Nadiya-3FSg.Dat Shahid-3MSg.Dat flower-3FSg.Nom give-Perf.3FSg .
Nadiya gave Shahid a flower.
To empirically verify this observation, we trained Hindi and Urdu parsers treating complex pred-
icates as single lexical units. We used the gold information in dependency trees to concatenate host
nominal and light verb into a single lexical predicate without altering the tree. For example, the com-
plex predicate chorī karnā after modification would look like chorī chorī-karnā. We do not remove
41
the host nominal as an independent node from the tree as its individual nominal modifiers would be lost
otherwise (see [24, p. 32]). Interestingly, the parsing accuracy on the development set increased by
almost 1% LS in both languages. We observed the improvements mainly on theme, location and expe-
riencer arguments in both languages by more than 2% LAS. Moreover, treating complex predicates as
single lexical units even improved the attachment accuracy by an average of 0.3% UAS. Based on these
observations we propose to treat complex predicates as single units while parsing so that their arguments
can be efficiently disambiguated. To do that, however, we need to first identify those nominal-verb pairs
which act as complex predicates. For this task, we use an approach similar to Begum et al. [15] who
proposed a maximum entropy model which uses linguistically motivated features to identify pair of con-
secutive noun/adjective-verb pairs as complex predicates. Specifically, they used a set of features to train
their model which include the constituent words of a complex predicate and their semantic properties,
count of occurrence of the nominal host with other verbs, presence of determiners before the nominal
host and presence of postpositions after it. Instead, we use the binary SVM classifier [57] to build the
identification system with additional features. Since complex predicates are annotated at the dependency
level in IL treebanks, we extract the host-nominal, light-verb pairs and consecutive noun/adjective-verb
pairs to create the training and evaluation data. The statistics about the data are reported in Table 4.12.
Hindi Urdu
Data
NVC Non-NVC NVC Non-NVC
Training 11,210 20,876 7,125 5,456
Testing 1,457 2,495 1,778 1,341
Development 1,542 2,699 1,607 1,434
Table 4.12: Distribution of nominal-light verb pairs and adjacent nominal-literal verb pairs in the
training, testing and development sets of Hindi and Urdu.
Although the features proposed by Begum et al. are linguistically motivated and are appropriate for
the task, there are many other easily extractable features that can prove beneficial. One such feature is
the strength of association between a word pair that can form complex predicates. Complex predicates
are multi-word expressions which are supposed to be highly cohesive as their constituent words tend
to co-occur more often. To check the strength of association between constituent words of a complex
predicate, we employ Normalized Pointwise Mutual Information (NPMI). NPMI is a widely used asso-
ciation measure for collocation extraction [49]. The strength of association between two words x and y
based on NPMI can be formally defined as:
pmi(x; y)
npmi(x; y) = (4.1)
− log(p(x, y))
42
where,
p(x, y)
pmi(x; y) = log (4.2)
p(x)p(y)
p(x,y) is joint probability between x and y while p(x) and p(y) are their individual distributions. NPMI
is bounded between [-1,+1], where -1 means never occurring together, 0 means independence, and +1
means complete co-occurrence. The probability distributions are learned from 8M and 5M monolingual
corpora of Hindi and Urdu respectively.
We computed the NPMI score of all the complex predicates in the Hindi and Urdu treebanks and
found around 87% and 85% cases with positive NPMI scores in both treebanks respectively. Thus an
indication about the association between the two consecutive nominal-verb pairs would be a strong
cue for the identification task. Using the association measure for complex predicate identification is
similarly motivated in Butt et al. [44], who use simple bigram extraction in conjunction with Chi-Square
(χ2 ) to study the distributional patterns of a fixed class (Noun + Verb) of complex predicates in Urdu.
Furthermore, we also observed that case markers are less likely to follow the nominals that form complex
predicates. Therefore, apart from the base features used by Begum et al., we use the NPMI scores of the
consecutive nominal-verb pair, nominal-case marker pair and also the association of a nominal host with
the most frequent light verbs used in the treebank.10 However, we replace the wordnet-based semantic
property of a word with its cluster id. We use the SVM [57] to build our classification model. To tune
the hyperparameters of SVM, we performed grid search on the development set. The results of our
experiments are reported in Table 4.13. Following Begum et al., we set the baseline using the lexical
forms of nominal host and light verb.
Hindi Urdu
Model
Testing Development Testing Development
Baseline 82.14 82.65 80.26 80.34
Begum et al. 90.74 90.55 87.75 87.89
This work 92.09 91.99 89.68 89.74
Table 4.13: Performance of our SVM-based classifier on complex predicate identification in Hindi
and Urdu treebank testing and development sets.
As shown in Table 4.13, using just the lexical forms of the host nominal and the light verb give rea-
sonable accuracies. In comparison to Begum et al. our model performs better by an average of around
∼1.5%. This shows the importance of association scores in the identification task.
Unlike Begum et al., who use the predictions from their model as boolean features indicating whether
a verb is a light verb or not, we propose to treat the identified complex predicates as single units. We
concatenate the word form of a nominal host with the light verb whenever the light verb is the first
10 We used 10 most frequently used light verbs in both treebanks.
43
word in the buffer and treat it as a normal word-based feature (i.e. B0 w).11 This lexical feature would
further interact with other features in a parser configuration like POS, label etc.12 to generate more
richer features that would help to parse the arguments of a complex predicate more accurately.
4.2.2.5 Identification of Ezafe
Ezafe is an enclitic short vowel ‘e’ which joins two nouns, a noun and an adjective, or an adposition
and a noun into a possessive relationship.13 In Urdu, ezafe is a loan construction from Persian. It
originated from an Old Iranian relative pronoun ‘-hya’, which in Middle Iranian changed into y/i, a
device for nominal attribution [38]. The Urdu ezafe construction functions similarly to that of its Persian
counterpart. In both languages, the ezafe construction is head-initial which is different from the typical
head-final nature of these languages. As in Persian, the Urdu ezafe lacks prosodic independence; it is
attached to a word to its left which is the head of the ezafe construction. It is pronounced as a unit with
the head and licenses a modifier to its right. This is in contrast to the Urdu genitive construction, which
conforms to the head-final pattern typical for Urdu. The genitive marker leans on the modifier of the
genitive construction not on the head and is pronounced as a unit with it. Example (18) is a typical
genitive construction in Urdu while Example (19) shows an ezafe construction.
The ezafe construction in Urdu can also indicate relationships other than possession [171]. In the
Urdu treebank, when an ezafe construction is used to show a possessive relationship, it is annotated
similarly to genitive constructions indicating possession with an “r6” label as in Example (19). The
head noun ‘owner’ possesses the modifying noun ‘throne’. However, in Example (20) ezafe does not
indicate a possessive meaning. In such cases “nmod” (noun modifier) is used instead of “r6”. The
adjective ‘bright’ does not stand in a possession relation to the head noun ‘day’, but simply modifies it
in an attributive manner.
(18) Yasin-kā qalam

Yasin-Gen pen
‘Yasin’s pen’
(19) sāhib-e takht

owner-Ez throne
‘The owner of the throne’
(20) roz-e roshan

day-Ez bright
‘Bright day’
11 Since Hindi and Urdu are head final languages, explicit representation of a complex predicate only makes sense when it
is in buffer. Nevertheless, we also tried representing it even in stack but that did not help improve the accuracy.
12 See the features involving B w in the feature templates defined so far.
0
13 There are also a few cases where an ezafe construction is formed by a demonstrative and a noun. Two of these case are
mukām-e hāzā ‘this place’ and masjid-e hāzā ‘this mosque’. ‘hāzā’ is a proximal demonstrative borrowed from Arabic.
44
sāhib-e ‘owner’ rooz-e ‘day’
r6 nmod
takht ‘throne’ rooshan ‘bright’
Figure 4.7: Dependency trees of Examples (19) and (20).
Ezafe is barely used in spoken language, while it is found frequently in literary texts and newspapers
which use high literary language [171]. Ezafe is a productive phenomenon in Urdu,14 though it is
assumed to be limited to words of Persian origin [38]. Unlike genitives, where genitive markers are
explicitly present, ezafe marker, like any other diacritic in Urdu, is dropped in written text. This poses a
challenge to parsing such constructions, since the parser has no cue available to identify two consecutive
nominals as an ezafe construction. There are, however, some implicit cues that can be used to identify
them. One is the fact that the construction is mostly formed using words of Persian origin. The other
fact being that these constructions behave more like multi-word expressions where constituent words
are highly correlated to each other and tend to co-occur more often. With a few exceptions, ezafe
constructions in Urdu treebank are very cohesive and have constituent words mainly of Persian origin.
We use Multinomial Bayes model proposed in [30] to identify the origin of a word in the Urdu treebank,
while we use Normalized pointwise mutual information to measure cohesion between the constituent
words of an ezafe construction. In Table 4.14, we report the NPMI scores of constituent words for a
sample of ezafe constructions from the Urdu treebank. Except for a few cases in the treebank, the scores
are positive for almost all the constructions (more than 90%). Regarding the origin of constituent words,
we observed quite a few cases where the dependent nominal is not a Persian word, however head words
are always Persian.
Examples Gloss Persian Origin npmi

maslah-e-Kashmir Kashmir issue maslah 0.4
sadr-e-congress Congress president sadr 0.1
māhirīn-e-nafsiyāt Psychology specialists both 0.5
qābil-e-gor Noteworthy both 0.4
rokun-e-assembly Assembly member rokun 0.5
khārāj-e-aqīdat Tribute both 0.7
sadr-e-cricket Cricket president sadr -0.1
izhār-e-māzirat Expressing apology both -0.02
Table 4.14: Normalized Pointwise Mutual Information Scores and the origin of a few ezafe con-
structions in the Urdu Treebank.
14 Someof the novel uses of ezafe in Urdu are dard-e disco ‘pain of disco’ and cement-e Kashmir ‘cement of Kashmir’.
The former is used in a Bollywood song, while the latter is a tagline used by a local cement factory in Jammu and Kashmir,
India.
45
It should be noted here that higher NPMI score and the Persian origin of two adjacent words does
not mean the word pair forms an ezafe construction. While using these heuristics on the development
data set, we observed many false positives like musalmān qaum ‘Muslim community’, which were
identified as Persian and also had high NPMI scores. Therefore, instead of their prior identification,
we identify ezafe during parsing using a few features based on the above observations. These features
would be complementary to the rich features that can be extracted from a parser configuration. Since
ezafe is head-initial, we compute the NPMI between nominal words S0 (potential head) and B0 (potential
modifier) if both S0 and B0 are Persian or S0 is Persian (S0 B0npmi ) and add the score as feature with the
language of origin for both words (S0origin , B0origin ). However, we constraint these heuristics to the
following edge types: NN → NN, NN → JJ and NN → NST.
4.3 Experiments and Results

In the above section, we have discussed how to integrate different morphologically and syntactically
relevant features into a transition-based parsing model of Hindi and Urdu. Here we discuss the results
and the impact each feature had on the performance of the parsers and put down our observations. To
carry out the parsing experiments, we use an arc-eager transition system with a dynamic oracle. The
model parameters are learned using the averaged perceptron algorithm. The parsing models are trained
for 15 epochs with the default exploration hyperparameters [76].
Following from our discussion in §4.2.2.2, case markers seem crucial for argument differentiation
in parsing. Adding case markers as a separate feature has been rewarding. We achieved an increase
in parsing accuracy by around ∼0.7% LAS in Hindi and ∼1.0% LAS in Urdu. The accuracies are
listed in Tables 4.15 and 4.16. The improvements are prominent on the dependencies related to the
verb argument structure. Relations like k1, k2, k4, k4a, k1s and k2s improved substantially by adding
case feature as shown in Figure 4.8. Using the case feature of the rightmost child of a coordinating
conjunction improved their accuracy by 1% LAS in Hindi and 4% LAS in Urdu. Similarly incorporating
the information of the possible sense of a case marker suggested by PSD models further improved the
parsing. It further increased the accuracy by around ∼0.3%.
46
Development Test
Feature
Baseline 93.51 90.21 87.82 93.52 90.01 87.77
+Lexical Case 94.010.50 91.010.80 88.70.88 93.90.38 90.710.70 88.450.68
+Case Disambiguated 94.110.10 91.240.23 88.950.25 94.090.19 91.10.39 88.850.40
+Agreement 94.230.12 91.450.21 89.120.17 94.220.13 91.380.28 89.070.22
+Complex Predicates 94.260.03 91.650.20 89.340.22 94.230.01 91.560.18 89.250.18
Table 4.15: Parsing accuracies of Hindi after incorporating information about case, complex
predication and agreement in the baseline feature template. Improvements in All features are
over the baseline.
Development Test
Feature
Baseline 88.81 84.88 81.34 88.77 84.84 81.19
+Lexical Case 89.790.98 86.091.21 82.351.01 89.791.02 86.011.17 82.241.05
+Case Disambiguated 90.010.22 86.340.25 82.680.33 90.000.21 86.200.19 82.430.19
+Agreement 90.170.16 86.520.18 82.850.17 90.140.14 86.420.22 82.570.14
+Complex Predicates 90.300.13 86.830.31 83.190.34 90.260.12 86.710.29 82.950.38
+Ezafe 90.430.13 87.040.21 83.380.19 90.390.13 86.920.21 83.210.26
Table 4.16: Parsing accuracies of Urdu after incorporating information about agreement, case,
complex predicates and ezafe in the baseline feature template. Improvements in All features are
over the baseline.
Similarly incorporating agreement into the parsing models led to substantial improvements. Even
though there is less improvement in UAS, dislocated genitives improved by around 5% UAS and 7%
UAS in Hindi and Urdu respectively. There is also a minor improvement of 2% LS in non-oblique core
verb arguments.
As we have mentioned earlier, we extract the features related to case marking and agreement from
the parser configurations as higher-order features. However, in the previous works on parsing Hindi
and other Indian languages, these features are added to the head node in a pre-processing step. These
features are then used as first-order features for any edge involving the head word. We have argued that
copying these features in a pre-processing step would lead to error propagation. We conducted separate
experiments to evaluate the performance of both extraction procedures. The results for comparison
are reported in Table 4.17. As shown in the Table, we gained improvements of around 0.3% LAS for
both Hindi and Urdu by incorporating these features during the parsing process. More importantly, our
47
extraction procedure removes the dependency of the Hindi and Urdu transition-based parsers on the
preprocessing tools.
Hindi Urdu
Feature
Development Test Development Test
First-order 89.03 89.01 82.98 82.91
Higher-order 89.340.31 89.250.24 83.380.40 83.210.30
Table 4.17: Parsing accuracies (LAS) showing the impact of different feature extraction proce-
dures used to extract features related to case marking and agreement.
Furthermore, we also gained significant points in accuracy by propagating complex predicates as

a single lexical unit. The gains are more prominent in Urdu where dependencies like k7p, k4a, k2
improved by 6%, 3% and 1.5% respectively. There are similar gains in Hindi as well though not as
significant. In Table 4.18, we compare the results of our approach of treating complex predicates as
single lexical units with the approach of Begum et al. [15]. Following Begum et al., we add a boolean
indicator to a verb if it is identified as a light verb by our identification system (see §4.2.2.4). The
feature is used whenever their is an edge involving the identified light verb. We also combined this
feature with our features for complex predicates in Combined settings. As shown in Table 4.18, our
approach performs better than Begum et al. Moreover, the combination of both approaches (Combined
setting) did not help, which shows that they are not complementary.
Hindi Urdu
Feature
Development Test Development Test
Begum et al. 89.15 89.09 82.95 82.74
This Work 89.340.19 89.250.16 83.190.24 82.950.21
Combined 89.33−0.01 89.260.01 83.200.01 82.950.00
Table 4.18: Comparison of parsing accuracies (LAS) showing the impact of features capturing
complex predication. The improvements are shown over the systems containing all the features till
the complex predicates as shown in Tables 4.15 and 4.16.
In addition to improvements by capturing complex predication, we could also improve parsing of

ezafe constructions by 35% LAS (covering almost 90% ezafe constructions in the test set). The im-
provements are particularly encouraging for those ezafe constructions which are unknown to the training
48
model. However, there are also ∼15% cases of false positives which mainly resulted due a combination
of high NPMI scores and Persian origin of adjacent word pairs. An interesting case, which is parsed as
ezafe, involves a word pair nājāyiz tāloqāt ‘illicit relationship’. Both the words are identified as Persian
and have high NPMI scores.
Figure 4.8 draws an overall sketch of the improvements that all these rich morpho-syntactic features
brought in the parsing of dependencies that define the verb argument structure in Hindi and Urdu.
100 100
Clusters Rich features Clusters Rich features
80 80
60 60
LAS
LAS
40 40
20 20
0 0
k1 k2 k4 k4a k1s k2s r6 k1 k2 k4 k4a k1s k2s r6
Dependency Labels Dependency Labels
Figure 4.8: Impact of case and agreement features on different verb arguments and genitives in
Hindi (left plot) and Urdu (right plot). The definitions of the dependency labels used in the graphs
can be found in Table 3.1.
4.3.1 Feature Ablation Experiments

To evaluate the individual importance of the proposed features, we used leave-one-out and only-one
evaluation. We ran the same experiment multiple times, each time including or excluding exactly one
feature. The results can be found in Table 4.19. The figures in Table 4.19 suggest that the features
related to case marking are most important for parsing both languages. Interestingly, features related to
lexical case marking and disambiguated case role seem to be correlated. Even though features related to
case marking are the most important individual features, leaving them out did not lead to a substantial
drop in the accuracy. Similarly, there is a slight drop in LAS when disambiguated case features are
omitted, while their individual contribution is very high. Conversely, features related to agreement and
complex predication seem to be complementary to case marking and capture crucial information of both
languages.
49
Hindi Urdu
Feature
Leave-one-out Only-one Leave-one-out Only-one
Lexical Case 88.74 88.70 82.60 82.35
Case Disambiguated 89.14 88.29 83.08 82.04
Agreement 89.12 88.04 83.19 81.54
Complex Predicates 89.12 88.34 83.03 81.79
Ezafe - - 83.19 81.49
Baseline - 87.82 - 81.34
All features 89.34 - 83.38 -
Table 4.19: Accuracies (LAS) for Leave-one-out and Only-one evaluation. All features and
Baseline serve for comparison. The former includes all the features, while the latter excludes them
all.
4.3.2 Comparison with the IL Feature Representation

Even though the original ZN feature model15 defines higher-order features, it lacks necessary feature
interactions that are beneficial for parsing morphologically-rich languages like Hindi and Urdu. The
reason for this deficiency is that the feature model was originally proposed and evaluated on English
and Chinese which significantly differ from Hindi and Urdu in morphology and syntax. The results in
Tables 4.15 and 4.16 above clearly show the impact of modeling syntactic phenomena of Hindi and Urdu
on top of the (extended) ZN feature model. On the other hand, the IL feature model (Table 4.2) is very
shallow and hardly defines higher-order features which are necessary for parsing in general. However,
it does model features related to case marking and grammatical agreement. Though individually both
models suffer from feature incompleteness, together they can complement each other.
We already discussed the impact of higher-order feature representation of different morphological
and syntactic features above. For comparison, here we evaluate an alternative representation of the
morpho-syntactic features discussed in this chapter. Instead of our representations, we complement the
original ZN feature model [203] with the feature interactions related to chunk tags, lemmas, case and
agreement as per their representation in the IL feature model. We performed multiple experiments to
compare and evaluate the first-order16 representation of these features in the IL model with the higher-
order representation in our extended ZN model. In Table 4.20, we report the results using the IL feature
template and the ZN template extended with chunk tags, lemmas, case and agreement features according
to the IL template (cf. Table 4.2).
15 The feature model proposed in the work of Zhang and Nivre [203] does not define any features related to chunk tags,
lemmas, clusters etc.
16 These features are mostly defined as first-order features in the IL feature model.
50
Indian Language Feature Template
Feature Hindi Urdu
POS 90.09 85.03 82.71 85.53 80.31 76.70
+Chunk 90.650.56 85.310.28 83.100.39 85.760.23 80.460.15 76.890.19
+Lemma 92.752.10 88.863.55 86.333.23 86.010.25 81.701.24 77.921.03
+Case 93.010.26 89.550.69 86.960.63 87.301.29 83.551.85 79.641.72
+Agreement 92.85−0.07 89.550.01 86.87−0.08 87.27−0.01 83.560.03 79.670.01
Zhang & Nivre’s Feature Template
POS 92.77 89.51 86.79 88.14 84.41 80.57
+Chunk 93.020.25 89.790.28 87.030.24 88.280.14 84.640.23 80.790.22
+Lemma 93.130.11 89.920.13 87.150.12 88.420.14 84.740.10 80.900.11
+Case 93.270.14 90.260.34 87.470.32 88.650.23 85.030.29 81.170.27
+Agreement 93.24−0.03 90.270.01 87.45−0.02 88.59−0.03 84.990.05 81.120.01
Table 4.20: Parsing performance of Hindi and Urdu based on the feature template defined in
previous works on transition-based parsing of Indian Languages and its comparison with the ZN
feature template. Addition of extra features to the ZN template is based on the interaction of these
features as reported in Indian Language feature template.
In comparison with the ZN feature template, POS-based features in the IL template are very shallow
which is clearly reflected in the parsing accuracies using the two templates. The ZN feature model is
better than the IL model by around ∼4% LAS in both Hindi and Urdu. Interestingly features related
to case marking and lemmas complement POS-based features in the IL template. As shown in Table
4.20, these features led to large improvements over the POS-based features. However, similar feature
representation of chunk tags, lemmas and case marking in the original ZN feature model perform com-
paratively worse than our higher-order representation of these features. Our rich representation of chunk
tags and lemmas significantly improves parsing as opposed to the local representation of these features
(compare improvements using chunk tags and lemmas in lower-half of Table 4.20 with the improve-
ments in Tables 4.4 and 4.5). Similarly, our higher-order representation of case improved results by an
average of 0.8% LAS (see Tables 4.15 and 4.16), while first-order representation of these feature as per
IL model only improved parsing by an average of 0.3% LAS. In the case of agreement features, there is
a drop in accuracies, while our representation of agreement further improved the results.
51
4.4 Related Work and Comparison
The application of beam-search and global learning [202] and the introduction of dynamic oracles
[76, 77] have created a paradigm shift in transition-based parsing methods. Both methods handle error
propagation effectively which was the major drawback of greedy transition-based parsing systems. Fur-
thermore Zhang and Nivre [203] introduced higher-order features for a beam-search transition-based
parser which further improved the performance of these parsers. Our results on Hindi and Urdu attest
the significance of these rich higher-order features cross-linguistically. In comparison to the POS-based
features in the IL feature model, the ZN features are very rich. However, these features are still in-
complete and lack essential feature interactions that are relevant for parsing of morphologically rich
languages like Hindi and Urdu.
Unlike fixed-order languages like English, morphologically-rich languages have been shown to pose
a multitude of challenges to statistical parsing. The challenges are mostly related to model architecture
(cascaded vs joint modeling of morphological and syntactic prediction), representation of morphological
information and lexical diversity [185, 186]. In this chapter, we have dealt with the feature representation
and modeling of morphological information in a transition-based parser of Hindi and Urdu. Our results
show that proper representation of morphological information is fruitful for parsing these languages.
Among different morphological features, case and agreement have been widely explored in statistical
parsing of morphologically-rich languages. Grammatical agreement has been repeatedly shown to give
mixed results, while case information has proved to be highly beneficial across languages [18, 74, 130,
185]. Seeker and Kuhn [172] used case makers as an underspecified filtering device that guides the
parser by restricting its search space. They have reported significant improvements over the state-of-
the-art parsing models of Czech, German, and Hungarian. Bengoetxea and Gojenola [17] proposed
a stacking-based approach to propagate relevant morphological features like case and TAM markers
across head and modifier nodes in a parse tree. They evaluated their approach on Basque and reported
significant gains in accuracy. The benefit of feature propagation is that it improves visibility of certain
non-local features like agreement. However, parser stacking is not necessary to propagate these non-
local features. Instead they can be used as higher-order features during parsing itself (cf. §4.2.2.2 and
§4.2.2.3). Likewise, Goldberg and Elhadad [74] have also captured agreement intelligently in their
easy-first dependency parsing model of Hebrew. They have captured agreement directly on the arcs
between relevant word pairs. Apart from dependency parsing, agreement has also been reported to be
beneficial for phrase structure parsing if efficiently incorporated in the parsing model [75, 184]. Finally,
Hohensee and Bender [88] have shown the impact of different morphological features across a range of
languages.17 They have reported very high improvements in dependency parsing of 21 languages.
Complex predicates are very common in South Asian languages like Hindi and Urdu. They differ
significantly from simple predicates with respect to their form and structure (cf. §4.2.2.4). The issues
related to the representation of complex predicates in lexical resources like treebanks [14] and wordnets
17 The results reported by Hohensee and Bender [88] have been invalidated due to a bug in their implementation (see [130,
p. 169]).
52
[35] have been extensively discussed and investigated. However, to the best of our knowledge, there is no
work that discusses their representation in parsing. Begum et al. [15] have studied the impact of complex
predicates on dependency parsing of Hindi texts. Their work, however, deals with the identification of
complex predicates during parsing rather than their representation. In this work, we have proposed
an alternative representation of complex predicates for parsing. We model the host nominal and light
verb of a complex predicate as a single lexical item which significantly improved the parsing of their
argument structure. Our approach is similar to [45, 56, 68, and the references therein], who propose to
model multiword expressions as single lexical units to improve parsing. However, these works transform
multiword expressions into a single node, while in our case the tree structure remains intact. We do not
remove the host nominal from the tree. We keep it intact, so that it could retain its individual modifiers
apart from the arguments it shares with the light verb.
Similar to Urdu text, ezafe constructions are not explicitly marked (with a diacritic) in Persian text.
Nourian et al. [154] have proposed to explicitly represent ezafe by adding an indicator to the related
POS tags in the Persian treebank. They showed that such an explicit representation substantially im-
proves dependency parsing using both gold and predicted POS tags. Unlike Persian, ezafe in Urdu is
limited to a subset of words which should be particularly of Persian origin (see §4.2.2.5 for exceptional
cases). Moreover, ezafe is not explicitly represented in the Urdu treebank. Therefore, instead of prior
identification of ezafe at the level of POS tagging, we add a few heuristics in our parsing model for its
identification.
4.5 Summary
In this chapter, we have thoroughly explored transition-based dependency parsing of Hindi and Urdu.
We have explored rich morphological features derived from local contexts and also from the parse his-
tory for improved parsing of verb arguments. We have brought substantial improvements in parsing
accuracies over the state-of-the-art for both languages. We have mainly explored and discussed syntac-
tically relevant phenomena of Hindi and Urdu and the ways to capture their interactions in an arc-eager
parsing system. In particular, we have explored case and TAM interactions, case ambiguity, complex
predication and ezafe identification in Hindi and Urdu and showed how properly handling these phe-
nomena can improve parsing of these languages.
53
Chapter 5
Non-Projectivity and Scrambling: Trade-off for Morphological Richness
In recent years non-projective structures have been widely studied across different languages. These
dependency structures have been reported to restrict the parsing efficiency and pose problems for gram-
matical formalisms. Non-projective structures are particularly frequent in morphologically rich lan-
guages like Czech and Hindi [111, 127]. In Hindi a major chunk of parse errors are due to non-projective
structures [91], which motivates a thorough analysis of these structures, both at linguistic and formal
levels. In this chapter we study non-projectivity in Indian languages (ILs) which are morphologically
richer with relatively free word order. We present a formal characterization and linguistic categorization
of non-projective dependency structures across four Indian language treebanks. We also show that for
each type of non-projective structure there exist a definitive syntactic cue which can be used in conjunc-
tion with pseudo-projective transformations for its accurate parsing. In addition to non-projectivity, we
also show how to mitigate the effect of sampling bias on a parser which is trained only on the canonical
structures of a language and propose a simple resource-light method to parse non-canonical structures
in morphologically rich languages.
5.1 Non-projectivity
Non-projective structures in contrast to projective dependency structures contain a node with a dis-
continuous yield. These structures are common in natural languages, particularly frequent in morpho-
logically rich languages with flexible word order like Czech, German etc. In the recent past the formal
characterization of non-projective structures have been thoroughly studied, motivated by the challenges
these structures pose to the dependency parsing [85, 110, 133]. Other studies have tried to provide an
adequate linguistic description of non-projectivity in individual languages [82, 127]. Mannem et al.
[127] have done a preliminary study on Hyderabad dependency treebank (HyDT) a pilot dependency
treebank of Hindi containing 1,865 sentences annotated with dependency structures. They have identi-
fied different construction types present in the treebank with non-projectivity. Recently, KULKARNI
et al. [115] have discussed the issues of non-projectivity in Sanskrit. They have found that adjectival
and genitive relations are mostly involved in Sannidhi violation (or violation of planarity constraint) in
54
Sanskrit. In this chapter we present our analysis of non-projectivity across four IL treebanks. ILs are
morphologically richer, grammatical relations are expressed via morphology of words rather than the
syntax. This allows words in these language to move around in the sentence structure. Such movements
quite often, as we will see in subsequent sections, lead to non-projectivity in the dependency structure.
We studied treebanks of four Indian languages viz Hindi (Indo-Aryan), Urdu (Indo-Aryan), Bengali
(Indo-Aryan) and Telugu (Dravidian). They all have an unmarked Subject-Object-Verb (SOV) word or-
der, however the order can be altered under appropriate pragmatic conditions. Movement of arguments
and modifiers away from the head is the major phenomenon observed that induces non-projectivity in
these languages.
In this section, we discuss the constraints and measures evaluated by Kuhlmann and Nivre [111]
and Nivre [148]. We evaluate these measures on IL treebanks, following with the adequate linguistic
description of non-projective structures, focusing on the identification and categorization of grammatical
structures that can readily undergo non-projectivity and the possible reasons for the same. Despite the
fact that non-projective structures are comparatively difficult to parse, they contain strong syntactic cues
which can play a great role in their identification. We show that pseudo-projective transformations of
Nivre and Nilsson [151] coupled with the syntactic cues can effectively parse non-projective structures
in Indian languages like Hindi and Urdu.
5.1.1 Dependency Graph and its properties

In this section, we give a formal definition of dependency tree, and subsequently define different
constraints on dependency trees like projectivity, planarity and well-nestedness. In our discussion of
these constraints, we have borrowed standard definitions and notations from [111].
Dependency Tree: A dependency tree T = (V, A) is a directed graph with V = {wi | ∈ [0, n]} is an
ordered set of words, A a set of arcs showing a dependency relation on V . Every dependency tree satisfies two
properties : a) it is acyclic, and b) all nodes have in-degree 1, except root node with in-degree 0.
5.1.1.1 Condition of Projectivity
Condition of projectivity in contrast to acyclicity and in-degree concerns the interaction between the
dependency relations and the projection of a these relations on the linear order of nodes in a sentence.
Projectivity: A dependency tree T is projective if it satisfies the following condition: i → j, υ ∈ (i, j) =⇒
υ ∈ Subtreei . Otherwise T is non-projective.
5.1.1.2 Relaxations of Projectivity
As Nivre [148] remarks, natural languages approve grammatical constructs that violate the condition
of projectivity. In the following, we define the global and edge based constraints that have been proposed
to relax projectivity.
55
• Planarity: A dependency tree T is non-planar if there are two edges i1 ↔ j1 , i2 ↔ j2 in T such that
i1 < i2 < j1 < j2 . Otherwise T is planar.
Planarity is a relaxation of projectivity and a strictly weaker constraint than it. Planarity can be
visualized as ‘crossing arcs’ in the horizontal representation of a dependency tree.
• Well-nestedness: A dependency tree is ill-nested if two non-projective subtrees (disjoint) interleave. Two
disjoint subtrees l1 , r1 and l2 , r2 interleave if l1 <l2 < r1 < r2 . A dependency tree is well-nested if no
two non-projective edges interleave [37].
• Gap Degree: The gap degree of a node in a dependency tree is the number of gaps in its projec-
tion. A gap is a pair of nodes (xn , xn+1 ) adjacent in πx (projection of x) such that xn+1 – xn > 1. The gap
degree of a node gd(xn ) is the number of such gaps in its projection. The gap degree of a sentence
is the maximum among the gap degree of its nodes [111]. Gap degree corresponds to the maximal
number of times the yield of a node is interrupted. A node with gap degree > 0 is non-projective.
• Edge Degree: For any edge in a dependency tree we define edge degree as the number of con-
nected components in the span of the edge which are not dominated by the parent node of the
edge. edi↔ j is the number of components in the span(i, j) and which do not belong to π parenti↔ j .
5.1.2 Evaluation of Tree Constraints on IL Treebanks

In this section, we present an experimental evaluation of the dependency tree constraints mentioned
in the previous section on the dependency structures across IL treebanks. Among the treebanks, Hindi
treebank due to its relatively large size provides good insights into the possible construction types that
approve non-projectivity in ILs. Urdu and Bengali treebanks, though comparatively smaller in size,
show similar construction types approving non-projectivity. Telugu, on the other hand, as reflected by
the analysis of the Telugu treebanks, does not have any non-projective structures. Possible types of
potential non-projective constructions and the phenomena inducing non-projectivity are listed in Table
5.2. In Table 5.1, we report the percentage of structures that satisfy various graph properties across IL
treebanks. In IL treebanks, Urdu has 23%, Hindi has 15% and Bengali has 5% non-projective structures.
The figures are similar in comparison to other free word order languages like Czech and Danish which
have non-projectivity in 23% (out of 73,088 sentences) and 15% (out of 4,393 sentences) respectively
[85, 111]. In Hindi and Urdu treebanks, highest gap degree and edge degree for non-projective structures
is 3 and 4 respectively which tallies with the previous results on Hindi treebank [127]. As shown in Table
5.1, planarity accounts for more data than projectivity, while almost all the structures are well-nested,
Hindi has 99.7%, Urdu has 98.3% and Bengali has 99.8% of structures as well-nested. Despite the
high coverage of well-nestedness constraint in these languages, there are linguistic phenomena which
give rise to ill-nested structures. The almost 1% of ill-nested structures are not annotation errors but are
rather linguistically justified. Few phenomena that were observed upon close inspection of the treebanks
are extraposition and topicalization of verbal arguments across clausal conjunctions. Extraposition, as a
56
reason behind ill-nestedness, is also observed by Maier and Lichte [124]. Sentence (1) shows a typical
ill-nested dependency analysis of a sentence from Hindi treebank. In this sentence, vyaktī ‘person’ in
complement clause is relativized by an extraposed relative clause which contains a nominal expression
esā koi jawāb ‘any such answer’ relativized by another extraposed relative clause.
(21) unhone kahā ki āj aisā koi bhī vyaktī nahīn hai, jiske pās esā koi jawāb ho, jo
He said that today such any Emp person not is, who near such any answer has, which
sabke liye vaiddhā ho .
all for valid is .
‘He said that today there is no such person, who has any such answer which is valid for all .’

<ROOT> unhone kahā ki āj aisā koi bhī vyaktī nahīn hai, jiske pās esā koi jawāb ho, jo sabke liye vaiddhā ho
Properties
Languages Gap Degree Edge Degree
Non-proj Non-planar Ill-nested Non-proj & Planar Non-proj Edges
gd1 gd2 gd3 ed1 ed2 ed3 ed4
Hindi 14.56 0.28 0.02 14.24 0.45 0.11 0.03 14.85 13.62 0.19 1.24 1.65
Urdu 20.58 1.31 0.12 19.20 1.97 0.56 0.22 22.12 20.11 1.66 2.00 2.59
Bangla 5.47 0.0 0.0 5.24 0.16 0.08 0.0 5.47 3.91 0.16 1.25 0.97
Table 5.1: Non-projectivity measures of Dependency Structures in IL treebanks.
S.No. Phenomenon Bengali Hindi Urdu

1 Discontinuous Genitives 0.256 0.152 0.528
2 Extraposed Relative Clauses 0.167 0.466 0.543
3 Topicalization and Scrambling out of Infiniti- - 0.024 -
val Clauses
4 Topicalization out of Finite Clauses 0.145 0.041 0.122
5 Quantifier Floating - 0.006 0.009
6 Scrambling out of Coordination - 0.005 -
7 Conditionals 0.145 0.231 0.571
8 Clausal Complements 0.256 0.725 0.817
Table 5.2: Sources of Non-projectivity in Hindi/Urdu dependency treebanks. The last two columns
represent the number of occurrences of non-projective DS in the Bengali treebank with 1,279 sen-
tences and in a subset of the Hindi and Urdu treebanks with 20,705 and 3,226 sentences respec-
tively.
5.1.3 Analysis and Categorization

In this section, we discuss different types of constructions that allow non-projectivity and the lin-
guistic phenomena that induce non-projectivity in them. We also list down the syntactic cues wherever
57
available for the accurate parsing of these structures. Our study of IL treebanks revealed a number of
construction types with non-projectivity namely Genitive Constructions, Relative clause constructions, Con-
ditionals, Clausal complements, Control Constructions, Co-ordinated constructions, Quantified expressions and,
Other Finite Clauses. Some of these formatives are inherently discontinuous like conditionals, however
a majority of them, with canonical order projective, can be rendered non-projective under appropri-
ate pragmatic conditions via movement. A number of movement phenomena observed behind non-
projectivity in IL treebanks are: a) Topicalisation, b) Extraposition, c) Quantifier floating, d) NP Extraction,
e) Scrambling-any movement other than ‘a − d’.
Below we discuss a few of the above mentioned construction types and the reasons behind the non-
projectivity in them. The examples discussed are from Hindi and Urdu treebanks.
5.1.3.1 Relative Clause Constructions
In Hindi, Urdu and Bengali relative clauses have three different orders, they can be left adjoined-
placed immediately before their head noun; embedded-placed immediately after the head noun and
extraposed-placed post-verbally away from the head noun. Since extraposed relative clauses are sepa-
rated from the head noun, this dislocation generates discontinuity in the structure. In example (2), the
nominal expression mojuda swarop ‘current f orm’ in the main clause is modified by the extraposed
relative clause. The projection of the head noun mojuda swarop ‘current f orm’ is interrupted by its
parent hai ‘is’. Although it is mainly an extra-posed relative clause that generates discontinuity, there
are instances in the IL treebanks where even Left-adjoined relative clauses can be separated from their
head noun by some verbal argument.
(22) iskā mojuda swaroop theory ādharit hai jisko practical ādhirit banāyā jāyegā .
Its current form theory based is which practical based made will be .
‘Its current form is theory based which will be made practice based .’

<ROOT> iskā mojuda swaroop theory aadharit hai jisko practical aadhirit banāyā jāyegā
Syntactic Cue: Pronouns or demonstratives in relative clause and/or in main clause such as ‘usa-
jisa’, ‘wo-jo’ etc. In the above example, the cues are the oblique pronouns ‘iskā’ in the main clause and
‘jisko’ in the relative clause.
5.1.3.2 Clausal Complements
Clausal complements, introduced by a complementizer (ki in Hindi/Urdu, je in Bengali), are placed

post-verbally in Hindi-Urdu and Bengali. If the head predicate licensing the clausal complement is other
than the verb, the canonical order is such that the head is positioned pre-verbally and its complement
is extraposed. In such order the structure has inherent discontinuity. Example (3) shows an extraposed
complement clause of an expletive yaha ‘it’. The head element and the complement clause are at a
58
distance from each other, the verb likhā hai ‘is written’ in the main clause interferes in the projection
of yaha ‘it’ making the structure discontinuous. Extraposed complement clauses are the major source
of non-projective structures in IL treebanks. In Hindi treebank around 42% non-projective structures
are due to extraposed clausal complements of a non-verbal predicate.
(23) jismẽ yaha bhī likhā hai ki Togadiya jordār dhamāke mẽ māre jāyenge .
In which this also written is that Togadiya powerful blast in killed will be .
‘In which it is also written that Togadiya will be killed in a powerful blast .’

<ROOT> jismẽ yaha bhī likhā hai ki Togadiya jordār dhmaake mẽ māre jāyenge
Syntactic Cue: Complementizer ‘ki’ and a proximal pronoun/demonstrative in the main clause.
5.1.3.3 Conditionals
Conditionals clause in Hindi, Urdu and Bengali have inherent discontinuous structure. In conditional
clauses, condition and the consequence clauses are connected by paired connectives (agar-to, ‘if-then’
Hindi-Urdu), each connective is placed at the initial position of its respective governed clause. In the
annotation scheme followed, each clause is separately dominated by a connective governing it and
then connective heading the condition clause is attached to the consequence clause verb, since it is
thematically dependent on it as it specifies the condition for the consequence, the sentence is rooted
at the connective dominating the consequence clause as shown in example (3). The canonical order
of the condition and consequence clause together with their respective connectives inherently generate
discontinuity.
(24) yadī vaha aisā karne mẽ vifal rehtī hai to hum padosī rājyon ke uplabdh mārg se
If it so doing in fails then we neighboring states of available routes from
jarorī vastuein bhejne kī koshish karenge .
essential things send of try will .
‘If it fails in doing so then we will try to send essential things from available routes of neighbor-
ing states .’

<ROOT> yadī vaha aisā karne mẽ vifal rehtī hai to hum padosī rājyon ke uplabdh mārg se jarorī vastuein bhejne kī koshish karenge
Syntactic Cue: Paired connectives such as ‘jab-tab’, ‘yadī-to’ etc. in main and subordinate clauses.
Even though, there is no connective in the main clause in the above example, the relationship between
the two clauses can be easily inferred using the connective ‘yadī’ in the conditional clause.
59
5.1.3.4 Genitive Constructions
In genitive constructions, the genitive marked nominal is easily dislocated from the head noun. The
study of IL treebanks show a varied number of movements from genitive constructions. Genitive marked
noun can either be extraposed or be extracted towards the left. However, the extraction towards left is
wide spread with good number of instances in all the treebanks except Telugu treebank. In example (5),
genitive marked pronoun jiskī ‘whose’ has been extracted from its base position to the sentence initial
position crossing the subject of the sentence.
(25) jiskī rāshtra ko bhārī kīmat adā karnī padī .

for which country Acc heavy cost pay had to .
‘For which the country had paid a heavy cost .’

<ROOT> jiskī rāshtra ko bhārī kīmat adā karnī padī thī
Apart from genitive marked nominals, there are a few instances of extraction of other case (locative
and ablative) marked arguments or adjuncts of a head noun. But these cases are negligible in comparison
to dislocation of genitive marked nominals and the discontinuity that genitive constructions allow. The
reason is the agreement between the genitive marker and the head noun in number and gender which
gives the genitive marked nominal extra freedom to move around in the sentence. This property is not
showed by other case markers in Hindi and Urdu.
Syntactic Cue: Agreement between the genitive marker and the possessed noun.
5.1.3.5 Control Constructions
In ILs under study, verbs can select non-finite complements and adverbial clauses marked with in-
finitive or participle inflections (-kar and -nā in Hindi-Urdu). In such bi-clausal combinations non-finite
clauses have a null subject controlled by a syntactic argument of the main verb. In IL treebanks such
arguments, which thematically belong to both verbs but are syntactically governed only by the main
verb, are annotated as the child of the main verb only in view of the single-headedness constraint of
dependency trees. Interestingly, in these control constructions, individual arguments of non-finite verb
can move around and cross the shared argument, child of the main verb, generating discontinuity in
non-finite clause. There are varied occurrences of such discontinuous non-finite clauses in IL treebanks.
Example (6) is a reflection of such discontinuity in these treebanks. In this example, the complement of
the verb dhorāne ‘to repeat’ has moved past the shared arguments, unhone ‘he’ and āj ‘today’, be-
tween the non-finite verb ‘dhorāne’ ‘to repeat’ and the main verb ‘kiyā’ ‘did’ leading to discontinuity
in the non-finite clause.
(26) mantrī na banne kā apnā vādā unhone āj dhorāne se perhez kiyā .
minister not become of his promise he-Erg today repeat Abl refrain did .
‘Today he refrained from repeating his promise of not becoming the minister .’
60

<ROOT> mantrī na banne kā apnā vādā unhone āj dhohrāne se parhez kiyā
5.1.3.6 Quantified Expressions
In quantified expressions quantifiers canonically precede the head noun. However, like other phe-
nomena discussed in previous sections, quantifier can be left floating by dislocating it from its head.
The dislocation, however, is not restricted to a particular direction, it can be either leftward or right-
ward. Quantifier float can bring discontinuity in the projection of quantified expressions if the head
noun crosses an element which it does not dominate. In example (8), noun vote ‘vote’ has moved from
its canonical position leaving the quantifier kam ‘less’ floating in situ in such a way that the subject
musalmānon ne ‘muslims’ and an adverbial isliye ‘because’ interferes with the projection of the head
noun vote ‘vote’ causing non-projectivity as can be seen in the tree given.
(27) unhone kahā ki Mayawati ko vote isliye musalmānon ne kam diye kyunki Mayawati
He said that Maayawati Dat vote because muslims Erg less gave since Mayawati
hukumat unkī tawaqāt par nahīn utarīn .
Government their expectations on didn’t fulfilled .
‘He said that Muslims gave less vote to Mayawati since Mayawati Government did not met their
expectations .’

<ROOT> unhone kahā ki Mayawati ko vote isliye musalmānon ne kam diye kyunki Mayawati hukumat unkī tawaqāt par nahīn utarī
5.1.3.7 Other Finite Clauses
Verb arguments have luxury to move around in the clause structure in ILs. In IL treebanks arguments
either from main clause or from subordinate clause may move from their canonical position and occupy
the sentence or clause initial position. Non-projectivity is induced when an argument or adverbial
moves, a) from main clause to the sentence initial position crossing the sentence initial subordinate
or coordinate conjunction, b) from extra-posed subordinate clause to the sentence initial position and,
c) from extra-posed subordinate clause to clause initial position crossing the subordinate conjunction.
In example (8) senā ‘army’ an argument of subordinate clause verb legī ‘take’ has been moved across
the projection of the main clause verb.
(28) senā hamein nahī lagtā jaldī is tarah kā koi faisalā legī .
Army we not feel soon this way of any decision take .
‘We don’t feel(that) army would take any such decision soon .’

<ROOT> senā hamein nahī lagtā jaldī is tarah kā koi faisalā legī
61
5.1.4 Parsing Non-projective Structures
As we noted in Chapter-2, arc-eager parsing algorithm is restricted to projective trees and can not
parse non-projective arcs. Furthermore, we saw in §5.1 that Hindi and Urdu have around 2% non-
projective arcs, which implies that we have to bear a loss of 2% UAS if we use the arc-eager system. To
prevent such loss of accuracy in Hindi and Urdu parsing, we apply pseudo-projective transformations of
Nivre and Nilsson [152] to projectivize non-projective arcs. The fact that dependency trees are labeled,
we can linearize the non-projective arcs by moving their heads up in the tree while preserving the lift
information in their dependency labels. Nivre and Nilsson [152] define three encoding schemes to
preserve information about the linear transformation of non-projective arcs. Here, we briefly discuss
these encodings with respect to Example 25.
1. Head Scheme: The lift information is encoded in a new label of the form d↑h, where d is the
dependency relation between the syntactic head and the dependent in the non-projective repre-
sentation, and h is the dependency relation that the syntactic head has with its own head in the
original non-projective structure.
r6↑k2
k1
k2
pof
jiskī rashtra ko bhārī kīmat adā karnī padī thī
2. Head+Path Scheme: The scheme, in addition, modifies every arc along the lifted arc (l↓ instead
of l).
r6↑k2
k1
k2↓
pof
3. Path Scheme: This is same as Head+Path encoding but drops the information about the syntactic
head of the lifted arc (d↑ instead of d↑h).
r6↑
k1
k2↓
pof
At parsing time, inverse transformation using breadth-first search can be applied to recover the non-
projective arcs efficiently. There is, however, a trade-off between the parsing accuracy and parsing
62
time as these transformations can increase the cardinality of the label set by a factor of n-square1 .
Nevertheless, we use the encoding schemes proposed by Nivre and Nilsson and evaluated them for
Hindi and Urdu. We apply the pseudo-transformations on training sentences as a pre-processing step
and inverse transformations on testing sentences as a post-processing step. The impact of pseudo-
transformations on parsing of non-projective arcs and on the overall sentences in the test sets are reported
in Tables 5.3 and 5.4 respectively. More or less all the encoding schemes show similar results on both
test sets, however Head scheme gives overall better results than the other two schemes. On an average
pseudo-transformations helped to parse 73% of non-projective arcs in Hindi and Urdu test sets correctly,
while they improve the overall parsing accuracy by an average of 2% LAS.
Hindi Urdu
Score
Head Path Head+Path Head Path Head+Path
UAS 76.57 77.40 76.33 67.47 65.89 65.89
LS 84.85 85.92 84.62 78.68 78.68 78.53
LAS 72.90 73.85 72.54 60.43 58.71 58.85
Table 5.3: Impact of different encoding schemes on parsing of non-projective arcs in Hindi and
Urdu test sets.
Hindi Urdu
Score
Baseline Head Path Head+Path Baseline Head Path Head+Path
UAS 89.68 91.99 92.03 91.98 84.86 86.60 86.57 86.54
LS 84.29 86.98 86.95 86.83 76.46 78.39 78.46 78.29
LAS 81.82 84.14 84.10 84.02 72.77 74.43 74.30 74.23
Table 5.4: Impact of different encoding schemes on parsing of Hindi and Urdu test sets.
In the previous section, we identified several syntactic cues in relative clauses, conditionals, com-
plement clauses and genitives which comprise ∼95% of non-projectivity in Hindi and Urdu treebanks.
In Chapter 4, we showed how agreement between the genitive marker and the head noun can be mod-
eled for better parsing of dislocated genitives. Here, we show how to model the syntactic cues in other
non-projective structures. The goal of modeling these cues is that the parser should identify the linear
head of the dependent node in the non-projective arc with an appropriate label so that the non-projective
arc can be correctly decoded after parsing. Consider the case of non-projectivity in relative clause con-
struction in Example 22. Figure 5.1 shows a parser configuration in which the linear head of the relative
clause is the top node in the stack and the head of the relative clause is the top node in the buffer. In
this configuration, the two oblique pronouns in the subtrees of S0 and B0 are strong indicators that B0
as the relative clause would modify S0 . The information about the syntactic head of the relative clause
1 At least in case of the most informative encoding scheme i.e., head + path. n is cardinality of the original label set.
63
can be learned by using a combined feature representing the pronoun in the subtree of S0 and the depen-
dency relation of its head. Similarly, the syntactic cues in other types of non-projective structures are
explicitly used as higher-order features in relevant parser configurations. Adding these heuristics, led
to further improvements in parsing of both Hindi and Urdu test sets. With respect to non-projectivity,
these heuristics further improved the parsing of ∼5% non-projective structures. The results are reported
in Table 5.5.
None nmod relc↑k1
k1↓ None
r6 None None
ROOT iskā mojuda swarop theory ādharit hai jisko practical ādhirit banāyā jāyegā .
Stack Buffer
hai banāyā-jayegā
mojuda swarop jisko
iskā
Figure 5.1: Parser configuration showing heads of the main and relative clauses on top of the stack and
buffer.
Hindi Urdu
Model
¬Heuristics 91.99 86.98 84.14 86.60 78.39 74.43
+Heuristics 92.190.2 87.080.1 84.280.14 86.850.25 78.580.19 74.610.18
Table 5.5: Impact of different heuristics on parsing of Hindi and Urdu test sets.
5.2 Scrambling
Every language in assumed to have a basic, default or marked constituent order. In fixed word or-
der languages, word order defines the grammatical structure of a clause and the grammatical relations
therein, while in free word-order languages morphology determines the grammatical relations between
constituents in a clause. In the latter case, constituent ordering has mostly pragmatic function determin-
ing the information structure of a sentence. Therefore, words or constituents can change their canonical
positions which results in a certain pragmatic effect. Scrambling of arguments in common particularly
in day-to-day conversation, while less prevalent in formal text or speech. Argument scrambling often
leads to discontinuities in syntactic constituents and long distance dependencies. In Indian languages
since case makers (either clitics or affixes) carry the information about the relations between words,
these words can freely change their positions in a sentence. Given appropriate pragmatic conditions a
64
simple sentence (containing a single verb) in ILs allows n factorial2 (n!) permutations. With respect to
parsing, scrambling of arguments is a major concern. These constituent/chunk sramblings would worsen
the data sparsity problem since data-driven parsers are usually trained on a limited sized treebank where
most of the valid structures may never show up. Indian language treebanks are annotated on newswire
text which is a type of formal text. The chances of any deviation from the canonical word order are less.
Consider a simple Hindi sentence shown in Example 29 with four constituents or chunks. The sentence
can be permuted in 24 different ways. We parsed all of its 24 permutations with a parser3 trained on the
Hindi treebank training set. Among the 24 parses only one parse exactly matched the actual annotation.
The overall performance in terms of UAS, LS and LAS is 72.5%, 61.67% and 52.5% respectively which
is almost 30% LAS down from its performance on the same domain test set (cf. Table 5.6). The parser
seems to follow canonical order of Hindi strictly. As an example, whenever object kitāb ‘book’ has
moved from the left adjacent position of the verb dī ‘give’ to some other position it is hardly identified
as k2 ‘theme’ as shown in Figure 5.2.
(29) Ram ne Gopal ko kitāb dī .

Ram-3MSg Erg Gopal-3MSg Dat book-3FSg.Nom give-Perf .
Ram gave Gopal a book .
root
k4|k2
k2|k7t
k1 rsym
Gopal-ko kitāb Ram-ne dī .
Figure 5.2: Dependency parse tree for Example 29. Relations on the left of the pipe are true labels
while the ones on the right are predicted.
Interestingly, none of the 24 structures were wrongly tagged or chunked by our POS tagger and
chunker. The obvious reason could be that both POS tagging and chunking are mostly dependent on the
focus word itself and its local context (words within a constituent). However, parsing decisions are de-
pendent, to say the least, on the word pairs and their linear order which determines the arc-directionality.
In case of an arc-eager parser, LEFT-ARC and RIGHT-ARC capture the head directionality parameter
prevalent in a language. The parameters for these transitions are learned from the treebank, thus they
will be biased toward the choice of directionality in the treebank data. In this section, we show how to
mitigate this bias by learning from multiple views of the training data automatically created.
To correct the sampling bias in the training set, a simple approach is to either oversample the training
examples belonging to the minority class or to undersample the examples belonging to the majority
2n is the number of chunks in a sentence.
3 Since scrambling mainly affects chunks or constituents as a whole, parsing experiments in this section are restricted to
inter-chunk parsing only.
65
class [191]. Both sampling techniques assume the presence of observations belonging to the minority
class(es) in the training data. In the case of parsing, the training data that represents only a sample of
syntactic structures will suffer from the non-representation of certain classes that encode valid arc direc-
tionalities. This is in addition to the possible under-representation of some output classes. Therefore, it
is not straightforward to use either sampling technique to address the problem of argument scrambling.
We, instead, generate training examples for different arc directionalities by re-orienting the gold syntac-
tic trees. The synthetic structures are similar to the gold trees but differ with respect to the arc direction
of certain edges. The sampling procedure is shown in Figure 5.3 on Example 29.
dī dī
Ram Gopal kitāb . Ram kitāb Gopal .

=⇒
ne ko ne ko
dī
dī
Ram Gopal kitāb .

=⇒ kitāb Ram
ne
Gopal
ko
.
ne ko
dī dī
Ram Gopal kitāb .

=⇒ Ram kitāb Gopal .
ne ko ne ko
Figure 5.3: Multiple argument scrambling in Example 29. Dashed edges represent scrambling.
There is one problem with this approach. Consider a data set of ‘n’ syntactic trees each containing 10
nodes on an average. Allowing all the possible scramblings, there would be around n×3 million possible
permutations. Training on such huge data may not be feasible. In order to restrict the permutations, we
filter a scrambling if it violates certain constraints. The constraints that are applied include projectivity,
obliqueness of an argument and its out-degree. We only scramble a clause with noun phrases that are
oblique and have an out-degree 0. The steps to generate the permutations of a sentence in the training
data are shown in Algorithm 3.
66
Given a sentence as a series of chunks C = c1 , . . . , cn and dependency graph T = (V, A),
where V are chunks and A are arcs of type (ci , c j ).
• Find V = {ci } : chunk-ID(ci ) = V GF

1≤i≤n
• Find VN = {ci } : ∃c j ∈ child(ci ), chunk-ID(c j ) = case-marked NP

ci ∈V
• Generate all permutations of VN keeping C −VN intact.
Algorithm 3: Algorithm to generate permutations of a chunked sentence.
Furthermore, we define two types of scrambling: restricted and unrestricted. In the former case we
only scramble noun phrases/chunks, however keeping the clause head i.e. verb phrase intact, while in
the latter case we move around all the noun phrases as well as the verb phrase. Examples 30, 31 and
32 show both restricted and unrestricted scrambling of sentence 29. In Example 30 verb dii ‘give’ is
at clause final position which is its canonical position, while in Examples 31 and 32 it has been moved
from the canonical position.
(30) Ram ne kitāb Gopal ko dī .

Ram Erg kitab Gopal Dat give .
(31) kitāb dī Ram ne Gopal ko .

book give Ram Erg Gopal Dat .
(32) Ram ne kitāb dī Gopal ko .

Ram Erg book give Gopal Dat .
We permuted the Hindi test set using the algorithm 3 and generated 2,374 and 11,239 sentences in
restricted and unrestricted settings respectively. In order to gauge the performance of our parser trained
on Hindi treebank training set, we tested it on these permuted sentences and their corresponding normal
sentences. The results are reported in Table 5.6. As expected, there is a substantial drop in accuracy
in both settings, however more severe in unrestricted settings. In unrestricted permutations, there is
no predefined order of constituents, while in restricted permutations at least the verb phrase always
occupies the final position of the clause. The latter are, therefore, more accurately predicted, given the
fact that the parser is trained on newswire text where the oblique arguments of a verbal predicate are
towards its left.
67
S.No. Type UAS LS LAS
Restricted 86.86 83.53 76.37
1.
Restricted-Normal 92.33 84.37 81.89
Unrestricted 55.00 60.77 47.15
2.
Unrestricted-Normal 92.13 84.19 81.72
All-Scrambled 64.58 66.81 55.81
3.
All-Normal 90.90 84.51 81.71
Table 5.6: Comparison of results on scrambled and normal views of the test set. X-Normal sets represent
the corresponding sentences of scrambled sentences in restricted and unrestricted settings.
To make the parser more robust and generic so that it can handle scrambling effectively, we created
multiple views of the training data using the above given algorithm. In unrestricted permutations, we
generated 54,501 more sentences that capture different possible argument ordering in Hindi. Once we
generate the permutations of the training data, we can use it to train the parser in multiple ways [59].
One simple way is to augment the original data with the permuted data and learn a single model, while
the other could be to train separate models on both views of the training data and interpolate them at
the inference time. In the case of latter, two parsing models are linearly interpolated by combining their
perceptron weights (cf. line 6 of Algorithm 2) as shown in Equation 5.1.
tˆ = argmaxt φ(ci ,t) · (λni wn + λsi ws ) (5.1)

ci ∈C
wn and ws are weight matrices learned from canonical and non-canonical training data respectively. c
is a given parsing configuration and t is a transition. λ weights are the perplexity scores from trigram
language models learned from the POS sequences of canonical and non-canonical training data. In
order to make sure that language models capture the order of constituents, we combine the POS tag of
a noun with the lexical form of its case marker like NN-ne NN-ko NN VM. We tested both approaches
on multiple domains of Hindi and the results are reported in Table 5.7. As shown in Table 5.7, model
interpolation seems to be more robust than the augmented model. In particular to conversation data,
which supposedly may contain deviations from the canonical order, the UAS increased by almost 1%.
Although marginally, we also improved LAS on the original test. On the other hand augmented model
performs best on the scrambled test set, while it comparatively performs worst and drops the accuracy
by an average of 3% UAS on other domains. Nevertheless, both models can handle scrambled data
better than the original model. Even though the approach is simple and resource-light, it could handle
scrambled structures with an absolute improvement of 18% over the original model. The approach is
also generic and can be applied to other morphologically rich languages and other types of scrambling
as well.
68
Canonical Scrambled Augmented Interpolated
Domain
UAS LS LAS UAS LS LAS UAS LS LAS UAS LS LAS
Test-canonical 90.90 84.51 81.71 89.33 83.49 79.71 90.11 84.63 81.22 90.91 84.70 81.84
Test-scrambled 64.58 66.81 55.81 86.11 80.40 76.44 86.88 80.90 77.42 82.60 80.27 73.90
Conversation 75.75 68.07 62.06 75.02 67.32 60.06 75.68 68.61 62.15 76.61 68.41 62.60
Boxoffice 86.21 60.95 59.36 82.85 57.12 54.83 81.46 58.61 56.32 86.26 61.05 59.46
Cricket 85.70 58.32 56.24 83.31 57.16 53.56 81.44 57.80 54.77 85.82 58.47 56.40
Gadget 81.63 55.90 54.13 77.47 53.46 51.04 77.68 54.91 52.78 81.60 55.85 54.11
Recipe 81.31 59.24 54.79 78.34 56.67 51.47 77.44 55.31 50.87 81.31 59.24 54.79
Table 5.7: Comparison of results on different domains of Hindi. Results in Bold letters are the best
results, while bold italic show the results of unmarked model on marked data. Conversation data
contains 1600 sentences, while other domains contain around 500 sentences each.
5.3 Summary
In this chapter, we have looked at the constraints and formal properties of dependency structures across
Indian languages that have been defined in the literature on dependency grammars. We did an in depth
study of dependency structures that allow non-projectivity, identifying the possible reasons that these
languages offer for discontinuous yield of governing categories. We have identified a list of grammat-
ical formatives that are potential sites of discontinuity in the closely related languages namely Hindi,
Urdu and Bengali. We also showed that pseudo-projective transformations of Nivre and Nilsson with
some heuristics are enough to parse these structures accurately. Similarly, we also discussed the issue
of scrambling in morphologically rich languages and empirically showed that scrambling of oblique
arguments can be handled efficiently without a need for further annotation.
69
Chapter 6
Enhancing Data-driven Parsing of Hindi and Urdu by Leveraging their

Typological Similarity
In Computational Linguistics, Hindi and Urdu are not viewed together as a monolithic entity and have
received separate attention with respect to their text processing. From part-of-speech tagging to machine
translation, models are separately trained for both Hindi and Urdu despite the fact that they represent the
same language. The reasons mainly are their divergent literary vocabularies and separate orthographies,
and probably also their political status and the social perception that they are two separate languages.
In this chapter, we propose a simple but efficient approach to bridge the lexical and orthographic dif-
ferences between Hindi and Urdu texts. With respect to text processing, addressing the differences
between the Hindi and Urdu texts would be beneficial in the following ways: (a) instead of training
separate models, their individual resources can be augmented to train single, unified models for better
generalization, and (b) their individual text processing applications can be used interchangeably under
varied resource conditions.
To remove the script barrier, we learn accurate statistical transliteration models which use sentence-
level decoding to resolve word ambiguity. Similarly, we learn cross-register word embeddings from the
harmonized Hindi and Urdu corpora to nullify their lexical divergences. We demonstrate the effect of
text harmonization on the Hindi and Urdu dependency parsing under two scenarios: (a) resource sharing,
and (b) resource augmentation. We demonstrate that a neural network-based dependency parser trained
on augmented, harmonized Hindi and Urdu resources performs significantly better than the parsing
models trained separately on the individual resources. We also show that we can achieve near state-of-
the-art results when the parsers are used interchangeably.
70
6.1 Introduction
Hindi and Urdu are spoken primarily in northern India and Pakistan and together constitute the third
largest language spoken in the world.1 They are two standardized registers of what has been called the
Hindustani language, which belongs to the Indo-Aryan language family. Masica [131] explains that,
while they are different languages officially, they are not even different dialects or sub-dialects in a
linguistic sense; rather, they are different literary styles based on the same linguistically defined sub-
dialect. He further explains that at the colloquial level, Hindi and Urdu are nearly identical, both in
terms of core vocabulary and grammar. However, at formal and literary levels, vocabulary differences
begin to loom much larger (Hindi drawing its higher lexicon from Sanskrit and Urdu from Persian and
Arabic) to the point where the two styles/languages become mutually unintelligible. In written form, not
only the vocabulary but the way Urdu and Hindi are written makes one believe that they are two separate
languages. They are written in separate orthographies, Hindi being written in Devanagari, and Urdu in
a modified Perso-Arabic script. Given these differences in script and vocabulary, Hindi and Urdu are
socially and even officially considered two separate languages. These apparent divergences have also
led to parallel efforts for resource creation and application building in computational linguistics. The
Hindi-Urdu treebanking project is one such example where the influence of differences between Hindi
and Urdu texts have led to the creation of separate treebanks for Hindi and Urdu [34, 194]. However,
pursuing them separately in computational linguistics makes sense. If the two texts differ in form and
vocabulary they can not be processed with the same models unless the differences are accounted for and
addressed. In this chapter, we aim to resolve the differences between Hindi and Urdu texts to facilitate
sharing of their resources. We provide a quantitative analysis of their divergences and show that they
diverge least syntactically, while lexical differences are significantly higher. To bridge the lexical and the
orthographic differences between Hindi and Urdu, we propose a simple yet efficient approach based on
machine transliteration and distributional similarity. We learn accurate machine transliteration models
for the common orthographic representation of their texts. To resolve their lexical divergences, we learn
cross-register word embeddings from the harmonized Hindi and Urdu corpora. Finally, we empirically
demonstrate the impact of text harmonization on the dependency parsing of both Hindi and Urdu under
varied supervised training conditions. We show that a neural network-based parser trained on augmented
treebanks sets the new benchmark for dependency parsing of both Hindi and Urdu.
The remainder of the chapter is organized as follows. In §6.2, we discuss the lexical and syntactic di-
vergences between Hindi and Urdu. We quantify the divergences using statistical measures in multiple
genres and domains of Hindi and Urdu. In §6.3, we discuss about the common orthographic repre-
sentation of Hindi and Urdu texts. In this section, we also present the extrinsic evaluation of different
representations on the dependency parsing pipeline. In §6.4, we conduct different experiments on the
dependency parsing of Hindi and Urdu under two scenarios, first one is about sharing of resources under
resource-poor conditions and the second scenario is about augmentation of resources in resource-rich
1 see http://www.ethnologue.com/statistics/size and https://en.wikipedia.org/wiki/List_of_
languages_by_number_of_native_speakers
71
conditions. We compare and discuss the related works in §6.5 and conclude the chapter with possible
future directions in §6.6.
6.2 Divergence between Hindi and Urdu

In both Linguistics and Computational Linguistics literature on Hindi and Urdu [131, 141, 164, 170],
questions related to their similarity have been raised but the similarities or divergences have never been
quantified in the true sense of the word. In particular, how their similarities affect their computational
relationship has hardly been addressed. In this section, we will address these questions from a compu-
tational perspective and provide a quantitative analysis of their differences and similarities.
6.2.1 Lexical Variation

As stated by Masica [131], Hindi and Urdu differ significantly in their lexicon. They heavily borrow
their literary vocabulary from two separate typologically unrelated (or distant) sources. An overview
of such borrowings particularly borrowing of grammatical words can be found in [170]. Despite the
extensive borrowings from Persian and Arabic, Urdu lexicon holds a sufficiently large portion of Indic
words which it shares with Hindi due to their common ancestry. To get a clear picture about the quantity
and nature of Perso-Arabic borrowings in Urdu, we propose a language identification-based approach
to bifurcate Urdu lexicon as per the source of its words. Based on the source of each word, we then
quantify the distribution of Perso-Arabic borrowings across genres of Urdu texts and across grammatical
classes.
6.2.1.1 Bifurcation of Urdu Lexicon
To quantify Perso-Arabic borrowings in different text genres of Urdu, we need to identify the source
language of each token. In Corpus Linguistics, statistical association measures like log-likelihood, χ2
etc. are used to extract the keywords of a text based on the statistical significance of their frequencies
in the given corpora.2 However Corpus Linguists are growing skeptical about their use (cf. [36, 99]
for more details) and also both corpora have to be in the same script in order for their vocabularies
to be comparable. Instead we pose the problem as a token-level language identification problem in
word-mixed or code-mixed corpora. In this way, the problem is similar to language identification in
code-switching/mixing data which has recently drawn the attention of researchers in NLP community
[177]. However, the usage of Perso-Arabic words in Urdu is not a case of code-switching or mixing. It
is rather a case of borrowing. In the case of latter, words are naturalized in the target language. They
undergo phonetic, morphological and semantic changes. Therefore, there will be additional challenges
in the identification task, as the foreign words may have particularly lost their native morphology (see
Table 6.1).
2 In our case, words typical of Urdu text would be mainly of Perso-Arabic origin.
72
Token level language identification is a sub-problem of document level language identification where
the task is to identify the language a given document is written in. However, language identification at
the word level is more challenging than a typical document level language identification problem. The
number of features available at document level is much higher than at word level. The available features
for word level identification are word morphology, syllable structure and phonemic (letter) inventory of
the language(s).
As mentioned above, the problem is even more complex in Urdu, as the borrowed words are naturalized
and do not necessarily carry the inflections of their source language(s) and do not retain their identity as
such (undergo phonetic changes as well). For example, khabar ‘news’ which is an Arabic word declines
as per the morphological paradigm of feminine nominals in Hindi and Urdu as shown in Table (6.1).
However, despite such challenges, if we look at the character histogram in Figure (6.1, see corresponding
raw relative frequencies in Table 6.2), we can still identify the etymology of a sufficiently large portion
of Urdu vocabulary just by using letter-based heuristics. For example neither Arabic nor Persian has
aspirated consonants like bH , ph Aspirated Bilabial Plosives; tSh , dZH Aspirated Alveolar Fricatives;
ãH Aspirated Retroflex Plosive; gH , kh Aspirated Velar Plosives etc. while Hindi does. Similarly, the
following sounds occur only in Arabic and Persian: Z Fricative Postalveolar; T, D Fricative Dental; è
Fricative Pharyngeal; X Fricative Uvular etc. Using these heuristics we could identify 2,682 tokens as
Indic and 3,968 either Persian or Arabic out of 12,223 unique tokens in the Urdu treebank [26].
Given the high correlation between a language and its respective letter or letter clusters, ngram-based
approaches have been successfully applied to both document level and the word level language identifi-
cation tasks (for more details see [62, 66, 100, 121, 145]).
Singular Plural
Direct khabar khabarain
Oblique khabar khabarom
Table 6.1: Morphological paradigm of khabar.
73
0.35 Arabic
Hindi
Persian
0.3 Urdu
0.25
Relative Frequency
0.2
0.15
0.1
5 · 10−2
0
bH Z T è X D sQ tQ dQ Q DQ G f tSh q ãH gH khdZH N ï ph S ù th ”th d”H tS b d g H k dZ m l n p s r t ”t V j d”
Alphabet in IPA3
Figure 6.1: Relative distribution of Arabic, Hindi, Persian and Urdu alphabets (consonants only).
3 http://www.langsci.ucl.ac.uk/ipa/IPA_chart_%28C%292005.pdf
74
IPA Arabic Hindi Persian Urdu
bH 0.0 0.006 0.0 0.005
Z 0.0 0.0 0.002 0.0
T 0.007 0.0 0.002 0.002
è 0.021 0.0 0.011 0.014
X 0.01 0.0 0.013 0.009
D 0.007 0.0 0.002 0.003
sQ 0.011 0.0 0.007 0.008
tQ 0.011 0.0 0.006 0.006
dQ 0.007 0.0 0.002 0.004
Q 0.039 0.0 0.015 0.017
DQ 0.003 0.0 0.002 0.002
G 0.006 0.0 0.004 0.003
f 0.031 0.0 0.017 0.012
tSh 0.0 0.003 0.0 0.002
q 0.025 0.0 0.013 0.012
ãH 0.0 0.002 0.0 0.001
gH 0.0 0.001 0.0 0.001
kh 0.0 0.007 0.0 0.004
dZH 0.0 0.001 0.0 0.002
N 0.0 0.075 0.0 0.035
ï 0.0 0.008 0.0 0.0
ph 0.0 0.004 0.0 0.002
S 0.012 0.012 0.031 0.011
ù 0.0 0.009 0.0 0.0
th 0.0 0.003 0.0 0.001
”th 0.0 0.006 0.0 0.007
d”H 0.0 0.01 0.0 0.001
tS 0.0 0.011 0.005 0.007
b 0.046 0.012 0.05 0.035
d 0.0 0.012 0.0 0.005
g 0.0 0.021 0.017 0.016
H 0.071 0.04 0.072 0.086
k 0.025 0.05 0.034 0.08
dZ 0.018 0.017 0.014 0.02
m 0.077 0.032 0.064 0.062
l 0.134 0.04 0.036 0.052
n 0.063 0.063 0.084 0.067
p 0.0 0.02 0.012 0.018
s 0.031 0.036 0.047 0.049
r 0.059 0.114 0.108 0.085
t 0.0 0.013 0.0 0.004
”t 0.053 0.059 0.057 0.048
V 0.033 0.025 0.026 0.021
j 0.015 0.041 0.016 0.021
d” 0.038 0.023 0.079 0.034
Table 6.2: Relative distribution of Arabic, Hindi, Persian and Urdu alphabets (consonants only).
75
To distinguish between native (Indic) and foreign (Perso-Arabic) words in Urdu lexicon, we formulate
the problem as a binary classification. We model the classification problem using Multinomial Naive
Bayes whose parameters are learned using the smoothed ngram-based language models.
6.2.1.1.1 Multinomial Naive Bayes Given a word w to classify into one of k classes c1 , c2 , ... , ck ,
we will choose the class with the maximum conditional probability:
c∗ = arg max p(ci |w)

ci
= arg max p(w|ci ) ∗ p(ci )

ci
The prior distribution p(c) of a class is estimated from the respective training sets shown in Table (6.3).
Each training set is used to train a separate letter-based language model to estimate the probability of
word w. The language model p(w) is implemented as an ngram model using the IRSTLM-Toolkit [70]
with Kneser-Ney smoothing. The language model is defined as:
n
i−1
p(w) = ∏ p(li |li−k )
i=1
where l is a letter and k is a parameter indicating the amount of context used (e.g., k = 4 means 5-gram
model).
6.2.1.1.2 Etymological Data To prepare training and testing data marked with etymological infor-
mation for our classification experiments, we used the Online Urdu Dictionary4 (henceforth OUD).
OUD has been prepared under the supervision of E-government Directorate of Pakistan. Apart from ba-
sic definition and meaning, it provides etymological information for more than 120K Urdu words. Since
the dictionary is freely5 available and requires no expertise for extraction of word etymology which is
usually the case with manual annotation, we could mark the etymological information on a reasonably
sized word list in a limited time frame. The statistics are provided in Table (6.3). We use Indic as a
cover term for all the words that are either from Sanskrit, Prakrit, Hindi or local languages.
Language Data Size Average Token Length

Arabic 6,524 6.8
Indic 3,002 5.5
Persian 4,613 6.5
Table 6.3: Statistics of etymological data.

4 http://182.180.102.251:8081/oud/default.aspx
5 We are not aware of an offline version of OUD.
76
6.2.1.1.3 Experiments and Results We carried a range of experiments to explore the effect of data
size and the order of ngram models on the classification performance. All the experiments are carried
on the etymological data discussed above. We split the data into training, testing and development sets
with a ratio of 80:10:10 using stratified sampling. The parameters of our model such as size of ngrams
are set on the development set. The results of our language identification model are reported in Table
(6.4). The baseline is set using the most frequent class label.
Model Precision Recall F1-Score

Baseline 40.02 50.34 40.62
Indic 88.56 89.23 89.11
Perso-Arabic 94.43 95.15 95.21
Micro-average 91.50 92.19 91.84
Table 6.4: Performance of our language identification model on the test set.
To explore the effect of data size and the order of ngram models on the classification performance,
we carried multiple experiments on the training and development sets. We varied the training size per
training iteration by 1% and the order of ngrams from 1 to 5. For each ngram 100 experiments were
carried, i.e., overall 400 experiments. The impact of training size and the order of ngram models on the
classification performance is shown in Figure (6.2). As expected, at every iteration the additional data
points introduced into the training data increased the performance of the model. With a mere 3% of the
training data, we could reach a reasonable accuracy of 85% in terms of F-score (micro average).
0.9
0.8
F-Score
0.7
1-gram
0.6 2-gram
3-gram
0.5 4-gram
0 0.2 0.4 0.6 0.8 1 1.2

Training Data Size
·104
Figure 6.2: Learning curves.
As with the increase in data size, increasing the ngram order profoundly improved the results. Interest-
ingly, unigram-based models converge faster than the higher order ngram-based models. The obvious
77
reason seems to be the small, finite set of characters that a language operates with (∼ 37 in Arabic,
∼ 39 in Persian and ∼ 48 in Hindi). A small set of words (unique in our case) is probably enough to
capture at least a single instance of each character. As no new ngram is introduced with subsequent
additions of new tokens in the training data, the accuracy stabilizes. However, the accuracy with higher
order ngrams kept on increasing with an increase in the data size, though it was marginal after 5-grams.
The abrupt increase after 8,000 training instances is probably due to the addition of an unknown bigram
sequence(s) to the training data. In particular, the Recall of Perso-Arabic increased by 2.2%.
6.2.1.2 Perso-Arabic Borrowings in Urdu Text: A Quantitative Analysis
Above we discussed our approach to classify Urdu words as per their origin. Here we look at different
text types and try to quantify the relative frequency with which Perso-Arabic words are distributed
across them. We consider text from different genres like religious, academic, newswire and also Urdu
translation of Quran and online blogs. Religious texts include books on different Islamic topics like
prayer, ethics etc. while Quranic text has only the verses of Quran translated into Urdu. Newspaper
articles include crawled data from different newspapers printed in India and Pakistan. We also collected
some text from online blogs in Urdu to study the influence of Perso-Arabic words on non-scholarly
writing. Academic text mainly includes novels, prose and poetry from Urdu textbooks taught in different
schools in India. We consider around 50K tokens as a sample from each text type and use our binary
classifier to classify each token either into Perso-Arabic or Indic. The relative frequencies of each class
across these text samples are shown in Figure 6.3. In all the samples, Perso-Arabic and Indic words are
distributed with almost an equal token count ratio of 52:48 which probably explains the unintelligibility
of Hindi and Urdu literary forms [131]. While type counts are unevenly distributed with a ratio of 65:35.
The gap between the type and token ratios clearly shows that Indic words are more frequently used in
Urdu texts than Perso-Arabic words. This could imply that Indic words are more of grammatical or
functional nature, as words of these categories tend to occur quite often in a text. Higher type counts of
Perso-Arabic words, on the other hand, implies that they may be proper names, technical and cultural
terms that signify the association between Urdu speakers/writers and the Perso-Arabic culture.
78
100
I-Tokens PA-Tokens I-Types PA-Types
80
Relative Frequency
60
40
20
0
Quran Religious Newspaper Academic Blogs
Text Genres
Figure 6.3: The plot shows percentage of Perso-Arabic words across text types of Urdu. I stands
for Indic while PA stands for Perso-Arabic.
Indic Perso-Arabic
Types in Urdu Treebank
TF
J
M
N
F
Q
G
O
JJ
ST
B
P
P
C
P
IN
Q
N
W
E
R
PR
E
PS
C
V
IN
N
N
Part-of-Speech Tags
Figure 6.4: The plot shows the distribution of Indic and Perso-Arabic words across grammatical
categories in the Urdu treebank.
To gain further insights about the nature of Perso-Arabic borrowings in Urdu, we also studied their
distribution across grammatical classes. In Figure 6.4, we plot the relative frequencies of Perso-Arabic
words in the Urdu treebank against their part-of-speech tags. Verbs together with other grammatical
classes like adpositions (PSP), demonstratives and pronouns in Urdu treebank have Indic origin, while
nominals, adverbs of manner and time and place, conjunctions are borrowed from Persian and Arabic
79
languages. Although, simple verbs are predominantly Indic, host nominals in complex predicates are
almost always borrowed from Perso-Arabic [30]. This makes Perso-Arabic borrowings spread across
all the grammatical classes. However, the predominantly Indic verb base is probably the explanation
why Hindi and Urdu do not deviate much grammatically or structurally (cf. §6.2.2).
6.2.2 Syntactic Variation

To measure the syntactic differences between Hindi and Urdu, we consider four major word-order pa-
rameters that are used in the language typology literature [55, 78]. These parameters are as follows:
1. the order of subject, object and verb,
2. the order of possessive (genitive) and head noun,
3. the order of adposition and noun, and
4. the order of adjective and noun.
Arabic and Persian vary with respect to these parameters with Hindi and we assume that these languages
might have influenced the syntax of Urdu as well apart from its vocabulary. Arabic is a head-initial lan-
guage. Verbs precede their arguments in a clause (VSO), while head nouns come before their modifiers
(parameters 2-4). Although Persian is SOV like Hindi, it behaves similarly to Arabic with respect to
other three parameters (2-4). It seems reasonable to consider these parameters for any syntactic variation
between Hindi and Urdu due to the influence of Arabic and Persian on Urdu. As a matter of fact, Urdu,
unlike Hindi, has some constructions (related to the parameters listed above) which are head initial in
nature. These include a famous Persian ezafe construction which is a head initial possessive construc-
tion and a number of Persian and Arabic PPs wherein the adpositions license their objects towards right.
These constructions have already been discussed in Urdu Linguistics literature [43, 170].
To quantify the variation with respect to the parameters listed above and any other structural divergences
between Hindi and Urdu, we extract and compare the structures relevant to these parameters from the
Hindi and Urdu treebanks. Any parametric variation related to the order of subject, object and verb
would be tough to capture since both Hindi and Urdu allow scrambling of arguments. In that scenario,
it would not be easy to judge if a particular variation in the order is a genuine influence of Arabic
or not. Fortunately we could not find any VSO structures in the Urdu treebank. Nevertheless, we
extract delexicalized partial trees which have depth of one from both treebanks for comparison of all
the parameters. These treelets which correspond to production rules also include the arc direction. We
also consider using POS and chunk trigram sequences for capturing these syntactic variations and for
relative comparison with other Indian languages. Nerbonne and Wiersma [143] show that POS trigrams
can be used to account for syntactic differences between two corpora.6
6 See the discussion in the paper about the use of POS tag sequences as an approximation of syntax.
80
• Partial Trees: Partial trees provide direct access to any structural variations between two lan-
guages. In Hindi and Urdu the head-directionality parameters can be studied from the partial
trees extracted from their respective treebanks. The drawback of using partial trees would be that
the longer partial trees may be treated as idiosyncrasies due to data sparsity.
• POS Trigrams: POS trigrams implicitly capture syntactic parameters of a language. For example
a trigram that contains ‘DEM ADJ NN’ would suggest that adjectives are followed by nouns which
captures the fourth parameter shown above.
• Chunk Trigrams: Chunk trigrams would also capture the parametric variation related to the order
of major constituents in a clause like subject and object. For example ‘NP NP VP/VGF’ would
suggest that the language prefers SOV order.
6.2.2.1 Comparing Probability Distributions using Jensen-Shannon Divergence
Kilgarriff [98] describes the use of cross-entropy to assess the homogeneity of a text corpus. He shows
how perplexity, which is the log of cross-entropy with itself, can be interpreted as a measure of self
similarity of a corpus. The corpus is more homogeneous if its perplexity is lower. Cross-entropy or
perplexity have also been used in other different works to measure the similarity between two corpora.
With respect to domain adaptation and cross-lingual parsing, Søgaard [176] and Plank and Van Noord
[163] have used different information-theoretic measures for sentence and domain selection. For cross-
lingual parsing, Søgaard [176] used perplexity to compare the probability distributions learned from
POS trigram sequences to select a subset of sentences from the source languages which are similar to
the target language. Similarly, Plank and Van Noord [163] used Jensen-Shannon (JS) measure [120] to
select the most similar domains for training and better generalization on the target test domain. In this
work, we use Jensen-Shannon divergence to compare Urdu with Hindi texts syntactically. The measure
is bounded by 1 unlike perplexity and, therefore, can be used as a similarity or distance measure. Given
a source probability distribution P and target probability distribution Q, we can defined JS divergence
between them as:
1 1
JSD(P k Q) = D(P k M) + D(Q k M) (6.1)
2 2
1
where M = (P + Q)
2
Jensen-Shannon divergence is a smoothed version of Kullback-Leibler (KL) divergence D(P||Q) which
is a classical measure of ‘distance’ between two probability distributions. KL is unsuitable for distri-
butions derived via maximum-likelihood estimates as it is undefined if an event has a zero probability
from the target distribution.
81
To put the similarities between Hindi and Urdu into perspective, we also compare Hindi with other
Indian languages namely Bengali, Gujarati, Kashmiri, Punjabi and Telugu7 and a few domains of Hindi
which contain texts other than newswire namely box-office, cricket, gadgets and recipes. The probability
distributions are derived via maximum-likelihood estimation from the respective annotated corpora. For
relative comparison, the JS divergences are shown in Figures 6.5 and 6.6. In both cases, Urdu seems
relatively similar to Hindi. While the higher similarity between Hindi and Urdu in the left plot is
intuitive, their higher similarity in the right plot also makes sense since both Hindi and Urdu treebanks
contain newswire texts. Interestingly, chunk divergences are higher than POS, which could be due to
the fact that scrambling is more prevalent at the constituent or chunk level. The POS trigrams capture
local dependencies which vary less in order in Indian languages.
We manually analyzed the partial trees from the Urdu treebank which are not present in the Hindi
treebank. A few of these partial trees are shown in Table 6.5. They capture the phenomena of ezafe
which has been borrowed into Urdu from Persian and also prepositions from both Arabic and Persian.
These structures are less than 1% in the Urdu treebank and would not be a bottleneck either in resource
sharing or resource augmentation. Also the JS divergences in the right plot encourages us to suggest
Urdu as another domain of Hindi or vice-versa.
0.3
Tag Chunk
JS Divergence
0.2
0.1
0
urd ben guj hin kas mar pan tel
Language
Figure 6.5: Relative comparison of JS divergences between Hindi, Urdu and six other Indian Lan-
guages. We did not have Gujarati chunked data to compute the probabilities, hence the divergence
is not shown. All the languages are represented by their three-letter ISO codes. Note that hin
represents test set of the Hindi treebank which contains newswire text.
7 Bengali, Gujarati and Punjabi belong to Indo-Aryan language family. Kashmiri belongs to Dardic group of Indo-Aryan
family, while Telugu is a Dravidian language.
82
0.3
Tag Chunk Partial Trees
JS Divergence
0.2
0.1
0
urd box-off cricket gadget hin recipe
Domain
Figure 6.6: Relative comparison of JS divergences between Hindi, Urdu and four different do-
mains of Hindi.
Partial Tree Example Gloss

NN → NNP māh-e ramzān ‘month of Ramadan’
NN → NST zer-e ilāj ‘under treatment’
PRP → NST bād azān ‘after sometime’
PRP → PSP az khud ‘by self’
NNP → PSP barā-e hind ‘for India’
NN → NST QC andron ek sāl ‘within one year’
NN → JJ qābil-e qadr ‘worth appreciation’
NN → JJ PSP lamhā-e ākhir tak ‘till the last moment’
NN → PSP QC bashamol do ladkiyān ‘toghether with two girls’
Table 6.5: Relatively frequent partial trees in the Urdu treebank.
6.3 Common Representation

In the above section, we empirically showed that there is a considerable overlap between Hindi and
Urdu texts, grammatical as well as lexical. Despite these similarities, we can not use Hindi text pro-
cessing tools for Urdu as such and vice versa. The reason being that the two varieties use two distinct
orthographies for writing−Hindi uses Devanagari and Urdu uses Perso-Arabic. To address this problem,
we need to represent both Hindi and Urdu texts in a single script. For this purpose, we can either use
Devanagari or Perso-Arabic script. We can transliterate Hindi texts in Devanagari to Perso-Arabic or
Urdu texts in Perso-Arabic to Devanagari. Either way, transliteration between these two scripts is a non-
trivial task. There are genuine cases of character ambiguity due to one-to-many character mappings in
both directions of transliteration. A detailed description of the challenges in Hindi-Urdu transliteration
can be found in the works of Malik et al. [125],Jawaid and Ahmed [94] and Lehal and Saini [118]. In
83
addition to character ambiguity, Perso-Arabic to Devanagari transliteration has also to deal with missing
short vowels in Urdu texts. In Urdu writing, short vowels are hardly represented, even though the Perso-
Arabic script has the provision for their representation. They are dropped due to the fact that readers
can infer them easily in the context. A major drawback of dropping short vowels in Urdu writing is that
it generates homographs. For example, without an appropriate short vowel on the first letter, @ñï could
f
mean ‘air’ ( @ñï) or ‘become’ ( @ñï) depending on the context. These homographs would lead to ambiguity
f f
in the Devanagari script. There would be more than one genuine Devanagari representation for such
homographs, since Devanagari represents each phoneme uniquely and explicitly. Usually word-level
transliteration models do not deal with word ambiguity and leave it unresolved. However, we need our
transliteration model to pick a transliteration that best fits the sentential context.
It should be noted that both Devanagari and Perso-Arabic scripts are a natural choice for the common
representation. The use of a third script (e.g. Roman script) for this purpose would be computationally
expensive, as we need to transliterate both Hindi and Urdu resources. Moreover, the transliteration
errors would also double. More importantly, if we choose a third script, we have to manually develop
a reasonably-sized corpus of transliteration pairs for training the transliteration models. On the other
hand, transliteration pairs in Devanagari and Perso-Arabic scripts can be automatically extracted from
the corpora available in these scripts (see §6.3.1.1 for more details).
To measure the suitability of both scripts for the common representation of Hindi and Urdu texts, we
perform extrinsic evaluation on the dependency parsing pipeline which involves POS tagging, chunking
and dependency parsing. The script that maximizes the accuracy across the pipeline would imply its
feasibility for uniformly representing the Hindi and Urdu texts for computational purposes.
6.3.1 Hindi-Urdu Transliteration

Hindi and Urdu transliteration has received a lot of attention from the NLP research community of
South Asia [116, 117, 118, 125]. It has been seen to break the barrier that makes the two look different,
although they are facets of the same language. Owing to the efforts of different researchers, there are a
couple of transliteration tools available online that produce good transliterations bidirectionally. Despite
the amount of work on Hindi-Urdu transliteration, we could not find a single offline tool that we could
use for our experiments. Also most of the systems do not perform at par with our requirements. We need
a system that should give accurate word-level transliterations and should also resolve word ambiguity at
the sentence level.
Most of the existing works on Hindi-Urdu transliteration have considered basic rule-based models which
use character tables coupled with a set of heuristics to resolve ambiguous mappings. A few exceptions
are the works of Sajjad et al. [166] and Srivastava and Bhat [178] who use phrase-based SMT and gen-
erative joint source-channel model for the Hindi-Urdu transliteration. In general, statistical approaches
like noisy-channel model and its variants (like joint source-channel model) are the most studied meth-
ods for supervised machine transliteration [6, 80, 101, 157]. Recently structured prediction models with
global learning and heterogeneous emissions have been shown to perform better than the noisy-channel
84
models [16, 200]. In this work, we model Hindi-Urdu transliteration as a structured prediction prob-
lem using a linear model. Our transliteration model is basically a second order Hidden Markov Models
(SHMM) formally represented in Equation 6.2. We denote the sequence of letters in a word in the source
script as boldface s and the sequence of hidden states which correspond to letter sequences in the target
script as boldface t. A basic HMM model has the following parameters:
n
P(s; t) = arg max∏ P(ti |ti−1 ,ti−2 ) P(si |ti ) (6.2)
t1 ···tn i=1 | {z } | {z }
Transition Probabilities Emission Probabilities
where
si · · · sn is a letter sequence in the source script, and

ti · · ·tn is the corresponding letter sequence in the target script.
Instead of maximum likelihood estimates, we use the structured perceptron of Collins [51] to learn the
model parameters. Given an input training data of aligned character sequences D = d1 ...dn , a vector
feature function ~f (d), and an initial weight vector ~w, the algorithm performs two steps for each training
example di ∈ D:
• Decode: tˆ = arg max(~w · ~f (d))

t1 ···tn
• Update: ~w = ~w + ~f (d) − ~f (tˆ)
In addition to global learning, structured perceptron also allows us to use feature-based emissions. We
replace the basic multinomial emissions P(si |ti ) with the feature-based emissions ~w · ~f (d). The feature
template used to learn the emissions is shown in Table (6.6). We use Viterbi-search for decoding in case
of Devanagari to Perso-Arabic transliteration, while we use beam-search for Perso-Arabic to Devanagari
transliteration to decode the best letter sequence in the target script. The reason for using beam-search
decoding for Perso-Arabic to Devanagari transliteration is to extract n-best transliterations for resolving
word ambiguity.
Ngram Features
Unigrams li−4 ; li−3 ; li−2 ; li−1 ; li ; li+1 ; li+2 ; li+3 ; li+4
Bigrams li−4 li−3 ; li−3 li−2 ;li−2 li−1 ; li−1 li ; li li+1 ; li+1 li+2 ; li+2 li+3 ; li+3 li+4
Trigrams li−4 li−3 li−2 ; li−3 li−2 li−1 ; li−2 li−1 li ; li li+1 li+2 ; li+1 li+2 li+3 ; li+2 li+3 li+4 ;
Tetragrams li−4 li−3 li−2 li−1 ; li−3 li−2 li−1 li ; i−2 li−1 li li+1 ; li−1 li li+1 li+2 ; li li+1 li+2 li+3 ; li+1 li+2 li+3 li+4 ;
Table 6.6: Feature template used for learning the emission parameters.
In case of Perso-Arabic to Devanagari transliteration, to resolve the word ambiguity as discussed above,
we perform sentence-level decoding on the n-best transliterations from the perceptron model. We use a
85
noisy channel model and exact Viterbi search to find the most likely Hindi (Devanagari) sentences. The
noisy-channel model can be formally defined as follows:
h∗ = arg max p(h) × p(h|u) (6.3)
p(h) is the language model score which gives a prior distribution over the most likely sentences in Hindi
and p(h|u) is the perceptron score which indicates how likely the Hindi (Devanagari) sentence h is a
word by word transliteration of the Urdu sentence u. Since p(h|u) is not a probability score, we assign
uniform probabilities to all the transliteration options. Thus redefining our model without p(h|u) as:
h∗ = arg max p(h) (6.4)
Thus, our model only relies on the language model to find the best sentence from the n-best translitera-
tions. We use the trigram language model learned from 40M multi-domain Hindi corpus with Kneser-
Ney smoothing. Here, it should be noted that it is plausible to score Urdu sentences in Devanagari using
a language model trained on Hindi data, since there is a considerable overlap in the Hindi and Urdu
grammar and vocabulary.
6.3.1.1 Transliteration Pair Extraction and Character Alignment
Like any other supervised machine learning approach, supervised machine transliteration requires a
strong list of transliteration pairs to learn the model parameters. However, such lists are not readily
available and are expensive to create manually. Sajjad et al. [167, 168] have proposed algorithms to
automatically mine transliteration pairs from parallel corpora. Sajjad et al. [167] propose an iterative
algorithm based on phrase-based SMT coupled with a filtering technique, while Sajjad et al. [168]
model transliteration mining as an interpolation of transliteration and non-transliteration sub-models.
The model parameters are learned via EM procedure and the transliteration pairs are mined by setting
an appropriate threshold. In this work, we use a simple edit distance-based approach to extract the
transliteration pairs from the translation pairs. Since Hindi and Urdu are phonetically similar, a simple
rule-based approach would be more robust than its statistical counterpart. The evidence can be found
in [166] who show that Hindi-Urdu transliteration models perform better when they are trained on
transliteration pairs mined using simple edit distance-based measures rather than using their SMT-based
approach.
We use the sentence aligned ILCI Hindi-Urdu parallel corpora [96] to extract the transliteration pairs.
Initially, the parallel corpus is word-aligned using GIZA++ [155], and the alignments are refined using
the grow-diag-final-and heuristic [102]. We extract all the word pairs which occur as 1-to-1 alignments
in the word-aligned corpus as potential transliteration equivalents. We extracted a total of 54,035 trans-
lation pairs from the parallel corpus of 50,000 sentences. To further complement the translation pairs,
86
we also extracted 66,668 pairs from Indo-wordnet [142] synset mappings.8 A rule-based approach with
edit distance metric is used to extract the transliteration pairs from these translation pairs. To com-
pute the edit distances, we use the Hindi-Urdu character mappings presented in [116]. Short vowels in
Hindi (and Urdu if any) words are ignored while computing the edit distances, since they are frequently
omitted in Urdu writing. We compute the levenshtein distance between the translation pairs based on
insertion, deletion and replace operations. For each translation pair, we compare the letters via their
mappings in the character mapping table. Finally, the distance scores are normalized by dividing them
with the length of the longest string in a translation pair. Translation pairs with a normalized score of
less than a small threshold of ∼0.1 are considered as transliteration pairs. Using this procedure, we ex-
tracted 21,972 transliteration pairs from the Hindi-Urdu parallel corpus and 24,614 transliteration pairs
from the Hindi-Urdu synsets.
Once we have mined the transliteration pairs from the parallel corpora, we character align them for
training and testing the transliteration models. We again use Giza++ for the alignment task. Giza++
produces three types of alignments from the transliteration pairs: 1 → 1, 1 → Many and 1 → 0. / There
can also be 0/ → 1 alignments where target string characters are left unaligned. Out of these four letter
alignments, we modified the 0/ → 1 alignments. If we keep these alignments as such at the training
time, we need to introduce 0/ in the test strings before decoding which is not a trivial task. Instead, we
modify these alignments by merging the target character with the previous aligned pair if it is not the
first character, otherwise it is merged with the succeeding aligned character pair.
6.3.1.2 Experiments and Results
We train two structured perceptron models on the transliteration pairs discussed above. We maintain
80:10:10 data split for training, testing and tuning both models. Additionally, we also manually translit-
erate 1000 Urdu sentences in Devanagari script to tune and evaluate our noisy-channel model which
we use on top of the Perso-Arabic to Devanagari transliteration system to resolve word ambiguity. The
parameters such as number of training iterations, order of ngram context, number of transliterations for
noisy-channel model are tuned on the respective development sets. We found that the top 5 translitera-
tions gave the best results.
To compare our results with the existing systems, we choose HUMT,9 Malerkotla (MAL)10 and
SANGAM11 available on the internet, while choosing the SMT-based transliteration as a baseline. The
baseline model is a phrase-based machine translation system (PSMT) for transliteration built using the
Moses toolkit.12 We train the system with the default settings except the distortion limit which is set
to zero (reordering is not important for transliteration). We tune phrase-based SMT models using min-
imum error rate training (MERT) on the development set for each system. The language model is a
8 http://www.cfilt.iitb.ac.in/
˜sudha/bilingual_mapping.tar.gz
9 http://www.sanlp.org/HUMT/HUMT.aspx
10 translate.malerkotla.co.in/
11 http://sangam.learnpunjabi.org/
12 http://www.statmt.org/moses/
87
4-gram model estimated from 40 million Hindi and 6 million Urdu monolingual corpora using modified
Kneser-Ney smoothing.
HUMT system is described in [125]. It uses finite-state transducers coupled with a phoneme-based
mapping scheme between Hindi and Urdu. We could not find a detailed description on the method of
MAL system documented anywhere. It seems, as per the official web page of the system, that it also
uses character mappings between Hindi and Urdu for transliteration. Sangam is a bidirectional system
defined in [116, 117, 118]. It uses manually crafted rules on character mappings between Hindi and Urdu
and applies a number of pre- and post-processing steps to enhance the performance like normalization,
spell correction, stemming etc. It also uses a bilingual word list for direct lookup of transliteration
equivalents. We list the performance of all these systems in Table 6.6 for comparison. The performance
of our noisy-channel models is reported in Table 6.8.
System Devanagari → Perso-Arabic Perso-Arabic → Devanagari

PSMT 96.23% 74.30%
HUMT 90.34% 40.75%
MAL 93.78% 78.23%
SANGAM 97.38% 87.56%
SHMM 98.03% 88.03%
Table 6.7: Comparison of accuracies of available system on internet with our system.
System Testing Set Development Set

SHMM 94.21% 94.63%
+Noisy-channel Model 96.37% 96.72%
Table 6.8: Performance of noisy-channel model in resolving word ambiguity in Perso-Arabic to

Devanagari transliteration.
As shown in Table 6.7, we have established a new best system for the transliteration of Hindi and Urdu
texts bidirectionally. Our Devanagari to Perso-Arabic system outperforms SANGAM by 0.65%. There
is also an improvement of 0.47% over SANGAM in case of Perso-Arabic to Devanagari translitera-
tion. Out of the two transliteration models, Perso-Arabic to Devanagari performs worse because of the
missing vowels in the Urdu texts. The impact of missing vowels on the performance of each system is
shown in Figure 6.7. More or less all the systems suffer due to missing vowels in source text. However,
our model beats all the models with a significant margin in predicting missing vowels. In our Perso-
Arabic to Devanagari system, vowel prediction has greatly benefited from higher-order ngram context
surrounding a source letter. It is interesting to note that the performance of this system is correlated
to the correct vowel prediction. As shown in Figure 6.8, as the performance of the system improves in
88
vowel prediction, so does its overall accuracy. Interestingly, all the systems cope well with the ambiguity
problem in Devanagari to Perso-Arabic transliteration, as shown in the Figure 6.7.
Furthermore, our noisy-channel model improved the results of our basic perceptron model for Perso-
Arabic to Devanagari transliteration. It improved the accuracy by an absolute 2% on the test set. This
clearly shows how often homographs are generated due to missing vowels in Perso-Arabic script. The
accuracies in Table 6.8 are reported on each token in the test set. It probably explains why the translit-
eration on this test set is more accurate than that shown in Table 6.7 which reports accuracies on unique
words. Moreover, the test set derived from ILCI corpus (heath and tourism domain) and WordNet
synsets contain typical Urdu words, while the new test set contains newswire text with relatively com-
mon and easier words. Common and simple words such as grammatical words would contribute more
to the accuracy because of their sheer frequency.
100
80
Accuracy
60
40
20 Vowel Prediction
Letter Disambiguation
0
HUMT PSMT MAL SANGAM SHMM
Transliteration Systems
Figure 6.7: Performance of different systems on vowel prediction (Perso-Arabic → Devanagari)

and letter disambiguation (Devanagari → Perso-Arabic).
100
80
Accuracy
60
40
Perso-Arabic → Devanagiri
20 Devanagiri → Perso-Arabic
Vowel Prediction
0
1-gram 2-gram 3-gram 4-gram 5-gram
N-grams
Figure 6.8: Impact of different length ngrams on bidirectional SHMM model.
89
6.3.1.3 Extrinsic Evaluation
As we already mentioned, the script that fares well in an extrinsic evaluation on a dependency parsing
pipeline will be chosen for the common representation of Hindi and Urdu texts. In other words, we
choose the script among Devanagari and Perso-Arabic scripts which produces similar or better results on
dependency parsing pipeline of both Hindi and Urdu. For training and evaluation in each script, we made
two copies of Hindi and Urdu training and evaluation sets. We used a feed-forward neural network with
a single layer of hidden units for training POS tagging, chunking and dependency parsing models. POS
tagging and chunking models use simple second-order structural features, while dependency parsers
use rich lexical and non-lexical features related to nodes in the stack and the buffer (see §6.4.1 for more
details). The hyperparameters of the neural network models are tuned on the respective development
sets. For each module, we trained two models–one in each script. In Table 6.9, we report the results
on the respective evaluation sets in both scripts. We report the results using both gold and predicted
features. Using gold features, we keep the impact of POS and chunk tags neutral for chunking and
parsing in both scripts. In this way, we capture the impact of orthographic representation on each
module independently.
Chunking Parsing (LAS)

Data POS tagging
Gold Features
Hindi Devanagari 96.48 98.40 91.70
Hindi Perso-Arabic 96.00 98.35 91.52
Urdu Perso-Arabic 93.13 96.62 88.08
Urdu Devanagari 93.42 96.65 88.38
Predicted Features
Hindi Devanagari - 97.84 88.32
Hindi Perso-Arabic - 97.58 87.95
Urdu Perso-Arabic - 95.60 81.67
Urdu Devanagari - 96.03 82.28
Table 6.9: Performance of different modules of a dependency parsing pipeline trained and evalu-
ated on Hindi and Urdu treebank data in Devanagari and Perso-Arabic scripts.
The results presented in Table 6.9 clearly favor Devanagari script over Perso-Arabic for the orthographic
representation of Hindi and Urdu texts. For all the modules, accuracy decreases in Perso-Arabic script,
while there is a significant increase in accuracy in Devanagari script. The results may at first seem
counter-intuitive if we go by the performance of the two transliteration systems. As the Devanagari
to Perso-Arabic transliteration model produces more accurate transliterations, representation of Hindi
and Urdu data in Perso-Arabic script should have produced better results, which it did not. The reason
mainly lies in the fact that Devanagari represents information explicitly while Perso-Arabic script does
not. Devanagari is a phonetic script which represents phonemes uniquely and explicitly. Perso-Arabic on
90
the other hand is not a phonetic script. Certain Perso-Arabic letters such as ð, ø and @ represent different
phonemes in different contexts. Also short vowels are mostly dropped in Perso-Arabic writing as they
are assumed to be redundant. Ambiguous letters and absence of short vowels in Perso-Arabic writing
generates homographs which are semantically and grammatically ambiguous words. Given these facts,
the better results in Devanagari script clearly would be due to the less ambiguous representation of
words in this script as can be seen in Table 6.10. Table 6.10 represents the impact of transliteration on
lexical and POS-tag merging.
Script Lexical Merging (%) Tag Ambiguity (%)
Hindi in Perso-Arabic -11.87 +3.12
Urdu in Devanagari +11.26 -2.51
Table 6.10: Comparison of lexical and POS-tag merging rates in Devanagari and Perso-Arabic
transliterated data.
We define lexical merging rate as the amount of percentage drop in the size of the vocabulary (type
count) after transliteration. Similarly tag ambiguity captures the increase in ambiguity of POS cate-
gories due to lexical merging. Both the lexical merging and tag ambiguity rates are higher in the case
of Devanagari to Perso-Arabic transliteration, which explains the drop in accuracies when the models
are trained and evaluated on data represented in Perso-Arabic script. There is an 11% drop in type
counts when we transliterate Hindi treebank data in Perso-Arabic, while on the other hand, there is a
similar percentage increase in vocabulary when the Urdu treebank in represented in Devanagari. In-
terestingly, not all the lexical merging/expansion leads to/resolves syntactic ambiguity. In the Hindi
treebank, Perso-Arabic script created 3% homographs which are syntactically ambiguous, while De-
vanagari scripts resolves ambiguity for around 3% words in the Urdu treebank. Table 6.11 provides
sample homographs from both treebanks represented in Perso-Arabic script. Finally, it would be safe to
conclude that Devanagari is better suited for automatic text processing of Hindi and Urdu texts, as our
empirical and theoretical analyses suggest.
91
Homographs Gloss Syntactic Category
á
Ó in, I PSP, PRP
AJ
» what, do WQ, VM
AK
X give, candle, sympathy VM, NN, NN
Aïf P live, release VM, JJ
PðX far, time NN, NST
ÉK bridge, moment NN, NN
É¿ tomorrow, total NST, QF
QK letter, son NN, NN
Cg. district, burn NN, VM
ñK you, so PRP, CC
@ñïf air, become NN, VM
ì»X sorrow, see NN, VM
Table 6.11: Some examples of homographs from the Hindi and Urdu treebanks represented in
Perso-Arabic script. PSP, PRP, WQ, VM, NN, NST, QF, JJ and CC represent postpositions, pro-
nouns, question words, verbs, common nouns, spatio-temporal nouns, quantifiers, adjectives and
conjunctions respectively.
6.4 Resource Sharing and Augmentation

In the above section, we empirically showed that the Devanagari script can serve as a common represen-
tation for both Hindi and Urdu texts without harming the performance of text processing applications
such as a POS tagger and a parser. Given the fact that Hindi and Urdu are syntactically or grammat-
ically similar, resource sharing and augmentation should become feasible just by removing the script
barrier. A common orthographic representation would, however, only affect that part of Hindi and Urdu
vocabularies which is shared. It will not fill the lexical gaps. To further harmonize the texts, we would
have to deal with their lexical divergences. It is the lexical differences between Hindi and Urdu texts
that leave them mutually unintelligible [131]. The severity of lexical differences can be clearly seen by
comparing the OOV rates (percentage of words in evaluation set that do not occur in the training data)
of Hindi and Urdu evaluation sets. As shown in Table 6.12, almost half of the tokens in both Hindi and
Urdu development sets are missing from the Urdu and Hindi training sets respectively. Such excessive
OOV rates would worsen the problem of lexical data sparsity for any statistical model.
92
Training Development OOV (%)
Hindi Urdu 51.6
Urdu Urdu 15.40
Urdu Hindi 62.76
Hindi Hindi 22.06
Table 6.12: OOV rates of Hindi and Urdu development sets. Both Hindi and Urdu data sets are
represented in Devanagari.
Lexical data sparseness is considered as one of the major challenges in tackling the problem of data
sparsity in data-driven approaches to natural language processing. A common approach to bridge the
lexical gaps between the source and the target data is to use distributional similarity-based methods
[58]. The distributional similarity methods exploit Harris’ distributional hypothesis which states that
words that occur in the same contexts tend to have similar meanings [84]. To address the lexical differ-
ences between Hindi and Urdu, distributional similarity-based methods seem as an appropriate choice.
Consider a case of pratikshā and intizār, both words are semantic equivalents and mean the same
thing i.e., waiting. pratikshā is a Sanskrit word used in Hindi texts, while intizār is its Perso-Arabic
equivalent used in Urdu texts. In both Hindi and Urdu, pratikshā and intizār form complex predicates
with similar light verbs. One such complex predicate is pratikshā/intizār kar ‘wait do’. The complex
predicate takes a genitive-marked theme argument, licenses ergative case on its agentive argument in
perfective aspect and can take similar tense, aspect and modal auxiliaries. Even though, pratikshā and
intizār are different word forms, they have identical syntactic distributions which could be used as an
approximation of their semantic similarity.
To capture the similarity between Sanskrit and Perso-Arabic words in Hindi and Urdu vocabularies, we
could apply distributional similarity methods on the union of harmonized (represented in same script)
Hindi and Urdu corpora. Augmenting source domain corpus with the target domain data to learn dis-
tributional representation of words is a common practice to address lexical sparseness encountered in
domain adaptation tasks [46, 162]. We could also use bilingual word clustering or word embedding
approaches which have been used to address the loss of lexical information in delexicalized parsing
[181, 195]. Täckström et al. [181] learn cross-lingual word clusters by jointly maximizing the likeli-
hood of the source and target monolingual data using word alignments as soft constraints, while Xiao
and Guo [195] learn interlingual word representation by training deep neural network models using seed
bilingual word pairs as pivots for building connections across languages. In case of Hindi and Urdu,
the former approach is more simple and direct way to capture the distributional similarity. The similar
grammar and partially shared vocabularies would ensure semantically similar words of Hindi and Urdu
are assigned similar distributional representations. The cross-lingual approaches, on the other hand,
are computationally complex, while they capture the distributional similarity indirectly using a seed
bilingual lexicon.
93
The distributional similarity can be incorporated in a statistical model either by using word clusters or
word vectors [188]. Similar to Collobert et al. [53] and Chen and Manning [48], we represent lexical
units in the input layer of our neural network model by the word embeddings instead of one-hot vectors.
We augment Hindi monolingual data with the transliterated Urdu data and use word2vec toolkit13 to
learn the word embeddings. The toolkit provides an efficient implementation of the continuous bag-of-
words (CBOW) and skip-gram (SG) approaches of Mikolov et al. [135] to compute distributed repre-
sentation of words. The CBOW model learns to predict the word in the middle of a symmetric window
based on the sum of the vector representations of the words in the window. The SG model, on the other
hand, predicts the current word based on the context. It tries to maximize classification of a word based
on another word in the same sentence [135]. We consider context windows of 2 to 5 words to either side
of the central element. We vary vector dimensionality within the 50 to 100 range in steps of 10. The
model choice, window size and vector dimensionality were selected on the development set in a POS
tagging task.
Model Window Dimension Min. Count

Skip Gram 1 50 2
Table 6.13: Choice of model architecture and other hyperparameters.
Once we have represented both Hindi and Urdu texts in the same distributional space, we model sharing
and augmentation of their resources as a supervised domain adaptation task. Supervised domain adap-
tation assumes the availability of annotated data in both source and target domains to improve model
performance in the target domain. We discuss the best practices in supervised domain adaptation and
evaluate their performance for Hindi and Urdu resource sharing and augmentation. An overview of the
supervised domain adaptation methods14 can be found in [59], which we repeat here briefly.
• The SRCONLY method trains a single model on the source data while ignoring any target data.
• The TGTONLY method trains a single model only on the target data and acts as the state-of-the-
art baseline.
• In the ALL method, we simply train our machine learning algorithm on the union of the Hindi
and Urdu training sets.
• In the WEIGHTED method, we weight the instances of a data set with larger instances to train a
single unbiased model. Weighting ensures that the data set with larger number of instances would
not wash out the affect the data set with lesser instances may have on the model parameters. The
weights are appropriately chosen by cross-validation.
13 https://code.google.com/p/word2vec/
14 We do not use the feature augmentation method of Daumé III [59]. The method creates general and domain specific
versions of each categorical feature and there is no straightforward way to do that is neural network-based model that uses
distributed word representations.
94
• In the LININT method, we train SRCONLY and TGTONLY models and linearly interpolate their
predictions at the inference time. The interpolation weights are separately tuned for Hindi and
Urdu via cross-validation.
• PRIOR model was first introduced by Chelba and Acero [47] in the context of maximum entropy
classifier. The main idea of this approach is to use the SRCONLY model as a prior on the weights
of the target model while training. In our neural network model, we would simply replace the
regularization term with λ||w − ws ||22 where ws is the weight vector from the SRCONLY model.
The PRIOR method ensures that the model trained on the target data would prefer to have weights
that are similar to the weights from the SRCONLY model, unless the data demands otherwise [59].
Among these supervised methods, SRCONLY will address resource sharing, while WEIGHTED, LIN-
INT and PRIOR are the methods for resource augmentation.
6.4.1 Experiments and Results

To carry out parsing experiments, we use an arc-eager transition system which we have formalized
in Chapter 2. Arc-eager algorithm defines four types of transitions to derive a parse tree namely: 1)
Shift, 2) Left-Arc, 3) Right-Arc, and 4) Reduce. To predict these transitions, a classifier is employed.
We follow Chen and Manning [48] and use a non-linear neural network to predict these transitions
for any parser configuration. The neural network model is the standard feed-forward neural network
with a single layer of hidden units. The output layer uses softmax function for probabilistic multi-class
classification. The model is trained by minimizing cross entropy loss with an l2-regularization over the
entire training data. We also use mini-batch Adagrad for optimization and apply dropout [48].
From each parser configuration, we extract features related to the top four nodes in the stack, top four
nodes in the buffer and leftmost and rightmost children of the top two nodes in the stack and the leftmost
child of the top node in the buffer. For each node, we use the distributed representation of its lexical
form, POS tag, chunk tag and/or dependency label. We use the Hindi and Urdu monolingual corpora
to learn the distributed representation of the lexical units. The Hindi monolingual data contains around
40M sentences, while the Urdu data is comparatively small and contains around 6M sentences. The
distributed representation of non-lexical units such as POS, chunk and dependency labels are randomly
initialized within a range of 0.01 to -0.01 [48].
In addition to the existing structural features, we explicitly use case markers as a second-order lexical
feature for oblique nominals and as a third-order feature for coordinating conjunctions in NP coor-
dination constructions. Case markers are highly correlated with dependency relations and have been
observed to boost parsing accuracy significantly [10].
To decode a transition sequence, we use dynamic oracle recently proposed by Goldberg and Nivre [76]
instead of the vanilla static oracle. Dynamic oracle allows training by exploration that helps to mitigate
the effect of error propagation. We use the same value for the exploration hyperparameter as suggested
by Goldberg and Nivre [76].
95
To train POS tagging and chunking models, we use a similar neural network architecture as discussed
above. Unlike Collobert et al. [53], we do not learn separate transition parameters. Instead we include
the structural features in the input layer of our model with other lexical and non-lexical units. For POS
tagging, we use second-order structural features, 2 words to either side of the current word, and last 3
letters of the current word. Similarly, for chunking we use POS tags of the current word and the 2 words
surrounding it on the either side, in addition to the features used for POS tagging.
In any non-linear neural network model, we need to tune a number of hyperparameters for an optimal
performance. Tuning these parameters is usually as cumbersome as designing appropriate feature com-
binations in a linear model. The hyperparameters include number of hidden units, choice of activation
function, learning rate, dropout, dimensionality of input units, etc. Furthermore, we had to tune these
parameters for each individual task separately. Interestingly, we found that the hyperparameters take
similar optimal values for all the three tasks. This could be due to the fact that POS tagging, chunking
and parsing are correlated to each other. There is, however, some variation in learning rate and dropout.
The tuned parameters of our neural network models are listed in Table 6.14.
Learning Rule Learning Rate #Hidden Units Activation Function

Adagrad 0.01 − 0.03 200 Rectilinear
Batch Size Dropout Dim. of non-lexical units Iterations
20 0.2 − 0.5 20 15
Table 6.14: Hyperparameters of our neural network models tuned on development sets.
After tuning the hyperparameters of our neural network models, we trained multiple models for both
Hindi and Urdu to evaluate the performance of each domain adaptation method. To use uniform POS
and chunk features, we used 10-fold jackknifing to assign these features to the training data instead of
using gold features. For chunking, we used the auto POS features from the best performing POS tagger.
Similarly for parsing, we used the best POS and chunk taggers to generate these features. The results of
our experiments are reported in Table 6.15.
96
Source Target SRCONLY TGTONLY ALL WEIGHTED LININT PRIOR
POS tagging
Hindi Urdu 89.34 93.42 93.74 93.77 93.93 93.52
Urdu Hindi 86.06 96.48 96.50 96.54 96.61 96.22
Chunking
Hindi Urdu 92.53 96.03 96.40 96.31 96.52 96.13
Urdu Hindi 90.27 97.64 97.77 97.63 97.71 97.44
Dependency Parsing
Hindi Urdu 78.93 82.28 82.64 82.53 82.65 82.32
Urdu Hindi 75.12 88.32 88.41 88.37 88.39 88.18
Table 6.15: Results of different supervised domain adaptation methods on Hindi and Urdu resource
sharing and augmentation.
For resource augmentation, it is encouraging to note that all the methods led to some improvement
over the TGTONLY baseline. Surprisingly, PRIOR method did not perform well in our case. It was
one of the best methods reported by Daumé III [59] when used with a linear model. On the other
hand, LININT consistently performed better than other methods. Nevertheless, we achieved substantial
improvements in accuracies in all the three tasks for both Hindi and Urdu. Particularly, improvements in
Urdu are more prominent. Our augmentation results clearly show that Hindi and Urdu annotations can
be complementary to each other. In our case, large number of annotations in the Hindi treebank proved
useful for parsing of Urdu, which relatively has a small-sized treebank.
The SRCONLY models, under resource sharing experiments, did not perform at par with the TGTONLY
baseline models. The performance is better for Urdu (as a target domain) than it is for Hindi which again
could be attributed to the sheer size of the Hindi training data. The larger gaps in accuracies between the
SRCONLY and TGTONLY models can be attributed to domain shift problem inherent in data-driven
approaches to natural language processing. To empirically verify it, we explored the impact of domain
shift on the Hindi tagger and parser trained on newswire data by applying it on Hindi texts other than
news articles. For this task we used annotated data from four different domains of Hindi which include
cricket, recipes, gadgets and box office. Each domain contains around 500 hundred sentences annotated
POS, chunk and dependency structures. The accuracies of the Hindi parser and tagger on these domains
is shown in Table 6.16. If we compare the accuracies of the Hindi POS tagger and parser on Urdu and
the four domains, it appears that the performance on the Urdu test set is comparable. The tagging and
parsing accuracies on gadget and recipe data are lower than the accuracies on the Urdu test data. This
encourages us to suggest that we can consider Hindi and Urdu as two separate domains representing the
same language at least in computational sense and use their tools interchangeably instead of building
tools for them separately.
97
Parsing
Domains POS tagging
UAS LS LAS
Cricket 87.90 85.26 79.72 94.02
Box-office 86.64 83.43 78.98 89.55
Gadget 83.27 81.30 75.35 85.93
Recipe 81.12 79.31 71.37 88.95
Urdu 86.13 83.15 78.93 89.34
Table 6.16: Comparison of parsing and tagging accuracy of Hindi parser and POS tagger on Urdu
test data and four different domains of Hindi.
6.5 Related Work and Comparison

In Linguistics, Hindi and Urdu are indisputably recognized as the same language [131]. Their literary
vocabularies are, however, so divergent that they are mutually unintelligible. In Computational Linguis-
tics, their text processing is modeled separately due to their visible orthographic and lexical divergences.
In a theoretical study, Riaz [165] argues that lexical divergences between Hindi and Urdu hinder the in-
teroperability of their computational resources. On the basis of the extensive variation in Hindi and
Urdu vocabularies, he argues that any method that relies on maximum likelihood estimation may not
work jointly for both Hindi and Urdu. A similar idea is presented in [164], who show that Hindi and
Urdu share a common grammar while their literary lexicon vary extensively.
Despite the heavy lexical differences, Hindi and Urdu share a same grammar. This fact has motivated
a few researchers to explore the possibility of sharing their resources. Sinha and Mahesh [175] have
proposed a simple strategy to build an English-Urdu machine translation system that uses Hindi as
a bridge language. To derive Urdu translations of English sentences, output of a English-Hindi MT
system is converted to Urdu by using lexical mappings between Hindi and Urdu words, lexical and
syntactic disambiguation rules, and a transliteration module.
Adeeba and Hussain [2] have used a transliteration-based approach to create an Urdu WordNet from
an existing Hindi WordNet [142]. They used a rule-based transliteration system to convert the lexical
database of the Hindi WordNet to Urdu (Perso-Arabic). They manually pruned typical Sanskrit words
that are not used in Urdu texts and added additional entries specific to Urdu. Similarly Ahmed and
Hautli [5] have proposed to use a simple transliteration-based approach to access Hindi WordNet for
Urdu texts, instead of creating a separate WordNet.
Mukund et al. [140] have explored the use of Hindi specific POS tagger and chunker on Urdu texts. Both
the training and testing data are transliterated to a common form for model transfer. They show that the
Hindi POS tagger performs worse on an Urdu text as it suffered an absolute loss of 32.5% in accuracy
from an Urdu specific tagger. Their observation is the same for chunking, though the results are not
reported due to lack of an evaluation set. Visweswariah et al. [193] explore the complementary role of
98
linguistic resources present in Hindi and Urdu for better system performance. They show improvements
in machine translation, bitext alignment and POS tagging. In particular to POS tagging, they have
improved the Hindi tagger by 6% absolute by using predictions from Urdu POS tagger which was
trained on Urdu tagged data translated into Hindi.
Our work differs from these related works in multiple ways. Firstly, we show that Urdu text processing
suffers substantially by using an ambiguous script. For computational purposes, we argued to represent
Urdu texts in Devanagari instead of Perso-Arabic. To that end, we proposed an efficient and accurate
transliteration method that resolves the lexical ambiguity arising due to missing short vowels and am-
biguous characters in Perso-Arabic writing. Secondly, in addition to resource sharing, we also show that
resource augmentation can improve the performance of individual text processing modules of Hindi and
Urdu. Furthermore, to mitigate the effect of lexical sparsity, we also use distributional similarity-based
method in addition to transliteration.
6.6 Summary
In this chapter, we have explored the possibility of sharing and augmenting annotation resources of
Hindi and Urdu to improve the performance of their dependency parsing models. To bridge the script
and lexical differences between Hindi and Urdu texts, we have proposed a simple and efficient technique
based on script transliteration and distributional similarity. We have shown that we can easily abstract
away from the orthographic differences between Hindi and Urdu texts by representing their lexicons
in the same distributional space. To demonstrate the effect of text harmonization, we have shown that
by bridging their script and lexical differences we can enhance the performance of Hindi and Urdu
dependency parsers by simply merging their training data. Moreover, our experimental results suggest
that Hindi and Urdu parsers can even be used interchangeably with reasonable accuracies
In the future, we would like to explore the possibility of merging the semantic role annotations in
the Hindi and Urdu treebanks for training a better semantic role labeler. It would also be interesting
to see whether our observations related to resource sharing between Hindi and Urdu would hold for
applications other than parsing.
99
Chapter 7
Adding Semantics to Data-driven Parsing of Semantically-oriented

Dependency Representations
The CPG-based dependencies used for Indian language treebanking are semantically rich. Unlike pop-
ular dependencies such as Stanford and Universal dependencies [60, 153], they are fine-grained and
almost approximate the semantic roles. Due to lack of large-sized treebanks in Indian languages, the
semantic richness of dependency relations may not be favourable for parsing. It may severe the problem
of data sparsity that already exist for parsing of Indian languages due to their morphological richness
and scarce annotated resources. In this chapter, we explore lexical semantics to mitigate the effect of
semantic richness of CPG-based dependencies. We explore and compare pre-existing lexical databases
such as WordNets and data-driven distributional semantics for this purpose. We use rich lexical repre-
sentations of words in linear and non-linear transition-based dependency parsers and show their impact
on parsing of semantically-oriented CPG dependencies.
7.1 Introduction
Treebanks are undoubtedly of paramount importance for robust data-driven parsing [52, 108, 186]. A
treebank is a rich resource of syntactic annotations which are carefully annotated by human hand. While
it takes considerable amount of time and human effort to build a treebank of reasonable size and quality,
they often under-represent the structures of a language. Due to huge cost involved in data annotation,
most treebanks are of limited size and mainly represent text from newswire articles. To complement
sparse annotations, almost all the treebanks are build on naturally occurring texts annotated with part-
of-speech (POS) tags. POS tags can improve the generalization of a parser by abstracting away lexical
differences across similar syntactic contexts. However, traditional syntactic categories have a limited
role in differentiating richer syntactic relations such as subject and object relations. Theoretically, it
is well known that such relations can be better disambiguated by using lexical semantics [54]. The
need for richer information invoked several efforts in the direction of annotating higher-order linguistic
information in treebanks [95, 139, 183, 199].
100
Attempts have been made to utilize hand-annotated semantic information for constituency parsing
[73, 123] as well as dependency parsing [8, 23, 158, 159]. However, acquiring such information for new
sentences remains a challenge. This lead to the exploration of pre-existing lexical databases for access-
ing semantic information useful for parsing. Xiong et al. [196] used two lexical resources HowNet1 [61]
and TongYiCi CiLin [134] for parsing Penn Chinese Treebank [197]. Agirre et al. [3] demonstrated
that semantic classes obtained from English WordNet [137] help to obtain significant improvements in
both PP attachment and PCFG parsing. Similarly, for dependency parsing, Agirre et al. [4] utilized the
English WordNet semantic classes and improved parsing accuracies. Apart from lexical databases, dis-
tributional representation of words have also been extensively used for reducing data sparsity in many
NLP applications including parsing [46, 48, 104, 187].
The role of lexical semantics is more prominent when tag sets are highly fine-grained and semantically
oriented. As we discussed in Chapter 3, CPG dependency labels capture subtle nuances of a sentence
structure. Unlike Stanford dependencies [60] which capture very basic surface-level relations, CPG
dependencies make rich distinctions between syntactic relations. Moreover, they include relations that
correlate with semantic roles of verbal predicates. There have been few attempts to study the role of lex-
ical semantics in parsing CPG dependencies in the Hindi treebank. Bharati et al. [23] and Ambati et al.
[8] have illustrated that encoding animacy information in the parsing models could bring substantial im-
provements in disambiguation of certain dependency relations. They have performed the experiments
on a limited set of sentences from the Hindi treebank in which nouns have been annotated with the
following animacy categories: human, non-human, inanimate, time and place. Instead of using manual
annotations, Jain et al. [93] used Hindi WordNet to add semantic information to the parsing model of
Hindi. However, they have only used ontologies of a synset to which a given word belongs, while ignor-
ing the information related to synonyms in the synset and the relations that hold between synsets such
as hyponymy and hypernymy. Moreover, no comparison has been done with distributional semantics
which has been shown to give competitive results [19].
In this chapter, we propose to use lexical semantics for mitigating the effect of high granularity of
CPG dependencies on parsing Indian language treebanks. Unlike the previous works, we use traditional
lexical semantics from lexical databases as well as distributional semantics induced from raw corpora.
Moreover, we also use lexical semantics from WordNets in a non-linear neural network architecture of
a transition-based parser. We propose a simple and efficient way to incorporate WordNet information in
the neural network parser that best captures the relations between ontologies and synonyms in synsets.
We also combine WordNet-based information and distributional information in our linear and non-linear
parsing models to explore their complementary role.
The remainder of the chapter is organised as follows. In §7.2, we discuss the impact of annotation
granularity on dependency parsing of Hindi and Urdu and motivate the use of lexical semantics for
disambiguation of fine-grained dependency types. In §7.3, we describe the lexical resources and the
approaches that we use to induce rich lexical information for parsing CPG-based dependencies. We
1 http://www.keenage.com
101
present our experimental setup in §7.4, which we use to carry out the parsing experiments. In §7.5,
we discuss the issues related to the representation of lexical and distributional semantics in our parsing
models. Experiments and results are discussed in §7.6 and finally, we summarize the chapter in §7.7.
7.2 Effect of Granularity

In Chapter 3, we have discussed the hierarchical annotation scheme which is used to build large scale
treebanks for Indian languages. The annotation scheme broadly categorizes dependency relations into
inter-chunk and intra-chunk. Intra-chunk dependencies are mostly syntactic in nature and capture in-
formation like case, tense and aspect for nouns and verbs, while inter-chunk relations capture depen-
dencies between larger syntactic units and usually involve a verbal predicate. Unlike Stanford depen-
dencies (henceforth SD) and its variants (Universal dependencies (henceforth UD)), CPG dependencies
are highly fine-grained. They capture subtle semantic aspects of dependency relations expressed in sen-
tences. As a result, label score of a parser trained on CPG dependencies is usually lower than that of a
parser trained either on SD or UD dependencies. For comparison see the first and last rows of Table 7.2.
On the other hand, coarse-grained annotation schemes such as SD facilitate the annotation process and
are also assumed to lead to better parser performance.
To understand the impact of granularity on parser performance, we demarcate the label error along
sub-groups of dependency relations in the annotation hierarchy. Figure 7.1 shows demarcation of label
accuracies of our parsing models trained on Hindi and Urdu treebanks. Among the different label
types, our parsers can accurately identify lwg and other, but varg and vad seem to be least predictable.
varg and vad represent dependency relations involving a verbal predicate. These dependency relations
are highly fine-grained and have been shown to be strongly correlated with PropBank labels [189]. The
correlation between PropBank and CPG dependencies (particularly varg and vad) is clearly visible from
the mapping Table 7.1 adapted from [189].
102
100
Hindi Urdu
80
Accuracy 60
40
20
0
varg vad nmod lwg other
Label Type
Figure 7.1: Accuracies of each label type in Hindi and Urdu test sets. lwg or local word group de-
pendency labels are used for intra-chunk dependencies, while other labels include non-dependency
relations such as root, conjunction, part-of and fragof.
CPG Dependencies PropBank labels

k1 (agent/subject); k4a (experiencer) Arg0
k2 (theme/patient) Arg1
k4 (beneficiary) Arg2
k1s (attribute) Arg2-ART
k5 (source) Arg2-SOU
k2p (goal) Arg2-GOL
k3 (instrument) Arg3
sent-adv (epistemic adv) ArgM-ADV
rh (cause/reason) ArgM-CAU
rd (direction) ArgM-DIR
rad (discourse) ArgM-DIS
k7p (location) ArgM-LOC
adv (manner) ArgM-MNR
rt (purpose) ArgM-PRP
k7t (time) ArgM-TMP
Table 7.1: Mappings of CPG dependencies to PropBank numbered arguments adapted from [189].
It might sound reasonable to drop-off the finer distinctions in the dependency annotations that would
nonetheless be captured in PropBank annotations. This would not only remove redundancy in depen-
dency and PropBank annotations but may also, increase the label accuracy of the parsers. However,
preserving the distinctions at the dependency level would facilitate PropBank annotation [189] on the
one hand, and, on the other hand, it would also boost the performance of downstream applications such
103
as semantic role labeling, word sense disambiguation, relation extraction, etc. In a recent study on Stan-
ford dependencies, it has been shown that reducing the granularity of the annotation scheme does not
necessarily increase the parsing accuracy [136]. This may not be true for CPG-based dependencies,
as they are semantic rather than syntactic in nature. To empirically gauge the impact of granularity of
CPG-based dependencies on parsing, we conduct experiments on Hindi and Urdu treebanks. In each
experiment, we reduce the granularity of related dependency relations using their hierarchical represen-
tation. We, particularly, focus on verb argument and adjunct relations. We derive the coarse-grained
annotations by moving up in the tagset hierarchy (see Chapter 3, Figure 7.3) in right to left bottom-up
fashion. At each tree level, we first merge labels in vad and then in varg. Unlike vad labels, we do not
merge all the varg relations. We keep the distinctions related to core arguments of a verb namely k1, k2
and k4. The procedure is graphically depicted in Figure 7.2.
104
vad varg
rt rh ras adv k*u k*s k1 k2* k3 k4* k5 k7*
ras-k1 ras-k2 etc k1u k2u etc k1s k2s k2 k2p k2g k4 k4a k7 k7p k7t
Fine vads. Fine vargs.
w w
w w

vad varg
ras-k1 ras-k2 etc k1u k2u etc k2 k2p k2g k4 k4a

(1) (1)
w w
w w

vad varg
ras-k1 ras-k2 etc k2 k2p k2g

(2) (2)
w w
w w

vad varg
rt rh ras* adv k*u k*s k1 k2* k3 k4* k5 k7*

(3) (3)
w w
w w

varg
vmod
varg vad*
k1 k2* k4*
(4) (4)
Figure 7.2: Fine-grained and coarse-grained dependency labels for verb arguments and adjuncts.
We use the linear transition-based parser to conduct the experiments. We conduct all of our experiments
using gold POS tag and chunk features. The results of these experiments are reported in Table 7.2.
Surprisingly, reducing granularity in first level vad relations such as ras* has no positive impact on
parsing accuracy. Although, merging of second-level vad relations lead to some improvements in LS,
but not as much as expected. Coarsening varg relations, however, lead to substantial improvements in
LS at both levels of granularity. Our results show that the finer distinctions in varg relations at both
levels are harder to differentiate. Nevertheless, there is an overall improvement of ∼1.5% LS in both
105
Hindi and Urdu test sets. More importantly, with the reduced tagset, label score of CPG parser is now
at par with that of the UD parser. Interestingly enough, UAS seems invariant to change in granularity of
dependency labels.
Hindi Urdu
Dependencies No. of Labels Merged
CPG 95.88 92.95 91.13 93.97 90.18 88.28 -
vad-(1) 95.84−0.04 92.93−0.02 91.08−0.05 94.030.06 90.470.29 88.50.22 2
vad-(2) 95.77−0.07 92.9−0.03 91.01−0.07 93.97−0.06 90.27−0.2 88.39−0.11 7
vad-(3) 95.780.01 92.960.06 91.080.07 94.090.12 90.370.1 88.460.07 15
varg-(1) 95.77−0.01 93.750.79 91.810.73 94.06−0.03 91.040.67 89.060.6 4
varg-(2) 95.850.08 93.770.02 91.840.03 94.090.03 91.040.0 89.060.0 2
varg-(3) 95.890.04 93.880.11 91.960.12 94.090.0 91.02−0.02 89.070.01 3
vad-(4) 95.890.0 93.960.08 92.030.07 94.120.03 91.160.14 89.220.15 6
varg-(4) 95.910.02 94.420.46 92.440.41 94.160.04 91.690.53 89.750.53 3
UD 95.02 94.97 92.05 - - - -
Table 7.2: Impact of annotation granularity on parsing Hindi and Urdu test data.
Even though, by reducing annotation granularity, we were able to bridge the performance gap between
LS and UAS, but at the peril of losing crucial information that could be important for downstream
applications. An alternative approach to coarsening of annotations is to use higher-order lexical infor-
mation to capture subtle semantic distinctions in the tagset. The reason to use word semantics is the
high correlation between the semantics of a word and the role it plays. This correlation is enforced
due to s-selectional constraints imposed by a predicate on its arguments [86, 97]. Semantic constraints
ensure that words carry essential semantics to fill specific argument roles. For examples, prototypi-
cally, agentive roles should be played by animate nouns, while patientive roles should be played by
inanimate nouns [54]. To encode word semantics in our parsing models, we rely on the distributional
representation of words and the rich lexical information from traditional databases such as WordNets.
7.3 Lexical Semantics

In this section, we discuss the resources and the approaches that we use to extract rich lexical infor-
mation for parsing semantically rich annotations in Hindi and Urdu treebanks. Particularly, we give a
brief account of IndoWordNet and discuss two widely used data-driven distributional similarity-based
methods.
7.3.1 IndoWordNet
WordNets are lexical databases primarily composed of synsets and the semantic relations connecting
them. Synsets are sets of synonyms representing a similar concept. They are linked by semantic rela-
tions like hypernymy (is-a), meronymy (part-of), troponymy (manner-of) etc. IndoWordNet is a linked
106
structure of wordnets of major Indian languages from Indo-Aryan, Dravidian and Sino-Tibetan families
[35, 142]. IndoWordNet has been created from the Hindi WordNet using the expansion approach [35].
In IndoWordNet, each synset is also mapped to an ontology which is a hierarchical organization of
concepts. More specifically, an ontology categorizes entities and actions. The ontologies used in In-
doWordNet consist of around 200 different categories. Figure 7.3 illustrates a typical ontology used in
Hindi WordNet, where synset “pen, qalam, etc.” represents the leaf node and each intermediate node
categorizes the synset into some semantic category. As we move up the hierarchy, its scope increases as
it becomes more and more generic. Some of these categories such as time, place, animacy, etc. encode
s-selectional constraint imposed by the predicates on their arguments. These categories can, therefore,
be argued to be quite useful for parsing of semantically-oriented dependencies. However, extraction of
relevant information from WordNets is not straightforward due to lexical ambiguity. Sense selection is a
major bottleneck in the use of WordNets. Here, we discuss some useful strategies to select contextually
relevant information.
Noun
Inanimate
Object
Artifact
pen, qalam, ..
Figure 7.3: Sample hierarchy of categories in Hindi Wordnet.
7.3.1.1 Sense Selection
Lexical ambiguity is a common phenomenon in natural languages. A word can have multiple meanings
or senses which vary across different contexts of its usage. Even though, IndoWordnets lists all the
possible senses of a word, to choose contextually appropriate sense is a non-trivial task. Here, we
discuss different approaches to select contextually relevant sense of a word from IndoWordNets.
• Category-based Sense Selection: The syntactic category of a word provides an essential initial cue
for sense selection. Consider the word chāt, it can either mean ‘lick’ or ‘snacks’. In the former
sense, chāt has a syntactic category of a verb while its latter sense is a noun. Ontological nodes
representing its both senses are shown in Figure 7.4.
107
Noun Verb
Inanimate Verb Of Action
Object Bodily Action
Artifact
Figure 7.4: Ontological nodes of chāt representing its verbal and nominal senses.
• Intra-category Sense Selection: It is a known fact that words may be ambiguous not only across
different syntactic categories but also within a same category. Figure 7.5 shows two different
ontologies representing ‘dog’ and ‘pawl’ meanings of the word kuttā. To select a relevant sense
of a word after syntactic disambiguation, we explore the following strategies:
Noun Noun
Animate Inanimate
Fauna Object
Mammal Artifact
Figure 7.5: Two ontologies corresponding to two nominal senses of the word kuttā.
1. First Sense: Among different senses of a word, we select the first sense listed in WordNets
corresponding to its POS tag. Usually, WordNets list the senses of a word in descending
order of their frequencies of usage i.e., the first sense listed in WordNets is usually the
predominant sense.
2. W SD: Although first sense captures the predominant usage of a word, it is inappropriate for
its other infrequent usages. We, therefore, need to select contextually appropriate sense. To
this end, we use Extended Lesk, a classical word sense disambiguation algorithm [13].
7.3.2 Distributional Representations

Lexical resources like WordNets are manually created and are therefore expensive. Instead data-driven
approaches to word meaning provide a cheaper alternative. They approximate word meaning at the ex-
pense of large quantities of raw data by exploiting distributional similarity of words. The distributional
similarity methods exploit Harris’ distributional hypothesis which states that words that occur in the
same contexts tend to have similar meanings [84]. There are multiple approaches to induce distribu-
tional representation of words from raw corpora. In this chapter, we discuss the two most widely used
approaches.
108
7.3.2.1 Brown Clustering
The Brown clustering algorithm is a hierarchical clustering algorithm which clusters words to maximize
the mutual information of bigrams [41]. It is basically a class-based bigram language model. The
algorithm runs in O(N·C2 ) time, where N is the size of the vocabulary and C is the number of clusters.
Due to hierarchical nature of the induced clustering we can choose the word cluster at several levels in
the hierarchy. This would compensate for poor clusters of a small number of words. One of the demerits
of Brown clustering is that it is based solely on bigram statistics. It does not consider word usage in a
wider context which is crucial for uncovering the semantics of a word. Brown clusters have been proven
useful for semi-supervised learning of multiple NLP tasks including syntactic parsing [46, 104, 119,
187]. To induce Brown clusters for parsing Hindi and Urdu texts, we use the implementation of Brown
clustering algorithm by Percy Liang [119]. We perform the clustering on Hindi and Urdu raw corpora
which contain 40M and 6M sentences respectively.
7.3.2.2 Word Embeddings
Another approach to word representation is to learn word embeddings. Word embeddings are dense,
low-dimensional, and real-valued vectors. Each dimension of the embedding represents a latent feature
of the word that supposedly captures its useful syntactic and semantic properties. Currently, word
embeddings are the building blocks of most of the NLP applications.
We use word2vec toolkit2 to learn the distributed representation of Hindi and Urdu vocabulary. The
toolkit provides an efficient implementation of the continuous bag-of-words (CBOW) and skip-gram
(SG) approaches of Mikolov et al. [135] to compute distributed representation of words. The CBOW
model learns to predict the word in the middle of a symmetric window based on the sum of the vector
representations of the words in the window. The SG model, on the other hand, predicts the current word
based on the context. It tries to maximize classification of a word based on another word in the same
sentence [135]. In this work, we only consider SG model for experimentation. We use the default values
of the hyperparameters to train the SG model, except for context windows and vector dimensionality.
We consider context windows of 2 to 5 words to either side of the central element and vary vector
dimensionality within the 10 to 100 range in steps of 10. The model is trained on the same data that we
used for Brown clustering.
7.4 Experimental Setup

To experiment with enriched lexical representations for parsing CPG-based dependencies, we use
transition-based parsing systems that use linear and non-linear learning models. We use brown clus-
ters as additional features to a linear model, while we employ a non-linear neural network architecture
2 https://code.google.com/archive/p/word2vec/
109
to make effective use of distributed word representations. In the following, we provide further details of
our parsing models used for the experimentation.
7.4.1 Parsing Models

Our parsing models are based on transition-based dependency parsing paradigm [149]. Particularly,
we use an arc-eager transition system, which is one of the famous transition-based parsing systems.
Arc-eager algorithm defines four types of transitions to derive a parse tree namely: 1) Shift, 2) Left-
Arc, 3) Right-Arc, and 4) Reduce (see Chapter 2 for more details). To predict these transitions, we use
a linear and a non-linear classifier. Linear model is used so that we can effectively use cluster-based
features with existing lexical and non-lexical features. We use vanilla perceptron to incorporate word
clusters as additional features. In Table 7.3, we list simple and complex cluster-based features that
can be extracted from a parser configuration. These features are used in addition to POS, chunk and
lemma-based features which we have discussed in Chapter 4.
To efficiently use word embeddings for parsing, we follow Chen and Manning [48] and use a non-linear
neural network to learn these transitions. The neural network model is the standard feed-forward neural
network with a single layer of hidden units and RelU activation function. The output layer uses softmax
function for probabilistic multi-class classification. The model is trained by minimizing cross entropy
loss with an l2-regularization over the entire training data. We also use mini-batch Adagrad for opti-
mization and apply dropout [48]. In both parsing models, we use dynamic oracle recently proposed by
Goldberg and Nivre [76] to decode a transition sequence. Dynamic oracle allows training by explo-
ration that helps to mitigate the effect of error propagation. We use the same value for the exploration
hyperparameter as suggested by Goldberg and Nivre [76].
7.5 Feature Representation

There are multiple representation issues that we have to consider for efficient use of the higher-order
lexical information in our parsing models. The issues are related to the representation of semantic
relations and the synsets, word clusters and word embeddings in our linear and/or non-linear models. In
this section, we elaborate on these representation issues with respect to our learning models.
7.5.1 Linear Model

WordNet: IndoWordNet provides related synsets and ontologies for content words in scheduled lan-
guages of India. While synsets provide very fine information as they group words of similar meaning or
concept, ontologies group related synsets (e.g. animates can be grouped under same ontology) thereby
reducing granularity. To use either of them in our parsing models, we have two representation choices.
In case of synsets, we can either use each synonym in a synset as a separate feature or use the unique id
of a synset as a representative feature. Similarly, we can use ontologies as a single string feature or can
110
use their individual categories as independent features. It seems more reasonable to use synset ids than
synonyms in a synset as features. Synset ids not only can efficiently represent the concept represented
by a synset, they will also have lesser effect on the dimensionality of our sparse vectors. Likewise using
an ontology as a single feature, would leave out any relation between ontologies which may otherwise
be captured by their individual categories. Consider the case of ontologies that represent time. All these
ontologies as a whole would mean different features, while they share time as one of the categories in
the hierarchy.
• Time
• Descriptive→Time
• Inanimate→Abstract→Time
• Inanimate→Abstract→Time→Period
• Inanimate→Abstract→Time→Season
• Inanimate→Abstract→Time→Mythological Period
In each parser configuration, we use information related to both synsets and ontologies of top word in
the stack, top three words in the buffer, leftmost and rightmost children of top word in the stack and
leftmost child of top word in the buffer. We also create more complex features by combining these
WordNet features with other features in a parser configuration. These features are listed in Table 7.3.
Single words S0 c; B0 c; B1 c; B2 c
Word pairs S0 cB0 c; S0 cpB0 p; S0 cpB0 cp; S0 cB0 cp; S0 cpB0 c; S0 pB0 cp; S0 wB0 c; S0 cB0 w; B0 cB1 c; B1 cB2 c;
Word triplets B0 cB1 cB2 c; S0 cB0 cB1 c; S0h cS0 cB0 c; S0 cS0l cB0 c; S0 cS0r cB0 c; S0 cB0 cB0l c
Distance S0 cd; S0 pd; B0 cd; S0 cB0 cd
Valency S0 cvr ; S0 cvl ; B0 cvl
Unigrams S0h c; S0l c; S0r c; B0l c
Third Order S0h2 c; S0l2 c; S0r2 c; B0l2 c
Labels S0 csr ; S0 csl ; B0 csl
Table 7.3: Additional cluster and WordNet-based features for parsing Hindi and Urdu. w denotes
word, p denotes POS tag, d denotes distance, v denotes valency and S|Bil , S|Bir denote the left and
rightmost children of S|Bi , c denotes WordNet-based feature or cluster id, and t denotes chunk tag.
Brown Clusters: Brown clustering performs hierarchical clustering over words to form a binary tree.
Each leaf node in the tree is a cluster of distributionally or semantically similar words. Due to hier-
archical nature of the clustering, we can also merge clusters to reduce their granularity. This would
easily allow us to try different levels of cluster granularity. We can either use the original clusters or the
merged clusters as features. The path from the root to each leaf is represented as a bit string, where the
111
ith bit is 0 if the path branches left at depth i, 1 otherwise. The leaves with longer common path prefixes
are more semantically related. Similar to WordNet-based features, we add bit strings of top word(s) in
the stack and the buffer as features using the feature template in Table 7.3.
7.5.2 Non-linear Model

Word Embeddings: In our non-linear model, we use distributed representation of lexical features in
the form of word embeddings. Unlike sparse representation in linear models, word embeddings allow
words that are closer in the embedding space to share the model parameters. Thus providing an efficient
solution to the problem of data sparsity. Moreover, word embeddings can also improve the correlation
between words and dependency labels, assuming they capture semantic aspects of a word. From each
parser configuration, we extract features related to the top four nodes in the stack, top four nodes in the
buffer and leftmost and rightmost children of the top two nodes in the stack and the leftmost child of the
top node in the buffer. For each node, we use the distributed representation of its lexical form, POS tag,
chunk tag and/or dependency label. The distributed representation of non-lexical units such as POS,
chunk and dependency labels are randomly initialized within a range of 0.01 to -0.01 [48].
WordNet: Unlike in a linear model, WordNet information can be easily and efficiently represented in
a non-linear neural network model. Similar to non-lexical units such as POS and chunk tags, ontolog-
ical information can be represented by randomly initialized embeddings in the input layer. To capture
similarity between ontologies, we compose their embeddings from the embeddings of their component
categories. This would ensure that the ontologies with overlapping categories are closer in the embed-
ding space. In addition to ontologies, we can also add synset-related information to the model quite
easily. Synsets can be used to improve both randomly initialized and pre-trained word embeddings us-
ing retrofitting [69]. The idea is to update word embeddings using their synonyms in a way such that
some distance metric like euclidean distance within synsets is minimized. Retrofitting has been shown
to work better than pre-trained embeddings in a number of NLP taks such as word similarity and senti-
ment analysis [69]. In this work, we retrofit both randomly initialized and pre-trained word embeddings
of Hindi and Urdu using IndoWordNet synsets.
7.6 Experiments and Results

We conducted many experiments to explore the effect of both traditional semantics and distributional
semantics on linear and non-linear parsing models of Hindi and Urdu. In both models, we added distri-
butional semantics and WordNets semantics separately on the baseline models. To explore their com-
plementary role in parsing, we also used both semantic features simultaneously. In case of linear parsing
model, we set the baseline using POS, chunk and lemma features (see Chapter 4). The baseline model
of non-linear parser is set using the randomly initialized embeddings for lexical as well as non-lexical
input units. The hyperparameters of each model such as learning rate, dropout, number of iterations are
112
tuned on development sets. The results are reported in Tables 7.4 and 7.5. The reported results are ob-
tained based on the best values for hyperparameters, best sense selection strategy, optimal length of bit
strings and embedding dimensionality. As shown in Figures 7.6 and 7.7, we achieved best results using
bit strings of length 10 and word embeddings of 80 dimensions. Similar to Jain et al. [93], first sense
selection strategy provided better results than an off-the-shelve WSD algorithm. In case of WordNet-
based features, only synonyms and the ontological categories associated with a word provided highest
improvements over the baseline. The information related to the semantic relations such as hypernymy
did not improve the parsing accuracy.
Hindi Urdu
Features
Baseline 93.24 89.73 87.52 88.69 84.77 81.11
Clusters 93.520.28 90.010.28 87.770.25 88.770.08 84.840.07 81.190.08
Ontos 93.250.01 90.170.44 87.850.33 88.710.02 85.070.30 81.330.22
+Synsets 93.270.02 90.210.04 87.900.05 88.770.06 85.150.08 81.370.04
Combined 93.550.31 90.260.53 87.980.46 88.790.10 85.230.46 81.430.32
Table 7.4: Impact of IndoWordNet and Brown clusters on Hindi and Urdu linear transition-based
parsers.
Hindi Urdu
Features
Baseline 93.13 90.04 87.13 88.21 84.62 80.57
Embeddings 93.410.28 90.740.7 87.820.69 88.940.73 85.911.29 81.731.16
Ontos 93.170.04 90.630.59 87.550.42 88.490.28 85.270.65 81.260.69
+Synsets 93.230.06 90.750.12 87.670.12 88.520.03 85.380.11 81.360.10
Combined 93.370.24 90.930.89 87.840.71 88.590.38 85.510.89 81.470.90
Table 7.5: Impact of IndoWordNet and word embeddings on Hindi and Urdu non-linear neural
network parsers.
113
87.6
87.4
87.2
LAS
Hindi
80.8
80.6
80.4 Urdu
2 4 6 8 10 12 14 16 17
#bits
Figure 7.6: Learning curves for optimal length of bit strings (number of clusters).
88
87
86
LAS
Hindi
82
81
80
Urdu
79
20 30 40 50 60 70 80 90 100
#dimensions
Figure 7.7: Learning curves for optimal dimensionality of word embeddings.
In the linear parsing models, WordNet-based features provided substantial improvements, while distri-
butional features (Brown clusters) provided slightly lesser improvements. The reverse is true in the case
of the non-linear parsing models. On both Hindi and Urdu test sets, word embeddings lead to 0.9%
average improvements in LS, while WordNet features provided only 0.4% average improvements in LS.
114
This could be due to the fact that IndoWordNet lacks coverage for almost the half of the Hindi and Urdu
treebank vocabularies. Finally, combination of WordNet and distributional features further improved
the results, although their individual performances did not add up linearly.
Although, there is still a large gap between attachment (UAS) and label (LS) scores of our parsers, we
have substantially improved the label scores from the baselines. In particular, there are very high im-
provements in dependency relations which were harder to disambiguate in our granularity experiments
(see §7.2). More specifically, label score of varg dependencies in Hindi and Urdu test sets increased by
8% and 5% respectively. The improvements on each label type over the baseline are shown in Figures
7.8 and 7.9. The scores used in these figures are from the more accurate non-linear parsing models. Our
results clearly show the importance of using lexical semantics to capture subtle distinctions between
fine-grained CPG dependencies.
100
Baseline Semantics
80
Accuracy
60
40
20
0
Label Type
Figure 7.8: Impact of lexical semantics on parsing of different label types in Hindi test sets.
100
Baseline Semantics
80
Accuracy
60
40
20
0
Label Type
Figure 7.9: Impact of lexical semantics on parsing of different label types in Urdu test sets.
115
7.7 Summary
In this chapter, we have explored lexical semantics as a complementary feature for parsing semantically
rich dependency annotations in Hindi and Urdu treebanks. We have shown that lexical semantic in the
form of discrete and continuous features such as ontological categories, Brown clusters and word em-
beddings can play a major role in disambiguating highly rich CPG dependencies. We have proposed
simple feature combinations to incorporate WordNet and clusters features in the linear parsing model.
We have also proposed to use retrofitting to incorporate WordNet information in a neural network pars-
ing model. By modeling lexical semantics in our parsing models, we have achieved very significant
improvements in parsing of both Hindi and Urdu test sets. The improvements are particularly promi-
nent in dependencies related to verb argument structure.
116
Chapter 8
Data-driven Parsing of Kashmiri: Setting up Parsing Pipeline with

Preliminary Experiments
Kashmiri is a resource poor language with very less computational and language resources available for
its text processing. In this chapter, we present our experiments on data-driven parsing of Kashmiri along
the lines of our work on Hindi and Urdu. Due to lack of essential tools and resources for Kashmiri, we
start with manually annotating the Kashmiri text with information relevant for parsing. More precisely,
we build a dependency treebank for Kashmiri, which contains sentences annotated with part-of-speech
(POS), chunk and dependency information. Subsequently, the annotations in the treebank are used for
training the first dependency parser for Kashmiri.
8.1 About Kashmiri

Kashmiri language belongs to Dardic sub-group of the Indo-Aryan family. It is spoken primarily in
the Kashmir Valley, in Jammu and Kashmir. According to the census of 2001, it has approximately
5,632,698 speakers throughout India and Pakistan. It is one of the 22 scheduled languages of India1 .
Kashmiri is mainly written in modified Perso-Arabic and Devanagari scripts. However, Perso-Arabic is
more common in use today.
Kashmiri is a V2 language like German in which tensed clauses are subjected to verb second constraint.
The finite verbal element in these clauses always occurs in the second position, i.e., the position imme-
diately following the first phrasal constituent [32, 33, 89]. In the cases where there is an auxiliary verb
carrying tense information, it occupies the second position in the clause while the main verb occupies
the final position. Consider examples 33 and 34 for an illustration of v2 phenomenon in Kashmiri. In
Example 33, auxiliary chu ‘is’ occupies clause second position while the verb diwan ‘give’ occurs at
the end. Tensed verb dits ‘give’ follows the first phrasal constituent in Example 34.
(33) Shahid chu Irshadas kitāb divān .

Shahid-3MSg be-Prs Irshad-3MSg.Dat book-3FSg give-Prog .
1 http://www.ethnologue.com/language/kas
117
‘Shahid is giving a book to Irshad .’
(34) Shahidan dits Irshadas kitab .

Shahid-3MSg.Erg. give-Perf Irshad-3MSg.Dat book .
‘Shahid gave Irshad a book.’
However, there are some exceptions to the v2 phenomenon in Kashmiri. The exceptions include relative
and adverbial clauses which adhere to a typical verb-final behavior. In these clauses, the tensed verb
always comes at the end, instead of occupying the clause second position. Similarly, in the presence of a
question word, tensed verb or auxiliary occupies v3 position, since v2 is occupied by the question word.
Kashmiri is inflectionally rich language. Nouns and verbs inflect for different set of grammatical infor-
mation. Nominals inflect for number, gender and case which are realized through a single portmanteau
morph, e.g. in the noun, insaan-an (human-MSgErg/Acc) ‘-an’ can either be an ergative or accusative
marker which also carries gender and number information with it. Further, nominal modifiers like
demonstratives, quantifiers and adjectives agree with their head noun for case, e.g. in the NP, yam-is
ak-is bad-is ladak-as (this-Dat one-Dat big-MSgDat boy-MSgDat) all the dependent words (modifiers)
agree with the head ladake ‘boy’ in terms of case information which is represented by dative marker
-is/-as. Similarly, verbs carry tense, aspect and mood (TAM) information and show agreement proper-
ties. Kashmiri verbs can also carry emphatic, honorific, negative and interrogative markers. In addition
to these pragmatic markers, the verbs exhibit an interesting phenomenon of pronominal cliticization (see
Section §8.2.4 for more details). In a discourse, verbs can be marked with pronominal suffixes encoding
information about their core arguments. These enclitics show grammatical agreement with the corre-
sponding argument, e.g. shong-us (slept-1MSg = I slept). Furthermore, Kashmiri has morphological
causatives and passives, i.e. causatives and passive forms are derived by a morphological process. For
example, suffixation of ‘-inaav’ to the root form produces a causative form and suffixation of ‘-ni’ (in
addition to a passive auxiliary āv ‘come’) produces passive form. Besides the inflectional morphology,
Kashmiri nouns and verbs also have derivational morphology (for more details see [107]).
8.2 Setting up a Parsing Pipeline

In Chapter-3, we discussed the general procedure for treebanking of Indian languages based on the
CPG formalism. Prior to the annotation of dependency structures, text is tokenized and then semi-
automatically morphologically analyzed, POS tagged and chunked. The semi-automatic nature of the
annotation process implies the availability of tools like morphological analyzer, POS tagger and chun-
ker. However, due to the lack of such basic resources in Kashmiri, we annotate the necessary infor-
mation manually from scratch. In what follows, we discuss tokenization, POS tagging, chunking and
inter-chunk dependency annotation of Kashmiri text which will be subsequently used for setting up the
parsing pipeline for Kashmiri.
118
8.2.1 Tokenization
Tokenization of texts written in Perso-Arabic script is a non-trivial task. Perso-Arabic script poses two
problems to the task; namely, space omission and space insertion. A space character has hardly any
significance in visual word identification as a word boundary marker, thus, it can be omitted altogether.
It is needed to generate the correct typography of a word [63] which has considerable role in readability
of the text. However, due to the impact of technology which, by and large, is itself under the impact
of English, the space character has become more or less a standard word boundary marker. Although
this addresses the problem of space omission, the space character became an unreliable cue for word
segmentation. The space character has now acquired two functions in languages written in the Persio-
Arabic script: to separate words and to generate correct typography. In the Urdu treebank [25], the
problem of tokenization is tackled by correcting the wrong word segmentation using the space charac-
ter as word boundary marker. Human annotators identify the wrong segmentations and join the word
segments using “ ”. At a later stage of treebank development, “ ” is replaced automatically with zero
width non-joiner character “ZWNJ” 2 , which converts the text into its natural form (by removing the
extra “ ” character) and addresses the space insertion problem. Similar to Urdu treebanking, we to-
kenize Kashmiri text using space character and then correct the wrong segmentations manually. The
segmentation errors are corrected during the POS tagging of the text. A few examples of tokenization
errors and the proposed modifications are shown in Table 8.1. Note that the space insertion problem is
mostly prominent in Perso-Arabic borrowings in both Kashmiri and Urdu.
Initial Representation Gloss Intermediate Representation Final Represenation

ghunah gār sinner ghunah gār ghunahgār
sehat mand healthy sehat mand sehatmand
zimi dārī responsibility zimi dārī zimidārī
tele vision television tele vision television
tali kani under/below tali kani telekani
Table 8.1: Tokenization problem in Kashmiri texts.
8.2.2 Part-of-Speech Tagging

Part-of-speech (POS) tagging is the process of assigning grammatical category or word category to a
word based on its grammatical function as well as its definition or meaning. POS tagging is of paramount
importance for many NLP applications, and in particular to parsing it reduces data sparsity. Besides,
prior disambiguation of words into their POS tags also helps in the annotation of dependency structures.
Usually words can be categorized into very few POS categories like verb, noun, adjective, adverb etc.
However, more fine-grained POS tags are utilized by different annotation projects including treebanking
projects. For example, the Penn Treebank [128] uses 47 POS categories, the Brown corpus [71, 72] has
2 http://en.wikipedia.org/wiki/Zero-width_non-joiner
119
87 categories, while Universal POS tag set contains mere 12 universal POS categories [161]. In order to
tag Kashmiri text we choose the current version of Indian Language Machine Translation (ILMT) pos
tag set without any additional changes [22]. The ILMT POS tag set contains 32 tags. Following the
tagging guidelines, we manually tagged around 61,741 words (3,409 sentences) of the Kashimri corpus
available.
The tagged corpus is used to build a statistical POS tagger for Kashmiri. The data set is split into training,
development and testing sets by a ratio of 80:10:10. We used structured perceptron with second-order
structural features to build the POS tagger. In the baseline, we used a context window of 4 words; for
every word capturing its context till 2 preceding and following words. To address the data sparsity and
OOV problem caused by the rich morphological nature of Kashmiri, we used the first and last n letters
of a word in different combinations. The overall features are shown in Table 8.2.
Word Unigrams wi , wi−1 , wi−2 , wi+1 , wi+2

Word Bigrams wi−2 wi−1 , wi−1 wi , wi wi+1 , wi+1 wi+2
Word Affixes pri , sui
Word length 1 if leni > 4 else 0
Table 8.2: Feature template used in POS tagging experiments. Affixes of a word are generated up
to 4 characters in different sequential combinations like first two letters, first three letters and first
four letters etc.
Model Precision Recall F-Score
Baseline 77.46 77.53 77.12

+Affixes 82.92 83.34 82.97
Table 8.3: POS tagging accuracy.
As shown in the above Table, baseline score is very low. The possible reasons for such low score
is the small size of our training set and lexical variation due to rich inflectional nature of Kashmiri
which aggravate the data sparsity problem. In order to address the data sparsity problem we used affix
information of words as letter ngram features of orders ≤4. We expect this information to capture
morphological paradigms of words that are marked similarly for inflectional categories like number,
gender, person, case and tense, aspect, modality. Incorporating these features in our perceptron model
increased the accuracy by 5.85% from the baseline. The impact of affix information on relevant word
classes can be clearly seen in Figure 8.1. The accuracies of word classes like noun and noun modifiers
(see §8.1 for details) that are inflected for different grammatical information improved substantially.
120
Baseline Affix Features
100
80
60
F-Score
40
20
0
DEM INTF JJ NEG NN NNP NST PRP PSP QC QF QO RB RP VAUX VM WQ CC
Part of Speech Tags
Figure 8.1: Impact of affix-based features on tag-wise performance on our POS tagger.
8.2.3 Chunking
Chunking is a process of breaking down a sentence into fragments, usually, bigger than words called
constituents, phrases or chunks. It is an analysis of a sentence which identifies and distinguishes word
groups like nouns groups, verb groups etc., without specifying their internal structure, or their relations
in a sentence. Breaking a sentence into word groups or chunks is an important step towards sentence
parsing [1, 144] as they represent the minimal units that express major grammatical relations in a sen-
tence. Besides, chunking is also important to expedite the dependency annotation. Unlike POS tags,
chunk tags are few in number and capture the information spread over a set of words like finiteness,
case etc. The ILMT guidelines for chunking specify 12 chunk tags among which 4 tags are particu-
larly reserved for verb groups to express the grammatical information marked. We followed the ILMT
guidelines and manually chunked the already POS tagged corpus of 3,409 sentences (41,678 chunks).
However, we made some significant changes addressing the V2 phenomenon in Kashmiri. Unlike other
Indian languages, auxiliaries in Kashmiri can stand away from the main verb in the second position of
a clause as a separate constituent. The notion of a verb group, where verb and its auxiliaries act as a
highly cohesive unit, is weaker in Kashmiri. We have introduced few additional chunk tags for verb and
auxiliary projections not used in ILMT guidelines treating every verb and auxiliary as a separate unit
rather than a part of a bigger verb group. The additional chunk tags are presented in Table 8.4.
121
S.No. Tag Description
1 AUXP All Auxiliaries
2 VCM Main Verb with separate tense auxiliary
3 VCF Tense verb
4 VCNN Gerund
5 VCNF Non-finite Verb
6 VCINF Infinitive
Table 8.4: New chunk tags.
We use AUXP tag for auxiliary verbs since they can clearly standalone as a constituent in Kashmiri,
particularly in v2 clauses. Besides, to differentiate between main verbs that carry tense from those
whose tensed auxiliaries (and/or even aspect and model auxiliaries) stand away from them in clausal
second position, we introduce two separate tags namely VCF and VCM for them respectively.
We used the manually chunked data for building an automatic chunker. We used the averaged perceptron
for this purpose. The data is split by 80:10:10 for training, tuning and testing the chunker. The features
used for learning are shown in Table 8.5. We set the baseline with a simple context window of 2
preceding and following words. The chunking accuracies are reported in Table 8.6.
Word Unigrams wi , wi−1 , wi−2 , wi+1 , wi+2

Tag Unigrams pi , pi−1 , pi−2 , pi+1 , pi+2
Word Bigrams wi−2 wi−1 , wi−1 wi , wi wi+1 , wi+1 wi+2
Tag Bigrams pi−2 pi−1 , pi−1 pi , pi pi+1 , pi+1 pi+2
Tag Trigrams pi−2 pi−1 pi , pi−1 pi pi+1 , pi pi+1 pi+2
Table 8.5: Feature template used in the chunking experiments. w denotes a word, p is the POS tag.
Model Precision Recall F-Score
Baseline 80.40 81.87 80.70

Gold POS-tag 92.35 93.77 92.30
Auto POS-tag 84.93 86.20 85.25
Table 8.6: Comparison of Chunking Accuracies using different feature models.
Using POS tags as feature has obvious benefits for chunking. Chunk tags can be deterministically
predicted if the POS tags are known. A huge jump of 12% F-score clearly shows the importance of
using POS tags as features in chunking. Even though auto POS tags degraded the accuracy due to error
propagation, there is still 5% F-score improvement over the baseline.
122
8.2.4 Dependency Annotation
Among the 3,409 sentences manually POS tagged and chunked, we annotated 1,500 sentences with
the dependency structures at chunk level. We used the dependency annotation guidelines proposed in
[14, 24]. The guidelines are based on the Computational Pān.inian Grammar formalism inspired by an
ancient Indian grammarian named Pān.ini [21]. Figure 8.2 shows annotation of an example sentence
from the treebank based on the guidelines. The labels starting with ‘k’ are Pān.ini’s Karaka relations
which are central to CPG formalism. Note that a Karaka relation is a grammatical relation that holds
between a verb and its arguments or its adjuncts.
(35) ām tor chi khori hinzi adji taliken mot tissue āsān .
common way be-Prs.3FSg heel of bone under thick tissue exist .
Usually, there is a thick tissue beneath the bone of the heel.
root
adv
aux
r6 k7p
k1
ām tor chi khori hinzi adji teliken mot tissue āsān .
RBC RB VAUX NN PSP NN PSP JJ NN VM SY M
Figure 8.2: Dependency tree of an example sentence from the Kashmiri treebank showing inter-
chunk annotations.
Although, we could use the annotation guidelines that were originally proposed for Hindi as such, we
made some minor changes to explicitly represent some of the phenomena prevalent in Kashmiri. With
respect to v2, we already proposed some changes at the chunk level so that v2 phenomenon can be
explicitly represented in the tree structure. Besides, we also propose to segment out pronominal clitics
from verbs and to treat them as verb arguments when the pronouns they refer to are null. Both the
phenomena are discussed below with appropriate examples from the treebank.
8.2.4.0.1 V2: Regarding the annotation of v2 clauses in which tensed auxiliaries occupy the clause
second position, we have two choices: either to treat tensed auxiliary as the head of the clause and attach
arguments to the main verb or treat tensed auxiliaries like they are treated in Hindi-Urdu treebanks i.e.,
as the dependents of the main verb. First option may seem appealing, since it emphasizes v2 position,
however, such an analysis of v2 clauses will always render them non-projective. Instead we choose
the second analysis, which does not offer any such complications. It also restricts the analysis of finite
clauses to a single, uniform representation across Indian languages under the CPG formalism. Both the
analysis are represented in Figures 8.2 and 8.3 respectively.
123
root
adv
r6 k7p
k1
ām tor chi khori hinzi adji teliken mot tissue āsān .
RBC RB VAUX NN PSP NN PSP JJ NN VM SY M
Figure 8.3: Dependency tree of example sentence 35. Tensed auxiliary is treated as the head of
the clause. Edge from words āsān to tor is non-projective due to crossing of edge from dummy
root to auxiliary verb chi.
Even though we treat auxiliaries as dependents of main verb, due to v2 constraint Kashmiri dependency
structures differ form that of Hindi-Urdu in multiple ways. One such difference is in the complement
clause constructions. Complement clauses can introduce non-projectivity if they modify an existing
complement of the main verb in Hindi and Urdu (see dependency tree of Example 23). However, in
Kashmiri, non-projectivity hardly occurs if the main verb occupies the second position. Similarly, if
light verb in light verb constructions carry tense, complex predicate would be discontinuous in the
linear order of words i.e., light verb and host nominal would be away from each other which is not be
desirable for parsing (see Figure 4.6 as an example of complex predicates in Hindi-Urdu). A similar
case is that of passivization, the auxiliary that reflects passivization sits mostly in second position away
from the main verb. Examples 36, 37 and 38 show a complement clause, a complex predicate and a
passive construction respectively.
(36) khabar chi yi zi kashīri hinden bālai alākan manz pov vāryāh shīn .
news be-PRS this that kashmir of mountainous areas in drop lot snow .
‘It is reported that there has been a lot of snowfall in mountainous areas of Kashmir.’
(37) Nadiyan kar kahānī yād .

Nadiya-3FSg.Erg do-Perf story memory .
Nadiya memorized the story.
(38) Nadiyas āv yenām dini .

Nadiya-3FSg.Dat come-Pass prize give .
Nadiya was given a prize.
8.2.4.0.2 Pronominal Cliticization: In addition to v2, Kashmiri shows another interesting property
called pronominal cliticization [79, 107], a known characteristic property of Romance languages [12].
However, the phenomenon is not uncommon in South-Asian languages and can also be found in Pun-
jabi, Maithili, Sindhi etc. [67]. It is a morpho-syntactic property associated with verbs in Kashmiri.
In appropriate discourse conditions, verbs are inflected for case suffixes which are governed by their
pronominal arguments [107]. Example 39 shows pronominal suffixation on tensed auxiliary chi in the
absence of an explicit prosodic third person pronoun.
124
(39) urdu vial chis chandā māmo wanān .
Urdu vial-3MPl be-Prs.3MPl.Dat chanda maamoo call-Prog .
Urdu speakers call it Chanda Mamu.
Before we motivate a specific treatment of these clitic markers in the Kashmiri treebank, we first provide
their descriptive overview based on Koul and Wali. In Kashmiri, these clitics are restricted only to
argument positions and show the following properties:
1. They agree with the governing pronoun in number and person only.
2. They are cued to pronoun’s case form.
3. They are obligatory for second person pronouns.
4. They are obligatory for ergative marked first and third person null pronouns and optional in their
presence.
5. They are in complementary distribution with dative marked first and third person pronouns.
As per the properties 3-5, these clitics always co-occur with pro-drop. This implies that verb arguments
are always explicitly represented in Kashmiri either by pronouns/noun phrases or by clitic markers on
verbs. It is important to note that the agreement between the pronominal suffix and its governed pronoun
is not the primary agreement. In Kashmiri, verbs agree with one of their non-oblique/nominative argu-
ments in number, gender and person, while the pronominal clitics agree with their governing oblique
pronoun.
In case of pro-drop, it thus seems reasonable to consider the pronominal clitic as a proper verb argument.
Since the clitic agrees with the corresponding pronoun for case, the role of the argument can also be
easily identified computationally. Based on these considerations, we segment the pronominal clitics
from the verb whenever the corresponding pronouns are dropped. Since we do not have a morphological
analyzer, we do the segmentation manually and treat the segmented clitic as a proper argument of the
main verb to complete its argument structure. Our analysis of pronominal clitics is similar to morpheme-
based syntactic annotation of morphologically rich languages like Hebrew, Turkish and Arabic [122,
156, 173]. Our analysis of pronominal clitics is shown in the annotation of Example 39 given below.
root
k1
aux
k2
rsym
k2g
urdu vial ch -is chandā māmo wanān .
NNP PSP VAUX PRP NNPC NNP VM SYM
Figure 8.4: Intra-chunk dependency annotation of Example 39 showing the treatment of third
person pronominal clitic -is.
125
8.2.4.1 Inter-Annotator Agreement Study
We carried an annotator agreement study on a set of 100 sentences to gauge the quality of the treebank.
The data set was separately annotated by two expert linguists. A good agreement on the data set will
assure that the annotations are reliable. The data set used contains 1,132 head-dependent dependency
chains marked with dependency relations belonging to a tag-set of 50 tags. The agreement measured
is chunk-based; for each chunk in a sentence agreement is measured with respect to its relation with
the head it modifies. Inter-annotator agreement was measured using Cohen’s kappa [50] which is the
mostly used agreement coefficient for annotation tasks with categorical data. The agreement statistics
shown in Table 8.7 suggests a good understanding of annotators of the annotation guidelines and the
morpho-syntax involved in the given set of corpus. The major disagreement is observed for the major
dependency labels namely k1 ‘agent/subject’, k2 ‘patient/theme/object’ and k1s ‘noun complement’
which indicates that the arguments which bear k1, k2 and k1s relations with a verb are most confusing
grammatical relations in the treebank. It seems morpho-syntactic cues serve as poor guide in certain
contexts. Some of the reasons for the disagreement are the size of a sentence (lower agreement for
sentences with >50 words), higher degrees of argument scrambling, ambiguity in morpho-syntactic
cues (case suffixes) and the lack thereof etc.
No. of Annotations Agreement P(a) P(e) κ

1132 880 0.777 0.089 0.756
Table 8.7: Kappa statistics.
8.2.4.2 Parsing Experiments
The annotated corpora is used to build a first dependency parser of Kashmiri. The data set is split
into training, testing and development sets by the ratio of 80:10:10 for training, testing and tuning the
parsing model. We used our implementation of arc-eager parser which uses dynamic oracle to learn
the model parameters. In addition, we projectivized non-projective arcs in the training set using the
pseudo-projective transformation of Nivre and Nilsson [151]. Current version of Kashmiri treebank has
0.02 non-projective edges, which is, however, much lower compared to other Indian languages [27].
We experimented with different feature sets both gold as well as automated. The results are reported in
Table 8.9.
Baseline for parsing is set using just the raw tokens. In predicted settings, POS and chunk features are
extracted using the POS tagger and chunker that we presented in this chapter. Since Kashmiri has very
rich morphology and nominals are marked for case, we can use the case marking as an explicit cue to
identify certain dependency relations, particularly those relations which are related to verb argument
structure. Unlike Hindi and Urdu, case marking in Kashmiri is part of word morphology. We extract
the case suffix of a nominal based on a suffix list (cf. Table 8.8) and use it as a feature in an appropriate
parser configuration (see Chapter-6 for more details on the role of case markers in parsing and how to
126
incorporate them into the parsing model.). The suffix list of case markers is based on the Modern Gram-
mar of Kashmiri by Koul and Wali [107]. As shown in the results table, case information substantially
increased the accuracy over the POS and chunk-based features in both the gold and predicted settings.
Besides, we also used WordNet3 features in our parsing model. These features are represented similarly
to those used in the Hindi and Urdu parsing models as discussed in Chapter 7. With the addition of
these semantic features to the model, we further improved the parsing performance. Despite these im-
provements, our parser is still not accurate enough. To further improve the parsing accuracy, we need to
annotate more data to address the problem of data sparsity. Even though results are not that promising,
we have set the baseline for further research on Kashmiri dependency parsing.
Masculine Feminine
Case
Singular Plural Singular Plural
Ergative -an -av -i/an -av
Dative -as -an -i -an
Ablative -i -av -i -av
Genitive -un, -uk, -hund, -sund
Table 8.8: Case suffixes in Kashmiri (adapted from Koul and Wali). We have only listed the base
form of genitive, since the possessive suffix not only declines according to the number and gender
of the possessed noun but also according to its case.
Gold Predicted
Features
Baseline 44.11 35.17 30.3 44.11 35.17 30.3
+POS 68.2324.12 50.515.33 45.4615.16 53.289.17 45.4710.3 39.188.88
+Chunk 72.254.02 52.351.85 49.173.71 54.140.86 47.21.73 40.571.39
+Case 73.711.46 56.053.7 52.893.72 55.211.07 52.054.85 44.894.32
+WordNet 73.760.05 56.540.49 53.310.42 55.230.02 52.620.57 45.290.4
Table 8.9: Effect of different features on our Arc-eager parser.
8.2.4.2.1 Intra-chunk parsing: Similar to Hindi and Urdu treebanks, only inter-chunk dependen-
cies are manually annotated in the Kashmiri treebank. However, to train a parser that produces complete
parse trees, we need to expand the intra-chunk dependencies in the Kashmiri treebank. For this purpose,
we used the grammar-driven chunk expander which we formalized in Chapter 3. For the treebanks that
use ILMT POS tag set, the chunk expander needs minimal tuning (in terms of addition or deletion of
context-free rules). Fortunately, for Kashmiri treebank, the existing rules worked very well. We evalu-
3 We could not used distributional semantics due to the unavailability of large scale raw data in Kashmiri.
127
ated the performance of the expander on a manually annotated test set of 100 sentences. The expander
produced quite accurate results. On the evaluation set, we achieved LAS of around 99.32%.
We finally trained our parser on the expanded treebank. To train and evaluate the parser, we used the
expanded version of the data sets that we used for the inter-chunk parsing. The features used for intra-
chunk parsing are also similar to the ones used in inter-chunk parsing. The results are reported in Table
8.10. In comparison to inter-chunk parsing (see Table 8.9), our intra-chunk parser is quite accurate. The
main reason for such improvement is the fact that the intra-chunk dependencies are quite trivial and easy
to predict. Moreover, they form the bulk of dependencies in a syntactic tree.
Gold Predicted
Features
POS 80.31 71.78 69.73 75.09 65.2 63.15
+Chunk 80.820.51 72.180.4 70.110.38 75.510.42 65.390.19 63.520.37
+Lexical Case 81.250.43 76.444.26 72.782.67 76.811.3 69.313.92 65.792.27
+WordNet 81.350.1 77.040.6 73.180.4 76.930.12 69.860.55 66.110.32
Table 8.10: Intra-chunk parsing accuracies on expanded version of the Kashmiri treebank.
8.3 Summary
In this chapter, we have presented our efforts towards dependency parsing of Kashmiri text. More pre-
cisely, we have presented a dependency treebank of Kashmiri, in which 3,409 sentences are POS tagged
and chunked while 1,500 sentences are annotated with dependency structures. We made few changes
to the guidelines [22] to address and accommodate the v2 phenomenon and pronominal cliticization in
Kashmiri. To assure the quality of the treebank, we carried out an inter-annotator study. A kappa value
κ=0.76 shows sufficient agreement between the 2 expert annotators and thus assures the quality of the
treebank. Finally, the treebank data is used to setup a dependency parsing pipeline for Kashmiri texts,
which includes, a tokenizer, a POS tagger, a chunker and a complete word-level dependency parser.
128
Chapter 9
Summary and Future Work
In this Chapter, we summarize the main concepts and contributions of the thesis. We also briefly discuss,
the possible directions for extending the research presented in the thesis.
9.1 Conclusion
This thesis targets the problems in dependency analysis of Indian languages which are morphologically
rich in nature. The apriori dependency analysis of a text is essential for many applications such as
information extraction, sentiment analysis, text summarization, and machine translation. Over the last
decade, many parsing algorithms have been proposed for data-driven dependency parsing of a natural
language text. It has been observed that these statistical parsing techniques usually underperform on
morphologically rich languages. This is primarily due to the fact that morphological richness leads to
high variation in word order and creates high lexical diversity.
We have proposed various strategies to tackle the problems posed by rich morphological nature of
Indian languages for enhancing their parsing performances. Our primary goals in this thesis have been
to present efficient representations for morphosyntactic interactions in the parsing models and to tackle
high lexical and syntactic diversity.
Firstly, we have presented efficient representations for core grammatical phenomena such as case mark-
ing and agreement in a non-deterministic transition-based parser. The proposed representations exploit
the correlation between morphology and syntax, and significantly improve the performance of these
parsers on Hindi and Urdu test sets. Secondly, we have proposed various strategies to address the issues
concerning annotation sparsity. We have shown that existing linguistic resources can be leveraged for
mitigating the effect of both lexical and syntactic sparsity. Concerning syntactic sparsity, we have pro-
posed a simple sampling technique to create training instances for parsing non-canonical texts, and have
identified syntactic cues for parsing of sparse non-projective structures. Similarly, for lexical diversity,
we have presented a harmonization technique for sharing and augmenting Hindi and Urdu treebanks for
training more diverse parsing models. Further, we also proposed to use lexical semantics for accurately
parsing highly rich dependency relations in Indian language treebanks. We have empirically shown that
129
the proposed methods perform reasonably well on Hindi and Urdu test sets. Finally, we have developed
the first dependency parser for Kashmiri along the lines of our work on parsing Hindi and Urdu.
9.2 Future Research Directions

While this thesis has presented effective strategies for tackling the issues concerning feature represen-
tation and data sparsity in statistical parsing of Indian languages, many opportunities for extending the
scope of this thesis remain. Some of the future research directions include the following:
1. The methods and approaches presented in this thesis can be applied to more Indian languages to
evaluate their efficiency.
2. Our parsers, despite being robust, may still require non-trivial adaptations for parsing of social
media data, particularly code mixed data.
3. The parsing results can be further improved by using more advanced learning architectures for
transition-based parsers such as Long Short-Term Memory networks [64].
4. For those Indian languages without treebanks, cross-lingual techniques can be used for parsing,
such as bitext projection methods [92] and direct transfer strategies using distributed interlingual
word representations [195].
130
Bibliography
[1] S. P. Abney. Parsing by chunks. Springer, 1992.

[2] F. Adeeba and S. Hussain. Experiences in building the Urdu wordnet. Proceedings of Asian
Language Resources collocated with IJCNLP, 2011.
[3] E. Agirre, T. Baldwin, and D. Martinez. Improving parsing and PP attachment performance with
sense information. Proceedings of ACL-08: HLT, 2008.
[4] E. Agirre, K. Bengoetxea, K. Gojenola, and J. Nivre. Improving dependency parsing with se-
mantic classes. In Proceedings of the 49th Annual Meeting of the Association for Computational
Linguistics, 2011.
[5] T. Ahmed and A. Hautli. Developing a basic lexical resource for Urdu using Hindi WordNet.
Proceedings of CLT10, Islamabad, Pakistan, 2010.
[6] Y. Al-Onaizan and K. Knight. Machine transliteration of names in Arabic text. In Proceed-
ings of the ACL-02 workshop on Computational approaches to semitic languages, pages 1–13.
[7] W. Ali and S. Hussain. Urdu dependency parser: a data-driven approach. In Proceedings of
Conference on Language and Technology (CLT10), SNLP, Lahore, Pakistan, 2010.
[8] B. R. Ambati, P. Gade, C. Gsk, and S. Husain. Effect of minimal semantics on dependency
parsing. In Proceedings of the Student Research Workshop, Borovets, Bulgaria, 2009. Association
for Computational Linguistics.
[9] B. R. Ambati, S. Husain, S. Jain, D. M. Sharma, and R. Sangal. Two methods to incorporate
local morphosyntactic features in Hindi dependency parsing. In Proceedings of the NAACL HLT
2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages, pages 22–30.
[10] B. R. Ambati, S. Husain, J. Nivre, and R. Sangal. On the role of morphosyntactic features in
Hindi dependency parsing. In Proceedings of the NAACL HLT 2010 First Workshop on Statisti-
cal Parsing of Morphologically-Rich Languages, pages 94–102. Association for Computational
Linguistics, 2010.
[11] B. R. Ambati, T. Deoskar, and M. Steedman. Using ccg categories to improve Hindi dependency
parsing. In ACL (2), pages 604–609, 2013.
[12] S. R. Anderson. A-morphous morphology, volume 62. Cambridge University Press, 1992.
131
[13] S. Banerjee and T. Pedersen. Extended gloss overlaps as a measure of semantic relatedness. In
International Joint Conference on Artificial Intelligence, volume 18, 2003.
[14] R. Begum, S. Husain, A. Dhwaj, D. M. Sharma, L. Bai, and R. Sangal. Dependency annotation
scheme for indian languages. In IJCNLP, pages 721–726. Citeseer, 2008.
[15] R. Begum, K. Jindal, A. Jain, S. Husain, and D. M. Sharma. Identification of conjunct verbs in
Hindi and its effect on parsing accuracy. In Proceedings of the 12th international conference on
Computational linguistics and intelligent text processing-Volume Part I, pages 29–40. Springer-
Verlag, 2011.
[16] K. Bellare, K. Crammer, and D. Freitag. Loss-sensitive discriminative training of machine
transliteration models. In Proceedings of Human Language Technologies: The 2009 Annual
Conference of the North American Chapter of the Association for Computational Linguistics,
Companion Volume: Student Research Workshop and Doctoral Consortium, pages 61–65. Asso-
ciation for Computational Linguistics, 2009.
[17] K. Bengoetxea and K. Gojenola. Application of feature propagation to dependency parsing.
In Proceedings of the 11th International Conference on Parsing Technologies, pages 142–145.
[18] K. Bengoetxea, K. Gojenola, and A. Casillas. Testing the effect of morphological disambiguation
in dependency parsing of basque. In Proceedings of the Second Workshop on Statistical Parsing
of Morphologically Rich Languages, pages 28–33. Association for Computational Linguistics,
2011.
[19] K. Bengoetxea, E. Agirre, J. Nivre, Y. Zhang, and K. Gojenola. On wordnet semantic classes
and dependency parsing. In Proceedings of the 52nd Annual Meeting of the Association for
Computational Linguistics (Volume 2: Short Papers), pages 649–655, Baltimore, Maryland, June
2014. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/
P14-2106.
[20] A. Bharati, V. Chaitanya, R. Sangal, and K. Ramakrishnamacharyulu. Natural Language Pro-
cessing: A Paninian Perspective. Prentice-Hall of India, 1995.
[21] A. Bharati, V. Chaitanya, R. Sangal, and K. Ramakrishnamacharyulu. Natural language process-
ing: a Paninian perspective. Prentice-Hall of India New Delhi, 1995.
[22] A. Bharati, R. Sangal, D. M. Sharma, and L. Bai. Anncorra: Annotating corpora guidelines
for POS and Chunk annotation for Indian languages. Technical report, (TR-LTRC-31), LTRC,
IIIT-Hyderabad, 2006.
[23] A. Bharati, S. Husain, B. Ambati, S. Jain, D. Sharma, and R. Sangal. Two semantic features
make all the difference in parsing accuracy. Proc. of ICON, 8, 2008.
[24] A. Bharati, D. S. S. Husain, L. Bai, R. Begam, and R. Sangal. Anncorra: Treebanks for indian
languages, guidelines for annotating Hindi treebank (version–2.0), 2009.
[25] R. Bhat and D. Sharma. A dependency treebank of Urdu and its evaluation. In Proceedings of
6th Linguistic Annotation Workshop (ACL HLT 2012). Jeju, Republic of Korea. 2012.
132
[26] R. A. Bhat and D. M. Sharma. A dependency treebank of Urdu and its evaluation. In Proceed-
ings of the Sixth Linguistic Annotation Workshop, pages 157–165. Association for Computational
Linguistics, 2012.
[27] R. A. Bhat and D. M. Sharma. Non-projective structures in indian language treebanks. In The 11th
International Workshop on Treebanks and Linguistic Theories, pages 25–30. Edições Colibri,
2012.
[28] R. A. Bhat, S. Jain, and D. M. Sharma. Experiments on dependency parsing of Urdu. In The 11th
International Workshop on Treebanks and Linguistic Theories, 2012.
[29] R. A. Bhat, S. M. Bhat, and D. M. Sharma. Towards building a Kashmiri treebank: setting up the
annotation pipeline. In Ninth International Conference on Language Resources and Evaluation
(LREC-2014), pages 748–752, 2014.
[30] R. A. Bhat, N. Jain, A. Vaidya, M. Palmer, T. A. Khan, D. M. Sharma, and J. Babani. Adapting
predicate frames for Urdu propbanking. In Proceedings of LT4CloseLang: Language Technology
for Closely Related Languages and Language Variants, 2014.
[31] R. A. Bhat, R. Bhatt, A. Farudi, P. Klassen, B. Narasimhan, M. Palmer, O. Rambow, D. M.
Sharma, A. Vaidya, S. R. Vishnu, et al. The Hindi/Urdu treebank project. In Handbook of
Linguistic Annotation. Springer Press, 2015.
[32] R. Bhatt. Verb movement in Kashmiri. U. Penn Working Papers in Linguistics, 2, 1995.
[33] R. Bhatt. Verb movement and the syntax of Kashmiri, volume 46. Springer Netherlands, 1999.
[34] R. Bhatt, B. Narasimhan, M. Palmer, O. Rambow, D. M. Sharma, and F. Xia. A multi-
representational and multi-layered treebank for Hindi/Urdu. In Proceedings of the Third Lin-
guistic Annotation Workshop, pages 186–189. Association for Computational Linguistics, 2009.
[35] P. Bhattacharyya. Indowordnet. In N. C. C. Chair), K. Choukri, B. Maegaard, J. Mariani, J. Odijk,
S. Piperidis, M. Rosner, and D. Tapias, editors, Proceedings of the Seventh conference on Inter-
national Language Resources and Evaluation (LREC’10), Valletta, Malta, may 2010. European
Language Resources Association (ELRA). ISBN 2-9517408-6-7.
[36] D. Biber. Methodological issues regarding corpus-based analyses of linguistic variation. vol-
ume 5, pages 257–269. ALLC, 1990.
[37] M. Bodirsky, M. Kuhlmann, and M. Möhl. Well-nested drawings as models of syntactic structure.
In Tenth Conference on Formal Grammar and Ninth Meeting on Mathematics of Language, 2009.
[38] T. Bögel, M. Butt, and S. Sulger. Urdu ezafe and the morphology-syntax interface. Proceedings
of LFG08, 2008.
[39] A. Böhmová, J. Hajič, E. Hajičová, and B. Hladká. The Prague Dependency Treebank: Three-
Level Annotation Scenario. In A. Abeillé, editor, Treebanks: Building and Using Syntactically
Annotated Corpora. Kluwer Academic Publishers, 2001.
[40] S. Brants, S. Dipper, S. Hansen, W. Lezius, and G. Smith. The TIGER Treebank. In Proceedings
of the Workshop on Treebanks and Linguistic Theories, Sozopol, 2002.
133
[41] P. F. Brown, P. V. Desouza, R. L. Mercer, V. J. D. Pietra, and J. C. Lai. Class-based n-gram
models of natural language. Computational linguistics, 18(4):467–479, 1992.
[42] M. Butt and T. H. King. The status of case. In Clause structure in South Asian languages, pages
153–198. Springer, 2004.
[43] M. Butt and T. H. King. Urdu ezafe and the morphology-syntax interface. Proceedings of LFG08.
CSLI Publications, Stanford, 2008.
[44] M. Butt, T. Bögel, A. Hautli, S. Sulger, and T. Ahmed. Identifying Urdu complex predication via
bigram extraction. In COLING, pages 409–424, 2012.
[45] M. Candito and M. Constant. Strategies for contiguous multiword expression analysis and de-
pendency parsing. In ACL 14-The 52nd Annual Meeting of the Association for Computational
Linguistics. ACL, 2014.
[46] M. Candito, E. H. Anguiano, and D. Seddah. A word clustering approach to domain adaptation:
Effective parsing of biomedical texts. In Proceedings of the 12th International Conference on
Parsing Technologies, pages 37–42. Association for Computational Linguistics, 2011.
[47] C. Chelba and A. Acero. Adaptation of maximum entropy capitalizer: Little data can help a lot.
Computer Speech & Language, 20(4), 2006.
[48] D. Chen and C. D. Manning. A fast and accurate dependency parser using neural networks.
In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing
(EMNLP), volume 1, pages 740–750, 2014.
[49] K. W. Church and P. Hanks. Word association norms, mutual information, and lexicography.
Computational linguistics, 16(1):22–29, 1990.
[50] J. Cohen et al. A coefficient of agreement for nominal scales. volume 20, pages 37–46. Durham,
1960.
[51] M. Collins. Discriminative training methods for hidden markov models: Theory and experiments
with perceptron algorithms. In Proceedings of the ACL-02 conference on Empirical methods in
natural language processing-Volume 10, pages 1–8. Association for Computational Linguistics,
2002.
[52] M. Collins, L. Ramshaw, J. Hajič, and C. Tillmann. A statistical parser for czech. In Proceedings
of the 37th annual meeting of the Association for Computational Linguistics on Computational
Linguistics. Association for Computational Linguistics, 1999.
[53] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language
processing (almost) from scratch. The Journal of Machine Learning Research, 12, 2011.
[54] B. Comrie. Language universals and linguistic typology: Syntax and morphology. University of
Chicago press, 1989.
[55] B. Comrie. Language universals and linguistic typology: Syntax and morphology. University of
Chicago press, 1989.
[56] M. Constant, A. Sigogne, and P. Watrin. Discriminative strategies to integrate multiword expres-
sion recognition and parsing. In Proceedings of the 50th Annual Meeting of the Association for
134
Computational Linguistics: Long Papers-Volume 1, pages 204–212. Association for Computa-
tional Linguistics, 2012.
[57] C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995.
[58] J. R. Curran. From distributional to semantic similarity. PhD thesis, 2004.
[59] H. Daumé III. Frustratingly easy domain adaptation. Proceedings of Association of Computa-
[60] M.-C. De Marneffe and C. D. Manning. The stanford typed dependencies representation. In Col-
ing 2008: Proceedings of the workshop on Cross-Framework and Cross-Domain Parser Evalua-
tion, pages 1–8. Association for Computational Linguistics, 2008.
[61] Z. Dong and Q. Dong. Hownet Chinese-English conceptual database. Technical report, Technical
Report Online Software Database, Released at ACL. http://www. keenage. com, 2000.
[62] T. Dunning. Statistical identification of language. Computing Research Laboratory, New Mexico
State University, 1994.
[63] N. Durrani and S. Hussain. Urdu word segmentation. In Human Language Technologies: The
2010 Annual Conference of the North American Chapter of the Association for Computational
Linguistics. Association for Computational Linguistics, 2010.
[64] C. Dyer, M. Ballesteros, W. Ling, A. Matthews, and N. A. Smith. Transition-based dependency
parsing with stack long short-term memory. In Proceedings of the 53rd Annual Meeting of the
Association for Computational Linguistics and the 7th International Joint Conference on Natu-
ral Language Processing (Volume 1: Long Papers), pages 334–343, Beijing, China, July 2015.
Association for Computational Linguistics.
[65] J. M. Eisner. Three new probabilistic models for dependency parsing: An exploration. In Pro-
ceedings of the 16th conference on Computational linguistics-Volume 1, pages 340–345. Associ-
ation for Computational Linguistics, 1996.
[66] H. Elfardy and M. T. Diab. Token level identification of linguistic code switching. In COLING
(Posters), pages 287–296, 2012.
[67] M. B. Emeneau. India as a lingustic area. Language, 1956.
[68] G. Eryiğit, T. Ilbay, and O. A. Can. Multiword expressions in statistical dependency parsing. In
Proceedings of the Second Workshop on Statistical Parsing of Morphologically Rich Languages,
pages 45–55. Association for Computational Linguistics, 2011.
[69] M. Faruqui, J. Dodge, S. K. Jauhar, C. Dyer, E. Hovy, and N. A. Smith. Retrofitting word vectors
to semantic lexicons. In Proc. of NAACL, 2015.
[70] M. Federico, N. Bertoldi, and M. Cettolo. Irstlm: an open source toolkit for handling large scale
language models. In Interspeech, pages 1618–1621, 2008.
[71] W. N. Francis and H. Kucera. Brown corpus manual. Brown University, 1979.
[72] W. N. Francis and H. Kucera. The Brown Corpus: A Standard Corpus of Present-Day Edited
American English, 1979. Brown University Liguistics Department.
135
[73] S. Fujita, F. Bond, S. Oepen, and T. Tanaka. Exploiting Semantic Information for HPSG Parse
Selection. In ACL 2007 Workshop on Deep Linguistic Processing, Prague, Czech Republic, 2007.
Association for Computational Linguistics.
[74] Y. Goldberg and M. Elhadad. Easy first dependency parsing of modern hebrew. In Proceedings of
the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages,
pages 103–107. Association for Computational Linguistics, 2010.
[75] Y. Goldberg and M. Elhadad. Word segmentation, unknown-word resolution, and morphological
agreement in a hebrew parsing system. Computational Linguistics, 39(1):121–160, 2013.
[76] Y. Goldberg and J. Nivre. A dynamic oracle for arc-eager dependency parsing. In COLING,
pages 959–976, 2012.
[77] Y. Goldberg and J. Nivre. Training deterministic parsers with non-deterministic oracles. Trans-
actions of the association for Computational Linguistics, 1:403–414, 2013.
[78] J. H. Greenberg. Some universals of grammar with particular reference to the order of meaningful
elements. Universals of language, 2:73–113, 1963.
[79] G. A. Grierson. On pronominal suffixes in the Kashmiri language. Journal of the Asiatic Society
of Bengal, 64(4), 1895.
[80] L. Haizhou, Z. Min, and S. Jian. A joint source-channel model for machine transliteration. In
Proceedings of the 42nd Annual Meeting on association for Computational Linguistics, page 159.
[81] J. Hajic, J. Panevová, E. Hajicová, P. Sgall, P. Pajas, J. Štepánek, J. Havelka, M. Mikulová,
Z. Zabokrtskỳ, and M. Š. Razımová. Prague dependency treebank 2.0. CD-ROM, Linguistic
Data Consortium, LDC Catalog No.: LDC2006T01, Philadelphia, 98, 2006.
[82] E. Hajicová, J. Havelka, P. Sgall, K. Veselá, and D. Zeman. Issues of projectivity in the prague
dependency treebank. Prague Bulletin of Mathematical Linguistics, 81, 2004.
[83] J. Hall, J. Nivre, and J. Nilsson. Discriminative classifiers for deterministic dependency pars-
ing. In Proceedings of the COLING/ACL on Main conference poster sessions, pages 316–323.
[84] Z. S. Harris. Distributional structure. Word, 1954.
[85] J. Havelka. Beyond projectivity: Multilingual evaluation of constraints and measures on non-
projective structures. In Proceedings of the 45th Annual Meeting of the Association for Compu-
tational Linguistics, volume 45, 2007.
[86] G. Hirst. Semantic interpretation and the resolution of ambiguity. Cambridge University Press,
1992.
[87] M. Hohensee. It’s Only Morpho-Logical: Modeling Agreement in Cross-Linguistic Dependency
Parsing. PhD thesis, University of Washington, 2012.
[88] M. Hohensee and E. M. Bender. Getting more from morphology in multilingual dependency
parsing. In Proceedings of the 2012 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, pages 315–326. Association for
136
Computational Linguistics, 2012.
[89] P. Hook and A. Manaster-Ramer. The position of the finite verb in Germanic and Kashmiri.
towards a typology of v2 languages. Germanic Linguistics, 1985.
[90] D. Hovy, S. Tratz, and E. Hovy. What’s in a preposition?: dimensions of sense disambiguation for
an interesting word class. In Proceedings of the 23rd International Conference on Computational
Linguistics: Posters, pages 454–462. Association for Computational Linguistics, 2010.
[91] S. Husain and B. Agrawal. Analyzing parser errors to improve parsing accuracy and to inform
tree banking decisions. Linguistic Issues in Language Technology, 7(1), 2012.
[92] R. Hwa, P. Resnik, A. Weinberg, C. Cabezas, and O. Kolak. Bootstrapping parsers via syntactic
projection across parallel texts. Natural language engineering, 11(03):311–325, 2005.
[93] S. Jain, N. Jain, A. Tammewar, R. A. Bhat, and D. M. Sharma. Exploring semantic information
in Hindi WordNet for Hindi dependency parsing. In International Joint Conference on Natural
Language Processing, Nagoya, Japan, 14-18 October 2013., pages 189–197, 2013.
[94] B. Jawaid and T. Ahmed. Hindi to Urdu conversion: beyond simple transliteration. In Conference
on Language and Technology, 2009.
[95] I. Jena, R. A. Bhat, S. Jain, and D. M. Sharma. Animacy annotation in the Hindi treebank. LAW
VII & ID, page 159, 2013.
[96] G. N. Jha. The tdil program and the Indian language corpora initiative (ilci). In Proceedings
of the Seventh Conference on International Language Resources and Evaluation (LREC 2010).
European Language Resources Association (ELRA), 2010.
[97] J. J. Katz and J. A. Fodor. The structure of a semantic theory. language, 39(2):170–210, 1963.
[98] A. Kilgarriff. Comparing corpora. volume 6, pages 97–133. John Benjamins Publishing Com-
pany, 2001.
[99] A. Kilgarriff. Language is never, ever, ever, random. volume 1, pages 263–276, 2005.
[100] B. King and S. P. Abney. Labeling the languages of words in mixed-language documents using
weakly supervised methods. In HLT-NAACL, pages 1110–1119, 2013.
[101] K. Knight and J. Graehl. Machine transliteration. Computational Linguistics, 24(4):599–612,
1998.
[102] P. Koehn, F. J. Och, and D. Marcu. Statistical phrase-based translation. In Proceedings of the
2003 Conference of the North American Chapter of the Association for Computational Linguis-
tics on Human Language Technology-Volume 1, pages 48–54. Association for Computational
Linguistics, 2003.
[103] T. Koo and M. Collins. Efficient third-order dependency parsers. In Proceedings of the 48th
Annual Meeting of the Association for Computational Linguistics, pages 1–11. Association for
[104] T. Koo, X. Carreras Pérez, and M. Collins. Simple semi-supervised dependency parsing. In 46th
Annual Meeting of the Association for Computational Linguistics, pages 595–603, 2008.
137
[105] P. Kosaraju, S. R. Kesidi, V. B. R. Ainavolu, and P. Kukkadapu. Experiments on Indian lan-
guage dependency parsing. In Proceedings of the ICON10 NLP Tools Contest: Indian Language
Dependency Parsing, 2010.
[106] P. Kosaraju, S. Husain, B. R. Ambati, D. M. Sharma, and R. Sangal. Intra-chunk dependency an-
notation: expanding Hindi inter-chunk annotated treebank. In Proceedings of the Sixth Linguistic
Annotation Workshop, pages 49–56. Association for Computational Linguistics, 2012.
[107] O. Koul and K. Wali. Modern Kashmiri Grammar. Dunwoody Press, 2006.
[108] S. Kübler, R. McDonald, and J. Nivre. Dependency parsing. Synthesis Lectures on Human
Language Technologies, 1(1):1–127, 2009.
[109] T. Kudo and Y. Matsumoto. Japanese dependency analysis using cascaded chunking. In proceed-
ings of the 6th conference on Natural language learning-Volume 20, pages 1–7. Association for
[110] M. Kuhlmann and M. Mohl. Mildly context-sensitive dependency languages. In Proceedings of
the 45th Annual Meeting of the Association for Computational Linguistics, volume 45, 2007.
[111] M. Kuhlmann and J. Nivre. Mildly non-projective dependency structures. In Proceedings of the
COLING/ACL on Main conference poster sessions. Association for Computational Linguistics,
2006.
[112] A. Kulkarni. A deterministic dependency parser with dynamic programming for sanskrit. In
Proceedings of the Second International Conference on Dependency Linguistics (DepLing 2013),
pages 157–166, 2013.
[113] A. Kulkarni and K. Ramakrishnamacharyulu. Parsing sanskrit texts: Some relation specific is-
sues. In Proceedings of the 5th International Sanskrit Computational Linguistics Symposium. DK
Printworld (P) Ltd, 2013.
[114] A. Kulkarni, S. Pokar, and D. Shukl. Designing a constraint based parser for sanskrit. In Sanskrit
Computational Linguistics, pages 70–90. Springer, 2010.
[115] A. KULKARNI, P. SHUKLA, P. SATULURI, and D. SHUKL. How free is ‘free’ word order in
sanskrit? Sanskrit syntax, page 269, 2015.
[116] G. S. Lehal and T. S. Saini. A Hindi to Urdu transliteration system. In Proceedings of ICON-
2010: 8th International Conference on Natural Language Processing, Kharagpur, 2010.
[117] G. S. Lehal and T. S. Saini. Development of a complete Urdu-Hindi transliteration system. In
COLING (Posters), pages 643–652, 2012.
[118] G. S. Lehal and T. S. Saini. Sangam: A Perso-Arabic to indic script machine transliteration
model. In Proceedings of the 11 International Conference on Natural Language Processing,
2014.
[119] P. Liang. Semi-supervised learning for natural language. Master’s thesis, Massachusetts Institute
of Technology, 2005.
[120] J. Lin. Divergence measures based on the shannon entropy. Information Theory, IEEE Transac-
tions on, 37(1):145–151, 1991.
138
[121] M. Lui, J. H. Lau, and T. Baldwin. Automatic detection and language identification of multilin-
gual documents. volume 2, pages 27–40, 2014.
[122] M. Maamouri, A. Bies, T. Buckwalter, and W. Mekki. The penn Arabic treebank: Building a
large-scale annotated Arabic corpus. In NEMLAR conference on Arabic language resources and
tools, 2004.
[123] A. MacKinlay, R. Dridan, D. McCarthy, and T. Baldwin. The effects of semantic annotations
on precision parse ranking. In Proceedings of the First Joint Conference on Lexical and Com-
putational Semantics-Volume 1: Proceedings of the main conference and the shared task, and
Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation. Association
for Computational Linguistics, 2012.
[124] W. Maier and T. Lichte. Characterizing discontinuity in constituent treebanks. In Formal Gram-
mar. Springer, 2011.
[125] M. G. Malik, C. Boitet, and P. Bhattacharyya. Hindi Urdu machine transliteration using finite-
state transducers. In Proceedings of the 22nd International Conference on Computational
Linguistics-Volume 1, pages 537–544. Association for Computational Linguistics, 2008.
[126] D. K. Malladi and P. Mannem. Statistical morphological analyzer for Hindi. In IJCNLP, pages
1007–1011, 2013.
[127] P. Mannem, H. Chaudhry, and A. Bharati. Insights into non-projectivity in Hindi. In Proceedings
of the ACL-IJCNLP 2009 Student Research Workshop, pages 10–17. Association for Computa-
[128] M. Marcus, M. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of English:
The Penn Treebank. Computational linguistics, 19(2), 1993.
[129] M. Marcus, G. Kim, M. A. Marcinkiewicz, R. MacIntyre, A. Bies, M. Ferguson, K. Katz, and
B. Schasberger. The penn treebank: annotating predicate argument structure. In Proceedings of
the workshop on Human Language Technology, pages 114–119. Association for Computational
Linguistics, 1994.
[130] Y. Marton, N. Habash, and O. Rambow. Dependency parsing of modern standard Arabic with
lexical and inflectional features. Computational Linguistics, 39(1):161–194, 2013.
[131] C. P. Masica. The Indo-Aryan Languages. Cambridge University Press, 1993.
[132] R. McDonald, F. Pereira, K. Ribarov, and J. Hajič. Non-projective dependency parsing using
spanning tree algorithms. In Proceedings of the conference on Human Language Technology and
Empirical Methods in Natural Language Processing, pages 523–530. Association for Computa-
[133] R. T. McDonald and J. Nivre. Characterizing the errors of data-driven dependency parsing mod-
els. In EMNLP-CoNLL, pages 122–131, 2007.
[134] J.-j. Mei and Y. Gao. Tongyi cilin (a Chinese thesaurus). China: Shanghai Lexicographical
Publishing House, 1996.
139
[135] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in
vector space. arXiv preprint arXiv:1301.3781, 2013.
[136] S. Mille, A. Burga, G. Ferraro, and L. Wanner. How does the granularity of an annotation scheme
influence dependency parsing performance? In COLING (Posters), pages 839–852, 2012.
[137] G. A. Miller. WordNet: a lexical database for English. Communications of the ACM, 38(11),
1995.
[138] T. Mohanan. Argument structure in Hindi. Center for the Study of Language (CSLI), 1994.
[139] S. Montemagni, F. Barsotti, M. Battista, N. Calzolari, O. Corazzari, A. Lenci, A. Zampolli,
F. Fanciulli, M. Massetani, R. Raffaelli, et al. Building the Italian syntactic-semantic treebank.
In Treebanks. Springer, 2003.
[140] S. Mukund, R. Srihari, and E. Peterson. An information-extraction system for Urdu—a resource-
poor language. ACM Transactions on Asian Language Information Processing (TALIP), 9(4):15,
2010.
[141] C. Naim. Introductory Urdu, 2 volumes. revised, 1999.
[142] D. Narayan, D. Chakrabarti, P. Pande, and P. Bhattacharyya. An experience in building the Indo
WordNet-a wordnet for Hindi. In First International Conference on Global WordNet, Mysore,
India, 2002.
[143] J. Nerbonne and W. Wiersma. A measure of aggregate syntactic distance. In Proceedings of
the Workshop on linguistic Distances, pages 82–90. Association for Computational Linguistics,
2006.
[144] G. Neumann, C. Braun, and J. Piskorski. A divide-and-conquer strategy for shallow parsing of
German free texts. In Proceedings of the sixth conference on Applied natural language process-
ing. Association for Computational Linguistics, 2000.
[145] D. Nguyen and A. S. Dogruoz. Word level language identification in online multilingual com-
munication. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language
Processing, 2014.
[146] J. Nivre. An efficient algorithm for projective dependency parsing. In Proceedings of the 8th
International Workshop on Parsing Technologies (IWPT), 2003.
[147] J. Nivre. Incrementality in deterministic dependency parsing. In Proceedings of the Workshop on
Incremental Parsing: Bringing Engineering and Cognition Together, pages 50–57. Association
[148] J. Nivre. Constraints on non-projective dependency parsing. In Eleventh Conference of the
European Chapter of the Association for Computational Linguistics (EACL), 2006.
[149] J. Nivre. Algorithms for deterministic incremental dependency parsing. Computational Linguis-
tics, 34(4):513–553, 2008.
[150] J. Nivre. Parsing indian languages with maltparser. In Proceedings of the ICON09 NLP Tools
Contest: Indian Language Dependency Parsing, pages 12–18, 2009.
140
[151] J. Nivre and J. Nilsson. Pseudo-projective dependency parsing. In Proceedings of the 43rd
Annual Meeting on Association for Computational Linguistics. Association for Computational
Linguistics, 2005.
[152] J. Nivre and J. Nilsson. Pseudo-projective dependency parsing. In Proceedings of the 43rd
Annual Meeting on Association for Computational Linguistics, pages 99–106. Association for
[153] J. Nivre, M.-C. de Marneffe, F. Ginter, Y. Goldberg, J. Hajic, C. D. Manning, R. McDonald,
S. Petrov, S. Pyysalo, N. Silveira, et al. Universal dependencies v1: A multilingual treebank
collection. In Proceedings of the 10th International Conference on Language Resources and
Evaluation (LREC 2016), 2016.
[154] A. Nourian, M. S. Rasooli, M. Imany, and H. Faili. On the importance of ezafe construction in
Persian parsing. In Proceedings of the 53rd Annual Meeting of the Association for Computational
Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume
2: Short Papers), pages 877–882, Beijing, China, July 2015. Association for Computational
Linguistics.
[155] F. J. Och and H. Ney. A systematic comparison of various statistical alignment models. Compu-
tational linguistics, 29(1):19–51, 2003.
[156] K. Oflazer, B. Say, D. Hakkani-Tür, and G. Tür. Building a turkish treebank. Abeillé (Abeillé,
2003), 2003.
[157] J.-H. Oh and K.-S. Choi. An English-korean transliteration model using pronunciation and con-
textual rules. In Proceedings of the 19th international conference on Computational linguistics-
Volume 1, pages 1–7. Association for Computational Linguistics, 2002.
[158] L. Øvrelid. Empirical evaluations of animacy annotation. In Proceedings of the 12th Conference
of the European Chapter of the Association for Computational Linguistics (EACL), 2009.
[159] L. Øvrelid and J. Nivre. When word order and part-of-speech tags are not enough–Swedish
dependency parsing with rich linguistic features. In Proceedings of the International Conference
on Recent Advances in Natural Language Processing (RANLP), 2007.
[160] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pret-
tenhofer, R. Weiss, V. Dubourg, et al. Scikit-learn: Machine learning in python. The Journal of
Machine Learning Research, 12:2825–2830, 2011.
[161] S. Petrov, D. Das, and R. McDonald. A universal part-of-speech tagset. In N. Calzolari,
K. Choukri, T. Declerck, M. U. Doğan, B. Maegaard, J. Mariani, J. Odijk, and S. Piperidis,
editors, Proceedings of the Eighth International Conference on Language Resources and Eval-
uation (LREC-2012), pages 2089–2096, Istanbul, Turkey, 2012. European Language Resources
Association (ELRA).
[162] B. Plank and A. Moschitti. Embedding semantic similarity in tree kernels for domain adaptation
of relation extraction. In ACL (1), pages 1498–1507, 2013.
141
[163] B. Plank and G. Van Noord. Effective measures of domain similarity for parsing. In Proceedings
of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language
Technologies-Volume 1, pages 1566–1576. Association for Computational Linguistics, 2011.
[164] K. Prasad and S. M. Virk. Computational evidence that Hindi and Urdu share a grammar but not
the lexicon. In 24th International Conference on Computational Linguistics, page 1. Citeseer,
2012.
[165] K. Riaz. Urdu is not Hindi for information access. In Workshop on Multilingual Information
Access, SIGIR, 2009.
[166] H. Sajjad, N. Durrani, H. Schmid, and A. Fraser. Comparing two techniques for learning translit-
eration models using a parallel corpus. In Proceedings of 5th International Joint Conference on
Natural Language Processing, pages 129–137, Chiang Mai, Thailand, November 2011. Asian
Federation of Natural Language Processing.
[167] H. Sajjad, A. Fraser, and H. Schmid. An algorithm for unsupervised transliteration mining with
an application to word alignment. In Proceedings of the 49th Annual Meeting of the Associa-
tion for Computational Linguistics: Human Language Technologies-Volume 1, pages 430–439.
[168] H. Sajjad, A. Fraser, and H. Schmid. A statistical model for unsupervised and semi-supervised
transliteration mining. In Proceedings of the 50th Annual Meeting of the Association for Com-
putational Linguistics: Long Papers-Volume 1, pages 469–477. Association for Computational
Linguistics, 2012.
[169] A. Saksena. Case marking semantics. Lingua, 56(3):335–343, 1982.
[170] R. L. Schmidt. Urdu, an Essential Grammar. Psychology Press, 1999.
[171] R. L. Schmidt. Urdu: An Essential Grammar. Routledge, 2013.
[172] W. Seeker and J. Kuhn. Morphological and syntactic case in statistical dependency parsing.
Computational Linguistics, 39(1):23–55, 2013.
[173] K. Simaan, A. Itai, Y. Winter, A. Altman, and N. Nativ. Building a tree-bank of modern hebrew
text. Traitement Automatique des Langues, 42(2), 2001.
[174] K. Simov and P. Osenova. Practical annotation scheme for an hpsg treebank of bulgarian. In
Proc. of the 4th Intern. Workshop on Linguistically Interpreteted Corpora (LINC), 2003.
[175] R. M. K. Sinha and K. Mahesh. Developing English-Urdu machine translation via Hindi. In
Third Workshop on Computational Approaches to Arabic-Script-based Languages, 2009.
[176] A. Søgaard. Data point selection for cross-language adaptation of dependency parsers. In Pro-
ceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human
Language Technologies: short papers-Volume 2, pages 682–686. Association for Computational
Linguistics, 2011.
[177] T. Solorio, E. Blair, S. Maharjan, B. Bethard, M. Diab, M. Gohneim, A. Hawwari, F. AlGhamdi,
J. Hirschberg, A. Chang, and P. Fung. Overview for the first shared task on language identification
in code-switched data. In Proceedings of The First Workshop on Computational Approaches to
142
Code Switching, held in conjunction with EMNLP 2014., Doha, Qatar, 2014. ACL.
[178] R. Srivastava and R. A. Bhat. Transliteration systems across indian languages using parallel cor-
pora. In Proceedings of the 27th Pacific Asia Conference on Language, Information, and Com-
putation (PACLIC 27), pages 390–398. Department of English, National Chengchi University,
2013.
[179] S. Sulger, M. Butt, T. H. King, P. Meurer, T. Laczkó, G. Rákosi, C. B. Dione, H. Dyvik, V. Rosén,
K. De Smedt, A. Patejuk, O. Cetinoglu, I. W. Arka, and M. Mistica. Pargrambank: The pargram
parallel treebank. In Proceedings of the 51st Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers), Sofia, Bulgaria, August 2013. Association for
Computational Linguistics.
[180] P. Svenonius. Adpositions, particles and the arguments they introduce. Argument Structure, 108:
63, 2007.
[181] O. Täckström, R. McDonald, and J. Uszkoreit. Cross-lingual word clusters for direct transfer of
linguistic structure. In Proceedings of the 2012 Conference of the North American Chapter of
the Association for Computational Linguistics: Human Language Technologies, pages 477–487.
[182] L. Tesnière. Eléments de syntaxe structurale. Librairie C. Klincksieck, 1959.
[183] J. Thuilier and L. Danlos. Semantic annotation of french corpora: animacy and verb semantic
classes. In LREC 2012-The eighth international conference on Language Resources and Evalu-
ation. European Language Resources Association (ELRA), 2012.
[184] R. Tsarfaty and K. Sima’an. Modeling morphosyntactic agreement in constituency-based parsing
of modern hebrew. In Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing
of Morphologically-Rich Languages, pages 40–48. Association for Computational Linguistics,
2010.
[185] R. Tsarfaty, D. Seddah, Y. Goldberg, S. Kübler, M. Candito, J. Foster, Y. Versley, I. Rehbein,
and L. Tounsi. Statistical parsing of morphologically rich languages (spmrl): what, how and
whither. In Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of
Morphologically-Rich Languages, pages 1–12. Association for Computational Linguistics, 2010.
[186] R. Tsarfaty, D. Seddah, S. Kübler, and J. Nivre. Parsing morphologically rich languages: Intro-
duction to the special issue. Computational Linguistics, 39(1):15–22, 2013.
[187] J. Turian, L. Ratinov, and Y. Bengio. Word representations: a simple and general method for
semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for
Computational Linguistics. Association for Computational Linguistics, 2010.
[188] J. Turian, L. Ratinov, and Y. Bengio. Word representations: a simple and general method for
semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for
Computational Linguistics, pages 384–394. Association for Computational Linguistics, 2010.
[189] A. Vaidya, J. D. Choi, M. Palmer, and B. Narasimhan. Analysis of the Hindi proposition bank us-
ing dependency structure. In Proceedings of the 5th Linguistic Annotation Workshop. Association
143
[190] A. Vaidya, M. Palmer, and B. Narasimhan. Semantic roles for nominal predicates: Building a
lexical resource. In NAACL HLT 2013, page 126, 2013.
[191] J. Van Hulse, T. M. Khoshgoftaar, and A. Napolitano. Experimental perspectives on learning
from imbalanced data. In Proceedings of the 24th international conference on Machine learning.
ACM, 2007.
[192] C. Vempaty, V. Naidu, S. Husain, R. Kiran, L. Bai, D. Sharma, and R. Sangal. Issues in analyzing
telugu sentences towards building a telugu treebank. Computational Linguistics and Intelligent
Text Processing, 2010.
[193] K. Visweswariah, V. Chenthamarakshan, and N. Kambhatla. Urdu and Hindi: Translation and
sharing of linguistic resources. In Proceedings of the 23rd International Conference on Com-
putational Linguistics: Posters, pages 1283–1291. Association for Computational Linguistics,
2010.
[194] F. Xia, O. Rambow, R. Bhatt, M. Palmer, and D. M. Sharma. Towards a multi-representational
treebank. In The 7th International Workshop on Treebanks and Linguistic Theories. Groningen,
Netherlands, pages 159–170, 2009.
[195] M. Xiao and Y. Guo. Distributed word representation learning for cross-lingual dependency pars-
ing. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning,
page 119, 2014.
[196] D. Xiong, S. Li, Q. Liu, S. Lin, and Y. Qian. Parsing the PENN Chinese treebank with semantic
knowledge. In Natural Language Processing–IJCNLP 2005. Springer, 2005.
[197] N. Xue, F.-D. Chiou, and M. Palmer. Building a large-scale annotated Chinese corpus. In Pro-
ceedings of the 19th international conference on Computational linguistics-Volume 1. Associa-
tion for Computational Linguistics, 2002.
[198] H. Yamada and Y. Matsumoto. Statistical dependency analysis with support vector machines. In
Proceedings of IWPT, volume 3, pages 195–206, 2003.
[199] A. Zaenen, J. Carletta, G. Garretson, J. Bresnan, A. Koontz-Garboden, T. Nikitina, M. C.
O’Connor, and T. Wasow. Animacy encoding in English: why and how. In Proceedings of the
2004 ACL Workshop on Discourse Annotation, pages 118–125. Association for Computational
Linguistics, 2004.
[200] D. Zelenko and C. Aone. Discriminative methods for transliteration. In Proceedings of the 2006
Conference on Empirical Methods in Natural Language Processing, pages 612–617. Association
[201] Y. Zhang and S. Clark. A tale of two parsers: investigating and combining graph-based and
transition-based dependency parsing using beam-search. In Proceedings of the Conference on
Empirical Methods in Natural Language Processing. Association for Computational Linguistics,
2008.
144
[202] Y. Zhang and S. Clark. A tale of two parsers: investigating and combining graph-based and
transition-based dependency parsing using beam-search. In Proceedings of the Conference on
Empirical Methods in Natural Language Processing, pages 562–571. Association for Computa-
[203] Y. Zhang and J. Nivre. Transition-based dependency parsing with rich non-local features. pages
188–193, 2011.
145

Bhat Thesis

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Bhat Thesis

Diunggah oleh

Hak Cipta:

Format Tersedia

Exploiting Linguistic Knowledge to Address Representation and Sparsity

Issues in Dependency Parsing of Indian Languages

Thesis submitted in partial fulfillment

Riyaz Ahmad Bhat

Intrnational Institute of Information Technology

Thank you very much, everyone!

3 Indian Language Treebanking: Grammar Formalism and Annotation Procedure . . . . . . . . 15

4 Improving Parsing of Hindi and Urdu by Modeling Syntactically Relevant Phenomena . . . . 24

4.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5 Non-Projectivity and Scrambling: Trade-off for Morphological Richness . . . . . . . . . . . 54

7 Adding Semantics to Data-driven Parsing of Semantically-oriented Dependency Representations100

9 Summary and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

2.1 Dependency tree of Example sentence 1. . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.1 Hierarchy of dependency labels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.1 Some major dependency relations depicted in Figure 3.1. . . . . . . . . . . . . . . . . 18

5.1 Non-projectivity measures of Dependency Structures in IL treebanks. . . . . . . . . . 57

6.1 Morphological paradigm of khabar. . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

6.4 Performance of our language identification model on the test set. . . . . . . . . . . . . 77

8.1 Tokenization problem in Kashmiri texts. . . . . . . . . . . . . . . . . . . . . . . . . . 119

Natural language text is an agglomeration of naturally occurring sentences, where predicate-argument

1.1 Parsing Indian languages: Challenges and Issues

1.2 Goals of the Thesis

1. to capture morphosyntactic correlations in the parsing models, and

1.3 Primary Contributions

1. Efficient Representation of Morphosyntactic Information: Case marking, grammatical agree-

2. Effective Parsing of Non-projective and Scrambled Structures: Besides parsing of formal

4. Lexical Semantics for Parsing Semantically-rich Dependencies: We proposed lexical seman-

1.4 Auxiliary Contributions

2. State-of-the-art Transliteration Models: We proposed very accurate bidirectional translitera-

1.5 Related Publications

1.6 Thesis Overview

• Chapter 4. In this chapter, we describe different strategies to represent morphosyntactic infor-

• Chapter 5. In this chapter, we present an extensive study of non-projective structures in Indian

2.1 Dependency Parsing

un kī nimāz-e janāzā jāmā masjid mẽ Maqsōd Alī ne paRhāyī .

Figure 2.1: Dependency tree of Example sentence 1.

2.2 Parsing framework

2.2.1 Transition-based Dependency Parsing

1: if c = (S|i, j|B, A) and ( j,l,i) ∈Agold then

1: function CHOOSE NEXT AMB (i,t,ZERO COST)

1: function CHOOSE NEXT EXP (i,t,ZERO COST)

the parameter space.

Indian Language Treebanking: Grammar Formalism and Annotation

3.2 Computational Pān.inian Grammar

3.2.1 The Scheme

3.2.1.1 Dependency Relations and Labels

vmod nmod jjmod rbmod

varg vad adj r6 relc rs etc

k1 k2* k3 k4* k5 k7* rt rh ras adv k*u k*s etc

Figure 3.1: Hierarchy of dependency labels.

Kāraka Meaning Non-Kāraka Meaning

Table 3.1: Some major dependency relations depicted in Figure 3.1.

(2) Atif kitāb paRhegā .

(3) darvāzā kal khulegā .

(4) Atif soyegā .

paRhegā ‘read’ khulegā ‘open’ soyegā ‘sleep’

3.2.2 Annotation Procedure

3 The toolkits can be downloaded from http://ltrc.iiit.ac.in/showfile.php?filename=downloads/ shallow parser.php

Figure 3.3: Dependency tree showing inter-chunk dependencies for Example 5.

Figure 3.4: Dependency tree showing word-level dependencies for Example 5.

3.2.2.1 Intra-Chunk Expansion

JJ → INTF lwg rp nmod adj lwg psp

Improving Dependency Parsing of Hindi and Urdu by Modeling

4.2 Transition-based Parsing with Rich Syntactic Features

k1 k2* k3 k4* k5 k7* rt rh ras adv ku ks etc