Bran Boguraev, Ted Briscoe-Computational Lexicography For Natural Language Processing-Longman (1989)

"' Computational '
Lexicography
for
Natural Language Processing
Edited by
Bran Boguraev and Ted Briscoe
LONGMAN
London and New York
Copublished in the United States with

John Wiley & Sons, Inc., New York
Longman Group UK Limited.
Longman House. Burnt Mill. Harlow,
Essex CM20 25E, England
and Associated Companies throughout the world
in the United States DI'AmeriDa with

Copublislled
John Wiley and Sons, lnc.,
605 Third Avenue. New York, NY 10158
Longinan Group UK Limited 1939

Contents
or this publication be
All rights reserved; no part may
stored in retrieval system. or transmitted in any
reproduced. a
term or by any means. electronic. mechanical. photocopying,

otherwise without either the prior written
recording. or
Contributors
permission or the Publishers or a licence permitting restricted 5.
copying in the United Kingdom issued by the Copyright Foreword >< E5
Ltd, 3334 Alrred Place. London, worn 7m.
Licensing Agency
Introduction
First published 1 989
B Boguraev and E Btiscoe
British Library Cataloguing in Publication Data
1.1 Natural language processing
Computational lexicography for 1.1.1 Knowledge for NLP
natural language processing. 1.2
Computational Linguistics
Computational lexicography 10
1. English language.
1.2.1 The nature of a dictionary entry
1. Boguraev. Bran. ll. nriscoe, Ted. 13
420'.28'.54 1.2.2 Dictionary organisation and representation 18
13 Overview of work with MRDs in NLP 21
ISBN 04624224577 1.3.1 Word lists 21
1.3.2 Taxonomies
of Cntalogulng-ln-Publicatlon Data 22
Library Congress
1.3.3 Access / browsing 22
Computational lexicography for natural language processing 1.3.4 Speech processing 24
edited by uran Eogurusv and Ted Eriscoe, 1.3.5
cm.
Parsing 25
p. 1.3.6 Semantic processing 26
Bibliography: p.
Includes indexes. 1.3.7 Generation 31
Contents: Introduction/B. soguraov and E. nriscoo
Placing 1.4 Reliability and utility of MRDS

E. and D. Carter
31
the dictionary oil-line / H. Alshuwi. boguraav, 1.5
of the moon coding
Organisation of the contributions 35
_
An independent analysis grammar
the moon codes 1.6 Notes
system/ 15. Akkerman utilising

grammar 39
In. Boguraev and E. Eriscoe
The derivation of a large

and
computational lexicon of English from moor: IJ. Carroll
o. Grover
LDOCE and speech recognition / D. Cute:- #

Placing the dictionary on-line
Analysing the dictionary definitions / l-l. Alshawi
Meaning H Aishawi, B Boguraev and D Carter 41

derinitions w. Meijs,
and structure in dictionary /. P. Vossen.
and M. den Breeder
A tractable machine dictionary as a 2.1 From printed to
page computer memory 41
resource for computational semantics / Y. Wilks iet al.l
2.2 The
nriscoo
Longman tape and its computational
counterpart 46
Conclusion / B. noguraev and n.
Appendices:
2.3 On-line
Lexical database-user guide
Semantic types ofLDOCE verbs access: simple mode 49
Dcttive" alternations
Lexicon development environment- 2.4 On-line access: exible mode 51

user guide
The Longman semantic codes
The Longman 2.4.1 Considerations for a.

system.
dictionary database 52
grammar coding 2.4.2
ISBN 0-470-21 187-3
Requirements for the dictionary database 55
2.4.3 Design and implementation
Lexiocgraphyoata
1. processing, 2, LexicologyData 56
processing. 3. English lunguaghData processing. 2.4.4 An example
60
4, Computational linguistics. 2.5 Conclusion
1950 .11. Briscoe. EL. 1959 62
l. Boguraev, Bran, 2.6 Notes
P321062 less 63
410'. 2855133d019
88-18077
ISBN omozuav-a (USA only)
CIP
printed in Great Britain at The Bath Press. Avon
6 LDOCE and speech recognition
3 An independent analysis D Carter 135
of LDOCE the grammar coding system
6' a
E Akkerman 6.1 Introduction ,
135
. 65 6.2 Constructing and using the hybrid lexicon 137
3.1 Introducthn fOALD and LDOCE 67 6.3 Transcriptions, equivalence classes
gzcimpsoixDoOCE
3'2 67 and consistency classes 140
coding system
71 6-3.1 Case Study One 142
3:2:2 The OALD coding system .
77 6.3.2 need for eXible

'
access 143
3 3 . A detailed analysis of LDOCE
6.4
The classes
77 Measuring 144
3 3.1 Intransitive verbs
78 6.4.1
33.2 Linking verbs with adverbial complementation
78 6.4.2
Wordcounts
and
and word frequencies 145
146
3 3 3 . Adjective complementation
79 6.4.3
Linear logarithmic
An information-theoretic
.measures
C clusion approach 146
3 4
N;
-
81 6.4.4 Case Study Two 148

3.5 0 es
6.4.5 Case Study One revisited 149
6.5 Summary, discussion and conclusions 150
6.6 Notes 152
the LDOCE grammar codes
4 Utilising 85 . . . ' '
and E Briscoe 7 the dictionary denitions

B Boguraev Analysmg
H Sham
85 153
4.1 Introduction
COdES 90 7.1 Introduction 153
4.2 The format of the grammar
codes 93 7.2 Denition
4.3 The content of the grammar analysis 154
98
4.4 Lexical entries for PATRII 7.3 Analysis examples 156
101 7.4 Some problems
4.5 Evaluation 161
106 Phrasal
4.6 The dative alternation 7.5 analysis hierarchies 163
116 7.6
4.7 Conclusion Analysis rules 164
116 7.7 Performance remarks
4.8 Notes 167
7.8 Further research 168
7.9 Notes 159
The derivation Of a large 8 and structure in denitions

5 . Meaning dictionary
for English from LDOCE
Carroll
J exicon
COHIPUtAtigaCl an rover
117
P Vossen,
8.1 Introduction
W Meijs and M den Broeder 171
117 171

8-2 Preparing meaning descriptions for examination

5.; EZthgromtdl
xicon e at e e
119
8-3
. . . .
17 2
Eggilzfdtill:Irliilaiigittiifst?;gi;:s'
_
2.3 Inaccuricies
.
in the data
tions LDOgE
for sourcte1
t e tar e eXicon 132 84 Semantic

. . .
175
system fir lexicon

0f
5. 4 istlethgdglhca
m H and
d t'
development
'
a
s
124
125
8.4.1
characteristics
Criteria for
.NMDS
distinguishing types of structures
176
181
5.4.1 System design
ctonsl legoilonhase
P *1 126
8.5 Syntactic characteristics of NMDs 183
5.4.: Srhetaxttiznflfatgesrans 128 3.5.1

8.5-2
Disturbing elements
Partial of
183
214 bigghological
generation
131
132 8-5-3
some Syntactic
grammar _NIvIDs
statistics of NMDs
184
186
5 5 Future developments 8.6 Conclusion 189
133
5.6 Conclusion
133 8'7 NW 190
517 Notes
vii
vi
E The Longman semantic codes 269
9 A tractable machine dictionary
as a resource for computational semantics E1 Subject eld codes 269
T Plate and B Slator 193 E.2 Semantic codes 270

Y Wilks, D Fass, C Guo, J McDonald,
ES Explanation of POS codes 270
193
9.1 Introduction
of semantic information from LDOCE 200 _
9.2 The extraction F Tl 1e L ongman grammar de system 273

9.2.1 Approach I: Obtaining and using cooccurrence
from LDOCE 202
statistics Bibl' lograp hy 275
11: Building a machine tractable
9.2.2 Approach
from LDOCE 211 4
Index
dictionary SubJect 293
III: A lexicon-producer 217
9.2.3 Approach
of semantic information from LDOCE 220 Author Index
9.3 The utilisation 307
221
9.3.1 A lexicon-consumer
semantics 222
9.32 Collative
228
9.4 Conclusion
1
10 Conclusion
and E Briscoe 229
B Boguraev
Appendices
233
A Lexical database
user guide
233
A.1 Overview
233
A.2 Access by spellings
236
A.3 Access by non-spellings
of LDOCE verbs 243

B Semantic types
243
B.1 Subject Raising verbs
243 f
B.2 Object Raising verbs
244
B.3 Subject Equi verbs
246 5
BA Object Equi verbs '
248
8.5 Equi verbs
C Dative alternations 249 :

249
C.1 Fine-grained semantic classes of verbs
i
D Lexicon development environment
user guide 261 i

The main command menu 261 1
D.1
4
Words windows 263
D2
264
D.3 Denition windows
265
DA Editing syntactic frames
translations 267
D.5 Editing subcategorisation
'
1X
viii
Contributors
Eric Akkerman, English Department, University of Amsterdam

Hiyan Alshawi, SRI International, Cambridge Computer Science
Research Centre
Bran Boguraev, Computer Laboratory, University of Cambridge
Ted Briscoe, Department of Linguistics, University of Lancaster
Marianne den Breeder, English Department, University of
Amsterdam
John Carroll, Computer Laboratory, University of Cambridge
David Carter, SRI International, Cambridge Computer Science
Research Centre
Dan Fass, Computing Research Laboratory, New Mexico State
University
Claire Grover, Department of Linguistics, University of Lancaster
Cheng-Ming Guo, Computing Research Laboratory, New Mexico
State University
James E. McDonald, Computing Research Laboratory, New
Mexico State University
Willem Meijs, English Department, University of Amsterdam
Tony Plate, Computing Research Laboratory, New Mexico State
University
Brian M. Slater, Computing Research Laboratory, New Mexico
State University
Piek Vossen, English Department, University of Amsterdam
Yorick Wilks, Computing Research Laboratory, New Mexico State
University
xi
Foreword
In the last
few years the attention of researchers and developers, both
in industry and academia, working in the elds of linguistics, compu-
tational linguistics, articial intelligence, psycholinguistics, cognitive
science and information technology, has been drawn to the importance
of lexical resources by a number of converging factors.
On the theoretical side, the major contemporary linguistic theories
are assigning to the lexicon an increasingly central role. On the ap
plication side, the availability of lexical resources in machine-readable
form is becoming a major concern for the language industry. This
field, which is emerging as an autonomous sector of the information
industries, includes both computer assistance to traditional applied
linguistics professions (including, for example, lexicography, transla-
tion and language teaching) and development of computational sys-
tems based on natural language processing (as required, for example,
in ofce automation, speech analysis and synthesis, natural language
interfaces, automatic indexing and abstracting, information retrieval,
machine translation, and, more generally, support for communication).
Computational linguists and language industry developers recog-
nise that, for real world applications, it is of fundamental importance
that natural language processing systems are able to deal with tens
and even hundreds of thousands of lexical items. Consequently, the
development of large lexical knowledge bases has emerged as probably
the most urgent, expensive, and timeconsuming task facing linguistics,
computational linguistics, and articial intelligence.
Computational lexicography and lexicology is beginning to emerge
as a discipline in its own right: witness the number of specialised
workshops (Automating the Lexicon in a Multilingual Environment,
The Lexical Entry, The Lexicon in Theoretical and Computational
xiii
in Lexical-
Semantics), conferences (Advances
Perspectives, Lexical Dictionaries), panel
Standardisation in Lexicography, Electronic
ogy, Tlie Lexicon in a Multi
Dictionaries,
discussions (MachineReadable and World Representations), specialist/
Words
lingual Environment, Lexicology and Lexicograpliy,
ad hoc working groups (Computational as the Special
Dictionaries and the Computer), and publications (such
the Lexicon of the journal Computational Linguistics).
Issue on though hu-
apparent that, even
At the same time, it is becoming of lex-
programs emphasise different. aspects
man users and computer and ac-
structures, explicit
ical information and require different data
greatly facilitate human use of Acknowledgements
cessible lexical knowledge bases can
well as directly support lexicographers work.

lexicographic data, as
resources acquires a dramatic
reusability of lexical
In this context, in
with increasing frequency
meaning. This expression, appearing
new
and international projects

and
the denition of objectives of
national
Framework Research Program formulated
in the
explicitly designated Communities for 1987/1991

of the European
by the Commission
issues.
refers to two major complementary to establish- new
the current growing efforts
The first concerns
bases, as generalised natural language pro
large lexical knowledge will serve, through ap-
in such a manner that they '
cessing modules, The editors and all of the contributors to this volume w u
and applications
Sumidaeiils11(1

wide variety of researchers

propriate interfaces, a Della
Group Ltd, and in particular
might utilise different linguistic and/ or computational thank Longman
even when they sronal ELT Dictionaries
Director, and Reference), forpermission to
frameworks. make use of the machine-readable version of the Longman Dictionar
existing lexical resources
The second concerns
converting
the ability
the data
to reuse
they contain for incorporation in of this
Contemporary English in their of the work
research. Much reported
by extracting or
kinds in book by the Cambridge/Lancaster group has been described in
modules. The various
a variety of different language processing the proceedings and journal papers of the Association for
and in particular the dictionaries available in conference
of existing dictionaries, Linguistics (ACL). We are grateful to the AOL for er-
obviously the richest and most valuable Computational Slime

machine-readable form, are
tradition which mission
to reproduce ACL copyright material in chapters 4 and 7
based, as they are, on a long lexicographical of 1 Boguraev (1988). Chapter
based in
sources,
and knowledge.
sections chapter are part on
a treasure store of data, information, 2 draws which has previously appeared in two conference
encompasses
Natural Language Processing deals on
material
Computational Lexicography for constitutes the first systematic papers Alshawr

et al. (1985) and Boguraev et al. (1987b). Large

with both these problems and, as such, parts of chapter 3 have previously appeared in Akkerman et al. (1985)
solution. This is based on the ex-
approach to their
methodological and Akkerman
(1987). Chapter 4 is a revised and extended version of
the exhaustive analysis of a major
perimental evidence gained through English, which Boguraevand Briscoe (1987), which appeared in Computational Lin
dictionary, the Longman of Contemporary
Dictionary guistzcs 13(34).Chapter 5 uses material from Boguraev et al (1987a)
information.
is particularly rich in linguistic, syntactic, and semantic
thus are providing a
and
chapter 7 has appeared in Computational Linguistics 13(34)
The editors, Bran Boguraev and Ted Briscoe, The editors would like to thank the University of Cambridge Com-
to the current cooperative efforts
contribution the computing and printing fa
timely and fundamental and publishing houses. puter Laboratory for making available
research groups, language industries, cxlities in the production of this book. We would also like to thank
among used and Engineering Research
Donald Walker the UK Science Council, who not only su -
Antonio Zampolli ported much of the research described here, through the provision {if
grants and fellowships to ourselves and Hiyan Alshawi but also su P -
May 1988
ported Bran Boguraev during the editing of this volume.
xiv
Chapter 1
Introduction
Bran Boguraev and Ted Briscoe
This book is about the design and implementation of natural language

processing systems and in particular the design, and im
construction
plementation of serious lexical components The fun-
for such systems.
damental goal of research on natural language processing is the au-
tomation of the various language processing tasks, such as first lan
guage learning or text comprehension, which (human) language users
perform naturally and uently, and to a lesser extent of those tasks,
such as translation, which they perform less naturally. Knowledge of
words underlies all these tasks, yet until very recently dictionaries (or
lexicons, as linguists usually call them) for natural language processing
systems have by and large been the poor sisters of computational lin-
guistic research. Most system designers have provided only illustrative
lexicons of, at most, a few hundred words. Furthermore, examination
of the contents of these lexicons conrms that there is no consensus con
cerning either the nature of the information which the lexicon should
contain or how it should be represented (for example, Ingria, 1988).
The task of constructing a realistic lexicon for a natural language,
such as English, is formidable, not only because of the absence of a
well-articulated theory of what it should contain, but also because of
the enormous number of words to be dealt with. The Oxford English
Dictionary (OED) contains entries representing 250 000 independent
words approximately, However, even the OED still does not list many
1
Chapter 1 1.1 Natural
Introduction language processing
from specialised elds. Walker and Amsler (1986) high- dictionary databases for human use. For exam le th
words
light this
more
problem by pointing out the considerable divergence between the

publishthe OED
electronically and a suitable fiataba:lees:1seteifrlia
the vocabulary of Merriam Webster Seventh New Collegiate Dictio- taming OED, is currently being developed at Waterloo (Stublis and
and the vocabulary which occurs in news reports on the Tompa, 1984; Stubbs, 1986). Some researchers envisage one dictionary
nary (W7) be impractical (given
New York Times newswire. It would certainly databaseproviding the discusses
source for both automated and human use for
unproductive for computational instance, Byrd (1988) the requirements of such a systerh in
current levels of funding) and probably
a lexicon by hand. For this reason, an automated ofce environment for such tasks as automated spellin g
linguists to set about constructing
turned to machine-readable versions of correction and manual look up of entries.
a number of researchers have
of lexical information for use The rest of this chapter provides the background and motivation
published dictionaries as potential sources
systems. This development for the research reported in subsequent chapters. In addition we briey
by automated natural language processing
Walker
recent and Zampolli, 1988) and has been
(see survey representative work in computational lexicography which is not
iscomparatively
made feasible by the advent of computer typesetting techniques, which discussed later
in chapters (either because
other than it involves MRDs
of machine-readable versions of most pub- LDOCE or because the aims of the work are not central to issues of nat-
have ensured the availability
ural language processing). In this way, we hope that this chapter will
lished dictionaries.
disadvantage to serve a tutorial overview of work in computational
There are several advantages and at least one major as
lexicography in '
machinereadable dictionaries (MRDs) in research on natural general, as well as providing the foundation for a P re p er u n d 815 t andmg
the use of .
since there is a considerable tradition be- of the speCic work reported in this book.
language processing. Firstly,
the of dictionaries for human consumption, we might
hind production for dening the
hope that they will provide a suitable starting point
of a lexicon for machine use. Secondly, since many published 1.1 Natural language processing
contents
much of the construction work has already been
dictionaries are large,
done for the computational linguist. On other hand, published the As we remarked above, the goal of research on natural 1
dictionaries are produced with the human reader in mind and there-
from the point of View of
cessmg (NLP)
hensxon, production
is the automation
and acquisition
of the processes
in both written
of langdzilggduclgfngi:
and
fore make many inconvenient assumptions spoken media
processing by machine; for example, the assumption that the user can
Research NLP scientists
on is undertaken by (computational) linguists psychol:
in English. Most of the and system engineers from slightly
understand definitions of word senses written ogists, computer different
research reported in this book has two facets: rstly, the development perspectives. However, all share the goal of developing a fully explicit
to make various types (and therefore
of automated (or semi-automated) techniques programmable) theory of these processes. Although no
for machine and secondly, the comprehensive theory has emerged yet or appears
of information in MRDs accessible use,
and both
such likely to do so
of this information in evaluating improving in the foreseeable future, in practice many workable NLP systems can
subsequent use
of various types and the linguis- be constructed on the basis of partial understanding of some of these
natural language processing systems
in this
work reported
All of the
tic theories which lie behind them. processes. For example, intelligible text-tospeech synthesisers have
is based the machinereadable version of the Longman Dic- been built (for example, Allen et (IL, 1987). Existing speech synthesis
book on
which will be systems do not'attempt to understand
tionary of Contemporary English (LDOCE). For reasons the input, but rather rely on
uniquely suitable for computational
made clear below this dictionary is a more
supercial linguistic analysis of the words in the text and their
organisation into sentences. In addition
lexicography. syntactic to speech synthe-
We have called this line of research with computational
MRDs Sis
significantprogress has been made in constructing workable speech
lexicography in recognition of the fact that, although strictly speak- recognition systems (see Fallside and Woods, 1985), which convert a
construction, it is certain
ing we are not in the business of dictionary
machine
streamcf spoken words into their written counterparts in question
that the lexicons which are derived from MRDs for use by answering systems (forexample, Bole and Jarke, 1986), whichrespond
dictionaries, both to a query
will be very different from conventional published concerning some limited domain, for example train time-
of how they organise and how they represent information. tables, and in translation between
in terms languages (for example Nirenbur
In addition, although the techniques described in this book are pri 1987), again within some limited domain. Rather less hagsI
progress
for machine increasing been made
marily of relevance to developing lexicons use,
achievinggeneral language understanding
in
systems and
of
use is being made of computational techniques in the development language generation or language learning systems, but much research
2 3
1 1,1 Natural language processing
Introduction Chapter
iliary verb in a corresponding declarative form; for example, Kim

is being devoted to these problems. becomes Was Kim happy.
full survey of and tutorial introduction to was happy
Attempting to provide a
the eld of NLP is beyond the scope of this chapter. However, we
Semantic
of NLP that
4. knowledge concerning the meanings of words and how
will outline of the more relevant facets systems so
some
contribute to the overall
these meanings are combined to form the meaning of sentences;
it will be clear how the chapters which follow
for example, that the concept denoted by hit involves two entities,
adopt the
what is often called
goal of the eld. Most NLP systems that
an agent (the hitter) and patient (the person or thing being hit)
knowledgebased approach to performing their designated task; and that these entities will be supplied by the subject and object
appropriate linguistic and
is, they attempt to incorporate explicitly of the verb, respectively.
more general knowledge.
5. Pragmatic or encyclopaedic knowledge which is central to many
distinct tasks which an NLP system may need to undertake; for
1.1.1 Knowledge for NLP
recovering the referents of pronouns, reconstructing el-
example,
The text-tospeechsynthesis system mentioned above requires knowl liptical
utterances or analysing the speakers presuppositions or
the pronunciation of letters and letter sequences as communicative intentions

which lie behind a particular utterance.
edge concerning
well as of individual words where they diverge from these more general Most, if not all, of these tasks rely heavily on more general kinds
it will require knowledge of rhythmic patterns of of knowledge concerning the intention and disposition of inter-
rules. In addition,
in spoken English and of how syntactic organisa locutors, the extralinguistic context of utterance, and so forth.
prominence (stress)
tion affects prosody and intonation. Thus, the letter sequence cake can
In general, pragmatic knowledge is least related to the lexicon and will
be pronounced according to simple rules for realising /c/ and /k/ as
least in
However, there this book. is no clear
the phoneme /k/ and a slightly more complex rule which states that a therefore
between
concern
lexical
us
semantic knowledge and more

vowel followed by a consonant and /e/ as in ake is realised as a long
dwision general prag-
matic knowledge. Lyons (1981) provides an excellent introduction to
vowel, that is /CI/. However, no such rules apply in the pronunciation the study of language and discusses these distinctions
rules in greater detail.
of yacht so its pronunciation must be stored separately and such
blocked in this case. Similarly, in polysyllabic words the system must Lyonsct al.within(1987) provides an up-to-date introduction to current
theorising these and other subelds of linguistics.
know which syllable carries greatest stress, for example diVIsion, and
Each of live broad types of knowledge we have introduced
in syntactic constructions such as It was John who did it it must know
be characterised
the in terms
can
to a large extent of sets of general rules. For

that John usually carries greatest stress within the complete sentence.
which incor- example, it is a syntactic rule of English that adjectives can a Pp ear
Irrespective of the actual rules deployed, any system after the verb to be in examples like:
rules of this type can be said to be knowledge-based.
porates explicit
to deploy knowledge of this kind if it
Almost any NLP system will need That sh is beautiful
is to be able to operate successfully. We can identify at least ve broad
to NLP systems:
types of knowledge which are potentially relevant and before nouns in noun phrases like:
the sound system and struc- sh
1. Phonological knowledge concerning That beautiful
ture of words and utterances of the type discussed in the last
paragraph. However, the rule breaks down in the case of an adjective such as
man-eating:
of words;
2. Morphological knowledge about the internal structure
English That man-eating sh
for example, that the phoneme /s/ or /2/ attached to an
*
makes it plural or that re attached to a verb means do That sh is man-eating2
noun
again. with such exceptions is the province of the

Dealing lexicon; so in this
instance We need to represent the fact that a certain set of adjectives
3. Syntactic knowledge organisation of words into
concerning the
cannot used predicatively (after copular be). If nd any
phrases and sentences; for example, that English polar questions 'be we cannot
generalisation which denes membership of this set, then approach
are formed by inverting a subject noun phrase and following aux- one
5
Chapter 1 1.1 Natural language processing
Introduction
member with a feature, such as

entries of each beautiful Alprd +], A[prd -].
is to mark the lexical
:
of the rule to adjectives not [uh

notpredicative and restrict application see
: N.
If we look at the rule more closely we can is : Vi

marked with this feature. to apply only
it is implicitly restricted mun-eating : AEprd -].
that, even without this addition, as adjectives. Is there that I Bet.
of the vocabulary which can serve
to the subset
adjective or is this too an arbitrary'
then any general denition
of an
which must simply be stipulated in the Now in

fact about English vocabulary
we are a position to specify (loosely) the process of parsing
the parts of speech of
difcult to predict
lexicon? In general, it seems might again 1. For each word in the lexicon lo ok it
' ' '
t he 16300011 and
'
the basis of, say, their meaning, therefore, we u P m

6551811
words on
category.
. .
information will be marked in individual it a lexrcal syntactic

that this type of
conclude
lexical entriess. rules of 2. Try match rules containin
to g daughters wrth lexical cat
' '
phrase structure
t;:d:li
these syntactic rules as
represent ~
We
' _
can With the lexrcal categories of the words from left to right
form shown below:
the general the sentence and connect u p all the In 0 th er and daughter Cate-
Daughter" .
Mother >
Daughterg Daughter,- gories as specied by that rule.
contain one or more
in which one mother syntactic category may
consist of a name followed by 3. Try to build a complete tree with an 5 node as root 'fill-
by
daughter categories, and categories
feature. A grammar written in this notation ing in the gaps with further rules wh 1C h matCh the unconnecmd
optional further syntactic examples above is categories in the partial tree.
which will generate all and only
the grammatical
shown below: For example, given the input:
1. S 0 NP VP
2. VP r V AP[p1-d +1
That maneating sh is beautiful
3. NF Dec N
The rst stage could yield:
4. NY '4 Det APtpx'd x] N
5. AP[prd x] ~
A[prd x] Det AEptd -] N V A[prd +]
consist of a noun phrase (NP) and l l i
(1) states hat a sentence (S) may
of verb and an That man-eating fish is beautiful
may consist a
states that a VP
verb phrase (VP). (2) prd takes two
adjective phrase (AP) marked
as [prd +1. The feature The second stage would yield the partial tree:
and is used to represent whether an adjectival category
values (+ or -) and (4) state
non-predicative (attributive). (3) Rules VP
is predicative or with
(P
tjrd\
and (N)
\[prd
that a noun phrase can consist of a determiner (Det) noun
where the
AP. This AP is marked as [prd x] +]
an optional intervening (4) states
the possible values of prd. Thus
x
that
is a variable
APs can be
ranging
either
over
predicative or non-predicative in this position.

the two
Det
Alprdl)
-] V Alplrd +1
states that an AP consists of an adjective and, because i

(5) the variable x, that the value of prd will That maneating fish is beautiful
occurrences of prd share same
be identical on both.
giftieii
rule (5) contains the feature
The process of applying a grammar to an input sentence to deter- matching which prd with a
Wthh specifies a value for that
mine its grammaticality and appropriate syntactic
structure is known
instantiat utilagamst word
With the
a feature
value within that rule
in order to mechanically and automat- appropriate
s.viliriable
weesh
'
However,
as (syntactic)parsing. of the sentences above it is necessary
Thus if the alternative
a c osen
category for beautiful from th
ically apply this grammar to one
which lexical syntactic cat-
lexical
n, In 1e (2) would not have matched the second AP and we would
which tells the parser
to provide a lexicon have needed to redo th e rst S t a g e 0 f th e at 5111 Th t lllrd
associated with the words in the input p g p r OCESS . e
egories (partsof are

speech) would be:
and incomplete) lexicon
sentence. An appropriate (small
6
Chapter 1 1.1 Natural
Introduction language processing
the rules and

generalisations embodied in the rest of the system (or
//S\ linguistic theory). Therefore, until we have discovered
correct set of such dene the ultimate
the ultimate,
rules, we cannot form of the
l\ \AP +1
lexicon. ,
hard
Word meaning, like does not
Det
-I\ -]
[plrd
A[prd +1
dictable on general
pronunciation,
grounds either. We observed
appear
that
to be pre-
pronunciation
I A[pTd if III .
and meaning appear
are, at most, only very
not to be linked in a principled fashion
weak relationships between
and there
meaning and part of
That man-eating sh is beautiful
speech. For example, although many nouns refer to physical objects or
of the things, many do not. Therefore, the meaning(s) of words will be de-
Gazdar (1987) gives a more detailed informal description parsing ned in their lexical entries.
used in grammars of However, this conclusion is complicated by
and matching processes type this and Winograd
which considerations of morphology rather similar to those which arose with
(1983) describes and explains a variety of parsing algorithms
could be used as the basis for an automated for part of speech information. For example, the meaning of think does
parser
detail because it
grammars. such not appear to be predictable, but the meaning of rethink is predictable
We have presented this example in some illustrates on the basis of a morphological rule which combines
clearly the interplay between general rules (grammar) and leXicon. The the meaning of
would need the prex re with the meaning of think. These considerations
lexicon given is adequate for this particular grammar, but that morphological
suggest
to change almost time new rules were introduced the gram- knowledge interacts even more intimately with the
every
if
into
added lexicon.
words into the lexicon. For a
mar or further example,
we
between So far we have beenthe term word uncritically
using and without
non-copular verb, such as loves, we would need to distinguish further denition.
We have also implicitly assumed
and 1ex1con, to prevent the conventional
different types of verb, both in the grammar view of a dictionary as a list of words in our discussion of the role of
the generation of examples such as:
the lexicon in NLP systems. However, the common sense concept of
*
That sh loves beautiful a word as an indivisible unit is inaccurate. Many words are morpho-
logically complex forms constructed from a number of more basic mor-
In general, there will be an intimate connection between the general phemes, for example, re+think. Insofar as the phonological, syntactic
and nature of and semantic properties of these derived words are
rules incorporated into a grammar (or NLP system) the predictable by mor-
entries in the lexicon. The lexicon provides the information not pre- phological rule they do not need to be listed separately in the lexicon.
from the rules, which feeds the rules and ensures they func- In this sense, morphological knowledge is more
dictable intimately connected
the need for both with the lexicon because it provides a set of rules for
tion correctly. This example also demonstrates broad organising the
lexicon in a maximally economic form. However, many derived words
part of speech and subcategory information by syntactic'(represented are not fully productive in this fashion; for example, reproduce in one
features) in the lexicon. This issue is the subject matter of chapters 3,
4 and 5.
sense means something more specialised, although related to, produce
again, whilst the semantic
. . . .
The lexicon will also contain other types of idiosyncratic

informa- relationship between vamp and revamp is
now only historical. Similarly, there is a large class of words with an
tion about words. The pronunciation of any word is generally
consid unstressed and phonologically reduced form of the prex re which may
ered to be an arbitrary fact about that word and not something to
form have arisen historically from this rule but both
be derived, say, from its meaning or morphological (for example, and semantically irregular with respect to
are now
phonologically
Lyons, 1977:71f). The fact that the verb meaning to speak softly
recall.
it; for example, represent or
with noisy breath" is pronounced 'Wispar (whisper) in English and

the hand, A lexicon organised to make of

(chuchuter) in French is not
/_iu:_[u.2ci/ predictable.9n other will contain
morphological generalisations
use
the fact that the first syllable of whisper carries main stress might be entries for bound morphemes, such as re, which cannot
stand alone and are not usually felt to be words.
felt to follow from a rule assigning main stress to all initial
syllablesof contain entries for some derived
In addition, it may not
English polysyllabic words. In this case, only exceptions to this rule, words whose behaviour is predictable
such as division, need main stress to be explicitly marked in their leXi- on the basis of morphological rules. An in the lexicon will repre-
entry
This example again emphasises the point that counts sent those facts about a word
cal entries. what (or morpheme) which are unpredictable
as idiosyncratic, unpredictable information depends almost entirely on or idiosyncratic with respect to the other rules in the system. This will
8 9
Chapter 1 1.2 Computational lexicography
Introduction
include phonological, morphological, syntactic,

semantic and possibly lexiconfor free, then at least to construct automatically a substantial
of the precise of such a lexicon, obtaining what is hoped to be internally
pragmatic information. We will return to the question portion an
content of a lexical entry below. should be

However, it clear
by now conSistent (and coherent) object. A lexicon derived in such a way might
all and effort and still produce the bulk of the target vocab
that there is no guarantee that a particular MRD will contain save
considerable
only the information required by a particular system because this is so ularywhich could be subsequently expanded, if necessary and further )
to the task and application at hand.

the nature of the rules incorporated into the system. To tailored MRDs
dependent on
also help define in general terms the infor-
LDOCE contains separate entries for a number litilising may
give one small example,
of words, such as reissue, reclaim, repay and so forth, whose meaning mation which the lexicon should contain. Current systems, developed
characteristics is predictable on the basis of the (simple) with particular applications or research problems in mind, tend to con-
and other tain which are adequate for the intended but which will
morphological rule discussed above. leXicons
not generalise to other
use
entirely systems. For example, quite large lexicons have

In this brief outline of NLP systems, we have concentrated
such and not on the been developed for various syntactic and morphological parsing sys-
on the types of information deployed in systems
method of representation or deployment. Of course, much work is de- tems (for example, Heidorn ct al., 1982; Sager, 1981; Russell et a1
voted to these and recent
issues advances in computational linguistics 1986) containing up to 100 000 entries, but none of these lexicons in:
of computationally tractable, corporates anything more than the minimum of information required
in particular have made the construction
the task in hand. Therefore, any new system which re-
wide-coverage morphological and syntactic processors feasible (for ex- to'perform
Russell et al., 1986; Briscoe et al., 1987). It is these advances quired further or different information from the lexicon would need
ample, to repeat much of the work of lexicon construction. This problem is
sized lexicons live one, since
which make the issue of realistic a very
raise the possibility of wider commercial application of NLP tech- further exacerbated by the use of different representations for similar
they information.
A text-tospeech synthesis or question-answering system with Below, we reproduce lexical entries for acknowledge from
nology. the BBNCFG
an exhaustive knowledge of syntax and morphology will still
English system (lngria, 1988), the IKUS system (Bates ct al.,
be woefully inadequate if its vocabulary is small. 1986), and the Alvey morphological and 5 y nt ac t' 1C ana l yser (see Carroll
and Grover, this volume).
1.2 Computational lexicography
A great number of NLP systems are strikingly limited in their range [ACKNOWLEDGE
This lack of versatility is to a large extent due to the Category: V
of application.
to them. A recent Base: acknowledge
small number of lexical entries available
typically Features: (TRANSITIVE (REALM?) (PAssrvmssn
workshop on linguistic theory and computer applications (Whitelock ;; 1. 2. 3. 4. s '
establish to the average size (CLAUSE (REALNP) (THATCDMP) ;: 1.

at al., 1987) reports on an informal poll
(INDICATIVE :rizNSE) (vii-n
of the lexicon used by the prototypes discussed. Taking into account
translation (NP-VP :AGR :AGRX (REALNP) :AGRX ::1.
a 50006000 word vocabulary used by a machine system,
(PASSIVIZES) (INF) (WH-))]
lexicon size came to 1500 words. Without this vocabulary,
the average
size was about 25. A second problem concerns the fragility
the average
to scale
of most experimental systems in the face of serious attempts
carefully tuned for
up laboratory prototypes or application programs
because much of the information contained in
particular domains, so
state
[ACKNOWLEDGE
lexical entries is specic the particular system in its current
to FEATURES (nuns ;, . 2. a 4 5
between the rules in the
(see the discussion of the intimate relation PASSIVE ;, . 2. 314:5
system and its lexicon in the last section). THATCUMP -
then, that a number of researchers have THAresqumcD :'

It is hardly surprising,
al :' 1.
been looking at They expect that information
available MRDs. NPTUCUMP) H i
available v 51)]
ready digested, categorised, indexed and, most importantly,
in machine-readable form, can be suitably used, if not to get a sizable
11
10
Cha p ter 1 1.2 Computational lexicography
introduction
greatly reduce the amount of effort involved in the construction of

"
1. 2, 3, 4. 5
(acknowledge
..
lexicons for specific systems as well as providing a useful resource for
up acknowledge ml) ,,
((v 4) (n -) (aubcnt
1
investigating various lexicographic properties of the language.
, ,
(aclcxiovledge As we noted above, it is neither practical nor desirable to set about

(subcat sfin)) acknowledge nil) ;;
((v r) (n -) constructing such a dictionary database by hand. Existing published
that they were defeated
'
aclmowledged dictionaries typically involve tens of lexicographer/years to develop

(aclcnowledge and these resources are not available to the great majority of NLP re-
(subcat seal) acknowledge n11)
((v #) (ii -) searchers. In addition, commercial
been defeated dictionary publishers have consid-
acknowledge
'
having
:1 erably more accumulated experience of lexicography than researchers
(acknowledge in NLP. It therefore
(subcat OI)) aclmowledge n11)

seems expedient to capitalise on this experience as
((v +) (n -) much as possible rather than reinvent the wheel. Once the decision
be the best to
him to
: acimowledge
make use of existing dictionaries has been made the problems of com-
putational lexicography become those of modication and conversion
the syn
of existing MRDs to a database capable of exploitation by machine.
entries contain rather similar information concerning
These problems fall into two broad categories.
322215:
classication of words into
parts of
speech.and further
encoded in mor;
suc tionaries are
Firstly, published dict
organised for human use and rely heavily on the users
ne-grained categories. However, this is
information
associated system With each would not background linguistic and common sense knowledge to retrieve and
different ways that the lexicons
comprehend the information they contain. Secondly, this information
be intersubstitutable between systems. .
is usually presented in an informal rather than systematic fashion and
ideas
Every NLP system has its own and conventions concerningf often rests on inappropriate linguistic models, from the perspective of
of Such a
the content, organisation, and structure state o
its'lexicon.or theoretic, NLP. In the next two sections we explore these two problems in greater
in
affairs is partly justied by differences, organisational detail.
task
the individual systems approaches to the
chosen (seethis makes
for exam-
ple, Cumming, 1986, and in particular Ingria, .1988).Still, across sys-

it impossible linguistically
to share relevant information
tems. Until a rich, powerful and exible lex1cal database is developed, 1.2.1 The nature of a dictionary entry
where different linguistic theories can find
relevantleXicalinformation
customised lexicons Will ineVitably involve
as they need it, building In this section, we analyse the type of information available in a lexi-
where generalpurpose
duplication of effort. In the meantime, leXicorlis cal entry in various published dictionaries with particular reference to
Will have
will be developed and released for Wide use, these deliberate y
to pro- LDOCE. We then compare this with the information and types of pro
uncommitted lexical entry structure: for example, one
prOJect cessing of information required in various NLP systems. The entries in
duce a lexicon and morphological analysis system for defines (English
structure for an each entry is a most dictionaries represent distinct homographsof a word form; that
in its user guide the following entry: is a set of senses for that form when it serves as noun, verb or some
4-tuple citation

form, syntactic category, semantic

name, and at a
a
usler other part of speech. It is possible to divide the entry into a number
eld, and is intended to represent a single morpheme (Ritchie .,
of distinct components.
1986:4, emphasis added). The form contains headword
these problems to attempt develop
.
a information, spelling, hyphenation

One way to approach is to and phonetic description, with possible variants in the spelling or pro-
database from which lexicons for specific purposes
single dictionary nunciation. Additional associated
This database need to contain information might be to do with
can be constructed on demand. would relevant glosses on use of the word (for example, formal, slang, etc.), variations
all the information about particular words potentially to anyas
uncommitted a fashion
in its spelling upon inection, its behaviour when undergoing deriva-
NLP system represented in as theoretically tional morphology changes, stress
Clearly, both patterns, and so forth. For example,
possible and would require an
exhaustive vocabulary.
such database the LDOCE entries for sandwich (in its noun and verb forms) shown
these requirements are ideals and it is unlikelythat any below species an alternative
which could be pronunciation, variations in hyphenation
could ever be constructed from specialised-lexwons when the word functions as different parts of speech, and a not formal
derived merely by a process of selection and ltering. Nevertheless,
database would label its verb use:
signicant progress in the direction of such a general
13
12
' Ch ap ter 1 1.2 Computational lexicography
Introduction
responds to a denition space other than subject eld, for example,

d-wlch sxndwjnmnwui / n 1 2
boxes two and three are associated with
if: piecel of breadwith some other ucu.
a geographical region and a
cold food between them, eaten with the level of attitude). The context of use is encoded in a similar way, but
1 Bid? cake of 2 flat parts with
hands 3
Ms cream between them in a separate eld: thus the subject code for the rst sense of the noun
(1) or
definition of sandwich is F0, which stands for food.

undwlch w x9, up. in, between]nordn
ot/nil
i
ferent
to put tighlyi
kind:
between}
film a] plutic
a
in in...
"Mama be-
In
general, the distinction drawn here
and between form, function
iwm 2 piece: 0/ an 2 to nd
for, time meaning is not as clear cut as it appears. Properties of a word, for
other events: I urn my bwv may
amen
in: fewto mama on: m in m m example its syntactic subcategorisation patterns (see section 1), are
very often a guide to its meaning(s). However, this categorisation will
behavxour. facilitate subsequent analysis of the different classes of use to which
tr 5 function describes its distributional Here we of the information in a dictionary entry can been applied.
Exilgehinhaiie tagging (for example The-CYllxlns
a simple word class
'grammatic'a Er;- lists c
some
How does the general notion of a dictionary entry relate to a generic

gli'sh Dictionary (Hanks, 1979) only broad of the item assejl, entry in a lexicon for an NLP of information
system? is What kinds
or a very elaborate grammatical subcategorisation Eml' 2 it reasonable to expect to nd in an MRD, which
of The Oxford Advanced Learners Dictionary of might be utilised
style
henceforth OALP),L.DOCE or The CurrcgtOBrILJinsD protably by some application program? While it is true that systems
(Hornby, 1980,
English Language Dictionary (Sinclair,1987, henceforth
COHHEOBUILDG) t
_
. dier in the organisation,
still possible to isolate certain
structure and content of their lexicons, it is
types of information, to be determined
Via
the by
In the gure above, this information is
conveyed which notfalilondin
is u y e- the particular task and application, which ought to be made available
brackets, utilising a grammar coding system
square at the lexical level.
of this
scribed and discussed in chapters 3 and 4 book. ,
f It is unlikely that most applications making of machine-

given by
means use a
homograph sense(s)
of the are
The meaning
denitions, examples, and cross references (in the T readable source will require access to all of the information outlined
one or more
the denitions and examples, there is an exarrli'p': above. It is also unlikely that until further progress is made in the
above, in addition to
the the for third sense of
rsthomograpb
imphicif more general aspects of natural language
understanding, signicant
cross-reference to
might
entry
even more
informationahou
ot use can be made on-line of, say, cross reference pointers, comparisons
jam). dictionaries
Some supply and stylistically, between antonyms and synonyms, or pictorial information".
of the word, both grammatically
the usage
by comparisons with synonyms, antonyms, relate.dperagis
war 5,
On the other hand, most of the phonological, syntactic and seman-
punctuated of a tic denitional information to be found in a dictionary entry can be
and so forth COBUILD
is the most
recentexample dictionary; quite useful to a range of application programs. The use of form and
making systematic use of such an organisation. Occasionally,
pictiirig function information is an obvious place to start in the computational
supp
or diagrammatic information might be prov1ded,
into
1 etymologies 1: exploitation of MRDs both because this information is represented
and various other devices used to offer further inSights the worb: moderately formally in most MRDs and because it seems
of Not related to the meaning, u plausible
meaning and patterns use.
words,
directly a priori that NLP systems can treat this information as straightfor-
broad class, would be pomters to derived
within the same
idiomatic or and expressions natura y corlrli- ward data in a fashion which would not be possible for, say, sense
pound terms,
'
der the headwor an co oca ions. llfhraf'es

elorumdon denitions. In fact, experiences with the exploitation of pronunciation
mdiiiegdldlition, information (see chapter 6) and the grammatical coding system (see
. .
of LDOCE contains
formation not
the machine-readable
printed in the published
version
dictionary. senses can
e
Word 11n chapters 3, 4 and 5) from LDOCE demonstrate that even here things
are not so straightforward. For example, few dictionaries,
notational including
tagged with subject and bar codes f devrceswhich,
notions i e Vilaka LDOCE, represent syllables explicitly in pronunciation elds (except
very compact system of representation, encode semantic to when a stress market occurs). Therefore, it is necessary to parse the
most
the overall context in which a word sense is
appear likely pronunciation eld to assign syllable boundaries
(see chapter 6). Sim-
religion, language) and selectional
(for example, politics,
on verbs, nouns and For'the
compound phrases. entries restrieition:
of
()C
San ilarly, the grammar
a specic linguistic
coding system employed in LDOCE derives from
model (see chapter 3) which is not appropriate for
for the rst sense is -_L_X--__S, where X
above, the box code verb denotes automated parsing for some classes of NLP systems (see chapter 4).
in the fth position of a tendimensmnal characteryegtor tenth
a
the In particular, the codes conate syntactic and semantic information in

abstract human and S in
preference for an or
subject, code some cases and only provide superficial syntactic information in oth-
position indicates a solid object (each pOSition of the box cor-
15
14
1 1.2 Computational Iexicography
Chapter
Introduction
' '
of information which is strai htforw

quite complex (cruder) meaning more
ers. Therefore,
in order to
it is necessary
tease apart
to analyse
syntactic and semantic
the codes in
facts and to re by machine. In this, LDOCE is unique amongst inagiioiitilihligls:
MgRDs
ways in the coding formal representation of some aspects of meaning, and this no doubt
cover information which is implicit, rather than explicit, in accounts for some of its popularity with NLP researchers. However
be 100% successful so,
scheme. Ultimately, this process can
lexicon from
never
there
are problemsconcerning the accuracy and completeness of this
order to derive a reliable (grammatically-subcategorised)
of semi-automatic, human type of
informationin LDOCE
a number of observations have been

LDOCE, we have adopted the methodology made the and inconsistent and
code of
than fully-automated incomplete
on tagging
none
of the target lexicon, rather
aided compilation the work reported in this book makes signicant use of these codes
batch processing (see chapter 5). of potential the- (See'Byrd at (11., 1987, for further discussion and description of work
the diversity
These problems are compounded by with the lexi- of LDOCE box codes.)
ories of phonology and syntax
which might be deployed maklingt
somehuse problem t is concernin the diversit
'
mygifargiidiilidZi
as ere a
information in diverse Ways us
con, each of which
draw a rather
may represent
different
similar
division between information in the theorieswhich might want to exploit prgnunciation
or, worse,
rule component within the information
from an MRD, the same issue arises perhaps to an even
lexicon and information in the associated With semantic theories there

dictio- greater. extent (although are underlyin
system (see section 1 above).
In
be
this situation,
inclusive as
a generalpurpose
possible and as theoret Similarities between knowledge representation schemes sectio
see
database must strive to as

1.3.6 and and 9). In addition, when
nary
Techniques which compile particular chapters7 systems performin are
ically uncommitted as possible. the division between the ME

representations relevant to specic theories from more neutral tem-
Rog; Endsemantic
Snlfactli analysis, pomts ' '
WIth
' '
etiiegdemm
Calder and to geisoten
0 nowe ma concomitant
here (for example,
plates or key words seem promising to solve the on the nature of lexical
Lindert, 1987). However, these techniques
cannot hope
in informational require ellccsts
malls theory of word expert parsin Small 19 0
'
more serious problem

the
of
lexicon
genuine
by
differences
dierent theories. places a heavy demand on the lexicon; Etlie
lexidalEngriefgrdildrtdnd
ments placed on
complicated programs with coroutine structure the
denitions, all published dictionaries present def- perts)there of are
Turning to sense
most there is no explicit restriction on schication which requires detailed knowledge of the architedture
initions in textual form and in of the with a specialised language for writin word
might that in order to make parser, acquaintance
the language employed. Thus
NLP
it
system
appear
would require very sophisticated experts,Judgement of what constitutes linguistically relevant informa-
of this information an
tion and how amounts
to represent that procedurally, and readiness to brin
use In
creating a catch-22 situation.
language understanding
this
capabilities,
bad because the language used in denitions in arbitrary
of more general, common world knowledge N5
fact, it is not quite
restricted form, typically consisting of a noun diViSion is
made between syntactic, semantic or pragmatic knowlede I
tends to be of a rather
term and modiers or dierentiae The point here is not so much whether the theory in question has air
NLP system Whi h
head or genus
phrase containing a extensive analysis of the language 'value,
but
that this is an instance of a lexicon for an
of this head (see chapter 8 for

in
an
the case of LDOCE the vocabulary of dierent

is radically from one which makes a more conventional disi-
of denitions). In addition, used mainly Sion between these types of knowledge. It is doubtful whether a lexicon
approximately 2200 words
these denitions is restricted to
cir could be constructed from
2?:me
MRD because
in their more common senses. This restriction helps out down on
parsing
egrpert different which this
an
to make denitions
liridczzrxrientally organisation theory imposes
more
tends
cularities of denition. However, it
denition
also
for the rst sense of bubble: a on thee
long-winded (cf. the LDOCE chapters 'Sowaand Way (1986)present a different theory of text interpre-
hollow ball of liquid containing air or gas). In addition, as
1; the notion of
liberal of derivational morphology to
'ation in context, relying on a conceptual graph Th
7 and 8 indicate clearly, the use
and the of this vocabulary in implications for the content and structure of a lexicon to be usedbe
extend the denitional vocabulary
use
it is fair are reconciliny
non-trivial; in particular, issues of
more than one sense makes the situation
by the
less ideal.
representation
However,
of meaning ;.conceptual
ormal
notions
parser
like lambda-abstraction tuagl and operations on conce
to say that the problems presented of the information

information in MRDs outweigh issues concerning the accuracy of this graphs leXical stilllevelwith the traditional
at format
in an question. Similarly, current tendencies t
limited knowledge of how to extract 'MRD architecturesopen result
are an
information given our current very in various schemes for connectionis:

this information reliably (see section 1.4 below for further discussion exploitparallel
parsmg (see, for example, Cottrell and
Small, 1983; Waltz and Pollack
of issues of reliability in MRDs). et (11., 1986), whose:
alternative source 1985; and, more recently, McClelland, Rumelhart
LDOCE box codes and subject codes provide an
17
16
Chapter 1 1.2 Computational lexicography
Introduction
nature of Although this allows

good deal of grammatical
would requ ire, due to the highly distributed system a information
implementations about a word to be conveyed economically, the syntax of the grammar
their knowledge sources, a qu alitatively different type of.lexicon. code elds is not
not encyc lopaedias. Denitions, however particularly easy to process efciently by machine
Finally, dictionaries are
general knowl- it is not straightforwardly describable in contextfree
fundamental, terms, for

'
ma be still assume some
wdlrld.
'
of word example. However, a worse

(eldlgailelhebioirlteghe problem is caused by the deviations
a sense from
to extract the meaning
Attempts of the system made
by '
dictionary

and to this means an

by lexicographers constructing particular entries.
convey
'
ri tion in a '
For example, comments

:rrfcrbldicijgdiensca requirehbeaorelgnylthlgnglseelsi
the grammatical
fIdrmal
knowledge structure
wrt w it: such as usu.
on
pass or neg. are

nature of word
sometimes
senses,
inserted inside code elds
knowledge, on
'
genera

of this 1
the formalisation and sometimes
hieve d This immediately after them, depending presumably
.
poses
'
lar definition
'
can be ac .
on the
'
interpretation of any particu

the
.
g round assumed by the individual lexicographers interpretation of the instructions for laying
tion of how to establish common . '
entries.
<(ienilti:1);i:.r8532):;i
out
idlificqougiiphers
during
implication writingltle that there
'
is
the
no
'
process ' '
of
that a t e now e These deviations mostly went undetected because of the lack of
ilgviirhvrferrstand
'
computerised be found an encyclopae
of the
checking structure of lexical entries
(see Michiels,
dictionary a denition can in
1982). Another type of noise is caused by typesetting commands on

dia.) the tape which do not reect the structure of the grammar code system
but are inserted for aesthetic reasons. For example, most font change
1 2.2 Dictionary organisation and representation commands indicate the inclusion of a word qualier (see chapter
4)
mam.
which modies the interpretation of the code, but some
reections typesetting
Machine-readable sources
'
tially, objects intended

are
primari
still
ly for ' '
'
Willy uman consump

.
oftyglrratpareedsrirci:
able
.
.
commands, such as the instruction
within
to insert a thin
code elds on the basis of their layout on the printed page.

space, are inserted
b
.
.
formal as it suitst h em, ei n g

' '
afford to be quite in
tionaries can .
,
, .
t an d intel Finally, there are the problems caused by errors of omission or com-
to enormous extent their readers
on Judgemen mission, where codes are either misapplied or left out.
to rely an .
ble to automatically and Detecting

We are long way away from b eing a

.
a
correcting such errors automatically would be extremely complex, so
liogriiiisalise facts like There is

'
no noun
tfrom1
quicknessfoEged
tvgeii:
232?
37.? roc er, .
it is crucial that source MRDs are reasonably reliable (see section 1.4).
Spee d
.

qumk. mStead Use or These problems and other aspects of the LDOCE grammar coding sys-
considerable
pres ent problems for
component of entry can tem discussed in greater
form an are detail in chapters 3, 4 and 5. Similar
0 {the
.
re p re-
'
because 0 f the semi-formal nature problems arise with the other formal information
automated processmg and
.
in the lexical entry
the Willingness of leXicograp hers
. .
employed and
'
sentation schemes . to a greater or lesser extent (see chapter 6 for a discussion of pronun-
these schemes in par fen.
i .
modications
'
to ciation elds).
tt rs to make minor
and so .
lifeeiietri; in order to enhance visual presentation, save space, Sense denitions

extraction of information.
present even more severe problems for automated
We have already noted that sense denitions
fortillhe
LDOCE grammar codes illustrate
pobflepshwell.
thse {Ell-11:; as a
are presented in textual form, although in LDOCE there are restric-
-W
that t e co e e
E
'
tions on the denitional

'
501'3 examlntlon
suggests
i-colons and colons are Of'LDO'C
which commas, sem
used vocabulary and typically the language of
well-dened system in f rom 1e tter.
definitions is a restricted subset of English (noun phrases) (see chapter
'
{codes' constructed '
to delimit and abbreViate sequences p 8). There are a variety of approaches to extraction of sense informa-
conSider the gramm M code eld for
irs. As an example tion, ranging from probabalistic assignment of sense numbers
ggeleiamX
,
form is
.
to tokens
whose expanded, unabbreViated
(to be) 1,7],
, ?
of the denitional vocabulary which occur in particular denitions (see
shown below: chapter 9) through to genus and differentiae spotting systems (see chap-
ters 7 and 8), none of which require automated comprehension of the
denition. However, all of these techniques fail when faced with cir-
sense-no 1 head: Tl cularity of denition or heavy reliance on cross references. In these
head: TE situations, the real denition is not to be found in the entry associated
optional (to be) with the relevant head word. Human
head: X1 right dictionary users cope with this
optional (to be) situation
head: X1 right fairly well despite the fact that the organisation of a pub-
lished dictionary makes following a chain of cross references tedious
19
18
Chapter 1 1.3 Overview of work with MRDS in NLP
Introduction
non-trivial to build automated further

chap(er'({;lfor discussion).
However, it is
and time-consuming.
exibility.
4
procedures of this t will als
ungofeiinlzelldsg
6 accesses 6
of this em
systems capable
A related point to note here concerns one type of information, par- studying the structure, of the English lexiign
and
be found in a dic-
ticularly relevant to dictionary use,
and yet not to
knowledge about
of words with
the entries
similar properties along any of the dimensions encoded
are assumed to possess in of the source dictionary. For example, using the same
tionary. Dictionary users
derivational. While a reasonable as-
it is possible, say, to obtain
type
morphology, both inectional
and a listing of all the
of access query
to make, given the reality of human linguistic performance transitive verbs which require human subjects by accessing all tho se
sumption within reason
of keeping printed dictionaries with grammar code and H box code
and the practical issues
verbCentrietsalt/[1T1]
i
applica- ' '
a requirement for any client computer an RD which is essentiall a t
suppoiitingygcecstsmget
able size, this makes it onver-ing
sort of morphological process fully database capable of
tion to be able to perform at least some
this as a dynamic
functional
number of
ing. interface
An to an MRD must regard source
type raises a Dictionaries are more specific problems.
struc-
data encoded as a set
in addition to the static tured free text but not so structured that they will t neatl
object and must provide, than into
of morphemes with associated features held in the online dictionary, a conventional database system with xed format records for le exailn I
'
morphological processes together with enabling software for real-time

Muchof thework reported in this book has been undertakenusinp
word analysis perhaps carried over a local area network
the desire to be
It was
(see, for exam-
speCialised dictionary database system developed in Cambridge. Tin:
and Shann, 1986).
' ' '
system, and the process

'
of loading the LDOCE MRD "0

ple, Kay,1984b and Domenig in an version of The
on-line
It, IS (115-
utilise more fully the information cussed in further detail in chapter 2.
able to
that motivated the work of Kay and
American Heritage Dictionary,
led
Kaplan (1981) on finite state morphology; this, in its OWn right
tractable model of twolevel mor- 1.3 Overview of work with MRDs
to Koskenniemis computationally in NLP
itself the basis for considerable efforts to
phological analysis (1983), and computationally efcient mor-
implement linguistically motivated
Russell et al., 1986).
:lheutility, potential and real, of MRD sources to most areas of NLP
(see, for example, beyond In
recent a Amsler (1984a) lists of
phologicalanalysers their associated MRDs are, without ex
1?) question. survey some
Published dictionaries and

of lexical entries sorted alphabetically by
t to
ehtasks which the'online versions of the Merriam Webster Sev-
lists Dictionary (W7) and The Merriam- Webster
ception, organised as New
headword. This makes the orthographic form of the headword
it makes
the
com
jantkNew.Collegmte
Dictionary(MPD) have been
M et applied. These include student
to particular entry. In addition, general text
only means of access a
unwieldy and diicult essay analysisand more

understanding, stylistic analysis
lexical entries an
rej
deiiiiotns,
across in information
parison or generalisation
Although some NLP systems, for example syntactic parsers 5:23;?
.
girgiiggiizrnary experimentation
eusedf owor issto
' '
01
process.
for textual input, may only require
access to lexical entries
which will require
in this word-
access via hyphenation,and forth. In the
so
following ZIeStIiSifdngvvzoiii/rielfh
, View
bywordfashion, there are many systems representative samples of this work.

information in the entry and in some caseshe what may required
other
is access to classes of entries which share property or some properties.
1.3.1 Word lists
system for continu
For example, a speech recognition/understanding
to entries via the pronunciation The
'
ef cient
'
ous speech will require primary access development of effective and algorithms fo ll' -
if we imagine that this

dniipgaiillfhgp
.
than orthographic) eld. Furthermore,

(rather
system is operating with degraded errorful or phonological informa-
with
igcgtor;
(:8;(prixarnipe,
; 1984) has
n 0 0c an amora
Pollock, 1982; Yannakoudakis
created the nee df 01' eXleswe
'
l
tion, it is possible that it will need to constrain the access process word lists. Such resources havebeen c ompiled by
'
for exam 1 Y -
word. For example,

TheShortfrebzlerld

order to select a unique

further information in (1983) ~
93 000 words extracted from

such a system may be performing an on-line syntactic analysis of the rbakpudakis
w? and Fawthrop
and Yannakoudakis(1983) ~
57 000 words
incoming speech and syntactic predictions about forthcoming words d watery
from The
Teochers Word Book of 30000 Words
In this situation, the opti 8:1: (Thorndike
recognition
Egonepii'cgiegflmi. (1986) has initiated
may help the word process.
Mitton project a on using the
take the form of a set of (partial) constraints the OALD to correct orm
certain classes
a ion in
'
of misspelt
mal access query may
a word beginning and ending words.
on the target word, such as bisyllabic
as a prenominal adjective (see Word lists have
with a voiceless stop which can function also been generated from machinereadable sources
20 21
Ch ap ter 1 1.3 Overview of work with MRDs in NLP
'
Introduction
of information of relevance from the machine-readable

translationmachine llirenburg,
(see for example Tucker and ploitation
ofdictionary. The
some
nature from the

of these
synthesisC(liiriscrici:g
source a programs ranges
13;:{Sifttomated indexing (Klingbiel,1985), speech
checking (Coltlieart, 1981;Kucera
ran very specialised, one-off utility, aimed at supporting a particular task,
1985 , word frequency
are f or
Idesigne a on
31:1 to a comprehensive softWare system, capable of accessing an electronic
1967 and so forth. In most cases, these lists gen; dictionary in a variety of diverse ways. Some of the issues concerning
,
only information (if any.)

specic purposes and contain limited
Will not conSider this work in more the relationships between the way an MRD is mounted on-line and the
the listed words. Therefore, we
tasks it is intended to discussed in
extraction/development support are
detail. chapter 2. In particular, it is clear that in order to make lexical data

available to a number of diverse research projects, certain properties
1.3.2 Taxonomies are required of a lexical database, for example, recongurability, fast
derive semantic access and multipleconstraint queries.
researchers r have
of used tax- MRDs to
Different MRD-based on-line reference
applications
systems, sup-
oAndlrlriinEZf
various types. Cowie (1983) presents a system for
analysut'ig port for natural language processing tasks, design of lexicographers
descriptive texts into hierarchically structured
make of
knowledge
the a
e fragmenbf, workstations, or lexical analysis of dictionary data on a large scale
current developments to the system

use
machine-rela clearly pose different requirements from their software systems. Still,
source of LDOCE. These include work on extracting
aswell'as.
asuitable
adapting it e Xicgln until recently the process of conguring an MRD on-line for the pur-
for the system from the dictionary,
word definitions in it.
.A different
s
tot an poses of carrying out lexical research has been regarded as a necessary,
textual fragments from the
denitions in W7
for various
syntacticues,
c ralegy, but uninteresting, step of such research, mostly due to the fact that
based on scanning the typesetting languages and the way in which they encode the printed
a folk taxonomy of English plant and
aims to assemble
carried out on in LDOC
the.denitions
animalgerms, ,
pages of reference works are both immensely complicated and idiosyn-
similar processing, but
the taxonomic properties of the seman ic wt?- cratic (see for example, Amsler, 1987, who points out that machine-
vides context for evaluating readable isnt necessarily computationally usable). Consequently, with
and box codes employed by the Longman
feature coding (the subject exception of a few scattered references (for example Reichert at 121.,
has been little
'
der Steen, 1982), there

lexIrtigigtzfrlferal,)dictionaries
hers . .
1969; Sherman, 1974; van re-

.an convenient starting
taxonomically
structured
lfor
knowledge. point1
offer a very
F0 porting on this aspect of work with MRDs.
compilation
initial of
(1980, ow811ng
19
This situation, however, is rapidly changing. In the context of work-
some work of
of a
more empirical nature
comprehensive investigationinto
in
the sixties,lAmsler
the content
an )d ing with a single dictionary, where often the term
in the literature not in its full
database
computational
is used
major Waterloo-
a
in the course sense,
that
structure of MPD, presents conclusive eVidence dictionary
a
semiautomatica (301111-
y
based effort for computerising the New OED has provided the context
can
tains a nontrivial amount of information which for a study into normalising a machine-readable source (Kazman, 1986)
be structured in a semantic hierarchy of definingconcepts. Morerecent
from standard dictionary
and developing a special purpose data model for a dictionary database
work aiming to extract (Tampa, 1986; Gonnet and Tompa, 1987).
'
ds on these ear ier

'
n
'
mgs. inifogrnation
semantlic The Lexical at IBM Yorktown Heights has been
dEilillzrolilsohutbxonomies
.
Systems group
researchwhich
of this kind
shades dictionary,
off into
of working on WordSmith, an automated on-line dictionary system de-
additional, impliCit, the structure a
attempts to uncover
signed to offer browsing functionality: users can retrieve words, from
networks for
and construct hierarchically structured ofconcepts use a number of dictionaries (for example, W7, LDOCE, The Collins The-
in semantic processing. This requiresmore sophisticated processing
it saurus and several Collins bilingual dictionaries), which are close to
definitions than is attempted in the work cited above.an
of sense
much work.which to a given word along dimensions such as spelling, meaning and sound
which lies behind
is this underlying
denitions
goal
We discuss
7, 8 and aim: 9). wor on (Byrd at al., 1987). Some of the tasks WordSmith has been applied
analyse (see chapters to, namely developing techniques for segmenting and matching word
in section 1.3.6 below.
semantic processing spellings and generating pronunciations for unknown words, are de-
scribed in Byrd and Chodorow (1985).
The more demanding context of working with multiple dictionaries
1.3.3 Access / browsing
presents an additional set of problems. Particularly important here is
'
and whic h is the focus

'
' '
all of the work discussed in this chapter

Almost and ex-
.
the issue of normalising all sources to the same internal format, while
for the extraction
of this book requires a (set of) program(s)
23
22
Introduction Chapter 1 1.3 Overview of work with MRDs in NLP
maintaining the exibility and power of a-fullyfunctionaldatabase necessarily involve a good morphological analysis component,
sign, so that for example browsing operations like
those above
are
deli
St'l of handling derivational and inectional
capable
morphology alike.) Briscoe
possible, performing equally efficiently on all available NIRDs. Plus, (1985) lists various current projects in speech recognition and synthesis
and related, questions are being investigated by the LeXical Systems in the UK (sited at Edinburgh
University, Leicester Polytechnic, the
project at IBM (Neff at al., 1988).
Joint Speech Research Unit at Cheltenham, the University of Cam~
Other work has investigated methods for construction brows bridge and IBM Scientic Centre at Winchester), all of which make
of,'and
ing through, a network of sense relations, derived from dictionarydef- use of sources like OALD or Collins for the task
and cross reference
generic compiling of
initions and, in particular, their synonym pomters. special-purpose word lists transcribed into project-specic phonemic
This activity has been carried out in the of George Miller s alphabets, incorporating primary and
context secondary stress assignment and
WORDNET project (Miller, 1985) whose is to
and and
goal. develop
With
asysltem, marking of syllable boundaries.
appropriately organised indexed, equipped naVigational Recent work by Huttenlocher,
Shipman and Zue of the Massachus-
aids for examining complex conceptual Without to con-
to the conventional
spaces
haying data.
setts Institute of Technology has investigated an alternative model of
form alphabetic arrangement of dictionary lexical access to the one commonly utilised
is currently aimed at human
by current approaches to
The system
the functional
users;
of free
a
suitable counterpart, speech recognition (Shipman and Zue, 1982; Huttenlocher and Zue,
however, offering equivalent assoc1ation couldbe 1983). Instead of applying template matching techniques, which be-
of great utility to a computer program for, say, robust interpretation come
inadequate for tasks requiring large vocabularies, they have anal-
of free text. ysed the particular
' knowledge about language and speech available in
The desire for powerful browsing capabilities leads not only to con
dictionaries, and have developed a special representation of the speech
sidering how best to adapt existing database technology to signal based on broad phonological constraints.
different lexical CDROM and similar low cost,
the needs'of Classification of words
projects. liigli-capaCity using these phonological categories achieves the partitioning of the
distribution media offer perspective to 20 000 word MPD
a new on
computerised access. into relatively small
equivalence classes, thus greatly
existing reference works (Hodgkin, 1987, discusses the implementation reducing the space of possible word candidates on
lookup and making
of the New OED CD-ROM, as
on aswell
presents a more general in- the Whole of lexical
process access less sensitive to
speaker variation
troduction to the promises and limitations of the medium). and other variabilities in the speech signal
of a dic-
.
(see chapter 6 for further
Finally, the computerisation of a.
machine-readable source details and an evaluation of this work).
tionary is not only concerned with proViding quick access to a mass
storage device: as already indicated above, this is

fundamentally a
1.3.5 Parsing
question of designing, and exploiting, a suitable representationof-such
While currently inspiration mostly comes from the Current work
a source.
database
insights in linguistics, and in particular developments in gram-
of conventional technology, explorations are underway to.in- matical theory
for example Generalized Phrase Structure Grammar

vestigate new representational schemes which address (Gazdar et a1.
of and the
the constraints 1985), Lexical Functional Grammar (Kaplan and Bres-
imposed both by speed access
(inherent) non-linearity of a nan, 1982)
and on natural language parsing frameworks
for ex-
dictionary; most promising of these is hypertext (Raymond and Tompa, ample, Kays Functional Unification Grammar (Kay, 1984b), PATRII
1987). (Shieber, 1984, 1985) a
make it feasible to consider the implementa-

tion of efficient systems for the syntactic analysis of substantial frag
ments of natural language. Machine-readable
1.3.4 Speech processing sources have been found
to be of substantial use in such contexts.
Perhaps the most compre-
During the course speech synthesis project atBellLabs, Church
of a hensive application of an MRD for the task of parsing
proper is the
(1985) has looked at the applicability of large machinereadable lexrcal use of an on-line lexicon of well over 100 000 words, compiled from
sources
(dictionaries and corpora) to the problem of stress assignment. machinereadable sources (primarily W7), in the CRITIQUE (formerly
Streeter (1978, and personal communication) has been explicitlytack- EPISTLE; see Heidorn et at, 1982) project at IBM Yorktown
Heights.
ling the problems associated with deriving a The lexicon is used by a
pronouncing dictionary powerful parser which comprises part of a
from W7 for the purposes of textto-speech synthesis. (Note that most larger system for machine processing of natural language text in an
of these caused by the fact that dictionaries do not always give office environment, and is
are
pro- specically geared to carry out grammar
nunciations for all forms of a word; consequently satisfactory solutions and style checking of business letters. The system makes use
solely
24
25
1 1.3 Overview of work with MRDs in NLP
Introduction Chapter
of syntactic information of a limited nature: entries are tagged with

formation in a dictionary entry (heavyin the sense that this informa-
tion has been extracted
and subsequently by a program utilised client
part of speech labels and only crudely subcategorised (for example, of some
transitivity of verbs would be indicated, but not much more). sort), shows that the concern has typically been with the lex-
ical properties of a word. Such information, normally to be found in
The more recently developed grammar formalisms cited above (with
the exception of CRITIQUES parser which is based on Augmented
the form and function elds, includes data on spelling, pronunciation,
hyphenation, perhaps (inectional) morphology, grammatical distribu-
Phrase Structure Grammar) generalise the notion of a non-terminal tion and subcategorisatien. The MRD source in these cases is regarded
symbol from a simple atom to a set of syntactic features. One of the
a lexical
as base, and the type of a client program that would need this
implications for the lexicon in such systems is that there has been a re- lexical information
location of linguistic information which hitherto used to be carried by usually knows quite precisely where to find it, how
rules into the lexicon. While not all of the MRDs carry
to extract it, and what, if any, transformations are required before
phrase structure
such complex information in their entries, some sources provide elab making actual use of the raw datas.
In contrast, work which tends to concentrate on making some use
orate syntactic tagging of the lexical items. In particular, OALD and
of the meaning component of a dictionary entry regards the MRD as
LDOCE employ systems for grammatical coding (see Michiels, 1982;
Akkerman et at, 1985, and chapters 3 and 4 for precise analysis of the
a knowledge base; efforts here are directed at locating and extract
LDOCE grammar codes and comparison between the coding systems of ing much more loosely dened, placed and structured raw data. Ulti-
LDOCE and OALD) which allow for detailed syntactic subcategorisa mately, the goal is to relate natural language words to an underlying
tion of individual word senses. It is only natural to attempt to use this taxonomy of concepts typically the one which binds together the

information about the idiosyncratic distributional behaviour of words defining concepts in a dictionary. This would involve a range of ac
for syntactic analysis. Since the system employed by LDOCE for gram- tivities, considerably more difficult than those which reduce to, say,
matical tagging is based on the descriptive grammatical framework of simple lexical look-up, partof-speech extraction, and its mapping into
a data structure suitable for subsequent use by, for example, an online
Quirk at al. (1972), the mapping of the original dictionary format into
a lexical entry appropriate to one of the syntactic theories mentioned parser.
above is not trivial Current views on automatic natural language processing tend to
(see chapter 4).
agree that there seems to be a continuum between the minimal seman-
tic knowledge implied by the use of a particular word (word sense) and
1.3.6 Semantic processing the specialised (or expert) knowledge relevant to its use in a do given
main context (see, for example, Wilks, 1977; or more
Most of the work mentioned in the previous sections makes use of the recently, Cater,
form and function of a dictionary entry. The work on disambiguation 1987). For a practical NLP system there are very pragmatic reasons for
and dictionary browsing appears to be an exception, in that the asso- distinguishing between lexical semantics and specialised World knowl-
ciated techniques and methods need denition edge. It would be unreasonable to expect to nd any of the latter in
access to a
(or a subject
eld. This an MRD. It would be of an enormous utility if most of the former
code, synonym, reference) is, however, somewhat

or cross
the semantic content of the denition is not really made no matter whether it is presented in terms of decomposition into se-
misleading, as
use of. mantic markers (Katz and Fodor, 1963), formulae constructed from se-
mantic primitives (Wilks, 1977), frame-based structures
There is still another aspect of the work on NLP systems in gen- (Hirst, 1987),
eral, to which recent research in
making use of the information available logical predicate/function symbols with associated sortal information
encoded in the form of meaning postulates
from MRDs is particularly relevant. Language processing programs fall (Grosz and Stickel, 1983),
within the larger category of knowledge-based systems or by some other means could be derived

from a machinereadable
(see below). In source. In this context, there
order to carry out their (semi) intelligent functions, such systems need are attempts to compile some lexical
semantics from an MRD into lexicons for an NLP systems.
signicant amounts of (pragmatic) knowledge about the real world, or Chapter 9
describes a project to use LDOCE for automatically
at least about a particular domain ofdiscourse. A common problem ad- constructing se-
mantic definitions (formulae) for individual word for in
dressed during the design process is that of acquiring such knowledge, senses, use a
and there is strong hope that ways can be found to localise and extract preference semantics framework.
Still, we should not expect to be able to generate complete
some of it from suitable machinereadable sources, namely dictionaries a seman-
tic component for an NLP system semi-automatically, in the same
or encyclopaedias. way
in which we are attempting to esh out the syntactic The
An analysis of the cases where heavy use has been made of the in one. seman-
26 27
1 1.3 Overview of work with MRDS in NLP
Chapter
Introduction
be useful in a different fashion by of

the input text (for example, Doguraev, 1980; Hirst, 1987), methods
tie content of an MRD is going to for disambiguating multiple word senses in context have been devel-
of relevance to
offering to a substantial body of facts potentially
access
and maintain oped which make use only of information found in a machine-readable
Knowledgebasedprograms organise
NLP applications. the range of source, and do not require any syntactic processing, thus lending them-
their knowledge bases in ways which differ widely across
selves to more general application across language boundaries. This
their tasks and applications. A knowledge representation (KR) scheme be traced back to the work done in the Cam-
suitable for a program that learns by analogy is not necessarily suit- kind of approach Research can
Even within the area of bridge Language Unit (Masterman et al., 1957), where the
able for a backward chaining expert system. tackled
kinds of structure are best problem of word sense disambiguation was by exploiting the
is no rm consensus on what
NLP there
suited for capturing the knowledge useful
for language interpretation structure of Rogets about Thesaurus. More recently, Amsler (1983) makes
certain assumptions the (usually implicit) taxonomic structure
it is possible to observe a common
and understanding. Nonetheless, of the dening concepts in a dictionary; his approach relies on this
systems which, in addi
theme in a large number of language processing structure being made explicit prior to the enterprise (not an unreason-
make use
tion to the relatively narrowly dened language-specic data,
standpoint). Lesks method (1986a) is even direct, in that
real-world, knowledge, as well as of more specialised, able more
of more general, not this kind all it requires is a machine-readable source, and not of any particular
knowledge. More often than
domain- and task-dependent dictionary at that. He uses heuristics depending on the overlap be
no
of knowledge is represented using a scheme based on the general tween words in the dictionaryprovided sense denitions for the words
of frame-like concepts with slot-like role descriptions, organised
tions across a context window.
axis.
in along a generalisation/specialisation
(inheritance)hierarchy mentioned (see 1.3.2) that, since dictionaries
an
work on knowledge representation, including FRL,
We already embody
Most of the recent and implicit structure within which their dening concepts
frail KLONE (see Brachman and additional
KRL, NETL, AIMDS, UNITS,
or
details
are
organised, they promote efforts for initial compilation of taxo- an
and Brachman and Levesque, 1985, for further

Schmolze, 1985, nomically structured knowledge. This could then hierarchy be utilised
but a few KR languages, can be cast into
and references), to mention by a numberof computer programs
Amslers (1983) technique for

this general mould. networks
lexical disambiguation, for example, relies on having access to such a
for NLP of hierarchically structured
The utility, systems, taxonomy of genus terms.
of concepts, both relating to the real world and to specic domains,
demonstrated beyond doubt (see, for example, Bobrow and More recently, Calzolari (1984a), Chodorow et ul. (1985) and Al-
has been chapter 8) have done work analysing dictionary
Haas and Hendrix, 1983; Boguraev et (11., shawi(seewith more on
Webber, 1980; Mark, 1981; denitions the aim of extracting variety of relationships among
a
It is not altogether clear whether all of the structured knowledge

1987). be derived in a system-
lex1cal
entries and concepts behind them. Signicantly, one common
for the operation of such systems can
required
atic and consistent way from a machine-readable source. A substantial threadin these projects is the effort to automate
without
the locating of su-
perordinates (genus extraction) having to invoke a full scale

at MCC (Lenat et al., 1986) is engaged developing (manu-
in
project
and heuristics, extracted
natural
language analyser. Even more signicant is the remarkable re-
large knowledge base of real world facts

ally) a
Cancer Research Fund in Lon
semblance the overall
in strategy for eliciting information relevant to
from an encyclopaedia. At the Imperial
don, John Fox is collaborating with the Oxford University Press to use subsequentlanguage processing programs
in the denition
from the more or less unre-
language text elds. Still there

the Oxford Dictionary of Medicine for the derivation of a medical KB strictednatural are
of expert systems (Hodgkin, 1987). It is differences the approaches

in introduced by specic properties
of the
suitable for use by a range machine-readable sources available
the individual
to projects.
to be how successful these, and related, projects will be; still,
yet seen
offered the results in compiling a. large Chodorow et a1. follow closely Amslers prescription for building
additional encouragement is by thesaurus: the procedure
of interactive expansion during the what is, in essence, a
requires the seman
semantic network for support query tic of the denition to be located in the description The
process of information retrieval from large textual databases (Fox E et head
semantic head does not
text.
always coincide with the syntactic one and
al., 1988). brows-
the has
developed various techniques and heuristics bringing in
connections between word senses and group
Exploring implicit constraints from linguistic theory, for nding the genus terinsfor verbs
networks of concepts behind dictionary denitions has
ing through the
nouns. Sincethe online MRD used is W7, and the denitions have
and
In contrast
been found useful as a technique for lexical disambiguation. been written moderately unconstrained
in a style, further techniques
to procedurally based disambiguation techniques, where semantic pro
had to be developed for disambiguation of the genus terms. Very sim-
cesses are applied either concurrently with or after syntactic analysis
29
28
1.4 Reliability and utility of MRDs
Chapter 1
Introduction
et al., who aim to compoundphrases. An independent study by Warren (1978) suggests a
motivated
have the work of Markowitz of for analysis of complex nominals
ilar concerns
implementation program a
make explicit a number of implicit relationships

and
in W7 (in particular,
set-memberships rela-
plos5ible
a
ong very
similarlinesuln addition, it appears that MRDs will be the
they are concerned with nding
taxonomy of information for the
source analysis of the large class
tionships, recognising nouns that
and
ordinarily
stative
represent
Verbs
human
and
beings
adjectives).
0??r'eaiistig
o compounds
exrca ise in En g lish , whi c h h ave a non composrtional, -
' '
and identifying active 1diosyncratic meaning,

in contrast to Chodorow et
The procedure developed by Alshawi, of a particular
of genus takes
terms, advantage
al.s separate analysis
denitions
that the in the dictionary are 1.3.7 Generation
property of LDOCE, namely of about 2200 basic
limited vocabulary
given (when possible) within
a
This largely disposes of the It would seem only application within the broad classication
that the
words, used in their central meanings. made it possible to con- of natural language processing where MRDs have not found use so far
disambiguation, and has
need for genus term of is that of language generationg. Just a quick glimpse at the lexical level
techniques which are capable
centrate on developing robust parsing to be solved for the process of generating text will sufce to
extracting more information from the dictionary and are, in that sense,
for example, Ahlswede (1983),
groblems why MRDs havenot been seriously exploited in this con-
superior to the approaches proposed
Alshawis
by,
denitions analysis procedure is cei'izoriz'trate
ex:
itchie (1987) mentions some of the questions to be considered
(1983) and Gaviria (1983). locating the semantic heads of de- during the process of lexical selection: what is the input to this pro
usually more reliable for precisely syntactic structure, a semantic
representation, a mixture
7aWhat are the dynamics of the lexical selection process i e
or
other information
nitions, achieves further semantic precision by using crgsls)oth.
denition, for example, modiers and predications (the :1 preferencefor a particular word affect subsequent choice
present in the ow does the
is a frame-like structure with filled-in
result of the analysis process the architecture of the lexical
selection mechanism
accurate classication of concepts in a partic- of words?.W hat is
1
slots which allows more at in the transition from internal

ular domain of discourse), and applied
can be not only and
to verbs he.
which-paint(s)
references
representation t 0
to
are the lexicon made?
nouns, but to adjectives and adverbs as well.
derived in such a l
surfiaJcet-(ext
l.
n motivated these, and related
answers to uesti
There are many uses for the semantic
The
information
results of the W7 analysis are there Will be
no background for, and very little valueinOIStiiarrfirfgytg
way from a machine-readable It seemsunlikel
source.
utilise the process of language generation.
E MltDs in
the CRITIQUE
into a semantic component for
going to be incorporated which extracts
that machine-readable sources, as we know them today will be ab]:
et (11., 1985). Any program
system lexicon (Chodorow be incorporated into an i to
supplyall ofamount
the information required for language prdduction and
from MRD can
have to be at least
of concepts an
taxonomy S of this information would
a
of application system targeted for a specic anon-triVial
(1981)whdsuggests
an
acquisition component hand-crafted. We agree with McDermott
domain. Thus instead of a module for deriving sortal hierarchies by an
liltlally,
no
Signicant insights into natural language generation are going
interactive (and somewhat stilted) pseudo-natural language dialogue
i : at);
e made
.until we
haye program which has something signicant
a
see for instance Haas and Hendrix, 1983), one could imagine a program
the denitions
to
0 say
~
this seems to ultimately tie up the issue of use of MRDs for

which would extract the same hierarchy by analysing l text thh fundamental research into endowing intelli gen t
terms and, as necessary, following chains E generation
of a set of domain-related programs With introspection capabilities.
to establish connecting links.
through the dictionary
Other current research (Amsler, 1987a) focuses
on an investiga- it
aimed at analysis of compound nouns and more general phrases. ,
1.4 and
tion
used to extract data relevant to these
i Reliability utility of MRDS
Dictionaries and almanacs are
at this stage of the project, particu-

studies. A working hypothesis experience
pollective makes it feasible and indeed
larly as far as compound nouns are concerned, seems
denitions.
to be that
Within
they
such a incsome broader aspects of working with machine::::l:leyldi2ti:::inr
of the hyponyms in their
instantiations already
(for example in 1.3.2 and
are
discussed 1.3.6) the
framework, on-line access to a semantic hierarchy of dening concepts
iiiiierwi
lilaveof applicationwhich critically require access to such
is a necessary prerequisite for the collection and analysis of empirical mung:Ictasses possible to ask questions now
like, for example, what
in me 0;!sleems
ultimate aim being the derivation of a set of interpre-
data, with the
the hierarchy to an extent
cost (the manpower effort spent
a on a
project) of attempt-
tation rules, eventually generalised through ing t o 1iarness what be
a wide of non-lexicalised
sometimes can a bulky and unwieldy object'7
where they can be used to disambiguate range
31
30
Chapter 1 1.4 Reliability and utility of MRDs
Introduction
the client of an MRD be fooled

be found in
[33:lrtlovsi/biffilmsrerb
programs can
to
of information could reasonably expected be
What type be the best problem is associated with simple typographic
is this information? What would
an MRD? How reliable the proof reading behind not to most of the currentl
context? that some, say
way to make use of it in any particular
relevant to these issues. The first available MRDs
leavessomething to be desired. Mittdn(1986) report:
two factors which are
There are
to be found in a dictionary and on yearlongactivity aimed at checking, correcting and proof-readin
a
concerns the information that is likely

is related to the dictionary the
OALD tape. Apparently a large number of small errors were intri
raises two further questions. One of these comprehen
and duced in the of keying the original dictionary data in the com-
i.e. how representative (nonredundant
yet process
coverage,
and ranging over the entire dic- puter, and eachSingle one of these would break the accessing programs
within a single entry,
sive) is it, both
subelds in the Note this correcting work had to be done before the real use of
tionary. The other concerns its reliability, i.e. do the that
the dictionary
format rules; does the con (phonetically guided spelling correction in this partic-
to prescribed, or expected,
entry conform
of a eld represent adequately
and faithfully the facts, as the users ular
case) could commence. Similarly, Yannakoudakis refers to (1983)
tent of prOJects which, after several running achieved as a
be, and so forth. a series years
them
Oxford Dictibnary
understand to
The second factor relates directly information
to how is struc-
such Side effect the
research
verication of the Shorter
the US (see for example Fox E et al
Several
l86) seem
how its extraction from the groups in .
u
tured and organised, and in particular, with to the same

hafle
2 is concerned gone route.
is be handled. Chapter
machinereadablesource to
early this type of problem could be avoide
' '
p3b1:5;::f%;;iab2::-
of a machinereadable
source. Below are a
issues of the overall packaging and reliability proof reading procedure is applied by the
the coverage
some more detailed points touching upon
to be of interest to research ofus
i was used for typesetting as well, the possibility of typographi:
of data machine-readable
in sources likely ca zlzltape
induced
y errors would be removed (see the next section) Still th'
computational linguistics. publishedyve:
Y
in for the
most obvious,
the concern is with the ex- wouldnot guarantee safe processing: example,
The rst, and in a way dozen entries in which
of the dictionary. Notwithstanding a parenthesised
pectations for the lexical coverage
and in particular the status 2:23;:taffgeczntgainiafout
opening
tonrgestiiicatldrgt
rac is or closin missi '-
overgrfight
.
theoretical investigations of the lexicon,

(for job batch submitted
of complex lexical items which could either be preinstantiated ex-
ronmentwhere rame
a was
model of the lexicon; see Jackendoff, 1975) or information the tape, of time to subsequent
prior on use by LISP pro
ample, the full entry
the impoverished entry model), a
conSiderable lost trying to
amount track of boundaries was read the int?
tapge
derived on an 'as-needed basis (viz.
to offer complete losing between individual entries
the fact remains that no dictionary can be expected memory Without i
Even if one is designing a customised lexical (the LISP read

generic function used which was relies c mma ll y 011
coverage of the language. balanced
account the fact that our vocab- parentheses).
to take into
database, it is necessary
'
A different of unreliabilit from

Thus any attempts
fbiiiildf
ileIEPE
source comes
1986b).
ulary is open-ended (for example,
existing machinereadable
Meijs,
source for a task that requires, or raphers or sub-editors to conform consistent toya {xiii}:
to use an
samples of text on an or a subeld. For example, the syntax for thg
employed
bodies of unrestricted entry
presupposes, dealing with large
to do when the system coding of word senses in LDOCE has not been defined
(or speech), leave open
the question
of its
of
lexicon.
what
A number of researchers grammatical
ormally anywhere. By and large, a specification of the allowable for
beyond the boundaries
'pl"
strays be useful for a range mat of the elds can code be derived by inspecting a sam e
have pointed out that even though MRDs can very grammar
to supply all of Leaving aside the fact that this is a time consu
of tasks, it should not be expected
that they will be able I:D_OCIEentries.
of them. Amsler (1984a) emphat- actiVity, it still carries penalties associated with the fact that theliifflg
the information required by any one
there does not exist a word to. develop such a grammar is by applying it to a large numbe);
ically states that one must admit
the
that
true lexical units of the language...
Wfay
0
entries and manually inspecting the results a trial-and-error -
pro-
list of English that enumerates
evidence of the been altogether could have by an explicitly dened avoided
(p.173). Walker and Amsler (1986) present empirical
is, by most standards,
Ieggsnwlhich to the made
available lexicographers prior to the dictio-
huge disparity between the contents of what
with the vocabulary behind n adcsslsltem oprnent effort: The same point is made, in a larger context
a perfectly respectable
dictionary (W7),
Churchs work on stress assignment andymude1 emphatically, by Kazman (1986) who has developed
the New York Times News
motivated
Service.
by a very similar observation.
Igrammarmoiie an assoc1ated software for parsing the full entries of the
(1985) is partly lexical entry is found there is still the OED,
when the appropriate
. .
Even r
in which Lven when seemingly stringent constraints are imposed on dictio
rely on it. There are many ways
question of how far we can
33
32
Introduction Chapter 1 of the contributions
1,5 Organisation
the nal product still may suffer from internal in-

nary construction, is evidence that procedures which are being developed to
automatic
consistencies. A representative example in this class of problem is the make use of the information in a dictionary start from the premise
denitions. Most publishers adhere
issue of circularity in dictionary that they can rely on what the dictionary specication promises, On
to a general principle of reducing complex concepts to a composite of the other hand, dictionaries (both machine-readable sources and pub-
more primitive ones, avoiding straight cases like (the apocryphal) re- lished versions) occasionally fail to full this promise on all accounts.
cursion: see recursion. The attempt is, where possible, to write the Some obvious questions which follow are: what is the real utility of the
denitions in terms simpler than the words they dene. LDOCE adopts programs attempting to derive, digest, compile and further utilise some
this maxim wholeheartedly; the preface to the published Version claims of the potentially useful data in a machine-readable source and how to
that a rigorous set of principles was established to ensure that only establish, for practical purposes, the reliability threshold of such a
the most central meanings of [a controlled vocabulary of] 2000 words, source, beyond which there is no point in attempting to employ auto
and only easily understood derivatives, were used (Procter, 19782ix). matic procedures for support of computational linguistics programs?
There are several problems here. For a start, no indication is given It is clear that very careful empirical analysis of a dictionary source
as to which of the many meanings of the words in this core vocabu- must be carried out prior to any serious project likely to engage a sub-
lary are considered central. Much more serious, and cause for many stantial amount of manpower over a period of time. Examples of such
failings of the denitions analysis programs described above, is the analysis are the studies like Akkerman ct al.s (1985) comparative eval-
somewhat liberal interpretation of the phrase only easily understood uation of the grammatical coding systems for OALD and LDOCE, or
derivatives. Thus, allowing derivational morphology to creep into the Moulin ct al.s (1985) analysis of the validity and systematicity of the
basic set, gives some lexicographer the power to use container for the grammatical information provided by the LDOCE grammar codes, as
denition of box2(1), even though only the verb contain is considered well as of the consistency of the whole coding system there. An alterna-
to be primitive. Elsewhere, container(2) is dened as a very large, tive approach to the same problem would be to use an existing source as
usu. metal box..., thus clearly violating the promise of non-circular a starting point for a specic project to do the job better. For instance,
denitions, described in simpler terms. A program attempting to use the ASCOT project underway at the University of Amsterdam (Meijs,
the semantic hierarchy derived from analysis of these terms is destined 1985; chapter 3) essentially attempts, within a particular theoretical
for the equivalent of an innite loop. linguistic framework, to rectify the situations where the LDOCE word
A different kind of trap is introduced by equally liberal use in the denitions diverge from the standard dictionary practice of maximum
denitions of phrasal verbs made up from verbs and particles taken economy.
from the restricted vocabulary. For example, the second meaning of
contain(2) is given as to hold back, keep under control.... While
within the vocab<
both bold (as a
verb) and back (as a noun) are core 1.5 Organisation of the contributions
phrasal verb, is not. A
ulary, holdback, with its own meaning as a
fundamental assumption on which Alshawis denitions analysis pro- This book contains contributions written by three separate groups of
based has been violated. Not surprisingly, the signicant work
gram (see chapter 7) is researchers who have undertaken with the machine-
result from the conversion of this particular denition into a semantic readable version of LDOCE. All of this work is focused enhanc-
on
structure makes no sense and is of no subsequent use.

ing the capabilities of NLP systems. Chapters 2, 4, 5, 6 and 7 have
LDOCE is by no means the only example of a not fully reliable been written by the group based at Cambridge and Lancaster Uni-
a number of people (see the discussion of MRDs in versities and describe the construction of a dictionary database
dictionary source; sys-
Ritchie, 1987) have commented on the signicant amount of extra work tem for LDOCE, analysis and evaluation of the grammar code system,
required either in order to bring other MRDs in phase with their spec- derivation of a new lexicon
incorporating subcategorisation, accessing
ications, or to patch a program which attempts to use them com< entries via partial phonological information, and automated parsing
putationally. Indeed LDOCE is unique in attempting to place such of denition elds, respectively, Chapters 3 and 8 have been written
rigorous restrictions on the denitional vocabulary available to lexi- by the group based at Amsterdam University and describe their work
cographers. Nevertheless, complete consistency in such an enterprise on analysing and rationalising the grammar code system and present
could not have been achieved without a very sophisticated checking a taxonomy of sense denitions. Chapter 9 has been written by the
phase during the construction of the dictionary (see Michiels, 1982). group at the University of New Mexico and presents a range of re-
The current situation then is as follows. On the one hand, there lated projects designed to extract lexical semantic information from
34 35
1 Organisation of the contributions
Chapter 1.5
Introduction
The restructuring of the infor-

Below we summarize each chapter individually. language processing systems. process
the dictionary. in the machine-readable version of the dictionary is discussed
mation
information
and a system is presented, which is capable of extracting
the dictionary on-line codes associated with a
Chapter 2: Placing implicit, as well as explicit, in the grammar
intermediate, theoretically uncommit-
available for research purposes in the form of a type particular verb. The resulting
LDOCE is made is translated into entries
of the dictionary ted of grammatical information
setting tape from which the printed version was pro- representation
of with mounting such a tape for the PATR-II grammar development environment. The chapter of-
duced. There are a number problems
and Carter discuss the fers an evaluation of the success of this approach by comparing the
on-line. In this chapter Alshawi, Boguraev
handcoded entries.
issues of normalising a tape and its conversion to an intermediate resulting lexical entries with previously published
device, and the relationship between semantic classes of verbs
amenable to loading into a mass storage Finally, they explore
representation, described in the rest and their ability to undergo the dative alternation in order to assess the
making it accessible to most of the applications
the structure and organisation of subcategorisation
that frames can be predicted on the basis
of the book. Questions concerning proposal
lead into an analysis of what the of membership of such classes.
an MRD, and in particular LDOCE,
research and application client programs
needs of natural language
and how they relate to the design of an lexicon
from electronic sources are, Chapter 5: The derivation of a large computational
to on-line dictionary. Two such systems are descibed,
access system an
for English from LDOCE
as a demonstration of the way in which different system requirements
and design of
and machine congurations inuence the functionality In this chapter Carroll and Grover describe the Cambridge/Lancaster
a large lexicon from LDOCE for use
the dictionary interface. groups approach to constructing
On the
in an English morphological and syntactic analysis system.
basis of the types of errors that occur in LDOCE, they reject the pos-
Chapter 3: An independent analysis An alternative
sibility of constructing the lexicon fully automatically.
LDOCE coding system
methodology involving semiautomatic, computeraidedgeneration
of the grammar of
is described and an implemented Lexicon Devel
of the grammar coding sys- the target lexicon
Here Akkerman presents an evaluation the
tem of LDOCE, independent of any particular language processing opment Environment presented. The system (which incorporates
with alternative system for grammar code unpacking and translation programs presented in the
task, and based mostly on comparison an
environment for use
information as employed by the Oxford Advanced previous chapter) takes the form of an interactive
encoding syntactic
Learners Dictionary of English. A critical study of a range of linguis- by a linguist or lexicographer, providing largely automated generation
of lexical entries. In this the most efficient use is made
and their coverage and treatment within the different and testing
tic phenomena, into the
demonstrates the advantages of the Long- of the lexicographcrs time without errors being propagated
(semi)-formalcoding systems information about the target lexicon.
man grammar codes in encapsulating important
Akkerman also shows that
syntactic properties of words. However,
is a very suitable basis for a lexicon that can 6: LDOCE and speech recognition
while in general LDOCE Chapter
be employed in a system dedicated to automatic grammatical analysis, for
in this respect. Some of these are A number of recent studies explored the likely effectiveness
have
it also has number
a of shortcomings broad
large vocabulary speech recognition of various phonological
of a general nature, while others are related to the more specic fea-
front
transcription schemes used as the output of an acoustic-phonetic
theories underlying a number of analysis
tures used in the grammatical end. This chapter criticises some of the methods normally used in such
programs that are being developed at present. to evaluate a putative front end by
studies. Firstly, it is customary
it provides to partition the lexicon into equiv-
using the transcription
the LDOCE grammar codes that this approach is appropriate only
Chapter 4: Utilising alence classes. Carter argues
to a small and overidealised subset of possible front end behaviours,
and Briscoe also focus
grammar code system.
on the They
Boguraev and shows the need for a more general type of class, the derivation of
a description of the development of grammar code unpacking
present and se- which in general requires the use of a exible lexical database (of the
and translation programs which allow the subcategorisation
for verbs in LDOCE to be employed in natural type presented in chapter 2). Secondly, he argues that, whatever types
mantic information
37
36
Chapter 1 1.6 Notes
introduction
assessments of
transcriptions based corpus of meaning descriptions. Subsequently
coded
of class are used, the customary grammatically
on expected class size can, even when weighted by word frequency, be they develop a syntactic and semantic typology, based on the ndings
misleading. He proposes an alternative, information theoretic measure in this corpus. This involves a parser-grammar distinguishing premod
and argues that it is superior. Several experiments LDOCE-based are
ifiers, kernels, postmodiers, etc. and an explanation of the semantic
reported that, through using this improved methodology, shed effects of di'erent types of kernels and the structures in which they
new
light on some questions of speech recogniser design. occur, classied into four main types: links, linkers, shunters and syn-
onyms. The rationale for this typology is discussed at length and nally
they present some statistics concerning the distribution of the various

7: Analysing the dictionary denitions
Chapter
types distinguished.
how dictionary word denitions
In this chapter Alshawi shows sense
hierarchy of phrasal patterns. He has

can be analysed by applying a
9: A tractable machine dictionary
for Chapter
implemented an experimental system embodying this mechanism
of this is as a resource for computational semantics
processing denitions from LDOCE. A property dictionary
that it restricted vocabulary in its denitions.
word sense The at the Computing Research
uses a In this chapter Wilks and his co-workers
the experimental system are intended to be
structures generated by Laboratory (Foss, Guo, McDonald, Plate and Slator] distinguish and
used for the classication of new word senses in terms of the senses
of the position in computational semantics
then investigate the merits
of words in the restricted vocabulary. examples il-
Alshawi presents that the semantic structure of language text and of knowledge repre-
lustrating the output generated, and discusses some qualitative per- sentations share common organising principles. Better understanding
formance results and problems that were encountered. The-analy- of such principles may come from analysing the semantic structure
sis process applies successively more specic phrasal analysis rules as and extracting the semantic information from dictionaries, a particu-
determined by a hierarchy of patterns in which less specic patterns lar kind of text. Dictionaries have particular promise because (a) the
dominate specic ones. This ensures that reasonable, incomplete
more semantic structure of text may be more exposed in them than in other
analyses of the denitions are produced when more complete analyses forms of text and (b) many are now in machine-readable form and
are not possible, resulting in a relatively robust analysis mechanism. are amenable to analysis by large-scale computational methods. They
Thus the work reported addresses two robustness problems faced by identify some convergence between the View of computational seman-
current experimental natural language processing systems: coping with tics presented, lexicography, and knowledge acquisition
computational
an incomplete lexicon and with incomplete knowledge of phrasal con- in terms of common issues and problems they share. This convergence
structions. is illustrated using some work by the authors that attempts to extract
semantic information from LDOCE and use that semantic informa-
in denitions tion in two kinds of computational semantics that reect the general
Chapter 8: Meaning and structure dictionary
position on computational semantics set forth.
The aim LINKS
of the project is the development of a semantic data
base in which the meaning descriptions in the LDOCE are stored in
a systematically related way. The underlying theoretical framework is 1.6 Notes
that of Diks (1978b) stepwise lexical decomposition, which does not
make use of an abstract semantic metalanguagebut instead, guided 1 Winograd (1983) and Allen (1987) both provide excellent intro-
by a well-defined economy-principle, reduces the meanings of lexical ductions to natural language processing and some aspects of linguistic
items, via a stepwise network of chains of meaning descriptions, to theory.
a restricted set of basic lexical items. The underlying assumption,
at the outset, was that such an approach would be a feasible option 2 By convention, asterisks are used to mark examples considered to
because of LDOCEs use, in all its meaning descriptions, of a restricted
be ungrammatical.
vocabulary.
In this Vossen, Meijs and den Broeder
chapter outline the basic
methodology for the project. First, they apply an appropriate gram- 3 This observation, though, is not totally correct, because the part
matical coding to the words in the meaning descriptions, deriving a of speech of many words is predictable from their morphological form;
38 39
Chapter 1
Introduction
in -ati'on are almost always nouns. We will

for example, ending
words
return to morphology below.
detailed introduction to phrase

4 Gazdar (1987) provides a more
of this type. 2
structure grammars Chapter
The numbers in the
5 The rst two entries are due to Bob Ingria.
denition numbers within the entry for ac-
annotations refer to the
knowledge in the Collins English Dictionary (Hanks, 1979).
University Interna-
6 COBUILD stands for COLLINS Birmingham
Hiyan Alshawi, Bran Boguraev and David Carter
tional Language Database.
7 There is some work, however, for example, at Bell Communications

studying the relationship between a dictionary
Research (Bellcore), is also being carried out
and its associated information; research
entry
of videodisc technology for the display of
investigating the application
illustrations under computer control.
dictionary
8 Note that the use of the lexical base is quite dierent

term from
from 2.1 From printed to computer
what Amsler calls lexical knowledge base (Amsler, 1984b) and page memory
Calzolaris lexical data base (Calzolari, 1984b).
The electronic source of LDOCE is a containing
tape, the original
underly-
data given by the publisher to the printer. Computer tapes, typically
9 Note that the process production from some
of language are the usual medium for distributing dictionaries
ing meaning representation distinguished
should befrom the functions typesetting ones,
machinereadable and the Longman is no excep-
form, Dictionary
various document generation aids, where MRDS have in
carried out by tion. The information on such tapes is read directly and interpreted
to large scale, for example, for spelling cor-
already been applied on a
by a program which drives the typesetting device used to produce the
rection within the CRITIQUE system. masters for the printed version of the dictionary.
While clearly a convenient way to separate the compilation of a dic-
and has
For example, body is part of the denitional vocabulary as .
10 tionary and the keying in of the data from the actual production of the
its central (1) meaning the whole of a person. However, parliament book, this process does not presuppose the use of the tape for purposes
is dened as a law-making body, utilising the meaning of body(5) other than typesetting; indeed, the tape is simply a by-product of the

a number of people who do something together. process of creating the printed book. This is the root of the problem
of mounting the electronic source of a dictionary in a form suitable for
NLP typesetting information its own does not
11 We adopt the conventionsubsequent chapters) that,
(in this and use
by a sufcient handle on the
programs: on
ho- problems concomitant with loading

where LDOCE lists several homographs for a word, the relevant proyide
mograph is indicated by a superscripted number [for example, box). a dictionary into a database. The organisation of the data on the tape
Within a homograph, the relevant LDOCE word sense is indicated by follows certain lexicographic and typographic principles, and the pri-
a bracketed number (for example, contain(2)). mary underlying assumption at the publishers end is that a dictionary
.is a printed, hand-held object to be used by an intelligent human. Vi-
is critical both from the point
sualpresenta'tion of the information
of View of distinguishing between the different kinds of information
41
40
2 2.1 From printed to
on-line
Chapter page computer memory
Placing the dictionary
'
'
allow the mounting of involve a

a process of anal 51's or
ta pe must
1988), and in order to
that a dictionary entry contains (Urdang, of character corrispdndsptirsatrligd
til-elat
dic- into a structure which
compaction of this information. differentparticular,
In printed stream
maximum the individual of a dictionary elltr
fonts and typefaces,
expfiit here components
y
a number of represents,
tionaries make heavy use of
0, +, i; I] are only a few is that of recovering, and appropriately labelling tile
lo icale
lsstue
17,
well a many special characters: H,
the For instance, looking up pair in LDCE
titlmleciitry.1
as as
examples of these.
Clearly, all this information
must be present on the tape; the solu- (nggureugilj obvious .
that1the SYStem
IS imme late y to th 6 tea d er
presentation '
for of lexical i nformatlon makes use of

tion commonly adopted
is to use control codes to signal font changes t'
11310115 1' WOTd
' lke
the number
Since of special characters
sense numbers (indicated by lar g e b 0 1d d' gm), demmns (m roman
represent special symbols. font) with subdenitions

or
entry (or a page) (marked by small bold lett em), eXPhClt and
for the complexity of a dictionary . .
to account
' ' '
required lmpllCit cross references (respectlvely indicated by dashes and

digits and punctu-
is considerably
marks used for more
larger than the number
conventional
of letters,
purposes, typically, the full
.
capitals) to other denitions in the dictionary, further tagged

'
' '
5:22:11)
ation
mapping. This presents yet
homograph numbers (in su perscrlpts and sense b em (1braCketS)
EBCDIC code is used
for
as the
electronic
basis of the
access to the information on the tape, semi-idiomatic
. . . .
phrases (in small bold

typeface).num
another difficulty set.
the smaller ASCII character
since modern computers typically use
at least two problems.

Such an encoding scheme presents and their effect
between characters on tape
Firstly, the mapping what is usually a
has to be established by pairl /.,./
pnin orn
palr[Wn2]1[C9
on the typesetting program and that
the character sequence on tape esp, cg]something made up on part!
painstaking comparison between Note that this process
are alike and which
joined and used are
its corresponding image on the printed page. together: a pair 0/ tmmenln pair 0/ .c-mn
would initially seem, since control compare coupcs 2 09,
it latheufalnzlldhdz
.
complicated than
is made more to infer a direct
things that are alike or
functions. It is easier
characters perform two distinct and
are usu. used together: a pn'r n/ihoetld
symbol in part a] lag: *compare
character and its corresponding rigor/u! courts!
correspondence between a
effect control sequence might have playing cards or the some value but
the of different (s):
printed form, than to establish For instance, common
a suinl a pair of 1w. 3
questions arising [CC] a 2 people closely pair connected: 4:
on the typesetting machine. alliance b COUPLE? (2) (esp. in the phi-

is the scope of a typeface changing
in such context might be what the hnppy pnlr) is ill people closely con:
active font terminated

nected who
a currently displea-
cause
how is the effect of annoyance or
command, and end of field, sure: Youre a ne par mun-rig m lot: a. Hill]
declaring a different font, by
by an explicit command, by the analysis of the character
of these? Secondly,
or by a combination since large number of the
on tape is not entirely trivial, a
sequence devices, Figure 2.1 A fragment of LDOCE
characters are unprintable conventional output on entry (pair)
EBCDIC into visible
effort has to go into unscrambling them
and additional detailed
1987b, for some more
objects. (See, for example, Amsler,
cements.) typesetting tape for on
giigsagsgitn structure, for the of any subsequent
The fundamental
come from
problems
two related
of mounting
factors:
a
the organisation. of the raw

(ifthis e lc lonary by a computer
purposes
program, has to proceed
line access
solely the basis of chara cter
'
conveyed by the visual which looks like th

on a sequence
structure,
bacitvtirxalfdsxl
as
data the tape and its logical ' _ '
Eggretzhil does
on
the typesetting This is by no means a trivial task, as working
information which drives
effects of typography. The there is and result not in a simple mapping between
machine is typically in the form of a
imposed on this information.
character
This is a
stream,
a cotdes character
pmluculzontrotl r ins ruc ion in the str earn, say

sw1tch
'
to bold
very little (explicit) structure that the and a unique logical e
.
qulvalent, such as this denot

' ' "
mentioned earlier, namely
consequence
tape
of the assumption
is going to be used in one
we
way only, namely the generation of

puhrfcast'e.
Fprthermore,the fact
to typography but
that there is a direclj5mg;plidrloufll
end? vice
which expects to in a
dictionary] not
a. printed
make use
document.
of the lexical
any
information
computer program
Since
in a dictionary must have access (Norllzndcoha'
to
1982:216),
the process
g rls vryord introduces
ensen,
of recover
additional
'
complications
'
representation for that dictionary, both yofthe strut at. a dlCtl'DarYfrom the
of underlying
(fall):

kind
.
an
to some typographical information held the
suitable for algorithmic processing and felicitous to the published book, on
43
42
2.1 From printed page to computer memory
Chapter 2
complexity of the OED entries. More recent work in this direction,
02 1 03
attempting to build on existing experience and develop general-purpose
01 pair
00 < C9, es peEZGHf
.
* o < < tools for facilitating the process of mounting a typesetting tape on-line,
g; 345 {Things
.
the kind, and are

in the University
a *44 2 that are alike or of same is represented by the efforts of the research groups of
16 of shoesla beautiful pair
18 man. used together: a
pair Waterloo Centre for the New OED (Gonnet and Tompa, 1987) and the
85 #45 h *44 2
18 of legs #44 *64 63 compare *CA COUPLE
but of
.13
dlfferent *CA SUIT *CE Lexical Systems Project at IBM Yorktown Heights (Neff et al., 1988).
cards of the same value
18 playing The nal major issue is that of convenient internal of
18 M6 3 *BA *44 (3): #46 a pair of kings representation
O7 300 < GC < <
.
a dictionary entry. Even within the simplest model of using an on-line
closely connected: 46 a pair of dancers
08 45 a #44 2 people dictionary, namely that of a batch program scanning the whole source
CB #88 44 (2) (esp. in the pht. 45 the
18 #45 b CA COUPLE
18 happy pair #44) *45 c *46 51 #44 2 people
cioselyfine
connected and extracting certain fragment(s) of a number of dictionary entries,
who or displeasure: 846 You re a pair some kind of an entry decomposition must be achieved even if this
18 cause annoyance
18 coming as late as this! simply indicates the beginning and the end of the entry in the character
stream, and separates the headword from the rest. More ambitious
for typesetting pair projects must give careful consideration to questions concerning the
Figure 2.2 A sample tape stream (extract) micro-structure of a dictionary entry; in particular,
a how to
represent the different fragments of an entry (headword,
pronunciation field, denitions for individual word senses, cross-
One is that of context

references embedded in the dening text, etc.) in a suitable com-
what
-
There are a number of problems here.

putational format;
looks identical on paper represent
may different types of information.
Thus, within the phrase in italicsmay be an
same dictionary entry, a
Within a how explicit to be in the internal representation;

specication argument a
example, a of an optional syntactic
frame for this word, or a label denoting notes on usage. Similarly,
of small bold typeface may be that it indicates a 0 how to evaluate the trade<ofls between
designing a powerful data-
the interpretation
subdefinition within the maindenition for that word sense, or an base for the dictionary and simulating one by a suitable com-
word bination
(semi-)idiomatic phrase incorporating the main headword,or a of an application-neutral format and a suite of easily
opposite) meaning to the customisable
with a related (for example synonymous or programs.
one being dened.
A further manifestation of the problem of context, already men- The first points above are related to fundamental
two questions facing
tioned in passing, is that control characters on the
tape are thetypically
end of
every attempt for building a dictionary database, and we elaborate on
begin markers; and there is no standard way of indicating them below. A particularly good analysis of the issues connected with
unit within entry. For example, the scope of
a logical functional an
parsing MRD sources into a standard form, as well as those leading to
a bold (indicated
font by the ASCII character Hex46) may be ter- the design of such a general-purpose representation for the lexical infor-
others, by a change to another font, by the end of a mation contained in a range of different
minated, among dictionary tapes, is presented
It may be globally
sense record, or by a superscript character. even
in Ne et (11.,1988).
by proclaiming within particular elds within an entry for
#
declared The remainder of this chapter presents an analysis of the particu-

information number which
instance containing headword or a sense lar format of the Longman tape, its conversion into a form suitable for
the typesetting program then interprets appropriately. symbolic processing, and the design and implementation of two soft-
of unloading all the information from a tape
Ideally, the process ware systems which provide access to the dictionary on-line. We aim
and loading a structured version of the lexical data in the dictionary to demonstrate the influence of different system requirements and ma-
to make
into a database would involve parsing the complete source chine congurations on the functionality and design of the dictionary
its implicit structure explicit. The complexities of such a process are interface. We conclude by sketching a generalised interface system,
discussed in detail by Kazman (1986). One of the results of his work.is capable of offering exible, multipath access to versions of LDOCE
the development of a full grammar for the entire OED, and a speCial independently of the hardware congurations it is mounted on.
the structural
mark-up language capable of representing and conveying
45
44
2 2.2 The Longman tape and its computational counterpart
on-line Chapter
Grammar codes are discussed in detail in the following three chapters

2.2 The Longman tape of this book; subject and box codes are described in chapter 1 (1.2.1).
and its computational counterpart For the rst sense of the verb denition of rivet, the subject code NAZV
indicates nautical subject area, and the box code _____H___XS indi-
of organising the data
Publishers usually adopt one of two major ways cates preferences (selectional restrictions) for human subject and solid
scheme like
'
esettin ta es. They may choose an

encoding object.
:ffetlolenlergintlieprevious
filigcussed section, based entirely on
contrgl There are a number requirements that determine

of how access to
hand, they may 0

the other
characters interspersed in the text. On LDOCE should be organized in order to facilitate its use byNLP sys-
of the overall entry by presenting some in convenient form
fer some indication
themselves
structure
With records
tems. Firstly, it is necessary to have the entries a
in xed-record format, languages such

aspects data
of the
1987b, for more detai plre- for use by programs written in symbol manipulation as
sented in xed order (see for

is
example
likely to
Amsler,
render a tape
moreeasy
to
conver
s)t Lisp or Prolog. All the machine-readable
information
of the dictionary has to be captured since it is difficult to de-
present in the
Clearly the latter method version

into a structured database, and, with the help of a description detailed termine in advance which, if any, of the fields and subelds of entries
task of the electronic
of the dictionary records and elds, the parsmg could not be exploited by some component of an NLP system. Further-
source be achieved
can with considerably less effort. '
more, provision for fast access to dictionary entries held in secondary
be
2.3
The tape falls into this second category;'Figure storage is needed, holding the complete dictionary in main mem-
since
Longman
for rivet
low shows how a fragment of an LDOCE entry is
encode ory would about
require 20 Megabytes of storage (the exact amount
at source. is organised
The information into a sequence
lines.
of records, a
depends on the format of dictionary and indexing information) and
record may be on a line of its own
or
splitthe rst several
across Every this is currently an expensive option.
a
line starts with sequence of digits, of which SIX
denoteidentifier.
unique The natural data format for Lisp programming (and one for which
the record builtin facilities
sequential number, and the remaining two encode char
there are in Prolog and other symbol manipulation
This is followed by a number of fields specic to that
record; the languages) is list structures, or more specifically Lisp s-expressions.
brackets to indicate the
acter < is a eld separator. (We use curly These can be used to represent arbitrarily complex nested records in
of a non-printable character on the source tape: thus (*CA} both the ling system and main memory. The records of the LDOCE
occurrence
EBCDIC character represented by the hexadecimal dig- tape were therefore transformed in a preprocessing phase into brack-
stands for the
and eted Lisp structures.
its C A.)
The reformatted, or lispied, machine-readable source of the Long-
man tape offers the convenience, from the point of view of a Lisp or a
to fasten with RIVETslz...
rivet2 v 1 [T13X9]to cause
Prolog program, of reading complete LDOCE entries in a form imme-
diately usable by client programs, thus avoiding the need to parse and
28289801<R0154300<rivet
28289902<02< < unpack the raw tape format used by the typesetter. All the information
28290005<v< in the source tape, as supplied by Longman, has been retained in this
ii
28290208<to
xst)J
23290107<oroo<ri;x9<ii;\zvt<
cause to as en in version; what this version offers however, above and beyond the origi-
nal lexical data, is the use of Lisp bracketing to structure entries into
28290318<(*CA)RIVET(*CB){*46)s{*44)(*BA}:
single s-expressions and the conversion of the source character stream
into a text-only ASCII encoding. Thus there are no control characters,
of the Longman tape stream
Figure 2.3 Fragment which means that all a client program needs to do in order to access a
into
particular entry is to position a le pointer at the offset which marks
Thus 01 is the headword record, which is further brokendown its beginning and to perform a single Lisp read.
fur-
two elds: serial number and headword (this may incorporate In addition, the internal structure of an LDOCE entry is indicated,
ther information concerning syllable boundaries,
02 encodes the
spelling yariants and where appropriate, by additional level(s) of Lisp bracketing. Where
capitalisation). Similarly, homograph an inforrnationci the original tape indicated separate logical (and physical) records, cor-
and additional data concerning segmentation responding to different entry elds in its printed form, we retain this
homograph number, the example en-
is
stress patterns in compound entries (this empty in indication of structure by grouping the individual records into sub-
identies a record With
try). The identier for definition code, 07.,code and box code.
expressions of the s-expression for the complete entry. For instance,
four elds: sense number, grammar code, Subject
47
46
'
Chapter 2 2.3 Online access: simple mode
on-Iine
atoms of the form *XY have been preserved in the lispied format
where
'
the
h
master
etic
tape
elds definition
'
tex pre
indicates
iXing eve
' '
PfELri_OI_SpC,e;}li;1:f%:,m:11
headwords,
t etc. by subsequent
changes special h

font
thus
and
software which
characters
depends on
make
information
of the
encoded in
idiilcffyiifg :igtigsenbystlbe
non-printing
'
can use
cbde, record this
s- informationtiiasLbeen straightforward control characters in a fashion.
'
recess and is retained in e

_isp . 1
subject The codes and bar codes
resuscscidhlsngvilfhin
.
represented are as character vectors

the record indicators in
the entry carry
Thus, the internal structure of'the
tl-lelf
6:111.
logicac 1
encoded positionally in the original format. In order to preserve this
expffion
(Figure 2.4). positional information these codes had to be translated into Lisp atoms
indiVidual elementslispicatipln
changed; the
p051ents
within entry an has not been
from {Eli ie with a special format.
nglieves
far greater ease in accessing
Finally, the original eld separator (C)has been retained; the
of the entry. character 1" is an escape character, preceding symbols with special
meaning to Lisp. At the head of every entry in the new format is placed
to to fasten with RIVETs:... a copy of the head word(s) in base orthographic form, ie. without the
rivet2 1) 1 [T1,X9] cause
hyphenation' information. This is the entry identier, and is used by

t. the sorting, indexing and access functions described below.

((EivletOiSASOO
!<)
rivet)
2 !<
|<
(2
E:ioqu 1; se
T1
to
X9 !<
fasten
NAZV
with
l< ----H---XS)
2.3 On-line access: simple mode
))
(8:3; :iilET *CE #46 s #44 *BA : ......
Before project could make signicant use of the information

any in
after lispication LDOCE, a signicant amount of evaluation work was required. A
2.4 S-expression for the rivet fragment number of outstanding
Figure questions concerning the suitability and usabil-
ity of the dictionary (for example developing large lexicons for formal
guiner{if required a
The process
and therefore
of
l
largef
lispifying
was
the master tape
carried
'
on tinaserieso ac J05 theories of grammar

for the purposes
or analysing the phonological structure of English
of lexical access
coxdglfated

described
.
.
in subsequent chapters)
3:1:erTime-
written in general anllaBrlili
editingfaCil'ity,}pn
thisptzfie);
a text
it is,
needed resolving; none of them, however, required the setting up of the
2381 mainframe. and
effort-consuming
After perio asf experimen
ewiti; a o
on-line source as a fully functional database.
needs be done only.once
to
. _
Given the level of structuring of the lexical

The
was ed in this way. Fig- data, indicated on the
th whole dictionary preprocess
SBJuglilsillusatrates tape and further emphasised by its lispied counterpart,
_
to the master it turns

corresponding
Lisp the s-expressxon
fgagtmgzixt out that once a natural language application has located the relevant
he
w
n in Figure 2.3; note in particular the correspondencle.
te its fragment of a dictionary entry
and matching
the is s in an ~
operation facilitated by the Long-

:hZZequence of records in the raw tape man indication of records and elds local parsing into whatever

lis ied form. this For ' '
format is required can be fast and reliable, and can therefore be per-
of
1DA
number of pomts had to be taken care in .
'phase. .
formed on the y by functions which manipulate

record identiers, sequentially labelling the lines individual entries
instance, the sixdigit leXical on demand. The development of all such functions, however, as well as
They hear no
in the original source, have been suppressed. the analysis of the data in the dictionary, both from a computational
'on whatsoever.
'
and a linguistic perspective,

mfoixhljtrion-printing
.
translated into atoms of clearly required fast access to the data, at

control characters were
the level of individual
the decimal digits for the dictionary entry.
*XY where X and Y are
hexa The problem of access,
thirtifci it
.

from Lisp, to the dictionary entry sexpres-
code: printing
make; eatsei;
fizzleancpsrigrfatikpe This
'
'
form
ware a
sions held as a 20odd Megabytes le in secondary storage cannot be
inspect whileto loping '
entries deve so
resolved by ad hoc solutions, such as sequential scanning of les or
.
mer
p mm
'
t, trans
and also means tha t the resulting forma is
characters, extracting subsets of such les which will lit in main memory,
and
commun ication (Recall . .
network conven t. ions. as these

system
.
to operating are not adequate for the purpose in hand. (Exactly the same problem
_
1d not be ignore d Since th ey embody v)- .
that the control characters

cou .
_
would occur if our natural language systems were
A u tomam implemented in Pro-
.
entries.)
_
'
about the structure

'
0 f dictionary
tal information
that the preprocessmg pw
.
log, since the Prolog database facility, refers to the knowledge base
'
had to be carried
'
out to ensure
tion that Prolog maintains in main
the tape. All memory.) In principle, given that the
.
zzdliffrc:took into account all the characters appearing on
49
48
Chapter 2 2.4 On<line access: flexible mode
online
powerful virtual memory cations homograph fields (see Figure 2.4). Options exist in
is via the
dictionary is now in a Lisp-readable format, a
the access to specify which particular homograph (or homo
access to the internal Lisp structures software
system might be able
to manage
the entire dictionary. This is, however, expen- graphs) .fora leXicalitem is required. The early process of lispification
resulting from reading in a single group all dictionary entries
not always available; an alternative, and a more general, was
deSignedtonot bring together
sive, as well as corresponding only to different homographs, but also to lexicalised
solution is outlined below.
implemented in Lisp run- compounds for which the argument word appears as the initial word
of systems in Cambridge are
A series compound. Thus, the primary index for blow allows access to two
make of efcient dictionary access
the
ning under Unix. They
all use an
client dierent verb homographs (for example blowa), two different noun ho-
for sexpression entries made by
system which services requests the has been converted into mographs (for example blowz), 10 compounds (for example blow off
The lispied form of dictionary
programs. and all 14 of the dictionary entries (which not
a random access le, paired together with indexing information from blow-by-blowl, or are
entries for words and compounds necessarily found in subsequent positions in the dictionary) related
which the disc addresses of dictionary to far making signicant use of this
for an entry is made, a dictio- The only
blow: the denitions application so
be computed. When the rst request

can a Search analysis of Alshawi which deriv d
red which dynamically constructs faCilityus program
nary access process is off,
homograph directly to the non-compositioneal
for compound nouns with
tree and navigates
offset in the lispified le from
through it from
where
a given
all the associated information can ls:mai}tic$epresentatiE)ns
Xica ise meanings
dictionaiy
for
Alshawi, this volume
acceizeddriier:f)rl:mg:h:li:
approach to
how '
be retrieved. This process then becomes dormant, ready to be reacti- motivation such an
parser which will operate on the basis of the on-line LDOCE

aging a
vated by subsequent queries. and any must be able to recognise compounds befo re lt
where it is not envisaged that the informa- serious parser
In situations like this,
altered installed in secondary segments its input into separate words.
have to be once
tion from the tape will
conventional access strategy is perfectly ade-
storage, this simple and for
such standard database indexing techniques (see,
quate. The use of it possible for an active dictionary
2.4 On-line access: exible mode
makes
example, Wiederhold, 1983) to main memory utilisa-
undemanding with respect
process to be very of customisation, namely There are essentially two different ways in which an MRD can be used
tion. For reasons of efficiency and exibility
the use of LDOCE by different client programs and from different Lisp (seealso 1.31?for some further discussion). The predominant tech-
system is implemented to date involves an arbitrary amount of preprocessing, typically
and/or Prolog systems, the dictionary
C and
access
makes use of the interprocess niqlue

in
atch mode, of
the on-line source. Those parts of the dictionary
in the programming language To which contain data for the task at hand extracted
entries.represented form
useful are
the Unix operating system.
communication facilities provided by
process and subse
in a directly usable
by a client program.
the Lisp programmer, the creation of a dictionary
gnhsuitably
of does
simply
dliciilonaryise not, in any way, rely on the original
as
quent requests for information

from the dictionary appear
1model
5:511:
eing at t time avai a e e the language processin
'g aPP l~Nation
'
Lisp function calls.

. ' '
has been proved is active, and thus a batch derivation of th e appropriate informatio n IS
The general applicability of thelispied source
suitable
.
of transformin
'
different hardware and a way g the raw data into a usable repOSitory of
completely
by the mounting of LDOCE
on a
running lexical knowledge.
a XEROX single-user workstation
conguration

' ' i
software
' '
of a separate But the reality of tr ying

adapt information to
originall y acka d
LDOCE file, but instead
InterlispD. We use the same
a precom
.
to
leXicographic
'
and typographic conventions

for viiualpgree-
process building a search tree,
structure
the access
which
method relies
allows direct
on
hashing into sacczrging

en a ion
and not intended for automated natural language processing
piled, multilevel indexing virtual mem- a
differentmodel of dictionary use. forThe non-trivial task of de-1
the online source. In addition, the powerful Interlisp-D
suggests
Isaiizggfsmc
enhanced caching efficient, procedures extracting lexical infor
signicantly by
an]?
........

the system to be
ory allows access
in main memory. It turns rom e ma eup of the predominant] y fl at character t
of the working subset of the dictionary
pariiiffi:
. . '
most analySis of the properties of the

361311
is specially tuned both
out that for a single user workstation,
at the
which
microcode level for random $511311}!
gequgrest hav'
,an is es ai e by ing fast
'
interactive
'
access to appropri-
'
for Lisp and operations optimised ate fragments of it.

this strategy offers remarkably good
le access and s-expression I/O, In addition, many research proyects of a more

ex eriment 1
results
The main access route into LDOCE for most of our current appli- focus on
.
investigating
' .
the ways in which the :nnlfffiig

availability of
51
50
the on-line Chapter 2 2,4 On-Iine access: flexible mode
Placing dictionary
can aid the development of particular NLP systems. The assumption methods.
is that an analysis of the accumulated data in the dictionary will reveal The second factor is due to the nature of the only source of machine-
readable dictionaries far available. The problems associated with the
regularities which can then be exploited for the task at hand. In fact, so
the rest of this book presents a number of projects which have sought, logical decoding of the flat character stream which reflects the visual
and applied, such regularities. organisation of the data on the tape have already been discussed (see
Work of this kind depends critically not only on the availability 2.1 above). These typically include not only the difficult problem of
of a machine<readable equivalent of a printed dictionary, but also on a parsing a dictionary entry, but also that of the issue of devising a
software system capable of providing fast interactive access into the on- suitable representation for the potentially huge amount of linguistic
line source through various access routes. Operational natural language data; one which does not limit in any way the language processing
processing systems clearly will have well-dened requirements as far as functions that can be supported or constrain the complexity of the
their lexicons are concerned, and once the format of lexical resources computational counterpart of a dictionary entry.
has been settled, retrieval of individual entries can be implemented Finally, there is the nature of the data. structures themselves. A
fairly efciently using standard computational and linguistic techniques text processing application, typically written in Lisp or Prolog, re-
(see for example Russell et al., 1986). In contrast, the placing of a quires that its lexical data is represented in a compatible form, say
dictionary on-line, with the intention of making it available to a number Lisp s-expressions of arbitrary complexity. Therefore, even if we choose
of different research projects which need to locate and collate dictionary to remain neutral with respect to representation details, we still face
samples satisfying a wide range of constraints, requires an efcient and the problem of interfacing to a vast number of symbolic sexpressions
exible system for management and retrieval of linguistic data. held in secondary storage. This problem arises from the unsuitability
Conventional database management systems (DBMSs) are not well of conventional data models for handling the complex data structures
suited for on-line dictionary support (see for example Tompa, 1986, and underlying any sophisticated symbolic processing. Partly, this is due to
Boguraev et al., 1987b, for further discussion). In particular, when the the inherent restrictions such models impose on the class of data struc-
entire dictionary is viewed as a lexical knowledge base, more complex ture they can represent easily

namely records of xed format. But

in structure and facing more taxing demands in a natural language re- more importantly, conventional database systems make strong assump-
search environment, the issue of exible access is not a straighforward tions about the status and use of data they have to hold: databases
one. Below we address the problem in greater detail, by placing it into are taken to consist of a large number of data records taken from a
the wider context of research into computational linguistics and high- small number of rigidly-defined classes. It is not clear that a lexical
lighting those issues which pose a challenge for the current state-of-the- knowledge base, derived from a dictionary and intended to support a
wide range of language processing applications, fits this model well.

art DBMS. The particular solution we present, which is generalisable
of Some solutions to these problems will no doubt be offered by dedi-
to other MRDs, essentially involves the design and implementation
a lexicographers workstation, and is adequate to handle most of the cated efforts to develop special purpose data models, capable of com-
lexical requirements of current putationally representing a dictionary and amenable to exible and
systems.
efcient DBMS support. The work, at the University of Waterloo, on
computerising the OED (Tompa, 1986) is a good example here; simi-
2.4.1 Considerations for a dictionary database
larly, the desire to be able to mount computerised dictionaries online
Several factors put the task of mounting a machine-readable dictionary, for in-house research motivates Byrd et al.s work on a general purpose
dictionary access method (Byrd et al., 1986) and, more recently, the
if it is to be a proper development tool, beyond the scope of current
DBMS practice and make its conversion into a database of a standard development of a general purpose lexical data base representation for
relational kind, for example, quite difcult. storing dictionary data (Byrd et al., 1988).
In the short term, alternative approaches reduce the complexity of
Firstly, there is the nature of the data in a dictionary: typically, it
the problem by limiting themselves to applying the machinereadable
contains far too much free text (denitions,examples, cross-reference of a dictionary to a small
and so forth) to fit easily into the concept source class of similar tasks, and building
pointers, glosses on usage,
customised interfaces offering relatively narrow access channels into
of structured data. On the other hand, the highly structured and for-
the on-line data. Thus IBMs WordSmith system (Byrd and Chodorow,
malised encoding of other types of information (found in, for example,
the part of speech, hyphenation or pronunciation fields) makes a dic- 1985) is concerned primarily with providing a browsing functionality
for online access retrieval which supports retrieval of words close to a. given word along the
tionary equally unsuitable by information
52 53
2.4 On-line access: flexible mode
on-line Chapter 2
for the dictionary database

while a group at Bell Labs 2.4.2 Requirements
dimensions of spelling, meaning and sound,
dictionaries on-line used exclusively for research on
has several large Three main requirements can be identied if the database is to perform
stress assignment (Church, 1985). The access
environmentdescribed
of texts the functions intended for it.
used for syntactic analysts
(section
earlier 2.3 above) has been Firstly, the source tape of the dictionary must be converted into
(Boguraev and Briscoe,

this volume, chapter 4); however, the approach a format to which fast access can be coupled. This involves, at the
orthography does not generalise easrly for
of simple preindexing b y
of entries very least, overall segmentation of the original character
stream into
the rapid locating and retrieval
applications which require criterion.
records corresponding to gross lexical categories such as head word,
than selection
pronunciation and part of speech. This may be a highly complex task,
one
satisfying more
along
research environment, the need for access
In particular in Kazmans the text of the OED, or it
our
clear interaction
from betweenthethe as (1986) project to restructure
multiple paths was particularly and the may be conceptually fairly straightforward, as in the case of LDOCE,
of intending to use LDOCE
diverse natUre of a number projects where considerable segmentation is already present. But in either case,
of that dictionary. Thus one anticipated use of the than
unique properties for the given that the on-line dictionary is intended to support more
to develop a procedure mapping

machine-readable source was
elaborate structuring of the entries individual
style of for example Gen-the
one application, a more
grammar codes into feature (in clusters
records might turn out to be unsuitable for further unforeseen use. As
fully utilising the elaborate syn-
eralized Phrase Grammar),
Structure
pointed earlier,
out with the provision of efficient procedures for local
information in LDOCE, for subsequent use by
tactic subcategorisation parsing (themselves developed within an interactive access environment
and Briscoe, and Carroll and Grover,
a syntactic parser (see Boguraev with fast turn-around) we can aim to incorporate the segmented version
m
this volume). In the different context of speech recognition, the data of of the source intact into the database, to serve directly as its bottom
elds was required for a study of the implications
the pronunciation paths ultimately point to complete
of the English lexicon for different methods layer in the sense that all access
the phonological structure the results of queries.
dictionary entries, which are then returned as
of lexical access (see Carter, this volume). queries involving infor

from single record or Secondly, it should be possible to execute
projects typically require information
a
Such
using arbitrary mation of as many different types as possible. Even if the machine
eld in the dictionary. Others, however, would rely on
such as LDOCE,
number of different readable source used is a comparatively structured one
amounts of lexical knowledge distributed over a

types of infor-
in the context of speech recogni- the creation of access paths will involve, for at least some
LDOCE records. For instance, again construction of an intermediate and
of word identication from a lat- mation, the non-trivial (but fast)
tion, we intend to tackle the problem by means of the local parsing already men-
that uses information about temporary representation
tice of phonemes by constructing a parser
tioned. For example, subcategorisation information is often specied
collocations and syntactic predictions derived from inde-
both phoneme
and code fields in the in a rather elliptical form in LDOCE, for the sake of human readabil-
pendent analyses of the pronunciation and bow
grammar
codes, as well as other ity; this must be made explicit by a parsing process (as described by
Furthermore, subject
the
Boguraev and Briscoe, this volum, chapter 4). Also, it is desirable
dictionary. to
nature for instance like
of a predominantly semantic

information
this volume) impose a phonologically motivated structure on pronunciations, which
that derived by Alshawis definitions analyser (Alshawi, markers. This
the word recognition are typically given as a string of phonemes and stress

might be used for further guidance during will allow the user to specify a constraint on, say, the onset of the
process. .
second syllable of the word, whose position in the phoneme string will
It is clear that in order to make full use of the computerised LDOCE,
not be the same for all words, and will not always be indicated by stress
system with proper DBMS functionality,
dictionary access
need
(see 2.4.3 below). The straight indexing approach described
we a
markers
retrieval of entries satisfying selection criteria apply-
capable of efficient earlier for headword-based access cannot in general provide sufciently
levels of linguistic description. The design of the system
ing at various exible access routes.
requests. What
described here allows precisely such heterogeneous client should be free to specify dif-
the user from the typi- Thirdly, the user or program
offer is a software environment buffering in any combination. We cannot assume in
we ferent types of constraint
and idiosyncratic format of the raw dictionary source
in
cally baroque advance that information of a given type will always be present
and allowing, via a carefully crafted interface, multiple entry points
great enough quantities to allow efcient retrieval. For example, if the
and arbitrarily complex access paths into the on-line lexical knowledge system is being used by an automatic speech recogniser, then at one
base. point in the signal signicant information on pronunciation may be
55
54
Chapter 2 2.4 On-line access: flexible mode
on-line
' '
be present; at Similarly, Carter (this volume] discusses an a roach to

semantic constraints may
available, but few syntactic or the pronunciations in thepgiictionaryimeifizlfil
the situation may be reversed, with the speech signal explicit
taln-
structure on
elds which break

another point, but with an expectation- isachieved

iis usmg parser for the pronunciation a 5
information
itself yielding little phonological within syllable, into onset pea/d
higher-level constraints. In each them into syllables and, and coda
PS
a
driven parser providing quite specic must be used for access, mg the phonotactic constraints given in Gimson (ISSSO)and 101 em
the stronger, more specic constraints a maximal onset (Selkirk, 1978) where these yield amifi ddii:
case,
the entries retrieved. To achieve principle
and the weaker ones only for checking
able estimate in advance What the syllableboundaries. Such decomposition is necessary for sgeech a a
this, the system must clearly

be to
ability to perform maxi- syllable-based access, since pronunciationsrep ar:
most efcient search strategy will be. This
of constraint will 3:25:23?
rigggiig)
in as strings of phonemes and l
'
mally efcient searches

if the
given many
database
different kinds
is being used interactively to investi- stress markers and, furthermore, syllable Y bounldralfileasrbijlengtfiizl
also be important
the language. If the systems claim to be interactive indicate-d,Thus for example the internal syllable boundary in the PPO'
gate properties of of is placed before the /s/.
is to be justied, it must be able to tell the user in advance roughly
and roughly how nunriilation
ieconsifaint
parser
use for analysing pronunciations is a s
i
how long a prospective query

would be returned
would
as
take to evaluate,
a result.
olle
whose
(verysimple)grammar is incorporated into iltZClcibleurTlisi:
many entries parsed many times faster than by a general-
guigjsirgnunmatiziis
arser t: bearative Wi a ec rammar. It a
'
to bfgeldifdlisvshzdtdzdnt
cognstituents
.
on relationships between syllable 5-
2.4.3 Design and implementation sary. Forexample, the LDOCE pronunciation of bedouin is /bcd /,
The design and implementati on of the database system described here
Ultimately, dictionary
which Violates the constraint that a syllable whose peak is /u/ 35 (01" m
requirements just identied.

have null coda; this constraint is therefore
reflects the three
form of the
putlcannot a re 1aXEd 10
retrieved by the system from the same lispied obtain a parse.

entries are
the user or '
earlier. The link between Finally, the strategy used for indexin entries
tape already discussed
designidreectcotrliidnfgactt
ac
Longman a pointer le
client program and the lispied dictionary is provided by
will be described
words
the
in
denition
their texts was to
and motivation it IS content of these words that is likely to be of inte t
and a constraint le whose nature
to the user.
semantic
This has two main
res
consequences:
below.
dictionary entries 1. It is appropriate to take forms of wor

root
Analysing
illsiftffzyrsobafht:
more
classied into separate groups, treat inectional variants differently, because

paths into dictionary
the are
Indeed, the inection usgd

The access The
according to the kind of information they ultimately point at.
holdsmost of the semantic content.
information of six different types: se-
With particular Word often depends on the largely arbitra
a
system we describe incorporates

of words and their dependents chaice of
syntactic constructions used in the denition Thus fry
01
mantic features classifying the meanings
codes); grammatical part of example, entries whose denitions contain any of the wordsFl my 1
(box codes); semantic subject area (subject films and filmed

codes); British En- should all be indexed under film.
speech; grammatical subcategorisationtexts(grammar vocabulary). All
glish pronunciations;
and denition (dening 2. Closed class words unlikel be useful ke
'
search queries. En- are y to as s becau th

these types can be mixed together
in constructing
conitiext-depzfiderci
. ' , .
content 15 limited and often highly

In principle, further
tries can also be accessed by spelling patterns.
extrawould route into it
semalntic
ndition, a many of them occur too often to be sufciently dis:
processing of the dictionary providing
for an
criminating for efcient look up. T heref'ore only open class words
of analysis described below.
be possible, following the style have
are made available as keys_
of these of information
The codes used for the rst three types
hence trivial to extract. The fourth,
a fairly simple structure, and are
is indicated by a complex and highly discriminatory
subcategorisation, The task of derivin

'
g root forms of words made

is much e
codes from the elliptical
set of grammar codes; the extraction
in LDOCE
of these
is described by Boguraev and
ac;;fl)1[)at;})IiD'O,CEs
definition
. .
texts are constructed ts}:

largelfflfdnina
form in which they occur other asic words. When words used, they in the
are
(or,
Briscoe (this volume, chapter 4).
57
56
Chapter 2 2.4 OnIine access: flexible mode
the dictionary online
Placing
WORD
variants, their root forms) are shown in a special
case of inflectional
for words not so therefore marked can
font. Accurate root extraction PRONUNCIATIDN
stripping off afxes (which are themselves
be accomplished simply by
and applying a few simple rules for spelling
in the basic word list) forms of basic words
SYLLABLE
changes until a basic word

is found. All irregular / /'// i\~.\
are stored explicitly. STRESS ONSET PsAK slim
and closed class words is also straightforward; l l l l
Distinguishing open
a database look-
a list of closed class words was derived by performing T * AND .
that represent closed "/4,,~\

those
using grammar codes and categories
up +low ~round
classes.
Figure 2.5 Interactive query construction: search tree
the
fllellzjverlselectts node, the resulting menu, shown in Figure
Constructing access paths . chOD?1 /pSi/, but
5 im 0 1
speci y t e coda not for exam e
Once the relevant information has been entry,

extracted con- from an
(I:thisInenu,
and
.
stigstTe/e
nodes of the PROIIIUNCIATIOPN
.
in terminal
.
semantic a matches
ove, symbols; single of 7 matches
structing access paths is straightforward in the grammatical, any sequence
.
any
is constructed for symbol; d SYmbols have the phonetic values dened for
of pointers
and denition
and
text cases:
suitable
a list
denition
entry
word found in the dictionary. them in {$0211%)?ther
code every
every
treated dierently. To achieve exibil-
Pronunciations, however, are
list is formed for every distinct syllable

ity and efciency, a pointer in a
in which it occurs example second
(for syllable
in every position
three-syllable word). a
When whole dictionary

the has been analysed, a pointer le is cre-
.
each list, D
ated containing all the entry pointer lists and, just before N
vvvvvvv
the to estimate the O
As described below, this allows system
its length. having to read 5
in evaluating a query without actually Z

work involved n
the (sometimes very long) list itself. d
construct the constraint le. This le takes the d N

The next stage is to f
net which links every possible constraint on

E
form of a discrimination k
a grammar code or a constituent
entry (for example subject area,
7
a |
an
several lists in the m
of a syllable) to one or, in the pronunciation case,

n
pointer le. r
t
ts
v
x
Constructing search queries z
of which the
A menu-driven graphical interface is provided by means
search query in the form of a tree whose terminal

user can construct a
of them, or wild cards. The
nodes constraint
are values, disjunctions
are derived automatically from the constraint le, so that only
menus For
of being satised can be constructed. Interactive
queries with some chance Figure 2.6 query construction: menu interface
is constructing a specication of a syllable, the
example, if the user
look like the one in Figure 2.5.

tree at one point may
59
58
2 2.4 On-line access: flexible mode
Placing the dictionary on-line Chapter
from either
a WORD node alone, or by in- [nan
A tree can be constructed
a specied word,
structing the system to build a tree from the entry for
and then editing it. Once the tree is built, either a partial search (to Pnuiiuiilrm/ii/ \'
gather statistics) or a full search (to retrieve entries)

be requested. i
In a partial search, the system follows each constraint
can
to the pointer 7// //7\ .,

\\ l .
SYLLAELE SYLLABLE SILLAELE CAT Bx

it leads and the lengths of these lists (as recorded ex-
1ist(s) to, sums
of dic- // \\ // \
number \_
le)
/
in the to
pointer the approximate
display ,// _
plicitly v miss: PEAK chm n 5 i
PEAK
which constraints it
tionary
would use
entries
to look up
that satisfy
candidate
it.It also
search,
entries
indicates
in a full which ones it snifss
DITEI
7
CurtSIRTSS
l x (no i 7 t nn
would merely apply as tests to thoseand, to allow the user

candidates, 1/ ]\\
to decide whether or not to order a full search, about how long the b d 5
would take and how many entries would be retrieved. It makes

process
the look-up/test choice using gures for the expected time taken to Figure 2.7 Interactive query construction: a complete search tree .
read (a) a pointer from the constraint le and (b) a complete entry
from the dictionary. The most eicient search strategy involves using
the specic constraints
most as look-up keys (more specic keys ul- and selects the partial search option. This returns the information
timately yielding fewer entries). The optimal number of constraints shown in Figure 2.8.
to use is found by balancing the number of pointers that will have to
be read, which increases with the number of look-up keys, against the
expected number of entries that will have to be read, which decreases

with the of look-up keys. (An entry will only be read if there
number Would look up on these constraints:
is a pointer to it in every pointer list. Therefore if n, look-up keys are

(scum: (as b a 5) / a 3) -> 502 item:
used, returning pointer lists of lengths L1, L1, Ln, then the ex-
pected number of entries to be read, assuming statistical independence Would test on these ones:
between lists, is Lle...L,./D"1, where D is the number of entries in

with it because exceed D, and (*PEAl E / 2 3) (> 3629 items)
the dictionary. This decreases L; cannot (as 5 J) (-> 5179 items)
is in fact normally very much smaller.) (xNSYLLs: 3) (-> 10468 items)
and choices not only displayed (no u) (-> 23835 items)
In a full search, these statistics are
but are also acted on. The pointer lists for the lookupconstraints are
Estimated puinter+entry reading time:
intersected, the number of pointers resulting is displayed and, at the 12.5+85.3=97.6 seconds (502 entries)
users option, the corresponding entries are read from the dictionary,
the test constraints are applied to them, and the surviving entries are 2.8 Interactive estimated statistics
Figure query'construction:
displayed. Applying tests to a dictionary entry involves reanalysing the
relevant of it in the same way as when the database is constructed.
parts
Because of the expected large number of entries in the result and the
time that would be taken to read them, the user decides to look only at
the entries for such words whose denitions contain the word camera.
He adds the relevant constraint to the tree (the system checking, as he
does so, that camera is a valid key) and orders another partial search.
2.4.4 An example
This time, the statistics are more manageable. A full search is therefore
the user wishes to see all entries for three- ordered, in which the denition word camera is used as the only look-
As example, suppose
an
up key, and the other constraints are this time all used as tests. This
syllable nouns which describe movable solid objects, whose second syl
the entries for the words clapperboard and Polaroid (Figure
lable has a schwa as peak, and whose third
a voiced stop. He constructs the tree
syllable has a coda that is
in Figure 2.7, gesirns
61
60
2 2.6 Notes
on-line
Chapter
'
and a command language could be de ned to modif y th 6 tree and

alsp-per-bo-ld / 'klxpahmusrhord / searches with commands such
to
perform as:
n (subj MP._, box ..__J) (when starting
lm a scene (or the cinema) 3 board on
which the detail: of the scene to be filmed > set [pron 52 onset] to "s p
in front of the camera
in written, held up > add [pron 83]
ml: 1 [U] > delete [syn cat]
Po-lur-old /'paulsr3ld / n
material with
51, box ,_J.__X) a
(aubj order to make
> psearch on [32]
is treated in
which glau
shine less brightly through it, used
light Each
'
5
bEing
rediwdZrmiiergdldfdes
car windows, etc.
in making SUNGLASSES,
1 [c] (lubj PG, box ,_J__J) also (/ml)
[1... '.../ type

a
c|m<-ra
construciideilbulnmdpaegiiilldiollenrriaficel
Polaroid
that produces a nished photo.
of camera
DBMS
lfridwlsddg:
seconds after the picture has
graph only
ilgztdiosfcfglsleggziggzagaboillliltyh
In this sense lexical
been taken
the
Figure 2.9 Result from searching the database

llkgsiiizisliibe228g31?
33:5?
satisfyaosuobgtg-ayntliglsEdgp
$33; 2:331:10giafililr:
to
Conclusion mix 0f database
capabiliilelsstliiscgziielsmai/Idiiilild
2.5 a PTOPel
and browsing
projects all the
in order to make the lexical knowledge base available to
had to be adapted to fit into the Cam-

requiring access to it, the system The database
of networked XEROX workstations. 2.6 Notes
bridge environment the model of one server cater-
was easily repackaged to reect
manager Call Pro
network; the Remote Procedure Part of the work reported here was carried out while the rst two
ing for several clients
over a
to incorporate
'
functionality
'
the necessary authors were supported b y Research Fellowshi fth U

tocol (XSIS, 1981) provided
s
the manager into a dictionary server node (of the kind discussed by
gnddlingineering
. .
Research Council. The rest pfssislflfnlje

viiasomadee
GR/D/4217.7 under th:
need for costly leservers and proved un from research SERC No.
this bypassed the ing an grant
Kay, 1984b)

' ' ' '
the integrity of the design.

Alvey Programme
. .
We tha nk Graham . T ltmus for his Slstame m
in Lisp, and make heavy use bringing the dictionary server
The programs have been developed up.
of XEROX Lisp workstations
of the interactive graphic capabilities interac-
this; the Interlisp-D
Nothing, however, depends critically
' '
on '
search
1 c at is a Lisp construct which means the first element of a list.
the end user task of constructing
tive graphics only simplify mod-
is sufciently modular to allow easy
specications. The design of functioning on a con- 2 Unix is a trademark of Bell Laboratories.
ication, and the system would be capable
long it supported random
ventional minicomputer or mainframe (as as
well it does on a single user workstation.

access to les) just as as
the constraint tree
Thus, in the absence of sophisticated graphics,
nested set asof features such
could be displayed as a
[sen -
[box -
5,J]
syn
=
[ cat =
n]
*
pron
'
l B1
52 =
[ stress =
7
onset: = *
peak
=
schwa
coda == * ] l l
63
62
Chapter 3
An independent analysis of the

LDOCE grammar coding system
Eric Akkerman
3.1 Introduction
The analysis of the LDOCE coding system, which is the subject of this
chapter, was carried out as a part of the recently completed ASCOT
project (see 3.5). The main goals of this project were the development
of a lexical database system and an associated scanning system, to be
employed in (semi-) automatic syntactic analysis.
The resulting ASCOT software package consists of two major com-
ponents:
1. the ASCOT lexicon (Aslex)

2. the scanning program MULTLFLEX.
The ASCOT lexicon contains information which is supposed to be nec-
essary for the automatic grammatical labelling of words in uncoded

corpora. Both the Oxford Advanced Learners Dictionary (OALD) and
the Longman Dictionary of Contemporary English were considered as a
possible basis for the ASCOT lexicon, because they contain an exten~
sive grammatical coding system and are available on computer tape.
Detailed analysis of the entry structure in the two dictionaries and of
the respective grammatical coding systems led to the conclusion that
LDOCE was most suitable for our purposes. Also, the computer-tape
65
coding system Chapter 3 3.2 A comparison of OALD and LDOCE
An analysis of the grammar
of the suitable basis for the ASCOT

version turned
of LDOCE out to be much more structured than that
BecauseLDOOEemerged as most
detailed anal-
tape. On the LDOCE tape, system
project, its-codingmain was subjected to an even more
OALD, which was basically a typesetting results will be discussed well.

of which identify the type as
each line is preceded by
of information contained
several digits,
in that line.
two
ysisirof
Wlliilch
lthe
ed or of all aspects of
a etai ASCOT account the progect,
'
we refer
in
et 111.
9815
ASCOT lexicon was created 1988a). general A outline of the project
four
On the basis of LDOCE,
stages:
the actual
faiAKkermIan
.
and1
emp products develo
asis on e exica P ed
'
m t he 00111'59 Of
'
It)
is presented in Akkerman et al. (1988b).
was developed to transform the original
1. A computer program
of information
les in such a way that the various types
LDOCE
3.2 A comparison of the OALD and LDOCE
became optimally accessible.
left due to the
2. The few imperfections which were still (either grammar coding systems
or to the structure of LDOCE) were corrected manuallly.
rogram is thus
rst intermediary le, which Both OALD
This resulted in the ASCOT
structured version of LDOCE. and LDOCE make use of an extensive grammatical coding
corrected, maximally system to
prOVide syntactic information about their Iemmata. In this
essentially a
' '
section the general structure of both 5 y stems W111 be descnbed and

analysis LDOCE coding system, of the
3. On the basis of a detailed
and additions were made affecting the con- compared systematically.
a number of changes
This resulted in the second intermediary
tents of the lexicon.
file. 3.2.1 The LDOCE coding system
in the rst
Finally, the ASCII le resulting from the work done ' i ' '
4.
created the The LDOCE grammar codes give detailed information about the b -
used as input for a program which

three stages was
.
owords English
(Procter, 1978zxxviii).
.
in The majority :f
actual ASCOT lexicon (Aslex). hagiour of which
conSist be a capital letter, may followed by a number
code and infor- wh'e; by a lower case letter. Apart from these
In Aslex, the entries,
about inection
together
and spelling
with
are
wordclass
their
stored in an L-tree format, tiurn
tworircmlm mzylbe
esymosteco (pllowded
co , esma y a l so contain
'
additional
" '
information
'
mation to of various kinds.

of with very fast access
which combines efficient use memory space ' '
with the lexicon The big letter codes denote ma' JOI grammatical erties of
A codefile associated pro
each lemma (cf. Skolnik, 1980).
information pertaining to each lemma. verbs,adjectives
. .
and adverbs, whereas the number codpes

give infbilri'il:
contains the grammatical the the rest of a phrase or clause is made up in relation
words from (pre-edited)
The scanning system MULTLFLEX analyses
which relates inected word :iortihabout
e word
way
described. The numbers always have the same meaning
input. It contains an inection component
and a derivation
_od
in ependent of are combined
whetherthey with verb codes or with
forms (including multiwordsl) to their stem, compo-
codes. A special group of codes is formed by the W" codes
the wordclass of derived words that are not in noun
nent which establishes which fall outside the general systemz, both as far as their meaning is,
consist of base form and one of the afxes
the lexicon, but which a
because they do not t into the table of codes, which
incorporated in LDOCE. :pycernedtand
aw-6......
' '
in- es 3. s y s ematic survey of the coding system as a whole (see appendix

it will have codes for wordclass,
When a word is fully coded,
characteristics. While the ASCOT F)
ection and all its grammatical for In LDOCE srx
'
basic
'
types of
'
big letter grammar codes can be dis-

be used interactively, it is basically intended
software package can
with the analy- anguished;
batch processing. Apart from the procedure dealing
is context-free, and the coded
sis of multiwords, the tagging program 1 Ill ose 1n d ica i. 11 g th e p0 stI ton 0 a 1] 0 un ad ective Or a d verb in a
should ideally be used as input for a grammatical analysing
sys- ,
text
itself be used as a lexical
WWu
tem. Of course, the ASCOT lexicon can
information be extracted in a very 2. Those

database, from which lexical can
giving information about verb type: D, I, L, T, V X.
short time. 3. Those
of this chapter we will concentrate on the compar- indicating countabi'lit y and numb er 0 f mm.
'
C, U) CC,
In the remainder
employed in OALD and LDOCE. GU,P,S,Wn1,Wn2,Wn3.
ison of the grammar coding systems
67
66
Chapter 3 3,2 A of OALD
coding system comparison and LDOCE
An analysis of the grammar
of adjectives (Wal, WaZ, There are, however , a number of cases in which th

'
special features
Those indicating various
th: dngnfgtlsintem
. , '
4.
Wv6) and adverbs (H). This is shown most clearly in
Wa5), nouns (N, R), verbs (Wv4, Wv5,
WVS. Wa3, Wa4,
1:032:22
igcinSistently.
complementation
n code t e 9. The [V3]code is Vlll
givcnet o a
5. Those giving information

about pronunciation: Verbs that t the pattern:
elsewhere in the dictionary: Wpl, Wp2,
Those referring to tables
6. NF a
V ~
NP
to-infinitive,

Wvl, Wv2. .
refer h ereas
of all big letter (and number) codes, we w Quirk et al. (1985) make a distinction between
For the specic meaning are of course the codes
F. Most important for our purposes transitive
to appendix - verbs (B8):
and (3). Especially the comprehensiveness
mentioned under (1), (2) from the standard We like all parents to visit the school
codes deserves some attention: apart
of the noun indi-
also the codes [GC] and [GU]to object
[U], LDOCE
uses
codes [C] and representing groups, like committee
special properties of
nouns
cate the a singular or a plu
0
complex transitive verbs (C4):
form can take either
the singular They expected James to win the
[G0] (which in clientele has no separate plural
form and race
ral verb) and [GU] (which The codes [S] and [P] are obj obj compl
a singular or a plural verb).
can take either with plural verbs
that are used only with singular or
t - ditransitive verbs (D6):
used for nouns It should be noted that these
example, undertow [S], police [P]). We asked the students to attend the lecture
(for information about subjectverb con-
four extra exclusively give

codes other indirect direct object
which are quite different on
cord and often accommodate

nouns
which indicates nouns and object
Nouns with code [P], for example, be of
points. verbs and pronouns, can
(1985:
used only with plural

Thiilgsg%3ddes\)3etween brackets those used in Quirk et are

adjectives that are
do not only have plural subject-verb
al) .
most of them code

little than s
indication of. is
['13] more an
two different types: that is normally associ-
a
structure surface structure
morphological underlying grammatical
'
concord, but also a

trousers). A number med, However
the.
,Owile
relations are
credentials, basis of gradience (cf Quirklgi
(for example, of analysis n 6 an
ated with plural nouns which do not have this plu al, 198590) in 11hsequence
however, are collective nouns N1 V NZ to V N3, Quirk at al 1 t8
e
91'
of [P]
ner specta
nouns, form
cattle, people). The [GU] nouns
argue that this t be down
ral morphology (for example,
into can
of categories betv:xonomy differences broken a
an even more heterogeneous category, including words like Admiralty,

will often be diicetWhICh
the small. As result? are a 1
classify certainverbs complex transitive, o

media and tactics.
ditmnut'
as
the type of complemen-
The number codes give information about transitive / it precisely is for of this kirfd
tation a word can have.
The big letter verb codes are
the
always
combination
combined
with
that the LDOCE :xiiZe. Maybe
ograp have decided ers on
reasons
this non-committal
other parts of speech approach.
with a number code, for the of codes LDOCE uses
table The '
in the complementation code

'
a number code is optional. Although

ritiiserajsgiilvlritcilireibihlactidl
.
of the number codes, in gen- its

in the description posSible realisations left
the phrase followed by
are
i one
accurate term. { i. u n '
(theoretically) more 0 e verb codes u

X
w
eral complemented by would be a 01 , it sta n d S for S o me form of ad v erbial

the following examples: 1 i z
This may be illustrated by
desire, in a desire that she should go put it in the box (put X9
[CS] example
for as
D
as in John was eager
[F3] for example eager, as in She kicked the boy
to please she lives here (live
lay it down (lay [X9, esp.
[L9]
DOWN])
[T1] for example kick, as in I saw the man leave
It is cle
[V2] for example see,
'
[X7] for example consider,

as in They considered him dead
by a nuxgzpnsftgeexamples given adverbial may be re 1' d that the
a"tllsie
ehrent
i
syntactic categories. In combination
Thus, LDOCE has a powerful

coding system which describes the gram- noun adjecti
or
ve, t e 9 code stands for needs a wor d
descriptive:V1
or phrase, and
' '
in a very systematic the exam 1es below indicate that it can

and adjectives indeed be
matical properties of verbs,
nouns
[mupmwd rather broadly?
way.
69
68
3.2 A comparison of OALD and LDOCE
Chapter 3
An analysis of the grammar coding system
C Labels indicating language variety, for example:

located [F9] : located in Florida
posterity [GUQ]: posterity schnitzel [C; (AmE)U]
aspect [CS]2 a southfacing aspect swap [(...)IO;(nonstandard)D1]
ace [09] : an ace at cards Although this additional information be very
can helpful to a normal
for a of the dictionary, only that mentioned in [a] is useful when the
Michiels, who makes similar observations (1982:121-8),argues user
such as that
rened
characterisation of LDOCEs code 9. For the verbal dictionary is to be processed computationally. Information
more
formalised enough to be of any use in this respect.
codes, he suggests a subclassication (a) propositional phrase, (b)
into in (B) is often not
here: for a computational
to specify the Of course there is a conflict of interests
locative phrase and (c) adverb phrase,
proform for a
be formalised much as possible,
adverbial above. A subclassication of the term descriptive dictionary all information should as
category has to nd a balance between efcient

and codes) would be more difcult to for a learners dictionary one
phrase (used with adjective noun
human
of possible renements, coding and intelligibility for the user.
Although he rst suggests a number
stipulate.
but also the most dilcult

that the most promising

he then argues
9" is to set up links between code and denition The OALD coding system
way of rening code 3.2.2
(1982:124)3. about The formal OALD grammatical coding system is limited to nouns and
The little letter codes in LDOCE give detailed information for
Some of them are concerned verbs. The information contained in the LDOCE codes adjectives
certain aspects of verb complementation. in accounted for in various
which are part and adverbs is either not given OALD, or
with the position of adverb particles and prepositions
of a verbal combination. Others indicate more specic grammatical ways:
is code a , which in combination with number if only be used predicatively, this is usually
properties; an example a an adjective can
be left out between a word and

code 5 means that the word that can accounted for by a wordclass code, for example,
in I know (that) he will
a following clause (for example know [T5a], as
code
come
afraid /../ pred adj (cf. LDOCE [F]).
Apart
.
from the formal code symbols described so far, the LDOCE this is
of additional information. This
o If adjectives have -er
/ -est comparative / superlative,
codes also contain various kinds for
may indicated after the wordclass information, example,
can be:
high /../ adj (-er,est](cf. LDOCE code [Wa1]].
be used in combination with the lemma
A Words that may (or must) in a Certain grammatical information is presented in a rather non
in question; a special category is formed by prepositions (printed for

formalised way, example,
lower case) and adverb particles (printed in capitals), for example:
elect /.../ adj 1 (after the n) (cf. LDOCE code [E]).
let [T1 (to,0UT]]
chug [L9, esp. ALONG, AWAY]
main /.../ adj (attrib only; no comp or superl)
divide [D1 + by / into] (cf. LDOCE codes [A] and [Wa5] respectively).
Other words are used as well, for example: a Most information possible modication
about of the head must be
distilled from the example sentences; this is the case, for example,
rain [it+T1 with happy, hard and lucky (cf. the LDOCE number codes).
betrothed the+P]
seem [L(to be)1,7,9] The OALD noun codes are restricted to [C] and [U], which indicate
countability and uncountability, respectively. The information con<
B Grammatical information of various kinds, for example: tained in the LDOCE codes [GU], [GC], [S] and [P] is often not given
abide by [T1 no pass]

at all in OALD (for example must (LDOCE [5]), peasantry [LDOCE
control pl.] [GC]), tribe [GC])). Sometimes, as with adjectives, it is ac-
(LDOCE
[C usu.
counted for by means of:
catch [(rare in tenses with -ing form)]
71
70
A comparison of OALD and LDOCE
Chapter 3 3.2
there is no real reason to use the subcode C to account for

Thus,
o a wordclass code, for example, this use of have: using code [VP24]with have already implies that the
womenfolk and cattle, both n pl (cf. LDOCE [P]). pattern can be used with that verb.
The description of the complementation part of the basic pattern
o informal information, for example, may consist either of functional or of categorial notions. Compare
in form, BE +
complement / adjunct
always sing Wm] S +
subject:
H
police /.../ (collective n,

infinitive
used with pl 0) (cf. LDOCE [P]). [VPlEA] .
S + vt; + noun / pronoun +
grouse /.../ n (pl unchanged) (cf. LDOCE (Wn3l). A combination is also possible:
wise /../ n (sing only) (cf. LDOCE [8]).
=
S + vt + noun / pronoun (D0) + noun
[VP23]
I
example sentences, for example, .
Other examples are:

welcomed
government (...) The G' (collectively) has
discussing the
the proposal. The G' (its members) are [vrzC] =
S 4 vi + adverbial adjunct (for example some in)
proposal. (cf. LDOCE [GC]). [VP15B] =
S + Vt + adverbial particle + noun / pronoun (DD)
(for example take 017your shoes)
Also information about possible type(s) postmodication
of is usually
its
illustrated by example sentences in OALD, whereas LDOCE uses
It seems somewhat inconsistent to call in an adverbial adjunct and off
number codes to account for these features of nouns.
in an adverbial particle.
it must be noted that many nouns that are coded [S
Finally,
the OALD code U], Finally consider:
LDOCE (for example, zoom, reek, frenzy) get
incorrect: these nouns are different from true
which is theoretically / adverb
[VPB] =
S + vt + incerrogative pronoun
they can be modied by the indenite article
a.
uncountables because
+ Ito-infinitive
to that for nouns and
OALDS coding system for verbs, contrary
25 basic verb [VP10] =
S + v1: 4*
dependent clause / question
adjectives, is verycomprehensive. distinguishes
OALD
of which have been divided into subcategories. The
patterns, many
where for structures
accounts such in He decided what to do
verb pattern codes all begin with the letters VP, which are followed [VPB] as
letter indicat- next and [VPIO]for He knew how it happened.

by a main pattern number (125) and often a capital
It may be concluded that OALDs terminology is somewhat incon-
of the various subcategories. Verb patterns refer to list(s) of
ing one
problematic
rules that underlie the different sorts of example sentences sistent. However, from a syntactic point. of view it is more
grammatical that some to be based on linear sequences of a mixture
that in the entries for verbs (Hornby (1980), Introduction, patterns seem
are given and categorial notions rather than underlying
is based A.S. Hornbys Guide to Patterns of these functional on
p. xxviii). The system on
syntactic relations. This gives rise to some questionable verb pattern
The basic pattern is:
and Usage in English. below.
codes, some examples of which are given
Subject Verb Complementation (in a broad sense)
A [VPAA] -
S + vi + to-intinitive
where Verb can be a transitive or intransitive verb, a modal or a
result)
(of purpose, outcome. or
have, need). The use of specic verbs

specic verb (for example seem,
and in
in the patterns often seems rather superuous, because (with the ex- Examples are stop come
of be copula in [VP1]) these are always specications of the

ception as
We stopped to have a rest
main pattern. There seems to be no grammatical difference between,
How did you come to know her?
for example:
The different syntactic status of the adverbial (which is accounted for
You must make your views known ((VP24A1)and by the respective LDOCEs codes [IO]and [13])is ignored in OALD.
He had the programme changed ([VPZdCl)
73
72
A comparison of OALD and LDOCE
Chapter 3 3.2
S + vt + In + D0
S + noun/pronoun
vt + (D0) +
preposition
B [VPM] =
for
+ noun/pronoun and [VP13]stands
code in
The verbs asCribe and deceive both get this S 4 vt +
preposition +
noun/pronoun
He ascribed his failure to bad luck The corresponding LDOCE code is [D1 . Note that, although
her into the belief that (...), most ditransitive verbs will have both VP12] and [VPlS], in
He deceived
some cases this may be a useful distinction.
although there is clearly a difference between
the relation of ascribe
the discussmn verbs that
and of deceive and into (cf. In two OALD makes a distinction between
and to (prepositionalverb) patterns,
-
of verbal combinations below). indicate physical perceptions and other verbs ([VPlSA/B] and
[VPlQA/BD;this distinction seems to be only semantically ino-
(D0) tivated.
S + vt + noun / pronoun
C [VPIGB] I
noun/clause:
+
as/like/ss if +
a Pattern code for transitive
[VP16A]accounts constructions con-
taining an adverbial adjunct realised by an innitive phrase

which
child
Her parents spoilt her as a
soldier. may be introduced by in order to, for example, open as in: He

He carries himself like a
opened the door to let the cat out.
two is
the underlying grammatical structure
.of these sentences On the other hand, a number of LDOCE codes are more de-
A ain codes
for by the codes. Some of the infor-
cogmplhtely
different, which is accounted LDOCE
in
child back to the

tailed than the corresponding OALD
(Notice that as a
refers mation contained in LDOCEs little letter codes is not formally
[T1] and (X9], respectively.
refers back
like soldier to the subject he.) accounted for in OALD (sometimes it is indicated in example
object her, while a
of that in a
An example is the possible omission
sentences).
+
drag-form that-clause, for example:
D [W196] =
S + vt + noun/pronoun/posseseive . .
the -ing form may be either a present partiCiple

(where it is preceded by a OALD [VPQ] 1| LDOCE [T5a] or [st}
whether
or a gerund, depending upon
OALD [VPli] = LDOCE [D5]or [D5a]
noun/pronoun or a. possessive)
to LDOCE (which only default codes for nouns
has and
I cant understand him behaving so
foolishly Contrary
I cant understand his behaving so foolishly adjectivess),OALD also has default verb codes: it often does not where give
a formal VP code, as with imbed (vt), infatuate (vt), jet (vi,vt),
OALD wants to capture these two different.gram the basic grammatical properties of the verbs are accounted for by the
It is not why clear
the verb 15 com-
the rst code vt (verb transitive) and / or vi" (verb intransitive).
matical structures in one pattern (in sentence
verb complemented
wordclass
plex transitive, in the second it is a mono transitive

Probably these verbs t into the standard verb pattern codes [VPZA]
with a NP with a gerund as head). (S + vi) and [VPGA](S + vt + noun/pronoun) by default.
Finally, there is a great difference between the ways in which OALD
difference between the numberof grammatical
prop- and LDOCE treat verbal combinations. LDOCE regards phrasal,
There is also some
distin-
erties of verbs that are dealt with in both dictionaries. OhLD prepositional, and phrasal-prepositional verbs as xed Verbal units,
which are not accounted for LDOCE,
in
which are listed as separate entries and have their own grammar code.
a number of patterns
guishes them in the entry of the verb
for example: OALD, on the other hand, incorporates
that constitutes their verbal part, and in the relevant codes the focus
verbs that cannot be
Pattern code [VP6B] denotes transitive pas- is likewise on the verbal part. The treatment of phrasal verbs in both
nonformalised
0
accounts for this with in point; the scheme below shows the
sivised; LDOCE sometimes dictionaries may serve as a case
information (for example [T1, no pass.]). different treatment in OALD and LDOCE of the following construc-
tions:
for the order
I With ditransitive verbs, [VPIZ]stands
75
74
----_'-<_
.
Chapter 3 3.3 A detailed analysis of LDOCE

LDOCE particle when the combination means to stop (cf. LDOCE: leave off
OALD
v adv [T1a,4a,lO]).
16 belt () 5 [VPZC]7up belt up 1/ adv
[lOl
The conclusion of comparison of a number
our of important gen-
2 blare vi,ut[VP2A,C]~out
1 blare v
[10(out)] eral aspects of the grammar coding systems of OALD is that LDOCEs
3 bail vt [VPISB]'sb out bail out 1; adv
[T1]
2 [T1 (out)]
grammatical coding system is both more comprehensive and more de-
4 belch ut,ui [VP6A,15B]"out belch v
tailed than that of OALD, especially where the coding of nouns, ad-
jectives and adverbs is concerned. Furthermore, in LDOCE the coding
Whereas [VPISB] is a useful pattern
OALDs code, code [VP2C] is of verbs is more clearly structured and grammatically sounder than
much too comprehensive. The term adverbial adjunct covers a num-
that of OALD. Although OALD is sometimes more specic in its verb
ber of syntactically different structures, like for example: codes than LDOCE, a serious shortcoming of OALD is the fact that
many OALD patterns represent supercial surface structures rather
go _
away than underlying grammatical structures. In LDOCE, this is only the
its getting on for midnight case with codes [V3/4] and 9(which are not dealt with more satis-
.
it looks like rain / as if it were going to ram

factorily in OALD, for that matter). Finally, LDOCEs treatment of
at the knees verbal combinations is to be preferred to that of OALD, both from a
his trousers bag
with grammatical point of View and because it is much more lucids.
Ill soon catch up you
Again, this
pattern code describes a certain order of elements,while
the 3.3 A detailed analysis
ignoring the concept of grammatical cohesion
between
theverb and
the
striking (cf. of the LDOCE coding
adjunct. With prepositional verbs, this is even system
more
[VP4A]and [VP14] above). one

discussion of patterns Althoughof may
argue about the correct theoretical approach to the analySis verbal Having settled on LDOCE as basic
dictionary for the ASCOT project
combinations, OALDs coding system in many cases simply fails to on the basis of our comparison of the general structure of the cod-
capture the specic relationship between the verb andthe particle / ing systems of OALD and LDOCE, we subjected its coding system to
osition. a detailed
analysis. In this critical assessment, LDOCES grammati-
preIli/ioreover, often when a pattern code indicates that a particle
should cal approach was systematically compared With that of Quirk at al.
follow the verb, the relevant particle is not specied (or only in an (1985). As a result, a number of code combinations emerged that are
this often than
example sentence). With code [VP2C] more
happens grammatically questionable or even incorrect. In this section we will
it is a logical result of the comprehensweness of the pattern. present a short survey of our most important ndings and conclusions.
not:
for example, array, In Akkerman at al. (1988a) a whole chapter is dedicated
However, it also occurs with pattern see, [VP15l, to this subject.
baste, beckon. .
A further practical disadvantage approach is that when a

of OALDs 3.3.1 Intransitive verbs
these are
verb functions in a number of verbal combinations, all lumped of It seems that LDOCE
at the end of the entry, preceded by an overall combination strange gives many combinatory possibilities of
together under the 11code with a number code, as it is generally agreed that intran-
all relevant codes. Thus, for example, leave all" is incorporated
verbs based sitive verbs are verbs that have no complementation. In fact, a number
leave, together with other phrasal and prepositional on
of these code combinations are only used, it seems, to deal with cer-
that Verb, as follows:
tain classes of verbs that fall outside
the table of codes, but for which
the code is used to give at least an indication of their grammatical be-
leave 4 [VP15A, 13,22, 198. 2c. 24,
haviour. Examples are code [12](intransitive verb followed by innitive
' '
'
25] (... sb/sth alone (...) oil'(...)

sth on (...) without to) which is used (exclusively) for the modal auxiliaries and
do, [18](intransitive Verb followed by past participle), which is used for
be as passive auxiliary and have as auxiliary of the perfect tense, and
It is not clear of the given codes is appropriatefor leave of
which and
neither is it specied whether of is an adverb particle or a preposmoni [16],which is used only for appear and seem in the construction [it +
16a (as if.
Moreover, the code does not specify the (obligatory) pOSition of the
77
76
3 3,4 Conclusion
of the coding system Chapter
An analysis grammar
He is easy to please
3.3.2 Linking verbs with adverbial complementation
An easy man to please
of copulas
Whereas, according to Quirk et a1. (1985), the category
He was anxious to please his guests
the number of
thathave adverbial complementation is fairly small, *
An anxious man to please his guests
adverbial
verbsin LDOCE that have code [L9] (linking verbs with
564 verbs. Examples of the use of this anxious
complementation) is enormous: where both adjectives are coded [B3] in LDOCE, while can
code in the [F3] construction.

be used Moreover, a [B3] number of true
are:
only
the elders head adjectives are not coded as such: splendid, sad, impossible, convenient,
balloon, as in: the ball ballooned over
imsy (cf. Quirk et al. (1985:1226-30).
bang, as in: theres someone banging about upstairs
breeze, as in: she breezed along, smiling at everyone
cannon, as in: he (...) cannoned into me (..) 3.4 Conclusion
We found that LDOCEs grammar coding system is more compre-

Most of these verbs seem to be derived from nouns by zeroderivation. hensive, systematic and more
more consistent than that of OALD.
bear close resemblance to prepositional and phrasal verbs, both although the general approach of LDOCE thus suited our
They However,
there is a certain relation between
grammatically and semantically: purposes best, a critical analysis of the LDOCE coding system also
that stronger than with a normal verb- In a number of cases the cod-
the verb and the particle seems
brought to light certain imperfections.
and the relation between the subject and the adjunct Some improvements were made in
adjunct structure, ing is theoretically questionable.
is very different from that between the subject and those predication the course of the ASCOT project, resulting in a recoding of several
a true copula like be or as in he is in the
adjuncts that are linked by get verbs coded I (intransitive) and a different approach to the coding
garden. Therefore we feel that the term linking verb does not properly of most lL Sometimes
(linking) verbs. codes are used somewhat in
they would be better characterised as
describe most of these verbs: consistently in LDOCE, notably the adjective complementation codes,
intransitive verbs needing an obligatory adjunct. which have been improved on the basis of Quirk et al. (1985).
The analysis of the LDOCE coding system as described above was
conducted independently of any language processing task. However,
3.3.3 Adjective complementation since the ASCOT lexicon is intended to be used in automatic anal-
example where codes present another ysis systems, one may well ask to what extent the coding system of
The adjective complementation
The general approach is that adjectives LDOCE (which is learners
after dictionary!) meets
all a the needs
LDOCE is not very consistent.
that can be used attributively are coded [A]and adjectives
that can be of such systems. The answer, of course, depends in large measure on
the grammatical formalism which underlies the system in which the

used predicatively have code [F]. The majority of adjectives, however,
ASCOT lexicon is to be used.
be used both attributively and predicatively, which is indicated by
can
Michiels, for example, proposes the development of a semantico
the code [B]. The ordinary adjectives that can have complementation
syntactic analyser on the basis of LDOCE and the Longman Dictio-
with a nonfinite toinfinitive clause are coded [B3]. Examples are
This code (which can be found with 13 nary of the English Language (LDEL, 1984), which should be able to
easy, slow, anxious, thankful. in context. In this respect, he argues for the in-
0n analogy with the
recognise word senses
entries), however, is not used very consistently. troduction of transforrnational codes into the LDOCE coding system.
this code should indicate that adjectives can be used
standard [B] code, He discusses the two transformations slitting (parenthetical forma-
constructions:
in two tion) and Itoughmovement,which are lexically governed, and should
[A3]
thus be coded in the lexicon (1982:5160).
adjective

noun

to-infinitive complement In the Netherlands there projects which will make

are a number of
to-intinicive complement [F3]
verb adjective project (English Depart

In the TOSCA
lexicon.

use of the ASCOT

ment, Nijmegen University), a grammar of English is being developed
Yet, a number of [B3]entries should probably have been coded [F3], Aix Grammar
have the specied complementation when they using the Extended formalism (see for example, Aarts
because they can only
are used predicatively. Compare:
and v.d. Heuvel (1985) and Oostdijk (1984). The PARSCOT project
79
78
Chapter 3 3.5 Notes
University), aims at
(Department Alpha-informatics, Amsterdam
of the position a word can take in a sentence (as do the LDOCE codes).
PARSPAT formalism
a reimplementation of the LSP-grammar in the Only when socalled expression rules are applied, the elements in a
of LSP (Linguis-
(cf. de Jong and Masereeuw (1987))9. The grammar sentence are put into the right position.
is comprehensive computer of English it may concluded
be that the coding system
tic String Project) a grammar Thus, employed in
is a
that wasdeveloped by Naomi 1981). PARSPAT
Sager (cf. Sager LDOCE makes it a very suitable basis for the creation of a comput-
of the Faculty
parser-generator developed at the Computer department erised lexicon. However, when that lexicon is to be used in a system
of Arts of the University of Amsterdam. Both TOSCA and PARSCOT dedicated to automatic grammatical analysis, it will often have to be
have evaluated the ASCOT lexicon with a view to its incorporation in adapted and / or expanded to a certain extent, depending on the gram
their respective analysis systems. The gist of their evaluation is that matical formalism that is the basis for the analysis system in question.
in general the LDOCE classications are very useful, but that in their In the ASCOT lexicon a number of such adaptations were made, geared
respective grammars certain subclassications must be distinguished to the Nijmegen TOSCA project. Hopefully, in the future ASCOT can
which are not accounted for in LDOCE. be extended even more, so that on the basis of LDOCE a comput-
TOSCA requires detailed information about the grammatical prop- erised lexicon will have been created that can be employed successfully
erties of pronouns and determiners, and a further subclassication of in various natural language processing tasks.
adverbs (into connective adverbs, exclusive adverbs, particularis-
ing adverbs, etc. cf. Quirk at al. (1985)). In LDOCE,

pronouns
always have the code [Wp], which refers to a table in the dictionary 3.5 Notes
specic information. Because this is of no use to
that provides more
a parsing program, a detailed formal coding system was developed for I would like to thank Willem Meijs, Hetty Voogt-van Zutphen and
determiners and pronouns in the ASCOT lexicon, in which information Pieter Masereeuw for their stimulating contributions to the many dis-
is conveyed like used with plural count nouns (for determiners) and cussions about the subject of this chapter; thanks also go to Willem
3rd person plural (for pronouns). Hopefully, at a future stage it will Meijs for his suggestions for improvement of the text. ASCOT, which
also be possible to include a further subclassication of adverbs. for Automatic
stands Scanning system for Corpus Oriented Tasks,
for PARSCOT basically uses a classication
The lexicon necessary was a research project supported by the Foundation for Linguistic Re-
like that of ASCOT, but needs one more level of subclassication for
search, under project number 300169-004. The Foundation is funded
most wordclasses. Examples are APREQ (adjectives or participles that by the Netherlands organisation for the advancement of pure research
can modify a numeral (an additional five people, NCOUNT2 (count (Z.W.O.) The project started on 1 March 1984 at the English Depart-
able nouns which can directly preceded by a preposition
be (he came ment of the University of Amsterdam and it was finished on 1 March
by car)), VCOLLECTIVE (verbs which require a plural, collective, 1987. Additional funding was supplied by the Arts Faculty of the Uni-
aggregate or coordinated object (collect tools / dust)). Furthermore, versity of Amsterdam for the period of 1 March 1987 1 July 1987, to

LSP makes use of categories that are especially useful for parsing, such round off computational work.
that, in combination with
as NTIME, which denotes temporal nouns
adjectives like last, can form a time-adverbial without a preposition (for
example last week). The use of such categories can prevent certain am-
1
words
By the termhave multi-word
lexical
we mean combinations
xed collocations
of two or more
is the lack of subcategorisation status, i.e. and idiomatic

biguities. Another omission in ASCOT whichIn this includes
for subjects of verbs and adjectives: a verb like to concern, for example, expresswns. particular, phrasal verbs.
is not marked for the possibility of having a clausal subject.
Even more problems arise if the grammatical approach of LDOCE 2 Some of these W codes be
useful; [Wa5], for example,
can very
is matched against the theory of Functional Grammar (cf. Dik (1980)), indicates that an adjective cannot have a comparative and / or su
which is illustrated by Voogt~van Zutphen (1987, 1988). On the basis perlative. Thus, a morphological analyser can reject the analysis of,
of LDOCE, she has tried to construct a Functional Grammar lexicon.
for example, frequenter as comparative of frequent.
of her
Although she has been quite successful in some respects, one
conclusions is that there is often an enormous discrepancy between

LDOCE codes and FG structures. The basic elements in an FG lexicon, 3 One of the Michiels denition
examples gives involves the third of
do not contain information about
predicate frames and term structures, literature:
80 81
3 3.5 Notes
of the coding system Chapter
An analysis grammar
three other dictionaries) focus on

literature a [U9] in set or works written systems of LDOCE and OALD (and
to similar con-
in a certain country or at a certain time, up. as
their pedagogical qualities and shortcomings, come a
the literature u/Ana | modern

that LDOCE is more systematic, more consistent
a subject to Indy: clusion. They feel
literature
and easier to use than OALDs system (p. 64), although their analysis
His contention this denition
is that provides semantic slots which the also reveals some deciencies in the LDOCE coding system. In the
of code 9 have the to ll. The word certain (just conclusion of their discussion of LDOCE, they state that the coding
realisations power
be viewed the marker of such a but it has not been exploited to the
as, for example, particular) is to as
system is potentially very good,
that the
slot. The use of a dening vocabulary in LDOCE ensures full. Minor changes and additions (..) could make LDOCE a reliable
of such slots is characterisable by means of a nite use
semantic domain dictionary for productive (p. 73).
involved in limited structure, such as here:
number of lexemes a
TOSCA
from and PARSCOT, two other projects can be
IN COUNTRY 9 Apart
mentioned the Centre
here: for Lexical Information at Nijmegen Uni-
AT TIME project for
versity (CELEX) has drawn on the work of the ASCOT
ON SUBJECT the of its lexical database of English and may incorporate
development
the ASCOT lexicon.
project LEXALYSE
The (English de-
time probably at (parts of)
The links between Asia / modern and country/ are
partment, University of Amsterdam) aims at employing the software
links between the members of
least partly retrievable by the thesauric developed in the ASCOT and LINKS projects (cf. Vossen et al. in this
these pairs (Michiels, 1982:124-5). volume) in a system dedicated to automatic semantic-syntactic analysis
of discourse texts.
4 In the patterns below, and later, S stands for subject
and do not have a code in LDOCE,

5 If nouns adjectives grammar
understood.
[C] (countable) or
[B] (ordinary adjective) are
6 Key to numbers:
= intransitive phrasal verb (belt up)

intransitive verb + optional adv.part. (blurs out)
anaemia
=
: transitive phrasal verb (bail out)

= transitive verb + optional adv.part. (belch out)
of the relevant OALD codes:

7 Explanation
s + vi
[VPZA] =
[VPZC] = S + vi + adverbial adjunct

(VPSA) = S + vt + noun/pronoun
adverbial particle
[VPISB] = S + vt + noun/pronoun (D0) +
S + vt + adverbial particle + noun/pronoun (DO)
8 It may be noted comparison was principally motivated

here that our
of the grammati-
by considerations regarding the supposed usefulness
lexicon.
cal coding systems with a view to incorporation in the ASCOT
Yet, Lemmens and Wekker (1986), who in their analysis of the coding
83
82
il
g Chapter 4
g Utilising the LDOCE grammar

l codes
l
l
l Bran Boguraev and Ted Briscoe
l
l
l
l
g 4.1 Introduction
l The grammar coding-system employed by LDOCE is the most com-

prehensive description of grammatical properties of words to be found
in any published dictionary available in machine readable form (see
Akkerman, this volume). This chapter describes the extraction of this,
and other, information from LDOCE by exploiting information which
is both explicit in the coding system and
implicit in the conjunction of
codes assigned to a particular word sense. We evaluate the utility of the
coding system as the source for constructing a
grammatically-indexed
lexicon for use by an automated parsing system (see also Carroll and
Grover, this volume). Finally, the coding system is used to derive
signicant amounts of data which bear on some current issues in the
theory of subcategorisation.
The research described below took place in the context of three
collaborative projects (Russell ct al., 1986; Phillips and Thompson,
1986; Briscoe at al., 1987) to develop a general-purpose, wide coverage
morphological and syntactic analyser for English. One motivation for
our interest in machine readable dictionaries is to attempt to provide
a substantial lexicon with lexical entries containing grammatical in-
formation compatible with the grammatical
theory employed by the
analyser. Carroll and Grover (this volume) present further relevant
85
4 4.1 ntroduction
the LDOCE codes Chapter
Utilising grammar
describe the methodology adopted in the Sally believes the story

details about this theory and
of the grammatically-indexed lexicon. Sally believes the story wrong
actual construction
have substantial lexicons and even Sally believes the story a lie
Few established parsing systems believes the story be lie
those which employ very comprehensive grammars (e.g. Robinson, 1982; Sally
the
to a
is wrong
Sally believes (that) story
consult relatively small lexicons, typically generated by
Bobrow, 1978)
the Linguistic String
hand. Two exceptions to this generalisation are
Sager, 1981) and the IBM CRITIQUE (formerly EPISTLE) Figure 4.2
Project a dic
Heidorn et (11., 1982; Byrd, 1983); the former employs
Project
10000 words, most of which are specialist
tionary of approximately from
medical terms, the latter has well over 100000 entries, gathered
However, to our knowledge no evaluation of In addition, verbs syntactically identical
taking predicative comple-
machine readable sources.
for such information has been ments may impose different interpretations on these complements for
and promise both take a NP followed by an inlini-
LDOCE, or any other MRD, as a source
example, persuade
published. of syntac tival Figure 4.3 illustrates. However, in the case of
Recent syntactic theorising steady relocation
has seen a complement, as
the underlying or logical subject of the innitival verb is SalIy

Moortgat et al., 1980). Therefore, promise
tic information into the lexicon (e.g. the subject of promise, whilst this role is lled by Mary, tie object of
system rests as much on
the construction of a wide coverage parsing
of the grammar. In persuade, in the other example.
the size lexicon
of the as on the comprehensiveness
information, the lexicon must subcategorise
addition to part of speech
with a range of more detailed and lexically idiosyncratic
words further
be marked as count or mass in
For example, nouns must
information.
of In this chapter, we focus on Sally promised Mary to go to her party
order to correctly state rules agreement.
of verbs. Different verbs require different types Sally persuaded Mary to go to her party
the subcategorisation
and numbers complements; of requires a
fornoun
example, devour
a NP and prepositional phrase (PP),
phrase (NP) object, put requires NP object and sentential
Figure 4.3
and tell (under one realisation) requires a
Figure 4.1 illustrates. (Asterisks indicate ungrammat-

complement, as
ical sentences.) Although this latter fact is not purely syntactic, if we want the parser
to the logical form or underlying predicate-argument structure

Sally devoured produce
for its input, then facts of this type must be marked in the lexical entries
Sally devoured the food
of relevant verbs.
Most have assumed that the subcategorisation of words

Sally put the book syntacticians in various alternations is arbitrary in the sense that
the table and participation
Sally put the book on
it is from other facts about the words in question for
unpredictable

Sally told (that) the man was ugly example. their meaning. Gazdar (1982) and Pollard and Sag (1087)
make this point using examples of the kind illustrated in Figure 4.4.
Sally told Bill (that) the man was ugly
In the
first two pairs, the verbs involved have very similar meanings
but which differ in a semantically insignificant
.require complements
Figure 4.1 fashion. In the third example, all the verbs mean something like be
of complement types they take' for
Many verbs can take more than one type of complement without sig- pome all these verbs will combine with
but differ widely in the range
to yield grammat-
believe in the sense of think instance, poetical
nificant change meaning; in for example, ical sentences, but only got can legitimately combine with sent more
inni-
NP, NP and adjectival, nominal or
something is true can take a
the
tival predicative complement, or sentential complement, as Figure 4.2 leaets. On the basis of this type of evidence
that lexical entries
many
encode
grammatical
all the syntactic
often called cries
assume must directly
are alternatives
illustrates. These (meaning preserving) enVironments in which a word can legitimately occur.
alternations.
87
86
i
Chapter 4 4'1 ntroduction

codes
Utilising the LDOCE grammar
Sally broke the cup

Kim throw
Sandy made Pierced) up
The cup broke
Sandy forced (made)Kim to throw up
Sally slid the cup across the table

Sandy spared (deprivedlKim a second helping
The cup slid across the table
Sandy deprived (spared)Kim of a second helping
Figure 4.5
grew poetical
a success
got fall into the socalled unaccusative
turned out sent more leaets
These verbs, which class, (Perlmut-
Kim
work ter, 1978) can be contrasted with members of the unergative class,
ended up doing all the
which undergo a different alternation in which the theme argument is
waxed to like anchovies
not present in the intransitive usage as Figure 4.6 shows.
Sally ate the cake quickly

Figure 4.4
Sally ate quickly
Figure 4.6
For instance, Generalized Phrase Structure Grammar (Gazdar et al.,

These claims have considerable implications for theories of subcate-
distinct syntactic value for the fea-
1985) requires that a word carry a
structure that word gorisation because they would allow lexical entries to be much more
ture SUBCAT, encoding each different complement economic than theories such as GPSG suggest.
take (regardless of whether the sense of the word changes). In addi-
can
semantic translations so
In section 4.2, we describe the format of the LDOCE grammar cod-
SUBCAT values can be linked to distinct
tion, follow ing system and the difficulties which arise in extracting information the
that different interpretations of predicative complements simply conveyed by the code fields in individual entries. In 4.3, we discuss the
from differing SUBCAT values.
informational content of the LDOCE coding system and the theory of
This approach to subcategorisation predicts that not only the type
structures and the interpretation of grammar which lies behind it, supplementing the discussion in Akker-
but also the of complement
range
all unpredictable from other properties of man
(this volume). We focus in particular on the separation of syntac-
predicative complements are
least some that at
tic and semantic information conveyed by the codes and the inference of
the word in question. Other linguists have argued
of meaning, further semantic distinctions concerning the interpretation of predica-
aspects of subcategorisation
are predictable from elements
roles which a verb imposes on its tive complements from the conjunction of codes found in the code eld
such as the semantic or thematic
It associated with an individual verb sense. We demonstrate a system
arguments (e.g. Fillmore, 1968; Gruber, 1965; Jackendoff, 1972).

of for translating from LDOCE codes into an intermediate theoretically
seems intuitively clear, example,
for that differing interpretations
uncommitted representation from which lexical entries suitable for a
complements follow from the meaning of the main verbsin
predicative range of grammatical theories can be generated straightforwardly. In
and others have argued that participation in
Figure 4.3. Levin (1985) 4.4, we outline how the output of the grammar code translation pro
alternations is predictable from the inherent
structure thematic
some
she claims that verbs gram can be used to derive lexical entries for the PATRII parsing
and the semantic class of the verb. For example,
state have a system. This section complements Carroll and Grover [this volume),
which express change of position or change of physical
and theme role and a non- who describe a lexicon development environment which allows a user
causative involving
interpretation agent an a
that agents are to semi-automatically construct a lexicon for a GPSG-style grammar

causative interpretation involving a theme only. Given
using the same translation In 4.5, the translation
realised as subjects in English and themes are realised
as objects except program. program
this and the accuracy of LDOCE as a source of grammatical information
in the absence of an agent when they are realised as subjects,
is against hand classied verb lists culled from the lin-
predicts that these classes of verbs will participate in the alternation evaluated
literature.
some
in 4.6 the translation

in 4.5. guistics Finally, we use program to
illustrated Figure
89
88
""W
'
4 4.2 The format of the grammar codes

the LDOCE codes Chapter
Utilising grammar
The capital letters encode information about the way a word works in
evaluate whether verbs undergoing the dative alternation
the
fall intoco- a sentence or about the position it can ll (Procter, 1978: xxviii); the
herent semantic classes and thus to explore further issue raised the way the rest of a phrase or clause
information numbers give information about
the predictability of subcategorisation
above concerning is made up in relation to the word described (ibid.). For example, T
from other properties of the word. denotes a transitive verb with one object, while 5 species that what
follows the verb must be a sentential complement introduced by that.
codes small letters, for example, a in the case above, provide further
4.2 The format of the grammar (The
information typically related to the status of various complementisers,
of LDOCE adverbs and prepositions in compound verb constructions: for example,
Akkerman (this Volume) presents detailed description
a the
the outline of the "a with T5 indicates that the word that can be left out between a
and appendix F contains
grammar coding system example, [V3]introduces
those of verb and the following clause.) As another a
system given in LDOCE. Below we summarise aspects the

which are central to the discussxon which verb followed by one NP object and a verb form V which must be
format of the coding system
innitive with 3.
4.7 illustrates the grammar code eld for the third an to ,
follows. Figure
of the verb believe as it appears in the published dictio- addition, codes can be qualified with Words or phrases which
In
word sense in which
after restructuring. These provide further information concerning the linguistic context
nary, on the typesetting tape and grammar is likely, and able, to occur; for example [D1(to)]
that believe can occur in the syntactic the described item
codes convey the information
environments illustrated in Figure 4.2 above. or
[L(to be)1]. Sets of codes, separated by semicolons, are associated
with individual word senses in the lexical entry for a particular item,
as Figure 4.8 illustrates.
believe v 3 [T53,b;V3;X(tabell, (ta bc]7l
X (#46120 be
('I 300 I<T5A I, b 1; V3 l;
#44) 1 !, (*46 to be 44) 7 !<) rssl u 1 [11,6] to get the knowledge orby
touching with the fingers: 2 [er;'r1}
_
to experience (the touch or movement or

sense-no 3 head: T5:
something): . 3 [L7] to experience (is
head: T5!) be
condition of e mind or body); com
V3 to onsssir to
head: Icioualy: {L1}to
4 seem
head: in right optional (to be) be: 5 [T1,s;vs} to believe, up. fol' the
moment 6 [L7] to give (a sensation): 1
head: X7 right optional (to be)
[st;io] to (be able to) experience sensa-
tiom: s [Wv6;Tll to su'er because of
{s state or event): 9 [L9 (alter, /m)] to

search with the ngerl rather than with
the eyes:
Figure 4.7
information than a tra-

LDOCE provides considerably syntactic more Figure 4.8
The Longman lexicographers have developed a
ditional dictionary.
in compact form a
grammar coding system capable of representing
amount of information, usually to be found only in large de-
non-trivial These sets are elided and field associated
abbreviated with in the code
Quirk et (11.,1985). A grammar
scriptive grammars of English (such as
the word sense to save codes sharing an initial letter can
space. Partial
of behaviour of a word. Patterns
code describes a particular pattern be separated by commas, for example [T1,5a]. Word qualiers relating
of informationzifor
descriptive, and are used to convey
a range
are to a complete sequence of codes can occur at the end of a code eld,
distinctions between count and mass nouns (dog vs. desrre),
example, elect vs. delimited by a colon, for example [T1,IO: (DOWNH.Codes which are
predicative, postpositive and attributive adjectives (asleep vs.
relevant to all the word senses in an entry often occur in a separate eld
(fondness, fact) and verb complemen-
jocular), noun complementation after the head word and occasionally codes are elided from this eld
tation and valency. down into code elds associated with each word sense as, for example,
a capital letter followed by a
Grammar typically contain
codes ,
in Figure 4.9.
number and, occasionally, a small letter, for example [T5a] or
[V3].
91
90
The content of the codes
Utilising the LDOCE grammar codes Chapter 4 4.3 grammar
This type of error and inconsistency grammatical

arises because
on adv [T11 1 [(u)] to go to the airport, station, etc.,
see v:
in [nd codes are constructed by hand and no automatic

checking procedure
who is beginning a trip): mu) up
with (someone
at the bu Italian 2 to remain unharmed until (something or
is attempted (see Michiels, 1982, for further comment). One approach
to be active; WITHSTAND:
someone dangerous)
: mmy
has
attach
ceased
mm .9 day
to this problem is that taken by the ASCOT project (see Akkcrman
my law of
at al., 1985; Akkerman, 1986, this volume). In this project, a new
lexicon is being manually derived from LDOCE. The systemcoding
Figure 4.9 for the new lexicon is a slightly modied and simplied version of
the LDOCE scheme, without any loss of generalisation and expressive
and restructuring grammar code entries into a format More importantly, the assignment of codes for problematic or
Decompacting power.
more suitable for further automated analysis can be done with knowl- erroneously labelled words, is being corrected in an attempt to make
code system and the signicance of the resulting lexicon appropriate for automated analysis. In the
edge of the syntax of the grammar more
punctuation andfont changes. However, discovering the syntax of the medium term this approach, though time consuming, will be of some
is available from Long- for natural
system is difcult since no explicit description utility for producing more reliable lexicons language pro-
.W
man and the code is geared more towards visual presentation than cessing. However, in the short term, the necessity to cope with such
such as to
formal precision; for example, words which qualify codes, errors provides much of the motivation for the interactive approach to
be in Figure 4.7, appear in italics and therefore, will be preceded by lexicon development (see Carroll and Grover, this volume), since this
the font control character *46. But sometimes the thin space con-
allows the restructuring to be progressively rened as these
programs
*64 also appears; the insertion of this code is based extensive
trol character problems emerge. Any attempt at batch processing without
of the
solely on visual criteria, rather than the informational
structure initial testing of this kind would inevitably result in an incomplete and
of ap-
dictionary. Similarly, choice of font can be varied for reasons possibly inaccurate lexicon.
and occasionally information normally associated with one
pearance
eld of an entry is shifted into another to create a more compact or
elegant printed entry. 4.3 The content of the grammar codes
In addition to the noise generated by the fact that we are working
rather than
typesetting tape geared to visual presentation,
a
with a Once the grammar codes have been restructured, it still remains to be
and inconsistencies in the use of the grammar
database, there are errors shown that the information they encode is going to be of some utility
illustrated in Figure 4.10, include the
code system. Examples of errors, for NLP systems. The grammar code system used in LDOCE is based
promise which contains a comma in place of a semi- of Quirk et a1.
code for the noun quite closely on the descriptive grammatical framework
in which a colon delimiter occurs before
colon, that for the verb scream, (1972, 1985). Akkerman (this volume) provides a more detailed com-
a grammatical
the end of the eld, and that for the verb like where parison. The codes are doubly articulated; capital letters represent the
label occurs inside a code eld. grammatical relations which hold between a verb and its arguments
and numbers subcategorisation
represent frames which
a verb can ap-
pear in. Most of the subcategorisation frames are specied by syntactic
promise in 1 [C(of),CS,5; under+Ul
category, but some are very ill-specified; for instance, 9 is dened as
scream v 3 [T1,5: (OUT); 10) needs a descriptive word or phrase. In practice many adverbial and
like 1; 2 [T3,4; My.)
predicative complements will satisfy this code, when attached to a verb;
for example, put [X9]where the code marks a locative adverbial prepo-
Figure 4.10 sitional phrase (Put the chair nearer the re/ He put a match to his
cigarette) versus make under sense 14 (hereafter written malte(14)),

inconsistencies application of the code system
occur in the coded [X9] to mark a predicative noun phrase or prepositional phrase
In addition, out of the material).
when codes containing to (Wemade the material into a skirt / made a skirt
by different lexicographers. For example,
The criteria for assignment of capital letters to verbs is not made
be are elided they mostly occur as illustrated in Figure 4.7 above.
this is represented as [L(to be)l,9]. Presumably explicit, but is inuenced both by the syntactic and semantic relations
However, sometimes for example, [15],[L5]
because member of the team of which hold between the verb and its arguments;
this kind of inconsistency arose one
and [T5] can all be assigned to verbs which take a NP subject and a
lexicographers realised that this form of elision saved more space.
93
92
4.3 The content of the grammar codes
codes Chapter 4
detailed analysis of the infor

sentential complement, but [L5] will only be assigned is a fairly
if there Akkerman (this volume) provide a more
mation encoded by the LDOCE codes and discuss their elli-

link between the two arguments and [T5]will be used in grammar
close semantic
verb is felt to be semantically two-place rather cacy as a system of linguistic description. Ingria (1984, 1988) compre-
preference to [15]if the On the other hand, seman- hensively compares different approaches to complementation in a vari-
than one-place, such as know versus appear.
marked throughout the ety of NLP systems, providing a touchstone against which the LDOCE
tic distinctions of this type are not consistently
scheme can be evaluated.
coding scheme. For instance, both believe and persuade are assigned which carefully
take a NP object and innitival complement, Most automated parsing systems employ grammars
[V3] which means they and semantic therefore, if the infor-
distinction to be made between the two distinguish syntactic information,
is similar semantic
yet there a
mation provided by the LDOCE grammar code system is to be of use,
believe is a tw0vplace predicate, whilst persuade is a three-place
Verbs; we need to be able to separate out this information and map it into a
We see this by examining the examples in Figure 4.11.
predicate. can
representation scheme compatible with the type of lexicon used by such
question marks to indicate semantically anomalous sentences.)
(We use
parsing systems. The program which translates the LDOCE grammar
codes into lexical entries utilisable by a parser takes as input the de-
compacted codes and produces relatively theoretically uncommitted
a
Sally believed Bill to be sensible representation of the lexical entry for a particular word, in the sense
Sally believed that Bill is sensible that this intermediate representation could be further transformed into
believed to be bad for her suitable for most For example, if
Sally peanuts a format current parsing systems.
7 believed peanuts to be sensible the input were the third of believe, as in Figure 4.7, the pro
Sally sense
gram would generate the (partial) entry shown in Figure 4.12 below.
Sally persuaded Bill to be sensible The four parts correspond to different syntactic realisations of the third
Sally persuaded that Bill is sensible

sense of the verb believe. Takes indicates the syntactic category of the
Sally persuaded Bill that she is sensible subject and complements required for a particular realisation. Type
'1 Sally persuaded peanuts to be bad for her
indicates the arity of a predicate and whether it is a raising or equi
verb under that syntactic realisation.
Figure 4.11
a relation between a believer agent) and

Inherently, believe expresses ((Takes NP SBar)
(Type 2))
be realised either as a sentential comple-
a state of affairs which can
In the latter realisation ((Takes NP NP Inf) (Type 2 DRaising))
ment or a NP and innitival complement.
but functions the (or ((Takes NP NP NP) (Type 2 Whining
unrelated to believe as
the NP object is semantically
anomaly ((Takea NP NP AuxInf) (Type 2 Waning)
logical subject of the innitival complement. Hence semantic
in 4.11 only occurs when the object NP (or ((Tukes NP NP AP) (Type 2 shining
of the type illustrated Figure
and believe ((Takes NP NP Auxlnf) (Type 2 0Raising)))
and infinitival complement are semantically incompatible,
the semantic nature of its NP object.
itself places no restrictions on
a relation between a persuader

On the other hand, persuade expresses
and state of affairs. The NP object
(agent), a persuadee (patient) some
Figure 4.12
of persuade is semantically related both to persuade and to the infini-
tival complement, so the NP must denote a persuadable entity and a
At the time of writing, rules for producing adequate entries to drive
cannot be substituted for both the NP
(tensed) sentential complement
The assignment of [V3] to Verbs which t
a parsing system have only been developed for verb codes. In what
and innitival complement.
does not capture this essentially semantic follows we will describe the overall translation strategy and the partic-
this complement structure
ular rules we have developed for the verb codes. Extending the system
the criteria for the assignment of the V code seem to
distinction, so
to'handle nouns, adjectives and adverbs would present no problems of

be purely syntactic. principle. However, the LDOCE coding of verbs is more comprehensive
This brief of the LDOCE coding system sufces for the pur-
survey the obvious in evaluation
al. and than elsewhere, so verbs are place to start an
poses of this chapter. Michiels (1982), Akkerman at (1985)

95
94
Chapter 4 4.3 The content of the grammar codes
Utilising
of the coding system. No attempt has been made to

of the usefulness happeu) [Wv5;l't+l5]
the translation to any closed class words in LDOCE, as
apply program (Type 1 SRaising)
a 3000 word lexicon containing closed class items has been developed warn
(1) [WV4;IO;T1:(oagainst),53;D5a;V3]
of the groups collaborating with us to develop
independently by one
[Type 3 OEqui)
the general purpose morphological and syntactic analyser (see Carroll assume(1) [Wv4;Ti,53,b;X(tobe)1,7]
et al., 1986).
and Grover, this volume, and Russell (Type 2 ORaising)
Initially the translation of the LDOCE codes was performed on a
within a code eld associated with each individual decline(3) [T1,3;IO]

code-by-code basis, (Type 2 SEqui)
word sense. This approach is adequate if all that is required is an indi-
cation of the subcategorisation frames particular
relevant to any sense.
In the main, the code numbers determine a unique subcategorisation. Figure 4.13
be used to select the appropriate VP rules from
Thus the entries can
the grammar (assuming a GPSG-style approach to subcategorisation) The ve ru es which are applied to the grammar codes associated with
and the relevant word senses of a verb in a particular grammatical a verb sense are ordered in a way which reects the ltering of the
be determined. However, if the parsing system is intended series of Verb with
context can
verb sense through a syntactic tests. senses an
for
representation of the predicateargument structure code classied Subject Raising. Next, verb
to produce a
then this simple approach is inadequate because the [it-[:15] are
V" X code
as
and of the
senses
input sentences, of
which contain a or one
[D5], [D5a], [D6] or
indications of the semantic nature
classied
individual codes only give partial [D6a]codes are as Object Equi. Then, verb senses which
the relevant sense of the verb. ,
contain a V or X code and a
[T5] or [T5a] code
in the associated
have adopted is to derive a semantic classication of the D'7 codes
The solution we grammar code eld, (but none mentioned above)
the basis
of the particular sense of the verb under consideration on
In any subcate-
are classified as Object Raising. Verb senses with a
[V] or [X(td
of the complete set of codes assigned to that sense. be)] code, (but no
[T5] or [T5a] codes), are classied Object Equi. as
gorisation frame which involves a predicative complement there will be Finally, verb senses containing a [T2], [T3] or [T4] code, or an [I2], [I3]
relationship between the supercial syntactic form classied
a non-transparent
situations the
or
[14]code are as Subject Equi. Figure 4.13 gives examples
relations in the sentence. In these of each
and the underlying type.
this relation-
semantic type of the verb to compute
parser can use the
classify verbs as The Object Raising and Object Equi rules attempt to exploit the
ship. Expanding suggestion of Michiels (1982),

on a we
variation in transformational potential between Raising and Equi
S(ubject) Equi, 0(bject) Equi, S(ubject) Raising or 0(bject) Raising with
verbs; thus, in the paradigm case, Object Raising verbs take a sen-
for each sense which has a predicate complement code associated tential complement and Object Equi verbs do not:
derive from Transformational Grammar (see, for
it. These terms, which
et al. (1973) raising-toobject e tc.), are used John believes that the Earth is round.
Stockwell #
example,
as convenient labels for what we regard as a semantic distinction; the *
John forces that the Earth is round.
actual of the translation program is a specication of the map-
output
if verb takes direct object and sententlal complement
ping from supercial syntactic form to underlying predicate-argument indi-
'Secondly,
it Will be
a
verb:
a a
structure. For example, labelling believe(3) (Type 2 ORaising)

an Equi
cates that this is a twoplacepredicate and that, if believe(3) occurs
object, as in
John persuaded Mary that the Earth is round.
with a surface syntactic direct *
John believed Mary that the Earth is round.
John believes the Earth to be round.
of the predicate com- Clearly, there are other syntactic and semantic tests for this distinction
it will function underlying logical subject
as the
thd
this for innitive comple- (see,for example, Perlmutter and Soames, 1979x472),but these are
plement. Michiels proposed rules for doing to only ones which are explicit
coding system. in the LDOCE
there to be no principled reason not
ment codes; however seems
relations in other Once the semantic type for a verb sense has been determined, the
extend this approach to computing the underlying
sequence of codes in the associated code eld is translated, as before
types of VP as well as in cases of NP, AP and PP predication (see
on a code-by-code basis. However, when a predicative complement
Williams (1980), for further discussion).
97
96
Lexical entries for PATR-ll
Chapter 4 4.4
Utilising
translation lexi-to derive most of the

the type in 4.1, we are using the program
semantic is used to determine
code is encountered, the type
structure, as illustrated con for the generaLpurpose, morphological and syntactic parser under
assignment associated with that complement grammatical formalism based on
development. The latter employs a
complement is
in Figures 4.7 and 4.12 above. Where no predicative entries tiat
for lexical we con-
sufficient to determine the semantic GPSG; the intermediate representation

involved, the letter code is usually be mapped straightforwardly into this frame-
struct from LDOCE can
the verb involved. For example, T codes nearly always
properties of
4.14 illustrates. work as wel (see Carroll and Grover, this volume).
translate into two-place predicates as Figure
information is conveyed by the
In some cases important syntactic
codes and the trans-
word qualiers associated with particular grammar
sensitive to these correlations. For example, word storm:
lation system is therefore => <head trans senseno> =
1
rule above makes reference to the left context quali- WJEnBe
the Subject Raising V TakesNP Dyadic
of this word qualifier in conjunction with [15]expresses
er it; the use
NP if
that the relevant verb requires a dummy subject
the constraint
There are more distinctions
it occurs with a sentential complement. worddag storm:
of codes and word
which are conveyed by the conjunction grammar [cst: v
4.6 below and Michiels, 1982, for further details). How-

qualiers (see task,
head: [suxz false
this information to the full would be a non-trivial storm
ever, exploiting the
trans: [pred1
the relevant knowledge about
because it would require accessing senserno: 1
elds from their LDOCE entries.
qualier <DG15> []
=
words contained in the 3:31:

srgZ: <DG16> =
[1]]
[cntz NP
of
syncut:[first:
great dislike <DG15>]]
hate u 1 [Tl,3,4; v3,4] to have a
head: [trans:
rest: [first: [catz NP
bend:
(hate <DGIG>]]
[transz
((Sense 1)
test: [first: lsmbda]]]]
Taken NP NP) (Type 2))
((Takes NP Inf) (Type 2 SEqui))
((Takes NP Ins) (Type 2 SEqui))
((Takes NP NP Inf) (Type 3 OEqui)) Figure 4.15
((Takes NP NP Ins) (Type 3 DEqui))
operates by unifying directed acyclic
The PATR-II parsing system
for a sentence will be the re-
graphs (DAGs); the completed parse
with the words and
sult of successively unifying the DAGs associated
Figure 4.14 constituents of the sentence according to the rules of the grammar.
The DAG for a lexical item is constructed from its lexical entry which
of templates for each syntactically distinct variant. Tem-
contains a set
4.4 Lexical entries for PATR-II for unications which dene the
plates are themselves abbreviations
DAG for the verb
can be used DAG. For example, the basic entry and associated
of the grammar translation
codes program
The output storm illustrated in Figure 4.15.
grammatical for- are
to derive entries which are appropriate for particular denes the way in which the syntactic argu
implemented a system The template Dyadic
malisms. To demonstrate that this is possible we
to the verb contribute to the predicate-argument structure of
ments
which constructs dictionary entries for the PATR-II system (Shieber, denes what syntactic ar-
it has been the sentence, while the template TakesNP
PATRII chosen because
1984 and references therein). was
the guments storm requires; thus, the information that storm is transitive
reimplemented in Cambridge and was therefore, available; however, Consequently, the
and that it is a twoplacepredicate is kept distinct.
task would be nearly identical if we were constructing entries for a sys-
the fact that some verbs which take two syntactic
system can represent
tem based on GPSG, Functional Unication Grammar (Kay, 1984a) or nevertheless oneplace predicates.
1982). As noted arguments are
Lexical-Functional Grammar (Kaplan and Bresnan,
99
98
Evaluation
4 4'5
codes Chapter
In Figure 4.17 we show one of the two analyses produced by PATR-
that we have implemented con-
The other analysis pro
The modied version of PATR-II automatically from II for a sentence containing these two verbs.
and constructs entries structure but incorporates the sec
tains only a small dictionary duces the same predicate-argument
Verbs that it encounters. of PATRI
restructured LDOCE entries for most ond sense of marry. Thus, the output from this version
to feel represents the information that further semantic analysis need only
pereuede r 1 [T1 (on; D5] to ceuee
marry a l in; 10] to take (e pereon) in certain; CONVINCE: Sire we. persuaded
no:
consider the second sense of persuade and the rst and second senses
He meme-i ieze ink]: /never merrier z [Ti(r'nto,out of);
marriage: o/ the mm o/ M. mun-cut
this rules out further sense of each, as dened in LDOCE.
mg.) 5: mm'ui manry (= a rich man) 2 to do something by reasoning, of marry; one
V8] to cause
Ti] (of a priest or cmciei) to perform the arguing, begging, etc.: persuade him to
iry to
{or (2 people): An old him
ceremony of marriage
to take
m u. W with him. i Nothing would persuade to Cornwall
[end named them a [T1 (to)] to cause parse> uther might persuade gwen marry
in her to
daughter
in marriage: She menu mm]
a rich man
[catz SENTENCE
(persuade finite
((Sense 1) head: [form
(marry
1) NP) (Type 2)) iir agr: [perz p3 num: sg]
59 ((Takes
true
aux:
((Takes NP NP) (Type 2)) ((Takes NP NP SEar)
trans: [pred; possible
((Takes NP) (Type 1) (r 3 ))
((Sense 2) 2i
((Seryrii: senseno:
argi:
1
[pred: persuade
((Takes NP NP) (Type 2) ((Takes NP NP) (Type 2))
sense-no: 2
5) ((Takes NP NP Inf)
((Sense [refz uther sense-no: i]
argi:
((Takes NP NP PP) (Type 3)))) (Type 3 nbjectEqui))))
arg2: [ref: gwen senseno: 1}
argaz [predz marry
sense-no: 2
word marry: ward persuade: [refz senaerno: 1]
:
argi: gwen
wJense => w-sense
argz: [refz ccrnwall
trans sense-N =
1 <head trans senseno>
-
i
(head
sense-no: 1]]]]}]
V TakesNP Dyadic V TakesNP Dyadic
:7 w.sense a
vuaense
=
i sensenv
-
1
(head trans sense-no) <head trans Figure 4.17
V Takes IntransNP Monadic V TakesNPSEar Triadic
Leanne => w.sensa =9
sense-no)
"
2 sense-no>
-
2
(head trans <head trans
V TakeaNP Dyadic V TakesNP Dyadic

=> w.sense :
wJenBe
sense-m =
3 sense-no)

2
(head trans <head trans
V TakesNPPP Triudic V TakesNFInf

Triadic
UbjectControl
4.5 Evaluation
The utility of the work reported above rests ultimately on the accuracy
Figure 4.16 of the lexical entries which can be derived from the LDOCE tape. We
the PATR-H lexicon sys-
As well as carrying over grammar codes,
which
the
are de- have'not attempted a systematic analysis of the entries which would re-
modied to include word senses numbers, sult the decompacting and grammar code translation programs were
has been
tem
Thus, the analysis of a sentence by the PATRII if entire In Section 4.3 we outlined some of
rived from LDOCE. appliedto_the dictionary.
syntactic its and underlying predicate-argument the the grammar codes which are problematic for the decom-
system now represents errors
in
of the words (as dened in LDOCE) omissions in the assignment of
structure and the particular senses pacting stage. However, mistakes or
context. Figure 4.16 illustrates
which are relevant in the grammatical codes represent a more serious problem. While inconsisten-
and constructed
persuade by the sys- grammar in
the dictionary entries for marry cies or errors in the application of the grammar coding system some
tem from LDOCE.

101
100
Evaluation
4.5
codes Chapter 4
decompacting Published Derived %

of the
cases can be rectied by the gradual renement assignment lists from
to correct errors of omission or
program, it is not possible evaluation, using the pro- LDO CE

On the basis of unsystematic
automatically.
entries for the PATR-Il parsing system, 100%
grams to dynamically produce SEqui 31 31
number of this type have emerged.

of errors 58 56 97%
a
and associated code elds in OEqui
denitions 71%
For example, the LDOCE 7 5
would cor SRaisillg
needs [ft+D5] which 67%
demonstrate that upset(3) ORaising 42 28
Figure 4.18,
dummy subject and a sentential complement;
with a
respond to its use to be for the [X1]and [X7]codes listed,
suppose(2) is missing optional codes since it does not always require an ob- Figure 4.19
hefp(1) needs [T2] or
[T3] and detest needs a [V4]
NP as well as an innitive complement;
ject NP as well as a gerund complement.
code because it can take an object
14%; however, as the table

The overall error system
rate of the was
with very strong hate Rais-
to not to detest [T1,4] to tell the rules discussed above classify verbs into Subject
people who deceive and
to worry,
a TI] cause
upset
be calm, e
feeling: 1 limit
deteet all that [heating and Him.
illustrates, and Object Equi very successfully. Two further verbs
ing, Subject Equi
Hell | 1M
// Stockwell ct 111. were not
which were classied as Subject Raising by
is not coded Turn[15]in LDOCE.
out
so classied by system. the
of omission on the part of the LDOCE lexi-
2 [T53,b; v3 am rm: This is a clear example
help 1 m; 10; V3, [up AmE) 2] we suppose
believe: I "type" new. true. | ratller than of failure of the rule. The other verb which
of the (for Iomeone);
work be or x1.1,9] to
but he was in cographers, how
do part
MD, 1 euppml Mm to be a workman,
is coded Subject Raising by Stockwell et al. is come about;
to (someone in doing something);
as
ule
we]. | He Wm commonly "Append {m
1m .1
part, rather than of
Assrsr: Gauld you help meup (the thin)? |
be)more ........
ever, this appears to be a misclassication on their
Your lympalhy
The my. help: him (to) walk. 1 Trust is not recognised as an Object Equi
help. a m. | Can I help {with your work)? the LDOCE lexicographers.
verb, because entry in LDOCE
its does not contain a
[V3] code; this
/ of
must be an omission, given the grammaticality
Figure 4.18
I trust him to do the job.
on the basis of is misclassied as Object Raising, rather

than as Object Equi,
is difcult to quantify the extent of this problem Prefer
It
this Therefore, have undertaken we because the relevant code eld contains a
[T5] code, as well as a
[V3]
enumeration of examples of type. of prefer
of the assignment of the LDOCE The [T5] code is marked as rare, and the occurrence
a limited test of both the accuracy
ambitious of the more
code. tensed sentential complement, as opposed to with an innitive ,
dictionary and the reliability With a

codes in the source code translation
aspects of the grammar is certainly marginal:
(and potentially controversial) that the rules for computing seman-
rules. It is not clear, in particular, that the
motivated linguistically or I prefer that he come on Monday.
tic types for verbs are well enough the different transfor- Julie.
enough to 7 I prefer that he marries
LDOCE lexicographers were sensitive verbs to make a rule such
classes of
mational potential of the various
for Object Raising viable. example also highlights deciency in the LDOCE coding system
a
as our one
verbs into semantic types using a This naturally with a sentential complement
We tested the classication of prefer occurs much more
drawn from the lists published in since modal such would. This deciency is rectied
verb list of 139 preclassied
items it collocates with a as
Stockwell et al. (1973). Figure 4.19 gives the lf

in the Verb classication system employed by Jackendoff
and Grimshaw
Rosenbaum (1967) and by these authors and the
number of verbs classified under each category (1985)in the Brandeis verb catalogue.
classied into the same categories by the system.
number successfully 103
102
Evaluation
Chapter 4 4-5
Utilising
The four which

verbs
are misclassied as Object Equi and which do
to hesi- ...o 1 [Wv5; T1; v2,4; 10] to re-
not have [T5] codes anywhere in their entries are elect, love, represent
acknowledge 1 [T1,4,5 (to) agree
understand (sounds) by using
to the truth of; recognise the fact or ex- ceive
the ears:
and
I cant hair very well. ] [heard him and require. None of these verbs take sentential complements and
1 acknowledge the man a] your
istence (or): M to.
[lean hour romeo": knocking 2 [Wm therefore they appear to be counterexamples to our Object Raising
llalcmmt. They acknowlodqed (to M}
]
my
informed: I heard that
they were defeated ] nay acknowlodyed
having T1,5a] to be told or
HEAR ABOUT, HEAR rule. In addition, Moulin at al. (1985) note that our Object Raising
to he mu i117 comps
been defeated 2 [T1 (in); x (to be) 1,7] to this category incorrectly. Mean is assigned
admit (in): Hz um: FROM, HEAR or
rule would assign mean
recognise, accept, or
to be the player. | He um:

best both a [V3]and a [T5]category in the code eld associated with sense 2
acknowledged
their leader. ] 17w domin- it is used in this it must be treated
intend),however, when
acknowledged in sense
odqod themselvu {to be}de/ooced (i.e.

as an Object Equi verb.
This small experiment demonstrates a number of points. Firstly, it
Figure 4.20 to conclude that the assignment of individual codes to
seems reasonable
misclassication of Object in LDOCE. Of the 139 verbs
main of error comes from the verbs is-on the whole relatively accurate
The source
verbs. Arguably, these errors also derive
tested, we only found code omissions in 10 cases. Secondly though,
Raising into Object Equi a defect of the rule: between the assignments of codes and
in the dictionary, rather than when we consider the interaction
mostly from errors Equi less reliable. This is the pri-
Raising verbs were misclassied as Object word sense classication, LDOCE appears
66% of the Object and V (2, 3, or 4) of the Object Raising rule. Thirdly, it
of the [T5] source of error in the case
verbs, because the cooccurrence mary
code elds, as predicted by the Object Raising rule seems clearthe Object that Raising rule is straining the limits of what
codes in the same
LDOCE. All the 14 verbs misclassied be reliably extracted from the LDOCE coding system. Ideally, to
above, was not conrmed by can
number of syntactic crite-
contain V codes and 10 of these also contain [T5] codes. However, distinguish between raising and equi verbs, a
the LDOCE lexicographers typically dene two different word senses, ria should employed (Perlmutter and Soames, 1979:460f.). However,
be
other codes) [T5]and the other only two of these criteria are explicit in the coding system.
of which is marked (perhaps among
obtained, we explored the possibility of
one
word suggests that this ap- On the basis of the results
of these senses
with a V1 code. Analysis rule to take account of the cooccurrence
but unmotivated in ve; for example, modifying the Object Raising
roach is justied in three cases,
and [T5a] codes and V or X codes within homograph,
acknowledge(1),(2)(unjustied)
a
(see Fig- of [T5]
hear(1),(2) (justied) versus
rather than word sense.
within An exhaustive search of the dictio-
ure 4.20). The other {our cases we interpreted as
unrnotivated
Iwere
coded
a
in this fashion. Ten of these were listed

know, confess and in the case of cons:der(2), (Figure nary produced 24 verbs
show, suspect, demonstrated by the as Object Raising verbs in the published lists used in the above experi-
4.21) there is a clear omission of a [T5] code, as
verbs were classied as Equi in the published lists. Of

ment. Five more
of
grammaticality the remaining nine verbs which did not appear in the published lists,
to be here.
I consider that it is a great honour three were clearly Object Raising, one was clearly Equi, a further
two
difcult to
is not given a
[V3]code under sense 1 (Figure 4.21), were probably Object Raising, and the last three were very
Similarly, expect that the Object Raising rule in
of classify. This demonstrates modifying
however the grammaticality Equi verbs.
this fashion would result in the misclassication of some
the exam.
I expect him to pass
In fact, the list is sufciently small that this set
of verbs is probably
that it sh ould assigned
be a
with the relevant interpretation suggests best coded by hand.
Alternatively, sense 5, which is assigned a
[V3]code, seems As a nal test, we ran the rules for determining the semantic type
V3] code. 1. all the 7965 verb entries in LDOCE. There are 719 verb
similar to sense of verbs over
suspiciously
senses which are coded in the dictionary as having the potential for
expect. 1 [mans] to think (that some-
classied as Subject Rais-
consider 2 [st, x (to be) 1,7; vs] hen pan predicate complementation. Of these 5 were
think or in a stated way: I thing will happen): 1 ec'pect (um)
to regard so;
as a fool). the examination. I He ezpectl to In: the mod
ing, 53 as Object Raising, 377 as Subject Object
Equi, and 326 as
conde! you olaol(=1rcgard you
wont Iezpect .o.' Sub
withyou nation ] win the come
rules. 42 of the Equi verbs are between
ambiguous
[Iwun'deritogrwthonourtobehen
mid he oonndmd {to be) too 5 [vs] to bulievs, hope and think Equi by our
today. 1 He
me
he of- under the same sense; in the translation program
Shetland wand. someone will do something): ject and Object Equi
law to be a good worker. | m duty in the
by selecting the type appropriate for each
to do their
aISooand ce.- zzpected in. men
this ambiguity is resolved
madly demand a pan .
an
eon-I'M hdue .......
individual code. For example, a code which translates as

(Takes NP
would select Subject Equi, while (Takes NP NP Inf) would select
Inf)
Figure 4.21
105
104
4 4.6 The Dative Alternation
codes Chapter
with the relevant LDOCE the book to her father

Equi. These sets of verbs together Sally gave
Object exhaustive analysis
B. An of the 54 Sally gave her father the book
sense number are listed in appendix of inclu-
revealed two further errors
verbs classied as Object Raising
should be Object Equi and promise(1)should The company donated the money to the charity
sion in this set; order(6) treats
The donated the charity the money
verbs which the translation program company
be Subject Equi. The 42
to be somewhat heterogeneous; some,
as ambiguous Equi verbs appear
where the
such as want(1) and ask(2), are cases of SuperEqui' verbs
Figure 4.22
contextually, whilst oth
determined
underlying semantic relations
are
to be better classied as to for
particularly the phrasal verbs, appear 131 verb senses with [T1] or [D1]codes qualied b or
system descrilfdd
databasye
ers, here incorrectly be-
Object Raising. Allow(1 and permit(1) appear
extracted'from LDOCE using the lexical
to capture examples such as
et al. (this volume) conjunction with
in the code eld de-
cause they are coded [T4 in
Alshawi described in 4.2. Undoubtedly this sample does
compacting program
in their house. exhaust the class of verbs which can undergo the dative alterna<
They do not allow / permit smoking not realistic-sized which be used
tion. However, it provides a sample can
to evaluate this class of verbs.

claims about
is not con-
In this example the subject of the progressive complement this set of Levin (1985) claims that the dative alternation applies to a subset
list is small,
trolled by the matrix subject. Again, since the of the agent-patient verbs; that is, verbs whose sense involves some
verbs should probably be coded by hand. to the patient or theme, and is associated with most verbs that
change
describe transfer of possession, but goes on to say that:
Although the dative alternation is displayed by members of

Alternatiori a semantically coherent class of verbs, there appears to be a
4.6 The Dative
others that productive process of lexical extension that allows members
In order to evaluate the proposals of Levin (1985) and of the change of position class to become members of the
verbs subcategorisation, and in particular the range of ofpossession class. The productivity of this process
aspects of a
it take, are predictable from
transfer
alternative complement structures can of leXical extension might partially explain why so many
the semantic class of the verb and its predicate-argument (thematic) more verbs show this alternation than other alternations.
verbs in LDOCE which are coded to undergo
structure, we examined
chosen because it is prob- ' '
alternation. This alternation was

Levin this process
illustrates of lexical extenSion With the change of
the dative
and is also clearly marked in the . . .
ably the most productive

in English pOSition slide.
verb following pattern of grammaticality suggests
The
LDOCE coding system. that the dative alternation is only possible when the meaning of slide
occur with ditransitive verbs
The dative alternation can potentially can be reanalysed as transfer of possession:
which allows us to
such as give and donate. Give is coded as
[D1(to)l
the dative alternation
recover the information that this verb permits Susan
Sally slid the drink to
phrase headed by to:
and requires a prepositional Sally slid the drink to the door
NP NP ToPP) and (Takes NP NP NP).

(Takes Sally slid Susan the drink

Sally slid the door the drink
tells that it does
On the other hand, donate is coded [T1 (to)], which us
not undergo the dative alternation but does require a prepositional

Figure 4.23
by to:
phrase headed
The of the nal example stems from our reluctance
(Takes NP NP ToPP). ungrammaticality this
to
recognise doors potential possessors. Although
as argument
seems this
convmcing it raisesinnumber of case, a problems for any
These codes predict the following patterns of grammaticality:
107
106
Chapter 4 4_5 The Dative Alternation
Utilising
entries of syn- The examples (which had been presented in alphabetical order)
programme of simplifying lexical aspects
by'pr'edicting sorted into further semantic classes to see if any trends could be
tactic behaviour on the basis of membership in semantic
classes (see were
discerned in the results. The complete set of verb senses partitioned

Levin, 1988). If membership of semantic classes is itself
uid
and based,
into these classes with illustrative examples is given in appendix C
difcult
usesemantic'class ba
as a
then it will be to
in part, on usage,
sis for simplication of lexical entries. In addition, if 1ex1cal extensmn along with the results of the questionnaire. Four examples were re-
for too then arguments based on classied with respect to the dative alternation because three or more
is invoked to account many cases,
semantic classes will begin to lose their empirical force. On the
other participants objected to the original classication. Membership of the
into the relevant semantic classes is shown by a scale (05) which represents
hand, the claim that lexical extension transferof
possession
verbs the number of participants who thought the verb sense
class licences the dative alternation with change of position im- belonged to
of the the class. In the cases where participant left a question blank or
plies that verbs which are not members transferoflposseSSion
this way Will
a
deemed to have voted with the ma-

class and which cannot have their meaning extended in put a question mark, they were
not alternate. .
. .
jority. For instance, in the examples shown in Figure 4.24 nobody felt
these issues further, we deSigned
questionnaire a that these three verbs belonged to either the change of position or the
In order explore
to
transfer of possession class. Figure 4.25 overleaf gives the main results
which asked linguists to categorise each of the 131 verb senses extracted
from LDOCE according to whether they participated in the dative of the questionnaire.
alternation and whether they belonged to either of the classes of change Within the total set ofverb senses, 96 (73%) alternate and 35 (27%)
of position or transfer of possession verbs. The verbs were presented in do not, 65 were judged to be change of position or transfer of possession
intended to bring out relevant sense from verbs (on the basis of a score of 3 or more) and 66 Were not. Of those
short example sentences theinto three
LDOCE. These example sentences were divided groups by which fell into these classes 54 (84%) alternate and 11 (16%) do not.
a guide, according to whether they Of those which do not fall into these classes 41 (64%) alternate and
the second author using LDOCE as
allowed the dative alternation or required two NP complements or a '25 (36%) do not. These results suggest that a weak version of Levins
introduced to or for. Examples from each claims receives support in that if a verb sense is in the relevant
NP followed by a PP by some
shown below: classes then there is a higher chance that it will undergo the dative
class are
alternation. However, stronger versions of her hypothesis do not stand
up: many verbs not in these classes undergo the dative alternation and
NP-NP: She forbade him the house some transfer of possession verbs do not alternatel.
0
(forbid(3)) Position/Possession? The rst column of Figure 4.25 gives the number of verb senses
which fall into each subclass and an approximate percentage gure
She the problem to him
NP-PP : posed representing the proportion of the total sample in that class. As can
0
(pose(3)) Position/Possession? be seen the largest class represents only 15% of the sample. In ad-
dition, the others class contains 12% of the sample which consists of
Either : She quoted him some poetry miscellaneous verbs which unable
0
we were to classify further. This sub-
(quote(1)) Position/Possession? classifying exercise tends to suggest that the verbs in the sample are
relatively semantically disparate and do not fall into a unied semantic
Figure 4.24 class, as implied by Levin (1985). The second column shows the mean
score for membership of the change of position or transfer of posses-
basic and that sion classes. Buy/Sell, Pay/Charge and Pass/Throwverbs werejudged
We assume that neither complement structure is more
to be paradigm members of these
no rule of dative movement is used to derive the other structure. classes, as were Assign/ Give verbs
select more less freely from a of (whose mean score was reduced by the inclusion of more metaphori-
Rather we assume that verbs set or
cal giving verbs, such as render(3) or a'ord(3)).All the Pass/Throw
and that verb senses
potential complement structures against (judged verbs alternate, as Levin (1985) would predict, and they all appear
must remain the across the (dative) alternation.
LDOCE entries) same
to undergo lexical extension from change of position to transfer of
asked participants to conrm that the verb sense
The questionnaire
it fell into the possession verbs (Figure 4.26).
was in the correct group and to say whether classes
of verbs of change of position or transfer of possession. Five linguists
this questionnaire (including the second author).
completed
109
108
Chapter 4 4.6 The Dative Alternation
Utilising
She tossed the ball to the corner
"1"" 3? 3 33 3; She tossed the ball to him
ass es s es se
$2 a s3 33
migv
:m V
She tossed the corner the ball
'30 She tossed him the ball
0)
'3 Figure 4.26
A A A
A
ten xx s
There is a clear default inference of transfer of
5500 nu: m possession, although
094: Vv
v
N
v
this can be overridden, as in:
z\v
m
{104% HID 7-1

2 Shetassed Susan the ball but Bill intercepted it.
in" 6
A A AAAAAAAA
With theother subclasses, as columns 3 and 5 indicate, the situation
o
oooooxzg
h%%%ooo
o
8x
Dag
% % is not so clear-cut. Only twothirds of the Assign/Give class alternate,
co 0
V
N
bomgooag although this is the class which most clearly involves a central infer-
\V
w
V
v
,_.
VSV 3:,
also ence of change of possession, as the oddity of the following example
illustrates:
a
E Af A H
A
a
? She gave the book to Susan but Bill took it before Susan could.
nesse
'_
s s
ID g
<ses e a H
Compare this with the more natural:
SSIVV
V
u
4.25
H
8 She meant to give the book to Susan

Cut-ammo b m
N$ but Bill took it before Susan could.
Figure
HH
The examples which

do not alternate in this class are donate(1), deli-
\m
88w wwN Newman wanna vcr(2), present(1), and return(2) which all select the NP-PP comple-
p(DtsmitdvoomoHHHOv-{OHH ment structure. In the case of the Pay/Charge class, about half of
these Verbs do not alternate, and the other half do:
ssseeeeasssesssss
a, :2 S QVVVVVVVVVVVVVCI
xIv
She ned him some money She refunded him some money
N

m She ned some money to him She refunded some money to him
bmelhwfVVmMNNCDr-t
gv-ir-i H
Figure 4.27
'
*5 h
q 1
'~
:3 "a
qug'm In this
411m the verbs which do not alternate
.> .gaam 5 class, select the NP-NP comple
so
(5Laukom... E ment structure. In the case of the Buy/Sell class, purchase(l) appears
\%\nsaeasizss
s.
m
not to alternate. Thus within these subclasses which all appear to fall
&'EDH\3\Q$}3 E 375
_.
g ig>>aesaseueg
VAO'QNN ~ce a
into the broader
we can see
classes
differential
of change of position or transfer
patterns of behaviour
of possession,
with respect to the dative
O <OOQ<WD<<UJEDUOLIJUONWOH
alternation which, at best, only indicates a trend in support of Levins
(1985)claims.
The Construct and Obtain/Find classes were also judged to fall
into the broader classes of change of position or transfer of possession,
but judgements were less clearcut. With these verbs the inference of
change of possession seems less central; for example:
111
110
The Dative Alternation
4.6
codes Chapter 4
In each of these classes all (16] examples alternate. These examples,

and the peripheral cases above, appear to conict with Levins (1985)
She cooked him dinner
came home claim that the dative alternation occurs with a semantically coherent
though
even he never
class of verbs, implying that it will not occur with Verbs which do not
She brought him a drink
be extended into this class. These subclasses illustrate
but he was already asleep fall or cannot
fall outside the change of position or transfer of pos-
that verbs which
session class do alternate with as much, or more, regularity than other
Figure 4.28 which fall within it.
subclasses
The remaining classes also fall outside the broad class of change
readily with a reexive object:
and they occur
of position or transfer of possession verbs, but mostly do not show
not alternate. The Save/Take
a consistent preference to alternate or
She cooked herself dinner select the NP-NP structure,

class is an exception to this as all members
She reached herself a cigarette
as 1112
Figure 4.29 She saved him a pound.

do not. In
verbs such as fine(1)
give(1), ref-und(1) or
class largely selects the NP-NP complement struc
in a way that
results: less clear
The Allow/Forbid
extension test gives
addition, Levins lexical ture as well, as in:
She knitted the teapot a cosy.

She denied him nothing.
show as much
Nevertheless, Figure 4.25 conrms that these just verbs So too does most of the Cause class, as in:
the classes discussed in the previous para-
tendency to alternate as
deemed
graph. In the case of
the Construct class, only one w
behavrour
as to the
example The meal gave him indigestion.
not to alternate, so that this class shows very
far weaker
51mllar
the Construct verbs involve of Levins claims, but
Pass/Throw class. However, transfer of possessron. These exceptions would support a strong version
inferences of change of position or
overall in the remaining classes there is a preference to alternate.
lie at outer edge of the
The Construct and Obtain/Find classes
verbs.
the We conclude that our data, based on the LDOCE sample, provides
field of change of position or transfer of possessmn However, for of simplication of lexical entries on
denitely littlesupport programme a

Owe and classes
the Say/Teach, Offer/Make, Recommend
Levm s the basis of membership of lexical items in broad semantic classes of
fall outside this eld entirely. They
all received low
the
scoresand the typediscussed by Levin However, this
(1985). conclusion must be
test reveals no signicant possessor
restriction on
lexical extension treated with caution we have only examined

one of many alterna-
and above
direct object NP in the NP-NP over
complementstructure tions discussed by Levin and others. Furthermore, there is reason to
the direct
restrictions object NP m the NP-PP
on
which do not affect mean-
existing semantic believe that highly productive alternations
construction: ing or involve morphology are relatively unusual from a cross-linguistic
It may be the case that we have
perspective (see Dixon, in press).
She made an offer to ICI
chosen the wrong alternation to test Levins claims.
She made 101 an offer
Another distinction which is sometimes argued to be a predictor of
a verbs ability to undergo the dative alternation is that between verbs
She taught chess to her computer deriving from Germanic as opposed to Latinate roots. For example,
She taught her computer chess is Germanic and alternate, whilst donate is Latinate and can-
give can
not. We tested this claim by marking all verb senses in the sample
quoted poetry to the great wall
7 She
wall poetry
with the feature +/ LAT(inate). 41 (31%) of verbs in the sample are
7 She quoted the great Latinntc and 90 (69%) are not. Of those which are Latinate, 27 (65%)
do alternate and 14 (35%) do not. Of those which are not Latinate,
Figure 4.30
113
112
4 4_6 The Dative Alternation
codes Chapter
receives this point clearly.

and 21 (23%) do not. Thus this hypothesis
69 (77%) alternate data.
no support from our
monosyllabic (and
short
'
A related claim phonologically

is that She gave him the present (KlVeuli
bisyllabic and all him
some bisyllabic) verbs alternate and long (i.e. some She gave the present to
claim be seen as a synchronic

verbs do not. This can
other polysyllabic) distinction between Latinate and She gave him pound
a (3MB). on)
reinterpretation of the etymological between She gave a pound to him
words. However, there is not a complete overlap
non-Latinate is
cede is monosyllabic but Latinate, bequeath all her time
'
(glVe(3), set aside)

the approaches
two (e.g. of words in the sample
She gave her family
The great majority She all her time to her family
bisyllabic but not Latinate). gave
does not make very strong
are mono or bi-syllabic so the hypothesis
the monosyllabic verbs there are many She gave him the point (KWEUS),comma)
predictions. However, amongst wish(4), ne(1),save(3),
'
She gave the point to him
which do not alternate
give(13), yield(3),
and so forth. version of She gave them the President (W414), was
Grimshaw and Prince (1986) offer a more sophisticated "
She gave the President to them
that all alternating verbs con-
the phonological hypothesis, arguing foot. A single foot can consist
sist phonologically of a single prosodic with stress falling on
The meal gave him indigestion 1), cause)
or more two syllables
The meal indigestion to him
of one (stressed) syllable, and an extrametrical
gave
the rst one, or two syllables with nal stress

containing only a vocalic nucleus). This pre-
initial syllable (usually Figure 4.31
words containing two stressed syllables (donate
dicts that bisyllabic
will not alternate, but monosyllabic words (give
or nal stress (deny) (award) will. latter three senses of give are clearly less productive and more
and bisyllabic words
and Prince
with
also
extrametrical
claim that
initial
only
syllables
verbs with a possessor Elle
1iomatic 1n the sense that they restrict the theme argument far more
Grimshaw constraint and severely than the former examples. However, this correlation is very
and assume that both this semantic
NP alternate fu r th er examples
can
must hold for alternation to be possible. rough and does not stand up in man y cases as some
i
the phonological constraint . .
constraint appears to be very simi from give (Figure 4.32) show:

Grimshaw and Princess semantic
alternation to transfer of possession verbs.
lar to restricting the dative
alternation can occur with
We have already observed
that the dative She gave him kiss/kick/push
a (cu/C(16). d0 3900)
class. To test Grimshaw and Princes phonologi-
She gave a
kiss/kick/push to him
verbs outside this
verbs coded D1 in LDOCE
cal constraint we searched for bisyllabic
For the 27 verbs found, the She him the information (KNEW),tell)
which had stress on the nal syllable. gave
She the information to him
in 14 cases, where the gave
single foot constraint made correct predictions
the dative alternation was
initial syllable was either extrametrical
or
extra-
She gave him her hand (give(15),Support)
blocked. In the remaining 13 cases, either the initial syllable was
She gave her hand to him
allow(3)) or the verbs
metrical and alternation was blocked (afford(3),
alternated despite not being single feet (e.g.prepare(2), p rocure(1,2),
Figure 4.32
Most of our examples which go against the single
refund(1) ensure(2)).
,
for membership of the transfer
foot constraint given low scores
The rst example illustrates
were
and reim- a fairly productive meaning of give which
of possession class, however, refund(1), remit(2), repay(1) does not alternate but can take a wide range of deverbal theme ar u-
do not consist of
burse(1) were given high scores, and
do alternate, and
given high scores, do ments. second two examples illustrate
The relatively specialised senies
single feet. Similarly, charge(1) fine(1) were
of which severely constrain the nature of the theme argument but
but do consist of single feet. give
not alternate, is that the
which alternate nevertheless. Finally, we present two examples of pay
One broad correlation which did emerge from our sample both of which alternate but in the second t he theme
is to occur with that
us age argument
idiomatic the less likely alternation . .
more a usage, IS restricted to the specific NP 3. visit:

of the different senses of give (Figure 4.31) illustrates
sense. A sample
115
114
the LDOCE codes Chapter 4
Utilising grammar
WW_W__
She paid him the money (paylln
She paid him a Visit (pay(6), visit)
Figure 4.33 Chapter 5
None of the four approaches above appears to offer a very convincing

to predicting which verb senses will undergo the dative alternation. The derivation of a large
way
Therefore, we conclude this section by noting that, at least for the
the arbitrary approach to subcategorisation (see 4.1) seems computational lexicon for English
moment,
more accurate.
from LDOCE
4.7
Most
tially larger
stration
system suggests
to make
entries
the programs
Conclusion
applications
purposes.
the on-line
both viable
parsing system
ments
in these
that
described
in the source
assignments
than
The evaluation
it is
for NLP
those
sufciently
of the LDOCE
detailed
systems require vocabularies
grammar
and accurate
production of the syntactic component of lexical
and labour
depends
dictionary.
would
saving. However, the success
above in producing useful lexical entries
directly on the accuracy
Correcting the mistakes and omissions
be a non-trivial
the interactive,
exercise.
rather than
will
typically developed for theoretical
coding
(for verbs)
rate
of the code assign-
This is part of
batch mode,
of
for a
or
substan-
demon-
W John
5.1
Carroll
Background
and Claire Grover
our motivation for adopting in

Alshawi at Within the larger framework of the Alvey Programme of advanced
approach using totape for lexicon development
the (see
the lexical database augmented with formation technology
a research and development initiative set up

111.,this volume). In addition,
the grammar code decompacting and translation programs provides a in the UK to promote collaborative research projects aimed at several
evaluation of hypotheses concerning enabling key technologies

a coordinated effort to build a natural
useful source of data for detailed
and industrial
language toolkit for use by the wider academic com-
WW
nature of the English lexicon.
the grammatical out jointly by groups at the Universities of
munity is being carried
Cambridge, Lancaster and Edinburgh.
4.8 Notes The goal of these three closely related projects is to produce directly
compatible rule systems and associated software, capable of function-
We would thank
like Geoffrey Leech, Beth Levin, Steve
to Pulman, ing together as an integrated system for morphological and syntactic
various
Graham Russell and Karen Sparck Jones for their comments on
parsing of texts. The projects aim to deliver, respectively, a sentence
drafts, which substantially improved this chapter. We are entirely grammar of English together with a grammatically-indexed lexicon, a
reponsible for any remaining errors. Part of the research reported combined inectional and derivational morphological analyser and dic-
funded by the UK Science and Engineering Research Council used. The
here was .
tionary system, and a parser for the grammatical formalism

No. GR/D/87321) under the Alvey Programs. work is being carried out within the theoretical framework of Gener-
(Grant
alized Phrase Structure Grammar (GPSG; Gazdar at 111.,1985), but
Several researchers have observed of the mechanisms would be usable without a theoretical com-
1 This observation is not new. many
that other factors, such as phonology and/or etymology, play a role. .
mitment to GPSG. It is envisaged that the complete integrated toolkit

of research and a base
We discuss these other factors elsewhere in this chapter. ,
will be used by a number development groups, as
117
116 ,,

5 5.2 The target lexicon
for English Chapter
A computational lexicon
of a current approach and summarises the further work we plan to do
for of potential
applications. requirements
The our
component a range directed towards deriving a large lexicon from LDOCE.
in particular, the need for a morpho-
diverse user community motivate,
with wide coverage of English grammar
logical and syntactic analyser
describes the sentence grammar
and vocabulary. Grover at al. (1987) in detail. Rus- 5.2 The target lexicon
formalism and current coverage of the English grammar
the morphological analyser and dictionary
sell et al. (1986) describes Both the grammar and morphology toolkit projects have adopted an
system. of to describing syntactic categories based largely on that used

Aspart of the
grammar project, in tandem with the development approach Grammar al., 1985). In
at
work is underway to develop a sizeable gramma-
Phrase
1n'Generalized Structure (Gazdar
the grammar proper, lexicon this categories are treated
theory, syntactic not as monadic categories
which will be integrated with an existing
ticallyindexed lexicon
project. The but as sets of feature/value pairs. A lexical entry, then, is a set of
of about 3000 words, hand crafted by the morphology
feature/value pairs where some of the features will be relevant to the
lexicon and its compatibility with the sentence
of this new the sentence below
coverage
lexicon will, to a large extent, word grammar and some to grammar. Figure 5.1
word grammar and existing hand crafted lexicon developed as part of the
grammar, The shows fragment of the
a
the ultimate utility of the complete analysis system.
determine at at, 1986). Each corresponds to
contain base and irregular entries, as productive morphology project (Russell entry
final lexicon need only the an lnleldual word sense or syntactic realisation of a sense.
variants are analysed at run-time on
inflectional and derivational
the nal lexicon and morphol-
basis of the word grammar. Therefore,
function dynamic system for word analysis, and
ogy system will as a
forms used for simple lockup. An ad-

not just as a repository of word address: .
the lexicon comes from the

ditional constraint on the content of target [V v, N -, EAR 0, FBI) -. VFURM NUT, INV -, NEG -, FIN +,
for analysis system
the to
fact that though
even there is no provision AUX -, AGR [V -, N 1'. BAR 2, CASE HUM, FLU v], INFL r,
handle semantics, there is still the need to provide a minimal, theoret- PAST -, CAT V, FIX NDT, REG 4, COMPOUND NOT, AT +, LAT +,
extension to the grammar rules and lexical entry SUBCAT NP]

ically uncommitted
of a semantic component: thus [V N BAR O, PRD VFURM ESE, INV NEG FIN
format to allow subsequent integration +, -, -, -, -, -,
structure [V BAR NFURM NORM]. INFL
information concerning, for example, the predicate-argument AUX ', AGR -, N +, 2, +,
be made available in the lexical CAT V, FIX NUT, REG 1', COMPOUND NOT, AT +, LAT +,
of verbs and their semantic types must
SUBCAT NP]
entries, CDUNT
for developing such a detailed [V -, N +, EAR 0, P055 '. PLU -. PART -. PRO -, +,
LDOCE is an obvious starting point INFL CAT N, FIX NUT,
obvious motivation of at- NFDRM NDRM, PN -, PEN. 3, +,
lexicon. Apart from the
and substantial COMPOUND NUT, AT +, LAT +, SUBCAT NULL].
to derive a. large list of words from a computerised source,
tempting
relevant to this project since it offers, through
LDOCE is particularly
the system of grammar codes, as has been discussed in Akkerman (this
behaviour of
volume), a great deal of detail about the grammatical
individual words. LDOCE contains only base and irregular forms (it Figure 5.1 Lexical entries for address
the entries in a derived lexicon
excludes regular inectional variants);
for merging with the existing
would thus be of exactly the right type
Figure gives of the feature
a complete
names and potential list
hand crafted lexicon. 5.2a
the tar- values which
may occur as part of the lexical entry for a given word
section summarises the detailed requirements on
The next and which relevant to the sentence Figure
imposed by the morphological analyser and the sentence
or
morpheme are grammar.
get lexicon 5.2b lists the features which are relevant only to the operation of the
and outlines a processing strategy for deriving appropriate
grammar,
that such a strategy must word grammar. Briey, they are used in the following way: FIX en-
lexical entries from LDOCE. We then argue
of in- codes whether a morpheme is a prex, a sufx or neither LAT encodes
manual intervention due to various types
be open to frequent
in LDOCE. In section 5.4, we whether morphemea is latinate or not; AT encodes whethera stem
accuracy in the grammatical coding combines the suflix +ation the suliix +ion; INFL marks mor-
describe our methodology together with a lexicon development system with
inectable
or
encodes whether a verb

5.5 outlines some shortcomings of phemes as or not; REG entry has
which embodies it. Finally, section
119
118
The target lexicon
lexicon for English Chapters 5.2
A computational
V
a regular past tense form or not; COMPOUND identies the category [N +, V , BAR 0], verbs as [N r, V +, BAR 01, adjectives as [N +,
0]. Minor categories do
of a compound; STEM, which is category-valued, indicates the type of +, BAR 0] and prepositions as [N , V , BAR
not have the features N, V and BAR. Instead they are usually dened
3, Features used by m. sentence grammar
simply by their SUBCAT feature, for example, the complementiser
is dened other features then be
BAR {-1 o i 2) LOC (. -) that as
[SUBCAT THAT]. Most can
classied of which categories

in terms they appear on. For each major
V D -} AFORM {an E5! AS runs)
which must all
category type there is a set of head features appear on
N P '} ADv (A -)
instances of that category type, regardless of their BAR feature value.
QUA { -)
(. -}
(or may) be associated only with some instances
FIN
Further features must
-) type, depending on the value of their BAR feature (or,
NUM (cum mu)
AUX (. -) of a category
VFORM (as: nu: Eli NEG ( -) on occasions, some other feature). The sets of head features for the
m nor} PRD (. -) four major categories are
PAST (. _} AGR cu
va (o -} WH (. }
VERBALHEAD {VFORM FIN AUX PAST AGR PRD}
PART PRD}
FLU (. -} UB {E 3}
NOMHEAD (NFORM PLU PER CASE PRO PN POSS COUNT
PER {I 2 :4} EVER (A -)
ADJHEAD {QUA ADV NUM NEG PART AGR DEF PRD}
CASE {mm ACO) CONJN (. -)
PREPHEAD (PFORM LOC PRO PRD).
COUNT (. -} certain in addition to the sets
SUBCAT (DEIN DEM AND sum DUI EITHER NEITHER The features appearing on categories
"HR ER Ill FUR IF En-ER NUT THAN AS HULL
PN (+ -) dened above are INV, SUBCAT
NEG and which are relevant to ver-
NP DUR PJIP NDPASS NLS HPJFLDC HPJP lll
NFORM {n was MURM) 1m: PPDF PPABUUI Prsxrnusuur PFVIUH PPGER bal categories; DEF and SUBCAT, applicable to nominal categories;
PART {. -) PPSBSE SFIN SIN! SESE or q: HILL] PREJJ AFORM and SUBCAT for adjectival categories; and SUBCAT alone
BASLVP IlllP VPJNF DD_CDMPL P5P SR1 SR2 SR3
P055 9 -) SEX SE] SE3 SE4 Rl DR] DE 115%] SJUBJX
for prepositional categories. Additionally, SUBCAT, AGR, DEF, QUA
55W UBLGAP INLSU'BJ PP PPFRUM PPTD PP and POSS appear on determiners, SUBCAT and AFORM appear on
DEF (. -)
PPDN PPFUH PPIN PPAGAINST PPBY PPDFJPHTH
degree modiers of adjectives, SUBCAT and CONJN appear on con-
PRO {. -} NPJPIN NPJPHITH PIPYRDM NPJFUF be
NFJPFUE NPJPTU NPJPIHID NPJPBY NPJJFF junction words and PFORM appears on particles SUBCAT must
PFORM (or B! uni ru
HPJJN NPJJP NP. OTHERWISE DllE ELSE llPJPDll specied for all lexical entries. Wh-words of any category also have the
rim FRUM Ill ur ABUUI
Acursr Ar nun mum l'LEFL BARE_S PPTDJHALS A1 features WH, UB and EVER.
NPJASLVP NPJSJRED PLUR PPDVER NPTU} indi-
hrs in THAN AS} These sentence grammar features are used as follows: PRD
cates whether a category can appear in predicative position or not;
about the cate-
AGR, which is category-valued, encodes information
item is able to agree with; NEG indicates whether a
gory that
b. Features used by the word grammar some
word or constituent is negative or not; PART denes whether a noun

LAT {~ -) FIX {PR5 sur inn)
or a quantifying adjective can be the head of a partitive construction
AT (' 'l COMPOUND (11 v A inn} or not; DEF indicates distinguishes proforms(N
deniteness; PRO
INFL (4 -} STEM cu or P) from non proforms; QUA identies
quantifying adjectives and
REG (~ -) determiners; POSS indicates whether some item is possessive or not;
AFORM reects whether an adjective is comparative, superlative or
5.2 Features and feature values neither; PFORM distinguishes one preposition from another; FIN in-
Figure is nite indicates whether a verb
dicates whether a verb or not; AUX
is an auxiliary or not; VFORM encodes information about the form
category an afx can attach to. A more complete description of these
in Ritchie a1. of nonfinite verbs (base form, past participle, present participle etc.);
features can be found ct (1987). PAST indicates whether a nite verb is past or present tense; INV
For the purposes of this chapter, we now present a brief overview
distinguishes those verbs that can appear as the head of an inverted
of the sentence grammar feature system. The features N, V and BAR
sentence from those that cannot; PLU indicates whether a noun is
together dene the four major category types: nouns are identied as
121
120
. .
5 5.3 InaccuraCIes In the LDOCE source data

Chapter
lexicon for English
A computational
' ' '
not 0111y broad ntacti Informa

syntactic classes, offering
subcaltegorislififfiimamtl
s
noun is first, second or
P ER indicates whether a .
information
.
about
singular or plural;
.
from accusative nouns;

CASE distinguishes
nominative
a countable
detailed
tion'hbut ese 0 servations on the requirements target of th elexrco
'
third person;
whether a noun is a mass no un or one;
from all
.
met byLDOCE suggest add? an overall

COUNT
NFORM
indicates
distinguishes the dummy pronouns
names; LOC
it and there
distinguishes locative tgedixtenti
towhich
exrcal
theysuitable
are
e entries for the analysis LDOCEy: system from
other nouns; PN identies proper between ad-
ADV distinguishes
'
'
non-locative ones; 1. Use part of s eech f to determine the values of the

repositions from NUM appears on adjec- med Bills-rmation
verbs (A[+ADV]) and adjectives (AlADV; those features N,
which are cardinals ([NUM CARDl),
tives and identies those
those which are neither (lNUM of information the analysis sys
and
ail: tirgugnnimal)
arrount
2.
-
which are ordinals ([NUM ORD]) and on the relative pro- '
H) provi e about the semantic

eesl:
natu r
wh-words ([WH e o
type)e(>(f1
. .
only on
and that
]); WH appears
appears only on wh-words .
and the semantic

l); UD again structure
nounthat ([WH
whether word
the is a question word ((UD Q]) or a rela-
p'reliliicate-argument information is not explicitly encodedpin
and indicates
tivising word ((UD Rl);
EVER also appears only on whwords
which and whichever.
and that
Biggie??? Eemantic.
,
u oguraev and Briscoe (this vol me: C h apter 4) de- '
for example, scribe an approach which in m ost cases is able to ded

and distinguishes between,
Thus our use of features involves us in a quite detailed charac-
of individual
correct clasSIfication
semantic
' _
of the particular sense of thievzlrll:

properties '
and syntactic under consideration on the b asrs of the complete set of codes
terisation of the morphological
assigned to that sense.
morphemes and words. FIX) are specific to bound mor-
Some of these features, (such as

for +ative, +ing or +ness).
entries the codes into the appropriate SUBCAT val-
Egarslate system. Even
3.
include, for example, grammar
phemes (these to closed class vocab- though the grammar codin
WH, UD) are specific
Other features (for instance The majority Step;thisnalysis.GPSQ it encodes much of thg
51330:; speclic,
and relative pronouns.
3):?
5
interrogatives
ulary items, such The
izformagion
as
items. '
however, are relevant to open class vocabulary and closed the requires re atin g to b Categorisa-
S
of features, denes bound morphemes tion

.
classes
.
in the lexicon' in th e m am)

'
the Hit? numbers deter-

exhaustively
hand crafted lexicon proportion of the
small subcategorisation.
.
but inevitably only contains a mine a unique

class vocabulary, class vocabulary that
vocabulary. It is this open
class
much larger open 4. Finally, predict the values of features such as VFORM and FIN
to derive from LDOCE. be divided
we intend class vocabulary can on the basis of the 1, Of speech and the fact that the LDOCE
to the open
The features relevant
on the basis of the part of speech of the entry is the base for}:
into those which are predictable the inflectional or derivational
th which follow from
involved, ose which rely this be expected to work but in a
the system, and those
ivilguncilczitthe
item can
strategy
morphologicalrules incorporated
into
of speech, but nevertheless must tigne er of
cases
it produces incorrect lexical due to entiies
on more specic information than part
entry. For example the values for the numd
inaccurac ies an errors in the grammatical coding in LDOCE The next
for each individual from the
'
les of th e may, Widespread types Of

'
be specified above follow section describes and gives exam
features N, V and
of
BAR
believe.
in the sample entries
The values of VFORM, FIN and some of
for
error which cause
'
processing prolems.
part of speech to be present on verbs,
are predictable
the other features expected of the word. On the other
uninflected form data
entries which are the base, is not predictable from either 5.3 Inaccuracies in the LDOCE source
value of SUBCAT, for instance,

hand, the principles.
or general grammatical and and Briscoe (this volume)
iiegmgggthits
ro
part of speech derivable from the part of volume) Boguraev
those features that are
coding system and Igive
.mp1 ofrprionbftlhe
It is clear that grammar
any MRD. However,
speech information are recoverable
in
from
the
virtually
third class above, such as SUB- .
LDlOCE ors in e app ication of this s yS t em and BHOIS 0f Olms-
'
'
of the features llon the asssignment of cod to word senses. For exam 1
not all) As discussed in es
most (if
recoverable from the majority
of MRDs.
hlajlthlelecrsd: _ '
promise(1) which
g:r[rsr(;f)co$rrgissu])rii
ntry
not in
volume,
CAT, are
(this volume) andto Boguraev

and Briscoe (this
this generali- tlllie {(or
unelided) n W 1c two cod es are se
of Emisadilodnliii
in Akkerman , , ,
be an exception from _
chapter 4), LDOCE appears a tagging of major no (rather than semi-colon), and there
mma is an error
because it employs system of grammatical

sation,
123
122
A computational lexicon for English Chapter 5 l 5,4 A methodology
consistent
and a system
and totally
for lexicon
accurate
development
of lex
worry, should contain code MRD provides a complete, source
entry for upset(3) to cause which
to a
the
In the light of this, the solution have adopted is a
licensing sentential subjects, as in That Mary is pregnant upset Bill. 'ical information.) We
semi-automatic system, a Lexicon Development Environment (LDE),

interacting with a linguist or lexicographer, in which lexical entries in
5.3.1 Implications for the derivation of the target lexicon the form required by the target lexicon are derived automatically from
LDOCE entries. These entries, however, are not immediately added to
The inconsistencies and errors that have been uncovered in the ap-
the target lexicon, but form the input to a second phase in which errors
coding system, combined with the errors of
plication of the grammar and inadequacies carried through from the source MRD are manually
omission in the assignment of grammar codes mean that any attempt
corrected.
to construct a lexicon by applying batch processing to LDOCE without of the rst, automatic,
The overall strategy stage in deriving en-
extensive human intervention would inevitably result in an incomplete tries for the target lexicon was outlined above in section 5.2. The
and inaccurate lexicon. entries which result from this process are then presented to the user
Furthermore,the analysis system requires information which is sim-
for checking, modication if it is decided that they are incorrect or
ply not encoded in the LDOCE entries; for example, the information incomplete, and eventually added to the target lexicon when they are
represented by the morphological features AT and LAT does not ap- deemed satisfactory. This methodology is conceptually very simple,
pear in LDOCE. These features play an important role in the analysis and full software support for it in its bare essentials need correspond-
of derivational variants, and are necessary for the correct working of the However, two major design considerations need
ingly be quite minimal.
word grammar. If they are not present many morphologically produc- to be taken into account for such a system to be usable, and thus for
tive, but non-existent, lexical forms will be potentially analysable by the methodology to be viable.
the lexicon system. At the sentence grammar level, too, there seem to
be systematic omissions; for example, verbs such as worry and bother

5.4.1 System design considerations
can take sentential subjects (That I havent seen her worries me) which
may be extraposed and replaced by a dummy it subject. There is, how The rst consideration stems from the fact that the content of lexical
ever, no provision in the Longman grammar coding system for spec information in LDOCE is dened only informally. Even though we are
ifying that these verbs are able to take either a sentential or dummy restricting the application of the automatic translation component to
subject. Despite a few shortcomings, though, verbs are relatively ac- the grammar code elds only, the mapping between just these portions
curately specied. In contrast, LDOCE lacks some quite basic infor- of LDOCE denitions and the denitions required in a target lexicon
mation about the subcategorisation properties of nouns. Those which that is suitable for computational tasks is likely to be quite complex.
subcategorise for S or VP complements receive distinct grammar codes, (See, for instance, the rules for deciding between Equi and Raising in
but nouns which subcategorise for particular PP complements are not Boguraev and Briscoe, this volume, chapter 4). The user must there-
marked in any way, the sentence grammar distinguishes between PP fore have the freedom to inspect and change the translation component,
modiers and PP complements of nouns whilst LDOCE does not.
making detailed modications to the form of the translations which it
it is clear that
Thus it is unrealistic to expect that on the basis of This sort of
produces to ease the subsequent job of manual correction.
only the information available in the machine-readable we will
exibility is also necessary
source
in order to be able to cope with modifica-
be able to derive a fully fleshed out lexical entry, capable of fullling tions to the specication of the target lexicon, or indeed to be able to
all the run-time requirements of the analysis system that the lexicon use the same system to build an entirely different lexicon.
under construction here is intended for. from the crucial
The second consideration stems requirement for
accuracy of the target lexicon (since it is to be used for computational
tasks). This requirement is entirely the responsibility of the linguist or
5.4 A methodology and a system lexicographer using the system. Thus although part of the translation
from the source lexicon is done automatically, the system should still
for lexicon development
try to provide the user with powerful facilities to help her check the
translations.
We argued in the previous section that fully automatic processing of
These two considerations were taken into account in the design of
LDOCE in order to derive the type of lexical entries required by the
the Lexicon Development Environment; our use of the term environ-
analysis system is not feasible. (Indeed we believe that no existing
125
124
lexicon for English Chapter 5 5.4 A methodology and a system for lexicon development
A computational
' '
th leXical
(Fagiaznggrmat
to help template, with
from the systems provision of powerful facilities
as a
paired a SUBCAT value
ment follows
of the automatic translation stage, and its
the user check the results
the to modify the form of the translations
exibility in allowing user
produced.
((Takes NP NP) 7) > NP
The automatic translation phase ((Tnkes NP SEar) 7) -> SFIN
5.4.2
((Takes NP NP Inf)
lexical entries for the target lexicon is an
The rst phase in deriving (Type 2 DRaisingD -) 0R
automatic translation of the grammar code eld of a source entry into ((Takes NP NP Inf) (Type 2 DEqui)) -> GE
a denition in the form required by the target lexicon. Hopefully, the

amount of editing
denition should then require at the most a limited
is satised with it
in the second, interactive, phase, before the user
it for inclusion in the target lexicon. We have chosen to Figure 5.4 Subcategorisation translations
and accepts
into two stages. The rst uses the
split the rst, automatic, phase
The SUBCAT
(this volume) value
corresponding to a particular template is taken
translation program described by Boguraev and Briscoe
uncommitted from the subcategorisation translation whose rst
to map the grammar codes in an entry into a theoretically pattern it matches
intermediate representation. For example, the codes for the third sense The patterns may be as general or specic as desired using 7 to
denote an in a template whose value does not matter and the
of the word believe element
user may modify an existing translation, or add new ones at anytime
believe 1; 3 [T5a,b;V3;X(tabell, (to bc)7l l Thesecondmechanism uses the sets of lentry completion rules and
multiplication rules which are dened to the morphological analyser
((Tukes NP SBar) (Type 2)) Whenever lexical
a
entry matches the pattern in one of the rules it is
either
((Takes NP NP Inf) (Type 2 URuisingD
padded out with new
feature/value pairs (by entry completion
rules), or other
completely new entries, based the
on existing one are
((Takes NP NP NP) (Type 2 Ullaising
(or
NP NP AuxInf) (Type 2 0113131115)
generated (by multiplication rules). Once a SUBCAT value for azien-
((Tnkes try has been decided, these rules
the entry to esh it
are applied to
(or ((Takea NP NY AP) (Type 2 whining out With a default set of features and associated feature values These
NP NP AuxInf) (Type 2 DRaisingD)
((Takes entry completion and multiplication rules may be edited by user the
to ne degree of control over the output of the translation
exercisea
process. Figure 5.5 shows an entry completion rule which assigns the
Figure 5.3 Lexical templates derived for the third sense of believe feature/value pair [AUX ]to verb entries which are unspecied for
the feature AUX. Figure 5.6 shows a multiplication rule which en-
would be mapped into the four templates in The second

Figure 5.3.
into the form required
erates new passive verb entries.
past participle entry VFORM
On the basis of an already
' '
exisiing
and converts them EN PRD D this m pmduce a new
stage takes
by~the target
these templates
lexicon, in our case a set of categories each containing
in the
passive entry ([vrorllvi
EN, PRD #1).
a list of feature/value pairs. Which features and values appear
nal categories is determined by two mechanisms, both of which may
be customised by the user. Add_AUX :
allocates values for the SUBCAT feature based

The rst mechanism (_ _ [V 1*, N -, 'AUX. _rest] _ _)
on the lexical templates: in the main there is a one-to-one correspon- ->
SFIN] to the target

dence. So, for example, we would assign [SUBCAT (i t [V +. N . Aux -. .Jest] a z)
derived from the rst template in Figure 5.3. The
lexical category
is under the
assignment of subcategorisation values, while automatic,
through the denitions of a set of sub-
complete control of the user
Figure 5.5 An entry completion rule
categorisation translations. Each translation consists of a pattern in
127
126
Chapter 5 5.4 A methodology and a system for lexicon development
A computational lexicon for English
Internally, frames are than illustrated

slightly more in Figure complex
5.7. Each of the forms
surface is associated
phrasal with one or more
MultLPASSIVES: lexical categories, each of which is compatible with those of the lexical
(_ [V'FDRMEN, rim _reBt] _)
to ll the blank
-, _
items allowed by the sentence grammar slots. Frames

n>>
should be as semantically bleached as possible, so that they will also
((k l [VFURM EN, FED +, _rest] & &))
be as compatible as possible with the semantic restrictions that verbs
place on their arguments. The LDE provides facilities for editing these
frames and also, by performing a syntactic analysis of the phrases, au-
Figure 5.6 A multiplication rule tomatically generating appropriate categories for blank slots according
is to the suc- to the current version of the sentence grammar. Figure 5.8 below shows
The accuracy of the automatic translation phase crucial
of the whole lexicon derivation task. If an excesswe amount of a fragment of the systems inventory of frames.
cess
of entries is necessary, the user will quickly become de-
post-editing than if the that is
moralised and the derivation task will proceed little quicker They l: II someone something.
[N , V +, BAR O, AGE [N t, V ', BAR 2, NFDRM
target lexicon were being constructed by hand entirely unaided. We
translation NORM. PER 3. FLU +. COUNT +, CASE NUM]
twostage approach to the automatic
.
have found that our

SUECAT SFIN]
phase, in which the output of a carefully hand-crafted program pro-
is fed into fleXible
ducing intermediate lexical entries a process which is
They I: :l someone to be something.
and customisable a successful, and we
by the user, is believe a general, [N -, V +, BAR 0, AGR [N +, V -, EAR 2, NFURM
way of achieving an acceptable level of accuracy. . .

NORM. PER 3, PLU +, CUUNT +, CASE NUM],
Once the automatic translation phase has constructed a denition, SUBCAT DE]
if it appears to be incorrect. [N V +, EAR 0, AGR [N V BAR 2, NFURM
the user
may it and edit
inspect it How -, +, '.
NURM. PER 3, PLU COUNT CASE NON]

ever, the size of target lexicon entries (with each category containing *, 1', ,
of a dozen feature/value pairs) makes it difcult to spot SU'BCAT UK]

in the order
translation errors; the great majority of these errors detected, mustbe there to be if
They I: II a problem. (OE set)
though, otherwise the target lexicon will end up being too inaccurateto [N ', V +, EAR 0, AGE [N 1', V -, BAR 2, FORM
To help the user
be of any computational use. spot'translationerrors, to NORM, PER 3, FLU +, COUNT +, CASE NDM] ,
the provides two powerful tools, both of which

system
are deSigned SUECAT UK]
tap her judgement of grammaticality. there be
a
They l: 3 to a problem.
:; GE is invalid in this context
5.4.3 Syntactic frames [N -, V +, BAR 0, AGR [N +, V -, BAR 2, NFDRM
is a NORM. PER a. FLU +. COUNT cAsa NCIM],

syntactic frames. A syntactic frame r,
The rst tool is a set of
smacai' 0E]
phrase or sentence with a blank slot replacing one of
is assoc1ated With at
the'words.Each
feature value in the grammar
possible SUBCAT
least one such frame, for example: Figure 5.8 Syntactic subcategorisation frames
that someone is something To check the plausibility denition of a word, the userof may a at any
SFIN: They E...:l
time request the LDE to display all the frames that the word (given its
DR: They [2...] someone to be something denition)may ll. Each such frame is shown paired with the SUBCAT
They [2...] there to be a problem value of its slot and with the word taking the place of the slot marker.
The grammaticality or otherwise of these instantiated frames assist the
0E: They Em] that someone is
something user in the task ofjudging the correctness of the words denition.
They there to be problem As well as providing
a [2...] a
an outright check of the correctness of a def<
inition, syntactic frames can also be used to help the user choose be-
tween related but nevertheless distinct subcategorisation possibilities.
Figure 5.7 Syntactic subcategorisation frames (abbreviated)
129
128
Chapter 5 5.4 A meth o d ology and a for lexicon
lexicon for English system development
A computati onal
in some lormal sm
However,
automat ic translation phase might the NP/NOPASS example above hlghllghts
in itial
transformationalalterna
the ' ' one
For example, to lems
semantically
.
bleached fra
SUBCAT values
cases assign incorrec
[SUBCAT
t
might be assigned
OR] when it should dOWIlnwllxslng
Can break
the frames
the
interact with w
mes:
eXample Wei leg

sense
tives. persuade
Thus frames when applied alTrd g; as two
The subcategorisation , Senses one of Which
really be [SUBCATOE].
ows pas-
hich d 5
(compare
'
The b b .
doctor with :vT weighed by Ware

Walghed by 191;?
5.9 l the
to persuade (Figure
.
Pounds
{mini}: here
was
or; They persuade someone to be s omethrng.

the Syntactic
ey
given
l; en d l;0 SE
for
lCC it til
NP/NOPASS {2'UHffgunatelx
eledChed
not
e 5 e HS e (7 f
ta be a problem. (CE is t) W ergh w hl c h do e S a l
They persuade there
denition of
to an incorrect
5.9 Syntact ic frames applied
Figure
persuade
even (as in this ex-
this distinction clearly, and frames may

show up a more suitable SUBCAT value.
be augmented to suggest automatic trans-
ample) to detect shortcomings in the
As Well as helping inad- make up for
frames can also be used to ase is a
morpholo ic ator. This
syntactic LDOCE
toiaa[timer
.
lation process, in the .

again derives much of
of the target grammar) power from its ability
equacies (from the perspective For example, LDOCE encodes transitivity ad- 315
18 time at the Iexic al rather
p
than
6
ersjud US
Phrasal 52119
of
grammaticality,
grammar code system. whether a particular
but does not represent systematically there are two The generator m a Y be i nvoked On a word during the second
equately ammar active, phase of deri , inte 1'-
has a passive types of
transitive se
SUBCAT values, NP an LDOCE code are ini-

all verbs with a transitive
verb. In our system subsequently be
and so they may
tially translated
as [SUBCATNP],5.10. word in
frames in Figure legitimate ways.
inserted into the believe are shown in Figure 5 11
They [NJ that

NP:
Those are Cm] by them (NOPASSif nu)
cost that cobelieVe Verb
.
They ehave beli

by them (NOPASS t) if
13211::219
'
Those are cast disbe

inter:::::veIabilieVe
beliewe believee
to cost misbelieve
distinction applied Bubbelieve believer
Figure 5.10 The NP/NOPASS postbelieVe unbelieve believin 3
t in th e second of these l'ramd PIEbelievg underbelieve
to t or not to believe
Whether a verb is judged lue. For example, when outbelieVE bElieval
B
the appropriate SUBCAT

va
is enough to decide
with the word cost, the [act
in Figure 5.10 are instantiated 5 that
the frames NP frame mean
cost does not t into the passive (second)
that from NP to NOPASS. Figure 5. 11 The
be rened surface forms generated from
its SUBCAT value may frames to check the b e l'Jeve
almost be possible to use syntactic them in
It would of direct perus al of
definitions to the exclusion
plausibility of lists. This strategy wou 1d certainly speed up lexical
terms of feature of grammaticalitywould)!
since the users judgement olll- a the
relevant fe
development, degree
being tapped directly. Syntactic frames
of
also provide
the morphologicaland syn
a
phase, vokcd to check that
details
,
sulation from the technical
130
A computational lexicon for English Chapter 5.7 Conclusion
Words such as believance in the afxed list generated from believe information in their box codes (as ,,,,U) and can be used
This appears
above represent morphological irregularity rather than an incorrect fea- to specify the value of their AGR feature to be [V ,1 N +, BAR 2,
ture specication in the base form. The user may edit the surface form PLU +].
and denition of words produced by the generator in exactly the same Another source of further information might be the phonology eld.
way as words taken directly from LDOCE. In the case of believance, the For example, it is important to the word that entries are
grammar
i
surface form would be changed to belief, and its denition inspected marked as either
latinate (LAT +) or not (LAT )since some afxes
and perhaps checked using the syntactic frames. When the new form is
saved it is added to the derived lexicon as a new
attach
only to latinate stems (for example ity),whilst other affixes
(non-productive) irreg- attach only to non-latinate stems (for example -hood)i Since latinate
ular entry. (The analyser can be run in a mode where nonproductive stems tend to be longer than non-latinate ones, it might be possible
separate entries are preferred to productive ones). for the system to hypothesise a LAT value for an entry on the basis
Other nonexistent forms in the generated list (such as
cobelieve) of the number of syllables in its phonology eld. The user could then
may just be ignored, since they will not form part of the output of the conrm or reject this hypothesis.
system although their denitions will still be implicit in the lexicon and
morphology component of the analyser (because co- and believe will
be there). It is assumed that this overgeneration is harmless 5.6 Conclusion
though,
as such forms will not occur in actual input to the analyser.
Hand-crafting the entries in the substantial lexicons typically required
by practical NL systems is often not feasible, and certainly never desir-
5.5 Future developments able. The evaluation of the LDOCE grammar coding system suggests
that it is sufciently detailed and accurate (for verbs) to make the
Although we nd that the entries produced
by the automatic transla- on-line production of the syntactic component of lexical entries both
tion phase of our system are on the whole satisfactory, given that they viable and labour saving. However, the less than 100% accuracy of the
may be edited and checked with the syntactic frames and morphologi- assignments of codes in the source dictionary suggests that a system
cal generator, there is still much room for improvement. Our aim is to using LDOCE for lexicon development must embody a methodology
generate only syntactic denitions for derived entries, and we take this allowing rapid, interactive and semi-automatic generation and testing
information from the obvious place, the LDOCE grammar code elds. of lexical entries on a large scale.
However, there are other elds in a source entry that are potentially of We have described a lexicon development environment, which em-
some use. We believe that taking this information into account when bodies a practical approach to using an existing MRD for the con-
assigning values to features would remove many of the shortcomings struction of a substantial computerised lexicon. The system splits the
we have observed in derived entries. derivation of target lexical entries into two phases; an automatic trans-
One potential extra source of information is the box codes which lation of the source data into denitions in the form required by the
contain information about the type of subject and object a verb re- target lexicon, followed by semi-automatic correction and renement
quires. For example, a verb like kick has the box code ___D__V_C which of these denitions, using two tools which tap the users judgment of
means that it has an animal human grammaticality, into a set of
fully checked base (and perhaps also ir-
or
subject and a concrete object
(either animate or inanimate see chapter 1). A verb such as die, on
~
regular) target entries. Appendix E presents a user guide to the LDE

the other hand, has the box code __Q system.
meaning that its subject is ani-
mate (either human, animate or plant). Whilst there is no intention of
incorporating this essentially semantic information into syntactic en-
tries for words, the box codes might be useful for 5.7 Notes
identifying classes of
verbs displaying similar alternations. Thus we might like to identify
verbs which pattern in a fashion We would like to thank Ted Briscoe and Bran Boguraev for many help-
similar to break where the inanimate
ful comments on the rst draft. The research reported here was funded
object of the transitive usage can be identied with the inanimate sub-
ject of the intransitive usage (Levin, 1985). A more concrete by the Science and Engineering Council (Grant No. GR/D/87321) un-
example der the Alvey Programme.
of the usefulness of the box codes can be seen with Verbs like mect
which, when used intransitively, need a plural or collective subject.
132
133
Chapter 6
LDOCE and speech recognition
DavidCarter
6.1 Introduction
A number of recent studies explored the likely effectiveness

have of
various phonetic approaches to large vocabulary speech recognition
that exploit the redundancy in the speech signal. This chapter criticises
some of the methods normally used in such studies, and reports on
some experiments, performed on an LDOCE-derived lexicon and using
an improved methodology, that shed new light on some questions of
speech recogniser design.
For the purposes of this chapter, the following simplied model of
a largevocabularyspeech recogniser can be assumed. Acoustic input
is transcribed by a front and to a partial phonetic representation. For-
tions of this representation are used as input to a lexicon to yield a
lattice of word candidates This lattice is then pruned using syntactic,

semantic and/or pragmatic knowledge to yield a small number (ideally
one) of possible uttered word strings.
The lexicon part of this model would need of the prop-
to have some
erties of the LDB described in chapter 2. In particular, it would need

to be able to carry out a search on the basis of partial knowledge of the
phonetic properties of the word in question, without assuming that any
given portion of it (for example, the beginning) will have been prop-
The experimental paradigm with which this chapter is
erly analysed.
concerned involves assessing possible front by modelling the lex-
ends
icon with a large (typically 20000 word) machine-readable dictionary,
135
and Chapter 6 6.2 Constructing and using the hybrid lexicon
LDOCE speech recognition
of the numbers of word candidates in amachine-readable dictionary into equivalence classes, where all
and determining the distribution re-
phonetic information the front entries whose pronunciations transcribed to the same
turned by queries containing whatever are symbol se-
end is assumed to provide. quence are placed in the same class. Statistics are then computed
If a front end were able to extract the full phonemic content of the based on the sizes of these classes: maximum class size, expected class
of according size, percentage of singleton (one-member) classes, and so on.
input, lexicon lookup would be a routine process access These
to phoneme strings, and it would be unnecessary to allow for partial statistics may either treat each word in the same way or be weighted
phonetic knowledge. according to word frequency.
The first reason that this is unrealistic is that the full phonemic For example, a of articulation
transcription, which classi-
mannEr
content is often there;

simply not reduction and
phenomena such as es segments into the categories Vowel, Stop, Nasal, LiquidOrGlide,
assimilation delete or disguise segments in a way that is only partly StrongFricative and WeakFricative, will give rise to a class labelled
recoverable by phonological processing prior to lexical access. For ex-
Stop Vowel LiquidOrGlide WeakFricative (6.1)
ample, the n in enmity will often in natural speech be assimilated to
the following m, perhaps even without a signicant lengthening of the whose members are
(using the British English pronunciations in the
m; and the t in wrists may be deleted altogether. the
Longman dictionary) entries for golf and gulf. Huttenlocher (1985)
The second reason for a partial knowledge approach is that even
for 20000 word
reports, a
lexicon, a word~frequency~weighted expected
those sounds that are present in the signal cannot always reliably be class size of 34 words for this transcription and maximum class size of
distinguished from other, similar sounds; the spectral differences be- 223 words.
tween pairs such as the underlined sounds in [at and min or in bgt and In this I will
chapter attempt to substantiate two claims
about ex-
bit are often of the same order as variations between pronunciations periments of these kinds. The rst claim is that the notion of equiva-
of these sounds by different speakers, or even by the same speaker on lence classes is only appropriate to a small and somewhat over-idealised
different occasions. subset of the kinds of front end performance that one might wish to
However, it is that not all the phonemes in a word
usually the case
model, and that this notion must be generalised to one of consistency
need to be fully identified for the word itself to be discerned. For exam- classes. One important of this is that, in
consequence general, a lex-
ple, if the word {it is uttered, then the front end need not distinguish ical database of the type described in chapter 2 then becomes essen-
the ffrom a th or the i from an 9, because the words that, thit and tial. The second claim is that statistics based class size
fet do not exist in English. Moreover, even in cases where more than
on
(whether of
equivalence or consistency classes) are misleading, and that a more ap-
one word is consistent with the information extracted by the front end, propriate measure is an information theory notion, that of percentage
contextual factors such as syntax and semantics can often, one hopes, of information extracted (PIE).
be exploited to rule out all but one candidate. For each of these two topics in turn, I will show the theoretical
A promising approach to designing a front end therefore seems to
superiority of the new notion to the conventional one, and then, by
be to concentrate on extracting only those aspects of the signal that are presenting the results of various experiments carried out on a lexicon
relatively reliably present and/or can relatively easily be

distinguished, derived from LDOCE, show that the new notion is practically superior
Such aspects include manner of articulation (categories such as nasal, as well, and indeed suggests different directions for speech recogniser
stop (plosive and vowel), and many instances of voicing (2 versus 5, v design. First, however, the process of adapting LDOCE for use in this
versus f, etc. . kind of experiment will be discussed.
The question therefore immediatelyvarious arises of how strongly
combinations of partial information will constrain the set of candidate
words. Answers to this question will help us to decide exactly what
.
6.2 Constructing and using the lexicon

hybrid
strategy the front and should adopt, and what degree of performance
to from the whole system given front end with particular The lexicon used in the experiments described in this chapter was
expect a
capabilities. constructed from LDOCE and one version of the MRC dictionary
A number of studies along these lines have
been carried out; among database; Coltheart (1981). Pronunciations were taken from the for-
and Zue and word frequencies (those of Kucera
the best-known are Shipman (1982), Huttenlocher and Zue mer, and Francis, 1967) from
the latter. These two of information
(1983), and Huttenlocher (1985). Typically, such studies use the tran- types were specified for 12 850
scription provided by a hypothetical front end to partition the entries words, which therefore constituted the hybrid lexicon.
136 137
Constructing and using the hybrid lexicon
Chapter 6 6.2
consists of a string of Syllable boundaries are indicated for the written forms of the words,
pronunciation
The eld of an LDOCE entry
values, between and but these cannot reliably be applied to pronunciations, rstly because
symbols representing phonemes, stress separators rather
British (received pronunciation) and Amer- they are intended primarily to indicate hyphenation conventions
alternative pronunciations. and secondly because of
than any directly phonological information,
ican pronunciations are separated by a double Vertical bar when they and spoken forms.
alternatives within one dialect are separated by commas.
In ad- the complex relationship between English written
differ;
alternation'between Therefore, in order to find syllable boundaries, a parsing algorithm was
dition, certain pronunciation symbols represent
an
be realised as either /I/ developed to group sequences of phonemes and stress markers into syl-
two sounds; for example the symbol /1/ may
and italic /3/ may be realised as
/a/ or may be omitted lables, and within syllables, into onset (initial consonant cluster), peak
or /a/, an
altogether. .
[vowelgroup) and coda (nal consonant cluster). The algorithm made
accessed in of the phonotactic constraints on British English pronunciation
Only the British pronunciations were constructingthe use
lexicon. Even the extraction of every alternative pronunc1a given in Gimson (1980). These constraints specify (a) the sequences of
hybrid so,
in each category: for example /1/ is a possible
tion would have been a. nontrivial task; alternatives are typically not phonemes that can occur
represented in full but only where they dilier from the first pronunci onset but /ml/ is not, so the syllable boundary in streamlined cannot
be before the /m/; and (b) cooccurrence restrictions on combinations

ation shown, as in for example the peak be
of legal onsets, peaks and codes: /C/ cannot
followed by a null coda, so the boundary in fetter cannot occur before
margarine: ,mozga- / Hmarcgorgn
/,mo:d59rizn, the sound.
/l/
columnist: /kolom;sl,-lamn;sl/ llka- When Gimsons constraints do not specify unambiguously where
mistranslate: /,mislrrenslen,
-1ra:nL-/H-irzenz,
lrzcn5> a syllable boundary should fall, a convention known as the maximal
unset principle (MOP; Selkirk, 1978) is applied, according to which the
To determine exactly which symbolsin the rst pronunciation are sub- boundary is placed as far to the left as possible in order to maximise
where the following onset. Thus for example the boundary in slipstream is
sumed by the hyphen(s) in the second is not easy, and, especially
placed after the /p/, even though no constraints would be broken by
only portion
the middle of a word is shown, appears to require pho
For the for
entry mistranslate, for placing it after the second /S/.
netic or phonological knowledge. to say where the
substitute for one another In ambiguous cases like this it is often difficult
example, the knowledge that s and 2 can
contexts would be needed to deduce that the nal hyphen syllable boundary really is. The adoption of the MOP does not rep-
in certain
resent a commitment to any strong theoretical position; rather, it was
stood for /len/ and not for, say, /s]cit/ or /e|L/. selected as a simple and quick decision procedure, and one that, more-
constructing the hybrid lexicon, therefore, only the first (and pre-
In
for entry was used. Where over, does not discriminate explicitly between stressed and unstressed
sumably most common) pronunciation an
syllables, as would, for example, the principle that ambiguities should
generated by multiple-valued symbols such as //
one
alternatives are ,
The is done be resolved by placing the problematical segment(s) in stressed sylla

pronunciation is selected at random. same
with non- bles wherever possible. This last property is important because one
such as live (verb, /liv/) and live (adjective,
homophonous homographs of the studies reported below involves comparing the relative informa-
associated with the spelled forms
/laiv/), since these frequencies are
a discussion of an tiveness of stressed and unstressed syllables.

of words in the MRC database. Section 6.5 presents
into account. Perhaps not surprisingly, there are many pronunciations in LDOCE
approach that takes multiple pronunciations be syllabied at all without of Gimsons
explicit pronunciation. The
contains
which cannot breaking some
Not every entry in LDOCE an
in the dictionary
and entries in a sequence of homographs which are constraints. A few of these are due to errors (for ex-
second subsequent is given as / r]1u:l'i_l'
contain pronunciation elds. In addition, only the ample, the pronunciation of starfish /), and some
also homophones no
not the phonemes, of entries for multiple-word phrases

to borrowings of foreign words (for example, kvass, spiel, svelte); many,
stress patterns, to be genuine exceptions. For example, the pronun-
are shown.
however, seem
the ciation of together is given as / ia'gciia /; and Gimsons constraints
Markers for primary and secondary begin-stress are placed at
in LDOCE. The beginnings of prevent /c/ from having either a null coda or /()/ as code.
nings of syllables with those stress values To cope with such cases, if no parse is possible without
of the transcrip- breaking
unstressed syllables are not marked. However, some
a constraint, then the parse that breaks the minimum number of co
tions studied in this chapter, and also the construction of the LDB,
occurrence constraints is accepted, the MOP again being used in the
depend on the identication of all syllable boundaries.
139
138
Chapter 6 6.3 Transcriptions, equivalence classes and consistency classes
placed in the third linear search, this would take a time proportional to the square of the
case of a tie.The /6/ sound in together is therefore
Words like kvass, and errors such as that for starsh, cannot size of the lexicon, typically involving tens or hundreds of millions of
syllable.
be dealt with in this way, since /kv/ and /II/ are not English legal comparisons between symbol sequences and entries. For this reason,
what peaks and codas they combine with; however, to my knowledge, every lexicon study to date within this paradigm has
onsets no matter
used an assign and count procedure or something equivalent to it.
such recalcitrant cases are quite rare.
Unfortunately, however, assign and count only yields the same re-
sults as assign and lookup (and is therefore valid) for a certain class
equivalence classes and of transcriptions: those for which the symbol sequences resulting from
6.3 Transcriptions,
classes transcribing two input words are always either identical or inconsistent
consistency with one another. If the transcription is capable of yielding two non-
identical but consistent symbol sequences, such as (6.2), then assign
The partitioning of the lexicon into equivalence
classesincarried
orderoutto and count is invalid.
measure the effectiveness of a transcription is most easily
what I will call an assign and count procedure, as follows. Vowel LiquidOrGlide WeakFricative
by g
Stop Vowel LiquidOrGlide I (6.2)
transcribe each word in the lexicon according to the current tran-
(a) For if
scription; example, a manner transcription is
always applied to each seg-
ment, the words
golfand gu1f(and, for LDOCE, only those
words) will
add it to the equivalence class dened by the resulting sequence receive the symbol sequence given in (6.1) above. This reects the fact
(b)
of symbols; that the corresponding front end would be unable to distinguish those
two words from each other, but could distinguish them from all others
processing the whole lexicon, examine the equiva-
(c) nally, after in the lexicon. Assign and count will create a class containing only
lence classes for uniqueness, average size, or whatever quantities those two words, and assign and lookup will, when either of them is
are deemed appropriate. input, retrieve them and only them.
But non-identical, non-exclusive sequences will arise whenever it
does reect would hap-
However, this procedure not
In
Very directlywhat is not assumed that a front end will always transcribe a word in the
pen if a performed
front the
end transcription in question. that same Such an assumption would, in fact, be unrealistic,
way. given the
case, the incoming signal, representing an uttered word, would receive
variability in speech. Even discounting the possibility of errors, the
a (partial) transcription, which would then be used to extract from amount of information the front end can extract from a given segment
the lexicon all those entries matching it. A procedure that reects tlns in a given word will vary between different utterances of the word.
process more directly is that of assign and lookup: for each entry in Suppose, for example, that the front end is able to make a manner
the lexicon, to: transcription of every segment and that, in addition, it can recognise
vowels uniquely on a random 50% of occasions. This means that if
the of the entry according to the current
(a) transcribe pronunciation either golf or gulf is uttered, there is a 50% chance of two candidate
transcription; words being retrieved from the lexicon, and a 50% chance of just one
candidate being retrieved.
extract from the lexicon consistency (not equivalence)
a class of
(b) But the probability that assign and count will put golf and gulf in
all those entries whose pronunciations match the resulting symbol the same class for this transcription is not 50% but 25%. This will
sequence; occur only when, in the simulation, a full (phonemic) transcription
class. of the vowel is made for neither word. Thus assign and count will
statistics from that consistency
(c) gather effectively predict incorrectly that the two words can be distinguished
three times out of fourl.
From computational point
a view, assign and count is much easier
of
and cheaper to carry out. Each entry in the lexicon need only be ac- Unfortunately, therefore, when there is any element of randomness
or variability in the transcription, the more computationally tractable
cessed once, and the time taken is therefore roughly proportional to the
For assign and lookup, however, step (b) assign and count method is invalid; and any realistic simulation of
size of the lexicon. involves front end must involve variability. We therefore need to ask:
a
searching the lexicon once for every word in it. Using an exhaustive
140 141
Chapter 6 6.3 Transcriptions, equivalence classes and consistency classes
and count Experiments (1) and (ii) conrm Altmanns

nding that a full tran-
(a) Does the methodological inappropriateness of assign
lead in practice to seriously misleading results? scription of stressed segments gives better, but not enormously better,
discrimination than an all-mid-class transcription; and there is no great
(b) If so, what can be done to make assign and lookup acceptably difference between the results of the assign and lookup and assign and
efficient? count procedures. However, the pictures presented by the two methods
differ considerably for the other two experiments. Under the (correct
that in general
It turns out that the answer to (a) is
yes:,essential. and
to, (b), assign and lockup method, little difference is apparent between (ii), (iii
a exible access system such as the LDB is
'and (iv); but under the (incorrect) assign and count, randomly chosen
segments appear to be considerably more informative than stressed,
6.3.1 Case Study One with evenly spaced ones occupying an intermediate position. This
shows that the difference in procedures can indeed completely distort
into the relative
Altmann (1986) reports an investigation informative- the pattern of results; accepting the assign and count method might
and unstressed uses (1)
a
ness of segments in stressed segments. He
which lead us to construct a front end that deliberately concentrated its effort
basic mid class transcription of about twelve categories is,
very on random segments rather than stressed whereas it is the latter
sixway dened above ones,
approximately, the manner
transcription with that will tend to be more clearly pronounced, and therefore easier to
also indicated. This enhanced is by transcrib-
voicing transcription stressed recognise accurately.
ing fully (i.e. phonemically) (ii) segments in syllables; (iii) an
equivalent proportion (one third) of segments, but chosen randomly; 6.3.2 The need for exible
or
(iv) every third segment in the lexicon. Altmannused an aSSign
and access
the results of which

count procedure to evaluate these transcriptions, The method used for implementing the assign and count
but subsequently procedure in
are presented in the published version of his paper, these experiments did not in fact require the use of the LDB. This
withdrew them on realising the inappropriateness of assign and count was because the transcriptions used in each of experiments
for transcriptions (ii) to (iv). Ipresent here the results of repeating .Alt- (ii), (iii)
and (iv) were always at least as detailed as that used in (i), and so
manns experiments under both procedures using the LDOCEderived for these experiments, one entry could only occur in the consistency
lexicon described earlier. (See Carter, Boguraev and Briscoe (1987) for class of another if the two entries fell into the same equivalence class
a more detailed account.) in experiment
The of words falling into a singleton'class(With no
'
(i), for which, of course, assign and count is valid. This

percentages meant that to derive
shown in the followmg table.
a consistency class, it was necessary merely to
word-frequency weighting) were as
match the symbol sequence in question with the entries in the relevant
Assign and Assign and equivalence class, which typically was quite small. But in fact, the mid
count lookup class transcription used in (i), and a fortinri those in (ii),
(iii) and (iv),
are considerably more optimistic than one could expect a front end
Lexicon size 12850 12850 always to be able to derive.
In more realistic simulations of front ends> one must allow for the
possibility that some segments will remain almost or completely un-
Experiment:
analysed. In this case, no equivalence classes of sensible
'
all segments mid-class: 69.2% 692% sizes can be

(i) used to partition the lexicon in advance to make
stressed segments full: 79.6% 79.2% so as an entry-by-
(ii) entry comparison reasonably efcient. Rather, we nd ourselves in the
segments full: 93.0% 78.7%
(iii) random
very situation for which the LDB was designed: one in which it is not
(iv)b every third segment full: 88.4% 78.7%
assumed that sufficient information will always be present in any xed
position to allow efficient access to take place, but rather, that the
strategy for access must be worked out dynamically once the informa-
The data 100 of the experiment.
in (iii) and (iv) are averages over runs
tion is available. It seems, therefore, that further progress in assessing
The to 0.001 condence, are not more than about 0.1%. the
errors, likely performance of front ends will depend on the availability of
In experiment (iv), every third segment is transcribed ne; however, exible LDB systems.
the choice whether to transcribe fully segments 1,4,7,..., 2,5,8..., or 3,6,9,...
in each word is made at random.
142 143
and speech recognition Chapter 6 6.4 classes
LDOCE Measuring
6.4 Measuring classes alence classes of at most 223 items. Such gures might lead one to
conclude that if a manner categorisation can be performed, a major
substantiate chapter is that
in this mea-
The second claim I set out to
of equivalence classes or con~
part (perhapsthe major part) of the word recognition task has been
sures based purely on class size, whether accomplished.If, even in the worst case, the set of candidate words is
the accurate guide to the power of a
1% of the original
sistency classes, are not most only Just over
lexicon, is the problem not then 99%
transcription. solved? The answer, we will now see, is that it is not.
is that certain statistics that can be derived
The rst
point to note
meaningless for consistency classes. These
from equivalence classes are
6.4.1 Word counts and word frequencies
are what one might call classvbased as opposed to word-based statistics.
For example, in an experiment where equivalence classes are legitimate, Measuresbased on class sizes are less than ideal for two main reasons
the average class size. The rst, Briey,
there are two ways one might calculate these are that even when frequency-weighted, they do not take
in classes
class-based, way is simply to divide the total number of words proper account of the effects of word frequency; and that a logarithmic
number of words in the lexicon) by the number of classes to get than the customary
(i.e. the analyszs moreis
appropriate linear one. In this
class size. The second, wordbased,way is to nd the expected section I to substantiate these two claims and then
a mean
all words in the lexicon, of the size of
attempt present
class size: the average, over information theoretic that suffers from neither
the class each word falls into. In the latter case, larger classes will be Silaivlifarfkaitwe, measure
counted more often, and therefore the expected class size is always at Consider two possible equivalence class artitionin s
least as high as, and in practice signicantly higher than, the

mean. a
10000word lexicon in which the words ageassignedgihdkiczgcilgif
But if the classes are not equivalence classes then the mean class
accordingto their relative frequencies. In partitioning A, the rstlclas's.
which the number of distinct classes, is of little inter- contains words the
size, depends on wl, 101001, wzogl, ..., 109001, second contains words
est, since consistency classes do not partition
the lexicon. Rather, the /1, wiooz, ..., 1119002, and
In partitioning so on. consists B, the rst class
of words ml to ww, the second
expected class size is the relevant statistic. words mm to wzo, etc.
size be calculated either in a class- Both partitionings consist
Just as the average class can entirely of classes of ten words the ex-
based or a wordbased way, so can other statistics, such as
per- the pected and maximum class sizes are therefore the same of forlboth
centage of singleton classes. The classbased ones are inapplicable to them However, partitioning
.(ten): A would be far more useful for
word identification. This is because in each of its classes
consistency classes for the same reasons. there is one
Word-based statistics are in any case more appropriate, even when
very frequent word and nine rarer ones. So if we Word is in knowya
the classes are equivalence classes, because they reect more closely the a given class, we quite close to identifying it, because in the ab-
are
front end: they describe the classes that sence concluSiVe clues from subsequent processing, the most frequent
performance of the putative
word-based statis- word in
of
word. For this class has a high probability of being the correct
arise from the average input
for
reason, the one. By
to reect
in word frequencies:
tics can easily be biased differences
at random
contrast,in partitioning B, all the words in a given class have similar
example, to nd the class size expected for a word chosen frequenCies,so none of them is outstandingly probable.
from a corpus (rather than from the lexicon) one merely weights the To get some idea of the magnitude of this difference a
frequency-
contribution of each class size according to the frequency in that cor. sublexicon
ordered constructedwas from the most 10000 frequent
of the word giving rise to it. (Just as word-based statistics produce words the in hybrid lexicon described in section 6.2. The total fre-
pus
more pessimistic classbased ones,
values so frequency weighting
than quency of, on the one hand, words w1, w2, ..., "11000 (the most frequent
also tends to increase class sizes. This is because more frequent words word from each class in A) and on the other
hand, words mi 1011 w
tend to be shorter, having fewer segments to be categorised and there- ..., 109991
(the most frequent word from each class in B) was talculatzeld
fore often falling into larger, less discriminating classes.) as
a fraction the total
of frequency of all the words in sublexicon )the
(merely) that The two fractions
But my claim here is more radical than this: it is not were 0.846 and 0.164. In other Words one would
word based statistics are superior to, and more widely applicable than, expect that the strategy of simply guessing that the uttered is word
but that neither when frequency weighted, most in the class, without
class based ones, type, even the frequent attempting any further anal-
is really appropriate. ySis, would yield the correct result 84.6% of the time for partitionin g
that of
Earlier we saw that Huttenlocher (1985) found a manner
A, butonly 16.4% of the time for partitioning B.
articulation transcription partitioned a 20 000 word lexicon into equiv- Thus, other things being equal (i.e. ignoring possible differences in
144 145
Chapter 6 6.4 Measuring classes
between words in a class) a good made. (Other work in this

example, Tubach and Bee (1986),
area, for
the easeof further discrimination
is not so much one in which the expected class size is has made use of entropy calculations, but only to a very limited
extent.)
partitioning The entropy per word for a lexicon L consisting of words 10;
but in which the expected total frequency of the words in a am, . . .
small, one
where word n),- has probability
small. This will tend to occur when the frequencies of different p; of occurring, and successive words
class is
in A but not in B. The expected class are assumed to be independent, is
classes are fairly uniform, as
factor.
size alone fails to capture this important
These arguments apply classes as well as equivalence
to consistency U(L) =
Z pilogp, (6.4)
class may consist of words of widely varying w;EL
classes. A consistency
the equivalence classes of A, or of words of similar
frequencies, like
will be
(all logarithms are
to base two).
Intuitively, the entropy is, roughly, the
like those of B; and again, the former distribution number of bits would be required to encode
frequencies, a word chosen
average
that
much more helpful. atrandom from
the leXicon according to the probability distribution
usmg the most efc1ent coding scheme possible. For example, if N is a
all words are equally probable (so that p; logN : _.
and logarithmic
o;2tgnd{1V
Linear measures
6.4.2
frowelrl
or a 1. , en eac word can be assigned a code ofl GEN blts. The

A second, independent drawback using class sizes (whether expressed

of value of U(L) is then log N.
of the lexicon, and whether for
as numbers of words or percentages The effect of applying a transcription T that leads to a consistency
to assess transcriptions is that they
consistency or equivalence classes) class C;
for each word
.uf,.can be quantied in terms of its e'ects on
should be expressed logarithmically. For
express linearly what really the (posterior) probabilities of given words having occurred. If the
word effects for the moment, if a given
of word it); identies
'
example, ignoring frequency transcription it as belon g in g to class 0- n then the

to 10 000 word lexicon gives an expected class . .
transcription applied a
expected remaining uncertainty is
of the transcrip-
size of 100, or 1% of the lexicon, then the contribution
tion towards identifying the word is more like 50% than 99%. This is
the candidate set by a factor of 100,
UiL l T) =
2 Pa 2 qua-10gquC; (6.5)
because on average it reduces w,eL wjeL
IOOfold reduction to be achieved way. in some other
leaving a further where quC; is the probability of word w- occurri

that member
to 1000 would a contribu- ng given a
Similarly, a reduction from 10000 represent 1
of class C; has occurred:
tion of about 25%. Thus a more helpful gure than the expected class
size is the value of IL unrec-
qleu:{
J
PC;
0 otherwise (6'6)
log EzpectedClassSize
1
log LexiconSize ) x 100% (6.3)
and Pg, is the probability of any of the words in C; occurring:
sheds different light on the usefulness of a
Formula (6.3) often quite a
Aull and Zue (1984) report that partition- P0.- =

2 P1: (6.7)
transcription. For example, wkec.
ing a 15 000 word lexicon according
to a stress pattern transcription
The expected relative contribution of the transcription T towards iden-
class size of 2915, or 19% of the lexicon; but the for-
gives an expected ti i ylng a Word can there i ore b e expresse d th t age
mula above indicates that only about 17% of the work has been done. as e percen 0 I In f arma-
a ext 10th (I
While this contribution is worthwhile, it is perhaps less impressive than ID)
the ve-fold reduction of the lexicon might make it appear.
U L
U L T
6.4.3 An informationtheoretic approach

EMT) =
LWSLX 100% (6.8)
A method assessing transcriptions

of which suffers from neither of the E p.- Z qjlc,log jSc;
W-EL wiEL
drawbacks just presented, and which has additional independent mo- =
1 _
X 1007 0
of uncertainty, entropy or 2
tivation, is that of calculating the amount Pi 10317.-
is
unextracted information present before and after the transcription "MEL
146 147
Chapter 6 6.4 Measuring classes
above. ECS (present) PIE (present)

suffers from none of the drawbacks mentioned Transcription ECS (Huttenlocher)
This measure
does Str Unstr Str Unstr Str Unstr
Firstly, it is dened consistency classes
in terms and therefore
of
91.4% 83.2%
producing a partitioning'intoequivalence
Phonemic 6 34 5.53 29.9
not rely on the transcription be-
account Manner 321 80.4 111 68.0% 69.0%
classes. Secondly, it takes word frequency properly into the words

to the sum of the frequencies of

cause is
Pg,proportional be- Str
in a class, not to number
the of words. It
reflectsthe'difference
discussed above
= stressed
10 000 leXicon
word bUnstr = unstressed
tween the two partitionings of the
because the value of EPC, log Pc, is maximised, for 2P0, .=' 1,.when
nearly case for partitioning A The PIE gures suggest that although stressed syllables are consider-
all the Pajs are equal; this is more the. ably more informative than unstressed if they are transcribed phonem-
than for partitioning B. Thirdly, the
measure is clearly logarithmic;in
the then it ically, there is no difference
significant for a manner class transcription.
all classes same Size,
fact if all the Ms are equal, and
are
This is in spite of the fact that the ECS gures derived from the same
reduces to (6.3).- data would Huttenlochers conclusions.
support
Altmann (personal communication) suggests a possible explanation
for the superiority of stressed syllables only for a full transcription. He
hypothesises ...that this result is due to an uneven distribution of
the 44 or so phonemes across stressed and unstressed syllables. For
instance, one might expect to nd all the vowel sounds in stressed po-
6.4.4 Case Study Two sition, but only some in unstressed position. Similarly for the distribu-
tion of consonant clusters. It would follow, I think, that a full phonemic
transcription of the stressed segments would be more informative sim-
Just as, earlier, we found that the
methodologically incorrectosSign ply because of these uneven distributions. If such an account is true,
now is
and count procedure could give rise to it
misleadingresults,differences it provides an explanation for the increased informativeness of stressed
there in practice Significant
time to investigate whether are
and from syllables which does not depend on the fortuitous assignment of lexi-
between the predictions arising from class size statistics the to the portion of the word. It
It turns out that, again,
cal stress (otherwise) most informative
theoretically more appropriate PIE statistic. could be that the difference in informativeness disappears for a man-
fuller account, see Carter,
there can be important discrepancies. (For a
ner transcription because phenomena such as reduction, which apply
1987.) . ~
.
to unstressed syllables more than to stressed ones, tend not to affect

Huttenlocher (1985) claims that ...
recognition algorithmscan'be the manner category of a phoneme so much as other aspects of it. For
made more robust by exploiting the manner of
articulationinformation
class Size example, the reduction of a vowel to a schwa decreases its informative-
calculations of
in stressed syllables. This claim is based on but does not affect its Thus stressed
class Size
(ECS), ness manner category. segments
statistics, principally frequency-weighted expected
and place of articula- have, as it were, more to lose than unstressed ones from the omission
of articulation
for phonemic (i.e. full), manner
all segments, only segments of non-manner information.
tion transcriptions, and for transcribing
in syllables. Hut- In terms of front end design, these PIE results suggest that one
in stressed syllables, and only segments unstressed
table 5) that, for polysyllabic words only, a can only afford to concentrate on stressed syllables at the expense
tenlocher shows (1985, leads to an of unstressed if one to be able to extract rather than
of segments in stressed syllables unstressed expects more
phonemic transcription of those manner class information from them. Again, therefore, the theoretical
ECS of 6, whereas a phonemic transcription in
gives
an ECS of 34. He attempts to show that the is
true when same superiority of the concept developed here is shown to have important L
is manner one rather than phonemic; whilethe fig- practical consequences.

the transcription a
allow this
ures he presents themselves
do not interpretation,
LDOCE-derived
a
leXicon
sunzilar
obtained using
.
difference in ECS was our 0.4.5 Case Study One revisited

by the
However, while Huttenlochers conclusions are
supported
obtained from the present study, they are denied by the However, PIE gures do not always lead to different conclusions. In
E05 figures
relevant statistics, for the polysyllabic Case Study One, I contrasted the results of assign-and-count and assign-
more accurate PIE gures. The
md-lookup by reporting the percentage of words falling into unique
part of the lexicon only,
are:
149
148
Chapter 6 6.5 Summary. discussion and conclusions
PIE I have argued, superior. It there- equivalence classes can, at least in principle, give rise to misleading
classes: a statistic to which is, is
of the relative conclusions about
promising directions for front end design.
fore necessary to return to the question informativeness The second
and evenly spaced segments.and practice criticised was the use of class size statistics
of stressed,
PIE gures here too.
randomly
to do so
chosen
In order accurately, one derige
must improvet e for assessing transcriptions. These statistics were shown both to be
introduce word frequency insensitive to variations in word frequency and, intuitively, to give an
treatment of stress in monosyllables and also
'
ation. over-optimistic impression of the power of a transcription. A more ap-
meXftmann
(1986), and, far, our Case StudyOne, treatedmonosyl-
so propriate measure,
introduced and
that of percentage
shown
of information extracted (PIE),
labic words as always being unstressed. It 15 more realisticto
assume was
that the use ofPIE

to
like the
overcome
of
these drawbacks. It was shown
that, for careful continuous speech, content (open class,lexxcal)words values, use consistency classes, can lead to
words Will different and reliable conclusions about front end design. Specif-
will be stressed, and function (closed class, grammatical)word ically, the
more
of PIE values showed that stressed

be unstressed. assumption is made, and, in addition,
If this fre- use
segments are more
the proportion of stressed segments in informative than unstressed only if they are transcribed phonemically,
quency weighting is introduced,
the LDOCE-derived lexicon becomes 42.4%, or almost exactly.3/7. and not, as had previously been thought, if a manner transcription is
made.
Therefore experiment (iii) was adjusted to transcribe.phonemically
(iv), the However, there other and which
42.4% of segments at random, whereas in
experiment effect are assumptions restrictions we
have not criticised and which
of maximally even spacing was achieved by transcribing phonemically characterise both the present study and
7n + lc, 7n + k + 2, 7n + k + 4 in a word, for all integer n, most of the others discussed here. For example, further work is clearly
segments
where lc is selected at random from for
0,1,2,3,4,5, eachword. .
needed to allow for the possibility of errors both in segmenting the
Under these conditions, the PIE values obtained (With 'the unique- signal and in transcribing segments. Within the paradigm assumed
ness gures from the table in 6.3.1 reproduced for comparison) are: here, an error would effectively lead to a word not being included in
its own consistency class.
% Unique PIE Adda et al. (1987) attempt to minimise the likelihood of such
Experiment er-
rors by dening a six-category transcription on based

classes of
not
All mid 692% 94.3% phonemes but on rough spectral features which are more directly repre-
(i) sented in the speech signal and hence more
Stressed full 79.2% 97.3% likely to be extractable with-
(ii) out errors. They report, for a 17 000 word French lexicon, a frequency-
Random full 78.7% 96.7%
(iii) weiglited ECS of 132 words for the spectral-feature
78.7% 97.0% transcription as
(iv) Even spacing opposed to 53 words for a manner If we approximate
transcription.
PIE by the formula (6.3) (which is the best approximation possible if
The pattern of the earlier, uniqueness results is, perhapscoincidentally,
between the only the E05 and lexicon size are known) these correspond to values
repeated almost exactly here: there is no great difference of 50% and 59% respectively. Thus the use of a more
The conclusions are therefore also unrealistically ex-
results of (ii), (iii) and (iv). tractable transcription with the same number of symbols decreases the
changed. PIE signicantly. Adda et al. make two further concessions to phonetic
reality: they allow multiple phonetic representations of words to reect
lexical variability, and they assume that two neighbouring segments
6.5 Summary, discussion and conclusions
of the same spectral class will not be distinguished but will be tran-
scribed as a single symbol. These two eminently realistic concessions
In this chapter I have criticised two of the practices commonly adopted increase the ECS from 132 words to 1111, decreasing the approximate
in assessing hypothetical front ends for large-vocabulary speech recog- PIE from 50% to only 28%, or less than half what it is for a manner
The first practice, that of partitioning the into
nisers. leXicon equiv-
the kinds class, segment-by-segment, single-pronunciation experiment. Early in
alence classes, was shown not in general to be applicable to this chapter we saw that Huttenlocher calculated an ECS of 34 words
of front and behaviour that one would wish to model; a more general
for a six-way manner transcription, representing the elimination, on
It was noted
notion, that of consistency classes, was introduced. that average, of 99.83% of the vocabulary; but it now seems, ignoring any
obtain
the experimental method needed to consistency
described
classesrelies, relevant differences between French and English (see below) and the
exible LDB the
in general, on the availability of a
of kind fact that PIES cannot be derived exactly from ECSs, that we can ex-
it shown that the inappropriate use of
in chapter 2. Further, was
151
150
Chapter 6
realistic sixway transcription to get us only just over a quarter

pect a
towards identifying the word.

of the way
extend the analysis presented here to the
It would be useful to
of connected or continuous speech in two ways. Firstly,
recognition to be available, and
syntactic and semantic constraints are then likely
to allow Chapter 7
informativeness could perhaps be measured one
their likely
to estimate the performance of a particular set of system components
at different linguistic levels. Secondly, the analysis should Analysing the dictionary
operating cannot
allow for the fact that in the continuous case, word recognition
be isolated from word boundary recognition.
denitions
most of what has been said here
Finally, nothing is implied by
of
about the properties of any language other than English; the study
various inter-language differences which
Carlson et al. (1986)presents
could protably be investigated using information-theoretic techniques. Hiyan Alshawi
tran<
Carlson et al.s unweighted ECS values suggest that a manner
effective for Germanic languages than for
scription is in general more
for English rather than French; this is con-

Romance, and in particular
sistent with the difference between the PIE value of about 75% implied
Tubach and Boes (1986) entropy values for a manner transcription
by
79.8% for English derived in Case Study
for French, and the gure of
Two, as reported in Carter (1987).
7.1 Introduction
6.6 Notes A major factor contributing to the lack of robustness of experimental

language understanding systems is the small number of words
The research reported here was funded by the Science and Engineering natural
the experimental semantic dictionaries used by these For
in systems.
under the Alvey Programme.
Council (Grant No. GR/D/4217.7) example missing vocabulary is cited as the most frequent cause of
in theory arise even without errors for the FRUMP system (DeJong, 1979), a system designed to
1 The same sort of discrepancies can achieve a high degree of robustness. The problem does not disappear
For example, if segments in stressed syl- of the type encountered
randomness or variability. when dealing with limited discourse domains
receive detailed transcription that those in in database and expert system interfaces. This is because of the
lables consistently a more
query
unstressed, then the golf/gulf problem can arise between pairs like large number of synonyms and specialised words that can occur, and
refuse and revise
(noun) (assuming that the stress pattern is not part because of the difculty of delimiting discourse domains exactly.
of the transcription). Adierentproblem faced by designers of natural language under
standing systems is how to provide for graceful failure of sentence anal-
words stressed. Conse- There is thus the need to produce reasonable incomplete inter-
2 Huttenlocher treated all monosyllabic as y5is.
in unstressed syllables will lead of sentences when complete analyses are not possible. This
quently transcribing
that only segments pretations of gaps in the grammatical
to an unrealistically large ECS, because monosyllables, which comprise Situation can occur because knowledge of
the system or because the system is faced with extragrammatical in-
the majority of words in text, will all map onto a null symbol sequence.
expected and maximum class put. This chapter shows how a possible solution to this partial analysis
Recognising this, Huttenlocher presented
sizes for transcribing unstressed segments in polysyllabic words only, problem can be applied to the vocabulary problem in the context of
but omitted to do the same for transcribing only stressed segments, large machine readable dictionaries.
thus not allowing any direct comparison. More
specifically, we will see how word sense denitions from the
Longman Dictionary of Contemporary English (Procter, 1978
hence-
153
152
Chapter 7 7.2 Definition analysis
the dictionary definitions
Analysing
forth LDOCE) processed by a phrasal analyserthat appliessuc-

are printed form). The suitability of LDOCE for work in computational
cessively more specic phrasal analysis rules. The aim of this analysis linguistics has been analysed in detail by Michiels (1982). For the pur-
sufcient semantic information to
enable system
a carry- pose of the work reported here, the most important property of LDOCE
is to provide of denition
out language processing application to cope With occurrences is the use of a restricted vocabulary of around 2000 words.
ing a
unknown words. Further, an important restriction imposed on LDOCE lexicographers is

with
of words and the problem of that only the central senses of these words should occur in denition
Both the problemcoping new
be thought of as of a more gen texts. Some ways in which the denitions diverge from a strict inter-
robust phrasal analysis can
instances of It should be remarked here
This is the problem pretation of this rule are discussed later.
eral natural language interpretation problem.
knowledge of language use; lexical knowledge that the LDOCE restricted denition vocabulary has more in common
coping with incomplete
in the second. The with a basic English vocabulary than a set of semantic primitives. (A
in the rst case and knowledge of phrasal structure
of the knowledge of language use available list of the words in the restricted denition vocabulary is given in an
unavoidable incompleteness
that trying to achieve robust appendix to the published version (Procter, 1978) of the dictionary.)
to a language processing system means
effective If the output of processing LDOCE denitions in the form of
natural language processing involves developing mechanisms was
this The research reported in this chapter is meaning postulates, then the logic expressions produced would have a
for dealing with problem.
to be a contribution to this development effort. new symbol for the word sense being defined along with symbols corre-
intended
kind of output that may
the be of words in the denition vocabulary. Similarly,
The next two sections will discuss sponding to the senses
denitions and give of involve building new for-
produced from processing dictionary examples producing semantic primitive formulae would
the results of processing LDOCE denitions produced by an imple- mulae by putting together formulae corresponding to the word senses
encountered of the denition
analyser. Some problems that were
are
mented denition vocabulary.
then discussed. Later sections motivate and explain the basic analy For the third possible form of output listed earlier, we need a (hand-
of analysis and of the denition
sis algorithm, and then describe and illustrate details coded) classication of the central senses vocabulary
made about the
structure building rules. Finally some remarks are
together with a classication of concepts in the particular domain of
further
performance of the current implementation and necessary re- discourse in terms of these word senses. The descriptions of imple-
search. mentations by Bobrow and Webber (1980), Mark (1981), and Alshawi
(1987), show how such a classication can be organised and used dur-
ing text processing. The LDOCE denition for a new word sense is
Denition processed using the mechanism described in this chapterin order to
7.2 analysis
extract sufcient information for
including the new word sense in such
of structures useful classication. A natural application that de-
There are possibilities for the kind
various for'lan- a language processing
guage understanding that may be derived from dictionary denitions. pended on a classication of concepts in the discourse domain should
include meaning postulates (Carnap, 1952) expressed be able its application task despite the of
in some then to out occurrence
These carry
constraints or semantic formulae based on semantic primitives a new word in an input sentence.
logic; infor-
Wilks, 1975b); and structures Extracting the information for classication will of course
necessary
(Katz and Fodor, 1963, carrying
mation enabling the classication of the new word sense With respect to include locating superordinates in the denitions (which dene the
in a discourse domain. The called ISA relation) as is done in the work reported by Amsler
an existing classication of entities struc- so
denition belong to this

tures produced by the implemented analyser (1981) and Calzolari (1984a). However, this previous work suggests
last type, and examples of these structures are given later. that achieving further semantic precision in a classication process re-
The dictionary being used in this work, LDOCE, has features

that quires making use of other information present in the denition (such
for denition analysis. Thus many LDOCE modiers and predications). Examples of extracting this sort of
make it particularly suitable as
entries contain additional semantic information that could information presented in the next section.
are
word sense
be combined, or used in conjunction with, the structures produced from This way of dealing with unknown words in language processing
word sense denition texts. This information is available as applications still requires good solutions to the problem of choosing
processing
box codes that selectional restrictions, and subject codes that between alternative possible word senses (Walker and Amsler, 1986,
give
discourse domain of word sense (these codes oc- have used the LDOCE subject codes for this purpose) and to the prob-
indicate typical usage
version of the dictionary, but not in the lems involved in the classication process (see Schmolze and Lipkis,
cur in the machine-readable
155
154
definitions Chapter 7 7.3 Analysis examples
Analysing the dictionary
a mechanism, as described in this chap

1983). Nonetheless, providing
the information required by the classication process (launch)
ter, for extracting
to handling unknown words,
is a necessary rst step for this approach
in the level of detail provided by their semantic (a large usu motor-driven boat used for
Dictionaries vary
.
a meaning postulate for under carrying people rivers lakes

denitions, and, in general, producing on , ,
a word sense into a discourse harbours etc.)

standing a sentence, or incorporating ,
domain classication, can often be done without making use of all

denition. Even only being able ((cmss BOAT) (PROPERTIES (LARGEN
the detail provided by the dictionary
of a (PURPOSE
to locate the semantic head (i.e. superordinate
the main term)
This (PREDICATICIN (cuss CARRY) (OBJECT PEUPLE))))
definition can be useful to a language processing application.
is fortunate since providing complete analyses of arbitrary dictionary
denitions is beyond the current state of the art in computational lin-
therefore reasonable to derive and make use of partial Figure 7.1
guistics. It is
denitions when complete analyses are not pos-
analyses of dictionary
sible. This is the approach taken in the implemented analysis system. 7.3 show denitions
for example precision information Figures 7.1to examples of together with
noun sense
(For certain text processing systems, semantic structures derived from them.
the (The analysis system re-
retrieval systems, such an approach may not be acceptable.) trieves denitions from a lispied version of the LDOCE type-setting
tape, for example items preceded by an asterisk are Lisp atoms cor-
responding to font control characters present on the type-setting tape
(see Alshawi et al., 1985; Alshawi et al. this volume.)
7.3 Analysis examples
I will refer to the derived for classifying word senses

information simply
as semantic structures; this rather vague term being chosen because (mug)
not viewed as having formal semantic status, but
these structures are
information relevant to the classi- (*46 BrE infml #44 u foolish person who is
only as data structures containing
easily deceived *44 63 also *CA MUG'S GAME)
(or perhaps some other semantic process). Roughly
see
cation process
speaking, these structures have some properties of a linguistic analysis
denition of word ((crAss PERSON) (PROPERTIES (FUDLISHD
of denition texts and some properties of a semantic (PREDIcArIuN (ostcr-ur ((cmss DECEIVE)))))
sense concepts; their gross syntactic form is that of nested feature lists.
The semantic structures are derived from various types of mod-
clauses in word sense denitions as well as
iers and relative present
head of the denition, if this is substantive. The syntactic Figure 7.2
the semantic
category under which senses are grouped in the dictionary is impor-
semantic head of a denition and,
tant, in particular, for locating the The heads denitions
of these and tree re-
for determining which analysis rules are applicable to semantic are boat, person,
more generally, this being different in the last case from the syntactic head
the denition. The details of the analysis process are explained in a spectively,
later section. Illustrative examples of the structures produced from .( type)
:5 derived
of the definition. The other
from adjectives, prepositional
information in these structures
of denitions currently handled by the phrases, and relative clauses

analysing the main categories
verb, adjective and adverb denitions Not all the information present in the denitions is captured for ex:
implemented system

noun,
in the semantic structures are often due to ample the information conveyed by the phrase sometimes in ,used

are now given. Oddities HEDGEs,in which is because it is not part of the
and output format, and HEDGE capitalised
peculiarities of the current analysis grammar
denition
restricted vocabulary (but is dened in terms of this vocab-
I would not wish to for their
argue correctness, especially in view of
discussed later.
ulary elsewhere).
the problems
157
156
definitions Chapter 7 7-3
Analysing the dictionary Analysis examples
(hornbeam) (club)
(a type of small tree with hard wood , (to beat or strike with a
heavy stick
sometimes used in *CA HEDGE *CE 446 s) (*CA CLUE :cB))
((CLASS TREE) ((cmss STRIKE) (OTHERCLASSES ((BEAT)))

(CDLLECTIVE TYPE) (PROPERTIES (SMALLD (ADVERBIAL
(HAS-PART ((CLASS WOOD) (PROPERTIES (HARD))))) ((CASE WITH)
(FILLER (cuss STICK) (PROPERTIES (HEAVY))))))
Figure 7.3
Figure 7-5
Verb sense denitions are, in general, innitive verb phrases with ad-
and additional restrictions the
verbials (often prepositional phrases) on
semantic class of agents and objects. Figures 7.4 to 7.6 give some ex-
of deriving structures from verb sense denitions.

amples
Similarly,adjectivesense denitions tend to have adjectival or verbal

(launch) predicates as their heads, and they often includerestrictions on the
class of obJects
to which the property corresponding to the adjective
(to send (a modern weapon or instrument) into can apply.
the sky 01 space by means of
scientific explosive apparatus)
((CLASS SEND)
(OBJECT
((CLASS INSTRUMENT) (UTHER'CLASSES WEAPON
(Pnurennas (Monmnn (bushy)
(ADVERBIAL ((cAss mm) (FILLER (cuss SKY))))
((of hair) growing thickly :
*46 a
bushy beard / tail)
Figure 7.4
((CLAss PROPERTY)
(PREDICATIUN (CLASS GROW) (MANNERTHICKLY))
(RESTRICTED-T0 ((CLASS HAIR))))
(mug)
(to rob with violence , as in a dark street)

Figure 7.7
((CLASS RUB)
(ADVEREIAL
((CASE WITH) (FILLER (cuss VIDLENCE)))))
The adverbial phrases used to dene adverbs are often prepositional
phlriieg:9lxamples
of adjective and adverb denitions are in Figures
Figure 7.5
158 159
Chapter 7 7'4 Some problems
Analysing
(bring out)
(undomest ic ated)
*CA TAME) (#46 becoming rare 44 to introduce (usu a

not. .
((of an animal) not serving man ;

young lady) into the social life of a great
City #63 see also *CA COME OUT *CB (7))
((CLASS PROPERTY)
MAN)
(PREDICATIUN (NDT (CLASS SERVE) (OBJECT
((CLASS INTRODUCE)
(RESTRICTED-1'0 ((CLASS ANIMAL))))
(OBJECT ((CLASS LADY) (PROPERTIES acumen
(ADVERDIAL
((CASE INTO)
Figure 7.8 (FILLER (cuss LIFE) (PROPERTIES (SDCIAL))))))
Figure 7.11
(overland)
7.4 Some problems
(across land and not by sea or air)
or by
The current implementation is able to locate heads the correct semantic
((HANNER of dictionary denitions in most
( (CASE ACROSS) (FILLER ((CLASS LAND)) ) ) )) cases, although the examples above are
untypical in the amount of additional information they recover from
the denitions. Some quantitative remarks about the performance of
the system are given later. This section briey discusses a number of
Figure 7.9 that
problems encountered
were while testing the implemented system.
In some respects the information conveyed by the output struc-
tures, being too closely tied to the surface denitions, only provides
and for further
LDOCE for lexicalised
denitions compound noun
phrasalverbs constraints semantic analysis. Perhaps the most important
are handled in exactly the same way as noun
and yerb denitions.
denitions
case
certain
of this is that the relationships implicit in compound nouns and
Two examples of structures generated for such are given in
prepositional phrase adverbials cannot, in general, be made
Figures 7.10 and 7.11. more explicit without further interpretation apparatus (see, for exam-
ple, Alshawi, 1987) beyond that available to the denition analyser.
The phrasal context can, however, sometimes allow further specica-
tion of relationships implicit in prepositions, for instance derivation of
PURPOSE from for in cases exemplied by the noun sense of launch
(roller coaster) (although, of course, errors can result from attempting to make rela-
tionships more explicit in this way). The actual words appearing in
(a kind of small railway with sharp slopes the semantic structures the other hand, further
are, on
disambiguated
and curves . popular in amusement parks) than might be assumed given the high degree of polysemy of many of
the words in the restricted vocabulary. This is because the analysis
((cmss RAILWAY) (COLLECTIVE KIND) identies the syntactic
process category of these words and because of
(murmurs (SMALLD) the LDOCE rule that only the most central senses of words from the
restricted vocabulary should appear in denitions (but see the remarks
below on phrasal verbs).
Figure 7.10 The fact that definition texts are often not analysed completely
161
160
definitions
' ' '
Ch ap ter 7 7.5
Analysing the dictionary Phrasal analysis hierarchies
.
. . .
denition t
'
sometimes
'

that information that is centra l to a is

no
giisinto
.
example. In this
account, illustrated as by the iollowmg (the wood of this

' tree)
ecovered.
case the usual purpose of nails 15 not r
((CLASS WDDD) (RELATED-TU *FREVIOUS-SENSEQ)
(nail)
Figure 7.14
(a thin piece of metal with a point one
1mm
endnnd
af 0
flat head at the other for hammering 3 piece A problem related to the
wood else)
one just mentioned is that only the simplest
wood usu to fasten the to something forms of
.
cross references to words not included in the denition vocab-

ulary are handled at present. However, given the compositional nature
((CLASS PIECE)(MATERIAL METAL) (PROPERTIES (THIN))
of nested feature lists, and the fact that denitional cross references
(HAS-PART ((CLASS PUINT)))) intended
are to be non-circular in LDOCE, it should be feasible to use
semantic structures for the referenced words (and previous senses as in
the hombeam
7.12
example) in building other semantic structures.
Figure The use of a restricted
vocabulary in LDOCE denitions means
'
that the lexicographers have
h the base forms of all the words in the restricted vocabulary already engaged in a substantial
:hgilrple
amount
.
of semantic analysis of word senses

thveairzlyasri: variants of these that is potentially useful for auto-
morphological
th many re are1
arehandlehd
morp 1by
ogy
matic natural language processing. However,
cases of derivationa
observed by Michiels
'
o as
pbttccedjiently (1982), there is tradeo" between

. .
a
handled. Difculties also caused by the the size of the denition vocab-
are
IIliDOCE denitions of phrasal verbs libgral

uts'elin ulary and the syntactic
complexity of denitions. This implies that
taken from the restricted
vocabu. lary.madle
T u};
i f ronti-verbts lights:
arei ' in order to take full
advantage of the potential of LDOCE entries for e ioma 1c na ur
f words language processing we need to pay special attention

verbs that the rule of ly the central senses o
to the design of
means
' ' ' using on. '
h' ic h p hrasal the denition analyser; this is the issue addressed

'
in the denition vocabulary is Vio lated in many cases in w

in the rest of this
b u 1 y chapter.
.
.
dening
' ' ' '
verbs are used to impliCitly the Size of the voca I
1e is
'
the
increase
of after

an d brin g up causmg
n ,
occurrence
look
2:33:31151
analysis the of the followmg sense denition
.
for foster.
.
.
7.5 Phrasal analysis hierarchies

The analysis mechanism has the avour of a pattern-based
(foster) phrasal
analyser. It designed
was overcome some toof the more obvious dif-
culties of applying a
(to look after or bring up
simple pattern matching approach to robust
. phrasal analysis. In particular it was
(a child or young animal) as one a own required that the mechanism
should have the means to specify which
components of a phrase are
((CLASS LOOK.
more
important, and to index analysis rules so that the mechanism
would be reasonably efcient.
Pattern matching has played an important role in several previous
parsing systems, for example those of Wilks
Figure 7.13 (1975b), Parkinson et a1.
(1977), Wilensky and Arens (1980), and Hayes and Mouradian
in LDOCE entries A characteristic of the mechanism (1981).
Another problem encountered used here is that
sometimes dened in terms of prevxous
senses
oft
'
e same omog a
'
wor}?
ishthat sensers
:1: . hierarchy of phrasal analysis patterns in which more
it depends on a
specic patterns
For example immediately
'
after the definition of the sense 0 f}; am beam

are dominated by less specic ones.
The basic analysis algorithm that
lier the following dependent word sense d 6 nition is present,
'
,
applies the hierarchy of patterns
Edi-31:}: is as follows.
.
the system produces a structure containing the special symbol Starting at the top of the hierarchy, a pattern is matched
against the input denition. If the match with this
*previous-sense*. pattern succeeds
162 163
Chapter 7 7.6
definitions Analysis rules
Analysing the dictionary
match attempted
is with daughter patterns
each of its (i.e. daughter of n-100, and n135 is a daughter of n130 (not shown), More
then a
below it in identiers '
of this pattern placed immediately mnemonic for these rules mi g h the Noun hrase S
specic forms
-
the more
the hierarchy). This procedure
is repeated recursively so that we end
denition. This
NP, and NP-With-relative
~
instead of n-100, n1l)0,

andln-llligprl:
the most specific matches against the input spectively.
up with
is different in kind from the more common approach
parsing technique tried rst before
exact rules are
to robust parsing in which grammar
for example, Weischedel and Black,
being relaxed by the parser (see,
1980, Kwasny and Sondheimer, 1981, and Pulman, 1984). (n-lOO
to the indexing problem (u at +0det as anoun m)
The hierarchy provides a natural solution to those 11120
acadj
above since it restricts the application of patterns n-iio 11-130 11140)
mentioned It
to succeed, enabling eicient phrasal analysis.
that are more likely im (11-110
specifying the more of
also provides a solution problem to the
(n
tend to be +Odet +0intens kOadj Anoun SCH)
of phrases since less specic patterns *Opp-mod
portant components This ensures that
concerned with more important components only.
detailed (r135
be produced when more
reasonable incomplete analyses can
(n +Odet aesdj lOnaun +noun1
analyses are not possible. What-which averb-pred m)

Each analysis rule consists of a rule identier, a phrasal pattern,
It is written in the
and a list of rule identiers for daughter patterns.
following form:
Figure 7.16
identifier) in the phrasal pattern part of these rules, the initial n" restrict th
(<ru1e e
(phrasal pattern) (daughter identifier) pattern to matching denitions for senses with lexical categor ns '
nouns. The
other pattern elements match zero or more inthee
iterlhs
hptliitodsgsgdcltnesn
the element (indicated by its rst
thie type(if of lexical features.
an res ric ions in terms
'
one
'
Figure 7.15
end of
pattern
a:the With the
elements simply distinguish different occurighgcl:
in semantic structure building o e ements properties. Examples of pattern
same el tS
The rule identier also a emen
appears
in order to allow dif- and what they can match are given in Figure 7.17.
rule. These two types of rule are kept separate
to be generated for the same analysis
ferent kinds of output structures
Building semantic structures is basically a simple process
grammar.
provided by the semantic structure building
of eshing templates
out
for the word for
rules using variable bindings generated by the matching algorithm. menu
examples of analysis and structure exactly one noun
The following section gives some +Odet zero or one determiner
the notation in which the phrasal patterns
building rules, explaining number of la- tuoun one or more nouns
a limited
are written. The notation currently provides EOadj
could be extended in zero or more adjectives
but it should be clear that these facilities
cilities, an
arbitrary segment of input words
various ways while remaining within the overall framework of applying itOpp _
mod zero or one prepositional phrase

of phrasal patterns as discussed above.

a hierarchy modifier
*Vlrb'pred a segment that matches
a verb phrase pattern
7.6 Analysis rules
for denitions, and two of its

A typical analysis rule, n-100, noun
shown in Figure 7.16. n-110 is a 7.17

descendants, n110 and n-135, are Figure
164 165
7.7 Performance remarks
. ..
Chap t er 7
Analysing
(predication (object-of (class deceive)))
of element with subsidiary patterns
The last element is an exampl e an
(in this case for verb

thus
phrases) which
associate
'
5:ii]:salrne
d wit
21:11:32:
kicislizft is e emen
using rule associated
a
mentioned earlier.
with the subsidiary pattern
There
passive-pred that
is also an optional further stage of the
as above. There are
d. '
_
15) 'ust
was
trans. Here passive pre
(be +v which applies transformations specied as
'
pred building process

'
structure
'1 ar 1y, one s ub
-
'
including (passrve
.
( be +vtrans). ' Simi '
of the subsrdiary pattern attached procedures associated with items in the structure building
the name
Oused for mom ver b)
.
(+ '
-
of *Oppmod is
(fonpp rules, for example the item predication. This phase gives greater free-
.
'
ttern
allows for
Tileaifsep;elements with such subsidiary
in
patterns
much the same way as 0 recursrfor; dom than would be possible by the use of structure building templates
and a more compact set of patterns, alone, for example it allows moving items (such as those indicating
t' nal context free phrase structure grammars. negation) upwards out of substructures.
t
congifldrcf
t the
this interpretation
hrasal patterns of rules
of pattern
n -110 and n-13 wr ma c mlslhoulgdhb:u;ls::
elements _
The analysis algorithm follows all paths from successful
This does not lead to inefciency because it is rare for several
matches.
deep, but
olfihe
setpof denitions matched by the
definition
' '
phrasal
examplesgiven
paftterln 23g
frigging '
ea r ier or a
. .
disjoint, paths to be followed successfully down the pattern hierarchy,
and because the implementation maintains a wellformed substring ta-
ho. .
The noun sense 1 '

1- p
an initia mo
matched n110 and n 13b respec tively.

-
(There is
n
it, does ble to avoid a certain amount of redundant computation. Alternative
lk
1
' H
us u . which
'
1 nal sis base which discardsitems e.in .
section
.
semantic structures can result from processing a denition when there
lriftlcraecdgnize.)
Theanalysis algorithm
n110 and n-135
tried outline?
only'i :28 preVlOElSS are
'
'
n
-
so ccee .
is more than one most specic successful analysis rule. At present one
that basically
such analysis is chosen by an over simplistic heuristic
.
that '
rplgsisrae
_
building rule associated

Shh
an
emails}: structure
descen fntifym '

an ts o e anay prefers analyses accounting for more words of the input denition.
d when

of the immediate none when .
'
for n100 applied is

gdlceceed. Thus the structure building rule
succeed,ensuring
that
of n-110,
none n-120, n-130, n-140
air}; or
n
7.7 Performance remarks
mantic structure is built according the analySisprOVided by to
5 rue t u re
of this analysis possible.
if specific
no
version
more
and n-135 given Figure Th; are
is
in 7.1 Although some difficulties

of the
to accommodate
mentioned
in the present of the errors
in the section
most
on problems
rules for n-ioo
building are not easy system,
in identifying semantic heads of definitions were not due to these. In
stead these errors appear to be caused mainly by failure on the part of
(11-100 compound-class knoun) the rudimentary morphological analysis performed, and the inadequate
(properties t0:dj))) coverage of the present set ofphrasal patterns.
((clns: mount) In order to theevaluate tested
performance of the system,
it was
(ll-135
lOnoun)
(noun-mods against a sample of 500 denitions chosen at random using a facility
(properties EOadj) for automatic random selection of entries provided by the dictionary
#verb-pred) ))
(predication access system. Only a few of these denitions will have coincided with
those processed during the development of the system and its analysis
rules. The selection process ignored denitions for idioms and those
Figure 7.18 with complex cross references, so these are not taken into account in
a foolish per-
the gures given below.
earlier for the definition The results of the test follows. The semantic head was
The semantic structure given of
were as
son who is easily deceived

... (a British sense
English the rule invo ves mug)\lNaS identied correctly for 387 definitions (77%). Additional information
the rule just given above. Applying was recovered for 236 (61%) of these denitions, and this additional
generated using with bindings generated by
the
matchinglpro-
the variables
'
information judged to be
was correct for 207 (88%) of these cases.
1cg::Lzlprligcing-out
substructures :ptizra:
associated uninstantiafled with to rue
Thus only identifying the head is much more typical than might
pattern elements; applying'this
and
tf6115tch.
recursively process be suggested by the examples given earlier for illustrative purposes,
building rule associated

with the appropriate (r.e.
for the element *verbpred. rraSte
succeslshu
31' p rs as
and in fact the
words appearing
present
in
set of rules rarely
denition. There
takes into account
altogether
all the
90
ing) subsidiary pattern a are some phrasal
results in building the substructure
167
166
definitions Chapter 7 7.9
Analysing the dictionary Notes
subsidiary patterns. It should be In any case,

patterns in the hierarchy, including a comprehensive treatment of dictionary entry
feasibility analysis
this of denitions was written as a lor 1an g 11a g e nderstand
emphasised that grammar l 11 g systems Cl e ar ly nee ds 1.0 t a ke acco lint 0 f
believe it is reasonable to expect that the number of patterns
test, and I
the problem of
in the hierarchy could be enlarged to 400, say, before
diminishing returns serious
becomes one.
a
The current implementation of the denition analyser takes around

the dictionary and process a denition. This is 7.9 Notes
half a second to access
the elapsed time on a lightly loaded CBC63 (a 32 bit mini-computer);
low [would like to thank Bran Boguraev, Ted
the denition analyser was implemented in Cambridge Lisp and ' i
Briscoe, and two anonymous
in the language C. Finally, perhaps it is worth Computational Lin guistlcs referees for man Y h 9 l P f Ul comments that
level dictionary access .
that the development effort was only a few man months for 1mproved earlier drafts of this paper.
mentioning
each of the program and grammar, which, compared with other natural
language processing systems we have developed recently at Cambridge,
effort.
represents a relatively small
7.8 Further research
The work carried out so far seems to suggest that dictionary denitions
can be analysed with a reasonable degree of success using hierarchies
but it still remains to be demonstrated that this
of phrasal patterns,
technique can enable an actual natural language application system to
e'ectively with unknown words.
cope
these
Although dictionary denitions exhibit a rich variety of forms,
variations a manageably small number of basic forms,
are mostly on
and it is this property of denitions that makes phrasal pattern hi-

erarchies particularly appropriate for analysing them. It seems likely
however that the analysis technique developed here would be useful for
the same reason in other language processing applications, for example
specialised interactive applications.
One direction in which it is hoped to extend the work reported in
this chapter is in enhancing the capabilities of natural language pro-
cessing systems for coping with idioms. Intuitively, some sort ofpattern
matching seems to be appropriate for analysing idioms (see, for exam-
In the context of a parsing system
ple, Wilensky and Arens, 1980).
for idioms would be placed
using an analysis hierarchy the patterns
as most specialised patterns (i.e. leaves). LDOCE entries contain a
wealth of information on idiomatic uses of words, and the meanings
of idioms are expressed using the restricted denition vocabulary. it
is hoped to extend the denition analysis system so that it would at-
to appropriate phrasal patterns when it encountered a
tempt generate
denition for an idiomatic use of word.
a It may then be possible to
use generated pattern and
the the denition of the idiom to produce a
paraphrase of an input sentence before further processing takes place.
168 169
Chapter 8
Meaning and structure in

dictionary denitions
Piek Vossen, Willem Meijs and

Marianne den Breeder
8.1 Introduction
The aim of the LINKS project is the development of a semantic data-

base in which the meaning descriptions in LDOCE are stored in a.
systematically related way. The underlying theoretical framework is
that of Diks (1978b) stepwise lexical decomposition, which does not
make use of an abstract semantic meta1anguage but instead, guided
by a well-dened economy-principle, reduces the meanings of lexical
items, via a stepwise network of chains of meaning descriptions, to
a restricted set of basic lexical items. The underlying assumption,
at the outset, was that such an approach would be a feasible option
because of LDOCEs use, in all its meaning descriptions, of a restricted
vocabulary.
The basic
methodology for the project is as follows: rst an ap-
propriate grammatical coding is applied to the words of the restricted
vocabulary and their inected forms. This coding is then automati-
cally inserted in all of the meaning descriptions, the outcome
being a
grammatically-coded corpus of meaning descriptions. Subsequently a
syntactic typology is developed for the structures of the meaning de-
scriptions of each of the major parts of speech (POS), nouns, verbs,
171
8 8.2 Preparing meaning descriptions for examination
definitions Chapter
and structure in dictionary
Meaning
the book section
Apply in 1.2.1 for brief description).

see a
for each of them.
and adjectives, resulting
in parser-grammars
should then lead to syntactically- 0 the orthographic form of the entry.
to the corpus
grammars it is possible to systematically
ing these
in which the POScode of the
analyzed meaning descriptions
0
In addition (and entry.
etc.
identifypremodiers,kernels, postmodiers, a semantic typology is be- 0 the meaning descriptions (also see Appendix E)
to the syntactic typology
partly parallel) both these typologies into a relational
ing developed. Incorporating it ultimately, to trace the hori- These to a
pattern matching programz developed
database system should
make possible,
words: showing what the hyponyms atthe Ftiilsgaatrlengiitaptned
toh'computational linguistics of the Universit of
zontai and vertical links
between
are involved Amsterdam Usin we can look for any of words inyth
p (g r:
kind of properties sequence
or hyperonyms
of a given word
post-modication
are, what
structures, etc. The expectation is meaning de-scri ut
iorfii, also for codes, the the box subject-elde
in their pre or
be useful in (semi)automatic
Pos_codes or An
Sgecl
c
entries. example3 of a pattern which iiia
the sem?rilic
of this kind can Very '
that a database be asked f or is all meaning descriptions of nouns with

text summarizing, and so forth.1
discourse analysis, designated by label human the sub J CCi. elds Illed 1 cine and OCCupaLloll and with
of this database of concepts 1
i
To improve the efciency already present in the

combine it with data which are
words, we will register
of LDOCE: subject eld codes, speech
tape version to ask entry(POS) marker of meaning
computer
codes and sociolect codes. This enables
word that are
a user
relevant
of the database
within a particular
SEES?selmgntic a el beginning description
of a
for only those meanings on a surgical operation. In of MD
a report
context. Imagine we are reading not U U
the instruments were U U U U U
one of the rst sentences we are
do we have
told that
in mind? A database of con mon . . . . hU . .
.y. anaesthetist(n)
.
ul UL a doctor who gives

instruments
clean. What
instruments, including drill", pincers, an anaesthetic to a
all possible
cepts will give If the subject eld
us
and on. patient before he is
ame-thrower,laxe, circular saw",
so
the information of the meaning descriptions treated by another

codes are combined with that have a subject-eld donor.
select those instruments
then we can
(MDs) mdon .h
specialist(n) ul UL
code for medicine or surgery.
. . r .....
a doctor who gives
we will mainly be dealing with: treatment in a par-
In this chapter
of meaning ticular way or
search for types of structures
a work done to make the to certain kinds of
descriptions possible (section 8.2). People or diseases,
examination of these meaning descriptions (sec-
the results of the and
typology (section 8.4)
o
a semantic
' '
distinguished into In
weolrdsrktizgafhtate
tion 8.3) search for structuresmeaning descri of tio
a syntactic typology (section 8.5). she s of the meaning descriptions with POS-Eoddls
This ievtvor was dune As stated by Paul Procter in the Gener;
Introductmn OfrtiOongeps.
derivai.)i:'leys"Naorredsusdiit
set and their
8.2 Preparing meaning descriptions understood
'Of
:hTeii/fgied '
T
I(gasily
cm
approximately 20535533:
1:5
in e s.
examination
cordihglfg (CV)
for contains
of LDOCE we
IYDogigglary
countedapproximately in list 2200 words the
In order get complete
to picture
a
of the semantic content in the back of the blw: . Since this seemed like a. manageable numb
the following information: 11oo felt
have created les containing, for each MD, to pre-edit
gamma:thetagged
manu the CV provided a unique opportuniter
domain to which a sense is 10 gm a of meaning denitions on thy
cheap.We themficaly
the corpus
the subject eld codes, specifying
o
book section 1.2.1 in chapter the CV words with their possible
part(s) ef
rteisted the.codes into the MDs. A morph:-
see
restricted (on tape, not in the

peeChand auto
1 for a brief description). logicalanalyze; Eggcihgdgsgefi '
ineiitedeiiidsizriiiigzigdai
sociolinguistic and ro
the so called boa: codes,

containing stylistic, for details) was

used to treat
of the entry (on tape, not
- ra
the sense s,
about
semantic information
173
172
Chapter 8 8i3 Nominal meaning descriptions:
in dictionary definitions a syntactic-semantic typology
Meaning and structure
'
codes. Al thou g h indenite article and letter. In our CV list its POS specication reects
forms etc.), w hich recei 'v ed appropriate
P at ticiple '
i
still
'
of d-t es its function in the meaning descriptions: determiner in 99% of the

t words could be tagged this way,
than 34 00inyphe
upward;5308011235 wor cases. Non-CV words on the other hand are less frequent and also less
ineiiiained
uncoded (representing
This
'
was
more
caused by several reasons: ambiguous. Tagging them with their POS listed in LDOCE, therefore,
meaning descriptions). is very effective. The second step in tagging reduced the number of
CV: uncoded tokens from over 34 000 to slightly less than 13000
i. not listed in the
which are
(among
the nationality
.
n names).
bad-tempered;neverending
.
multiwords: After the tagging the meaning

-
n
descriptions of the two examples
unabl e emerging previously given look like the following:
'
' .
patience
'

,

derivatives: ,
_
Words marked references: as cross Entry Meaning Description
,,
tinderbox
.
.
anaesthetic; .
anaesthetist
amoj
n. u -
a
u.
doctormg]Wh0[p0l givesiv
ASia , African ammo]anaestheticimmmxx]

nationality names: tom-m]Ema]

patientiAD]beforqcomhelpo)ismtreatedivpl
,,
Rhone; Christ; Hitler

.
other names
bylllanotherlpopo]doctorlNOI"
- I
'
(L
1!.
cricket 11.
similar; however; everybody ,
others- Specialist 8.[Dg]
doctorlNo]whom givesle]
containing special symbols treatmentho] inUU]alDUI particuIarle waleOJ
ii. strings Grim] tong-m]certainleDg}kindsle] [10} 0
a. all kinds of symbols and gures, peopleiND] Ol'[c0l diseasesmsl

* 111
1,213,,Ln;u0.1n;u%v;u&n; .
These tagged meaning
descriptions constitute a
corpus, comparable to
the Brown and LOB corpora Kucera
words, (cf.
and Francis 1976, Johansson
b. typographicallyproblematic ,,
et al.
1986), in which the pattern matcher can search not only for any
-winged, sound(s)
keep/break,
.
of words but also for sequences

fun(ny)"
* ,
sequence
of words and POScodes.
of POS-codes, or combinations
code, compound
'
receive a were
ot
iiiTrix};tillzfifaitlis,
Because
'
The
'
the structures of meaning descriptions differ very much

words Hmetm'y natioiialti};
cross references and
depending on the POS of the entry they explain, we decided to divide
had to add
ahnndle
COT
nationality just names we
therai'no this corpus into chunks per part of speech of the entry. In what follows
I(llaltieiiillmcfchfeabulary.
em,
The other cases had to be handled in a 1 er
we will be
dealing only with the meaning descriptions of nouns.
on We will
use the term nominal meaning description for meaning descriptions of
way- the more frequent
especially nominal entries
Many O f the uncoded ens: derivatives, with the structure of a noun Since phrase. only a few
We 1;l1107)
English words such
decision (frequency meaning descriptions of nouns
were on
meow as: are not noun
phrases, Nominal Meaning
171th C(fmniiency (frequency l31). orCOS Descriptions (NMDs) can be regarded
wooden
( req 461),
out) that, just the
likeLDOCE.
ere or
ere cross re
Descriptions (MD) of nouns.
as synonymous with Meaning
sumed ctl it turned as
a;
many fcolirevsgrds
0 would be ordinary entries Havmgi1 in
f $8 ASCOT project at disposal, we

could quite
LDOCE etuheii- our
o ePOS
fezllilhipothe of any word listed entry as an in in
8.3 Nominal meaning descriptions:
a syntactic-semantic
n(Egg;
grainnfzizitcilt we even considered the possibility of recoding typology
tpnnligrgig We
words
all the in the MDs
of our
this way.
rst step. Looking decide:
However,CV 1
The next step involved making an extensive inventory of of sequences
the tagging results
made Gimme
erlnds a mo words with particular POS-tags and the
expression of regularities in
instead of in the CV-list
lipgvsilves t ghe
we
'
in for purpose.
our T Wei:m
e
whole high. these
i
ferent
ems ar e sequences in the form of grammar rules. Simultaneously, the dif-
( structural properties were evaluated
twiidsg
:Xtenlelvce y and it is a well known problem that the
frzgpiiilie
rtrilpst implications they embody.
with regard to the semantic
1qu also the ambiguous most ones With

regardto
ab
p.
egr reVia The on . inventory and the interpretation of data
worrtssoafrsepeech
pa
Thus LDOCE lists .
a as noun, adjective, our
distinguishes four
175
174
Chapter 8 8,4 Semantic CharaCteristics of nOminal
in dictionary definitions meaning descriptions
Meaning and structure .
- - .
Structural patterns that ca n be distinguished. Consider

.
the following
levels: of nominal MDs:
word sequence
a man Who 100,195 kinds
1' Entry
a.
RelPronoun Verb Meaning DeScription
Dec Noun
b. POS sequence
KERNEL) man] amingo a tall tropical water bird 'thl
'
c. Syntactic pattern [NP [DET a]

[RELCLAUSE
[N (or
who works]] pink and red feather
dOanards
Curved
I thnegs,
:hldami eak oa
pattern Quantor category activity-specifying la p d 03 a small

pet dog
(1. Semantic
modifier cocktail a mixed alcoholic dri'nl
described how we came to level b. Level

In the previous section we T
. .

describes instantiatethe straightforward of

c is reached by the development of a phrase
POSpatterns
grammar that
represent. in WES}?
Exampleskernel
syntactic semantically a hyperonym of the
is
case a structure
entr1n
the possible phrase structures that the
into which the POS-
can
word an: pre
and post~modifying elements express r};
order to know what the relevant structures are
content whigiithe
striationsi{1:11 of the entry-word on the extension
at the semantic
hyper:;:1
meaning
patterns have to be
translated we also look
of of the Iythe call the syntactic n ese cases we
'
kernel
'
deviZtiliin
distinguishing types
.
interested in
of these patterns. We are only I
. ~
which the kernel

.
that the twoways are in
structure that differ in semantic effect. The semantic pattern

ripnrlintCiple
this tlfielre can
Efitherand much
the kernel embodies
phrase structure expresses will be revealed by
of
a typology of structures
structural variation is. infmmatimypoer(it .eriiel. is ess in or

semantic
more
'
semantic
weight is
'
that explains what the semantic implication

of MDs, express elsewhere in the MDmative
the structure
.
describe the typology of

Therefore we will rst An extreme case of the {o rmer S ynon
'
the PCS-patterns with type are ms. A

rules that match
Iieed
to rliasstiihltlltii:
' .
before specifying the grammar not reached

in f ormation is expressed by one word, there is no
Of in practice, these levels were
those structures. course, class, hyperonymor or to add some f t
mZZiziiio' necessary prOPerty 0f
independently. Synonyms, therefore, is the lack of
ii. Entry Meaning Description
characteristics of abattoir
8.4 Semantic slaughterhouse
nominal meaning descriptions agony
column personal column
air mattress airbad
of:
The basic structure of a. noun phrase (NP) consists
For the database there p roble

a re no ms as long as the Elven
o A determiner component (optional). S
'
in the MD
'
is fully dened elsewhere in the didtiizd(as ifs

A modier component modiers) (op-
(pre- and/or postkernel Hygienymexamples m the
given).
nary 15
tional).
12:59 also other
ere are
types of kernel that are rather meaningless in
comparison with th
'
L' -
kernels and which somehow leave the job to

o A syntactic kernel (obligatory). Other part of the eMIiJIik
structure withand se-the internal
ourselves here in.
We will not concern Meaning Description
mantics of the modier
that
and
the
determiner
basic
components.
function of
For present pur-
modiers is to restrict Sm: ac out a
periodof darkness caused by a failure of th e
poses simply
assume L
kernel of the electric power supply

the class of entities that is designated by the syntactic
deniteness, number, br t plate of
like to chest
a piece the
protect armour worn
aspects
NP. Determiners
countability,
basically
measure, and
express
so on.
81:28 a type of fast
graceful horse
whole is mainly deter- '
of the structure of a MD as a b a member of the middle

character
The class"
mined by the syntactic kernel of the
NP. If one looks at the NMDs from aggrgems e
u
the group of people livmg in such
. .
distribu-
stomzd building"
a
becomes apparent that the
a semantic point of View it soon u
the front part of the chest
relevant information often cuts across the various body below the
tion of semantically
177
176
. . . ,
Chapter 8 8. 4 Semantic characteristics of nominal meaning descriptions

in dictionary definitions
that immediately followed by an Entry Meaning Description

examples
All these have kernels are
a NP. The noun of the complement adornment an act which consists of adorning
ofcomplement with a n oun or
We cannot, however, state

most of the semantic information. actuality a state which consists of being real
carries is hy-
is a type of body (i.e. there
no
for example, stomach etc.
that, the relation ........
relation between them). Apparently,

ponym/hyperonym and the entries is different from the relation
between these complements kernel
is expressed by
Iii: general class and the post-kernel designates a very
between Links and their entries. In fact this relation
call Linkers. The semantic syntactic s ould be read as the complete lling in of the specic t yp e o f
the syntactic which we therefore
kernels,
consists relating of their comple- phrase
t at class the entrywordstands for.
contribution Linkers only
of these formal with the previous type of structure is
ments, which are the most informative part of the NMD, in a specific
iht:Atiecond, difference
member/set, etc.) to the entry.
ath
e
complementis not a noun phrase but a verb phrase (VP)
way (part/whole, Linkers: i illustrated the MDs in V. that the
erguyenpsrtigresmzei
we see
ofcomplement
formaln
are
Not all kernels followed by an
that designates
e is a erive noun rel a tion 01' Property a
Meaning Description
entity.
' '
This nominalisatio
'
iv. Entry seen as an n licitl y ex Presse d m the

etc. used as food of the
'ls'exp
the lungs of sheep, pigs, MD
'
entry . The s yntactic kernel deSignates a class of r e l

lights
beef the meat of farm cattle" or
properties
.
and leaves the speCific interpretation

.
to a P hras 6 WI
.
a a'tfiins
business shares of a firm, busi- non-nominal structure: a VP.
the price of the
closing price the stock ex- The
entries havin g a such a structure M D With
'
dire ctl
'
when the trade on are
progeiodf
ness, etc. .
ated With a VP which expresses

change stops at the end of the day a relation between, or a i
en t't

i ies These kernels that shunt the interpretation from nominal

explosion a
.
the noise of
.
an
detonation structure will be called Shunters The
world which do not belong
23:32:?afnoxr/iinominal
'
the oceans of the

high seas r o a is in most cases a verb or a d'J90 t' W6 and We Often
to any particular country see that information most th t ry
'
expressed that
The . .
on e en is by verbal or
the MDs containing Linkers. adJectival kernel of the VP,
These MDs are formally the same as
information
contribute
as or- much Something like shunting also in structures that
is that the kernels
as consist
difference occurs of a
only function is to relate ' -
hardly say that their major very general J uninform a t' we kernel immediately followed by a relative
dinary Links. One can
of semantic weight Clause:
to the entry. If we speak in terms
the complement Therefore,
more or less in balance.
the kernel and the complement
are
Links, with this addition that equal Vi. Entry Meaning Description
we treat these kernels as ordinary
to the contribution of the of~complement. camper a person who camps
weight should be given
which ofcomplementation is involved
Another type of structure in cultivator a person who cultivates
can be found in the following examples: destroyer a person destroys who
5'. try Meaning Description annoyance something which annoys

adornment the act of 36.01:; attraction something which attracts
ihe side of b eaIi discouragement something that discourages
actuality
the of being accurate
accuracy qualify
ambiguous

of being Again the

syntactic hundreds of entries kernel in
'
the condition crops up

ambiguity
Linkers ux
mics a very
to have seen
large, general class of entities. In the
how the entr y was basically related to a VP h'
'
formdragfagiedigs
with the hiDs c ontaining
These ths have in common
the kernels are very general,
uninformative wor (is and that the maid pressed most of the features; here we see that the entry is
State of Affairs, designated by the clause.
rellztdfihtzxa
in the complement part of tk Generalizing, we can state
contribution is to be found
semantic tin l. we are a g ain dealin g With a proces , ,
. .
tation differs from the Linker-type in till Oi 110111118.

i 1 5 6.121
on in which
MD. However, their interpre thy time the interpretation shunted
to the entry is to a clause instead of a VP It
do not late the complement
aspects. First the kernels
re '
these MDs as: knot p o SSible, however, to paraphrase these MDs with consists of".
the way Linkers do. Rather
we could paraphrase
179
178
in dictionary definitions Chapter 8 8.4 Semantic characteristics of nominal meaning descriptions
State autopsy an examination of dead body,

the a
by cut-
if:that,
entry of the to esp.
Rather the specific relation
thekernel,
it functions the rst argument (usua y Ailiirs
01f1 it open,
ting to discover the cause of death
via as
of Dik, 1978a) of t e re 1,
a tion gen beauty queen the winner of a beauty competition
periencerY or Theme in the terminology Such an Argument
' ' '
author the writer of

of Affairs book,
t that the State implies. .
a
newspaper article, play,
gtageogfejtilairAs not be found in the former MDs contain _
relation can poem, etc.

arguments the leader
bot flotmake
in Shunters (group V.). long as no
further o z;
captain of a team or group
still conSider
suidivision come about, however, we will ypes
t be Shunters. In all these examples the kernel is derived from verb

.
a noun a
(which
kerlgiirsiier and
.
Links
.
kernels that form a tranSition between

to the represents the alternative way for nouns to refer to relations or
between proper-
also examples of kernels that stand halfway
Linkers there are
ties). They are followed by an ofcomplementwhich typically expresses
Links and Shunters: an argument of the act or process designated by the kernel. The ker~
nels also less general words than the Shunters
vii. Entry Meaning Description are we have seen so far.
It is interesting to see that the MDs of many kernels of this
a ofcomplementation: type of
structure in their real Shunter-structure:
.
'
a turn have a
the art of representing a character, esp. on
acting
stage or for a lm .
Entry Meaning Description
sh With a hook and
angling the sport of catching coming arrival
line .
arrival the act of
the of writing,
' _
esp. of writing arriving"

authorship profession examination an act
books for money
of examining

writer a person
.
the ceremony of beatifying who writes as a job or has written

beatication a
particular thing
leader a person who guides or directs a move-
b. relative clause: group,
ment, etc.
that is added to another in a mix-
admixture a substance
Because of the fairly proportional semantic content of the MDs in viii.
ture
have
.
.
picture we decided not to treat their kernels Shunters.

actor a man who acts a part in a play see as
at theatre .
.
a word which describes the thing for which

adjective
a noun stands (such as black in the sentence
She wore a black hat) 8.4.1 Criteria for
'
distinguishing different types of structures
t of kernels equivalent to
"
"
th mu examples (VILa.
5 of and
V1.1.b)consls .
The division of kernels into

Synonyms, Links, Linkers and Shunters is
Eitilksig
Hefe too, the semantic contribution of kernel and
attention should paid to t e postmodir not just theoretically important. There is also a practical side to it.
is more or less in balance and special be
Will treat them Building up the database envisaged in the LINKS project will obviously
ofcomplements and the relative clauses. Therefore, we
be easier if, as in the case of Shunters, we shift
can
systematically
LFiinriaSl-ly
as from one part of speech to another, thereby avoiding duplication
of kernels resemble
which Shunters, in
there is another type accordance with the underlying
in that the shunting applies to the kernel itself: economy-principle. However, in order
but differ from them for the distinctions to work need clear criteria
we for deciding which
viii. Entry Meaning Description items function in which category in any given occurrence. Some of

the world those criteria
advent the coming of Christ to are
necessarily somewhat fuzzy. This goes, for instance,
for such a notion weight used to distinguish between
apiculture the keeping of bees, esp. for profit as semantic
esh by other human be- Links, Linkers and Shunters for MDs with of-complement or a relative
anthropophagy the eating of human
clause (Le. the decision whether some item is sufciently general or
ings vague in meaning to function Linker
the pronunciation of the letter 11
as or
Shunter)
aspiration
181
1.80
Chapter 8 8.5 Syntactic characteristics of nominal meaning descriptions
structure in dictionary definitions
Meaning and
Types of Relative PCS-shift

'
Frequency of

verifiable
and in dications
also other, more concrete
there are
{liztleviizran
Kernels low med high complement clause
of MDs reveals
use.
31:31:26
Inspection of large samples
kernel function as a strong 1
Synonyms (ii) + - , -
frequency of a word in] n:

as a may
s o f
Real Linkers typically frequentlkerne
and Shunters
ucfinare very
and they hard y ever occ Links
that potential
structions
caiididates
are
onl contain Lin 5. (i) + - - -
MDlliiiaftollowingydiagram
can .
of
occurriiice:1; shows the frequency in (iv) e + -
nominal MDs of some of the words that gured as kernels in re vario

(viii) + + -
+
examples that we have given. (vii.a) + + -

+
(vii.b) + -
4 +
Linkers (iii) + + -
in
Frequency as token in followed as
kernel Shunters
of Determiner
Of words 31 l NMDs by
Kernel (v) + -
+
41034
994 741 (vi) + - -
+
act 1505
662
state
966
117 30 2;: 8.5 Syntactic characteristics of
sport 1O
11 nominal
ceremony
71 meaning descriptions
894 200
1232
part As one might expect, in terms of the
987 883 284 structure, language used in
piece LDOCE (and in dictionaries in general, we
1188 912 assume) turns out to be
1296 a restricted subset of English. On the the
type one hand, MDs contain lit-
tle variety of structure, the
noise 83
23 1:: 615 certain specic
on
structures.
other hand, there is a predominance of
lungs
A 6
91
dog
10 8.5.1 Disturbing elements
bird
drink
197
213 14 1; At least two features
of the MDs are the of all
unique: use kinds of
followed by meta symbols (words like old use, rare, fml), sometimes in brackets,
rel. pronoun the frequent insertion of bracketed at
parts position in the
person
3523 1440
107i
$215 any
431
As for the meta symbols, we found that they will not hinder us
something
substance
2131
373 33 2732 very much for the moment,

rst position of a MD or immediately
because they usually occur
after the syntactic
either
kernel.
in the
The
11 51
201 in initial
Word ones position (3109 out of the 5625 words tagged as META
in upwards of 41000 nominal MDs)e do not affect the structure of
the following NP and can therefore easily be the skipped by partial
grammar.
The bracketed parts that appear in many MDs cause difcul- more
of s.laughter-house,

three examples that we gave Synonyms ties. For in

(Th the NMDs.) instance, a number of cases entries that could be both noun
erseonal column and airbed
each occurred
all types
only
of kernels and
once
their
in
features
can
and adjective had in fact two different MDs rolled into one by the use
In the following diagram of brackets:
be found together.
183
182
8 8.5 Syntactic characteristics of nominal meaning descriptions
Meaning and structure in dictionary definitions Chapter
know the exact type of kernel. However, as we have previously seen,

in-
the major criterion for deciding to what specic type of structure a
bolshevik adj (asupporter) of the system of government

n, MD belongs, is the frequency of any item as a syntactic kernel, and
troduced in the USSR in 1917" rst. The grammar
this is something that has to be established can-
Bracketed parts also occur the pre-kernel part and in that case
within not immediately discriminate between Links, Linkers, Shunters and
they interrupt the structure of the NP. MDs containing bracketed parts Synonyms.
position will therefore be analysed separately. To detect the kernel necessarily have to analyse all of the text
in initial or pre-kernel we
Yet another special case are the idioms (we coded them EXPR for that precedes the kernel and the rst marks word that the end of the
total of 6501 strings tagged EXPR occur in have developed a partial grammar of NMDs that
expression). 4281 of the kernel. Therefore we
initial position, like this one: analyses them into a Determiner component, a Premodier compo-
nent, the Kernel, and the rst word that marks the beginning of the
Entry Meaning Description Post-modier component.
law META hav-
abode of/withno xed abode EXPR
ing no place as a regular home Determiner
single words tagged determiner.

entry word, but the idiom
as
The MDs concerned do not describe their Simple
start slipped
off with. They into the corpus because, on the Complex e.g. any of the.
they
typesetting tape, the idioms gure as the rst words of an MD of the Coordinated e.g. many or all.
most idiomatic of the words they consist of (cf. LDOCE: Guide to Pre-moditier
consider these MDs in our further
the dictionary, 12.2). We will not Simple single words tagged as adjective, or
research. present or past participle.
expressions that
and occur in the
symbols, bracketed
Meta parts Complex word coded as adverb followed by a prevmodier.
the preceding NP structure, the
postmodiers do not affect so parser Coordinated e.g. black and blue.
includes the kernel] of
will be able to analyse the main part (which Recursion e.g. big brown".
these MDs.
Kernel
Simple single Word coded as noun or gerund.

8.5.2 Partial grammar of nominal meaning descriptions (Special kernel + of
complement,
kernel + relative pronoun)
The restrictions of words of particular parts of speech
on sequences animal.
The general strategy of the
Coordinated e.g. person or
were into a phrasal grammar.
translated
kernels of the MDs. Next Compounded kernel +
kernel, e.g. government building.
analysis is to rst detect all the syntactic
Post-modifier
we the MDs into semantic elds based on the Links, Shunters
group
of the Linkers. One reason for doing this is that Rel. clause, Proposition phrase, Adjectives, Adverbs,
and the complements
Verb Verb
the interpretation of the post-kernel phrases largely depends on the participles. phrases.
semantic eld to which the entry belongs. Many post-kernel modiers
nominal MD can be composed of more than one NP joined by coor-
only occur within specic elds, for example, specications of colour A.
dination. The NPs have the internal structure:
in the MDs of abstract entries. can following
do not normally occur
is that the structure of the post-kernel
A more practical reason Deherminer Pre-modiiier(s) Kernel Remainder
phrase varies more considerably than that of the pre-kernel phrase. Determiner Kernel Remainder .
Only a determiner component and an adjective phrase can precede a
varia- Pre-modifier(s) Kernel Remainder

almost anything but follow. This
kernel, whereas a noun can
Kernel Remainder
tion correlates with the type of kernel of the MD. example, only
For
Pronoun Remainder
Shunters and some Links derived from a verb may be followed by an
argument of the verb or an ing-form plus an argument or a complete The remainder is either Postmodiers.
empty, or consists of one or more
VP.
Note that all kinds of combinations of structures occur, for example:
For both these reasons it is either necessary or very practical to
185
184
Chapter 8 8.5
Meaning and structure in dictionary definitions Syntactic characteristics of nominal meaning descriptions
Entry Meaning Description Meaning descriptions with a determiner component

the body or body parts ofa person or animal
anatomy Almost 24 000 MDs were found whose initial word was coded as a deter-
miner. If we make a subdivision with re gard to the ty e of det ermmer

This MD contains a kernel (bodyor body parts) which is composed that has been found we get the
following distribution:p
of two coordinated kernels of which the second kernel (body parts)
is a compound, followed by an of-complement containing a NP with a Type of Number
kernel composed of two coordinated kernels (personor animal). determiner
"a", an or the" 22 571

statistics of
8.5.3 Some syntactic any of 750
nominal meaning descriptions7 remainder 489
be have analysed
before we all MDs. There- Total 23 810
Exact gures given
cannot
MDs with structures that can be
fore, we can only give gures for the
The total number of MDs o Since only a very limited number of MDs 489 MDs ust
found by the pattern matching program. or
nouns is approximately 41 000. If we consider the overall structure

not captured by the
of
fgori
alrather gr(oup
F262g2rjlffvgggszvngtm detleifriirzifr);
differeth large
all kind
of
'
MDs and the MDs that of P arsin g P robl

are s w re a so cause
the NPs of regular , ems SIHCE
get the following distribution: theyalso have codes for other arts ofs ee
grammar, we
Structure
adjective), we
the grammar
decided
to
to
determiner
skip MDsITorcflhiiiiiiZIffielrixti
tlliese 35:81:12: m
Number Percentage a
component that only deals with a , an,
anylf
or the. skipped MDs will be analysed afterwards.
of Total
Ds
.The
41034 containing any ofoften have a special structure: any ofsevei'al
types of, which Very frequently occurs in MDs of animals, for
23 810 58% Der. (premod) Kernel (postmod) example
5006 12% (premod) Kernel (postmod) Entry Meaning Description
any of several t ypes
'
gannet
' '
750 2% Pronoun postmod of birds which 11V8 11681

the sea and eat sh
10 946 27% Initial item is:
In the
'
next two columns We give the numbers of the

Bracket, Meta symbol, or Expression occu
iii

477 1% Initial phrase consists of

those
all):
more specic structures that began with a, an or the $821???
the Meta expressions:
"see + reference, also"+ reference. of structure
Type Number Percentage of Total
total number 29 566 Det: (Modifier) Kernel

The MDs of interest are the rst three groups:
MDs. To this number we can add those MDs beginning with a meta 22 571
Det
symbol that are followed by a fairly regular NP: premod Kernel 5486 24%
Dec Kernel 14 269
32675 MDs 80%. 63%
total number = 29 566 + 3109 = or
Det (premod) Kernel 2458 11% (ambiguous)
insight in the
This large sample of MDs has been used to gain more
that will be found and their frequencies. Next we
actual structures Here the third group consists of MD s in which it was
' ' '
not 'b
noi).os'lslielseet:
that a slight indication of the distribution of .
-
present some figures give Cide whether the kernel was preceded by a modier or
which came up during that work. For that
the ner types of structure when words followin g t he determiner
occur '
can be noun or
adjective,
we will only look at the rst two groups of MDs: for example;
purpose
and
c Det (premod) Kernel, Entry Meaning Description
o
(premod) Kernel. drink a liquid suitable for swallowing"
186 187
8 8.6 Conclusion
and structure in dictionary definitions Chapter
Meaning
Number Percentage of Total

a medicine used for etc... Type of structure
embrocation liquid
(Modifier) Kernel
5006
Most striking is the relatively
here pre-modiers.
small contribution of
account that of the ambiguous structures ap- Adjective or Adverb 1020 20%
Even if we take into some
pear to be pro-modied, the number of MDs without prevmodication Noun 2648 53%
is still twice as high. Noun 1338 27% (ambiguous)
of the different Adjective or
In order to gain some insig it in the distribution

took e of MDs that unambiguously had the
types of kernels we a samp
considered those MDs The important difference relatively large number

in distribution is the
structure determiner-kernel. We therefore only
of determiner followed by a pure noun which could not of ambiguous we look at them
cases. If more closely, we see that they
that consist a
be an adjective (total number 0 these MDs: 14 212). consist of many one-word meaning descriptions: 34% or a total number
of 455. Many of them could therefore be Synonyms of the entry.
If we look at the ner distribution of those MDs that unambiguously
Number Percentage of Total start with a noun (total number 2597) then we again see the inuence
Type of structure
of those Synonyms on the numbers found:
Det: pure' Noun
14 212
Type of structure Number Percentage of Total
Det Noun Relative Clause 1993 14%
pure' Nouns
Det Noun ofcomplement 5864 41% 2597
Det Noun Noun 379 3% (compounds) Noun Relative Clause 121 5%

Det Noun Coordinator 2101 15% (coordination) Noun 177 7%
ofrcomplement
Det Noun other Post-Modiers 3735 26% (Links) Noun Noun 89 3% (compounds)
Noun Coordinator 176 7% (coordination)
Noun other Post-Modifiers 597 23% (Links)
Notice that more than half of all NMDs (1993 + 5864 7857, or 55% =
Shunter kerne Noun 1216 47% (possible Synonyms)

formal criteria for a Linker or .
in principle satisfy our

wit
That in itself is of course no guarantee that we are indeed dealing
kind of special kernel in all cases. Nonetheless, it is a highly signiv to treat them as Synonyms is that of all 1671 cases in
this
the conventional assumption that
Another reason
cant percentage, which suggests that which MD a consisted of one word (which could function as a noun
dictionary denitions characteristically involve hyponym/hyperonym cases were marked as cross reference by the Longman lexicogra-
relationships should be reconsidered 1:03
p ers.
Notice, nally, that the number given for compounds in the above
should not be taken too seriously at this stage. This number
diagram
well turn out to be larger after more detailed analysis of structures 8.6 Conclusion
may
which at present appear to be ambiguous.
The gures presented in the previous section give only a rough indi-
cation as regards the distribution of the various types of NMDs in the
dictionary. Nevertheless, we have been able to show that next to the or-
Meaning descriptions without determiner component dinary structure of meaning descriptions (MDs) of nouns in which the
syntactic kernel is a regular hyperonym of the entry (Link), there are
MDs without an initial determiner occur less frequently. Again we can
at least three other types of structures based on their kernels (Linkers,
make the same subdivision between MDs that have premodication Shunters and that make
Synonyms), up large part of the dictionary.
a
and those that have not. Looking only at the rst word of the MD the Although it may be difcult to explain these structural properties, for
following picture emerges:
189
188
Chapter 8 8.7 Notes
For ex- is funded organisation for the advance I
straightforward. :peirtidjftion }l1)y(%he

some of them the
of the
explanati
MDs co
on seems
ntaining a
fairly
Shunter describe the meaning
den Broeder
Marianne
Net)herlands pure
December
researc
In ge
.W.O. .
den
Participatin E R esearc
Hurk
l ) Assmams;
Februar 1986
'
ample, verb van
many or
of an entry which is a noun
the most
that is somehow
informative e lement
derived from
in the
a
complement 1986), coordinator:(Willem

Piek lf/leijs.
_
Vossen.
,
Research
adjective. Often the is
an
from which entry
is the Verb, noun or adjective
of the Shunter i '
1 Meijs (1988a) deals at len gth With th 8 PrereQUISIteS, organisation

derived, for example: and pOSSible uses of such

a database.
the act of annoying
annoyance
the quality of being accurate 2 Information on the program can be found in van der Steen (1982)
accuracy
a person who studied or practised alchemy
alchemist - .
3 All
examples in this cha pter are fro m th e Langman Dimming 0/
kernel thus seems
of MDs having a Shunter as a Contemporary English (Procter, 1982)-
The large number include many derivatives
makers to
to point to a habit of dictionary
entries, even though they have a systematic, predictable 4 Some of the words that could not be coded usin th
woiilghaifil:
as ordinary . '
occurred rather frequentlyg,

meaning.
The next step of the
ise
project
mainly
will
multiple
be
PCS-codes
the implementation
with
of the
Legrilceczglpecaallyt
(tthat that
ondes un ers 00 y learners
(if);
' '
511103 they have rather

meanings.
. .
in
a parser. Problems ar
idiosyncratic, nontransparent
grammar
of the MDs. Not all multiple PCS-codes
and ambiguous structures
An example of a
also in all circumstances equally problematic. Common derivatives Frequency in
are
been mentioned: the words that
in fact, has already and
problematic case,
Ambiguity of compounds Meaning Descr.
can either be noun or adjective (for example, liquid). which can not in the CV base:
of coordination
structure mainly refers to the interpretation
on all levels of the MDs.
occur
semantic analysis that must advance ?
MDs will be input to a advantage 149
The analysed words in the MD
multiple senses of the agreement agree 233
deal with problems such
as
is the
of MDs. An example of the latter phenomenon aircraft air and craft 239
and circularity
following pair of MDs: argument argue ! 173
Meaning Description business busy I 446

Entry
a thing central centre 110
object
any material object
natural nature ! 226
thing
success succeed I 140
not necessarily be seen as grave the-

a
However, suchcircularity need well be a very

circularity of this kind may More about these uncontrolled words can be found in Meijs (1988b)
oretical problem. In fact,
dictionaries and dictionary makers regard as
useful guide as to what
the really basic items. The for sometliin

'
5 figure g refers to the cases h ere It

functions as
Kernel
.
(initial word)
. ~
and the determiner is zero.w

8.7 Notes ' '
0 Due to the special of the ty esettin t
bilgecriiiilsl
or ganization
1;:
Foundation for
Links lexicon is a project
in the supported by the .
the of MD gurlfz
in ourg
prigied
of three start an
period flag-L113
lat
i
Research und er number 300-169-007 for a MD this

'
Linguistic information stored record

coplixtlzclmomer
es W1 on one was
1986 until lst February 1989, to be carried y
' .
from lst February The than the one containing the actual MD which we
years of Amsterdam.
t of the University
out at the English Departmen
191
190

. ,
. . ,
Chapter 8
This is due to the

7 The statistics in this section are provisional.
all

times slightly varies in

that the attern matching program some _
dealing
This
ihcetoutput. variation is not problematic as long as we are
have
varEied by
in never
with large numbers (the differences output the
than 1 however, during our sessions With program). recise
more
Chapter 9
have analysed all denitions.
i
will be given after we

gures
A tractable machine dictionary as
a resource for computational

semantics
Yorick Wks, Dan Fass, Gluing-Ming Guo,

James McDonald, Tony Plate and
Brian Slator
9.1 Introduction
We begin with a review

of computational semantics, starting with what
may now be called the
historical background in computational seman-
tics. In some sense nearly all the work in the early seventies on A1
approaches to natural language understanding can be said to fall un
der the term computational semantics: the two approaches that t
most obviously were Schanks Conceptual Dependency (Schank, 1973,
1975a,1975b) and Wilks Preference Semantics Willis, 1973, 1975a,
1975b), but in a broader sense, Winograds (1972 SHRDLU was also
within the concept, particularly since his own emphasis at the time
was not on reference (to the limited number of blocks on his simulated
table) but on procedural semantics. Very much the same can be said
of Simmons [1973) semantic network and Charniaks (1972) inferential
system.
Aspects of these systems are still being incorporated in work un-
der development: the emphasis in the Conceptual Dependency (CD)
193
192
9.1 Introduction
Chapter 9
LDOCE as a resource for computational semantics
with those logicbased approaches
of An important difference arises
work natural
roup has shifted from describing the as
but prgciss;
languagle1 here between socalled subsymbolic approaches within connectionism
igrig (NLP) to describing it as human memory research, iiecna1:. (Smolensky, 1987) and those usually called localist (for example, Cot-
approach is still applied
is largely cosmetic,
database
since
front
the
ends. A recent paper by 'e ner
is infactiialeggr'g .
trell and Small, 1983; Waltz and Pollack, 1985). This dierence, which
is yet in no way settled, bears very much on the subject matter of
iation and
in spite of its title ( Utl'
i may
a retur n to early CD concerns, this chapter: in the subsymbolic approach to computational seman-
very much and Semantics) ' and is
S ntaz
the Integration of y
E isodi'c Memory for '
mches of
and
'
fusion of the app

tics one would not expect to distinguish representations for particular
vdi'y
much computational
Semantics and CD.
semantics, a
.
word senses, they would be simply different aspects of a single non
and would correspond (if to anything) to no
e
be described
.
by contrasting symbolic representation,

Praisilsmoften
the position case, a can best
more than a selection of differing weighted arcs in some network, pro-
it with in this
others and, case, we can
contrastcomputatiolnal
59:23:12; vided as values of a function taking numerical arguments.
tics usefully with three positions: syntactic analysis,
subt llog-ica.5811111 those On the other hand, localist approaches to computational semantics
and eties in a; mad
and expertsystems. There renements. are
have assumed word senses in their symbolic representations that are
broadly, computational is 1:1;

positions but, speaking for semantics, 1am a set, and are given weighting criteria for selecting between them. As
to any claims as to the necessity or sufciency, corlripuit; To 5 we shall see below, it is a version of this approach that is used later
guage 3,515,th,lgsflinesszlginaiigted
understanding,
syntacttic
of
is not to deny, in car
ain
cases,
in the present
computational semantics.
chapter to produce a large scale database
The difference between
for doing
these two notions of
izi/iailiyntactic SZb::q2::
pleasurelsdoft
analysis, nor the'aesthetic ow connectionism, as they apply to issues of word sense for computational
fghe, ho. s,
formal axiomatisation. Computational
b- semantics semantics, is precisely the issue that is set out in the next section of the
the real theories of language understanding,
tcisay},1eThe whichbeis e chapter as the key issue in computational semantics at the moment.
stantive algorithms and the right level of analySis,

is sew 2e, Most recent work in the computational semantics tradition has been
issue of natural language specic-allyvforgivesys Exchal
interfaces, exper
it is difficult to
concerned with the central issue, isolated and discussed at length be
dierent: there is position, for which

a
for the above two ctions) low, of the discreteness of word senses versus models of continuity
sources (in way that it is
all.
a
easy
assoCiated knowledge
pg ate
too
between them (as connectionism prefers) and the question of how, if
which claims that, if the
semantics strucgure
(leXica gs a-teqetc), am
distinctions are to be made between senses, this can be done at the
domain 'lguiliigen
of language is information the rather than
then practical problems unless the c appropriate time, as enters system, on
The that, answer is

the basis of early all-or-nothing commitment, the earlier systems all
arise.
t?1this
as
do not
im is siml ase. has been done by Mellish (1985), Hirst (1987) and
'
did. Key Work here

'
triviildbvtdileeiilivenot) i ntend to mark

do computational simantitsagff
t
Small and Rieger (1982), It is significant that the latter two are now
from knowledge realms

and their expreSSion:
formalthat (an :2ionflay; exploring the relationship of their ideas to connectionism. Much of the
the position take in this chapter
we
ge
knowle ii ange gt is
current state in the computational semantics
of the art line of research
tdeytar
as
and world
of the not ultimately separable, Just
are
mics (including Fass work on Collative Semantics, described in some detail
ion
into data bases called, respectively, ic
ultimately separable below) is to found in Small, Cottrell and Tanenhaus (forthcoming).

The emphasis there is on the boundaries of discreteness, in word-
ndinngligxrliisection semantics would not be

computational on
scnses, on continuity and vagueness, and the ways in which sense de<
complete without connectionisrn:thatcluster

some reference to
0:52; cisions can be made appropriately as information enters a text under-
that very small computing
theories around the notion
and learning from
units,
conneS
of y
standing system. These are the real concerns of understanders, and
in very large numbers, experience
These meany for. er
ones about which logiCebased semantics has little or nothing to say.
shifting aggregated weights network.

in a
of AI, including
maylo a?!
computationa serrcilatnhicsi
but
this
If there is distinctive
a position about computational
it is that the inseparability
semantics
of knowledge and of language
in
ward in many areas

ms an c a
chapter,
of
the notions yet in early stage
are as formulation an
from t; 'm goes further than is normally thought and that particular language
Sign
is worth noting, and an optimistic
f tehpogf-
untested. What structures, text in fact a paradigm for knowledge struc-
structures, are
e
shares many
of view of this chapter,
is that
has With the approacoh
connectionisrn tid tures (Wilks, 1978) or, to put it very crudely, knowledge for certain
ferences that computational semantics
continuitywit leshmzhe purposes should be stored in text-like forms (as opposed to, say, pred-
above: emphasis an non-syntactic methods,'a
and difculty
on
reconciling its i)He]; some in
c ai
icate calculuslike ones). Examples of such knowledge structures in-
forms of world knowledge,
195
194
LDOCE resource for computational semantics Chapter 9 91
as a Introduction
Quillians Memory Model (1967, 1968), pseudo- with

the planes of cope lexical ambiguity was
clude
texts from Preference Semantics, sense-frames from Collative Seman- 31);? tlocomputational
prografms ai ure 0 ear y linguistics t 35 k S l'k
l
a major
e maChme
reason

trans-
for
semantic units or ISUs (Guo, lation.

Yet, does it follow fro m that failure that th e 1echal amblglllty
'
tics (Fass, 1986, 1987), and integrated

i ' '
the supercial claim that we, distinguished by conventional

' '
is far deeper than d' ictionaries has any real si ni

1987). Our position
with writing
flit
haarslceei
ngay
.
store information as pieces of paper that task, for example, in the claim that a word such as
as people, normally the semantic g
that principles underlie that are then distinguished and described7
them, but rather, common
on
structure of language text and of knowledge representations. 56115163;

e point can
perhaps be put most clearl y by conSiderin
. th
i
' '
whose subject
dictionSriesevildEe
.
is the dictionary, text there

One form of language text
a
is to provide
that
Evisiiion
never was leXical ambiguity until
Given that the purpose of dictionaries en in roughly the form we have them, and that
matter is language. '
now lexical ambigu
denitions of words and their senses, it might well be expected that, it,
product ofscholarshlp: social product ) in
of all forms of text, it would be in dictionaries that the semantic struc-
for
lesls
than;
Otygliesrrxogosrelor
etween . rans a ion
languages
as well
a
35 more mundane
would be the most explicit and hence accessible understanding tasks had been for millen
' ' '
b do! SUCh
ture of language gomg la
SChOlarly
structure of knowledge and
and comparison with the semantic therefore cannot require them
examination
representations. And indeed, the semantic structure of dictionaries has proiucts
certain
.
.
kind of cognitive psychologist '-
may nd this 0 't'
of knowledge
been analysed, compared to the underlying organisation
and similarities have been observed, Dictionary en- Efgspgiighlt vfery riuch o ave
is also
never oun
to the taste of certain
the idea of lexical ambi
formal
'
52x55? '
-
representations, terms
.
guity interest
that Cirilng
and the genus
.
b):
'
and dilferentia
contain
am52rtazi.bForthem, peripheral phenomenon, is
tries commonly a genus it a one
of dictionary entries can be assembled into large hierarchies (Amsler, play/1, playZ, etc., as
stein gistdigistblscrilptiiitg (as Wittgen
syr)nbols
-
Byrd, and Heidorn, 1985). Likewise in the study of and claimin is me alus ' '
1980; Chodorow, that th g ere 15: 1 any Case,

a frame can be viewed as containing a genus no real
ambiguity in the world itself'
'
knowledge representation,
' i
b o ls deSi
g .e d' isioint classes
.
sym nat
hierarchy of
.
and a semantic network is viewable as a of things and that f act best be captured
.
and differentia can

by disiomt (subscripted)
terms. symbols.
to the semantic The
'
principles
'
that there are common to this

It appears, then, answer position would be that when le tra
rocessgzltali23
but these K . peo
knowledge representations,
and of
aglgiglilip::gwlords
structure of language text of dictionaries, the
are still unclear. Such principles are worth seeking because, as well
a useful when
bifire thedadvent
elled by computer
. without canno e mo
5 ome -
guide ' .
principles
.
of
as having
faced with
theoretical
the problem
merit,
of
those
knowledge acquisition,
are
i.e., of extracting in-

22:21: lthinCal ambiguity. practical experience twentryfpfirife All for
Pamdy {0531
position just presented,
formation from language text and using that information to build new
irsnieadltiiat.
_Tl;;subscripting procedures whatever us in at
'
it no
in
knowledge structures or update existing ones. A particular instance deal the; with offers real problem.
to
extraction of semantic information

of this problem is the automatic still to be faced by those
from machine-readable dictionaries, a problem that is currently re-
We
resp213:eaiairoltaherhproblem
lexical w 0 want to construct a ambi
who
guit
make
dat
'
this
b
last
adiectii:f
.
lexicography.
.
differerait
.
computational
.
within
zstgsrnise
.
considerable attention one) the arbitrariness

ceiving
converging an'eXisting is factor:
the theoretical interests of computational semantics as
ig t:
34, 87
cannotrilybglif'e
see or senses for single
with those in knowledge acquisition and computational lexicography.
that number of basic
.
2W7,yet .
. orse , those different
a
segmentations
word
'
and they
0f usage
As evidence for this convergence, it is notable a into discrete senses ma y not even be
'
lnCluSl
'
S ometimes
' '
semantics
' .
.
ve. different
that must be addressed regarding computational dictionaries will 5e gment
'
into
'
'
questions usage senses for a given word -
lexicography and knowledge acquisi-

computational
Eggagzbligvz; dictiolnnarrymh
.
petrhdapsllayli
also issues in (the third of in
are
three)
'
tion. eia e Wi one of the

eight any senses of play in
The rst of these questions, which overlaps with computational dictionary B,
is whether it is right to assume a notion of word sense, The answer to this last
' ' '
lexicography, objection is extensibilit : a d t'

canlzxfennysdra:
.
lexicography that con- the);

.
and direct from the traditional algorithm plausible only if

unexamined,
structed machine-readable dictionaries? Can computational a seman- igncse-rtesolution in the dictionary, the basis of
are
tics be based on such a notion? The answer given in this chapter will t_c)malgl1111i;<:.ni;;2ivvlsenses,tngt
'
Ialready
that '
differin d' t'
Ilonanes could,
a
presen e . n
way g m
on
'
'
m
principle, be tuned sense
' '
to
be yes. corn
patibility (though it mi ht
thegsanizrtleiriso
.
that the inabil-

'
To put the matter another way, many have claimed practical purpose), Just as
people can be if exposed to
196 197
9 9.1 Introduction
Chapter
'
(1972)has large scale, general format suitab

and in a though Independent
Such a position is not a th ousand
the years: modelling
that
miles
intelligence
from the
must
one Dreyfus
inevitably of,
:rfangetff
stabsequent
text analysis proctsdzdi '
defended over related question is that of bootstm

tiiziifilglrdhlfilici'l
of NLP, ourt an
In the case
be by means of extensible, or learning, systems. to the of collecting the initial information
metaphor understanding. The process
that task is often subcontracted
out to
whatever it is called set of
computationalprocedures that are able to extract semantic )in-
is that that phenomenon,
position defended here the phenomenon) is formation from the sense-denitions in a machine-readable dictionar
(and metaphor has always tended to marginalise

and for the possibility
The initial information needed commonly is linguistic information rid:-
utterly central for language understanding itself,
information, which is used during the parsing
galllglnsynagtii'and representamn from WhICh
case
dictionaries that start from different
of reliable, compatible, machine
' ' '
into
then exxffldzdrfg
se- eini ions an u 59
databases. from mantic information is

to extract semantic information
Furthermore, any attempt Bootstrapping methods be internal e can or
used in dic-
a machine dictionary must
and
acknowledge
their senses
that
may
the words
themselves be lexically bootstrapping methods obtain the initial informatirfntidleziied
thitihleirl
words
tionary entries of human readers see am- proceduresfrom the dictionary itself and use the procedures to extra t
ambiguous and must be disambiguated.of real dictionaries, they appear

in the denitions
When that
information. This is not circular as as it may A
seem. proceZS
biguous words used may information
requue for the analysis of some sense-denition (for
understand the
to recognise those words as used in a particular sense,
example, some knowledge of the words used in the denition)and ma
the words in
f the words, and hence disambiguate
themselves.
be able to nd that information somewhere else in the dictionary By
the dictionary definitions. Three solutions
humans
then suggest
to do and run a contrast, external bootstrapping methods obtain the initial informa}:
is to mimic what appear tion for their method other than the use of thos
The rst solution those denitions
procedures by some
program on dictionary denitions and disambiguate

is to remove the lexical am-
The
procedures. initial information may be from some source externai
The second solution
prior to using them. thus have denitions which tothe dictionary or may be in the dictionary but impossible to extract
denitions and
biguity from the dictionary Quillian (1967) and
Without the use of the very same information. For example the word
solution adopted by semanti)c
contain only wordsenses, a
is to not attend to noun may have a denition in abut the dictionary informa-
The third solution
Amsler (1980) amongst others. that denition cannot be extracted without prior knowledge of
until it becomes essential that it be dealt with. :05:1:1 contains knowledge of syntactic
incifdisghinirugh
the problem lexicog- categories,
A second basic question, that also concerns computational
concerns the sufciency of a dictio ' '
raphy and knowledge acquisition There differences of opinion in uta
Siiocugldiildlldfggdd
are com
enough knowledge base for English, specically as

and
nary as a strong of the garding .extricability bootstrapping.
above all, the knowledge
regards the linguistic knowledge and, are pessimisticabout the use of machine readable dictionaries for use
real world needed for subsequent

text analysis. translation. Others (for example, Amsler, 1980; Boguraev
Different answers have been suggested within
enough knowledge
that there is not
computational lexi-
iggnachine
Kegl, 1987) appear
',
believe that the semantic information iri to
researchers believe
cography. Some d" but only with
can external bootstrap some
in dictionaries in principle (for example, Hobbs, 1987); in other words,

lbe'extricatted
igtionfarites least knowledge hand-coded into
ialiisislsrzrlirflat
some prior is
semantic information is not available anywhere in ;

that certain specic
and hence must be derived from another, outside, source.
a dictionary extricability,
bootstrapping, sufciency, and the status of
Other researchers believe that dictionaries do contain
be implicit but can
sufcient knowl-
be made explicit
th e Althgugh all
wor theoretical
sense
are serious problems for computatio l
edge, though that knowledge may
entries in other parts of a dictionary (for semantics,this chapter also assumes that such problems can be degillt
information from With practical The remainder of the chapter discusses work
by using Boguraev, 1987;
in ways.
Slocum and Morgan, 1988;
example, Amsler, 1980; by{our researchers from the natural language group at the Com ut-
Kegl, 1987). We explainposition shortly.
our
in computational mg Research Laboratory (CRL) that demonstrates the conver En
A third fundamental question, again of interest
between theoretical semfntii:
notied
issues in computational
is about the extricabilitgof the above
lexicography and knowledge acquisition,in other words, can a set of coin. lexicography and knowledge
semantic information in a dictionary;
that operate on a machinereadable
2ofd'cornputatkilonal
inquiiiiichilcalsizncern; . ion . iscusses t ree different but com p lement
ary
be specied ' '
putational procedures that LDOCE for

approaches use as a resource ex
dictionary and extract, through their operation

and
alone and without
reliable semanticon
any
3information
formation in a usable form. Section 9.3 discusses fibiivctsliilcghsiiffdihiac
human intervention, general
199
198
9.2 The extraction of semantic information from LDOCE
Chapter 9
non-primitives: each primitive

as has about
of two kinds of seman- 51X as
ambiguous
can be directly used in the kn owledge structures
Semantics, which embody the twe 1times
ve senses while each non-primitive has approximately two.
tics, Preference Semantics and Collative the outset.

semantics described at
general position on computational Heads Words Entries Sense Denitions
summary and some brief conclusions.
Section 9.4 gives a
Primitives 2166 8413 24 115
Non-Primitives 25 592 32687 49 998
extraction of semantic information Totals 27 758 41 100 74 113

9.2 The
from LDOCE Figure 9.1 Head counts for words, entries, and sense denitions in
of LDOCE and then LDOCE.

begins by giving brief a description
This section between the three ap-
similarities and differences
outlines the main The book and tape versions of LDOCE both use a grammatical code
of semantic information from LDOCE.
proaches to the extraction dictionary designed for learners of English of about 110 syntactic categories which vary in generality from noun
LDOCE is a full-sized to to noun/count/followed-by-innitive-with-to

55 000 entries in book form noun/count
second language that contains over
readable version of LDOCE also containsbox and
a We The
as
typesetting t ape). machine
and 41 100 entries in machinereadable
form (a codes that not found in the book. The box codes
denitions that ends pragmatic are use
dene an entry as a collection

of one or more sense
phrase or hyphenated word set

special of primitives such as ABSTRACT, CONCRETE and ANIMATEa
at the next head. The head is the word, The primitives arelused to assigri
is a set of denitions, examples
a. type hierarchy.
dened by an entry. A sense denition prganisegi-into and adjectives, and type restrictions on the
aigirrnisnifslcofeig:
includes nouns
of a head. If an entry
and other text associated with
one sense
each definition will have a

than one sense denition then sense
sub'ect codes but
Thepragmatic codes (also called
gramrZEEihheld
more
number.
entries dened using a as pragmatic codes to avoid confusion iviththe
hierarchy JTh
that are
of LDOCE claim
The preparers
and that the entries have use
another special primitives organised into a
set of
controlled vocabulary of about
We
2000
will
words
refer to the words of the con of main
hierarchy.consists headings such as ENGINEERING and subvi
a simple and regular syntax. The primitives are used to classify words
primitives and all other words in LDOCE as
headinglike'ELECTRICAL. is classied
trolled vocabulary as for of current GE
ggltiggsublect,
as
example,
while another
one sense
marked
non-primitives. sense is
ENGINEERING/
Figure 9.1 shows some basic data derived
of
from
a tape
our analysis of the
error, words that ELECTRXESEOLOGY
machine-readable tape of LDOCE (because described LDOCE, approaches to the
we now outline three
after zone have not been analysed). The gure of Having
follow alphabetically The list of controlled vocab- extraction ofsemantic information from LDOCE that are explained
at follows.
primitives is arrived
as
2166
58 prexes and sufxes
in
more
detail later in the section. These approaches are extensions
2219 words. We have removed lines of research. The approach in section
ulary contains also removed 35 primitives that
of
fairly.wellthe established
listed as primitives and have of Sparck Jones (1964) investigation of semantic
that are
shows that some words 9.2.1.15in. spirit
did not have heads. Furthermore, the analysis clasSication of the uses of words.attempt is In section 9.2.2, an
are used frequently in
vocabulary yet
are not part of the controlled madeof to develop an empirically motivated controlled vocabulary in the
the word aircraft is not part of the controlled
for example, work on the role of dening vocabular
definitions,
denitions. About thirty spirit 'Amslers(1980) in
vocabulary yet it is used 267 times
the
in sense
list of primitives, giving 2166 prim-
dictionaries. Section 9.2.3 describes the construction of a large-shale
such words have been added
to for of genus and differentia terms
parser the extraction expandin
itives.
9.1 is the extremely high
upon other
Similar work (for example, Chodorow, Heidorriand B
rdg
The interesting thing to note from Figure
vocabulary. Al- 1985; A'lShan, Boguraev, and Briscoe, 1985' l Boguraev aiid B iy
Scoe, I
for words belonging to the controlled

number of senses andJensen, 1987).
in the controlled vocabulary,
though there only about 2166 words 198:};hBinot
e three approaches to the extraction of sem
' ' '
diffslengiiin
are
of these
over 24 000 of the 74 000
of
senses
phrases
dened
beginning
in LDOCE
with
are
a word
senses
from the con- fromLDOQEhave contrasting answers to the quesfixdtrics

dic:
words (including senses roughly Ciency, extricability, and the bootstrapping of a machineqeadable
trolled vocabulary). To put this another way, primitives are
201
200
'
'
Chapte r 9 9.2 The extraction of semantic information from LDOCE

all three ap 6. relationship to some previous work.

d extricability,
'
W th
'
res ect sufciency

to an
.
.
tllfat
.
at
assxume
gfdifliles
NLP applications,
least some
dictionaries
'
do

an d that :ufcilert
contair}i1 kn::v)l:;iig:bfper
know].
Relationship
suc
.
now
.
e gei
what,
.
to other approaches in this chapter

i. e over
ches differ over bootstrapping, .,
.
To clarify the relationship of this approach to the other two approaches

aprrigedsencoded
Brimetiife is;
progifirant to be into an initial analysis
described later, we can examine the standing of this approach with
:xfrdcting
P110!
semantic
kDOWledge
'
information.( rstzaipprtplzcioingzguteso
The
supplied.
'
5 nee ded
philosophical
at all
'
see ion
t' ion
.
must
.
;
be
. ,
belieVe
The sec-
regard to the basic questions presented in the introduction.
A neutral attitude is taken to the status of word
k'nds external '
of informa . .
sensedistinctions
.
the that made

:errzdzchlargues smallset assumption being enough
Ylildlste]: of senses, are
thadt words {0}: permit

that
specic enough meaning
there is a
in LDOCE to a to be given to a word
irlpOSZibleinformation semlargtiic

giftiniitizgczn:
h Vin
an:
extract
rior
to
context,
the sort of
avai a e. 0
third in and that if there are unnecessary distinctions made then
adversely general approach. these will not affect the (A useful
31:23:23; asEulineprior needed 1gi'zngimar,w :1;315%}:
lituhtl '
the existence
is supp
of a
ie rom
.
tion here is to ask What are the consequences of too many or too
ques-
few
' '
information that is .
syntactic '
.
h Simpy 1
'
assumes the pm); sense distinctions, and we come to see that the consequences depend
'
a pproac _
information,one.in
.
case
Regarding 9.2.3) wh1le the '
upon the end purpose to which we wish to apply our systems. We

parser. (section human subjects
,
'
f the case information his from

'
have tried to sneak in a vague yet suggestive one-sizets-all

21315:: Ialicdevblcates
extracting the case
[section
information
9.2.2).
Some more specic purposes might be paraphrasing, translation,
purpose.
or
in psychology experiments story-interpretation.)
.
The position taken of LDOCE
on the with sufciency
regard to
and
9.2.1 APP roach I: Obtaining using.
statistics from LDOCE collecting semantic information about word senses is that all the infor-
co-occurrence mation is in LDOCE and can be extracted, but we will need to look
beyond the sense denition of a word to nd the information
(Tony Plate) about
from ' '
'
it. This technique has far fewer stages of bootstrapping than other
information '
semantic
We now describe a technique for extracting semantic infor- techniques. Information about all primitives is collected all at once,
LDOCE) that does not require any
text (specically it. Central to this t ec hni q ue is that all sentences '
'
rather than collecting information by traversing a treelike network of
denitions.
'
mation to bootstrap about the use . sense The bootstrapping is entirely internal (in contrast to
mation
t 'n a word are use
This tech
.
those of sections 9.2.2 and 9.2.3); the rst phase where cooccurrence
word.
. _
.
of the
blfatlhgfnwzlrd, rather than just the denition 0 f statistics are collected for words can be seen as bootstrapping for later
frequency
'
that the
experimentalndings
'
specic cooccurrence
'
based on some phases where more statistics

'
of the are collected. The

Ic1<1)-11\)1::acfisrrence a reasonable measure
of a pair of words prOVides completely internal bootstrapping of this technique is made possible
them.
strength semantic of the relationship between by the conjunction of two things: (a) the interpretation of conditional
discussed this section:
in
Six topics are probability of occurrence as a measure of semantic relatedness, and (b)
the use of measures of semantic relatedness as a basis from which to
technique to the others pre-
of this cooccurrence
1 the relationship construct useful information about a word and its senses.
sented in this chapter, Finally world or pragmatic knowledge is not exploited for the cur-
of co-occurrence data from LDOCE, rent analysis of dictionary denitions; thus the extraction, or use, of
2. the collection
reveal what to kind of rela- pragmatic knowledge is beyond the current scope of this technique.
experiments that help
3 exploratory data,
in cooccurrence
tionships are expressed
the interpretation of
.
Constructing co-occurrence matrices from LDOCE

h ich validate
.
.
W
4- psychologicalexpenments as measures of the Cooccurrence data records the frequencies of of
easures based on frequency ofoo-occurrence cooccurrence pairs of
words,
'
words within
1strength of the semantic relationship between some textual unit. textual (This
unit could be a phrase,
'
how co-occur- a sentence, a paragraph, but unless otherwise stated the textual unit
'
'
initial verifica tion)

' ' '
about
With some
5. lations referred to here is denition.)The independent frequencies of
:23: data cguld
not a sense
be used by NLP systems
to reduce (but
and occurrence of words in a textual unit are also important and are used
entirely resolve) lexical ambiguity in any text,
203
202
semantics Chapter 9 9.2 The extraction of semantic information from
LDOCE as a resource for computational LDOCE
of of words to of words was selected using the BROWSE

'
in conjunction with frequencies of cooccurrence pairs program With ma Image as

a seed word. The set of words in the
the values of various functions.
probability network contains
calculate
LDOCE is a very suitable text for which to collect co-occurrence
that the vocabulary is limited (largely) to the

data, one reason being X if
set of LDOCE primitives. Another reason is that LDOCE is a dic- max(Pr(X | marriage) Pr(X),
(9.1)
provides small, easily iden-
tionary, and the structure of a dictionary Pr(marriage I X) Pr(marriage)) ~
Z 0.03,and
units of text over which cooccurrence statistics can
Y
.
if
tifiable, coherent max(Pr(X I Y) Pr(X),
data so far collected, all syntactic (9.2)

be collected. In the cooccurrence
LDOCE def- my l X) Pr(Y)) 2 003(2), ~
information has been ignored. A unit of text (an sense

where X is of the
unordered list of the one words satisfying 9.1
inition with its example text) is analysed as an
all the that occur in it. Words that

unique root forms of primitives
not some form of a primitive are ignored. The frequency of co-
are
and Pr(ward) signies the robabilit t
by one each time they
sensepdenitionyinhfliiswegf
or more
occurrence of a pair of words is incremented the
in unit (a
cooccur within a unit of text. Other units of text have been consid
The
textual of the values ;
of the function
has been successful as the entire sense denition. . matrix Pr(X I Y) Pr(X) w
(This functionz'ls
but as
ered, none
printedand given to the PATHFINDER program.
about
Unfortunately, LDOCE does not contain very many words, only the arcs) The result5
1.2 million in total. For many primitives the frequency of occurrence is directed,which is the reason for the arrows on
30 or less times), thus the entire text of

ing network required spatial make it reorganisation b t to readable
very low (280 primitives occur
this reason
was
not with otherwise.
tampered The purpose of the PATHFINDEIlt
LDOCE provides a poor sample for some words. For any
briey, to explain edges in the network
statistics based only on some part of the dictionary are algorithm is, very and th
cooccurrence
suffer from small sample size.
eliminate them. For example, if there is an edge A 4v and C, a1:
likely to
edges A cr B and B 1 C, and the latter two edges havesufcient
weight to make up for removing the edge A 4r C then that ed 35 '
Explorations of the co-occurrence data removed. such a manner

In the number of edges complete inia llsi
Cooccurrence data for LDOCE
primitives nearly twoand-a- contains can'be typically be reduced from N2 to between N and 2N m3- gm}; -
(the triangle of a 2200 by 2200 far

ing it more
readable, while retaining much of the Ion and informat'Y
half million frequencies of cooccurrence some of the noise.
matrix). Such an enormous amount of data is impossible to examine in
Ehmhlilagllng
uc structure is visible in Fi ure 9.2 artl

shgfiiiltffoikggdddmu
of information must be reduced to make it
raw form, thus the amount
This must be done without eliminating large amounts
substructure. This seems to be afeature
ofaall
comprehensible. the
from co-occurrence data the 1' e are q 11 la a few eve s of stri c t lire V iSi b le
what is needed is a way to cut ,
of interesting or useful information;

noise out of the data.
has successful
proved is to use the PATH-
One technique that very
Durso and Dearholt, 1985), which
FINDER program (Schvaneveldt,
to discover the network structure in psychological data.
was developed
called BROWSE, was written especially to work
Another program,
matrices. It enables someone to select groups of
with cooccurrence
of various probability functions and write out Psych ologica] validation
words using thresholds
of the cooccurrence matrix. These sub-matrices can then
sub-matrices
gezgggrbglottetirorzpdttto
the almost the nature of the relationship ex-
be given to the PATHFINDER program
represented
which
by
converts
the sub-matrix into a uIlidersltaEd 1 iona pro a iit of occurre
of thelsiegdtilbileiiiehe
connected network
completely
sparsely connected network. These networks provided the rst indica- ductedsome comparisons of the matrices (30.3;
subjects wFlor
of relatedness
tion that co-occurrence data contains much semantic information, and ofJudgements
Kilatricgs made by human
subjectswere asked to
relatedness of eachof
rate the
have proved very interesting
One such network
examine.
is presented in Figure
to
9.2 (overleaf), which shows theerrMsfhulmazn
All]
)/
ia
'
d Similar
'
pairs of words. Several experiments were conducted
results, of which will be described
briey here-
one
of words that have close relationship to marriage. The set very
a network a
204 205
semantics Chapter 9 9.2 The extraction of semantic information from LDOCE
LDOCE as a resource for computational
contrlcl
7 In one experiment 5 human judges rated the relatedness of allpairs of

uubltnh
words from a set of 20 words. The intersubject correlation ranged from
charm 0.70 to 0.83 (where 1.0 would be identical, 0.0 would
be no relation,
and 1.0 would be opposite) and the correlation between the mean of
/
Accwdlncn
the human judgements and the conditional probability of co-occurrence
......_ was 0.66.
probability
This is a high correlation
of
gure and indicates that conditional
co-occurrence is strongly related to human judgements
\<> kqudul
"Untou-
.m
of semantic relatedness.
christian Possible applications

mum
We initially hoped that the co-occurrence data would be related

)/
king
J semantic similarity (or distance) between concepts.
to the
This has recently
l. 9"?
cram mu-
holy
been veried by comparison of conditional probability of occurrence
\\
princ-
3"
\\ Halon
with human
discussed
judgements of semantic
above. The exact nature
distance between
of the relationships
concepts, as just
that the co-
ncclslnn occurrence data reects is as yet unclear. We intend to investigate this
00"
mum
further.
and
Given that data is strongly related to ratings of se-
liy
co-occurrence
mu
w mantic similarity, it can be used to nd sets of words that are semanti
m.- mm. cally related to a particular word. This is easily done by selecting the
lonnn +
_ words for which the similarity of concept (based on equations involving
m1 <> born
3 probabilities) to the given word is above some threshold. One of the

plrsnl desired
child
properties of these sets of words is that the sets of words for
two words overlap to the extent that the two words are related. Once
birth
daughter we have sets with these properties they can be put to many uses, we
W
mlrrl
will discuss their application to resolution of lexical ambiguity here.
x mum
It should be stated that we do not see statistics
being able to solve all the problems of natural
of cooccurrcnce
language understanding,
as
O 71L or
useful semantic
statistics
lexical disambiguation.
evenjust
information
and we wish to see what
that can
Rather we
be extracted
think that there is some
from co-occurrence
applications such information can
be put to, and to investigate how a subsystem using cooccurrence
inllde
/ information
a natural
could be built so that
language understanding
it would constitute a useful
system.
part of
rel-(10" mm Discussion
,9."
We recognise that sense-resolution isproblem that has been Worked

<>
a
9m on extensively and that many other techniques have been developed
[sumo-n
(for example, Collative Semantics). However, we consider statistically
dum-
+
'"
<> Pu..."
* qu!
/ based techniques to
problems facing the
of these problems is that most systems
have the potential to overcome several
practical application of other techniques.
would
serious
One
require enormous num
9.2 Network of words related to marriage. bers of handcrafted knowledge structures if they were to have
Figure any
207
206
Chapter 9 9.2 The extraction of semantic information from LDOCE
resource for computational semantics
LDOCE as a
is that the centered around idea and

Another problem coherent units
of text thatof are one
reasonable coverage of natural

systems degrades
language.
as more knowledge is intro- srgia'll,
o ways of
selectingunits cooccurrence. Other sources of text
performance of many of vlilous similarly coherent units of text
provides ways
lmightprovide
:3, freetncygcloilaaedliis
Y
is that analysis of co-occurrence

duced. Our hope text that require very lit- suspicion beenex is ess i e y to. his has veried to s
and
extracting information from dictionaries large information

at the cooccurrence taken just fromofli:

for constructing looking
tle human input, and thus are
feasible techniques
of statistically
implementing sxgentbyLDOCE, nitions just in and the information taken from the exam-
knowledge-bases. We are studying
framework,
ways
and have a strong f: that
supplement many of the denitions. The statistics
based systems within the
connectionist
to implementation by ;gesgntencesthe denitions
have much better correlation with ratings of
emphasis on techniques
which lend themselves
with this aim we are using an Intel aselon
arity from human subjects.This does not rule out the usefulness
parallel algorithms.
In connection
investigating connectionist models. SlfIIEI text information, but it may
of much useful information as well be the
a source case
Hypercube to perform experimentsnatural language systems constructed ohtrefe if it par were -
Another problem with many

they work with serial
search strategies that lack psy- 2:in:e texit coulid progde
iguate while init'
nse- ll y, the dEmthHS
isam LDOCE ,
la
' ' '
to date is that
clear at all how these systems are a text:
better source of
chological plausibility (and it is not
be made more psychologically plausible). Systems based on sta- Thelastadvantage,'ari.d
the most important to the continuation of
might the that provides a starting point from which
aIdicionary
in a parallel
tistical techniques, especially
framework,
if they
would seem
were implemented
more
potentially psychologically to wgriiectjhis is it also hel
w1 senses. nt Way P s to Mg?!
'
t 6 th e critiCism
' '
or
connectionist connectionist system may text-scanning the y only gather information
techniques 3 b 0m com- that
based distributed
plausible. A statistically local connectionist system such as that moniy used senses of words The fact th at there are sense def 't' .
able to realise the goals of bhnpdggif

a . .
be a us a way to build representations for all senses. It may

of Waltz and Pollack (1985).
statistics may not give a full picture of nat- give:rene
e 0 these representations further by text-scanning, and then
While cooccurrence the senses th t Will
'
be rened those that most

to build a very useful sub-system a most are are frequently
ural language, it may possible be
and text used,
that will learn from dictionaries
based on them, a sub-system and co
One obvious way in which syntactic
with little intervention. to
interact is for the syntactic system
occurrence based systems might
of cooccurrence or similarity are important. Relation to other work
govern which judgements cooccurrences for
consider adjective-noun
For example, it might only
' ' ' '
The work described here be Similarity to don

Jonesedgdlf
ars some some
and nouns in the same noun phrase. , . .
adjectives the work of Karen
statistics
gfgaseandfin
pgrticulatho
which re erre o as SJ arose from
Spirck
th e nee (1 f0! Efuent
'
l
provides a good source of data {or cooccurrence and precise classication

of words. The intent was t of the uses
LDOCE
and might be a
thesaurus for machine translation. The undeilirri
Cooccurrence statistics can be collected on free text,
iliuceatpowerful the uses of words be distinguishedg
tlat
advantages over free may
(12:31:;assumptiongvtas
,
three important
useful adjunct, but LDOCE has
advantages or ana yse y t e semantic relation S W h' 16 h how between

to LDOCE and the other two are ,
text; the rst is unique

'

them and that the vocabular y of a language has a semanm Slil'UCtIHe
any dictionary.
.
'
of almost is lim- determined by these relati ons. Of twelve possible sema 11 t' 1C relatlons,
of LDOCE is that its vocabulary
The rst, unique, advantage 2000 words). synonymy was chosen
l f ea ture of nat ma l l anguage. as the fundament a
ited (largely) to the set

of LDOCE primitives (just over
A system constructed su ch that each word 1159 was d Emed
to collect and store frequencies was by the
A limited vocabulary makes it possible other word uses which can re place it in a context without C h angmg the
' ' '
of words using conventional techniques

of cooccurrence for all pairs
'
of the sentence Words are th en linked to one anoth er b Y the

'
the array of frequen- meaning .
resources;
requiring excessive computing
. ' '
without synonymy relations between these uses ) ngg a atwork Of aSSOCia.

4.7 Megabytes of storage
cies of cooccurrence requires approximately lim- tions within a vocabular y. These sets of word s f orm c l asses which can
'
of transcending storage
and takes less than an hour
to build. Ways be described of the kind f mm d in a thesaurus.
. '
how this technique

as come ptual groupings
examining
being investigated (we
are '
itations are A thesaurus thus constructed serves as a word se I156 d' lSmblguatlon
in a connectionist model), and eventually we
' . .
might be implemented
' ' ' '
deVice in machine translati on. However despite the surfa

hope to be able to use all the
is
words
that the
dened in LDOCE.
denitions in LDOCE provide
there are many differences, some of are now
.
which discu::efilmllantles
The second advantage
209
208
9 9.2 The extraction of semantic information from LDOCE
Chapter
9.2.2 Approach 11: Building a machine tractable dictionary

laborious than the
method is much more
(1) KSJs collection
data
be cot; from LDOCE
occurrence method. Either many sentences must constructe
dictionary must be analysed-uSing sophisticated (Cheng-Ming Guo)
by hand, or a
the data
techniques. The semantic relations
in
embodied are LDOCE
built which in Determining the key dening vocabulary (KDV) in
very precise. Rows of words
are aresynonymous
that context.
by virtue of being interchangeable in
some context
in sentences or A set of 1200 words, called the Key Dening Vocabulary (KDV), are
information is very sparse
In addition, synonymy to get a found todene the 2219 word controlled vocabulary of LDOCE, and
and thus much text must be analysed dened in LDOCE. We propose
in a dictionary, data thence all the remaining 27 758 words
little information. In comparison, much more cooccurrence that the LDOCE vocabulary, including the controlled vocabulary of
from a text than synonymy information.
can be gleaned 2219 words, can be dened by the KDV in a series of four dening
cycles which add progressively more of the controlled vocabulary, and
that
[2) To synonymy
use data for sense-disambiguation requires then the whole vocabulary of LDOCE, to the KDV. When a candidate
be
longer distance semantic relations between words available. word enters a dening cycle, the stems of the words used in the deni~
word used to proVide these If
The different senses of the same are tions of the rst three senses of that candidate word are examined.
of
more distant semantic relationships. Thus
two rows,Similar to each all the word stems in those three sense-denitions occur in the KDV
for the words it contains, are then the candidate word is put into a success' le and added to the
which embodies a sense
than senses, in common.
that they have words, rather KDV at the end of the dening cycle; if not, the word is put into a
the extent
fail le and addition of the word to the KDV is postponed until a
the problem. Se- later cycle. In this way, the size of the KDV expands with each cycle
[3) The method of cooccurrence faces opposite controlled
that cooccurrence data contain
seem
to be
too until, after three cycles, all 2219 words from the LDOCE
mantic relations of LDOCE
is vocabulary are accounted for. The remainder vocabulary
vague and imprecise, rather than too narrow.
.Thlsproblem is expected to be dened in the next dening cycle.
'
and comparison techniques.

dealt with by using set intersection The discovery of a KDV and the use of dening cycles is valuable
In KSJs method, the data must contain all the senses
ofit words
not
for a number of reasons. First, in the building of an MTD, a KDV re-
to be considered. In the co-occurrence method, is
duces the initial number of knowledge structures for dictionary entries
that need
of all because
necessary that examples
the text contain
senses,about the that have to be hand-coded before such structures can be constructed
the sense denitions are used to provide information automatically by some bootstrapping process. The knowledge struc-
of to dene
senses. The text need only use enough senses
words
it does use.
tures used in this particular study are called integrated semantic units
of the senses A discussion ISUs is given in the next section. Though
all words, but should make frequent ISUs.
use or on
the preliminary study reported here uses a KDV of around 1200, the
twenty years ago, the computing number can probably be reduced to about 1000.
(4) KSJs work was performed over
experiments
have to be done on realworld
allow Second, the use of dening cycles helps to identify vacuous circular
resources we now
makes it pos- denitions. Circular denitions that use circles ofjust two words pose
size vocabularies. Modern computing power also
sible that techniques developed might be
usable in
real systems. special problems for building an MTD from an MRD. For example, in
machine
Also there are now vast quantities of text available in LDOCE a trip is dened as a journey,and ajaurney as a trip. An
encyclopaedias and much
less con- MTD built from an MRD should be free of such circular denitions.
readable form; dictionaries,
for expanding'on One way to overcome them is to try to include just one of the words
strained text. All this provides good reason
then aside
work done in the 60s. Techniques developed were
left to involved as a KDV Word, but not the other. The word selected for
sometimes because of lack of resources, and now it is pos5ible the KDV will be the one whose rst three senses full the criteria of a
further develop and use those techniques. dening cycle given earlier.
Thirdly, when constructing an MTD, use of the dening cycles en-
sures that all denitions of words and their senses are built containing
Four aspects of building a Machine Tractable DictionaryiMTD) from
is only words that already have denitions. In the case of LDOCE, use
ow chart of the construction process
LDOCE are discussed next; a
of the dening cycles sorts out words in the 2219 LDOCE controlled
also presented later.
211
210
semantics Chapter 9 9.2 The extraction of semantic information from LDOCE
words outside that vocabulary. knowledge in the representation of each word s e n 59. Th e general form
vocabulary whose denitions include
in LDOCE denitions. of an ISU is as follows:
This has proved to be not uncommon
of these empirically-
Fourthly, in building an MTD, the main
senses
isu(Wordsense.
taken the semantic primitives of the MTD.
found KDV words are as
belong(Superordinate) ,
that a set of primitives that best suit

The use of dening cycles ensures
ik(Integrated_knowledge)
be found empirically. At a later stage, we
use
a particularMRD can ) .
constructed from these word senses in building

semantic hierarchies
of word senses. where isu stands for
integrated semantic unit; Wordsense is a word
ISUs of LDOCE definitions
in the KDV as the hunch of
any entry word in LDOCE; belong introduces
sense a hierarchi-
We refer to the initial set of 1200 words
set. We are currently experimenting with the right criteria for the cal
relationshipbetween Wordsense and Superordinate" ik in-
formation and expansion of an initial hunch set. The initial hunch set
troduces integrated linguistic and world knowledge associdtedwith
derived from the use of two criteria, which Wordsense. Although the general form of the ISU is uniform across
of our 1200 KDV-words are
of the main sense(s) all
are word frequency and the conceptual simplicity four open-classcategories of English words, the actual specication
the words of the hunch of their word
of a word. Regarding use of word frequency, respectivesuperordinate and the associated world
senses
used words in all

set are the intersection of the 4000 most frequently knowledge varies from category to category. This results from the fact
in a word frequency kinds of information essential
the definitions in LDOCE, as found by Tony Plate thatdifferent are to the word sense dis-
with the 850 words of Basic English used in The ambiguation process of different categories of words. For example case
study of LDOCE,
General Basic English Dictionary (GBED) (Ogden, 1942), and the 500 rolesandframes are important to verb senses whereas for noun sdnses
most common words in Kucera and Francis (1967). The underlying of the
speCifications and the properties
component of parts the ref:
of the English
assumption here is that a large portion of that subset erents are indispensable. For adjectives and adverbs it is what the
vocabulary that might function as a KDV would appear in LDOCE,
qualify in the actual use of the language that is Also importhnt. ad'ec}:
GBED, and Kucera and Francis.
tives and adverbs
of how
do not form a hierarchy by themselves.
used in on-line
For examdles
of conceptual simplicity is applied by only selecting TSUs vocabulary acquisition in a machine
are
A criterion
Words translation
words that express a single main concept, for example go. translation/indexmachine system, see Guo (1987)
expressing two or more main concepts, for example marry, are removed. An estimated 3600 ISUs for an average of three basic sensesof
when selecting the initial the 1200
Conceptual simplicity is a signicant criterion KDVwords are to be
hand-crafted. An initial attempt at
hunch set. This explains, in part, why only about half of the words hand-crafting small a set of ISUs has been successful. Initial efforts at
of the KDV among the 1200 most frequently used words in adapting Huangs XTRA (1984, 1985) parser to the use of ISUs has
appear -
also been successful.

LDOCE denitions.
An handcraftin
'
alternative '
to the 1200 KDV words approach to

Daaliliizlistcdtitdnrf
The expansion from the initial hunch set pure
It involves the use of human informants.
was an empirical process of much trial and error. We had to ensure plored.
rial is constructed
of sufcient size and content that extra words from LDOCE word sense denitions. Informants are
that the hunch set was
to scale the psychological distances
would be added in each dening cycle and that all
of the 2219 words of asked rate on a 10-point
they per-
to eXist between pairs of concepts represented by LDOCE
By nding a ceive
the controlled vocabulary would eventually be included. The elicited
word
hunch set to do this, we have shown that the 2219 controlled senses.
proximity data is rst processed by the PATH-
suitable
then by a cluster
Vocabulary in LDOCE can be described in terms of a smaller, more FlNDER
this process
program, analysis program. The output of
of related word senses. Each cluster represents
controlled vocabulary, i.e., a KDV. We plan to develop a new,
smaller are
clusters word senses
a to be included
hunch set formed by a more motivated application of the criteria of group of candidate
the linguistic and world
in the specification of
word frequency and conceptual simplicity.

knowledge information associated with the
key word sense of the cluster. A trial project conducted on two in-
ISUs for the basic word senses of the KDV formantsproduced

this process
very interesting results. One obvious drawback of
Hand-coding is the enormous number of judgements informants have
to make. Methods that help to reduce the number of judgements
built takes the form of an ISU Database of LDOCE are
The MTD being bein consid ere d' partitioning
' ' '
and controlled
'
is representation of
enriched assomation are possible
candgidates,
,
denitions of word senses. An ISU an
word senses. It incorporates linguistic knowledge with general world

213
212
9.2 The extraction of semantic information from LDOCE
Chapter 9
resource for computational semantics
LDOCE as a
The bootstrapping schedule is concerned with the serial order of which

The bootstrapping schedule and bootstrapping process
exact word senses are to be processed rst, and which later. The
is shown schematically in Figure 9.3. There general principle is for the word senses dened at an earlier dening
The bootstra ing process and before those that dened at a later cycle.
Database,
are four unitspiii the diagram: LDOCE, LAAL, the
number attached
ISU
to the end of
cycle to be processed
The success le discussed earlier
are
keeps a good account of which word

the Revised LDOCE with word sense
stands for a database of which word is dened at which dening cycle. It provides
each word in the denitions.
LDOCE Ifrolog sense
of such a schedule. The necessity
LDOCE dictionary. Each Prolog clause in the database adequate basis for the establishment
of the entire LAAL is schedule stems from the fact that the lSUs for
LDOCE has word
about a for the bootstrapping
contains what information sense.
XTRA It takes the basic senses of the words in the denition of a word sense have to
from s.
the Language Analyser And Learner, adapted before the denition of that Word sense could be
and two be in the ISU database
LDOCE denitions and usage examples as input produces
One is the iSU Database and the other is the Revised analysed and its ISU produced. After the ISUs for the basic word senses
types of output. to each word in the denition. of the 2219 controlled vocabulary words are built into the database,
LDOCE with word sense number tagged When all the
word numbers are the the non-basic senses of these words will be processed.
Revised word sense denitions with tagged sense
vocabulary words are nished, words outside the controlled
natural results of LAALs analysis procedure. Initially thehand-
only
to prov1de the
controlled
vocabulary will be attended to. Either within the controlled vocabulary
crafted ISUs are in the ISU database. They are
intended processed outside the controlled vocabulary, single words are always treated
initial bootstrapping data. As more and more denitions are or
and more ISUs are produced. The size of the ISU before phrasal words no matter they are phrasal verbs, phrasal

by LAAL, more all the denitions are processed propositions, or any other phrases.
database thus grows until eventually
form of a Prolog database of ISUs.
and we obtain an MTD in the
Word sense hierarchies and learning
As can be seen from our review of MEDrelated research, much work
has done on determining
been semantic hierarchies of genus terms in
hierarchies of the dening
dictionary denitions. However, semantic no
senses of empirically-found KDV in an MRD have ever been attempted.

One of the natural consequences of this approach is that the dening
word senses of the KDV arrange themselves in hierarchies. An advan-
of word sense semantic hierarchies over genus term hierarchies is
tage
that the latter tend to be tangled, i.e., one genus term has more than
one superordinate genus terms. This is due to the fact that a genus
term have more than one word sense, where the superordinate of
may
each is a distinct genus term. For example, a bank can be land along
LAAL
the side ofriver or a place in which money
a is kept and paid out on
(Language Analyzer demand". Thus bank has at least two superordinate terms: land and
Lealnav)
And
place. However, in a semantic hierarchy of word senses, tangled hier-
archy is an exception if not totally non-existent. In this case, bankl
(river bank) belongs to landl whereas bank2 (money bank) belongs

Unfortunately, few lexicographers have word sense hierar-
to placel.
R e v is e u chies in mind when they compile their dictionaries. How well *these
ISU Database
LDOCE hierarchies are formed in LDOCE how suitable
and they are for com-
(Wold sense number putational purposes is open to empirical studies. An ideal system of
each word
lagged in
In dellnHlon iuxls)
semantic hierarchies disambiguation and generalisation pro-
facilitates
cesses and is indispensable to language analysis and learning.
The present effort to establish semantic hierarchies of the dening
senses of the KDV used in LDOCE shares the same assumption as
Figure 9.3. Flowchart of the bootstrapping process.
215
214
9 9.2 The extraction of semantic information from LDOCE
semantics Chapter
dictionary circles, i.e., that use- several closelyrelated word senses are identical. For example LDOCE
Karen Sparck Jones (1967) study on
for the word Marryl means,to take
obtained from analyses of dictionary recognises three senses marry.
information can be to perform the ceremony
ful semantic aimed (a person)in marriage; marryZ means of
of text analysis in general. Sparck Jones
denitions for purposes for (two people); and marry3 to to take in
of our vocabulary
means cause
relational and classicatory structure marriage

at nding the
marriage. It so happens that the box codes for the expected agents
and methods were discussed to deter-
(1967, p.5). Various approaches circles. It is important to note that in and expected patients of the three senses of marry are all given as H
mine interesting dictionary which means human. Using Tony Plates examplesearchingprogram,
circles deemed as quirky and un
examples are found throughout the dictionary containing the)
were
Sparck Jones (1967) tw07word disregarded. In 85 more
interesting (p.13), and consequently,
were completely It is found that
all these occurrences of marry are used
the present study, an effort is made to empirically nd the system of word rst marry.
the defin- in
the sense, suggesting that the dening sense of the word marry
semantic primitives used in an MRD. These primitives are
used LDOCE
in is marry1. Generalisation of the list of all identied
hunch set and ends in an
of the KDV, which starts on a
ing senses in a dictionary. The KDV agents of marryl gives us humanl, which conrms
a tentative specica-
subset of the vocabulary used
interesting
taken as the dening
tion of the expected agent for marryl. No more examples are available
they basic word senses,
is interesting in that are
in a se- for and marry3
marryZ than those given with their denitions. How-
that dene the entire LDOCE vocabulary
of the MRD,
senses
Unlike Sparck Jones, two-word circles are ever, the
iri denition
of marry/2 preferred agents are given in parenthe-
ries of four dening cycles. ofcial. For of limited priest context both
a. priest
ses as or reason
the number of KDV words can be
Therefore, the preferred
that
treated as important evidence and ofIiCiaI are defaulted to their rst senses.
further reduced. that agents for marryZ are priestl and olciall. In the case of marry3 we
The determination of the KDV together with an algorithm nd any information in parentheses and
not
do. do we have on preferred agents
of the KDV, which is available to LAAL, with
catches missing dening senses more examples besides the one given the
solves the problem of extracting a minimum set of se- neither
denition.
any
The only example available reads: She wants her
approximately a problem cur-
to marry
dictionary

mantic primitives from a monolingual daughter to a rich man. LAAL infers from the patient of the sen-
rently believed to be computationally
intractable [Dailey, 1986).
of the LAAL system to be able
tence (her daughter) that the pronoun she stands for parent] which
The bootstrapping requires of a person. Thus the preferred agent
process
information from information avail- means
the father or mother of
to acquire useful world knowledge marry3 is parentl. Representing preferred case role llers in terms of
An example of useful world knowledge informa-
able in the dictionary. has great advantage over a system of semantic features
llers such as the agent, the patient, the location, word senses as
be-
tion is preferred
case
and example sentences
is used by LDOCE, in that it results in ner semantic distinctions
forth. Both definition texts
the time, and so word senses. An important guideline of this learning process is
of such information. Preferred agents and patients are of- tween and overspecications.
are sources
limited cautiously guarding against overgeneralisations
in word sense denitions. Although
ten given in parentheses the definition of a par-
Generalisations are made only on the basis of ample evidence, and in
of example sentences are given with
number the word
the absence of counter evidence.
there are, usually, more examples containing
ticular sense, the
the dictionary. To construct
in that particular sense throughout
tentative versions of the ISUs are
ISUs for the word senses of a word,
the word sense denitions and examples that
obtained from analysing
in setting the initial
go with them. Box codes are often helpful up
sentences available outside the def 9.2.3 Approach III: A lexicon-producer (Brian M. Slator)
specications of an ISU. Examples
interesting material for the
of a particular word provide
inition texts
Learning from these examples often results in the In this section The and the next one a lexicon-producer / consumer system
learning process.
modication of the specications of tentative versions
is
outlined. systemand is composed of two distincteach andbasedseparable parts
conrmation or
trees (alexicon-producer a on different
lexicon-consumer),
of the ISUs. LAAL analyses these examples and produces parse
Fillers of the same category across all examples, for princrplesand,course, in
of designed for different purposes. The lexicon-
of Word senses. producer is wellalong its development, using LDOCE. The lexicon-
in a list. Gen-
example, all identied agents, are grouped together consumer in the is planning stages, waiting for a clear picture to develop
hierarchies of word senses are made ' ' '
semantic
'
eralisations along the of the l x' Will have to work With. Figure 9.4 shows an overView
on these categorised lists of word senses.
is especially useful when the box
This knowledge acquisition
codes given in LDOCE for
of the 5:52:03,
process
217
216
for computational semantics Chapter 9 9.2 The extraction of semantic information from LDOCE
LDOCE as a resource
The rst
phase of frame construction uses LDOCEs precise gram-
matical codesto distinguish among the senses of words, but with only
general semantic and pragmatic information, such as is easy to extract
from the dictionary. However, when the needs of the knowledge-based
parser (the lexicon-consumer, operating over non-dictionary text), in-
ulanuLJlllXu crease beyond this initial representation (as is the case whenever, say,
resolving lexical ambiguity or making non-trivial attachment decisions),
the frame representations are enriched by appeal to parse trees con-
structed from the dictionary entries of the relevant word senses. That
is, the text of the denition entry itself is analysed, to extract genus and
differentia terms (Slator and Wilks, 1987). This additional information
further enriches the semantic structures.
LDOCE definition clausoids (for lack of a better word), are typi-
cally one or more complex
phrases composed of zero or more preposi-
tional phrases, noun phrases, and / or relative clauses. The syntax of
the denition entries is relatively uniform, and developing a grammar
for the bulk of LDOCE has not proven to be an intractable problem.
Chart parsing was selected for this system because of its utility as a
grammar testing and development tool.
The chart parser accepts LDOCE denitions as Lisp lists and pro-
duces phrase-structure trees. This parser is driven by a context-free
grammar of 100+ rules and has a lexicon composed of the 2219 words
in the LDOCE controlled vocabulary. The parser is left-corner, and
bottom-up, with top-down ltering and early constituent tests (taken
from Slocum, 1985). The grammar is still being tuned, but currently
9.4. Lexiconproducer/ consumer. covers the language of content word denitions in LDOCE to a fac-
Figure tor exceeding 90%. This chart parser is not, we emphasise, a parser
entries
LDOCE dictionary into lexi- for English it is a parser for the language of LDOCE denitions
'
converts

-
roducer
3395233122:
Istructures
(a frame-based knowledgerepresentation);
in-
(Longmanese),and in fact only the open class (content word) portions
based
for knowledge parsing. Each
leXicalsemanticEtuc of that language; denitions of closed class (function) words not
tended
constructed, is part of one or more ire
hierarchies, eseh 1
analysed.
are
(frame), as
important because nonstandard grammaticality,

suc The context-free grammar driving the chart parser is only
erarchies
number
are
disagreement, or
preference breaking can gse mented to suppress certain frequently occurring misinterpretations
aug-
(for
expected
or
in
person
typical text. Preference breaking occurs
when, exam- usagfe, example, separated nouns
comma as two individual noun
phrases); and,
ple, a verb like drink, which prefers an animate agent, is use in gr with certain minor exceptions, no procedure associates constituents
with what they modify. Hence, there is little or no motivation for as-
My car drinks gasoline (Wilks, 1978) signing elaborate or competing syntactic structures, since the choice
to of one over the other has semantic
roceed when breakdowns
these occur, it is
necessary
no
consequence (Pulman, 1985).
iglgiiire
tgorzfmmatical or semantic constraints.
of
This relaxatiogi
(the o verse typ-f
is
0
Therefore,
parser
the trees
also has a longest string (fewest
are constructed
possible. The
to
constituents) syntactic pref-

be as at as
ically donehierarchy
by travelling up a
constraintsof the erence. A tree interpreter extracts semantic information from these
preferences), testing at each level, and keepingtrack accuniu- phrase-structure denition trees.
lated deviance (in order to compare or
semantic density). In this context then, mferencrng' entails competing.interpretatioiis
re
axmg
The output of the chart parser, a phrase-structure
tree, is passed
and then recursrvely applying a to an interpreter for pattern matching and inferencing. The rst step
constraints by climbing hierarchies
matcher. picks off the dominating phrase and, after restructuring it into genus
pattern
218 219
Chapter 9 9.3 The utilisation of semantic informationfrom LDOCE
active grammar
Section 9.3.2 Semantics, which principally ad-
outlines Collative
to the
and feature components (by reference currelnttly dresses the phenomena of lexical ambiguity and semantic relations.
genus so
undera
.
'
inserts it into the frame Seven kinds of semantic relation investigated are literal, metonymic,
versli3ffher tcaeiis developed,

being
strategies for pattern-matching are metaphorical, anomalous, novel, inconsistent and redundant relations.
information,
tract detailed differentia
more
'
The
' '
between a thedgietrsiugeggmon
beyond{ war
Collative Semantics
an
has been implemented in a NLP program called
feature
'
modifiers.
'
be Viewed
relationship
as an I S-A relation;
_ meta5
for examp
which analyses sentences,
1e, discriminates
an am me- the seven kinds of se~
can tr1Vially T h e f rame
_ n mantic relation between
pairs of word senses in these sentences, and
lectric
_
current .
for measuring
ter is an instrument e t s th 6 in. . resolves any lexical ambiguity in them. Section 9.3.2 focusses on the
fromi 'ts denition, then, represen
created for each word sense form of called sense-frames, and the
-
motivates the assump on '

'
.
knowledge representation used,
bservation
of that wordsense. This
'
0
tensnon an d given a
.
.
kinds of semantic information sensevframes contain.

be isolated

l '
'
can i
of this intenSIonal
'
t ortions materia by noting

.
that an
iggeipfor
eventual
for measuring it becomes
preference matching.
For example,
reasonable to create a slot in 9.3.1 A lexicon-consumer (Brian M. Slator)
ammeter is
lled with
the AMMETER frame
of
that is labeled PURPOSE
information is precisely what
andis needed to MEASUR;a
compu The lexicon of frames created by the lexiconconsumer of section 9.2.3
ING. This kind constitutes a text-specific knowledge source for use by a knowledge-
case roles and preferences. of based Preference Semantics for text. The
job of Preference
prior, chartparser
. .
parser a
it is important to guish this,

Again,
t'onar texts from the know 1digtin
e ge- based parser of general texts. The Semantics
pretations for
parser
a
is to
sentence
consider
constituent,
or
the
of which
various competing
there may be many,
semantic inter
for
.
giftput
ofy the former (the lexicon-producer) the
the dictionary
is
knowledge-base and to choose among them by nding the one that is the most seman-
the latter (the lexicon-consumer). The output
With each frame exp
.of parsiag
y iCl tically dense, and hence preferred.
is a lexicon of wordsense frames, The lexicon of word-sense frames and the original text
program are presented
implicitly positioned in multiple, preexisting,hierarchies. to the Preference Semantics which is under
or parser, development. We
envision a goal-directed, non-deterministic parser that keeps all plau-
sible interpretations alive but only pursues the one most highly pre
utilisation of semantic information ferred. The text is processed left to right, attachments are made im-
9.3 The
mediately, and constituents constructed locally to be applied globally
from LDOCE (since we do make weak claims for psychological plausibility). In those
cases where no satisfactory preference decisions immediately be
the semantic
derived from the
information
can
describes
section how
This '
in
'
sec tion 9.2 can be mapped made, we foresee an extended mode where previous context, if any
L man dictionary by the processes of '
'
exists, will be used; we assume a text-level, rather than a sentence-
knds
used in two
diicregtly into the kinds of knowledge structures and Collative
Semantics
eman- level, orientation. In the absence of useful context, new frames are
computational semantics, Preference constructed for the of the word

genus terms
senses under considera-
develo merit of the former.
'
of Preference Se .
.
tion, then processing can continue with these new semantic structures
9.33.1
meslszction describes a recent implementation until some preferred reading can be found. This strategy is both a run-
mantics in a semantic parser. Preference
the meaningfor a text is represente
theorylgi
Semanticsdisba time optimisation and an application of the least effort principle of
language in which
is built out of smaller seman o- ytacoin; psychologically plausible language processing. The parsing will be ro
semantic structure that
components
up
in the structure are
crea
e o n ticdc{he bust in that some structure will be returned for every input, no matter
nents. Links between ta.
how ill-formed or garden-pathological it is. Parsing will be directed
semanticrefilresgnby
semantic The
basis of coherence and preference. the y 5e
a top-level planning component to dynamically compute deviance
tion computed for a text is the one with most semantica a proper y ent scores allowing the pursuit of the currently most preferred parse as a
structure of the competing readings. Semantic denSity is strategy for enforcing the no failure robustness policy. We anticipate
regarding their
of structures that have strong
These preferences are compared
preferences. terms of owncor;f a macro-text structure formulation to be applied by appeal to global
stituents.
the
lack of
in
ures, the'eXI?
ence
preference-breaking
coherence measures derived from an analysis of preferred pragmatic
preference-matching features,
chains to Justify eac sense
needed
e; (subject)codes appearing in the text (Walker and Amsler, 1986).
and the length of the inference The suppositions of this approach are that:
constituent attachment deciSion.
selection and
221
220
'
'
Chapter 9 9.3 The utilisation of semantic information

resource for computational semantics from LDOCE
LDOCE as a
to as a knowledge re- sense-frames of two word senses and discriminates

the semantic rela-
1 LDOCE has sufcient information serve
tions between the word senses as a
complex system of mappings be-
source for text analysis, and
tween their sense-frames. Semantic vectors the systems of
represent
be processed in an effective way by programs. mappings produced by collation and hence the semantic relations en-
2 the information can
coded in those mappings (except for metonymic relations). Screen-
to bear is i'ng chooses between two semantic vectors by applying rank order
) information brought
The outside (external bootstrappingfor LDOCE sense denitions and ings among semantic relations and a measure of conceptual similarity,
of the most general sort: grammar
a
thereby resolving lexical ambiguity.

ful tree patterns. Sense-frames developed using principles

secfrfniiirdxtgo
were to
a . _
this
.
chapter, the common knowledge

the other approaches described in
representations and to the semantic structure of
language text, with
lexicon
-
t
consumer
the
'
is perhaps the most
dictionary
'
material
'
direc);
with in izisteagireiulp;
beciuse,
'
atword er orma
_
ion,
particular attention
perform the functions
senses of
to the structure of dictionaries.
semantic
In sense-frames,
primitives, hence sense-
glenelzlltldlgoutside riieans,vtel:e
. .
through intuitively arrived at

frames are
experts or
composed of other word senses that have their own sense-
leIicIicon-consumer
limits
approach '
less {2.2312111
itselfAirbag oun
to
3:15;the: an r
frames, much like Quillians (1968) planes, and much like the circular
as
organisation of a. real dictionary. A means
'
or
taking the dictionary more has been developed by which
each sense
' de n as '
ii men
takes word senses
most of it. Also, the
leXicon-consumer rather t h an d e termining the '
perform all the functions of the semantic primitives in
't for analySis,
a largely self-containeduni
theories of NLP such as Conceptual
its incidence of occur ' Dependency (for example, Schank,
of a word by examining rise defining . 1973, 1975a, 1975b) though they are also part of English, the object
:zfircidthsfoughout the dictionary, as the rst approach (section9.2.1) language being represented.
It was noted in the introduction that dictionary entries
doeiittle
other machine-readable dictionary
for semantic, 31:
has20:11:21
woik (in; nowe g
-
-
contain
entries
a genus and differentia and that the genus terms
commonly
of dictio-
be assembled
'
dictionary a language resource nary can into large hierarchies

as
semantic
.
informa
'
tion fr om maChme-
.
(Amsler, 1980;
'
mg. APPTOBCheS0 9X tracting Chodorow, Byrd, and Heidorn, 1985). Likewise with sense-frames, a
such

sprouting
'

(Ch o d oro w B y rd ,
and Heidorn, sense-frame contains and
Ices as ,
-
a genus differentiae and belongs to a semantic
igagdsgblfhzoelmpldy
paid disambiguators
'
I
(MarkOWitz, dNED),
(Argisler, :r 381251) ii in
of
formulas
'
Ahlswe e ,
an van
'
,
'-
,
network
Sense-frames
which is a
consist
hierarchy
of two
of genus terms, as we explain below.
major parts, the arcs and the node.
23:12:11:
elaililorgto
vocabulary,
tion Webster
aggnigigagnjgriehnait
froniix (a such
construct
as
'
in
taxonomies

5 Sevent . e
The
word
arcs
sense
part of sense-frame
a
with its own

contains a labelled arc to its genus term
at Cambri dge University, has concen ' and on '

.
sense-frame). Together, the arcs of all the

we know of, principally . _
_
sense-frames comprise a densely structured semantic network of word
unification b as ed p arsmg
.
des for '

use With
explicating the grammar co_ senses called the sense-network. This general architecture of semantic
an d Briscoe 1985;
'
Boguraev an d Briscoe , 1987), a
(Shawl Boguraev, has taken a Similar

. '
network with frame-like structures as nodes is similar to many frame
1987)
'
and .l
nt work at IBM (Binot ensen, ,
bascd and semantic networkbased systems, such as schema theory
11:35:}? to interpreting
rule-based
and a
denition
inference
parse
mechanism
trees, by applying
to assign MYCIN-like
a pattern
(ltumelhart and Ortony, 1977), KRL
FRL
and (Bobrow Winograd, 1977),
matcher
n umbers
(Roberts and Goldstein, 1977), KLONE (Brachman, 1979], and
attachment alternatives.' [the
probabilities (Shortlilfe,1976) to frail '
but again Webster s Seventh

- - , (Wong, 1981).
arrived at by intuition and tuning), using The node part of a sense-frame is the differentiae that provides
as a knowledge base. a denition of the word sense represented by the sense-frame that
differentiates it from other word senses. Nodes consist of cells which
have a syntax modelled on that of English. There are three types of
node. The cells in sense<frame nodes for nouns (node-type 0) represent
9.3.2 Collative semantics (Dari Fass) .
their structural and functional properties. Sense-frames for two senses
'
senta of the noun crook are shown in Figure 9.5.

Collative
'
tions,
Semantics has four
senseframes an d semantic vectors

components,whicihtarepiwozegipsrelmna ,
an wo ,
.

the know 1e d ge re p resemamn
'
tion and screening

'
S e rise-frames are .
.
word senses.
.
Callahan matches the

scheme and represent indiVidual
222 223
semantics Chapter 9 9.3 The utilisation of semantic information from LDOCE
material (from Chambers Twentieth Century Dictionary). The sense-

frame denitions of record_player1 and tapeJecorderl share the same
sf(crook2,
genus term (playbackI) but differ in their differentiae. A record
(Eizzfi' [[arca,
BtiCk 1n]
plays records whereas a tape recorder plays audio tapes. RecordI
player
and
[[supertype, criminali],
[[[gugertyphin],
110 E ,
.
audio_tape1 are themselves species of recordingI, which their sense-

frames represent. The sense-frame of p1ay10 is shown in Figure 9.9.
[22:2, steal! [[shephetdl,
um,
user,
shepherdi,
Sense-frames nodes for adjectives, adverbs,
determiners, ordinals,
valuablealnn). and other one-place predicates
sheeplllll). (node-type 1) contain a preference and
an assertion. Preferences and assertions distinguish two uses of seman-
tic information. A preference contains semantic information expressing
a restriction the local context
on
(Wilks, 1973, 1975a, 1975b; Fass and
9.5 Sense-frames of crookl and crook2. Wilks, 1983); an assertion contains semantic
Figure information to be imposed
onto the local context. '
Figure 9.7 shows how

antonymous twoterms, the adjective senses
00:
crook2 is the thie
"
and femalel and malel, are represented using sense-frames.
Crookl is the
is a metapronoun
sense meaning
which refers to the word sense shepherd: frames share the cell [superproperty, sexi]
The sense-
which indicates
In
the senseframe. So, for example, crookl
can
su
be bbeing
tednfer
o stitu bership of the property class sex and also have the same
mem-
preference
by
ill in [it1, steali. valuablesi]. Common dictionary e for an organism. The sense-frames differ because femalel has the cell
followed in that word senses are listed separately for pragcice1}s
eachpa? ipeecl} 0 [property. femalel] that signifies that being female is a
property of
of occurrence. Hence in 2,croo females, whereas malel has the cell [property, malel]. The sense-
and numbered
[shepherd1.
by frequency
usei. iti] contains
thenoun sense shepherdl if :16 w 1e is frames also differ in their assertions. The senseframe for malel has
verb the assertion that any entity which it is applied to is
cell [iti shepherdl
. . sheepl] containsthe sense.
male, whereas the
the reader sees are we Id sense-frame for [emaiel has the assertion [sex1,
In Figure 9.5, all the alphabetic symbols femalel].
labels
senses with their own senseframes, except for symbols used as ,
for sense-frame parts (arcs,nodeO, etc), are labels (supertype, etc),

and case labels. These are not word senses but words used in a partic-
ular sense that could be replaced by word senses.

af(fema1e1. sf(ma1e1,
[[nrcs. [[arcs.
[[superproperty, 32x1]. [[superptoperty, sexi] ,
[proper-W, femalellll .
[property, malellll.
sf (tape_recorder, [node1, [node1,
sf (record_plnyer1.
[[arca, [[arcs, [[preference, organism] .
Preference organism]
, ,
[[Bupertype. playbackilll . [[supertype. playbacklll]. [assertionI [assertion]

[[sexl. femaleilllllll. [[eexi. ma1e1]]]]]]).
[node0,
'
Dada? it , )3 1 av 10 .
[[:::ori;iil<li. audio_tupe1]l]l).
Figure 9.7 Sense-frames for femalel and maIeI (adjective). .
Figure 9.6 Sense-frames of record_p1ayer1 and tape_recorder1.

Sense-frame nodes for
verbs, prepositions, comparatives, and conjunc-
tions (node-type 2) are case frames containing case
Figure 9.6 shows how genus, species and differentia are
used in sense- subparts lled by
case roles such as agent, object and instrument. Case subparts
and tape.recorder1 are species of
frames.
A
Reeoriplayerl
playback is a machine for reproducing a recording of sound playbackii
or Visu
contain preferences, and assertions if a state change is described.
224 225
semantics Chapter 9 9.3 The utilisation of semantic information from LDOCE
object preference is rccordingl, which is the genus term of record] and

audio_tape1.
sf(p1ny6, sf(p1ay7, It would be highly desirable if sense-frames for meta5s lexicon could
[[arcs. [[arcs, be built using semantic information extracted from LDOCE. Mapping
strike1]]], [[supertype. 1152111].
[[supertype, from dictionary entries into senseframes can be relatively direct be-
[node2, [node2,
cause the metalanguage in sense-frames is English word senses, the
[[agent, [[agent,
[preference. syntax of cells is a simplied version of English, and because the in-
[preference, ternal division of sense-frames into and differentia resembles the
human_beingt]] genus
human_being1]], ,
structure of dictionary denitions, which also commonly consist of a

[obj act,
[Ebje?
pre eteuce, genus and differentia. Nevertheless, there are problems and these cen-
[:ziiiiiiiii . p1aying_card1]]]]i). tre on

other
the kinds of semantic information needed
words, about the sufficiency of LDOCE.
to ll sense-frames; in
Because the semantic primitives in sense-frames are word senses,

extracted semantic information must be at the level of the word sense,
Figure 9.8 Sense-frames for playb and play7 (verb). as with section 9.2.2s approach to extraction. Three kinds of semantic
information are needed to ll sense-frames. First, extracted semantic
Figures 9.8 and 9.9 show two senses of the
verb play. Twelve senses information must be distinguished as genus or differentia information.
of play are currently represented in metass lexrcon. The sense-frame The lexicon-producer of section 9.2.3 can do this. However, sense-
denitions are modelled on dictionary denitions given for these
senses frames require genus and differentia information to be further organ-
of play in the LDOCE. (a
send ball), Play6 is to strike and
esp. in a ised. Among genus information, the hierarchical relationship between
stated way";play7is to use (onesgroup ofplayingcards) in the stated the word sense under denition and the genus word sense needs to
is to reproduce (sounds, esp. musrc) on an apparatus ', be categorised as one of the twelve are labels currently used in sense-
way"; plale a:
and playll is to perform on (a musical instrument). frames. Differentia information for nouns must be organised into a list
of properties that can each be represented as a cell. Differentia infor-
mation for adjectives, adverbs and so on needs to be grouped as pref-
erence versus assertion information. Dierentia information for verbs,
sf(play10, sf(p1ny11, prepositions and so on must be sorted into case label, preference, and
[[nrcs, [[nrcs,
usertion information.
[[supertype. [[supeztype.
Such demands are beyond the abilities of the best current extrac-
[[cteutel . operatel] ] ] J]. [[performi, use1]] n] .
[nodeZ, [nod92.
tion techniques. Recognising genus dictionary denitions ap-
terms in
pears to be less difcult than extracting dierentiae and, according
[[ngent. [[agent,
to Chodorow, Byrd, and Heidorn
[preference , [preference . (1985), extracting genus terms for
humnn_beingf]] . humun_being1]] , verbs is less complex than doing the same for nouns. They found that
[object. [instrument, although the genus term for both verb and noun denitions is typi-
[pref arenc: recordingil] , , [preference ,
cally the head of the dening phrase, head nding for verb denitions
[instrument, musica1_inatrument]]]l])- was relatively straightforward while noun definitions were much more
[pteference. complicated because of their greater variety. Hence it makes sense to
playback ] ] ] ] ) .
attempt to extract genus terms for adjectives and adverbs (where they
exist) before turning to nouns.

When extracting semantic information from the dierentia of dic-
Figure 9.9 Senseframes for play10 and p1ay11 (verb). tionary denitions, it looks as if the denitions of nouns will be the
least difcult because the dierentia in their denitions has only to
of record_player1 and tape_record- be transformed into lists of sense-frame cells. Adjectives and adverbs
Play10 appears in the sense-frames
of plale is would be probably the ones to try next. Verb denitions most
ing1 (Figure 9.6). The preferred instrument
are
playbackl, likely to be the hardest to analyse because three types of differentia
which is the genus term of record,player1 and tape.recording1; the
226 227
semantics Chapter 9
information must be distinguished for use in senseframes, i.e., case
labels, preferences and assertions.
9.4 Conclusion
Chapter 10
The chapter has argued that a computational semantics
branch of
that merits attention is one seeks to develop more
branch that natural
and noted in Conclusion
language-like schemes for knowledge representation that,
interests of computational semantics Wlll come
doing so, the theoretical
to overlap increasingly with those in knowledge acquisition and compu-
tational lexicography. The chapter focussed on the practical problem
of developing methods for the of semantic
extraction information from Bran Boguraev and Ted Briscoe
machine readable dictionaries, a specialised form of natural language
text, and the use of that information to build knowledge structures.
In focussing on this problem from the perspective of computational
semantics, a number of theoretical questions were identied: the sta

tus of the word sense; sufciency, i.e., the question of the adequacy
of dictionaries as a source of knowledge for use in NLP applications;
extricability, the question of the accessability of the knowledge in dic-
The use of MRDs in computational linguistic research is expanding
tionaries; and bootstrapping, the question of what core knowledge is rapidly (as a cursory glance in the proceedings of more recent con-
needed a basis for the extraction of knowledge. We believe that these
as
and develop-
ferences conrms). MRDs are being used because they represent the
questions must be investigated, and adequate research most accessible source of information for building NLP systems which
ment resources committed to them, if there is to be robust, large-scale have realistic sized lexicons. There is clearly a feeling that the basic
machine comprehension of English in the near future.
techniques of (syntactic) parsing, some types of semantic processing,
spoken word recognition, and so forth, are well enough understood to
warrant either testing with large lexicons or commercial deployment in
a limited range of applications.
The chapters of this book demonstrate that useful information can
be extracted from at least one MRD source and the work reviewed
in chapter 1 suggests that many of the techniques discussed by con-
tributors to this volume generalise to other MRDs, as well as LDOCE.
Recently, there have been proposals to standardise the format of MRDs
[Amsler, 1987) so that the various sources of different MRDs will
be compatible and easily interchangeable between different research
groups. The success of this enterprise would mean that, in princi-
ple, any MRD would be straightforwardly loadable into the Lexical
Database System described in chapter 2 and the Lexicon Development
Environment described in chapter 5 (as well as many similar software
systems developed by other groups). As a consequence the MRD-
derived data available to researchers would increase massively and, no
doubt, much
valuable information could be extracted through cross
dictionary comparison and merging of the kind which is beginning to
be by the IBM Lexical Systems Project (see Byrd et 3.1.,
1987usrdertaken
.
228 229
1O Conclusion
Chapter 10
Conclusion
of MRDs of software systems, such

Lexical Database
as our S
in the availability
will
On the other hand,
not solve many of
a
the
massive
problems
increase
which arise in the computational
ter 2), which endeavour to separate the task of (522511;);
extriitidrfi
of and deployment of the information extracted is well
To date, no substantial dictionary has exploxtation
exploitation of these sources.
which is intended for compu- worthwhile. Such systems convert the MRD into a general iesource
been produced by a dictionary publisher capable of opportunistic use, both as a source of lexical data and in
tational The
great majority of MRDs remain typesetting tapes,
use. unforeseen applications, which is limited only by the capabilities
preprocessing before they are usable by ma.
NbP
requiring considerable in that the MRD source
of the extraction programs (and the reliability and completeness of the
chine. LDOCE was, until recently, unique
attributes and structure. The publica- chosen MRD). It is likely that further work on extraction
of informa
contained some database-like from MRDs but
tion of COBUILD (Sinclair, 1987) represents
a major development, in tion
robust
will need to focus on the development of crude
techniques geared specifically to this task, similar in spirit
this context, because, not only is this dictionary based on linguistic
AishaWi
NL-l)3
phrasal Probabilistic
but also the s
analyser (see chapter 7). techniques
data derived from a large computerised corpus of English,
from an intermediate (and
tic
eveloped'for the analySis of textual corpora may be particularl Y ap-
published dictionary itself was produced this context (see Garside
propriie
in et al., 1987).
database constructed from the corpus.
more detailed)computerised current context of research and dev lo
gR$ftviiifedys
n e on
of MRDs which, from inception, have been genuine .
The availability terns, it to us that most of the work on

to greatly facili- seems
databases (as opposed to typesetting tapes) promises of rapidlyquantities of fairly reliable
large
automated extraction of information from such sources. Neverthe-
f. convenientway
eXical knowledge.
acquiring
tate However,
represents this
only a transitory phase in
is primarily intended as a computa-
the COBUILD database of eld: future work on, say, deriving subcate
less, even
the. d evelopment the ori-
dictionaries for human consumption,
tional aid for the construction
the
of
information in the text is conveyed in free
sation information
is likely to be based on analysis of large quantgities
and, as such, most of of
naturallyoccurring machinereadable text (or transcribed s
eech)
text format.
that dictionary publishers will respond ?nll)y up-todate
by usmg corpora can such information be guaiPanteed
It is possible (and desirable) reliable.
is
fechniques for analysing corpora are in their infancy
to the requirements of computational linguistic
research (and the po- ho developed(sce Garside et al., 1987). It is also possible
:1 .beingthat
are
for lexicons for commercial NLP systems)

tentially lucrative market lex1cography will increasingly make use of corpora
databases which represent information in a
flil deSIrable ,
and produce dictionary o lead of the COBUILD project. In this case there ma
automatic extraction and de- owmg
the b
format which will yield
in
more
the
easily
meantime
to
there remains the difficult ques-
a productive
convergence of interests and techniques between
' '
corrilputj
ployment. However, MRDs. Clearly,
tational linguists and lexico graphers both in the pm d union and use
tion concerning how far it is worth exploiting existing of the next generation of dictionaries).
between the amount of (probably semi-
there is relationship
a direct
to extract information and the amount
automatic) processing required There is com-
needed to recover this information.
of research resources
which is extractable without the
MRDs
paratively little in current For example,
use of sophisticated and robust processing techniques.
in LDOCE
even the grammar codes and pronunciation information
semi-automatic extraction techniques
(see chapters 4 and 6) require the
100% success. Most information in MRDs requires
to guarantee
of robust natural language processing techniques in order
deployment these tech-
extract the information needed to support
to reliably very
the situation is not completely circular,
niques. Fortunately though,
written spe- in a
because most MRDs (and especially LDOCE) are
cialised subset of English.
Nevertheless, the problems of extraction are usually great enough
of whether the information extracted
to motivate serious consideration
One obvious lesson is that there is little point in
warrants the effort.
extraction techniques for each new
repeating the chore of developing
to a MRD. Therefore, the development
project which requires access
231
230
Appendix A
Lexical database
user guide
A.1 Overview
The lexical database (LDB) consists of the machine-readable form of

LDOCE and a number of accompanying pointer les on disc. It can be
set up in two modes, stand-alonemode and server mode. In standvalone
mode, everything happens on one machine, from the user interface
right down to the dictionary itself. This is relatively fast, but, in
the Xerox 1100 series environment in which the system runs, has the
disadvantage that all the data must be stored on the local disc of the
machine in question. In server mode> this problem is overcome: all the
data is held on a dedicated server machine, which also does most of the
processing, while the user sits at a client machine which runs the user
interface. The two machines communicate by Courier (the XEROX
Remote Procedure Call protocol) calls.
The rest of this appendix describes the user interface to the LDB.
Everything from here onwards applies equally to the stand-alone and
server modes; indeed, except for speed of response, the system should
look exactly the same to the user in both modes. Section 2 describes
how to access entries according to their spellings, while section 3 con-
cerns access using other kinds of information.
A.2 Access by spellings

Access according to word spellings is achieved by means of the Lisp
function Words, whose single argument is a pattern for retrieving
entries according to the spellings of their head words or phrases. The
pattern may contain only letters, question marks and asterisks. A
letter stands for itself; a question mark stands for any letter; and an
asterisk stands for any sequence of zero or more letters.
When Words is called, the system retrieves and constructs a menu
from all entries whose head words or phrases either match the pattern
in their entirety, or, in the case of head phrases, have a component
word that matches the pattern. Non-alphabetic characters [including
233
Appen d'xI A A.2 Access by spellings
LDB user guide
Thus the call near the give rise to a very long search.
beginning may This is because
ignored in matching.
spaces) are
the search uses a leftto-right discrimination net. A similar situation
(Words cot???) obtains, of course, when using a normal printed dictionary by hand;
one must know how the spelling of a word begins to be able to nd it
will present the menu in Figure A.1, quickly.

Once an entry window like that in Figure A.2 has been displayed,
Which coffee? one can make the system expand on several diercnt parts of it by
coffee (r1) selecting them with the mouse.
coffee bar (n)
coffee bean in}:
A.2.1 Expanding on grammar, subject and box codes
coffee bro ah(in)
coffee e m)
Buttoning on any of these items brings up a sub-window to the right
coffee mill in) of the main one explaining or expanding the information selected; see
coffee shop (n) A.3.
Figure
coffee table my Grammar codes in square
coffee table book
-
(r1) appear brackets, e.g. [U],[C;U]in Figure

A.2. Buttoning on them gives a Geode window in which any implicit
coffee tree in)
softer (n:- relationships (there are none here) are expanded out, as discussed in
ccufrl (n) chapters 4 and 5.
Subject codes appear after the abbreviation subj and are four-
letter codes consisting of two meaningful pairs (although the second
pair may be null). The rst pair represents a main subject area; the
coff77 second represents a secondary area, a geographical area of origin or a
Figure A.l Entries matching spelling pattern
Subpart of the main subject area. Buttoning on a subject code brings
up a subwindow explaining it.
which allows
ontain
the
a
user
matching
word
to select
'
one or
coff77.
all
oEItlhewlose
e(nt)ries
mar
.hleag:
e n er s 1 Box codes are ten-place sequences following the marker box in the
of the denition. Each place codes for a particular type of meaning, roughly
:2h:rstcand for,noun). Selecting any
for
one
example,
.menu
in itemszcauses
Figure A. .
that whole entry to be displayed, as,

Variety of English (geographical);
95.?
Lance window (coffee) Subpart of variety 1;
coflee Fizoii ll '1.:tfi,'Lzuiim l [U]
brown
Register (humorous; formal, etc.);
"HP"TI;
box a
(subj PMEV, Period of use;
powder made. by crushing cortexor
BEANS, used for making drinks . Semantic type of object described (for a
noun) or qualied (for
an
adjective) or of subject (for a verb);
giving special a taste to food 2 [C;U]
{simj EV, box ----L--<'r) (a cuprui of) 6. Language of origin if not English;
a not brown drink made by adding. not 7. Neologism or not;
water and f or milk to this powder
8. Illustration pointer (although illustrations are not part of the
machine-readable LDOCE);
A.2 A word window for coffee
Figure 9. Crossreferences (yes or no);
in every entry in the list being dis~ 10. Semantic type of object, if verb.
Selecting the [all]option results
p1 ed. .
be re Buttoning on a box code gives window which explains its non-empty

aSitithough
principle in entries matching
containing
any
asterisks
pattern
or
may
question mark! elds.
a
trieved, in practice, patterns

235
234
Appendix A A.3 Access by nonspellings
LDB user guide
and box codes for the coffee entry are A.3.1 The WORD node
If the grammar, subject
A.3 results.
selected, the display in Figure Initially the tree consists of on ly a WORD node.
'
Buttoning on this gives

the menu in Figure AA,
GCode window
Ldoce window (cones) ..,
cofJee ,xirmi ll 'F'wii,'1-'.I:ifi.l'n

1 [U] L.
:umruunc' 'Ur'tnc'cie
a brown U
p, ,
------
'1) Window 1
,
Subject code
'der mad: by crushing corn:
I
used for making drinks or Lur P
BEANS,
'
[CHI]
taste to food 2 Ella.ICiQkUp
special Bax code window 1 DC

ut_ , tun}: uriLn-Y} (a cuprui of) self P: Plant

._
Sutuj emf ;
:1 nm brown mini: made by adding hm Y: mummy incomplele
Cross-reference:
water and or milk. to this powder
Figure AA Options available at the WORD node

for coffee with expanded code information
Figure A.3 Display
whose entries have the following semantics.
A.2.2 Expanding on denition words

Add pron u cration es FRONUNCIAIIO
a \l node for
If a word or phrase in small capitals
created for the entry
denition
for that
selected,
word or phrase.
in a is a sec~
(node
cre)a speedy-
and entry window is

the to browse the database to determine relationships
This allows user
Add grammar node
.
c reates a GRAMM AR node for
'
SpeCIfymg grammar
.. .
between denitions. codes and categories (see A.3.7)
Add.semantics node cre ates a SEMANTICS

'
by non-spellings
Xiaoffl
.
A.3 Access box

SPeCIfymg SUbJECt
codes, codes and denition words (see
to the main part of the
The Lisp function Tree is the entry point
the basis of almost any
LDB, which can access dictionary entries on Close result .
windows closes all th 6
'
Windows
entry
.
codes and (see Figure A.2

kind non-spelling
categories,
of
subject codes,
information:
box codes, and words occurring
pronunciations, grammar
in defini-
that are
Words.
currently open, whether they were created via Tree or vi;
constructed interactively in tree form, as explained
tions. Queries are
2.7 for an example of a complete search tree)

in chapter 2 (see Figure initiate a search, or else ask D
pa/tick? search
full
the tree is complete, the user can .
a described in
When
the system to estimate how long such
how a search would take and gy d.does the s
isplayed in
as
explanation window,
chapter
and a new
6. The sea
windovris
h
would be returned as a result. created for the search. secm

'
A'3-14
.
many entries deSCribes what can be done with

a large window in which the tree will be con- this window.
Calling Tree creates
window top. The query tree along its
structed, with an explanation and
nodes in the main window with the mouse
is created by selecting Do.statistics does '
textual information in the explanation window. Selecting a partial search in ch apter 6. It

as
exchanging _ ,
fofraeflfdkziigeivri
.
of
a node-specic menu
|ifillltlilorl$t
w
brings would be used
a node
things
with
that
the
can
left
be
mouse
done
button up
to it, as explained below; selecting with the 331025
whladt
.
1information , a wou e used for tests an d rou hl h
middle button will delete it if deletion is allowed.
a cmse would take and how many entries {ignitiigcthe
wouldbegrezfd
236 237
Appendix A A.3 Access by non-spellings
LDB user guide
in the expla- Inserttorleft in ser t S a new SYLLABLE node to the left of the current
for a word or phrase
Get.specified.template prompts
builds tree for the entry that word
for one.
nation window, and then a
if there is more than one such entry, as

a menu of choices
(presenting to nd how many words are
search the l
is useful if wants
doesa partial on content of the current
in Words.) This one
it is both quicker than building a tree l1:315tatiistics

it monosyllabic word). This gives some
5
idea
like a given
from scratch,
one
and reduces
in some way;
the risk of errors. of hzxiyvggzaltitrzg
u. asta e cons raints on that s liable a hm does not cor-
respond to any immediately speciable {tillSearchlie,

selects en-
Get.specied.temp1ate but an
Get.random.template is like
how to use the A.3.4 The
at random. T his is mainly useful for learning STRESS, ONSET, PEAK and CODA nodes
try itself,
LDB. one gives a structured menu of phonemes
, or; of!
lift-igptstgggggs
thtjse , s ress va ues in a phonetic fo n t C onstltuents
'
Of more
'
.
'
node than be s peeled by followm g 5 b 1119111154 Slgle and

'
A.3.2 The PRONUNCIATION phoneme

.
one can
menu with the follow- multiple-phoneme wild cards also available.

PRONUNCIATION node gives a
Buttoning on the
ing options: A.3.5 Phoneme nodes
node (see A.3.3) to the A phoneme node b e al t ercd

' ' '
adds an (additional) SYLLABLE can by selecting it, which glVeS the follow-
Add.syllable node initially has a daughter ing options:
right of any existin g ones. The syllable
that the content of the syllable is
node *, a wild car d that indicates
the phoneme
by a disjunction correspond
unspecified.
ggatggfistagfgzgdmgsplacels ner c ass
(see Changepbonemcs.to.btoad in A.3.2
phonemes in the pronunci
broad changes any above).
Change.phonemes.to. manner-of-articulation classes
to representations of their
broad
ation
dened by the value of a variable Ma k e dis} mc t ion t bet til
classes used are mser S all OR wee n e Ph 01181116 all d It S d 01111-
.
(see chapter 6). The edit if ( J

which the user can required.
both work for Showfeatures displa ya the

Get.specied.template
as SPE f
explanatiorfiviiidivifhomSky
t and
Get.random.template an d Halle, 1968)
information in the entry {or the phoneme in the
the WORD node, but only the pronunciati on
is used.
A.3.0 Features and phonemes
.
node
.
SYLLABLE B
otitfgiifnzgn
A.3.3 The constituent enables to add
gftgodetuilatsylltable
one
the following options: representation , swr c o a eature wh

SYLLABLE node gives one
the
raetrliieiltclijann
. .
Buttoning on
:triiiigitziiteixprsfsiei
(AND) as a of SPE features
congmction onemes. ne can switch back a It d forth betWeen fea-
adds nodes for STRESS, ONSET, PEAK and CODA [see A.3.4), tures and phonemes, adding and deleting either.
Add.sopc wild card daugh-
Initially the y have
where these do not already exist.
ters to indicate that they are unspecied. A.3.7 The GRAMMAR node
'
nodes and But enables to add node

and CODA one daughter nodes
Deletesopc deletes the STRESS,
*. deletes the
ONSET,
constraints on
PEAK
the {or ctzirclgfrizrsi
(achAMMAR see
A.3.8) and for grammar codes (CODE
replaces them with a single
making
This
it unspecied,which is, of course, quite "e A,3 MultTliZG9RY, g) lp e
instances? each are allowed and are of
tr ea t 6 d
content of the syllable,
itself.
0 n].1") c t we .1y '
m a search (for disJunction, see A.3.8 A.3 9) and

a different action from deleting the syllable
239
238
A A.3
Appendix Access by non-spellings
LDB user guide

2. inectional
- .
an variant
A.3.8 The CATEGORY node thereof, which Will be altered to its root
form;
CATEGORY node gives a menu of all the categories
Buttoning on a
3. word that

- ,
category inserts an a a 1 small denition

. .
appearing in LDOCE. Buttoning on a particular Ppears CaPltEillS in a somewhere

node. Buttoning on the OR allows in LDOCE,
OR between it and the CATEGORY
one to set up a disjunction of categories. If the word falls into none of these categories a comme [it t th
indic:ti0:toefte}::
'
. '
together)
with
35:56:25githeLeDigilgEationdvzindow,
an

A.3.9 The CODE node .lC Wor if there is one that is n t

' i
dlSSlmllar).
porotiotriso
. . .
codes Similarity according is measured to common left

node menu of all the grammar
Buttoning on gives a CODE a
appearing in LDOCE. Complex

codes are represented by submenus, A.3.14 The search result window
and wild cards are available. Selecting a particular code provides an
to build a disjunction, as for
explanation of the code, or allows
one V
a result
3:51:31: :earch windowis created by Do.Iookup to displa th
CATEGORY.
100k; 11y: gzlleshearch, tinitiallydonly
the Jiron:
een ersecte
pointer lists deriving
in 'no entri many been IOOked
fdur fevi/aesrhave
.
up or tests
applied. If there are
A.3.10 The SEMANTICS node or
pointers, looking up and
testing will done be
cally; otherWise automati
When
'
it will 0 l
cofhiiizlilagzlcn
subject
of adding the
Buttoning on a SEMANTICS node presents a choice
user left~buttons window selects in the
the and a 3 mm
codes, box codes or denition words (see A.3.11A.3.l3).Multiple resulting menu. These commands are as follows
instances of these are interpreted conjunctively.
A.3.11 The SUBJECT node

gptavgfvdisn
the query
:hwod
33161; :Ivindow
phrases c e ea words
subwindow
or
below the
of entries
i
Search
'
satisfying
re _
are Show.
SUBJECT node gives a structured menu of two-letter
Buttoning on a
selects
of which have two-letter subcodes. If one
subject codes, some Displayentries opens a subwindow to the right of the search
for its meaning (if result
a code holds and down, the button an explanation window for every entry satisf
'
the Ent
'
ying quer
Such an explanation displayedas
.
window.
explanation in the
[all]option :ndcaiileljeaLre
' .
available) will appear the Words (see A.2.1)

can also be requested by selecting the node itself once it is added. {ii
e same way.
in
, manipulated in
also possible in the usual way.

Disjunctions are
Left-buttoning on a word in the word window also can ses the entr
{or that word
is
'
to be dis .
useful if there are a lot

played,
of entries
selective
thisvmore display facility
The BOX node satisfying the query.
A.3.12
whose top level
Buttoning on gives a structured
a BOX menu
node Words.to.le writes the words the
1 to and whose second level consists of satisfying query to a le
consists of the positions 10,
the mouse over an item
the values available for each position. Holding
in the explanation window, which is also
will prompt an explanation
from buttoning the node itself once added.
available on
A.3.13 The DEFN node
to type a word in the ex-

Buttoning on a DEFN node gives a prompt
Window. This word can be any of the following:
planation
1. one of the 2000 basic LDOCE words;
240 241
Appendix B
Semantic types of LDOCE verbs
B.1 Subject Raising verbs (total number 5)
appear(3) happen (3) seem {2) transpire{2)

chance {1)
B.2 Object Raising verbs (total number 53)
adjudge) discover(2) observe{1) remember(1)

admit(3) engage {4) order(6) report {1)
allow(5) fee1(5) perceive(1) reveal(2)
argue(3) nd(8) predicate) see(2)
assert(1) foreordain (1) prefer(1) smelI{2)
assume{1) guess(1) preardain (1} sme11{3)
avow(1) hold{9) presume{1) suppose(1)
believe(3) judge(3) presume(2) suppose(2)
bettay(3) maintain(5) proclaim) re/1(6)
certify(2) make out(5) pronounce{2) think(2)
dcclare{2) mean
(2) pronounce{3} understand(3)
deem) mind(2) prove(1) understand{4)
deny(1) notice) recagnize(3) warrant{2)
determinea)
243
B B.3
of LDOCE verbs
Appendix SUblECt Equi verbs
Semantic types
B.3 Subject Equi verbs (total number 335)

give up(1) Ieave(7) p00h-pool1{1) resist{2)
30(5) like(2) postpone(1) resolve(1)
care{1) continue(1) do with(2) go about(2) live by(1}
practise(4) resolve(2)
abide(1) dread(1)
continue{3) go in for(2) loathe{1)
account [or(2) cease(1) practise(5) resort
to{1)
duck out of{1) go on(5}
chance( 1) contract(l) long{1) prate about(1) result from(1)
ache(2) elect(2)
contrive(1) go with (3) look forward
acknowledge) choose(2) Pram) resume( 1)
claim {4) contrive{3} endeavour(1) go without(1) to(1)
adore(3) preclude(l) revel in {1)
clamour(2) could(1) endure{1) grow(5) make{18) prepare{3} revert
advocate{1) to(1)
clog(1) covenant{1) enjoy(1) grow out of(2) make up for-(1) prepare
far(1) rise above(1)
affect(1)
close(3) cut out(4) envisage(1) grow out of{3)
manage(2)
afford(2) presume{4) risk(2)
cloud (3) cry out escape(3) grudge(1) mean(5) pretend{1}
agree(2) rue(1)
come(1) against(1) essay(1) guarantee{2) merit) pretend(2)
aim(2) 5ay{5)
come(7) date(1) evade{2) guard against{1} militate
aim at(1) pretend(4} scheme(1)
before(1) dare{2) excuse{l) happen(2) against{1)
allude to{1) come
proceed{1) scorn(1)
down 150(1)decide(2) expect{1) hasten(2) miss{1)
anticipate(1) come
profess{2) scramble{2)
come out decide on(1) exult{l) lzate{3) miss(2} prot by(1)
appear(2) exult over-{2} scream(4)
against(1) declare hesitate(1) miss{5)
arrange(2) prohibit(1) scruple(1)
against(1) fail(1) {1)
aspire(1) come into(1)
fail{3}
hinge on
necessitate{1) promise(3) seek{3)
assent(1) come on {1) declare {or{1) hit on(1)
need(1) Preposea) seem(1}
{all to(1)
attach to(3) come to{1) decline{3) liope(1) neglect{1} propose{2) see(7)
defend(3) nish(1)
attempt(1) commence(1) incline(4) neglect(2) provide for(2) see
about)
with {1) defy(3) x(2} include{1) negotiate) provide [01(3)
avoid(1) compare
x on(1)
see
to(1)
awake{1) compete(1) deign(1) indulge in(l) 051943) Purport(1) send(4)
conceal{1) delay(1) ick(2) inveigh omit(2) purpose(1) send away(2)
bear(5)
conceive of(1) delight{2) forbear) against{1) operate{2) put 05(1) send 01173)
bear(9)
forbid)
.
in (1)
begin(1) concur(2) delight involve{2} own
to(1) quit(1) serve{5)
condescend(1) demand(1) forget(l) ltcll(3) pant{4) recall( 1) set
about{1)
1763(3)
conduce to{1) depose(2) forget about(1) J'lbat{1) pay for(1) reckon on {2} set out(2}
begrudge(1)
bid fair(1) confess(1) deride(1) forswear(1} Justify(1) pertain to(1) recollect) shirk(1)
descend to(1) frown on(1) keep(11) petition{2}
blanch(2) confess(2) refuse(1) should(1)
deserve(1) funk(1) keep {rom(2) pine(3} tegret( 1)
blink at(1) confide(1) shrink from)
connive(1) detest(1) gem) keep on
at{1) plan(1) rejoice(1) shudder)
bluslz(2) kick against(1) play(3)
consent{1) disclaim(1) getai) relisli(1) shun(1}
bother(3) knock 05(2)
break 017(1) consider(1) discontinue(1) get around to(l) play at{1) remember(2) sicken of(1)
get away with(l) know about(1) play at(2}
burn(6) consist in(1) discourage{2) repent(1) smile(2)
conspire(1) disdain(2) get down to(1) luncnt(1} pledge(1} require{1) stand(8)
burst(3) lead to(1)
conspire(2) dislike(1) get out of(l) plot{5) resent(1) stand(12)
burst out(1) learn (1)
bust out(3) contemplate{2) do with(1) get round ta(l) plump for{1) resist(1) stand for(2}
245
244
B B. 4
of LDOCE verbs Appendix Object Equi verbs
Semantic types
embolden(1) harden to(1) look at(1) prepare(1)

swear 05(1) try(1), (2), (3) wait(1) employ(1) at(1) hark look to(2) prepare(5)
start(1) want(3) hear(1)
stem from(1) swear ta(1) undertake(2) employ?) lower(3) prepare for(3)
unite(2) want(4) employ in(1) help(1) make(3) press(9)
stick(8) take to(2)
take up(1) use(1) warrant(l) empower(1) help(2) make(5) pressure(1)
stoop(3) watch(3) enable(1) hunger(1)
tend (2) venture(2) make(6) pressurize(1)
stoop to(1)
venture(4) witness to(l) enable(2) impel(1) make(7) prevail u on(1)
stop(1) think of(1)
strive(1) think of(5) volunteer(1) wriggle encourage}
end in(1)
implore(1) mean(4) preventap)
volunteer(2) outof(1) importune(1) motion(2) prevent from(1)
subscribe to(1) threaten(2) impute to(1)
suggest(2) train(3) vote(1) write(4) engage(1) motion to(1) pride on(l)
tremble(3) vouchsafe(1) write back(1) engage in(1) incite(1) motivate(1) profess(3)
swear(1) incline(3)
swear by(1) trouble(3) vow(1) yearn(1) entice(1) move(11) program(1)
entitle(2) induce(1) name(3) ptogramme)
entreat(1) inuence) nominate(2) promise(1)
equ1p(2) inhibit from(1) nominate(4) prompt(1)
BA: Object Equi verbs (total number 284) esteem(2) inspire(1) noti[y(1) prove(3)
eXCite(2) instigate(2) obligate(1) provoke(2)
exhort(1) instruct(2) oblige(1)
'
provoke m t 0( 1 )
expect(5) instruct(3) order(1) push(3)
bid (2) charge with (1) credit with(1)
acknowledge(2) intend(2) organize(1) push on(2)
adj ure(1) bill(2)
bludgeon into(1)
charge with
come down
(2) dare(5)
debar from(1)
(gleyg)
g; introduceto(1) overhear(1) put down 35(1)
advise(1) can ) inure to(1) persuade(2) put down to(1)
bluff into(1) on(l) decide(4) d
nd(
aid (1) 1)
dedicate to(1) into(1)
inveigle pester(1) put 017(1)
bri be( 1) command (1)
{21536}
allow(2) invite(2) petition(1) put up to(1)
bring(2) commission (1) defy(2)
allure(1) invite(3) reckon (1)
appoint(1) bring(5) compel(1) delegate(2)
[DI-bid)
phone(1)
depend on(l) itehfor(1)- pick(1) reckon on(l)
for(1) bring in (3) condemn(3) force) Jom With
arrange
depute(1) in(1) pick on(l) reduce to(4)
ask(4) bully(1) condemn(4) [H [mm
.
t
deputize(2)
in
0(1) keep(10) plead with(1) reeducate)
assign(4) buzz (3) candition(3) I 'gh out keep from (1) pledge(2) regard 35(1)
assist( 1) cable( 1) confess(3) design(2) rigftzen know(4) plume upon(1) rely on(2)
attribute to(1) call on (2) conjure(1) designate(2)
detail(1) 823(5)) lead(2) pray(3) remember as(1)
authorize(1) catch (3)
cause(1)
cannive at(1)
consider(2) direct(3) gem?) leadon(l) preclude from(1) remind(1)
badger(1)
doom(1) given legislate predestinate(1) represent (1)
bargain for(1) cau tion (I) constrain(1) .
beckon(1) challenge(1) cop(1) dragoon into(1) against)

givedover legislate
to(1) predestine(1) represent(2)
behove(1) challenge(4) counsel(1) draw on(3)
gfguninio) lor(1) predetermine) represent 35(1)
beseech(1) cliallenge(5) couple with(1)
into(1)
drive(8)
egg an(1)
bung;
tau ) (et(1),(2),(3),(4)
o(1) et
predetermino(3) request(1)
predispose(1) require(2)
bestir(1) charge(5) cozen
1mg] as()1 long for(1) preen on(l) result in(1)
247
246
verbs Appendix B
Semantic types of LDOCE
tax with(1) 10(1)

schedule(1) slate{2)
teach(1) tip ofT{2)
school(1) spur(2) train {2) Appendix C
5PY(3) telegraph (1)
seduce(2) troub1e(2)
select{1) steel(1) telephone(1)
unt for(1)
Dative alternations
017(2) telex(1)
send(1) urge(2)
5115944) tell(2)
send{2) want(2)
summons(1) tell(3) C.1 Fine-grained semantic classes of Verbs
send(3) warn {1)
summon (I) tell(5)
set(4) .
05(2)
tell watch(1) C.1.1 Assign/Give
set(8) supplicate)
SUPP59(3) tempt(1) watch(5) to her.
shape(1) watch for(1) They delivered the groceries (deliver(2))
SUSPectQ} tempt{2) NP-PP [+Lat?] Position/Possession?: 5
show(1) wean from(1)
take{18) thank(2)
show(9) at{1)
talk into(1) timetable{l) worry
signal(2) yearn for(1) She donated the book to him. (donate(1))
talk out of(1) time(1) NPPPDLat]
sign(2) Position/Possession?: 5
She presented the award to him. (present(1))

B.5 Equi verbs (total number 42) NPPP? [+Lat] Position/Possession?: 5
She returned his book to him. (return(2))

countenance(1) imagine {1) race( 3) NP'PP? ['Lat] Position/Possession?: 5
allow(1) intend (1) recommend (2)
count on (1)
allow for{1)
like (1) rely on (1)
approve of(1) culminate in(1) She slipped him a. pound. (slip(6))
like (3) save( 4)
a5k(2) desire(1) EitherE-Lat] Position/Possession'l: 5
enjoin(1) love(2) see (6)

bank on(1)
naglll sign (3}
beg(2) hate{1) She allocated him shares. (allocate[1))
need (1) Sign up (1)
calculate on(1) hate(2) s tart (3)
Either [+Lat] Position/Possessioniz 5
hear para)
chance(2) visualize(1)
about(1) permit (1)
choose(1) want (1) She allocated him the room. [allocate(3))
compensate hear of(1) prepare(4)
wish (5) Either Position/Possession'l: 4
help(4) quaiify(1)
for-(1)
She alloted them some money. (allot(1))
Either [-Lat] Position/PossessionY: 5
She assigned him a room.

(assign(4))
Either [+Lat] Pesition/Pessession?: 4
249
248
Appendix C CJ Fine-grained semantic classes of verbs
"Dative" aiternations
him the job. (assign(l))

She built him a house. (build(1))
She assigned 4
Either
Position/Possessioniz5 EitherE-Lat] Position/13055655510117:
the [award[1), (2))

She carved him some meat. (carve(5))
She awarded him prize. Either[-Lat] Position/Possession'Z: 4
EitherE-Lat] Position/Possession'I:
She cooked him a meal. (cook(1))
She bequeathed him some money. (bequeath(1)] Either [+Lat] Position/Possession72 5
Either[-Lat] Position/PossessionY:
She cut him some cake. (cut(6))
She/France ceded Germany Alsace. (cede(1)) Either [-Lat] Position/Possession7: 4
EitherDLat] Position/Possession72
She xed him a meal. (x(4))
She gave him the baby/present. (give(1), (2)) Either[+Lat] Position/Possession?: 4
Either[-Lat] Position/Possession?:
She knitted him some socks. (knit(l))
She granted him the award. (grant(1)) Either[-Lat] Position/Possessioni: 4
Either[-Lat] Position/Possession?:
She made him a cake. (make(1))
She extended him credit. (extend(4)) EitherI-Lat] Position/Possession?: 4
EitherULat] Position/Possession?:
She mixed him a drink. (mix(2))
EitherULat'H Position/Possession71 4
She gave her family all her time (give(8),set aside)
Either Position/Possession'iz
She poured him a drink. (pour(7))
Either[?Lat] Position/Possession71 5
She rendered him a service. (render(3))
NP-NP [+Lat?] Position/Possession'i:
She pulled him a. beer. (pull (1 ),( 5 ))
EitherE-Lat] Position/Possession'l: 4
She/the house yielded him some shelter (yield{3))
EitherELat] Position/Possessioniz
She prepared him a meal. (prepare(2))
Either[+Lat] Position/Possession?: 4
She/the tree afforded him shelter from... (afford(3))

NP-NP? [-Lat] Position/Possession72 She him bath.
ran a
(run(1), (2))
Either[-Lat] Position/Possession'i: 4
C.1.2 Construct
She served him the meal. (serve(7))
She lled the cup {or him. (ll(1)) 1*
NP'PP [-Lat] Position/Possession'l: Either[+Lat] Position/Possession7: 5
She shot him a bird. (shoot(4))
She blew him a beautiful glass. (blow[4))
Position/Possession'l: 4 Either [-Lat] Position/Possession'i: 4
EitherK-Lat]
251
250
.
Appendix C C'1 Fine-grained semantic classes of verbs
"Dative" alternations
She won him the prize. (win(6))

She wrote him a letter. (write(4)) chadLat] Position/PosSession]:4
E'ther[
1
-
Lat] Position/Possession'iz4
0.1.4 Pay/Charge
0.1.3 Obtain/Find She him
charged some money. (charge(1))
She obtained the hook for him. (obtain(1]) '
5 NP-NP[+Lat?]
' '
POSlthn/Possessmn
7.
4

NPPP[+Lat] Position/PossesswnY:
She fined himpound. a
(ne(1))
(bring(1)) NPNP[+Lat?]
'
him the book. Position

She brought I
4 / Posses sum 7; 4
E'th
1 er [ Lat]
-
Position/Possesswn72
She muleted him a. pound for... (muict(1), ne)
him new coat. (choose(1)) NP-NPf-Lac' Position/Possession?: 4
She chose a. '
Either [-Lat] Position/Possessmn'i:2

She overcharged him pound.
a
(overcharge(1))
NP-NP h L a t? J
She found him present. a (ndui) 2
IOSltlon/I 0556551011.. 5
1 [31191 [-Lat]
10531011 /I 0556551011
She undercharged him pound.

a
(undercharge(1))
NP-NP[+Lat?] Position / Possession?: 4
She got him
present. a (get(2). fetCh)
EitherE-Lat] Position/Possession7: 1
She gave him a pound. (give(3), pay)
Either
him rose.
'
(pick[2)) Position/Possession?: 5
She picked a
Either [-Lat] Position/Possession?:4

She paid him pound.
a
(pay(1))
Eithervaat?) Position/PossessiOn'Z;5
She plucked him a rose. (pluck[3))
Either [-Lat] Position/PossessionY:4
She refunded him the money. (refund(1))
Eithethag] Position/PossessionY: 5
She procured him the book. (procure(1))
EitherE+Lati Position/Possession7: 5
She remitted him the check. (remit(2))
EitherDLat] Position/Possession?: 5
She procured him a. woman. (procure(2])

Either Position/PossessionY:2 She repayed him the pound. (repay(1))

Either [*Lat Position/Possession?: 5
She reached him the book. (reach(3))
Either [-Lat] Position/Possession7:5 She reimbursed him the money. (reimburse(l))
EitherE+LatJ Position/Possession7: 5
She took him a. drink. (take(6)) . .
.
7. 4
Either[-Lat] Posttion/Possesswn.. (3.1.5 Say/Teach
She preached him the word of God.
She sent him a book. (send(l)) (preach(1))
EitheerLatY]
Either [-Lat] Position/Possession?:5 Position/Possession'I: 0
253
252
"Dative" alternations Appendix C C.1 Fine-grained semantic classes of verbs
She tossed him the ball.

She preached him a sermon. (preach(2)) (toss(1))
Either Position/Possession'i: 0 Eitherf-Lat]
Position/Possession7: 5
0.1.7 Allow/Forbid
She quoted him some poetry. (quote(1))
Either[+Lat] Position/Possession?: 0
She allowed him some money. (allow(4))
NP-NP? [+Lat7] Position/Possession'i: 1
She quoted him a price. (quote(3))
Either Position/Possession?: 0
She denied him nothing/the money. (deny(3))
~_a.w
NP-NP[+Lat?J
She read him a book. (read(3))
EitherE-Lat] Position/Possession?: 0
She refused him kiss.
a
[refuse(1))
NP-NP[+Lat;?] Position/Possession7: 0
She sung him a nursery rhyme. (sing(l))
Either ['Lat] Position/Possession7: O She forbade him the house.
([0rbid(3))
NP-NP[-Lat]
Positiorl/Possession7: 0
She showed him the book. (show(1))
EitherE'Lat] Position/Possession?: 0 She him time
gave enough to
(give(G)7allow)
NP-NP
Position/Possession?: 0
She taught him history. (teach(1))
Either[-Lat] Position/Possession'i: 2 She accorded him permission to
(accord(2))
EitherI+Lat?]
C.1.6 Pass/Throw She spared him five minutes.
bail.
(spare(4))
She chucked him the (chuck(1)) Either[-Lac]
Either[-Lat] Position/Possession?: 5 Position/Possession7: 0
0.1.8 Save/Take
She ung him the ball. (ing(1)) She saved him
Either['Lat] Position/Possession?: 5 a. pound. (save(3))
NP-NPbLaWJ
She passed him the bread. (pass(5)) She saved him journey.
Either[-Lat] Position/Possession]: 5
a
(save(4))
NP-NP
Position/Possession'Z: 0
She passed him the bail. (pass(7)) She spared him visit.
5
a
(spare(2))
Either Position/Possession7: NP-NP [-Lat]
Position/Possession'I: 0
She passed him a fake coin. (pass(1), (3)) She spared him her opinion.
5
(spare(3))
Either Position/PossessionY: NP-NP
ball. The journey took him four hours.

She threw him the (throw(1)) (take(2), (2))
5 NP-NP
Either[-Lat] Position/Possession7: Position/PossessionY: 2
254 255
Appendix C C.1 Finegrained semantic classes of verbs
"Dative" alternations
She proffered him a drink. (proffer(l))

It, took him 5V6 pounds 0" (take(2), (4)) 0 ii Either[Lat] Position/Possession'l: 1
Position/P05595530n71
Eithert-Lat]
i 0.1.12 Ensure/Cause
0.1.9 Buy/Sell She/the news gave him a shock. (give(4), cause effect)
. . ,
She purchased a car for him. (purchase(1)) 4

"P _
NP 7' 0
POSition/Posse551on..
NP-PP[-Lat]
Position/Possession7:
The book him ideas. (give(5), produce, supply)
i gave some
NP-NP Position/Possession7: 3*
She bought him a present (buy(1)) 4
Eth
1 er [ Lat]

Position/Possession?:
The meal gave him indigestion. (give(7), cause pain)
NP-NP Position/Possession?: 2
Sh e s old him a car -
(5311(6)) 5
Position/Possession'Z:
E' 1 ther[-Lat]
ii She/The medicine ensured him some sleep. (ensure(2)
Either[?Lat] Position Possession?: 1
i Sh 51:00 d h' im '1 drink ~
(stand(l) (7)) v
e
Ethet[+Lat7]
1 .
Position/Possession?:5 i
C.1.13 Concede
"ii
c_1.10
She Wished him

'
NP'NP
Greet/Wish
[Lat]
'
a safe Journey
'
(Wish(4)) .
P051 't' mn/ Possessm: 1

i She gave
NP-NP
him the point. (give(l), (3), concede)
ill She conceded

EitharDLat]
him the point. (concede(1))
Position/Possession'Z: 1
3
I
II She bid him good morning. (b1d(1)) 0

['Lat]
FOSltlDH / FOSSESSIOH '
E15116! She granted him the point. (grant[2))

EitherPLat] Position/Possession?: 0
Hi
She swept him a curtsey. (sweepulv(2))
ii Either [-Lat]
Position/Possession?:0 C.1.14 Owe
. She owed him a pound. (owe(1))

(wave(3]]

She
hlffdbye'
wgfe: tr er
_
a
Position/Possession7:0 Either [-Lat] Position/Possession}: 0

Size owed him loyalty. (owe(2))
C.1.11 Offer Either Position/Possession?: 0
(makql), (7))
She
Hiia'iiili'T-igtger'
1
Position/Possession?:2 She owed
Either
him a lot. (owe(3))
She offered him a pound. (o'er(1))

E1 the):
Position/Possession'l:1 C.1.15 Recommend
She prescribed him some pills. (prescribe(1))

She olfered God a prayer. (o'er(2)) 1
Either[+Lat] POSItIOn/PossesmonY;2
51 1: h er
Position/Possession?:
,
257
1 256
Appendlx C .
C.1
Dative alternations Fine-grained semantic classes of verbs
She left him

She recommended him a book. (recommend(1)) a m essage _
(leave(4))
1 E1ther[-Lac]
EitherULat] Position/Possession'l: Position/Possession7: 4
She lent him the

(3.1.16 Set ; car.
(Iend(1), (2), (3))
EltherE-Lat] Position/Possession'l: 3
She posed the problem to him. (pose(3))
NP-PP? [+Lat?] Position/Possession7: O .
She/His dishonesty lost him the job. (lose(3))

Either[Lat] '
Position / Posse 5510 ?; 3

She set him the problem. (set(4)) 1
EitherE-Lat] Position/Possession?: She made him a. good wife. (make[1), (0))
Either[Lat] Position/Possession?: 2
0.1.17 Miscellaneous
drink toast) She
She gave them the President. (give(1), (4), O
begrudged him the money. (begrudge(1))
NP-NP Position/Possession'Z: NP- NPf Lat]- I '
P051tion/Possession?: 1
' '
She gave him her hand. (give(1), (5), give support) 1

She p aid h' Im a 51" (pay(6))
Either Position/Possession'l: 1ther[+Lat'.7] P051tion/Possession7: 1
She gave him a kiss. (give(1), (6), do action) 0

NP-NP Position/PossessionY:
She/His honesty earned him the title. (em-11(3))

3
Either [-Lat] Position/Possession?:
She gave him the information. (give(9), tell)

Either Position/Possession?: 4*
She struck him a blow. (strike(3))

NP-NP[-Lat] Position/Possession?: 0
She asked him for money/advice .

(aski2))
NPPP [-Lat] Position/Possession'i: 0
She sold him the idea. (sell(6)) 1

EitherE-Lat] Position/Possession?:
She vouchsal'ed him a promise. (vouchsaie(1))

EitherE-Lat] Position/PossessionY: 1
She kept him the book. (keep(8))

1
EitherE-Lat] Position/Possession'l:
258 259
Appendix D
gamma.
Lexicon development environment

user guide
when.
D.1 The main command menu
The Lexicon Development Environment (LDE) displays a permanent

menu, with the title LDE Main Menu, (Figure DJ) on the right hand
side of the screen. Commands may be invoked by selecting items from
the menu with the mouse. Eight commands are available in this menu.
LDE Mam Menu :
Edit Ll
Reset r.
Figure D.1 The main menu
Initialise reloads from disc all the data needed by the LDE. The data
comprises the morphology system compiled files, the sentencelevel
GPSG grammar, and the denitions of the LDE frames and subcate-
gorisation translations. The command should be issued if any changes
have been made to the morphological description or sentence level
grammar, and it is desired that these changes should be noticed by
the LDE. The command asks for conrmation before this process is
started, since reloading all these les takes a few minutes.
261
D D2 Words windows
LDE user guide Appendix
uaoos :wlnuow. believe -.

GCOGlWMdOWSI.
window contain-
Edit Syn Frames opens a text edit (Xerox TEDIT) believe 'h'ii'w 1- i [Ill] isutrj ELEI-I, ID
frames that they can be edited.
ing the denitions of the syntactic so
bit-z mum ID )1 re a firm reii 'mis faith
to help the user construct Etude window-:2,
Section DA describes the facilities provided 2 [T1]
and edit the frames. '
'
true
Gcoiaewlnaowe
Edit Subcat Trans opens TEDIT

a window allow the
to subcategori- 1
See section D.5 for more details. [right optional] :0 be
translations to be edited. [right optional}
'
sation m
Word for an LDOCE head word. The LDE

Reset Next prompts
it is in the dictionary, and if it is, tells the Next Word
makes sure
command that this is the next word it should present to the user. D.2
do nothing if the prompt is Figure A word window
This, and all the other commands as well,
answered with just <return>.
Translate GCode for a grammar

prompts code, and prints out the
LDEs translation of it, rst into a theor yneutral tern P lat
' ' e, and then
to be dealt into a list of subcategorisation
Next Word takes the next word in LDOCE with, trans- values.
lates grammar codes for the entry, and opens a new window containing
a list of the senses in the entry. The window is a word window; from
be selected and its translated denition
this an individual sense may
The next section describes word windows and
inspected and edited. Help displays some information about the se comm an d 5 and more gen
-
how to use them. Repeatedly invoking Next Word steps through each erally about how to use the LDE.
head word in LDOCE in alphabetic order, so this command may be
used to help to systematically derive a new lexicon from LDOCE once
aids (syntactic frames and subcategorisation trans-

the transformation
lations) have been fully developed. D.2 Words windows
A words window (Figure D.3

same)head
'
dis la 5 a list 0
Word word in LDOCE, and opens

for a window con-
set of
senses all with the wbrd LDwaCOErld
edgiliifed
05:12:22 in
Get prompts a
the,sense
full textual form of its denition as it appears in the dic- by_afxation
from a single
both cases, stern.
number In
is
taining the printed in parentheses after each word in the list. Words windows allow
code elds,
tionary. Various parts of the window, such as the grammar the
clicks on them, and pop up subsidiary windows to user to select a word sense and get the LDE to display the denition
respond to mouse it
that has computed for the sense. The word that is currently selected
the right of the main denition window containing more detailed rep<
is when the window is rst opened, it is the rst word
resentations of the information in the eld (Figure D.2). Grammar highlighted;
that is highlighted, but any word in the window
code sub-windows in their turn respond to a mouse click in them by may subsequently be
selectedby clicking on it with the mouse. Attached to the side of the
translating the list of codes they contain and opening a word window Window 15 a menu containing two commands.
the senses of the word, just as Next Word does.
containing
263
262
D D.4
Appendix Editing syntactic frames
LDE user guide
words from LODGE surface form a denltlou var believe (31

I I
believe
F rarr
a,
+6Rt4[i,+PR\DI
[BAEIE,_TTIJR3F:, , , Aux , vroRm tar,
gee2, , , AR NFUFJA NORM], AT
Act e
ISLJBCAT f, SFINJ
[BYPJLAE
9,A 'URD +, 't' +. N
PRU
NEG -
mI -,
'
VFURM

ESE)
r-i
l
Fm -
INFL APR '

t
SIVJE.E:AT):3FiN]+
+
Window BAR 2, NFURM NORM], AT
Figure D.3 A words by}
,
h +3
[BAR a, 9mm: +, v +. N , PRD , NEG -
Aux
\IFURm
FIN
s , 35:5,

INFL + AGR N +, v
BAR Foam
-
2, NORM],
Defn opens a Denition
of the currently
Window
selected
(see
word.
next section) displaying the
[my if, SL136 DRE , AT
nitwlFl
denition v H PRU
+, , -, NEG , Aux -, VFURM 55:5
[H
Hui-HER +, v -, BAR 2, NFORM rmRm] AT
+ L?
TWNF]

and the denitions of [BARe W+,

R0
uua'tAT
OK causes words
the window to be closed, A
+, +_. N -,
, PRU v
NEG -, Aux , 'vFURM bee

le
,
the to be added to the Fm , INFL +, AER [H +, v BAR

words in it that have been accepted by 2, NFORM
-
user
LAT
NORM], AT
+, +, suntan i-weAP]
containing the derived lexicon.
D.3 Denition windows
the surface form of word.

a
A denition window (Figure DA) displays
The denition Figure DA A denition window
followed by its denition, a syntactic
list of categories.
form not only be viewed, but also changed, since
and surface may
LDE provides
denition windows support text editing operations. The
Affixes and Frames, to help the user check that a denition the LDE to close
aids, the
two
for a word is reasonable; commands to apply them may be invoked ngct tells been If any
window and ignore any chan
asgk:
e
aye made to.it. made, the LDE

were
attached to the right hand side of the window. for coring:
from a menu discarding them. Unless the
opens iondbgfoze
anion;
window
.
er e ni ion on the word and
user
the n
subsequently
denition, the fiiiae
T5353:
.
word Will not be written out to the

window
word in the and from derived
Aixes takes the surface form of the
to the word a xed
it generates a new set of words derived by adding
of afxes. This takes into account the denitions
collection process
of the word, and also the spelling rules and word
of the afxes, that DA Editing syntactic frames
grammar which are currently loaded. The words in the newly generated
set are displayed in a new words window.
agiagict
frame isphrase or sentence
a
with one word replaced b
Ch framzkrepresented by _. A set of categories is associated witii
the word, given its def- frame to
Frames nds those syntactic frames in which
able to ll {horla be applicable to a word (Le. the word is
will t, and displays in window these frames with the at) the denition
s of the word must include a cate
inition, a new
which he the one frames set of categories. magoy
word substituted in them.
Chan erinatc
(is selectingyth:
in Frames
EditaynojrfrirengE
s of them enlarged by
rapiepoire LDE
minan rom the main T
'
that have been'deldzggdriira

. menu
form and denition of the word
wgodxwdcgntaimg
surface frames the
Accept indicates that
after editing)
the
satisfactory. The window is closed and the 3:2;ejni: categories e
With y e asso
'
t d
' ' '
l
alvaluij
(maybe are Cia e it F
it is later to be written to the le T normal
Hardcopy Find
DIS)- and

denition
containing
remembered
the derived
for when
lexicon.
mheattached
enu the
TEDIT
window. to side of the
commands are 1 a e on a
264 265
Appendix D D.5 . .
,
LDE user guide Editing subcategorisation translations
D.5 Editing
Sshflc Frames
frames for
subcategorisation translations
3:)
Syntactic Hardcapy
the Lexicon Development Environment. L .
gtogfggrirrirgar
arinappied
denitions in two stages
3] Install into
'
JAE April 1937 OUR oodles rans orm e no es into theor y n all t m l l emal

. The
tem-
in plates. The second stage is to tran slate these

.

tem P late S 111 t 0 subcate
gorisation values using the subcate gorisation
'
translations

rovid d
orIies Ed

they <- that someone is something.

NORM, Nlw
translations
Eheiisissxer. may be provided and existing chain
[N -, v +, BAR o, AER [N +, v , BAR 2, NFORM
SUBCAT SFlN] minllgh'
Edit
Sugcat
e Trans command from the main commfnd
PER 13, PLU +, COUNT +, CASE NON], EDIT is window
'
opens .
a in wh' ic i l translations be
may
edited (Figure D
'
.
5) . Schat egorisation translations b
i
Installyfroinuzislriile
.

ma
something.

Someone to be Similar
.
they 6 in 3
way to syntactic frames by invoking
V BaR 2, NFORM NORM,
[N , V +, BAR B, AGR [N +, -,
attached to the WndOW' there l S , however
PER PLU COUNT +, CASE NDM], SUBCAT 0E] , no anao l g o 11 5 Cr eat 9 C0 m
3,
-
+,
V +, BAR B, AER [N +, V , BAR 2, NFORM NORM,
[N ,
SUBGAT DR]
PLU +, COUNT +, CASE NOM], .. TIA
PER 3, Simon sis! our
AER [N +, V , BAR 2, NFDRM NORM, F in :3
[N . V +, BAR 8,
PER 3, PLU +, COUNT +, CASE NDM],
SUBCAT SE2] .
Ha rd 1: CI p 1-,-
In.-
Li'iEr-e 1.1;: h- a problem.

..-..
they r
NFORM NORM,
w r, 8%. l. {tE'R [N +, V , BAR 2, .
[N -.
SUBCAT 352]
PER COUNT +,
FLU CASE Mom],
3: +, .r
EAR El, AER [N +, V , BAR 2, NFDRM NORM,
[N ,, V +,
.3, F'LLI +, COUNT +, CASE NON],
SUBCAT DR]
mural...
F'EF:
NF:
frames
m
Figure D.5 Syntactic
HP HP IMF} CIR-a 1 '.\"i a;
3 1 g QR
also three other commands in the menu:
There are
NP NF le (Type 2 IIiEqLiijj ) L'IE
creates candidate set of categories for the
Create automatically a
The candidate is nearest the

slot in the phrase which cursor.
blank
set may then be edited. The candidates are generated by rst parsing Figure 13.6 Subcategorisation translations
level grammar and allowing any
the phrase with respect to the surface
to ll the slot. The categories that ended up as lling
lexical category
to form the candidate set.
the slot are then extracted
used the LDE with the changes

Install updates the syntactic frames by
made by the user in the editing window.
should be invoked at the end of an editing session; it shrinks

Quit
the editor window down to an icon. If any changes to the frames were
the Install command), Quit asks if they

made but not installed (using
should be. Subsequent frame editing sessions may be started just by
the rather than having to select the Edit Syn Frames
expanding icon,
command from the main menu each time.
266 267
Appendix E
The Longman semantic codes
E.1 Subject eld codes
The subject field codes designate a

particular domain of the world
or a particular context of discourse to which the sense of an entry is
restricted. The codes consist of at least :
o a Mainfield code, for example:

GA =
Main subject eld of games,
I which can either be subdivided, indicated Z followed

by by a
subdivision code, for example:
GAZB Main subject eld of games GA,subeld Z is B which
=
stands for billiards,
a or it is combined
with a locality restriction. The third letter is
either 0, U, X or Y which stand for the continents America,
Europe, Asia and Australia, Africa. The fourth position refers
to local sub-areas which codes all the letters
range over of the
alphabet and the numbers 078, for example GAQA,which refers
to the restriction of the Maineld restricted to the location
games
Argentina;
0 or it is combined with another Maineld for

code, example:
GAGB : both the Maineld games (GA)and the Mainiield gam-
bling (GB)apply;
0 or it is unspecied, which means that the eld applies in its gen-

eral sense. This is also the case when the Maineld is followed
by xx. So GA. . and GAxx are two ways of saying that the
general eld of games appliesi
269
Appendix E E3
semantic codes Explanation of POS codes
The Longman
DJ be; third person singular

E.2 Semantic codes
16 VD verb past particle
The system of semantic codes is rather general, only discriminating
codes either indicate the class of entities
classes and subclasses. The IOTO preposition (IO) and innitive marker
of the entry belongs or in case (TO)
to which the entry or the specic sense
stand for the selection restrictions DOPO

the entry is a verb or adjective they determiner (P0) and pronoun (P0)
Transitive and (ii-transitive verbs may
of the arguments of the entry.
each argument. Below are some AODO adjective
have more than one semantic code for (A0) and determiner (D0)
codes labels that can be found on tape:
examples of semantic or
COIO conjunctor (C0) and preposition (IO)
T Abstract. NO]. AOIXX
I '
noun
(N01) and adjective (A01) and cross reference (XX)
C Concrete:
Q Animate:
H Human.
A Animal.
P Plant.
I Inanimate:
S Solid.
L Liquid.
G Gas.
codes which stand for usages as
Stylistic information is provided by

formal, informal, old use, rare, and so on.
refers to all kinds of dialects and restrictions

Sociolect information
to other classes within society.
of POS codes
E.3 Explanation
D0 determiner
N0 noun
P0 pronoun
A0 adjective
IO preposition
C0 conjunctor
NS plural noun
VJ verb; third person singular

271
270
The Longman grammar coding system Appendix F
/ r
Table of a
Codes L g a
g c
3? o .5
l: =
: _ '5' s w m
c 3 i
g 0
h47
wk.
b .: =1
w H :1
-=
-~
w, a 5 w 8
65
5'8 ,5
.
a
E 5%?
350w 3:
=\ a}; a
a gig=3 =3
n? 3? a 17:4? L: 395 S
.n
no;
53:; N
an.)
A B C
Appendix F *
l) E F CC GU H
on [A] [B] (C)
3013331;;
mun
a, .m. .m
mum
m
mu
m.
~
M, mu!m:a G
a F
1"! H.
mam
4
The Longman grammar coding [C].:1de

ms Wm. 5mm-
I uni:
deal:
"he
1.
maxilla:
mm
Admlnlly
Lv/alcl
mm
m
35%;!
.
no lencr [a gmlplmelim- m, m. 1mm

Mm... m. x..." in; 3721: ""
system nwd
be
not -
.
7
followed
by
anything
followed
by one or
more 1 5%
nouns or beam:
pronoun: gs-I
15
followed
lzylhc
Innitive
withoulm
followed
hydle
(an
"M!
"[01 [L31
.
"u 7"
innitive
235

5" m)
713'; MM"
. ,
.
wnlllm . , plum I l
, a
WWW,
5;;who
followed
by lb: 4

- MA.
..
mg form ,
z ,
an.4"
MN:
followed v
[51 IDS! [m
A
duh! cw
,
[u]
b hat-5
71mm lo/mu
'
cause
(mud) new]:
_ 1'. The
Imuh/t
harm: hm!
In turn: blue yanknaw
warmth
JP
In
m1
followed ,5?"
bya wh- 6

5:, 1M;
word 7-?
w $1, z
,
mm.) l/w'd
, my, 5am." m V 71mg
a w r
- W
m'

n, mud
followed
byan 7
adjective :15
be:
lm
gollowed
y past 8 a
. ,
3:11"I 1
pmmplc ,
k
New
; .
WM
,_ :mlmd
seed '
ij'm'9
ms] to;
0 0' palm-
tally
.Mi/
mum hale-1
rm {Gun
V
[m
hu
phme um m inarlda
-
mum
you-shy 'Am
.
272
273
The Longman grammar coding system
Bibliography
box hlvendlelem muninl
rwvu [Well
pusmll the mu
"W: m. u
m bk Inble
(wm
mun
, um
verhl'
guise Aarts F, Aarts J (1982) English Syntactic Structures:
Imus-m:
mm: Functions and
In Categories in Sentence Analysis. Pergamon Press
(mum
Aarts J M G, Calbert Ltd, Oxford.
J P
[wnn 1qu
Semantics
(1979) Metaphor and Non-Metaphor: The
plurl-l m w
ofAdjective<Noun Combinations.
um
slum:
.- Ixtm
ingen.
Niemeyer Verlag, Tub-
mm: -z.3:
. we: Aarts J, van den Heuvel T
lkwiml
(1985)Computational tools for the syntactic
wed
lwm... analysis of corpora. Linguistics 23: 30335.
Adda G, Esknazi M, Stern P E (1987) The use of
tures for large rough spectral fea-
..
aim"
.. vocabulary recognition. Proceedings ofthe
Conference Speech Technology,Edinburgh, pp 1714. European
mun-
m
ms]
or
i on
Ahlswede T (1983) A linguistic
from Websters Seventh
string grammar of adjective denitions
mm mt
mum-
Collegiate Dictionary. Masters thesis, Illi-
mm
vs:
mu: nois Institute of Technology,
mm
Akkerman
Chicago, Illinois.
E, Masereeuw P C, Meijs W J
puterized Lexicon for Linguistic
(1985) Designing a Com-
Purposes: ASCOT Report No 1.
Rodopi, Amsterdam.
Akkerman E, Meijs W J, Voogtvvan Zutphen H J
cal tagging in ASCOT. In Meijs W J
(1987) Grammati-
(ed)
Beyond, Proceedings of the Seventh International
Corpus Linguistics and
ICAME Confer-
ence. Rodopi, Amsterdam, pp 18193.
Akkerman E, Meijs W J,
VoogtvanZutphen H J (1988a,
A Computerized Lexicon [or
Word Level forthcoming)
No 2. Rodopi, Amsterdam. Tagging: ASCOT Report
Akkerman E, Meijs W J,
Voogt-van Zutphen H J (19881),
ASCOT: a
computerized lexicon with forthcoming)
an associate d scanning
tem. In Ihalainen sys-
O, Kyto M, Rissanen M (eds) Proceedings of the
Eighth International Conference on English Language Research on
Computerized Corpora (provisional title). Rodopi, Amsterdam.
275
Bibliography Bibliography
The MITalk Bobrow D

G, Winograd T (1977) An overview
Allen J, Hunnicutt S, Klatt D (1987)From Text
to Speech: representation
of KRL, a knowledge
Press, Cambridge. language. Cognitive Science 1: 346.
System. Cambridge University c
' .
umm Bobrow R (1978) The RUS System. BBN

Understanding. Benjamin/ Report No. 3878 Bolt
Allen J F (1987)Natural Language
Beranek and Newman
, ,
'
Menlo Park California. Inc., Cambridge, Massachusetts.

.
1n- Bobrow
AlshraigA/sil
Memory
H (1987) and Context Mechanisms for Language R, Webber B (1980) Knowledge representation for syntactic/
Cambridge University Press, Cambridge. semantic processing. Proceedings of the First National Conference
terpretation. a
. .
Artificial Intelligence (AAAI80),

E J T
(1985)owards Stanford, California, pp 316
Alshawi H, Boguraev B
dgczionsary
K, Briscoe
Proceedings 3:131
support environment.
of the
:C-
for real-time parsing. 0' tie or
the Assouation Boguraev B K (1980) Automatic resolution of linguistic ambiguities.
ond Conference
tationalChapie7r108f
European Lin uistics, eneva, pp
~
.
'
Doctoral thesis, also available as Technical Report No.11 Com-
Lexigcal
..
puter Laboratory, University of

AltrSZiiiipC (1986) phi:-
stress, lexical discriminability,
in speech recognition. Proceedings
variable
of the Institu e
Cambridge, Cambridge.
Boguraev B K (1987) The denitional power of words.
Proceedings
netic information of the
caustics 8 7 : 4718. .irdWorkshop on Theoretical Issues in Natural Language
(198(0))
Amglfei'iR Processmg (TINLAP-3),
_
Pocket Las Cruces, New Mexico, pp 1115.

A Tlie structure of Merriam-Webster
the Boguraev B
Dictionary. Doctoral thesis, Universitynouns
of Texas at Austin. K (1988a, forthcoming) Machine readable dictionaries and
verbs. d
researchincomputational linguistics. In Walker D, Zampolli A
-
Amsler R A (1981)A taxonomy

for English and Proceet
ings of the 19th Annual Meeting
of Assoc1ation
the 1338.
for Compu a- CalzolariN (eds) Automating the Lexicon: Research and Practice
Multilingual Environment. Cambridge University Press,
tional Linguistics, Stanford, California, pp
(ff;ge.
ri
Cam-
Amsler R A (1983) Experimental

tion for lexical disambiguation
research on
of full-text sources represelgita BoguraevB
knowledge
Research ro .
K (1988b, forthcoming) A natural language toolkit: rec-
onCiling theory with practice. In Reyle U, Rohrer C
posal for the NSF, International,
SRI MenloPark, California. Language Parsing and Linguistic Theory. Reidel, Dordrecht.
(eds) Natural
WilllalsleT)
dictionaries. In M d
Amsler R A (1984a)Machine-readable
). Boguraev B K, Briscoe E J (1988) Large lexicons for natural
Annual Review of Information Scienceand Technology (A exploiting the grammar
language
Voli19,1612231;
processmg: coding system of LDOCE. Com-
American Society for Information pp
1984b Lexical knowledge bases pane sesswn
on
m
SciencE,
putational Linguistics 18(34).

AmiiaeddRablie(dictiorlaries).
theProceedings of
Linguistics (Coling84),Stanford,
10th
InternationallfCong- Ca i ornia,
Boguraev B K, Briscoe E J, Carroll J, Carter D, Grover C
The
derivation of a grammatically indexed
(1987a)
lexicon from the Long-
ress on Computational
4589.
man
Dictionary of Contemporary English. Proceedings of the 25th
Amsplliar Issues
R A (1987a) Words
in Natural
and worlds. Proceedings
Language Processmg
of the
(TINLA WorkIshgp
), -
Annual Meeting of the Association
Stanford, California, pp 193200.
for Computational
Linguistics ,
on Theoretical
Cruccs NM, p 1619. -
BoguraevB K, Carter D M, Briscoe E J (19871))A multi-purpose
Ami: R A Plow
(1987b) do I turn this book on?. Proceedings
for the New Ox offth;
interface to an on-line dictionary. Proceedings of the Third Confer-
Conference of the UW Centre ence of the European Chapter of the Association for Computational
Third Annual
Dictionary The Uses of Large Text Databases, Water lor
00, Linguistics, Copenhagen, Denmark, pp 634).
English Boguraev B K, Copestake A and Sparck Jones K
da, 7688. . . (1988, forthcoming)
Aullcinltl,
ZEEV Inference
.
W (1984) Lexical stress and its application to large in natural language front ends for databases. In Sernadas
at A (eds) Knowledge and Data
vocabulary speech recognition. Paper'presen'ted the 108th
Minneso mett- a. B K, Carroll
{D532}.NorthHolland, Amsterdam.
ingof the Acoustical Society of America, Minneapolis, Boguraev
Briscoe E
J, Pulman S, Russell G, Ritchie G D, Black A
The IRUS J, Grover C (1988, forthcoming) The
Bates M, Moser M G, Stallard D (1986)
Ltransportable of
lexical component
natural language database interface.
.In Kershberg (ed) Exper a natural
language In Walker toolkit.
Zampolli A, Calzolari ,
Park, N (eds)Automating the Lexicon: Research and Practice in a Mul-

Database Systems. Benjamin/Cummings, Menlo
an
online
California;3 tilingual Environment. Cambridge University Press, Cambridge.
Binot J-L, Jensen
dictionary.
K (1987)
Proceedings
A
of
semantic
the 10th
expert using Jaint
International Con erence standar Boguraev B K, Sparck Jones K (1983) How to drive a database
front
70914. end general semantic
on Articial Intelligence (110111-87), Milan, Italy, pp usmg information. Proceedings of the First
276 277
Bibliography
Bibliography
dictionaries, lexical databases

Conference on Applied Natural Language Processing, Santa Mon- Calzolari N (1984b) Machine-readable Dictio
819.
and the lexical system (panel session on Machine-Readable
ica, California, pp
naries). Proceedings of the 10th International Conference on Com-
Interfaces to Information Systems.
Bolc L, Jarke M (1986) Cooperative
putational Linguistics, Stanford, California, pp 460.
SpringerVerlag,
Berlin.
of semantic net- Calzolari N (1988, forthcoming) Structure and access in an automated
Brachman R J (1979) On the epistemological status lexicon and related issues. In Walker D, Zampolli A, Calzolari N
works. In Findler N V (ed) Associative Networks: Representation the Lexicon: Research and Practice in a Multi-
Academic Press, New York, (eds) Automating
and Use of Knowledge By Computers. Cambridge University Press, Cambridge.
lingual Environment.
350. E (1984) The machine-readable dictionary as a
Caizolari N, Picchi
Bragiiman
R J, Levesque H (1985) Readings
Publishers,
in Knowledge Represen-
Inc., Los Altos, California.
powerful tool for consulting large textual archives. Proceedings of
tation. Kaufmann
Morgan the Automatic Processing of Art History Data and Documents,
An overview of the KLONE knowl
Brachman R, Schmolze J (1985) Pisa, Italy, pp 277e88.
edge representation system. Cognitive Science 9(2): 171216. Carlson R, Elenius K, Granstrom B, Hunnicutt S (1986) Phonetic
Briscoe E J (1985) Report of the Dictionary Syndicate. Alvey Speech properties of the basic vocabulary of ve European languages. Pro-
Club Workshop, Warwick University. ceedings of the International Conference on Acoustics, Speech and
J (1987) A formalism
Briscoe E J, Grover, C, Boguraev B K, Carroll Signal Processing, Tokyo, Japan, pp 27636.
of En-
and environment for the development of a large grammar
Carnap R (1952) Meaning postulates. Philosophical Studies 3: 6573.
International Joint
glish. Proceedings of the Proceedings of Tenth Carter D M (1987) An informationtheoretic approach to phonetic dic-
Conference on Articial Intelligence, Milan, pp 7038. tionary access. Computer Speech and Language 2: 111.
formation in natural language processing sys- Carter D M, Boguraev B K, Briscoe E J (1987) Lexical stress and pho-
Byrd R J (1983) Word
International Joint Conference on
netic information: which segments are most informative. Proceed-
tems. Proceedings of the Eighth
Artificial Intelligence, Karlsruhe, Germany, pp
70445. ings of the European Conference on Speech Technology, Edinburgh,
Dictionary systems for ofce practice. In 235v8.
Byrd R J (1988, forthcoming) pp
their metaphorical relation-
Walker D, Zampolli A, Calzolari N (eds) Automating the Lexicon: Cater (1987) Conceptual primitives and
A
Failure in Dialogue. North-
Research and Practice in a Multilingual Environment. Cambridge ships. In Reilly E (ed) Communication
Holland, Amsterdam.
University Press, Cambridge.
Klavans J, Nefi M, Rizk O (1987) Charniak E (1972) Toward a model of childrens story comprehen
Byrd R, Calzolari N, Chodorow M,
11012642, IBM sion. MIT AI Memo 266, Massachusetts Institute of Technology,
Tools and methods for computational lexicology.
Cambridge, MA.
Yorktown Heights. G E (1985) Extracting semantic
M (1985) Using an on~line dictionary to nd Chodorow S, Byrd R J, Heidorn
M
Byrd R J, Chodorow words. Proceedings hierarchies from a large on-line dictionary. Proceedings of the 23rd
for unknown
rhyming words and pronunciations for Computational Annual Meeting of the Association for Computational Linguistics,
of the 23rd Meeting of the Association
Annual
Chicago, Illinois, pp 299304.
Linguistics, Chicago, Illinois, pp 27783. Chomsky N, Halle M (1968) The Sound Pattern of English. Harper
Andersson K S B (1986) DAM a dictio-
Byrd R J, Neumann G, and Row, New York.

IBM Thomas J Watson
method. IBM research report,
assignment in letter-tosound rules for speech
nary access
Church K (1985) Stress
Research Center, Yorktown Heights, New York.
towards a high-level synthesis. Proceedings of the 23rd Annual Meeting of the Associa-
Calder J, te Lindert E (1987) The protolcxicon:
In Klein E, van Bentham, J (eds)
tion for Computational Linguistics, Chicago, Illinois, pp 24654.
language for lexical description. Coltheart M (1981) The MRC psycholinguistic database. Quarterly
and Unication. Centre for Cognitive
Categories, Polymorphism Journal of Experimental Psychology 33A: 4974505.
35570.
Science, University of Edinburgh, pp Cottrell G W, Small S L (1983) A connectionist scheme for modelling
Lexical denitions in a computerised dictionary.
Calzolari N (1983) word-sense Cognition and Brain
disambiguation. Theory 6: 89120.
Computers and Artificial Intelligence 2(3): 22533. Proceedings
in a lexical database. Proceed- Cowie J (1983) Automatic analysis of descriptive texts.
Calzolari N (1984a) Detecting patterns Processing, Santa
Lin~ of the Conference on Applied Natural Language
ings of the Tenth International Congress on Computational
1703. Monica, California, pp 11723.
guistics, Stanford, California, pp
279
278
Fillmore C J (1968) The for In Each Harms R T (eds) Uni-

CSLI Monthly, Report of Research Activities (1986) A monthly pub- case case. E,
versals in Holt, Rhinehart and
lication or" the Center for the Study of Language and Information Linguistic Theory. Winston, New
Stanford University, Stanford, California. York, pp 1e88.

(Vol.1, No.1). Fox E A, Nutter J T, Ahlswede
in Evens
T, M, Markowitz J (1988) Build-
Cumming S (1986) A guide to lexical acquisition the JANUS sys-
In ing a large thesaurus for information retrieval. Proceedings of the
tem. 181 Research Report ISI/RR-85162,
Information Scrences
Second ACL Conference on Applied Natural Language Processing,
stitute, Marina del Rey, California.
Austin, Texas, pp 1018.
.
S (1988, forthcoming) The distribution of lexical

Cumming
N (eds)
information Fox E A, Wohlwend R, Sheldon P, Fan Chen Q, France R (1986)
in text generation. In Walker D, Zampolli A, Calzolari
the Lexicon: Research and Practice in a Multilingual Building the CODER lexicon: the Collins English Dictionary and
Automating
its adverb definitions. Technical Report 8623, Dept. of Computer
Environment. Cambridge University Press, Cambridge.
of a minimum set of semantic prim Science, Virginia Tech.
Dailey D P (1986) The extraction Gatside
Computa- R, Leech G (1982) Grammatical tagging of the LOB corpus:
itives from a monolingual dictionary is NP-complete. In Johansson S (ed) Computer
general survey. Corpora in English
tional Linguistics 12(4): 3067.
Prediction and substantiation: a new approach to Language Research. Norwegian Computing Centre for the Human-
DeJong G (1979) ities, Bergen, pp 3642.
Cognitive Science 3: 25173.
natural language processing. Garside
Grammar. North-Holland, Amsterdam. R, Leech G, Sampson G (1987) The Computational Analysis of
Dik S C (1978a) Functional English: A Corpus-Based Approach. Longman Group Ltd, Harlow.
Dik S C (1978b) Stepwise Lexical Decomposition. Peter de Ridder
Gaviria G (1983) Une Approche pour Amorcer le Processus de Com-
Press, Lisse.
prehension et dUtilisation du Sens des Mots en Langage Naturel.
Dik S C (1980) Studies in Functional Grammar. Academic Press, New
Centre National de la Recherche Scientifique, Paris.
York. Gazdar G (1982) Phrase structure grammar. In Jacobson P, Pullum G
and object in universal
Dixon R M W (in press, forthcoming) Subject
In Atkinson M, Arnold D, Durand .I, Grover C, Sadler L (eds) The Nature of Syntactic Representation. Reidel, D'ordrecht,
grammar. 13186.
Grammar. Oxford pp
(eds) Essays on Syntactic Theory and Universal Gazdar G (1987) Generative Grammar. In Lyons J, Coates R, Deuchar
University Press, Oxford. M, Gazdar G (eds) New Horizons in
a dedicated database Linguistics 2. Penguin, Har-
Domenig M, Shann P (1986) Towards manage
mondsworth, pp 12251.
ment system for dictionaries. Proceedings of the 11th International Gazdar G, Klein PullumE, G, Sag I (1985) Generalized Phrase Struc-
Congress on Computational Linguistics (Coling86), Bonn, Germany, ture Grammar. Basil Blackwell Publisher Ltd, Oxford.
pp 916. Gimson A C (1980) An Introduction to the Pronunciation of English.
Dreyfus H L (1972) What Computers Cant Do: A Critique ofArtih'ciaI
New York.
(3rd edition) Edward Arnold, London.
Reason. Harper and Row, Gonnet G H, Tompa F W (1987) Mind your grammar: a new approach
W Speech Processing. Prentice-
Fallside F, Woods (1985) Computer to modelling text. Technical Report OED-87-01, UW Centre for
Hall, Englewood Clis, New Jersey. the New OED, Waterloo, Canada.
to coherence. Mem-
Fass D C (1986) Collative semantics: an approach Grimshaw J, Prince A (1986) A prosodic account of the todative
orandum MCCS-8656, Computing Research Laboratory, New Mex alternation. and Cognitive
Unpublished MS, Linguistics Science
ico State University, New Mexico. program, Brandeis University.
D C Semantic relations, metonymy, and lexical ambigu- Grosz B, Stickel M (1983) Research
Fass (1987) on interactive acquisition and
ity resolution: a coherencebased account. Proceedings of the 9th use of knowledge. Final Report, SRI International, Menlo Park,
Annual Cognitive Science Society Conference, Seattle, Washington, California. ~
pp 575-86. Grover C, Briscoe E J, Boguraev B K, Carroll .1 (1987) A computa

Fass C, Wilks Y A (1983) Preference
D semantics, ill-formedness and tional grammar of English. Lancaster Working Papers in Linguis-
metaphor. American Journal of Computational Linguistics 9(34): tics, vol.47.
17887. Guo C-M (1987) Interactive vocabulary acquisition in XTRA. Pro-
ceedings of the 10th International Joint Conference Articial
Fawthrop D, Yannakoudakis E J (1983) An intelligent spelling error on
corrector. Information Processing and Management 19(2): 1018. Intelligence (CAI-87}, Milan, Italy, pp 71517.
280 281
Bibliography
Bibliography
Mas- Multilingual Environment. Cambridge University Press, Cam-

J (1965) Studies lexical relations. Doctoral thesis,
Gruber
sachusetts Institute
in
of Technology.
in:
ri ge.
Jackendofl R (1972) Semantic Interpretation in Generative Grammar.
Haas N, Hendrix G (1983) Learning by being told: acquiring knowl-
MIT Press, Cambridge, Massachusetts.
edge for information management. In Michalski R, Carbonell J,
Intelligence Ap- Jackendoff R (1975) Semantic and morphological regularities in the
Mitchell T (eds) Machine Learning: an Artificial
405 lexicon. Language 51: 63941.
proach. Tioga Publishing Company, Palo Alto, California, pp
R, Grimshaw J (1985) A key to the verb catalog. Unpub-
26. Jackendoff
Hanks P (1979) The Collins English Dictionary. William Collins Sons
lished MS, under NSF Grant IST-84~20073, Information Structure
and Co. Ltd, Glasgow. ofa Natural Language Lexicon, Program in Linguistics and Cog-
Flexible parsing. American Journal nitive Science Brandeis University, Waltham, Massachusetts.
Hayes P J, Mouradian G V (1981)
of Computational Linguistics 7(4): 23242.
Johansson S, Atwell E, Garside R, Leech G (1986) The tagged LOB
M (1982 The EPIS- corpus: Users manual. Norwegian Computing Centre for the Hu-
Heidorn G, Jensen K, Miller L, Byrd R, Chodorow
TLE textcritiquing system. IBM Systems Journal 21 3): 30526. manities, Bergen.
M (1985) Computer aids for comparative dictionaries. Lin-
Hirst G (1987) Semantic Interpretation Against Ambiguity. Cambridge Johnson
Cambridge. guistics 23(2): 2854302.
University Press,
J R (1987) World knowledge and world meaning. Proceedings JohnsonT (1985) Natural Language Computing: the Commercial Ap
Hobbs
of the 3rd Workshop on Theoretical Issues in Natural Language plications. Ovum Press Ltd, London.
205. de Jong J, Masereeuw P (1987) PARSCOT: a new implementation of
Processing (TINLAP-S},Las Cruces, New Mexico, pp
text databases. Proceedings of the the LSPgrammar. In Meijs W (ed) Corpus Linguistics and Beyond.
The of large
Hodgkin A (1987) uses
Third Annual Conference of the UW Centre for the New Oxford Rodopi, Amsterdam, pp 195206.
J (1982) LexicaHunctional a formal
English Dictionary on The Uses of Large Text Databases, Waterloo, Kaplan R, Bresnan grammar: sys-
tem for grammatical representation. In Bresnan J (ed) The Mental
Canada, pp 916.
Grammatical Relations. The MIT
Hornby A S (ed) (1980) Oxford Advanced Learners Dictionary of
Representationof Press, Cam-
Current English 3rd edition (11th impression). Oxford University bridge, Massachusetts, pp 173-281.
Katz J J, Fodor J A (1963) The structure of a semantic theory. Lan-
Press, Oxford.
treatment of gapping, right node guage 39: 170210.
Huang X (1984) A computational
Proceedings of the 10th Interna- Kay (1984a) Functional unication a formalism for ma-
raising and reduced conjunction. M grammar:
tional Congress on Computational Linguistics (Coling84),Stanford, chine translation. Proceedings of the 10th International Congress
California, pp 24376. Computational Linguistics (Coling84), Stanford, California, pp
Huang X (1985) Machine translation in the SDCG (semantic defi- :n59.
MachineReadable
nite clause grammars) formalism. Proceedings of the Conference Kay M'(1981-1b) The
dictionary server
(panel on
and Methodological Issues in Machine Translation Proceedings of the 10th International Conference
on Theoretical
13544.
Dictionaries). on
of Natural Languages, Colgate University, New York, pp Computational Linguistics (Coling84), Stanford, California p 461.
in
Huttenlocher D P (1985)Exploiting sequential phonetic constraints Kay M, Kaplan R (1981) Phonological Rules and Finite State Trans-
words. Massachusetts Institute of Technology ducers. Paper presented at the Annual Meeting of the Association
recognizing spoken
Articial Intelligence Laboratory AI. Memo 867. for Computational Linguistics, New York.
Huttenlocher D P, Zue V W (1983) Phonotactic and lexical constraints Kazman R (1986) Structuring the Text of the Oxford English Dictio-
in speech recognition. Proceedings of the National Conference on nary through finite state transduction. Technical Report TR-B-ZO,
Artificial Intelligence (AAAI83), Washington, DC, pp 1726. Department of Computer Science, University of Waterloo, Water-
Complement types in English. Report No. 5684, Bolt loo, Ontario.
Ingria R (1984)
Beranek and Newman Inc, Cambridge, Massachusetts. Kegl J (1987) The boundarybetween word knowledge and world knowl-
Lexical information for parsing systems: edge. Proceedings of the 3rd Workshop on Theoretical Issues in
Ingria R (1988, forthcoming)
points of convergence and divergence. In Walker D, Zampolli A, Natural Language Processing (TINLAP-3), Las Cruces, New Mex-
2631.
Calzolari N (eds) Automating the Lexicon: Research and Practice ico, pp
283
282
Bibliography
Bibliography
the Lexicon: Research and Practice in a Multilingual Environment.

Kirkpatrick (ed) (1983)
E Chambers Twentieth Century Dictionary
(New Edition). W&R Chambers Ltd, Edinburgh. Cambridge University Press, Cambridge.
in information Lindsay P H, Norman D A (1977) Human Information Processing;
Klingbiel P (1985) Phrase structure rewrite systems
retrieval. Information Processing and Management 21(2): 11326. An Introduction to Psychology. Academic Press, New York (2nd
Koskenniemi KTwo-level
(1983) Morphology: a General Computa- edition).
tional Model for Word-Form Recognition and Production. Publica- Lyons J (1977) Semantics (2 volumes). Cambridge University Press,
tion No. 11, University of Helsinki, Finland. Cambridge.
Kucera H, Francis W N (1967) A computational analysis of present- Lyons J (1981) Language and Linguistics: An Introduction. Cambridge
day American English. Brown University Press, Providence, Rhode University Press, Cambridge.
Lyons J, Coates R, Deuchar M, Gazdar G (1987) New Horizons in
Island.
Kwasny S C, Sondheimer N K (1981) Relaxation techniques for parsing Linguistics 2. Penguin, IIarmondsworth.
ill-formed input in natural language understanding McArthur T (1981) Longman Lexicon of Contemporary English. Long-
grammatically
systems. American Journal of Computational Linguistics 7(2): 99 man Group Ltd, Harlow.
108. McClelland J, Rumelhart D and the PDP Research Group (1986) Par-
of the English Language. Longman allel Distributed Processing. MIT Press, Cambridge, Massachusetts.
LDEL (1984) Longman Dictionary McDermott D (1981 Articial intelligence meets natural stupidity.
Leech
Group Ltd, Harlow.
G, Svartvik J (1975) A Communicative Grammar of English. In Haugeland
sachusetts,
J Red)
pp 143450.
Mind Design. MIT Press, Cambridge, Mas-
Longman Group Ltd, Harlow. and inference in the CONSUL
Grammatical
Mark W (1981) Representation system.
Leech G, Garside R, Atwell E (1983) The Automatic International Joint Conference Ar-
of the International Proceedings of the Seventh on
Tagging of the LOB Corpus. Bulletin Computer tificial Intelligence, Vancouver, British Columbia, pp 37581.
Archive of Modern English, Norwegian Computing Centre for the Markowitz J, Ahlswede T, Evans M (1986) Semantically signicant
Humanities, Bergen, Norway. in dictionary denitions. Proceedings of the 24th Annual
patterns
Lehnert W G (1987) Utilizing episodic memory for the integration of
Meeting of the Association for Computational Linguistics, New
syntax and semantics. Unpublished MS, Department of Computer York, pp 11219.
Science, University of Massachusetts. Masterman M, Needham R, Spirck Jones K, Mayoh B (1957) AGRI-

Lehrer A (1974) Semantic Fields and Lexical Structure. NorthHolland, COLA INCURVO TERRAM DIMOVIT ARATRO First stage trans-
Amsterdam. lation English with the aid of Rogets Thesaurus.
into Report
Lemmens M, Wekker H (1986) Grammar in Learners
English Dictio-
(ML84) ML92, Cambridge Language Research Unit, Cambridge.
naries. Niemeyer Verlag, Tubingen (Lexicographica, Series Maior Meijs W (1985) Linguistically Useable Meaning Characterisations in
16). the Lexicon (LINKS). Project Description, English Department,
Lenat D, Prakash M, Shepherd M (198(5)CYC: using common sense
University of Amsterdam, Amsterdam.
knowledge to overcome brittleness and knowledge acquisition bot- Meijs W (1986a) Links in the lexicon; the dictionary as a corpus. Icame
tlenecks. AI Magazine 6(4): 6592. News 10: 2678.
Lesk M (1986a) Information in Data: Using the Oxford English Dictio- Meijs W (1986b) Lexicalorganisation from three di'erent angles. Jour-
nary on a Computer. Summary of a conference on Information in nal of the Association ofLiterary and Linguistic Computing 13(1).
Data held in the Centre for the New OED, University of Waterloo Meijs W (1988a, forthcoming) Spreading the word: knowledge activa-
in November 1985 (also in ACM SIGIR Forum 20(12)). tion in a functional perspective. In Conolly J, Dik S (eds) Func-
Lesk M (1986b) Why I Want the OED on My Computer, When Im tional Grammar and the Computer. Foris, Dordrecht.
Likely to Have It. SIGCUE Newsletter, 2nd Quarter. Meijs W (1988b, forthcoming) Morphology in the dictionary, with spe
Levin B (1985) Lexical semantics in review: an introduction. 1n Levin cial reference to LDOCE. In Lachlan Mackenzie .1, Todd R (eds)
in Review. Lexicon Festschrift fir Hans Heinrich Meier (provisional title). Free Univer-
B (ed) Lexical Semantics Working Papers 1,
sity Press, Amsterdam.
Massachusetts Technology, pp 162.
Institute of
Mellish C S (1985) Computer Interpretation of Natural Language De-
Levin B (1988, forthcoming) Approaches to lexical semantic represen
Chichester.
scriptions. Ellis Horwood,
tation. In Walker D, Zampolli A, Calzolari N (eds) Automating
284 285
database. Doctoral Pollard C, Sag I (1987) Head-Driven Phrase Structure Grammar.

Michiels A (1982) Exploiting
large dictionary
a CSLI
Lecture Notes 12, Stanford, California.
thesis, Universit de Liege, Belgium. .
Michiels A (1983) Automatic analysis of texts. Proceedings of the con- Pollock J (1982) Spelling error detection and correction by computer:
ference (Informatics 7) held by the Aslib Informatics Group .and some notes and bibliography. Journal of Documentation 38(4):
the Information Retrieval Group of the British Computer Socrety, 282-91.
Cambridge, pp 10320. Pollock J, Zamora A (1984) Automatic spelling correction in scientic

Miller G 1985 WORDNET: a dictionary browser. Proceedings of the and scholarly text. Communications of the ACM 27(4): 35868.
First International Conference on Information in Data, University Procter P (ed) (1978) Longman Dictionary of Contemporary English.
of Waterloo Centre for the New OED, Waterloo, Ontario. Longman Group Ltd, Harlow.
Mitton R (1986) A Partial Dictionary of English in ComputerlUsahle Pulman S G (1984) Limited domain systems forlanguage teaching.
Form. Unpublished MS, Computer Science Department, Birkbeck Proceedings of the 10th International Conference on Computa-
College, University of London. tional Linguistics, Stanford, California, pp 847.
Moortgat M, Hoekstra T, van der Hulst, H (1980) Lexical Grammar. Pulman S G (1985) Generalised Phrase Structure Grammar, Earleys
Foris, Dordrecht. algorithm, and the minimisation of recursion. In Sparck Jones K,
Moulin A, Jansen J, Michiels, A (1985) Computer Exploitation of Wilks Y (eds) Automatic Natural Language Parsing. Ellis Hor-
LDOCEs Grammatical Codes. Paper presented at a conference wood, Chichester, pp 11731.
on Survey of English Language, Lund. Quillian M R (1967) Word concepts: a theory and simulation of some
MPD (1964) The New MerriamWebster Pocket Dictionary. Pocket basic semantic capabilities. Behavioral Science 12: 41030, (also
Books, New York. .
reprinted in Brachman R J and Levesque H (eds) (1985) Read-

Netf M S, Byrd R J, Rizk O A (1988) Creating and querying lexical
ings in Knowledge Representation, pp 98118, Morgan kaufmann
data bases. Proceedings of the Second ACL Conference on Applied Publishers, Inc., Los Altos, California).
Natural Language Processing, Austin, Texas, pp 8492. Quillian M R (1968) Semantic memory. In M Minsky (ed) Semantic
Nirenberg S (1987) Machine Translation. Cambridge University Press, Information Processing. MIT Press, Cambridge, Massachusetts, pp
Cambridge. 21670.
Norling-Christensen O (1982) Commercial lexicography on the thresh- Greenbaum
Quirk R, S (1973)A University Grammar ofEnglish. Long-
old of the electronic age. In Goetschalckx J, Rolling L (eds) Lex- man Group Ltd, Harlow.
icography in the Electronic Age. North-Holland, Amsterdam, pp
Quirk R, Greenbaum S, Leech G, Svartvik J (1972) A Grammar of
211-19.
General Basic English Dictionary. W.W. Nor Contemporary English. Longman Group Ltd, Harlow.
Ogden C K (1942) The Quirk R, Greenbaum S, Leech G, Svartvik J (1985) A Comprehensive
ton, New York.
Grammar of the English Language. Longman Group Ltd, Harlow.
Oostdijk N (1984) An extended affix grammar for the English noun
De- Raymond D R, Tompa F W (1987) Hypertext and the New Oxford
phrase. In Aarts J, Meijs W (eds) Corpus Linguistics: Recent
English Dictionary. Technical Report, UW Centre for the New
velopments in the Use of Computer Corpora in English Language
Research. Rodopi, Amsterdam, pp 95122. OED, Waterloo, Canada.
Parkinson R C, Colby K M, Faught W S (1977) Conversational lan- Reichert R, Olney J, Paris J (1969) Two dictionary transcripts and
and pars- programs for processing them. Technical Report TM-3978/001/00,
guage comprehension using integrated pattern-matching
ing. Articial Intelligence 9: 11134. System Development Corporation, Santa Monica, California.
D 1978 Impersonal passives and the unaccusative Ritchie G (1987) Discussion Session on the Lexicon. In Whitelock et
Perlmutter hypoth-
al. (eds) Linguistic Theory and Computer Applications. Academic
esis. Berkeley Linguistics Society 4.
Perlmutter D M, Soames S (1979) Syntactic Argumentation and the Press, New York, pp 22556.
Ritchie G, Black A, Pulman S, Russell G (1986) Dictionary and Mor-
Structure of English. University of California Press, Berkeley, Cal-
ifornia. phological Analyser User Guide: Version 3.0. Department of Arti-
cial Intelligence, University of Edinburgh
Phillips J, Thompson H (1986) A parser for
generalised phrase struc- / Computer Laboratory,
ture grammars. Edinburgh Working Papers in Cognitive Science University of Cambridge.
1. Roberts R B, Goldstein I P (1977) The FRL Manual. MIT-AI Memo
286 287
Bibliography
Bibliography
Massachu Shipman D, Zue V (1982) Properties of large lexicons: implications

409, Massachusetts Institute of Technology, Cambridge,
for advanced isolated word recognition systems. Proceedings of the
setts.
for dialogues. Communica- IEEE International Conference on Acoustics, Speech and Signal
Robinson (1982) DIAGRAM: a grammar
J
546~9.
tions of the ACM 25(1): 2747. Processing, Paris, pp
P S (1967) The Grammar of English Predicate Comple- Shortliffe E H (1976) Computer-Based MedicalConsultation: MYCIN.
Rosenbaum
ment Constructions. MIT Press, Cambridge, Massachusetts. Elsevier Science Publishers B.V., NorthHolland, Amsterdam.
Rumelhart D E, Ortony A (1977) The representation of knowledge Simmons R (1973) Semantic networks: their computation and use for
in memory. In Anderson R C, Spiro R J, Montague W E (eds) understanding English sentences. In Schank R C, Colby K M (eds)
Erlbaum
Schooling and the Acquisition of Knowledge. Lawrence Computer Models of Thought and Language. W.H. Freeman, San
Associates, Hillsdale, New Jersey. Francisco, pp 63113.

C, Black A (1986) A dictionary and
Russell G, Pulman S, Ritchie Sinclair J (1987) Collins COB UILD English Language Dictionary. Will-
analyser for English. Proceedings of the 11th Inter- iam Collins Sons and Co
Ltd, London.
morphological
national Congress on Computational Linguistics, Bonn, Germany, Skolnik J (1980) L-trees. Paper presented at the 6th Symposium of
pp 2779. .
the Association for Literary and Linguistic Computing, Cambridge
Processing. Addison-Wesley, Read~
Sager N (1981) Natural Language (also available as Technical Report from the Computer Department
ing, Massachusetts. of the University of Amsterdam).
Schank R C (1973) Identication of conceptualizations underlying nat- Slator B M, Wilks Y A (1987) Toward semantic structures from dictio-
ural language. In Schank R C, Colby K M (eds) Computer Models nary entries. Proceedings of the Second Annual Rocky Mountain
of Thought and Language. W.H. Freeman, San Francisco, pp 187 Conference Articial
on Intelligence, Boulder, Colorado, pp 85~96.
247. Slocum J (1985) Parser construction techniques. Tutorial held at the
episodes in memory.
of In Bobrow D
Schank R C (1975a) The structure 23rd Annual Meeting of the Association for Computational Lin-
Academic
G, Collins A (eds) Representation and Understanding. guistics, Chicago.
Press, New York, pp 237~72. Slocum J, Morgan M G (1988,forthcoming) The role of dictionaries and
Schank R C (1975b) Conceptual Information Processing. North-Holl- machine readable lexicons in translation. In Walker D, Zampolli A,
and, Amsterdam. Calzolari N (eds) Automating the Lexicon: Research and Practice
Schmolze J G, Lipkis T A (1983) Classication in the KLONE knowl- in a Multilingual Environment. Cambridge University Press, Cam-
edge representation system. Proceedings of the Eighth Interna- bridge.
tional Joint Conference on Articial Intelligence, Karlsruhe, Ger- Small S (1980) Word Expert Parsing: a Theory of Distributed Word-
3302. Based Natural
many, pp Language Understanding. Doctoral thesis, also avail-
Schvaneveldt R W, Durso F T, Dearholt D W (1985) PATHFINDER: able as Technical Report TR954/NS G-7253, Department of Com-
Scaling with network structure. Memorandum Computer and
in
puter Science, University of Maryland.
Cognitive Science, MCCS-85-9, Computing Research Laboratory, Small S L, Cottrell G W, Tanenhaus M K (eds) (1988, forthcom-
New Mexico State University. ing) Lexical Ambiguity Resolution in the Comprehension ofHuman
Prosodic Structure and Its Relation to
Syntac-
Selkirk E O (1978) On Language. Morgan Kaufmann Publishers, Inc., Los Altos, Califor-
tic Structure. Indiana University Linguistics Club, Bloomington, ma.
Indiana. Small S
data. In
L, Riegcr C (1982) Parsing and comprehending with word
structure for lexicographic
Sherman D (1974) A common experts (a theory and its realization). In Lehnert W G, Ringle
Mitchell J L (ed) Proceedings of an International Conference on
M H (eds) Strategies for Natural Language Processing. Lawrence
Computers in the Humantities (University of Minnesota). Univer- Erlbaum Associates, Hillsdale, New Jersey, pp 89e147.
sity of Edinburgh Press, Edinburgh, pp 215224.
Smolensky P (1987) Connectionist AI, symbolic AI and the brain. Ar-
Shieber S (1984) The design of a computer language for linguistic in-
Congress on Com-
tificial Intelligence Review 1: 95110.
formation. Proceedings of the 10th International
Sowa J, May E (1986) Implementing a semantic interpreter using con-
putational Linguistics (Coling84), Stanford, California, pp 3626.
Shieber S (1985) Criteria for designing computer facilities for linguistic ceptual graphs. IBM Journal of Research and Development 30(1):
5769.
analysis. Linguistics 23(2): 189211.
288 289
K Synonymy and Semantic Classication. Doc- Department of Linguistics, University of Amsterdam, Amsterdam.
Spirck Jones (1964)
toral thesis, University of Cambridge (also published in Edinburgh Voogt-van Zutphen H J (1988, forthcoming) Towards a lexicon of func
Information Technology Series (EDITS), Michaelson S and Wilks tional grammar. In Connolly J, Dik S (eds) Functional Grammar
Y (eds), Edinburgh University Press: Edinburgh, Scotland, 1986. and the Computer. Foris, Dordrecht.
Sparck Jones K (1967) Dictionary Circles. Report SP-3304, System Vossen P, den Broeder M, Meijs W J (1988, forthcoming) The LINKS
Development Corporation, Santa Monica, California. project: building a semantic database for linguistic applications
van der Steen G J (1982) A treatment of queries in large text corpora. Proceedings 8th Icame Conference, Helsinki.
In Johansson S (ed) Computer Corpora in English Language Re- W7 (1967) Websters Seventh New Collegiate Dictionary. C.&C. Mer-
search. Norwegian Computing Centre for the Humanities, Bergen, riam Company, Springeld, Massachusetts.
4965. Walker D, Amsler R (1986) The use of machine-readable dictionaries in
pp
Stockwell R P, Schachter P, Partee B H (1973) The Major Syntactic sublanguage analysis. In Grishman R, Kittredge R (eds) Analyzing
Structures of English. Holt, Rinehart and Winston, New York. Language in Restricted Domains. Lawrence Erlbaum Associates,
Streeter L A (1978) The acoustic determination of phrase boundary Hillsdale, New Jersey, pp 69783. -
perception. Journal of Acoustic Society of America 84(6): 1582. Walker D, Zampolli A, Calzolari N (eds) (1988) Automating the Lex-
A Database-in-Waiting: The OED Becomes the New icon: Research and Practice in a Multilingual Environment. Cam-
Stubbs J (1986)
OED. Paper presented to the conference on Computers and the bridge University Press, Cambridge.
Humanities, University of Toronto, Toronto. Waltz D (1983) Articial Intelligence: an assessment of the state-of-the~
Stubbs J, Tompa F (1984) Waterloo and the New Oxford English Dic- art and recommendation for future directions. The AI Magazine
tionary Project. Paper presented to the Twentieth Annual Confer- 4(3): 5567.
ence on Editorial Problems, University of Toronto, Toronto. Waltz D L, Pollack J B (1985) Massively parallel parsing: a strongly
Thompson H (1983) Natural language processing: a critical analysis interactive model of natural language interpretation. Cognitive Sci-
of the structure of the eld, with some implications for parsing. ence 9: 5174.
In Sparck Jones K, Wilks Y) (eds) Automatic Natural Language Warren B (1978) Semantic Patterns of Noun-Noun Compounds. Goth-
Parsing. Ellis Horwood, Chichester, pp 2231. enhurg Studies of English 41, Goteborg: Acta Universitatis Gothen-
Thorndike E L, Lorge I (1944) The Teachers Word Book of 30 000 burgensis.
Words. Teachers College Press, Teachers College, Columbia Uni- Weischedel R M, Black J E (1980) Responding intelligently to un-
versity, New York. parsable inputs. American Journal of Computational Linguistics
Tompa F (1986) Database design for a dictionary of the future. Pre-
6(2): 97109.
liminary report, Centre for the New Oxford English Dictionary, Whitelock P, Wood M, Somers H, Johnson R, Bennett P (eds) (1987)
University of Waterloo, Waterloo, Ontario. Linguistic Theory and Computer Applications. Academic Press,
Tubach J-P, Boe L J (1986) Quantitative knowledge on word struc- New York.
with application to large vocabular Wiederhold G Database
ture, from a phonetic corpus, (1983) Design, McGraw-Hill, New York.
ies recognition systems. Proceedings of the International Confer- Wilensky R, Arens Y (1980) PHRAN a phrasal natural language
~
ence on Acoustics, Speech and Signal Processing, Tokyo, Japan, understander. Proceedings of the 18th Annual Meeting of the Asso-
pp 614. ciation for Computational Linguistics, Philadelphia, Pennsylvania,
Tucker A, Nirenburg S (1984) Machine translation. In Williams M (ed) pp 11721.
Annual Review of Information Science and Technology {ARIST}, Wilks Y A (1973) An artificial
intelligence approach to machine transla-
vol.19 AmericanSociety for Information Science. tion. In Schank R C, Colby K M (eds) ComputerModels of Thought

of in- and Language. W.II.

Urdang L (1988, forthcoming) Lexicographic practice as a source
Freeman, San Francisco, pp 11451.
formation for designing dictionaries intended for electronic access. Wilks Y A (1975a) A preferential pattern-seeking semantics for natural
In Walker D, Zampolli A, Calzolari N (eds) Automating the Lex- language inference. Artificial Intelligence 6: 5374.
icon: Research and Practice in a Multilingual Environment. Cam- Wilks Y A (1975b) An intelligent analyser and understander of English.
bridge University Press, Cambridge. Communications of the ACM 18: 264a74.
Wilks Y A (1977) Good and bad arguments
Voogt-van Zutphen H J (1987) Constructing an EC. Lexicon on the for semantic primitives.
Basis of LDOCE (Working Papers in Functional Grammar No 24). Communication and Cognition 10: 181221.
290 291
Bibliography
Wilks Y A (1978) Making preferences more active. Articial Intelli-

gence 10: 7597.
' . .
Williams E S (1980) Predication. Linguistic Inquiry 11(2): 20.3738.

Winograd T (1972) Understanding Natural Language. Academic Press,
New York.
Addison-Wesley, Process.
.
Subject Index
Winograd T
(1983) Language Cognitive
as a
Reading, Massachusetts. _ .
Wong D (1981) On the unication of language comprehenswn With

problem solving. Technical Report CS-78, Department of Com
puter Science, Brown University.
XSIS 038112 (1981) Courier: the Remote Procedure protocol. Xe-
Xerox
Cali
rox Systems Integration Standard, Corporation, Stamford,
Connecticut.
'
Yannakoudakis E J (1983) Expert spelling analysrs and correc-

errors
tion. In Jones K P (ed) Informatics 7: Proceedings ofa conference

held by the Ainb Informatics Group and the Information Retrieval access paths to the LDB, 47, involving transitivity, 88
Group of the British Computer Society, Cambridge. Aslib, London, 49, 55 American Heritage Dictionary
3950. constraint les, 56, 58
Yannlakoudakis (Heritage), 20
E, Fawthrop D (1983) The rules of spelling errors. In- access queries, 5860, 283 ambiguity
formation Processing and Management 19(2): 8799. search strategy, 60 of structure, 190
adjective denitions, 156 of word sense
adjectives and syntactic frames, 131
attributive use of, 78, 90 in meaning descriptions,
comparative form of, 71 190
grammar codes for, 678, ASCII characterset, 42, 47
75 ASCOT project, 35, 65, 67, 77,
better in
LDOCE, 77 79, 801,83, 93, 1734
in dictionary denitions, lexicon, 6543, 79, 803
157,159 scanning program, 65
predicative use of, 71, 78, Aslex, 65, sec ASCOT lexicon
90 and
assign count procedure,
superlative form of, 71
140~3,149
adverb denitions, 156 computational tractability
adverbial adjunct, 7576 of, 141
adverbial phrases inappropriateness of, 142,
in denitions, 159, 161 I48
in verb phrases, 158 and
assign Iookup procedure,
adverbs
140~3,149 .
further subclassication,80 automatic speech recognition,

grammar codes for, 67 see speech recognition
better in LDOCE, 77
agent-patient verbs, 107
alternation, 86 basic English, 155
arbitrariness of, 87 batch derivation ofLDB, 51
dative, 90, 106 ff, 249 if box codes, 47, 54, 56, 132,
154,
292
293
Subject index
Subject index
172, 201,216, 235, 269 speech register, see speech transcription into, 140 dative alternation, 90, 106 ff,
access to LDB via, 236 register codes control characters 249 {1
after lispication, 49 subject eld, see subject after lispication,48 Database Management Systems
British English pronunciations, codes as begin markers, 44 (DBMS)
56 Collative Semantics, 196, 200, for font changes, 42 assumptions of, 53
BROWSE program, 204 207, 220, 222 functions of, 423 relational, 52, 172
browsing Collins English Dictionary, 14, controlled vocabulary, 16, 34, dening vocabulary, see
in MRD, 53,
an 63 25 38, 56~7, 155, 157, controlled vocabulary
in the LDB, 236 complement structure, 88, 90, 1613, 168, 1734,200, denition text, 56, 154~6, 161
106 208, 215 LDOCE
on
tape, 48
complementation access to LDB via, 236 denitional cross
references,
CELEX, 83 9 code, 69 and Key Dening 163
Centre for the New OED, 45, adverbial, 69, 78, 93 Vocabulary, 211 denitional vocabulary, see
53 codes for adjectives, 79 number of senses for, 201 controlled vocabulary
Chambers Twentieth Century in basic of derivational
patterns, 73 use
denitions, 43, 215, 227,
224
Dictionary, innitival, 94, 96 morphology, 34 see also meaning
character stream of a word, 68 use of phrasal verbs, 34 descriptions
analysis of, 42 of 78 relations
adjectives, co-occurrence for adjectives, 156, 159
segmentation into records, of intransitive verbs, 77 frequency of, 202~4 for adverbs, 156, 159
55 ofverbs, 70, 74 and semantic similarity, 207 for idiomatic senses of a
circularity, 190 predicative, 86, 88r9, 93, 96 corpus, 184, 230 word, 1678
in denitions, 19, 34 sentential, 86, 94 Brown, 175 for lexicalised compound
class size compound nouns, 30, 160 grammatically 173
tagged, nouns, 160
expected, see Expected in denitions, 161 Lancaster-OsloBergen,
175 for nouns, 156e7, 164, 166
Class Size with non-compositional of meaning descriptions, for phrasal verbs, 160, 162
maximum, 137, 145 lexicalised
meanings, 51 171 for verbs, 156, 158
of consistency class, 144 computational lexicon, 71, 81, of tagged meaning in LDOCE, 154-6, 163
statistics, 148, 151 118, 264, see also descriptions, 175 of word 156
senses,
classication, 1556 grammatically-indexed probabilistic techniques for parsing
of central
of, 219
senses of lexicon analysis of, 231 syntax of, 219
denition vocabulary, batchderivation, 66 Courier, 62 superordinate in, 155
155 Conceptual Dependency, 193, x
CRITIQUE, 25, 30, 4o, 86 with cross
reference, 167
of concepts, 155 223 cross
references, 163, 189 word sense
ambiguity in,
used during text conceptual hierarchies, 196, complex, 167 198
processing, 155 see also semantic
explicit, 43 denitions analyser, 54, 154,
of entities in discourse hierarchies
implicit, 43 161, 163, 168
domain, 154 connected speech, 152 in denition, 163 denitions 168
closed class words, 95, 122 grammar,
connectionism, 194, 208 in a. dictionary, 52 derivational
COBUILD, 2301 morphology,
localist, 195
codes
11718, 124, 162
sub<symbolic, 195 determiners
box, see box codes consistency class, 137, 140, database, 177, 181
grammatical properties of,
POS, see Part-of-Speech 1438, 1501 front ends for, 194 80
codes class size, 144 grammatical, 174
sociolect
dictionary entries in LDOCE
sociolect, see statistics derived from, 140, of concepts, 172 content
codes
of, 13
144 search pattern, 173 27
use of,
294
295
Subject index Subject index
microstructure of, 45 front ends, 1356, 141, 144, for nouns, 71 head features, 121
dictionary server, 62, see also speech for verbs, 72 head of denition, 159
sec Lexical Database recognisers pattern numbers, 72 semantic, 156, 157, 161, 167
discourse analysis, 172 Functional Grammar (FG), 80 interaction with word syntactic, 157
ditransitive verbs, 106 ff Functional Unication 105 headwords
senses,
Grammar (FUG), 25, on LDOCE tape, 46, 48
syntactic and semantic
98 homograph, 162
criteria for assignment,
93 access LDOCE, 51in
EBCDIC character codes, 42,
46 numbers in LDOCE, 43
Generalized Phrase Structure grammar coding system, 65
records in LDOCE, 46
Equi verbs, 96, 125, 248 Grammar
in LDOCE, 6577, 77, 79,
(GPSG), 25, hybrid lexicon, 137~8, 145
equivalence class, 137, 140, 124, 273
143~6
54,98,117,119,123 hyperonym, 172, 1778,1889
SUBCAT critical survey, 77 if
feature, 88, 96 hyphenation elds, 52
class size, 144 in OALD, 66, 71, 77
syntactic categories in, 119 hyponym, 172, 178, 188
inappropriate use of, 151 code elds, 54, 262 grammar
grammar
partitioning into, 145, 148, grammar codes, 15, 26, 33, 47, contextfree phrase
150 166 IBM Lexical
Systems Project,
54, 56, 201, 235, 262 structure,
transcription into, 1367, for denition analysis, 156, 45, 222, 229
decompacting program, 92,
140 idiomatic 43, 114
101, 107 , 164, 168 usage,
statistics derived from, 144
for meaning descriptions, idioms, 168
translation program, 89,
errors in dictionaries, 33 184 if inectional morphology, 11718
95, 98, 101,126
Expected Class Size (E05), information extracted, see
translation rules, 102 for pronunciations, 57
137, 144, 146, 1489, to LDB via, 236 for the entire Percentage of
access OED, 44
1512 Information Extracted
deriving GPSG entries grammar records in LDOCE,
frequency-weighted, 151 information theory, 137, 145,
from, 123, 267 47
146, 152
errors in, 69, 92, 101, 118, grammar rules
124 entropy, 1467, 152
relaxation of, 164 information theoretic measures,
features evaluation of, 124, 130, 133
grammatical analysis, 81, 145
morphological, 119, 120, in LDOCE, 75, 118 see syntactic parsing linear vs. logarithmic, 146
131 additional information, grammatical patterns in OALD Semantic Units
151 Integrated
spectral, 67, 70 shortcoming of, 77
syntactic, 119, 120, 131 and FG structures, 80 (ISUs), 196, 211,
grammatical relations, 69 21316
value assignment to, 132 basic types of letter
grammatical structures vs. Intel Hypercube, 208
xedrecord format in codes, 67 surface structures, 77 interactive access to an on-line
typesetting tape capital letters, 67, 90, 93
51
46 grammaticality judgement, MRD,
structure, for adjectives, 67, 68
12831, 133 interactive query construction,
fonts in dictionaries, 42 for adverbs, 67
grammatically-indexed lexicon, 58, 236
free text in dictionaries, 52 for nouns, 67
of words for verbs, 85, 11718,123, 125 example, 60
frequency as kernels, 67
50
accuracy of, 125, 128 Interlisp-D,
182 little
letters, 67, 75, 90
derivation and virtual memory, 50
frequency weighting, 1445 numbers, 67, 68, 90, 93 of, 124, 128
W codes, 67 in batch, 124, 125, 126 inter-process communication,
frequency-ordered sublexicon, in Unix, 50
145 word qualifiers, 91, 98 modications to the
frequency-weighted expected in OALD, 75 specication of, 12576
class size, 148, 151 capital letters, 72 word senses in, 119 kernel, 181
296 297
of a meaning description, Lexicon Development reliability of, 32, 231 and denition of word, 9
185 Environment [LDE), structured representation inflectional, 117
Key Dening Vocabulary 89, 124*5,129, 133, for, 42 derivational, 162
(KDV), 21112, 216 229, 261 machine translation, 3, 10, 22, spelling rules, 131
knowledge acquisition, 196 lexicon-consumer, 222, 227 194, 197, 199, 209 MRC dictionary database, 137,
knowledge representation lexicon manner of articulation 138
and the phonological 1367,
schemes, see transcription, MULTIJLEX, 654i
representation schemes structure of English, 49 1412, 144, 149, 1512
knowledge-based approach to and formal theories of maximal onset principle, 55, natural language processing
NLP, 4, 26 grammar, 49 , 57, 139, 140
hybrid, 137, meaning descriptions, 175 if (NLP) systems, 3 if,
168
see hybrid lexicon determiner component of,
labels in LDOCE entries, 270 Linguistic String Project, 185 operational requirements
Language Analyser and 86 distribution in for, 52
(LSP), 80, , dictionary, nominal
Learner (LAAL), 214, Link, 177-8, 181-2, 184-5, 189 meaning descriptions
216 see also syntactic kernel marked as cross
reference, (NMDs), 175,
language generation, 31
Linker, 178, 1802, 184, 185 189 see meaning
Latinate roots, 113 LINKS project, 83 postmodier of, 185 descriptions
semantic characteristics
lexical access, 49, 54, 136 Lisp, 47, 53 . premodier of, 185 of,
lexical 176
ambiguity, 197, 203, and LDOCE structure, 47 relative clauses in, 157
207, 221, 223 Interlisp-D, 50 semantic heads of, 156, 157, nominalisation, 179
Lexical Database noun codes
System lispication of LDOCE 161, 167
(LDB), 107, 135, 138, tape, 48 semantic characteristics of comprehensiveness of, 68
1423, 150,229 little letter, 70, NMDs, 176
noun sense denitions, 156~7,
for general purposes, 53 166
see grammar codes syntactic characteristics of
nouns
implementation of, 56 logical form, 87 NMDs, 183
interactive queries of, 58, and subject-verb concord,
Longman Dictionary of syntactic heads of, 157
236 68
Contemporary English meaning postulates, 1546
requirements for, 55 Merriam- Webster count and mass, 90
(LDOCE) New Pocket
user interface for, 233 default codes for, 75
codes, 70 Dictionary (MPD), 21,
see also access paths 45 25 denitions of, 160
typesetting tape format,
lexical entries 227 derived from adjectives, 190
meta5, 226,
content of, 10, 15 mid class derived from verbs, 181,
transcription, 142~3
intermediate representation Machine Readable Dictionaries 190
morphological analyser, 81, 261
of, 89, 95, 99, 126, 128 (MRDs) i and dictionary system, 118 grammar codes for, 67, 71
Lexical-Functional Grammar and conventional DBMS with wide better in LDOCE, 77
coverage, 85
(LFG), 25, 98 systems, 52 morphological analysis, 10, 12, special properties of, 68
lexical knowledge base, 523, and information retrieval, 117-18,131, 167
62 52 i for accessing MRDs, 2O
OEqui verbs, 246
lexical stress, 149 and stress assignment, 54 morphological analysis phase, online access to LDB
lexical variability, 151 availability of, 51 166 exible
mode, 51
lexical words, 150 completeness of, 231 morphological features, 11920 simple mode, 49
lexicalised compounds, 51 coverage of, 32 morphological generator, 1312
,
ORaising verbs, 243
lexicographers workstation, 23, errors in, 19 morphological irregularity, 132 Oxford Advanced Learners
52 interchange format, 229 morphology Dictionary of
298
299
Subject index
Subject index
Contemporary English phonological assimilation, 136 pronunciation, 55, 68, 138, 140, see controlled vocabulary
(OALD), 14, 21, 25, 26, phonological information in 151 robust parsing, 38, 154, 16374
33, 35, 67 ff speech signal, 56 American English, 138 Rogets Thesaurus, 29
Oxford Dictionary of Medicine, phonological reduction, 136, and syllable boundaries,
28 149 139
Oxford English Dictionary phonological structure of the and the
hybrid lexicon, 138 selectional restrictions, 47, 154,
(OED), 1, 3, 33, 44, 53 English lexicon, 54 British English, 137, 139 270
structural complexity of phonological transcriptions imposing explicit structure semantic classes, 106
entries, 45 see transcription on, 57 change of position, 88, 107
phonology field not always explicit in change of state, 88
see pronunciation elds LDOCE, 1389 negrained, 109
PARSCOT project, 79, 80 phonotactic constraints, 57, 139 phonologically motivated involved in dative
parsing, 6
phrasal analyser, 154, 163 structure of, 55 alternation, 90
of a dictionary entry, 49, 53 algorithm for, 163, 166, 167 lexical extension, 107,
transcription of, 137 109
of MRD sources, into a
efcient, 164 to LDB of verbs, 88
access
via, 58, 236
standard form, 45 hierarchies, 163 variations between transfer of possession, 107
of a typesetting tape, 434, of idioms, 168
speakers, 136 semantic codes, 269,
46, 55 patterns, 163 pronunciation elds, 48, 52, 133 see also box codes
see robust
parsing, robust, 154, 163 in LDOCE, 54 semantic dictionary, 153, 156
syntactic parsing rule examples, 1634 parsing of, 57 semantic eld, 184
PARSPAT grammar formalism, phrasal context, 161 semantic hierarchies, 22, 29,
80 phrasal pattern, 38, 164-7, 168 215, 218
part of speech, 1223 phrasal prepositions, 215 question-answering system, semantic networks, see
Part-of-Seech (PCS) codes in phrasal Verbs, 76, 81, 162, 215 3, 10 representation schemes
LDOCE,48,173,175, denitions of, 160 semantic primitives, 154~5
190 idiomatic nature of, 162 semantic relations, 88, 221~2
particles see verbal combinations Raising verbs, 96~7, 103, 125 semantic
in denitions, 162
structure, 1567,161,
Phrase Structure Grammar, 6 record identiers in LDOCE,
1634,1667
partitioning of a lexicon, sec
Augmented, 26 46
semantic weight, 177*8, 181
consistency and Generalized, see GPSG Remote Procedure Call sentence grammar, 11718,
equivalence classes approximation of, 151 Protocol, 62
124, 129, 130, 261
PATHFINDER program, 204 pointer le, 56, 58 representation coverage of, 118
PATR-H, 25, 89, 98 predicate-argument for lexical data in
structure, MRD, 53 features for, 11920
pattern matching, 163, 166, 168 87,96, 100, 118, 123 for lexical information in features of, 121
Percentage of Information predication, 155, 167 typesetting
tapes, 45 features relevant
Extracted of a dictionary
to, 119
(PIE), 137, Preference Semantics, 193, 196, entry, 45 formalism of, 118
14752 200, 2201 representation schemes SUBCAT features in, 128
cannot be derived from parser for, 221 for NLP, 28
SEqui verbs, 244 .
ECSS, 151 prepositional phrases semi-formal nature of, 18 Shorter Oxford Dictionary
statistics of, 148 in dictionary denitions, semantic formulae, 154, 155 (SOD), 21, 33
phonemes, 55 159, 157 semantic networks, 22, 196, SHRDLU, 193
collocations of, 54 prepositional verbs, 76, 215,223 Shunter, 181, 1889,190,
lattice of, 54 see verbal combinations sense-frames, 218, 2203, see also syntactic kernel
phonemic transcription, 148, processing of an MRD, 45 225
149 Prolog, 47, 49, 53 restricted denition vocabulary slitting, 79
300
301
sociolect codes, 172, 270 primary and secondary, 57 onset, 55, 140 tough movement, 79
special characters in stress pattern transcription, peak,57,139,140 transcription, 1378,1401,

dictionaries, 42 146 phonemes and stress 148,152
spectral differences, 136 stressed syllables, 152 markers in, 139 assessment of, 140, 142,
spectral features, rough, 151 and front end
design, 149 stressed, 139, 142, 1489, 144,146,151
stylistic information, 270, 152 based rough spectral
speech on
continuous, 150, 152 see also labels unstressed, 1389, 148 features, 151
variability in, 141 subcategorisation, 56, 85, with primary and full, 149,
speech recognisers 1223 secondary stress in see transcription,
and partitioning an MRD, arbitrariness of, 87 LDOCE, 138 phonemic
136 assignment of values, 126 syllable-based access, 57 into equivalence classes, 136
and random segments, 143 for subjects, 80 Synonym, 177, 181, 185, 189 manner of articulation, 137,
design of, 1357, 151 of verbs, 86 examples of, 182 1412,144,148m
for large-vocabulary speech of nouns, 124 synonymy relations, 209 1512
recognisers, 135, 150 subcategorisation classes in the syntactic categories, 264, mid class, 1423
front ends for, see front lexicon, 123 see POS codes
multiple phonetic
ends subcategorisation frames, access to LDB via, 236 of a
representations
performance of, 137, 143 12930 in dictionary denitions, word, 151
simulations of, 141, 143 subcategorisation information 156
of stressed segments, 143
transcription by, 140 in LDOCE, 54 syntactic constraints on access, of the vowel, 141
speech recognition, 3, 20, 25, 55 elliptical specication of, 55 56
partial, 140
speech register codes, 172 subcategorisation translations, syntactic features, 11920 performed by a front end,
speech synthesis, see 126,262,267 enumeration of, 119 140
text-tospeech synthesis subcategorisation values, 263, syntactic frames, 12830, 132,
phonemic, 1489
spelling correction, 21 267 262,2647
interaction with word
phonetic, 135
SRaising verbs, 243 subdefinitions sense
place of articulation, 148
statistics, 137, 144, 148 withindenitions, 43 ambiguity, 131 randomness
based on class size, 137, semantically bleaclled, 131
or variability
subject codes, 47, 54, 56,
CL 141
144,148,151 1545,172,201,22L syntactic kernel, 176 stress pattern, 146
basis for, 137 235,269 Link, 177, 181, 189 unextracted information
derived from consistency access to LDB via, 236 Linker, 181, 1889 before and after, 146
class, 140 after lispification, 49 Shunter, 17985, 18890
Transformational Grammar, 96
derived from equivalence and discourse domain, 154 syntactic parsing, 10, 20, 25,
65, 85, 117 typeface changing commands,
class, 144 superordinate, 155,
42
for LDB search, 60 see also hyperonym unication-based, 222
of NMDs, 186 surface grammar, 266 see parsing, robust parsing typesetting tape
and a dictionary database,
PIE, 148 surface structures vs.
41
word based, 144 grammatical structures, _
text summarising, 172 and printed image, 42

stepwise lexical decomposition, 77
171 syllable boundaries, 57, 1389 textto-speech synthesis, 3, 4, for an MRD, 41
stress and ambiguity, 139 10,22,24 for LDOCE, 41
in 150 thematic roles, 88 organisation of data, 42, 46
monosyllables, syllable, 57
lexical, 149 coda, 57, 139, 140 tlieory neutral lexical parsing of, 44
stress assignment, 54 constraints on the contents templates, 263, 267 tools for mounting on-line,
stress markers, 55 TOSCA project, 7981 45
of, 238
302 303
Subject index
Subject index
unaccusative verbs, 89 types denoted by grammar logical symbol for, 155

unergative verbs, 89 codes, 67
recognition in context, 79
unknown words, 1546, 168 virtual memory, 50
semantic
unstressed
analysis of, 163
syllables visual presentation of a
semantic
and reduction,
primitive formula
149 dictionary, 4112 for, 155
lexicographic and semantic relatedness, 203
typographic typical usage in a discourse
variability conventions for, 51
in speech, 141 domain, 154
vocabulary numbers, 43
lexical, 151 closed class, 95, 122
of transcription, 141 word-frequency weighting, 137,
coverage of MRDs, 32 142
Verb sense denitions, 156, 158 open class, 122 words
verbal combinations, 76, 78, 81 voicing, 136 denition of, 9
in LDOCE, 75, 77
in OALD, 75 morphological properties
phrasal verbs, 75 Websters Seventh New
of, 122
syntactic properties of, 122
phrasal-prepositional verbs, Collegiate Dictionary
75 unknown, 1546, 168
(W7), 2, 21, 2475, 29, WordSmith, 53
prepositional verbs, 7576 32, 222
verbs word analysis
arguments in kernels, 184 system for, 118 XEROX workstations
basic (subcategorisation) word based statistics, 144 graphic capabilities of, 62
patterns for, 72 word boundary, 152 single-user, 50
complex transitive, 69, 74 word frequency, 22, 137, 14476, networked, 62
default codes for, 75 148, 150
denitions of, 160 variations in, 151 ,
derived from nouns, 78 word frequency weighting, 150

ditransitive, 69, 745 word grammar, 118, 124, 131,
grammar codes
for, 71 133
better in LDOCE, 77 features relevant to, 119
grammatical properties of, word identication
75 from a lattice of phonemes,
in denitions, 159 54
indicating physical word meaning, 131
perceptions, 75 word pronunciation, 15
intransitive, 72, 757,79 word recognition, 54
modal, 72 word senses, 172, 196, 203, 215,
mono transitive, 74 223, 2623
predicateargument ambiguity, 155
structure of, 118, 123 classication of, 154, 156
semantic classication of, denitions, 16, 26, 153 if
118, 123, 243 denitions texts, 154
semantic restrictions of, 129 disambiguation, 28
standard pattern codes, 75 discreteness/continuity, 195
transitive, 26, 69, 72, 746 in LDOCE, 1545, 162
304
305
Author index
Author index
Michiels A, 19, 26, 34, 70, 79, Sag I, 87

Hornby A, 14, 72
Dik S, 38, 80, 171,180 812, 934, 96, 98, 155, Sager N, 80, 86
Dixon R, 113 Huang X, 213 163 Schank R, 193, 223
20 Huttenlocher D, 25, 136, 137, Schmolze J, 28, 155
Domenig M, Miller G, 24, 63
144, 148, 152 Schvaneveldt R, 204
Dreyfus H, 198 Mitten R, 21, 33
Durso F, 204 Moortgat M, 86 Selkirk E, 57, 139
95 Morgan M, 19859
Shann P, 20
Ingria R, 1, 11, 12,
Moulin A, 35, 105 Sherman, D, 23
Evans M, 222 Mouradian G, 163 Shieber S, 25, 98
Jackendo R, 32, 88, 103 Shipman D, 25, 136
Jensen K, 202, 222 Shortliffe E, 222
Fallside F, 3
Johansson S, 175 N96" M, 45 Simmons R, 193
225
F355 D, 193, 19543, 222, de Jong J, 80 Nirenburg S, 3, 22 Sinclair J, 230
Fawthrop D, 21 Skolnik J, 66
Fillmore C, 88 Slater B, 193,217,219, 221
Fodor J, 27, 154 Kaplan R, 20, 25, 98 Ogden C, 212 Slocum J, 1989,219
Fox E, 28 Katz J, 27, 154 Oostdijk N, 79 Small S, 17, 195
Fox J, 28 Kay M, 20, 25, 62, 98 Ortony A, 223 Smolensky P, 195
212 R, 23, 33, 44, 55
Francis W, 22, 137, 175, Kazman Soames S, 97, 105
Keg] J, 198, 199 Sondheimer N, 164
Klingbiel P, 22 Parkinson R, 163 Sowa J, 17
Garside R, 231 Koskenniemi K, 20 Perlmutter D, 89, 97, 85, 105
Gaviria G, 30 Spirck Jones K, 201, 209, 216
Kucera H, 22, 137, 175, 212 Plate T, 193, 202, 217 der Steen G, 23, 191
G, 8, 25, 40, 878, 117,
van
Gazdar Kwasny S, 164 Pollack J, 17, 195, 208 Stickel M, 27
119 Pollard C, 87 Stockwell R, 96, 1023
Gimson A, 57, 139 Pollock J, 21
Lehnert W, 194 Streeter L, 24
Goldstein I, 223 Prince A, 114
Gonnet G, 23, 45 Lemmens M, 82 Stubbs J, 3
Procter P, 18, 34, 67, 91, 155,
Grimshaw J, 103, 114 Lanai. D, 28
173, 191
Grosz B, 27 Lcsk M, 29
Pulman S, 164, 219 Tanenhaus M, 195
Grover C, 11, 54, 85, 89, 93, Levesque 11, 28
Levin B, 88, 106, 107, 113, 132 Thompson H, 85
96, 99, 11718 Thorndike E, 21
te Lindert, 16
Gruber J, 88 Quillian M, 196, 198, 223 Tampa F, 3, 234, 45, 53
Guo C, 193, 196, 211, 213 Lipkis T, 155 Quirk R, 26, 69, 7780, 90, 93
Tubach J, 147, 152
Large 1, 21 Tucker A, 22
Lyons J, 5, 8
Haas N, 28, 30 Raymond D, 24
Halle M, 239 Reichert R, 23 42
Mark W, 28, 155 Urdang L,
Hanks P, 14, 40 Rieger C, 195
Markowitz J, 222 G, 12, 31,34, 120
Hayes P, 163 Ritchie
Masereeuw P, 80
Heidorn G, 25, 86, 196, 201, Roberts R, 223 Voogt-van Zutphen H, 80~
222-3, 227 Masterman M, 29 Robinson J, 86 Vossen P, 38, 83, 171
Hendrix G, 28, 30 McClelland J, 17 Rosenbaum P, 102
van den Heuvel T, 79 McDermott D, 31 Rumelhart D, 17, 223
Hirst G, 27,29, 195
McDonald J, 193 Russell G, 10, 20, 52, 85, 96 Walker D, 2,32, 155, 221
Hobbs J, 198 Meijs W,32,35,38, 171,191 Russell G, 11819 Waltz D, 17,195, 208
Melllsh C, 195
Hodgkin A, 24, 28 309
308
Author index
Warren B, 31
Way E, 17
Webber B, 28, 155
Weischedel R, 164
Wekker H, 82
Whitelock P, 10
Wiederhold G, 50
Wilensky R, 163, 168
Wilks Y, 27, 154,163, 193,195,
21819, 225
Williams E, 96
Winograd T, 8, 39, 193, 223
Wong D, 223
Woods W, 3
Yannakoudakis E, 21, 33
Zamora A, 21
Zampolli A, 2
Zue V, 25, 136, 146
310

Bran Boguraev, Ted Briscoe-Computational Lexicography For Natural Language Processing-Longman (1989)

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Bran Boguraev, Ted Briscoe-Computational Lexicography For Natural Language Processing-Longman (1989)

Diunggah oleh

Hak Cipta:

Format Tersedia

"' Computational '

Copublished in the United States with

in the United States DI'AmeriDa with

Longinan Group UK Limited 1939

term or by any means. electronic. mechanical. photocopying,

Placing 1.4 Reliability and utility of MRDS

The derivation of a large

LDOCE and speech recognition / D. Cute:- #

Meaning H Aishawi, B Boguraev and D Carter 41

Lexicon development environment- 2.4 On-line access: exible mode 51

The Longman 2.4.1 Considerations for a.

77 6.3.2 need for eXible

81 6.4.4 Case Study Two 148

and E Briscoe 7 the dictionary denitions

The derivation Of a large 8 and structure in denitions

8-2 Preparing meaning descriptions for examination

t e tar e eXicon 132 84 Semantic

system fir lexicon

5.4.: Srhetaxttiznflfatgesrans 128 3.5.1

T Plate and B Slator 193 E.2 Semantic codes 270

9.2 The extraction F Tl 1e L ongman grammar de system 273

of LDOCE verbs 243

C Dative alternations 249 :

user guide 261 i

Eric Akkerman, English Department, University of Amsterdam

well as directly support lexicographers work.

wide variety of researchers

obviously the richest and most valuable Computational Slime

et al. (1985) and Boguraev et al. (1987b). Large

Bran Boguraev and Ted Briscoe

This book is about the design and implementation of natural language

problem by pointing out the considerable divergence between the

iliary verb in a corresponding declarative form; for example, Kim

the pronunciation of letters and letter sequences as communicative intentions

semantic knowledge and more

to a large extent of sets of general rules. For

again. with such exceptions is the province of the

member with a feature, such as

of the rule to adjectives not [uh

If we look at the rule more closely we can is : Vi

which must simply be stipulated in the Now in

the basis of, say, their meaning, therefore, we u P m

information will be marked in individual it a lexrcal syntactic

predicative or non-predicative in this position.

states that an AP consists of an adjective and, because i

egories (partsof are

the rules and

The lexicon will also contain other types of idiosyncratic

with noisy breath" is pronounced 'Wispar (whisper) in English and

the hand, A lexicon organised to make of

include phonological, morphological, syntactic,

content of a lexical entry below. should be

to the task and application at hand.

entirely systems. For example, quite large lexicons have

1.2 Computational lexicography

establish to the average size (CLAUSE (REALNP) (THATCDMP) ;: 1.

then, that a number of researchers have THAresqumcD :'

greatly reduce the amount of effort involved in the construction of

(aclcxiovledge As we noted above, it is neither practical nor desirable to set about

aclmowledged dictionaries typically involve tens of lexicographer/years to develop

ple, Cumming, 1986, and in particular Ingria, .1988).Still, across sys-

form, syntactic category, semantic

a information, spelling, hyphenation