Lexicography
for
Natural Language Processing
Edited by
Bran Boguraev and Ted Briscoe
LONGMAN
London and New York
grammar 39
In. Boguraev and E. Eriscoe
2.2 The
nriscoo
Longman tape and its computational
counterpart 46
Conclusion / B. noguraev and n.
Appendices:
2.3 On-line
Lexical database-user guide
Semantic types ofLDOCE verbs access: simple mode 49
Dcttive" alternations
access 143
3 3 . A detailed analysis of LDOCE
6.4
The classes
77 Measuring 144
3 3.1 Intransitive verbs
78 6.4.1
33.2 Linking verbs with adverbial complementation
78 6.4.2
Wordcounts
and
and word frequencies 145
146
3 3 3 . Adjective complementation
79 6.4.3
Linear logarithmic
An information-theoretic
.measures
C clusion approach 146
3 4
N;
-
8.1 Introduction
W Meijs and M den Broeder 171
117 171
Eggilzfdtill:Irliilaiigittiifst?;gi;:s'
_
2.3 Inaccuricies
.
in the data
tions LDOgE
for sourcte1
s
124
125
8.4.1
characteristics
Criteria for
.NMDS
distinguishing types of structures
176
181
5.4.1 System design
ctonsl legoilonhase
P *1 126
8.5 Syntactic characteristics of NMDs 183
214 bigghological
generation
131
132 8-5-3
some Syntactic
grammar _NIvIDs
statistics of NMDs
184
186
5 5 Future developments 8.6 Conclusion 189
133
5.6 Conclusion
133 8'7 NW 190
517 Notes
vii
vi
E The Longman semantic codes 269
9 A tractable machine dictionary
as a resource for computational semantics E1 Subject eld codes 269
Index
dictionary SubJect 293
III: A lexicon-producer 217
9.2.3 Approach
of semantic information from LDOCE 220 Author Index
9.3 The utilisation 307
221
9.3.1 A lexicon-consumer
semantics 222
9.32 Collative
228
9.4 Conclusion
1
10 Conclusion
and E Briscoe 229
B Boguraev
Appendices
233
A Lexical database
user guide
233
A.1 Overview
233
A.2 Access by spellings
236
A.3 Access by non-spellings
246 5
BA Object Equi verbs '
248
8.5 Equi verbs
1X
viii
Contributors
xi
Foreword
In the last
few years the attention of researchers and developers, both
in industry and academia, working in the elds of linguistics, compu-
tational linguistics, articial intelligence, psycholinguistics, cognitive
science and information technology, has been drawn to the importance
of lexical resources by a number of converging factors.
On the theoretical side, the major contemporary linguistic theories
are assigning to the lexicon an increasingly central role. On the ap
plication side, the availability of lexical resources in machine-readable
form is becoming a major concern for the language industry. This
field, which is emerging as an autonomous sector of the information
industries, includes both computer assistance to traditional applied
linguistics professions (including, for example, lexicography, transla-
tion and language teaching) and development of computational sys-
tems based on natural language processing (as required, for example,
in ofce automation, speech analysis and synthesis, natural language
interfaces, automatic indexing and abstracting, information retrieval,
machine translation, and, more generally, support for communication).
Computational linguists and language industry developers recog-
nise that, for real world applications, it is of fundamental importance
that natural language processing systems are able to deal with tens
and even hundreds of thousands of lexical items. Consequently, the
development of large lexical knowledge bases has emerged as probably
the most urgent, expensive, and timeconsuming task facing linguistics,
computational linguistics, and articial intelligence.
Computational lexicography and lexicology is beginning to emerge
as a discipline in its own right: witness the number of specialised
workshops (Automating the Lexicon in a Multilingual Environment,
The Lexical Entry, The Lexicon in Theoretical and Computational
xiii
in Lexical-
Semantics), conferences (Advances
Perspectives, Lexical Dictionaries), panel
Standardisation in Lexicography, Electronic
ogy, Tlie Lexicon in a Multi
Dictionaries,
discussions (MachineReadable and World Representations), specialist/
Words
lingual Environment, Lexicology and Lexicograpliy,
ad hoc working groups (Computational as the Special
Dictionaries and the Computer), and publications (such
the Lexicon of the journal Computational Linguistics).
Issue on though hu-
apparent that, even
At the same time, it is becoming of lex-
programs emphasise different. aspects
man users and computer and ac-
structures, explicit
ical information and require different data
greatly facilitate human use of Acknowledgements
cessible lexical knowledge bases can
of the European
by the Commission
issues.
refers to two major complementary to establish- new
the current growing efforts
The first concerns
bases, as generalised natural language pro
large lexical knowledge will serve, through ap-
in such a manner that they '
cessing modules, The editors and all of the contributors to this volume w u
and applications
Sumidaeiils11(1
a treasure store of data, information, 2 draws which has previously appeared in two conference
encompasses
Natural Language Processing deals on
material
Computational Lexicography for constitutes the first systematic papers Alshawr
Antonio Zampolli ported much of the research described here, through the provision {if
grants and fellowships to ourselves and Hiyan Alshawi but also su P -
May 1988
ported Bran Boguraev during the editing of this volume.
xiv
Chapter 1
Introduction
cerning either the nature of the information which the lexicon should
contain or how it should be represented (for example, Ingria, 1988).
The task of constructing a realistic lexicon for a natural language,
such as English, is formidable, not only because of the absence of a
well-articulated theory of what it should contain, but also because of
the enormous number of words to be dealt with. The Oxford English
Dictionary (OED) contains entries representing 250 000 independent
words approximately, However, even the OED still does not list many
1
Chapter 1 1.1 Natural
Introduction language processing
from specialised elds. Walker and Amsler (1986) high- dictionary databases for human use. For exam le th
words
light this
more
systems. This development for the research reported in subsequent chapters. In addition we briey
by automated natural language processing
Walker
recent and Zampolli, 1988) and has been
(see survey representative work in computational lexicography which is not
iscomparatively
made feasible by the advent of computer typesetting techniques, which discussed later
in chapters (either because
other than it involves MRDs
of machine-readable versions of most pub- LDOCE or because the aims of the work are not central to issues of nat-
have ensured the availability
ural language processing). In this way, we hope that this chapter will
lished dictionaries.
disadvantage to serve a tutorial overview of work in computational
There are several advantages and at least one major as
lexicography in '
machinereadable dictionaries (MRDs) in research on natural general, as well as providing the foundation for a P re p er u n d 815 t andmg
the use of .
since there is a considerable tradition be- of the speCic work reported in this book.
language processing. Firstly,
the of dictionaries for human consumption, we might
hind production for dening the
hope that they will provide a suitable starting point
of a lexicon for machine use. Secondly, since many published 1.1 Natural language processing
contents
much of the construction work has already been
dictionaries are large,
done for the computational linguist. On other hand, published the As we remarked above, the goal of research on natural 1
dictionaries are produced with the human reader in mind and there-
from the point of View of
cessmg (NLP)
hensxon, production
is the automation
and acquisition
of the processes
in both written
of langdzilggduclgfngi:
and
fore make many inconvenient assumptions spoken media
processing by machine; for example, the assumption that the user can
Research NLP scientists
on is undertaken by (computational) linguists psychol:
in English. Most of the and system engineers from slightly
understand definitions of word senses written ogists, computer different
research reported in this book has two facets: rstly, the development perspectives. However, all share the goal of developing a fully explicit
to make various types (and therefore
of automated (or semi-automated) techniques programmable) theory of these processes. Although no
for machine and secondly, the comprehensive theory has emerged yet or appears
of information in MRDs accessible use,
and both
such likely to do so
of this information in evaluating improving in the foreseeable future, in practice many workable NLP systems can
subsequent use
of various types and the linguis- be constructed on the basis of partial understanding of some of these
natural language processing systems
in this
work reported
All of the
tic theories which lie behind them. processes. For example, intelligible text-tospeech synthesisers have
is based the machinereadable version of the Longman Dic- been built (for example, Allen et (IL, 1987). Existing speech synthesis
book on
which will be systems do not'attempt to understand
tionary of Contemporary English (LDOCE). For reasons the input, but rather rely on
uniquely suitable for computational
made clear below this dictionary is a more
supercial linguistic analysis of the words in the text and their
organisation into sentences. In addition
lexicography. syntactic to speech synthe-
We have called this line of research with computational
MRDs Sis
significantprogress has been made in constructing workable speech
lexicography in recognition of the fact that, although strictly speak- recognition systems (see Fallside and Woods, 1985), which convert a
construction, it is certain
ing we are not in the business of dictionary
machine
streamcf spoken words into their written counterparts in question
that the lexicons which are derived from MRDs for use by answering systems (forexample, Bole and Jarke, 1986), whichrespond
dictionaries, both to a query
will be very different from conventional published concerning some limited domain, for example train time-
of how they organise and how they represent information. tables, and in translation between
in terms languages (for example Nirenbur
In addition, although the techniques described in this book are pri 1987), again within some limited domain. Rather less hagsI
progress
for machine increasing been made
marily of relevance to developing lexicons use,
achievinggeneral language understanding
in
systems and
of
use is being made of computational techniques in the development language generation or language learning systems, but much research
2 3
1 1,1 Natural language processing
Introduction Chapter
lexical
us
paragraph. However, the rule breaks down in the case of an adjective such as
man-eating:
of words;
2. Morphological knowledge about the internal structure
English That man-eating sh
for example, that the phoneme /s/ or /2/ attached to an
*
makes it plural or that re attached to a verb means do That sh is man-eating2
noun
5
Chapter 1 1.1 Natural language processing
Introduction
t he 16300011 and
'
phrase structure
t;:d:li
these syntactic rules as
represent ~
We
' _
can With the lexrcal categories of the words from left to right
form shown below:
the general the sentence and connect u p all the In 0 th er and daughter Cate-
Daughter" .
Mother >
Daughterg Daughter,- gories as specied by that rule.
contain one or more
in which one mother syntactic category may
consist of a name followed by 3. Try to build a complete tree with an 5 node as root 'fill-
by
daughter categories, and categories
feature. A grammar written in this notation ing in the gaps with further rules wh 1C h matCh the unconnecmd
optional further syntactic examples above is categories in the partial tree.
which will generate all and only
the grammatical
shown below: For example, given the input:
1. S 0 NP VP
2. VP r V AP[p1-d +1
That maneating sh is beautiful
3. NF Dec N
The rst stage could yield:
4. NY '4 Det APtpx'd x] N
5. AP[prd x] ~
A[prd x] Det AEptd -] N V A[prd +]
consist of a noun phrase (NP) and l l i
(1) states hat a sentence (S) may
of verb and an That man-eating fish is beautiful
may consist a
states that a VP
verb phrase (VP). (2) prd takes two
adjective phrase (AP) marked
as [prd +1. The feature The second stage would yield the partial tree:
and is used to represent whether an adjectival category
values (+ or -) and (4) state
non-predicative (attributive). (3) Rules VP
is predicative or with
(P
tjrd\
and (N)
\[prd
that a noun phrase can consist of a determiner (Det) noun
where the
AP. This AP is marked as [prd x] +]
an optional intervening (4) states
the possible values of prd. Thus
x
that
is a variable
APs can be
ranging
either
over
Alprdl)
-] V Alplrd +1
be identical on both.
giftieii
rule (5) contains the feature
The process of applying a grammar to an input sentence to deter- matching which prd with a
Wthh specifies a value for that
mine its grammaticality and appropriate syntactic
structure is known
instantiat utilagamst word
With the
a feature
value within that rule
in order to mechanically and automat- appropriate
s.viliriable
weesh
'
However,
as (syntactic)parsing. of the sentences above it is necessary
Thus if the alternative
a c osen
category for beautiful from th
ically apply this grammar to one
which lexical syntactic cat-
lexical
n, In 1e (2) would not have matched the second AP and we would
which tells the parser
to provide a lexicon have needed to redo th e rst S t a g e 0 f th e at 5111 Th t lllrd
associated with the words in the input p g p r OCESS . e
6
Chapter 1 1.1 Natural
Introduction language processing
hard
Word meaning, like does not
Det
-I\ -]
[plrd
A[prd +1
dictable on general
pronunciation,
grounds either. We observed
appear
that
to be pre-
pronunciation
I A[pTd if III .
and meaning appear
are, at most, only very
not to be linked in a principled fashion
weak relationships between
and there
meaning and part of
That man-eating sh is beautiful
speech. For example, although many nouns refer to physical objects or
of the things, many do not. Therefore, the meaning(s) of words will be de-
Gazdar (1987) gives a more detailed informal description parsing ned in their lexical entries.
used in grammars of However, this conclusion is complicated by
and matching processes type this and Winograd
which considerations of morphology rather similar to those which arose with
(1983) describes and explains a variety of parsing algorithms
could be used as the basis for an automated for part of speech information. For example, the meaning of think does
parser
detail because it
grammars. such not appear to be predictable, but the meaning of rethink is predictable
We have presented this example in some illustrates on the basis of a morphological rule which combines
clearly the interplay between general rules (grammar) and leXicon. The the meaning of
would need the prex re with the meaning of think. These considerations
lexicon given is adequate for this particular grammar, but that morphological
suggest
to change almost time new rules were introduced the gram- knowledge interacts even more intimately with the
every
if
into
added lexicon.
words into the lexicon. For a
mar or further example,
we
between So far we have beenthe term word uncritically
using and without
non-copular verb, such as loves, we would need to distinguish further denition.
We have also implicitly assumed
and 1ex1con, to prevent the conventional
different types of verb, both in the grammar view of a dictionary as a list of words in our discussion of the role of
the generation of examples such as:
the lexicon in NLP systems. However, the common sense concept of
*
That sh loves beautiful a word as an indivisible unit is inaccurate. Many words are morpho-
logically complex forms constructed from a number of more basic mor-
In general, there will be an intimate connection between the general phemes, for example, re+think. Insofar as the phonological, syntactic
and nature of and semantic properties of these derived words are
rules incorporated into a grammar (or NLP system) the predictable by mor-
entries in the lexicon. The lexicon provides the information not pre- phological rule they do not need to be listed separately in the lexicon.
from the rules, which feeds the rules and ensures they func- In this sense, morphological knowledge is more
dictable intimately connected
the need for both with the lexicon because it provides a set of rules for
tion correctly. This example also demonstrates broad organising the
lexicon in a maximally economic form. However, many derived words
part of speech and subcategory information by syntactic'(represented are not fully productive in this fashion; for example, reproduce in one
features) in the lexicon. This issue is the subject matter of chapters 3,
4 and 5.
sense means something more specialised, although related to, produce
again, whilst the semantic
. . . .
the fact that the first syllable of whisper carries main stress might be entries for bound morphemes, such as re, which cannot
stand alone and are not usually felt to be words.
felt to follow from a rule assigning main stress to all initial
syllablesof contain entries for some derived
In addition, it may not
English polysyllabic words. In this case, only exceptions to this rule, words whose behaviour is predictable
such as division, need main stress to be explicitly marked in their leXi- on the basis of morphological rules. An in the lexicon will repre-
entry
This example again emphasises the point that counts sent those facts about a word
cal entries. what (or morpheme) which are unpredictable
as idiosyncratic, unpredictable information depends almost entirely on or idiosyncratic with respect to the other rules in the system. This will
8 9
Chapter 1 1.2 Computational lexicography
Introduction
raise the possibility of wider commercial application of NLP tech- further exacerbated by the use of different representations for similar
they information.
A text-tospeech synthesis or question-answering system with Below, we reproduce lexical entries for acknowledge from
nology. the BBNCFG
an exhaustive knowledge of syntax and morphology will still
English system (lngria, 1988), the IKUS system (Bates ct al.,
be woefully inadequate if its vocabulary is small. 1986), and the Alvey morphological and 5 y nt ac t' 1C ana l yser (see Carroll
and Grover, this volume).
A great number of NLP systems are strikingly limited in their range [ACKNOWLEDGE
This lack of versatility is to a large extent due to the Category: V
of application.
to them. A recent Base: acknowledge
small number of lexical entries available
typically Features: (TRANSITIVE (REALM?) (PAssrvmssn
workshop on linguistic theory and computer applications (Whitelock ;; 1. 2. 3. 4. s '
available v 51)]
ready digested, categorised, indexed and, most importantly,
in machine-readable form, can be suitably used, if not to get a sizable
11
10
Cha p ter 1 1.2 Computational lexicography
introduction
1. 2, 3, 4. 5
(acknowledge
..
lexicons for specific systems as well as providing a useful resource for
up acknowledge ml) ,,
((v 4) (n -) (aubcnt
1
investigating various lexicographic properties of the language.
, ,
having
:1 erably more accumulated experience of lexicography than researchers
(acknowledge in NLP. It therefore
(subcat OI)) aclmowledge n11)
seems expedient to capitalise on this experience as
((v +) (n -) much as possible rather than reinvent the wheel. Once the decision
be the best to
him to
: acimowledge
make use of existing dictionaries has been made the problems of com-
putational lexicography become those of modication and conversion
the syn
of existing MRDs to a database capable of exploitation by machine.
entries contain rather similar information concerning
These problems fall into two broad categories.
322215:
classication of words into
parts of
speech.and further
encoded in mor;
suc tionaries are
Firstly, published dict
organised for human use and rely heavily on the users
ne-grained categories. However, this is
information
associated system With each would not background linguistic and common sense knowledge to retrieve and
different ways that the lexicons
comprehend the information they contain. Secondly, this information
be intersubstitutable between systems. .
is usually presented in an informal rather than systematic fashion and
ideas
Every NLP system has its own and conventions concerningf often rests on inappropriate linguistic models, from the perspective of
of Such a
the content, organisation, and structure state o
its'lexicon.or theoretic, NLP. In the next two sections we explore these two problems in greater
in
affairs is partly justied by differences, organisational detail.
task
the individual systems approaches to the
chosen (seethis makes
for exam-
usler other part of speech. It is possible to divide the entry into a number
eld, and is intended to represent a single morpheme (Ritchie .,
of distinct components.
1986:4, emphasis added). The form contains headword
these problems to attempt develop
.
Ms cream between them in a separate eld: thus the subject code for the rst sense of the noun
(1) or
mdiiiegdldlition, information (see chapter 6) and the grammatical coding system (see
. .
of LDOCE contains
formation not
the machine-readable
printed in the published
version
dictionary. senses can
e
Word 11n chapters 3, 4 and 5) from LDOCE demonstrate that even here things
are not so straightforward. For example, few dictionaries,
notational including
tagged with subject and bar codes f devrceswhich,
notions i e Vilaka LDOCE, represent syllables explicitly in pronunciation elds (except
very compact system of representation, encode semantic to when a stress market occurs). Therefore, it is necessary to parse the
most
the overall context in which a word sense is
appear likely pronunciation eld to assign syllable boundaries
(see chapter 6). Sim-
religion, language) and selectional
(for example, politics,
on verbs, nouns and For'the
compound phrases. entries restrieition:
of
()C
San ilarly, the grammar
a specic linguistic
coding system employed in LDOCE derives from
model (see chapter 3) which is not appropriate for
for the rst sense is -_L_X--__S, where X
above, the box code verb denotes automated parsing for some classes of NLP systems (see chapter 4).
in the fth position of a tendimensmnal characteryegtor tenth
a
15
14
1 1.2 Computational Iexicography
Chapter
Introduction
' '
ers. Therefore,
in order to
it is necessary
tease apart
to analyse
syntactic and semantic
the codes in
facts and to re by machine. In this, LDOCE is unique amongst inagiioiitilihligls:
MgRDs
ways in the coding formal representation of some aspects of meaning, and this no doubt
cover information which is implicit, rather than explicit, in accounts for some of its popularity with NLP researchers. However
be 100% successful so,
scheme. Ultimately, this process can
lexicon from
never
there
are problemsconcerning the accuracy and completeness of this
order to derive a reliable (grammatically-subcategorised)
of semi-automatic, human type of
informationin LDOCE
none
of the target lexicon, rather
aided compilation the work reported in this book makes signicant use of these codes
batch processing (see chapter 5). of potential the- (See'Byrd at (11., 1987, for further discussion and description of work
the diversity
These problems are compounded by with the lexi- of LDOCE box codes.)
ories of phonology and syntax
which might be deployed maklingt
somehuse problem t is concernin the diversit
'
mygifargiidiilidZi
as ere a
information in diverse Ways us
con, each of which
draw a rather
may represent
different
similar
division between information in the theorieswhich might want to exploit prgnunciation
or, worse,
rule component within the information
from an MRD, the same issue arises perhaps to an even
see
WIth
' '
etiiegdemm
Calder and to geisoten
0 nowe ma concomitant
here (for example,
plates or key words seem promising to solve the on the nature of lexical
Lindert, 1987). However, these techniques
cannot hope
in informational require ellccsts
malls theory of word expert parsin Small 19 0
'
Turning to sense
most there is no explicit restriction on schication which requires detailed knowledge of the architedture
initions in textual form and in of the with a specialised language for writin word
might that in order to make parser, acquaintance
the language employed. Thus
NLP
it
system
appear
would require very sophisticated experts,Judgement of what constitutes linguistically relevant informa-
of this information an
tion and how amounts
to represent that procedurally, and readiness to brin
use In
creating a catch-22 situation.
language understanding
this
capabilities,
bad because the language used in denitions in arbitrary
of more general, common world knowledge N5
fact, it is not quite
restricted form, typically consisting of a noun diViSion is
made between syntactic, semantic or pragmatic knowlede I
tends to be of a rather
term and modiers or dierentiae The point here is not so much whether the theory in question has air
NLP system Whi h
head or genus
phrase containing a extensive analysis of the language 'value,
but
that this is an instance of a lexicon for an
to make denitions
liridczzrxrientally organisation theory imposes
more
tends
cularities of denition. However, it
denition
also
for the rst sense of bubble: a on thee
long-winded (cf. the LDOCE chapters 'Sowaand Way (1986)present a different theory of text interpre-
hollow ball of liquid containing air or gas). In addition, as
1; the notion of
liberal of derivational morphology to
'ation in context, relying on a conceptual graph Th
7 and 8 indicate clearly, the use
and the of this vocabulary in implications for the content and structure of a lexicon to be usedbe
extend the denitional vocabulary
use
it is fair are reconciliny
non-trivial; in particular, issues of
more than one sense makes the situation
by the
less ideal.
representation
However,
of meaning ;.conceptual
ormal
notions
parser
like lambda-abstraction tuagl and operations on conce
17
16
Chapter 1 1.2 Computational lexicography
Introduction
'
wdlrld.
'
dictionary
ri tion in a '
genera
of this 1
the formalisation and sometimes
hieve d This immediately after them, depending presumably
.
poses
'
lar definition
'
can be ac .
on the
'
g round assumed by the individual lexicographers interpretation of the instructions for laying
tion of how to establish common . '
entries.
<(ienilti:1);i:.r8532):;i
out
idlificqougiiphers
during
implication writingltle that there
'
is
the
no
'
process ' '
of
that a t e now e These deviations mostly went undetected because of the lack of
ilgviirhvrferrstand
'
of the
checking structure of lexical entries
(see Michiels,
dictionary a denition can in
mam.
which modies the interpretation of the code, but some
reections typesetting
Machine-readable sources
'
primari
still
ly for ' '
'
.
commands, such as the instruction
within
to insert a thin
afford to be quite in
tionaries can .
,
, .
t an d intel Finally, there are the problems caused by errors of omission or com-
to enormous extent their readers
on Judgemen mission, where codes are either misapplied or left out.
to rely an .
qumk. mStead Use or These problems and other aspects of the LDOCE grammar coding sys-
considerable
pres ent problems for
component of entry can tem discussed in greater
form an are detail in chapters 3, 4 and 5. Similar
0 {the
.
re p re-
'
because 0 f the semi-formal nature problems arise with the other formal information
automated processmg and
.
in the lexical entry
the Willingness of leXicograp hers
. .
employed and
'
sentation schemes . to a greater or lesser extent (see chapter 6 for a discussion of pronun-
these schemes in par fen.
i .
modications
'
to ciation elds).
tt rs to make minor
and so .
-W
that t e co e e
E
'
501'3 examlntlon
suggests
i-colons and colons are Of'LDO'C
which commas, sem
used vocabulary and typically the language of
well-dened system in f rom 1e tter.
definitions is a restricted subset of English (noun phrases) (see chapter
'
{codes' constructed '
to delimit and abbreViate sequences p 8). There are a variety of approaches to extraction of sense informa-
conSider the gramm M code eld for
irs. As an example tion, ranging from probabalistic assignment of sense numbers
ggeleiamX
,
form is
.
to tokens
whose expanded, unabbreViated
(to be) 1,7],
, ?
of the denitional vocabulary which occur in particular denitions (see
shown below: chapter 9) through to genus and differentiae spotting systems (see chap-
ters 7 and 8), none of which require automated comprehension of the
denition. However, all of these techniques fail when faced with cir-
sense-no 1 head: Tl cularity of denition or heavy reliance on cross references. In these
head: TE situations, the real denition is not to be found in the entry associated
optional (to be) with the relevant head word. Human
head: X1 right dictionary users cope with this
optional (to be) situation
head: X1 right fairly well despite the fact that the organisation of a pub-
lished dictionary makes following a chain of cross references tedious
19
18
Chapter 1 1.3 Overview of work with MRDS in NLP
Introduction
to make, given the reality of human linguistic performance transitive verbs which require human subjects by accessing all tho se
sumption within reason
of keeping printed dictionaries with grammar code and H box code
and the practical issues
verbCentrietsalt/[1T1]
i
suppoiitingygcecstsmget
able size, this makes it onver-ing
sort of morphological process fully database capable of
tion to be able to perform at least some
this as a dynamic
functional
number of
ing. interface
An to an MRD must regard source
type raises a Dictionaries are more specific problems.
struc-
data encoded as a set
in addition to the static tured free text but not so structured that they will t neatl
object and must provide, than into
of morphemes with associated features held in the online dictionary, a conventional database system with xed format records for le exailn I
'
01
process.
for textual input, may only require
access to lexical entries
which will require
in this word-
access via hyphenation,and forth. In the
so
following ZIeStIiSifdngvvzoiii/rielfh
, View
ef cient
'
ous speech will require primary access development of effective and algorithms fo ll' -
tion, it is possible that it will need to constrain the access process word lists. Such resources havebeen c ompiled by
'
for exam 1 Y -
and Yannakoudakis(1983) ~
57 000 words
incoming speech and syntactic predictions about forthcoming words d watery
from The
Teochers Word Book of 30000 Words
In this situation, the opti 8:1: (Thorndike
recognition
Egonepii'cgiegflmi. (1986) has initiated
may help the word process.
Mitton project a on using the
take the form of a set of (partial) constraints the OALD to correct orm
certain classes
a ion in
'
of misspelt
mal access query may
a word beginning and ending words.
on the target word, such as bisyllabic
as a prenominal adjective (see Word lists have
with a voiceless stop which can function also been generated from machinereadable sources
20 21
Ch ap ter 1 1.3 Overview of work with MRDs in NLP
'
Introduction
systems, sup-
oAndlrlriinEZf
various types. Cowie (1983) presents a system for
analysut'ig port for natural language processing tasks, design of lexicographers
descriptive texts into hierarchically structured
make of
knowledge
the a
e fragmenbf, workstations, or lexical analysis of dictionary data on a large scale
taxonomically
structured
lfor
knowledge. point1
offer a very
F0 porting on this aspect of work with MRDs.
compilation
initial of
(1980, ow811ng
19
This situation, however, is rapidly changing. In the context of work-
some work of
of a
more empirical nature
comprehensive investigationinto
in
the sixties,lAmsler
the content
an )d ing with a single dictionary, where often the term
in the literature not in its full
database
computational
is used
major Waterloo-
a
in the course sense,
that
structure of MPD, presents conclusive eVidence dictionary
a
semiautomatica (301111-
y
based effort for computerising the New OED has provided the context
can
tains a nontrivial amount of information which for a study into normalising a machine-readable source (Kazman, 1986)
be structured in a semantic hierarchy of definingconcepts. Morerecent
from standard dictionary
and developing a special purpose data model for a dictionary database
work aiming to extract (Tampa, 1986; Gonnet and Tompa, 1987).
'
mgs. inifogrnation
semantlic The Lexical at IBM Yorktown Heights has been
dEilillzrolilsohutbxonomies
.
Systems group
researchwhich
of this kind
shades dictionary,
off into
of working on WordSmith, an automated on-line dictionary system de-
additional, impliCit, the structure a
attempts to uncover
signed to offer browsing functionality: users can retrieve words, from
networks for
and construct hierarchically structured ofconcepts use a number of dictionaries (for example, W7, LDOCE, The Collins The-
in semantic processing. This requiresmore sophisticated processing
it saurus and several Collins bilingual dictionaries), which are close to
definitions than is attempted in the work cited above.an
of sense
much work.which to a given word along dimensions such as spelling, meaning and sound
which lies behind
is this underlying
denitions
goal
We discuss
7, 8 and aim: 9). wor on (Byrd at al., 1987). Some of the tasks WordSmith has been applied
analyse (see chapters to, namely developing techniques for segmenting and matching word
in section 1.3.6 below.
semantic processing spellings and generating pronunciations for unknown words, are de-
scribed in Byrd and Chodorow (1985).
The more demanding context of working with multiple dictionaries
1.3.3 Access / browsing
presents an additional set of problems. Particularly important here is
'
the issue of normalising all sources to the same internal format, while
for the extraction
of this book requires a (set of) program(s)
23
22
Introduction Chapter 1 1.3 Overview of work with MRDs in NLP
maintaining the exibility and power of a-fullyfunctionaldatabase necessarily involve a good morphological analysis component,
sign, so that for example browsing operations like
those above
are
deli
St'l of handling derivational and inectional
capable
morphology alike.) Briscoe
possible, performing equally efficiently on all available NIRDs. Plus, (1985) lists various current projects in speech recognition and synthesis
and related, questions are being investigated by the LeXical Systems in the UK (sited at Edinburgh
University, Leicester Polytechnic, the
project at IBM (Neff at al., 1988).
Joint Speech Research Unit at Cheltenham, the University of Cam~
Other work has investigated methods for construction brows bridge and IBM Scientic Centre at Winchester), all of which make
of,'and
ing through, a network of sense relations, derived from dictionarydef- use of sources like OALD or Collins for the task
and cross reference
generic compiling of
initions and, in particular, their synonym pomters. special-purpose word lists transcribed into project-specic phonemic
This activity has been carried out in the of George Miller s alphabets, incorporating primary and
context secondary stress assignment and
WORDNET project (Miller, 1985) whose is to
and and
goal. develop
With
asysltem, marking of syllable boundaries.
appropriately organised indexed, equipped naVigational Recent work by Huttenlocher,
Shipman and Zue of the Massachus-
aids for examining complex conceptual Without to con-
to the conventional
spaces
haying data.
setts Institute of Technology has investigated an alternative model of
form alphabetic arrangement of dictionary lexical access to the one commonly utilised
is currently aimed at human
by current approaches to
The system
the functional
users;
of free
a
suitable counterpart, speech recognition (Shipman and Zue, 1982; Huttenlocher and Zue,
however, offering equivalent assoc1ation couldbe 1983). Instead of applying template matching techniques, which be-
of great utility to a computer program for, say, robust interpretation come
inadequate for tasks requiring large vocabularies, they have anal-
of free text. ysed the particular
' knowledge about language and speech available in
The desire for powerful browsing capabilities leads not only to con
dictionaries, and have developed a special representation of the speech
sidering how best to adapt existing database technology to signal based on broad phonological constraints.
different lexical CDROM and similar low cost,
the needs'of Classification of words
projects. liigli-capaCity using these phonological categories achieves the partitioning of the
distribution media offer perspective to 20 000 word MPD
a new on
computerised access. into relatively small
equivalence classes, thus greatly
existing reference works (Hodgkin, 1987, discusses the implementation reducing the space of possible word candidates on
lookup and making
of the New OED CD-ROM, as
on aswell
presents a more general in- the Whole of lexical
process access less sensitive to
speaker variation
troduction to the promises and limitations of the medium). and other variabilities in the speech signal
of a dic-
.
(see chapter 6 for further
Finally, the computerisation of a.
machine-readable source details and an evaluation of this work).
tionary is not only concerned with proViding quick access to a mass
database
insights in linguistics, and in particular developments in gram-
of conventional technology, explorations are underway to.in- matical theory
for ex-
dictionary; most promising of these is hypertext (Raymond and Tompa, ample, Kays Functional Unification Grammar (Kay, 1984b), PATRII
1987). (Shieber, 1984, 1985) a
LDOCE grammar codes and comparison between the coding systems of ing much more loosely dened, placed and structured raw data. Ulti-
LDOCE and OALD) which allow for detailed syntactic subcategorisa mately, the goal is to relate natural language words to an underlying
tion of individual word senses. It is only natural to attempt to use this taxonomy of concepts typically the one which binds together the
information about the idiosyncratic distributional behaviour of words defining concepts in a dictionary. This would involve a range of ac
for syntactic analysis. Since the system employed by LDOCE for gram- tivities, considerably more difficult than those which reduce to, say,
matical tagging is based on the descriptive grammatical framework of simple lexical look-up, partof-speech extraction, and its mapping into
a data structure suitable for subsequent use by, for example, an online
Quirk at al. (1972), the mapping of the original dictionary format into
a lexical entry appropriate to one of the syntactic theories mentioned parser.
above is not trivial Current views on automatic natural language processing tend to
(see chapter 4).
agree that there seems to be a continuum between the minimal seman-
tic knowledge implied by the use of a particular word (word sense) and
1.3.6 Semantic processing the specialised (or expert) knowledge relevant to its use in a do given
main context (see, for example, Wilks, 1977; or more
Most of the work mentioned in the previous sections makes use of the recently, Cater,
form and function of a dictionary entry. The work on disambiguation 1987). For a practical NLP system there are very pragmatic reasons for
and dictionary browsing appears to be an exception, in that the asso- distinguishing between lexical semantics and specialised World knowl-
ciated techniques and methods need denition edge. It would be unreasonable to expect to nd any of the latter in
access to a
(or a subject
eld. This an MRD. It would be of an enormous utility if most of the former
code, synonym, reference) is, however, somewhat
or cross
the semantic content of the denition is not really made no matter whether it is presented in terms of decomposition into se-
misleading, as
use of. mantic markers (Katz and Fodor, 1963), formulae constructed from se-
mantic primitives (Wilks, 1977), frame-based structures
There is still another aspect of the work on NLP systems in gen- (Hirst, 1987),
eral, to which recent research in
making use of the information available logical predicate/function symbols with associated sortal information
encoded in the form of meaning postulates
from MRDs is particularly relevant. Language processing programs fall (Grosz and Stickel, 1983),
within the larger category of knowledge-based systems or by some other means could be derived
from a machinereadable
(see below). In source. In this context, there
order to carry out their (semi) intelligent functions, such systems need are attempts to compile some lexical
semantics from an MRD into lexicons for an NLP systems.
signicant amounts of (pragmatic) knowledge about the real world, or Chapter 9
describes a project to use LDOCE for automatically
at least about a particular domain ofdiscourse. A common problem ad- constructing se-
mantic definitions (formulae) for individual word for in
dressed during the design process is that of acquiring such knowledge, senses, use a
and there is strong hope that ways can be found to localise and extract preference semantics framework.
Still, we should not expect to be able to generate complete
some of it from suitable machinereadable sources, namely dictionaries a seman-
tic component for an NLP system semi-automatically, in the same
or encyclopaedias. way
in which we are attempting to esh out the syntactic The
An analysis of the cases where heavy use has been made of the in one. seman-
26 27
1 1.3 Overview of work with MRDS in NLP
Chapter
Introduction
Even within the area of bridge Language Unit (Masterman et al., 1957), where the
able for a backward chaining expert system. tackled
kinds of structure are best problem of word sense disambiguation was by exploiting the
is no rm consensus on what
NLP there
suited for capturing the knowledge useful
for language interpretation structure of Rogets about Thesaurus. More recently, Amsler (1983) makes
certain assumptions the (usually implicit) taxonomic structure
it is possible to observe a common
and understanding. Nonetheless, of the dening concepts in a dictionary; his approach relies on this
systems which, in addi
theme in a large number of language processing structure being made explicit prior to the enterprise (not an unreason-
make use
tion to the relatively narrowly dened language-specic data,
standpoint). Lesks method (1986a) is even direct, in that
real-world, knowledge, as well as of more specialised, able more
of more general, not this kind all it requires is a machine-readable source, and not of any particular
knowledge. More often than
domain- and task-dependent dictionary at that. He uses heuristics depending on the overlap be
no
of knowledge is represented using a scheme based on the general tween words in the dictionaryprovided sense denitions for the words
of frame-like concepts with slot-like role descriptions, organised
tions across a context window.
axis.
in along a generalisation/specialisation
(inheritance)hierarchy mentioned (see 1.3.2) that, since dictionaries
an
work on knowledge representation, including FRL,
We already embody
Most of the recent and implicit structure within which their dening concepts
frail KLONE (see Brachman and additional
KRL, NETL, AIMDS, UNITS,
or
details
are
organised, they promote efforts for initial compilation of taxo- an
Webber, 1980; Mark, 1981; denitions the aim of extracting variety of relationships among
a
motivated
have the work of Markowitz of for analysis of complex nominals
ilar concerns
implementation program a
This largely disposes of the It would seem only application within the broad classication
that the
words, used in their central meanings. made it possible to con- of natural language processing where MRDs have not found use so far
disambiguation, and has
need for genus term of is that of language generationg. Just a quick glimpse at the lexical level
techniques which are capable
centrate on developing robust parsing to be solved for the process of generating text will sufce to
extracting more information from the dictionary and are, in that sense,
for example, Ahlswede (1983),
groblems why MRDs havenot been seriously exploited in this con-
superior to the approaches proposed
Alshawis
by,
denitions analysis procedure is cei'izoriz'trate
ex:
itchie (1987) mentions some of the questions to be considered
(1983) and Gaviria (1983). locating the semantic heads of de- during the process of lexical selection: what is the input to this pro
usually more reliable for precisely syntactic structure, a semantic
representation, a mixture
7aWhat are the dynamics of the lexical selection process i e
or
other information
nitions, achieves further semantic precision by using crgsls)oth.
denition, for example, modiers and predications (the :1 preferencefor a particular word affect subsequent choice
present in the ow does the
is a frame-like structure with filled-in
result of the analysis process the architecture of the lexical
selection mechanism
accurate classication of concepts in a partic- of words?.W hat is
1
see for instance Haas and Hendrix, 1983), one could imagine a program
the denitions
to
0 say
~
1.4 and
tion
used to extract data relevant to these
i Reliability utility of MRDS
Dictionaries and almanacs are
denitions.
to be that
Within
they
such a incsome broader aspects of working with machine::::l:leyldi2ti:::inr
of the hyponyms in their
instantiations already
(for example in 1.3.2 and
are
discussed 1.3.6) the
framework, on-line access to a semantic hierarchy of dening concepts
iiiiierwi
lilaveof applicationwhich critically require access to such
is a necessary prerequisite for the collection and analysis of empirical mung:Ictasses possible to ask questions now
like, for example, what
in me 0;!sleems
ultimate aim being the derivation of a set of interpre-
data, with the
the hierarchy to an extent
cost (the manpower effort spent
a on a
project) of attempt-
tation rules, eventually generalised through ing t o 1iarness what be
a wide of non-lexicalised
sometimes can a bulky and unwieldy object'7
where they can be used to disambiguate range
31
30
Chapter 1 1.4 Reliability and utility of MRDs
Introduction
p3b1:5;::f%;;iab2::-
of a machinereadable
source. Below are a
issues of the overall packaging and reliability proof reading procedure is applied by the
the coverage
some more detailed points touching upon
to be of interest to research ofus
i was used for typesetting as well, the possibility of typographi:
of data machine-readable
in sources likely ca zlzltape
induced
y errors would be removed (see the next section) Still th'
computational linguistics. publishedyve:
Y
in for the
most obvious,
the concern is with the ex- wouldnot guarantee safe processing: example,
The rst, and in a way dozen entries in which
of the dictionary. Notwithstanding a parenthesised
pectations for the lexical coverage
and in particular the status 2:23;:taffgeczntgainiafout
opening
tonrgestiiicatldrgt
rac is or closin missi '-
overgrfight
.
model of the lexicon; see Jackendoff, 1975) or information the tape, of time to subsequent
prior on use by LISP pro
ample, the full entry
the impoverished entry model), a
conSiderable lost trying to
amount track of boundaries was read the int?
tapge
derived on an 'as-needed basis (viz.
to offer complete losing between individual entries
the fact remains that no dictionary can be expected memory Without i
there does not exist a word to. develop such a grammar is by applying it to a large numbe);
ically states that one must admit
the
that
true lexical units of the language...
Wfay
0
entries and manually inspecting the results a trial-and-error -
pro-
list of English that enumerates
evidence of the been altogether could have by an explicitly dened avoided
(p.173). Walker and Amsler (1986) present empirical
is, by most standards,
Ieggsnwlhich to the made
available lexicographers prior to the dictio-
huge disparity between the contents of what
with the vocabulary behind n adcsslsltem oprnent effort: The same point is made, in a larger context
a perfectly respectable
dictionary (W7),
Churchs work on stress assignment andymude1 emphatically, by Kazman (1986) who has developed
the New York Times News
motivated
Service.
by a very similar observation.
Igrammarmoiie an assoc1ated software for parsing the full entries of the
(1985) is partly lexical entry is found there is still the OED,
when the appropriate
. .
Even r
in which Lven when seemingly stringent constraints are imposed on dictio
rely on it. There are many ways
question of how far we can
33
32
Introduction Chapter 1 of the contributions
1,5 Organisation
lary are considered central. Much more serious, and cause for many stantial amount of manpower over a period of time. Examples of such
failings of the denitions analysis programs described above, is the analysis are the studies like Akkerman ct al.s (1985) comparative eval-
somewhat liberal interpretation of the phrase only easily understood uation of the grammatical coding systems for OALD and LDOCE, or
derivatives. Thus, allowing derivational morphology to creep into the Moulin ct al.s (1985) analysis of the validity and systematicity of the
basic set, gives some lexicographer the power to use container for the grammatical information provided by the LDOCE grammar codes, as
denition of box2(1), even though only the verb contain is considered well as of the consistency of the whole coding system there. An alterna-
to be primitive. Elsewhere, container(2) is dened as a very large, tive approach to the same problem would be to use an existing source as
usu. metal box..., thus clearly violating the promise of non-circular a starting point for a specic project to do the job better. For instance,
denitions, described in simpler terms. A program attempting to use the ASCOT project underway at the University of Amsterdam (Meijs,
the semantic hierarchy derived from analysis of these terms is destined 1985; chapter 3) essentially attempts, within a particular theoretical
for the equivalent of an innite loop. linguistic framework, to rectify the situations where the LDOCE word
A different kind of trap is introduced by equally liberal use in the denitions diverge from the standard dictionary practice of maximum
denitions of phrasal verbs made up from verbs and particles taken economy.
from the restricted vocabulary. For example, the second meaning of
contain(2) is given as to hold back, keep under control.... While
within the vocab<
both bold (as a
verb) and back (as a noun) are core 1.5 Organisation of the contributions
phrasal verb, is not. A
ulary, holdback, with its own meaning as a
fundamental assumption on which Alshawis denitions analysis pro- This book contains contributions written by three separate groups of
based has been violated. Not surprisingly, the signicant work
gram (see chapter 7) is researchers who have undertaken with the machine-
result from the conversion of this particular denition into a semantic readable version of LDOCE. All of this work is focused enhanc-
on
34 35
1 Organisation of the contributions
Chapter 1.5
Introduction
assessments of
transcriptions based corpus of meaning descriptions. Subsequently
coded
of class are used, the customary grammatically
on expected class size can, even when weighted by word frequency, be they develop a syntactic and semantic typology, based on the ndings
misleading. He proposes an alternative, information theoretic measure in this corpus. This involves a parser-grammar distinguishing premod
and argues that it is superior. Several experiments LDOCE-based are
ifiers, kernels, postmodiers, etc. and an explanation of the semantic
reported that, through using this improved methodology, shed effects of di'erent types of kernels and the structures in which they
new
light on some questions of speech recogniser design. occur, classied into four main types: links, linkers, shunters and syn-
onyms. The rationale for this typology is discussed at length and nally
determined by a hierarchy of patterns in which less specic patterns lar kind of text. Dictionaries have particular promise because (a) the
dominate specic ones. This ensures that reasonable, incomplete
more semantic structure of text may be more exposed in them than in other
analyses of the denitions are produced when more complete analyses forms of text and (b) many are now in machine-readable form and
are not possible, resulting in a relatively robust analysis mechanism. are amenable to analysis by large-scale computational methods. They
Thus the work reported addresses two robustness problems faced by identify some convergence between the View of computational seman-
current experimental natural language processing systems: coping with tics presented, lexicography, and knowledge acquisition
computational
an incomplete lexicon and with incomplete knowledge of phrasal con- in terms of common issues and problems they share. This convergence
structions. is illustrated using some work by the authors that attempts to extract
semantic information from LDOCE and use that semantic informa-
in denitions tion in two kinds of computational semantics that reect the general
Chapter 8: Meaning and structure dictionary
position on computational semantics set forth.
The aim LINKS
of the project is the development of a semantic data
base in which the meaning descriptions in the LDOCE are stored in
a systematically related way. The underlying theoretical framework is 1.6 Notes
that of Diks (1978b) stepwise lexical decomposition, which does not
make use of an abstract semantic metalanguagebut instead, guided 1 Winograd (1983) and Allen (1987) both provide excellent intro-
by a well-defined economy-principle, reduces the meanings of lexical ductions to natural language processing and some aspects of linguistic
items, via a stepwise network of chains of meaning descriptions, to theory.
a restricted set of basic lexical items. The underlying assumption,
at the outset, was that such an approach would be a feasible option 2 By convention, asterisks are used to mark examples considered to
because of LDOCEs use, in all its meaning descriptions, of a restricted
be ungrammatical.
vocabulary.
In this Vossen, Meijs and den Broeder
chapter outline the basic
methodology for the project. First, they apply an appropriate gram- 3 This observation, though, is not totally correct, because the part
matical coding to the words in the meaning descriptions, deriving a of speech of many words is predictable from their morphological form;
38 39
Chapter 1
Introduction
of this type. 2
structure grammars Chapter
The numbers in the
5 The rst two entries are due to Bob Ingria.
Placing the dictionary on-line
denition numbers within the entry for ac-
annotations refer to the
University Interna-
6 COBUILD stands for COLLINS Birmingham
Hiyan Alshawi, Bran Boguraev and David Carter
tional Language Database.
underly-
data given by the publisher to the printer. Computer tapes, typically
9 Note that the process production from some
of language are the usual medium for distributing dictionaries
ing meaning representation distinguished
should befrom the functions typesetting ones,
machinereadable and the Longman is no excep-
form, Dictionary
various document generation aids, where MRDS have in
carried out by tion. The information on such tapes is read directly and interpreted
to large scale, for example, for spelling cor-
already been applied on a
by a program which drives the typesetting device used to produce the
rection within the CRITIQUE system. masters for the printed version of the dictionary.
While clearly a convenient way to separate the compilation of a dic-
and has
For example, body is part of the denitional vocabulary as .
10 tionary and the keying in of the data from the actual production of the
its central (1) meaning the whole of a person. However, parliament book, this process does not presuppose the use of the tape for purposes
is dened as a law-making body, utilising the meaning of body(5) other than typesetting; indeed, the tape is simply a by-product of the
a number of people who do something together. process of creating the printed book. This is the root of the problem
of mounting the electronic source of a dictionary in a form suitable for
NLP typesetting information its own does not
11 We adopt the conventionsubsequent chapters) that,
(in this and use
by a sufcient handle on the
programs: on
Within a homograph, the relevant LDOCE word sense is indicated by follows certain lexicographic and typographic principles, and the pri-
a bracketed number (for example, contain(2)). mary underlying assumption at the publishers end is that a dictionary
.is a printed, hand-held object to be used by an intelligent human. Vi-
is critical both from the point
sualpresenta'tion of the information
41
40
2 2.1 From printed to
on-line
Chapter page computer memory
Placing the dictionary
'
'
examples of these.
Clearly, all this information
must be present on the tape; the solu- (nggureugilj obvious .
that1the SYStem
IS imme late y to th 6 tea d er
presentation '
or
entry (or a page) (marked by small bold lett em), eXPhClt and
for the complexity of a dictionary . .
to account
' ' '
' '
5:22:11)
ation
mapping. This presents yet
homograph numbers (in su perscrlpts and sense b em (1braCketS)
EBCDIC code is used
for
as the
electronic
basis of the
access to the information on the tape, semi-idiomatic
. . . .
its corresponding image on the printed page. together: a pair 0/ tmmenln pair 0/ .c-mn
would initially seem, since control compare coupcs 2 09,
it latheufalnzlldhdz
.
complicated than
is made more to infer a direct
things that are alike or
functions. It is easier
characters perform two distinct and
are usu. used together: a pn'r n/ihoetld
symbol in part a] lag: *compare
character and its corresponding rigor/u! courts!
correspondence between a
effect control sequence might have playing cards or the some value but
the of different (s):
printed form, than to establish For instance, common
a suinl a pair of 1w. 3
nected who
a currently displea-
cause
how is the effect of annoyance or
command, and end of field, sure: Youre a ne par mun-rig m lot: a. Hill]
declaring a different font, by
by an explicit command, by the analysis of the character
of these? Secondly,
or by a combination since large number of the
on tape is not entirely trivial, a
sequence devices, Figure 2.1 A fragment of LDOCE
characters are unprintable conventional output on entry (pair)
EBCDIC into visible
effort has to go into unscrambling them
and additional detailed
1987b, for some more
objects. (See, for example, Amsler,
cements.) typesetting tape for on
giigsagsgitn structure, for the of any subsequent
The fundamental
come from
problems
two related
of mounting
factors:
a
Eggretzhil does
on
the typesetting This is by no means a trivial task, as working
information which drives
effects of typography. The there is and result not in a simple mapping between
machine is typically in the form of a
imposed on this information.
character
This is a
stream,
a cotdes character
pmluculzontrotl r ins ruc ion in the str earn, say
sw1tch
'
to bold
very little (explicit) structure that the and a unique logical e
.
make use
document.
of the lexical
any
information
computer program
Since
in a dictionary must have access (Norllzndcoha'
to
1982:216),
the process
g rls vryord introduces
ensen,
of recover
additional
'
complications
'
representation for that dictionary, both yofthe strut at. a dlCtl'DarYfrom the
of underlying
(fall):
kind
.
an
to some typographical information held the
suitable for algorithmic processing and felicitous to the published book, on
43
42
2.1 From printed page to computer memory
Chapter 2
Placing the dictionary on-line
02 1 03
attempting to build on existing experience and develop general-purpose
01 pair
00 < C9, es peEZGHf
.
* o < < tools for facilitating the process of mounting a typesetting tape on-line,
g; 345 {Things
.
18 cause annoyance
18 coming as late as this! simply indicates the beginning and the end of the entry in the character
stream, and separates the headword from the rest. More ambitious
for typesetting pair projects must give careful consideration to questions concerning the
Figure 2.2 A sample tape stream (extract) micro-structure of a dictionary entry; in particular,
a how to
represent the different fragments of an entry (headword,
pronunciation field, denitions for individual word senses, cross-
word bination
(semi-)idiomatic phrase incorporating the main headword,or a of an application-neutral format and a suite of easily
opposite) meaning to the customisable
with a related (for example synonymous or programs.
one being dened.
A further manifestation of the problem of context, already men- The first points above are related to fundamental
two questions facing
tioned in passing, is that control characters on the
tape are thetypically
end of
every attempt for building a dictionary database, and we elaborate on
begin markers; and there is no standard way of indicating them below. A particularly good analysis of the issues connected with
unit within entry. For example, the scope of
a logical functional an
parsing MRD sources into a standard form, as well as those leading to
a bold (indicated
font by the ASCII character Hex46) may be ter- the design of such a general-purpose representation for the lexical infor-
others, by a change to another font, by the end of a mation contained in a range of different
minated, among dictionary tapes, is presented
It may be globally
sense record, or by a superscript character. even
in Ne et (11.,1988).
by proclaiming within particular elds within an entry for
#
instance containing headword or a sense lar format of the Longman tape, its conversion into a form suitable for
the typesetting program then interprets appropriately. symbolic processing, and the design and implementation of two soft-
of unloading all the information from a tape
Ideally, the process ware systems which provide access to the dictionary on-line. We aim
and loading a structured version of the lexical data in the dictionary to demonstrate the influence of different system requirements and ma-
to make
into a database would involve parsing the complete source chine congurations on the functionality and design of the dictionary
its implicit structure explicit. The complexities of such a process are interface. We conclude by sketching a generalised interface system,
discussed in detail by Kazman (1986). One of the results of his work.is capable of offering exible, multipath access to versions of LDOCE
the development of a full grammar for the entire OED, and a speCial independently of the hardware congurations it is mounted on.
the structural
mark-up language capable of representing and conveying
45
44
2 2.2 The Longman tape and its computational counterpart
on-line Chapter
Placing the dictionary
the other
characters interspersed in the text. On LDOCE should be organized in order to facilitate its use byNLP sys-
of the overall entry by presenting some in convenient form
fer some indication
themselves
structure
With records
tems. Firstly, it is necessary to have the entries a
be
2.3
The tape falls into this second category;'Figure storage is needed, holding the complete dictionary in main mem-
since
Longman
for rivet
low shows how a fragment of an LDOCE entry is
encode ory would about
require 20 Megabytes of storage (the exact amount
at source. is organised
The information into a sequence
lines.
of records, a
depends on the format of dictionary and indexing information) and
record may be on a line of its own
or
splitthe rst several
across Every this is currently an expensive option.
a
line starts with sequence of digits, of which SIX
denoteidentifier.
unique The natural data format for Lisp programming (and one for which
the record builtin facilities
sequential number, and the remaining two encode char
there are in Prolog and other symbol manipulation
This is followed by a number of fields specic to that
record; the languages) is list structures, or more specifically Lisp s-expressions.
brackets to indicate the
acter < is a eld separator. (We use curly These can be used to represent arbitrarily complex nested records in
of a non-printable character on the source tape: thus (*CA} both the ling system and main memory. The records of the LDOCE
occurrence
EBCDIC character represented by the hexadecimal dig- tape were therefore transformed in a preprocessing phase into brack-
stands for the
and eted Lisp structures.
its C A.)
The reformatted, or lispied, machine-readable source of the Long-
man tape offers the convenience, from the point of view of a Lisp or a
to fasten with RIVETslz...
rivet2 v 1 [T13X9]to cause
Prolog program, of reading complete LDOCE entries in a form imme-
diately usable by client programs, thus avoiding the need to parse and
28289801<R0154300<rivet
28289902<02< < unpack the raw tape format used by the typesetter. All the information
28290005<v< in the source tape, as supplied by Longman, has been retained in this
ii
28290208<to
xst)J
23290107<oroo<ri;x9<ii;\zvt<
cause to as en in version; what this version offers however, above and beyond the origi-
nal lexical data, is the use of Lisp bracketing to structure entries into
28290318<(*CA)RIVET(*CB){*46)s{*44)(*BA}:
single s-expressions and the conversion of the source character stream
into a text-only ASCII encoding. Thus there are no control characters,
of the Longman tape stream
Figure 2.3 Fragment which means that all a client program needs to do in order to access a
into
particular entry is to position a le pointer at the offset which marks
Thus 01 is the headword record, which is further brokendown its beginning and to perform a single Lisp read.
fur-
two elds: serial number and headword (this may incorporate In addition, the internal structure of an LDOCE entry is indicated,
ther information concerning syllable boundaries,
02 encodes the
spelling yariants and where appropriate, by additional level(s) of Lisp bracketing. Where
capitalisation). Similarly, homograph an inforrnationci the original tape indicated separate logical (and physical) records, cor-
and additional data concerning segmentation responding to different entry elds in its printed form, we retain this
homograph number, the example en-
is
stress patterns in compound entries (this empty in indication of structure by grouping the individual records into sub-
identies a record With
try). The identier for definition code, 07.,code and box code.
expressions of the s-expression for the complete entry. For instance,
four elds: sense number, grammar code, Subject
47
46
'
Chapter 2 2.3 Online access: simple mode
on-Iine
Placing the dictionary
atoms of the form *XY have been preserved in the lispied format
where
'
the
h
master
etic
tape
elds definition
'
tex pre
indicates
iXing eve
' '
PfELri_OI_SpC,e;}li;1:f%:,m:11
headwords,
t etc. by subsequent
changes special h
font
thus
and
software which
characters
depends on
make
information
of the
encoded in
idiilcffyiifg :igtigsenbystlbe
non-printing
'
can use
cbde, record this
s- informationtiiasLbeen straightforward control characters in a fashion.
'
indiVidual elementslispicatipln
changed; the
p051ents
within entry an has not been
from {Eli ie with a special format.
nglieves
far greater ease in accessing
Finally, the original eld separator (C)has been retained; the
of the entry. character 1" is an escape character, preceding symbols with special
meaning to Lisp. At the head of every entry in the new format is placed
to to fasten with RIVETs:... a copy of the head word(s) in base orthographic form, ie. without the
rivet2 1) 1 [T1,X9] cause
E:ioqu 1; se
T1
to
X9 !<
fasten
NAZV
with
l< ----H---XS)
2.3 On-line access: simple mode
))
(8:3; :iilET *CE #46 s #44 *BA : ......
described
.
.
in subsequent chapters)
3:1:erTime-
written in general anllaBrlili
editingfaCil'ity,}pn
thisptzfie);
a text
it is,
needed resolving; none of them, however, required the setting up of the
2381 mainframe. and
effort-consuming
After perio asf experimen
ewiti; a o
on-line source as a fully functional database.
needs be done only.once
to
. _
format is required can be fast and reliable, and can therefore be per-
of
1DA
number of pomts had to be taken care in .
'phase. .
entries deve so
resolved by ad hoc solutions, such as sequential scanning of les or
.
mer
p mm
'
t, trans
and also means tha t the resulting forma is
characters, extracting subsets of such les which will lit in main memory,
and
to operating are not adequate for the purpose in hand. (Exactly the same problem
_
entries.)
_
'
0 f dictionary
tal information
that the preprocessmg pw
.
log, since the Prolog database facility, refers to the knowledge base
'
had to be carried
'
out to ensure
tion that Prolog maintains in main
the tape. All memory.) In principle, given that the
.
49
48
Chapter 2 2.4 On<line access: flexible mode
online
Placing the dictionary
powerful virtual memory cations homograph fields (see Figure 2.4). Options exist in
is via the
dictionary is now in a Lisp-readable format, a
the access to specify which particular homograph (or homo
access to the internal Lisp structures software
system might be able
to manage
the entire dictionary. This is, however, expen- graphs) .fora leXicalitem is required. The early process of lispification
resulting from reading in a single group all dictionary entries
not always available; an alternative, and a more general, was
deSignedtonot bring together
sive, as well as corresponding only to different homographs, but also to lexicalised
solution is outlined below.
implemented in Lisp run- compounds for which the argument word appears as the initial word
of systems in Cambridge are
A series compound. Thus, the primary index for blow allows access to two
make of efcient dictionary access
the
ning under Unix. They
all use an
client dierent verb homographs (for example blowa), two different noun ho-
for sexpression entries made by
system which services requests the has been converted into mographs (for example blowz), 10 compounds (for example blow off
The lispied form of dictionary
programs. and all 14 of the dictionary entries (which not
a random access le, paired together with indexing information from blow-by-blowl, or are
entries for words and compounds necessarily found in subsequent positions in the dictionary) related
which the disc addresses of dictionary to far making signicant use of this
for an entry is made, a dictio- The only
blow: the denitions application so
be retrieved. This process then becomes dormant, ready to be reacti- motivation such an
vated by subsequent queries. and any must be able to recognise compounds befo re lt
where it is not envisaged that the informa- serious parser
In situations like this,
altered installed in secondary segments its input into separate words.
have to be once
tion from the tape will
conventional access strategy is perfectly ade-
storage, this simple and for
such standard database indexing techniques (see,
quate. The use of it possible for an active dictionary
2.4 On-line access: exible mode
makes
example, Wiederhold, 1983) to main memory utilisa-
undemanding with respect
process to be very of customisation, namely There are essentially two different ways in which an MRD can be used
tion. For reasons of efficiency and exibility
the use of LDOCE by different client programs and from different Lisp (seealso 1.31?for some further discussion). The predominant tech-
system is implemented to date involves an arbitrary amount of preprocessing, typically
and/or Prolog systems, the dictionary
C and
access
has been proved is active, and thus a batch derivation of th e appropriate informatio n IS
The general applicability of thelispied source
suitable
.
of transformin
'
different hardware and a way g the raw data into a usable repOSitory of
completely
by the mounting of LDOCE
on a
running lexical knowledge.
a XEROX single-user workstation
conguration
' ' i
software
' '
to
leXicographic
'
for viiualpgree-
process building a search tree,
structure
the access
which
method relies
allows direct
on
........
the system to be
ory allows access
in main memory. It turns rom e ma eup of the predominant] y fl at character t
of the working subset of the dictionary
pariiiffi:
. . '
interactive
'
access to appropri-
'
ex eriment 1
results
The main access route into LDOCE for most of our current appli- focus on
.
investigating
' .
51
50
the on-line Chapter 2 2,4 On-Iine access: flexible mode
Placing dictionary
can aid the development of particular NLP systems. The assumption methods.
is that an analysis of the accumulated data in the dictionary will reveal The second factor is due to the nature of the only source of machine-
readable dictionaries far available. The problems associated with the
regularities which can then be exploited for the task at hand. In fact, so
the rest of this book presents a number of projects which have sought, logical decoding of the flat character stream which reflects the visual
and applied, such regularities. organisation of the data on the tape have already been discussed (see
Work of this kind depends critically not only on the availability 2.1 above). These typically include not only the difficult problem of
of a machine<readable equivalent of a printed dictionary, but also on a parsing a dictionary entry, but also that of the issue of devising a
software system capable of providing fast interactive access into the on- suitable representation for the potentially huge amount of linguistic
line source through various access routes. Operational natural language data; one which does not limit in any way the language processing
processing systems clearly will have well-dened requirements as far as functions that can be supported or constrain the complexity of the
their lexicons are concerned, and once the format of lexical resources computational counterpart of a dictionary entry.
has been settled, retrieval of individual entries can be implemented Finally, there is the nature of the data. structures themselves. A
fairly efciently using standard computational and linguistic techniques text processing application, typically written in Lisp or Prolog, re-
(see for example Russell et al., 1986). In contrast, the placing of a quires that its lexical data is represented in a compatible form, say
dictionary on-line, with the intention of making it available to a number Lisp s-expressions of arbitrary complexity. Therefore, even if we choose
of different research projects which need to locate and collate dictionary to remain neutral with respect to representation details, we still face
samples satisfying a wide range of constraints, requires an efcient and the problem of interfacing to a vast number of symbolic sexpressions
exible system for management and retrieval of linguistic data. held in secondary storage. This problem arises from the unsuitability
Conventional database management systems (DBMSs) are not well of conventional data models for handling the complex data structures
suited for on-line dictionary support (see for example Tompa, 1986, and underlying any sophisticated symbolic processing. Partly, this is due to
Boguraev et al., 1987b, for further discussion). In particular, when the the inherent restrictions such models impose on the class of data struc-
entire dictionary is viewed as a lexical knowledge base, more complex ture they can represent easily
relational kind, for example, quite difcult. storing dictionary data (Byrd et al., 1988).
In the short term, alternative approaches reduce the complexity of
Firstly, there is the nature of the data in a dictionary: typically, it
the problem by limiting themselves to applying the machinereadable
contains far too much free text (denitions,examples, cross-reference of a dictionary to a small
and so forth) to fit easily into the concept source class of similar tasks, and building
pointers, glosses on usage,
customised interfaces offering relatively narrow access channels into
of structured data. On the other hand, the highly structured and for-
the on-line data. Thus IBMs WordSmith system (Byrd and Chodorow,
malised encoding of other types of information (found in, for example,
the part of speech, hyphenation or pronunciation fields) makes a dic- 1985) is concerned primarily with providing a browsing functionality
for online access retrieval which supports retrieval of words close to a. given word along the
tionary equally unsuitable by information
52 53
2.4 On-line access: flexible mode
on-line Chapter 2
Placing the dictionary
information
this volume) impose a phonologically motivated structure on pronunciations, which
that derived by Alshawis definitions analyser (Alshawi, markers. This
the word recognition are typically given as a string of phonemes and stress
might be used for further guidance during will allow the user to specify a constraint on, say, the onset of the
process. .
second syllable of the word, whose position in the phoneme string will
It is clear that in order to make full use of the computerised LDOCE,
not be the same for all words, and will not always be indicated by stress
system with proper DBMS functionality,
dictionary access
need
(see 2.4.3 below). The straight indexing approach described
we a
markers
retrieval of entries satisfying selection criteria apply-
capable of efficient earlier for headword-based access cannot in general provide sufciently
levels of linguistic description. The design of the system
ing at various exible access routes.
requests. What
described here allows precisely such heterogeneous client should be free to specify dif-
the user from the typi- Thirdly, the user or program
offer is a software environment buffering in any combination. We cannot assume in
we ferent types of constraint
and idiosyncratic format of the raw dictionary source
in
cally baroque advance that information of a given type will always be present
and allowing, via a carefully crafted interface, multiple entry points
great enough quantities to allow efcient retrieval. For example, if the
and arbitrarily complex access paths into the on-line lexical knowledge system is being used by an automatic speech recogniser, then at one
base. point in the signal signicant information on pronunciation may be
55
54
Chapter 2 2.4 On-line access: flexible mode
on-line
Placing the dictionary
' '
driven parser providing quite specic must be used for access, mg the phonotactic constraints given in Gimson (ISSSO)and 101 em
the stronger, more specic constraints a maximal onset (Selkirk, 1978) where these yield amifi ddii:
case,
the entries retrieved. To achieve principle
and the weaker ones only for checking
able estimate in advance What the syllableboundaries. Such decomposition is necessary for sgeech a a
to bfgeldifdlisvshzdtdzdnt
cognstituents
.
2.4.3 Design and implementation sary. Forexample, the LDOCE pronunciation of bedouin is /bcd /,
The design and implementati on of the database system described here
Ultimately, dictionary
which Violates the constraint that a syllable whose peak is /u/ 35 (01" m
earlier. The link between Finally, the strategy used for indexin entries
tape already discussed
designidreectcotrliidnfgactt
ac
Longman a pointer le
client program and the lispied dictionary is provided by
will be described
words
the
in
denition
their texts was to
and motivation it IS content of these words that is likely to be of inte t
and a constraint le whose nature
to the user.
semantic
This has two main
res
consequences:
below.
the
fllellzjverlselectts node, the resulting menu, shown in Figure
Constructing access paths . chOD?1 /pSi/, but
5 im 0 1
speci y t e coda not for exam e
stigstTe/e
nodes of the PROIIIUNCIATIOPN
.
in terminal
.
semantic a matches
ove, symbols; single of 7 matches
structing access paths is straightforward in the grammatical, any sequence
.
any
is constructed for symbol; d SYmbols have the phonetic values dened for
of pointers
and denition
and
text cases:
suitable
a list
denition
entry
word found in the dictionary. them in {$0211%)?ther
code every
every
treated dierently. To achieve exibil-
Pronunciations, however, are
ated containing all the entry pointer lists and, just before N
vvvvvvv
the to estimate the O
As described below, this allows system
its length. having to read 5
a |
an
several lists in the m
pointer le. r
t
ts
v
x
of which the
A menu-driven graphical interface is provided by means
from either
a WORD node alone, or by in- [nan
A tree can be constructed
a specied word,
structing the system to build a tree from the entry for
and then editing it. Once the tree is built, either a partial search (to Pnuiiuiilrm/ii/ \'
PEAK
which constraints it
tionary
would use
entries
to look up
that satisfy
candidate
it.It also
search,
entries
indicates
in a full which ones it snifss
DITEI
7
CurtSIRTSS
l x (no i 7 t nn
read (a) a pointer from the constraint le and (b) a complete entry
from the dictionary. The most eicient search strategy involves using
the specic constraints
most as look-up keys (more specic keys ul- and selects the partial search option. This returns the information
timately yielding fewer entries). The optimal number of constraints shown in Figure 2.8.
to use is found by balancing the number of pointers that will have to
be read, which increases with the number of look-up keys, against the
but are also acted on. The pointer lists for the lookupconstraints are
Estimated puinter+entry reading time:
intersected, the number of pointers resulting is displayed and, at the 12.5+85.3=97.6 seconds (502 entries)
users option, the corresponding entries are read from the dictionary,
the test constraints are applied to them, and the surviving entries are 2.8 Interactive estimated statistics
Figure query'construction:
displayed. Applying tests to a dictionary entry involves reanalysing the
relevant of it in the same way as when the database is constructed.
parts
Because of the expected large number of entries in the result and the
time that would be taken to read them, the user decides to look only at
the entries for such words whose denitions contain the word camera.
He adds the relevant constraint to the tree (the system checking, as he
does so, that camera is a valid key) and orders another partial search.
2.4.4 An example
This time, the statistics are more manageable. A full search is therefore
the user wishes to see all entries for three- ordered, in which the denition word camera is used as the only look-
As example, suppose
an
up key, and the other constraints are this time all used as tests. This
syllable nouns which describe movable solid objects, whose second syl
the entries for the words clapperboard and Polaroid (Figure
lable has a schwa as peak, and whose third
a voiced stop. He constructs the tree
syllable has a coda that is
in Figure 2.7, gesirns
61
60
2 2.6 Notes
on-line
Chapter
Placing the dictionary
'
which the detail: of the scene to be filmed > set [pron 52 onset] to "s p
in front of the camera
in written, held up > add [pron 83]
ml: 1 [U] > delete [syn cat]
Po-lur-old /'paulsr3ld / n
material with
51, box ,_J.__X) a
(aubj order to make
> psearch on [32]
is treated in
which glau
shine less brightly through it, used
light Each
'
5
bEing
rediwdZrmiiergdldfdes
car windows, etc.
in making SUNGLASSES,
1 [c] (lubj PG, box ,_J__J) also (/ml)
[1... '.../ type
a
c|m<-ra
construciideilbulnmdpaegiiilldiollenrriaficel
Polaroid
that produces a nished photo.
of camera
DBMS
lfridwlsddg:
seconds after the picture has
graph only
ilgztdiosfcfglsleggziggzagaboillliltyh
In this sense lexical
been taken
the
functionality
'
the manager into a dictionary server node (of the kind discussed by
gnddlingineering
. .
on '
search
1 c at is a Lisp construct which means the first element of a list.
the end user task of constructing
tive graphics only simplify mod-
is sufciently modular to allow easy
specications. The design of functioning on a con- 2 Unix is a trademark of Bell Laboratories.
ication, and the system would be capable
long it supported random
ventional minicomputer or mainframe (as as
[sen -
[box -
5,J]
syn
=
[ cat =
n]
*
pron
'
l B1
52 =
[ stress =
7
onset: = *
peak
=
schwa
coda == * ] l l
63
62
Chapter 3
Eric Akkerman
3.1 Introduction
The analysis of the LDOCE coding system, which is the subject of this
chapter, was carried out as a part of the recently completed ASCOT
project (see 3.5). The main goals of this project were the development
of a lexical database system and an associated scanning system, to be
employed in (semi-) automatic syntactic analysis.
The resulting ASCOT software package consists of two major com-
ponents:
possible basis for the ASCOT lexicon, because they contain an exten~
sive grammatical coding system and are available on computer tape.
Detailed analysis of the entry structure in the two dictionaries and of
the respective grammatical coding systems led to the conclusion that
LDOCE was most suitable for our purposes. Also, the computer-tape
65
coding system Chapter 3 3.2 A comparison of OALD and LDOCE
An analysis of the grammar
we refer
in
et 111.
9815
ASCOT lexicon was created 1988a). general A outline of the project
four
On the basis of LDOCE,
stages:
the actual
faiAKkermIan
.
and1
emp products develo
asis on e exica P ed
'
m t he 00111'59 Of
'
It)
is presented in Akkerman et al. (1988b).
was developed to transform the original
1. A computer program
of information
les in such a way that the various types
LDOCE
3.2 A comparison of the OALD and LDOCE
became optimally accessible.
left due to the
2. The few imperfections which were still (either grammar coding systems
or to the structure of LDOCE) were corrected manuallly.
rogram is thus
rst intermediary le, which Both OALD
This resulted in the ASCOT
structured version of LDOCE. and LDOCE make use of an extensive grammatical coding
corrected, maximally system to
prOVide syntactic information about their Iemmata. In this
essentially a
' '
4.
created the The LDOCE grammar codes give detailed information about the b -
owords English
(Procter, 1978zxxviii).
.
in The majority :f
actual ASCOT lexicon (Aslex). hagiour of which
conSist be a capital letter, may followed by a number
code and infor- wh'e; by a lower case letter. Apart from these
In Aslex, the entries,
about inection
together
and spelling
with
are
wordclass
their
stored in an L-tree format, tiurn
tworircmlm mzylbe
esymosteco (pllowded
co , esma y a l so contain
'
additional
" '
information
'
with the lexicon The big letter codes denote ma' JOI grammatical erties of
A codefile associated pro
each lemma (cf. Skolnik, 1980).
information pertaining to each lemma. verbs,adjectives
. .
aw-6......
' '
basic
'
types of
'
text
itself be used as a lexical
WWu
tem. Of course, the ASCOT lexicon can
C, U) CC,
In the remainder
employed in OALD and LDOCE. GU,P,S,Wn1,Wn2,Wn3.
ison of the grammar coding systems
67
66
Chapter 3 3,2 A of OALD
coding system comparison and LDOCE
An analysis of the grammar
special features
Those indicating various
th: dngnfgtlsintem
. , '
4.
Wv6) and adverbs (H). This is shown most clearly in
Wa5), nouns (N, R), verbs (Wv4, Wv5,
WVS. Wa3, Wa4,
1:032:22
igcinSistently.
complementation
n code t e 9. The [V3]code is Vlll
givcnet o a
V ~
NP
to-infinitive,
Wvl, Wv2. .
refer h ereas
of all big letter (and number) codes, we w Quirk et al. (1985) make a distinction between
For the specic meaning are of course the codes
F. Most important for our purposes transitive
to appendix - verbs (B8):
and (3). Especially the comprehensiveness
mentioned under (1), (2) from the standard We like all parents to visit the school
codes deserves some attention: apart
of the noun indi-
also the codes [GC] and [GU]to object
[U], LDOCE
uses
codes [C] and representing groups, like committee
special properties of
nouns
cate the a singular or a plu
0
complex transitive verbs (C4):
form can take either
the singular They expected James to win the
[G0] (which in clientele has no separate plural
form and race
ral verb) and [GU] (which The codes [S] and [P] are obj obj compl
a singular or a plural verb).
can take either with plural verbs
that are used only with singular or
t - ditransitive verbs (D6):
used for nouns It should be noted that these
example, undertow [S], police [P]). We asked the students to attend the lecture
(for information about subjectverb con-
['13] more an
two different types: that is normally associ-
a
structure surface structure
morphological underlying grammatical
'
the
always
combination
combined
with
that the LDOCE :xiiZe. Maybe
ograp have decided ers on
reasons
this non-committal
other parts of speech approach.
with a number code, for the of codes LDOCE uses
table The '
69
68
3.2 A comparison of OALD and LDOCE
Chapter 3
An analysis of the grammar coding system
he then argues
9" is to set up links between code and denition The OALD coding system
way of rening code 3.2.2
(1982:124)3. about The formal OALD grammatical coding system is limited to nouns and
The little letter codes in LDOCE give detailed information for
Some of them are concerned verbs. The information contained in the LDOCE codes adjectives
certain aspects of verb complementation. in accounted for in various
which are part and adverbs is either not given OALD, or
with the position of adverb particles and prepositions
of a verbal combination. Others indicate more specic grammatical ways:
is code a , which in combination with number if only be used predicatively, this is usually
properties; an example a an adjective can
from the formal code symbols described so far, the LDOCE this is
of additional information. This
o If adjectives have -er
/ -est comparative / superlative,
codes also contain various kinds for
may indicated after the wordclass information, example,
can be:
high /../ adj (-er,est](cf. LDOCE code [Wa1]].
be used in combination with the lemma
A Words that may (or must) in a Certain grammatical information is presented in a rather non
B Grammatical information of various kinds, for example: tained in the LDOCE codes [GU], [GC], [S] and [P] is often not given
in form, BE +
complement / adjunct
always sing Wm] S +
subject:
H
grouse /.../ n (pl unchanged) (cf. LDOCE (Wn3l). A combination is also possible:
wise /../ n (sing only) (cf. LDOCE [8]).
=
S + vt + noun / pronoun (D0) + noun
[VP23]
I
example sentences, for example, .
S + vt + In + D0
S + noun/pronoun
vt + (D0) +
preposition
B [VPM] =
for
+ noun/pronoun and [VP13]stands
code in
The verbs asCribe and deceive both get this S 4 vt +
preposition +
noun/pronoun
He ascribed his failure to bad luck The corresponding LDOCE code is [D1 . Note that, although
her into the belief that (...), most ditransitive verbs will have both VP12] and [VPlS], in
He deceived
some cases this may be a useful distinction.
although there is clearly a difference between
the relation of ascribe
the discussmn verbs that
and of deceive and into (cf. In two OALD makes a distinction between
and to (prepositionalverb) patterns,
-
of verbal combinations below). indicate physical perceptions and other verbs ([VPlSA/B] and
[VPlQA/BD;this distinction seems to be only semantically ino-
(D0) tivated.
S + vt + noun / pronoun
C [VPIGB] I
noun/clause:
+
as/like/ss if +
a Pattern code for transitive
[VP16A]accounts constructions con-
information (for example [T1, no pass.]). different treatment in OALD and LDOCE of the following construc-
tions:
for the order
I With ditransitive verbs, [VPIZ]stands
75
74
----_'-<_
.
LDOCE particle when the combination means to stop (cf. LDOCE: leave off
OALD
v adv [T1a,4a,lO]).
16 belt () 5 [VPZC]7up belt up 1/ adv
[lOl
The conclusion of comparison of a number
our of important gen-
2 blare vi,ut[VP2A,C]~out
1 blare v
[10(out)] eral aspects of the grammar coding systems of OALD is that LDOCEs
3 bail vt [VPISB]'sb out bail out 1; adv
[T1]
2 [T1 (out)]
grammatical coding system is both more comprehensive and more de-
4 belch ut,ui [VP6A,15B]"out belch v
tailed than that of OALD, especially where the coding of nouns, ad-
jectives and adverbs is concerned. Furthermore, in LDOCE the coding
Whereas [VPISB] is a useful pattern
OALDs code, code [VP2C] is of verbs is more clearly structured and grammatically sounder than
much too comprehensive. The term adverbial adjunct covers a num-
that of OALD. Although OALD is sometimes more specic in its verb
ber of syntactically different structures, like for example: codes than LDOCE, a serious shortcoming of OALD is the fact that
many OALD patterns represent supercial surface structures rather
go _
away than underlying grammatical structures. In LDOCE, this is only the
its getting on for midnight case with codes [V3/4] and 9(which are not dealt with more satis-
.
Again, this
pattern code describes a certain order of elements,while
the 3.3 A detailed analysis
ignoring the concept of grammatical cohesion
between
theverb and
the
striking (cf. of the LDOCE coding
adjunct. With prepositional verbs, this is even system
more
capture the specic relationship between the verb andthe particle / ing systems of OALD and LDOCE, we subjected its coding system to
osition. a detailed
analysis. In this critical assessment, LDOCES grammati-
preIli/ioreover, often when a pattern code indicates that a particle
should cal approach was systematically compared With that of Quirk at al.
follow the verb, the relevant particle is not specied (or only in an (1985). As a result, a number of code combinations emerged that are
this often than
example sentence). With code [VP2C] more
happens grammatically questionable or even incorrect. In this section we will
it is a logical result of the comprehensweness of the pattern. present a short survey of our most important ndings and conclusions.
not:
for example, array, In Akkerman at al. (1988a) a whole chapter is dedicated
However, it also occurs with pattern see, [VP15l, to this subject.
baste, beckon. .
of these code combinations are only used, it seems, to deal with cer-
that Verb, as follows:
tain classes of verbs that fall outside
the table of codes, but for which
the code is used to give at least an indication of their grammatical be-
leave 4 [VP15A, 13,22, 198. 2c. 24,
haviour. Examples are code [12](intransitive verb followed by innitive
' '
'
77
76
3 3,4 Conclusion
of the coding system Chapter
An analysis grammar
He is easy to please
3.3.2 Linking verbs with adverbial complementation
An easy man to please
of copulas
Whereas, according to Quirk et a1. (1985), the category
He was anxious to please his guests
the number of
thathave adverbial complementation is fairly small, *
An anxious man to please his guests
adverbial
verbsin LDOCE that have code [L9] (linking verbs with
564 verbs. Examples of the use of this anxious
complementation) is enormous: where both adjectives are coded [B3] in LDOCE, while can
noun
In the TOSCA
lexicon.
79
78
Chapter 3 3.5 Notes
An analysis of the grammar coding system
University), aims at
(Department Alpha-informatics, Amsterdam
of the position a word can take in a sentence (as do the LDOCE codes).
PARSPAT formalism
a reimplementation of the LSP-grammar in the Only when socalled expression rules are applied, the elements in a
of LSP (Linguis-
(cf. de Jong and Masereeuw (1987))9. The grammar sentence are put into the right position.
is comprehensive computer of English it may concluded
be that the coding system
tic String Project) a grammar Thus, employed in
is a
that wasdeveloped by Naomi 1981). PARSPAT
Sager (cf. Sager LDOCE makes it a very suitable basis for the creation of a comput-
of the Faculty
parser-generator developed at the Computer department erised lexicon. However, when that lexicon is to be used in a system
of Arts of the University of Amsterdam. Both TOSCA and PARSCOT dedicated to automatic grammatical analysis, it will often have to be
have evaluated the ASCOT lexicon with a view to its incorporation in adapted and / or expanded to a certain extent, depending on the gram
their respective analysis systems. The gist of their evaluation is that matical formalism that is the basis for the analysis system in question.
in general the LDOCE classications are very useful, but that in their In the ASCOT lexicon a number of such adaptations were made, geared
respective grammars certain subclassications must be distinguished to the Nijmegen TOSCA project. Hopefully, in the future ASCOT can
which are not accounted for in LDOCE. be extended even more, so that on the basis of LDOCE a comput-
TOSCA requires detailed information about the grammatical prop- erised lexicon will have been created that can be employed successfully
erties of pronouns and determiners, and a further subclassication of in various natural language processing tasks.
adverbs (into connective adverbs, exclusive adverbs, particularis-
ing adverbs, etc. cf. Quirk at al. (1985)). In LDOCE,
pronouns
always have the code [Wp], which refers to a table in the dictionary 3.5 Notes
specic information. Because this is of no use to
that provides more
a parsing program, a detailed formal coding system was developed for I would like to thank Willem Meijs, Hetty Voogt-van Zutphen and
determiners and pronouns in the ASCOT lexicon, in which information Pieter Masereeuw for their stimulating contributions to the many dis-
is conveyed like used with plural count nouns (for determiners) and cussions about the subject of this chapter; thanks also go to Willem
3rd person plural (for pronouns). Hopefully, at a future stage it will Meijs for his suggestions for improvement of the text. ASCOT, which
also be possible to include a further subclassication of adverbs. for Automatic
stands Scanning system for Corpus Oriented Tasks,
for PARSCOT basically uses a classication
The lexicon necessary was a research project supported by the Foundation for Linguistic Re-
like that of ASCOT, but needs one more level of subclassication for
search, under project number 300169-004. The Foundation is funded
most wordclasses. Examples are APREQ (adjectives or participles that by the Netherlands organisation for the advancement of pure research
can modify a numeral (an additional five people, NCOUNT2 (count (Z.W.O.) The project started on 1 March 1984 at the English Depart-
able nouns which can directly preceded by a preposition
be (he came ment of the University of Amsterdam and it was finished on 1 March
by car)), VCOLLECTIVE (verbs which require a plural, collective, 1987. Additional funding was supplied by the Arts Faculty of the Uni-
aggregate or coordinated object (collect tools / dust)). Furthermore, versity of Amsterdam for the period of 1 March 1987 1 July 1987, to
LSP makes use of categories that are especially useful for parsing, such round off computational work.
that, in combination with
as NTIME, which denotes temporal nouns
adjectives like last, can form a time-adverbial without a preposition (for
example last week). The use of such categories can prevent certain am-
1
words
By the termhave multi-word
lexical
we mean combinations
xed collocations
of two or more
80 81
3 3.5 Notes
of the coding system Chapter
An analysis grammar
TOSCA
from and PARSCOT, two other projects can be
IN COUNTRY 9 Apart
mentioned the Centre
here: for Lexical Information at Nijmegen Uni-
AT TIME project for
versity (CELEX) has drawn on the work of the ASCOT
ON SUBJECT the of its lexical database of English and may incorporate
development
the ASCOT lexicon.
project LEXALYSE
The (English de-
time probably at (parts of)
The links between Asia / modern and country/ are
partment, University of Amsterdam) aims at employing the software
links between the members of
least partly retrievable by the thesauric developed in the ASCOT and LINKS projects (cf. Vossen et al. in this
these pairs (Michiels, 1982:124-5). volume) in a system dedicated to automatic semantic-syntactic analysis
of discourse texts.
6 Key to numbers:
anaemia
=
l
l
l
l
g 4.1 Introduction
85
4 4.1 ntroduction
the LDOCE codes Chapter
Utilising grammar
is wrong
Sally believes (that) story
consult relatively small lexicons, typically generated by
Bobrow, 1978)
the Linguistic String
hand. Two exceptions to this generalisation are
Sager, 1981) and the IBM CRITIQUE (formerly EPISTLE) Figure 4.2
Project a dic
Heidorn et (11., 1982; Byrd, 1983); the former employs
Project
10000 words, most of which are specialist
tionary of approximately from
medical terms, the latter has well over 100000 entries, gathered
However, to our knowledge no evaluation of In addition, verbs syntactically identical
taking predicative comple-
machine readable sources.
for such information has been ments may impose different interpretations on these complements for
and promise both take a NP followed by an inlini-
LDOCE, or any other MRD, as a source
example, persuade
published. of syntac tival Figure 4.3 illustrates. However, in the case of
Recent syntactic theorising steady relocation
has seen a complement, as
ical sentences.) Although this latter fact is not purely syntactic, if we want the parser
to the logical form or underlying predicate-argument structure
Sally devoured produce
for its input, then facts of this type must be marked in the lexical entries
Sally devoured the food
of relevant verbs.
Most have assumed that the subcategorisation of words
Sally put the book syntacticians in various alternations is arbitrary in the sense that
the table and participation
Sally put the book on
it is from other facts about the words in question for
unpredictable
Sally told (that) the man was ugly example. their meaning. Gazdar (1982) and Pollard and Sag (1087)
make this point using examples of the kind illustrated in Figure 4.4.
Sally told Bill (that) the man was ugly
In the
first two pairs, the verbs involved have very similar meanings
but which differ in a semantically insignificant
.require complements
Figure 4.1 fashion. In the third example, all the verbs mean something like be
of complement types they take' for
Many verbs can take more than one type of complement without sig- pome all these verbs will combine with
but differ widely in the range
to yield grammat-
believe in the sense of think instance, poetical
nificant change meaning; in for example, ical sentences, but only got can legitimately combine with sent more
inni-
NP, NP and adjectival, nominal or
something is true can take a
the
tival predicative complement, or sentential complement, as Figure 4.2 leaets. On the basis of this type of evidence
that lexical entries
many
encode
grammatical
all the syntactic
often called cries
assume must directly
are alternatives
illustrates. These (meaning preserving) enVironments in which a word can legitimately occur.
alternations.
87
86
i
Figure 4.5
grew poetical
a success
got fall into the socalled unaccusative
turned out sent more leaets
These verbs, which class, (Perlmut-
Kim
work ter, 1978) can be contrasted with members of the unergative class,
ended up doing all the
which undergo a different alternation in which the theme argument is
waxed to like anchovies
not present in the intransitive usage as Figure 4.6 shows.
Figure 4.6
structure that word gorisation because they would allow lexical entries to be much more
ture SUBCAT, encoding each different complement economic than theories such as GPSG suggest.
take (regardless of whether the sense of the word changes). In addi-
can
semantic translations so
In section 4.2, we describe the format of the LDOCE grammar cod-
SUBCAT values can be linked to distinct
tion, follow ing system and the difficulties which arise in extracting information the
that different interpretations of predicative complements simply conveyed by the code fields in individual entries. In 4.3, we discuss the
from differing SUBCAT values.
informational content of the LDOCE coding system and the theory of
This approach to subcategorisation predicts that not only the type
structures and the interpretation of grammar which lies behind it, supplementing the discussion in Akker-
but also the of complement
range
all unpredictable from other properties of man
(this volume). We focus in particular on the separation of syntac-
predicative complements are
least some that at
tic and semantic information conveyed by the codes and the inference of
the word in question. Other linguists have argued
of meaning, further semantic distinctions concerning the interpretation of predica-
aspects of subcategorisation
are predictable from elements
roles which a verb imposes on its tive complements from the conjunction of codes found in the code eld
such as the semantic or thematic
It associated with an individual verb sense. We demonstrate a system
The capital letters encode information about the way a word works in
evaluate whether verbs undergoing the dative alternation
the
fall intoco- a sentence or about the position it can ll (Procter, 1978: xxviii); the
herent semantic classes and thus to explore further issue raised the way the rest of a phrase or clause
information numbers give information about
the predictability of subcategorisation
above concerning is made up in relation to the word described (ibid.). For example, T
from other properties of the word. denotes a transitive verb with one object, while 5 species that what
follows the verb must be a sentential complement introduced by that.
codes small letters, for example, a in the case above, provide further
4.2 The format of the grammar (The
information typically related to the status of various complementisers,
of LDOCE adverbs and prepositions in compound verb constructions: for example,
Akkerman (this Volume) presents detailed description
a the
the outline of the "a with T5 indicates that the word that can be left out between a
and appendix F contains
grammar coding system example, [V3]introduces
those of verb and the following clause.) As another a
follows. Figure
of the verb believe as it appears in the published dictio- addition, codes can be qualified with Words or phrases which
In
word sense in which
after restructuring. These provide further information concerning the linguistic context
nary, on the typesetting tape and grammar is likely, and able, to occur; for example [D1(to)]
that believe can occur in the syntactic the described item
codes convey the information
environments illustrated in Figure 4.2 above. or
[L(to be)1]. Sets of codes, separated by semicolons, are associated
with individual word senses in the lexical entry for a particular item,
as Figure 4.8 illustrates.
believe v 3 [T53,b;V3;X(tabell, (ta bc]7l
X (#46120 be
('I 300 I<T5A I, b 1; V3 l;
#44) 1 !, (*46 to be 44) 7 !<) rssl u 1 [11,6] to get the knowledge orby
touching with the fingers: 2 [er;'r1}
_
V3 to onsssir to
head: Icioualy: {L1}to
4 seem
head: in right optional (to be) be: 5 [T1,s;vs} to believe, up. fol' the
moment 6 [L7] to give (a sensation): 1
head: X7 right optional (to be)
[st;io] to (be able to) experience sensa-
punctuation andfont changes. However, discovering the syntax of the medium term this approach, though time consuming, will be of some
is available from Long- for natural
system is difcult since no explicit description utility for producing more reliable lexicons language pro-
.W
man and the code is geared more towards visual presentation than cessing. However, in the short term, the necessity to cope with such
such as to
formal precision; for example, words which qualify codes, errors provides much of the motivation for the interactive approach to
be in Figure 4.7, appear in italics and therefore, will be preceded by lexicon development (see Carroll and Grover, this volume), since this
the font control character *46. But sometimes the thin space con-
allows the restructuring to be progressively rened as these
programs
*64 also appears; the insertion of this code is based extensive
trol character problems emerge. Any attempt at batch processing without
of the
solely on visual criteria, rather than the informational
structure initial testing of this kind would inevitably result in an incomplete and
of ap-
dictionary. Similarly, choice of font can be varied for reasons possibly inaccurate lexicon.
and occasionally information normally associated with one
pearance
eld of an entry is shifted into another to create a more compact or
elegant printed entry. 4.3 The content of the grammar codes
In addition to the noise generated by the fact that we are working
rather than
typesetting tape geared to visual presentation,
a
with a Once the grammar codes have been restructured, it still remains to be
and inconsistencies in the use of the grammar
database, there are errors shown that the information they encode is going to be of some utility
illustrated in Figure 4.10, include the
code system. Examples of errors, for NLP systems. The grammar code system used in LDOCE is based
promise which contains a comma in place of a semi- of Quirk et a1.
code for the noun quite closely on the descriptive grammatical framework
in which a colon delimiter occurs before
colon, that for the verb scream, (1972, 1985). Akkerman (this volume) provides a more detailed com-
a grammatical
the end of the eld, and that for the verb like where parison. The codes are doubly articulated; capital letters represent the
label occurs inside a code eld. grammatical relations which hold between a verb and its arguments
and numbers subcategorisation
represent frames which
a verb can ap-
pear in. Most of the subcategorisation frames are specied by syntactic
promise in 1 [C(of),CS,5; under+Ul
category, but some are very ill-specified; for instance, 9 is dened as
scream v 3 [T1,5: (OUT); 10) needs a descriptive word or phrase. In practice many adverbial and
like 1; 2 [T3,4; My.)
predicative complements will satisfy this code, when attached to a verb;
for example, put [X9]where the code marks a locative adverbial prepo-
Figure 4.10 sitional phrase (Put the chair nearer the re/ He put a match to his
93
92
4.3 The content of the grammar codes
codes Chapter 4
Utilising the LDOCE grammar
Sally believed Bill to be sensible representation of the lexical entry for a particular word, in the sense
Sally believed that Bill is sensible that this intermediate representation could be further transformed into
believed to be bad for her suitable for most For example, if
Sally peanuts a format current parsing systems.
7 believed peanuts to be sensible the input were the third of believe, as in Figure 4.7, the pro
Sally sense
gram would generate the (partial) entry shown in Figure 4.12 below.
Sally persuaded Bill to be sensible The four parts correspond to different syntactic realisations of the third
Sally persuaded that Bill is sensible
sense of the verb believe. Takes indicates the syntactic category of the
Sally persuaded Bill that she is sensible subject and complements required for a particular realisation. Type
'1 Sally persuaded peanuts to be bad for her
indicates the arity of a predicate and whether it is a raising or equi
verb under that syntactic realisation.
Figure 4.11
In the main, the code numbers determine a unique subcategorisation. Figure 4.13
be used to select the appropriate VP rules from
Thus the entries can
the grammar (assuming a GPSG-style approach to subcategorisation) The ve ru es which are applied to the grammar codes associated with
and the relevant word senses of a verb in a particular grammatical a verb sense are ordered in a way which reects the ltering of the
be determined. However, if the parsing system is intended series of Verb with
context can
verb sense through a syntactic tests. senses an
for
representation of the predicateargument structure code classied Subject Raising. Next, verb
to produce a
V" X code
as
and of the
senses
input sentences, of
which contain a or one
[D5], [D5a], [D6] or
indications of the semantic nature
classied
individual codes only give partial [D6a]codes are as Object Equi. Then, verb senses which
the relevant sense of the verb. ,
contain a V or X code and a
[T5] or [T5a] code
in the associated
have adopted is to derive a semantic classication of the D'7 codes
The solution we grammar code eld, (but none mentioned above)
the basis
of the particular sense of the verb under consideration on
In any subcate-
are classified as Object Raising. Verb senses with a
[V] or [X(td
of the complete set of codes assigned to that sense. be)] code, (but no
[T5] or [T5a] codes), are classied Object Equi. as
gorisation frame which involves a predicative complement there will be Finally, verb senses containing a [T2], [T3] or [T4] code, or an [I2], [I3]
relationship between the supercial syntactic form classied
a non-transparent
situations the
or
[14]code are as Subject Equi. Figure 4.13 gives examples
relations in the sentence. In these of each
and the underlying type.
this relation-
semantic type of the verb to compute
parser can use the
classify verbs as The Object Raising and Object Equi rules attempt to exploit the
example,
as convenient labels for what we regard as a semantic distinction; the *
John forces that the Earth is round.
actual of the translation program is a specication of the map-
output
if verb takes direct object and sententlal complement
ping from supercial syntactic form to underlying predicate-argument indi-
'Secondly,
it Will be
a
verb:
a a
of the predicate com- Clearly, there are other syntactic and semantic tests for this distinction
it will function underlying logical subject
as the
thd
this for innitive comple- (see,for example, Perlmutter and Soames, 1979x472),but these are
plement. Michiels proposed rules for doing to only ones which are explicit
coding system. in the LDOCE
there to be no principled reason not
ment codes; however seems
relations in other Once the semantic type for a verb sense has been determined, the
extend this approach to computing the underlying
sequence of codes in the associated code eld is translated, as before
types of VP as well as in cases of NP, AP and PP predication (see
on a code-by-code basis. However, when a predicative complement
Williams (1980), for further discussion).
97
96
Lexical entries for PATR-ll
Chapter 4 4.4
the LDOCE grammar codes
Utilising
The DAG for a lexical item is constructed from its lexical entry which
of templates for each syntactically distinct variant. Tem-
contains a set
4.4 Lexical entries for PATR-II for unications which dene the
plates are themselves abbreviations
DAG for the verb
can be used DAG. For example, the basic entry and associated
of the grammar translation
codes program
The output storm illustrated in Figure 4.15.
grammatical for- are
to derive entries which are appropriate for particular denes the way in which the syntactic argu
implemented a system The template Dyadic
malisms. To demonstrate that this is possible we
to the verb contribute to the predicate-argument structure of
ments
which constructs dictionary entries for the PATR-II system (Shieber, denes what syntactic ar-
it has been the sentence, while the template TakesNP
PATRII chosen because
1984 and references therein). was
the guments storm requires; thus, the information that storm is transitive
reimplemented in Cambridge and was therefore, available; however, Consequently, the
and that it is a twoplacepredicate is kept distinct.
task would be nearly identical if we were constructing entries for a sys-
the fact that some verbs which take two syntactic
system can represent
tem based on GPSG, Functional Unication Grammar (Kay, 1984a) or nevertheless oneplace predicates.
1982). As noted arguments are
Lexical-Functional Grammar (Kaplan and Bresnan,
99
98
Evaluation
4 4'5
codes Chapter
Utilising the LDOCE grammar
In Figure 4.17 we show one of the two analyses produced by PATR-
that we have implemented con-
The other analysis pro
The modied version of PATR-II automatically from II for a sentence containing these two verbs.
and constructs entries structure but incorporates the sec
tains only a small dictionary duces the same predicate-argument
Verbs that it encounters. of PATRI
restructured LDOCE entries for most ond sense of marry. Thus, the output from this version
to feel represents the information that further semantic analysis need only
pereuede r 1 [T1 (on; D5] to ceuee
marry a l in; 10] to take (e pereon) in certain; CONVINCE: Sire we. persuaded
no:
consider the second sense of persuade and the rst and second senses
He meme-i ieze ink]: /never merrier z [Ti(r'nto,out of);
marriage: o/ the mm o/ M. mun-cut
this rules out further sense of each, as dened in LDOCE.
mg.) 5: mm'ui manry (= a rich man) 2 to do something by reasoning, of marry; one
V8] to cause
Ti] (of a priest or cmciei) to perform the arguing, begging, etc.: persuade him to
iry to
{or (2 people): An old him
ceremony of marriage
to take
m u. W with him. i Nothing would persuade to Cornwall
[end named them a [T1 (to)] to cause parse> uther might persuade gwen marry
in her to
daughter
in marriage: She menu mm]
a rich man
[catz SENTENCE
(persuade finite
((Sense 1) head: [form
(marry
1) NP) (Type 2)) iir agr: [perz p3 num: sg]
59 ((Takes
true
aux:
((Takes NP NP) (Type 2)) ((Takes NP NP SEar)
trans: [pred; possible
((Takes NP) (Type 1) (r 3 ))
((Sense 2) 2i
((Seryrii: senseno:
argi:
1
[pred: persuade
((Takes NP NP) (Type 2) ((Takes NP NP) (Type 2))
sense-no: 2
5) ((Takes NP NP Inf)
((Sense [refz uther sense-no: i]
argi:
((Takes NP NP PP) (Type 3)))) (Type 3 nbjectEqui))))
arg2: [ref: gwen senseno: 1}
argaz [predz marry
sense-no: 2
word marry: ward persuade: [refz senaerno: 1]
:
argi: gwen
wJense => w-sense
argz: [refz ccrnwall
trans sense-N =
1 <head trans senseno>
-
i
(head
sense-no: 1]]]]}]
V TakesNP Dyadic V TakesNP Dyadic
:7 w.sense a
vuaense
=
i sensenv
-
1
(head trans sense-no) <head trans Figure 4.17
V Takes IntransNP Monadic V TakesNPSEar Triadic
sense-no)
"
2 sense-no>
-
2
(head trans <head trans
2
(head trans <head trans
The utility of the work reported above rests ultimately on the accuracy
Figure 4.16 of the lexical entries which can be derived from the LDOCE tape. We
the PATR-H lexicon sys-
As well as carrying over grammar codes,
which
the
are de- have'not attempted a systematic analysis of the entries which would re-
modied to include word senses numbers, sult the decompacting and grammar code translation programs were
has been
tem
Thus, the analysis of a sentence by the PATRII if entire In Section 4.3 we outlined some of
rived from LDOCE. appliedto_the dictionary.
syntactic its and underlying predicate-argument the the grammar codes which are problematic for the decom-
system now represents errors
in
of the words (as dened in LDOCE) omissions in the assignment of
structure and the particular senses pacting stage. However, mistakes or
context. Figure 4.16 illustrates
which are relevant in the grammatical codes represent a more serious problem. While inconsisten-
and constructed
persuade by the sys- grammar in
the dictionary entries for marry cies or errors in the application of the grammar coding system some
100
Evaluation
4.5
codes Chapter 4
Utilising the LDOCE grammar
102
Evaluation
Chapter 4 4-5
the LDOCE grammar codes
Utilising
not have [T5] codes anywhere in their entries are elect, love, represent
acknowledge 1 [T1,4,5 (to) agree
understand (sounds) by using
to the truth of; recognise the fact or ex- ceive
the ears:
and
I cant hair very well. ] [heard him and require. None of these verbs take sentential complements and
1 acknowledge the man a] your
istence (or): M to.
[lean hour romeo": knocking 2 [Wm therefore they appear to be counterexamples to our Object Raising
llalcmmt. They acknowlodqed (to M}
]
my
informed: I heard that
they were defeated ] nay acknowlodyed
having T1,5a] to be told or
HEAR ABOUT, HEAR rule. In addition, Moulin at al. (1985) note that our Object Raising
to he mu i117 comps
been defeated 2 [T1 (in); x (to be) 1,7] to this category incorrectly. Mean is assigned
admit (in): Hz um: FROM, HEAR or
rule would assign mean
recognise, accept, or
the LDOCE lexicographers typically dene two different word senses, ria should employed (Perlmutter and Soames, 1979:460f.). However,
be
other codes) [T5]and the other only two of these criteria are explicit in the coding system.
of which is marked (perhaps among
obtained, we explored the possibility of
one
word suggests that this ap- On the basis of the results
of these senses
with a V1 code. Analysis rule to take account of the cooccurrence
but unmotivated in ve; for example, modifying the Object Raising
roach is justied in three cases,
and [T5a] codes and V or X codes within homograph,
acknowledge(1),(2)(unjustied)
a
(see Fig- of [T5]
hear(1),(2) (justied) versus
rather than word sense.
within An exhaustive search of the dictio-
ure 4.20). The other {our cases we interpreted as
unrnotivated
Iwere
coded
a
today. 1 He
me
he of- under the same sense; in the translation program
Shetland wand. someone will do something): ject and Object Equi
law to be a good worker. | m duty in the
by selecting the type appropriate for each
to do their
aISooand ce.- zzpected in. men
this ambiguity is resolved
madly demand a pan .
an
eon-I'M hdue .......
104
4 4.6 The Dative Alternation
codes Chapter
Utilising the LDOCE grammar
system descrilfdd
databasye
ers, here incorrectly be-
Object Raising. Allow(1 and permit(1) appear
extracted'from LDOCE using the lexical
to capture examples such as
et al. (this volume) conjunction with
in the code eld de-
cause they are coded [T4 in
Alshawi described in 4.2. Undoubtedly this sample does
compacting program
in their house. exhaust the class of verbs which can undergo the dative alterna<
They do not allow / permit smoking not realistic-sized which be used
tion. However, it provides a sample can
verbs should probably be coded by hand. to the patient or theme, and is associated with most verbs that
change
describe transfer of possession, but goes on to say that:
In order to evaluate the proposals of Levin (1985) and of the change of position class to become members of the
verbs subcategorisation, and in particular the range of ofpossession class. The productivity of this process
aspects of a
it take, are predictable from
transfer
alternative complement structures can of leXical extension might partially explain why so many
the semantic class of the verb and its predicate-argument (thematic) more verbs show this alternation than other alternations.
verbs in LDOCE which are coded to undergo
structure, we examined
chosen because it is prob- ' '
entries of syn- The examples (which had been presented in alphabetical order)
programme of simplifying lexical aspects
by'pr'edicting sorted into further semantic classes to see if any trends could be
tactic behaviour on the basis of membership in semantic
classes (see were
for too then arguments based on classied with respect to the dative alternation because three or more
is invoked to account many cases,
semantic classes will begin to lose their empirical force. On the
other participants objected to the original classication. Membership of the
into the relevant semantic classes is shown by a scale (05) which represents
hand, the claim that lexical extension transferof
possession
verbs the number of participants who thought the verb sense
class licences the dative alternation with change of position im- belonged to
of the the class. In the cases where participant left a question blank or
plies that verbs which are not members transferoflposseSSion
this way Will
a
of position or transfer of possession verbs. The verbs were presented in do not, 65 were judged to be change of position or transfer of possession
intended to bring out relevant sense from verbs (on the basis of a score of 3 or more) and 66 Were not. Of those
short example sentences theinto three
LDOCE. These example sentences were divided groups by which fell into these classes 54 (84%) alternate and 11 (16%) do not.
a guide, according to whether they Of those which do not fall into these classes 41 (64%) alternate and
the second author using LDOCE as
allowed the dative alternation or required two NP complements or a '25 (36%) do not. These results suggest that a weak version of Levins
introduced to or for. Examples from each claims receives support in that if a verb sense is in the relevant
NP followed by a PP by some
shown below: classes then there is a higher chance that it will undergo the dative
class are
alternation. However, stronger versions of her hypothesis do not stand
up: many verbs not in these classes undergo the dative alternation and
NP-NP: She forbade him the house some transfer of possession verbs do not alternatel.
0
(forbid(3)) Position/Possession? The rst column of Figure 4.25 gives the number of verb senses
which fall into each subclass and an approximate percentage gure
She the problem to him
NP-PP : posed representing the proportion of the total sample in that class. As can
0
(pose(3)) Position/Possession? be seen the largest class represents only 15% of the sample. In ad-
dition, the others class contains 12% of the sample which consists of
Either : She quoted him some poetry miscellaneous verbs which unable
0
we were to classify further. This sub-
(quote(1)) Position/Possession? classifying exercise tends to suggest that the verbs in the sample are
relatively semantically disparate and do not fall into a unied semantic
Figure 4.24 class, as implied by Levin (1985). The second column shows the mean
score for membership of the change of position or transfer of posses-
basic and that sion classes. Buy/Sell, Pay/Charge and Pass/Throwverbs werejudged
We assume that neither complement structure is more
to be paradigm members of these
no rule of dative movement is used to derive the other structure. classes, as were Assign/ Give verbs
select more less freely from a of (whose mean score was reduced by the inclusion of more metaphori-
Rather we assume that verbs set or
cal giving verbs, such as render(3) or a'ord(3)).All the Pass/Throw
and that verb senses
potential complement structures against (judged verbs alternate, as Levin (1985) would predict, and they all appear
must remain the across the (dative) alternation.
LDOCE entries) same
to undergo lexical extension from change of position to transfer of
asked participants to conrm that the verb sense
The questionnaire
it fell into the possession verbs (Figure 4.26).
was in the correct group and to say whether classes
of verbs of change of position or transfer of possession. Five linguists
this questionnaire (including the second author).
completed
109
108
Chapter 4 4.6 The Dative Alternation
the LDOCE grammar codes
Utilising
ass es s es se
$2 a s3 33
migv
:m V
She tossed the corner the ball
'30 She tossed him the ball
0)
'3 Figure 4.26
A A A
A
ten xx s
There is a clear default inference of transfer of
5500 nu: m possession, although
094: Vv
v
N
v
this can be overridden, as in:
z\v
m
in" 6
A A AAAAAAAA
With theother subclasses, as columns 3 and 5 indicate, the situation
o
oooooxzg
h%%%ooo
o
8x
Dag
% % is not so clear-cut. Only twothirds of the Assign/Give class alternate,
co 0
V
N
bomgooag although this is the class which most clearly involves a central infer-
\V
w
V
v
,_.
VSV 3:,
also ence of change of possession, as the oddity of the following example
illustrates:
a
E Af A H
A
a
? She gave the book to Susan but Bill took it before Susan could.
nesse
'_
s s
ID g
<ses e a H
Compare this with the more natural:
SSIVV
V
u
4.25
H
Figure
HH
qug'm In this
411m the verbs which do not alternate
.> .gaam 5 class, select the NP-NP comple
so
(5Laukom... E ment structure. In the case of the Buy/Sell class, purchase(l) appears
\%\nsaeasizss
s.
m
not to alternate. Thus within these subclasses which all appear to fall
&'EDH\3\Q$}3 E 375
_.
g ig>>aesaseueg
VAO'QNN ~ce a
into the broader
we can see
classes
differential
of change of position or transfer
patterns of behaviour
of possession,
with respect to the dative
O <OOQ<WD<<UJEDUOLIJUONWOH
alternation which, at best, only indicates a trend in support of Levins
(1985)claims.
The Construct and Obtain/Find classes were also judged to fall
into the broader classes of change of position or transfer of possession,
but judgements were less clearcut. With these verbs the inference of
change of possession seems less central; for example:
111
110
The Dative Alternation
4.6
codes Chapter 4
Utilising the LDOCE grammar
class of verbs, implying that it will not occur with Verbs which do not
She brought him a drink
be extended into this class. These subclasses illustrate
but he was already asleep fall or cannot
fall outside the change of position or transfer of pos-
that verbs which
session class do alternate with as much, or more, regularity than other
Figure 4.28 which fall within it.
subclasses
The remaining classes also fall outside the broad class of change
readily with a reexive object:
and they occur
of position or transfer of possession verbs, but mostly do not show
not alternate. The Save/Take
a consistent preference to alternate or
not. We tested this claim by marking all verb senses in the sample
quoted poetry to the great wall
7 She
wall poetry
with the feature +/ LAT(inate). 41 (31%) of verbs in the sample are
7 She quoted the great Latinntc and 90 (69%) are not. Of those which are Latinate, 27 (65%)
do alternate and 14 (35%) do not. Of those which are not Latinate,
Figure 4.30
113
112
4 4_6 The Dative Alternation
codes Chapter
Utilising the LDOCE grammar
the approaches
two (e.g. of words in the sample
She gave her family
The great majority She all her time to her family
bisyllabic but not Latinate). gave
does not make very strong
are mono or bi-syllabic so the hypothesis
the monosyllabic verbs there are many She gave him the point (KWEUS),comma)
predictions. However, amongst wish(4), ne(1),save(3),
'
She gave the point to him
which do not alternate
give(13), yield(3),
and so forth. version of She gave them the President (W414), was
Grimshaw and Prince (1986) offer a more sophisticated "
She gave the President to them
that all alternating verbs con-
the phonological hypothesis, arguing foot. A single foot can consist
sist phonologically of a single prosodic with stress falling on
The meal gave him indigestion 1), cause)
or more two syllables
The meal indigestion to him
of one (stressed) syllable, and an extrametrical
gave
For the 27 verbs found, the She him the information (KNEW),tell)
which had stress on the nal syllable. gave
She the information to him
in 14 cases, where the gave
single foot constraint made correct predictions
the dative alternation was
initial syllable was either extrametrical
or
extra-
She gave him her hand (give(15),Support)
blocked. In the remaining 13 cases, either the initial syllable was
She gave her hand to him
allow(3)) or the verbs
metrical and alternation was blocked (afford(3),
alternated despite not being single feet (e.g.prepare(2), p rocure(1,2),
Figure 4.32
Most of our examples which go against the single
refund(1) ensure(2)).
,
for membership of the transfer
foot constraint given low scores
The rst example illustrates
were
and reim- a fairly productive meaning of give which
of possession class, however, refund(1), remit(2), repay(1) does not alternate but can take a wide range of deverbal theme ar u-
do not consist of
burse(1) were given high scores, and
do alternate, and
given high scores, do ments. second two examples illustrate
The relatively specialised senies
single feet. Similarly, charge(1) fine(1) were
of which severely constrain the nature of the theme argument but
but do consist of single feet. give
not alternate, is that the
which alternate nevertheless. Finally, we present two examples of pay
One broad correlation which did emerge from our sample both of which alternate but in the second t he theme
is to occur with that
us age argument
idiomatic the less likely alternation . .
WW_W__
She paid him the money (paylln
Most
tially larger
stration
system suggests
to make
entries
the programs
Conclusion
applications
purposes.
the on-line
both viable
parsing system
ments
in these
that
described
in the source
assignments
than
The evaluation
it is
for NLP
those
sufciently
of the LDOCE
detailed
systems require vocabularies
grammar
and accurate
production of the syntactic component of lexical
and labour
depends
dictionary.
would
saving. However, the success
above in producing useful lexical entries
directly on the accuracy
Correcting the mistakes and omissions
be a non-trivial
the interactive,
exercise.
rather than
will
typically developed for theoretical
coding
(for verbs)
rate
This is part of
batch mode,
of
for a
or
substan-
demon-
W John
5.1
Carroll
Background
and Claire Grover
WW
nature of the English lexicon.
the grammatical out jointly by groups at the Universities of
munity is being carried
Cambridge, Lancaster and Edinburgh.
4.8 Notes The goal of these three closely related projects is to produce directly
compatible rule systems and associated software, capable of function-
We would thank
like Geoffrey Leech, Beth Levin, Steve
to Pulman, ing together as an integrated system for morphological and syntactic
various
Graham Russell and Karen Sparck Jones for their comments on
parsing of texts. The projects aim to deliver, respectively, a sentence
drafts, which substantially improved this chapter. We are entirely grammar of English together with a grammatically-indexed lexicon, a
reponsible for any remaining errors. Part of the research reported combined inectional and derivational morphological analyser and dic-
funded by the UK Science and Engineering Research Council used. The
here was .
117
116 ,,
5 5.2 The target lexicon
for English Chapter
A computational lexicon
of a current approach and summarises the further work we plan to do
for of potential
applications. requirements
The our
component a range directed towards deriving a large lexicon from LDOCE.
in particular, the need for a morpho-
diverse user community motivate,
with wide coverage of English grammar
logical and syntactic analyser
describes the sentence grammar
and vocabulary. Grover at al. (1987) in detail. Rus- 5.2 The target lexicon
formalism and current coverage of the English grammar
the morphological analyser and dictionary
sell et al. (1986) describes Both the grammar and morphology toolkit projects have adopted an
handle semantics, there is still the need to provide a minimal, theoret- PAST -, CAT V, FIX NDT, REG 4, COMPOUND NOT, AT +, LAT +,
be made available in the lexical CAT V, FIX NUT, REG 1', COMPOUND NOT, AT +, LAT +,
of verbs and their semantic types must
SUBCAT NP]
entries, CDUNT
for developing such a detailed [V -, N +, EAR 0, P055 '. PLU -. PART -. PRO -, +,
LDOCE is an obvious starting point INFL CAT N, FIX NUT,
obvious motivation of at- NFDRM NDRM, PN -, PEN. 3, +,
lexicon. Apart from the
and substantial COMPOUND NUT, AT +, LAT +, SUBCAT NULL].
to derive a. large list of words from a computerised source,
tempting
relevant to this project since it offers, through
LDOCE is particularly
the system of grammar codes, as has been discussed in Akkerman (this
behaviour of
volume), a great deal of detail about the grammatical
individual words. LDOCE contains only base and irregular forms (it Figure 5.1 Lexical entries for address
the entries in a derived lexicon
excludes regular inectional variants);
for merging with the existing
would thus be of exactly the right type
Figure gives of the feature
a complete
names and potential list
hand crafted lexicon. 5.2a
the tar- values which
may occur as part of the lexical entry for a given word
section summarises the detailed requirements on
The next and which relevant to the sentence Figure
imposed by the morphological analyser and the sentence
or
morpheme are grammar.
get lexicon 5.2b lists the features which are relevant only to the operation of the
and outlines a processing strategy for deriving appropriate
grammar,
that such a strategy must word grammar. Briey, they are used in the following way: FIX en-
lexical entries from LDOCE. We then argue
of in- codes whether a morpheme is a prex, a sufx or neither LAT encodes
manual intervention due to various types
be open to frequent
in LDOCE. In section 5.4, we whether morphemea is latinate or not; AT encodes whethera stem
accuracy in the grammatical coding combines the suflix +ation the suliix +ion; INFL marks mor-
describe our methodology together with a lexicon development system with
inectable
or
119
118
The target lexicon
lexicon for English Chapters 5.2
A computational
V
a regular past tense form or not; COMPOUND identies the category [N +, V , BAR 0], verbs as [N r, V +, BAR 01, adjectives as [N +,
0]. Minor categories do
of a compound; STEM, which is category-valued, indicates the type of +, BAR 0] and prepositions as [N , V , BAR
not have the features N, V and BAR. Instead they are usually dened
3, Features used by m. sentence grammar
simply by their SUBCAT feature, for example, the complementiser
is dened other features then be
BAR {-1 o i 2) LOC (. -) that as
[SUBCAT THAT]. Most can
PAST (. _} AGR cu
va (o -} WH (. }
VERBALHEAD {VFORM FIN AUX PAST AGR PRD}
PART PRD}
FLU (. -} UB {E 3}
NOMHEAD (NFORM PLU PER CASE PRO PN POSS COUNT
PER {I 2 :4} EVER (A -)
ADJHEAD {QUA ADV NUM NEG PART AGR DEF PRD}
CASE {mm ACO) CONJN (. -)
PREPHEAD (PFORM LOC PRO PRD).
COUNT (. -} certain in addition to the sets
SUBCAT (DEIN DEM AND sum DUI EITHER NEITHER The features appearing on categories
"HR ER Ill FUR IF En-ER NUT THAN AS HULL
PN (+ -) dened above are INV, SUBCAT
NEG and which are relevant to ver-
NP DUR PJIP NDPASS NLS HPJFLDC HPJP lll
NFORM {n was MURM) 1m: PPDF PPABUUI Prsxrnusuur PFVIUH PPGER bal categories; DEF and SUBCAT, applicable to nominal categories;
PART {. -) PPSBSE SFIN SIN! SESE or q: HILL] PREJJ AFORM and SUBCAT for adjectival categories; and SUBCAT alone
BASLVP IlllP VPJNF DD_CDMPL P5P SR1 SR2 SR3
P055 9 -) SEX SE] SE3 SE4 Rl DR] DE 115%] SJUBJX
for prepositional categories. Additionally, SUBCAT, AGR, DEF, QUA
55W UBLGAP INLSU'BJ PP PPFRUM PPTD PP and POSS appear on determiners, SUBCAT and AFORM appear on
DEF (. -)
PPDN PPFUH PPIN PPAGAINST PPBY PPDFJPHTH
degree modiers of adjectives, SUBCAT and CONJN appear on con-
PRO {. -} NPJPIN NPJPHITH PIPYRDM NPJFUF be
NFJPFUE NPJPTU NPJPIHID NPJPBY NPJJFF junction words and PFORM appears on particles SUBCAT must
PFORM (or B! uni ru
HPJJN NPJJP NP. OTHERWISE DllE ELSE llPJPDll specied for all lexical entries. Wh-words of any category also have the
rim FRUM Ill ur ABUUI
Acursr Ar nun mum l'LEFL BARE_S PPTDJHALS A1 features WH, UB and EVER.
NPJASLVP NPJSJRED PLUR PPDVER NPTU} indi-
hrs in THAN AS} These sentence grammar features are used as follows: PRD
cates whether a category can appear in predicative position or not;
about the cate-
AGR, which is category-valued, encodes information
item is able to agree with; NEG indicates whether a
gory that
b. Features used by the word grammar some
121
120
. .
P ER indicates whether a .
information
.
about
singular or plural;
.
third person;
whether a noun is a mass no un or one;
from all
.
which are ordinals ([NUM ORD]) and on the relative pro- '
type)e(>(f1
. .
only on
and that
]); WH appears
appears only on wh-words .
and syntactic under consideration on the b asrs of the complete set of codes
terisation of the morphological
assigned to that sense.
morphemes and words. FIX) are specific to bound mor-
however, are relevant to open class vocabulary and closed the requires re atin g to b Categorisa-
S
classes
.
features N, V and
of
BAR
believe.
in the sample entries
The values of VFORM, FIN and some of
for
error which cause
'
processing prolems.
part of speech to be present on verbs,
are predictable
the other features expected of the word. On the other
uninflected form data
entries which are the base, is not predictable from either 5.3 Inaccuracies in the LDOCE source
'
of the features llon the asssignment of cod to word senses. For exam 1
not all) As discussed in es
most (if
recoverable from the majority
of MRDs.
hlajlthlelecrsd: _ '
promise(1) which
g:r[rsr(;f)co$rrgissu])rii
ntry
not in
volume,
CAT, are
of Emisadilodnliii
in Akkerman , , ,
be an exception from _
chapter 4), LDOCE appears a tagging of major no (rather than semi-colon), and there
mma is an error
122
A computational lexicon for English Chapter 5 l 5,4 A methodology
consistent
and a system
and totally
for lexicon
accurate
development
of lex
worry, should contain code MRD provides a complete, source
entry for upset(3) to cause which
to a
the
In the light of this, the solution have adopted is a
licensing sentential subjects, as in That Mary is pregnant upset Bill. 'ical information.) We
ply not encoded in the LDOCE entries; for example, the information incomplete, and eventually added to the target lexicon when they are
represented by the morphological features AT and LAT does not ap- deemed satisfactory. This methodology is conceptually very simple,
pear in LDOCE. These features play an important role in the analysis and full software support for it in its bare essentials need correspond-
of derivational variants, and are necessary for the correct working of the However, two major design considerations need
ingly be quite minimal.
word grammar. If they are not present many morphologically produc- to be taken into account for such a system to be usable, and thus for
tive, but non-existent, lexical forms will be potentially analysable by the methodology to be viable.
the lexicon system. At the sentence grammar level, too, there seem to
th leXical
(Fagiaznggrmat
to help template, with
from the systems provision of powerful facilities
as a
paired a SUBCAT value
ment follows
of the automatic translation stage, and its
the user check the results
the to modify the form of the translations
exibility in allowing user
produced.
((Takes NP NP) 7) > NP
The automatic translation phase ((Tnkes NP SEar) 7) -> SFIN
5.4.2
((Takes NP NP Inf)
lexical entries for the target lexicon is an
The rst phase in deriving (Type 2 DRaisingD -) 0R
automatic translation of the grammar code eld of a source entry into ((Takes NP NP Inf) (Type 2 DEqui)) -> GE
believe 1; 3 [T5a,b;V3;X(tabell, (to bc)7l l Thesecondmechanism uses the sets of lentry completion rules and
multiplication rules which are dened to the morphological analyser
((Tukes NP SBar) (Type 2)) Whenever lexical
a
entry matches the pattern in one of the rules it is
either
((Takes NP NP Inf) (Type 2 URuisingD
padded out with new
feature/value pairs (by entry completion
rules), or other
completely new entries, based the
on existing one are
((Takes NP NP NP) (Type 2 Ullaising
(or
NP NP AuxInf) (Type 2 0113131115)
generated (by multiplication rules). Once a SUBCAT value for azien-
((Tnkes try has been decided, these rules
the entry to esh it
are applied to
(or ((Takea NP NY AP) (Type 2 whining out With a default set of features and associated feature values These
NP NP AuxInf) (Type 2 DRaisingD)
((Takes entry completion and multiplication rules may be edited by user the
to ne degree of control over the output of the translation
exercisea
process. Figure 5.5 shows an entry completion rule which assigns the
Figure 5.3 Lexical templates derived for the third sense of believe feature/value pair [AUX ]to verb entries which are unspecied for
the feature AUX. Figure 5.6 shows a multiplication rule which en-
stage takes
by~the target
these templates
lexicon, in our case a set of categories each containing
in the
passive entry ([vrorllvi
EN, PRD #1).
a list of feature/value pairs. Which features and values appear
nal categories is determined by two mechanisms, both of which may
be customised by the user. Add_AUX :
127
126
Chapter 5 5.4 A methodology and a system for lexicon development
A computational lexicon for English
MultLPASSIVES: lexical categories, each of which is compatible with those of the lexical
(_ [V'FDRMEN, rim _reBt] _)
to ll the blank
-, _
that someone is something To check the plausibility denition of a word, the userof may a at any
SFIN: They E...:l
time request the LDE to display all the frames that the word (given its
DR: They [2...] someone to be something denition)may ll. Each such frame is shown paired with the SUBCAT
They [2...] there to be a problem value of its slot and with the word taking the place of the slot marker.
The grammaticality or otherwise of these instantiated frames assist the
0E: They Em] that someone is
something user in the task ofjudging the correctness of the words denition.
They there to be problem As well as providing
a [2...] a
an outright check of the correctness of a def<
inition, syntactic frames can also be used to help the user choose be-
tween related but nevertheless distinct subcategorisation possibilities.
Figure 5.7 Syntactic subcategorisation frames (abbreviated)
129
128
Chapter 5 5.4 A meth o d ology and a for lexicon
lexicon for English system development
A computati onal
in some lormal sm
However,
automat ic translation phase might the NP/NOPASS example above hlghllghts
in itial
transformationalalterna
the ' ' one
For example, to lems
semantically
.
bleached fra
SUBCAT values
cases assign incorrec
[SUBCAT
t
might be assigned
OR] when it should dOWIlnwllxslng
Can break
the frames
the
interact with w
mes:
The b b .
Pounds
{mini}: here
was
denition of
to an incorrect
5.9 Syntact ic frames applied
Figure
persuade
even (as in this ex-
Phrasal 52119
of
grammaticality,
grammar code system. whether a particular
but does not represent systematically there are two The generator m a Y be i nvoked On a word during the second
equately ammar active, phase of deri , inte 1'-
has a passive types of
transitive se
130
A computational lexicon for English Chapter 5.7 Conclusion
Words such as believance in the afxed list generated from believe information in their box codes (as ,,,,U) and can be used
This appears
above represent morphological irregularity rather than an incorrect fea- to specify the value of their AGR feature to be [V ,1 N +, BAR 2,
ture specication in the base form. The user may edit the surface form PLU +].
and denition of words produced by the generator in exactly the same Another source of further information might be the phonology eld.
way as words taken directly from LDOCE. In the case of believance, the For example, it is important to the word that entries are
grammar
i
surface form would be changed to belief, and its denition inspected marked as either
latinate (LAT +) or not (LAT )since some afxes
and perhaps checked using the syntactic frames. When the new form is
saved it is added to the derived lexicon as a new
attach
only to latinate stems (for example ity),whilst other affixes
(non-productive) irreg- attach only to non-latinate stems (for example -hood)i Since latinate
ular entry. (The analyser can be run in a mode where nonproductive stems tend to be longer than non-latinate ones, it might be possible
separate entries are preferred to productive ones). for the system to hypothesise a LAT value for an entry on the basis
Other nonexistent forms in the generated list (such as
cobelieve) of the number of syllables in its phonology eld. The user could then
may just be ignored, since they will not form part of the output of the conrm or reject this hypothesis.
system although their denitions will still be implicit in the lexicon and
morphology component of the analyser (because co- and believe will
be there). It is assumed that this overgeneration is harmless 5.6 Conclusion
though,
as such forms will not occur in actual input to the analyser.
Hand-crafting the entries in the substantial lexicons typically required
by practical NL systems is often not feasible, and certainly never desir-
5.5 Future developments able. The evaluation of the LDOCE grammar coding system suggests
that it is sufciently detailed and accurate (for verbs) to make the
Although we nd that the entries produced
by the automatic transla- on-line production of the syntactic component of lexical entries both
tion phase of our system are on the whole satisfactory, given that they viable and labour saving. However, the less than 100% accuracy of the
may be edited and checked with the syntactic frames and morphologi- assignments of codes in the source dictionary suggests that a system
cal generator, there is still much room for improvement. Our aim is to using LDOCE for lexicon development must embody a methodology
generate only syntactic denitions for derived entries, and we take this allowing rapid, interactive and semi-automatic generation and testing
information from the obvious place, the LDOCE grammar code elds. of lexical entries on a large scale.
However, there are other elds in a source entry that are potentially of We have described a lexicon development environment, which em-
some use. We believe that taking this information into account when bodies a practical approach to using an existing MRD for the con-
assigning values to features would remove many of the shortcomings struction of a substantial computerised lexicon. The system splits the
we have observed in derived entries. derivation of target lexical entries into two phases; an automatic trans-
One potential extra source of information is the box codes which lation of the source data into denitions in the form required by the
contain information about the type of subject and object a verb re- target lexicon, followed by semi-automatic correction and renement
quires. For example, a verb like kick has the box code ___D__V_C which of these denitions, using two tools which tap the users judgment of
means that it has an animal human grammaticality, into a set of
fully checked base (and perhaps also ir-
or
subject and a concrete object
(either animate or inanimate see chapter 1). A verb such as die, on
~
DavidCarter
6.1 Introduction
135
and Chapter 6 6.2 Constructing and using the hybrid lexicon
LDOCE speech recognition
of the numbers of word candidates in amachine-readable dictionary into equivalence classes, where all
and determining the distribution re-
phonetic information the front entries whose pronunciations transcribed to the same
turned by queries containing whatever are symbol se-
end is assumed to provide. quence are placed in the same class. Statistics are then computed
If a front end were able to extract the full phonemic content of the based on the sizes of these classes: maximum class size, expected class
of according size, percentage of singleton (one-member) classes, and so on.
input, lexicon lookup would be a routine process access These
to phoneme strings, and it would be unnecessary to allow for partial statistics may either treat each word in the same way or be weighted
phonetic knowledge. according to word frequency.
The first reason that this is unrealistic is that the full phonemic For example, a of articulation
transcription, which classi-
mannEr
capabilities. constructed from LDOCE and one version of the MRC dictionary
A number of studies along these lines have
been carried out; among database; Coltheart (1981). Pronunciations were taken from the for-
and Zue and word frequencies (those of Kucera
the best-known are Shipman (1982), Huttenlocher and Zue mer, and Francis, 1967) from
the latter. These two of information
(1983), and Huttenlocher (1985). Typically, such studies use the tran- types were specified for 12 850
scription provided by a hypothetical front end to partition the entries words, which therefore constituted the hybrid lexicon.
136 137
Constructing and using the hybrid lexicon
Chapter 6 6.2
LDOCE and speech recognition
consists of a string of Syllable boundaries are indicated for the written forms of the words,
pronunciation
The eld of an LDOCE entry
values, between and but these cannot reliably be applied to pronunciations, rstly because
symbols representing phonemes, stress separators rather
British (received pronunciation) and Amer- they are intended primarily to indicate hyphenation conventions
alternative pronunciations. and secondly because of
than any directly phonological information,
ican pronunciations are separated by a double Vertical bar when they and spoken forms.
alternatives within one dialect are separated by commas.
In ad- the complex relationship between English written
differ;
alternation'between Therefore, in order to find syllable boundaries, a parsing algorithm was
dition, certain pronunciation symbols represent
an
be realised as either /I/ developed to group sequences of phonemes and stress markers into syl-
two sounds; for example the symbol /1/ may
and italic /3/ may be realised as
/a/ or may be omitted lables, and within syllables, into onset (initial consonant cluster), peak
or /a/, an
altogether. .
[vowelgroup) and coda (nal consonant cluster). The algorithm made
accessed in of the phonotactic constraints on British English pronunciation
Only the British pronunciations were constructingthe use
lexicon. Even the extraction of every alternative pronunc1a given in Gimson (1980). These constraints specify (a) the sequences of
hybrid so,
in each category: for example /1/ is a possible
tion would have been a. nontrivial task; alternatives are typically not phonemes that can occur
represented in full but only where they dilier from the first pronunci onset but /ml/ is not, so the syllable boundary in streamlined cannot
To determine exactly which symbolsin the rst pronunciation are sub- boundary is placed as far to the left as possible in order to maximise
where the following onset. Thus for example the boundary in slipstream is
sumed by the hyphen(s) in the second is not easy, and, especially
placed after the /p/, even though no constraints would be broken by
only portion
the middle of a word is shown, appears to require pho
For the for
entry mistranslate, for placing it after the second /S/.
netic or phonological knowledge. to say where the
substitute for one another In ambiguous cases like this it is often difficult
example, the knowledge that s and 2 can
contexts would be needed to deduce that the nal hyphen syllable boundary really is. The adoption of the MOP does not rep-
in certain
resent a commitment to any strong theoretical position; rather, it was
stood for /len/ and not for, say, /s]cit/ or /e|L/. selected as a simple and quick decision procedure, and one that, more-
constructing the hybrid lexicon, therefore, only the first (and pre-
In
for entry was used. Where over, does not discriminate explicitly between stressed and unstressed
sumably most common) pronunciation an
syllables, as would, for example, the principle that ambiguities should
generated by multiple-valued symbols such as //
one
alternatives are ,
in LDOCE. The beginnings of prevent /c/ from having either a null coda or /()/ as code.
nings of syllables with those stress values To cope with such cases, if no parse is possible without
of the transcrip- breaking
unstressed syllables are not marked. However, some
a constraint, then the parse that breaks the minimum number of co
tions studied in this chapter, and also the construction of the LDB,
occurrence constraints is accepted, the MOP again being used in the
depend on the identication of all syllable boundaries.
139
138
Chapter 6 6.3 Transcriptions, equivalence classes and consistency classes
LDOCE and speech recognition
placed in the third linear search, this would take a time proportional to the square of the
case of a tie.The /6/ sound in together is therefore
Words like kvass, and errors such as that for starsh, cannot size of the lexicon, typically involving tens or hundreds of millions of
syllable.
be dealt with in this way, since /kv/ and /II/ are not English legal comparisons between symbol sequences and entries. For this reason,
what peaks and codas they combine with; however, to my knowledge, every lexicon study to date within this paradigm has
onsets no matter
used an assign and count procedure or something equivalent to it.
such recalcitrant cases are quite rare.
Unfortunately, however, assign and count only yields the same re-
sults as assign and lookup (and is therefore valid) for a certain class
equivalence classes and of transcriptions: those for which the symbol sequences resulting from
6.3 Transcriptions,
classes transcribing two input words are always either identical or inconsistent
consistency with one another. If the transcription is capable of yielding two non-
identical but consistent symbol sequences, such as (6.2), then assign
The partitioning of the lexicon into equivalence
classesincarried
orderoutto and count is invalid.
measure the effectiveness of a transcription is most easily
what I will call an assign and count procedure, as follows. Vowel LiquidOrGlide WeakFricative
by g
Stop Vowel LiquidOrGlide I (6.2)
transcribe each word in the lexicon according to the current tran-
(a) For if
scription; example, a manner transcription is
always applied to each seg-
ment, the words
golfand gu1f(and, for LDOCE, only those
words) will
add it to the equivalence class dened by the resulting sequence receive the symbol sequence given in (6.1) above. This reects the fact
(b)
of symbols; that the corresponding front end would be unable to distinguish those
two words from each other, but could distinguish them from all others
processing the whole lexicon, examine the equiva-
(c) nally, after in the lexicon. Assign and count will create a class containing only
lence classes for uniqueness, average size, or whatever quantities those two words, and assign and lookup will, when either of them is
are deemed appropriate. input, retrieve them and only them.
But non-identical, non-exclusive sequences will arise whenever it
does reect would hap-
However, this procedure not
In
Very directlywhat is not assumed that a front end will always transcribe a word in the
pen if a performed
front the
end transcription in question. that same Such an assumption would, in fact, be unrealistic,
way. given the
case, the incoming signal, representing an uttered word, would receive
variability in speech. Even discounting the possibility of errors, the
a (partial) transcription, which would then be used to extract from amount of information the front end can extract from a given segment
the lexicon all those entries matching it. A procedure that reects tlns in a given word will vary between different utterances of the word.
process more directly is that of assign and lookup: for each entry in Suppose, for example, that the front end is able to make a manner
the lexicon, to: transcription of every segment and that, in addition, it can recognise
vowels uniquely on a random 50% of occasions. This means that if
the of the entry according to the current
(a) transcribe pronunciation either golf or gulf is uttered, there is a 50% chance of two candidate
transcription; words being retrieved from the lexicon, and a 50% chance of just one
candidate being retrieved.
extract from the lexicon consistency (not equivalence)
a class of
(b) But the probability that assign and count will put golf and gulf in
all those entries whose pronunciations match the resulting symbol the same class for this transcription is not 50% but 25%. This will
sequence; occur only when, in the simulation, a full (phonemic) transcription
class. of the vowel is made for neither word. Thus assign and count will
statistics from that consistency
(c) gather effectively predict incorrectly that the two words can be distinguished
three times out of fourl.
From computational point
a view, assign and count is much easier
of
and cheaper to carry out. Each entry in the lexicon need only be ac- Unfortunately, therefore, when there is any element of randomness
or variability in the transcription, the more computationally tractable
cessed once, and the time taken is therefore roughly proportional to the
For assign and lookup, however, step (b) assign and count method is invalid; and any realistic simulation of
size of the lexicon. involves front end must involve variability. We therefore need to ask:
a
searching the lexicon once for every word in it. Using an exhaustive
140 141
Chapter 6 6.3 Transcriptions, equivalence classes and consistency classes
LDOCE and speech recognition
142 143
and speech recognition Chapter 6 6.4 classes
LDOCE Measuring
6.4 Measuring classes alence classes of at most 223 items. Such gures might lead one to
conclude that if a manner categorisation can be performed, a major
substantiate chapter is that
in this mea-
The second claim I set out to
of equivalence classes or con~
part (perhapsthe major part) of the word recognition task has been
sures based purely on class size, whether accomplished.If, even in the worst case, the set of candidate words is
the accurate guide to the power of a
1% of the original
sistency classes, are not most only Just over
lexicon, is the problem not then 99%
transcription. solved? The answer, we will now see, is that it is not.
is that certain statistics that can be derived
The rst
point to note
meaningless for consistency classes. These
from equivalence classes are
6.4.1 Word counts and word frequencies
are what one might call classvbased as opposed to word-based statistics.
For example, in an experiment where equivalence classes are legitimate, Measuresbased on class sizes are less than ideal for two main reasons
the average class size. The rst, Briey,
there are two ways one might calculate these are that even when frequency-weighted, they do not take
in classes
class-based, way is simply to divide the total number of words proper account of the effects of word frequency; and that a logarithmic
number of words in the lexicon) by the number of classes to get than the customary
(i.e. the analyszs moreis
appropriate linear one. In this
class size. The second, wordbased,way is to nd the expected section I to substantiate these two claims and then
a mean
all words in the lexicon, of the size of
attempt present
class size: the average, over information theoretic that suffers from neither
the class each word falls into. In the latter case, larger classes will be Silaivlifarfkaitwe, measure
counted more often, and therefore the expected class size is always at Consider two possible equivalence class artitionin s
front end: they describe the classes that sence concluSiVe clues from subsequent processing, the most frequent
performance of the putative
word-based statis- word in
of
word. For this class has a high probability of being the correct
arise from the average input
for
reason, the one. By
to reect
in word frequencies:
tics can easily be biased differences
at random
contrast,in partitioning B, all the words in a given class have similar
example, to nd the class size expected for a word chosen frequenCies,so none of them is outstandingly probable.
from a corpus (rather than from the lexicon) one merely weights the To get some idea of the magnitude of this difference a
frequency-
contribution of each class size according to the frequency in that cor. sublexicon
ordered constructedwas from the most 10000 frequent
of the word giving rise to it. (Just as word-based statistics produce words the in hybrid lexicon described in section 6.2. The total fre-
pus
more pessimistic classbased ones,
values so frequency weighting
than quency of, on the one hand, words w1, w2, ..., "11000 (the most frequent
also tends to increase class sizes. This is because more frequent words word from each class in A) and on the other
hand, words mi 1011 w
tend to be shorter, having fewer segments to be categorised and there- ..., 109991
(the most frequent word from each class in B) was talculatzeld
fore often falling into larger, less discriminating classes.) as
a fraction the total
of frequency of all the words in sublexicon )the
(merely) that The two fractions
But my claim here is more radical than this: it is not were 0.846 and 0.164. In other Words one would
word based statistics are superior to, and more widely applicable than, expect that the strategy of simply guessing that the uttered is word
but that neither when frequency weighted, most in the class, without
class based ones, type, even the frequent attempting any further anal-
is really appropriate. ySis, would yield the correct result 84.6% of the time for partitionin g
that of
Earlier we saw that Huttenlocher (1985) found a manner
A, butonly 16.4% of the time for partitioning B.
articulation transcription partitioned a 20 000 word lexicon into equiv- Thus, other things being equal (i.e. ignoring possible differences in
144 145
Chapter 6 6.4 Measuring classes
LDOCE and speech recognition
small, one
where word n),- has probability
small. This will tend to occur when the frequencies of different p; of occurring, and successive words
class is
in A but not in B. The expected class are assumed to be independent, is
classes are fairly uniform, as
factor.
size alone fails to capture this important
These arguments apply classes as well as equivalence
to consistency U(L) =
Z pilogp, (6.4)
class may consist of words of widely varying w;EL
classes. A consistency
the equivalence classes of A, or of words of similar
frequencies, like
will be
(all logarithms are
to base two).
Intuitively, the entropy is, roughly, the
like those of B; and again, the former distribution number of bits would be required to encode
frequencies, a word chosen
average
that
much more helpful. atrandom from
the leXicon according to the probability distribution
usmg the most efc1ent coding scheme possible. For example, if N is a
all words are equally probable (so that p; logN : _.
and logarithmic
o;2tgnd{1V
Linear measures
6.4.2
frowelrl
or a 1. , en eac word can be assigned a code ofl GEN blts. The
transcription applied a
expected remaining uncertainty is
of the transcrip-
size of 100, or 1% of the lexicon, then the contribution
tion towards identifying the word is more like 50% than 99%. This is
the candidate set by a factor of 100,
UiL l T) =
2 Pa 2 qua-10gquC; (6.5)
because on average it reduces w,eL wjeL
IOOfold reduction to be achieved way. in some other
leaving a further where quC; is the probability of word w- occurri
that member
to 1000 would a contribu- ng given a
Similarly, a reduction from 10000 represent 1
of class C; has occurred:
tion of about 25%. Thus a more helpful gure than the expected class
size is the value of IL unrec-
qleu:{
J
PC;
0 otherwise (6'6)
log EzpectedClassSize
1
log LexiconSize ) x 100% (6.3)
and Pg, is the probability of any of the words in C; occurring:
sheds different light on the usefulness of a
Formula (6.3) often quite a
U L T
X 1007 0
of uncertainty, entropy or 2
tivation, is that of calculating the amount Pi 10317.-
is
unextracted information present before and after the transcription "MEL
146 147
Chapter 6 6.4 Measuring classes
LDOCE and speech recognition
10 000 leXicon
word bUnstr = unstressed
tween the two partitionings of the
because the value of EPC, log Pc, is maximised, for 2P0, .=' 1,.when
nearly case for partitioning A The PIE gures suggest that although stressed syllables are consider-
all the Pajs are equal; this is more the. ably more informative than unstressed if they are transcribed phonem-
than for partitioning B. Thirdly, the
measure is clearly logarithmic;in
the then it ically, there is no difference
significant for a manner class transcription.
all classes same Size,
fact if all the Ms are equal, and
are
This is in spite of the fact that the ECS gures derived from the same
reduces to (6.3).- data would Huttenlochers conclusions.
support
Altmann (personal communication) suggests a possible explanation
for the superiority of stressed syllables only for a full transcription. He
hypothesises ...that this result is due to an uneven distribution of
the 44 or so phonemes across stressed and unstressed syllables. For
instance, one might expect to nd all the vowel sounds in stressed po-
6.4.4 Case Study Two sition, but only some in unstressed position. Similarly for the distribu-
tion of consonant clusters. It would follow, I think, that a full phonemic
transcription of the stressed segments would be more informative sim-
Just as, earlier, we found that the
methodologically incorrectosSign ply because of these uneven distributions. If such an account is true,
now is
and count procedure could give rise to it
misleadingresults,differences it provides an explanation for the increased informativeness of stressed
there in practice Significant
time to investigate whether are
and from syllables which does not depend on the fortuitous assignment of lexi-
between the predictions arising from class size statistics the to the portion of the word. It
It turns out that, again,
cal stress (otherwise) most informative
theoretically more appropriate PIE statistic. could be that the difference in informativeness disappears for a man-
fuller account, see Carter,
there can be important discrepancies. (For a
ner transcription because phenomena such as reduction, which apply
1987.) . ~
.
allow this
ures he presents themselves
do not interpretation,
LDOCE-derived
a
leXicon
sunzilar
obtained using
.
149
148
Chapter 6 6.5 Summary. discussion and conclusions
LDOCE and speech recognition
PIE I have argued, superior. It there- equivalence classes can, at least in principle, give rise to misleading
classes: a statistic to which is, is
of the relative conclusions about
promising directions for front end design.
fore necessary to return to the question informativeness The second
and evenly spaced segments.and practice criticised was the use of class size statistics
of stressed,
PIE gures here too.
randomly
to do so
chosen
In order accurately, one derige
must improvet e for assessing transcriptions. These statistics were shown both to be
introduce word frequency insensitive to variations in word frequency and, intuitively, to give an
treatment of stress in monosyllables and also
'
meXftmann
(1986), and, far, our Case StudyOne, treatedmonosyl-
so propriate measure,
introduced and
that of percentage
shown
of information extracted (PIE),
labic words as always being unstressed. It 15 more realisticto
assume was
of
these drawbacks. It was shown
that, for careful continuous speech, content (open class,lexxcal)words values, use consistency classes, can lead to
words Will different and reliable conclusions about front end design. Specif-
will be stressed, and function (closed class, grammatical)word ically, the
more
the proportion of stressed segments in informative than unstressed only if they are transcribed phonemically,
quency weighting is introduced,
the LDOCE-derived lexicon becomes 42.4%, or almost exactly.3/7. and not, as had previously been thought, if a manner transcription is
made.
Therefore experiment (iii) was adjusted to transcribe.phonemically
(iv), the However, there other and which
42.4% of segments at random, whereas in
experiment effect are assumptions restrictions we
have not criticised and which
of maximally even spacing was achieved by transcribing phonemically characterise both the present study and
7n + lc, 7n + k + 2, 7n + k + 4 in a word, for all integer n, most of the others discussed here. For example, further work is clearly
segments
where lc is selected at random from for
0,1,2,3,4,5, eachword. .
needed to allow for the possibility of errors both in segmenting the
Under these conditions, the PIE values obtained (With 'the unique- signal and in transcribing segments. Within the paradigm assumed
ness gures from the table in 6.3.1 reproduced for comparison) are: here, an error would effectively lead to a word not being included in
its own consistency class.
% Unique PIE Adda et al. (1987) attempt to minimise the likelihood of such
Experiment er-
All mid 692% 94.3% phonemes but on rough spectral features which are more directly repre-
(i) sented in the speech signal and hence more
Stressed full 79.2% 97.3% likely to be extractable with-
(ii) out errors. They report, for a 17 000 word French lexicon, a frequency-
Random full 78.7% 96.7%
(iii) weiglited ECS of 132 words for the spectral-feature
78.7% 97.0% transcription as
(iv) Even spacing opposed to 53 words for a manner If we approximate
transcription.
PIE by the formula (6.3) (which is the best approximation possible if
The pattern of the earlier, uniqueness results is, perhapscoincidentally,
between the only the E05 and lexicon size are known) these correspond to values
repeated almost exactly here: there is no great difference of 50% and 59% respectively. Thus the use of a more
The conclusions are therefore also un- realistically ex-
results of (ii), (iii) and (iv). tractable transcription with the same number of symbols decreases the
changed. PIE signicantly. Adda et al. make two further concessions to phonetic
reality: they allow multiple phonetic representations of words to reect
lexical variability, and they assume that two neighbouring segments
6.5 Summary, discussion and conclusions
of the same spectral class will not be distinguished but will be tran-
scribed as a single symbol. These two eminently realistic concessions
In this chapter I have criticised two of the practices commonly adopted increase the ECS from 132 words to 1111, decreasing the approximate
in assessing hypothetical front ends for large-vocabulary speech recog- PIE from 50% to only 28%, or less than half what it is for a manner
The first practice, that of partitioning the into
nisers. leXicon equiv-
the kinds class, segment-by-segment, single-pronunciation experiment. Early in
alence classes, was shown not in general to be applicable to this chapter we saw that Huttenlocher calculated an ECS of 34 words
of front and behaviour that one would wish to model; a more general
for a six-way manner transcription, representing the elimination, on
It was noted
notion, that of consistency classes, was introduced. that average, of 99.83% of the vocabulary; but it now seems, ignoring any
obtain
the experimental method needed to consistency
described
classesrelies, relevant differences between French and English (see below) and the
exible LDB the
in general, on the availability of a
of kind fact that PIES cannot be derived exactly from ECSs, that we can ex-
it shown that the inappropriate use of
in chapter 2. Further, was
151
150
Chapter 6
LDOCE and speech recognition
in theory arise even without errors for the FRUMP system (DeJong, 1979), a system designed to
1 The same sort of discrepancies can achieve a high degree of robustness. The problem does not disappear
For example, if segments in stressed syl- of the type encountered
randomness or variability. when dealing with limited discourse domains
receive detailed transcription that those in in database and expert system interfaces. This is because of the
lables consistently a more
query
unstressed, then the golf/gulf problem can arise between pairs like large number of synonyms and specialised words that can occur, and
refuse and revise
(noun) (assuming that the stress pattern is not part because of the difculty of delimiting discourse domains exactly.
of the transcription). Adierentproblem faced by designers of natural language under
standing systems is how to provide for graceful failure of sentence anal-
words stressed. Conse- There is thus the need to produce reasonable incomplete inter-
2 Huttenlocher treated all monosyllabic as y5is.
in unstressed syllables will lead of sentences when complete analyses are not possible. This
quently transcribing
that only segments pretations of gaps in the grammatical
to an unrealistically large ECS, because monosyllables, which comprise Situation can occur because knowledge of
the system or because the system is faced with extragrammatical in-
the majority of words in text, will all map onto a null symbol sequence.
expected and maximum class put. This chapter shows how a possible solution to this partial analysis
Recognising this, Huttenlocher presented
sizes for transcribing unstressed segments in polysyllabic words only, problem can be applied to the vocabulary problem in the context of
but omitted to do the same for transcribing only stressed segments, large machine readable dictionaries.
thus not allowing any direct comparison. More
specifically, we will see how word sense denitions from the
Longman Dictionary of Contemporary English (Procter, 1978
hence-
153
152
Chapter 7 7.2 Definition analysis
the dictionary definitions
Analysing
be thought of as of a more gen texts. Some ways in which the denitions diverge from a strict inter-
robust phrasal analysis can
instances of It should be remarked here
This is the problem pretation of this rule are discussed later.
eral natural language interpretation problem.
knowledge of language use; lexical knowledge that the LDOCE restricted denition vocabulary has more in common
coping with incomplete
in the second. The with a basic English vocabulary than a set of semantic primitives. (A
in the rst case and knowledge of phrasal structure
of the knowledge of language use available list of the words in the restricted denition vocabulary is given in an
unavoidable incompleteness
that trying to achieve robust appendix to the published version (Procter, 1978) of the dictionary.)
to a language processing system means
effective If the output of processing LDOCE denitions in the form of
natural language processing involves developing mechanisms was
this The research reported in this chapter is meaning postulates, then the logic expressions produced would have a
for dealing with problem.
to be a contribution to this development effort. new symbol for the word sense being defined along with symbols corre-
intended
kind of output that may
the be of words in the denition vocabulary. Similarly,
The next two sections will discuss sponding to the senses
denitions and give of involve building new for-
produced from processing dictionary examples producing semantic primitive formulae would
the results of processing LDOCE denitions produced by an imple- mulae by putting together formulae corresponding to the word senses
encountered of the denition
analyser. Some problems that were
are
mented denition vocabulary.
then discussed. Later sections motivate and explain the basic analy For the third possible form of output listed earlier, we need a (hand-
of analysis and of the denition
sis algorithm, and then describe and illustrate details coded) classication of the central senses vocabulary
made about the
structure building rules. Finally some remarks are
together with a classication of concepts in the particular domain of
further
performance of the current implementation and necessary re- discourse in terms of these word senses. The descriptions of imple-
search. mentations by Bobrow and Webber (1980), Mark (1981), and Alshawi
(1987), show how such a classication can be organised and used dur-
ing text processing. The LDOCE denition for a new word sense is
Denition processed using the mechanism described in this chapterin order to
7.2 analysis
extract sufcient information for
including the new word sense in such
of structures useful classication. A natural application that de-
There are possibilities for the kind
various for'lan- a language processing
guage understanding that may be derived from dictionary denitions. pended on a classication of concepts in the discourse domain should
include meaning postulates (Carnap, 1952) expressed be able its application task despite the of
in some then to out occurrence
These carry
constraints or semantic formulae based on semantic primitives a new word in an input sentence.
logic; infor-
Wilks, 1975b); and structures Extracting the information for classication will of course
necessary
(Katz and Fodor, 1963, carrying
mation enabling the classication of the new word sense With respect to include locating superordinates in the denitions (which dene the
in a discourse domain. The called ISA relation) as is done in the work reported by Amsler
an existing classication of entities struc- so
entries contain additional semantic information that could information presented in the next section.
are
word sense
be combined, or used in conjunction with, the structures produced from This way of dealing with unknown words in language processing
word sense denition texts. This information is available as applications still requires good solutions to the problem of choosing
processing
box codes that selectional restrictions, and subject codes that between alternative possible word senses (Walker and Amsler, 1986,
give
discourse domain of word sense (these codes oc- have used the LDOCE subject codes for this purpose) and to the prob-
indicate typical usage
version of the dictionary, but not in the lems involved in the classication process (see Schmolze and Lipkis,
cur in the machine-readable
155
154
definitions Chapter 7 7.3 Analysis examples
Analysing the dictionary
(For certain text processing systems, semantic structures derived from them.
the (The analysis system re-
retrieval systems, such an approach may not be acceptable.) trieves denitions from a lispied version of the LDOCE type-setting
tape, for example items preceded by an asterisk are Lisp atoms cor-
responding to font control characters present on the type-setting tape
(see Alshawi et al., 1985; Alshawi et al. this volume.)
7.3 Analysis examples
noun,
in the semantic structures are often due to ample the information conveyed by the phrase sometimes in ,used
are now given. Oddities HEDGEs,in which is because it is not part of the
and output format, and HEDGE capitalised
peculiarities of the current analysis grammar
denition
restricted vocabulary (but is dened in terms of this vocab-
I would not wish to for their
argue correctness, especially in view of
discussed later.
ulary elsewhere).
the problems
157
156
definitions Chapter 7 7-3
Analysing the dictionary Analysis examples
(hornbeam) (club)
(a type of small tree with hard wood , (to beat or strike with a
heavy stick
sometimes used in *CA HEDGE *CE 446 s) (*CA CLUE :cB))
Figure 7.3
Figure 7-5
Verb sense denitions are, in general, innitive verb phrases with ad-
and additional restrictions the
verbials (often prepositional phrases) on
semantic class of agents and objects. Figures 7.4 to 7.6 give some ex-
((CLASS SEND)
(OBJECT
((CLASS INSTRUMENT) (UTHER'CLASSES WEAPON
(Pnurennas (Monmnn (bushy)
(ADVERBIAL ((cAss mm) (FILLER (cuss SKY))))
((of hair) growing thickly :
*46 a
bushy beard / tail)
Figure 7.4
((CLAss PROPERTY)
(PREDICATIUN (CLASS GROW) (MANNERTHICKLY))
(RESTRICTED-T0 ((CLASS HAIR))))
(mug)
158 159
Chapter 7 7'4 Some problems
the dictionary definitions
Analysing
(bring out)
(undomest ic ated)
Figure 7.11
(overland)
7.4 Some problems
(across land and not by sea or air)
or by
The current implementation is able to locate heads the correct semantic
((HANNER of dictionary denitions in most
( (CASE ACROSS) (FILLER ((CLASS LAND)) ) ) )) cases, although the examples above are
untypical in the amount of additional information they recover from
the denitions. Some quantitative remarks about the performance of
the system are given later. This section briey discusses a number of
Figure 7.9 that
problems encountered
were while testing the implemented system.
In some respects the information conveyed by the output struc-
tures, being too closely tied to the surface denitions, only provides
and for further
LDOCE for lexicalised
denitions compound noun
phrasalverbs constraints semantic analysis. Perhaps the most important
are handled in exactly the same way as noun
and yerb denitions.
denitions
case
certain
of this is that the relationships implicit in compound nouns and
Two examples of structures generated for such are given in
prepositional phrase adverbials cannot, in general, be made
Figures 7.10 and 7.11. more explicit without further interpretation apparatus (see, for exam-
ple, Alshawi, 1987) beyond that available to the denition analyser.
The phrasal context can, however, sometimes allow further specica-
tion of relationships implicit in prepositions, for instance derivation of
PURPOSE from for in cases exemplied by the noun sense of launch
(roller coaster) (although, of course, errors can result from attempting to make rela-
tionships more explicit in this way). The actual words appearing in
(a kind of small railway with sharp slopes the semantic structures the other hand, further
are, on
disambiguated
and curves . popular in amusement parks) than might be assumed given the high degree of polysemy of many of
the words in the restricted vocabulary. This is because the analysis
((cmss RAILWAY) (COLLECTIVE KIND) identies the syntactic
process category of these words and because of
(murmurs (SMALLD) the LDOCE rule that only the most central senses of words from the
restricted vocabulary should appear in denitions (but see the remarks
below on phrasal verbs).
Figure 7.10 The fact that definition texts are often not analysed completely
161
160
definitions
' ' '
Ch ap ter 7 7.5
Analysing the dictionary Phrasal analysis hierarchies
.
. . .
denition t
'
sometimes
'
example. In this
account, illustrated as by the iollowmg (the wood of this
' tree)
ecovered.
case the usual purpose of nails 15 not r
(nail)
Figure 7.14
(a thin piece of metal with a point one
1mm
endnnd
af 0
flat head at the other for hammering 3 piece A problem related to the
wood else)
one just mentioned is that only the simplest
wood usu to fasten the to something forms of
.
o as
a
handled. Difculties also caused by the the size of the denition vocab-
are
dening
' ' ' '
1e is
'
the
increase
of after
an d brin g up causmg
n ,
occurrence
look
2:33:31151
analysis the of the followmg sense denition
.
for foster.
.
.
e same omog a
'
wor}?
ishthat sensers
:1: . hierarchy of phrasal analysis patterns in which more
it depends on a
specic patterns
For example immediately
'
,
applies the hierarchy of patterns
Edi-31:}: is as follows.
.
the system produces a structure containing the special symbol Starting at the top of the hierarchy, a pattern is matched
against the input denition. If the match with this
*previous-sense*. pattern succeeds
162 163
Chapter 7 7.6
definitions Analysis rules
Analysing the dictionary
match attempted
is with daughter patterns
each of its (i.e. daughter of n-100, and n135 is a daughter of n130 (not shown), More
then a
below it in identiers '
of this pattern placed immediately mnemonic for these rules mi g h the Noun hrase S
specic forms
-
the more
the hierarchy). This procedure
is repeated recursively so that we end
denition. This
NP, and NP-With-relative
~
identifier) in the phrasal pattern part of these rules, the initial n" restrict th
(<ru1e e
(phrasal pattern) (daughter identifier) pattern to matching denitions for senses with lexical categor ns '
nouns. The
other pattern elements match zero or more inthee
iterlhs
hptliitodsgsgdcltnesn
the element (indicated by its rst
thie type(if of lexical features.
an res ric ions in terms
'
one
'
Figure 7.15
end of
pattern
a:the With the
elements simply distinguish different occurighgcl:
in semantic structure building o e ements properties. Examples of pattern
same el tS
The rule identier also a emen
appears
in order to allow dif- and what they can match are given in Figure 7.17.
rule. These two types of rule are kept separate
to be generated for the same analysis
ferent kinds of output structures
Building semantic structures is basically a simple process
grammar.
provided by the semantic structure building
of eshing templates
out
for the word for
rules using variable bindings generated by the matching algorithm. menu
examples of analysis and structure exactly one noun
The following section gives some +Odet zero or one determiner
the notation in which the phrasal patterns
building rules, explaining number of la- tuoun one or more nouns
a limited
are written. The notation currently provides EOadj
could be extended in zero or more adjectives
but it should be clear that these facilities
cilities, an
arbitrary segment of input words
various ways while remaining within the overall framework of applying itOpp _
164 165
7.7 Performance remarks
. ..
Chap t er 7
the dictionary definitions
Analysing
(predication (object-of (class deceive)))
of element with subsidiary patterns
The last element is an exampl e an
5:ii]:salrne
d wit
21:11:32:
kicislizft is e emen
using rule associated
a
mentioned earlier.
with the subsidiary pattern
There
passive-pred that
is also an optional further stage of the
as above. There are
d. '
_
15) 'ust
was
trans. Here passive pre
(be +v which applies transformations specied as
'
structure
'1 ar 1y, one s ub
-
'
including (passrve
.
of the subsrdiary pattern attached procedures associated with items in the structure building
the name
Oused for mom ver b)
.
(+ '
-
of *Oppmod is
(fonpp rules, for example the item predication. This phase gives greater free-
.
'
ttern
allows for
Tileaifsep;elements with such subsidiary
in
patterns
much the same way as 0 recursrfor; dom than would be possible by the use of structure building templates
and a more compact set of patterns, alone, for example it allows moving items (such as those indicating
t' nal context free phrase structure grammars. negation) upwards out of substructures.
t
congifldrcf
t the
this interpretation
hrasal patterns of rules
of pattern
n -110 and n-13 wr ma c mlslhoulgdhb:u;ls::
elements _
The analysis algorithm follows all paths from successful
This does not lead to inefciency because it is rare for several
matches.
deep, but
olfihe
setpof denitions matched by the
definition
' '
phrasal
examplesgiven
paftterln 23g
frigging '
ea r ier or a
. .
disjoint, paths to be followed successfully down the pattern hierarchy,
and because the implementation maintains a wellformed substring ta-
ho. .
(There is
n
it, does ble to avoid a certain amount of redundant computation. Alternative
lk
1
' H
us u . which
'
section
.
lriftlcraecdgnize.)
Theanalysis algorithm
n110 and n-135
tried outline?
only'i :28 preVlOElSS are
'
'
n
-
so ccee .
is more than one most specic successful analysis rule. At present one
that basically
such analysis is chosen by an over simplistic heuristic
.
that '
rplgsisrae
_
an ts o e anay prefers analyses accounting for more words of the input denition.
d when
of the immediate none when .
'
Figure 7.18 with complex cross references, so these are not taken into account in
a foolish per-
the gures given below.
earlier for the definition The results of the test follows. The semantic head was
The semantic structure given of
were as
1cg::Lzlprligcing-out
substructures :ptizra:
associated uninstantiafled with to rue
Thus only identifying the head is much more typical than might
pattern elements; applying'this
and
tf6115tch.
recursively process be suggested by the examples given earlier for illustrative purposes,
166
definitions Chapter 7 7.9
Analysing the dictionary Notes
that the development effort was only a few man months for 1mproved earlier drafts of this paper.
mentioning
each of the program and grammar, which, compared with other natural
language processing systems we have developed recently at Cambridge,
effort.
represents a relatively small
The work carried out so far seems to suggest that dictionary denitions
can be analysed with a reasonable degree of success using hierarchies
but it still remains to be demonstrated that this
of phrasal patterns,
technique can enable an actual natural language application system to
e'ectively with unknown words.
cope
these
Although dictionary denitions exhibit a rich variety of forms,
variations a manageably small number of basic forms,
are mostly on
168 169
Chapter 8
8.1 Introduction
The basic
methodology for the project is as follows: rst an ap-
propriate grammatical coding is applied to the words of the restricted
vocabulary and their inected forms. This coding is then automati-
cally inserted in all of the meaning descriptions, the outcome
being a
grammatically-coded corpus of meaning descriptions. Subsequently a
syntactic typology is developed for the structures of the meaning de-
scriptions of each of the major parts of speech (POS), nouns, verbs,
171
8 8.2 Preparing meaning descriptions for examination
definitions Chapter
and structure in dictionary
Meaning
the book section
Apply in 1.2.1 for brief description).
see a
for each of them.
and adjectives, resulting
in parser-grammars
should then lead to syntactically- 0 the orthographic form of the entry.
to the corpus
grammars it is possible to systematically
ing these
in which the POScode of the
analyzed meaning descriptions
0
In addition (and entry.
etc.
identifypremodiers,kernels, postmodiers, a semantic typology is be- 0 the meaning descriptions (also see Appendix E)
to the syntactic typology
partly parallel) both these typologies into a relational
ing developed. Incorporating it ultimately, to trace the hori- These to a
pattern matching programz developed
database system should
make possible,
words: showing what the hyponyms atthe Ftiilsgaatrlengiitaptned
toh'computational linguistics of the Universit of
zontai and vertical links
between
are involved Amsterdam Usin we can look for any of words inyth
p (g r:
kind of properties sequence
or hyperonyms
of a given word
post-modication
are, what
structures, etc. The expectation is meaning de-scri ut
iorfii, also for codes, the the box subject-elde
in their pre or
be useful in (semi)automatic
Pos_codes or An
Sgecl
c
entries. example3 of a pattern which iiia
the sem?rilic
of this kind can Very '
relevant
of the database
within a particular
SEES?selmgntic a el beginning description
of a
for only those meanings on a surgical operation. In of MD
a report
context. Imagine we are reading not U U
the instruments were U U U U U
one of the rst sentences we are
do we have
told that
in mind? A database of con mon . . . . hU . .
.y. anaesthetist(n)
.
distinguished into In
weolrdsrktizgafhtate
tion 8.3) search for structuresmeaning descri of tio
a syntactic typology (section 8.5). she s of the meaning descriptions with POS-Eoddls
This ievtvor was dune As stated by Paul Procter in the Gener;
Introductmn OfrtiOongeps.
derivai.)i:'leys"Naorredsusdiit
set and their
8.2 Preparing meaning descriptions understood
'Of
:hTeii/fgied '
T
I(gasily
cm
approximately 20535533:
1:5
in e s.
examination
cordihglfg (CV)
for contains
of LDOCE we
IYDogigglary
countedapproximately in list 2200 words the
In order get complete
to picture
a
of the semantic content in the back of the blw: . Since this seemed like a. manageable numb
the following information: 11oo felt
have created les containing, for each MD, to pre-edit
gamma:thetagged
manu the CV provided a unique opportuniter
domain to which a sense is 10 gm a of meaning denitions on thy
cheap.We themficaly
the corpus
the subject eld codes, specifying
o
book section 1.2.1 in chapter the CV words with their possible
part(s) ef
rteisted the.codes into the MDs. A morph:-
see
restricted (on tape, not in the
peeChand auto
1 for a brief description). logicalanalyze; Eggcihgdgsgefi '
ineiitedeiiidsizriiiigzigdai
sociolinguistic and ro
used to treat
of the entry (on tape, not
- ra
the sense s,
about
semantic information
173
172
Chapter 8 8i3 Nominal meaning descriptions:
in dictionary definitions a syntactic-semantic typology
Meaning and structure
'
codes. Al thou g h indenite article and letter. In our CV list its POS specication reects
forms etc.), w hich recei 'v ed appropriate
P at ticiple '
i
still
'
was
more
caused by several reasons: ambiguous. Tagging them with their POS listed in LDOCE, therefore,
meaning descriptions). is very effective. The second step in tagging reduced the number of
CV: uncoded tokens from over 34 000 to slightly less than 13000
i. not listed in the
which are
(among
the nationality
.
n names).
bad-tempered;neverending
.
-
n
descriptions of the two examples
unabl e emerging previously given look like the following:
'
' .
patience
'
,
derivatives: ,
_
Words marked references: as cross Entry Meaning Description
,,
tinderbox
.
.
anaesthetic; .
anaesthetist
amoj
n. u -
a
u.
doctormg]Wh0[p0l givesiv
ASia , African ammo]anaestheticimmmxx]
other names
bylllanotherlpopo]doctorlNOI"
- I
'
(L
1!.
cricket 11.
similar; however; everybody ,
others- Specialist 8.[Dg]
doctorlNo]whom givesle]
containing special symbols treatmentho] inUU]alDUI particuIarle waleOJ
ii. strings Grim] tong-m]certainleDg}kindsle] [10} 0
receive a were
ot
iiiTrix};tillzfifaitlis,
Because
'
The
'
ere cross re
Descriptions (MD) of nouns.
as synonymous with Meaning
sumed ctl it turned as
a;
many fcolirevsgrds
0 would be ordinary entries Havmgi1 in
o ePOS
fezllilhipothe of any word listed entry as an in in
8.3 Nominal meaning descriptions:
a syntactic-semantic
n(Egg;
grainnfzizitcilt we even considered the possibility of recoding typology
tpnnligrgig We
words
all the in the MDs
of our
this way.
rst step. Looking decide:
However,CV 1
The next step involved making an extensive inventory of of sequences
the tagging results
made Gimme
erlnds a mo words with particular POS-tags and the
expression of regularities in
instead of in the CV-list
lipgvsilves t ghe
we
'
in for purpose.
our T Wei:m
e
whole high. these
i
ferent
ems ar e sequences in the form of grammar rules. Simultaneously, the dif-
( structural properties were evaluated
twiidsg
:Xtenlelvce y and it is a well known problem that the
frzgpiiilie
rtrilpst implications they embody.
with regard to the semantic
175
174
Chapter 8 8,4 Semantic CharaCteristics of nOminal
in dictionary definitions meaning descriptions
Meaning and structure .
- - .
the following
levels: of nominal MDs:
word sequence
a man Who 100,195 kinds
1' Entry
a.
RelPronoun Verb Meaning DeScription
Dec Noun
b. POS sequence
KERNEL) man] amingo a tall tropical water bird 'thl
'
kernel
'
deviZtiliin
distinguishing types
.
interested in
of these patterns. We are only I
. ~
Efitherand much
the kernel embodies
phrase structure expresses will be revealed by
of
a typology of structures
semantic
more
'
semantic
weight is
'
in the MD
'
tional).
12:59 also other
ere are
types of kernel that are rather meaningless in
comparison with th
'
L' -
distribu-
stomzd building"
a
becomes apparent that the
a semantic point of View it soon u
the front part of the chest
relevant information often cuts across the various body below the
tion of semantically
177
176
. . . ,
a NP. The noun of the complement adornment an act which consists of adorning
ofcomplement with a n oun or
Meaning Description
entity.
' '
This nominalisatio
'
to a P hras 6 WI
.
a a'tfiins
business shares of a firm, busi- non-nominal structure: a VP.
the price of the
closing price the stock ex- The
entries havin g a such a structure M D With
'
dire ctl
'
progeiodf
ness, etc. .
en t't
the noise of
.
an
detonation structure will be called Shunters The
world which do not belong
23:32:?afnoxr/iinominal
'
expressed that
The . .
on e en is by verbal or
the MDs containing Linkers. adJectival kernel of the VP,
These MDs are formally the same as
information
contribute
as or- much Something like shunting also in structures that
is that the kernels
as consist
difference occurs of a
only function is to relate ' -
hardly say that their major very general J uninform a t' we kernel immediately followed by a relative
dinary Links. One can
of semantic weight Clause:
to the entry. If we speak in terms
the complement Therefore,
more or less in balance.
the kernel and the complement
are
Links, with this addition that equal Vi. Entry Meaning Description
we treat these kernels as ordinary
to the contribution of the of~complement. camper a person who camps
weight should be given
which ofcomplementation is involved
Another type of structure in cultivator a person who cultivates
can be found in the following examples: destroyer a person destroys who
these MDs as: knot p o SSible, however, to paraphrase these MDs with consists of".
the way Linkers do. Rather
we could paraphrase
179
178
in dictionary definitions Chapter 8 8.4 Semantic characteristics of nominal meaning descriptions
Meaning and structure
of Dik, 1978a) of t e re 1,
a tion gen beauty queen the winner of a beauty competition
periencerY or Theme in the terminology Such an Argument
' ' '
Links
.
a turn have a
the art of representing a character, esp. on
acting
stage or for a lm .
Entry Meaning Description
sh With a hook and
angling the sport of catching coming arrival
line .
arrival the act of
the of writing,
' _
particular thing
leader a person who guides or directs a move-
b. relative clause: group,
ment, etc.
that is added to another in a mix-
admixture a substance
Because of the fairly proportional semantic content of the MDs in viii.
ture
have
.
.
at theatre .
.
th mu examples (VILa.
5 of and
V1.1.b)consls .
esh by other human be- Links, Linkers and Shunters for MDs with of-complement or a relative
anthropophagy the eating of human
clause (Le. the decision whether some item is sufciently general or
ings vague in meaning to function Linker
the pronunciation of the letter 11
as or
Shunter)
aspiration
181
1.80
Chapter 8 8.5 Syntactic characteristics of nominal meaning descriptions
structure in dictionary definitions
Meaning and
Frequency of
verifiable
and in dications
also other, more concrete
there are
{liztleviizran
Kernels low med high complement clause
of MDs reveals
use.
31:31:26
Inspection of large samples
kernel function as a strong 1
Synonyms (ii) + - , -
MDlliiiaftollowingydiagram
can .
of
occurriiice:1; shows the frequency in (iv) e + -
(vii.b) + -
4 +
Linkers (iii) + + -
in
Frequency as token in followed as
kernel Shunters
of Determiner
Of words 31 l NMDs by
Kernel (v) + -
+
41034
994 741 (vi) + - -
+
act 1505
662
state
966
117 30 2;: 8.5 Syntactic characteristics of
sport 1O
11 nominal
ceremony
71 meaning descriptions
894 200
1232
part As one might expect, in terms of the
987 883 284 structure, language used in
piece LDOCE (and in dictionaries in general, we
1188 912 assume) turns out to be
1296 a restricted subset of English. On the the
type one hand, MDs contain lit-
tle variety of structure, the
noise 83
23 1:: 615 certain specic
on
structures.
other hand, there is a predominance of
lungs
A 6
91
dog
10 8.5.1 Disturbing elements
bird
drink
197
213 14 1; At least two features
of the MDs are the of all
unique: use kinds of
followed by meta symbols (words like old use, rare, fml), sometimes in brackets,
rel. pronoun the frequent insertion of bracketed at
parts position in the
person
3523 1440
107i
$215 any
431
As for the meta symbols, we found that they will not hinder us
something
substance
2131
each occurred
all types
only
of kernels and
once
their
in
features
can
and adjective had in fact two different MDs rolled into one by the use
In the following diagram of brackets:
be found together.
183
182
8 8.5 Syntactic characteristics of nominal meaning descriptions
Meaning and structure in dictionary definitions Chapter
Bracketed parts also occur the pre-kernel part and in that case
within not immediately discriminate between Links, Linkers, Shunters and
they interrupt the structure of the NP. MDs containing bracketed parts Synonyms.
position will therefore be analysed separately. To detect the kernel necessarily have to analyse all of the text
in initial or pre-kernel we
Yet another special case are the idioms (we coded them EXPR for that precedes the kernel and the rst marks word that the end of the
total of 6501 strings tagged EXPR occur in have developed a partial grammar of NMDs that
expression). 4281 of the kernel. Therefore we
initial position, like this one: analyses them into a Determiner component, a Premodier compo-
nent, the Kernel, and the rst word that marks the beginning of the
Entry Meaning Description Post-modier component.
law META hav-
abode of/withno xed abode EXPR
ing no place as a regular home Determiner
185
184
Chapter 8 8.5
Meaning and structure in dictionary definitions Syntactic characteristics of nominal meaning descriptions
This MD contains a kernel (bodyor body parts) which is composed that has been found we get the
following distribution:p
of two coordinated kernels of which the second kernel (body parts)
is a compound, followed by an of-complement containing a NP with a Type of Number
kernel composed of two coordinated kernels (personor animal). determiner
be have analysed
before we all MDs. There- Total 23 810
Exact gures given
cannot
MDs with structures that can be
fore, we can only give gures for the
The total number of MDs o Since only a very limited number of MDs 489 MDs ust
found by the pattern matching program. or
get the following distribution: theyalso have codes for other arts ofs ee
grammar, we
Structure
adjective), we
the grammar
decided
to
to
determiner
skip MDsITorcflhiiiiiiZIffielrixti
tlliese 35:81:12: m
Number Percentage a
component that only deals with a , an,
anylf
or the. skipped MDs will be analysed afterwards.
of Total
Ds
.The
41034 containing any ofoften have a special structure: any ofsevei'al
types of, which Very frequently occurs in MDs of animals, for
23 810 58% Der. (premod) Kernel (postmod) example
5006 12% (premod) Kernel (postmod) Entry Meaning Description
any of several t ypes
'
gannet
' '
iii
not 'b
noi).os'lslielseet:
that a slight indication of the distribution of .
-
present some figures give Cide whether the kernel was preceded by a modier or
which came up during that work. For that
the ner types of structure when words followin g t he determiner
occur '
can be noun or
adjective,
we will only look at the rst two groups of MDs: for example;
purpose
and
c Det (premod) Kernel, Entry Meaning Description
o
(premod) Kernel. drink a liquid suitable for swallowing"
186 187
8 8.6 Conclusion
and structure in dictionary definitions Chapter
Meaning
pear to be pro-modied, the number of MDs without prevmodication Noun 2648 53%
is still twice as high. Noun 1338 27% (ambiguous)
of the different Adjective or
be an adjective (total number 0 these MDs: 14 212). consist of many one-word meaning descriptions: 34% or a total number
of 455. Many of them could therefore be Synonyms of the entry.
If we look at the ner distribution of those MDs that unambiguously
Number Percentage of Total start with a noun (total number 2597) then we again see the inuence
Type of structure
of those Synonyms on the numbers found:
Det: pure' Noun
14 212
Type of structure Number Percentage of Total
Det Noun Relative Clause 1993 14%
pure' Nouns
Det Noun ofcomplement 5864 41% 2597
cant percentage, which suggests that which MD a consisted of one word (which could function as a noun
dictionary denitions characteristically involve hyponym/hyperonym cases were marked as cross reference by the Longman lexicogra-
relationships should be reconsidered 1:03
p ers.
Notice, nally, that the number given for compounds in the above
should not be taken too seriously at this stage. This number
diagram
well turn out to be larger after more detailed analysis of structures 8.6 Conclusion
may
which at present appear to be ambiguous.
The gures presented in the previous section give only a rough indi-
cation as regards the distribution of the various types of NMDs in the
dictionary. Nevertheless, we have been able to show that next to the or-
Meaning descriptions without determiner component dinary structure of meaning descriptions (MDs) of nouns in which the
syntactic kernel is a regular hyperonym of the entry (Link), there are
MDs without an initial determiner occur less frequently. Again we can
at least three other types of structures based on their kernels (Linkers,
make the same subdivision between MDs that have premodication Shunters and that make
Synonyms), up large part of the dictionary.
a
and those that have not. Looking only at the rst word of the MD the Although it may be difcult to explain these structural properties, for
following picture emerges:
189
188
Chapter 8 8.7 Notes
in dictionary definitions
Meaning and structure
ntaining a
fairly
Shunter describe the meaning
den Broeder
Marianne
Net)herlands pure
December
researc
In ge
.W.O. .
den
Participatin E R esearc
Hurk
l ) Assmams;
Februar 1986
'
many or
the most
that is somehow
informative e lement
derived from
in the
a
Vossen.
,
Research
adjective. Often the is
an
from which entry
is the Verb, noun or adjective
of the Shunter i '
a database.
Entry Meaning Description
the act of annoying
annoyance
the quality of being accurate 2 Information on the program can be found in van der Steen (1982)
accuracy
a person who studied or practised alchemy
alchemist - .
3 All
examples in this cha pter are fro m th e Langman Dimming 0/
kernel thus seems
of MDs having a Shunter as a Contemporary English (Procter, 1982)-
The large number include many derivatives
makers to
to point to a habit of dictionary
entries, even though they have a systematic, predictable 4 Some of the words that could not be coded usin th
woiilghaifil:
as ordinary . '
in
a parser. Problems ar
idiosyncratic, nontransparent
grammar
of the MDs. Not all multiple PCS-codes
and ambiguous structures
An example of a
also in all circumstances equally problematic. Common derivatives Frequency in
are
been mentioned: the words that
in fact, has already and
problematic case,
Ambiguity of compounds Meaning Descr.
can either be noun or adjective (for example, liquid). which can not in the CV base:
of coordination
structure mainly refers to the interpretation
on all levels of the MDs.
occur
semantic analysis that must advance ?
MDs will be input to a advantage 149
The analysed words in the MD
multiple senses of the agreement agree 233
deal with problems such
as
is the
of MDs. An example of the latter phenomenon aircraft air and craft 239
and circularity
following pair of MDs: argument argue ! 173
(initial word)
. ~
bilgecriiiilsl
or ganization
1;:
Foundation for
Links lexicon is a project
in the supported by the .
the of MD gurlfz
in ourg
prigied
of three start an
period flag-L113
lat
i
from lst February The than the one containing the actual MD which we
years of Amsterdam.
t of the University
out at the English Departmen
191
190
. ,
. . ,
Chapter 8
in dictionary definitions
Meaning and structure
dealing
This
ihcetoutput. variation is not problematic as long as we are
have
varEied by
in never
with large numbers (the differences output the
than 1 however, during our sessions With program). recise
more
Chapter 9
have analysed all denitions.
i
9.1 Introduction
193
192
9.1 Introduction
Chapter 9
LDOCE as a resource for computational semantics
with those logicbased approaches
of An important difference arises
work natural
roup has shifted from describing the as
but prgciss;
languagle1 here between socalled subsymbolic approaches within connectionism
igrig (NLP) to describing it as human memory research, iiecna1:. (Smolensky, 1987) and those usually called localist (for example, Cot-
approach is still applied
is largely cosmetic,
database
since
front
the
ends. A recent paper by 'e ner
is infactiialeggr'g .
trell and Small, 1983; Waltz and Pollack, 1985). This dierence, which
is yet in no way settled, bears very much on the subject matter of
iation and
in spite of its title ( Utl'
i may
a retur n to early CD concerns, this chapter: in the subsymbolic approach to computational seman-
very much and Semantics) ' and is
S ntaz
the Integration of y
E isodi'c Memory for '
mches of
and
'
.
word senses, they would be simply different aspects of a single non
and would correspond (if to anything) to no
e
be described
.
guage 3,515,th,lgsflinesszlginaiigted
understanding,
syntacttic
of
is not to deny, in car
ain
cases,
in the present
computational semantics.
chapter to produce a large scale database
The difference between
for doing
these two notions of
izi/iailiyntactic SZb::q2::
pleasurelsdoft
analysis, nor the'aesthetic ow connectionism, as they apply to issues of word sense for computational
fghe, ho. s,
formal axiomatisation. Computational
b- semantics semantics, is precisely the issue that is set out in the next section of the
the real theories of language understanding,
tcisay},1eThe whichbeis e chapter as the key issue in computational semantics at the moment.
domain 'lguiliigen
of language is information the rather than
then practical problems unless the c appropriate time, as enters system, on
tdeytar
as
and world
of the not ultimately separable, Just
are
mics (including Fass work on Collative Semantics, described in some detail
ion
into data bases called, respectively, ic
trans-
for
store information as pieces of paper that task, for example, in the claim that a word such as
as people, normally the semantic g
that principles underlie that are then distinguished and described7
them, but rather, common
on
' '
whose subject
dictionSriesevildEe
.
is to provide
that
Evisiiion
never was leXical ambiguity until
Given that the purpose of dictionaries en in roughly the form we have them, and that
matter is language. '
now lexical ambigu
denitions of words and their senses, it might well be expected that, it,
product ofscholarshlp: social product ) in
of all forms of text, it would be in dictionaries that the semantic struc-
for
lesls
than;
Otygliesrrxogosrelor
etween . rans a ion
languages
as well
a
35 more mundane
would be the most explicit and hence accessible understanding tasks had been for millen
' ' '
b do! SUCh
ture of language gomg la
SChOlarly
structure of knowledge and
and comparison with the semantic therefore cannot require them
examination
representations. And indeed, the semantic structure of dictionaries has proiucts
certain
.
.
kind of cognitive psychologist '-
may nd this 0 't'
of knowledge
been analysed, compared to the underlying organisation
and similarities have been observed, Dictionary en- Efgspgiighlt vfery riuch o ave
is also
never oun
to the taste of certain
the idea of lexical ambi
formal
'
52x55? '
-
representations, terms
.
guity interest
that Cirilng
and the genus
.
b):
'
and dilferentia
contain
am52rtazi.bForthem, peripheral phenomenon, is
tries commonly a genus it a one
of dictionary entries can be assembled into large hierarchies (Amsler, play/1, playZ, etc., as
stein gistdigistblscrilptiiitg (as Wittgen
syr)nbols
-
Byrd, and Heidorn, 1985). Likewise in the study of and claimin is me alus ' '
knowledge representation,
' i
b o ls deSi
g .e d' isioint classes
.
sym nat
hierarchy of
.
and a semantic network is viewable as a of things and that f act best be captured
.
principles
'
guide ' .
principles
.
of
as having
faced with
theoretical
the problem
merit,
of
those
knowledge acquisition,
are
Pamdy {0531
position just presented,
formation from language text and using that information to build new
irsnieadltiiat.
_Tl;;subscripting procedures whatever us in at
'
it no
in
knowledge structures or update existing ones. A particular instance deal the; with offers real problem.
to
We
resp213:eaiairoltaherhproblem
lexical w 0 want to construct a ambi
who
guit
make
dat
'
this
b
last
adiectii:f
.
lexicography.
.
differerait
.
computational
.
within
zstgsrnise
.
ig t:
34, 87
cannotrilybglif'e
see or senses for single
with those in knowledge acquisition and computational lexicography.
that number of basic
.
2W7,yet .
. orse , those different
a
segmentations
word
'
and they
0f usage
As evidence for this convergence, it is notable a into discrete senses ma y not even be
'
lnCluSl
'
S ometimes
' '
semantics
' .
.
ve. different
that must be addressed regarding computational dictionaries will 5e gment
'
into
'
'
petrhdapsllayli
also issues in (the third of in
are
three)
'
tics be based on such a notion? The answer given in this chapter will t_c)malgl1111i;<:.ni;;2ivvlsenses,tngt
'
Ialready
that '
differin d' t'
Ilonanes could,
a
presen e . n
way g m
on
'
'
m
principle, be tuned sense
' '
to
be yes. corn
patibility (though it mi ht
thegsanizrtleiriso
.
To put the matter another way, many have claimed practical purpose), Just as
people can be if exposed to
196 197
9 9.1 Introduction
Chapter
LDOCE as a resource for computational semantics
'
into
then exxffldzdrfg
se- eini ions an u 59
Siiocugldiildlldfggdd
are com
number.
entries dened using a as pragmatic codes to avoid confusion iviththe
hierarchy JTh
that are
of LDOCE claim
The preparers
and that the entries have use
another special primitives organised into a
set of
controlled vocabulary of about
We
2000
will
words
refer to the words of the con of main
hierarchy.consists headings such as ENGINEERING and subvi
a simple and regular syntax. The primitives are used to classify words
primitives and all other words in LDOCE as
headinglike'ELECTRICAL. is classied
trolled vocabulary as for of current GE
ggltiggsublect,
as
example,
while another
one sense
marked
non-primitives. sense is
ENGINEERING/
Figure 9.1 shows some basic data derived
of
from
a tape
our analysis of the
error, words that ELECTRXESEOLOGY
machine-readable tape of LDOCE (because described LDOCE, approaches to the
we now outline three
after zone have not been analysed). The gure of Having
follow alphabetically The list of controlled vocab- extraction ofsemantic information from LDOCE that are explained
at follows.
primitives is arrived
as
2166
58 prexes and sufxes
in
more
detail later in the section. These approaches are extensions
2219 words. We have removed lines of research. The approach in section
ulary contains also removed 35 primitives that
of
fairly.wellthe established
listed as primitives and have of Sparck Jones (1964) investigation of semantic
that are
shows that some words 9.2.1.15in. spirit
did not have heads. Furthermore, the analysis clasSication of the uses of words.attempt is In section 9.2.2, an
are used frequently in
vocabulary yet
are not part of the controlled madeof to develop an empirically motivated controlled vocabulary in the
the word aircraft is not part of the controlled
for example, work on the role of dening vocabular
definitions,
denitions. About thirty spirit 'Amslers(1980) in
vocabulary yet it is used 267 times
the
in sense
list of primitives, giving 2166 prim-
dictionaries. Section 9.2.3 describes the construction of a large-shale
such words have been added
to for of genus and differentia terms
parser the extraction expandin
itives.
9.1 is the extremely high
upon other
Similar work (for example, Chodorow, Heidorriand B
rdg
The interesting thing to note from Figure
vocabulary. Al- 1985; A'lShan, Boguraev, and Briscoe, 1985' l Boguraev aiid B iy
Scoe, I
diffslengiiin
are
of these
over 24 000 of the 74 000
of
senses
phrases
dened
beginning
in LDOCE
with
are
a word
senses
201
200
'
'
W th
'
tllfat
.
at
assxume
gfdifliles
NLP applications,
least some
dictionaries
'
do
an d that :ufcilert
contair}i1 kn::v)l:;iig:bfper
know].
Relationship
suc
.
now
.
e gei
what,
.
5 nee ded
philosophical
at all
'
see ion
t' ion
.
must
.
;
be
. ,
belieVe
The sec-
regard to the basic questions presented in the introduction.
A neutral attitude is taken to the status of word
k'nds external '
of informa . .
sensedistinctions
.
ie rom
.
tion here is to ask What are the consequences of too many or too
ques-
few
' '
information that is .
syntactic '
.
h Simpy 1
'
assumes the pm); sense distinctions, and we come to see that the consequences depend
'
a pproac _
information,one.in
.
case
Regarding 9.2.3) wh1le the '
[section
information
9.2.2).
Some more specic purposes might be paraphrasing, translation,
purpose.
or
in psychology experiments story-interpretation.)
.
The position taken of LDOCE
on the with sufciency
regard to
and
9.2.1 APP roach I: Obtaining using.
statistics from LDOCE collecting semantic information about word senses is that all the infor-
co-occurrence mation is in LDOCE and can be extracted, but we will need to look
beyond the sense denition of a word to nd the information
(Tony Plate) about
from ' '
'
it. This technique has far fewer stages of bootstrapping than other
information '
semantic
We now describe a technique for extracting semantic infor- techniques. Information about all primitives is collected all at once,
LDOCE) that does not require any
text (specically it. Central to this t ec hni q ue is that all sentences '
'
rather than collecting information by traversing a treelike network of
denitions.
'
mation to bootstrap about the use . sense The bootstrapping is entirely internal (in contrast to
mation
t 'n a word are use
This tech
.
those of sections 9.2.2 and 9.2.3); the rst phase where cooccurrence
word.
. _
.
of the
blfatlhgfnwzlrd, rather than just the denition 0 f statistics are collected for words can be seen as bootstrapping for later
frequency
'
that the
experimentalndings
'
specic cooccurrence
'
W
4- psychologicalexpenments as measures of the Cooccurrence data records the frequencies of of
easures based on frequency ofoo-occurrence cooccurrence pairs of
words,
'
words within
1strength of the semantic relationship between some textual unit. textual (This
unit could be a phrase,
'
how co-occur- a sentence, a paragraph, but unless otherwise stated the textual unit
'
'
about
With some
5. lations referred to here is denition.)The independent frequencies of
:23: data cguld
not a sense
be used by NLP systems
to reduce (but
and occurrence of words in a textual unit are also important and are used
entirely resolve) lexical ambiguity in any text,
203
202
semantics Chapter 9 9.2 The extraction of semantic information from
LDOCE as a resource for computational LDOCE
(9.1)
provides small, easily iden-
tionary, and the structure of a dictionary Pr(marriage I X) Pr(marriage)) ~
Z 0.03,and
units of text over which cooccurrence statistics can
Y
.
if
tifiable, coherent max(Pr(X I Y) Pr(X),
shgfiiiltffoikggdddmu
of information must be reduced to make it
raw form, thus the amount
This must be done without eliminating large amounts
substructure. This seems to be afeature
ofaall
comprehensible. the
from co-occurrence data the 1' e are q 11 la a few eve s of stri c t lire V iSi b le
what is needed is a way to cut ,
gezgggrbglottetirorzpdttto
the almost the nature of the relationship ex-
be given to the PATHFINDER program
represented
which
by
converts
the sub-matrix into a uIlidersltaEd 1 iona pro a iit of occurre
of thelsiegdtilbileiiiehe
connected network
completely
sparsely connected network. These networks provided the rst indica- ductedsome comparisons of the matrices (30.3;
subjects wFlor
of relatedness
tion that co-occurrence data contains much semantic information, and ofJudgements
Kilatricgs made by human
subjectswere asked to
relatedness of eachof
rate the
have proved very interesting
One such network
examine.
is presented in Figure
to
9.2 (overleaf), which shows theerrMsfhulmazn
All]
)/
ia
'
d Similar
'
pairs of words. Several experiments were conducted
results, of which will be described
briey here-
one
of words that have close relationship to marriage. The set very
a network a
204 205
semantics Chapter 9 9.2 The extraction of semantic information from LDOCE
LDOCE as a resource for computational
contrlcl
/
Accwdlncn
the human judgements and the conditional probability of co-occurrence
......_ was 0.66.
probability
This is a high correlation
of
gure and indicates that conditional
co-occurrence is strongly related to human judgements
\<> kqudul
"Untou-
.m
of semantic relatedness.
mu
w mantic similarity, it can be used to nd sets of words that are semanti
m.- mm. cally related to a particular word. This is easily done by selecting the
lonnn +
_ words for which the similarity of concept (based on equations involving
m1 <> born
W
mlrrl
will discuss their application to resolution of lexical ambiguity here.
x mum
It should be stated that we do not see statistics
being able to solve all the problems of natural
of cooccurrcnce
language understanding,
as
O 71L or
useful semantic
statistics
lexical disambiguation.
evenjust
information
and we wish to see what
that can
Rather we
be extracted
think that there is some
from co-occurrence
applications such information can
be put to, and to investigate how a subsystem using cooccurrence
inllde
/ information
a natural
could be built so that
language understanding
it would constitute a useful
system.
part of
rel-(10" mm Discussion
,9."
dum-
+
'"
<> Pu..."
* qu!
/ based techniques to
problems facing the
of these problems is that most systems
have the potential to overcome several
practical application of other techniques.
would
serious
One
require enormous num
9.2 Network of words related to marriage. bers of handcrafted knowledge structures if they were to have
Figure any
207
206
Chapter 9 9.2 The extraction of semantic information from LDOCE
resource for computational semantics
LDOCE as a
to date is that
clear at all how these systems are a text:
better source of
chological plausibility (and it is not
be made more psychologically plausible). Systems based on sta- Thelastadvantage,'ari.d
the most important to the continuation of
might the that provides a starting point from which
aIdicionary
in a parallel
tistical techniques, especially
framework,
if they
would seem
were implemented
more
potentially psychologically to wgriiectjhis is it also hel
w1 senses. nt Way P s to Mg?!
'
t 6 th e critiCism
' '
or
connectionist connectionist system may text-scanning the y only gather information
techniques 3 b 0m com- that
based distributed
plausible. A statistically local connectionist system such as that moniy used senses of words The fact th at there are sense def 't' .
statistics
gfgaseandfin
pgrticulatho
which re erre o as SJ arose from
Spirck
th e nee (1 f0! Efuent
'
l
of almost is lim- determined by these relati ons. Of twelve possible sema 11 t' 1C relatlons,
of LDOCE is that its vocabulary
The rst, unique, advantage 2000 words). synonymy was chosen
l f ea ture of nat ma l l anguage. as the fundament a
resources;
requiring excessive computing
. ' '
of transcending storage
and takes less than an hour
to build. Ways be described of the kind f mm d in a thesaurus.
. '
itations are A thesaurus thus constructed serves as a word se I156 d' lSmblguatlon
in a connectionist model), and eventually we
' . .
might be implemented
' ' ' '
which discu::efilmllantles
The second advantage
209
208
9 9.2 The extraction of semantic information from LDOCE
Chapter
LDOCE as a resource for computational semantics
the problem. Se- later cycle. In this way, the size of the KDV expands with each cycle
[3) The method of cooccurrence faces opposite controlled
that cooccurrence data contain
seem
to be
too until, after three cycles, all 2219 words from the LDOCE
mantic relations of LDOCE
is vocabulary are accounted for. The remainder vocabulary
vague and imprecise, rather than too narrow.
.Thlsproblem is expected to be dened in the next dening cycle.
'
the sense denitions are used to provide information automatically by some bootstrapping process. The knowledge struc-
of to dene
senses. The text need only use enough senses
words
it does use.
tures used in this particular study are called integrated semantic units
of the senses A discussion ISUs is given in the next section. Though
all words, but should make frequent ISUs.
use or on
the preliminary study reported here uses a KDV of around 1200, the
twenty years ago, the computing number can probably be reduced to about 1000.
(4) KSJs work was performed over
experiments
have to be done on realworld
allow Second, the use of dening cycles helps to identify vacuous circular
resources we now
makes it pos- denitions. Circular denitions that use circles ofjust two words pose
size vocabularies. Modern computing power also
sible that techniques developed might be
usable in
real systems. special problems for building an MTD from an MRD. For example, in
machine
Also there are now vast quantities of text available in LDOCE a trip is dened as a journey,and ajaurney as a trip. An
encyclopaedias and much
less con- MTD built from an MRD should be free of such circular denitions.
readable form; dictionaries,
for expanding'on One way to overcome them is to try to include just one of the words
strained text. All this provides good reason
then aside
work done in the 60s. Techniques developed were
left to involved as a KDV Word, but not the other. The word selected for
sometimes because of lack of resources, and now it is pos5ible the KDV will be the one whose rst three senses full the criteria of a
further develop and use those techniques. dening cycle given earlier.
Thirdly, when constructing an MTD, use of the dening cycles en-
sures that all denitions of words and their senses are built containing
Four aspects of building a Machine Tractable DictionaryiMTD) from
is only words that already have denitions. In the case of LDOCE, use
ow chart of the construction process
LDOCE are discussed next; a
of the dening cycles sorts out words in the 2219 LDOCE controlled
also presented later.
211
210
semantics Chapter 9 9.2 The extraction of semantic information from LDOCE
LDOCE as a resource for computational
words outside that vocabulary. knowledge in the representation of each word s e n 59. Th e general form
vocabulary whose denitions include
in LDOCE denitions. of an ISU is as follows:
This has proved to be not uncommon
of these empirically-
Fourthly, in building an MTD, the main
senses
isu(Wordsense.
taken the semantic primitives of the MTD.
found KDV words are as
belong(Superordinate) ,
formation and expansion of an initial hunch set. The initial hunch set
troduces integrated linguistic and world knowledge associdtedwith
derived from the use of two criteria, which Wordsense. Although the general form of the ISU is uniform across
of our 1200 KDV-words are
of the main sense(s) all
are word frequency and the conceptual simplicity four open-classcategories of English words, the actual specication
the words of the hunch of their word
of a word. Regarding use of word frequency, respectivesuperordinate and the associated world
senses
hunch set. This explains, in part, why only about half of the words hand-crafting small a set of ISUs has been successful. Initial efforts at
of the KDV among the 1200 most frequently used words in adapting Huangs XTRA (1984, 1985) parser to the use of ISUs has
appear -
alternative '
hunch set to do this, we have shown that the 2219 controlled senses.
proximity data is rst processed by the PATH-
suitable
then by a cluster
Vocabulary in LDOCE can be described in terms of a smaller, more FlNDER
this process
program, analysis program. The output of
of related word senses. Each cluster represents
controlled vocabulary, i.e., a KDV. We plan to develop a new,
smaller are
clusters word senses
a to be included
hunch set formed by a more motivated application of the criteria of group of candidate
the linguistic and world
in the specification of
and controlled
'
is representation of
enriched assomation are possible
candgidates,
,
denitions of word senses. An ISU an
is shown schematically in Figure 9.3. There general principle is for the word senses dened at an earlier dening
The bootstra ing process and before those that dened at a later cycle.
Database,
are four unitspiii the diagram: LDOCE, LAAL, the
number attached
ISU
to the end of
cycle to be processed
The success le discussed earlier
are
and more ISUs are produced. The size of the ISU before phrasal words no matter they are phrasal verbs, phrasal
by LAAL, more all the denitions are processed propositions, or any other phrases.
database thus grows until eventually
form of a Prolog database of ISUs.
and we obtain an MTD in the
Word sense hierarchies and learning
As can be seen from our review of MEDrelated research, much work
has done on determining
been semantic hierarchies of genus terms in
hierarchies of the dening
dictionary denitions. However, semantic no
(Language Analyzer demand". Thus bank has at least two superordinate terms: land and
Lealnav)
And
place. However, in a semantic hierarchy of word senses, tangled hier-
archy is an exception if not totally non-existent. In this case, bankl
ISU Database
LDOCE hierarchies are formed in LDOCE how suitable
and they are for com-
(Wold sense number putational purposes is open to empirical studies. An ideal system of
each word
lagged in
In dellnHlon iuxls)
semantic hierarchies disambiguation and generalisation pro-
facilitates
cesses and is indispensable to language analysis and learning.
The present effort to establish semantic hierarchies of the dening
senses of the KDV used in LDOCE shares the same assumption as
Figure 9.3. Flowchart of the bootstrapping process.
215
214
9 9.2 The extraction of semantic information from LDOCE
semantics Chapter
LDOCE as a resource for computational
dictionary circles, i.e., that use- several closelyrelated word senses are identical. For example LDOCE
Karen Sparck Jones (1967) study on
for the word Marryl means,to take
obtained from analyses of dictionary recognises three senses marry.
information can be to perform the ceremony
ful semantic aimed (a person)in marriage; marryZ means of
of text analysis in general. Sparck Jones
denitions for purposes for (two people); and marry3 to to take in
of our vocabulary
means cause
The determination of the KDV together with an algorithm nd any information in parentheses and
not
do. do we have on preferred agents
of the KDV, which is available to LAAL, with
catches missing dening senses more examples besides the one given the
solves the problem of extracting a minimum set of se- neither
denition.
any
The only example available reads: She wants her
approximately a problem cur-
to marry
dictionary
mantic primitives from a monolingual daughter to a rich man. LAAL infers from the patient of the sen-
rently believed to be computationally
intractable [Dailey, 1986).
of the LAAL system to be able
tence (her daughter) that the pronoun she stands for parent] which
The bootstrapping requires of a person. Thus the preferred agent
process
information from information avail- means
the father or mother of
to acquire useful world knowledge marry3 is parentl. Representing preferred case role llers in terms of
An example of useful world knowledge informa-
able in the dictionary. has great advantage over a system of semantic features
llers such as the agent, the patient, the location, word senses as
be-
tion is preferred
case
and example sentences
is used by LDOCE, in that it results in ner semantic distinctions
forth. Both definition texts
the time, and so word senses. An important guideline of this learning process is
of such information. Preferred agents and patients are of- tween and overspecications.
are sources
limited cautiously guarding against overgeneralisations
in word sense denitions. Although
ten given in parentheses the definition of a par-
Generalisations are made only on the basis of ample evidence, and in
of example sentences are given with
number the word
the absence of counter evidence.
there are, usually, more examples containing
ticular sense, the
the dictionary. To construct
in that particular sense throughout
tentative versions of the ISUs are
ISUs for the word senses of a word,
the word sense denitions and examples that
obtained from analysing
in setting the initial
go with them. Box codes are often helpful up
sentences available outside the def 9.2.3 Approach III: A lexicon-producer (Brian M. Slator)
specications of an ISU. Examples
interesting material for the
of a particular word provide
inition texts
Learning from these examples often results in the In this section The and the next one a lexicon-producer / consumer system
learning process.
modication of the specications of tentative versions
is
outlined. systemand is composed of two distincteach andbasedseparable parts
conrmation or
trees (alexicon-producer a on different
lexicon-consumer),
of the ISUs. LAAL analyses these examples and produces parse
Fillers of the same category across all examples, for princrplesand,course, in
of designed for different purposes. The lexicon-
of Word senses. producer is wellalong its development, using LDOCE. The lexicon-
in a list. Gen-
example, all identied agents, are grouped together consumer in the is planning stages, waiting for a clear picture to develop
hierarchies of word senses are made ' ' '
semantic
'
eralisations along the of the l x' Will have to work With. Figure 9.4 shows an overView
on these categorised lists of word senses.
is especially useful when the box
This knowledge acquisition
codes given in LDOCE for
of the 5:52:03,
process
217
216
for computational semantics Chapter 9 9.2 The extraction of semantic information from LDOCE
LDOCE as a resource
The rst
phase of frame construction uses LDOCEs precise gram-
matical codesto distinguish among the senses of words, but with only
general semantic and pragmatic information, such as is easy to extract
from the dictionary. However, when the needs of the knowledge-based
parser (the lexicon-consumer, operating over non-dictionary text), in-
ulanuLJlllXu crease beyond this initial representation (as is the case whenever, say,
resolving lexical ambiguity or making non-trivial attachment decisions),
the frame representations are enriched by appeal to parse trees con-
structed from the dictionary entries of the relevant word senses. That
is, the text of the denition entry itself is analysed, to extract genus and
differentia terms (Slator and Wilks, 1987). This additional information
further enriches the semantic structures.
LDOCE definition clausoids (for lack of a better word), are typi-
cally one or more complex
phrases composed of zero or more preposi-
tional phrases, noun phrases, and / or relative clauses. The syntax of
the denition entries is relatively uniform, and developing a grammar
for the bulk of LDOCE has not proven to be an intractable problem.
Chart parsing was selected for this system because of its utility as a
grammar testing and development tool.
The chart parser accepts LDOCE denitions as Lisp lists and pro-
duces phrase-structure trees. This parser is driven by a context-free
grammar of 100+ rules and has a lexicon composed of the 2219 words
in the LDOCE controlled vocabulary. The parser is left-corner, and
bottom-up, with top-down ltering and early constituent tests (taken
from Slocum, 1985). The grammar is still being tuned, but currently
9.4. Lexiconproducer/ consumer. covers the language of content word denitions in LDOCE to a fac-
Figure tor exceeding 90%. This chart parser is not, we emphasise, a parser
entries
LDOCE dictionary into lexi- for English it is a parser for the language of LDOCE denitions
'
converts
-
roducer
3395233122:
Istructures
(a frame-based knowledgerepresentation);
in-
(Longmanese),and in fact only the open class (content word) portions
based
for knowledge parsing. Each
leXicalsemanticEtuc of that language; denitions of closed class (function) words not
tended
constructed, is part of one or more ire
hierarchies, eseh 1
analysed.
are
(frame), as
disagreement, or
preference breaking can gse mented to suppress certain frequently occurring misinterpretations
aug-
(for
expected
or
in
person
typical text. Preference breaking occurs
when, exam- usagfe, example, separated nouns
comma as two individual noun
phrases); and,
ple, a verb like drink, which prefers an animate agent, is use in gr with certain minor exceptions, no procedure associates constituents
with what they modify. Hence, there is little or no motivation for as-
My car drinks gasoline (Wilks, 1978) signing elaborate or competing syntactic structures, since the choice
to of one over the other has semantic
roceed when breakdowns
these occur, it is
necessary
no
consequence (Pulman, 1985).
iglgiiire
tgorzfmmatical or semantic constraints.
of
This relaxatiogi
(the o verse typ-f
is
0
Therefore,
parser
the trees
also has a longest string (fewest
are constructed
possible. The
to
ically donehierarchy
by travelling up a
constraintsof the erence. A tree interpreter extracts semantic information from these
preferences), testing at each level, and keepingtrack accuniu- phrase-structure denition trees.
lated deviance (in order to compare or
semantic density). In this context then, mferencrng' entails competing.interpretatioiis
re
axmg
The output of the chart parser, a phrase-structure
tree, is passed
and then recursrvely applying a to an interpreter for pattern matching and inferencing. The rst step
constraints by climbing hierarchies
matcher. picks off the dominating phrase and, after restructuring it into genus
pattern
218 219
Chapter 9 9.3 The utilisation of semantic informationfrom LDOCE
LDOCE as a resource for computational semantics
active grammar
Section 9.3.2 Semantics, which principally ad-
outlines Collative
to the
and feature components (by reference currelnttly dresses the phenomena of lexical ambiguity and semantic relations.
genus so
undera
.
'
inserts it into the frame Seven kinds of semantic relation investigated are literal, metonymic,
versli3ffher tcaeiis developed,
being
strategies for pattern-matching are metaphorical, anomalous, novel, inconsistent and redundant relations.
information,
tract detailed differentia
more
'
The
' '
between a thedgietrsiugeggmon
beyond{ war
Collative Semantics
an
has been implemented in a NLP program called
feature
'
modifiers.
'
be Viewed
relationship
as an I S-A relation;
_ meta5
for examp
which analyses sentences,
1e, discriminates
an am me- the seven kinds of se~
can tr1Vially T h e f rame
_ n mantic relation between
pairs of word senses in these sentences, and
lectric
_
current .
for measuring
ter is an instrument e t s th 6 in. . resolves any lexical ambiguity in them. Section 9.3.2 focusses on the
fromi 'ts denition, then, represen
created for each word sense form of called sense-frames, and the
-
l '
'
can i
of this intenSIonal
'
that an
iggeipfor
eventual
for measuring it becomes
preference matching.
For example,
reasonable to create a slot in 9.3.1 A lexicon-consumer (Brian M. Slator)
ammeter is
lled with
the AMMETER frame
of
that is labeled PURPOSE
information is precisely what
andis needed to MEASUR;a
compu The lexicon of frames created by the lexiconconsumer of section 9.2.3
ING. This kind constitutes a text-specific knowledge source for use by a knowledge-
case roles and preferences. of based Preference Semantics for text. The
job of Preference
prior, chartparser
. .
parser a
for
.
giftput
ofy the former (the lexicon-producer) the
the dictionary
is
knowledge-base and to choose among them by nding the one that is the most seman-
the latter (the lexicon-consumer). The output
With each frame exp
.of parsiag
y iCl tically dense, and hence preferred.
is a lexicon of wordsense frames, The lexicon of word-sense frames and the original text
program are presented
implicitly positioned in multiple, preexisting,hierarchies. to the Preference Semantics which is under
or parser, development. We
envision a goal-directed, non-deterministic parser that keeps all plau-
sible interpretations alive but only pursues the one most highly pre
utilisation of semantic information ferred. The text is processed left to right, attachments are made im-
9.3 The
mediately, and constituents constructed locally to be applied globally
from LDOCE (since we do make weak claims for psychological plausibility). In those
cases where no satisfactory preference decisions immediately be
the semantic
derived from the
information
can
describes
section how
This '
in
'
sec tion 9.2 can be mapped made, we foresee an extended mode where previous context, if any
L man dictionary by the processes of '
'
knds
used in two
diicregtly into the kinds of knowledge structures and Collative
Semantics
eman- level, orientation. In the absence of useful context, new frames are
of Preference Se .
.
tion, then processing can continue with these new semantic structures
9.33.1
meslszction describes a recent implementation until some preferred reading can be found. This strategy is both a run-
mantics in a semantic parser. Preference
the meaningfor a text is represente
theorylgi
Semanticsdisba time optimisation and an application of the least effort principle of
language in which
is built out of smaller seman o- ytacoin; psychologically plausible language processing. The parsing will be ro
semantic structure that
components
up
in the structure are
crea
e o n ticdc{he bust in that some structure will be returned for every input, no matter
nents. Links between ta.
how ill-formed or garden-pathological it is. Parsing will be directed
semanticrefilresgnby
semantic The
basis of coherence and preference. the y 5e
a top-level planning component to dynamically compute deviance
tion computed for a text is the one with most semantica a proper y ent scores allowing the pursuit of the currently most preferred parse as a
structure of the competing readings. Semantic denSity is strategy for enforcing the no failure robustness policy. We anticipate
regarding their
of structures that have strong
These preferences are compared
preferences. terms of owncor;f a macro-text structure formulation to be applied by appeal to global
stituents.
the
lack of
in
ures, the'eXI?
ence
preference-breaking
coherence measures derived from an analysis of preferred pragmatic
preference-matching features,
chains to Justify eac sense
needed
e; (subject)codes appearing in the text (Walker and Amsler, 1986).
and the length of the inference The suppositions of this approach are that:
constituent attachment deciSion.
selection and
221
220
'
'
this
.
material
'
direc);
with in izisteagireiulp;
beciuse,
'
atword er orma
_
ion,
particular attention
perform the functions
senses of
to the structure of dictionaries.
semantic
In sense-frames,
primitives, hence sense-
glenelzlltldlgoutside riieans,vtel:e
. .
leIicIicon-consumer
limits
approach '
less {2.2312111
itselfAirbag oun
to
3:15;the: an r
frames, much like Quillians (1968) planes, and much like the circular
as
organisation of a. real dictionary. A means
'
or
taking the dictionary more has been developed by which
each sense
' de n as '
ii men
takes word senses
most of it. Also, the
leXicon-consumer rather t h an d e termining the '
perform all the functions of the semantic primitives in
't for analySis,
a largely self-containeduni
theories of NLP such as Conceptual
its incidence of occur ' Dependency (for example, Schank,
of a word by examining rise defining . 1973, 1975a, 1975b) though they are also part of English, the object
:zfircidthsfoughout the dictionary, as the rst approach (section9.2.1) language being represented.
It was noted in the introduction that dictionary entries
doeiittle
other machine-readable dictionary
for semantic, 31:
has20:11:21
woik (in; nowe g
-
-
contain
entries
a genus and differentia and that the genus terms
commonly
of dictio-
be assembled
'
informa
'
tion fr om maChme-
.
(Amsler, 1980;
'
mg. APPTOBCheS0 9X tracting Chodorow, Byrd, and Heidorn, 1985). Likewise with sense-frames, a
such
sprouting
'
(Ch o d oro w B y rd ,
and Heidorn, sense-frame contains and
Ices as ,
-
a genus differentiae and belongs to a semantic
igagdsgblfhzoelmpldy
paid disambiguators
'
I
(MarkOWitz, dNED),
(Argisler, :r 381251) ii in
of
formulas
'
Ahlswe e ,
an van
'
,
'-
,
network
Sense-frames
which is a
consist
hierarchy
of two
of genus terms, as we explain below.
major parts, the arcs and the node.
23:12:11:
elaililorgto
vocabulary,
tion Webster
aggnigigagnjgriehnait
froniix (a such
construct
as
'
in
taxonomies
5 Sevent . e
The
word
arcs
sense
part of sense-frame
a
and .l
nt work at IBM (Binot ensen, ,
bascd and semantic networkbased systems, such as schema theory
11:35:}? to interpreting
rule-based
and a
denition
inference
parse
mechanism
trees, by applying
to assign MYCIN-like
a pattern
(ltumelhart and Ortony, 1977), KRL
FRL
and (Bobrow Winograd, 1977),
matcher
n umbers
(Roberts and Goldstein, 1977), KLONE (Brachman, 1979], and
attachment alternatives.' [the
probabilities (Shortlilfe,1976) to frail '
tions,
Semantics has four
senseframes an d semantic vectors
components,whicihtarepiwozegipsrelmna ,
an wo ,
.
the know 1e d ge re p resemamn
'
S e rise-frames are .
.
word senses.
.
222 223
semantics Chapter 9 9.3 The utilisation of semantic information from LDOCE
LDOCE as a resource for computational
preference
by
ill in [it1, steali. valuablesi]. Common dictionary e for an organism. The sense-frames differ because femalel has the cell
followed in that word senses are listed separately for pragcice1}s
eachpa? ipeecl} 0 [property. femalel] that signifies that being female is a
property of
of occurrence. Hence in 2,croo females, whereas malel has the cell [property, malel]. The sense-
and numbered
[shepherd1.
by frequency
usei. iti] contains
thenoun sense shepherdl if :16 w 1e is frames also differ in their assertions. The senseframe for malel has
verb the assertion that any entity which it is applied to is
cell [iti shepherdl
. . sheepl] containsthe sense.
male, whereas the
the reader sees are we Id sense-frame for [emaiel has the assertion [sex1,
In Figure 9.5, all the alphabetic symbols femalel].
labels
senses with their own senseframes, except for symbols used as ,
[proper-W, femalellll .
[property, malellll.
sf (tape_recorder, [node1, [node1,
sf (record_plnyer1.
[[arca, [[arcs, [[preference, organism] .
Preference organism]
, ,
Dada? it , )3 1 av 10 .
[[:::ori;iil<li. audio_tupe1]l]l).
224 225
semantics Chapter 9 9.3 The utilisation of semantic information from LDOCE
LDOCE as a resource for computational
and playll is to perform on (a musical instrument). frames. Differentia information for nouns must be organised into a list
of properties that can each be represented as a cell. Differentia infor-
mation for adjectives, adverbs and so on needs to be grouped as pref-
erence versus assertion information. Dierentia information for verbs,
sf(play10, sf(p1ny11, prepositions and so on must be sorted into case label, preference, and
[[nrcs, [[nrcs,
usertion information.
[[supertype. [[supeztype.
Such demands are beyond the abilities of the best current extrac-
[[cteutel . operatel] ] ] J]. [[performi, use1]] n] .
[nodeZ, [nod92.
tion techniques. Recognising genus dictionary denitions ap-
terms in
pears to be less difcult than extracting dierentiae and, according
[[ngent. [[agent,
to Chodorow, Byrd, and Heidorn
[preference , [preference . (1985), extracting genus terms for
humnn_beingf]] . humun_being1]] , verbs is less complex than doing the same for nouns. They found that
[object. [instrument, although the genus term for both verb and noun denitions is typi-
[pref arenc: recordingil] , , [preference ,
cally the head of the dening phrase, head nding for verb denitions
[instrument, musica1_inatrument]]]l])- was relatively straightforward while noun definitions were much more
[pteference. complicated because of their greater variety. Hence it makes sense to
playback ] ] ] ] ) .
attempt to extract genus terms for adjectives and adverbs (where they
exist) before turning to nouns.
Figure 9.9 Senseframes for play10 and p1ay11 (verb). tionary denitions, it looks as if the denitions of nouns will be the
least difcult because the dierentia in their denitions has only to
of record_player1 and tape_record- be transformed into lists of sense-frame cells. Adjectives and adverbs
Play10 appears in the sense-frames
of plale is would be probably the ones to try next. Verb denitions most
ing1 (Figure 9.6). The preferred instrument
are
playbackl, likely to be the hardest to analyse because three types of differentia
which is the genus term of record,player1 and tape.recording1; the
226 227
semantics Chapter 9
LDOCE as a resource for computational
9.4 Conclusion
Chapter 10
The chapter has argued that a computational semantics
branch of
that merits attention is one seeks to develop more
branch that natural
and noted in Conclusion
language-like schemes for knowledge representation that,
interests of computational semantics Wlll come
doing so, the theoretical
to overlap increasingly with those in knowledge acquisition and compu-
tational lexicography. The chapter focussed on the practical problem
of developing methods for the of semantic
extraction information from Bran Boguraev and Ted Briscoe
machine readable dictionaries, a specialised form of natural language
text, and the use of that information to build knowledge structures.
In focussing on this problem from the perspective of computational
and develop-
ferences conrms). MRDs are being used because they represent the
questions must be investigated, and adequate research most accessible source of information for building NLP systems which
ment resources committed to them, if there is to be robust, large-scale have realistic sized lexicons. There is clearly a feeling that the basic
machine comprehension of English in the near future.
techniques of (syntactic) parsing, some types of semantic processing,
spoken word recognition, and so forth, are well enough understood to
warrant either testing with large lexicons or commercial deployment in
a limited range of applications.
The chapters of this book demonstrate that useful information can
be extracted from at least one MRD source and the work reviewed
in chapter 1 suggests that many of the techniques discussed by con-
tributors to this volume generalise to other MRDs, as well as LDOCE.
Recently, there have been proposals to standardise the format of MRDs
[Amsler, 1987) so that the various sources of different MRDs will
be compatible and easily interchangeable between different research
groups. The success of this enterprise would mean that, in princi-
ple, any MRD would be straightforwardly loadable into the Lexical
Database System described in chapter 2 and the Lexicon Development
Environment described in chapter 5 (as well as many similar software
systems developed by other groups). As a consequence the MRD-
derived data available to researchers would increase massively and, no
doubt, much
valuable information could be extracted through cross
dictionary comparison and merging of the kind which is beginning to
be by the IBM Lexical Systems Project (see Byrd et 3.1.,
1987usrdertaken
.
228 229
1O Conclusion
Chapter 10
Conclusion
will
On the other hand,
not solve many of
a
the
massive
problems
increase
which arise in the computational
ter 2), which endeavour to separate the task of (522511;);
extriitidrfi
of and deployment of the information extracted is well
To date, no substantial dictionary has exploxtation
exploitation of these sources.
which is intended for compu- worthwhile. Such systems convert the MRD into a general iesource
been produced by a dictionary publisher capable of opportunistic use, both as a source of lexical data and in
tational The
great majority of MRDs remain typesetting tapes,
use. unforeseen applications, which is limited only by the capabilities
preprocessing before they are usable by ma.
NbP
requiring considerable in that the MRD source
of the extraction programs (and the reliability and completeness of the
chine. LDOCE was, until recently, unique
attributes and structure. The publica- chosen MRD). It is likely that further work on extraction
of informa
contained some database-like from MRDs but
tion of COBUILD (Sinclair, 1987) represents
a major development, in tion
robust
will need to focus on the development of crude
techniques geared specifically to this task, similar in spirit
this context, because, not only is this dictionary based on linguistic
AishaWi
NL-l)3
phrasal Probabilistic
but also the s
analyser (see chapter 7). techniques
data derived from a large computerised corpus of English,
from an intermediate (and
tic
eveloped'for the analySis of textual corpora may be particularl Y ap-
published dictionary itself was produced this context (see Garside
propriie
in et al., 1987).
database constructed from the corpus.
more detailed)computerised current context of research and dev lo
gR$ftviiifedys
n e on
of MRDs which, from inception, have been genuine .
and produce dictionary o lead of the COBUILD project. In this case there ma
automatic extraction and de- owmg
the b
format which will yield
in
more
the
easily
meantime
to
there remains the difficult ques-
a productive
convergence of interests and techniques between
' '
corrilputj
ployment. However, MRDs. Clearly,
tational linguists and lexico graphers both in the pm d union and use
tion concerning how far it is worth exploiting existing of the next generation of dictionaries).
between the amount of (probably semi-
there is relationship
a direct
to extract information and the amount
automatic) processing required There is com-
needed to recover this information.
of research resources
which is extractable without the
MRDs
paratively little in current For example,
use of sophisticated and robust processing techniques.
in LDOCE
even the grammar codes and pronunciation information
semi-automatic extraction techniques
(see chapters 4 and 6) require the
100% success. Most information in MRDs requires
to guarantee
of robust natural language processing techniques in order
deployment these tech-
extract the information needed to support
to reliably very
the situation is not completely circular,
niques. Fortunately though,
written spe- in a
because most MRDs (and especially LDOCE) are
cialised subset of English.
Nevertheless, the problems of extraction are usually great enough
of whether the information extracted
to motivate serious consideration
One obvious lesson is that there is little point in
warrants the effort.
extraction techniques for each new
repeating the chore of developing
to a MRD. Therefore, the development
project which requires access
231
230
Appendix A
Lexical database
user guide
A.1 Overview
from all entries whose head words or phrases either match the pattern
in their entirety, or, in the case of head phrases, have a component
word that matches the pattern. Non-alphabetic characters [including
233
Appen d'xI A A.2 Access by spellings
LDB user guide
Thus the call near the give rise to a very long search.
beginning may This is because
ignored in matching.
spaces) are
the search uses a leftto-right discrimination net. A similar situation
(Words cot???) obtains, of course, when using a normal printed dictionary by hand;
one must know how the spelling of a word begins to be able to nd it
matching
word
to select
'
one or
coff77.
all
oEItlhewlose
e(nt)ries
mar
.hleag:
e n er s 1 Box codes are ten-place sequences following the marker box in the
of the denition. Each place codes for a particular type of meaning, roughly
:2h:rstcand for,noun). Selecting any
for
one
example,
.menu
in itemszcauses
Figure A. .
95.?
Lance window (coffee) Subpart of variety 1;
coflee Fizoii ll '1.:tfi,'Lzuiim l [U]
brown
Register (humorous; formal, etc.);
"HP"TI;
box a
(subj PMEV, Period of use;
powder made. by crushing cortexor
BEANS, used for making drinks . Semantic type of object described (for a
noun) or qualied (for
an
adjective) or of subject (for a verb);
giving special a taste to food 2 [C;U]
{simj EV, box ----L--<'r) (a cuprui of) 6. Language of origin if not English;
a not brown drink made by adding. not 7. Neologism or not;
water and f or milk to this powder
8. Illustration pointer (although illustrations are not part of the
machine-readable LDOCE);
A.2 A word window for coffee
Figure 9. Crossreferences (yes or no);
in every entry in the list being dis~ 10. Semantic type of object, if verb.
Selecting the [all]option results
p1 ed. .
and box codes for the coffee entry are A.3.1 The WORD node
If the grammar, subject
A.3 results.
selected, the display in Figure Initially the tree consists of on ly a WORD node.
'
p, ,
------
'1) Window 1
,
Subject code
'der mad: by crushing corn:
I
used for making drinks or Lur P
BEANS,
'
[CHI]
taste to food 2 Ella.ICiQkUp
special Bax code window 1 DC
SpeCIfymg grammar
.. .
by non-spellings
Xiaoffl
.
Windows
entry
.
explanation window,
chapter
and a new
6. The sea
windovris
h
A'3-14
.
fofraeflfdkziigeivri
.
of
a node-specic menu
|ifillltlilorl$t
w
brings would be used
a node
things
with
that
the
can
left
be
mouse
done
button up
to it, as explained below; selecting with the 331025
whladt
.
1information , a wou e used for tests an d rou hl h
middle button will delete it if deletion is allowed.
a cmse would take and how many entries {ignitiigcthe
wouldbegrezfd
236 237
Appendix A A.3 Access by non-spellings
LDB user guide
in the expla- Inserttorleft in ser t S a new SYLLABLE node to the left of the current
for a word or phrase
Get.specified.template prompts
builds tree for the entry that word
for one.
idea
like a given
from scratch,
one
and reduces
in some way;
the risk of errors. of hzxiyvggzaltitrzg
u. asta e cons raints on that s liable a hm does not cor-
Get.random.template is like
how to use the A.3.4 The
at random. T his is mainly useful for learning STRESS, ONSET, PEAK and CODA nodes
try itself,
LDB. one gives a structured menu of phonemes
, or; of!
lift-igptstgggggs
thtjse , s ress va ues in a phonetic fo n t C onstltuents
'
Of more
'
.
'
adds an (additional) SYLLABLE can by selecting it, which glVeS the follow-
Add.syllable node initially has a daughter ing options:
right of any existin g ones. The syllable
that the content of the syllable is
node *, a wild car d that indicates
the phoneme
by a disjunction correspond
unspecified.
ggatggfistagfgzgdmgsplacels ner c ass
(see Changepbonemcs.to.btoad in A.3.2
phonemes in the pronunci
broad changes any above).
Change.phonemes.to. manner-of-articulation classes
to representations of their
broad
ation
dened by the value of a variable Ma k e dis} mc t ion t bet til
classes used are mser S all OR wee n e Ph 01181116 all d It S d 01111-
.
node
.
SYLLABLE B
otitfgiifnzgn
A.3.3 The constituent enables to add
gftgodetuilatsylltable
one
Buttoning on
:triiiigitziiteixprsfsiei
(AND) as a of SPE features
congmction onemes. ne can switch back a It d forth betWeen fea-
adds nodes for STRESS, ONSET, PEAK and CODA [see A.3.4), tures and phonemes, adding and deleting either.
Add.sopc wild card daugh-
Initially the y have
where these do not already exist.
ters to indicate that they are unspecied. A.3.7 The GRAMMAR node
'
2. inectional
- .
an variant
A.3.8 The CATEGORY node thereof, which Will be altered to its root
form;
CATEGORY node gives a menu of all the categories
Buttoning on a
3. word that
- ,
together)
with
35:56:25githeLeDigilgEationdvzindow,
an
dlSSlmllar).
porotiotriso
. . .
a result
3:51:31: :earch windowis created by Do.Iookup to displa th
CATEGORY.
100k; 11y: gzlleshearch, tinitiallydonly
the Jiron:
een ersecte
pointer lists deriving
in 'no entri many been IOOked
fdur fevi/aesrhave
.
up or tests
applied. If there are
A.3.10 The SEMANTICS node or
pointers, looking up and
testing will done be
cally; otherWise automati
When
'
it will 0 l
cofhiiizlilagzlcn
subject
of adding the
Buttoning on a SEMANTICS node presents a choice
user left~buttons window selects in the
the and a 3 mm
codes, box codes or denition words (see A.3.11A.3.l3).Multiple resulting menu. These commands are as follows
instances of these are interpreted conjunctively.
satisfying
re _
are Show.
SUBJECT node gives a structured menu of two-letter
Buttoning on a
selects
of which have two-letter subcodes. If one
subject codes, some Displayentries opens a subwindow to the right of the search
for its meaning (if result
a code holds and down, the button an explanation window for every entry satisf
'
the Ent
'
ying quer
Such an explanation displayedas
.
window.
explanation in the
[all]option :ndcaiileljeaLre
' .
to be dis .
240 241
Appendix B
243
B B.3
of LDOCE verbs
Appendix SUblECt Equi verbs
Semantic types
in (1)
begin(1) concur(2) delight involve{2} own
to(1) quit(1) serve{5)
condescend(1) demand(1) forget(l) ltcll(3) pant{4) recall( 1) set
about{1)
1763(3)
conduce to{1) depose(2) forget about(1) J'lbat{1) pay for(1) reckon on {2} set out(2}
begrudge(1)
bid fair(1) confess(1) deride(1) forswear(1} Justify(1) pertain to(1) recollect) shirk(1)
descend to(1) frown on(1) keep(11) petition{2}
blanch(2) confess(2) refuse(1) should(1)
deserve(1) funk(1) keep {rom(2) pine(3} tegret( 1)
blink at(1) confide(1) shrink from)
connive(1) detest(1) gem) keep on
at{1) plan(1) rejoice(1) shudder)
bluslz(2) kick against(1) play(3)
consent{1) disclaim(1) getai) relisli(1) shun(1}
bother(3) knock 05(2)
break 017(1) consider(1) discontinue(1) get around to(l) play at{1) remember(2) sicken of(1)
get away with(l) know about(1) play at(2}
burn(6) consist in(1) discourage{2) repent(1) smile(2)
conspire(1) disdain(2) get down to(1) luncnt(1} pledge(1} require{1) stand(8)
burst(3) lead to(1)
conspire(2) dislike(1) get out of(l) plot{5) resent(1) stand(12)
burst out(1) learn (1)
bust out(3) contemplate{2) do with(1) get round ta(l) plump for{1) resist(1) stand for(2}
245
244
B B. 4
of LDOCE verbs Appendix Object Equi verbs
Semantic types
provoke m t 0( 1 )
expect(5) instruct(3) order(1) push(3)
bid (2) charge with (1) credit with(1)
acknowledge(2) intend(2) organize(1) push on(2)
adj ure(1) bill(2)
bludgeon into(1)
charge with
come down
(2) dare(5)
debar from(1)
(gleyg)
g; introduceto(1) overhear(1) put down 35(1)
advise(1) can ) inure to(1) persuade(2) put down to(1)
bluff into(1) on(l) decide(4) d
nd(
aid (1) 1)
dedicate to(1) into(1)
inveigle pester(1) put 017(1)
bri be( 1) command (1)
{21536}
allow(2) invite(2) petition(1) put up to(1)
bring(2) commission (1) defy(2)
allure(1) invite(3) reckon (1)
appoint(1) bring(5) compel(1) delegate(2)
[DI-bid)
phone(1)
depend on(l) itehfor(1)- pick(1) reckon on(l)
for(1) bring in (3) condemn(3) force) Jom With
arrange
depute(1) in(1) pick on(l) reduce to(4)
ask(4) bully(1) condemn(4) [H [mm
.
t
deputize(2)
in
0(1) keep(10) plead with(1) reeducate)
assign(4) buzz (3) candition(3) I 'gh out keep from (1) pledge(2) regard 35(1)
assist( 1) cable( 1) confess(3) design(2) rigftzen know(4) plume upon(1) rely on(2)
attribute to(1) call on (2) conjure(1) designate(2)
detail(1) 823(5)) lead(2) pray(3) remember as(1)
authorize(1) catch (3)
cause(1)
cannive at(1)
consider(2) direct(3) gem?) leadon(l) preclude from(1) remind(1)
badger(1)
doom(1) given legislate predestinate(1) represent (1)
bargain for(1) cau tion (I) constrain(1) .
05(2)
tell watch(1) C.1.1 Assign/Give
set(8) supplicate)
SUPP59(3) tempt(1) watch(5) to her.
shape(1) watch for(1) They delivered the groceries (deliver(2))
SUSPectQ} tempt{2) NP-PP [+Lat?] Position/Possession?: 5
show(1) wean from(1)
take{18) thank(2)
show(9) at{1)
talk into(1) timetable{l) worry
signal(2) yearn for(1) She donated the book to him. (donate(1))
talk out of(1) time(1) NPPPDLat]
sign(2) Position/Possession?: 5
249
248
Appendix C CJ Fine-grained semantic classes of verbs
"Dative" aiternations
EitherULat] Position/Possession?:
She mixed him a drink. (mix(2))
EitherULat'H Position/Possession71 4
She gave her family all her time (give(8),set aside)
Either Position/Possession'iz
She poured him a drink. (pour(7))
Either[?Lat] Position/Possession71 5
She rendered him a service. (render(3))
NP-NP [+Lat?] Position/Possession'i:
She pulled him a. beer. (pull (1 ),( 5 ))
EitherE-Lat] Position/Possession'l: 4
She/the house yielded him some shelter (yield{3))
EitherELat] Position/Possessioniz
She prepared him a meal. (prepare(2))
Either[+Lat] Position/Possession?: 4
251
250
.
Appendix C C'1 Fine-grained semantic classes of verbs
"Dative" alternations
Lat] Position/Possession'iz4
0.1.4 Pay/Charge
0.1.3 Obtain/Find She him
charged some money. (charge(1))
She obtained the hook for him. (obtain(1]) '
5 NP-NP[+Lat?]
' '
POSlthn/Possessmn
7.
4
NPPP[+Lat] Position/PossesswnY:
She fined himpound. a
(ne(1))
(bring(1)) NPNP[+Lat?]
'
4 / Posses sum 7; 4
E'th
1 er [ Lat]
-
Position/Possesswn72
She muleted him a. pound for... (muict(1), ne)
him new coat. (choose(1)) NP-NPf-Lac' Position/Possession?: 4
She chose a. '
1 [31191 [-Lat]
10531011 /I 0556551011
(pick[2)) Position/Possession?: 5
She picked a
7. 4
Either[-Lat] Posttion/Possesswn.. (3.1.5 Say/Teach
She preached him the word of God.
She sent him a book. (send(l)) (preach(1))
EitheerLatY]
Either [-Lat] Position/Possession?:5 Position/Possession'I: 0
253
252
"Dative" alternations Appendix C C.1 Fine-grained semantic classes of verbs
0.1.7 Allow/Forbid
She quoted him some poetry. (quote(1))
Either[+Lat] Position/Possession?: 0
She allowed him some money. (allow(4))
NP-NP? [+Lat7] Position/Possession'i: 1
She quoted him a price. (quote(3))
Either Position/Possession?: 0
She denied him nothing/the money. (deny(3))
~_a.w
NP-NP[+Lat?J
Position/Possession7: 0
She read him a book. (read(3))
EitherE-Lat] Position/Possession?: 0
She refused him kiss.
a
[refuse(1))
NP-NP[+Lat;?] Position/Possession7: 0
She sung him a nursery rhyme. (sing(l))
Either ['Lat] Position/Possession7: O She forbade him the house.
([0rbid(3))
NP-NP[-Lat]
Positiorl/Possession7: 0
She showed him the book. (show(1))
EitherE'Lat] Position/Possession?: 0 She him time
gave enough to
(give(G)7allow)
NP-NP
Position/Possession?: 0
She taught him history. (teach(1))
Either[-Lat] Position/Possession'i: 2 She accorded him permission to
(accord(2))
EitherI+Lat?]
Position/Possession?: 1
C.1.6 Pass/Throw She spared him five minutes.
bail.
(spare(4))
She chucked him the (chuck(1)) Either[-Lac]
Either[-Lat] Position/Possession?: 5 Position/Possession7: 0
0.1.8 Save/Take
She ung him the ball. (ing(1)) She saved him
Either['Lat] Position/Possession?: 5 a. pound. (save(3))
NP-NPbLaWJ
Position/Possession?: 1
She passed him the bread. (pass(5)) She saved him journey.
Either[-Lat] Position/Possession]: 5
a
(save(4))
NP-NP
Position/Possession'Z: 0
She passed him the bail. (pass(7)) She spared him visit.
5
a
(spare(2))
Either Position/Possession7: NP-NP [-Lat]
Position/Possession'I: 0
She passed him a fake coin. (pass(1), (3)) She spared him her opinion.
5
(spare(3))
Either Position/PossessionY: NP-NP
Position/Possession?: 0
NP 7' 0
POSition/Posse551on..
NP-PP[-Lat]
Position/Possession7:
The book him ideas. (give(5), produce, supply)
i gave some
NP-NP Position/Possession7: 3*
She bought him a present (buy(1)) 4
Eth
1 er [ Lat]
Position/Possession?:
The meal gave him indigestion. (give(7), cause pain)
NP-NP Position/Possession?: 2
Sh e s old him a car -
(5311(6)) 5
Position/Possession'Z:
E' 1 ther[-Lat]
ii She/The medicine ensured him some sleep. (ensure(2)
Either[?Lat] Position Possession?: 1
i Sh 51:00 d h' im '1 drink ~
(stand(l) (7)) v
e
Ethet[+Lat7]
1 .
Position/Possession?:5 i
C.1.13 Concede
"ii
c_1.10
NP'NP
Greet/Wish
[Lat]
'
a safe Journey
'
(Wish(4)) .
3
I
She
hlffdbye'
wgfe: tr er
_
a
Position/Possession7:0 Either [-Lat] Position/Possession}: 0
Size owed him loyalty. (owe(2))
C.1.11 Offer Either Position/Possession?: 0
(makql), (7))
She
Hiia'iiili'T-igtger'
1
Position/Possession?:2 She owed
Either
him a lot. (owe(3))
Position/Possession7: 1
1 256
Appendlx C .
C.1
Dative alternations Fine-grained semantic classes of verbs
(leave(4))
1 E1ther[-Lac]
EitherULat] Position/Possession'l: Position/Possession7: 4
P051tion/Possession?: 1
' '
258 259
Appendix D
gamma.
Lexicon development environment
user guide
when.
D.1 The main command menu
Edit Ll
Reset r.
Initialise reloads from disc all the data needed by the LDE. The data
comprises the morphology system compiled files, the sentencelevel
GPSG grammar, and the denitions of the LDE frames and subcate-
gorisation translations. The command should be issued if any changes
have been made to the morphological description or sentence level
grammar, and it is desired that these changes should be noticed by
the LDE. The command asks for conrmation before this process is
started, since reloading all these les takes a few minutes.
261
D D2 Words windows
LDE user guide Appendix
true
Gcoiaewlnaowe
sation m
command that this is the next word it should present to the user. D.2
do nothing if the prompt is Figure A word window
This, and all the other commands as well,
answered with just <return>.
how to use them. Repeatedly invoking Next Word steps through each erally about how to use the LDE.
head word in LDOCE in alphabetic order, so this command may be
used to help to systematically derive a new lexicon from LDOCE once
dis la 5 a list 0
believe
F rarr
a,
+6Rt4[i,+PR\DI
[BAEIE,_TTIJR3F:, , , Aux , vroRm tar,
gee2, , , AR NFUFJA NORM], AT
Act e
ISLJBCAT f, SFINJ
[BYPJLAE
9,A 'URD +, 't' +. N
PRU
NEG -
mI -,
'
VFURM
ESE)
r-i
l
Fm -
h +3
[BAR a, 9mm: +, v +. N , PRD , NEG -
Aux
\IFURm
FIN
s , 35:5,
INFL + AGR N +, v
BAR Foam
-
2, NORM],
Defn opens a Denition
of the currently
Window
selected
(see
word.
next section) displaying the
[my if, SL136 DRE , AT
nitwlFl
denition v H PRU
+, , -, NEG , Aux -, VFURM 55:5
[H
Hui-HER +, v -, BAR 2, NFORM rmRm] AT
+ L?
TWNF]
user
LAT
NORM], AT
+, +, suntan i-weAP]
containing the derived lexicon.
the n
subsequently
denition, the fiiiae
T5353:
.
agiagict
frame isphrase or sentence
a
with one word replaced b
Ch framzkrepresented by _. A set of categories is associated witii
the word, given its def- frame to
Frames nds those syntactic frames in which
able to ll {horla be applicable to a word (Le. the word is
will t, and displays in window these frames with the at) the denition
s of the word must include a cate
inition, a new
which he the one frames set of categories. magoy
word substituted in them.
Chan erinatc
(is selectingyth:
in Frames
EditaynojrfrirengE
s of them enlarged by
rapiepoire LDE
minan rom the main T
'
t d
' ' '
l
alvaluij
(maybe are Cia e it F
it is later to be written to the le T normal
Hardcopy Find
DIS)- and
denition
containing
remembered
the derived
for when
lexicon.
mheattached
enu the
TEDIT
window. to side of the
commands are 1 a e on a
264 265
Appendix D D.5 . .
,
D.5 Editing
Sshflc Frames
frames for
subcategorisation translations
3:)
Syntactic Hardcapy
the Lexicon Development Environment. L .
gtogfggrirrirgar
arinappied
denitions in two stages
3] Install into
'
JAE April 1937 OUR oodles rans orm e no es into theor y n all t m l l emal
. The
tem-
in plates. The second stage is to tran slate these
.
tem P late S 111 t 0 subcate
gorisation values using the subcate gorisation
'
translations
rovid d
orIies Ed
opens .
a in wh' ic i l translations be
may
edited (Figure D
'
.
5) . Schat egorisation translations b
i
Installyfroinuzislriile
.
ma
something.
Someone to be Similar
.
they 6 in 3
way to syntactic frames by invoking
V BaR 2, NFORM NORM,
[N , V +, BAR B, AGR [N +, -,
attached to the WndOW' there l S , however
PER PLU COUNT +, CASE NDM], SUBCAT 0E] , no anao l g o 11 5 Cr eat 9 C0 m
3,
-
+,
V +, BAR B, AER [N +, V , BAR 2, NFORM NORM,
[N ,
SUBGAT DR]
PLU +, COUNT +, CASE NOM], .. TIA
PER 3, Simon sis! our
AER [N +, V , BAR 2, NFDRM NORM, F in :3
[N . V +, BAR 8,
PER 3, PLU +, COUNT +, CASE NDM],
SUBCAT SE2] .
Ha rd 1: CI p 1-,-
In.-
[N -.
SUBCAT 352]
PER COUNT +,
FLU CASE Mom],
3: +, .r
EAR El, AER [N +, V , BAR 2, NFDRM NORM,
[N ,, V +,
.3, F'LLI +, COUNT +, CASE NON],
SUBCAT DR]
mural...
F'EF:
NF:
frames
m
Figure D.5 Syntactic
HP HP IMF} CIR-a 1 '.\"i a;
3 1 g QR
also three other commands in the menu:
There are
NP NF le (Type 2 IIiEqLiijj ) L'IE
creates candidate set of categories for the
Create automatically a
266 267
Appendix E
a or it is combined
with a locality restriction. The third letter is
either 0, U, X or Y which stand for the continents America,
Europe, Asia and Australia, Africa. The fourth position refers
to local sub-areas which codes all the letters
range over of the
alphabet and the numbers 078, for example GAQA,which refers
to the restriction of the Maineld restricted to the location
games
Argentina;
269
Appendix E E3
semantic codes Explanation of POS codes
The Longman
noun
(N01) and adjective (A01) and cross reference (XX)
C Concrete:
Q Animate:
H Human.
A Animal.
P Plant.
I Inanimate:
S Solid.
L Liquid.
G Gas.
of POS codes
E.3 Explanation
D0 determiner
N0 noun
P0 pronoun
A0 adjective
IO preposition
C0 conjunctor
NS plural noun
/ r
Table of a
Codes L g a
g c
3? o .5
l: =
: _ '5' s w m
c 3 i
g 0
h47
wk.
b .: =1
w H :1
-=
-~
w, a 5 w 8
65
5'8 ,5
.
a
E 5%?
350w 3:
=\ a}; a
a gig=3 =3
n? 3? a 17:4? L: 395 S
.n
no;
53:; N
an.)
A B C
Appendix F *
l) E F CC GU H
on [A] [B] (C)
3013331;;
mun
a, .m. .m
mum
m
mu
m.
~
M, mu!m:a G
a F
1"! H.
mam
4
system nwd
be
not -
.
7
followed
by
anything
followed
by one or
more 1 5%
nouns or beam:
pronoun: gs-I
15
followed
lzylhc
Innitive
withoulm
followed
hydle
(an
"M!
"[01 [L31
.
"u 7"
innitive
235
5" m)
713'; MM"
. ,
.
wnlllm . , plum I l
, a
WWW,
5;;who
followed
by lb: 4
- MA.
..
mg form ,
z ,
an.4"
MN:
followed v
[51 IDS! [m
A
duh! cw
,
[u]
b hat-5
71mm lo/mu
'
cause
(mud) new]:
_ 1'. The
Imuh/t
harm: hm!
In turn: blue yanknaw
warmth
JP
In
m1
followed ,5?"
bya wh- 6
5:, 1M;
word 7-?
w $1, z
,
mm.) l/w'd
, my, 5am." m V 71mg
a w r
- W
m'
n, mud
followed
byan 7
adjective :15
be:
lm
gollowed
y past 8 a
. ,
3:11"I 1
pmmplc ,
k
New
; .
WM
,_ :mlmd
seed '
ij'm'9
ms] to;
0 0' palm-
tally
.Mi/
mum hale-1
rm {Gun
V
[m
hu
phme um m inarlda
-
mum
you-shy 'Am
.
272
273
The Longman grammar coding system
Bibliography
rwvu [Well
pusmll the mu
"W: m. u
m bk Inble
(wm
mun
, um
verhl'
guise Aarts F, Aarts J (1982) English Syntactic Structures:
Imus-m:
mm: Functions and
In Categories in Sentence Analysis. Pergamon Press
(mum
Aarts J M G, Calbert Ltd, Oxford.
J P
[wnn 1qu
Semantics
(1979) Metaphor and Non-Metaphor: The
plurl-l m w
ofAdjective<Noun Combinations.
um
slum:
.- Ixtm
ingen.
Niemeyer Verlag, Tub-
mm: -z.3:
. we: Aarts J, van den Heuvel T
lkwiml
(1985)Computational tools for the syntactic
wed
lwm... analysis of corpora. Linguistics 23: 30335.
Adda G, Esknazi M, Stern P E (1987) The use of
tures for large rough spectral fea-
..
aim"
.. vocabulary recognition. Proceedings ofthe
Conference Speech Technology,Edinburgh, pp 1714. European
mun-
m
ms]
or
i on
Ahlswede T (1983) A linguistic
from Websters Seventh
string grammar of adjective denitions
mm mt
mum-
Collegiate Dictionary. Masters thesis, Illi-
mm
vs:
mu: nois Institute of Technology,
mm
Akkerman
Chicago, Illinois.
E, Masereeuw P C, Meijs W J
puterized Lexicon for Linguistic
(1985) Designing a Com-
Purposes: ASCOT Report No 1.
Rodopi, Amsterdam.
Akkerman E, Meijs W J, Voogtvvan Zutphen H J
cal tagging in ASCOT. In Meijs W J
(1987) Grammati-
(ed)
Beyond, Proceedings of the Seventh International
Corpus Linguistics and
ICAME Confer-
ence. Rodopi, Amsterdam, pp 18193.
Akkerman E, Meijs W J,
VoogtvanZutphen H J (1988a,
A Computerized Lexicon [or
Word Level forthcoming)
No 2. Rodopi, Amsterdam. Tagging: ASCOT Report
Akkerman E, Meijs W J,
Voogt-van Zutphen H J (19881),
ASCOT: a
computerized lexicon with forthcoming)
an associate d scanning
tem. In Ihalainen sys-
O, Kyto M, Rissanen M (eds) Proceedings of the
Eighth International Conference on English Language Research on
Computerized Corpora (provisional title). Rodopi, Amsterdam.
275
Bibliography Bibliography
1n- Bobrow
AlshraigA/sil
Memory
H (1987) and Context Mechanisms for Language R, Webber B (1980) Knowledge representation for syntactic/
Cambridge University Press, Cambridge. semantic processing. Proceedings of the First National Conference
terpretation. a
. .
'
Doctoral thesis, also available as Technical Report No.11 Com-
Lexigcal
..
WilllalsleT)
dictionaries. In M d
Amsler R A (1984a)Machine-readable
). Boguraev B K, Briscoe E J (1988) Large lexicons for natural
Annual Review of Information Scienceand Technology (A exploiting the grammar
language
Voli19,1612231;
processmg: coding system of LDOCE. Com-
American Society for Information pp
1984b Lexical knowledge bases pane sesswn
on
m
SciencE,
on Theoretical
Cruccs NM, p 1619. -
BoguraevB K, Carter D M, Briscoe E J (19871))A multi-purpose
Ami: R A Plow
(1987b) do I turn this book on?. Proceedings
for the New Ox offth;
interface to an on-line dictionary. Proceedings of the Third Confer-
Conference of the UW Centre ence of the European Chapter of the Association for Computational
Third Annual
Dictionary The Uses of Large Text Databases, Water lor
00, Linguistics, Copenhagen, Denmark, pp 634).
English Boguraev B K, Copestake A and Sparck Jones K
da, 7688. . . (1988, forthcoming)
Aullcinltl,
ZEEV Inference
.
W (1984) Lexical stress and its application to large in natural language front ends for databases. In Sernadas
at A (eds) Knowledge and Data
vocabulary speech recognition. Paper'presen'ted the 108th
Minneso mett- a. B K, Carroll
{D532}.NorthHolland, Amsterdam.
ingof the Acoustical Society of America, Minneapolis, Boguraev
Briscoe E
J, Pulman S, Russell G, Ritchie G D, Black A
The IRUS J, Grover C (1988, forthcoming) The
Bates M, Moser M G, Stallard D (1986)
Ltransportable of
lexical component
natural language database interface.
.In Kershberg (ed) Exper a natural
language In Walker toolkit.
Zampolli A, Calzolari ,
276 277
Bibliography
Bibliography
Briscoe E J (1985) Report of the Dictionary Syndicate. Alvey Speech properties of the basic vocabulary of ve European languages. Pro-
Club Workshop, Warwick University. ceedings of the International Conference on Acoustics, Speech and
J (1987) A formalism
Briscoe E J, Grover, C, Boguraev B K, Carroll Signal Processing, Tokyo, Japan, pp 27636.
of En-
and environment for the development of a large grammar
Carnap R (1952) Meaning postulates. Philosophical Studies 3: 6573.
International Joint
glish. Proceedings of the Proceedings of Tenth Carter D M (1987) An informationtheoretic approach to phonetic dic-
Conference on Articial Intelligence, Milan, pp 7038. tionary access. Computer Speech and Language 2: 111.
formation in natural language processing sys- Carter D M, Boguraev B K, Briscoe E J (1987) Lexical stress and pho-
Byrd R J (1983) Word
International Joint Conference on
netic information: which segments are most informative. Proceed-
tems. Proceedings of the Eighth
Artificial Intelligence, Karlsruhe, Germany, pp
70445. ings of the European Conference on Speech Technology, Edinburgh,
Dictionary systems for ofce practice. In 235v8.
Byrd R J (1988, forthcoming) pp
their metaphorical relation-
Walker D, Zampolli A, Calzolari N (eds) Automating the Lexicon: Cater (1987) Conceptual primitives and
A
Failure in Dialogue. North-
Research and Practice in a Multilingual Environment. Cambridge ships. In Reilly E (ed) Communication
Holland, Amsterdam.
University Press, Cambridge.
Klavans J, Nefi M, Rizk O (1987) Charniak E (1972) Toward a model of childrens story comprehen
Byrd R, Calzolari N, Chodorow M,
11012642, IBM sion. MIT AI Memo 266, Massachusetts Institute of Technology,
Tools and methods for computational lexicology.
Cambridge, MA.
Yorktown Heights. G E (1985) Extracting semantic
M (1985) Using an on~line dictionary to nd Chodorow S, Byrd R J, Heidorn
M
Byrd R J, Chodorow words. Proceedings hierarchies from a large on-line dictionary. Proceedings of the 23rd
for unknown
rhyming words and pronunciations for Computational Annual Meeting of the Association for Computational Linguistics,
of the 23rd Meeting of the Association
Annual
Chicago, Illinois, pp 299304.
Linguistics, Chicago, Illinois, pp 27783. Chomsky N, Halle M (1968) The Sound Pattern of English. Harper
Andersson K S B (1986) DAM a dictio-
corrector. Information Processing and Management 19(2): 1018. Intelligence (CAI-87}, Milan, Italy, pp 71517.
280 281
Bibliography
Bibliography
and Co. Ltd, Glasgow. ofa Natural Language Lexicon, Program in Linguistics and Cog-
Flexible parsing. American Journal nitive Science Brandeis University, Waltham, Massachusetts.
Hayes P J, Mouradian G V (1981)
of Computational Linguistics 7(4): 23242.
Johansson S, Atwell E, Garside R, Leech G (1986) The tagged LOB
M (1982 The EPIS- corpus: Users manual. Norwegian Computing Centre for the Hu-
Heidorn G, Jensen K, Miller L, Byrd R, Chodorow
TLE textcritiquing system. IBM Systems Journal 21 3): 30526. manities, Bergen.
M (1985) Computer aids for comparative dictionaries. Lin-
Hirst G (1987) Semantic Interpretation Against Ambiguity. Cambridge Johnson
Cambridge. guistics 23(2): 2854302.
University Press,
J R (1987) World knowledge and world meaning. Proceedings JohnsonT (1985) Natural Language Computing: the Commercial Ap
Hobbs
of the 3rd Workshop on Theoretical Issues in Natural Language plications. Ovum Press Ltd, London.
205. de Jong J, Masereeuw P (1987) PARSCOT: a new implementation of
Processing (TINLAP-S},Las Cruces, New Mexico, pp
text databases. Proceedings of the the LSPgrammar. In Meijs W (ed) Corpus Linguistics and Beyond.
The of large
Hodgkin A (1987) uses
Third Annual Conference of the UW Centre for the New Oxford Rodopi, Amsterdam, pp 195206.
J (1982) LexicaHunctional a formal
English Dictionary on The Uses of Large Text Databases, Waterloo, Kaplan R, Bresnan grammar: sys-
tem for grammatical representation. In Bresnan J (ed) The Mental
Canada, pp 916.
Grammatical Relations. The MIT
Hornby A S (ed) (1980) Oxford Advanced Learners Dictionary of
Representationof Press, Cam-
Current English 3rd edition (11th impression). Oxford University bridge, Massachusetts, pp 173-281.
Katz J J, Fodor J A (1963) The structure of a semantic theory. Lan-
Press, Oxford.
treatment of gapping, right node guage 39: 170210.
Huang X (1984) A computational
Proceedings of the 10th Interna- Kay (1984a) Functional unication a formalism for ma-
raising and reduced conjunction. M grammar:
tional Congress on Computational Linguistics (Coling84),Stanford, chine translation. Proceedings of the 10th International Congress
California, pp 24376. Computational Linguistics (Coling84), Stanford, California, pp
Huang X (1985) Machine translation in the SDCG (semantic defi- :n59.
MachineReadable
nite clause grammars) formalism. Proceedings of the Conference Kay M'(1981-1b) The
dictionary server
(panel on
and Methodological Issues in Machine Translation Proceedings of the 10th International Conference
on Theoretical
13544.
Dictionaries). on
of Natural Languages, Colgate University, New York, pp Computational Linguistics (Coling84), Stanford, California p 461.
in
Huttenlocher D P (1985)Exploiting sequential phonetic constraints Kay M, Kaplan R (1981) Phonological Rules and Finite State Trans-
words. Massachusetts Institute of Technology ducers. Paper presented at the Annual Meeting of the Association
recognizing spoken
Articial Intelligence Laboratory AI. Memo 867. for Computational Linguistics, New York.
Huttenlocher D P, Zue V W (1983) Phonotactic and lexical constraints Kazman R (1986) Structuring the Text of the Oxford English Dictio-
in speech recognition. Proceedings of the National Conference on nary through finite state transduction. Technical Report TR-B-ZO,
Artificial Intelligence (AAAI83), Washington, DC, pp 1726. Department of Computer Science, University of Waterloo, Water-
Complement types in English. Report No. 5684, Bolt loo, Ontario.
Ingria R (1984)
Beranek and Newman Inc, Cambridge, Massachusetts. Kegl J (1987) The boundarybetween word knowledge and world knowl-
Lexical information for parsing systems: edge. Proceedings of the 3rd Workshop on Theoretical Issues in
Ingria R (1988, forthcoming)
points of convergence and divergence. In Walker D, Zampolli A, Natural Language Processing (TINLAP-3), Las Cruces, New Mex-
2631.
Calzolari N (eds) Automating the Lexicon: Research and Practice ico, pp
283
282
Bibliography
Bibliography
Lehrer A (1974) Semantic Fields and Lexical Structure. NorthHolland, COLA INCURVO TERRAM DIMOVIT ARATRO First stage trans-
Amsterdam. lation English with the aid of Rogets Thesaurus.
into Report
Lemmens M, Wekker H (1986) Grammar in Learners
English Dictio-
(ML84) ML92, Cambridge Language Research Unit, Cambridge.
naries. Niemeyer Verlag, Tubingen (Lexicographica, Series Maior Meijs W (1985) Linguistically Useable Meaning Characterisations in
16). the Lexicon (LINKS). Project Description, English Department,
Lenat D, Prakash M, Shepherd M (198(5)CYC: using common sense
University of Amsterdam, Amsterdam.
knowledge to overcome brittleness and knowledge acquisition bot- Meijs W (1986a) Links in the lexicon; the dictionary as a corpus. Icame
tlenecks. AI Magazine 6(4): 6592. News 10: 2678.
Lesk M (1986a) Information in Data: Using the Oxford English Dictio- Meijs W (1986b) Lexicalorganisation from three di'erent angles. Jour-
nary on a Computer. Summary of a conference on Information in nal of the Association ofLiterary and Linguistic Computing 13(1).
Data held in the Centre for the New OED, University of Waterloo Meijs W (1988a, forthcoming) Spreading the word: knowledge activa-
in November 1985 (also in ACM SIGIR Forum 20(12)). tion in a functional perspective. In Conolly J, Dik S (eds) Func-
Lesk M (1986b) Why I Want the OED on My Computer, When Im tional Grammar and the Computer. Foris, Dordrecht.
Likely to Have It. SIGCUE Newsletter, 2nd Quarter. Meijs W (1988b, forthcoming) Morphology in the dictionary, with spe
Levin B (1985) Lexical semantics in review: an introduction. 1n Levin cial reference to LDOCE. In Lachlan Mackenzie .1, Todd R (eds)
in Review. Lexicon Festschrift fir Hans Heinrich Meier (provisional title). Free Univer-
B (ed) Lexical Semantics Working Papers 1,
sity Press, Amsterdam.
Massachusetts Technology, pp 162.
Institute of
Mellish C S (1985) Computer Interpretation of Natural Language De-
Levin B (1988, forthcoming) Approaches to lexical semantic represen
Chichester.
scriptions. Ellis Horwood,
tation. In Walker D, Zampolli A, Calzolari N (eds) Automating
284 285
Bibliography Bibliography
Michiels A (1983) Automatic analysis of texts. Proceedings of the con- Pollock J (1982) Spelling error detection and correction by computer:
ference (Informatics 7) held by the Aslib Informatics Group .and some notes and bibliography. Journal of Documentation 38(4):
the Information Retrieval Group of the British Computer Socrety, 282-91.
286 287
Bibliography
Bibliography
P S (1967) The Grammar of English Predicate Comple- Shortliffe E H (1976) Computer-Based MedicalConsultation: MYCIN.
Rosenbaum
ment Constructions. MIT Press, Cambridge, Massachusetts. Elsevier Science Publishers B.V., NorthHolland, Amsterdam.
Rumelhart D E, Ortony A (1977) The representation of knowledge Simmons R (1973) Semantic networks: their computation and use for
in memory. In Anderson R C, Spiro R J, Montague W E (eds) understanding English sentences. In Schank R C, Colby K M (eds)
Erlbaum
Schooling and the Acquisition of Knowledge. Lawrence Computer Models of Thought and Language. W.H. Freeman, San
288 289
Bibliography Bibliography
K Synonymy and Semantic Classication. Doc- Department of Linguistics, University of Amsterdam, Amsterdam.
Spirck Jones (1964)
toral thesis, University of Cambridge (also published in Edinburgh Voogt-van Zutphen H J (1988, forthcoming) Towards a lexicon of func
Information Technology Series (EDITS), Michaelson S and Wilks tional grammar. In Connolly J, Dik S (eds) Functional Grammar
Y (eds), Edinburgh University Press: Edinburgh, Scotland, 1986. and the Computer. Foris, Dordrecht.
Sparck Jones K (1967) Dictionary Circles. Report SP-3304, System Vossen P, den Broeder M, Meijs W J (1988, forthcoming) The LINKS
Development Corporation, Santa Monica, California. project: building a semantic database for linguistic applications
van der Steen G J (1982) A treatment of queries in large text corpora. Proceedings 8th Icame Conference, Helsinki.
In Johansson S (ed) Computer Corpora in English Language Re- W7 (1967) Websters Seventh New Collegiate Dictionary. C.&C. Mer-
search. Norwegian Computing Centre for the Humanities, Bergen, riam Company, Springeld, Massachusetts.
4965. Walker D, Amsler R (1986) The use of machine-readable dictionaries in
pp
Stockwell R P, Schachter P, Partee B H (1973) The Major Syntactic sublanguage analysis. In Grishman R, Kittredge R (eds) Analyzing
Structures of English. Holt, Rinehart and Winston, New York. Language in Restricted Domains. Lawrence Erlbaum Associates,
Streeter L A (1978) The acoustic determination of phrase boundary Hillsdale, New Jersey, pp 69783. -
perception. Journal of Acoustic Society of America 84(6): 1582. Walker D, Zampolli A, Calzolari N (eds) (1988) Automating the Lex-
A Database-in-Waiting: The OED Becomes the New icon: Research and Practice in a Multilingual Environment. Cam-
Stubbs J (1986)
OED. Paper presented to the conference on Computers and the bridge University Press, Cambridge.
Humanities, University of Toronto, Toronto. Waltz D (1983) Articial Intelligence: an assessment of the state-of-the~
Stubbs J, Tompa F (1984) Waterloo and the New Oxford English Dic- art and recommendation for future directions. The AI Magazine
tionary Project. Paper presented to the Twentieth Annual Confer- 4(3): 5567.
ence on Editorial Problems, University of Toronto, Toronto. Waltz D L, Pollack J B (1985) Massively parallel parsing: a strongly
Thompson H (1983) Natural language processing: a critical analysis interactive model of natural language interpretation. Cognitive Sci-
of the structure of the eld, with some implications for parsing. ence 9: 5174.
In Sparck Jones K, Wilks Y) (eds) Automatic Natural Language Warren B (1978) Semantic Patterns of Noun-Noun Compounds. Goth-
Parsing. Ellis Horwood, Chichester, pp 2231. enhurg Studies of English 41, Goteborg: Acta Universitatis Gothen-
Thorndike E L, Lorge I (1944) The Teachers Word Book of 30 000 burgensis.
Words. Teachers College Press, Teachers College, Columbia Uni- Weischedel R M, Black J E (1980) Responding intelligently to un-
versity, New York. parsable inputs. American Journal of Computational Linguistics
Tompa F (1986) Database design for a dictionary of the future. Pre-
6(2): 97109.
liminary report, Centre for the New Oxford English Dictionary, Whitelock P, Wood M, Somers H, Johnson R, Bennett P (eds) (1987)
University of Waterloo, Waterloo, Ontario. Linguistic Theory and Computer Applications. Academic Press,
Tubach J-P, Boe L J (1986) Quantitative knowledge on word struc- New York.
with application to large vocabular Wiederhold G Database
ture, from a phonetic corpus, (1983) Design, McGraw-Hill, New York.
ies recognition systems. Proceedings of the International Confer- Wilensky R, Arens Y (1980) PHRAN a phrasal natural language
~
ence on Acoustics, Speech and Signal Processing, Tokyo, Japan, understander. Proceedings of the 18th Annual Meeting of the Asso-
pp 614. ciation for Computational Linguistics, Philadelphia, Pennsylvania,
Tucker A, Nirenburg S (1984) Machine translation. In Williams M (ed) pp 11721.
Annual Review of Information Science and Technology {ARIST}, Wilks Y A (1973) An artificial
intelligence approach to machine transla-
vol.19 AmericanSociety for Information Science. tion. In Schank R C, Colby K M (eds) ComputerModels of Thought
290 291
Bibliography
Subject Index
Winograd T
(1983) Language Cognitive
as a
Reading, Massachusetts. _ .
172, 201,216, 235, 269 speech register, see speech transcription into, 140 dative alternation, 90, 106 ff,
access to LDB via, 236 register codes control characters 249 {1
after lispication, 49 subject eld, see subject after lispication,48 Database Management Systems
British English pronunciations, codes as begin markers, 44 (DBMS)
56 Collative Semantics, 196, 200, for font changes, 42 assumptions of, 53
BROWSE program, 204 207, 220, 222 functions of, 423 relational, 52, 172
browsing Collins English Dictionary, 14, controlled vocabulary, 16, 34, dening vocabulary, see
in MRD, 53,
an 63 25 38, 56~7, 155, 157, controlled vocabulary
in the LDB, 236 complement structure, 88, 90, 1613, 168, 1734,200, denition text, 56, 154~6, 161
106 208, 215 LDOCE
on
tape, 48
complementation access to LDB via, 236 denitional cross
references,
CELEX, 83 9 code, 69 and Key Dening 163
Centre for the New OED, 45, adverbial, 69, 78, 93 Vocabulary, 211 denitional vocabulary, see
53 codes for adjectives, 79 number of senses for, 201 controlled vocabulary
Chambers Twentieth Century in basic of derivational
patterns, 73 use
denitions, 43, 215, 227,
224
Dictionary, innitival, 94, 96 morphology, 34 see also meaning
character stream of a word, 68 use of phrasal verbs, 34 descriptions
analysis of, 42 of 78 relations
adjectives, co-occurrence for adjectives, 156, 159
segmentation into records, of intransitive verbs, 77 frequency of, 202~4 for adverbs, 156, 159
55 ofverbs, 70, 74 and semantic similarity, 207 for idiomatic senses of a
circularity, 190 predicative, 86, 88r9, 93, 96 corpus, 184, 230 word, 1678
in denitions, 19, 34 sentential, 86, 94 Brown, 175 for lexicalised compound
class size compound nouns, 30, 160 grammatically 173
tagged, nouns, 160
expected, see Expected in denitions, 161 Lancaster-OsloBergen,
175 for nouns, 156e7, 164, 166
Class Size with non-compositional of meaning descriptions, for phrasal verbs, 160, 162
maximum, 137, 145 lexicalised
meanings, 51 171 for verbs, 156, 158
of consistency class, 144 computational lexicon, 71, 81, of tagged meaning in LDOCE, 154-6, 163
statistics, 148, 151 118, 264, see also descriptions, 175 of word 156
senses,
classication, 1556 grammatically-indexed probabilistic techniques for parsing
of central
of, 219
senses of lexicon analysis of, 231 syntax of, 219
denition vocabulary, batchderivation, 66 Courier, 62 superordinate in, 155
155 Conceptual Dependency, 193, x
CRITIQUE, 25, 30, 4o, 86 with cross
reference, 167
of concepts, 155 223 cross
references, 163, 189 word sense
ambiguity in,
used during text conceptual hierarchies, 196, complex, 167 198
processing, 155 see also semantic
explicit, 43 denitions analyser, 54, 154,
of entities in discourse hierarchies
implicit, 43 161, 163, 168
domain, 154 connected speech, 152 in denition, 163 denitions 168
closed class words, 95, 122 grammar,
connectionism, 194, 208 in a. dictionary, 52 derivational
COBUILD, 2301 morphology,
localist, 195
codes
11718, 124, 162
sub<symbolic, 195 determiners
box, see box codes consistency class, 137, 140, database, 177, 181
grammatical properties of,
POS, see Part-of-Speech 1438, 1501 front ends for, 194 80
codes class size, 144 grammatical, 174
sociolect
dictionary entries in LDOCE
sociolect, see statistics derived from, 140, of concepts, 172 content
codes
of, 13
144 search pattern, 173 27
use of,
294
295
Subject index Subject index
microstructure of, 45 front ends, 1356, 141, 144, for nouns, 71 head features, 121
dictionary server, 62, see also speech for verbs, 72 head of denition, 159
sec Lexical Database recognisers pattern numbers, 72 semantic, 156, 157, 161, 167
discourse analysis, 172 Functional Grammar (FG), 80 interaction with word syntactic, 157
ditransitive verbs, 106 ff Functional Unication 105 headwords
senses,
Grammar (FUG), 25, on LDOCE tape, 46, 48
syntactic and semantic
98 homograph, 162
criteria for assignment,
93 access LDOCE, 51in
EBCDIC character codes, 42,
46 numbers in LDOCE, 43
Generalized Phrase Structure grammar coding system, 65
records in LDOCE, 46
Equi verbs, 96, 125, 248 Grammar
in LDOCE, 6577, 77, 79,
(GPSG), 25, hybrid lexicon, 137~8, 145
equivalence class, 137, 140, 124, 273
143~6
54,98,117,119,123 hyperonym, 172, 1778,1889
SUBCAT critical survey, 77 if
feature, 88, 96 hyphenation elds, 52
class size, 144 in OALD, 66, 71, 77
syntactic categories in, 119 hyponym, 172, 178, 188
inappropriate use of, 151 code elds, 54, 262 grammar
grammar
partitioning into, 145, 148, grammar codes, 15, 26, 33, 47, contextfree phrase
150 166 IBM Lexical
Systems Project,
54, 56, 201, 235, 262 structure,
transcription into, 1367, for denition analysis, 156, 45, 222, 229
decompacting program, 92,
140 idiomatic 43, 114
101, 107 , 164, 168 usage,
statistics derived from, 144
for meaning descriptions, idioms, 168
translation program, 89,
errors in dictionaries, 33 184 if inectional morphology, 11718
95, 98, 101,126
Expected Class Size (E05), information extracted, see
translation rules, 102 for pronunciations, 57
137, 144, 146, 1489, to LDB via, 236 for the entire Percentage of
access OED, 44
1512 Information Extracted
deriving GPSG entries grammar records in LDOCE,
frequency-weighted, 151 information theory, 137, 145,
from, 123, 267 47
146, 152
errors in, 69, 92, 101, 118, grammar rules
124 entropy, 1467, 152
relaxation of, 164 information theoretic measures,
features evaluation of, 124, 130, 133
grammatical analysis, 81, 145
morphological, 119, 120, in LDOCE, 75, 118 see syntactic parsing linear vs. logarithmic, 146
131 additional information, grammatical patterns in OALD Semantic Units
151 Integrated
spectral, 67, 70 shortcoming of, 77
syntactic, 119, 120, 131 and FG structures, 80 (ISUs), 196, 211,
grammatical relations, 69 21316
value assignment to, 132 basic types of letter
grammatical structures vs. Intel Hypercube, 208
xedrecord format in codes, 67 surface structures, 77 interactive access to an on-line
typesetting tape capital letters, 67, 90, 93
51
46 grammaticality judgement, MRD,
structure, for adjectives, 67, 68
12831, 133 interactive query construction,
fonts in dictionaries, 42 for adverbs, 67
grammatically-indexed lexicon, 58, 236
free text in dictionaries, 52 for nouns, 67
of words for verbs, 85, 11718,123, 125 example, 60
frequency as kernels, 67
50
accuracy of, 125, 128 Interlisp-D,
182 little
letters, 67, 75, 90
derivation and virtual memory, 50
frequency weighting, 1445 numbers, 67, 68, 90, 93 of, 124, 128
W codes, 67 in batch, 124, 125, 126 inter-process communication,
frequency-ordered sublexicon, in Unix, 50
145 word qualifiers, 91, 98 modications to the
frequency-weighted expected in OALD, 75 specication of, 12576
class size, 148, 151 capital letters, 72 word senses in, 119 kernel, 181
296 297
Subject index Subject index
of a meaning description, Lexicon Development reliability of, 32, 231 and denition of word, 9
185 Environment [LDE), structured representation inflectional, 117
Key Dening Vocabulary 89, 124*5,129, 133, for, 42 derivational, 162
(KDV), 21112, 216 229, 261 machine translation, 3, 10, 22, spelling rules, 131
knowledge acquisition, 196 lexicon-consumer, 222, 227 194, 197, 199, 209 MRC dictionary database, 137,
knowledge representation lexicon manner of articulation 138
and the phonological 1367,
schemes, see transcription, MULTIJLEX, 654i
representation schemes structure of English, 49 1412, 144, 149, 1512
knowledge-based approach to and formal theories of maximal onset principle, 55, natural language processing
NLP, 4, 26 grammar, 49 , 57, 139, 140
hybrid, 137, meaning descriptions, 175 if (NLP) systems, 3 if,
168
see hybrid lexicon determiner component of,
labels in LDOCE entries, 270 Linguistic String Project, 185 operational requirements
Language Analyser and 86 distribution in for, 52
(LSP), 80, , dictionary, nominal
Learner (LAAL), 214, Link, 177-8, 181-2, 184-5, 189 meaning descriptions
216 see also syntactic kernel marked as cross
reference, (NMDs), 175,
language generation, 31
Linker, 178, 1802, 184, 185 189 see meaning
Latinate roots, 113 LINKS project, 83 postmodier of, 185 descriptions
semantic characteristics
lexical access, 49, 54, 136 Lisp, 47, 53 . premodier of, 185 of,
lexical 176
ambiguity, 197, 203, and LDOCE structure, 47 relative clauses in, 157
207, 221, 223 Interlisp-D, 50 semantic heads of, 156, 157, nominalisation, 179
Lexical Database noun codes
System lispication of LDOCE 161, 167
(LDB), 107, 135, 138, tape, 48 semantic characteristics of comprehensiveness of, 68
1423, 150,229 little letter, 70, NMDs, 176
noun sense denitions, 156~7,
for general purposes, 53 166
see grammar codes syntactic characteristics of
nouns
implementation of, 56 logical form, 87 NMDs, 183
interactive queries of, 58, and subject-verb concord,
Longman Dictionary of syntactic heads of, 157
236 68
Contemporary English meaning postulates, 1546
requirements for, 55 Merriam- Webster count and mass, 90
(LDOCE) New Pocket
user interface for, 233 default codes for, 75
codes, 70 Dictionary (MPD), 21,
see also access paths 45 25 denitions of, 160
typesetting tape format,
lexical entries 227 derived from adjectives, 190
meta5, 226,
content of, 10, 15 mid class derived from verbs, 181,
transcription, 142~3
intermediate representation Machine Readable Dictionaries 190
morphological analyser, 81, 261
of, 89, 95, 99, 126, 128 (MRDs) i and dictionary system, 118 grammar codes for, 67, 71
Lexical-Functional Grammar and conventional DBMS with wide better in LDOCE, 77
coverage, 85
(LFG), 25, 98 systems, 52 morphological analysis, 10, 12, special properties of, 68
lexical knowledge base, 523, and information retrieval, 117-18,131, 167
62 52 i for accessing MRDs, 2O
OEqui verbs, 246
lexical stress, 149 and stress assignment, 54 morphological analysis phase, online access to LDB
lexical variability, 151 availability of, 51 166 exible
mode, 51
lexical words, 150 completeness of, 231 morphological features, 11920 simple mode, 49
lexicalised compounds, 51 coverage of, 32 morphological generator, 1312
,
ORaising verbs, 243
lexicographers workstation, 23, errors in, 19 morphological irregularity, 132 Oxford Advanced Learners
52 interchange format, 229 morphology Dictionary of
298
299
Subject index
Subject index
Contemporary English phonological assimilation, 136 pronunciation, 55, 68, 138, 140, see controlled vocabulary
(OALD), 14, 21, 25, 26, phonological information in 151 robust parsing, 38, 154, 16374
33, 35, 67 ff speech signal, 56 American English, 138 Rogets Thesaurus, 29
Oxford Dictionary of Medicine, phonological reduction, 136, and syllable boundaries,
28 149 139
Oxford English Dictionary phonological structure of the and the
hybrid lexicon, 138 selectional restrictions, 47, 154,
(OED), 1, 3, 33, 44, 53 English lexicon, 54 British English, 137, 139 270
structural complexity of phonological transcriptions imposing explicit structure semantic classes, 106
entries, 45 see transcription on, 57 change of position, 88, 107
phonology field not always explicit in change of state, 88
see pronunciation elds LDOCE, 1389 negrained, 109
PARSCOT project, 79, 80 phonotactic constraints, 57, 139 phonologically motivated involved in dative
parsing, 6
phrasal analyser, 154, 163 structure of, 55 alternation, 90
of a dictionary entry, 49, 53 algorithm for, 163, 166, 167 lexical extension, 107,
transcription of, 137 109
of MRD sources, into a
efcient, 164 to LDB of verbs, 88
access
via, 58, 236
standard form, 45 hierarchies, 163 variations between transfer of possession, 107
of a typesetting tape, 434, of idioms, 168
speakers, 136 semantic codes, 269,
46, 55 patterns, 163 pronunciation elds, 48, 52, 133 see also box codes
see robust
parsing, robust, 154, 163 in LDOCE, 54 semantic dictionary, 153, 156
syntactic parsing rule examples, 1634 parsing of, 57 semantic eld, 184
PARSPAT grammar formalism, phrasal context, 161 semantic hierarchies, 22, 29,
80 phrasal pattern, 38, 164-7, 168 215, 218
part of speech, 1223 phrasal prepositions, 215 question-answering system, semantic networks, see
Part-of-Seech (PCS) codes in phrasal Verbs, 76, 81, 162, 215 3, 10 representation schemes
LDOCE,48,173,175, denitions of, 160 semantic primitives, 154~5
190 idiomatic nature of, 162 semantic relations, 88, 221~2
particles see verbal combinations Raising verbs, 96~7, 103, 125 semantic
in denitions, 162
structure, 1567,161,
Phrase Structure Grammar, 6 record identiers in LDOCE,
1634,1667
partitioning of a lexicon, sec
Augmented, 26 46
semantic weight, 177*8, 181
consistency and Generalized, see GPSG Remote Procedure Call sentence grammar, 11718,
equivalence classes approximation of, 151 Protocol, 62
124, 129, 130, 261
PATHFINDER program, 204 pointer le, 56, 58 representation coverage of, 118
PATR-H, 25, 89, 98 predicate-argument for lexical data in
structure, MRD, 53 features for, 11920
pattern matching, 163, 166, 168 87,96, 100, 118, 123 for lexical information in features of, 121
Percentage of Information predication, 155, 167 typesetting
tapes, 45 features relevant
Extracted of a dictionary
to, 119
(PIE), 137, Preference Semantics, 193, 196, entry, 45 formalism of, 118
14752 200, 2201 representation schemes SUBCAT features in, 128
cannot be derived from parser for, 221 for NLP, 28
SEqui verbs, 244 .
ECSS, 151 prepositional phrases semi-formal nature of, 18 Shorter Oxford Dictionary
statistics of, 148 in dictionary denitions, semantic formulae, 154, 155 (SOD), 21, 33
phonemes, 55 159, 157 semantic networks, 22, 196, SHRDLU, 193
collocations of, 54 prepositional verbs, 76, 215,223 Shunter, 181, 1889,190,
lattice of, 54 see verbal combinations sense-frames, 218, 2203, see also syntactic kernel
phonemic transcription, 148, processing of an MRD, 45 225
149 Prolog, 47, 49, 53 restricted denition vocabulary slitting, 79
300
301
Subject index Subject index
sociolect codes, 172, 270 primary and secondary, 57 onset, 55, 140 tough movement, 79
continuous, 150, 152 see also labels unstressed, 1389, 148 features, 151
variability in, 141 subcategorisation, 56, 85, with primary and full, 149,
speech recognisers 1223 secondary stress in see transcription,
and partitioning an MRD, arbitrariness of, 87 LDOCE, 138 phonemic
136 assignment of values, 126 syllable-based access, 57 into equivalence classes, 136
and random segments, 143 for subjects, 80 Synonym, 177, 181, 185, 189 manner of articulation, 137,
design of, 1357, 151 of verbs, 86 examples of, 182 1412,144,148m
for large-vocabulary speech of nouns, 124 synonymy relations, 209 1512
recognisers, 135, 150 subcategorisation classes in the syntactic categories, 264, mid class, 1423
front ends for, see front lexicon, 123 see POS codes
multiple phonetic
ends subcategorisation frames, access to LDB via, 236 of a
representations
performance of, 137, 143 12930 in dictionary denitions, word, 151
simulations of, 141, 143 subcategorisation information 156
of stressed segments, 143
transcription by, 140 in LDOCE, 54 syntactic constraints on access, of the vowel, 141
speech recognition, 3, 20, 25, 55 elliptical specication of, 55 56
partial, 140
speech register codes, 172 subcategorisation translations, syntactic features, 11920 performed by a front end,
speech synthesis, see 126,262,267 enumeration of, 119 140
text-tospeech synthesis subcategorisation values, 263, syntactic frames, 12830, 132,
phonemic, 1489
spelling correction, 21 267 262,2647
interaction with word
phonetic, 135
SRaising verbs, 243 subdefinitions sense
place of articulation, 148
statistics, 137, 144, 148 withindenitions, 43 ambiguity, 131 randomness
based on class size, 137, semantically bleaclled, 131
or variability
subject codes, 47, 54, 56,
CL 141
144,148,151 1545,172,201,22L syntactic kernel, 176 stress pattern, 146
basis for, 137 235,269 Link, 177, 181, 189 unextracted information
derived from consistency access to LDB via, 236 Linker, 181, 1889 before and after, 146
class, 140 after lispification, 49 Shunter, 17985, 18890
Transformational Grammar, 96
derived from equivalence and discourse domain, 154 syntactic parsing, 10, 20, 25,
65, 85, 117 typeface changing commands,
class, 144 superordinate, 155,
42
for LDB search, 60 see also hyperonym unication-based, 222
of NMDs, 186 surface grammar, 266 see parsing, robust parsing typesetting tape
and a dictionary database,
PIE, 148 surface structures vs.
41
word based, 144 grammatical structures, _
302 303
Subject index
Subject index
Author index
308
Author index
Warren B, 31
Way E, 17
Webber B, 28, 155
Weischedel R, 164
Wekker H, 82
Whitelock P, 10
Wiederhold G, 50
Wilensky R, 163, 168
Wilks Y, 27, 154,163, 193,195,
21819, 225
Williams E, 96
Winograd T, 8, 39, 193, 223
Wong D, 223
Woods W, 3
Yannakoudakis E, 21, 33
Zamora A, 21
Zampolli A, 2
Zue V, 25, 136, 146
310