Anda di halaman 1dari 350



TITLE "Lexis in Contrast: Corpus-based approaches"

SUBJECT "Studies in Corpus Linguistics, Volume 7"



WIDTH "150"


Lexis in Contrast

Studies in Corpus Linguistics

Studies in Corpus Linguistics aims to provide insights into the way a corpus can
be used, the type of ndings that can be obtained, the possible applications of
these ndings as well as the theoretical changes that corpus work can bring into
linguistics and language engineering. The main concern of SCL is to present
ndings based on, or related to, the cumulative eect of naturally occuring
language and on the interpretation of frequency and distributional data.

General Editor
Elena Tognini-Bonelli

Consulting Editor
Wolfgang Teubert

Advisory Board
Michael Barlow, Rice University, Houston
Robert de Beaugrande, UAE
Douglas Biber, North Arizona University
Chris Butler, University of Wales, Swansea
Wallace Chafe, University of California
Stig Johansson, Oslo University
M. A. K. Halliday, University of Sydney
Graeme Kennedy, Victoria University of Wellington
John Laing, Herriot Watt University, Edinburgh
Georey Leech, University of Lancaster
John Sinclair, University of Birmingham
Piet van Sterkenburg, Institute for Dutch Lexicology, Leiden
Michael Stubbs, University of Trier
Jan Svartvik, University of Lund
H-Z. Yang, Jiao Tong University, Shanghai
Antonio Zampolli, University of Pisa

Volume 7
Lexis in Contrast: Corpus-based approaches
Edited by Bengt Altenberg and Sylviane Granger

Lexis in Contrast
Corpus-based approaches

Edited by
Bengt Altenberg
University of Lund

Sylviane Granger
Universit Catholique de Louvain

John Benjamins Publishing Company



The paper used in this publication meets the minimum requirements of American
National Standard for Information Sciences Permanence of Paper for Printed
Library Materials, ansi z39.48-1984.

Cover design: Franoise Berserik

Cover illustration from original painting Random Order
by Lorenzo Pezzatini, Florence, 1996.

Library of Congress Cataloging-in-Publication Data

Lexis in contrast : corpus-based approaches / edited by Bengt Altenberg, Sylviane Granger.
p. cm. (Studies in Corpus Linguistics, issn 13880373 ; v. 7)
Includes bibliographical references and index.
1. Lexicology--Data processing. 2. Contrastive linguistics--Data processing. 3.
Lexicography--Data processing. 4. Translating and interpreting--Data processing. I.
Altenberg, Brengt. II. Granger, Sylviane, 1951- III. Series.
P326.5.D38 LA495 2002
isbn 90 272 2277 0 (Eur.) / 1 58811 090 7 (US) (Hb; alk. paper)


2002 John Benjamins B.V.

No part of this book may be reproduced in any form, by print, photoprint, microlm, or any
other means, without written permission from the publisher.
John Benjamins Publishing Co. P.O. Box 36224 1020 me Amsterdam The Netherlands
John Benjamins North America P.O. Box 27519 Philadelphia pa 19118-0519 usa

Table of contents



List of contributors


Recent trends in cross-linguistic lexical studies
Bengt Altenberg and Sylviane Granger


Cross-Linguistic Equivalence
Two types of translation equivalence
Raphael Salkie

Functionally complete units of meaning across English and Italian:

Towards a corpus-driven approach
Elena Tognini Bonelli

Causative constructions in English and Swedish:

A corpus-based contrastive study
Bengt Altenberg

Contrastive Lexical Semantics

Polysemy and disambiguation cues across languages:
The case of Swedish f and English get
ke Viberg

A cognitive approach to Up/Down metaphors in English and

Shang/Xia metaphors in Chinese
Lan Chun


Table of contents

From figures of speech to lexical units:

An English-French contrastive approach
to hypallage and metonymy
Michel Paillard

Corpus-based Bilingual Lexicography

The role of parallel corpora in translation
and multilingual lexicography
Wolfgang Teubert
Bilingual lexicography, overlapping polysemy, and corpus use
Victria Alsina and Janet DeCesaris

Computerised set expression dictionaries:

Analysis and design
Sylviane Cardey and Peter Greenfield

Making a workable glossary out of a specialised corpus:

Term extraction and expert knowledge
Christine Chodkiewicz, Didier Bourigault and John Humbley

Translation and Parallel Concordancing

Translation alignment and lexical correspondences:
A methodological reflection
Olivier Kraif

The use of electronic corpora and lexical frequency data

in solving translation problems
Franois Maniez

A computer tool for cross-linguistic research
Patrick Corness

General index

Author index


Most of the articles in this volume represent a selection of papers presented at

the Contrastive Linguistics and Translation Studies. Empirical Approaches
conference organised by Sylviane Granger at the Catholic University of
Louvain in February 1999. All the contributions have been revised to fit the
special theme of the volume. In addition, two contributions have been added to
the original selection of papers: the introductory survey by Bengt Altenberg
and Sylviane Granger and Wolfgang Teuberts article on the importance of
translations in cross-linguistic lexical research.
The contributions reflect three striking tendencies that emerged during
the conference. One is the rapidly growing interest in corpus-based approaches
to the study of lexis, in particular the use of multilingual corpora, shared by
researchers working in widely differing fields - contrastive linguistics, lexicology, lexicography, terminology, computational linguistics, machine translation
and other branches of natural language processing.
The second tendency finds its expression in the wealth of methodological
approaches represented at the conference, especially as regards the kinds of corpora used and the ways in which multilingual lexical information can be
extracted from corpora and exploited for various purposes. This methodological diversity reflects to some extent the types of monolingual and multilingual
corpora available at the time of the conference, but it is above all a healthy and
promising sign of the vitality and desire for reorientation in a number of related
fields where not only the object of research (lexis) but also the methodology (the
use of corpora) are rapidly expanding and demanding increasing attention.
However, no matter what the purpose of the individual contributions may
be, whether theoretical or practical, the driving force that unites them all is easily recognisable as the third and perhaps most fundamental tendency to
have emerged from the conference: a common desire to give the cross-linguistic study of lexis a firm empirical foundation.
We have divided the articles into four main groups reflecting what we
regard as some major concerns and aspects of the field: the exploration of


Bengt Altenberg and Sylviane Granger

cross-linguistic equivalence, contrastive lexical semantics, corpus-based multilingual lexicography, and translation and parallel concordancing.
The conference brought together researchers from a wide range of countries and this is reflected in the diversity of the languages covered in the articles:
English, Catalan, Chinese, Czech, Finnish, French, German, Italian, Lithuanian,
Spanish and Swedish.
In preparing this volume we have benefited from the generous help of several people. Apart from the contributors themselves, we wish to thank an
anonymous reviewer for many valuable comments and suggestions, Helen
Swallow for her meticulous examination of the manuscript, and Kees Vaes and
Elena Tognini-Bonelli for the confidence they have shown in entrusting us with
the task of editing this volume.
Bengt Altenberg and Sylviane Granger
Lund and Louvain-la-Neuve, Autumn 2001

List of contributors

Victria Alsina
Institut Universitari de Lingstica Aplicada, Universitat Pompeu Fabra,
Bengt Altenberg
Department of English, University of Lund
Didier Bourigault
Centre National de la Recherche Scientifique, Equipe de recherche en syntaxe et
smantique, Universit Tolouse-le-Mirail
Sylviane Cardey
Centre de recherche en linguistique LucienTesnire, Universit de Franche-Comt
Christine Chodkiewicz
Centre National de la Recherche Scientifique, Centre de terminologie et de nologie, Laboratoire de linguistique informatique, Universit Paris 13
Janet DeCesaris
Institut Universitari de Lingstica Aplicada, Universitat Pompeu Fabra,
Peter Greenfield
Centre de recherche en linguistique LucienTesnire, Universit de Franche-Comt
John Humbley
Centre de terminologie et de nologie, Laboratoire de linguistique informatique,
Universit Paris 13
Patrick Corness
School of International Studies and Law, Coventry University
Sylviane Granger
Centre for English Corpus Linguistics, Universit Catholique de Louvain

List of contributors

Olivier Kraif
Laboratoire dIngnierie Linguistique et de Linguistique Applique, Universit de
Nice Sophia Antipolis
Lan Chun
Department of English, Beijing Foreign Studies University
Franois Maniez
Centre de Recherche en Terminologie et Traduction, Dpartement des Langues
Etrangres Appliques, Universit Lumire Lyon II
Michel Paillard
Dpartement dtudes Anglophones, Universit de Poitiers
Raphael Salkie
School of Languages,University of Brighton
Wolfgang Teubert
Department of English, University of Birmingham
Elena Tognini Bonelli
Universit degli Studi di Lecce and The Tuscan Word Centre
ke Viberg
Department of Linguistics, University of Lund



Recent trends in cross-linguistic

lexical studies
Bengt Altenberg and Sylviane Granger

Lexis and contrastive linguistics

.. Lexis: an expanding universe

The days are long gone when lexis was thought of as an unruly chaos, a
prison, to use Di Sciullo & Williams (1987: 3) words, [which] contains only
the lawless, and [where] the only thing the inmates have in common is their
lawlessness. Following this period of neglect, during which lexis was most definitely the poor relation of grammar and syntax, there has been a radical
restructuring of priorities, and the lexicon now features high on the agenda, in
both theoretical and applied linguistics. As a result, there is a general trend
towards lexically oriented approaches to language in which what was formerly
regarded as syntactic phenomena has increasingly come to be viewed as projections of lexical properties. This development is noticeable in most branches of
linguistics, formal as well as functional.1 One influential strand of this development is the empiricist movement that is sometimes called British contextualism, most clearly represented by John Sinclair and his colleagues. Sinclair
(1987a) attributes this dramatic turnabout to two concurring factors:
Hallidays model of language and the advent of computers.
In 1966, in an article entitled Lexis as a Linguistic Level, Halliday called for
recognition of a lexical level alongside the universally recognised grammatical
level. From the start, however, he insisted that lexis was not to be viewed as
totally separate from grammar: If therefore one speaks of a lexical level, there
is no question of asserting the independence of such a level, whatever this
might mean; what is implied is the internal consistency of the statements and

Bengt Altenberg and Sylviane Granger

their referability to a stated model (1966: 152). Alongside the grammatical

and the lexical levels, there is also a lexico-grammatical level where lexical
restrictions intersect with grammatical ones. The main argument offered by
Halliday in support of a lexical level is the existence of collocations, i.e. combinatory restrictions which are neither grammatical nor semantic but which
reflect the habitual or customary places of words (Firth 1957: 12). The
acceptability of strong tea and powerful car and relative unacceptability of powerful tea and strong car demonstrate the existence of restrictions which depend
on the syntagmatic relations into which words enter. Collocations are essentially based on probabilities, with words having a higher or lower likelihood of
occurring together. But on the whole this probability is extremely low and, as a
result, verification of Hallidays probabilistic approach relies on the existence of
large corpora and computational techniques.2
Without the advent of computers the approach to lexis propounded by
Halliday would never have had the tremendous impact it has already had and
continues to have on the field of linguistics. Computers have made it possible
to store ever larger collections of texts in electronic form and to analyse them
using increasingly sophisticated, versatile and user-friendly software tools. But
whereas grammar and semantics involve a high degree of abstraction, and are
therefore relatively difficult to access using computer technology, lexis lends
itself perfectly to the form-based research at which computers excel, whether
those forms be letters, word spaces, punctuation marks or, indeed, words. Take
frequency counts for example: an ideal field of enquiry in which to use computational techniques. For the first time ever, linguists have been able to rely on
non-impressionistic large-scale frequency data. Although the reliability of frequency studies was questioned from a relatively early stage, this did not put an
end to them but, instead, merely prompted corpus linguists to gather bigger
and more tightly controlled corpora.
These two factors have contributed to bringing the study of words to the
forefront of linguistic research, along with a change of name from vocabulary
to lexis. But it is not only the name which has changed. It has become an altogether different phenomenon, in three ways in particular.
First and foremost, lexis and grammar are now seen as interdependent.
This idea, first introduced by Halliday, was further developed by Sinclair, who
criticised the traditional decoupling of lexis and grammar and claimed that it
was more fruitful to start by supposing that lexical and syntactic choices correlate, than that they vary independently of each other (1991:104). This interrelation of grammar and lexis is one of the key features in the new corpus-based


Longman Grammar of Spoken and Written English (Biber et al. 1999), which
gives pride of place to lexico-grammatical associations both grammatical
associations of lexical words and lexical associations of grammatical structures.
Closely linked with this development is the fact that lexis has now been
firmly placed on the syntagmatic axis. While paradigmatic relations for a long
time dominated lexical studies, the pendulum now seems to have swung in the
opposite direction so that it is now on the analysis of co-occurrence relations
that attention is focused. This new emphasis on the company words keep, to
use Firths expression, has led to the discovery of a wide range of word combinations or multi-word units, which vary in fixedness and idiomaticity.
The third major change which has taken place in perceptions about lexis is
that it is now recognised as displaying a much higher degree of stylistic differentiation than had previously been thought. In the case of English, the analysis
of corpora has led to the discovery of a wide range of dialectal differences related to regional provenance (American English, Indian English), age (teenager
English), sex (female lexis), time (Middle English lexis), social class, as well as
diatypic differences in terms of field, mode and tenor (spoken lexis, ESP lexis,
informal lexis).
Lexis has undergone a dramatic transformation and come out less
autonomous, more open to other layers of language, notably grammar, composed of both single words and multi-word units and entering into a complex
network of paradigmatic and syntagmatic relations.
.. The revival of Contrastive Linguistics
Like lexicology, contrastive linguistics now also occupies a dominant position
in linguistics, but it has reached this position via a rather different route.
Whereas in the case of lexis, its time had come, contrastive linguistics had
already had its glory days back in the 1960s, before falling into disfavour, principally because of its association with structuralism. What we are now witnessing is thus more of a revival, and a dramatic one at that.
When Contrastive Analysis (CA) emerged as a scholarly discipline in the
decades after World War II, it was regarded mainly as an applied branch of linguistics serving practical pedagogical purposes in foreign and second language
teaching. In accordance with the linguistic climate of the time (structuralism,
early generative grammar), phonology and grammar held centre stage, while
lexis played a subordinate role.3 The high hopes it had raised that similarities and differences between languages could predict, or at least explain, prob-

Bengt Altenberg and Sylviane Granger

lems in foreign and second language learning and make language teaching
more efficient were largely thwarted. For a time CA became a suspect field of
study, especially in the United States (on the history and deficiencies of CA, see
Ringbom 1994, Sajavaara 1996, Chesterman 1998). However, in Europe CA
continued to thrive and large contrastive projects were established in the 1970s,
comparing English and other European languages. There, in particular, the
view persisted that CA still had much to offer, not only to language pedagogy,
but also to translation theory, the description of particular languages, language
typology and the study of language universals (on various early approaches, see
Di Pietro 1971, James 1980 and Krzeszowski 1990).
Now CA or contrastive linguistics (CL), as it is increasingly called is
again an active and expanding field which generates lively theoretical and
methodological discussion. A large number of research projects, conferences
and journals are devoted to cross-linguistic work of various kinds, especially in
Europe. And lexis, moreover, is very much the focus of attention.
Broadly speaking, there are three main reasons for this, although they are
closely interrelated and difficult to separate. Internationalisation and the gradual integration of Europe have created an increasing demand for multilingual
and cross-cultural competence, for translation, interpreting and foreign language teaching. The importance of accurate and efficient communication
across language boundaries has become a concern not only of linguists and
teachers but of governments, commercial institutions and international organisations. As a result, there has been a rapidly increasing awareness of the need
for large-scale cross-linguistic research.
At the same time, there have been important developments within linguistics. A growing interest in real-life communication has shifted the focus away
from the earlier preoccupation with abstract language (sub)systems and the
reliance on the native speakers intuition as the main source of linguistic
knowledge in the direction of natural discourse and empirical data as evidence
for linguistic observations. The earlier tendency, fostered by structuralism and
early generative grammar, to regard language as consisting of autonomous systems (with phonology and grammar in the centre) has given way to a more
complex and dynamic view of language which allows greater interaction
between the systems and fuzzier boundaries between them. As mentioned,
lexis has acquired a more central position in several respects: the concept of the
lexical item has expanded and the interdependence between lexical choice and
contextual factors has led to a growing tendency to enrich the lexicon with
information of a grammatical, semantic and pragmatic nature (see e.g. Atkins


et al. 1994). These tendencies have had a profound influence on lexical CL.
A third important reason for the revival of contrastive studies is the computer revolution and the possibility of analysing natural language on the basis
of large text corpora. This has opened up new possibilities of research on the
basis of bilingual or multilingual corpora and experiments in natural language
processing, e.g. in the fields of machine translation, information retrieval and
computational lexicography. Corpora provide empirical data for linguistic theories and practical applications or serve as testing grounds for linguistic and
computational models. The information gained from corpora is both richer
and more reliable than that derived from introspection.
These new developments have brought about a revival of interest in CL. CL
now permeates a number of fields inside and outside linguistics and its impact
has been especially strong in areas concerned with natural language processing,
such as machine translation and computational lexicography. Indeed, the
analysis of individual languages has even been described as forming a part of
CL (Weigand 1998b: vii). The new tendencies have also given rise to increased
cooperation between experts from a number of fields: linguistics, lexicography,
translation, computer science, psychology and cognitive science. Even if the
problems of describing and relating many languages are as formidable as ever,
great advances have been made in identifying and addressing the issues and
there is new hope and great vitality in the field.

. Multilingual corpora
. Types of corpora
As we have seen, one factor that has influenced the contrastive study of lexis
more than any other is the computer revolution and the development of multilingual corpora. Several types of multilingual corpora need to be distinguished.
Unfortunately, the terminology used to describe the different types is inconsistent and confusing (for some different typologies, see Baker 1995, 1999 and
Hartmann 1996). We shall here use the typology and terms set out in Figure 1
(cf. Johansson 1998:47).
Depending on the number of languages involved, one distinction that can
be made is that between bilingual and multilingual corpora. To simplify matters, we shall use multilingual as a general inclusive term and only be more
specific when necessary. A more important distinction is that between comparable corpora and translation corpora. Comparable corpora consist of original

Bengt Altenberg and Sylviane Granger

Multilingual corpora

Comparable corpora

Translation corpora



Figure 1. Types of multilingual corpora

texts in each language, matched as far as possible in terms of text type, subject
matter and communicative function. Corpora of this kind can either be
restricted to some specific domain (e.g. genetic engineering, contract law, job
interviews) or be large balanced corpora representing a wide range of text
types. Translation corpora consist of original texts in one language and their
translations into one or several other languages. If the translations go in one
direction only (from language A to language B) they are unidirectional; if they
go in both directions (from language A to language B and from language B to
language A) they are bidirectional. The term parallel corpus is sometimes
used as an umbrella term for both comparable and translation corpora, but it
seems more appropriate for aligned translation corpora, where a unit (paragraph, sentence or phrase) in the original text is linked to the corresponding
unit in the translation (see Section 2.2).4
Each of these types has its advantages and disadvantages (see Aijmer et al.
1996, Teubert 1996, Johansson 1998). Comparable corpora represent natural
language use within the genres they contain and are unaffected by various
translation effects (see below). Domain-specific corpora are especially useful
for terminological studies. If comparability is taken in a broad sense, very
large balanced corpora representing a wide range of genres and text types can
serve as comparable corpora. Since corpus size and large quantities of data are
important factors in contrastive lexical research, they are especially useful in
collocation studies and as control corpora for results derived from translation
The problem with comparable corpora is, somewhat paradoxically, the
comparability of the data. It is difficult, and in some cases impossible, to know
what to compare, i.e. to relate expressions with comparable meaning and function in the languages compared. Moreover, unlike translation corpora, compa-


rable corpora cannot reveal sets of cross-linguistic equivalents in cases where

one or both languages provide a choice of alternatives (unless these have been
identified in advance). Another problem with comparable corpora is their
functional and stylistic comparability. If the source texts of the corpora are not
selected according to the same principles, any comparison is bound to be
uncertain. For these reasons, the use of comparable corpora is either limited to
restricted domains or to very large balanced corpora where such factors as
topic, register, and communicative function can be controlled.
Translation corpora have the advantage of keeping meaning and function
constant across the compared languages.5 They also make it possible to discover cross-linguistic variants, i.e. alternative ways of rendering a particular meaning or function in the target language. By reversing this process, i.e. starting
from the range of variants discovered in language B and observing how these
are rendered in language A, it is possible to discover paradigms of cross-linguistic correspondences (see Section 5.2).
The disadvantage of using translation corpora is that translations tend to
retain traces of the source language (translationese see e.g. Gellerstam
1986, 1996) or display other general characteristics of translated texts (see
Baker 1993, Schmied and Schffler 1996). The results based on translation corpora therefore have to be verified on the basis of original text corpora. Another
disadvantage of translation corpora is that they rarely provide a full or balanced representation of the languages compared. By definition they are
restricted to genres and text types that are translated, which tends to confine
them to certain written text types. Moreover, what is translated tends to vary
from one language to another: for reasons of cultural dominance certain text
types may be translated in one direction but not in the other. As a result, translation corpora are seldom large and well balanced, a fact which limits their usefulness for certain types of cross-linguistic studies.
It is obvious from this comparison of the advantages and disadvantages of
the two main types of multilingual corpora that they should be seen as complementary sources of cross-linguistic data. The possibility of combining comparable and translation corpora, thus taking advantage of the specific merits of
both types, has also been recognised in various contrastive projects, e.g. in the
composition of the English-Norwegian Parallel Corpus (see Johansson 1998)
and the English-Swedish Parallel Corpus (see Altenberg and Aijmer 2000) and
in the cross-linguistic methodology advocated by Teubert (1996).
The cross-linguistic insights gained from translation corpora obviously
increase considerably if more than two languages can be compared. One inter-

Bengt Altenberg and Sylviane Granger

esting example of a multilingual bidirectional translation corpus involving a

number of languages is the Oslo Multilingual Corpus.6 The basis of this corpus
is the English-Norwegian Parallel Corpus (ENPC) (Johansson 1998), which is
closely linked to similar English-Swedish and English-Finnish translation corpora. By extending the ENPC to include translations between English,
German, Dutch and Portuguese, it will be possible to compare six languages
using English original texts as a starting point.
. Text alignment and search tools
To be maximally useful translation corpora must be aligned in such a way that a
unit in the original text is linked to the corresponding unit in the translated
text. The linked units can then be displayed together and compared, and parallel concordancers and other multilingual search tools can be applied to the
aligned texts.
Translation corpora can be aligned paragraph by paragraph or, more commonly, sentence by sentence, but experiments are also being made to align
translation corpora at phrase and word level.7 Automatic sentence-level alignment, which was first developed for the French and English versions of the
Canadian Hansard (see e.g. Brown et al. 1991, Gale and Church 1991), is normally based on statistical matching of features that link corresponding sentences in the source and target texts, such as sentence length (in terms of words
or characters), typographical features (e.g. initial capitals, punctuation marks)
and cognate words, but there are also programs that make use of a combination
of statistical feature matching and a bilingual lexicon of unambiguous equivalents in the languages involved (see Hofland 1996, Hofland and Johansson
1998).8 The main obstacle to automatic sentence alignment is represented by
cases where a sentence in the original text has been divided into two (or more)
sentences in the translation or, conversely, where two (or more) sentences in
the original text have been combined into one in the translation. Sentence-level
alignment programs generally achieve a high degree of accuracy, but the result
has to be checked and corrected manually. Multilingual alignment, i.e. alignment of a source text and its translations into several languages, has also been
carried out with good results (see e.g. Hofland and Johansson 1998:98f.).
Efforts have also been made to align parallel texts at word or phrase level
(see e.g. Church and Gale 1991, Kay and Rscheisen 1993, Merkel 1999:113ff.).
This is a much more difficult task than sentence alignment, since a given word in
the source text may be rendered by many translation equivalents and structural


paraphrases, and sometimes none at all. Word alignment programs must therefore rely heavily on bilingual lexicons, contextual pattern matching and sophisticated statistical techniques. Since perfect word alignment is difficult to
achieve, most text alignment programs used today are sentence-based. A survey
of various alignment techniques and an examination of two major problems
confronting word alignment, viz. the lack of isomorphism of lexical units across
languages and the semantic discrepancy between source and target expressions
that is often found in translation corpora, is presented by Kraif in this volume.
Text alignment is a prerequisite for parallel concordancers and other multilingual tools. These vary in approach and degree of sophistication. Here we
shall distinguish two main types: (1) parallel concordancers and search tools
(browsers) which operate on previously aligned corpora and which identify
and present a search word (or expression) in its context together with the corresponding aligned unit in the other language, and (2) word-based concordancers pairing lines of text on the basis of computed word correspondences in
the compared languages.
In the first type the user selects a search item in L1 or L2 as input and either
(a) leaves the equivalents in the other language open, or (b) pre-selects one or
several potential equivalents in the other language. In the former case the program presents all the aligned sentence pairs containing the search item in one
of the languages and it is up to the user to identify any relevant equivalents in
the aligned output. This is illustrated in the following example, which shows a
small sample of a search for drug(s) (in bold) in the sentence-aligned EnglishFrench Canadian Hansard corpus using the web-based TransSearch interface.9

Police have to comfort and question the victims of murderers, rapists, armed
bandits, drug dealers
Les policiers doivent rconforter et interroger les victimes de meurtriers, de
violeurs, de bandits arms et de trafiquants de drogue

It means that cheaper generic drugs will not be available to them.

Cela veut dire quils ne pourront plus obtenir de mdicaments gnriques bon

Each time they stop a car they never know whether the driver is armed, on
drugs, a hood or an upstanding member of the community.
Chaque fois quil arrte une voiture, il ne sait jamais si le conducteur est arm,

Bengt Altenberg and Sylviane Granger

drogu, ou sil sagit dun truand ou dun membre respect de la collectivit.

Many young people feel either rejected or marginalized in society which creates
additional problems of crime and drug and alcohol abuse.
Dans notre socit, bien des jeunes se sentent rejets ou marginaliss, ce qui
occasionne dautres problmes de criminalit, de toxicomanie et dalcoolisme.

If a pre-selected equivalent of the search item is specified, the program only

presents the aligned sentence pairs that contain the search item and the preselected equivalent in the other language. This is illustrated in the following
example, which shows a small sample in KWIC format from a TransSearch
bilingual query for drug(s) translated either as mdicament(s) or drogue(s).
...withdrawal of Bill C-91 which gives brand name drugs a 20-year market monopoly
... du projet de loi C-91 qui donne aux fabricants de mdicaments brevets un monopole de 20 ans... the Canadian people access to information as to drug safety and efficacy.
... linformation sur linnocuit et lefficacit des mdicaments

A lot of the drugs that come into this country....
Bon nombre de drogues introduites dans notre pays...
...who were lured into prostitution, hooked on drugs and exploited...
...dans la prostitution, rendues dpendantes de la drogue et exploites...
...organized crime hides the profits of the drug trade, international smuggling,...
...camoufler les profits du commerce de la drogue, de la contrebande internationale....
...moving to coastal communities if the drug trade continues the way it has.
...dans les localits ctires, au train o va le trafic de drogues.

These bilingual concordances yield a wealth of information, notably on the most

frequent multiword units (drug abuse/dealers/cartels/smugglers/barons/trafficking/
trade) and their equivalents in the other language. In the case of drogue and drug
they are the ideal starting point from which to uncover the rules governing the
choice between the singular and plural form in the two languages.10
The sophistication of sentence-based concordancers or browsers varies, but
most programs allow the user to choose which of the languages he wishes to
regard as the source language (L1) and which as the target language (L2), to use
wildcards, and to restrict the search by means of various contextual conditions


or word-class tags (if the corpus is tagged for word-class). Some examples of
various types of (paragraph-based or sentence-based) multilingual browsers are
ParaConc (Barlow 1995), Multiconcord (Wools 1998), the Translation Corpus
Explorer (Ebeling 1998) and the Pedant Bilingual Concordance (Ridings 1998). A
detailed demonstration of how Microsoft Word can be used to align source texts
and translations and be combined effectively with a mark-up program and the
parallel concordancer Multiconcord is given by Corness in this volume.
Word-based concordance programs are closely related to word alignment
and are consequently more problematic. This type makes use of a statistical
matching technique which creates an index indicating which words in L1 tend
to correspond to which words in L2. It takes just one search word as input and
uses the pre-computed index of word correspondences to align concordance
lines in L1 with their translations in L2 (see e.g. Church and Gale 1991).
Obviously, this is a complicated statistical task and the outcome depends on
the efficiency of the index and on the closeness of the translation. Parallel concordance programs of this type are still in an experimental stage, and the most
robust and immediately useful multilingual search tools available today are
therefore concordancers and browsers of the first type.
Even if fully automatic and accurate word alignment and word-based concordancing programs may be a utopian goal, there is no doubt that multilingual research tools, however constructed, are extremely useful instruments for
anyone concerned with lexical CL, for theoretical as well as practical purposes.
By allowing the user to compare an L1 keyword in its context with its counterpart in another language they make it possible to arrive at empirically founded,
richer and much more delicate descriptions of translation equivalents. This is
also amply demonstrated in the studies in the present volume, many of which
depend, implicitly or explicitly, on various kinds of alignment and parallel concordance techniques.
. Some uses of multilingual corpora
Multilingual text corpora can be used for a variety of purposes in contrastive lexical studies. Their main uses can be summarised as follows (cf. Johansson 1998):

they offer a firm empirical basis for cross-linguistic lexical studies, providing richer and more reliable information about the degree of correspondence between lexical items in different languages than comparisons based
on introspection;
they give new insights into the lexis of the languages compared insights

Bengt Altenberg and Sylviane Granger

that are likely to be missed in studies of monolingual corpora;

they can be used for a range of comparative purposes and increase our
knowledge of language-specific, typological and cultural differences, as
well as of universal features;
they can be used to study lexical systems as well as the contextual use of lexical items, and thus provide information about paradigmatic as well as
syntagmatic lexical relations;
they can serve to disambiguate polysemous items, reveal the degree of
mutual correspondence of lexical items in different languages, and uncover
cross-linguistic sets of translation equivalents in the languages compared;
they are of theoretical as well as practical importance: theoretically, they
provide input data for lexical models and serve as testbeds for lexical theories and hypotheses; practically, they are essential for applications in a
number of fields, such as multilingual lexicography and terminology, natural language processing, machine-assisted translation, translator training, information retrieval, and language teaching;
they illuminate lexical differences between original texts and translations
and can be used for studies of individual translation problems and strategies, as well as of language-related and universal translation effects.

In the following sections we shall give a brief survey of some of these uses and
indicate some major tendencies in corpus-based contrastive studies of lexis in
the last decade. The emphasis will be on theoretical and methodological
approaches to the study of lexis, but we shall also touch briefly on some developments in multilingual lexicography (Section 6) and machine-assisted translation (Section 7).

. Theoretical and methodological issues

. Some contrastive approaches
Traditionally, CL has been described as involving three methodological steps:
description, juxtaposition and comparison (see e.g. Krzeszowski 1990:35). The
description includes selection of the items to be compared and a preliminary
characterisation of these in terms of some language-independent theoretical
model. The juxtaposition involves a search for, and identification of, cross-linguistic equivalents. In the comparison proper the degree and type of correspondence between the compared items are specified.


Modern lexical CL often follows this procedure, but a characteristic feature

of recent corpus-based contrastive work is the great variety of approaches
employed. This is largely due to the expansion of the field and the new research
possibilities that multilingual corpora and search tools offer. The methodology
chosen and the delicacy of the analysis depend to a large extent on the purpose
of the analysis, e.g. whether it is primarily theoretical (focusing on a contrastive description of the languages involved) or practical (intended to serve
the needs of a particular application). This in turn may determine the role that
the corpus is allowed to play in the analysis. One distinction that is sometimes
made in corpus linguistics, and which is also applicable to CL, is that between
corpus-based and corpus-driven approaches (see e.g. Francis 1993 and
Tognini Bonelli 2001 and in this volume). The former may involve any work
theory-driven or data-driven that makes use of a corpus for language
description, but it is also used in a restricted sense to refer to studies which start
from a model postulating a cross-linguistic difference or similarity on theoretical grounds and use a multilingual corpus to confirm, refute or enrich the theory. The latter approach, on the other hand, may start from an implicit or loosely formulated assumption but uses the corpus primarily to discover types and
degrees of cross-linguistic correspondence and to arrive at theoretical statements. In practice, however, the distinction may be slight. The difference lies
rather in the importance attached to the initial assumptions and the role that
the data play in the analysis. Here we shall use the term corpus-based as an
umbrella term covering both types of corpus-informed studies.
In the following sections we shall briefly examine some of the theoretical
and methodological issues involved and how these have been approached in
some recent corpus-based contrastive studies of lexis.
. Tertium comparationis and translation equivalence
Any cross-linguistic comparison presupposes that the compared items are in
some sense similar or comparable. That is, to be able to say that certain categories in two languages are similar or different it is necessary that they have
some common ground, or tertium comparationis. For lexis it is obvious that the
compared items should express the same thing, i.e. have the same (or at least
similar) meaning and pragmatic function (see James 1980: 90f.). However,
what exactly this thing is is not always obvious, and the problem of identifying
a tertium comparationis in CL has been discussed a great deal in the past (see
e.g. James 1980:169ff., Krzeszowski 1990, and Chesterman 1998:27ff.).

Bengt Altenberg and Sylviane Granger

Krzeszowski (1990: 23f.) has distinguished seven types of equivalence: statistical equivalence, translation equivalence, system equivalence, semanticosyntactic equivalence, rule equivalence, substantive equivalence and pragmatic
equivalence. However, although there is something to say for this taxonomic
approach, it seems that the only way we can be sure that we are comparing like
with like is to rely on translation equivalence (see James 1980: 178).
Chesterman (1998: 37ff.) develops this in the following way. Any notion of
equivalence is a matter of judgement. Similarly, cross-linguistic equivalence is
not absolute, but a matter of judgement or, more precisely, translation competence. On this view, estimations of any kind of equivalence that involves
meaning must be based on translation competence, precisely because such
estimations require the ability to move between utterances in different languages. Translation competence, after all, involves the ability to relate two
things (ibid.: 39).
The fact that equivalence is a relative concept also has another consequence. It is not realistic to proceed from a tertium comparationis that is based
on identity of meaning. For one thing, this would be putting the cart before
the horse and we would run the risk of methodological circularity: the result of
the contrastive analysis would be no more than the initial assumption (cf.
Krzeszowski 1990: 20). For another, the area we want to explore is often fuzzy
and impossible to define satisfactorily (e.g. epistemic modality or pragmatic
particles). In such cases we cannot start from a tertium comparationis that is
founded on equivalence in a strict sense (identity of meaning). Instead, what
we have to do and what we generally do is to start from a perceived or
assumed similarity between cross-linguistic items (cf. James 1980: 168f.).
Viewed in this way, CL becomes a way of refining initial assumptions of similarity. Chesterman (1998:58) expresses this as follows:
In this methodology, the tertium comparationis is thus what we aim to arrive
at, after a rigorous analysis; it crystallizes whatever is (to some extent) common to X and Y. It is thus an explicit specification of the initial comparability
criterion, but it is not identical with it hence there is no circularity here.
Using an economic metaphor, we could say that the tertium comparationis
thus arrived at adds value to the initial perception of comparability, in that the
analysis has added explicitness, precision, perhaps formalization; it may also
have provided added information, added insights, added perception.

The crucial role that translation equivalence plays in CL has important

methodological consequences. We have already described the differences
between comparable corpora and translation corpora (Section 2.1). When


items are compared across comparable corpora, it is difficult to know if we are

comparing like with like. Any judgement about cross-linguistic equivalence (or
similarity) must be based on the researchers translation competence. This is
true at both ends of the analysis: initially, when items are selected for comparison, and finally, when the results of the comparison are evaluated. When we use
translation corpora the situation is different. Although we normally start with
an initial assumption about cross-linguistic similarity the very basis for
comparing anything at all we can place more reliance on the translations
found in the corpus. The corpus can be said to lend an element of empirical
inter-subjectivity to the concept of equivalence, especially if the corpus represents a variety of translators.
However, despite the usefulness of translation corpora, to what extent can
we trust the translations we find in them? Can we treat all the translations that
turn up as cross-linguistic equivalents? There does not seem to be a simple
answer to this question. In one sense, every translation is worth considering as
a potential translation equivalent as it reflects the translators competence.
However, translations are rarely literal renderings of the original. Translators
transfer texts from one language (and culture) to another and the translation
therefore tends to deviate in various ways from the original. We have already
mentioned possible translation effects traces of the source language or universal translation strategies and they may involve additions, omissions and
various kinds of free renderings that are either uncalled for or motivated by
cultural and communicative considerations.11
How, then, can we determine which translations should be regarded as
equivalents in a stricter sense? One solution has been to resort to the procedure of back-translation (see Ivir 1983, 1987), i.e. to restrict the comparison
to forms in L2 that can be translated back into the original forms in L1. This is
likely to eliminate irrelevant differences that are due to the translators idiosyncrasies or motivated by particular communicative or textual strategies.
Another solution is to rely on recurrent translation patterns, i.e. to resort
to a quantitative notion of translation equivalence (cf. Kzreszowski 1990:27). If
several translators have used the same translation, this obviously increases its
relevance. However, this too implies a risk: by restricting the comparison to
recurrent translations we may throw away valuable evidence and miss the
cross-linguistic insights that unexpected translations often provide.
A variant of this approach which combines Ivirs idea of back-translation
and a quantitative notion of equivalence is to calculate what has been called the
mutual correspondence (or translatability) of two items in a bidirectional

Bengt Altenberg and Sylviane Granger

translation corpus (see Altenberg 1999). If an item x in language A is always

translated by y in language B and, conversely, item y in language B is always
translated by x in language A, they will have a mutual correspondence of 100%.
If they are never translated by each other their mutual correspondence will be 0
%. In other words, the higher the mutual correspondence value is, the greater
the equivalence between the compared items is likely to be. Although the
mutual correspondence of categories in different languages seldom reaches
100% in a translation corpus (even 80% seems to be a comparatively high
value), a statistical measure of translation equivalence can be a valuable diagnostic of the degree of correspondence between items or categories in different
languages (see e.g. Altenberg 1999 and Ebeling 1999: 257ff.). However, it does
not tell us where to draw the line between equivalence and non-equivalence.
Ultimately, the notion of equivalence is a matter of judgement, reflecting either
the researchers or the translators bilingual competence.12 Both involve a
judgement of translation equivalence.
. Language system vs. language use
In the past, contrastive analysis was chiefly concerned with comparisons of
abstract systems across languages. However, corpora reflect language use, and
translation equivalence is always equivalence-in-context (Chesterman
1998:31). This broadens the scope of contrastive analysis. The aim is to account
for both language systems and language use, i.e. the task is not only to identify
translation equivalents and systematic correspondences between categories in
different languages, but to specify to what extent and in what respect they
express the same thing and where similarities and differences should be located in a model of linguistic description.
The extended scope of corpus-based CL creates theoretical as well as
methodological problems. As has been pointed out by Salkie (1997) in a comparison of English but and French mais, translation equivalents in two languages seldom have the same distribution and seldom have 100% correspondence in multilingual corpora. This raises a number of important questions.
For example, how regular does an observed difference have to be in order to
count as systematic (rather than random or unpredictable)? Where should the
difference be located in the language system (langue) or in language use
(parole)? To what extent can linguistic (sub)systems be isolated from each
other, and in what ways do they interact? (See Salkie in this volume for further
discussion of this question.)


The fact that translation equivalents seldom have 100% correspondence in

translation corpora has been demonstrated in a number of studies. In
Altenbergs (1999) comparison of adverbial connectors in English and Swedish
not even cognate or functionally similar items like instead : i stllet and on the
other hand : andra sidan reach a mutual correspondence of 80%. The correspondence of cognate or functionally similar verb pairs across languages tends
to be surprisingly low. For example, Altenbergs comparison of the prototypical causative verbs make in English and f in Swedish (this volume) reveals a
mutual correspondence of only 52%. Similarly, Vibergs (1996a:161) comparison of the cognate verb pairs go/g and give/ge in English and Swedish shows
that they are only translated by each other in about a third of the cases, and the
mutual correspondence of the primary possession verbs get and f in the same
languages is shown to be as low as 15% (Viberg, this volume).
It is obvious that a low degree of mutual correspondence between functionally related items has several explanations. In the case of Vibergs verb pairs
the reason is the diverging polysemy and the different meaning extensions that
verbs tend to develop in different languages (see Section 4.1). In the case of the
English and Swedish connectors examined by Altenberg, some of the differences are clearly system-related. For example, connectors with zero correspondence reveal the existence of lexical gaps in either language: the Swedish
explanatory connector nmligen has no exact counterpart in English and the
English transitional connector now has no counterpart in Swedish. Items with
intermediate correspondence values often illustrate differences in the stylistic
or functional status of the connectors in the two languages. This is typically
revealed by an asymmetrical translation tendency. For example, English therefore is more often translated into Swedish drfr than the other way round,
because drfr is a more common and stylistically more neutral resultive connector in Swedish than therefore is in English.
However, there is also evidence of system interchange. This is clearly
revealed in Altenbergs comparison of causative English make and Swedish f in
the present volume. In both languages the periphrastic causative verb construction with make and f can be replaced by alternative constructions, such
as a synthetic causative verb or a structurally reorganised causative construction. Epistemic modality is another area where different subsystems tend to
interact. For example, as shown by Aijmer (1999) in her comparison of epistemic possibility in English and Swedish, when there is a gap in the Swedish system of modal auxiliaries, it can be filled by a modal adverb. Similarly, when
English may and Swedish kan are not good equivalents, the translators tend to

Bengt Altenberg and Sylviane Granger

choose a corresponding adverb or a combination of modal elements.

A similar tendency is revealed in Johanssons (1997) multilingual comparison of the generic pronoun man in German and Norwegian and its counterparts in English. Many languages have a generic pronoun (e.g. man in German
and the Scandinavian languages, one in English, and on in French), but their
frequency and stylistic status vary from language to language. Consequently,
translations between such languages tend to display different tendencies
depending on the direction of the translation. When a generic pronoun is translated from a language where it is comparatively infrequent (such as English)
into a language where it is relatively frequent (such as the Scandinavian languages and, in particular, German and French), it is generally rendered by a
generic pronoun in the target language. However, translations in the opposite
direction show a different tendency. The generic pronoun in the source language is less often translated by a generic pronoun in the target language.
Instead, it tends to be rendered by a range of syntactically restructured impersonal expressions, such as non-finite clauses, agent-less passives, imperatives
and nominalisations. These cross-linguistic differences suggest that the tertium
comparationis needs to be defined at the intersection of several structural systems. Further examples of system interaction will be given in Section 4.2.
The shift from one construction in the source language to another in the
target language is often accompanied by a change of viewpoint. For example, in
changing an original active clause with generic man as subject into either a construction with a specific personal pronoun (e.g. I, he or she) or an impersonal
passive or non-finite construction, the translator can in some sense be said to
view the situation expressed in the source language from a different perspective. A shift in perspective of a different kind is examined by Salkie (this volume) under the term modulation and used as a way of explaining the various
unexpected translations of the German adverb kaum into English and of the
English verb contain into French.
We see, then, that translation corpora confront the researcher with a
wealth of different translation types reflecting various degrees of cross-linguistic correspondence. Broadly speaking, these can be said to range from
highly recurrent expected translation equivalents to a bewildering variety of
unexpected renderings, many of which cross the boundaries between linguistic subsystems and at first sight seem to defy classification. It may be tempting
to dismiss such unexpected cases as products of the translators performance,
but there is generally a good reason behind the choice of translation. It is the
task of the contrastive researcher to evaluate the corpus data as far as possible


and try to see the patterns lurking behind the translators resourcefulness and
behind the most unexpected renderings that turn up in translation corpora.

4. Types of cross-linguistic correspondence

Languages divide up semantic space in different ways. This is a natural consequence of the fact that the conceptual world evolves differently in different languages, for historical, cultural, geographical and social reasons. As a result,
complete equivalence between words and expressions in different languages is
rather unusual, just as it is unusual to find exact synonyms within one language. This lack of cross-linguistic correspondence is manifested in different
ways. The number of concepts encoded in the vocabulary may differ from one
language to another. Moreover, the conceptual systems may differ in structure.
Familiar examples of this are the ways in which colours and kinship are encoded in different languages. Swedish, for example, has no common term corresponding to English uncle or French oncle but has to make a distinction
between farbror fathers brother and morbror mothers brother.
One consequence of this is that words that are treated as translation equivalents in bilingual dictionaries tend to have different ranges of meaning. An
example of this is the relationship between the French, English and German
words bois : wood : Holz and fort : forest : Wald (see Svensn 1993:141). Bois has
a wider meaning than wood, and wood a wider meaning than Holz; conversely,
Wald has a wider meaning than forest, and forest has a wider meaning than fort.
As a result, the meanings of wood and Wald only partly overlap, and the same is
true of forest and bois. In other words, there is not complete equivalence
between any of the words. Partial overlap of a similar kind is revealed by
Teubert (1996) in his analysis of English diary and calendar and German
Tagebuch, Kalender and Almanach.
The divergent meaning extensions that have evolved in different languages
are especially striking in high-frequency words expressing certain basic meanings. This is clearly illustrated by verbs of motion, perception, and cognition,
which occur in most languages with roughly the same basic meanings. At the
same time, they are highly polysemous owing to various types of universal and
language-specific meaning extensions (see e.g. Viberg 1996a).13 The complex
cross-linguistic differences these give rise to can be described in terms of such
general processes as lexical specification (or elaboration), schematisation (or
abstraction), grammaticalisation, metaphorical extension, and idiomatisation.

Bengt Altenberg and Sylviane Granger

Cross-linguistically, these developments result in complex patterns of partially

overlapping polysemy. Differences of this kind are not only a major problem
for language learners, they have also become one of the major stumbling blocks
for machine translation and one reason why the lexicon is often described as
the bottleneck of natural language processing (see e.g. Calzolari 1996: 3 and
Sinclair et al. 1996: 174). To identify and describe these patterns is a challenge
for lexical CL.
However, cross-linguistic equivalence is not only a matter of semantic content. Since the meaning of words is also determined by their grammatical and
lexical environment (syntagmatic relations like colligation and collocation) as
well as by the situation in which they are used (style, pragmatics), similarities
and differences in these respects must also be considered when cross-linguistic
equivalence is determined. In other words, equivalence is a complex phenomenon: it involves several levels of linguistic description, and both paradigmatic
and syntagmatic relations. We shall not attempt to give a detailed description
of various types of cross-linguistic correspondence here. Instead, we shall make
a broad distinction between three types of cross-linguistic relationships:
(a) overlapping polysemy (items in two languages have roughly the same
meaning extensions)
(b) diverging polysemy (items in two languages have different meaning
(c) no correspondence (an item in one language has no obvious equivalent in
another language)
It should be added that polysemy is not a clear-cut notion. Whether a lexical
item can be assigned a certain number of meanings (polysemy) or should be
regarded as vague or underspecified with regard to particular items in another
language is often difficult to determine. However, it is obvious that translation
corpora offer a fertile basis for exploring issues of this kind. In the rest of this
section we shall give examples of some recent studies that have explored various types of correspondence. Since overlapping polysemy (in its strictest sense)
is relatively uncommon (see however Alsina and DeCesaris, this volume), we
shall concentrate on the last two types distinguished above. The difference
between paradigmatic and syntagmatic relations will be discussed separately in
Section 5.


. Diverging polysemy
Diverging polysemy is a very common phenomenon in contrastive studies of
lexis. In a series of studies focusing on high-frequency verbs with similar basic
meanings in English and Swedish, Viberg (1996a and b, 1998, 1999, this volume) has explored the divergent patterns of polysemy characterising verbs of
motion (such as go : g and verbs for running, putting and pulling) and
physical contact verbs (verbs for hitting) in the English-Swedish Parallel
Corpus. Using a general typological framework, partly inspired by Miller and
Johnson-Laird (1976), Talmy (1985) and the frame semantics model proposed
by Fillmore and Atkins (1992), he demonstrates that verbs that are usually
treated as translation equivalents in dictionaries display surprisingly low
mutual correspondence in the corpus, a fact which is due to their various divergent meaning extensions and reflected in a wide range of translations in both
languages. Vibergs studies are a good illustration of how theory and cross-linguistic data can interact in a fruitful way. The data serve to test the validity of a
language-independent semantic framework, while the framework provides a
stable basis for refined descriptions of language-specific and typological lexical
differences, as well as of universal semantic categories and principles of meaning extension.
In his contribution to the present volume Viberg compares the Swedish
possession verb f with is closest English equivalent get and, more briefly, with
its equivalents in Finnish and French. Starting from basic sense distinctions of
f and get established on the basis of the original texts, he uses their translation
equivalents to determine their degree of cross-linguistic correspondence.
Viberg finds great conceptual similarities, as regards both their basic and their
extended meanings, but the lexicalisation patterns are very language-specific
and their mutual translatability low. Another important finding is that the
meanings of both verbs can to a large extent be disambiguated by the syntactic
frames in which they occur. Some meanings, however, have to be inferred from
semantic and pragmatic cues in the linguistic and extra-linguistic context.
A good example of the complexity of cross-linguistic (and intralinguistic)
lexical relationships are the multiple correspondences revealed by
Chodkiewicz et al. (in this volume) in their comparison of the French legal
term procdure and the English term proceedings in the French and English versions of the European Convention on Human Rights. Both terms are highly
polysemous and consequently have multiple equivalents in the other language:
proceedings has no less than twelve translation equivalents in the French sub-

Bengt Altenberg and Sylviane Granger

corpus and procdure has six in the English subcorpus. Although these correspondences do not create any problems of comprehension, their description is
a great challenge for the lexicographer and terminologist.
A special variant of diverging polysemy can be said to exist when a single
word in one language is variously rendered by several items in another language. A well-known example of this is the verb think in English which regularly corresponds to several verbs in other languages (cf. Nuyts 1997). German,
for instance, has to use at least three different verbs (denken, glauben and
finden) to express the main meanings covered by think. A similar differentiation is required in Swedish where the main counterparts are tnka (cogitation), tycka (subjective evaluation) and tro (belief ), as illustrated in the following examples (from Aijmer 1998:278):
What are you thinking of?

Vad tnker du p?

I think Stockholm is a
beautiful city

Jag tycker Stockholm r en

vacker stad

I think Stockholm is the capital

of Sweden

Jag tror Stockholm r Sveriges


English think thus represents a complex case of polysemy and semantic fuzziness
that forces a semantic (and lexical) distinction in other languages.14 The different
meanings must be distinguished pragmatically by means of contextual cues and
background knowledge. In cases of this kind translation corpora can help to
specify not only the choices that have to be made in other languages, but also the
conditions that determine the choices and the semantic range covered by the different alternatives. This is well illustrated in three closely related studies by
Simon-Vandenbergen (1998), Aijmer (1998) and Mauranen (1999), who examine the Dutch, Swedish and Finnish equivalents of the epistemic use of I think
(i.e. excluding its dynamic cogitation sense corresponding to Swedish tnka or
German denken). Since Aijmer also makes use of translations into German and
Norwegian, the three studies give a broad multilingual picture of the main equivalents of think in five languages. As indicated in Table 1, the field covered by the
Germanic verbs can be seen as describing a continuum between two poles: verifiable probability-based opinion (the verbs in the left-hand column) and
impression-based subjective evaluation (the verbs in the right-hand column).
Contextual factors determining the choice of verb along the continuum (as well
as the use of other related verbs like Dutch dunken and lijken and Swedish tyckas
seem) include such features as type of evidence involved (e.g. direct observation
or experience), type of certainty and verifiability and type of speaker authority.


Table 1: Translation equivalents of epistemic think in some Germanic languages




. No correspondence
There are also cases where an item in one language has no obvious equivalent in
another language. One familiar grammatical example of this is the English progressive, which has no equivalent in many languages and is therefore difficult to
translate in a systematic way. Broadly speaking, this kind of cross-linguistic difference is revealed in two ways in translation corpora (although the distinction
is not clear-cut and there is a great deal of overlap between the two tendencies):
either the lack of a clear equivalent in the target language results in a large number of zero translations, indicating that the translators have great difficulties
finding a suitable target item, or in a wide range of translations, indicating that
the translators find it necessary to render the source item in some way but, in the
absence of a single prototypical equivalent, vary their renderings according to
the context. We shall illustrate both tendencies briefly here.
A characteristic feature of the Germanic languages is the frequent use of
lightly stressed pragmatic particles of various kinds, especially in spoken discourse. Familiar examples are the German modal particles ja, doch and schon.
The meaning of these particles is difficult to pinpoint and describe in dictionaries, partly because they tend to be multifunctional and partly because their
function tends to be pragmatic and highly context-dependent. Although many
of them can be described as having a modal function, often corresponding to
modal auxiliaries or modal adverbs in other languages, they also tend to have
various interactive or interpersonal functions without any direct lexical counterpart in other languages. They are therefore interesting to study contrastively
on the basis of translation corpora. Some recent studies will illustrate this.
In a study based on the English-Swedish Parallel Corpus, Aijmer (1996)
examines the Swedish particles ju, vl, nog and visst and their translations into
English. For each particle the translations display a great variety of renderings
representing a wide range of categories, from adverbs and modal auxiliaries to
full verbs and comment clauses. In some cases (especially in the case of vl)

Bengt Altenberg and Sylviane Granger

questions and tag questions are used as approximate renderings. However, the
most striking result is the high frequency of zero translations, especially in the
case of ju, nog and visst. The difficulty of rendering these particles into English
is particularly clearly illustrated by ju (as you know), which lacks a translation
in 71% of the cases. Moreover, a great proportion of the renderings are unique
(singleton) translations and the most common English rendering (after all)
only represents 5% of the examples. Yet, despite the lack of an obvious English
counterpart, each particle has a translation profile of its own, reflecting its
complex pragmatic function. Hence, the translations can help to specify the
functional identity of the particles. This identity cannot be described in terms
of a single dimension but, like the translations of epistemic think into Dutch or
Swedish, rather in terms of a combination of grammatical, modal and interactive features, involving syntactic position, type of evidence (e.g. belief, inference, hearsay, etc), type of authority involved (first, second and third person)
and interactive appeal (e.g. soliciting the listeners confirmation). A very similar picture emerges from Johansson and Lkens (1997) study of Norwegian
particles and their correspondences in English and Johanssons (1998) study of
the noun mind and its Norwegian translations.

. Paradigmatic and syntagmatic perspectives

. The lexical unit
So far we have tacitly assumed that the lexical items compared across languages
are easy to define and identify. However, although the definition and demarcation of a lexical unit may be fairly straightforward in theory (see e.g. Cruse
1986: 23ff.), it is often problematic in practice and notoriously difficult for a
computer. Teubert (1996:243f.), for example, compares the task confronting a
computer with that of a human being trying to make sense of a totally unfamiliar language. In contrast with the orthographical word, which has no consistent relationship to meaning, a lexical unit (or lexical item) can be defined as a
stable pairing of form and meaning (cf. Sinclair 1998). What complicates the
picture is that lexical units may consist of several words and that multiword
units tend to be unstable in form, lexically as well as grammatically (see e.g.
Moon 1996). Moreover, many meanings are difficult to specify without considering co-occurrence phenomena in the linguistic context. As a result, in corpus-based studies the researcher must either know what to look for, or rely on
collocation software for spotting potential lexical units in corpora.


As mentioned, the meaning of lexical units must be determined with

respect to two linguistic dimensions, the paradigmatic and the syntagmatic.
The paradigmatic dimension reflects how the senses of the words in a language
are related to each other and, cross-linguistically, to the senses distinguished in
other languages. Monolingually, these relations are typically described in terms
of such relations as synonymy, antonymy, hyponymy, meronomy, etc. (see
Cruse 1986:84ff.). Closely related to this dimension are various ways of organising vocabulary in terms of lexical sets or fields or in terms of prototypes.
From a contrastive point of view it has been attractive to work with typological
and universal categories and to use these as a basis for the comparison.
The syntagmatic dimension relates words to the linguistic context, lexically, semantically, and grammatically. Syntagmatic phenomena are typically
described in terms of lexical co-occurrence (collocation), semantic preferences
(e.g. case roles, selection restrictions, semantic prosody) and syntactic function
(e.g. syntactic dependency or valency). Cross-linguistically, this dimension has
also led to various attempts to establish language-independent or universal categories (e.g. in frame semantics) against which the vocabulary of different languages can be compared.
Although the paradigmatic and syntagmatic axes are two clearly distinguishable lexical dimensions in theory, they are closely related and difficult to
separate in practice. The reason for this is that the meaning of a lexical item (its
paradigmatic status) can only be determined on the basis of the context in
which it occurs (its syntagmatic status). In fact, it is the syntagmatic patterning
of words that determines what we can regard as a lexical unit in the first place.
However, it is important to add that the two dimensions affect the interpretation of lexical categories in different ways. As pointed out by Sinclair et al.
(1996: 176), closed-class words (or grammatical words) tend to have little
independent meaning and therefore have to be accounted for mainly in terms
of their co-text and grammatical or textual function. At the other end of the
spectrum are rare and specialised open-class words, such as technical terms,
which usually have little to do with the phraseology that surrounds them
(ibid. 176) and which can therefore easily be listed in multilingual term banks.
Most other open-class words exist somewhere in between these two extremes:
they tend to be polysemous and their contrastive description generally has to
take account of several layers of conceptual and contextual factors.
In the rest of this section we shall describe some approaches that have a
predominantly paradigmatic or syntagmatic bias and conclude with some
attempts to reconcile the two perspectives. It should be added that a clear dis-

Bengt Altenberg and Sylviane Granger

tinction between the two dimensions is often difficult to make, and in some
cases our account is simply based on the lexical difference mentioned above:
open-class items tend to invoke a paradigmatic approach, closed-class items a
syntagmatic approach.
. Some paradigmatic approaches
As we have seen, in order to be able to describe the type and degree of correspondence between lexical items in different languages, the basis of the comparison the tertium comparationis has to be primarily semantic or functional. Moreover, it is essential that the model of description that is used is language-independent and, preferably, based on typologically interesting or universal categories. In addition, the comparison must consider various principles
of lexical and semantic organisation, including the structure of the vocabularies of natural languages, the nature of lexical relations, the relationship
between meaning, concepts and the world, and the nature of polysemy (cf.
Kittay and Lehrer 1992: 2). A few paradigmatic approaches of this kind that
have been used in lexical CL with some success will be mentioned briefly here:
semantic features, prototypicality and semantic fields.
One concept that has long been central to the idea of the lexicon as an
organised system is semantic decomposition. Interconnections within the lexicon have often been analysed in terms of shared primitive components or features. As pointed out by James (1980: 89) and Kittay and Lehrer (1992: 9), one
motivation for this has been economy of description: a small number of components can be used to define a large number of words; another which is of
particular interest for lexical CL is that semantic primitives offer an attractive (and potentially universal) basis for lexical comparisons across languages.
Semantic primitives have consequently been used a great deal in lexical CL,
implicitly or explicitly, either as a tertium comparationis or as part of the contrastive analysis (see James 1980:89ff.). However, semantic decomposition has
been criticised and is still a controversial issue in lexical semantics (see Kittay
and Lehrer 1992: 9). One objection has been that the number of semantic
primitives is arbitrary (and theory-dependent), another that differences in
meaning between lexical items are conceptual and cannot be captured in terms
of abstract linguistic features.15
Some of the problems inherent in semantic decomposition can be avoided by
resorting to the notion of prototypicality (see Rosch 1975, Taylor 1995).
Prototypicality indicates degrees (or the best fit) of category membership and is a


fuzzier notion than other semantic relations. If we accept that meanings can be
fuzzy and are better described in cognitive (rather than purely linguistic) terms,
certain lexical relations can be characterised more adequately in terms of prototypes. Prototypicality is therefore often used in cognitively oriented taxonomies
and semantic field studies based on typological universals (cf. Viberg 1996:159f.).
Another concept with a long history is that of semantic field, i.e. the conceptual domain within which lexemes are organised by specific semantic relations such as synonymy, hyponymy, incompatibility, antonymy, etc. (see e.g.
Lehrer 1974, Schwarze 1985, Kittay 1987; on semantic relations, see Cruse
1986). Familiar examples of such fields are colour terms, cooking, parts of
the body, visual perception (see James 1980: 86ff.) and verbs of motion
(Viberg 1996a). The claim of this approach is that the meaning of words must
be understood, in part, in relation to other words that articulate a given content
domain and that stand in the relation of affinity and contrast to the word(s) in
question. Thus to understand the meaning of the verb to saut requires that we
understand the contrastive relation to deep fry, broil, boil, and also to affinitive
terms like cook and the syntagmatic relations to pan, pot, and the many food
items one might saut (Kittay and Lehrer 1992:34).
Many of the corpus-based multilingual studies mentioned in the previous
sections have been essentially paradigmatic in character (e.g. those by Viberg).
A translation-based variant of great theoretical and methodological interest is
Dyviks (1998) demonstration of how sense distinctions between English and
Norwegian lexemes can be made by means of successive bidirectional comparisons of translation correspondences in the English-Norwegian Parallel
Corpus. Developing Ivirs (1983) notion of back-translation, Dyvik proceeds
in three steps. Starting from polysemous Norwegian lexemes like tak (cover,
roof , grip), selskap (companionship, society, firm, party) and god (good,
nice, fine, etc), he first examines their translations into English (tak, for
instance, is rendered by roof, ceiling, cover, grip and hold). He then reverses the
perspective and examines how these English translations are rendered in
Norwegian. Reversing the procedure a second time (from Norwegian into
English), he arrives at a structured picture of the senses and sense relations of
both the Norwegian and English lexemes. The method can be used to define
such lexical properties as ambiguity, vagueness and synonymy, as well as lexical
fields, feature-specified hierarchies and overlap relations within these fields
(e.g. prototypicality, hyponomy). Like Viberg, Dyvik uses translation data to
objectify criteria and distinctions derived from, or supporting, a semantic
model, giving lexical semantics an empirical foundation.

Bengt Altenberg and Sylviane Granger

. Some syntagmatic approaches

As mentioned earlier, the corpus revolution has brought the contextual patterning of words into focus in recent years. The use of idioms and collocations
tends to be highly language-specific and the syntagmatic aspect of lexis is consequently of great contrastive interest, theoretically as well as practically. It is of
great importance for the FL learner (see e.g. Roos 1976, Bahns 1993, Granger
1998, Howarth 1996, 1998) and it is absolutely essential in natural language
processing (NLP) fields, such as machine-assisted translation and the creation
of multilingual lexical databases. Interesting attempts have been made to
extract collocations from existing bilingual dictionaries and monolingual corpora for storage in multilingual electronic lexicons (see Section 6). One example is the DECIDE project (see Grefenstette et al. 1996), which used a combination of these sources to collect speech act nouns and their verb collocates in
English, French and German (e.g. make/proffer etc. an apology, einen Vorschlag
machen/annehmen etc.) together with information about corpus frequencies
and syntactic behaviour in a multilingual database.
An interesting alternative to this approach is an experiment carried out at
Pisa (see Peters 1996), in which collocations in two domain-specific and topiccontrolled monolingual corpora (in English and Italian) were matched with the
aid of a bilingual dictionary. The strategy was to (a) identify collocates of nouns
in a corpus representing one language on the basis of their mutual information
value (see Church and Hanks 1989), (b) to select potential translation blocks in
the other language on the basis of an English-Italian lexical database, and (c) to
identify similar sets of contexts in a comparable corpus for the other language.
Like pragmatic particles, prepositions are closed-class items whose meanings are difficult to define without considering the context in which they occur.
Their functional importance varies from language to language and since they
tend to have many language-specific and idiosyncratic uses they are interesting
to study in a contrastive perspective. Three recent studies based on translation
corpora illustrate this: Schmieds (1998) comparison of German mit and
English with, Paulussens (1999) contrastive investigation of English preposition/particle on/up, Dutch op and French sur, and Fabricius-Hansens (1999)
study of German bei and its translations into English and Norwegian. These
prepositional studies are interesting in several ways. They all demonstrate the
usefulness of translation corpora in specifying the functions of items that derive
their meanings largely from the context. The great diversity of translation
equivalents encountered in the corpora also underlines the inadequacy of earli-


er contrastive and lexicographical descriptions that are not based on natural

corpus data. Two of the studies (Paulussen and Fabricius-Hansen) also demonstrate the usefulness of a cognitive framework in describing prepositional
meanings, at least those at the less idiosyncratic end of the semantic spectrum.
Some of the uses of the items examined in these studies are characterised as
metaphorical extensions of a prototypical literal or concrete core meaning.
Although the difference between the literal and figurative uses of an item can be
regarded as a paradigmatic phenomenon, it can normally only be established on
the basis of the linguistic context, i.e. syntagmatically. This is clearly illustrated in
two studies in the present volume comparing various figurative uses of lexis in
different languages. Lan Chun demonstrates the usefulness of a cognitive
approach in describing the metaphorical uses of up/down in English and
shang/xia in Chinese. On the basis of random samples from two monolingual
corpora, one English and one Chinese, she reveals remarkable similarities
between the metaphorical domains of the examined items in the two languages, a
finding which supports the idea that there is a universal spatial metaphorical system and that our abstract reasoning is at least partially metaphorical.
Paillard compares the use of two other types of figurative expression,
hypallage and metonomy, in English and French. Hypallage involves constructions in which the normal function of an element is changed (by syntactic
transposition, conversion or ellipsis) to create a marked effect (e.g. Melissa
shook her doubtful curls), while metonomy involves the replacement of a term
by another term that is closely associated with it (e.g. redneck). Both types exist
on a cline from complete lexicalisation to linguistic creativity. Corpus-based
investigations of these types of figurative language are problematic since they
cannot be searched for on the basis of form, and Paillard therefore uses a mixture of sources in his study: dictionaries, textual examples and a sample from a
translation corpus. Paillard demonstrates that the use of the two types tends to
be diametrically opposed in English and French, in terms of both frequency
and availability: while hypallage is more common in English, metonomy is in
some respects more readily tolerated in French. This divergence appears to
reflect interesting cross-linguistic differences: English allows greater syntactic
flexibility in terms of movement, part-of-speech conversion and ellipsis,
whereas French permits greater semantic freedom in the relationship between
argument and predicate.

Bengt Altenberg and Sylviane Granger

. Combining the paradigmatic and syntagmatic perspectives

Since corpora always present lexical items in their linguistic context, corpusbased contrastive studies of lexis can hardly avoid paying at least some attention to the syntagmatic patterning of the compared items, even if the primary
concern is their paradigmatic relationship. Hence, the paradigmatic and syntagmatic perspectives are often fused and the distinction is to a large extent a
matter of emphasis. However, it may be useful to end this section with a brief
account of an approach whose goal is a conscious attempt to reconcile the two
A clearly corpus-driven approach to the study of lexis is that advocated by
Sinclair (1998). Following the tradition of J. R. Firth (1957) in his definition of
meaning as function in context, Sinclair proposes a model in which the paradigmatic and syntagmatic dimensions of lexical items can be determined by studying the contextual patterning or co-selection of words in text corpora. Five
categories of co-selection are posited as components of a lexical item (ibid.:
14f ): a formally invariable core, its semantic prosody (roughly, its associated
attitudinal or pragmatic meaning), collocation (lexical co-occurrence), colligation (grammatical patterning) and semantic preference (link to a co-occurring
lexical field).
Cross-linguistic applications of this corpus-driven approach which
explore the possibility of identifying units of meaning on the basis of their contextual environment in one language and linking them with functionally
equivalent units in another have been tested in several European projects
involving several languages (see e.g. Sinclair et al. 1996, Sinclair 1996, Teubert
1996). Although the ultimate goal of these projects has generally been to create
multilingual lexicons for machine translation (MT) or machine-assisted translation (MAT) (see Section 7), they are of considerable contrastive interest. In
the projects reported by Sinclair (1996) and Teubert (1996) the starting point is
a number of pre-selected words which are studied on the basis of concordances
from monolingual corpora representing the compared languages. For each
word, recurrent contextual patterns specifying the different meanings of the
word are identified and a translation equivalent is defined for each meaning by
the researchers.
A characteristic feature of these projects is that the translation equivalents,
though inspired by monolingual corpora, are established on the basis of the
researchers translation competence. An interesting variant of this approach is
illustrated in Tognini Bonellis (1996) comparison of the English adjective real


and its Italian counterparts reale and vero. The comparison, which is based on
two broadly comparable monolingual corpora, involves several steps. First, the
meanings and functions of the English adjective are established on the basis of
its formal patterning in the concordance of an English corpus. Second, the
same process is repeated for the most likely Italian translation equivalent, reale,
on the basis of an Italian corpus. The meanings and functions of the two items
are then compared and cross-linguistic matches and mismatches identified.
Since the uses of reale do not cover all the uses of real, another Italian equivalent
is postulated and tested to see if it can fill the functional gaps that reale fails to
match. The result is a cross-linguistic description of the compared items that
can be stored in a bilingual database of comparable units of meaning. The limitation of this approach is that the comparison is confined to prima facie translation equivalents postulated by the researcher, either on the basis of intuition,
a bilingual dictionary or a translation corpus. The strength is that it can take
full advantage of the large amounts of data provided in the two monolingual
corpora to identify and describe syntagmatic and paradigmatic patterns of the
compared items.
In her contribution to this volume Tognini Bonelli explores the functional
equivalence of expressions containing the English word case and the Italian
word caso in concordances from broadly comparable English and Italian corpora. Starting with an analysis of the contextual patterning of the English multiword forms in the case of, in case of and in case in the English corpus, she
repeats the analysis for their prima facie translation equivalents nel caso di, in
caso di and se per caso in the Italian corpus. For each of these pairs the functionally complete units of meaning are identified and compared. The functional
equivalence of the items is found to be surprisingly similar, especially in the
case of the first two pairs.

. Bilingual and multilingual lexicography

Since the publication of the Collins Cobuild English Language Dictionary in
1987, text corpora have become a well-established ingredient in monolingual
lexicography. The advantages of using corpora to ensure authenticity and
empirical adequacy in lexicography are well documented in Sinclair (1985 and
1987b). In 1994 this practice was extended to bilingual lexicography with the
appearance of the Oxford-Hachette English-French, French-English Dictionary,
which makes use of two monolingual corpora, one in English and one in

Bengt Altenberg and Sylviane Granger

French (see Atkins 1994:xix). Since then, the use of translation corpora has also
become an important supplementary feature in bilingual lexicography, e.g. in
the Bilingual Canadian Dictionary project (see Roberts and Montgomery
1996). Both types of corpora are invaluable resources in bilingual lexicography:
monolingual corpora in the structuring of the lexical entries, in supplying natural examples and in verifying the target language equivalents, and translation
corpora in enriching the inventory of target language equivalents (cf. e.g.
Dickens and Salkie 1996, Teubert 1996 and the many projects described in
Gellerstam et al. 1996). In the 1990s, the main emphasis has been on the creation of multilingual databases and terminology systems for machine translation as well as for human use (see e.g. Blser 1995). Most of these projects have
been devoted to specific lexical domains, such as perception verbs and nouns
(e.g. Ostler 1995, Heid 1995).
An interesting approach to multilingual lexicography is used in the
Contrastive Verb Valency Dictionary of Dutch, French and English project at
the University of Ghent (see Simon-Vandenbergen et al. 1996). Although corpora play a subordinate role in this project, it deserves to be mentioned here
because it represents a recent and very interesting contrastive development of a
long tradition of valency studies (for a good survey of earlier work, see Devos
1996). It has several distinctive features: (a) it is multilingual, comparing verbs
in three languages; (b) it is multidirectional and truly contrastive in that it pays
equal attention to all three languages and gives each the same systematic treatment (on this requirement, see Fisiak 1981:3); (c) it is multi-layered, providing
syntactic, semantic and stylistic information for each entry. The starting point
of the comparison is verbal lexemes that are judged to be prototypical equivalents in the three languages. The lexemes of each language are then analysed
separately in terms of their syntax and semantics on the basis of monolingual
corpora, bilingual dictionaries and introspection. In the final stage, the
descriptions arrived at in this way are contrasted and different degrees of crosslinguistic similarity established. The result is a three-way multilingual electronic lexicon which provides rich contrastive information about the lexical
structure of verbs in the three languages and about such phenomena as translation equivalence, overlapping and diverging polysemy, and ranges of meaning.
The multidirectional approach also produces a more fine-grained cross-linguistic analysis of the lexemes than would otherwise have been possible. This is
illustrated by Devoss (1996: 3537) comparative sketch of Dutch kijken,
French regarder and English look. A unidirectional analysis that takes, say,
Dutch as its point of departure would result in a lexical entry separating only


five out of the nine meanings that need to be distinguished in a multidirectional perspective. Moreover, a multidirectional approach gives a clearer picture of
the conceptual ranges of the three verbs: English look and French regarder have
more meaning extensions than Dutch kijken.
Many of the contributions to the present volume demonstrate convincingly how bilingual and multilingual lexicography can be enriched by means of
corpora of various kinds. Alsina and DeCesaris examine two interesting issues
in bilingual lexicography: how the degree of overlapping polysemy and the use
of data from a monolingual corpus can be used to improve the structure of lexical entries in three general-purpose English-Spanish dictionaries and one
English-Catalan dictionary. By comparing the sense distinctions made for
three polysemous English adjectives, cold, high and odd, in existing monolingual and bilingual dictionaries, they identify potential areas for improvement
in the design of the dictionary entries, the ordering of equivalents, and the
treatment of idioms and set phrases. This information is then compared with
the distribution of senses in the British National Corpus. Their conclusion is
that bilingual lexicography should pay more attention to overlapping polysemy, i.e. symmetrical equivalence relations established on the basis of the languages involved. Although a single monolingual corpus is of limited use in this
respect, it plays an important role for the ordering of equivalents and for the
selection of fixed expressions in a bilingual dictionary.
Cardey and Greenfield report on experiences gained from the construction
of a multilingual dictionary system intended for the automatic recognition and
translation of set expressions in four languages. Focusing on set expressions
involving names of animals and parts of the body collected from various dictionaries, they examine the variability and various types of ambiguity associated with such multiword expressions. They conclude that, although only
human researchers can identify and disambiguate set expressions and organise
the entries in a multilingual electronic lexicon, the computer is a very useful
tool in collecting, organising and verifying the use of multiword expressions.
NLP systems depend heavily on powerful lexical databases that provide
language-specific as well cross-linguistic information about the paradigmatic
and syntagmatic patterning of lexical items. An important development in this
direction is the creation of databases storing networks of lexical relations within and across languages. An example of this is the EuroWordNet project (see
e.g. Vossen 1998, Ide et al. 1998). It is a multilingual development of the
American (Princeton) WordNet system and intended to serve a variety of
applications such as machine-assisted translation and information retrieval.

Bengt Altenberg and Sylviane Granger

So far, however, the wordnets of this project are derived from existing machinereadable dictionaries rather than from corpora and their usefulness has been
called into question (see Teubert this volume). Different possibilities of creating multilingual databases from various sources (machine-readable dictionaries, corpora, language-specific wordnets) are discussed in Steffens (1995) and
Teubert (1998).

Machine translation and machine-assisted translation

Much of the recent upsurge of interest in corpus-based CL has its origin in the
growing need for machine translation (MT) or machine-assisted translation
(MAT). When computers began to be used for language processing in the
1950s and 1960s one of the first priorities was MT. However, the results were
disappointing, partly because the computational resources were insufficient,
but mainly because the complexity of simulating human translation of unrestricted text was underestimated. Neither the mentalist linguistic models nor
the statistical processing techniques the two competing approaches that
were used could cope adequately with the problem of transferring one language in use to another (see e.g. Sinclair et al. 1996).
MT and MAT rely heavily on large multilingual computational lexicons or
databases (see Section 6). In the 1980s great efforts were made to extract and
formalise the lexical information contained in conventional machine-readable
dictionaries. However, traditional dictionaries are rarely detailed, systematic,
explicit and reliable enough for NLP, and although valuable experience was
gained in the development of large-scale lexical databases from such sources,
the reliability and coverage of the resulting lexicons proved to be inadequate
(see e.g. Ide and Vronis 1995, Steffens 1995:2f.). Maniez (this volume) identifies lexical ambiguity as one of the main problems confronting machine translation. Taking three concrete problems as a point of departure the translation into French of the English compound sedimentation rate, the ambiguity of
the expression based on, and discontinuous collocations he demonstrates
the need for lexical databases that include frequently used compounds and collocations and information about the frequency of the different meanings of
polysemous lexical items. The emphasis of his article is not so much on the usefulness of computer tools and corpora in the field of machine-assisted translation which is taken for granted but rather on the need to collect data that
can improve expert systems and tools for translators. To automize the disam-


biguation process it is necessary to build lexical databases which include a comprehensive description of syntactic and lexical ambiguities that are likely to
appear in particular domains, together with statistical information about various co-occurrence phenomena derived from corpora.
The main outcome of these experiments has therefore been a growing
awareness of the need for text corpora as an additional source of data and of
closer collaboration between computational linguists, lexicographers and corpus linguists (see Ide and Vronis 1995). As a result, more recent work on MT
has increasingly turned to text corpora for such tasks as lexical acquisition, disambiguation and analysis. Corpus-based methods are central for examplebased and statistical MT (e.g. Brown et al. 1990), for the extraction and structuring of multilingual term banks and for the creation of translation memories
(e.g. Heyn 1998), and various support tools for translators (Isabelle et al. 1993,
Merkel 1999:25ff.).
Multilingual corpora have revived the hopes of achieving, if not automatic
translation (except in specific domains), at least robust MAT. A number of projects are now at work trying to recover the wealth of information stored in
translation corpora in various ways. This work involves at least two tasks. One
is to recover, organise and recycle the information available in translation corpora; the other is to develop a very rich network of meaning relationships
between categories in the languages involved or to use the words of Sinclair
et al. (1996:174) to internalise the expertise of bilingual humans.
The first task relies heavily on the fact that the meaning and use of words can
generally be deduced from the linguistic context in which they occur and that
the translations serve to distinguish these meanings. Several approaches have
been used. One is to create translation banks, i.e. recurrent source-target pairings stored in a translation memory that can be called upon as an aid in translating new texts. This approach is especially useful for the translation of domainspecific texts where words and expressions tend to recur (see Ahrenberg and
Merkel 1996, Merkel 1998: 4361, Heyn 1998). Another is the procedure proposed by Sinclair et al. (1996), which extends the tradition of monolingual collocational studies to multilingual corpora: words and word combinations are
disambiguated by means of their translations and by their linguistic context in
both languages. The distinctive patterns recognised in this way are then formalised and stored in a large multilingual database which can be used for
machine-assisted and human translation. Variants of this approach, involving a
number of languages (English, French, German, Italian, Spanish and Swedish),
are described in Sinclair et al. (1996) and Teubert (this volume).

Bengt Altenberg and Sylviane Granger

. The way forward

There is no doubt that the use of text corpora has revolutionised contrastive
lexical analysis. The wealth of natural language data represented in corpora not
only provides a more detailed and accurate picture of the cross-linguistic correspondences of lexical items, it also greatly improves the quality and usefulness of multilingual lexicons and various types of translation tools.
However, having said this, it is necessary to emphasize some of the challenges
that lie ahead. In his criticism of earlier models of (monolingual) lexical description, Sinclair (1998:14) points out that an analysis that aims to account for both
the paradigmatic and syntagmatic patternings of lexical items calls for nothing
less than a comprehensive redescription of each language. In the same vein,
Teubert (1996: 238), commenting on the impressive developments in multilingual lexicography and NLP, states that further improvement depends on reanalysing the languages involved from scratch with the aid of multilingual corpora. These statements do not necessarily imply that earlier contrastive work is
useless, but they underline the inadequacy of much research in the past and, in
particular, the magnitude of the work that lies ahead. Considering the vast size
and complexity of the vocabulary of a single language and the enormous task of
comparing the lexis of even two languages, these statements serve as a healthy
reminder that the revolution in lexical CL has only just begun. Despite the revitalisation of the field that multilingual corpora have created and the many useful
tools that are now available, a host of problems remain to be solved and many
challenges need to be overcome if real progress is to be made. It may be useful to
mention some of these challenges here. In particular, there is a need for

a stronger coordination of activities and cooperation across related disciplines, in particular corpus linguistics, computational linguistics, lexicography, natural language processing and translation studies;
increased efforts to integrate theoretical modelling and empirical studies
of language use, to incorporate both the paradigmatic and syntagmatic
dimensions of lexis, and to relate language-internal and cross-linguistic
lexical relations in a systematic way;
further refinement of corpus-based contrastive methodology, especially as
regards the combined use of comparable and translation corpora;
intensified efforts to create larger, more comprehensive and generally
accessible multilingual corpora, especially translation corpora relating
more than two languages;
further development of multilingual software tools in such areas as word


and phrase alignment, parallel concordancing, lexical databases, translators workbenches, computer-assisted translation, and multilingual systems
in which corpora, electronic lexicons and grammars are linked in a userfriendly way.
To achieve these goals much research and development will be required in the
future. To judge from the great vitality in the field, the many promising corpusbased projects that are in progress all over the world and, not least, the variety
of approaches represented in the present volume, there are good reasons to be
hopeful about the future of lexical CL. There are indeed exciting times ahead.


For an overview of this lexical reorientation, see Faber & Mairal Usn (1999:
Chapter 1). Some approaches representing different theoretical traditions reflecting this
development are Lexical Functional Grammar (see e.g. Bresnan 1982), Generalised Phrase
Structure Grammar (e.g. Gazdar et al. 1985) and its descendant Head-Driven Phrase
Structure Grammar (e.g. Pollard and Sag 1987), Word Grammar (Hudson 1984),
Functional Grammar (Dik 1978, 1989), Systemic Functional Grammar (Halliday 1994),
Role and Reference Grammar (Van Valin 1993, Van Valin and LaPolla 1997), the Functional
Lexematic Model (Faber and Mairal Usn 1999, Butler 1998, 1999), Cognitive Grammar
(Langacker 1987, 1991), Frame Semantics (e.g. Fillmore and Atkins 1992) and Construction
Grammar (Fillmore 1988, Goldberg 1995). It is also clearly manifested in such lexicological
enterprises as the MIT Lexicon Project (e.g. Rappaport and Levin 1988, Levin 1993) and the
WordNet Project (e.g. Miller and Fellbaum 1991).
. By far the majority of lexical items have a relative frequency in current English of less
than 20 per million. The chance probability of such items occurring adjacent to each other
diminishes to less than 1 in 2,500,000,000! (Clear 1993:274)
. In his summary of the development of CA up to 1980, James (1980: 83) says that the
structuralist movement in linguistics, and the allied Audio-Lingual Method, with their
emphasis on the priority of grammatical patterns, tended, in contrast to the laymans view,
to neglect the role which vocabulary undoubtedly plays in the process of communication.
. In addition to these types there are other corpora involving translations: corpora of
original texts and translations in the same language and corpora of translated texts in different languages. These are especially useful for translation studies and for investigations of
systematic translation effects (see Baker 1993, Schmied and Schffler 1996) but of little use
in CL.
. The use of translations for contrastive studies of lexis has a long history. Viberg (this
volume) mentions the work of Wandruszka (1969), who used a non-electronic corpus of 60
publications in six Germanic and Romance languages, partly inspired by Bally (1950). The
use of electronic translation corpora in CL is a comparatively recent phenomenon, but its

Bengt Altenberg and Sylviane Granger

roots can be traced back to the 1960s, i.e. the first decade of computer corpora. The first
attempt to assemble a bidirectional electronic translation corpus for contrastive studies
seems to have been made by Rudolf Filipovic and his collaborators in the Yugoslav SerboCroatian-English Contrastive project at the University of Zagreb (see e.g. Filipovic 1969,
1971). The corpus compiled for this project consisted of half the Brown Corpus (Francis and
Kucera 1979) which was translated into Serbo-Croatian and a smaller corpus of original
Serbo-Croatian texts translated into English.
. For a description of the Oslo project, see
sprik/index.html. Another interesting example of a truly multilingual translation corpus is
that created by the Trans-European Language Resources Infrastructure (TELRI). The corpus
uses Platos Republic as a point of departure and includes aligned translations into more than
twenty languages (see Erjavec et al. 1998; for studies of translation equivalence on the basis
of this corpus, see Teubert et al. 1997 and Teubert this volume).
. For some useful surveys of different approaches to text alignment, see e.g. Merkel
(1999: 28ff.), Oakes and McEnery (2000), Simard et al. (2000) and Vronis (2000). For
experiments in pairing corresponding units in comparable corpora, see Peters (1996).
. For some pioneering work on sentence alignment, see Brown et al. (1990), Brown et al.
(1991), Gale and Church (1991), Simard et al. (1992).
. The example is inspired by an illustration in Church and Gale (1991). For more information on the TransSearch bilingual concordancing tool, see Simard et al. (1993) and the
TransSearch website:
. In English the underlying principle is mainly syntactic. The singular form is used in
premodifying position (drug abuse, drug prevention, drug-related, drug-free) and the plural
form in all the other cases (to import drugs, the war against drugs). In French the main factor
is the countable/uncountable status of the noun. When it is uncountable the singular is used:
lutter contre la drogue, les barons de la drogue, le milieu de la drogue. The plural form is only
used when the referent is countable: 10% de toutes les drogues, la cocane et les autres drogues.
In some cases, both forms are possible: trafic de drogue(s), saisies de drogue(s).
. Translation effects, whether induced by the source language or universal strategies, are
seldom violations of the target language system in professional translations, but quantitative
deviations from the target language norm (see Schmied and Schffler 1996). As such they
are of course eligible as potential translation equivalents. What importance should be
attached to them depends on their naturalness, which can only be evaluated against the
norm provided by a large reference corpus of original texts representing the target language.
. Alternatively, the definition of equivalence may be determined by the theoretical model
used for the contrastive description. If the model requires a certain kind of formal correspondence, or if it draws the line between semantics and pragmatics in a particular way, this
may be a legitimate reason for having a stricter definition of cross-linguistic equivalence.
However, it is important that the grounds for the definition are made clear.
. This is usually demonstrated by the number of subentries that are needed to describe
them in monolingual dictionaries. A good English example is the verb run which is given 42
numbered senses in WordNet and 31 in the Longman Dictionary of Contemporary English
(cf. Viberg 1998:346).

. The polysemy of think can be explained diachronically. Historically the English verb
think represents a merger of two Old English verbs, /oynkan seem and its causative (or factitive) variant /oynkan cause to seem to oneself (cf. Persson 1993).
. The controversy over lexical decomposition is well reflected in Cruses (1986: 22) dismissal of the terms semantic features and semantic components: Representing complex
meanings in terms of simpler ones is as problem-ridden in theory as it is indispensable in
practice. I would like my semantic traits to carry the lightest possible burden of theory. No
claim is made, therefore, that they are primitive, functionally discrete, universal, or drawn
from a finite inventory. Nor is it assumed that the meaning of any word can be exhaustively
characterised by any finite set of them.
. Two other approaches that have attempted to broaden the view of lexical meaning and
given it a more lexico-grammatical and cognitively oriented basis are frame semantics
(Fillmore 1985, Fillmore and Atkins 1992) and the pragmatically oriented contrastive model
suggested by Weigand (1998a). Frame semantics has been used as a language-neutral lexical
framework in the creation of corpus-based multilingual dictionary fragments for machine
translation and multilingual lexicography (see e.g. Heid 1995, 1996, Ostler 1995). Several
(partly corpus-based) studies exploring the conceptual field of emotion are presented in
Weigand (1998b). Another interesting attempt to combine syntagmatic and paradigmatic
approaches is the Functional Lexematic Model, as demonstrated in Faber and Mairal Usn
(1999) and Butler (1998, 1999).

Ahrenberg, L., and Merkel, M. 1996. On translation corpora and translation support tools:
A project report. In Aijmer et al. (eds), 183200.
Aijmer, K. 1996. Swedish modal particles in a contrastive perspective. Language Sciences
18: 393427.
Aijmer, K. 1998. Epistemic predicates in contrast. In Johansson and Oksefjell (eds),
Aijmer, K. 1999. Epistemic possibility in an English-Swedish perspective. In Hasselgrd
and Oksefjell (eds), 301326.
Aijmer, K., Altenberg, B. and Johansson, M. (eds). 1996. Languages in Contrast. Papers from
a Symposium on Text-based Cross-linguistic Studies. Lund: Lund University Press.
Aijmer, K., Altenberg, B. and Johansson, M. 1996. Text-based contrastive studies in
English. Presentation of a project. In Aijmer et al. (eds), 7385.
Altenberg, B. 1999. Adverbial connectors in English and Swedish: Semantic and lexical correspondences. In Hasselgrd and Oksefjell (eds), 249268.
Altenberg, B., and Aijmer, K. 2000. The English-Swedish Parallel Corpus: A resource for
contrastive research and translation studies. In Corpus Linguistics and Linguistic
Theory, C. Mair and M. Hundt (eds), 1533. Amsterdam and Atlanta: Rodopi.
Atkins, B.T.S. 1994. A corpus-based dictionary. In The Oxford-Hachette French Dictionary,
xix-xxvi. Oxford and Paris: Oxford University Press/Hachette Livre.
Atkins, B. T. S., Levin, B. and Zampolli, A. 1994. Computational approaches to the lexicon:

Bengt Altenberg and Sylviane Granger

an overview. In Computational Approaches to the Lexicon, B. T. Sue Atkins and A.

Zampolli (eds), 1745. Oxford and New York: Oxford University Press.
Bahns, J. 1993. Lexical collocations: a contrastive view. ELT Journal 47: 5663.
Baker, M. 1993. Corpus linguistics and translation studies: Implications and applications.
In Baker et al. (eds), 233250.
Baker, M. 1995. Corpora in translation studies An overview and some suggestions for
future research. Target 7: 223243.
Baker, M. 1999. The role of corpora in investigating the linguistic behaviour of professional
translators. International Journal of Corpus Linguistics 4: 281298.
Baker, M., Francis, G. and Tognini Bonelli, E. (eds). 1993. Text and Technology. In Honour of
John Sinclair. Amsterdam and Philadelphia: Benjamins.
Bally, Ch. 1950. Linguistique gnrale et linguistique franaise. 3rd ed. Berne: A. Francke.
Barlow, M. 1995. ParaConc: a concordancer for parallel texts. Computers and Texts 10: 1416.
Biber D., Johansson, S., Leech, G., Conrad, S. and Finnegan, E. 1999. Longman Grammar of
Spoken and Written English. Longman: Harlow.
Blser, B. 1995. TransLexis: An integrated environment for lexicon and terminology management. In Steffens (ed.), 159173.
Botley, S. P., McEnery, A. M. and Wilson, A. (eds). 2000. Multilingual Corpora in Teaching
and Research. Amsterdam and Atlanta: Rodopi.
Bresnan, J. (ed.) 1982. The Mental Representation of Grammatical Relations. Cambridge,
Mass.: MIT Press.
Brown, P. F., Cocke, J., Della Pietra, S. A., Della Pietra, V.J, Jelinek, F., Lafferty, J. D., Mercer,
R. L. and Roossin, P. S. 1990. A statistical approach to machine translation.
Computational Linguistics 16: 7985.
Brown, P., Lai, J. and Mercer, R. 1991. Aligning sentences in parallel corpora. In Proceedings
of 29th Annual Meeting of the Association for Computational Linguistics (Morristown,
NJ), 169176.
Butler, C. S. 1998. Enriching the Functional Grammar lexicon. In The Structure of the
Lexicon in Functional Grammar, H. Olbertz, K. Hengeveld and J. Snchez Garca (eds),
171194. Amsterdam and Philadelphia: Benjamins.
Butler, C. S. 1999. Some contributions of corpus linguistics to the functional lexematic
model. In Estudios functionales sobre lxico, sintaxis y traduccin. Un homenaje a
Leocadio Martn Mingorance, M.-J. Feu Guijarro and S. Molina Plaza (eds), 1937.
Cuenca: Universidad de Castilla La Mancha.
Calzolari, N. 1996. Lexicon and corpus: a multi-faceted interaction. In Gellerstam et al.
(eds), 316.
Chesterman, A. 1998. Contrastive Functional Analysis. Amsterdam: Benjamins.
Church, K. W. and Gale, W. A. 1991. Concordances for parallel texts. In Proceedings of the
Seventh Annual Conference of the UW Centre for the New OED and Text Research
(Oxford), 4062. Oxford: Oxford University Press and Waterloo, Ontario: UW Centre
for the New OED and Text Research.
Church, K. W. and Hanks, P. 1989. Word association norms, mutual information and lexicography. Proceedings of the 27th Annual Meeting of ACL (Vancover, B.C.), 7683.
Clear J. 1993. From Firth principles. Computational tools for the study of collocation. In
Baker et al. (eds), 271292.


Cowie,A.P.(ed.).1998. Phraseology.Theory, Analysis, and Applications. Oxford: Clarendon Press.

Cruse, D. A. 1983. Review of A. Wierzbicka, Lingua mentalis: The semantics of natural language. Journal of Linguistics 19: 265272.
Cruse, D.A. 1986. Lexical Semantics. Cambridge: Cambridge University Press.
Devos, F. 1996. Contrastive verb valency: Overview, criteria, methodology and applications. In Simon-Vandenbergen et al. (eds), 1581.
Di Pietro, R.J. 1971. Language Structures in Contrast. Rowley, Mass.: Newbury House.
Dickens, A. and Salkie, R. 1996. Comparing bilingual dictionaries with a parallel corpus.
In Gellerstam et al. (eds), 551559.
Dik, S.C. 1978. Functional Grammar. Amsterdam: North-Holland.
Dik, S. C. 1991. Functional grammar. In Linguistic Theory and Grammatical Description, F.
Droste and J. Joseph (eds), 247274. Amsterdam and Philadelphia: Benjamins.
Di Sciullo A.-M. and Williams, E. 1987. On the Definition of Word. MIT Press: Cambridge,
Dyvik, H. 1998. A translational basis for semantics. In Johansson and Oksefjell (eds),
Ebeling, J. 1998. The Translation Corpus Explorer: A browser for parallel texts. In
Johansson and Oksefjell (eds), 101112.
Ebeling, J. 1999. Presentative Constructions in English and Norwegian. A Corpus-based
Contrastive Study. Oslo: Unipub forlag.
Erjavec, T., Lawson, A. and Romary, L. (eds). 1998. East Meets West: A Compendium of
Multilingual Resources. Mannheim: Institut fr Deutsche Sprache/TELRI Association e.V.
Faber, P. B. and Mairal Usn, R. 1999. Constructing a Lexicon of English Verbs. Berlin and
New York: Mouton de Gruyter.
Fabricius-Hansen, C. 1999. Bei dieser Gelegenheit on this occasion ved denne anledningen. German bei a puzzle in a translational perspective. In Hasselgrd and Oksefjell
(eds), 231248.
Filipovic, R. 1969. The choice of the corpus for the contrastive analysis of Serbo-Croatian
and English. The Yugoslav Serbo-Croatian-English Contrastive Project, B. Studies 1,
3746. Institute of Linguistics, University of Zagreb.
Filipovic, R. 1971. The Yugoslav Serbo-Croatian-English Contrastive Project. In Papers in
Contrastive Linguistics, G. Nickel (ed.), 107114. Cambridge: Cambridge University
Fillmore, C. J. 1985. Frames and the semantics of understanding. Quaderni di Semantica 6:
Fillmore, C. J. 1988. The mechanisms of Construction Grammar. Proceedings of the 14th
Annual Meeting of the Berkeley Linguistic Society, 3555. Berkeley: University of
Fillmore, C. J. and Atkins, B. T. 1992. Toward a frame-based lexicon: The semantics of RISK
and its neighbors. In Lehrer and Kittay (eds), 75102.
Firth J. R. 1957. A synopsis of linguistic theory 19301955. Studies in Linguistic Analysis
(special volume of the Philological Society), Oxford, 132.
Fisiak, J. 1981. Some introductory notes concerning contrastive linguistics. In Contrastive
Linguistics and the Language Teacher, Fisiak (ed.), 111. Oxford: Pergamon.
Francis, G. 1993. A corpus-driven approach to grammar. In Baker et al. (eds), 137156.

Bengt Altenberg and Sylviane Granger

Francis, W. N. and Kucera, H. 1979. Manual of Information to Accompany a Standard Sample

of Present-Day Edited American English, for Use with Digital Computers. Department of
Linguistics, Brown University, Providence, RI.
Gale, W. and Church, K.W. 1991. A program for aligning sentences in bilingual corpora. In
Proceedings of 29th Annual Meeting of the Association for Computational Linguistics
(Morristown, NJ), 177184.
Gazdar, G., Klein, E., Pullum, G., and Sag, I. 1985. Generalized Phrase Structure Grammar.
Cambridge, Mass.: Harvard University Press.
Gellerstam, M. 1986. Translationese in Swedish novels translated from English. In
Translation Studies in Scandinavia, L. Wollin and H. Lindquist (eds), 8895. Lund:
CWK Gleerup.
Gellerstam, M. 1996. Translations as a source for cross-linguistic studies. In Aijmer et al.
(eds), 5362.
Gellerstam, M., Jrborg, J., Malmgren, S-G., Norn, K., Rogstrm, L. and Rjder Papmehl,
C. (eds). 1996. Euralex 96 proceedings I-II. Papers submitted to the Seventh EURALEX
International Congress on Lexicography in Gteborg, Sweden. Gteborg: Department of
Swedish, University of Gteborg.
Goldberg, A. 1995. A Construction Grammar Approach to Argument Structure. Chicago:
Chicago University Press.
Granger S. 1996. From CA to CIA and back: an integrated approach to computerized bilingual and learner corpora. In Aijmer et al. (eds), 3751.
Granger, S. 1998. Prefabricated patterns in advanced EFL writing: collocations and formulae. In Cowie (ed.), 145160.
Grefenstette, G., Heid, U., Schultze, B.M., Fontenelle, T. and Gerardy, C. 1996. The DECIDE
project: Multilingual collocation extraction. In Gellerstam et al. (eds), 93107.
Guillemin-Flescher J. 1981. Syntaxe compare du franais et de langlais. Problmes de traduction. Ophrys: Paris.
Halliday M. A. K. 1966. Lexis as a Linguistic level. In In Memory of J. R. Firth, C. E. Bazell,
J.C. Catford, M.A.K. Halliday and R.H. Robins (eds), 148162. Longmans: London.
Halliday, M. A. K. 1994. A Introduction to Functional Grammar. 2nd ed. London: Edward
Hartmann, R. R. K. 1996. Contrastive textology and corpus linguistics: On the value of parallel texts. Languages Sciences 18: 947957.
Hasselgrd, H. and Oksefjell, S. (eds). 1999. Out of Corpora. Studies in Honour of Stig
Johansson. Amsterdam and Atlanta: Rodopi.
Heid, U. 1995. Relating parallel monolingual lexicon fragments for translation purposes.
In Steffens (ed.), 231251.
Heid, U. 1996. Creating a multilingual data collection for bilingual lexicography from parallel monolingual lexicons. In Gellerstam et al. (eds), 573590.
Heyn, M. 1998. Translation memories: insights and prospects. In Unity in Diversity.
Current Trends in Translation Studies, L. Bowker, M. Cronin, D. Kenny and J. Pearson
(eds), 123136. Manchester: St. Jerome Publishing.
Hofland, K. 1996. A program for aligning English and Norwegian sentences. In Research in
Humanities Computing, S. Hockey, N. Ide and G. Perissinotto (eds), 165178. Oxford:
Oxford University Press.


Hofland, K., and S. Johansson. 1998. The Translation Corpus Aligner: A program for automatic alignment of parallel texts. In Johansson and Oksefjell (eds), 87100.
Howarth, P. A. 1996. Phraseology in English Academic Writing. Some Implications for
Language Learning and Dictionary Making. Tbingen: Niemeyer.
Howarth, P. A. 1998. The phraseology of learners academic writing. In Cowie (ed.),
Hudson, R. 1984. Word Grammar. Oxford: Blackwell.
Ide, N., Greenstein, D. and Vossen, P. (eds). 1998. Special issue in EuroWordNet. Computers
and the Humanities 32 (23).
Ide, N. and Vronis, J. 1995. Knowledge extraction from machine-readable dictionaries:
An evaluation. In Steffens (ed.), 1934.
Isabelle, P., Dymetman, M., Foster, G., Jutrac, J-M., Macklovitch, E., Perraul, F., Ren, X. and
Simard, M. 1992. Translation analysis and translation automation. Proceedings of the
Fifth International Conference on Theoretical and Methodological Issues in Machine
Translation (TMI93), Kyoto, 201217.
Ivir, V. 1983. A translation-based model of contrastive analysis. Jyvskyl Cross-Language
Studies 9: 171178.
Ivir, V. 1987. Functionalism in contrastive analysis and translation studies. In
Functionalism in Linguistics, R. Dirven and V. Fried (eds), 471481. Amsterdam and
Philadelphia: Benjamins.
James, C. 1980. Contrastive Analysis. London: Longman.
Johansson, S. 1997. Using the English-Norwegian Parallel Corpus a corpus for contrastive analysis and translation studies. In Lewandowska-Tomaszczyk and Melia
(eds), 282296.
Johansson, S. 1998. On the role of corpora in cross-linguistic research. In Johansson and
Oksefjell (eds), 124.
Johansson, S. and Lken, B. 1997. Some Norwegian discourse particles and their English
correspondences. In Sounds, Structures and Senses. Essays Presented to Niels DavidsenNielsen on the Occasion of his Sixtieth Birthday, C. Bache and A. Klinge (eds), 149170.
Odense: Odense University Press.
Johansson, S. and Oksefjell, S. (eds). 1998. Corpora and Cross-linguistic Research.
Amsterdam and Atlanta: Rodopi.
Kay, M. and Rscheisen, M. 1993. Text-translation alignment. Computational Linguistics
19: 121142.
Kittay, E. F. 1987. Metaphor: Its Cognitive Force and Linguistic Structure. Oxford: Clarendon
Kittay, E.F. and Lehrer, A. 1992. Introduction. In Lehrer and Kittay (eds), 118.
Krzeszowski, T.P. 1990. Contrasting Languages. Berlin: Mouton de Gryuter.
Langacker, R. W. 1987. Foundations of Cognitive Grammar, Vol. I: Theoretical Prerequisites.
Stanford: Stanford University Press.
Langacker, R. W. 1991. Foundations of Cognitive Grammar, Vol. II. Stanford: Stanford
University Press.
Lehrer, A. 1974. Semantic Fields and Lexical Structure. Amsterdam: North-Holland.
Lehrer, A. and Kittay, E. F. (eds). 1992. Frames, Fields, and Contrasts. New Essays in Semantic
and Lexical Organization. Hillsdale, N.J: Lawrence Erlbaum.

Bengt Altenberg and Sylviane Granger

Levin, B. 1993. English Verb Classes and Alternations: A Preliminary Investigation. Chicago:
University of Chicago Press.
Lewandowska-Tomaszczyk, B. and Melia, P. J. (eds). 1997. Practical Applications in Language
Corpora. Lodz: Lodz University.
Mauranen, A. 1999. Form and sense relations as seen through parallel corpora. Paper presented at the Third European TELRI Seminar on Translation Equivalence Theory
and Practice (Montecatini, 1997). Mannheim: TELRI.
Merkel, M. 1999. Understanding and Enhancing Translation by Parallel Text Processing.
Department of Computer and Information Science, University of Linkping.
Miller, G. A. and Johnson-Laird, P. N. 1976. Language and Perception. Cambridge, Mass.:
Harward University Press.
Miller, G.A. and Fellbaum, C. 1991. Semantic networks in English. Cognition 41: 197229.
Moon, R. 1996. Data, description, and idioms in corpus lexicography. In Gellerstam et al.
(eds), 245256.
Nuyts, J. 1997. How do you think? In A Fund of Ideas: Recent Developments in Functional
Grammar, C. S. Butler, J. H. Connolly, R. A. Gatward & R. M. Vismans (eds), 318.
Amsterdam: IFOTT.
Oakes, M. and McEnery, T. 2000. Bilingual text alignment an overview. In Botley et al.
(eds), 137.
Ostler, N. 1995. Perception vocabulary in five languages towards an analysis using frame
elements. In Steffens (ed.), 219230.
Paulussen, H. 1999. A Corpus-based Contrastive Analysis of English on/up, Dutch op and
French sur within a Cognitive Framework. Ph.D dissertation, Faculty of Letters and
Philosophy, University of Gent.
Persson, G. 1993. Think in a panchronic perspective. Studia Neophilologica 65: 318.
Peters, C. 1996. From parallel to comparable text corpora. In Gellerstam et al. (eds),
Pollard, C. and Sag, I. 1994. Head-Driven Phrase Structure Grammar. Chicago: University of
Chicago Press.
Rappaport, M. and Levin, B. 1988. What to do with theta-roles. In Syntax and Semantics
21: Thematic Relations, W. Wilkins (ed.), 736. New York: Academic Press.
Ridings, D. 1998. PEDANT: Parallel texts in Gteborg. Lexikos 8: 243268.
Ringbom, H. 1994. Contrastive analysis. In The Encyclopedia of Language and Linguistics,
R.E. Asher and J.M.Y. Simpson (eds), 737742. Oxford: Pergamon Press.
Roberts, R. P. and Montgomery, C. 1996. The use of corpora in bilingual lexicography. In
Gellerstam et al. (eds), 457464.
Roos, E. 1976. Contrastive collocational analysis. Papers and Studies in Contrastive
Linguistics 5: 6575.
Rosch, E. 1975. Cognitive representations of semantic categories. Journal of Experimental
Psychology 104: 192233.
Sajavaara, K. 1996. New challenges for contrastive linguistics. In Aijmer et al. (eds), 1736.
Salkie, R. 1997. Naturalness and contrastive linguistics. In Lewandowska-Tomaszczyk and
Melia (eds), 297312.


Schmied, J. 1998. Differences and similarities of close cognates: English with and German
mit. In Johansson and Oksefjell (eds), 255275.
Schmied, J. and Schffler, H. 1996. Approaching translationese through parallel and translation corpora. In Synchronic Corpus Linguistics. Papers from the Sixteenth International
Conference on English Language Research on Computerized Corpora (ICAME 16), C. E.
Percy, C.F. Meyer and I. Lancashire (eds), 4156. Amsterdam and Atlanta: Rodopi.
Schwarze, C. (ed.). 1985. Beitrge zu einem kontrastiven Wortfeldlexikon Deutsch
Franzsisch. Tbingen: Gunter Narr.
Simard, M., Foster, G., Hannan, M-L., Macklovitch, E. and Plamondon, P. 2000. Bilingual
text alignment: where do we draw the line? In Botley et al. (eds), 3864.
Simard, M., Foster, G. F. and Isabelle, P. 1992. Using cognates to align sentences in bilingual
corpora. In Proceedings of the Fourth International Conference on Theoretical and
Methodological Issues in Machine Translation (TMI92) (Montreal), 6781.
Simard, M., Foster, G. F. and Perrault, F. 1993. TransSearch: un concordancier bilingue.
Centre dinnovation en technologies de linformation. Laval: Canada.
Simon-Vandenbergen, A-M. 1998. I think and its Dutch equivalents in parliamentary
debates. In Johansson and Oksefjell (eds), 297317.
Simon-Vandenbergen, A-M., Taeldeman, T. and Willems, D. 1996. Introducing CONTRAGRAM or why we need contrastive verb valency. In Simon-Vandenbergen et al. (eds),
Simon-Vandenbergen, A-M., Taeldeman, J. and Willems, D. (eds). 1996. Aspects of
Contrastive Verb Valency. Studia Germanica Gandensia 40, University of Gent.
Sinclair, J. 1985. Lexicographic evidence. In Dictionaries, Lexicography and Language
Learning, R. Ilson (ed.), 8192. Oxford: Pergamon.
Sinclair J. 1987a. Collocation: a progress report. In Language Topics. Essays in Honour of
Michael Halliday, R. Steele and T. Threadgold (eds), 319331. Amsterdam and
Philadelphia: Benjamins.
Sinclair, J. (ed.). 1987b. Looking Up: An Account of the COBUILD Project in Lexical
Computing and the Development of the Collins COBUILD English Language Dictionary.
London: HarperCollins.
Sinclair J. 1991. Corpus, Concordance, Collocation. Oxford: Oxford University Press.
Sinclair, J. 1996a. An international project in multilingual lexicography. In Sinclair et al.
(eds), 179196.
Sinclair, J. 1996b. Cross-language semantic links. Data-driven multilingual lexicons.
Unpublished report from Workshop on Multilingual Lexical Semantics, 1921 June
1998. Institut fr deutsche Sprache/The Tuscan Word Centre/PAROLE German
Sinclair, J. 1998. The lexical item. In Weigand (ed.), 124.
Sinclair, J., Payne, J. and Prez Hernndez, C. (eds). 1996. Corpus to corpus: A study of translation equivalence. Special issue of International Journal of Lexicography 9: 179276.
Steffens, P. 1995. Introduction. In Steffens (ed.), 115.
Steffens, P. (ed.). 1995. Machine Translation and the Lexicon. Third International EAMT
Workshop, Heidelberg, April 2628 1993. Berlin/New York: Springer.
Svensn, B. 1993. Practical Lexicography. Principles and Methods of Dictionary-Making.
Oxford: Oxford University Press.

Bengt Altenberg and Sylviane Granger

Talmy, L. 1985. Lexicalization patterns: semantic structures in lexical forms. In Language

Typology and Syntactic Description, Vol. 3, T. Shopen (ed.), 57149. Cambridge:
Cambridge University Press.
Taylor, J. 1995. Linguistic Categorization. Prototypes in Linguistic Theory. 2nd ed. Oxford:
Clarendon Press.
Teubert, W. 1996. Comparable or parallel corpora? In Sinclair et al. (eds), 238264.
Teubert, W. (ed.). 1998. Workshop on Multilingual Lexical Semantics. Mannheim: Institut fr
deutsche Sprache/The Tuscan Word Centre. (
Teubert, W., Tognini Bonelli, E. and Volz, N. (eds). 1998. Proceedings of the Third European
Seminar Translation Equivalence, Montecatini Terme, Italy, October 1618, 1997.
Mannheim/The Tuscan Word Centre: The TELRI Association e. V.
Tognini Bonelli, E. 1996. Towards translation equivalence from a corpus linguistic perspective. In Sinclair et al. (eds), 197217.
Tognini Bonelli E. 2001. Corpus Linguistics at Work. Amsterdam & Philadelphia: Benjamins.
Van Valin, R. D. 1993. A synopsis of Role and Reference Grammar. In Advances in Role and
Reference Grammar, R. D. Van Valin (ed.), 1164. Amsterdam and Philadelphia:
Van Valin, R. D. and LaPolla, R. J. 1997. Syntax: Structure, Meaning and Function.
Cambridge: Cambridge University Press.
Vronis, J. (ed.). 2000. Parallel Text Processing. Alignment and Use of Translation Corpora.
Berlin: Kluwer Academic Publishers.
Viberg, . 1993. Crosslinguistic perspectives on lexical organization and lexical progression. Progression and Regression in Language. Sociocultural, Neuropsychological and
Linguistic Perspectives, K. Hyltenstam and . Viberg (eds), 340385. Cambridge:
Cambridge University Press.
Viberg, . 1996a. Cross-linguistic lexicology. The case of English go and Swedish g. In
Aijmer et al. (eds), 151182.
Viberg, . 1996b. The meanings of Swedish dra pull: a case study of lexical polysemy. In
Gellerstam et al. (eds), 293308.
Viberg, . 1998. Contrasts in polysemy and differentiation. Running and putting in
English and Swedish. In Johansson and Oksefjell (eds), 343376.
Viberg, . 1999. Polysemy and differentiation in the lexicon. Verbs of physical contact in
Swedish. Cognitive Semantics. Meaning and Cognition, J. Allwood and P. Grdenfors
(eds), 87129. Amsterdam: Benjamins.
Vinay J. P. and Darbelnet, J. 1969. Stylistique compare du franais et de langlais. Paris:
Vossen, P. (ed.). 1998. EuroWordNet: A Multilingual Database with Lexical Semantic
Networks. Dordrecht: Kluwer Academic.
Wandruszka, M. 1969. Sprachen. Vergleichbar und unvergleichlich. Mnchen: Piper.
Weigand, E. 1998a. Contrastive lexical semantics. In Weigand (ed.), 2544.
Weigand, E. (ed.). 1998b. Contrastive Lexical Semantics. Amsterdam and Philadelphia:
Wools, D. 1998. Multiconcord. Birmingham: CFL Software Development.


Cross-Linguistic Equivalence

Two types of translation equivalence*

Raphael Salkie


Translation equivalence is an elusive notion which has been debated vigorously

in the literature. If a source text and target text diverge in some way, we need to
set up two levels of analysis so that they are different on one level but equivalent
on the other. The difficult challenge is to define these levels rigorously: as Gutt
(1991) rightly points out, a popular solution which distinguishes the meaning
of a text from its communicative effect is flawed because the former notion is
vague and the latter is untestable.
This paper argues that translation corpora offer a new perspective on this
old issue. One of the advantages of corpora is that they reveal patterns which
would be difficult to find otherwise. With a monolingual corpus, the patterns
usually involve a phenomenon occurring more frequently than expected. The
phenomenon in question is typically one of these:

linguistic items such as words and phrases

association patterns (Biber et al. 1998:58) between items

If a word occurs more frequently in a corpus than we would expect, then that is
information in the corpus which is of interest to researchers.1 Similarly, if two
items occur together more frequently than we would expect, we have found a
regular association pattern which calls for explanation.2 This information was
not deliberately entered into the corpus by the writers of the texts. What happens is that language users make a series of unconscious choices which the corpus incorporates and which the analyst can find using frequency counts
assuming that the corpus is large enough to yield patterns which are statistically significant.

Raphael Salkie

With a translation corpus, the patterns that are revealed involve correspondences between words and expressions in different languages. Unlike monolingual corpora, the interesting cases with translations tend to be those where the
correspondence is less frequent than anticipated. For example, if an English
word in our corpus is translated by the expected French word less frequently
than we would anticipate, and other translations are used instead, then we have
found a puzzle which needs an explanation. With translation corpora our
expectations are based on our linguistic competence as bilinguals. For more
thorough statements of correspondence we turn to a good bilingual dictionary,
though we often find that the corpus contains correspondences which are not
mentioned in dictionaries, as we shall see below.
This paper looks at two examples of unexpected correspondences that
were found in a translation corpus. The patterns that came to light were different, and we examine why this might be. We also consider the implications for
lexicographers, translators and contrastive linguists. We made use of the
INTERSECT corpus (Salkie 1997, 2000), consisting of about 1.5 million words
in French and English, and about 800,000 words in German and English.
Details of the corpus are given in Appendix 1.

. The data
We looked at two words in the corpus: the German word kaum and its equivalents in English (Appendix 2), and the English word contain with its counterparts in French (Appendix 3).3
For kaum the dictionaries lead us to expect English translations using
hardly, along with the less common alternatives scarcely or barely. The corpus
produced 61 instances, 38 of them in the fiction texts and 23 in non-fiction.
The fiction examples contain one of the expected equivalents 32 times:
hardly 18
scarcely 10
barely 4

The other correspondences were a negative expression (1314 in Appendix 2),

almost + negative (1516), but (1718), and three instances of time expressions
using as soon as or upon (1924). As the fiction is mostly from the 18th or 19th
century, these findings are straightforward, since scarcely and barely perhaps
have an old-fashioned feel to them in contemporary English.

Two types of translation equivalence

In the non-fiction examples, on the other hand, the expected equivalents

are far less frequent:
hardly 5
scarce 1

(The scarce example is from the Communist Manifesto, another 19th century
text). The remaining 17 cases cluster as follows:
negative expression 5
little 4
almost + negative 3
hard/difficult/impossible 3
less and less 1
largely + negative 1

Thus only a quarter (26%) of the non-fiction examples are expected. Note,
however, that the unexpected translations are not random: they fall into four
groups with only two singleton translations (of which less and less is closely
akin to little, and largely + negative is similar to almost + negative). A set of patterns thus emerges from the non-fiction equivalents of kaum, though it is not
the single pattern that the dictionary leads us to expect, and which we find in
the fiction equivalents.
Compare this with the results for contain and its French counterparts. A total
of 295 examples were found, of which 171 (58%) were translated by the expected
contenir. In the remaining 124 examples only two significant groupings emerged:
32 examples using figurer + preposition, and 9 using publier (all the latter were in
United Nations documents, which suggests a house style). Some of the other renderings are given in Appendix 3: they contain a large number of singleton equivalents, including about 20 where there is no apparent expression corresponding to
contain in the other language (418 are a small sample).

. Labelling the types of equivalence

We thus need to distinguish a case like kaum English, where the expected
translation equivalent (at least in the non-fiction texts) is relatively rare, from a
case like contain French, where the expected equivalent may be relatively
common but where there are many unique equivalents. One possible pair of
labels would be to call kaum translationally ambiguous into English, but con-

Raphael Salkie

tain translationally vague into French. These labels capture one important
part of the distinction: a translator who has to translate kaum into English is in a
similar situation to a linguist who wants to state the meanings of an ambiguous
word. The linguist is faced with multiple senses, and the translator has to consider multiple equivalents. A translator who is required to translate contain into
French is like a linguist confronted by a vague word: in the monolingual case,
just listing all the possible interpretations of the word that have been found so
far in different contexts misses the point that it is the contexts rather than the
word which are doing most of the work. In the translation situation, simply listing all the translation equivalents found so far misses the point that it is the
translator who is doing most of the work by creating a new solution each time.
I would argue, however, that these labels are not suitable because they suggest that the translational behaviour of kaum and contain is directly linked to
their semantics: an ambiguous word might be thought to be translationally
ambiguous into any L2, and a vague word might be thought to be translationally vague into any L2. These suggestions are not correct: the English word hard,
for example is ambiguous between the senses not easy and not soft. Whether
hard is also translationally ambiguous depends on the L2: it is not translationally ambiguous into French, where dur has both senses, but it is translationally
ambiguous into German, where schwer / schwierig correspond to the first sense
and hart to the second. Similar counterexamples can easily be constructed for
vagueness and translational vagueness.
A better pair of labels emerges if we focus instead on the demands that
kaum and contain make on translators. For a word like kaum, the strategy that
skilled translators seem to have adopted can be stated informally like this:
Dont choose the expected equivalent, because the corpus shows that it is not
very common. Instead, look at a wider range of equivalents in a bilingual dictionary based on a translation corpus.4

For a word like contain, on the other hand, the implicit strategy looks like this:
Use the expected equivalent if you can. If you cant, dont bother to look at a
wider range of equivalents in a bilingual dictionary, because they are unlikely
to work in your specific context. Instead, invent a new equivalent of your own.

The two strategies are near the extreme ends of a spectrum of translation
strategies. At one extreme are items which are always translated the same way:
an example might be television French. At the other end is the unlikely but
logically possible case of items which have different equivalents each time they

Two types of translation equivalence

occur.5 Items at the former extreme can be called translationally systematic,

while items with unpredictable equivalents can be called translationally unsystematic. Thus kaum English is closer to the systematic end of this spectrum,
while contain French is nearer the unsystematic end. (The difference
between the kaum and television cases is that television French is simple
while kaum English is complex). To translate kaum successfully you need a
good (= exhaustive) dictionary. To find an equivalent for contain a dictionary
will help up to a point, but not as much as a good (= creative) translator.

. Reasons for the two types of equivalence

In the case of kaum we seem to have learned something about the linguistic systems of English and German information that could in principle be captured in an enriched bilingual dictionary which used corpus findings to capture lexical information about these two systems. With contain the emphasis is
on unique creative solutions by translators, and this takes us away from the
underlying systems and firmly into textual practice. It appears, then, that
translational systematicity is best regarded as a relation between two linguistic
systems, whereas translational unsystematicity is a relation between textual
practice in two languages.
There are, however, some reasons for doubt. Firstly, the distinction
between translationally systematic and unsystematic items is to some extent an
artefact of the size of the corpus. If we had a far larger translation corpus, some
of the unique solutions in the contain cases would no doubt occur more than
once, in which case we could argue that consistent tendencies had emerged
which could in principle be recorded in a bilingual dictionary. When we call an
item translationally systematic, is that just a way of saying that we havent
looked hard enough yet to find the system?
Secondly, the distinction between the underlying linguistic system and
textual practice is highly problematic when we look at translations. Since
Saussure distinguished langue from parole it has been a fundamental principle
of linguistics that these two domains are distinct. When a translation corpus
reveals systematic differences in the textual practice of two languages, however,
the question arises of where these differences are located. Does the textual difference that we have proposed between English and French reflect a difference
in the underlying systems of the two languages? Either possible answer to this
question leads to difficult problems. If the answer is yes, then we are committed

Raphael Salkie

to the principle that the underlying linguistic system of a language can include
frequency rules which determine how often the resources in that system are
used. Many linguists would feel uncomfortable with such a principle. If the
answer is no, on the other hand, we have to find an alternative way to account
for differences in textual practice between languages. It is not clear what this
alternative way might be.
Thirdly, the differences between the kaum case and the contain case are
perhaps not as clear-cut as we have claimed so far. The two words occupy very
different places in the linguistic systems of their respective languages. Kaum is a
degree modifier, and so is its expected equivalent hardly, but what they can
modify is not the same. This is not surprising: items in closed grammatical
classes normally behave differently across languages. Contain, on the other
hand, is a verb that enters into relations of hyponymy with other expressions: at
the more general level we have the verb have (cf. example (34) in Appendix 3,
where the French uses avoir), and at the more specific level there are cases
where X contains Y can be specified as Y is published in X (cf. example (18)
in Appendix 3):
(33) Each document can [[contain]] one header, or one footer, or both.
(34) Un document peut avoir un en-tte ou un pied de page ou les deux la
(17) Mr.LABERGE (Canada), on behalf of the sponsors, who had been joined
by Pakistan and Thailand, introduced the draft resolution [contained] in
document A/C.3/43/L.77 entitled Human rights and mass exoduses.
(18) M. LABERGE (Canada) prsente au nom des auteurs, auxquels se sont
joints le Pakistan et la Thalande, le projet de rsolution publi sous la cote
A/C.3/43/L.77, intitul Droits de lhomme et exodes massifs.

These differences between kaum and contain clearly influence the type and
amount of creativity that translators deploy when they deal with these words.
Here we see that the underlying systems of different languages influence textual

5. Understanding translators resourcefulness

The two types of equivalence identified here thus raise difficult conceptual
issues. If we are to gain insight into the issues, two types of further research are

Two types of translation equivalence

necessary. Firstly, we need larger translation corpora and more studies which
try to systematise the data that they produce. Different types of translation
equivalence, of the kind discussed here, have not been identified in the past
because the data was not available. As we remarked earlier, large and representative corpora are necessary if we hope to find significant linguistic patterns in
them. As empirical work of this kind with translation corpora progresses, the
distinction between translational systematicity and translational complexity
may become clearer or turn out to be illusory; and other types of translation
equivalence may emerge.
Secondly, we need conceptual clarification. The following remarks are
intended to be a small step in this direction. In both kaum-type examples and
contain-type examples, the translator has departed much of the time from the
most direct translation and has been resourceful in finding an equivalent which
works for this text in this context. The skill involved in creating a good translation
is partly linguistic and partly literary in this respect, translating is like any type
of writing. For contrastive linguistics the literary dimension is not a primary concern: our task is to find linguistic patterns and explain them. Is there, though, a
way of isolating the linguistic part of translators resourcefulness?
Here I think that the notion of modulation is helpful. Vinay and Darbelnet
(1958: 51) define this term as a change in the point of view from which a situation is regarded. This is a rather vague definition, and the examples given by
Vinay and Darbelnet and others such as Chuquet and Paillard (1987: 2638)
and Van Hoof (1989:126130) cover a very wide area. I think that it is nonetheless an accurate description of much translational resourcefulness, however.
Consider again the German word kaum. If we were to attempt to specify the
sense of this word, it would be something like zero plus a small increment on
some scale. If we now outline the sense of the various English equivalents of
kaum, we get something like this:
zero (38 in Appendix 2)
small quantity (4854)
negative (3846)
almost zero (5658)
almost not (60)
mostly not (70)

These are modulations in the sense used here: different ways of viewing the
same situation. In some cases the meaning is arguably identical to the meaning
of kaum: a small quantity of something is the same as zero plus a small incre-

Raphael Salkie

ment. In other cases the meaning is not identical: zero is not the same as zero
plus a small increment. If we can compare all the translations of kaum in this
way, and then look at the same conceptual area with other language-pairs, we
can start to get a picture of how many ways this same situation can be viewed.
We will be compiling a kind of multilingual thesaurus, where a large number of
ways of representing the same concept is displayed. In another paper I have
outlined a practical framework for compiling such a multilingual thesaurus
(cf. Salkie 1999).
For the issues raised in this paper, taking modulation as a starting point has
two advantages. Firstly, the notion of modulation is located conveniently in
between linguistic systems and textual practice. In our analysis of kaum we are
not just talking about semantic equivalents of kaum in other languages, which
would be a comparison of linguistic systems. Nor are we simply talking about
textual creativity in two languages, which would be a matter of textual practice.
We are talking about different ways of viewing the same situation, which is
partly a semantic matter but is also partly textual and stylistic. The semantic
part probably involves some process of semantic decomposition as a result of
which the point of view is shifted. We need to collect evidence about equivalents of kaum in various languages, and we will then be able to draw links
between systemic constraints and textual creativity. In the case of English we
will be able to explain more accurately the avoidance of hardly which is evident
in the (non-fiction) data: we will have a basis for conceiving of the linguistic
system and its textual realisation as separate but related.
Secondly, modulation falls just on the right side of the line which separates
linguistic resourcefulness from literary resourcefulness. It is creative, but not
to such an extent that it cannot be systematised. Translations where the source
text and target text diverge more radically than modulation are on the other
side of the line, and can be ignored by contrastive linguists. Thus Vinay and
Darbelnets most radical translation strategy what they call adaptation
involves replacing the situation referred to in the source text by a new situation
in the target text (1958: 524). As Chuquet and Paillard (1987: 10) note, this
goes beyond the kind of phenomenon which a linguistic approach to translation should aim to analyse. With modulation, on the other hand, at least we are
dealing with the same situation. It remains to be seen, of course, whether these
conceptual distinctions can be maintained in the light of data from large translation corpora, but at least they offer a starting point.
Consider now the equivalents of contain. The question once again is
whether any particular translation into English is just a case of modulation, or

Two types of translation equivalence

whether a different situation is being represented. In example (256) from

Appendix 3, where contain corresponds to a construction with reposer, it seems
clear that the English and French sentences are not semantically equivalent, but
this is nonetheless an instance of modulation: the situation is being described
from a different point of view (arguably the French sentence is also more specific than the English):
(25) The big kitchen table was covered with wicker baskets [[containing]] the
(26) La grande table de la cuisine tait couverte de panetons dosier o reposait

Compare this with example (456):

(45) Amongst the number of letters we found waiting for us at Naples was one
[[containing]] an unexpected piece of information a chair at the
College de France had fallen vacant and my name had been several times
mentioned in connexion with it;
(46) Dans limportant courrier qui, depuis longtemps, nous attendait Naples,
une lettre mapprenait brusquement que, se trouvant vacante une chaire
au Collge de France, mon nom avait t plusieurs fois prononc; [FICTION\GIDE]

Here we might be reluctant to accept that this is modulation: to contain some

information is not the same situation as to tell someone something. They are
very close, however, and in context perhaps they are the same in the relevant
Finally, consider example (478):
(47) Apart from this, although the CGT and CTC did not have problems similar to those [[contained]] in the CSTC complaint, it was pointed out, at
the meeting with UTC officials, that this organisation had also lost several
trade unionists who had been murdered or disappeared.
(48) Cela mis part, alors que la CGT et la CTC ne connaissaient pas des problmes similaires ceux qui sont dnoncs dans la plainte de la CSTC, les
dirigeants de lUTC ont indiqu que cette organisation avait galement
perdu divers syndicalistes qui taient morts ou avaient disparu.

In this case we would probably be even less willing to accept that the words contained and dnoncs are modulations of each other: but taking the sentences as

Raphael Salkie

a whole, perhaps this continues to be an instance of modulation.

By distinguishing modulation from other cases we thus have a criterion for
deciding which translations should form raw material for our multilingual thesaurus and which are too divergent and idiosyncratic. The criterion is not
always easy to apply, as the contain examples show, but it offers a starting point
for delimiting and organising the types of translation equivalence where we are
likely to find systematic regularities.6

6. Conclusion
The problems of translation equivalence, and of langue versus parole the
underlying system of a language as opposed to the use of this system in texts
have usually been seen as conceptual ones. One of the benefits of using translation corpora as sources of data is that they bring a new empirical dimension to
bear on these problems. Ideally the findings from such corpora can clarify the
conceptual aspects of the problem; at the same time, this conceptual clarity will
bring new vigour to our empirical work. In the process, contrastive linguistics
will be able to make distinctive contributions to translation theory and to linguistic theory in general.

* Constructive criticism by the editors of this volume improved this paper a great deal,
and I thank them both. They are not responsible for any remaining defects.

The notion more frequently than we would expect is usually itself based on data from a
corpus. Normally we take a large, representative corpus as the standard against which we
compare a new corpus. We might, for instance, take the British National Corpus (BNC) as
such a standard for English. If we then take a new corpus we would compare normed word
frequencies in our corpus with the frequencies of the same words in the BNC. Any statistically significant differences would be of interest to researchers.
. With association patterns we base the expected frequency of co-occurrence on the frequency of the two items in the corpus as a whole. If one word in every 12 in a corpus is the
word the, then we would expect the to collocate with any word w once for every 12 instances
of w. If we discover that the collocates much more often than this with w, then we have found
a linguistically interesting association pattern between the and w.
. The choice of the word contain was prompted by the interesting discussion of the
semantic field of inclusion in Chesterman (1998).

No such dictionary exists at the moment, so this is a pious hope rather than a practical

Two types of translation equivalence

strategy. In the meantime a translator can be advised to consult a translation corpus directly,
or to build a corpus using translation memory software such as TRADOS.
. Something approaching this situation occurs in the translation of poetry, where the
need to find a rhyme with a neighbouring word can be the prime consideration. Gutt
(1991:1067) discusses this poem by Morgenstern:
Ein Wiesel
sass auf einem Kiesel
inmitten Bachgeriesel.
Das raffinierte Tier
tats um des Reimes Willen.
A weasel
sat on a pebble
in the middle of a ripple of a brook
The shrewd
did it for the sake of the rhyme
Gutt suggests various English translations, for example: a weasel perched on an easel; a ferret
nibbling a carrot; a mink sipping a drink; a hyena playing a concertina; a lizard shaking its gizzard. Thus the translations of Wiesel and Kiesel can be regarded as almost completely unpredictable in this text. Whether this still counts as translation is, of course, debatable.

For more discussion of modulation, see Salkie (to appear).

Biber, D., S. Conrad & R. Reppen. 1998. Corpus linguistics. Cambridge: Cambridge
University Press.
Chesterman, A. 1998. Contrastive functional analysis. Amsterdam: John Benjamins.
Chuquet, H. & M. Paillard. 1987. Approche linguistique des problmes de traduction. Gap:
Gutt, E.-A. 1991. Translation and relevance. Oxford: Blackwell.
Salkie, R. 1997. INTERSECT: parallel corpora and contrastive linguistics. Contragram
Newsletter 11 (Oct 1997), 69. Available on the Web:
Salkie, R. 1999. How can linguists profit from parallel corpora? Paper presented at the
Parallel Corpus Symposium, University of Uppsala, April 1999. (To appear in the proceedings).
Salkie, R. 2000. Quelques questions mthodologiques dans lexploitation des corpus multilingues. In Corpus: mthodologie et applications linguistiques, M. Bilger (ed), 180195.
Paris: Champion.
Salkie, R. To appear. A new look at modulation. In Proceedings of Maastricht Conference on

Raphael Salkie

Translation and Meaning, April 2000, M. Thelen (ed.).

Van Hoof, H. 1989. Traduire langlais: thorie et pratique. Louvain-la-Neuve: Duculot.
Vinay, J.-P. & J. Darbelnet. 1958. Stylistique compare du franais et de langlais. Paris: Didier.

Appendix 1: Corpus texts






Extracts from Genesis, Exodus and Psalms.

Canhans (Canadian


Extracts from Canadian Hansard (Reports of proceedings in the Canadian Parliament).



Cline, Voyage au bout de la nuit.

Extracts from B. Stoker, Dracula and H.G. Wells, The
Invisible Man
Extracts from J. Verne, Le tour du monde en quatre
vingt jours; S. Germain, Jours de colre; A. de
Saint-Exupry, Le petit prince; A. Camus, Lhte.
Gide, Limmoraliste.
Extracts from stories by Malraux and Maupassant.
Extracts from Robbe-Grillet, La jalousie.
Extracts from Sartre, La Nause.



Instrs (Instructions)


Intorgs (international Esprit





Instructions for Xerox ScanWorx User Manual and

various domestic appliances: Braun
MultiPractic deluxe food processor; Fisher-Price
All-in-one Kitchen Centre; Concertmate-750
keyboard; Sony Radio Cassette-Corder.
EU document: Proposal For A Council Decision
Adopting The First European Strategic
Programme For Research And Development In
Information Technology (Esprit)
International Labour Organisation. Reports of the
Committee on Freedom of Association, 246th
EU document: Maastricht Treaty
United Nations: Report on committee meeting
United Nations: Report on committee meeting
Royal Bank of Canada newsletter
Reports of the Joint Canadian House of
Commons/Senate Special Committee on
Canadian Foreign Policy.

Two types of translation equivalence





Canadian National Library newsletter

Canadian armed forces discussion documents
Canadian forestry information
More Canadian armed forces discussion documents
More Canadian foreign policy discussion documents
Information about France from the French embassy
in London



LM92/GW92 Extracts from Le Monde 1992 and their translation in

Guardian Weekly.
LM93/GW93 Extracts from Le Monde 1993 and their translation in
Guardian Weekly.



Information from the International Organization for

Standardization website.
Information from the Institut Pasteur website.
Extracts from the International Telecommunication
Union CCITT Blue Book SECTION 10

Comps (Company


Information from Hoechst website

Information from Deutsche Telecom website
Information from Siemens website



Bchner, Lenz & Leonce und Lena; Kafka, Die

Dickens, A Christmas Carol

Intorgs (international Esprit


EU document: Proposal For A Council Decision

Adopting The First European Strategic
Programme For Research And Development In
Information Technology (Esprit)
United Nations documents: General Assembly



Manual for employees of SAP (German translation

and localisation company) in use of software



Short news items from the German News website,

April 1996.



Constitutions of the FRG, Austria and Switzerland.

Speeches by Roman Herzog, President of the FRG.
Marx-Engels, Communist Manifesto.

Raphael Salkie

Appendix 2: Equivalents of kaum in English

Hardly (a selection from 18 examples)
(1) [[Kaum]] hatte sie sich umgedreht, zog sich schon Gregor unter dem Kanapee hervor und streckte und blhte sich.
(2) Hardly had she turned her back when Gregor came from under the sofa and
stretched and pulled himself out. [FICTION\GERFICT]
(3) Obgleich schon so ziemlich an gespenstische Gesellschaft gewhnt, bangte
Scrooge vor der stummen Erscheinung doch so sehr, da seine Knie wankten und
er [[kaum]] noch stehen konnte, als er sich ihr zu folgen bereit machte.
(4) Although well used to ghostly company by this time, Scrooge feared the silent
shape so much that his legs trembled beneath him, and he found that he could
hardly stand when he prepared to follow it. [FICTION\DICKENS]
Scarcely (a selection from 10 examples)
(5) Nicht doch, meine Liebe, die Blumen sind ja [[kaum]] welk, die ich zum Abschied
brach, als wir aus dem Garten gingen.
(6) LENA: Not so, my dear, these flowers, which I picked in parting as we left the gardens, are scarcely wilted. [FICTION\GERFICT]
(7) Als Scrooge wieder erwachte, war es so finster, da er das Fenster [[kaum]] von
den Wnden seines Zimmers unterscheiden konnte.
(8) When Scrooge awoke, it was so dark, that looking out of bed, he could scarcely distinguish the transparent window from the opaque walls of his chamber. [FICTION\DICKENS]
Barely (a selection from 4 examples)
(9) Immerfort nur auf rasches Kriechen bedacht, achtete er [[kaum]] da auf, da kein
Wort, kein Ausruf seiner Familie ihn strte.
(10) Intent on crawling as fast as possible, he barely noticed that not a single word, not
an ejaculation from his family, interfered with his progress. [FICTION\GERFICT]
(11) Er gab dem Lschhut einen letzten Druck und fand [[kaum]] Zeit, in das Bett zu
wanken, bevor er in tiefen Schlaf sank.
(12) He gave the cap a parting squeeze, in which his hand relaxed; and had barely time
to reel to bed, before he sank into a heavy sleep. [FICTION\DICKENS]
A selection of other translations
(13) [[Kaum]] zu glauben, wie rasch und munter die beiden Jungen darangingen.
(14) You wouldnt believe how those two fellows went at it! [FICTION\DICKENS]
(15) Man versuche es einmal und senke sich in das Leben des Geringsten und gebe es
wieder in den Zuckungen, den Andeutungen, dem ganzen feinen, [[kaum]]
bemerkten Mienenspiel

Two types of translation equivalence

(16) People should try to plunge themselves into real life and to reproduce it in the tiny
movements, the little hints, and in the fine, almost imperceptible play of features.
(17) Obgleich sie die Schule [[kaum]] einen Augenblick hinter sich gelassen hatten,
befanden sie sich doch pltzlich mitten in den lebendigsten Straen der Stadt
(18) Although they had but that moment left the school behind them, they were now in
the busy thoroughfares of a city [FICTION\DICKENS]
(19) Aber [[kaum]] war er wieder heraus, als er, obgleich noch keine Tnzer dastanden,
wieder aufzuspielen begann,
(20) But scorning rest, upon his reappearance, he instantly began again, though there
were no dancers yet [FICTION\DICKENS]
(21) Nun aber warteten oft beide, der Vater und die Mutter, vor Gregors Zimmer,
whrend die Schwester dort aufrumte, und [[kaum]] war sie herausgekommen,
mute sie ganz genau erzhlen, wie es in dem Zimmer aussah
(22) But now, both of them often waited outside the door, his father and his mother,
while his sister tidied his room, and as soon as she came out she had to tell them
exactly how things were in the room [FICTION\GERFICT]
(23) Und [[kaum]] hatten die Frauen mit dem Kasten, an den sie sich chzend drckten, das Zimmer verlassen, als Gregor den Kopf unter dem Kanapee hervorstie,
um zu sehen, wie er vorsichtig und mglichst rcksichtsvoll eingreifen knnte.
(24) As soon as the two women had got the chest out of his room, groaning as they
pushed it, Gregor stuck his head out from under the sofa to see how he might
intervene as kindly and cautiously as possible. [FICTION\GERFICT]

Hardly (5 examples)
(25) [[Kaum]] einer wisse, wofr die SPD stehe und wogegen sie sei.
(26) He said that hardly anyone knows what the SPD stands for and what it is against.
(27) Allerdings drfte der Mannschaftskapitaen [[kaum]] von Anfang an aufgeboten
(28) To be sure, the team captain could hardly be mobilized at once.
(29) Die Risiken, mit denen wir es heute zu tun haben, sind [[kaum]] geringer.
(30) The risks confronting us today are hardly of lesser magnitude. [POLITICS\HERZOG]
(31) Man wird der deutschen ffentlichkeit wohl [[kaum]] Unrecht tun, wenn man
behauptet, da zu viele bei der Nennung des Wortes Islam vor allem Begriffe wie
inhumanes Strafrecht assoziieren.
(32) It would hardly be doing the German public an injustice to claim that too many of

Raphael Salkie

us mainly associate terms such as inhumane penal law with the word Islam.
(33) Die Schlerzahlen stiegen in den alten Bundeslndern um 2.5 Prozent, dieser Wert
sei aber bei der Schaffung neuer Planstellen [[kaum]] bercksichtigt worden.
(34) The number of pupils increased by 2.5 per cent, but this hardly had been taken
into account for the establishment of new posts. [NEWS\NEWAP96]
Scarce (1 example)
(35) Die Bourgeoisie hat in ihrer [[kaum]] hundertjhrigen Klassenherrschaft massenhaftere und kolossalere Produktionskrfte geschaffen als alle vergangenen
Generationen zusammen.
(36) The bourgeoisie, during its rule of scarce one hundred years, has created more
massive and more colossal productive forces than have all preceding generations
together. [POLITICS\MANIF]
Negative expression (5 examples)
(37) Dies bedeutet mit anderen Worten, dass zwar [[kaum]] Zweifel hinsichtlich der
strategischen Bedeutung der fnf festgestellten breiten Bereiche und bezglich des
Umfangs der Gesamtanstrengungen bestehen, die in den nchsten zehn Jahren
erforderlich sind, um mit den Wettbewerbern gleichzuziehen, dass aber fr die
detaillierten FuE-Ziele [INTORGS\ESPRIT]
(38) In other words whereas there are no doubts about the strategic importance for the
next 10 years of the five broad areas identified and of the size of the overall effort
necessary to catch up with the competition, the detailed R & D objectives
(39) Streitkrfte dieser Art werden in absehbarer Zeit [[kaum]] zur Verfgung stehen.
(40) Such forces are not likely to be available for some time to come. [INTORGS\UN]
(41) Wenn die Bonner Plne wahr gemacht wrden drohe ein [[kaum]] wiedergutzumachender Schaden, schreibt Schulte in einem Brief an den Cher der
Koalitionsfraktion im Bundestag, Schuble.
(42) In a letter to Mr. Schuble, head of the coalitions parliamentary group, Schulte
writes that the realization of the government plans contains the risk of irreparable
damage. [NEWS\NEWAP96]
(43) Schwierigkeiten machen vor allem zwei Phosphatersatzstoffe in Waschmitteln und
aus Papierfabriken, die biologisch [[kaum]] abbaubar sind.
(44) Two phosphate substitutes in particular were problematic because they are not
bio-degradable. These substitutes are two laundry detergent ingredients and pulp
mill by-products. [NEWS\NEWAP96]
(45) Der Steuerzahlerbund erwartet 1996 [[kaum]] Entlastungen bei den Abgaben.
(46) The union of the tax payers does not expect a reduction of taxes for the year 1996.
Little (4 examples)
(47) Die vorgesehene Lockerung des Kndigungsschutzes wirke sich in der Metall- und
Elektroindustrie [[kaum]] aus.

Two types of translation equivalence

(48) The planned relaxation of laws regarding layoffs will likely have little effect in the
steel and electronics industries. [NEWS\NEWAP96]
(49) Das Bruttoinlandsprodukt lag damit [[kaum]] noch ber dem Vorjahreswert.
(50) The gross national product therfore was only little higher than last year.
(51) Er sei Ausdruck reiner Machtpolitik und habe mit den religisen Grundlagen
[[kaum]] etwas gemein.
(52) Fundamentalism was an expression of power politics and had little in common
with religious fundamentals. [NEWS\NEWAP96]
(53) Danach gibt es fr die Arbeitslosen in Deutschland [[kaum]] Aussicht auf
(54) According to them, there is only little positive prospect for unemployed people in
Germany. [NEWS\NEWAP96]
Almost + negative (3 examples)
(55) Die Belastung des Abwassers mit Schwermetallen hat ein [[kaum]] noch nennenswertes Niveau erreicht.
(56) The contamination of the wastewater with heavy metals has fallen to an almost
insignificant level. [COMPS\HOECHST]
(57) Sie haben [[kaum]] eine Chance, ein menschenwrdiges Leben zu fhren.
(58) These people have almost no chance of a life in dignity. [POLITICS\HERZOG]
(59) [[Kaum]] ein Unterzeichner des Pamphlets habe je ein Buch von Annemarie
Schimmel gelesen.
(60) Almost none of the persons who signed the letter ever read a book of hers, he
added. [NEWS\NEWAP96]
Hard/difficult/seems impossible (3 examples)
(61) Haemischer Kommentar von SPD-Fraktionsvize Wolfgang Thierse: Die Politik
schwcht eben so ungemein, dass man sich [[kaum]] auf den Beinen halten kann.
(62) SPD-faction vice-president Wolfgang Thierse sneered: Politics can debilitate so
much that it is hard to keep going. [NEWS\NEWAP96]
(63) Vor diesem Hintergrund ist ein effektives Gebudemanagement ohne DVUntersttzung [[kaum]] noch vorstellbar.
(64) Against this backdrop, it is difficult to imagine efficient building management
without computer support. [COMPS\SIEMENS]
(65) Nach dem nein aus Kiel ist die erforderliche 2/3-Mehrheit im Bundesrat
[[kaum]] noch zu erreichen, da auch die Lnder Hessen, Berlin, NordrheinWestfalen und Sachsen-Anhalt mit nein stimmen oder sich der Stimme enthalten wollen.
(66) After the no coming from Kiel, the required two-thirds majority seems impossible to achieve in the Bundesrat, as Hesse, Berlin, North Rhine-Westphalia and
Saxony Anhalt also want to vote no or abstain. [NEWS\NEWAP96]

Raphael Salkie

Less and less (1 example)

(67) In der Tat legt sich diese Bundesregierung [[kaum]] mehr fr die rmsten in der
Gesellschaft ins Zeug. Aber verursacht hat sie die Konjunkturflaute nicht, stellt die
(68) Government indeed is less and less interested in backing the poorest in society, but
that hasnt caused the recession. [NEWS\NEWAP96]
Largely + negative (1 example)
(69) Manche nehmen die ermutigenden Tendenzen und Fortschritte aber auch einfach
nicht zur Kenntnis, weil Erfolge, so spektakulr sie auch sein mgen, weniger
dramatische Bilder abgeben als Katastrophen und deshalb von den Medien
[[kaum]] beachtet und berichtet werden.
(70) Some people simply fail to appreciate encouraging trends and progress because
success stories, no matter how spectacular, do not provide such dramatic pictures
as disasters and are therefore largely ignored by the media. [POLITICS\HERZOG]

Appendix 3: Equivalents of contain in French

Contenir (a selection from 171 examples)
(1) And my baggage [[contains]] apparatus and appliances.
(2) - Et mes bagages contiennent des appareils, un matriel. [FICTION\ENGLISH]
(3) Unconverted document file [[containing]] formatting information.
(4) Fichier de document non converti contenant des informations de formatage.
(5) In 1984, the database [[contained]] close to three million bibliographic records,
and was growing at an annual rate of 400 000 records.
(6) En 1984, la base de donnes contenait prs de 3 millions de notices bibliographiques et augmentait de quelque 400 000 notices par an. [MISC\CANLIB]
Figurer (a selection from 32 examples)
(7) - No WRU signals should be [[contained]] within the pre-recorded message up
to the last code expression CI
(8) - Aucun signal WRU ne doit figurer dans le message prenregistr jusqu la
dernire expression de code CI. [SCI-TECH\TELECOM]
(9) The complaint presented by the Central Organisation of Workers (CGT) is [[contained]] in a communication dated 30 May 1985.
(10) La plainte figure dans une communication de la Centrale gnrale des travailleurs
(CGT) du 30 mai 1985. [INTORGS\ILO]
(11) These comments, [[containing]] important information on the situation, were
made by those we interviewed and I have done my utmost to transcribe them as
faithfully as possible.

Two types of translation equivalence

(12) Ces commentaires, parmi lesquels figurent des informations importantes sur la
situation, ressortissent entirement la responsabilit des personnes rencontres
et je me suis efforc den rendre compte aussi fidlement que possible.
(13) Immediately after the Committees consideration of the case the Governments
reply [[contained]] in a communication dated 12 May 1986 was received.
(14) Immdiatement aprs avoir examin le cas, le comit a reu la rponse du gouvernement qui figurait dans une communication date du 12 mai 1986.
(15) In that connection, he drew attention to the relevant explanations [[contained]] in
paragraphs 60 to 65 of document E/CN.4/1988/24 and in paragraph 60 of the
interim report.
(16) A cet gard, M. Pohl renvoie la Commission aux explications figurant dans les
paragraphes 60 65 du document E/CN.4/1988/24 et dans le paragraphe 60 du
rapport intrimaire. [INTORGS\UN1]
Publier (a selection from 9 examples)
(17) Mr.LABERGE (Canada), on behalf of the sponsors, who had been joined by
Pakistan and Thailand, introduced the draft resolution [contained] in document
A/C.3/43/L.77 entitled Human rights and mass exoduses.
(18) M. LABERGE (Canada) prsente au nom des auteurs, auxquels se sont joints le
Pakistan et la Thalande, le projet de rsolution publi sous la cote A/C.3/43/L.77,
intitul Droits de lhomme et exodes massifs. [INTORGS\UN2]
A selection of other translations (about 50 types)
(19) It [[contains]] in all some twenty acres, quite surrounded by the solid stone wall
above mentioned.
(20) Il comprend quelque vingt cres de terres entirement ceintes, comme je lai dit,
par un solide mur de pierres. [FICTION\ENGLISH]
(21) I confessed that her country terrified me quite definitely more than the whole
sum total of threats, actual, hidden and unforeseen which I found it [[contained]]
(22) et quant son pays il mpouvantait tout bonnement plus que tout lensemble de
menaces directes, occultes et imprvisibles que jy trouvais
(23) He observed that the butchers stalls [[contained]] neither mutton, goat, nor
(24) Il avait bien remarqu que moutons, chvres ou porcs, manquaient absolument
aux talages des bouchers indignes [FICTION\FRENCH]
(25) The big kitchen table was covered with wicker baskets [[containing]] the dough.
(26) La grande table de la cuisine tait couverte de panetons dosier o reposait la pte.
(27) These works, and the pleasure they [[contain]], can be learned like a foreign language

Raphael Salkie

(28) Ces oeuvres, et le plaisir quelles apportent, peuvent tre apprises comme une
langue trangre [FICTION\MALMAU]
(29) In order to enjoy the features and functions of this unit to their fullest, be sure to
carefully read this manual and follow the instructions [[contained]] herein..
(30) Afin dapprcier au mieux les fonctions et les caractristiques de cet instrument,
lisez attentivement ce manuel et suivez les instructions y inclues.
(31) The preview window [[contains]] pause options at the top of the window.
(32) Le haut de la fentre de visualisation comporte des options de pause.
(33) Each document can [[contain]] one header, or one footer, or both.
(34) Un document peut avoir un en-tte ou un pied de page ou les deux la fois.
(35) It is simply a cache in the department of printed books, to which only our librarians have a key, and which [[contains]] a number of books which although
extremely evil, are sometimes very precious to bibliophiles and have a high market
(36) Cest tout simplement une cachette du dpartement des imprims dont les conservateurs ont seuls la clef et dans laquelle on enferme certains livres fort mauvais,
mais quelquefois trs prcieux pour les bibliophiles, et de grande valeur vnale.
(37) General Assembly resolution 2248(S-V), which [[contained]] the political
mandate and framework for the activities of the United Nations Council for
(38) la rsolution 2248(S-V) de lAssemble gnrale, qui dfinit le mandat du
Conseil des Nations Unies pour la Namibie et le cadre politique de ses activits.
(39) the committee would like to recall the principle [[contained]] in the Workers
Representatives Recommendation
(40) le comit tient rappeler le principe nonc dans la recommandation
(41) When asked if she had the letters [[containing]] the death threats she had received,
she replied
(42) A la question de savoir si elle possdait les lettre de menaces de mort quelle avait
reues, Mme Avella a rpondu [INTORGS\ILO]
(43) he had not hidden the fact that Chile wished to keep those territories because of
the wealth they [[contained]].
(44) il na pas cach que si le Chili tenait garder ces territoires, ctait en raison de
leur richesse. [INTORGS\UN1]
(45) Amongst the number of letters we found waiting for us at Naples was one [[containing]] an unexpected piece of information a chair at the College de France

Two types of translation equivalence

had fallen vacant and my name had been several times mentioned in connexion
with it;
(46) Dans limportant courrier qui, depuis longtemps, nous attendait Naples, une lettre mapprenait brusquement que, se trouvant vacante une chaire au Collge de
France, mon nom avait t plusieurs fois prononc. [FICTION\GIDE]
(47) Apart from this, although the CGT and CTC did not have problems similar to
those [[contained]] in the CSTC complaint, it was pointed out, at the meeting
with UTC officials, that this organisation had also lost several trade unionists who
had been murdered or disappeared.
(48) Cela mis part, alors que la CGT et la CTC ne connaissaient pas des problmes
similaires ceux qui sont dnoncs dans la plainte de la CSTC, les dirigeants de
lUTC ont indiqu que cette organisation avait galement perdu divers syndicalistes qui taient morts ou avaient disparu. [INTORGS\ILO]

Functionally complete units of meaning

across English and Italian
Towards a corpus-driven approach
Elena Tognini Bonelli
If meaning is function in context, as Firth used to put it, then equivalence of meaning is equivalence of function in context. What the translator is doing when translating or interpreting is taking decisions all the
time about what is the relevant context within which this functional
equivalence is being established. (Halliday 1992a: 16)


This study addresses the issue of comparing words and expressions across languages and proposes an approach where meaning whether denotational
and/or connotational and/or pragmatic is seen as encoded by and intertwined with formal lexico-grammatical realisations in the verbal context.
Starting from such a perspective it would not make sense to identify a certain
function in a language solely from a grammatical or lexical point of view and
expect an equivalent grammatical or lexical match in another language. It is
argued that whether the starting point is lexical or grammatical, an analyst sensitive to the cumulative effect of usage (what Firth called repeated language
events) will be led by the evidence to identify multiword lexico-grammatical
items that operate within well-defined semantic platforms and perform specific functions at the pragmatic level.
If we consider the comparative angle it is proposed that these multiword
units only become available for comparison across languages or translation
when they are functionally complete (Tognini Bonelli 1996a), that is when all
the components that are necessary for the unit to function have been identi-

Elena Tognini Bonelli

fied. This study will try to demonstrate that this is possible and, indeed, the
only way forward.
The approach adopted here is a step towards that which has been referred
to as corpus-driven (Tognini Bonelli 2001) and I will start by identifying the
tenets of such an approach (Section 2) and differentiating it from a more traditional corpus-based approach with a view to outlining the implications for
language description in general and contrastive linguistics and translation in
particular.1 I will then (Section 3) go on to define and exemplify what I mean by
functionally complete units of meaning, which I take to be the minimal currency units when comparing languages. In Section 4 I will discuss the implications for translation and contrastive linguistics. In Section 5 I will illustrate the
approach comparing a given function and its formal realisations across
two languages.

. The corpus-driven approach

It should be noted that general work which makes use of a corpus as evidence
for language description is usually referred to as corpus-based. I use this term in
a more restricted sense to refer specifically to work where the corpus is used
mainly to expound on, or exemplify, existing theories, that is theories which
were not necessarily derived with initial reference to a corpus.
It is important to note that although the evidence of the corpus may indeed
seem to support, at least partially, a pre-existing non-corpus-derived theoretical statement, the corpus-based approach does not really go as far as querying
traditional units of investigation which are taken as given, even though they
could be questioned in the light of corpus evidence. Traditional distinctions
such as the one between lexis and grammar are taken for granted and so this
type of corpus-based investigation is usually happy to go along with distinctions
between lexicons dealing with lexical units (usually words) on the one hand and
grammars (studying grammatical frames) on the other.2 This approach does
not allow for the fact that the enormous amount of evidence now available is
bound to challenge language description and offer fascinating new insights
into language (Sinclair 1991:4). To start, therefore, with units derived from traditional descriptions, often based on very little evidence3, is not only not sufficient anymore, it is dangerous.
Perhaps the most important change brought about by corpus work and
one which is not recognised by the corpus-based approach is a change in the

Functionally complete units of meaning

unit of currency, that is in the unit of linguistic investigation. The traditional

water-tight separation between lexis and grammar does not really hold in the
light of corpus evidence. This is why the linguist deciding to investigate corpus
evidence with an open mind, rather than pre-set beliefs, will accept that even
the units of investigation will have to be re-defined and s/he will have to come
to terms with units which are neither fully grammatical nor purely lexical, but a
mixture of the two.4 This type of unit is not the type that has been studied and
analysed in traditional grammar books, nor listed in traditional dictionaries.
The reason why we can now call it a unit, and indeed adopt it as the new currency unit, is that the interrelation between the lexical and the grammatical elements in it are so strong and systematic that they cannot be ignored anymore.
Frequency distributions and patterns of co-selection determine the size and
shape of the unit. But a new approach is needed to account for this new unit.
The corpus-driven approach (discussed in some detail in Tognini Bonelli
1996b and 2001; see also Hunston and Francis 2000 for a corpus-driven
approach to grammar), in contrast to the corpus-based approach, constitutes a
methodology that uses a corpus beyond the selection of examples to support
linguistic argument or to validate a theoretical statement. The commitment of
the scholar is to the integrity of the data as a whole, and descriptions aim to be
comprehensive, rather than selective, with respect to the corpus evidence for a
particular topic of research. Here the corpus is not used just as a repository of
examples to back pre-defined theories. The theoretical statements, as well as
the comments or recommendations made, arise directly from, and reflect, the
evidence provided by the corpus. The new unit of description can be safely
posited and explored in this framework. Linguistic description is arrived at,
step by step, from the observation of language usage; recurrent language events
and frequency distributions are expected to form the basis of linguistic categories; the absence of a pattern is considered potentially meaningful.
Of course, many issues and queries related to the corpus itself become relevant when one adopts a corpus-driven approach. The representativeness of the
corpus has to be assessed, and so should the sampling criteria used in the creation of the corpus. Now that corpora containing hundreds of millions of
words are available, even the question of what corpus size is adequate, and for
what type of enquiry, should be addressed. Indeed, as Halliday points out, the
corpus should be seen as a theoretical construct (1992b) because what may
seem to be just evidence, and more evidence, contains the parameters of a very
specific view of language.
Querying pre-defined theoretical statements does not mean to say that the

Elena Tognini Bonelli

activity of analysing corpus evidence should be a-theoretical. The initial

assumptions of the enquiry, though, should always be made clear and, above
all, should be testable against the evidence of the corpus. The sections below
will try to exemplify the corpus-driven approach and, in particular, explore a
methodology for the identification of what I have called the new currency unit
in linguistic description and, more particularly, the description and identification of equivalent units across languages.

. The new currency: Functionally complete units of meaning

The central proposal of the theory is () to split up meaning or function into a
series of component functions. Each function will be defined as the use of some
language form or element in relation to some context.
(Firth, A Synopsis of Linguistic Theory, 1968: 173)

The initial assumption here relates to a view of form and meaning as strictly and
systematically interconnected, indeed as two aspects of the same phenomenon:
language seen as function in context. This view of language, originally proposed by J. R. Firth, is adopted as a fundamental tenet by Sinclair (1991: 7),
who, reporting on corpus work of the 80s, explains:
Soon it was realised that form could actually be a determiner of meaning, and a
causal connection was postulated, inviting arguments from form to meaning.
Then a conceptual adjustment was made, with the realisation that the choice of
a meaning, anywhere in a text, must have a profound effect on the surrounding
choices. It would be futile to imagine otherwise. There is ultimately no distinction between form and meaning.

From the perspective adopted in this study, the implications of the above claim
are very important. We are assuming that, given certain formal parameters in
the context of a word, it is possible to arrive at a reliable meaning by formalising
the evidence of language usage. We are assuming that a variation in the formal
profile of a word or an expression will always lead us to a change in meaning.
I will now briefly discuss the use of the word fork, as a noun and as a verb, to
show how a series of steps in formalisation of the context reliably indicate
meaning differentiations. The complete concordance of fork from a corpus of
The Economist (9.38 million words) and the Wall Street Journal (6.36 million
words) shows a total of 28 instances. Below I will present and discuss just a few
examples that illustrate the interrelation between form and meaning.

Functionally complete units of meaning

Of the instances present in the corpus, five show a very consistent collocation with the word knife in the left co-text as in:
Use a knife (right hand) and a fork (left hand) 0 00000000000
conservatives who use a knife and fork to eat their red meat 00000000

The meaning here is indeed the implement we use to eat, and this specific collocational patterning is associated with all instances with this meaning.
Another meaning of fork is the point at which a road or a path divides into
two parts. This is the meaning of fork in three instances of the concordance and
it is always associated with the word road as a collocate as in:
We are really at a fork in the road in terms of 0 000000 000000000000
At every fork of the road there were 000000000 0000

As a variation of this meaning of fork as bifurcation we find in the concordance

six instances where fork always appears in capital letters and shows a very
strong collocation in the left co-text with the adjectives North and South:
rally of his supporters at the South Fork Ranch 000000000000000000
the real estate division of North Fork Bank & Trust Co. 0000000

It is interesting to note that although the meaning of fork here is the same as
fork in the road, the contextual patterning is different. It consistently forms
part of place names and institutions. This usage is mainly American and, in the
concordance, strictly confined to the Wall Street Journal corpus.
All the remaining instances of fork in the concordance are uses of the
phrasal verb fork out. The examination of these instances leads us to consider
the issue of co-selection, that is the habitual selection of two or more items
together, beyond the simple patterns of collocation seen above. As I mentioned, patterns of co-selection in text shown up by corpus work are so strong
and can usually be identified so clearly that they lead us to question the extent
of the unit of meaning (Sinclair 1991, 1996, 1998) traditionally associated with
the word, and, only in the case of well established idioms, with the phrase. This
is where we can observe more clearly what I have called the change in currency
which becomes evident when adopting a corpus-driven approach. It is not as if
traditional linguistics has totally ignored the issue of co-selection. Indeed
idiomatic expressions and phrasal verbs (such as fork out), which are examples
of co-selection, have been studied from all angles, but until the advent of large
corpora it has not been possible to see how all-pervasive this issue is. The evidence from the corpus tends to point to the fact that what Sinclair (1991) calls
the phraseological tendency is not limited to standard idiomatic expressions

Elena Tognini Bonelli

and phrasal verbs, but affects all words.5 Moreover, in the case of phrasal verbs,
for example, traditional grammars have tended to identify the fixed core of the
phrasal verb, but little if no attention has been given to other, perhaps not as
immediately visible, patterns of co-selection obtaining between the fixed core
and its own co-text. A corpus-driven approach aims to go beyond the identification of a fixed idiomatic core: it considers closely the patterns that link up the
core to its environment and tries to quantify and assess their inbuilt variability.
Let us consider now the series of steps whereby we can place the phraseological core fork out in relation to its co-text and identify the ultimate function associated with the unit. If we consider the right co-text of fork out we find a strong
collocation with words such as Pounds, Dollars followed by numerals (the quotations are unchanged from the electronic form of the text, and words denoting
currencies are clearly a way of avoiding the special characters , $, etc.):
fork out
fork out
fork out
fork out
fork out
fork out
fork out

Pounds sterling 50m70m

a further Dollars 2.4m
for the benefit of shareholders
an extra Pound sterling 7..95 in nics.
for the full fare.
Yen 391 for this feast.
close to Dollars 1 billion to raise its
several hundred people, ready to fork out the Pounds sterling 8 (Dollars 12.50)
means losing medical benefits and having to fork out for expensive child-care
the Germans and the Japanese will be wise to fork out even if America profits.
Taurus member firms may be reluctant to
it had to
tax payers might ask why they should
and his employer will have to
business travellers who are prepared to
In Japan, Big Mac fans have to
BAT would have to

This patterning is supported by other words such as dm, Yen, cash, fare, money
and establishes what Sinclair (1996, 1998) has called semantic preference: to
fork out is related to the activity of paying money, usually exact amounts.
If we consider now the left co-text we note that the verb fork out is preceded
mainly by different forms of the modal verb have to. Other collocates are may,
might, would, could, should and establish a strong co-selection pattern with
modals in general; some of their lexical equivalents, such as reluctant to, are prepared to, will be wise to, agreed to, refused to, should make it easier to, support this
tendency. The cumulative effect of such instances points to a semantic
prosody (Louw 1993, Sinclair 1996, Stubbs 1996) which has to do with pressure and unwillingness. People who have to fork out are certainly not pleased
about it and do it only in case of pressure or real need. Semantic prosodies represent the functional choice which links meaning to purpose (Sinclair:
1996); they delineate, in other words, the outer limit of the unit of meaning

Functionally complete units of meaning

where the co-text merges with the context and a certain item achieves a purpose in a certain environment.
Looking at the systematicity of the patterns we have identified above, we
are led to support the notion of an extended unit of meaning where collocational and colligational patterning (that is lexical and grammatical choices
respectively) are intertwined to build up a multi-word unit with a specific
semantic preference, associating the formal patterning with a semantic field,
and an identifiable semantic prosody, performing an attitudinal and pragmatic
function in the discourse. The unit thus identified is truly functionally complete
(Tognini Bonelli 1996a) in that it merges the two dimensions, the contextual
one and the functional one.

. Implications for contrastive linguistics and translation:

Progressive steps between form and function
From the point of view of the comparison of two languages this study argues
that the assumptions of a correlation between form and meaning on the one
hand and the postulation of a functionally complete unit of meaning on the
other are the crucial stepping stones to identifying a network of equivalences.
This approach of course entails a communicative view of language, where the
linguistic choices made are seen as primarily functional. This is where it
becomes crucial to identify systematically the formal patterns associated with
the semantic preference and the semantic prosody: only when functionally
complete will a unit of meaning be available as a possible choice to the translator or for comparison to the contrastive linguist.
Before I go on to propose a methodology that makes use of a set of corpora
for translation or contrastive linguistics, I would like to say a few words on the
process of translation itself (see also Tognini Bonelli 1996a). The main and
perhaps the most obvious point to be made is that both the text, encoding
meaning, and the context in which the text itself is embedded, vary. Translation
presupposes displaced situationality (Neubert 1985, Viaggio 1992) at both the
linguistic and the extra-linguistic levels. At the purely linguistic level the translator will negotiate equivalence of meaning in a displaced context, that is from
SL to TL. This will involve assessment of the two different linguistic systems
and analysis of the formal contextual features that realise the same function;
the linguist will identify two units of meaning which are comparable in spite of
the displaced context.

Elena Tognini Bonelli

At the extra-linguistic level the situational features will also be displaced in

that the context will invariably refer to, and reflect, a different culture, a different situation and different participants. Two different levels of interaction will
also have to be assessed and accommodated: the original interactive process
between SL writer and his/her SL audience and the one between translator and
his/her TL audience. The translator here has the task of reproducing, re-creating, the original interaction to a different audience, in a different situation, taking into account the fact that the original text and the translation may even
have a different purpose altogether.
The steps that the translator will take to negotiate equivalence at the extralinguistic level will account for the strategies s/he will adopt in order to transfer
and report the original interaction to a new target audience. This stage can be
seen as a reporting strategy, whereby the original interactive process has the
status of a report within another, new, interactive process.6 This framework is
taken to allow for shifts and changes of purpose between source text and translation, and for specific interventions of the translator vis--vis his own target
audience, for example.
Given these two levels in the translation process, it is important to understand that I am assuming a difference between what I called a unit of meaning,
whether in the source or in the target language, and a unit of translation: I
maintain that while units of meaning are defined contextually that is by
examining the verbal co-text of the chosen word or phrase and identifying the
patterns of co-selection units of translation are defined mainly strategically
by means of explicit balancing decisions taken by the translator in order to
achieve an effect or purpose equivalent to the original (Nida 1964). These balancing decisions will be possible (a) once comparable units of meaning have
been isolated in the source and the target language, in the light of (b) the perceived role of the translator as go-between linking two cultures and two specific situations, as well as of (c) the function of the translated text vis--vis the
new target audience.
Although both the linguistic and the extra-linguistic levels must be taken
into account for the translation to be successful, this article will present only a
methodology for identifying and evaluating sets of comparable units of meaning. The strategic steps that may influence the translator to opt for one unit of
translation rather than another and the linguistic realisation of these steps are
excluded from the present enquiry.7 In this respect, the methodology illustrated here may be of relevance to other linguists who also work across languages
but are not necessarily interested in a translated output as such.

Functionally complete units of meaning

The methodology I will illustrate tries to locate the words and phrases that
encode a function in L1 and that of other words and phrases, inevitably different from the first set, that will yield a comparable unit of meaning in L2. In
other words, my aim here is to trace, through a series of steps correlating formal patterning with function in L1 and L2, the boundaries of sets of functionally complete units of meaning in the two languages.
We should note that the initial hypothesis positing one or more tentative
matches between two or more prima facie units of meaning in SL and TL has to
rely on the translators intuition or past experience. Traditionally, standard reference works such as bilingual dictionaries attempted to provide this information. Recently we have witnessed the emergence of translation corpora (also
referred to as parallel corpora), which are corpora of texts that stand in a translational relationship to each other, that is to say the texts can each be a translation of an absent original or one of them can be the original and the other(s)
translation(s). I maintain that the use of a translation corpus at this stage, if
available, will give us the benefit of such input in a more reliable manner and
provide us with a range of possible translation pairs that have already been
identified and used by translators, in other words verified by actual translation
usage. I maintain, however, that in the framework of a corpus-driven approach
the definition of a functionally complete unit of meaning cannot be confined
solely to the evidence from a translation corpus. Each unit of meaning has to be
contextualised and its formal components identified in the light of a type of
corpus evidence which is not subject to the restrictions of mediated language.
At this stage, therefore, the linguist will need to substantiate his/her observations using two comparable corpora (one L1 and one L2); these are corpora
whose components are chosen to be similar samples of their respective languages in terms of external criteria such as spoken vs. written language, register, etc. The identification and matching of form and function of the equivalence pair will take place in each of the two sets of comparable corpora.8
The formalisation of the regularities exhibited by the evidence will allow a
series of progressive steps which will deconstruct an initial chosen function
into its formal components, and vice versa. The aim is to ascertain functional
equivalence, that is, the equivalence obtaining between functionally complete
units of meaning. Here I will distinguish three methodological stages (see
Table 1 below). The first step works within L1 and consists in identifying and
classifying the formal patterning in the context of a given word or expression
against the evidence of an L1 corpus (see Johns 1991: 4); this is followed by the
matching of a specific meaning/function to each specific pattern. Step 2 in the

Elena Tognini Bonelli

process will consider both L1 and L2 and will posit a prima facie translation
equivalent for each meaning/function. If a translation corpus is available, the
process will be enriched by access to translations. If no translation corpus is
available, as in the case of this study, this step has to rely on information taken
from reference books or intuition on the part of the analyst. Step 3 will start
from a function in L2, realised by the prima facie equivalent, and will deconstruct it into its formal realisations (collocational and colligational patterning);
in a way it will replicate the process of step 1, but the other way around.
Table 1: Methodological steps
Comparable Corpus

Translation Corpus /
Translators Experience

Step 1
Step 2
from Formal Patterning/L1 identify a prima facie transto Function(s)/L1
lation equivalent for each
Function/L1 Function/L2

Comparable Corpus
Step 3
from Function/L2 (as realised by
a translation equivalent) to
Formal Patterning /L2

I believe it is very important to be strict and systematic about the specific formal patterning associated with a given item. By looking at the patterns on the
vertical axis of a concordance and identifying larger syntagmatic units on the
horizontal axis and by considering the frequency distributions, the researcher
whether s/he is a translator, a bilingual lexicographer or a contrastive linguist will not only be able to assess what is possible, but also what is likely
within two different linguistic systems; specific appropriateness to context will
be evaluated against the evidence, the full value of a translators own chosen or
inadvertent deviations from the norm can be assessed against the range of variations present in the L2 corpus. Issues that could not really be addressed before
the advent of corpora because of the need for a large amount of evidence
cumulative connotational tendencies or specific register characteristics, for
example can now be observed and become tangible, often being simply
identified by alphabetising the context of a word.
These points become very relevant when one considers the implications of
translating from and into ones own mother tongue. The process leading from
formal patterning to function and vice versa can be related to the process of
decoding and encoding in language. In translation the norm is for the translator to translate that is to encode into his/her own mother tongue where it
is assumed s/he can be more sensitive to the demands of appropriateness. With
corpus evidence at hand and with a methodology to identify systematically the

Functionally complete units of meaning

relevant lexical and grammatical profiles of a word or expression and relate

them to connotational weight and pragmatic function, this approach will
reduce the gap existing between translating from and into ones own mother
tongue. Even when dealing with a language other than their own mother
tongue translators will be able to identify tangibly the norm and the range of
variation from it and make choices in the light of that evidence.

. Navigating across English and Italian: The expression in (the) case (of)
Sections 5.1 and 5.2 below propose an analysis of two expressions that incorporate the words case and caso in order to introduce a circumstantial element.
Section 5.3 will present a third example, where the conjunction in case will be
compared with the Italian se per caso. I will use two sets of general corpora, one
of English and one of Italian, which at the time of writing were the best I could
access in terms of comparable corpora, although not explicitly put together
according to the same criteria.9 Step two, positing a prima facie translation
equivalent, which should ideally make use of a translation corpus, will rely here
on standard reference works and the linguists experience and intuition. The
findings reported below are to be taken as indicative of the methodological
steps proposed, but they would still need to be explored further and validated
in the light of more exhaustive evidence. The citations discussed below are
reported in order to illustrate patterning and they represent a reduced sample
of the overall concordance analysed from the corpus.
. From in the case of to nel caso di
The expression in the case of is, from a grammatical point of view, a complex
preposition introducing what Halliday called a circumstance of matter
(1985: 142). Our first step will be to consider its co-text and analyse it into its
formal lexico-grammatical constituents. Looking at the right co-text of the
concordance below, we find a noticeable presence of the definite article the.10
This, coupled with the other strong pattern the presence of proper names
points to a strong function related to specificity. The function here is to present
individual examples, considered for their particular characteristics.
of subsidies can be illustrated
on shifts in values. As we shall see
period is likely to be lengthier than
end in itself. This is especially so

in the case of
in the case of
in the case of
in the case of

Australia, where the est

Londons motorways, the
Spain, because of the we
experiments which can

Elena Tognini Bonelli

even a reasonable thing to assume; and

awakening of enlightened optimism
allowed myself to break this rule
an expert witness on the truth drug
children and primitive artists, but
not be able to perform efficiently.

in the case of
in the case of
in the case of
in the case of
in the case of
in the case of

relatively minor ills

the Liberals, and the
the USSR the data
the Boston strangler.
the caricaturist the
the distance runner

The semantic preference associated with this is very varied: people alternate
with countries, tangible objects with less tangible ones. In terms of semantic
prosody we are not associating a particular evaluation with the instance presented. We could perhaps see the introduction of specificity as the ultimate
function of this complex preposition.
To understand better the neutrality attached to the specific cases introduced by in the case of it might be interesting here to open a brief parenthesis
and consider the collocational profile of a grammatically parallel expression in
English, namely in the event of, extracted from the same corpus. For lack of
space, I will not present the whole concordance here, only a few examples of the
collocates: unavoidable nationalisation, a major disaster, a breach of the rules,
great national emergency, a war, company failure, hostilities, trouble, etc. The
negative semantic prosody attached to the collocates of this expression is very
strong and regular, so much so that even the only neutral word an election
is turned negative by what follows: of a politically hostile party.
The Italian prima facie translation equivalent of in the case of posited in
stage two is nel caso di and stage three will go through the same process of deconstruction in the Italian concordance, identifying the formal patterning present in the co-text:
peggiorativa, come, per esempio,
ambienti simulanti l acqua di mare
del rumore termico) formaggio. Diverso il il
si riproduce, altre volte come
la collaborazione del paziente:
l approccio pi diretto, almeno
olidaristiche e corporative come
rambe le eventualit si presentano
informazioni di tipo diagnostico

nel caso degli

nel caso degli
nel caso degli
nel caso dei
nel caso del
nel caso del
nel caso del
nel caso del
nel caso dell
nel caso della

homines novi, uomini

acciai superferritici e le
algoritmi a minima
bambini. Poich il loro
terzo sonetto vediamo
fumo, deve voler smettere.
carcinoma midollare della
Lord Spleen di Giovanni
Orlando furioso,
patologia neoplastica

Here we note the merging of the preposition di with the definite article which
gives rise to del, dell, dei, degli, delle as substitutable for di. The function of
specificity is very obvious because this merging between the preposition and
the article is present in all the instances. In the right co-text we find nouns

Functionally complete units of meaning

which, as in the English material, show quite a lot of variation with no strong
collocational pattern. In terms of semantic preference it is interesting to note
two rather prominent areas. Firstly, the area of technical and scientific terminology (acciai superferritici, algoritmicarcinoma midollare, amminoacido,
patologia neoplastica, etc.) which accounts for 31% of the instances; secondly
the area of literary analysis (homines novi, terzo sonetto, Lord Spleen,
lOrlando Furioso, etc.) which accounts for 21.5% of the instances. At the level
of semantic prosody again we could say that a fairly objective function of specificity is the only identifiable one, as in the English equivalent.
We have now established a first set of translation equivalents. The correspondence is not only between two multi-word units which incorporate the
same lexical word case/caso, or indeed between grammatical functions. The
equivalence has been evaluated at the level of functionally complete units of
meaning. The evidence of a divergence in semantic preference and/or prosody
will be of great help to the translator, for example, and will allow him/her to
avoid those instances of rather infelicitous translationese (Gellerstam 1986)
which may stem from an involuntary contravening of the unstated semantic
preference. The case discussed above is, in spite of the differences in semantic
preference, a fairly felicitous case of equivalence.
One word of warning. The difference in semantic preference apparent
between the English and the Italian in the concordance discussed above needs
to be confirmed, to ensure that it is not the result of an imbalance in the selection of the texts included in the corpus and therefore skewed towards a language variety or reflecting a specific topic. Semantic prosodies are often linked
to language varieties and seem to become more systematic and restricted the
more specific and restricted the variety is. This point raises the issue of representativeness of a given corpus and, in our case here, the comparability of the
L1 and L2 corpora. Unfortunately, it is still often the case that an analyst will be
presented with a set of L1/L2 comparable corpora as a fait accompli, without
any real access to information on the criteria according to which these corpora
have been assembled and certainly without any say in corpus design. This is to a
certain extent inevitable given the fact that corpora tend to be very large nowadays and beyond the undertaking of a single individual. However, it also means
that the user will all too often not be in a position to evaluate the evidence
properly for lack of information and will not to be able to influence the representativeness of the texts included in the corpus.
My position with respect to representativeness is rather pessimistic; I
would go along with Leech (1991: 27) in saying that the assumption of repre-

Elena Tognini Bonelli

sentativeness must be regarded largely as an act of faith as at present we still

have no way of ensuring it or evaluating it objectively, although a lot of work is
being done in this direction. I believe it is of paramount importance that the
analyst, who will have to judge in the end whether the semantic preferences and
prosodies reflect topic-dependency or are inbuilt in the language, be at least
able to assess the specific criteria used in corpus building, and access a list of all
the texts included in the corpus and information on the sampling criteria
adopted. Last but not least, I think we should remember that corpus work
whether monolingual or bilingual is above all comparative work, where the
analyst must never tire of comparing across different varieties, different situations, different languages, different corpora. Only thus will s/he arrive at a balanced statement in language description.
. From in case of to in caso di
From a grammatical point of view in case of is, like in the case of above, a complex preposition introducing, in Hallidayan terms, a circumstance of cause or
condition (1985: 140). It is interesting to note that the difference between the
two the first one introducing a circumstance of matter, the second one of
cause or condition is only brought about by the presence or absence of the
definite article as an explicit signal of specificity. From the corpus we get some
information on frequency: compared with in the case of (550 instances in 20m
words) this complex preposition is not very frequent (56 instances in total), a
fact that points to its specialised function:
man being, and I wanted to be sure
ever. This will minimize your loss
of milk, a jar of pureed prunes
mast and sail were constructed
the place up kept an eye on it
it in polythene kitchen wrap
be there to pick up the pieces
er not. One of us should be here,
dont, and closed her eyes
Under-ripe berries were preferred

in case of
in case of
in case of
in case of
in case of
in case of
in case of
in case of
in case of
in case of

a sudden emergency that we gave

accident or theft. One pound
constipation. Other tips.
engine failure. There were
further vandalism or moves from
involuntary incontinence, and
massive calamity. Such
more Lady Alices, do nt you
something terrible. Nothing fata
transport hold-ups, and were

Starting now with the first step focusing on formal patterning, we identify
some of the most frequent collocates as repeated co-occurrences in the right
co-text: accident/s, attack, emergency, fire, trouble, need and difficulty are among
the more frequent. This collocational profile already shows a strongly negative

Functionally complete units of meaning

semantic preference for what could be termed disaster areas and this is reinforced by other words which belong to the same semantic field:
a burn

an urgent telegram
renewed difficulty


further questioning
massive calamity


What we have called semantic prosody, the overall function of the expression,
could here be termed provision for disaster. This prosody is so strong that the
only instance where the noun following in case of is rather neutral, viz.
One of us should be here, in case of more Lady Alices, dont you think?

is understood along the same lines, and the possibility of a Lady Alice, or
someone like her, appearing at the door is interpreted as unappealing to say the
least. The statement here carries an obvious ironical intention and the clash in
semantic prosodies can be seen as the formal realisation of this ironical intention (Louw 1993).11
Having thus identified an extended unit of meaning in English, the second
step will posit, as a prima facie equivalent, the Italian in caso di. As a start we can
say that in caso di has the same grammatical function as the English counterpart, and again, in terms of frequency, it is rather rare (the Italian corpus contains 83 instances in total, compared with 319 nel caso di).
The third step will de-generalise the prima facie equivalent into its formal
posporre l intervento dell esercito. In caso di
questa castagna e non aprirla se non in caso di
riguarda l arruolamento volontario in caso di
avrebbe preso e come le avrebbe usate in caso di
fossero pi pericolose di quelle grandi in caso di
adeguati alle loro esigenze di vita in caso di
epilessia che garantisce una copertura in caso di
giova ricordare che il sistema bancario in caso di
con Mitterrand alla presidenza poich in caso di
meglio avere i capelli super-puliti. In caso di

calamit naturali dirige

gran necessit. Cammina
guerra. A sedici anni Giovan
incendio. Accennava i movime
incidenti a catena a velocit
infortunio, malattia, invalidit
morte o invalidit permanente
riduzione del personale , non
successo non sarebbe n
un invito ultimo-momento,

At the collocational level we find repeated instances of the words necessit,

guerra and urgenza. The negative semantic preference for disaster areas is
clearly noticeable and the other words present in the right co-text emphasise
this preference for the negative:

Elena Tognini Bonelli

riduzione del personale

brusca frenata
impedimento permanente
rilascio accidentale

debolezza organica
giudizio negativo
risposta negativa

The semantic prosody can again be termed provision for disaster as the extended unit of meaning has the overall function of hypothesising the possibility of
something unpleasant happening and offering a guarded damage-limitation
statement about it. It is interesting to note that as in the instance with more
Lady Alices we find here an example that is apparently neutral:
con Mitterrand alla presidenza poich in caso di successo non sarebbe n servito a

This instance, though, is seen to fit the pattern once we look at the wider context, where it becomes apparent that the spokesman who is talking is conservative and, for him, the success of Mitterrand would be certainly perceived as a
major disaster.
The equivalence thus established between in case of and in caso di takes
account of the general semantic field and semantic prosody and can be said to
be satisfactory at the wider functional level. In the light of evidence from larger
corpora it will be possible to validate a stronger collocational profile which
would certainly provide a welcome guide to appropriateness for the researcher
who works across languages.
. From in case to se per caso
Next I would like to consider briefly the conjunction in case.13 The conjunction
works at the level of the clause and therefore it can be expected that the local
patterning in terms of collocation will be less strong than with the prepositions.
In my data there is no specific collocational restriction. At the colligational
level one notices a strong presence of personal pronouns: I, you, he, she, we,
which point to a certain interactiveness and colloquialness of the texts14 on the
one hand and an association with the narrative genre on the other. The element
of colloquialness seems to be confirmed by the frequent use of phrasal verbs,
lay ahead, put off, turns out, catch up, mixed in, brushed against, etc. In terms of
semantic preference there is again a lot of variation, but a number of words and
expressions certainly confirm the feeling of informality.
In most cases the analysis requires a wider context to identify the functionally

Functionally complete units of meaning

complete unit and I will not present the full concordance of this expression for
lack of space. The citations below can help identify the overall semantic prosody:
He looked at her now with alarm, in case
it is better to be bribed than to bribe in case
Claude, Fernet and I will be there too in case
Tear gas, small arms, in case
He poured himself another in case
I left a bundle in my bed in case
avoid passing them too close to the males body in case
He didnt tell you how to get in touch with him in case
The pistols right there beside the bed, just in case
cannibals are perfectly nice people and just in case

she might do the room an injury...

something goes wrong down the line...
the wolves start snapping...
they wont come back by themselves...
the abbess forgot to suggest it...
anyone looked but they were both
they brush against him accidentally...
I should arrange another party?
the pimp has an attack of amnesia...
you are wondering what this team is
doing in the bush...

In these instances the tone and what we have called the semantic prosody is
clear: there is an element of guarded damage limitation and provision for disaster (as with in case of) but the speaker/writer/narrator also seems to be smiling
half ironically or tongue-in-cheek at some situation and sharing with
his/her reader a knowing wink.
A prima facie translation equivalent for Italian is se per caso:
Una torre senza ragni sospetta:
E soprattutto non cadermi addosso
si guardavano intorno per vedere
calzati con le scarpe da footing infangate e

se per caso
se per caso
se per caso
se per caso

Venne la ragazza per chiedergli se per caso

chiede notizie di Mozart, se per caso
decise di scoprire se per caso
una pentola, dico una, con il coperchio? se per caso
Attraversava il sentiero; e se per caso

ne trovaste una, fuggite via subito...

c un urto violento...
il lupo li seguiva, ma lui ovviamente...
caso non si posino sul divano rivestito di
un materiale
il cibo non fosse stato cucinato...
lei conosce la sua Marcia Turca...
si erano trovate a New York nello stesso
gli riesce a convincere una a salire,
pardon , a...
un qualche grosso scarabeo zuccone...

In the citations above, again we find that the collocational patterning is not
strong but the verbs vedere (to see whether) and chiedere (to ask whether) are
often present in the left context of the conjunction. Other verbs which occur
only once, but reinforce a general semantic preference for discovering
whether are: scoprire and spiare, often in the context of a fairy tale (see also the
use of words such as lupo/lupa wolf and she-wolf ).
Perhaps the most noticeable difference between the English and the Italian,

Elena Tognini Bonelli

though, is the fact that in case is used when someone is mentioning a possible
future situation or hypothetical event as a reason for doing something. In
Italian this reference to a future/hypothetical event is not there, so we have
instances like si guardavano intorno per vedere se per caso il lupo li seguiva where
the best translation would perhaps be a more neutral to see if by any chance
the wolf was following/had followed them which would allow a reference to
a past situation.
At the level of the semantic prosody we find some instances like the first
one above: A tower without spiders is suspect; se per caso you were to find one
you should run away immediately which share the tone of tongue-incheek with the English instances. Others, though, do not seem to take this
light, semi-ironical attitude to the subject presented. In the example she
decided to discover se per caso they had been in New York at the same time ,
we no longer have a conditional clause but an indirect interrogative clause; here
the writer is really talking of a fairly neutral possibility, and simply adding to it
the element of discovery. A translator would probably opt for a translation that
does not include in case, which is consistently associated with irony. We have to
conclude that the prosody which was remarkably regular in English is not as
systematic in Italian. The element of tongue-in-cheekness remains as a possible
choice at the paradigmatic level, allowing the linguist or the translator to identify se per caso as an equivalent of in case when the time reference allows it.
However, the type of special effects identified by Louw (1993) irony, hidden
attitudinal stance, for instance which depend on a clash with a sufficiently
expected background of expected collocations (ibid.), cannot be reliably identified in the Italian data. A translator working with English as his/her TL would
have to be aware that the neutral possibility introduced by se per caso could
become tinted with ironic connotations when introduced by in case.
In terms of functionally complete units of meaning we have analysed here
what seemed a possible translation equivalent, but the function of the two
expressions has been shown to differ quite a lot after all. At the grammatical
level the difference in time reference is quite noticeable and could give rise to
mistakes in the translation. At the level of the semantic prosody it could generate a trap for the unaware translator because the correspondence is similar but
not as systematic.

Functionally complete units of meaning

6. Conclusion
The approach to establishing functional equivalence, whether for contrastive
or translation purposes, proposed in this article advocates the use of comparable corpora, in stage one and three, and a translation corpus in stage two. The
use of comparable or even relatively comparable corpora is seen as an
absolute necessity in order to establish equivalence and it is argued that it
would be impossible to identify reliably functionally complete units of meaning without the help of the evidence from the corpus. I maintain that it is also
necessary to use a translation corpus to posit a set of prima facie equivalents,
but such corpora are still not widely available and I have not been able to access
one for this study. As a result this study only partially exemplifies the model it
advocates, but I hope it points a way forward to a methodology that will bring
together the translators experience (as from the translation corpus) and the
input, the richness and the variability of two natural languages (as from the
comparable corpora). This methodology is offered as potentially useful to all
researchers working across languages, contrastive linguists, bilingual lexicographers and translators alike. Of course the input of the translators is given a
certain priority and their input is more directly channelled into the procedure
at the level of the translation corpus.
What I have tried to show in this study is a way of establishing and evaluating the comparability of units of meaning across languages which takes into
account language events which, in Firths words, are typical, recurrent and
repeatedly observable (1957: 35). The assumption that words do not live in
isolation but in strict semantic and functional relationship with other words
has led me to posit the notion of functionally complete units of meaning. To sum
up we can characterise them in this way:
1. They can be identified by looking at patterns of co-selection in the context
of a word or expression. They involve collocational (lexical) and colligational
(grammatical) choices and therefore cannot be defined solely in lexical or
grammatical terms. They also involve a semantic preference, realised by words
which belong to the same semantic field, and a specific semantic prosody at the
pragmatic and connotational level.
2. They are syntagmatic units in that they interrelate with other words and,
through a process of co-selection, they form a multi-word unit which becomes
available as a single choice on the paradigmatic axis.15
3. Only when these multi-words units are functionally complete do they

Elena Tognini Bonelli

become available as translation equivalents or as comparable units of meaning

between two languages.
I hope to have demonstrated that there is no way in which the information
gathered from corpus evidence simply by observing the repeated patterns of
co-selection can be found in standard works of reference. The examples of
multi-word units chosen here share the same lexical core, the word case. What
accounts for their varying degrees of correspondence in terms of their semantic
preference and semantic prosody cannot be severed from their very individual
pattern of co-selection. It would not make sense to attempt a translation without first being fully aware of that specific semantic preference and that specific
semantic prosody. In the examples discussed in the context of English and
Italian we have found that the match was good in most respects. But this should
never be assumed, and a comparison with other languages will indeed prove
this point.


See Tognini Bonelli (2001) for an account of the integration of this approach with the
building up of a network of translation equivalents.
This is not always the case. Some models which recognise a lexicon and a grammar
attempt to integrate the two (for an overview, see Faber & Mairal Usn 1999). One may
wonder, however, whether the theoretical positing of a dichotomy between lexis and grammar does not in itself affect the actual appreciation of the strict interconnections and overlaps between the two.
Stubbs (1993: 89) explicitly points out that much linguistics is based on invented sentences. In addition, often only a very small number of invented sentences are discussed and
he goes on to warn us that it is easy to forget or ignore how little data, either invented sentences or real texts, is actually analysed in the most influential literature in twentieth century
linguistics. The linguists he quotes in this context range from Saussure, Bloomfield,
Chomsky and Lyons to Austin and Searle; but even Firth and Halliday turn out to analyse
very little text.
Sinclair (1991 andff.) is fond of saying that a corpus can prove anything and the opposite of anything, and a theory that can account for corpus evidence specifically is needed in
order to do justice to the data.
Sinclair defines the idiom principle or phraseological tendency and points out that The
principle of idiom is that a language user has available to him or her a large number of semipreconstructed phrases that constitute single choices, even though they might appear to be
analysable into segments (1991:110).

Functionally complete units of meaning

The view of this second stage in the translation process I am proposing is based on
Sinclairs position on the function of reporting structures in discourse (1981). Applied to
translation, at the extra-linguistic level, this strategy will account for the role the translator
has in (1) assessing his bridging task between SL and TL, source audience (i.e. the audience
of the original writer, in the SL) and target audience (i.e. the audience of the translator in the
TL), as well as the specific genre and function of the text (narrative, technical, persuading,
advertising, etc.); (2) taking into account the correspondences between function and formal
realisations across different languages established at the purely linguistic level; (3) reporting
the original message (as negotiated in the interaction between source writer and his own
audience) to his/her (the translators) own target audience.
A translation corpus (see below) can be used to shed light on the process of translation
itself. For an interesting discussion on the process of translation and the use of corpora see
Baker (1996) and (1998). The most common use of a translation corpus, however, remains
the access to translations as products where the translated corpora reveal cross-linguistic
correspondences and differences that are impossible to discover in a monolingual corpus.
Access to a set of two (or more) truly comparable corpora is not always possible. At the
time of writing, although some monolingual corpora which claim to be representative exist
and are accessible, sets of comparable corpora in different languages are still difficult to
The set of English corpora used here are: the Economist corpus (containing 9.38 million
words from the journal of the same name), the Wall Street Journal corpus (containing 6.36
million words from the journal of the same name). As a general corpus I refer to the
Birmingham corpus, which is the original 20 million corpus of contemporary English on
which the Cobuild project was initially based. These are now part of the holdings of the Bank
of English. The Italian corpus I use contains 4.5 million words of contemporary Italian and
is a part of the holdings of the Istituto di Linguistica Computazionale at the Universit degli
Studi di Pisa. I would like to acknowledge here the generosity of all those who provide corpora for research, and in this case Prof. Zampolli, Director of ILC, Pisa and Jeremy Clear,
Director of Cobuild Ltd.
The concordance for in the case of is taken from the Birmingham Corpus of
Contemporary English (20 million words).
In his seminal article (1993) Louw defines semantic prosody as a consistent aura of
meaning with which a form is imbued by its collocates and discusses the special effects due
to a clash in semantic prosodies: Evidence is emerging that departures in speech or writing
from the expected profiles of semantic prosodies, if they are not intended as ironic, may
mark the speakers real attitude even where s/he is at pains to conceal it (ibid.:157).
Un invito-ultimo-momento is an invitation at the last moment. In Italy some people
are rather sensitive to this because it implies not really having been planned in at the party
and having been invited only because someone has called out or, worse, because the host has
suddenly realised that 13 people were going to sit at the table. This instance goes along with
the trend of in case + something unpleasant taking place.
I found a total of 532 instances of the conjunction in case in the Birmingham Corpus
(20 million words).

Elena Tognini Bonelli

Biber (1994) identifies text types in contrast to register and genre on the basis of
shared linguistic co-occurrence patterns. Among the linguistic features analysed to identify
text types one is pronouns; Biber points out that they are relatively interactive and colloquial in communicative function (ibid.:389).
This is the case even when a very simple collocational pattern is the only expansion on
the core word: consider the example of fork the meaning of which was differentiated, at the
collocational level, by a different pattern of co-selection: knife and fork on the one hand and
fork in the road on the other.

Baker, M. 1993. Corpus linguistics and translation studies in Baker et al., 233250.
Baker M. 1996. Corpus-based translation studies: the challenges that lie ahead. In
Terminology, LSP and Translation: Studies in Language Engineering, in Honour of Juan
Sager, H. Somers (ed.), 175186. Amsterdam and Philadelphia: Benjamins.
Baker, M. 1998. Rexplorer la langue de la traduction: une approche par corpus, in Meta
43: 480485.
Baker M., Francis, G. and Tognini Bonelli, E. (eds) 1993. Text and Technology. In Honour of
John Sinclair. Amsterdam and Philadelphia: Benjamins.
Biber, D. 1994. Representativeness in corpus design. In Current Issues in Computational
Linguistics in Honour of Don Walker, A. Zampolli, Calzolari N. and Palmer M. (eds).
Linguistica Computazionale IX.X. Giardini Editori e Stampatori in Pisa and Kluwer
Academic Publishers.
Faber, P. B. & Mairal Usn, R. 1999. Constructing a Lexicon of English Verbs. Berlin and New
York: Mouton de Gruyter.
Firth, J.R. 1957. Papers in Linguistics 19341951. London: Oxford University Press.
Firth, J. R. 1968. A synopsis of linguistic theory: 193055. In Selected Papers of J. R. Firth
195259, F.R. Palmer (ed.), 168205. London and Harlow: Longmans.
Francis, G. 1993. A corpus-driven approach to grammar. In Baker et al. (eds), 137156.
Gellerstam, M. 1986. Translationese in Swedish novels translated from English. In
Translation Studies in Scandinavia, L. Wollin and H. Lindquist (eds), 8895. Lund:
CWK Gleerup.
Halliday, M.A.K. 1992a. Language theory and translation practice. In Rivista
Internazionale di Tecnica della Traduzione, No. 0, 2758. Udine: Campanotto Editore.
Halliday, M.A.K. 1992b. Language as a system and language as an instance: the corpus as a
theoretical construct. In Directions in Corpus Linguistics, J. Svartvik (ed.), 6177. Berlin
and New York: Mouton de Gruyter.
Halliday, M.A.K. 1985. An Introduction to Functional Grammar. London: Edward Arnold.
Hunston, S. and Francis, G. 2000. Pattern Grammar: a Corpus-driven Approach to the Lexical
Grammar of English. Amsterdam and Philadelphia: Benjamins.
Louw, B. 1993. Irony in the text or insincerity in the writer? The diagnostic potential of
semantic prosodies. In Baker et al. (eds), 157176.

Functionally complete units of meaning

Johns, T. 1991. Should you be persuaded. Two samples of data-driven learning materials.
In Classroom Concordancing. ELR Journal 4: 116. University of Birmingham.
Neubert, A. 1985. Text and Translation. Leipzig: VEB Verlag Enzyklopadie.
Nida, E. 1964. Towards a Science of Translating. Leiden: J. Brill.
Sinclair, J. M. 1981. Planes of discourse. In The Two-fold Voice: Essays in Honour of Ramesh
Mohan, S.N.A. Rizvi (ed.), 7089. Salzburg: University of Salzburg.
Sinclair, J. M. (ed.) 1987. Looking Up: an Account of the COBUILD Project in Lexical
Computing. London: Collins.
Sinclair, J.M. 1991. Corpus Concordance Collocation. Oxford: Oxford University Press.
Sinclair, J. M. (ed.) 1996. Corpus to Corpus. Studies in Translation Equivalence. Special issue
of the International Journal of Lexicography 9 (3).
Sinclair, J.M. 1996. The search for units of meaning. TEXTUS 9: 75106.
Sinclair, J. M. 1998. The lexical item. In Contrastive Lexical Semantics, E. Weigand (ed.),
124. Amsterdam and Philadelphia: John Benjamins.
Stubbs, M. 1993. British traditions in text analysis. In Baker et al. (eds), 133.
Tognini Bonelli, E. 1996a. Towards translation equivalence from a corpus linguistics perspective. In Sinclair (ed.), 197217.
Tognini Bonelli, E. 1996b. Corpus Theory and Practice. TWC Monographs, Birmingham:
Tognini Bonelli, E. 2000. Il corpus in classe: da una nuova concezione della lingua a una
nuova concezione della didattica. In Linguistica e Informatica: Corpora, Multimedialita
e Percorsi di Apprendimento, R. Rossini Favretti (ed.), 93108. Roma: Bulzoni.
Tognini Bonelli, E. 2000. Things that can and do go wrong in language teaching: Revisiting
the seven sins. In the light of corpus evidence. Linguistica e Filologia 11.
Tognini Bonelli, E. 2001. Corpus Linguistics at Work. Amsterdam and Philadelphia:
Viaggio, S. 1992. Contesting Paul Newmark. In Rivista Internazionale di Tecnica della
Traduzione, no. 0: 2758. Udine: Campanotto Editore.

Causative constructions
in English and Swedish
A corpus-based contrastive study
Bengt Altenberg


High-frequency verbs are often problematic for foreign language learners. The
reason for this is that, while they tend to express basic universal meanings and
consequently have equivalents in most languages, they have also undergone
various meaning extensions resulting in a high degree of polysemy and language-specific uses (cf. Viberg 1996). As a consequence, superficial cross-linguistic similarities often conceal treacherous differences.
An interesting example of this was revealed in a recent study by Altenberg
and Granger (2001) of the lexical and grammatical patterning of the verb
make in the International Corpus of Learner English, which showed that
French-speaking and Swedish EFL learners deviated in interesting ways from
native American students use of the verb.1 While both learner groups underused (and misused) delexical make (e.g. make a decision, make a point), they
were clearly differentiated in their treatment of causative make (e.g. make sb
happy, make sb believe sth). As shown in Table 1, the French-speaking learners
Table 1. Causative uses of make by EFL and native US students












Bengt Altenberg

significantly underused causative make with adjective and noun complements

(e.g. make sth possible, make sb a star), whereas the Swedish learners revealed
an equally significant overuse of causative make with adjective and verb complements (e.g. make sth easier, make sb understand).
Another interesting finding was that the learners treatment of causative
make seldom resulted in clear errors but in a number of rather clumsy constructions, suggesting that the learners tended to opt for a semantically and
grammatically decomposed make + object + complement pattern in cases
where a native writer would prefer a synthetic causative verb alternative (e.g.
make people come closer instead of bring people closer):
(1) So a recession can actually make people come closer to each other (bring
people closer)
(2) The most difficult thing about this is to make its inhabitants open their
eyes (open its inhabitants eyes)
(3) ... the differences are made to vanish (are eliminated)
(4) There will always be pressure from the outside to make us change (change us)

From a Swedish perspective, these results raise several interesting questions.

How can the Swedish learners overuse of causative make be explained? Do they
overgeneralise a dominant English pattern (intralingual influence) or are they
affected by transfer from Swedish (interlingual influence)? A purely intralingual explanation is not very plausible, however, since the Swedish and Frenchspeaking learners display fundamentally different tendencies. If intralingual
influence had been the main conditioning factor, we would expect both learner
groups to behave in the same way. This leaves interlingual influence, i.e. transfer from L1, as a more likely explanation.

. Causatives in English and Swedish

The assumption that the Swedish learners overuse of causative make may be
the result of transfer from Swedish is intuitively supported by the similarity
between the basic causative constructions in the two languages. As shown in
Table 2, English causatives with make can be divided into three main analytical types as I will call them depending on whether the complement following the object is an adjective phrase (type A), an infinitive clause (type B) or
a noun phrase (type C). Swedish has corresponding constructions with the
verbs gra (types A and C) and f (type B).

Causative constructions in English and Swedish

Table 2. Main causative constructions in English and Swedish



Type A

make + Object + Adjective phrase:

She made him happy

gra + Object + Adjective phrase:

Hon gjorde honom lycklig

Type B

make + Object + Infinitive:

He made her laugh

f + Object + Infinitive:
Han fick henne att skratta

Type C

make + Object + Noun phrase:

They made it their home

gra + Object + Prep. phrase:

De gjorde det till sitt hem

Semantically and syntactically the constructions are very similar in the two
languages. They are all complex-transitive structures (cf. Quirk et al.
1985: 1195) in which the raised object and the complement are notionally
equivalent to the subject and predication of a related clause which expresses the
result of the causative event (cf. Juffs 1996 and Song 1996). The differences are
relatively superficial: English has one prototypical causative verb, Swedish has
two; the English B construction has a bare infinitive, the Swedish infinitive is
preceded by the marker att to; the complement of the C construction is a
noun phrase in English but a prepositional phrase in Swedish.
Apart from these analytical constructions, both languages have various
other ways of expressing causative relations. For example, in English there are
many synthetic causative verbs in which the resulting state or event is fused
with the causative meaning of make into a single verb form: make sth fall = fell
sth, make sb believe = convince sb. In addition, causative relations can be
expressed by verbs other than make, such as cause, force, get, have and let.
Moreover, cause-effect relations can be expressed by conjunctions (e.g.
because, so that), by adverbial expressions (e.g. because of NP), by verbs (e.g.
cause, result in) and in a number of other ways. The same applies in Swedish.
Although both languages have various resources to express causative relations, it is reasonable to describe the analytical patterns as the basic or prototypical causatives in the two languages. Since there is a striking cross-linguistic
parallelism between these constructions, it is natural to assume that Swedish
learners might be tempted to use the semantically and grammatically decomposed make + NP + complement pattern even in cases where a native writer
would prefer a synthetic alternative, with examples like (1)(4) as a result.
However, in the absence of good contrastive descriptions of causative constructions in English and Swedish this can only be a hypothesis. Our knowledge of the relative frequency of various alternatives and of the prototypicality
of the analytical constructions in the two languages is very limited, and hardly

Bengt Altenberg

anything is known about the degree of correspondence of the various alternatives across the two languages. It is the purpose of this study to find out something about this and, if possible, throw some light on the Swedish learners
overuse of causative make constructions. For this purpose, the learner study
will here be supplemented with a contrastive examination of the main
causative options available in English and Swedish and their distribution in a
parallel corpus of English and Swedish texts.

. Aim and material

For practical reasons, I will limit my study to the B construction in the two languages.2 The following questions will be explored:

How dominant are the B constructions in the two languages?

To what extent are the B constructions retained in translations between the
two languages?
Which are the main causative alternatives and how often are they used?
How can contrastive data help to explain the Swedish learners overuse of
English B constructions?

The study is based on the English-Swedish Parallel Corpus (see Aijmer et al

1996). As shown in Table 3, the corpus consists of 40 English text samples and
their translations into English and 40 Swedish text samples and their translations into English. The samples are 10,00015,000 words in length and half of
them are drawn from fiction, half from non-fiction texts. Within each genre,
the source texts from the two languages have been matched as far as possible in
terms of purpose, subject matter and register, which means that the corpus can
be treated both as a comparable corpus and as a translation corpus (on this
distinction, see Johansson 1998).
Table 3. Size and composition of the English-Swedish Parallel Corpus

Text samples

No. of words

Fiction Non-fiction Total

Eng. original Swe. translation
Swe. original Eng. translation





Causative constructions in English and Swedish

. Method
The composition of the corpus makes it possible to compare the languages in
several ways (cf. Aijmer et al 1996):
(a) Source texts source texts. By comparing the use of the B constructions in
the original English and Swedish texts we can get an indication of their frequency and relative importance in each language.
(b) Source texts translations. Using the original texts as a starting point and
comparing them with the corresponding translations into the other language, we can find out how English causative make is translated into Swedish
and how Swedish causative f is translated into English. This will give an indication of the main translation equivalents used to render the B constructions
in each language and the relative importance of these equivalents.
(c) Translations source texts. Using the translations as a starting point and
comparing them with the corresponding source texts in the other language, we can find out which Swedish source constructions have ended up
as causative make constructions in the English translations and which
English source constructions have ended up as causative f constructions
in the Swedish translations. This reversed approach will give an indication of the range of source constructions that have been used as a point of
departure for the B constructions in the target language. Studying the
translations in this direction will be a useful supplement to approach (b)
and serve as a check on possible translation effects (cf. Johansson 1998).

. Analytical B constructions in source texts and translations

The relative frequencies of English make and Swedish f in analytical B constructions in the English and Swedish source texts and translations are shown
in Table 4.
Table 4. Causative make and f (type B) in source texts and translations
(n/100,000 words)

Source texts




Bengt Altenberg

Two striking tendencies emerge from the figures. First, make has a much more
dominant position as a causative B verb in the English source texts than f has
in the Swedish source texts. This suggests that causative f has greater competition from alternative expressions in Swedish than make has in English. One
way of uncovering these alternatives will be to look at the Swedish sources of
causative make in the English translations. Second, in the translations this tendency is reversed: causative f is much more common in the Swedish translations than make is in the English translations. This indicates somewhat paradoxically that, whereas the Swedish translators regard f as a natural means of
rendering causative B constructions in their translations and tend to overuse it
as a result, the English translators display the opposite tendency: they seem to
underuse analytical make, evidently preferring other alternatives. Another reason could of course be that its main source Swedish f is relatively infrequent in the Swedish original texts. To determine this we shall have to look
more closely at the English sources and translations of causative f.

. Swedish equivalents of English causative make

To find out more about the causative options in the two languages, let us first
examine the Swedish equivalents of English B constructions in the corpus. As
mentioned, these can be established by looking at how make has been translated into Swedish and by looking at the Swedish sources of make in the English
translations. As shown in Table 5, five main types of equivalents can be distinguished, four involving a causative verb of some kind and one miscellaneous
Table 5. Swedish equivalents of English causative make (type B)
Types of Swedish equivalents


Swedish translations

Swedish sources








Congruent construction
with f
Other causative verb + NP
+ Vinf/Vfin
Causative verb + NP + Adj
(type A)
Synthetic causative verb
Miscellaneous other
































Causative constructions in English and Swedish

category in which the causative relation is expressed in various other ways.

The most common Swedish equivalent in the corpus is a congruent B construction with f as a causative verb. It is especially common in the Swedish
translations (54%) but only accounts for a third of the examples in the source
texts (32%). This disproportion confirms the picture of Swedish f given in
Table 4. The Swedish translators obviously regard analytical f as the prototypical equivalent of the corresponding English constructions, overusing it as a
result. However, despite its status as the most natural Swedish equivalent of
make, it has strong competition from other alternatives, especially in the
Swedish source texts where the miscellaneous category is very common (44%).
The second most common equivalent (disregarding the miscellaneous category) is the use of a causative verb other than f (15%) appearing either in a
congruent construction with an infinitive complement or with a following
finite object clause. This type, too, is especially common in the Swedish translations. The following list includes all the verbs of this kind in the corpus (source
texts as well as translations):
komma NP + Vinf
tvinga NP + Vinf
gra att + finite clause
lta NP + Vinf
se till att + finite clause
sga till NP + finite clause
be NP + Vinf
gra s att + finite clause
ha NP + Vinf
tillhlla NP + Vinf
vinnlgga sig om + Vinf


With the exception of the high-frequency verbs komma come and gra make,
these verbs are generally more specific in meaning than f, indicating varying
types of coercion, manner and modality (cf. tvinga force, tillhlla admonish,
lta let, be ask). In addition, the choice of verb is determined by selection
restrictions, most of them requiring a human subject (e.g. se till see to, sga till
tell, be ask, gra s att do so that).
A third, less common, Swedish alternative is to use the causative verb gra
followed by an adjective complement instead of an infinitive, i.e. an A construction rather than the B construction (cf. Table 2):

Bengt Altenberg

English version:

Swedish equivalent:

make NP feel dizzy (2)

make NP feel better
make NP feel calmer
make NP feel cheerful
make NP feel less grim
make NP feel uneasy
make NP feel worse
make NP look foolish
make NP look handsome
make NP look whiter
make NP look real
make NP sweat

gra NP yr (2) make NP dizzy

gra NP bttre till mods make NP better at heart
gra NP lugnare make NP calmer
gra NP glad make NP cheerful
gra NP mindre tryckt make NP less depressed
gra NP underlig till mods make NP uneasy at heart
gra NP olustig make NP uneasy
gra NP ljlig make NP ridiculous
gra NP vackrare make NP more handsome
gra NP vitare make NP whiter
f NP riktig get NP real
gra NP svettig make NP sweaty

Interestingly, in the great majority of these cases the only difference between
the English and Swedish versions is that a copular verb of perception feel or
look is present in the English version and absent in the Swedish one. In other
words, it seems as if these verbs tend to be redundant in Swedish.3 In one case
the Swedish verb is f rather than gra (f NP riktig get NP real). F is used as
an alternative causative A verb to indicate that some degree of effort is involved
in the action and that the outcome is successful (cf. Viberg, this volume).
Another rather rare Swedish alternative is to use a synthetic verb conflating
the resulting state or event with the causative meaning of make:
English version:

Swedish equivalent:

make NP eat
make NPs differ
make NP emerge (by washing)
make NP go further
make NP stand
make NP stay behind
make NP think of
make NP turn down
make NP wet the hair

mata NP feed NP
skilja NPs distinguish NPs
tvtta fram NP wash forth NP
dryga ut NP make-last NP
stlla NP put NP
hlla NP kvar keep NP behind
pminna NP om remind NP of
dra NP nert pull NP down
vattenkamma NP watercomb NP

As this list indicates, the synthetic alternatives are mainly restricted to cases
where the complement verb of the corresponding analytical construction is
intransitive (e.g. make NP stand = stlla put NP). When the complement verb
is transitive, the object has to be incorporated into the synthetic verb in some
way (e.g. make NP wet the hair = vattenkamma watercomb NP, where kamma

Causative constructions in English and Swedish

comb implies hair). This complication may be one of the reasons why synthetic verbs are generally less common as alternatives to B constructions than
to A constructions in the corpus (cf. Altenberg 1998).
In addition to these Swedish alternatives, all of which contain a causative
verb of some kind, there is a large number of other Swedish variants (called
miscellaneous in Table 5) in which the causative elements are reorganised
grammatically in various ways. These variants are especially common as
Swedish sources of analytical English constructions. Three recurrent subtypes
can be distinguished in the material:
i. the cause is implied and omitted
ii. the result is expressed in a finite clause and the cause by a different syntactic
iii. the result is nominalised or replaced by a nominal expression
i. When the cause is unspecified or implied in the context there is often no need
to use a causative construction as long as the result is expressed. This is illustrated in (5), where the agentless passive causative verb (is made) in the English
original is left out in the Swedish translation, and in (6), where the causative
verb in the English translation corresponds to a modal auxiliary expressing
obligation (mste had to) in the Swedish original:
(5) Brand, the hero of the poems
is made to say (RH)

Brand, diktens hjlte, sger

Brand, the poems hero, says

(6) Dag mste lova att inte fra

det vidare. (MG) Dag had to
promise ...

She made Dag promise not to pursue

the matter any further.

ii. Generally, however, the cause is specified in both versions but encoded by a
syntactic element other than the subject in the Swedish text. A common
Swedish strategy (in source texts as well as translations) is to express the
causative result in a finite clause and indicate the cause in the form of an adverbial of reason:
English version:

Swedish equivalent:

X makes NP feel like crying

NP blir grtfrdig av X NP becomes

cry-ready of X
NP mr inte bra av X NP does not feel well
from X
NP jmrade sig vid X NP groaned at X

X makes NP feel bad

X made NP groan

Bengt Altenberg

English version:

Swedish equivalent:

X made NP groan
X made NP laugh
X made NP blow up
X almost made NP explode

NP jmrade sig vid X NP groaned at X

NP skrattade t X NP laughed at X
drfr svllde NP ut therefore NP swelled out
D hll NP p att smlla av then NP almost

Something similar is illustrated in the following examples, where the interrogative wh-pronoun (the causative subject) in the English version is rendered by
an interrogative adverb (varfr why, hur how) in the Swedish version:
What makes you say that?
What makes you think I dont know?
What made you think I was looking
for him?

Varfr sger du det?

Varfr tror du inte jag skulle gra det?
Hur visste du att det r honom jag
letar efter?

Alternatively, the cause and the result can be expressed in two finite clauses
linked by a subordinator indicating result or purpose:
(7) Normally it takes a lot more than I normala fall tarvas det betydligt fler
that to make me feel outnumn s fr att jag ska knna mig i
bered. (JB)
so that I shall feel myself in a minority

iii. Another common Swedish alternative is to nominalise the result and

encode it syntactically as the direct object in a monotransitive or ditransitive
construction. The causee is either implied or expressed as the direct object:
English version:

Swedish equivalent:

make NP change
make NP think of
make NP appear to
make NP stink

stadkomma frndringar achieve changes

fra tanken till bring the thought to
ge intryck av att give the impression that
ge NP dligt rykte give NP a bad reputation

Alternatively, the result can be rendered by a prepositional phrase indicating

the goal or result of the causative event while the causee is retained as the
direct object of the causative verb. Most of these variants are set expressions in
make NP look like a fool
make NP go
make NP eat their words

gra NP till tlje make NP to ridicule

hlla NP i gng keep NP in motion
stta NP p plats put NP in place

Causative constructions in English and Swedish

make NP observe rituals

make the money go round

tvinga NP in i ritualer force NP into rituals

stta pengarna i rrelse put the money in motion

The fact that these miscellaneous constructions are more common in the
Swedish source texts than in the translations suggests two things. First, despite
their formal variation, they represent important causative alternatives in
Swedish. At the same time, the Swedish translators obviously find it easier to
render them as analytical causatives than to retain them in their translations.
As we shall see, the same tendency is evident in the English texts.

English equivalents of Swedish f

Let us now reverse the perspective and look at the English equivalents of analytical Swedish B constructions with f. Table 6 shows the main English types of
equivalents used either as translations of Swedish f or as English sources of f
in the Swedish translations.
Table 6. English equivalents of Swedish causative f (type B)
Types of English equivalents


English translations

English sources














Congruent construction
with make
Other causative verb + NP
+ Vnfin/Vfin
Synthetic causative verb
Various other constructions




















The most common English equivalent is a congruent analytical construction

with make. Hence, make can indeed be described as the main English equivalent of Swedish f and the picture of make and f as mutually corresponding
causative B verbs is confirmed. In fact, the relative frequency of make as an
equivalent of f in the English translations and source texts is exactly the same
(44%) as that of f as an equivalent of make in the Swedish texts (cf. Table 5).
The proportion of make is slightly higher in the English translations (49%)
than in the English source texts (43%), which suggests that the translators
regard it as a particularly natural and handy alternative and tend to overuse it
as a result. This contradicts the picture given in Table 4 where make appears to

Bengt Altenberg

be underrepresented in the English translations. However, the higher proportion of analytical make in the English translations demonstrated in Table 6
clearly indicates its attractiveness to the translators, and even if the increased
use is not so dramatic as that of f in the Swedish translations (cf. Table 5), the
term overuse seems justified in both cases.
Yet, despite its favoured position as the most common alternative, the relative frequency of make does not reach 50% even in the translations. Just like f
in Swedish, English make has strong competition from other causative variants. The second most common English alternative (disregarding the miscellaneous category) is to use a causative verb other than make, followed either by a
non-finite clause or, exceptionally, a finite object clause. As shown in Table 7,
these verbs are especially frequent in the English source texts, but their proportion is in fact higher in the translations, where they account for no less than
28% of the examples. This means that the English translators tend to rely rather
heavily on a smaller set of causative verbs in their renderings of Swedish f.
Table 7. Other causative English verbs
Type of causative verb

English translations

English sources


get NP + Vinf
cause NP + Vinf
lead NP + Vinf
ensure (that) + finite clause
set (off) NP + Ving
stop NP + Ving
allow NP + Vinf
encourage NP + Vinf
have NP + Ving
send NP + Ving
adapt NP + Vinf
compel NP + Vinf
enable NP + Vinf
get + NP + Ving
have NP + Vinf
have NP + Ved
induce NP + Vinf
leave NP + Vinf
persuade NP + Vinf
render NP + Vinf
rouse NP + Vinf








Causative constructions in English and Swedish

Like the corresponding Swedish verbs, these English variants are generally more
specific in meaning than make, indicating various types of coercion and modality (cf. compel, persuade, encourage, enable, allow), requiring a particular type of
subject (human or non-human), or specifying the outcome of the causative
event (e.g. prevention). The majority of the verbs take an infinitive complement
(the outstanding choice in the translations) but quite a few take an ing-participle.
Have also occurs with a past participle and ensure with a finite object clause.
A third English alternative is to use a synthetic causative verb conflating the
causative meaning of make and the meaning of the complement verb or predicate (e.g. make NP rise = lift NP). This alternative is much more common as an
English source (14%) than as a translation (6%) of the analytical Swedish f
construction, which indicates that the translators find it more difficult to
retrieve a synthetic English equivalent than the readily available analytical construction. Yet, to judge from the corpus, synthetic verbs are a more important
causative alternative in English than in Swedish (cf. Table 5). Some examples of
synthetic English verbs in the material are:
Swedish version:

English equivalent:

f NP att slappna av (3)

f NP att brista (2)
f NP att lossna (2)
f NP att ka takten/farten
f NP att mjukna/vekna
f NP att sl ner/bort blicken
f NP att bibehllas
f NP att explodera
f NP att fladdra
f NP att fllas ihop
f NP att gapa
f NP att g (snett)
f NP att hoppa hgt
f NP att inse
f NP att lyfta
f NP att mrkna
f NP att resa sig
f NP att sl sig
f NP att spricka
f NP att tappa koncepterna
f NP att tova sig
f NP att tystna

relax NP (3)
break NP (2)
remove/loosen NP
quicken NP (2)
soften NP (2)
outstare NP (2)
keep NP
explode NP
ruffle NP
collapse NP
astonish NP
head NP (diagonally)
startle NP
teach NP
lift NP
thicken NP
rouse NP
warp NP
burst NP
unnerve NP
mat NP
silence NP

Bengt Altenberg

f NP att verg till

f tiden att g

turn NP to
pass the time

As this list demonstrates, the great majority of the synthetic English verbs correspond to an analytical Swedish construction with an intransitive (or reflexive) verb complement. Hence, the pattern of the Swedish synthetic verbs has a
clear parallel in English: analytical constructions with transitive complements
cannot easily be transformed into synthetic verbs unless the object of the transitive verb can be incorporated into the synthetic verb in some way (e.g. f NP
att sl ner blicken make NP turn down the gaze outstare NP; f NP att tappa
koncepterna make NP lose his nerve unnerve NP). But even in cases where
the analytical construction has an intransitive verb, a synthetic alternative is
often not lexically available (cf make sb cry *cry sb). The choice of a synthetic
verb is thus restricted in both languages, lexically as well as grammatically,
whereas the analytical construction is nearly always possible.
In addition to these English variants the corpus also contains a large group
of miscellaneous alternatives in which the causative relation is reorganised
grammatically in various ways. These alternatives are especially common as
English sources of Swedish analytical causatives (25%) and, like their Swedish
counterparts, they highlight the formal variation of causative expressions.
Although they are not as common as their Swedish counterparts in the corpus
(cf. Table 5), their structural patterning is very similar to that of the Swedish
ones. If we ignore cases where the cause is implied and therefore omitted, the
same subtypes can be distinguished as in the Swedish texts.
i. The result of the causative event can be expressed in a finite clause and the
cause rendered by an adverbial of reason:
Swedish version:

English equivalent:

Ett ekonomiskt avgrande fick NP att ka

X fick NP att ramla ur stolen
En impuls fick NP att ppna Y

NP went for financial reasons

NP fell about in his chair at this
On impulse NP opened Y

Alternatively, the cause and the result can be expressed in separate clauses
linked by a subordinator indicating result:
(8) Han hade sinnesnrvaro nog och
turen att rulla ver p rygg, och det
fick honom att knna sig lugnare.
(JC1T) made him feel safer

He had the sense, and luck, to

roll this time onto his back so
that [...] he was more safe.

Causative constructions in English and Swedish

ii. In the great majority of cases, however, the result of the causative event is
rendered by a nominal expression acting either as the direct object or as a
prepositional complement of the verb. In the latter case, the English verb is typically causative, the causee functions as the direct object, and the prepositional
complement expresses the goal or result of the causative event. Many of these
examples are set expressions or restricted collocations:
Swedish version:

English equivalent:

f NP att framst som X (2)

f NP att grta (2)
f NP att leva
f NP att upphra
f NP att klttra uppfr vggarna
f NP att flla trar
f NP att acceptera
f NP att lsa
f NP att rka i panik
f NP att spnna av
f NP att g samman
f NP att verg till terrorism

turn/make NP into X
bring NP to tears, have NP in tears
bring NP to life
bring NP to an end
drive NP up the wall
have NP in tears
lull NP into acceptance
lure NP into reading
push NP into panic
put NP at ease
turn NP into a group
turn NP to terrorism

When the nominalised result is represented as the direct object in the English
version, the causee generally acts as the indirect object:
Swedish version:

English equivalent:

f NP att tro
f NP att tveka
f NP att knna sjlvfrtroende
f NP att likna
f NP att rysa
f NP att se ut som
f NP att verka
f NP att frtvivla

give NP cause to believe

give NP pause for thought
give NP confidence
give NP the look of
give NP a chill
give NP the air of
give NP the appearance of
bring despair to NP

Sometimes the causee appears as a possessive modifier:

Swedish version:

English equivalent:

f NP att visa frtrolighet

f NP att sluta flina
f NP att brja grta
f NP att lyssna

invite NPs confidence

wipe the smile off NPs face
bring tears to NPs eyes
get the ear of NP

Bengt Altenberg

If we compare the miscellaneous categories in the English and Swedish texts,

we find surprisingly similar structural patterns. The main difference lies in the
proportion of the subtypes used: in the Swedish texts the preferred strategy is
to express the result in a finite clause and indicate the causative relation by an
adverbial of reason or by a subordinator indicating result or purpose, while the
tendency to nominalise the result is less prominent. The English texts display
the opposite tendency: finite alternatives to the analytical construction are less
common, while nominalised results are very frequent. But all subtypes are used
in both languages, and they are more common in the source texts than in the
translations of both languages. Hence, the translators often find it easier to render them by analytical constructions than to retain them.

. Contrastive summary
As this contrastive survey has demonstrated, English and Swedish have a surprisingly similar range of resources for expressing causative relations. The
main types in both languages are:

analytical constructions with make in English and f in Swedish

other causative verbs + NP + Vnfin/Vfin
synthetic causative verbs
miscellaneous other constructions

These have roughly the same rank order in the two languages but their proportions differ somewhat. The English texts display a more frequent use of other
causative verbs and synthetic verbs, whereas the Swedish texts have more constructions of the miscellaneous type, especially finite causative variants. In
addition, the Swedish texts have a tendency to convert analytical B constructions into A constructions (with an adjective complement) in cases where a
copular verb of perception (feel, look) can be omitted.
In both languages the proportion of these types also differs in the source
texts and the translations. Broadly speaking, the source texts display a more
even distribution of the different alternatives, while the translations tend to
rely more on the two most common types, especially the analytical construction. This suggests that the latter are easier to use and more readily retrieved by
the translators of both languages.
In both languages the analytical construction is the most common alternative, even if it seldom accounts for more than 50% of the examples. It is

Causative constructions in English and Swedish

especially common in the translations, where it is not only used to render analytical counterparts in the source language but frequently replaces other
source constructions.4 In other words, the analytical construction tends to be
overused by the translators in both directions. In this respect the translators
behave much like the advanced Swedish learners. The difference is that while
the learners overuse the analytical construction in their L2, the translators do
it when they translate into their own language.
The reason for this is no doubt the different status of the causative options
and the restrictions that determine the choice between them. In terms of frequency alone, the analytical construction represents the prototypical or
unmarked choice in both languages. This, in turn, reflects linguistic conditions of various kinds. For example, synthetic verbs are not always lexically
available in either English or Swedish, and when they are, they are mainly used
as alternatives to analytical constructions with an intransitive verb complement. Constructions with causative verbs other than make and f are also
restricted, mainly because such verbs tend to have more specific meanings and
be subject to various selection restrictions. The use of the miscellaneous other
constructions revealed in the material is also constrained in various ways, for
example because they restructure the causative elements or because they tend
to be idiomatic or collocationally restricted. Many alternatives are also stylistically marked, being either more formal or informal than the analytical variant
(cf. induce sb to do sth and drive sb up the wall). By contrast, the analytical construction is linguistically and contextually unmarked and can therefore nearly
always be used: it is lexically unrestricted (always available), semantically more
general, grammatically more versatile, and stylistically more neutral than the
other alternatives.

. Conclusions
The main contrastive conclusion that can be drawn from this study is that, on
the whole, English and Swedish provide a very similar range of causative
options. In both languages the dominant choice is the analytical construction
with equivalent high-frequency verbs (make and f), but there are also a number
of competing alternatives other causative verbs, synthetic verbs and various
grammatically reorganised causatives all of which tend to be lexically, grammatically or stylistically restricted and therefore more difficult for learners.
Overuse of a target structure can either be explained as the result of over-

Bengt Altenberg

generalisation of an L2 pattern (intralingual influence) or as the result of transfer from L1 (interlingual influence). As we have seen, the analytical construction can be regarded as the unmarked causative in English. It is therefore reasonable to assume that learners even advanced learners will tend to overgeneralise this construction at the expense of more marked alternatives. The
problem with this explanation is that French-speaking learners do not overuse
this construction in their L2 writing, as would be expected if intralingual influence had been the decisive factor. Consequently, we have to turn to transfer as a
more plausible explanation.
Transfer, too, can be linked to the notion of markedness. As we have seen,
the analytical construction can also be regarded as the unmarked form in
Swedish. According to Hyltenstam (1984: 43), learners are likely to substitute
unmarked categories from their native language for corresponding marked
categories in the target language, whereas marked structures are seldom transferred, especially when the corresponding target category is unmarked.5 This
prediction is clearly applicable to the Swedish learners overuse of the analytical
construction in English.
However, transfer can also be explained by the concept of prototypicality
and by learners judgements of the similarity between L1 and L2. What they
perceive as prototypical and semantically transparent in their L1 determines
what they transfer to their L2 (see Ellis 1994: 326 and Kellerman 1983, 1986).
This perception does not seem to be affected by their experience of or proficiency in L2, which would explain why advanced Swedish learners tend to
overuse the analytical construction, while French-speaking learners do not. To
Swedish learners the similarity between the prototypical causatives in English
and their L1 is obviously more striking than it is to French-speaking learners.
This psychotypology to use a term from Kellerman can also be expected
to retard second language development. Categories that are perceived as prototypical, unmarked or transparent are usually adopted early by learners and run
the risk of becoming linguistic teddy bears that continue to be favoured in
later stages of the learning process at the expense of less common and more differentiated target alternatives (cf. Hasselgren 1994).
The contrastive picture that emerges from this study thus suggests that the
Swedish learners overuse of causative make with verb complements is the
effect of transfer supported by cross-linguistic similarity. Learners who are
unfamiliar with less common causative alternatives in English are likely to
overuse the dominant target pattern and treat it as a lexico-grammatical teddy
bear, especially if it is easy to transfer from their native language.

Causative constructions in English and Swedish

Methodologically, the study has demonstrated two other things. One is the
usefulness of combining corpus-based interlanguage research with contrastive
investigations based on parallel corpora. As Granger (1996: 46) has pointed
out, results derived from learner corpora can only be reliably interpreted as
being evidence of transfer if supported by clear [contrastive] descriptions.
Such descriptions require empirical bilingual data from comparable corpora
or translation corpora. Even if it has not been possible to make a detailed investigation of the factors determining the choice between the causative options in
English and Swedish in the present study, the corpus has clearly revealed the
causative paradigms in the two languages and the degree of correspondence
between them. This is a good starting point for further contrastive research of
causatives in the future.

. For a description of the International Corpus of Learner English (ICLE) and the methodology of corpus-based interlanguage research, see Granger (1993, 1998).
. For a contrastive study of the A construction in English and Swedish, see Altenberg
. An inspection of the A consructions in the corpus shows a corresponding drift in the
opposite direction: A constructions without a copular verb of perception in the Swedish
texts tend to be represented by B constructions with such verbs in the English versions (cf
Altenberg forthcoming).
. Despite the competition from other causative options in both languages, the analytical
constructions are often translated into each other: a calculation of their mutual translatability in the corpus shows a cross-linguistic correspondence of 52% (on this concept, see
Altenberg 1999).
. Chinese ESL learners provide a good example of this. Chinese, being poor in derivational morphology, has no synthetic causative verbs. As a result, Chinese learners tend to
transfer the analytical Chinese shi make construction to their L2, greatly overusing make
causatives in their English (see Wong 1983 and Juffs 1996:152).

Aijmer, K., Altenberg, B. and Johansson, M. (eds). 1996. Languages in Contrast. Papers from
a Symposium on Text-based Cross-linguistic Studies. Lund: Lund University Press.
Aijmer, K., Altenberg, B. and Johansson, M. 1996. Text-based contrastive studies in
English. Presentation of a project. In Aijmer et al. (eds) 1996:7385.

Bengt Altenberg

Altenberg, B. 1999. Adverbial connectors in English and Swedish: Semantic and lexical correspondences. In Out of Corpora. Studies in Honour of Stig Johansson, H. Hasselgrd
and S. Oksefjell (eds), 249268. Amsterdam: Rodopi.
Altenberg, B. forthcoming. Advanced Swedish learners use of causative make: A contrastive
background study. In Computer Learner Corpora, Second Language Acquisition and
Foreign Language Teaching, S. Granger, J. Hung and S. Petch-Tyson (eds), Amsterdam and
Philadelphia: Benjamins.
Altenberg, B. and Granger, S. 2001. The grammatical and lexical patterning of make in
native and non-native student writing. Applied Linguistics 22: 173194.
Ellis, R. 1994. The Study of Second Language Acquisition. Oxford: Oxford University Press.
Granger, S. 1993. The International Corpus of Learner English. In English Language
Corpora: Design, Analysis and Exploitation, J. Aarts, P. de Haan, and N. Oostdijk (eds),
5769. Amsterdam: Rodopi.
Granger, S. 1996. From CA to CIA and back: An integrated approach to computerized
bilingual and learner corpora. In Aijmer et al (eds) 1996:3751.
Granger, S. 1998. The computerized learner corpus: a versatile new source of data for SLA
research. In Learner English on Computer, S. Granger (ed.), 318. London and New
York: Addison Wesley Longman.
Hasselgren, A. 1994. Lexical teddy bears and advanced learners: a study into the ways
Norwegian students cope with English vocabulary. International Journal of Applied
Linguistics 4: 237260.
Hyltenstam, K. 1984. The use of typological markedness conditions as predictors in second
language acquisition: The case of pronominal copies in relative clauses. In Second
Language: A Crosslinguistic Perspective, R. Andersen (ed.), 3958. Rowley, Mass.:
Newbury House.
Johansson, S. 1998. On the role of corpora in crosslinguistic research. In Corpora and
Cross-linguistic Research: Theory, Method, and Case Studies, S. Johansson and S.
Oksefjell (eds), 324. Amsterdam and Atlanta: Rodopi.
Juffs, A. 1996. Learnability and the Lexicon. Theories and Second Language Acquisition
Research. Amsterdam and Philadelphia: John Benjamins.
Kellerman, E. 1983. Now you see it, now you dont. In Language Transfer in Language
Learning, S. Gass & L. Selinker (eds), 112134. Rowley, Mass.: Newbury House.
Kellerman, E. 1986. An eye for an eye: Crosslinguistic constraints on the development of the
L2 lexicon. In Crosslinguistic Influence in Second Language Acquisition, E. Kellerman
and M. Sharwood-Smith (eds), 3548. New York: Pergamon Institute of English.
Quirk, R., Greenbaum, S., Leech, G. and Svartvik, J. 1985. A Comprehensive Grammar of the
English Language. London: Longman.
Song, J. J. 1996. Causatives and Causation. A Universal-Typological Perspective. London:
Viberg, . 1996. Cross-linguistic lexicology. The case of English go and Swedish g. In
Aijmer et al. (eds) 1996:151182.
Wong, S. C. 1983. Overproduction, underlexicalisation, and unidiomatic usage in the
make causatives of Chinese speakers. Language Learning and Communication 2:


Contrastive Lexical Semantics

Polysemy and disambiguation cues

across languages*
The case of Swedish f and English get
ke Viberg


Languages are at the same time very similar and very diverse. At a fundamental,
cognitive level, there are certain similarities even between languages that are
genetically and geographically widely separated. Simultaneously, there are
often important semantic differences between cognates in closely related languages such as English go and Swedish g (Viberg 1999a). Crosslinguistic lexicology (Viberg 1996) is concerned with this complex relationship of similarity
and divergence between languages at the lexical level. It combines and tries to
strike a balance between a number of approaches such as lexical universals
(Berlin & Kay 1969, Goddard & Wierzbicka 1994, Newman 1996), linguistic
relativity (Gumperz & Levinson 1996), lexical typology (Talmy 1985) and contrastive lexical analysis (Schwarze 1985). Crosslinguistic studies of the lexicon
are relevant also to applied fields such as the second language lexicon (Hatch &
Brown 1995, Singleton 1999) and the lexicon in machine translation (Dorr
1993, Wanner 1996).
This paper will be concerned in particular with the nature of multiple
meanings from a crosslinguistic perspective and with the interaction between
word meaning and linguistic context in the disambiguation process. Words
with multiple meanings are analyzed differently in various theoretical frameworks. The term multiple meanings is intended to be neutral with respect to
the notions polysemy and homonymy. Polysemy is in general used to refer to
the case where the same word (lemma) is used with multiple meanings that

ke Viberg

are somehow related, whereas homonymy is used to refer to the case where different words (lemmas) with totally unrelated meanings happen to be expressed
by the same form. Studies concerned with the polysemy of words are concerned with the principles for linking various meanings and with the explanation and motivation of the links.
A basic difference between theories concerns the nature of the primary
meaning (taken as a neutral term): whether it is an ideal case (prototype) from
which the other meanings represent deviations or of a more general or abstract
type, which in some sense covers all the others and is not necessarily realized in
pure form in any context. The former position has been predominant in more
recent times, in particular with reference to prototypical meaning (e.g.
Tsohatzidis (ed.) 1990, Taylor 1995, Geeraerts 1997). The second position was
taken by Roman Jakobson (1936) in his famous study of the Russian case system, in which he aimed at describing an invariant general meaning
(Gesamtbedeutung) which was independent of all the varying individual meanings (Sonderbedeutungen) induced by the contexts in which a case was used.
The general meaning of a case was primarily dependent on the oppositions
into which it entered with the other cases in the language system. More recently, a similar position has been taken by Pustejovsky (1995) and Poesio (1996)
with respect to lexical meaning. Recent theories of the latter type have introduced the term underspecification. Lexical representations are underspecified
with respect to the actual meanings of words which appear in actual text, where
meaning is filled in from the linguistic context.
The terms ambiguity and disambiguation often appear in contexts where
comprehension is the major concern and are in principle neutral with respect to
the distinction between polysemy and homonymy. When a listener or reader is
confronted with a word form with multiple meanings, it is not possible to
decide whether the word is homonymous or polysemous until the appropiate
meaning has been identified. In order to make the distinction, disambiguation
must already have been achieved. Disambiguation is also the first step in the
translation of a word with multiple meanings. This paper will mainly be concerned with the contrasting use of syntactic and semantic cues in the disambiguation process of words with multiple meanings which are the primary
translation equivalents across languages. The analysis is based on the assumption that the primary meaning could best be represented as a prototype, but the
problems with establishing a primary meaning and links to the various extended meanings will only be briefly discussed in this paper. (See Viberg 1999b for an
analysis along these lines of the polysemous Swedish verb sl strike, hit, beat.)

Polysemy and disambiguation cues across languages

Multiple meanings are common in verbs, particularly the most frequent

ones. Among the 20 most frequent verbs in English, we find the four basic verbs
of possession have, get, take and give. With a concrete object such as a camera,
these verbs are readily interpreted as verbs of possession: Jane has a camera,
Mary gave Jane a camera, etc. Very often, however, these verbs have an atypical
object such as a headache or an idea: Peter has an idea, The noise gave Peter a
headache. Such uses are often referred to as abstract possession, but this is only
a cover term which conceals the problems involved in interpreting such expressions. There are also meanings which extend into other semantic fields such as
motion: Eve got up early in the morning, The plane took off. In addition, some of
these basic verbs have grammatical meanings such as Peter has left, Alexandra
has to go or Mary got Jane to collaborate. The verbs of possession, especially the
most frequent ones, are therefore a good testing ground for various approaches
to multiple meanings. In this paper, the Swedish possession verb f will be
compared with its closest equivalent in English, get, and, more briefly, with its
correspondents in French and Finnish. The analysis is based on translation
corpora corpora of original texts and their translations (Johansson 1998).
The availability of computerized translation corpora is likely to breathe new
life into the method known as comparison of translations. A major earlier
work B.C. (before computer corpora) is Wandruszka (1969), which is based on
60 publications in six Germanic and Romance languages. Wandruszka identifies Bally (1950) as the originator of the technique of comparing translations.

. The Swedish verb f

. A brief look at major meanings and syntactic frames
The comparison of Swedish and English that will be presented in this paper is
based on the complete English-Swedish Parallel Corpus, ESPC (Aijmer et al.
1996, Altenberg & Aijmer 2000), which contains original text samples in
English and Swedish together with their translations. The text samples represent both fiction and non-fiction and the total number of words from each
source language is about half a million. The distribution of the meanings and
syntactic frames of f (2043 occurrences in all) in the complete set of Swedish
originals in the ESPC is shown in Table 1. For comparison, the distribution in a
monolingual Swedish corpus, the Stockholm-Ume Corpus (SUC 1997), is
also shown. The texts in the SUC represent a wide range of genres and were
chosen according to principles similar to the ones used for the Brown corpus.

ke Viberg

The total number of words is also in the same range: 1 million. In this corpus,
there are 4588 occurrences of the verb f. 1009 of these were selected randomly
and coded for meaning and syntactic frame. In the following sections, the individual meanings of f will be briefly presented.
When the verb refers to Possession it has an NP as object with a concrete
noun as head. The NP can also consist of a pronoun referring to such a concrete object but pronominalized NPs will not be dealt with here, since they
Table 1. The major meanings of f in the English-Swedish Parallel Corpus (ESPC) and
in the Stockholm-Ume Corpus (SUC)

Syntactic frame

Proportion %
N=2043 N=1009

Per fick en kamera.
Per got a camera

f + NPConcrete
Per got a camera.



Abstract possession
Per fick en id.
Per got an idea

f + NPAbstract
Per got an idea.



Modal: Permission/Obligation
Per fick slja kameran.
Per got sell camera-the

f + VPInfinitive
1. Per was allowed to sell his
2. Per had to sell his camera.



Per fick se en lg.
Per got see an elk

f + VPInfinitive [V: se, hra, veta] 4.5

Per caught sight of an elk.


Per fick oss att skratta.
Per got us to laugh

f +NP+att VPInfinitive
Per made us laugh.



Per fick upp drren.
Per got up door-the

f + Particle +NP
Per managed to open the door.



Per fick benen fria.
Per got legs-the free

f + NP + ADJResult
Per got his legs free.



f + NP + Participle
Per fick bilen
reparerad/stulen. Per got his car repaired/stolen.
Per got car-the repaired/stolen



Various other alternatives



Polysemy and disambiguation cues across languages

involve problems of a general nature, which are not specifically related to

verbs of possession.
The NP can also have an abstract noun as head as in Per fick en id Per got
an idea. This case is referred to as abstract possession which, as mentioned
above, is only a cover term for a number of problematic cases, some of which
will be commented on in greater detail in Section 3.2. In any event, the interpretation in this case is closely related to the meaning of the abstract noun.
There are several frequent uses where f is combined with an infinitive.
The most important cases are the ones where the infinitive is bare (without the
infinitive marker att). With one important exception mentioned below, f in
this construction expresses deontic modality, but the interpretation is
ambiguous in principle and can either be permission or obligation. Certain
main verbs may strongly suggest one of the two alternative interpretations,
but the choice is ultimately motivated by pragmatic factors. The modal meaning of f is always deontic (root-, agent-oriented), never epistemic (possibility
or certainty). When the following verb is one of the two perception verbs se
see or hra hear or the cognitive verb veta know, there is an alternative
interpretation which is more frequent than the modal ones. In combination
with these verbs, f usually has an inchoative sense, even if the modal interpretations are still possible.
When f is combined with an NP as object followed by an infinitive combined with the infinitive marker att, the interpretation is causative: Per fick oss
att skratta Per made us laugh. In this use, the subject of f may be agentive,
which is an excluded reading in the constructions discussed earlier. The agentive interpretation is more or less obligatory when f is combined with a spatial
particle such as upp up, ut out or in in: Per fick upp drren Per managed to
open (lit. got up) the door. The meaning of f in this type of construction is
close to succeed. It implies an active attempt on the part of the subject. The
same applies when there is a resultative adjective as complement: Per fick benen
fria Per got his legs free.
When the object of f is combined with a past participle, the subject of f is
psychologically affected by the outcome of the event described by the participle
and can be interpreted as a Beneficiary or Maleficiary: Per fick bilen reparerad/
stulen Per got his car repaired/stolen.
What is notable about f is the extent to which the syntactic frame can be
used as a cue for disambiguation except in cases involving the distinction
between concrete and abstract possession or between the two modal meanings
of permission and obligation.

ke Viberg

. Major translation equivalents in English, French and Finnish

In Table 2, representative examples of the major meanings of f are given
together with their translations into English, French and Finnish. The Swedish
and English versions are taken from the ESPC, whereas the other versions are
extracted from a small corpus prepared by the present author. The source of the
examples from the ESPC is indicated by a text code. For an explanation of these
codes, see Altenberg et al. (1999). The examples are taken from Ingmar
Table 2. Translations of f associated with various meanings




1. Concrete Possession
Nu kommer det hr
med kinematografen.
Det var min bror som
fick den. IB

That was when the cinematograph affair

occurred. My brother
was the one who got it.

Alors arrive cette histoire du cinmatographe. Le cinmatographe cest mon frre

qui la eu.

Nyt on tmn kinematografiasian

vuoro. Kojeen sai
minun veljeni.

He felt sick.

Une nause monta

en lui.

Johania alkoi

2. Abstract possession
Han fick kvljningar.

3. Modal: Permission
Annie visste inte ens om Annie didnt even know Annie ne savait mme
pas si elle pouvait quitif she was allowed to
hon fick lmna kket.
ter la cuisine.
leave the kitchen.
4. Modal: Obligation
S fick Annie stta sig i
en ftlj vid kaffebordet. KE
5. Inchoative
Men jag fr veta det
snart. KE
6. Causative
Nnting var det en
lukt? fick honom att
tnka p fisk. KE

Annie dut donc

So Annie had to sit
down in an armchair by sasseoir dans un
fauteuil prs de la
the coffee table.
table basse.

Annie ei tiennyt
edes, saiko hn
lhte pois keittist.
Annie siis sai luvan
istua nojatuoliin

But Ill find out soon.

Mais je le saurai

Mutta saan kyll


Something was it a
smell? made him
think of fish.

Quelque chose futce une odeur? le fit

penser du poisson.

Jokin hajuko
ehk? sai hnet
ajattelemaan kalaa.

Cette nuit, on ma vol

quatre pneus
Hakkapeliitta tout

multa on viime yn
viety nelj uutta

7. Beneficiary/Maleficiary
nu har jag ftt fyra nya I had four new
Hakkapeliitta tyres
Hakkapeliitta stulna i
stolen last night.
natt. KE

Polysemy and disambiguation cues across languages

Bergmans (1987) autobiography Laterna magica (IB) and the novel Blackwater
(KE) by Kerstin Ekman (1993) and their translations into the languages mentioned above. When f is used in its basic meaning involving receiving a concrete object as possession, the most common equivalent in English is get and in
Finnish saada get, receive. The most common translation into French is some
form of avoir have. With an abstract noun as object such as kvljningar nausea, a change of construction is relatively frequent in the translations. What literally means He got nausea in Swedish is rendered as He felt sick in English,
A nausea arose in him in French and Him (partitive case) began nauseate in
Finnish. When f is used as a modal verb indicating either Permission as in (3)
or Obligation as in (4), English and French generally use modal verbs as translations, whereas Finnish to a great extent uses saada, the primary equivalent
even in contexts involving concrete possession.
The inchoative meaning of f appearing with certain mental verbs as in (5)
is often left unexpressed, or signalled by the choice of an inchoative mental
verb instead of a stative one (e.g. find out instead of know in English). The use of
f as a periphrastic causative is exemplified in (6). In this case the most frequent
equivalent in English is make and in French faire make; in Finnish saada is also
used to express this meaning. In (7), the construction with an object followed
by a past participle is shown.
In the comparison of original and translated texts, the word in the translated text which corresponds to a specific instance of the word under discussion in
the source text will be called a translation. The term translation equivalent
will be used in a more restricted sense. The most frequent translations of f into
English are shown in Table 3, where the English verbs have been classified into
semantic fields according to their basic meaning, which means that extended
uses are included in the counts except for have to which is counted separately.
As can be observed, the two major fields are Possession and Modality. When f
is used as a periphrastic causative, the most common translation is make. The
major meanings of Swedish f are thus reflected in the basic meaning of the
most frequent translations.
In total, there are 2043 occurrences of f in the Swedish originals. The most
frequent translations in English shown in Table 3 below account for 60.3% of
all translations. In addition, there is a considerable number of other translations which occur only a few times (many occur only once). The most frequent
English translation is get, which is the closest semantic equivalent of f in its
prototypical meaning (see 3.1), but it occurs in only 11.7 % of the cases.
Another frequent translation is have as a main verb, which accounts for 7.6%. If

ke Viberg

Table 3. The most frequent translation equivalents of f in English







0have to
0be allowed












Various other alternatives




the occurrences of have to, which reach 6.6%, had not been accounted for separately, have would actually be the most frequent translation.
A similar survey of the most frequent translations into French is given in
Table 4, based on extracts from six Swedish novels.
Table 4. The most frequent translations of f in French (based on extracts from 6 Swedish







0have 0058
0give 0019
0receive 0015
0find 0011
0offer 0005
0take 0005



0faire make 017




Various other alternatives




*including tre oblig

The most frequent translations represent 47,4 % of all the translations. The
most notable result is that f does not have any direct equivalent in French even
in its prototypical use as a possession verb. The use of avoir have (usually in a
perfective form) to indicate a change of possession represents a semantic exten-

Polysemy and disambiguation cues across languages

sion of the prototype. The verbs recevoir receive and obtenir acquire correspond semantically more closely to f but do not have a frequency in French
which is comparable to that of f in Swedish, even if only the uses as a possession verb are counted.
The pattern of polysemy which distinguishes f has an interesting areal distribution. The Norwegian cognate f shares most of the meaning patterns with
the Swedish verb, whereas the Danish cognate f has similar uses only as a verb
of possession. The closest correspondent of f as a modal verb in Danish is m,
a cognate of English may. Interestingly, Finnish has the etymologically unrelated verb saada, which has a pattern of polysemy that closely resembles the
Swedish and Norwegian one. As seen in Table 5, saada is the only translation
that reaches a relatively high frequency. It represents 50,8 % of the translations,
which is a high percentage for a verb with a complex pattern of polysemy.
Table 5. The most frequent translation equivalents of f in Finnish (based on extracts
from 6 Swedish novels)









0ought to







Various other alternatives




. Individual meanings of f
In this section, the individual meanings of f will be discussed in greater depth
although many problems have to be dealt with rather cursorily since the verb
cuts across a number of complex semantic areas, such as possession, causation
and modality, which have been studied intensively in recent years. The major
translations of the various meanings of f will also be presented. In this case,
too, it will only be possible to present a broad outline. There is an earlier study
of f by Wagner (1976) which forms part of a contrastive analysis of modal
verbs in Swedish and German.

ke Viberg

. Possession
The concept of Possession is complex and can only be discussed briefly in this
paper. Even in the most straightforward case, where the object is concrete, possession can be construed in various ways. Miller & Johnson-Laird (1976: 565)
use the following example to illustrate this: He owns an umbrella but shes borrowed it, though she doesnt have it with her (see also Heine, 1997, on possession).
Using partly different terms than Miller & Johnson-Laird, we can say that He
owns an umbrella refers to Ownership, whereas she borrowed it refers to
Temporary Possession. Ownership presupposes certain socially regulated rights
to use an object which is regarded as the property of a certain individual. These
rights can be transferred permanently (e.g. as a gift) or temporarily (e.g. as a
loan). The exact social norms motivating the lexicalization patterns are complex
and vary a great deal between different cultures. The last part of the example, she
doesnt have it with her, refers to Physical Possession. Availability for immediate
use seems to be the crucial notion behind this meaning. In the prototypical case,
Possession involves both Ownership and Physical Possession, which can be
combined as in the traditional text-book example: Peter gave Mary an apple (in
her hand, which she could keep). Temporary Possession is a possible but
marked interpretation with a verb such as give (Peter gave Mary a book as a loan.)
When f refers to some aspect of concrete possession, the translation is predominantly a verb of possession. In English the most frequent translations of this
meaning are (absolute frequency within parentheses): get (74), have (51), give (33),
receive (19), acquire (7) and obtain (4). Together these verbs account for 72,3% of
the total number of occurrences of f as a concrete verb of possession (N=260).
When give is used as a translation, it usually appears in the passive. The
French verb donner give is also a rather frequent translation but generally
appears in the active form with the generic subject on one as in the following
example (Finnish, as is usually the case, uses saada get, receive):
(1) Jag fick varm
choklad och
smrgs med
ost. (IB)

I was given
hot chocolate
and cheese

On ma donn
du chocolat et une
tartine avec du

Sain kuumaa
kaakaota ja

The passive is in fact used in the translations with a rather wide range of (mostly) verbs of possession such as be granted, be handed. The frequent use of passives in the translations is a reflection of the fact that f is inherently non-agentive (cf. 4.1 concerning get).

Polysemy and disambiguation cues across languages

. Abstract possession
All verbs belong to a small number of dynamic classes which form a Dynamic
System that cuts across all verbal semantic fields. In essence, a verb can either
designate a state (no change) or a change, for example know (State) realize
(Change) or have, own (State) get, lose (Change). Changes can either be
inchoatives, which means they are pure changes without any indication of the
cause, or causatives, which indicate a cause. Compare Harry died (Inchoative)
and Peter killed Harry (Causative) or Harry lost his camera (Inchoative) and
Peter stole the camera from Harry (Causative). Within a language, there are a
number of ways to form complex (surface) predicates which fulfill the same
function as a simple verb and in several cases can be used to paraphrase simple
verbs (usually with some change in meaning). One such device is the use of
Verb + Abstract Noun instead of a simple verb: ask => put a question, visit =>
pay a visit to, etc. In Swedish and English, the most basic verbs of possession
meaning have, get and give in combination with abstract nouns form a very
productive system generating complex predicates which represent states,
inchoatives or causatives. The same dynamic contrasts are basic when complex
predicates are formed with adjectives: be, become and make. Sometimes it is
possible to form a complete set of parallel predicates as shown schematically in
Table 6 below (taken from Viberg 1981), in which various ways to form emotive predicates related to happiness are shown. To express the inchoative, for
example, it is possible to say either X fick gldje X got happiness, X blev glad X
became (got) happy or to use a passive (gladdes) or reflexive (gladde sig) form
of the verb gldja, which in its basic form has a causative meaning.
Even though the discussion in this paper will be focused around the use of
f and get, it is important to stress that the use of these verbs to form complex
inchoative predicates in combination with abstract nouns is part of the more
Table 6. Basic possession verbs as dynamic operators with an abstract noun as object.

Word class




X hade gldje av Y

X var glad (t Y)

X gladdes/ gladde sig (t Y)


X fick gldje av Y

X blev glad (t Y)


Y gav X gldje

Y gjorde X glad

Y gladde X

give happiness

make happy

Emotion verb

ke Viberg

general pattern involving have and give. In spite of the fact that this use is very
productive, there are obviously restrictions as to which abstract nouns can
appear in such combinations. A much larger corpus than the one used in this
study is required to pin down these restrictions. However, it is possible to identify certain semantic fields of abstract nouns which are frequent in such combinations. One distinct group is constituted by the nouns belonging to the field
Physical Contact such as a blow, a punch, a kick. Such nouns can be combined with the basic possession verbs (except the stative have) to form complex predicates in all four languages considered here:
(2) Han sparkade
och han bet en
av dem i armen
och fick ett slag
i nacken. (KE)

(kicking out, biting

one of them in the
arm, and received a
blow on the back of
his neck.

Il donna des
coups de pied
et en mordit un
au bras, et reut
un coup sur
la nuque.

Hn potki ja hn
puri jotakuta
kasivarteen ja sai
iskun niskaansa.

The translations use parallels to the Swedish construction although the closest
equivalent of f appears only in the Finnish translation with the verb saada.
Even if such expressions exist in all four languages, a simple verb of Physical
Contact in the passive form is a common translation. The French translation
has an active form with the generic subject on one (cf. 3.1):
(3) Jag avstngdes
frn skolgng
och fick mycket
stryk. (IB)

I was removed
from school
and severely

On ma renvoy de
lcole et on ma
beaucoup battu.

keskeytettiin ja
minua kuritettiin ankarasti.

The largest group appears to be nouns of Verbal Communication such as

order, offer, promise or answer as in the following example:
(4) ke frgade
p nytt om
men fick
samma svar.

ke again asked
about the rake
handle, but was
given the same

ke posa une
nouvelle fois une
question sur le
manche de rteau
mais obtint la mme

ke kysyi
mutta sai saman

Even if be given corresponds to f in the English translation of this particular

example, get an answer is also possible in English. French uses obtenir obtain
as a translation, while Finnish uses the primary equivalent saada. At a general

Polysemy and disambiguation cues across languages

level, the languages are rather similar with respect to the formation of complex
predicates of verbal communication using basic possession verbs and abstract
nouns, even if important contrasts can be found with respect to specific combinations of Verb + Abstract Noun. Verbal communication verbs in the passive
form are quite frequent in the English translations, for example:
(5) Kungens tjnare och de som tjnade andra stormn med hst
och rustning fick hr lften om
stora privilegier. (AA)

The Kings servants and those who

served other great men of the realm
with horses and weapons were
promised great privileges,

The following list contains a number of similar cases, most of them representing Verbal Communication:
f + Nabstr Verbpassive

0f + Nabstr Verbpassive

0f + Nabstr Verbpassive

f besked
f namnet
f nej no
f kritik

0f berm
0f lfte
0f tillstnd
0f std

0f trst 0
0f besk 00
0f stryk 00

be advised
be named
be refused
be criticized

0be praised
0be promised
0be allowed
0be supported

be consoled
be visited
be thrashed,
00be beaten

In addition to examples of this kind, expressions with a verb of possession in

the passive form followed by an abstract noun are also found:
f + Nabstr 0PossVerbpassive + Nabstr

f + Nabstr PossVerbpassive + Nabstr

f rd
0be given advice
f order
0be given orders
f fria hnder 0be given a free hand

f uppgift
f tillstnd

be given a task
be granted permission

There are also a number of cases where the translation involves a total change
of grammatical roles. Such examples are sometimes used even when there is a
normal, direct translation. The following example, which represents the normal case, involves an abstract noun from the field Cognition, which is also
fairly frequent in the material:
(6) Medan han sakta promenerade
genom de smutsiga gatornas
snslask fick han en id. (GT)

While he slowly strolled through

the slush in the dirty streets, he
got an idea.

Besides this more or less direct translation of f en id as get an idea, there is also a
translation such as the one found in the following example, where I got the idea is
translated as The idea came to me. This is an example of a change from
Experiencer as subject to Stimulus as subject which is characteristic of Mental
Verbs in general:

ke Viberg

(7) Jag fick tanken tidigt p

morgonen (AP)

The idea came to me in the morning.

Various other examples of radical change of role structure in the translations

are given below. It appears that such changes are particularly frequent when
Abstract Possession is involved, although it cannot be said to reach a very high
frequency even in this case. The following example literally reads something
like: Sweden got a changed military situation:
(8) Efter frlusten av Finland
fick Sverige ett helt frndrat
militrt lge. (AA)

After the loss of Finland Swedens

military situation was completely

In the next example the literal translation of the original is She got increased
blood pressure from rhododendrons:
(9) Hon avskydde allt som var
spikrakt i trdgrdssammanhang och fick frhjt blodtryck
av rhododendron och silvergranar. (ARP)

She loathed anything dead straight

in gardens, and rhododendrons and
silver spruce made her blood pressure

. Modal
The verb f in combination with a verb in the infinitive generally has a modal
meaning. F signals primarily what van der Auwera & Plungian (1998) analyze
as participant-external modality of the deontic type (deontic possibility = permission and deontic necessity = obligation). Following the interesting proposals in Winter & Grdenfors (1995), the external power could either be a participant of the speech situation or a third party. These distinctions are expressed in
subtle ways, as for example in the interaction between the choice of pronouns
and sentence mood: Fr jag g? May I leave? (listener in power), Du fr g You
may leave (speaker in power) vs. Jag fr g I may leave (third party in power).
The translation corpus is particularly well suited to studying the contrast
between Permission and Obligation, since this distinction usually requires different translations. Which alternative applies is a pragmatic question. An
example like Han fick ka hem He f-PAST go home can be translated either as
He was allowed/could go home or He had to go home depending on the context.
In an example like Han hoppades f ka hem He was hoping to be allowed to go
home Permission (or perhaps Possibility) is involved. In the translation cor-

Polysemy and disambiguation cues across languages

pus, it is possible to find examples which come close to minimal pairs. In the
following example, Obligation is the correct interpretation and this is also
reflected in the English translation. The passage is taken from a novel (P. C.
Jersild, Babels hus 1985) and describes what happens when someone arrives at
a hospital. The presupposition is that someone who feels ill wants to stay at the
(10) Den som inte r sjuk r fljaktligen frisk och fr ka hem igen.

The person who is not ill is consequently well and has to go back home.

In the following example taken from the same novel, another patient wants to
leave the hospital after an operation. In this case, Permission is the appropriate
interpretation, which is reflected in the translation:
(11) Han skulle frmodligen snart
f ka hem.

He would presumably be allowed

[to go] home soon.

The ambiguity is quite obvious to native speakers of Swedish. For example, if a

parent happens to tell the children to keep quiet using the phrase Nu fr ni hlla
tyst! Now you must(/may) keep quiet, the children are likely to answer Fr vi?
May we? (with stress on f may and mockingly surprised intonation).
Intuitively, Permission appears to be the default interpretation, even if the children are well aware of the intended meaning in the preceding example.
Although both Permission and Obligation are frequent as meanings of f,
Permission appears to be most frequent. In legal texts, where ambiguity is not
tolerated, f can only express Permission (at least in the present corpus).
The semantic relations are set out in Figure 1 along the lines of Langackers
(1988) usage-based model. As a modal, f has Permission as a default interpretation (symbolized by a box with double lines), which can be extended to cover
Obligation (box enclosed in single lines and semantic extension symbolized by
a broken arrow). What both of these meanings have in common is that some
external actor is in power, usually a human agent with social authority, but
External power



Figure 1. Schematic network representing the modal meanings of f

ke Viberg

power could also reside in some other actor such as a natural force: Vi fick g
hem p grund av regnet We had to go home because of the rain. This more
schematic meaning which is shared by Permission and Obligation is symbolized as a box with broken lines. It is related to the more specific meanings via
specialization (unbroken arrow).
The contrast between Permission and Obligation is usually clearly reflected in the English translations. The major translations of modal f are shown in
Table 7. When the interpretation is Permission, the major translations are be
allowed to and can as in the following examples:
(12) Fr man meta i sjn, sa han. (SC) Are you allowed to fish in the lake?
he asked.
(13) Jo, sa Pettersson. Fr tio kronor
fr ni hyra roddbten. (SC)

Yes, said Pettersson. You can hire

the rowing boat for ten kronor.

In legal Swedish, f seems to be used exclusively in the Permission sense. The

major translation in such texts is may, which is actually the most frequent
translation of f in the Permission sense, but it is primarily found in this text
type, which is the motivation for regarding can and be allowed to as the major
translations in this sense. May is a domain-specific translation, dominating in
legal language from which the following example is taken:
(14) Visering fr begrnsas ven i
vrigt och fr frenas med de
villkor som kan behvas. (UTL)

The issue of a visa may be restricted in

other respects and may be subject to
such conditions as may be necessary.

Table 7. English translation equivalents of f as a modal verb



be allowed to
must (negated)
should (negated)
Various other cases


have to
Various other cases






* Predominantly in legal texts

** An instance is counted as ZERO only when f has been specifically omitted. Cases where
f is contained in a longer passage which has been omitted are marked untranslated passage
in the original coding (included under Various other cases in the tables).

Polysemy and disambiguation cues across languages

The most common translations account for 61% in the Permission sense and
60% in the Obligation sense. The dominant translation of f referring to
Obligation is have to but must is also used to some extent:
(15) Det var s djup sn att han fick
leda cykeln sista biten. (AP)

The snow was so deep he had to push

the bike the last bit.

(16) Jag kan inte pst att jag tyckte

om att hra Siiri vrka ur sig
detta, men man fr komma ihg
att hon var upprrd. (AP)

I cant say I liked hearing Siiri pouring

all this out, but you must remember
that she was upset.

When f is negated, there is a well-known difference between Swedish and

English, which is mentioned in most school grammars of English. In Swedish,
negated permission is expressed as X is not permitted to do S, for example
Peter fr inte rka Peter is not allowed to smoke, whereas in English negated
permission is rather expressed as an obligation not to do something: Peter must
not smoke = Peter is obliged not to smoke.
(17) Hr fr du inte rensa, Aron! (GT)

You must not weed here, Aron!

A negated form of should is also relatively frequent as a translation of negated

(18) Slttern fick inte ta en timme. (SC)

Hay-making shouldnt take

an hour!

In legal texts, negated permission is translated by may not:

(19) En utlnning fr inte hllas i fr- An alien may not be detained pursuant
var med std av 2; frsta stycket to Section 2(1)(2) for more than 48
2 lngre tid n 48 timmar. (UTL) hours.

In a relatively large number of cases marked as ZERO in Table 7, f as a modal

verb is not translated, which may be taken as a sign that it sometimes has a
rather weak modal force in Swedish, as in the following example:
(20) I morgon ska det bli sknt att f
tala med Stanley. (LH)

Tomorrow it will be a joy to talk to


. Inchoative
When the main verb is one of the perception verbs se see or hra hear or the

ke Viberg

cognitive verb veta know, f usually has an inchoative reading as in the following example with se see:
(21) D kom Alfrida, mora hans, ut
och fick se att han stod dr (TL)

Then his mother, Alfrida, came out

and saw him standing there

Once one has the possibility of seeing something because it comes within the
field of vision, one usually also sees it. The correlation between the inchoative
meaning and the combination of f with the perceptual/cognitive verbs se, hra
and veta is not total. Occasionaly, f can have a modal meaning even in combination with these verbs.
The most common translation is Zero, i.e. f does not have a translation, as
in the following English and Finnish translations. In the French example, the
verb voir see appears in a perfective form which signals the inchoative meaning:
(22) Utanfr Lill-Olas
bod stod det bilar
och nr ke fick se
att det var ppet
ville han in och kpa
nya flugor. (KE)

There were cars outside

Lill-Olas fishing-tackle
booth and when ke
saw the shop was open,
he said he wanted to get
some more flies.

ke vit
que ctait

[] kun
ke nki
olevan auki

The major English translations are displayed in Table 8. A fairly frequently used
possibility is that of changing the main verb into another main verb which
incorporates an inchoative meaning. The choice of translation depends on the
main verb. F se may be translated catch sight of, whereas f veta can be translated as find out or learn, be told. The last two alternatives can also translate f hra.
Table 8. English translation equivalents of f + VInfinitive[Cogn, Perc]
find out (f veta)
be told (f veta, f hra)
catch (a glimpse, sight of)




. Causative
When f appears in the syntactic frame f + NP + att VPInfinitive, it has a
causative meaning (see also Altenberg, this volume). The most common trans-

Polysemy and disambiguation cues across languages

lation is make in English and faire make in French, whereas Finnish in most
cases uses the general equivalent saada:
(23) Han var vid sitt
lynne [] och
fick alla att
skratta. (IB)

He was in his most

merry mood []
and made everyone

Il tait de son
humeur la plus
joyeuse [] et
faisait rire tout
le monde.

Hn oli hilpeimmll tuulellaan

[] ja sai kaikki

As can be observed in Table 9, which gives a summary of the most frequent

translations in English, the most common of these in the causative use is make,
but get and cause are also used with some frequency.
Table 9. English translation equivalents of f as a periphrastic causative
Various other cases




. Success and related senses

Usually f has a non-agentive (non-intentional) subject but there are a few syntactic frames where the subject has a strong tendency to be agentive. Actually,
an agentive interpretation is possible but not very frequent when f is used as a
periphrastic causative, as in the following example, where the matrix verb
frska try explicitly signals intention:
(24) Jag frstod att de inbillade sig att
vrlden var som lrarna eller
frldrarna frskte f dem att
tro: utan hemligheter. (AP)

I realized they all imagined the world

was just what the teachers and parents
tried to get them to believe: with no

Non-agentive readings are common even when the subject is human in the
frame f + NP att VPInfinitive. In addition, inanimate subjects which do not allow
an agentive interpretation are fairly frequent in this construction. There is,
however, a set of syntactic frames where the subject has a strong tendency to be
agentive. Most of these frames have a low frequency. The most frequent frame
of this type, which accounts for almost 5% of the occurrences of f, is f +
Particle + NP:

ke Viberg

(25) Med ett mjukt ryck fick roddaren With a soft jerk, the oarsman got up
speed again to keep it at a distance.
upp farten igen. (KE)

An example like this one implies that the subject made an active attempt to
achieve something and succeeded. In Table 1, this meaning is labelled Success. The
attempt is intentionally controlled, but whether the attempt is achieved or not
cannot be controlled, and a further implication is that the act required a greater
than usual amount of effort or skill. A sentence such as Peter fick upp drren, which
literally means Peter got up the door should be translated as Peter managed to
open the door or Peter got the door open rather than simply Peter opened the door
which has the straightforward equivalent Peter ppnade drren in Swedish.
Although, strictly speaking, a human agent can never control all the conditions
that affect the outcome of a certain intended act, we normally take it for granted
that a simple act like opening a door will succeed. Attempt and success are invoked
only when the outcome is in some sense unlikely or problematic.
The intentional reading is often explicitly signalled by other linguistic cues
in the immediate context. No less than 17 (out of a total of 95) examples appear
in the wider frame fr att VPInfinitive (in order) to VPInfinitive , which marks intention. In 10 cases, a verb marking Attempt or Success/Failure appears in a matrix
clause governing f.
Although the majority (81%) of the examples of f + Particle have the
Intentional Success reading, there are some clear exceptions. In most of them f +
Particle serve as a Mental predicate. The most frequent cases (9 examples) consist
of the phrase f fr sig att S get the (wrong or weakly motivated) idea that S:
(26) I samma veva fick en del
personer fr sig att de mste
informera Nora om mamma
och pappa. (MG)

Then a number of people got it into

their heads that they had to inform
Nora about her mother and father.

The most common translation of f in combination with a particle is get, which

accounts for 35% of the translations. That is a higher proportion than for most
of the other uses of f and this indicates that get also has a strong association
with Success (or Human Interest in a more general sense). The rest of the translations consist of a wide range of verbs, many of which are characterized as taking an agentive subject.
Attempt and Success are also involved in most occurrences of f in the
frame f + NP + ADJresult, in which f is combined with an object followed by a
resultative adjective. This use is, however, infrequent, which makes the generalization tentative.

Polysemy and disambiguation cues across languages

(27) Sedan gller det bara att komma Then its only a matter of how well set
p hur vi skall f hunden fri. (PCJ) the dog free.

The appearance of a PP (usually spatial) after the object often serves as an extra
cue for the interpretation and may change the meaning in various directions.
In certain cases, it has an effect similar to that of a spatial particle and introduces a success interpretation:
(28) Men Birger fick honom p benen. (KE)

Birger got him upright.

A use related to the ones discussed earlier in this section is f in the frame f +
NP + Participle. The NP in this frame serves semantically as an object of the
verb in past participle form, whereas the subject of f has an interest in the outcome and could best be characterized as an Experiencer of either benefit or
harm (Beneficiary or Maleficiary). Usually, the subject is simultaneously the
Possessor of the object. In the following two examples, the subject is a
Beneficiary (the second example literally means get a prototype financed):
(29) Medan de uppfinnare inte har
den upplevelsen som strvat flera
r fr att f sin senaste id
accepterad och som ntligen ftt
ett erbjudande av att f en prototyp finansierad. (BB)

An inventor, who has been struggling

for several years to have his/her latest
idea accepted and who has finally got
an offer to get financing for his/her
prototype based on this idea, may not
have the same feeling of uncertainty.

The following is a clear example of the subject as Maleficiary:

(30) jag minns inte vad jag fretog
mig, antagligen klttrade jag p
hyllor och hngde i krokar fr att
slippa f trna upptna. (IB)

I dont remember what I did, probably

climbed on to shelves or hung from
hooks to avoid having my toes

The combination of Beneficiary and Maleficiary meaning is well-known from

crosslinguistic studies of the Dative. Like the Success reading, it focuses on the
Human Interest domain.

. English get
The verb get is particularly versatile with respect to the number of clause types
it can enter into (Quirk et al. 1985:720) and it also has a large number of senses
which are both lexical and grammatical. There are several detailed studies, two

ke Viberg

of which will be mentioned here. Johansson & Oksefjell (1996) focus on the
verbs constructional flexibility and also account for its distribution in various
text categories. Get turns out to be particularly characteristic of spoken English
and of less formal fiction, whereas it is underrepresented in informative prose.
Gronemeyer (1997, 1999) is centered around the polysemy of the verb and also
accounts for the diachronic development of its meanings from Middle English
to present-day English. Since the polysemy and the varied syntactic frames of get
have already been treated in considerable detail in earlier studies, this section
will concentrate on the most important contrastive relationships as reflected in
data from the English-Swedish Parallel Corpus. The major meanings and syntactic frames of get in this corpus are set out in Table 10.
Table 10. The major meanings of get. English originals




get + NP
have got + NP

Peter got a book

Peter has got a book


Modal: Obligation

have got to + VPInfinitive

gotta + VPInfinitive

Peter has got to come

Peter gotta come



get + ADJ/Participle
get + PastPart (by NP)

Peter got angry

Peter got killed (by a gunman)



get + NP +to VPInfinitive

Peter got Harry to leave


get + Particle
get + PP

Peter got up/in/out

Peter got to Berlin


get + NP + PP
get + Particle + NP

Peter got the buns out of the oven 07,1

Various other cases

N = 967


The most common translations of get in Swedish are shown in Table 11. The
most frequent translation is f but this verb does not cover more than 20,9% of
the total number of cases. Some other possession verbs, in particular ha
have(9,5%) and ta take (5,7 %), also reach a relatively high frequency as translations. However, the second most frequent equivalent is a motion verb komma
come, which represents 11,3% of the translations, and the inchoative verb bli
become, which translates get in 8,3% of the cases. In the following sections, the
translations will be discussed in relation to the major meanings of get.

Polysemy and disambiguation cues across languages

Table 11. The most frequent Swedish equivalents of English get





0komma 0come
0stiga 0step
0kliva 0stride
0resa sig 0rise



bli become




Total other equivalents




. Possession
Like Swedish f, the verb get in its prototypical use as a possession verb combines the notion of CHANGE and POSSESSION and, as noted above, f is the
dominant translation of get in this meaning. One of the major differences in
comparison with f is that English get can refer to an intentional, controlled
action even in its basic use as a verb of possession. The closest translation in this
case is skaffa, which is often used in the reflexive form, as in the following
example, but can also be used as a simple transitive verb:
(31) Why dont we get a microwave?

Varfr skaffar vi oss inte en


That get can be used with an agentive subject is also reflected in the fact that it
can appear in a ditransitive syntactic frame. In this case, too, skaffa can be used
as a translation. Another quite close translation of agentive get is hmta fetch,
which is relatively frequent when get has an active meaning:
(32) and so she had told the maid to och drfr hade hon bett kammarget her some champagne. (RDA) jungfrun att hmta lite champagne t

. Motion
The uses of get as a motion verb, which are displayed in Table 12, are particularly interesting and represent 37% of the total number of occurrences of the verb.
Verbs of motion can be divided into subject-centered verbs of motion such as
walk and run, which describe the displacement of the subject, and object-cen-

ke Viberg

tered verbs of motion, such as throw and put, which describe the displacement
of the object. Get is primarily used as a subject-centered motion verb, which is a
meaning the Swedish verb f does not have. The few cases where f is used as a
translation of get in this meaning are not equivalent in this respect. The table
also includes cases of abstract (or metaphorical) motion which tend to require
a rather free translation. On the other hand, when get is used as an object-centered motion verb, f is the dominant translation, usually in the frame f +
Particle. The second most frequent alternative ta take is less common and the
remaining translations only appear once or twice.
Table 12. Major Swedish equivalents of get as a motion verb
Subject-centered motion

Object-centered motion

komma come
ta (sig) take +refl.
g go
stiga step, rise
kliva step, stride
resa sig rise, get up
hinna get ... in time


ta take








The dominant translation of get as a subject-centered motion verb is komma

come. The reason for this is probably related to the fact that the semantics of
komma involve a point of view tied to ego or a main character in various ways.
This is related to the human interest domain, something which is characteristic
of the basic verbs of possession in general.
(33) Help me to get to the crossroads
safely. (PDJ)

Hjlp mig att komma till vgkorsningen utan att ngot hnder.

Another relatively frequent translation is ta sig, which requires the subject to be

active and implies a certain effort on the part of the subject:
(34) She had to get to the crossroads
and catch the bus. (PDJ)

Hon mste ta sig till vgsklet och


The verb hinna, which is a language-specific hyponym of succeed (get in

time) and appears as a translation with moderate frequency, is also related to
the human interest domain:

Polysemy and disambiguation cues across languages

(35) Diana heard her say, But I must

get to Marks and Spencer before
they close. (ST)

Diana hrde hur hon sa: Men jag

mste hinna till Marks and Spencer
innan de stnger.

The translations discussed so far are neutral with respect to Manner of Motion.
There are two closely related meanings of get as a subject-centered motion verb,
which appear in the frame get + Particle (+PP) and tend to be translated with
verbs indicating Manner of Motion. The reason for this is that the displacement is of very limited extent, while, at the same time, the movement of the
human body is extensive. The first subtype is related to the entrance into and
exit from vehicles (get on/get out,off) and is usually translated with kliva (p/av)
step, stride, stiga (p/av) step, rise or g (p/av) go, walk:
(36) Dalgliesh got out of the Jaguar

Dalgliesh klev ur Jaguaren

(37) The train stopped and more

people got on. (AT)

Tget stannade och fler passagerare

steg p.

The other subtype is the meaning get up, get out of bed/get to bed which is translated with the same set of verbs. When get up refers to a change from sitting to
standing position, resa sig rise in the reflexive form is the dominant translation:
Han reste sig, satte p tekitteln och
(38) He got up and put on the kettle
and he sat down again where my satte sig ner igen dr min mamma
alltid satt.
ma always sat. (RDO)

Observe that the Source of the Motion (such as car, bed, chair) is usually not
explicitly mentioned in the syntactic frame of the verb but must be inferred
from the wider context in order to yield a correct translation.
As a motion verb, get is also used fairly frequently in metaphorical expressions. Usually the PP in the syntactic frame refers to an abstract Place or even to
an event as in the example below (cf. the event structure metaphor treated in
Lakoff (1993), which involves many spatial concepts):
(39) Impetuous, he rages on: It really
turned bad soon after the
divorce, when I tried to get down
to writing again. (BR)

Han fortstter med en pltslig

hftighet: Det blev verkligt illa strax
efter skilsmssan nr jag frskte
komma igng med skrivandet igen.

As already mentioned, f is the dominant translation when get is used as an objectcentered motion verb. When it appears it usually has the success reading described
in Section 3.5:

ke Viberg

(40) Ma and Pa were at the front door Mamma och pappa stod framfr ett
of a dirty old house, trying to get smutsigt gammalt hus och frskte f
in en nyckel i lset.
a key in the lock. (ST)
(41) Theres enough petrol for this
Det finns vl bensin s det rcker fr i
afternoon, I expect, but how am I eftermiddag, tror jag, men hur ska jag
going to get the children to school kunna f barnen till skolan i morgon?
tomorrow morning? (FW)

The relative prominence of the feature human interest in the meaning of get is
most probably also reflected in the feature Success associated with f + Particle
in Swedish, but this component represents a much stronger degree of human
interest and in many cases f cannot be used as a translation. The most frequent
alternative in this case is ta take, but very often some more specific verb is used
such as plocka pick. Various types of spatial metaphor are also relatively frequent when get is used as an object-centered motion verb:
(42) I cant get any sense out of her.

Jag kan inte f ett vettigt ord ur henne.

. Grammaticalized meanings
Get has acquired a number of grammaticalized uses in English but only the
ones with an inchoative meaning reach relatively high frequencies in the corpus. Consequently, the grammatical uses will only be discussed rather briefly
here in spite of their theoretical interest.
.. Modal
The forms (have) got to, (have) gotta can be used to express modal obligation.
All but one of the 17 examples are translated by mste must, which expresses
strong obligation in Swedish:
(43) But when youre in big business
like I am, youve got to be hot
stuff at arithmetic. (RD)

Men nr man gr affrer i den hr storleksklassen s mste man vara slngd i


Patton rasade mot Eisenhower: Mina

(44) To Eisenhower he [Patton]
exploded: My men can eat their karlar kan ta sina livremmar, men
tanksen mste ha soppa.
belts, but my tanks have gotta
have gas. (MH)

Polysemy and disambiguation cues across languages

.. Causative
In the frame get + NP + to VPInfinitive get has a causative meaning but there are
only 16 examples of this type. Exactly half of them are translated by f:
(45) You should get Stuart to narrate
our schooldays together. (JB)

Ni borde f Stuart att bertta om vr


As was shown earlier, f in its use as a periphrastic causative was most frequently translated by make. There is another frame where get has primarily a
causative meaning: get+NP+Participle. The most common translation is f in
this case also (10 out of 23 examples):
(46) But how was I going to get the
check replaced? (SG)

Men hur skulle jag f checken utbytt?

4.3.3 Inchoative
When get is combined with an adjective, bli become is the dominant translation (40 out of 58 get + ADJ) as in the following simple example:
(47) I dont ever want to get old. (JB)

Jag vill aldrig ngonsin bli gammal.

Bli is quite common also when get is combined with an adjectival participle,
but in that case (according to varying lexical constraints) the most frequent
type of translation is a reflexive verb, for example gifta sig get married, skilja sig
get divorced, kl (p) sig get dressed, intressera sig get interested, vnja sig get
used to. The following is a typical example:
(48) And get involved he does, daily,
in the lives of all. (LT)

Nog engagerar han sig alltid; dagligen

och i alla bybornas liv.

The reflexive in these examples serves to topicalize the NP that ends up as subject. Sometimes the inchoative element is strengthened with the aspectual
verb brja begin:
Innan Baby brjade intressera sig fr
(49) Before Baby got interested in
boys, she would help my mother pojkar brukade hon hjlpa min mor att
sy klnningar t henne.
make dresses for her. (NG)

In Swedish, there is also a semi-productive inchoative verbal suffix na,

which appears in a few translations: kallna (from the adj. kall) get cold, trttna
(adj. trtt) get tired, fastna (adj. fast) get stuck.

ke Viberg

There are also examples of two other frames with an inchoative meaning in
the corpus. The frame get + to + VPInfinitive is primarily found in the phrase get
to know (translated lra knna), but there are a few other examples like the following:
(50) People in communities like his
own, in other areas of the Transvaal, got to hear of him; (NG)

Folk i samhllen liknande hans eget, i

andra delar av Transvaal, fick hra talas
om honom;

Another, related, frame with inchoative meaning is get + to + VPing. It is, however, only attested once in the present corpus:
(51) But this morning, when Mr
Harris didnt turn up, and
Marion didnt either, we got to
wondering (FW)

Men i morse, nr varken mr Harris

eller Marion kom, brjade vi undra,

.. The passive
Clear cases of the so-called get-passive are represented in examples with an
explicit by-phrase expressing the agent. Such examples are, however, infrequent and in many cases it is hard to draw a clear line between the get-passive
and the inchoative use of get. Out of the 25 examples classified as passive, 9 are
translated by the bli-passive in Swedish and 4 by the morphological s-passive:
(52) Did he get picked up? (SG)

Blev han tagen av polisen?

(53) We were pen pals after he got sent Vi blev brevvnner efter att han skickats till San Luis.
to San Luis. (SG)

. Conclusion: universal and language-specific structuring

To resume the theme of the opening lines of this paper, human languages are at
once characterized by universality and an enormous variability both across languages and within languages. At a general level, f and get resemble one another
with respect to their semantic extension. Etymologically f is derived from a
physical action verb fnga meaning catch, whereas get is derived from seize. The
latter meaning is also a common source for verbs meaning have in European
languages. The rise of the possession verb meaning represents a focusing of the

Polysemy and disambiguation cues across languages

result and a gradual bleaching of the components related to manner of action

and agentivity. The latter component has virtually disappeared from f, whereas
get can still have an agentive reading as a possession verb.
The further extensions into areas of grammatical meaning such as modal,
causative and inchoative also show many parallels at a general level. The pattern represents quite a common path of meaning extension cross-linguistically.
Matisoff (1991) describes a pattern of grammaticalization characteristic of
Southeast Asian languages such as Thai, Vietnamese, Khmer and Lahu which
in many respects resembles that of Swedish f, Finnish saada and English get. In
Southeast Asian languages, verbs meaning get, obtain have characteristically
developed meanings such as manage/get to, have to/must and be able to.
The various senses are related to a quite high degree to distinct syntactic
frames. The first two meanings tend to appear when get is a pre-head auxiliary, whereas be able to appears in post-head auxiliary position.
It appears, however, that the meaning get (come to possess) is not generally lexicalized as a simple verb in the worlds languages. The situation found in
French, where there is no direct equivalent of f, seems to be rather common.
(In particular, it seems that take can be extended to cover this meaning in a
number of languages.) The pattern of meaning extension characteristic of get
is closely related to the patterns found for other basic possession verbs, in particular for give. According to Newmans (1996) systematic study of give
across a wide range of languages, it is clear that this verb tends to extend into
the grammatical areas of benefactive, permission/enablement and causation,
which all have parallels in languages where get is (being) grammaticalized.
This indicates that there are great similarities across languages with respect to
the conceptual core.
In spite of the strong universality at the conceptual level, the lexicalization
patterns are very language-specific at a more detailed level. The overall mutual
translatability of f and get is remarkably low. F is translated with get in only
12% of the cases and even if get has f as a translation almost twice as often, in
21% of the cases, that is still a relatively low figure. This prompts a detailed contrastive analysis, which can be used for applied purposes such as translation
and language teaching. In this study, attention has been paid in particular to
the cues for disambiguation. It turns out that the syntactic frame plays an
important role in narrowing down the range of possible meanings of f and get
to an extent that is unusual in the case of other words. However, additional cues
often have to be taken into consideration in order to identify the exact sense.
For example, the choice between a permission and an obligation reading of f is

ke Viberg

decided primarily on the basis of pragmatic factors, which have to be worked

out in more detail. For abstract possession, the semantic composition of the
head noun of the object plays an important role in the choice of an appropriate
translation. Since such uses are numerous and involve a wide range of abstract
nouns, a more complete description requires the study of much larger corpora,
which are available only in monolingual form. It is clear, however, that translation corpora serve an important function in sharpening the questions we
would like to answer using the large monolingual corpora that are now available at the touch of a key.

* This work has been carried out within the Crosslinguistic Lexicology (Swed.
Tvrsprklig lexikologi) project, which receives financial support from the Swedish Council
for Research in the Humanities and Social Sciences. For a presentation of the project, see
Viberg (1996).

Aijmer, K., Altenberg, B. & Johansson, M. 1996. Text-based contrastive studies in English.
Presentation of a project. In Languages in contrast. Papers from a Symposium on Textbased Cross-linguistic Studies [Lund Studies in English 88], K. Aijmer, B. Altenberg & M.
Johansson (eds), 7385. Lund: Lund University Press.
Altenberg, B. & Aijmer, K. 2000. The English-Swedish Parallel Corpus: A resource for contrastive research and translation studies. In Corpus Linguistics and linguistic theory,
Christian Mair and Marianne Hundt (eds.), 1533. Amsterdam and Atlanta: Rodopi.
Altenberg, B., Aijmer, K. & Svensson, M. 1999. The English-Swedish Parallel Corpus: Manual.
Department of English, University of Lund. (Also at
Bally, Ch. 1950. Linguistique gnrale et linguistique franaise. 3e d. Berne: A. Francke.
Berlin, B. & Kay, P. 1969. Basic color terms. Berkeley: University of California Press.
Dorr, B. 1993. Machine translation. A view from the lexicon. Cambridge: The MIT Press.
Geeraerts, D. 1997. Diachronic prototype semantics: a contribution to historical lexicology.
Oxford: Clarendon.
Goddard, C. & Wierzbicka, A. (eds). 1994. Semantic and lexical universals. Theory and
empirical findings. Amsterdam: Benjamins.
Gronemeyer, C. 1997. A semantic and syntactic account of the polysemy in get. Licentiate of
Philosophy Thesis. Dept. of Linguistics, Lund University.
Gronemeyer, C. 1999. On deriving complex polysemy: the grammaticalization of get.
English Language and Linguistics 3(1): 139.

Polysemy and disambiguation cues across languages

Gumperz, J. & Levinson, S. (eds). 1996. Rethinking linguistic relativity. Cambridge: Cambridge
University Press.
Hatch, E. & Brown, C. 1995. Vocabulary, semantics and language education. Cambridge:
Cambridge University Press.
Heine, B. 1997. Possession. Cambridge: Cambridge University Press.
Jakobson, R. 1936. Beitrag zur allgemeinen Kasuslehre: Gesamtbedeutungen der russischen
Kasus. Reprinted in: Selected writings II: words and language, 2371. The Hague: Mouton.
Johansson, S. 1998. On the role of corpora in cross-linguistic research. In S. Johansson & S.
Oksefjell (eds), 324.
Johansson, S. & Oksefjell, S. 1996. Towards a unified account of the syntax and semantics of
get. In Using corpora for language research, J. Thomas & M. Short (eds), 5775. London
& New York: Longman.
Johansson, S. & Oksefjell, S. (eds). 1998. Corpora and cross-linguistic research. Theory,
method, and case studies. Amsterdam: Rodopi.
Lakoff, G. 1993. The contemporary theory of metaphor. In Metaphor and thought, A.
Ortony (ed), 202251. Cambridge: Cambridge University Press.
Langacker, R. 1988. A usage-based model. In Topics in cognitive linguistics, B. RudzkaOstyn (ed), 127161. Amsterdam: Benjamins.
Matisoff, J. 1991. Areal and universal dimensions of grammatization in Lahu. In
Approaches to grammaticalization. Vol. II [Typological studies in language 19:2], E. C.
Traugott & B. Heine (eds), 383453. Amsterdam and Philadelphia: John Benjamins.
Miller, G. A. & Johnson-Laird, Ph. 1976. Language and perception. Cambridge: Cambridge
University Press.
Newman, J. 1996. Give. A cognitive linguistic study. Berlin & New York: Mouton de Gruyter.
Poesio, M. 1996. Semantic ambiguity and perceived ambiguity. In Semantic ambiguity and
underspecification [CSLI Lecture Notes 55.], K. van Deemter & S. Peters (eds), 159201.
Pustejovsky, J. 1995. The generative lexicon. Cambridge, MA: Bradford.
Quirk, R., Greenbaum, S., Leech, G. & Svartvik, J. 1985. A comprehensive grammar of the
English language. London & New York: Longman.
Schwarze, C. (Hrsg.) 1985. Beitrge zu einem kontrastiven Wortfeldlexikon Deutsch
Franzsisch. Tbingen: Gunter Narr Verlag.
Singleton, D. 1999. Exploring the second language mental lexicon. Cambridge: Cambridge
University Press.
SUC (1997). SUC 1.0. Stockholm-Ume Corpus. Produced by Dept. of Linguistics, Ume
University and Dept. of Linguistics, Stockholm University. CD-Rom.
Talmy, L. 1985. Lexicalization patterns: semantic structures in lexical forms. In Language
Typology and Syntactic Description. Vol. III, T. Shopen (ed), 57149. Cambridge:
Cambridge University Press.
Taylor, J. 1995. Linguistic categorization. Prototypes in linguistic theory. 2nd ed. Oxford:
Clarendon Press.
Tsohatzidis, S. (ed). 1990. Meanings and prototypes. Studies in linguistic categorization.
London: Routledge.
van der Auwera, J. & Plungian, V. A. 1998. Modalitys semantic map. Linguistic Typology 2:

ke Viberg

Viberg, . 1981. Emotiva predikat i svenskan och ngra andra sprk. In Studier i kontrastiv
lexikologi, 6199. (In Swedish. Studies in contrastive lexicology.) Ph. D. diss. Dept. of
Linguistics, Stockholm University.
Viberg, . 1996. Crosslinguistic lexicology. The case of English go and Swedish g. In
Languages in contrast. Papers from a Symposium on Text-based Cross-linguistic Studies
[Lund Studies in English 88], K. Aijmer, B. Altenberg & M. Johansson (eds), 151182.
Lund: Lund University Press.
Viberg, . 1999a. The polysemous cognates Swedish g and English go. Universal and language-specific characteristics. Languages in Contrast 2: 87115.
Viberg, . 1999b. Polysemy and differentiation in the lexicon. Verbs of physical contact in
Swedish. In Cognitive semantics. Meaning and cognition, J. Allwood & P. Grdenfors
(eds), 87129. Amsterdam: Benjamins.
Wagner, J. 1976. Eine kontrastive Analyse von Modalverben des Deutschen und
Schwedischen. IRAL XIV(1): 4966.
Wandruszka, M. 1969. Sprachen. Vergleichbar und unvergleichlich. Mnchen: Piper.
Wanner, L. (ed). 1996. Lexical choice. Special issue of Machine Translation 11(13): 1216.
Winter, S. & Grdenfors, P. 1995. Linguistic modality as expressions of social power.
Nordic Journal of Linguistics 18(2): 167199.

A cognitive approach to Up/Down

metaphors in English and Shang/Xia
metaphors in Chinese
Lan Chun

1. Introduction
This is a contrastive study of spatial metaphors in English and Chinese carried
out within the framework of cognitive semantics. It is assumed in the study that
there exists an intermediate level cognition between language and the physical
world (Svorou 1994, Grdenfors 1996, Geiger & Rudzka-Ostyn 1993, Langacker
1987, Lakoff 1987), and an experiential view of cognition is adopted. This view,
also known as experiential realism, hypothesizes that basic-level categories and
image schemas are the two kinds of preconceptual structure directly meaningful
to us. One way in which abstract conceptual structure arises from these two
kinds of preconceptual structure is by metaphorical mapping.
The cognitive approach ascribes the following basic features to metaphor:
1. Metaphor is conceptual in nature: it is a cognitive device which enables us
to organize our conceptualization of the world.
2. Metaphor is composed of two domains, a relatively clearly structured
source domain and a relatively less clearly structured target domain. It is a
mapping of the schematic structure of the source domain onto that of the
target domain.
3. Metaphorical mappings are not arbitrary but are grounded in our physical
experience. Once a metaphorical mapping is set up, it will impose its structure on real life and be made real in different ways.
Two English spatial terms, namely up and down, and two Chinese spatial terms,

Lan Chun

namely shang (up) and xia (down), constitute the main research issues of this
study. Following one of the basic assumptions of cognitive semantics that
semantic structure is equated with conceptual structure, which commonly
gives rise to a prototype-based network (Smith 1993: 531, Geiger & RudzkaOstyn 1993: 1) each of the four spatial terms is regarded as capturing a conceptual structure with prototypical models, and metaphorical extensions
developed out of those prototypical models. To distinguish the linguistic term
from the conceptual structure, the former will be referred to as up, down, shang
and xia, and the latter as UP, DOWN, SHANG and XIA.
The study is based on a Chinese corpus and an English corpus and has the
following objectives:
1. to determine the metaphorical extensions along which UP/DOWN and
SHANG/XIA develop;
2. to explicate the experiential bases of the metaphorical extensions uncovered on the one hand, and the realizations of those metaphorical extensions in everyday life on the other, which, according to Lakoff (1993: 244),
are two sides of the same coin;
3. to discover the similarities and differences between the ways English and
Chinese speakers conceptualize other domains via their UP/DOWN and
SHANG/XIA metaphors.
As recognized by Yu (1996) and Stibbe (1996), the cognitive approach to
metaphor now faces three main challenges. First of all, more cross-linguistic and
cross-cultural research needs to be done before sound evidence can be produced
for the claim of the cognitive approach that abstract reasoning is partly
metaphorical. Secondly, to what extent and in what manner cognitive universals
and variations exist across cultures and languages still remains to be explored.
Thirdly, during the past two decades research into the cognitive approach to
metaphor has relied heavily on a narrow range of unnatural data, sometimes
made on the spot to fit a pre-set theory. A closer look at a representative range of
contemporary examples taken from natural language sources, considered in as
full a context as possible, is therefore called for (cf. Schnefeld 1999).
In view of these challenges, the present study contributes to cognitive
semantic research in metaphor in the following ways. First, it offers a systematic
contrastive analysis of the metaphorical extensions of two English spatial terms
and two Chinese spatial terms. Second, evidence is provided from the analysis
for the cognitive claim that metaphorical mapping of the image-schematic
structure of the source domain onto that of the target domain gives rise to

Cognitive approach to Up/Down metaphors in English and Shang/Xia metaphors in Chinese

abstract concepts and abstract reasoning. Evidence is also provided for the possible existence of a universal spatial metaphorical system, which has so far largely remained a speculation. Third, the study contributes to research methodology: it shows that, handled properly, a corpus-based approach towards data collection and analysis can be fruitfully exploited in the field of cognitive semantics; it also demonstrates how two typologically different languages can be
brought together for comparative purposes within a cognitive framework.

. UP, DOWN, SHANG and XIA as image-schematic concepts

UP, DOWN, SHANG and XIA each activates an image-schematic concept
depicting a movement or a particular location of a trajector in relation to a
landmark along the vertical axis. When UP/DOWN and SHANG/XIA depict a
movement of the trajector, they will be referred to as dynamic UP/DOWN and
dynamic SHANG/XIA. When they depict a particular location of the trajector,
they will be referred to as static UP/DOWN and static SHANG/XIA. Figures 1
and 2 are graphic representations of dynamic UP/SHANG and dynamic
Examples of the dynamic type are:
(1) The camera is panning up a girls body.
(2) The unemployment rate has gone up to 4%.
vertical axis

Figure 1: Schema for

dynamic UP/SHANG

vertical axis




horizontal axis

horizontal axis

Figure 2: Schema for

dynamic DOWN/XIA

Lan Chun

(3) women pashang shanding.

climb up mountain top
We climbed up to the top of the mountain
(4) qiwen
shangsheng dao 38 du.
temperature up rise
to 38 degrees
The temperature has risen to 38 degrees
(5) She sat down, perching on the edge of the armchair.
(6) Cut your shopping down to twice a week.
(7) women zouxia
we walk down
mountain slope
We walked down the mountain
(8) qiwen
dao lingxia
10 du.
temperature down drop to zero down 10 degrees
The temperature has dropped to 10 degrees below zero

When the trajector is stationary, we get static UP/SHANG and static DOWN/ XIA
as captured in Figures 3 and 4.
vertical axis

vertical axis





horizontal axis

horizontal axis

Figure 3: Schema for

static UP/SHANG

Figure 4: Schema for

static DOWN/XIA

Examples of the static type are:

(9) He is up in his own bedroom.
(10) They were two goals up at half time.
(11) hongqi zai caochang
shangkong piaoyang.
red flag at playground up sky
The red flag is flying in the wind over the playground

Cognitive approach to Up/Down metaphors in English and Shang/Xia metaphors in Chinese

(12) nanxing diwei zai

nuxing diwei zhi shang.
status at
women status of up
Mens status is above womens status
(13) He could see the house down below.
(14) Brazil was two down against France at half time.
(15) zhongzi maizai dixia.
bury at ground down
The seeds are buried deep down in the earth
(16) nuxing diwei zai nanxing diwei zhi xia.
woman status at man
status of down
Womens status is below mens status

When static SHANG and XIA depict a contact between the trajector and the
landmark, they constitute a special case for which there is no counterpart in the
case of static UP and DOWN. I call this special use of SHANG and XIA contact
SHANG and contact XIA, which are represented by Figures 5 and 6. It should
be noted that in the case of contact SHANG, the trajector not only touches, but
is also supported by, the landmark; and in the case of contact XIA, the trajector
is covered or pressed by the landmark.
vertical axis

vertical axis


horizontal axis

Figure 5: Schema for

contact SHANG

horizontal axis

Figure 6: Schema for

contact XIA

Examples of contact SHANG and XIA are:

(17) baozhi
newspaper up-contact place-ing
There is a pen on the newspaper
(18) baozhi
newspaper up-contact have

yi zhi bi.
one NC pen

yi pian
one NC


Lan Chun

There is an article in the newspaper

(19) hui
yi ge fayan.
meeting up-contact have
one NC speech
There is a speech at the meeting
(20) baozhi
newspaper down-contact
There is a pen under the newspaper


yi zhi
one NC


(21) gangban
zai juda de yali
steel board at huge pressure down-contact change shape
The steel board bent under the enormous pressure
(22) zai shichang jingji
zuoyong xia,
wujia you sheng you jiang.
at market economy function down-contact, price have rise have fall
Under the influence of the market economy, prices rise and fall

When the image schemas of UP/DOWN and SHANG/XIA are used to structure other domains outside space, i.e. when we give other non-spatial domains
a vertical axis, a trajector and a landmark, as in examples (2), (4), (6), (8), (10),
(12), (19), and (22), they will be regarded as metaphorical extensions of

. Research methodology
This study is based on samples from a Chinese corpus and an English corpus.
The English corpus chosen is the 5-million-word Word Bank of the Collins
Cobuild English Language Dictionary (1996), from which 5728 instances of up
and 4781 instances of down were retrieved. This English corpus is mainly made
up of written material taken from three sources, viz. newspapers, magazines
and books published in the UK after 1990. The Chinese corpus, which is made
up of about 1.8 million characters of written material, was assembled by the
author by downloading Chinese newspapers and magazines published
between 1 April and 30 June 1998 from their web-sites and by downloading
books of contemporary Chinese writers published after 1995 which can be read
from the internet. From this corpus, 7621 instances of shang and 4387
instances of xia were retrieved.
The software Microsoft Access was used to process the data. A database
was built up for up, down, shang and xia separately. A random list was created
and about 10% of the instances of up/down and shang/xia were randomly

Cognitive approach to Up/Down metaphors in English and Shang/Xia metaphors in Chinese

selected. 529 instances of up, 431 instances of down, 750 instances of shang
and 434 instances of xia formed the final database.
Each record of up, down, shang and xia was analysed in accordance with the
following parameters: prototype model (static or contact or dynamic), trajector,
landmark, path, and metaphorical extension. When analyzing up or down in a
particular verb-particle construction, such as pick up, or cut down, the present
study followed Lindner (1981) and Morgan (1997) in recognizing the contribution of up or down to the meaning of the whole phrase. However, since my
interest was not in up/down as a word, but in up/down as encoding the concept
UP/DOWN, I did not make a distinction between up/down as a preposition,
adjective or adverb.


. Prototypical vs. metaphorical meanings
SHANG and XIA originated as purely spatial concepts. This is reflected in the
earliest pictographic characters inscribed on oracle-bones excavated from Yin
(capital of the Shang Dynasty).
Evidence in the Chinese corpus shows that SHANG and XIA are mainly
used for the conceptualization of a certain stage or a certain process in the following four target domains: QUANTITY, SOCIAL HIERARCHY, TIME, and
STATES. The metaphorical extensions identified are:
1. A Larger Quantity Is Shang

A Smaller Quantity Is Xia

2. A Higher Status Is Shang

A Lower Status Is Xia

3. An Earlier Time Is Shang

A Later Time Is Xia

4. A More Desirable State Is Shang

A Less Desirable State Is Xia

The mapping of the image-schematic structures of SHANG and XIA onto that
of their target domains and the relationship this mapping has with its experiential grounding and its realizations in real life are roughly represented in
Figure 7. In the figure we see that the image-schematic structures of SHANG
and XIA emerge directly from our everyday bodily experience. They are then
projected onto the abstract target domains through metaphorical mappings.
As a result, the target domains receive a spatial structure and become indirectly
meaningful to us. The metaphorical mappings, once established, then impose
their structures on real life and become realized in various ways.

Lan Chun



directly emerging

being realized







source domain of SPACE



target domains

Figure 7: Mapping of SHANG and XIA onto their target domains

Cognitive approach to Up/Down metaphors in English and Shang/Xia metaphors in Chinese

Among the 750 occurrences of shang analysed, only 34.7% are found to be
cases of the dynamic model. This shows that, literally or metaphorically,
SHANG is less often used to depict the trajectory followed by a moving trajector than the location of a stationary trajector.
As many as 72.3% of the 750 instances analysed carry metaphorical meanings. This demonstrates how often SHANG is used metaphorically.
The percentages of the three models of SHANG and the distribution of the
metaphorical extensions detected are presented in Tables 1 and 2.
Table 1: The three prototypical models of SHANG
Prototype model
Non-dynamic SHANG
Dynamic SHANG


Percentage of 750




(a) contact SHANG

(b) static SHANG

Table 2: The metaphorical extensions of SHANG

Target domain

Metaphorical extension

Number 0Percentage
0of 543
A More Desirable State Is Shang 328
A Larger Quantity Is Shang
An Earlier Time Is Shang
A Higher Status Is Shang




of 750

Of the 434 occurrences of xia analysed, about 45.9% are instances of the
dynamic model. The remaining 54.1% are either cases of the static model or of
the contact model. This shows that XIA is quite well balanced between its nondynamic side and its dynamic side, although the former occurs slightly more
often than the latter.
As many as 77.7% of the 434 instances of xia carry metaphorical meanings.
The statistical findings are presented in Tables 3 and 4.
Table 3: The three prototypical models of XIA
Prototype model
Non-dynamic XIA
Dynamic XIA


Percentage of 434




(a) contact XIA

(b) static XIA

Lan Chun

Table 4: The metaphorical extensions of XIA

Target domain

Metaphorical extension



A Less Desirable State Is Xia

A Later Time Is Xia
A Lower Status Is Xia
A Smaller Quantity is Xia



of 337

of 434




. Four metaphorical extensions

In this section, we shall discuss the four metaphorical extensions observed for
SHANG and for XIA in turn. Following the claims of experiential realism that
conceptual metaphors arise from bodily experience and, once set up, will then
impose their structures on real life, in presenting the metaphorical extensions I
shall try to work out their experiential grounding on the one hand and their
realizations in real life on the other.

(a) Quantity
A Larger Quantity Is Shang.
A Smaller Quantity Is Xia.

Experiential Grounding: When more of a substance or of physical objects is

added to a container or pile, the level goes up (see Lakoff & Johnson
1980:1516, Lakoff 1993:240, Johnson 1987, Goatly 1997).
Realizations of the Metaphor: Man-made objects like thermometers and stock
market graphs exhibit a clear correlation between Larger Quantity and
SHANG and between Smaller Quantity and XIA.
The following special cases have been identified in the data:

Increase in salary is shang/ Decrease in salary is xia.

Increase in costs is shang/ Decrease in costs is xia.
Increase in prices is shang/ Decrease in prices is xia.
Increase in inflation rate is shang/ Decrease in inflation rate is xia.
Increase in temperature is shang/ Decrease in temperature is xia.
Increase in speed is shang/ Decrease in speed is xia.
Increase in volume/pitch of voice is shang/ Decrease in volume/pitch of
voice is xia.

Cognitive approach to Up/Down metaphors in English and Shang/Xia metaphors in Chinese

Examples of the above special cases are:

(23) gongzi shangtiao
salary up adjust
a rise in the salary
(24) xiaofei
xia jiang
consumption down fall
a drop in consumption
(25) wujia shangzhang
price up rise
a rise in the price
(26) wujia xia die
price down drop
a drop in the price
(27) tonghuo pengzhang shangyang
inflation rate
up rise
a rise in the inflation rate
(28) tonghuo pengzhang xia jiang
inflation rate
down fall
a drop in the inflation rate
(29) wendu
temperature up rise
a rise in the temperature
(30) sudu xia jiang
speed down fall
a drop in the speed
(31) shengyin shangyang
up rise
a rise in the voice
(32) shuliang xia tiao
number down adjust
a decrease in the number

(b) Social Hierarchy

A Higher Status Is Shang.
A Lower Status Is Xia.

Lan Chun

Experiential Grounding: In ancient society, a mans status was associated with

his physical strength, and the latter in turn was typically correlated with his
physical size. A man who is bigger and taller is usually stronger and hence in a
better position to win a fight than a shorter and smaller man. The victor in a
fight is typically on top of the loser (Lakoff & Johnson 1980: 1516, Lakoff
1993, Johnson 1987).
Realizations of the Metaphor:
Architecture: Take the halls in the Forbidden City as an example. To go to any of
the halls, one needs to climb a lot of stairs. Those halls symbolize the emperors
status and power in peoples eyes and are therefore raised far above ground level.
Rituals: In ancient China the throne of the emperor was always situated in a place
several steps higher than the seats for his ministers. Within family households,
the seat for the patriarch was also situated in a higher place or in a place considered to be higher. People kowtow in front of officials to acknowledge their humbleness. The rebellious were forced to kneel down to repent of their sin.
Social practices: In a name list the names of VIPs come at the top of a page. In
the prize-giving ceremonies at sporting events the champion stands a step
higher than the contestant who came second, who in turn stands a step higher
than the contestant in third place.
Below are some examples:
(33) shangdiao
up move
move to a higher social position
(34) xiafang
down place
be moved to a lower social position
(35) shangqing xiada
up feeling down reach
for the feelings of those at the top to reach those at the bottom
(36) shangji bumen
up step bureau
those bureaus of a higher level
(37) xiaji
down step bureau
those bureaus of a lower level

Cognitive approach to Up/Down metaphors in English and Shang/Xia metaphors in Chinese

(38) shangzuo
up seat
seat for VIP
(39) xiazuo
down seat
seat for less important people
(40) shangliu shehui
up stream society
the upper class society
(41) xialiu
down stream society
the lower class society

(c) Time
An Earlier Time Is Shang.
A Later Time Is Xia.

These two metaphors fit into the larger system of TIME-AS-SPACE metaphor
noted by many researchers (see e.g. Lakoff & Johnson 1980, Lakoff 1993, Alverson
1994, Svorou 1994, Allan 1995, Yu 1996). In particular, they arise from two special
Special case 1: Times are fixed locations arranged along a vertical landscape. An
earlier time is above a later time.
It is reflected in expressions like the ones listed below:
(42) shang yi dai
up a generation
the older generation
(43) xia yi dai
down a generation
the younger generation
(44) shangci
up time
last time
(45) xia ci
down time
next time

Lan Chun

(46) shang ban nian

up half year
the first six months of a year
(47) xia ban nian
down half year
the second six months of a year
(48) shangxun
up ten-days
the first ten days of a month
(49) xia xun
down ten-days
the last ten days of a month

Special case 2: Human beings (with their belongings) move downwards

towards the future. They can nevertheless go upwards to revisit an earlier time.
It is reflected in expressions like:
(50) you ci shangsu dao hanchao
from here up trace to han dynasty
trace to the Han Dynasty from this point
(51) yanzhe lishi de changhe ni
er shang
along history long river against stream up
to go up stream against the river of history
(52) jianchi xia qu
insist down go
carry on till the future
(53) yi dai
yi dai
chuan xia lai
one generation one generation pass down come
to pass down generation after generation

The two cases are consistent with each other in that both entail an earlier time
being above a later time.
Experiential grounding:
1. Human beings have detectors for motion and for objects/locations in their
visual systems, yet they have no detectors for time. It thus makes sense
from a biological point of view that time should be understood in terms of
space (Lakoff 1993:218).
2. In the history of human evolution, the conceptions of spatial relations are

Cognitive approach to Up/Down metaphors in English and Shang/Xia metaphors in Chinese

developed much earlier than those of temporal relations (Akhundov

3. In the process of individual growth, the conception of spatial relations is
also acquired before those of temporal relations (Akhundov 1986:2122).
Realizations of the Metaphor:
Man-made objects: In a typical calendar, an earlier time is usually put either in
front of or above a later time.
Rituals: Offerings to the ancestral spirits were always placed on top of a sacrificial altar raised above ground level.
Social practices: When drawing a family tree, one always puts the oldest generation at the top of the page and then traces down to the youngest generation,
rather than vice versa.

(d) States
A More Desirable State Is Shang.
A Less Desirable State Is Xia.

This is a special case of the Event Structure Metaphor (Lakoff 1993, Yu 1996),
which claims that various aspects of event structure, including notions like states,
changes, processes, actions, causes, purposes, and means are characterized cognitively via metaphors in terms of space, motion, and force (Lakoff 1993:220).
Experiential grounding: The human body stands upright, with the head at the
top and the feet at the bottom. Humans and most other mammals lie down
when they sleep and stand up when they wake. Dead people are in a physically
recumbent position.
Realizations of the Metaphor:
Physical symptoms: A drooping posture is typically associated with sadness and
depression; an erect posture is typically associated with more positive emotional states such as happiness and cheerfulness.
Literary works: In literary works it is common for the pursuit of a desirable purpose to take the form of an actual upward journey, such as mountain climbing.
The following specific metaphorical extensions are identified within the target
domain of STATES:

Lan Chun

Higher Morality Is Shang/Lower Morality Is Xia.

Better Quality Is Shang/Poorer Quality Is Xia.
In Public Is Shang/In Private Is Xia.
Greater Intensity Is Shang/Lesser Intensity Is Xia.
Fulfilment Of A (Positive) Action Is Shang/ Fulfilment Of A (Negative) Action
Is Xia.
(54) shang de
up virtue
grand virtue
(55) xia jian
down humble
of low morality
(56) shang shi
up gentleman
gentleman with high morality
(57) xia shi
down gentleman
gentleman with low morality
(58) shang pin
up rank
of the best quality
(59) xia pin
down rank
of the poorest quality
(60) shang shi
up market
be on sale
(61) xia shi
down market
be off sale
(62) shang ban
up office
go to work
(63) xia ban
down office
leave work

Cognitive approach to Up/Down metaphors in English and Shang/Xia metaphors in Chinese

(64) shang ke
up class
have class
(65) xia ke
down class
class is over
(66) dang shang jingli
become up manager
get to the post of manager
(67) diu xia haizi
drop down child
leave the child unattended

. UP and DOWN
. Prototypical vs. metaphorical meanings
In their prototypical dynamic and static models, UP and DOWN are used to
denote the physical position or the changes in the physical position of a trajector along a vertical axis. Extended from these two prototypical models, UP and
DOWN are also used to talk about and to construct a certain stage or changes
over a period of time in other abstract domains.
Evidence in the English corpus shows that UP and DOWN are mainly used
for the conceptualization of changes in the same four target domains as
STATES. The metaphorical extensions identified are:

A Larger Quantity Is Up.

A Higher Status Is Up.
A Later Time Is Up.
A More Desirable State Is Up.

A Smaller Quantity Is Down.

A Lower Status Is Down.
A Later Time Is Down.
A Less Desirable State Is Down.

Since these metaphorical mappings for the most part share the same experiential grounding as their Chinese counterparts, I will not repeat their experiential
bases in the following descriptions. As for the realizations of those metaphorical mappings, attention will only be paid to cases where a distinctively English
way of realizing a particular metaphor has been detected.
The analysis of the 529 instances of up shows that UP is used in its dynamic
model in 97.7% of the cases. This is certainly different from SHANG and seems

Lan Chun

to suggest that while SHANG is used to structure both a certain stage of its target domains and a certain change taking place in its target domains (with a bias
towards the former), UP is almost always used to denote the changes going on
in its target domains.
Altogether, 87.6% of the records of up analysed are found with metaphorical extensions. This is even higher than the 72.3% found in the case of shang.
Tables 5 and 6 present the statistical findings.
Table 5: The two prototypical models of UP
Prototype Model
Dynamic UP
Static UP


Percentage of 529




Table 6: The metaphorical extensions of UP

Target domains

Metaphorical extension
A More Desirable State is Up
A Larger Quantity Is Up
A Higher Status Is Up
A Later Time Is Up

Number 0Percentage
0of 463

of 529





Of the 431 instances of down analysed, 94.4% belong to the dynamic model.
Comparing this with XIA, which is well balanced between its static side and its
dynamic side, we notice a sharp contrast.
45.4% of all the instances of down are found to have metaphorical extensions. This is much less than the 77.7% of xia. It is interesting to notice that
while up is more often used metaphorically than shang, down is less often used
metaphorically than xia. The reason for this will only be established by further
research. Tables 7 and 8 present the statistical results.
Table 7: The two prototypical models of DOWN
Prototype model
Dynamic DOWN
Static DOWN


Percentage of 431




Cognitive approach to Up/Down metaphors in English and Shang/Xia metaphors in Chinese

Table 8: The metaphorical extensions of DOWN

Target Domain

Metaphorical extension
A Less Desirable State Is Down
A Smaller Quantity Is Down
A Lower Status Is Down
A Later Time Is Down

Number Percentage
of 196

of 431





In what follows, each of the metaphorical extensions observed for UP and

DOWN is discussed briefly.
. Four metaphorical extensions

(a) Quantity
A Larger Quantity Is Up.
A Smaller Quantity Is Down.

The following specific cases have been found among the corpus data:

Increase In Salary Is Up/ Decrease In Salary Is Down.

Increase In Costs Is Up/ Decrease In Costs Is Down.
Increase In Price Is Up/ Decrease In Price Is Down.
Increase In Inflation Rate Is Up/ Decrease In Inflation Rate Is Down.
Increase In Temperature Is Up/ Decrease In Temperature Is Down.
Increase In Speed Is Up/ Decrease In Speed Is Down.
Increase In Size Is Up/ Decrease In Size Is Down.
(68) The football star can expect up to 300,000 pounds a week.
(69) The nurses have offered to scale down their pay demands to a lower figure.
(70) The costs have been multiplied up many times.
(71) Is there any way we can prune the costs down still further?
(72) The dealers bid up all the good pieces, to keep out private buyers.
(73) The price of milk should be down next week.
(74) The inflation rate is going up again.
(75) The new government promised to bring the inflation rate down.
(76) The sun warmed up the seat nicely.
(77) After a warm and sunny day, the temperature will be
down to 10 degrees tomorrow.

Lan Chun

(78) Youll have to speak up a bit, we cant hear you above the noise of the traffic.
(79) The radio station faded the music down to give a special news broadcast.
(80) She has blown up the pictures she took with her mom.
(81) Youve slimmed down such a lot since we last met!

(b) Social Hierarchy

A Higher Status Is Up.
A Lower Status Is Down.

One special way of realizing this pair of metaphors was noted in the English
Religious beliefs: In Christianity, God and Jesus are up in Heaven, Satan and the
other devils are down in Hell.
Below are a few examples:

The upper strata of society

Paleo is an upmarket resort.
Your request will be handed up to the board of directors.
Are the citizens still refusing to yield up the town?
He has moved up the social ladder quite a lot since we last met.


The downfall of a dictator

We sell a lot of down-market books.
A national strike would bring the government down.
Why do the English look down on everything foreign?

(c) Time
A Later Time Is Up.
A Later Time Is Down.

Two special cases have also been found with TIME PASSING IS MOTION
ALONG VERTICAL AXIS in English. Consider the following examples:

from 1918 up to 1945

from the Middle Ages up to the present day
They were using charcoal right up to my day.
Up until the early sixties there was no shortage of power.
Up to now theyve had very little to say.

Examples like the above suggest the existence of Special Case 1: time is moving

Cognitive approach to Up/Down metaphors in English and Shang/Xia metaphors in Chinese

upward from the past towards the future. This is different from special case 1
noted in Chinese.
Now consider some other examples:
(96) It had been occupied as a palace by all our kings and queens down to James I.
(97) There has been a chapel here down all the years my family has lived in
this house.
(98) The custom has been carried down from the 18th century.

Expressions like these suggest the existence of Special Case 2: human beings
(with their belongings) move downward from the past toward the future. This
is the same as special case 2 noted in Chinese.
Unlike the situation in Chinese, the two special cases in English are not
consistent with each other. This inconsistency results in a conflict between
Towards A Later Time Is Up and Towards A Later Time Is Down.

(d) States
A More Desirable State Is Up.
A Less Desirable State Is Down.

This is a piece of evidence for the existence of the Event Structure Metaphor in
English. The following specific cases have been observed in the data:

Into Consciousness Is Up/ Into Unconsciousness (or Death) Is Down.

Into A More Active State Is Up/ Into A Less Active State Is Down.
Virtue Is Up/ Depravity Is Down.
Into a State Of Cheerfulness Is Up/ Into A State Of Depression Is Down.
Improvement In Appearances Is Up/ Worsening In Appearances Is Down.
Increase In Brightness Is Up/ Decrease In Brightness Is Down.
Increase In Force Is Up/ Decrease In Force Is Down.
Increase In Thickness Is Up/ Decrease In Thickness Is Down.
Into Existence Is Up/ Out of Existence Is Down.
Into A State Of Operation Is Up/ Out Of A State Of Operation Is Down.
Towards Completeness Is Up/ Towards Finality Is Down.

(99) When did you wake up this morning?

(100) One of the brothers was gunned down outside his home in London.
(101) Now that Im the mother of two children Im up at 6 every morning.
(102) Jane was down with a cold last week, so she didnt come to work.

Lan Chun

(103) She is an upstanding citizen.

(104) That was a low-down thing to do.
(105) You need a holiday to cheer you up.
(106) The young man seemed to be loaded down with the worries of fatherhood.
(107) Are we going to dress up for the wedding, or is it informal?
(108) The model dressed down so that nobody could recognize her on the streets.
(109) The new paint will brighten up the house.
(110) Dim the stage lights down during scene 3.
(111) The wind is up.
(112) I hope the wind keeps down, or the sea will be too rough for sailing.
(113) The mist has thickened up since this morning. I dont think its
safe to go out now.
(114) The paint has been thinned down too much.
(115) New towns are sprouting up all over the country as part of the
governments plan to find homes for the increasing population.
(116) She waited until the laughter had died down.
(117) I hated that old car, I had to crank it up every morning to get it started.
(118) Make sure you shut down the computer before leaving the room.
(119) Im sorry, the hotel is booked up.
(120) The shop will be closing down for good on Saturday, so everything
is half price.


From the above analysis of both the Chinese and the English data, it can be seen
that remarkable similarities mark the metaphorical extensions detected for
SHANG/XIA and UP/DOWN. The similarities are mainly reflected in the following three ways:
1. Both SHANG/XIA and UP/DOWN are used to structure the same four target domains, namely QUANTITY, SOCIAL HIERARCHY, TIME and STATES.
2. Within these four target domains, what is oriented xia is also oriented
down, and what is oriented shang is also oriented up (except that An Earlier
Time Is Shang, but A Later Time Is Up).
3. The metaphorical extensions detected are found to be arranged in largely

Cognitive approach to Up/Down metaphors in English and Shang/Xia metaphors in Chinese

comparable order of frequency, with A More Desirable State Is Shang/Up//A

less Desirable State Is Xia/Down being the most frequently occurring
metaphorical extension for all the four concepts.
Some discrepancies between Chinese and English have also been observed:
1. It has been found that UP and DOWN are predominantly used in their
dynamic model while SHANG and XIA are well balanced between their nondynamic side and their dynamic side, with a bias towards the former. To put
this in another way, while SHANG and XIA are used to denote both the location of a stationary trajector and the orientation of a moving trajector, UP and
DOWN are predominantly used to capture the latter rather than the former.
2. SHANG and XIA carry a special prototypical model called contact
SHANG and contact XIA. With contact SHANG, the trajector rests upon and is
supported by the landmark; with contact XIA, the trajector stays below and is
covered or pressed by the landmark. No such special case has been found with
3. Within the domain of TIME, Chinese has a pair of conceptual metaphors
in agreement with each other, namely An Earlier Time Is Shang/A Later Time Is
Xia; English, by contrast, has a pair of metaphors in the reverse direction,
namely A Later Time Is Up/A Later Time Is Down.
It must be emphasized that these discrepancies do not diminish the overall
remarkable similarities between the metaphorical extensions found for
SHANG and XIA and for UP and DOWN. This study suggests the following
abstract domains important to our thinking, the fact that in both Chinese and
English they are organized by the metaphorical mappings of the imageschematic structures of SHANG/XIA and UP/DOWN illustrates that our
abstract reasoning is at least partly metaphorical.
2. That Chinese and English exhibit remarkable similarities in the metaphorical extensions of SHANG/XIA and UP/DOWN is a piece of evidence that there
may indeed exist a universal spatial metaphorical system as predicted by
Johnson (1992) and Sinha (1995). The Event Structure Metaphor and the
TIME-AS-SPACE metaphor in particular may be strong candidates for universal metaphorical mappings.

Lan Chun

Akhundov, M. 1986. Conceptions of Space and Time. Cambridge, MA: MIT Press.
Allan, K. 1995. The anthropocentricity of the English word(s) back. Cognitive Linguistics 6:
Alverson, H. 1994. Semantics and Experience: Universal Metaphors of Time in English,
Mandarin, Hindi, and Sesotho. Baltimore: Johns Hopkins University Press.
Bickel, B. 1997. Spatial operations in deixis, cognition, and culture: Where to orient oneself
in Belhare. In Language and Conceptualization, Nuyts & Pederson (eds), 4683.
Cambridge: Cambridge University Press.
Grdenfors, P. 1996. Mental representation, conceptual spaces and metaphors. Synthese
106: 2147.
Geiger, R. and Rudzka-Ostyn, B. (eds) 1993. Conceptualizations and Mental Processing in
Language. Berlin: Mouton de Gruyter.
Goatly, A. 1997. The Language of Metaphors. London: Routledge.
Johnson, M. 1987. The Body in the Mind. Chicago: University of Chicago Press.
Johnson, M. 1992. Philosophical implications of cognitive semantics. Cognitive Linguistics
3 (4): 345366.
Lakoff, G. and Johnson, M. 1980. Metaphors We Live By. Chicago: University of Chicago
Lakoff, G. 1987. Women, Fire, and Dangerous Things. Chicago: University of Chicago Press.
Lakoff, G. 1993. The contemporary theory of metaphor. In Metaphor and Thought,
Ortony (ed.), 202251. Cambridge: Cambridge University Press.
Langacker, R. 1987. Foundations of Cognitive Grammar. Stanford: Stanford University Press.
Leech, G. 1983. Principles of Pragmatics. London: Longman.
Lindner, S. 1981. A Lexico-semantic Analysis of English Verb Particle Constructions with OUT
and UP. Ph.D. dissertation. University of California, San Diego.
Morgan, P. 1997. Figuring out figure out: Metaphor and the semantics of English verb-particle construction. Cognitive Linguistics 8 (4): 327359.
Schnefeld, D. 1999. Corpus linguistics and cognitivism. International Journal of Corpus
Linguistics 4: 137171.
Sinha, C. 1995. Introduction. Cognitive Linguistics 6: 79.
Smith, M. 1993. Cases as conceptual categories: Evidence from German. In
Conceptualizations and Mental Processing in Language, R. Geiger and B. Rudzka-Ostyn
(eds), 530545. Berlin: Mouton de Gruyter.
Stibbe, A. 1996. Metaphor and Alternative Conceptions of Illness. Ph.D. dissertation.
Lancaster University.
Svorou, S. 1994. The Grammar of Space. Amsterdam: John Benjamins.
Yu, N. 1996. The Contemporary Theory of Metaphor: A Perspective from Chinese. Ph.D. dissertation. University of Arizona.

From figures of speech to lexical units

An English-French contrastive approach
to hypallage and metonymy
Michel Paillard

Introduction: methodological issues

The aim of this paper is to examine specific instances of the figures of speech
known as hypallage and metonymy, whether from textual or lexicographic
material, and to show that in some areas the availability of these syntacticosemantic patterns differs substantially in English and in French. The reason for
examining both corpus and dictionary data is that while contrast observed in
translated passages of fiction may be partly attributed to stylistic choice and
subjectivity, lexicalized examples of the same contrastive patterns provide recognized evidence.
Using corpus data to investigate these syntactico-semantic patterns is far
from straightforward as they cannot be searched for on the basis of form. They
are special cases of Adjective+Noun or Noun+Noun phrases which even elaborate tagging procedures could not adequately sort out. Besides, systematic
scrutiny of a 100,000-word bilingual press corpus only yielded a dozen occurrences of those types, several of which are discussed below. This seems to indicate that they are less frequent in journalistic prose than in either literary style
or everyday vocabulary.1 The systematic study of a much larger corpus might
modify the conclusions reached in this paper.

Michel Paillard

. Hypallage
. Figure of speech and syntactic shift
Hypallage is defined as an interchange in syntactic relationship between two
terms (Websters Collegiate Dictionary, e.g. You are lost to joy for Joy is lost to
you) or as the transposition of the natural relations of two elements in a
proposition (Concise Oxford Dictionary, e.g. Melissa shook her doubtful curls).
More specifically, hypallage characterizes phrases in which the (apparent) syntactic scope of a qualifying term does not coincide with its (real) semantic
scope. As the difference between the two examples above suggests, this can
apply to several types of syntactic structure. For the sake of clarity we shall distinguish between three of them.
Type 1 involves a syntactic shift (of one) or inversion of (two) elements:
(1) Ce marchand accoud sur son comp- The greedy shopkeeper, resting
his elbows on his counter2
toir avide (Victor Hugo) for: Ce
marchand avide accoud sur son
(2) besmirched / With rainy marching in salies par des marches pluvieuses
the painful field (Henry V) for: With travers la plaine ardue [Translation
painful marching in the rainy field by Franois-Victor Hugo]

These examples (quoted in Suhamy 1981: 54) clearly belong to dated poetic
style. The translation of (1) restores the real syntactic scope of the adjective. In
the translation of (2), the marked effect of hypallage is fully maintained in
marches pluvieuses, but only partly so in the second phrase as ardu(e) commonly
collocates if not with plaine (which would be felt as a contradiction) at least with
other topographic terms such as chemin (chemin ardu : steep/difficult path).
The trope still routinely appears in modern literature, creating an impressionistic effect, as highlighted by Fromilhague (1995:43) with regard to example (3):
Do la prsence marque de lhypallage dans les textes qui visent restituer
des associations impressionnistes trangres la logique: lcriture artiste et
lesthtique fin de sicle en fournissent des exemples nombreux.3
(3) les fleurs de paulownias, dun mauve the paulownia flowers, a rainy
mauve against the sky of Paris
pluvieux du ciel parisien (Colette)
(4) des cocktails dune coeurante et
inutile complication (Modiano)

sickening and uselessly sophisticated cocktails

From figures of speech to lexical units

Type 2, illustrated in examples (56), adds a change in syntactic category, typically Adjective for Adverb or vice versa:
Je me frayai avec prcaution un
(5) I picked a careful way through the
lobby to the privies on the starboard chemin jusquaux toilettes ct
quarter. (W. Golding, Close Quarters) tribord
(6) Still and quiet and almost looking
flimsily aged at ten years old
(J. Gardam, God on the Rocks)

Sage et tranquille, lair fragile et

presqug dix ans 4

Such data can be related to the well-known syntactic versatility of English

adverbs in -ly (from manner to sentence adverbs). As argued by Larreya &
Mry (1992), hypallage is part of the more general flexibility of movement
(including raising) which characterizes English syntax. In the fields of qualification and modality, for instance, both languages have raisings such as She is
easy to please (Elle est facile contenter) instead of It is easy to please her. French,
however, does not have She is likely/certain to win or the epistemic interpretation of Youre sure to like the film.
Type 3 further involves ellipsis of one term, as in the following passages,
where the semantically implied but syntactically omitted element has to be
reintroduced in the French version:
(7) They thought in the winter it must be Ils pensaient aux dix miles pardamn cold. They thought of the ten courir sous la pluie pour gagner
drizzling miles to Handleyford.
(V.S. Pritchett, Many are
(8) Did you imagine that in the vicinity
of Noahs palace (oh, he wasnt poor,
that Noah) there dwelt a convenient
example of every species on earth?
(J.Barnes, A History of the World in
10 1/2 Chapters)

On aurait pu penser qu proximit du palais de No (il ntait

pas plaindre, allez, ce No), par
un heureux hasard, rsidait un
exemplaire de chaque espce
vivant sur Terre.

(9) Ancien prsident de NBC, M. Joseph

Angotti suggre: Pour lessentiel,
cela ne doit rien un choix rdactionnel. Il sagit dune logique
conomique. Cest ce quil y a de plus
facile, de plus paresseux et de moins
cher couvrir. (Le Monde diplomatique, August 1998)

A former vice-president of NBC,

Joseph Angotti suggests that
most of that crime coverage is
not editorially driven, its economically driven. Its the easiest,
cheapest, laziest news to cover.
(Le Monde diplomatique, August
1998, English Edition)

Michel Paillard

Example (9), from the parallel corpus mentioned above, is more complex for
the following reasons:

The French text includes a quotation which was originally in English.

In the English version, the first two adjectives pave the way for a tolerable if
semantically distorted third superlative + infinitive construction. Clearly,
while easy and cheap apply to the coverage, lazy can only qualify the journalist.
The French adjective paresseux is in second position and therefore does not
directly take the infinitive as its complement. The sequence plus paresseux
couvrir would definitely be unacceptable.

. Hypallage in lexicalized phrases

The pattern of ellipsis and the contrast between English and French in this
respect are clearest in cases of lexicalized hypallage such as the following:
(10) Foreign Office Foreign Affairs Office Ministre des Affaires trangres
(11) art theft theft of works of art

vol doeuvres dart

(12) lucid interval an interval during

which a mentally ill person is lucid

moment de lucidit

(13) restricted area

area where speed is restricted
area where access is restricted
(14) white wedding
a wedding at which the bride wears
a white dress

zone vitesse limite

zone accs rglement
mariage en blanc5

sports de lextrme
(15) extreme sports Designating sports
performed in a hazardous environment, involving a high physical risk.
(The Oxford Dictionary of New Words,
? tarif accord pour motif familial
(16) compassionate fare A significantly
reduced airline fare made available to grave
people who are travelling to attend a
funeral or visit someone very ill.
(A Dictionary of Todays Words)6

Similarly elliptical compounds such as happy hour, topless bar, sick bag doggedly resist translation into French.7 They require some form of transposition, and

From figures of speech to lexical units

in some cases are simply borrowed (e.g. standing ovation). Variants of example
(15) seem to be creeping in (ski extrme, skieurs de lextrme) and one recently
found its way into a report by Le Monde on the rescue of two British mountaineers, following due contextual preparation :
(17) Cest une vritable opration commando qui a t mene, dimanche 31
janvier, dans des conditions extrmement prilleuses, deux alpinistes britanniques bloqus depuis quatre jours prs de 4000 mtres daltitude.
() Il fallait faire vite et tre prcis. Ctait une course contre le temps et
la montre, raconte Pascal Brun () Ce sauvetage extrme exigeait un
appareil puissant et disposant dune moindre prise au vent qui continuait
de souffler en rafales. (Le Monde, 2 February 1999, p. 10)

A few phrases of this type have come to be common to the two languages:
(18) fast lane

voie rapide

(19) happy days

jours heureux

(20) masked ball

bal masqu

(21) musical chairs

chaises musicales

But fully lexicalized cases of hypallage are few and far between in French:
(22) (tomber en) panne sche
panne lors de laquelle le rservoir
dessence est sec
also: (tomber en) panne dessence

run out of petrol/gas

(23) de guerre lasse (il finit par accepter)

las de rsister, il finit par accepter

he grew tired of resisting and finally accepted [Robert & Collins


In their outstanding studies of English word formation, both Adams (1973:87)

and Tournier (1985: 212) emphasize the role of ellipsis in such patterns. In the
words of Adams (1973:87):
Some of these could be seen as three-word structures with an ellipted second
element; confidential secretary might be explained by a phrase like confidential
work secretary.

Tourniers treatment of lhypallage lexicalise as a shift of meaning (alongside the countable/uncountable or transitivity parameters of polysemy) can be
questioned insofar as the admittedly problematic syntax of such phrases leaves
the meanings of the components unchanged. They are best dealt with, as

Michel Paillard

Adams chooses to do, as a special type of Adjective + Noun compound.

Both authors group the type illustrated above with arguably different
structures: plastic surgeon quoted by Tournier or criminal lawyer quoted by
Adams do raise problems of analysis and translation but they should in fact be
treated as derivationally related to plastic surgery and criminal law respectively
(cf. Coates 1971).

. Metonymy
. Metonymy in exocentric and endocentric compounds
The contrastive picture is quite different, and in some respects reversed, where
metonymy is concerned. Metonymies in which the vehicle, in the terms of
Leech (1969: 151), names a distinctive part or concrete characteristic of the
entity referred to, or tenor, are not uncommon in French:
(24) des gros bras

rednecks, musclemen

(25) le rouge-gorge

the (redbreast) robin

(26) le petit cran

the small screen

(27) une ceinture noire

a black belt

These are exocentric compounds (rednecks are not necks), also called bahuvrihi
compounds. Both languages have cols blancs (white-collar workers), even
though the French phrase is labelled as traduction de langlais in Le Petit
Robert, but only American English has wetbacks, ouvrier agricole mexicain
entr illgalement aux Etats-Unis (Robert & Collins English-French Dictionary).
The pattern is indeed even more widespread in English in endocentric compounds (a bag lady is a lady) such as the following. The distinctive feature selected tends to be very specific and the semantic shortcut from vehicle to tenor can
be spectacular. Translation into French is often problematic:
(28) bag lady

clocharde [Robert & Collins Dictionary,

which conversely gives tramp as a translation for clochard(e)]

(29) lollipop lady / man

(Brit) contractuel(le) qui fait traverser la

rue aux enfants [Robert & Collins

(30) red-brick university

(Brit: often pej) universit de fondation

rcente [Robert & Collins Dictionary]

From figures of speech to lexical units

(31) Ivy League

(US) les huit grandes universits prives

du nord-est [Robert & Collins Dictionary]

(32) latchkey child (a child who is ?? enfant la cl [Robert & Collins

alone at home after school
until a parent returns from
work, Concise Oxford
(33) jet set

le ou la jet set

. Nominalization and discreteness

English on the other hand seems to resist some types of abstract-for-concrete
metonymy. Although there are many examples of long-standing, fully lexicalized state or action nouns in either English or French (an administration, an
introduction, a building, a facility, etc.) English less readily allows a nominalized
predicate to refer to a specific occurrence, or to the agent, place or instrument
of the process. The following examples are from Chuquet & Paillard (1987),
Astington (1983), Guillemin-Flescher (1981) and the parallel corpus mentioned above:
(34) socit de consommation

consumer society

(35) la rception

at the reception desk 8

(36) lallongement de la scolarit

the raising of the school-leaving age

(37) Une signalisation totalement dif- An entirely different signalling system

frente sera alors indispensable, will be essential, particularly to allow
pour permettre en particulier le automatic braking of trains.
freinage automatique des rames.
(38) Mais les rgions domines par la
gurilla sont aussi les zones o
sest dveloppe la culture de la
coca. (Le Monde diplomatique,
July 1998)

However, the regions dominated by the

guerrilla movements also happen to be
the areas in which the growing of coca
is particularly widespread. (Le Monde
diplomatique, English Edition)

(39) La confusion entre information

et divertissement, dsormais runis par le lien sacr de laudience,
a parfois des effets politiques et
sociaux dvastateurs. (Le Monde
diplomatique, August 1998)

The blurring of the dividing line

between information and entertainment, both of which are now governed
by the iron law of audience ratings, can
have dangerous political and social
effects. (Le Monde diplomatique,
English Edition)

Michel Paillard

Such differences are relevant to at least two areas of linguistic analysis:

the crucial question of nominalization and the related issues of concretization and discreteness: Langacker (1987) offers different cognitive representations of the verb and of the noun in pairs such as explode/explosion. Defrancq &
Willems (1996) examine the polysemy of nominalizations on a scale of concreteness. For example, French construction can refer either to the process of
building or to the resulting edifice whereas dification only refers to a process.
the diverging strategies of English and French in sentence orientation and
semantic compatibility in argument structure. Detailed contrastive work carried out within the theoretical framework of Culiolis Thorie des oprations
nonciatives (Guillemin-Flescher 1981, Celle 1997) shows that while French
routinely associates heterogeneous predicates and arguments in terms of their
degree of animacy or abstractness, English requires a higher degree of homogeneity. Guillemin-Flescher (1981) uses a large corpus of works of fiction and
their published translations to show, for instance, that the English versions will
tend to avoid associating nouns referring to inanimate entities with verbs normally taking animate subjects:
(40) sa conscience le taraudait
(M. Tournier, Vendredi)

he suffered pangs of conscience

(Translation by N. Denny)

This explains the frequent need in English translations of this type to fall back
on concrete nouns as the syntactic heads of arguments. Straightforward examples are provided by lexicalized phrases such as (3439). The structure is more
complex in textual material such as (4144), where various grammatical factors are involved: quantification (41), collocation and metaphor (43), coordination of arguments (44). Rearrangements are then required.9
(41) Ses plus extrmes audaces, par
certains cts, sont des navets.

His most audacious tricks, in some

respects, are mere lack of experience.

(42) Il est vrai que, mieux que tous les

sondages, les banques connaissent lintimit conomique des

It is true that, better than any

opinion poll, the banks know the
intimate details of French peoples
economic life.

(43) Des hritiers, dont les anctres

ont immigr depuis dj deux
sicles, persistent y cultiver la
citoyennet britannique.

The descendants of the first migrants

who landed some two centuries ago
still cultivate the art of being true
British citizens.

From figures of speech to lexical units

(44) Une route buissonnire un peu

dglingue rejoint Ermelo et la
fracheur de ses cascades.

A rough cross-country road leads

to Ermelo and its cooling waterfalls.

Failure by non-native speakers to recognize and respect such differences can

lead to grammatically well-formed but non-idiomatic expressions. Celle
(1997:148) notes that literal translations of the phrases highlighted in (43) and
(44) would not be acceptable in English :

Their descendants still cultivate British citizenship.

A rough country road leads to Ermelo and the coolness of its waterfalls.

. Conclusion
Hypallage and metonymy are found on a cline from complete lexicalization to
literary creativity. On the basis of the data examined in this paper, which
should be supplemented by quantitative corpus-based analysis, the limits
imposed on the use of these patterns in English and in French appear to be diametrically opposed: in the type of metonymy just examined, French characteristically tolerates a greater degree of semantic heterogeneity between argument
and predicate. Through hypallage, English characteristically allows greater
syntactic flexibility in the form of movement and ellipsis.

. The sample examined is part of a 500,000-word journalistic corpus now being created
at the University of Poitiers for concordance processing. It consists of articles from Le Monde
diplomatique published over a three-year period (19982000) and their English translations
made available to subscribers in electronic form. It will be matched by a multilingual fiction
corpus as part of a joint research project named PLECI (Poitiers-Louvain Echange de
Corpus Informatiss).

The French versions of examples (1) to (7) are my translations unless otherwise stated.

. Hence the marked presence of hypallage in texts aiming to conjure up impressionistic

associations alien to logic: over-elaborate writing and fin de sicle aesthetic standards offer
many examples of it. (My translation.)
. Examples (6) and (8) are from Khalifa, J.C., Fryd, M. & Paillard, M. 1998. La version
anglaise aux concours. Paris: Colin.

Michel Paillard

Not to be confused with mariage blanc, which is a metaphor (unconsummated marriage).

Cf. Lerner and Belkin (1993).

. I am grateful to the colleagues who offered suggestions on this point during and after
the Symposium in Louvain, particularly Franois Maniez from the University of Lyon 2.
. An illustration of this problem is to be found in Van Roey et al. (1988: 583): Veuillez
passer la rception can be translated by either Please go to reception or Please go to the reception desk.
. Examples (41) and (42) are borrowed from Astington (1983); (43) and (44) from
Celle (1997).

Adams, V. 1973. An Introduction to Modern English Word-Formation. London: Longman.
Astington, E. 1983. Equivalences. Translation Difficulties and Devices, French-English,
English-French. Cambridge University Press.
Celle, A. 1997. Quand lobjet est un nom de procs. In La transitivit, M.L Groussier
(ed.), Cahiers Charles V 23: 139172. Universit de Paris 7.
Chuquet, H. et Paillard, M. 1987. Approche linguistique des problmes de traduction, anglais
<> franais. Paris et Gap: Ophrys.
Coates, J. 1971. Denominal Adjectives: A Study in Syntactic Relationships between
Modifier and Head. Lingua 27: 160169.
Culioli, A. 1990. Pour une linguistique de lnonciation. Oprations et reprsentations. Paris et
Gap: Ophrys.
Defrancq, B. and Willems, D. 1996. De labstrait au concret. Une rflexion sur la polysmie
des noms dverbaux. In Les noms abstraits. Histoire et thories, N. Flaux, M. Glatigny
and D. Samain (eds), 221230. Lille: Presses Universitaires du Septentrion.
Dupriez, B. 1984. Gradus. Les procds littraires (Dictionnaire). Paris: Union Gnrale
dEditions [Collection 10/18].
Fromilhague, C. 1995. Les figures de style. Paris: Nathan [Collection 128].
Guillemin-Flescher, J. 1981. Syntaxe compare du franais et de langlais. Paris et Gap:
Kleiber, G. 1994. Nominales. Essais de smantique rfrentielle. Paris: Colin.
Lakoff, G. and Johnson, M. 1980. Metaphors We Live By. Chicago: The University of Chicago
Langacker, R. 1987. Nouns and Verbs, in Communications 53 (1991): 103153. Paris:
Editions du Seuil.
Larreya, P. et Mry, R. 1992. On the Syntactic Productiveness of Hypallage. Travaux du
CIEREC 76: 143160. Universit de Saint-Etienne.
Leech, G. 1969. A Linguistic Guide to English Poetry. London: Longman.
Rainer, F. 1996. La polysmie des noms abstraits. In Les noms abstraits. Histoire et thories,
N. Flaux, M. Glatigny and D. Samain (eds), 117128. Lille: Presses Universitaires du

From figures of speech to lexical units

Suhamy, H. 1981. Les figures de style. Paris: Presses Universitaires de France. [Collection
Que sais-je?]
Tournier, J. 1985. Introduction descriptive la lexicogntique de langlais contemporain.
Paris: Champion-Slatkine.
Tournier, J. 1988. Prcis de lexicologie anglaise. Paris: Nathan.
Ullmann, S. 1967. Semantics. An Introduction to the Science of Meaning. Oxford: Blackwell.
Van Hoof, H. 1989. Traduire langlais. Thorie et pratique. Paris et Louvain: Duculot.

The Concise Oxford Dictionary, 1995.
Le Nouveau Petit Robert, Dictionnaire de la langue franaise, 1993.
Robert & Collins, French-English, English-French Dictionary, 1995.
Websters New Collegiate Dictionary, 1979.
The Oxford Dictionary of New Words, 1997.
Lerner, S. and Belkin, G. S. 1993. Trash Cash, Fizzbos, and Flatliners. A Dictionary of Todays
Words. Boston and New York: Houghton Mifflin.
Van Roey, J., Granger, S. and Swallow, H. 1988. Dictionnaire des faux amis anglais-franais.
Paris & Gembloux: Duculot.


Corpus-based Bilingual Lexicography

The role of parallel corpora in translation

and multilingual lexicography
Wolfgang Teubert

The need for translation

Globalisation has led to an increased demand for translation. Twenty years ago,
when people in Europe who had bought a satellite dish were given the option of
choosing between TV programmes broadcast in many languages, it was
believed that this would lead to an increase in learning of foreign languages, not
only English as the global interlingua, but also European languages of some
regional importance such as French and German. But these expectations were
not fulfilled. While, in their professional lives, more and more people are learning to function in a bilingual or multilingual environment, it seems that, apart
from a traditionally small polyglot elite, in their private lives they tend to cling
to the language they grew up with. Suddenly we find not only periodicals but
also daily newspapers being translated. There is a daily German edition of the
Financial Times, and in France, Germany, and Italy the International Herald
Tribune now comes together with an English language edition of a prominent
local newspaper. The EuroNews TV channel is transmitted in several languages,
and other channels will follow suit shortly. Globalisation of the media has
opened up a new market for instant translation.
Alongside the need for instant translation of texts, most of which will soon
be forgotten (the common fate of most media coverage), there is also the necessity to translate agreements, contracts and all other documents that could have
a legal impact (such as product descriptions and user instructions) into other
languages. Whatever their source language (increasingly English, as we are all
aware), these texts have to be localised, translated into the language(s) of the
country whose jurisdiction is involved. As long as legal systems are not glob-

Wolfgang Teubert

alised, courts will accept documents as evidence only if they exist in the official
language(s) of the country in which the court is situated.
It is one of the central principles of the European Union that only those
texts issued by the European authorities which have been translated into the
official EU languages become legally binding in the member states. Therefore,
all the new countries which have applied for membership, the so-called newly
associated countries (NACs), will have to translate a corpus of (ultimately) 12
million words of EU documents into their languages before they can join. For
the existing EU, the Commissions Translation Service, the largest translation
agency in the world, produces translations of all relevant new documents in the
official EU languages; and here again it is questionable whether these translations will ever be referred to. Kaisa Koskinen tells us that often, all the Finnish
participants will have already read the original non-Finnish document (or even
taken part in drafting it) and the Finnish translation arriving two months later
contains no new information for them. The Finnish authorities have also been
notoriously reluctant to rely (or admit to relying) on the Finnish versions, preferring to use English or occasionally French translations which are perceived
as more reliable or even judicially more valid than the Finnish ones
(Koskinen 2000: 5152). This certainly does not mean the Finns will give up
their right to have Finnish translations. Rather, it is another indication that our
complex modern environment demands the production and translation of
texts not to be read but to be there in case a need for them arises.
At the European Parliament, we encounter the paradoxical situation that
less than 10% of its budget is spent on parliamentary work proper, while more
than 90% is spent on interpretation and translation. Translation, together with
the necessity to write texts in a foreign language, is the most remarkable challenge linguistics has ever faced. To prepare people to cope with multilingual situations it is not enough to teach them foreign languages, there is also a need to
give them tools printed and electronic bilingual dictionaries that actually
serve their purpose. It is time to develop a new generation of dictionaries, dictionaries suitable for assisting translation not only into the translators native
language but also into a foreign language, dictionaries that give their users the
proper translation equivalent for each semantic unit they have to deal with.

. Cognitive linguistics: a model for cross-linguistic lexicology?

Meaning is the core issue of translation. A translator produces a paraphrase of a

The role of parallel corpora in translation and multilingual lexicography

text in another language. Meaning and meaning alone links a paraphrase to the
original text. The more similar text and paraphrase are in their meanings, the
more satisfactory the paraphrase. But many linguists are rather coy on the issue
of translation. They are much more interested in contrasting languages and their
vocabularies from a typological point of view. But how can one contrast vocabularies without using texts and their translations as a tertium comparationis?
The history of Machine Translation (MT) is closely related to the history of
cognitive linguistics. But for mainstream cognitive linguistics, meaning is not a
discourse feature but a feature of the mind, in the form of mental representations of concepts. Concepts are, in this model, universal and part of the language of thought or mentalese, and they can be mapped onto the speakers
native-language vocabulary, even though we cannot assume a one-to-one relationship between universal concepts and the words of a given natural language.
Is this an approach that could be adopted by cross-linguistic lexicology? Is a
word in one language the equivalent of a word in another language because
they can both be mapped, somehow, to the same concept?
Words are language signs, are symbols; they can be studied from the two
points of view: content (or meaning) and form. Content or meaning cannot be
separated from form. What then is the form of the universal concepts cognitive
linguists talk about? How can we describe their meaning without using natural
language? If there really are universal concepts, if we are told what they look like
and what exactly they mean, cross-linguistic lexicography could and should
use them. Universal concepts could help us define what translation equivalence
is. As long as the conceptual ontologies used in MT describe their concepts in
pretty much the same way as dictionaries do, using English or any other language for their definitions, it is hard to see how cross-linguistic lexicography
can profit from this approach.
The core issue of translation is meaning. For each semantic unit of the
source text, there has to be an equivalent in the target text. Therefore cross-linguistic lexicography in quest of meaning must pay close attention to the practice
of translators. It is they who invent the translation equivalents for lexical expressions. For these translation equivalents are not discovered, they are invented.
Translators deal in texts, and they undertake to paraphrase a text in a different
language so that the paraphrase will mean almost the same as the original text.
In order to carry out their task, they have to understand the text. This means
that they interpret the text. Text interpretation, however, is an action, not a
process. Only human beings can do it. All computers can do is carry out
processes. Therefore computers cannot translate in the sense that translation is

Wolfgang Teubert

generally understood. This is why the classical approach to MT will necessarily

fail whenever the goal is to translate general language texts without any need for
post-editing. Using concepts does not help as long as we have to treat concepts
just as we have to treat natural language words. This is really nothing new. In his
book The Possibility of Language (1996) Alan K. Melby, one of the founders of
the discipline of Machine Translation, has given us a thorough account of why
machine translation based on conceptual ontologies cannot work. But the MT
community has paid little attention. So at present cross-linguistic lexicology can
learn but little from Machine Translation.

. The quest for the perfect language

Ever since the Tower of Babel, the Ursprache has been replaced by a multitude
of mutually incomprehensible vernaculars, and the complaints about the corruption of our (natural) languages have not abated. In the language of Adam
and Eve, we are told, words still had their proper meaning; they represented the
Platonic ideas, concepts like apple and snake in metaphysical purity, and not
yet subject to the distortions that came to pass once the Golden Age was over.
Since then there have been a multitude of attempts to recreate this original perfect language as a symbolic algorithm guaranteeing instant communication
and perfect understanding. It is a belief cherished in their hearts by many
members of the AI and MT communities. In a newspaper article written by
Chris Partridge which appeared in the London Times in July 1997, reiterating
how close science is to complete success in Machine Translation, we find the
revealing heading Language is the last barrier to global communication. The
message of this article is that once we replace our deficient, decayed and corrupt native languages by a linguistic system free of the contingencies, idiosyncrasies and anomalies so typical of natural language, the problem of global
communication will be solved. It is a way of looking at meaning that we also
find in Hilary Putnams article The Meaning of Meaning (1975). In it he suggests that the closest we can hope to get to distinguishing the meaning of the
word elm from the meaning of the word beech is to ask not the language community but the expert. But even what the experts tell us is no more than an
approximation. For Putnam, the true meaning of the word elm is what elms are
in reality and what sets them apart from other trees, for instance beeches. In his
view, the category elm would exist even if there was no one to be aware of it.
Among cognitive linguists, we find a similar desire to posit an ideal lan-

The role of parallel corpora in translation and multilingual lexicography

guage, common to all human beings. Since Noam Chomsky is primarily interested in finding and formulating the universal and innate laws which make our
language work, he is more concerned with syntax (regarding language as some
kind of formal algorithm) than with semantics, which he regards as a secondary
and contingent phenomenon. But scholars originally close to him, such as the
language philosopher Jerry Fodor, include semantics in their search for language universals. In 1975, Fodor published the seminal Language of Thought in
which he discusses the universal nature of concepts as cognitive phenomena.
Since then, cognitive linguists have been busy exploring the world of concepts,
and there seem to be few who would agree on their nature, their number, on
what they mean and how they differ from words of natural languages. An interesting selection of competing ideas can be found in the anthology Language and
Thought. Indeed it seems that each author represented in this collection has his
or her unique definition of concept (Carruthers/Boucher 1998). In the meantime, Jerry Fodors Language of Thought has been renamed Mentalese, and
Stephen Pinker apparently thinks that this is the language people all over the
world think in, regardless of which language they speak (Pinker 1996).
The quest for a perfect language that today unites many cognitive linguists,
philosophers of the mind and experts in the field of MT has a long tradition.
Umberto Eco, in his La ricerca della lingua perfetta nella cultura europea (1993),
has described the endless attempts to either reconstruct the Ursprache or create
a perfect language in which every correct sentence is a true sentence, in which
every expression has only one meaning and in which every word is forever
linked with the metaphysical reality it designates. All of these attempts were
doomed to failure. But the fascination emanating from this idea seems to be
inexhaustible, to this very day.

. The interlingua approach to multilingual lexicography

In the remaining sections of this article, I want to show that
1. using a conceptual ontology as an interlingua does not help us with translation;
2. it is not words that are translated but translation units (units of meaning)
in the form of compounds, multi-word units, collocations or set phrases;
3. parallel corpora are repositories of translation units and their equivalents
in the target language, and that these translation units and their equivalents can be processed and re-used in subsequent translations.

Wolfgang Teubert

I will look at the words work, travail and Arbeit. For the interlingua model, I
will use the entry for work in the current Internet version of the Princeton
WordNet (, a set of seven English translations of
Platos Republic, and the French and German translations of the same book. I
intend to show that conceptual ontologies representing general language cannot be language-neutral, and that while the concepts they feature may correspond to the word senses of one language, they do not match word senses in
another language. I will then show that conceptual ontologies, even if they were
language-neutral, would still not facilitate translation since the concepts they
feature correspond in principle to single words, whereas texts are translated by
translation units often larger than a single word. Finally I intend to demonstrate how the translational knowledge contained in parallel corpora can be
used to increase the productivity and quality of human translation.
Table 1 presents the WordNet entry for the noun work. It lists seven word
senses, called synsets (sets of synonyms). Synsets may consist of only one word
(if there is no other word having a word sense synonymous with it) (synsets 1
and 7), or of several words, if all those words have the word sense in common
(synsets 2 to 6). Each entry gives a definition in brackets, but it is not clear if the
Table 1. WordNet entries for work

work (activity directed toward making or doing something; she checked several
points needing further work)


work, piece of work (something produced or accomplished through the effort or

activity or agency of a person or thing: it is not regarded as one of his more memorable
works; the symphony was hailed as an ingenious work; he was indebted to the pioneering work of John Dewey; the work of an active imagination; erosion is the work
of wind or water over time)


job, employment, work (the occupation for which you are paid; he is looking for a
job; a lot of people are out of work)


study, work (applying the mind to learning and understanding a subject (especially
by reading); mastering a second language requires a lot of work; no schools offer
graduate study in interior design)


oeuvre, work, body of work (the total output of a writer or artist (or a substantial
part of it); he studied the entire Wagnerian oeuvre; Picassos work can be divided
into periods)


workplace, work (a place where work is done; he arrived at work early today)


work (physics) a manifestation of energy; the transfer of energy from one physical
system to another expressed as the product of a force and the distance through which it
moves a body in the direction of that force; work equals force times distance)

The role of parallel corpora in translation and multilingual lexicography

language used in the definitions is thought to be a controlled language in the

sense that the definitions are unambiguous, or if it is thought to be plain
English, with all its fuzziness and polysemy. In addition, we find general language sentences illustrating the particular word sense. The idea is that within
these examples, each element of the relevant synset can be used without a
change of meaning.
It must be borne in mind that WordNet was not set up as a conceptual
ontology. George A. Miller, its designer and original creator, is first of all a psychologist, and he set it up because he was interested, among many other things,
in how people associate ideas. For him, it was of no importance whether these
ideas correspond to language-independent, universal cognitive concepts or
just to words of the English language. When linguists first became interested in
WordNet, they interpreted it as a thesaurus of the English language, and the
relationships accounted for were not relationships between concepts but relationships between different senses of different words. Thus, sense 4 of work and
one of the senses of study are identical, and work and study are therefore synonymous in this sense. WordNet provides other relationships as well, such as
hypernyms, hyponyms and meronyms for the different word senses of a word.
WordNet differs from a traditional thesaurus in that its basic unit is explicitly not the word, but the words senses. (Implicitly, this is also true of traditional thesauri, but there the senses are often not properly identified.) The word
sense, however, is an abstraction; it is, strictly speaking, not inherent to language, but part of the lexicographers interpretation of the meaning of a word.
Otherwise it would be possible to decide on the basis of linguistic evidence how
many senses a given word has. But this is a crucial point on which dictionaries,
even if they are comparable in size, tend to differ. What we would like to know
is whether word senses, in WordNet, are thought to be concepts or not. If concepts are universal mental entities (or composed of universal mental entities)
or if they are categories corresponding to entities of some language-external
reality, we would expect them to be of a different nature from word senses. Yet
in WordNet the question is left open whether the featured word senses are simply lexicographers hypotheses concerning English vocabulary or if they are
thought to be universal (and therefore conceptually identical with the concepts
of cognitive linguistics) at least to the degree that in principle, for each word
sense, there is a corresponding word for it in other languages, or a lexical gap,
or a distinction not present in the English unit of meaning so that there are two
(or more) equivalents of this one unit.
The possibility of identifying WordNet word senses with the concepts of

Wolfgang Teubert

cognitive linguistics proved too attractive to be ignored. It became the underlying idea of the major project EuroWordNet, funded by the European
Commission and involving many EU languages (
index.html). For each language, local WordNets have been established, and a
language-independent ontology provides the framework in which to match
word senses.
Thus, WordNet has indeed come to be seen as a universal conceptual ontology, as a model of a universal interlingua, with a clear-cut, finite set of senses or
concepts, onto which the words of any language can be mapped so that, ideally,
for each natural language expression, there would be a language-independent
conceptual representation. WordNet would be the answer to the quest for a perfect language. But does it work?
In this experiment, I use the Internet version of the Princeton WordNet,
and I am interested in how far the synsets for each sense of the noun work, that
is the set of words corresponding to each unit of meaning, can be matched to
seven different English translations of Platos Republic. These translations constitute a set of paraphrases of the original Greek text. Each translator, we can
assume, strives to render the text as closely to the original as possible, so that
the paraphrases should turn out to be largely synonymous. Therefore, if we
look at all occurrences of the noun work in one translation (our master version), we would expect to find in the other translations either also the word
work or, in view of the word sense in which work is used in a given occurrence,
one of the synonyms of the pertaining synset.
I have worked only with the Princeton WordNet, because it is more detailed
and more elaborate than the localised versions for other languages. This is why
I have had to choose an indirect way of assessing the value of the EuroWordNet
approach for translations. Instead of comparing the original text with its paraphrase in another language, I compare different paraphrases (in the same language) of the original Greek text. If the word senses of WordNet reflect universal concepts, one concept could then be assigned to each occurrence of the
noun work, and, assuming that the translators interpret the Greek text in the
same way, they would all assign the same concept to a given occurrence. If this
is the case, the WordNet approach indeed seems a viable method for computerassisted translation. If the lexical variation displayed in our set of translations
does not map with the synsets of Wordnet, then either WordNet is deficient
(and what thesaurus isnt?) or we would have to assume that the meaning of the
underlying Greek text cannot be mapped so easily onto the concepts or word
senses featured by WordNet. In this case the translator using the word work

The role of parallel corpora in translation and multilingual lexicography

would use it in a sense not identified by WordNet, but perhaps slightly differently. This would be an indication that the meaning of words is generally more
fuzzy than the tradition of displaying word senses in dictionaries (or concepts
in ontologies) would have us believe. If this is the case, then it is not possible to
assign in a procedural, algorithmic, controllable way the proper word sense to a
given occurrence of a word in a text.
In this experiment I use seven different translations, the first of which is available electronically and therefore used as master version.1 Table 2 displays the evidence. We have, in Translation 1, i.e. in the translation chosen as the master version, 24 occurrences of the noun work. The first column identifies the citation in
standardized form. The second column gives the word we find in the original
Greek text. The third to eighth columns give the word we find in the six translations with which the master version is compared. The ninth column gives the
synset that I would assign to work in the given citation. Words in bold face are the
synonyms found in the relevant synset. Empty spaces indicate that in these cases
there was no corresponding word (usually as a result of the fact that the original
text was not translated word by word but in larger units of meaning).
Table 2. Work and its equivalents in the Plato translations
Origin Original

Transl. 2

332e rgon

Transl. 4

Transl. 5

Transl. 6

Transl. 7oiSynset





352e rgon





352e rgon







353a rgon







353a rgon







353b rgon

work vb



353c ergzomai


work vb





353c to rgon





353d rgon






353e ta rga




functions function

369e rgon

product of
his labour

product of work
his work

370b rgon

job [= task] work


Transl. 3


result of
his labour





Wolfgang Teubert

Table 2. continued Work and its equivalents in the Plato translations

Origin Original

Transl. 2

Transl. 3

Transl. 4

Transl. 5

Transl. 6

Transl. 7ooResult

370c parrgon

do vb



371c demiourga




work vb

occupation 3

372a rgazomai

work vb

work vb

work vb

work vb

work vb

work vb

374a ergazomai

job or

practice vb exercise vb practise vb work vb

375a phulak


quality of

quality of


guarding keeping

380a rga







416c phulak

guardians guardians


421c rgon





442b prtto







501b apergzomai work vb


out vb



[in] doing

1, 3

1, 3

professions 1, 3

535d misopono

intellectual trouble





553c ergzomai







The table shows that in the first nine cases we find the Greek word rgon, and
hence the English word work in our master version (Translation 1, see above),
being used in the sense of function or end. This is not a word sense featured in
WordNet for work, and actually work, even though it is the standard translation,
does not really fit in well, as illustrated in example 352e in the master translation:
Would you be willing to define the work of a horse or of anything else to be that
which one can do only with it or best with it?

Some of the other word sense assignments can be disputed as well. It is not surprising that the only synonym occurring in the translations listed in WordNet is
job, belonging to synset 3. There are others, such as labour, occupation, profession, practice and business, that, in the contexts of our citations, are certainly
synonymous with work but could not easily be subsumed under synsets 1 or 3.
In the example given in WordNet for word sense 3, he is looking for a job, it is not
possible to substitute profession or business for job (and I am not entirely sure

The role of parallel corpora in translation and multilingual lexicography

that looking for a job is always the same as looking for work). In a similar way, this
would also be true of practice and labour as synonyms of work in synset 1.
However, the results of this experiment are far from conclusive. Can
WordNet really be improved so that it reliably indicates synonyms for the word
senses it posits? Does it make sense to expect multiple translations of the same
text to be synonymous? Is the whole experiment not seriously flawed because it
focuses on a single word, which is not necessarily the unit of translation? The
subsequent sections will consider these questions.

. Translation practice
In this section I shall set out to demonstrate the practice of translation, using
the example of the French and German versions of Platos Republic. I shall look
for the word travail and the word Arbeit, the standard equivalents of the
English noun work. As in the preceding section, I shall not compare the original
Greek text with its French and German paraphrase, but compare two paraphrases of the same text. Each paraphrase (or translation) is an interpretation
of the original text. Interpretation presupposes text understanding, something
that is happening in peoples minds and involves intentionality. However, if
text understanding involves intentionality, in the sense in which John Searle
has defined intentionality (Searle 1992), then it is an action, not a process.
While humans can carry out both actions and processes, computers are good
only for processes, and this is why computers cannot paraphrase texts. Actions
presuppose intentionality, and differ from processes in that the outcome of an
action is not predetermined; it always involves some degree of arbitrariness.
This is also one of the reasons why the conceptual ontology approach does not
work. It takes a human being to assign a word sense to a text word, because such
an assignment requires an understanding of the text. But if it is an action, then
it involves arbitrariness.
Paraphrases are not procedural mappings of an original text, they are the
results of acts of interpretation. They may be as close in meaning to the original
text as possible, but they can never be identical with it. Comparing the French
and German paraphrase of a Greek original doubles the semantic difference
obtaining between the original text and its paraphrase. Therefore the equivalence between the two paraphrases is looser than that between original and
translation. It is this looseness that gives us, in a nutshell, a view of the range of
options, the infinite design space translators have in their work.

Wolfgang Teubert

As we can see in the comparison of the French and German translations of

Platos Republic, texts are not translated word for word. Quite often the translation units, the text segments that are translated as a whole, are larger than the
single word; they are phrases of two, three or many more words. The equivalents of these translation units do not have to be phrases of the same or a similar
structure; a collocation can become a clause; a whole clause can be reduced to a
single word; singulars become plurals and vice versa. Table 3 shows the citations for travail/travaux and their German equivalents.2
Table 3. Platos Republic: travail / travaux and their German equivalents (16 citations)

son travail de cordonnier

tout le travail de fabrication des autres objets
travail mal fait
travail de la poterie
en supplment du travail srieux
un travail considrable
son travail
le travail manuel
ses travaux seront moins russis
les travaux des artisans
les travaux de la guerre
les travaux dartisans
travaux des champs
travaux du corps que de ceux de lme
utiles nos travaux
sassocier avec le sexe mle dans
tous ses travaux

seine Schusterei
alle andere handwerkliche Arbeit
schlechte Leistungen
nur eine Nebenbeschftigung
eine grosse Aufgabe
schmutzig-kleinliche Arbeit
niedrige Handwerksarbeit
er wird schlechtere Arbeit leisten
die Betriebsleistungen
das Kriegshandwerk
die handwerkliche Arbeit
krperliche und geistige Arbeit
beim Gewerbe brauchbar
an alle Arbeiten wie ein Mann

There are eight instances of travail and eight instances of travaux. For travail,
the German standard equivalent Arbeit is only used in two cases, and in one
more case Arbeit has become part of the compound Handwerksarbeit.
Following similar patterns, travail de cordonnier becomes Schusterei and travail
de la poterie becomes Tpferei. The singular travail mal fait becomes the plural
schlechte Leistungen (is this WordNet sense 1 or 4?). Citation 5 needs more context: (philosophy is) une chose ne pratiquer quen supplment du travail srieux
(something to be practised only in addition to serious work), which reads in
German: Philosophie sei nur eine Nebenbeschftigung (is only a side job), in
spite of the difference in structure an appropriate equivalent. In citation 7, the
German translator added schmutzig-kleinlich, which is not called for in the
original text. To a lesser degree, this seems also to be the case for citation 8: travail manuel does not imply menial work but niedere Handwerksarbeit does.

The role of parallel corpora in translation and multilingual lexicography

Here we find banausa in the Greek text, implying the praxis of a mere
mechanical art (cf. Teubert 1996).
In three of the eight citations of the plural travaux, the German equivalent is
the singular Arbeit, and only in one case is it the plural Arbeiten. We shall see in
the next section that this phenomenon is not unique to the Plato translations
but also quite common in EU documents. Citations 11 and 13 show stable collocations: travaux de la guerre and travaux des champs, which have their standard
German equivalents: Krieghandwerk and Ackerbau. Citations 10 and 12 give the
same collocation travaux des artisans, with the German equivalents
Berufsleistungen and handwerkliche Arbeit. The Greek text has the neutral rga
(pl., works) for citation 10, and banausa (see above) for citation 12. The wider
context of citation 12 reveals a pejorative condition not found in the French and
German equivalents: leurs corps mutils par leurs travaux dartisans,
Berufsleistungen and handwerkliche Arbeit (cf. citation 2: here handwerkliche
Arbeit is the equivalent of travail de fabrication des autres objets). In citation 15,
the equivalent of travaux is Gewerbe, a collective noun for craftsmens shops.
The Greek text just has the neutral rga. Generally, travaux seems to mean a set
of continuous and coherent activities, while Arbeiten is usually not used to designate activities, but the results of work. Only in citation 16 do we find the plural
Arbeiten, because here all the different kinds of work men do are implied.
Table 4. Platos Republic: Arbeit(en) and its French equivalents (30 citations)
01 ihre Arbeit
02 sie zu gemeinsamer Arbeit unfhig machen
03 ein Winzermesser, gefertigt fr diese Arbeit
04 jeder leistet seine Arbeit
05 der richtige Zeitpunkt fr eine Arbeit
06 schnere Arbeit leisten
07 die eigene Arbeit versumen
08 zu keiner anderen Arbeit taugen
09 Krperkrfte fr Arbeit besitzen
10 vor allen anderen Ttigkeiten sollte
er Ruhe haben undtchtige Arbeit leisten
11 alle andere handwerkliche Arbeit
12 seine eigentliche Arbeit vernachlssigen

leur uvre propre

tre incapables dagir en commun les
uns avec les autres
la serpette fabrique cet effet
chacun deux destine le produit de son
le bon moment pour un travail
russir mieux
laisser en sommeil son activit
tre impropres toute autre fonction
leur force physique les rend aptes aux
efforts pnibles
et donnant cong aux autres tches
etlaccomplir comme il fallait
encore tout le travail de fabrication des
autres objets
en ngligeant louvrage qui est devant

Wolfgang Teubert

Table 4. continued Platos Republic: Arbeit(en) and its French equivalents (30 citations)
13 ihn an einer aufmerksamen Arbeit hindern

der geistigen Arbeit schuld geben

fr die Arbeit
schlechtere Arbei
die Arbeit des Schusters machen
an alle Arbeiten wie der Mann herangehen


dieselben Arbeiten machen

die geistige Arbeit
die handwerkliche Arbeit
eine hervorragende Arbeit
unermdliche Arbeitsfreude


in seiner Arbeitslust
in seiner Arbeitslust
schmutzig-kleinliche Arbeit
zu krperlicher und geistiger Arbeit
28 die von ihrer Hnde Arbeit leben
29 wenn eine solche Arbeit Stckwerk ist
30 praktische Arbeit

en gnant le type dattention quon

doit exercer
accuser la philosophie
pour faire une poterie
travaux moins russis
accomplir louvrage dun cordonnier
sassocier avec le sexe mle dans tous
ses travaux
faire les mmes choses
la pense
les travaux des artisans
un trs belle excution
qui ait du got pour toutes les sortes
dans son got de leffort
dans lamour de leffort
incapables de travaux du corps que de
ceux de lme
qui travaillent
si cet objet lui aussi se trouve tre
quelque chose de peu net
uvres dhabilit humaine

Table 4 gives 30 citations for Arbeit(en), and their French equivalents. Only in
citations 4, 5 and 11 do we find the standard equivalent travail; in citations 16,
18, 21 and 27 it is travaux (for the reason, see above). Other recurrent equivalents are effort (citations 9, 23) and efforts (citations 24, 25); louvrage (12, 17);
uvre (1) and uvres (30); singular equivalents are effet (3), activit (7), fonction (8), choses (19), excutions (20) and objet (29). For the collocation geistige
Arbeit we find philosophie (14), pense (16) and travaux de lme (27). None of
these equivalents seems out of place, but no German-French dictionary offers
such a wealth of options. The really interesting finding is that in six instances the
equivalent of a nominal phrase is a verbal phrase: zu gemeinsamer Arbeit unfhig
(2) becomes incapables dagir en commun; schnere Arbeit leisten (6) becomes
russir mieux; tchtige Arbeit leisten (10) becomes laccomplir comme il fallait;
aufmerksame Arbeit (13) becomes lattention quon doit exercer; fr die Arbeit
(15) (where potters are mentioned in the wider context) becomes faire une
poterie; and finally we find qui travaillent (28) as a very loose equivalent of die
von ihrer Hnde Arbeit leben (who make a living from their labour).

The role of parallel corpora in translation and multilingual lexicography

The evidence extracted from the comparison of the French and the
German versions of the Republic shows clearly that translators do not translate
single, decontextualised words by assigning to them the word senses featuring
in a dictionary or an ontology. Rather, translators carry out their task by slicing
the text into translation units, semantic conglomerates which are translated as
a whole. But even where they translate one single word by another single word,
they do this not by assigning a specific word sense to the word but by forming a
hypothesis on the basis of the context in which this word occurs. We do not
know why a translator paraphrases gefertigt fr diese Arbeit by fabrique cet
effet, but paraphrases zu keiner anderen Arbeit taugen by tre impropres toute
autre fonction. Are not fonction and effet used in a very similar way? Are they
synonymous? It is impossible to tell from the limited evidence we have. In a
large parallel corpus we can expect to find a reasonable number of occurrences
of each translation unit, allowing us to generalize. If there is one unit where
Arbeit is always translated as effet and never as fonction, then we would have
found a reliable equivalent. But if in the same context effet and fonction are
about equally common, it would seem that, in the given context, these two
words are synonymous.
In this section, I have shown that the translators design space is much larger
than the language-neutral conceptual ontology (or the traditional bilingual dictionary) would lead us to believe. However, the evidence this comparison is based
on is too limited and the variation encountered too large to show how information derived from parallel corpora can be incorporated into a bilingual dictionary
or a translation platform that would actually help with translating texts.

. The parallel corpus: A repository of re-usable translation units

Recently a team at the Multilingual Research Group of the Institut fr Deutsche
Sprache in Mannheim, headed by Valrie Kervio-Berthou, assembled a 30-million-word French-German Parallel Corpus, with the acronym GeFrePaC. The
work was supported by a grant from ELRA (the European Language Resources
Agency), and it is now being distributed by this agency. About two thirds of this
corpus consists of documents issued by the European Commission (and downloaded from the CELEX database), and one third consists of the German and
French versions of the European Parliaments verbatim record of proceedings.
The corpus is part-of-speech tagged, aligned on the sentence level and TEI-encoded. More information on GeFrePaC can be found in Kervio-Berthou (2000).

Wolfgang Teubert

GeFrePaC will be used in Mannheim for a project in bilingual lexicography.

The goal is to test the hypothesis that the translational knowledge implicitly
contained in a parallel corpus complements the traditional bilingual dictionary.
While, owing to its limited size, it will not cover as many words as the average
comprehensive French-German/German-French dictionary, it will contain
many relevant translation units and their equivalents that tend to be overlooked
by lexicographers not working with a parallel corpus (and that is the majority).
Bilingual lexicographers have always been aware of the fact that texts are
often translated in units larger than the single word. For a long time they have
aimed to include compounds, multi-word units, significant collocations, set
phrases and idioms. But until the arrival of corpora it was left to the lexicographers skills to sift the evidence and to decide what to enter in the dictionary.
Usually they relied on monolingual dictionaries and on their own observations.
The results were often arbitrary or even idiosyncratic. With the availability of
monolingual corpora the quality of bilingual dictionaries quickly improved.
Corpus linguistics provided the methodology to identify semantic conglomerates such as compounds or collocations, using a combination of statistical and
grammatical approaches. It was then possible to enter the most relevant of these
conglomerates, together with their presumed equivalents, in the dictionaries.
Usually, bilingual dictionaries refer to corpora only to validate entry candidates selected on the basis of other principles. In Elena Tognini Bonellis (1996)
words, this is still only the corpus-based (as opposed to the corpus-driven)
approach. But even where bilingual dictionaries record the evidence encountered in monolingual corpora, they still have to rely on the lexicographers
bilingual competence to determine the translation equivalent of any semantic
conglomerate. This equivalent will, under normal circumstances, not be
wrong. But it will not necessarily reflect the translation practice of the community of French-German and German-French translators.
This practice is what parallel corpora record. They are repositories of
translation units and their equivalents in the target language, and these translation units may be words within a given context or the semantic conglomerates
mentioned above. Since there are, to date, no bilingual dictionaries of general
language based on parallel corpora, we still do not know to what extent they
can complement, improve and validate existing dictionaries. However, there is
reason to believe that the additional evidence a parallel corpus of just 30 million words will provide is enough to take the traditional concept of printed dictionaries to its limits. Instead of a multi-volume printed dictionary there is now
the option of a bilingual database of translation units and their target language

The role of parallel corpora in translation and multilingual lexicography

equivalents. In the more distant future, a translation platform may offer the
translator a tentative breakdown of the text to be translated into translation
units and possible target equivalents.
Unfortunately the GeFrePaC corpus was finalised only in June 2000, when
it was too late to exploit it systematically for this analysis. What I present here
are occurrences of Arbeit and travail together with their equivalents as they
were extracted by hand from the corpus, amounting to less than 40 citations
per language pair. This represents only a tiny fraction of the total number of
occurrences. All citations were taken from EU documents, none from the
European Parliament verbatim record of proceedings.
In the case of the EU documents in our corpus, it is often impossible to to
say which is the original text and which is the translation. While it is safe to
assume that there will be hardly any German original texts, in a number of
instances the French text will also be a translation, in this case of an English
text. Earlier drafts may well have been written in Spanish or German or other
EU languages. Since each language version of the final text is legally binding, we
can, in principle, assume that semantically a German and a French version are
closer to each other than the translations of Platos Republic. GeFrePaC is not a
corpus of general language; it records, rather, the legal and administrative language used in the European Commission. The French version therefore differs
from the legal and administrational language used in France, and the same is
mutatis mutandis true of the German version. It is a special jargon of its own,
and the different language versions this jargon is used in are linked by the continuous practice of translation. This strong continuity in practice is one important reason why the translators of the EU legal documents have less freedom,
less design space, than the translators of Platos Republic. In their efforts to link
the different language versions of a text as closely as possible they preserve the
original structure wherever possible, at text level, at paragraph level, at sentence level, and at the level of the translation unit.
From a methodological point of view, the special language used in EU documents facilitates the extraction of translation units and their equivalents, as
well as the processing of this data into lexicographical results. This is why they
are a good starting point for testing parallel corpora in dictionary making.
More balanced parallel corpora, representing general language as it is used in
newspapers, magazines, fiction and general-purpose books, will be more difficult to handle. Nevertheless, I believe our citations of Arbeit and travail usefully
show how corpus data can be processed. In the tables that follow, the translation units are given in italics, with the keyword in bold face. T1, T2 etc. refers to

Wolfgang Teubert

different texts, which may have been translated by different people. Table 5 presents the citations for Arbeit.
Table 5. EU documents: Arbeit and its French equivalents (20 citations)
01 T1 an der Arbeit der Organisation teilnehmen

T1 die Arbeit der neuen Organisation

T1 die Arbeit der Organisation
T2 sich an ihrer Arbeit beteiligen
T2 seine Arbeit zgig durchfhren
T2 ein Jahresbericht ber die Arbeit
der Behrde
T2 die Arbeit der Behrde
T2 die Arbeit des Schiedsgerichts
T2 die Arbeit an Bord
T3 die Arbeit des weltweiten Netzes strken

11 T3 die Modalitten der Arbeit dieser Gruppen

12 T3 die Modalitten der Arbeit ihrer Gruppe
13 T4 die sich auf die Kosten einer solchen
Arbeit beziehen
14 T4 seine Arbeit abschlieen
15 T4 Bericht ber die Arbeit des Ausschusses
16 T5 die Arbeit von Nichtregierungsorganisationen
17 T5 die Arbeit der Kommission
18 T5 Organisation und Umfang der Arbeit
19 T6 in seine Arbeit einbeziehen
20 T7 ohne eine andere zustzliche Arbeit

participer aux travaux de

les activits de la nouvelle organisation
les activits de lorganisation
participer ses travaux
saquitter promptement de sa tche
un rapport annuel sur lactivit de
les travaux de lAutorit
la tche du tribunal arbitral
le travail bord
renforcer le fonctionnement du rseau
les modalits de fonctionnement de ces
les modalits de fonctionnement de
chaque groupe
relatives au cot de ce travail
mener leur terme ces travaux
rapport sur les travaux du Comit
le travail des organisations non
les travaux de la commission
lorganisation et ltendue des travaux
dassocier ces travaux
sans autre main duvre

The most surprising result is that the standard equivalent travail occurs only
three times, in citations 9, 13 and 16. Other nouns in the singular are tche (5,
8; same text), fonctionnement (10,11,12), activit (6) and main duvre (20).
In all the other instances, the singular Arbeit corresponds to a plural noun:
travaux (1, 4, 7, 14, 15, 17, 18, 19; spread over most texts) and activits (2, 3;
same text). Looking at the collocates, we find that organisation can be a modifier of travaux (1), activits (2, 3) and travail (16). This does not provide a clear
picture but may indicate that in this context pattern travail, travaux and activits are synonymous. It seems reasonable to assume that organisation belongs

The role of parallel corpora in translation and multilingual lexicography

to the same semantic field as autorit (7; collocate of travaux), tribunal arbitral
(8; collocate of tche), groupe, groupes (11, 12; collocates of fonctionnement),
comit (15; collocate of travaux), commission (17; collocate of travaux).
Travaux, in this context, is not only the most frequent equivalent (1, 7, 15, 17)
but also the one spread over most texts, and this can be understood as an indication that travaux is always appropriate if followed by such modifiers.
Fonctionnement occurs in only one text, perhaps due to a translators whim. We
find participer ses travaux (4) alongside mener leur terme ces travaux (14)
and dassocier ces travaux (19), but sacquitter de sa tche (5). Again, travaux
seems to be the best choice for this pattern.
Table 6. EU documents: Arbeiten and its French equivalents (17 citations, all in text T2)

an den Arbeiten der Behrde teilnehmen

im Zusammenhang mit Arbeiten
die Unterbrechung von Arbeiten
Einzelheiten der dort durchgefhrten
Ttigkeiten fr Arbeiten
fr Arbeiten nicht mehr bentigt
die mit Arbeiten zusammenhngen
im Zusammenhang mit Arbeiten des
die Arbeiten des Unternehmers behindern
Arbeiten (in bersicht)
Dauer der Arbeiten (in bersicht)
Fortschritt der Arbeiten (in bersicht)
berprfung der Arbeiten (in bersicht)
Ertrge aus den Arbeiten (in bersicht)
im Verlauf seiner Arbeiten
seine Arbeiten selbstndig durchfhren
Auskunft ber Arbeiten

participer aux travaux de l`Autorit

au titre des oprations
de suspendre les oprations
dtails sur les travaux dont elles font lobjet
activits connexes au titre des oprations
nest plus ncessaire au titre des oprations
connexes au titre des oprations
raison des oprations du contractant
gner les activits de lexploitant
oprations (en sommaire)
dure des oprations (en sommaire)
avancement des travaux (en sommaire)
inspection des oprations (en sommaire)
recettes tires des oprations (en sommaire)
dans la conduite des oprations
agir de faon autonome
renseignements au sujet des oprations

Table 6 presents the citations of Arbeiten. Again it comes as a surprise that in 12

out of 17 instances, it is not travaux but oprations that corresponds to the plural
noun Arbeiten (2, 3, 5, 6, 7, 8, 10, 11, 13, 14, 15, 17). As all these citations are taken
from one text, it is impossible to decide if they can be generalised. Apart from
oprations, we find also travaux (1, 4, 12) and activits (9). There is even a case
where the noun Arbeiten corresponds to the verb agir (17). Unfortunately, the
citations do not indicate a difference in usage between oprations and travaux.
Table 7 below presents the citations of travail. It seems that travail corresponds more often to Arbeit (or Arbeits-) than the other way round. Indeed, out of

Wolfgang Teubert

Table 7. EU documents: travail and its German equivalents (20 citations)



15 T4
16 T4
17 T5
18 T5
19 T5
20 T5

les conditions de travail

plans de travail
un plan de travail
les plus hautes qualits de travail
le volume de travail du Tribunaldes
au travail bord
un groupe de travail
les heures de travail normales
un groupe de travail dexperts
la pression de travail
la travail du caoutchouc
un simple travail de surface
un travail de tirage de fils
les machines-outils pour le travail
des mtaux
les organes de travail
les coupeuses pour le travail du
papier ou du carton
un appel doffres en vue dun complment de travail de conception
les permis de travail
les accidents du travail
les lock-out ou autres conflits du

die Arbeitsbedingungen
ein Hchstma an Leistungsfhigkeit
Arbeitsanfall des Gerichtshofs
bei der Arbeit an Bord
eine Arbeitsgruppe
die normale Arbeitzeit
eine Arbeitsgruppe von Sachverstndigen
der Betriebsdruck
das Bearbeiten von Kautschuk
eine einfache Oberflchenbearbeitung
eine Auszieharbeit
die metallbearbeitenden
die Arbeitsgerte
die Papier- und Pappeschneidemaschinen
eine Ausschreibung fr weitere
die Arbeitserlaubnis
die Arbeitsunflle
Aussperrungen oder sonstige

our 20 citations, we find Arbeit/Arbeits- in eleven instances (1, 2, 3, 5, 6, 7, 8, 9, 15,

18, 19), and Arbeit is also part of Auszieharbeit, the equivalent of travail de tirage de
fils (14). In these twelve citations there is only one occurrence of Arbeit proper (6),
while everywhere else Arbeit is part of a nominal compound. Recurrent are plan(s)
de travail (Arbeitsplan, -plne; 2, 3) and groupe de travail (Arbeitsgruppe; 7, 9).
N+de+travail collocations correlate with a compound noun beginning with
Arbeits-: conditions de travail (Arbeitsbedingungen; 1), volume de travail (Arbeitsanfall; 15), heures de travail (Arbeitszeit; 8), organes de travail (Arbeitsgerte; 15),
permis de travail (Arbeitserlaubnis; 18). The same French pattern, but a different
German word, is found in qualits de travail (Leistungsfhigkeit; 4). (There are also
two instances of the correlation Leistung/travail/travaux in Platos Republic, see
above.) If the work in question is carried out by machines, we find Bearbeiten (11,
14) and Bearbeitung (12) instead of Arbeit. In citation 14, there is also a change in
the structure of the collocation: the French prepositional phrase here corresponds
to an adjective. In citation 10, pression de travail and Betriebsdruck are standard-

The role of parallel corpora in translation and multilingual lexicography

ised terms, and this also seems to be true in the case of citation 16. In citation 17
travail de conception correlates with Konzeptionen, which looks like an idiosyncratic solution not to be generalised. Finally, in citation 20 we find Betriebsunruhen where I would have expected the more common Arbeitsunruhen. But an
explanation could be that Arbeitsunruhen is commonly used for trouble caused by
the workforce and therefore cannot designate les lock-out caused by managers.
Table 8. EU documents: travaux and its German equivalents (9 citations)

T6 travaux effectus laide de fils

brodeurs en mtal
T7 les travaux de la Cour
T7 lorsque les travaux ont t
suffisamment avancs
T7 les travaux de rhabilitation urgents
T7 lensemble des travaux des ONG
T7 les travaux de ce groupe
T7 les travaux du groupe se sont
T7 dans le cadre de ces travaux
T8 les marchs de travaux

mit Metallfden ausgefllte Sticharbeiten

die Arbeiten des Hofes
bei zufriedenstellendem Fortschritt der
dringende Rehabilitationsmanahmen
die Gesamtttigkeit der NRO
die Arbeiten dieser Gruppe
die Gruppe ist in ihrer Arbeit
im Rahmen dieser Arbeit

Table 8 shows the citations of travaux. Out of the nine citations of travaux, four
correspond to the standard equivalent Arbeiten (or -arbeiten) (1, 2, 3, 6), and
only two to Arbeit (7, 8), whereas out of 20 citations for Arbeit, eight corresponded to travaux (see above). In citation 4, travaux de rhabilitation must be
rendered as Rehabilitationsmanahmen, while in citation 5, Gesamtttigkeit
seems to be more elegant than die Gesamtheit der Arbeiten. In 7 there is an
attractive rephrasing of the original structure: the French nominal modifier is a
subject in the German version, while the French subject is an adverbial phrase
modifier in the German text. Finally, in citation 9, travaux is rendered by
Dienstleistungs-, which makes sense in this context.
My analysis is based on such a small number of citations that it is very difficult to generalise. It is recurrence as a parameter which allows us to determine
whether a correlation between translation unit and translation equivalent is
sufficiently established for it to be safely re-used in the translation of new texts.
Recurrence is also an important issue for the automatic identification and
extraction of translation units and their equivalents. Two other methods are
statistical procedures (allowing us to determine the significance of co-occurrence of text elements) and grammatical operations (POS-tagging for determining the syntactic structure of a collocation). But even the very restricted

Wolfgang Teubert

analysis of just a few score citations of the kind presented here yields results that
can well be used to complement existing bilingual dictionaries. In Table 9 these
results are juxtaposed with the (abbreviated) entry for Arbeit/travail in the
PONS Growrterbuch Franzsisch-Deutsch (1996).
Table 9. Corpus evidence vs. dictionary evidence
Corpus evidence

Arbeit & equivalents

travaux (4)
activits (2)
fonctionnement (2)

Arbeiten & equivalents

oprations (12)
travaux (3)

travail & equivalents

Arbeits- (10)
Betriebs- (2)
Leistungs- -Arbeit
- -bearbeitung

travaux & equivalents

Arbeiten (3)
Arbeit (2)

Dictionary evidence
Arbeit & equivalents
01. (Ttigkeit) travail
02. (Arbeitsplatz) travail
03. ( Produkt) travail
04. (schriftliches Werk) travail, ouvrage
05. SCOL (Klassen~) devoir, contrle
06. UNIV mmoire, dissertation
07. (Mhe) travail
08. (Aufgabe) travail, tche

travail & equivalents

01. (activit) Arbeit ...
02. (tche) Arbeit ...
03. (activit professionnelle) Arbeit...
Schwarzarbeit... Nachtarbeit ...Zeitarbeit
04. pl (ensemble de tches)... die Bauarbeiten...
~aux de champs Feldarbeit...
05. (ralisation) Arbeit... Werk
06. (publication) Arbeit
07. ECON Arbeit; division du ~ Arbeitsteilung
08. (faonnage) Bearbeiten, Bearbeitung...
~de qc Bearbeitung einer S.,
09. (fonctionnement) Arbeit... Funktion,
10. (effet) [Ein]wirkung...
11. (bois/mtal) Arbeiten
12. PHYS Arbeit
13. MED... [Geburts]wehen...

The role of parallel corpora in translation and multilingual lexicography


My goal has been to show that the evidence of parallel corpora can complement
traditional translation aids, such as printed dictionaries, termbanks and even
translation memories. This evidence can be used to compile better bilingual dictionaries. However, as parallel corpora keep growing in size, the traditional form
of the printed dictionary will give way to a bilingual database. Bilingual databases
can cope better with larger translation units, and they can also be used as input for
translation platforms which provide translators with translation options among
which they can select the appropriate equivalent. What matches a translation unit
with its equivalent in a given target language is not some abstract property of the
language system in general or the systems of two specific languages but the continuous practice of generations of translators. It is this received practice that bilingual
dictionaries have always endeavoured to capture. Now, with the advent of parallel
corpora, this goal can finally be approached.
Not all the practice of translators is adequate or appropriate. It is up to the
community of bilingual speakers to decide whether a translation is adequate or
appropriate. Many of the equivalents translators have come up with are questionable or idiosyncratic and should not be re-used if the translation unit
occurs in a new text to be translated. Not all the evidence extracted from a parallel corpus should go into a bilingual database or printed dictionary. But how
can we distinguish good practice from bad?
For an automatic distinction, the only parameter available is recurrence.
We can act on the assumption that a successful solution to a recurrent problem
will, in due course, outnumber less successful attempts. If, in comparable contexts, a given adjective/noun collocation occurs eight times and is translated
five times by the same equivalent and three times by different equivalents, we
are justified in assuming that the recurrent equivalent, established by practice,
can safely be re-used in a new translation of this collocation.
I do not believe that Machine Translation has a future except for texts written in the controlled language of a narrowly restricted domain. It cannot be the
answer to the growing demand for general language translations. Conceptual
ontologies as they are customarily used in MT have two major deficiencies:
they cannot deal with inherent word meanings but only with externally
assigned word senses, and they fail to account for the fact that a large part of the
vocabulary of general language consists of words whose meaning becomes
concrete only within the context they are used in or as part of a semantic conglomerate. Conceptual ontologies contain decontextualised concepts, con-

Wolfgang Teubert

cepts in their paradigmatic relationship, but deprived of their syntagmatic relationships. Conceptual ontologies attempt to categorise the reality we
encounter independently of language, and this is why they cannot deal with the
symbolic nature of language.
Language is discourse; it is the universe of texts that has been produced by a
language community. Texts are concatenations of semantic conglomerates, of
words, of collocations, of set phrases, of linguistic symbols which can only be
studied from the point of view of form and meaning. Whatever the meaning of
such a conglomerate, it does not refer to some language-external reality.
Meaning is the history of all earlier occurrences of this conglomerate, and it is to
these that it refers. In this history of occurrences we find citations where the conglomerate was paraphrased or explained, and other instances where it was used
within a specific context. It is these citations, and not the language-external reality, which are symbolised by a linguistic sign. For there is no other way of introducing a unit of meaning into the discourse than by explaining what it means. A
word whose meaning has never been explained does not refer to anything.
Corpus linguistics provides the methodology to take linguistics, and lexicology in particular, beyond the single word as the basic semantic unit. Rather
than decontextualising words and describing their meanings in the isolation of
a lexical entry, corpus linguistics breaks down the border between syntax and
the lexicon by identifying semantic conglomerates in corpora, combining the
parameters of recurrence, statistical significance and syntactic categorisation.
Corpus linguistics elucidates the meanings of these units of meaning by
extracting their paraphrases and their usage from the corpus.
Multilingual parallel corpora can be understood as repositories of paraphrases of translation units. The meaning of a translation unit in the source language is its equivalent in the target language. If there are several equivalents of a
translation unit and if these equivalents are not synonymous, then this translation unit has several meanings. While traditional bilingual lexicography has
often assigned word senses according to the established practice in the source
language, it is now possible to define the senses of translation units on the basis
of their non-synonymous equivalents in the target language. What is a translation unit in relation to one target language does not have to be one in relation to
another. It is the target language that determines the unit of meaning.
My analysis of Arbeit and travail in Platos Republic and in EU legal documents shows, I hope, that actual translation practice offers a wider choice of
options and a larger design space for translation than the traditional bilingual
dictionary. The two corpora used in this study are not representative of parallel

The role of parallel corpora in translation and multilingual lexicography

corpora in general. In the case of Platos Republic, I compared the French and
the German translation of a Greek text, not an original text with its translation.
The EU documents as a parallel corpus are unique in the sense that it is not
possible to distinguish source language from target language. Translation studies have taught us, however, that it is important to know which is the source
language, which is the target language. Translation is a unidirectional activity,
and there are good reasons why bilingual dictionaries cannot be (and perhaps
should not be) reversible. The EU corpus, therefore, does not permit us to
reassess the important issue of reversibility. Here we need the evidence of reciprocal parallel corpora, consisting of original texts in all the languages involved,
together with translations into all the languages. These corpora can then tell us
to what extent the practice of translating from language A into language B can
have an effect on translating from language B into language A. These parallel
corpora will not lead to fully reversible bilingual dictionaries, but they will provide evidence for a renewed discussion of this issue. Parallel corpora will make
faster and better translations possible. Multilingual corpus linguistics will contribute to monolingual and bilingual lexicology.


The following translations of Platos Republic were used:

Paul Shorey; Cambridge, Mass.: Harvard University Press (available electronically and
therefore used as master version)
Desmond Lee (1974), 2nd ed.; London: Penguin.
Francis MacDonald Cornford (1941); Oxford: Oxford University Press.
W. H. D. Rouse (1984), revised ed.; New York: New American Library.
B. Jowett (1871); New York: Vintage Books
A. D. Lindsay (1957); New York: E. P. Dutton.
John Llewelyn Davies/David James Vaughan (1997); Ware: Wordsworth Editions.
Access to the Plato Parallel Corpus (the Republic in ca. 20 different language versions,
many of them aligned on the sentence level) can be obtained from:

Carruthers, P. and Boucher, J. (eds) 1998. Language and Thought. Interdisciplinary Themes.
Cambridge: Cambridge University Press.
Eco, U. 1993. La ricerca della lingua perfetta nella cultura europea. Roma: Laterza.
Fodor, J. 1975. The Language of Thought. New York: Crowell.
Kervio-Berthou, V. 2000. GeFRePac. Deliverable 3: ELRA Final Report. Mannheim: IDS.

Wolfgang Teubert

Koskinen, K. 2000. Institutional illusions: Translating in the EU Commission. The

Translator 6: 4966.
Melby, A.K. 1996. The Possibility of Language. Amsterdam: Benjamins.
Pinker, S. 1994. The Language Instinct. New York: William Morrow.
Putnam, H. 1975. The meaning of meaning. Reprinted in H. Putnam, Mind, Language
and Reality. Philosophical Papers 2. Cambridge: Cambridge University Press.
Teubert, W. 1996. The concept of work in Europe. In Conceiving of Europe: Diversity in
Unity, A. Musolff, C. Schffner and M. Townson (eds), 129145. Aldershot:
Tognini Bonelli, E. 1996. Corpus: Theory and Practice. Birmingham: TWC.

Bilingual lexicography, overlapping

polysemy, and corpus use
Victria Alsina and Janet DeCesaris


Both researchers interested in improving the quality and usefulness of dictionaries and lexicographers have welcomed the advent and availability of large
computerized corpora. Representative bilingual or multilingual corpora are
possible in specialized fields because in these well-defined situations set in
multilingual environments the subject domains are quite restricted. Bilingual
or multilingual corpora consisting of texts based either on translations produced by highly trained professionals or on comparable text production thus
play an essential role in ensuring that specialized dictionaries, glossaries and
terminologies actually reflect the language used in the workplace. However, the
tasks and data facing the general language bilingual lexicographer are rather
different in nature from the delimited contexts just mentioned: the kind of corpus which proves most useful in the construction of bilingual dictionaries is
not yet well defined. While many modern monolingual dictionaries depend
heavily on corpus-based data, bilingual lexicography has yet to determine what
type of corpus best serves the needs of general bilingual dictionaries. This
would seem to be yet another manifestation of the fact that bilingual lexicography lags behind monolingual lexicography (Hartmann and James 1998:15).
Many researchers have noted that the typology of potential users of general
bilingual dictionaries is quite varied, ranging from advanced learners to experienced translators (Al-Kasimi 1983:154157, Tomaszczyk 1983:46). Bilingual
dictionaries of this sort are used for both encoding and decoding by speakers of
two different languages with several levels of language skills and thus must
incorporate a great deal of grammatical and pragmatic information. In corpus-

Victria Alsina and Janet DeCesaris

based bilingual lexicography, the two alternatives previously hinted at are parallel corpora, which contain one set of texts in two or more languages, and
comparable corpora, which contain texts in several languages with the same or
similar composition. Teubert presents a cogent discussion of both types
(1996:245249), and concludes (rightly, we think) that ideally, parallel corpora should be viewed as complementary to comparable corpora (1996: 252).
Parallel corpora run the risk of presenting data produced under the special
conditions of translation, which may be significantly different from regular
native-speaker production. It is a well-known fact in translation theory that
phenomena pertaining to the make-up of the source text tend to be transferred to the target text, whether they manifest themselves in a negative transfer
(i.e., deviations from normal, codified practices of the target system), or in the
form of positive transfer (i.e., greater likelihood of selecting features which do
exist and are used in any case) (Toury 1995: 275). It is true that interference
need not be seen as an undesirable trait in translation. Indeed, its undesirability is always a function of a host of socio-cultural factors, which may therefore
be said to condition our law and communities differ in terms of their resistance to interference, especially of the negative type (Toury 1995: 277).
Nevertheless, the inevitable presence of interference or transfer in translated
texts does bear directly on the data to be found in parallel corpora, and makes
us question the reliability of parallel corpora as the primary source of data for a
general language bilingual dictionary. Comparable corpora would seem
preferable for this type of dictionary project, but their use will not be addressed
in this paper because we know of no such corpus data available for general purpose language in the language combinations we discuss.2
The two main problems we have mentioned with bilingual corpora, the
presence of interference and unavailability, do not plague monolingual corpora of English. Since reliable, contemporary corpus data is widely available for
English, we decided to see how it could be used to improve the information
currently provided in English/Spanish and English/Catalan dictionaries. In
order to determine the possible role for monolingual corpus data in the preparation of these dictionaries, we must first identify the main problems that beset
existing dictionaries. We have therefore chosen three non-derived, polysemous
adjectives in English that we were sure to find amply covered in current
English/Spanish and English/Catalan dictionaries and in a corpus of English:
cold, high and odd. Existing dictionary entries for these words were analyzed to
pinpoint what needed improvement, and then the British National Corpus was
consulted to see how it might help resolve the issues resulting from the dictio-

Bilingual lexicography, overlapping polysemy, and corpus use

nary analysis. We conclude that data from a monolingual corpus proves useful
for addressing some of the main problems associated with providing equivalents for adjectives in a general-purpose bilingual dictionary, such as order of
presentation, repetition of equivalents due to what we will define as overlapping polysemy, and decisions regarding examples, but has little bearing on the
issue of delimiting possible contexts in which the equivalent provided by the
dictionary is appropriate.


We looked up the entries for the three adjectives in three English/Spanish bilingual dictionaries and one Catalan/English dictionary. The bilingual dictionaries consulted in the case of Spanish were The Oxford Spanish Dictionary (OSD),
Larousse Gran Diccionario Espaol-Ingls/English-Spanish (GL), and Simon &
Schusters International Dictionary English-Spanish/Spanish-English (S&S),
which were chosen for the following reasons. First, we were interested in analyzing entries in recently published dictionaries which would reflect contemporary usage. The OSD in particular is noteworthy in this respect, as its first
edition was published in 1994. Second, we deliberately included dictionaries
produced by both British and American publishers. Third, it has been our personal experience as translators and teachers of translation that all three of these
dictionaries are useful, that is to say, we ourselves use them and recommend
them to our students. The English/Catalan dictionary analyzed is the
Diccionari angls-catal published by Gran Enciclopdia Catalana (DAC),
which is the most comprehensive bilingual dictionary for this language combination currently available.
The entries from bilingual dictionaries were compared with those from
three monolingual dictionaries, the second edition of the Collins Cobuild
English Dictionary (Cobuild), the Cambridge International Dictionary of English
(CIDE) and the third edition of the American Heritage Dictionary (AHD3). The
choice of these particular dictionaries was also not random: Cobuild is the
prime example of a corpus-based dictionary in English, and because it is aimed
at advanced learners of the language its target audience coincides to a large
extent with the users of the bilingual dictionaries under examination. CIDE is
addressed to the same target audience, states that a corpus was used in its
preparation although it does not purport to be corpus-based in the same way
as does Cobuild, and has a very nice way of dealing with polysemy in that senses

Victria Alsina and Janet DeCesaris

are clearly grouped together under differentiated basic concepts. AHD3 covers
American English, a variety of the language which is explicitly included in the
bilingual dictionaries and not, we feel, well represented in Cobuild (it is somewhat better represented in CIDE). Although AHD3 is not based on a corpus, it
does claim to rank the order of senses on the basis of usage, as opposed to the
historical order of senses employed by other well-known American dictionaries such as Merriam-Websters Collegiate Dictionary.
The corpus consulted, as mentioned above, was the British National
Corpus (BNC). The BNC was designed to characterise the state of contemporary British English in its various social and generic uses (Aston and Burnard
1998: 28). It includes both informative and imaginative texts, and comprises
90% written texts and 10% spoken texts. In spite of the design features of the
BNC that might lead to controversial linguistic generalizations about general
purpose English, we believe it provides a sufficiently accurate picture of British
English to allow comparison of data culled from it with data from dictionaries.
The number of examples of the three adjectives in this corpus are as follows:

28,698 examples in 3,243 texts


06,438 examples in 1,592 texts


04,478 examples in 1,595 texts

For the purposes of this article, we decided to examine 500 examples of each
adjective randomly chosen by the search function, taken from both written and
oral language, and with only one example from any given text. Although 500 is
only 1.7%, 7.7%, and 11.2% respectively of the totals available for these words,
this number proved workable from a practical standpoint in terms of downloading and producing a clean set of examples.

. Analysis of the dictionary entries and comparison with information

extracted from the corpus sample
The three adjectives, cold, high and odd, exhibit varying degrees of polysemy, as
can be seen in the summary of the definitions listed in the monolingual dictionaries in Table 1. AHD3 consistently makes more sense distinctions than the
other two monolingual dictionaries. We believe this difference may be attributed
to two factors: (1) AHD3 is a more comprehensive dictionary, with more entries
than either of the other two dictionaries; and (2) unlike Cobuild and CIDE,
AHD3 is not addressed to foreign learners of English, and thus includes less fre-

Bilingual lexicography, overlapping polysemy, and corpus use

quent, even uncommon, senses of words which are unlikely to be consulted by

advanced learners but nevertheless are not insignificant in the context of a comprehensive dictionary for native speakers. The three dictionaries differ somewhat
from one another in the number of senses assigned to each adjective, but in this
respect we think the best guide is CIDE, which has made a noteworthy effort to
limit itself to only the basic senses (which is particularly important in bilingual
dictionaries), whereas the other two, especially AHD3, tend to assign a separate
sense or subsense to every nuance of meaning. We will therefore be referring
mainly to CIDE when we discuss the number of senses of each word.
Table 1. Definitions in the monolingual dictionaries




10 senses, divided into 18 subsenses

13 senses, divided into 22 subsenses
07 senses, divided into 9 subsenses

08 senses, plus some expressions 02 senses

15 senses, plus some expressions 07 senses
05 senses, plus some expressions 05 senses


3.1 The adjective cold 3

Cold is a polysemous adjective in English with two main senses: (1) having a
low temperature, and (2) the metaphorical sense of unfeeling or unfriendly.
Both of these senses correspond almost exactly to the two main senses of fro, in
Spanish, and fred, in Catalan. Tables 2 and 3 below show the order of senses in
the monolingual and bilingual dictionaries respectively.
The data in Tables 2 and 3 show that all the dictionaries consulted, even the
one with the simplest structure (DAC), give the sense low temperature first,
thus reflecting the intuition of the lexicographers that it is the most frequently
used sense. The sense unfriendly or unfeeling is generally, but not always,
given in second position. Some dictionaries acknowledge up to five additional
different senses, although we feel that these could be included in one of the two
Table 2. Treatment of cold in monolingual dictionaries
Senses of cold

(10 total senses)

(8 total senses)

(2 total senses)

low temperature
cold colors
trail or scent

12, 8



Victria Alsina and Janet DeCesaris

Table 3. Treatment of cold in bilingual dictionaries

Senses of cold

(4 total senses)

(6 total senses)

(12 total senses)

DAC (Catalan)
(1 total sense)

low temperature
cold colors
trail or scent



1, 2

2, 3, 9
5, 6


main senses (as in CIDE). The varying number of senses in the dictionaries
reflects two different problems: (1) exactly what constitutes a separate sense is
not always clear, even to trained lexicographers; and (2) some dictionaries list
highly lexicalized examples as separate senses, even though the meaning could
be included as part of an earlier sense. This latter issue is particularly evident in
the case of bilingual dictionaries and explains why there may be more senses
listed for a word in a bilingual dictionary than in a monolingual dictionary.
It is a fact that English cold and Spanish fro / Catalan fred generally coincide in terms of physical reference and possible metaphorical contexts. This
can be seen from the entry for cold in the OSD, in which the same equivalent
(fro) is provided for cold numerous times.
cold1 adj 1 <water/weather/drink> fro; Im ~ tengo fro; my feet are ~ tengo los pies
fros, tengo fro en los pies; its ~ today/in here hoy/aqu hace fro; the soup is ~ la
sopa est fra; Im getting ~ me est entrando fro; its getting ~ est empezando a
hacer fro; you dinners getting ~ se te est enfriando la comida; the water has gone ~
el agua se ha enfriado; the engine starts straight from ~ without fail el motor arranca
en fro sin fallar; the trail has gone ~ se han borrado las huellas; the news was already
~ la noticia ya estaba pasada or aeja; no, youre still ~, getting ~er (in game) no,
fro, ms fro; blow2 vi 1(a)
2 (a) (unfriendly, unenthusiastic) <person/stare/color> fro; I got a very ~ reception
me recibieron con mucha frialdad or muy framente, la recepcin que me dieron fue
muy fra; to be ~ TO or WITH sb tratar a alguien con frialdad, estar*/ser* fro con
algn; to go ~ on sth: I went ~ on the idea (colloq) la idea dej de hacerme gracia
(fam); to leave sb ~: that leaves me ~ (colloq) (eso) me deja fro or tal cual (fam),
(eso) no me da ni fro ni calor (fam) (b) (impersonal) <logic> fro; keeping to the ~
facts atenindose nicamente a los hechos
3 (unconscious) out2 1(b)
4 (without preparation) sin ninguna preparacin; I came to the job ~ empec el trabajo sin ninguna preparacin; I was expected to start from ~ esperaban que
empezara sin ninguna preparacin.

Figure 1. Entry for cold in the OSD

Bilingual lexicography, overlapping polysemy, and corpus use

Such repetition is in no way limited to the OSD, but quite commonplace in

entries that represent what we call overlapping polysemy. In overlapping polysemy, a word in one language is polysemous, and there exists an equivalent word in
the other language that, by and large, exhibits the same polysemy. Overlapping
polysemy is a manifestation of what Sinclair (1996: 179) termed parallels
between the textual environment of a word in one language and a word that is
used to translate it in another. At this point we are not as concerned with the
causes behind overlapping polysemy as we are with its effects on bilingual lexicography, although we might speculate that a cognitive linguistics approach to
metaphor in language could be enlightening. We have found examples of overlapping polysemy in English on the one hand and Spanish and Catalan on the
other in all word classes, for example: verbs, Eng. run /Sp. correr, Cat. crrer
(walk quickly and run a risk); prepositions, Eng. before /Sp. antes, Cat. abans in
both spatial and temporal contexts; nouns, Eng. dough /Sp. and Cat. pasta referring both to a mixture of flour and water and to money; and adverbs, Eng. naturally / Sp. naturalmente, Cat. naturalment meaning in a natural (as opposed to
unnatural) way or expressing the expectedness of an outcome. The existence of
overlapping polysemy has not gone unnoticed in the literature; for example,
Tognini-Bonelli (1996: 207214) discusses a case of overlapping polysemy with
reference to English real /Italian reale in some detail from the perspective of using
corpora to identify translation equivalents.
Given the overlapping polysemy exhibited by cold/fro-fred, we might have
expected that the entries in the bilingual dictionaries would have taken advantage of the overlap and would hence turn out to be simpler and shorter than
those in the monolingual dictionaries. In fact, however, several situations occur:
the structure in the DAC is quite simple; the structure in the OSD is somewhat
more complex, but is less so than that of either GL or S&S, both of which contain long entries with many senses. In fact, in these latter two dictionaries it
appears that little to no attempt has been made to organize the material.
After our initial analysis of the entries for cold in the bilingual dictionaries,
we are now in a position to identify areas in which the dictionaries differed
from one another and which are, perhaps, potential points for improvement:

the design of the entry, to take advantage of overlapping polysemy

when it exists;
the criteria for ordering the equivalents in the entry; and
decisions determining which set phrases or idioms should be afforded
equivalent translations in the entry.

Victria Alsina and Janet DeCesaris

We now turn to the corpus data relating to cold, to see if it bears on any or all of
these issues. We were able to use 489 of the 500 examples containing cold that
were downloaded from the BNC; in 11 examples the context provided by the
search was not explicit enough to determine the sense of cold being used. As
seen in Table 4, the sample showed that the low temperature sense is by far the
most frequently used in English.
Table 4. Cold in the BNC sample
Senses of cold

Number of examples in BNC sample

low temperature
giving impression of low temperature


expression: cold war

expression: in the cold light of morning/day/dawn
expression: cold comfort
expression: cold feet
expression: cold shoulder
other collocations


The group of 358 includes several collocations in which the sense low temperature was clear to us (e.g. cold sweat), and 22 figurative uses such as that exemplified by the expression reality closed its cold hand around her in which the sense of
low temperature was still evident. The sample included a significant number of
lexicalized collocations, which are particularly important for bilingual dictionaries since they can constitute exceptions to the almost perfect equivalence between
cold and fro/fred. The most frequent of these collocations was cold war (35 cases)
the equivalent of which is the loan translation guerra fra / guerra freda and
this large number of cases no doubt reflects the fact that much of the textual base
of the BNC is journalistic and thus concerned with politics. There is also a third,
metaphorical, sense of cold which we identify as giving the impression of low
temperature. This sense is used in contexts in which cold is applied to nouns
referring to color, light or appearance (e.g. cold grey/gleam/outlines/full moon).
All of the above-mentioned senses exemplified in the corpus data, and others which did not appear in our 500 examples but which no doubt would have
turned up in a larger sample from the corpus, such as cold sore, are present in
the dictionaries examined. In addition, the dictionaries assigned the most frequent meaning to the first sense in the entry. The only exception to these observations is in the cold light of day, which surprisingly is not present in any of the
dictionaries in spite of being the second most frequent lexicalized collocation.

Bilingual lexicography, overlapping polysemy, and corpus use

Moreover, the meaning of this expression is opaque and would not be immediately understood by foreign speakers.

. The adjective high

Tables 5 and 6 show the order of senses for the polysemous adjective high in the
monolingual and bilingual dictionaries consulted. We may first note that, as in
the case of cold, AHD3 makes more sense distinctions than the other two dictionaries. The first two or three senses refer to physical height in all three dictionaries; what is perhaps surprising is the variation in the order of the other senses
for example, the sense referring to the foul smell of meat is third out of thirteen
in AHD3, last of seven in CIDE, and absent from Cobuild.
Table 5. Treatment of high in monolingual dictionaries
Senses of high

(13 total senses,
with 22 subsenses)

(15 total senses)

(7 total senses)

above average
mental state
bad smelling

02, 6, 7

01, 2, 3
04, 5, 7
08, 9



Table 6. Treatment of high in bilingual dictionaries

Senses of high

(6 total senses)

(27 total senses)

(18 total senses)

DAC (Catalan)
(1 total sense)

above average
mental state
bad smelling


1, 2
3, 5, 12, 21, 23, 24
1, 7, 8
10, 11

4, 5, 9


17, 19


Table 6 clearly shows that the treatment of high in the bilingual dictionaries is
even more diverse. At one extreme there is GL, which distinguishes 27 different
senses, not counting the large number of examples provided; for its part, S&S
includes 18 senses. We note that in both these dictionaries the number of senses

Victria Alsina and Janet DeCesaris

is greater than that given by the monolingual dictionaries, although it must be

said that neither of these two bilingual dictionaries includes subsenses, which
might have lowered the number of main senses substantially. OSD divides its
entry into 6 senses, and the English/Catalan dictionary, DAC, as it did with
cold, lists only one basic sense divided into two subsenses, one for the physical
sense and the other for the metaphorical sense. Although the structure of the
entry is much simpler in the DAC than in the other dictionaries, the entry itself
is not much shorter because there are many examples under each subsense.
And, like the monolingual dictionaries discussed above, the bilingual dictionaries list equivalents for the physical reference of the adjective first and then
differ as to the order of the derived senses presented.
The study of the 500 examples taken from the BNC (492 of which we were
able to use) confirms that high is indeed a highly polysemous word. Table 7
summarizes the distribution of senses we found in our sample. Perhaps the
most striking aspect of this information is the fact that the historically original
sense of extending (relatively) far upwards or placed at a great distance from
the bottom, i.e. the sense giving a physical description that is listed first by all
the dictionaries examined, is obviously not the most frequent, although it is by
no means rare. The most frequently found sense was that of situated at the top
part of the scale when applied to nouns describing objectively measurable
qualities, such as pressure (13), rate(s) (18), degree (10), price (10), cost(s) (10),
level (29), proportion (9). A closely related sense, but with the further component of good or positive, is also frequently found, and occurs when the adjective is paired with nouns describing qualities involving a subjective assessment:
quality (13), class (3), standard(s) (17), reputation (4), and performance (3). Yet
another frequent meaning is that of important, above others in its class,
which we found in collocations such as high court (16), high school (12), high
commissioner (7), and high priest (3). There are a few cases in which high is
Table 7. High in the BNC sample
Senses of high

Number of examples in BNC sample

top of scale
top of scale + good
top of musical scale
important, above others
physical height


expression: high street

expression: high profile
other collocations


Bilingual lexicography, overlapping polysemy, and corpus use

applied to sound as in high pitch. Finally, high is present in a number of collocations such as high street, high profile, high priority, high time, etc.
To summarize up to this point, we have seen that the bilingual dictionaries
examined take little notice of overlapping polysemy, even when it is almost
complete (the case of cold/fro/fred). The main exception to this observation is
the DAC, which gives one main equivalent and then several expressions. The
bilingual dictionaries differ from one another with regard to the order of
equivalents, especially in the case of figurative senses, and with regard to which
equivalents are included. And, finally, relatively small samples from the corpus
contained frequent fixed expressions and collocations that are legitimate candidates for translation equivalents. Not all of these expressions are included in
the dictionaries under examination.
. The adjective odd
To judge from the number of senses listed in AHD3, odd is not as polysemous as
either high or cold, and examination of the bilingual dictionaries shows us that
it exhibits practically no overlapping polysemy with its equivalents. The five
descriptors used by CIDE for odd, viz. strange, separated, numbers, not
often, and approximately (as an affix), do not correspond to either one or two
adjectives in Spanish or Catalan. Since overlapping polysemy does not really
Table 8. Treatment of odd in monolingual dictionaries
Senses of odd

(7 total senses)

(5 total senses)

(5 total senses)


not often

3, 4




Table 9. Treatment of odd in bilingual dictionaries

Senses of odd

(4 total senses)

(19 total senses)

(9 total senses)

DAC (Catalan)
(5 total senses)

not often


5, 6

7, 8


Victria Alsina and Janet DeCesaris

come into play with this adjective, we must concentrate on the other two issues
that we have identified as needing improvement: order of equivalents and
inclusion of fixed expressions. Table 8 above shows the order of senses in the
monolingual dictionaries (we have included the historical order of senses
(according to the Oxford English Dictionary) in the final column as a point of
comparison), and Table 9 the order of equivalents in the bilingual dictionaries.
The sample from the BNC shows that odd is much more common in the
sense of strange than in any of the other senses; the mathematical sense and
the sense of not matched or not part of a pair, which are historically older
and, we believe, still important senses of this word, were quite few in number.
The figures from the corpus sample are given in Table 10.
Table 10. Odd in the BNC sample
Senses of odd

Number of examples in BNC sample

numbers = not even
not matched


expression: odd jobs

expression: odd man out


The expression odd man out, which is present in the bilingual dictionaries and
explicitly explained in both Cobuild and CIDE, turns out to be less frequent
than the phrase odd jobs. If we compare these findings with the entries in the
bilingual dictionaries, we see that OSD and DAC list the strange equivalent
first, but note that S&S begins with suelto unmatched, historically the first
meaning, while GL begins with impar not even and gives the reader the
strange sense in 13th position.
This example brings up an important issue in relation to corpus-based lexicography, namely how to evaluate senses with relatively few occurrences in the
corpus. The not matched sense of odd was relatively infrequent in our sample,
which might lead lexicographers to omit it from a bilingual dictionary, but we
believe that would be an error because this meaning cannot be derived from
other information and it is perceived by speakers as a basic sense of the word.
Frequency data alone are not enough to determine the inclusion of a sense in a
general purpose dictionary. By contrast, the same number of occurrences
should be interpreted quite differently when dealing with an expression. Our
sample shows that the expression odd jobs is quite frequent and always used in

Bilingual lexicography, overlapping polysemy, and corpus use

the plural (with the meaning of several unrelated, not regular jobs), and this
information can help guide lexicographers in their choice of examples.


A major problem with equivalents in bilingual dictionaries is the identification

of the range of semantic contexts in which the equivalent provided by the dictionary can be used. Adjectives that have more than one sense are used in a
variety of lexical contexts, so it might seem that the starting point for determining which equivalents should be included in a bilingual dictionary is the information, and specifically the sense distinctions, provided by a monolingual dictionary. Several bilingual dictionaries covering the language combinations we
have considered here are based on the information from monolingual dictionaries, although this is not openly stated. For instance, it seems that the order
of equivalents presented in Simon and Schusters International Dictionary corresponds to the order of senses as presented in Merriam Websters Collegiate
Dictionary of English (which is historical). However, the sense distinction in a
monolingual dictionary, whether it be addressed to native speakers or to
advanced-level foreign-language learners, may, in practice, not be right in the
context of a bilingual dictionary precisely because both languages are not taken
into account from the very beginning. In our opinion, the fact that bilingual
dictionaries do not seem to be conceived of as bilingual, contrastive works but
instead take as their starting point a monolingual dictionary or, at most, a
description of only one of the languages does not always yield optimal results.
Since, by and large, bilingual dictionaries are not written from the perspective
of both languages, they do not take into account phenomena such as overlapping polysemy, which can only make sense when there are two languages
involved. The prime role played by both languages in identifying translation
equivalents explains why data from a monolingual corpus will not be relevant
to this important issue for bilingual lexicography, because a monolingual corpus is simply not constructed from the standpoint of two languages.
Although we do not believe that a monolingual corpus can resolve problems resulting from overlapping polysemy, we have seen that it can play an
important role in two other, equally important, issues facing the bilingual lexicographer, namely choosing the order in which to present equivalents and
determining which fixed expressions to include in a specific entry. In a case like
that of odd, the fact that the word is used much more often in the sense of

Victria Alsina and Janet DeCesaris

strange than in the mathematical sense of not even can be used as an argument for the equivalent of the more frequent sense coming first (this argument
seems to have been used by OSD and DAC). We are not arguing explicitly for
the order of equivalents always to be based on corpus data, but the GL entry
that buries the strange sense in the 13th position (out of a total 19) is not, in
our opinion, as useful to the reader as it could be. In the case of cold, the expression in the cold light of day/morning/dawn is relatively frequent in a large corpus
of English, yet contrastive analysis shows that it is opaque to Spanish and
Catalan speakers; here the corpus data helps us to determine which expressions
should warrant translations in the dictionary entry.
We thus conclude by saying that a monolingual corpus does have a role to
play in the preparation of a general purpose bilingual dictionary. Bilingual dictionaries that do not take frequency data into account in the organization of
their entries, such as S&S and GL, have been criticized in this article for the way
they arrange the information the lexicographers have chosen to include.
However, we should like to point out that we ourselves often use these dictionaries because they provide a wealth of information, especially in the form of
translation equivalents. The information is there but it is poorly organized, and
translators, whose professional obligations require them to search for the right
equivalent, are willing to spend the time and effort necessary to plod through
the entries. Conversely, a bilingual dictionary with significant gaps in coverage,
no matter how well organized the information may be, is going to be found
lacking. That is precisely our experience with the DAC a dictionary with
well-structured entries, as we have seen in the cases of cold and high; nevertheless, in our opinion the dictionary contains too few translation equivalents for
the level of advanced learners, not to mention that of translators. Those of us
who teach know that second-language learners, some of the main users of general purpose bilingual dictionaries, often do not read whole entries; that is why
the structure of these reference books needs to be the best possible. We hope to
have suggested at least two ways in which use of a monolingual corpus can be
fruitful in this task.


Work on this paper was supported by grant PB960305 from the Spanish Ministry of
Education and Culture, which we hereby acknowledge.
We note that the institute we are affiliated with, the Institut Universitari de Lingstica
Aplicada of Pompeu Fabra University, is currently building a multilingual corpus for some

Bilingual lexicography, overlapping polysemy, and corpus use

languages for specific purposes that includes parallel production in English, Spanish and
In order to show the range of senses for the three adjectives as clearly as possible, we
present the data from the monolingual dictionaries first, followed by that from the bilingual
dictionaries, although in carrying out our research we started with the bilingual dictionaries
because as teachers of lexicography and translation we knew the entries would prove to be
different from one another.

Al-Kasimi, A. M. 1983. The interlingual/translation dictionary. Dictionaries for translation. In Lexicography: Principles and Practice, R. R. K. Hartmann (ed.), 153162.
London: Academic Press.
American Heritage Dictionary, 3rd ed. 1992. Boston: Houghton Mifflin.
Aston, G. and Burnard, L. 1998. The BNC Handbook. Edinburgh: Edinburgh University
Cambridge International Dictionary of English. 1995. Cambridge: Cambridge University
Collins Cobuild English Dictionary, 2nd ed. 1995. London: HarperCollins.
Diccionari angls-catal. 1983. Barcelona: Gran Enciclopdia Catalana.
Hartmann, R. R. K. and James, G. 1998. Dictionary of Lexicography. London and New York:
Larousse Gran Diccionario Espaol-Ingls/English-Spanish. 1983. Mexico City: Larousse.
Merriam Websters Collegiate Dictionary, 10th ed. 1996. Springfield, Mass.: Merriam
Oxford English Dictionary, 2nd ed. 1991. Oxford: Oxford University Press.
The Oxford Spanish Dictionary. 1994. Oxford: Oxford University Press.
Simon & Schuster International Dictionary English-Spanish/Spanish-English. 1971. New
York: Simon & Schuster.
Sinclair, J. 1996. An International Project in Multilingual Lexicography. International
Journal of Lexicography 9: 179196.
Teubert, W. 1996. Comparable or Parallel Corpora?. International Journal of Lexicography
9: 238264.
Tognini Bonelli, E. 1996. Towards Translation Equivalence form a Corpus Linguistics
Perspective. International Journal of Lexicography 9: 197217.
Tomaszczyk, J. 1983. On bilingual dictionaries. The case for bilingual dictionaries for foreign langugage learners. In Lexicography: Principles and Practice, R. R. K. Hartmann
(ed.), 4151. London: Academic Press.
Toury, G. 1995. Descriptive Translation Studies and Beyond. Amsterdam and Philadelphia:
John Benjamins.

Computerised set expression dictionaries

Analysis and design
Sylviane Cardey and Peter Greenfield
Although machines are useful in advancing and verifying the work of the linguist, there remains much core work for which only the linguist is competent.
In the area of lexis, such work is essentially lexicographic in nature (conception, understanding and organisation). To provide evidence for this claim,
problems and results arising from the construction of computerised set
expression dictionaries are presented, as well as the problems encountered in
the automatic recognition of set expressions in texts.


There is a tendency to forget that the role of linguists, lexicologists and lexicographers is the one that is the most important in the creation of dictionaries
even if these are electronic, automated or computerised. This is also true of
applications of natural language processing such as aided or automatic translation. What we attempt to show in this paper is that the results do indeed
depend in large part, if not wholly, upon the analytical power and intuition of
the linguist, the computer serving only as a tool for collecting, organising and
verifying either what the linguist needs or what the linguist intuitively thinks.
We present results arising from the construction of set expression dictionary systems and explain how these systems have been implemented. These
systems are concerned with the automatic recognition and translation of set
expressions in four languages (English, French, Italian and Spanish). In one
dictionary (Limame 1998) the set expressions contain the names of animals
(metaphoric use), as illustrated in example (1), in two others (Thomas 1998,
Morgadinho 1999), parts of the body, as in example (2).

Sylviane Cardey and Peter Greenfield

(1) to kill the goose that lays the golden egg

(2) to breathe down someones neck

. Set expressions in a cross-linguistic perspective

Sociologically, we can postulate universals, that is to say, the identical conception of reality all over the world. But the world includes societies, and societies involve languages, and this is why it is interesting to take account of this
aspect, because the language of a linguistic community forges the identity of a
people, a fact which is evident in certain expressions.
Let us look at the French expression (3) and the English expression (4).
(3) mettre la charrue avant les boeufs
(4) put the plough before the oxen

The word-for-word translations of these expressions are syntactically correct,

but semantically they do not refer to the same reality, as the meaning of (4) is
not metaphorical.
The cultural dimension has to be taken into account in the process of
translation. Where a French speaker says
(5) quand les poules auront des dents

an English speaker says

(6) when pigs have wings

Culture obviously plays a role in understanding set expressions; inevitably,

nations do not perceive the world in the same way. While the English and
French expressions in (7) and (8) are lexically different, their underlying meaning is identical.
(7) blind as a bat, to kill two birds with one stone
(8) aveugle comme une taupe, faire dune pierre deux coups

Language reflects reality; in consequence, terms such as cat, nose, dog form the
basis of numerous locutions. These expressions convey a certain environment and
certain habits. For the English, as for the French, the monkey symbolises cleverness, the fox cunning and the mule stubbornness. However, locutions constructed
upon the experiences of everyday life differ from one community to another,
because what is experienced is similar for each person living within a given cul-

Computerised set expression dictionaries

ture. Clearly, no Westerner sees an elephant in the way that a Congolese does.
Sometimes the translator is faced with two possible English translations of
a given French expression (see 911), which causes problems for lexicographic
(9) vol doiseau
(10) as the crow flies
(11) in a bee-line
Certain expressions come from fables:
(12) vendre la peau de lours
(13) vendere la pelle dellorso
(14) sell the bears skin
(15) se parer des plumes du paon
(16) coprirsi conle penne del pavone
(17) tuer la poule aux oeufs dor
(18) kill the goose that lays the golden eggs

Whole communities acquire culture from their literature, both contemporary

and past. The Bible, in particular, has bequeathed a great many expressions, as
illustrated by examples 19 to 23.

adorer le veau dor

worship the golden calf
adorare il vitello doro


un chien vivant vaut mieux quun lion mort

meglio un asino vivo che un dottere morto

. Set expression dictionaries

Idioms are the exceptions that prove the rule: they do not get their meaning
from the meanings of their syntactic parts. (Katz 1973 in Moon 1998: 15)

It is now generally recognised that set expressions pose problems for natural language processing. Furthermore, the existence of set expressions is one of the universal characteristics of natural languages. The importance of this phenomenon
became fully apparent as a result of attempts to construct automatic translation
systems. For example, expressions (24) and (25) have nothing in common, either
at the lexical level or at the level of syntax, but they have the same meaning.

Sylviane Cardey and Peter Greenfield

(24) se payer la tte de quelquun

(25) to make fun of somebody

Set sequences can be subcategorised. Fraser (in Moon 1998: 15) establishes a
hierarchy of seven degrees of idiom frozenness from L6 (completely free) to L0
(completely frozen) and argues that no true idiom can belong to level 6.
Whatever the classification, what really matters is to make apparent the problems posed by set sequences. One can distinguish between the following types
of set expression:
a. pragmatagms (Meluk 1995 : 176177), which are the result of usage and
pragmatic conventions, as illustrated in (26) and (27).
b. set sequences which are semantically blocked; they belong to one of the following three subtypes:
idiomatic expressions, which are composed of non-compositional elements, as in (28) and (29);
semi-idiomatic expressions, where one of the elements is non-compositional and the others are transparent, as in (30);
quasi-idiomatic expressions, which have both a compositional and an
idiomatic meaning, as in (31).
c. non-compositional syntactic sequences such as (32), whose syntactic
structure is malformed.
(26) consommer avant (best before)
(27) cest la vie (thats life)
(28) couper les cheveux en quatre (literally: to cut the hair in four; to be a perfectionist)
(29) lever le coude (literally: to lift the elbow; to like a drop)
(30) coter les yeux de la tte (literally: to cost the eyes of the head; to be terribly
(31) ouvrir la bouche (literally: to open ones mouth; to say something)
(32) mettre pied terre (literally: to put foot to earth; to dismount (from a horse))

The first stage in compiling a computerised dictionary of set expressions is to

consult existing general-purpose paper dictionaries (mono- or bilingual) and
then to carry out corpus searches on newspapers such as Le Monde, for example. In order to carry out these searches, it is necessary to define one or several
search items. We searched for set expressions which contained names of animals and parts of the body. Our initial search enabled us to retrieve a considerable number of set expressions, some of which were new in the sense that they

Computerised set expression dictionaries

are not listed in current dictionaries, but also yielded many irrelevant sequences
which did not contain set expressions because the computer was unable to distinguish between set expressions and fully compositional, free sequences (see
Sections 7 and 8).
Many of the new expressions yielded by our corpus search resulted from
the high degree of variability involved in set expressions. Examples (33) and
(34) illustrate the modification of a set expression by the insertion of spurious
or non-canonical words (Moon 1998: 174). In (33) two prepositional phrases
are added to the standard set phrase prendre le taureau par les cornes, while in
(34) the adjectif noirs is added to chats in the expression avoir dautres chats
(33) Le prsident a pris le taureau du chmage par les cornes de la fiscalit.
(34) Le premier ministre a dautres chats noirs fouetter avant darriver ses fins.

This technique is often used by the press to draw the readers attention. Given
that the degree of frozenness is entirely arbitrary, all expressions possess a
degree of liberty and in consequence every expression can be unfrozen according to the speakers or writers needs. It is always possible to unfreeze an expression, even the most frozen, in order to produce an amusing effect or to surprise.
However, unfrozenness is impalpable and arbitrary: anybody can add, delete or
substitute components of set expressions at will. This variability makes corpus
searches extremely time-consuming.

. Building a set expression inventory

En terminologie, un corpus est un ensemble de textes homognes, cest--dire
traitant du mme domaine, rdigs et utiliss par le mme type de personnes
et dans des conditions semblables.1 (Dauphin 1997 : 17)

One of the aims of our work was the development of a system able to recognise
set expressions in context. To do this, we first had to draw up an inventory of set
expressions which would subsequently be used for designing our recognition
rules. Even if we had an exhaustive inventory of the set expressions, this does
not mean that they would easily be recognised in context by a machine. As
illustrated in the preceding section, set expressions are not always stable.
We first examined more than 50 specialised and non-specialised monoand bilingual dictionaries, including current dictionaries and dictionaries of
idioms, in a variety of forms (paper, CD-ROM and on-line). These dictionaries

Sylviane Cardey and Peter Greenfield

are too numerous to be cited in this paper; but the major ones are listed in the
Reference section (Chan 1999, Limame 1998, Morgadinho 1999, Thomas
1998). The sequences that we found in these dictionaries met the first condition of frozenness, that is to say, a conventional usage. They are recognised as
set expressions by users of the language, and any other sequences would look
strange to a native speaker. For example, to indicate the date of validity of a
product, French would say A consommer avant whereas English could only
use Best before (meilleur avant), whilst Polish say Data spozycia... (date de
consommation) and Germans Mindestens haltbar bis (utilisable au moins
jusqu). The meaning of these expressions is transparent for a native speaker, and they cannot be replaced by equivalent sequences, even if we could say in
French A garder jusqu, Ne pas manger aprs, Date limite dutilisation. We
can thus be certain that a given sequence is not an ad hoc metaphor, but that it
possesses the same meaning for all speakers of the language. We then looked at
journalistic text corpora such as Le Monde on CD-ROM in order to see if there
were other set expressions in current use which do not appear in the dictionaries. We also integrated some well known expressions frequently used in spoken
language which had not been found in the corpora that we had examined.
The starting point in the search for set expressions in electronic corpora was
the individual words (animal or parts of the body). As one might expect, we had
to deal with a very large number of non-idiomatic uses of the words. It took a
considerable length of time to sort the relevant sequences manually in the search
for set expressions but there was no other solution. As we had expected, we
found new expressions (3336), many of which are mephorical extensions of
existing set expressions such as se jeter dans les bras (de quelquun) in (35).
(35) A ceux qui dsesprent et qui voudraient se jeter dans les bras tendus du
parti de
(36) En stigmatisant -lEurope des technocrates-, il pointe du doigt les carences
futures du trait de

. Criteria for delimiting and selecting set expressions

The first problem is determining where the set expression starts and ends. The
difficulty with sequences such as (37) and (38) is that of deciding whether the
whole sequence needs to be included or only part of it. In this case as well as in
examples (3944) only the non-verbal part was included (par coeur, gorge
dploye, etc.). The same decision was taken in the case of examples (47) and

Computerised set expression dictionaries

(48) in spite of the fact that coeur, unlike par coeur, is not listed independently
in dictionaries. However, in other cases, such as (48), it seemed justified to
include the verb (avoir du coeur).
(37) apprendre par cur
(38) connatre par cur
(39) verb + gorge dploye (sing, laugh)
(40) verb + au doigt et loeil (obey, listen)

(tre) au cur de quelque chose

(arriver) comme un cheveu sur la soupe
(tre) dans les bras de Morphe
(avoir) quelque chose sur les bras (Avec cette affaire sur les bras, je ne finirai pas avant demain)
(45) avoir quelque chose cur
(46) prendre cur de faire
(47) tenir cur
(48) avoir du cur

The other important decision to be made in selecting set expressions for inclusion in a multilingual dictionary is whether to include only expressions that
belong to the same register or to list all translations, whatever the register. We
opted for the latter. Our dictionary therefore contains set expressions belonging to the following registers: literary (49), colloquial (5051) and slang (52).
(49) nourrir un serpent sur son sein
(50) se faire du mauvais sang
(51) donner sa langue au chat
(52) lavoir dans los

The number of expressions involving parts of the body retrieved using these
selection criteria was 595.

. Automatic recognition of set expressions

Having established our list of set expressions, we then had to deal with the
problem of recognising set expressions in the source text (to be translated),
identifying translation equivalents for automatic translation and building a
bilingual dictionary. A close scrutiny of the set expressions shows that certain

Sylviane Cardey and Peter Greenfield

parts are fixed whilst others are not. Examples (5356) show that certain elements can be inserted or omitted, while examples (5758) illustrate variations
in word order.

enlever une (belle) pine du pied

aller (droit) au cur
sell the bears skin (before one has caught the bear)
vendere la pelle dellorso

(57) mettre quelquun au pied du mur

(58) il a t mis au pied du mur

In order to recognise set expressions automatically, it is crucial to identify the

fixed part of a set expression, which is composed of those elements that are indispensable to the existence of the expression, be they variable or not. It is also necessary to take into account all the possible variants, as illustrated in (5961).
(59) se taper/se cogner la tte contre les murs
(60) All is fish that comes to/in his/her net.
(61) briser/dchirer le coeur

For each expression, all the alterations have to be listed. These alterations are
found at several levels. At the morphological level, variations in verbal and
nominal inflexion are found, as in (62) and (63). At the lexical level, there are
cases of omission and addition of elements (3335) as well as synonymous paradigms (5961).
(62) Porter la main sur Pierre, a ne me serait jamais venu lide.
(63) Tu as port la main sur Pierre.

In addition, there are also alterations at the structural level since many set
expressions can undergo syntactic transformations such as passivisation, nominalisation or pronominalisation (see Moon 1998:104119).

Set expressions and ambiguity

The great majority of set expressions are also susceptible of a literal interpretation. There are several reasons for this. Firstly, at the lexical level, there are few
words that only exist in set sequences. Most exceptions to this tendency involve
the use of archaic terms, as in (64). Secondly, the syntactic structure is usually
perfectly normal, although there are some exceptions such as (65). Finally,

Computerised set expression dictionaries

there is a semantic reason: the idiomatic sense of a set expression is usually

based on its original, literal, sense, as illustrated by example (66).
(64) chercher noise
(65) il y a anguille sous roche
(66) se mettre la corde au cou

An expression is ambiguous when it possesses several senses, which can be:


literal and frozen

(67) se faire taper sur les doigts
literal sense: to rap sbs fingers
frozen sense: to rap sbs knuckles


frozen and accidental

(68) jouer sa peau
frozen sense: to risk ones neck
accidental sense: il a tu un ours ce matin, il a jou sa peau contre une
bouteille de whisky

3. several frozen senses

(69) prendre le mors au dent
a. of a horse: to bolt
b. of a human being: to get carried away by passion, anger

. Resolving ambiguities
In spite of all the difficulties involved, recognition and translation of a number
of set expressions are possible even when these include parts which are not
fixed. Strangely enough, it seems that the ambiguity problem can be resolved
by examining precisely those elements within idiomatic expressions which can
be free. As we have seen, both se taper la tte contre les murs and se cogner la tte
contre les murs are possible, whilst sappuyer la tte contre les murs is not. There
is widespread research underway establishing and examining the different possibilities for insertion and transformation. The question is whether this work,
which is likely to require much time and effort, will make it possible to distinguish between morphologically identical free sequences and set expressions. In
other words, will computers be able to distinguish between the literal and the
idiomatic use of sequences such as (7075)?

Sylviane Cardey and Peter Greenfield


cook ones goose

see the elephant
on lui a mis laffaire sur le dos
on lui a mis le fagot sur le dos
il prend la mouche
Jean a le cafard

Some syntactic criteria could be used, such as the fact that set expressions cannot be relativised (*Je lui ai dvoil mon coeur qui tait triste) or pronominalised. However, there are many exceptions to these rules, as illustrated by
examples (7678).
(76) Je me suis pay sa tte qui sy est franchement prte.
(77) Le peuple a rclam sa tte qui ne valait pas grand chose.
(78) Il donne sa langue au chat, je donne la mienne aussi.

In reality, there is a great deal of variability in acceptance of these structures;

some native speakers find them acceptable, others reject them. This is partly
due to the fact that set expressions are often part of informal language, which is
less codified than standard language.
The following are some of the cues and strategies that can be used to
resolve ambiguities:
1. using semantic information, whereby the interpretation of a set expression
can be based on a free element. Thus, in example (79), if the subject of prendre
is human, then the expression means to become hungry, whilst if the subject is
an animal it means to bolt;
2. using contextual cues, for example in specifying to which domain the text
belongs. This enables certain ambiguous readings to be excluded. For example,
avocat in example (80) cannot be an avocado; the context will probably serve
to indicate that the referent could not be a fruit;
3. using interactive systems by means of which the computer consults the
user in order to obtain pragmatic information capable of aiding the disambiguation process;
4. giving priority to statistically dominant structures. Statistical analysis of
corpora could indicate that a given set expression is more likely to have one
meaning than another;
5. preserving the ambiguity in the hope that it is also present in the target language, as, for example, in (81).

Computerised set expression dictionaries

(79) prendre le mors aux dents

(80) se faire lavocat du diable (to play the devils advocate)
(81) to take the bull by the horns (prendre le taureau par les cornes)

Although we are aware that all ambiguities cannot be resolved, we have applied
a set of rules which give reasonably good results. These rules involve:
a. applying restrictions to the free elements of frozen sequences; for example,
in the sequence noun1 avoir noun2 loeil, if noun2 is human, the meaning
will be to keep an eye on; if noun2 is an object, the meaning is to get something for nothing;
b. examining the type of prepositional group, for example, garder la main de
la fillette (literal sense) vs garder la main au jeu (to play first);
c. specifying the domain of use: ouvrir loeil has a literal meaning in the medical field, but otherwise means watchful;
d. analysing the syntactic structure of a given sentence, since some transformations are not applicable to frozen sequences; for example, sa langue a t
donne au chat can only have a literal sense while donner sa langue au chat has
two meanings (literal and metaphorical to give in or up).
In cases where disambiguation of frozen sequences is not possible, the best general strategy seems to be to give preference to the frozen interpretation. However,
we are not totally satisfied by this method as it does not always yield good results.

9. The MultiCoDiCT dictionary system

In this section we aim to show how set expressions are represented in the
MultiCoDiCT dictionary system (Multilingual Collocation Dictionary Centre
Tesnire) (Greenfield et al. 1999).
As stated earlier, set expressions raise problems as regards their formal
recognition, but they also present semantic problems. This facilitates neither
their access nor their representation in dictionaries. A first step is to choose a
canonical representation for set expressions and to organise them by meaning
equivalences within and across languages, so that the set expressions can be
subsequently accessed in different but prescribed manners (Greenfield 1998).
The MultiCoDiCT dictionary system includes the following:
1. a specialised language for encoding set expressions (and, indeed, collocations in general) in the context of meaning equivalences in a given domain and
across a given set of languages;

Sylviane Cardey and Peter Greenfield

2. a kernel which is both language- and domain-independent and which provides access facilities to the set expressions in their encoded form;
3. domain-dependent dictionaries, where a dictionary is a structure comprising a specific set of set expressions for a given domain and across a given set of
Specialised language includes a natural language-independent component allowing the definition of the meaning equivalence relations that can occur between set
expressions within and across languages, and language-specific components
where a given languages typology is defined, this being used in the canonical
descriptions. We have opted for a coding based on canonical forms, for both
word forms and collocations. This decision has been taken in order to ensure
maximum re-usability of the data for possible different ends (manual and automatic). The system as currently implemented is for manual access.

Dictionary headwords
Many specialised dictionaries, whether mono- or multilingual, are restricted to
a special area of knowledge drawn, for example, from the sciences, technology
or skills. This restriction allows them to include numerous words and collocations excluded from general dictionaries and to incorporate everyday words
exclusively in their specialised meanings within a particular field.
In the case of a manually accessed dictionary, the following questions arise:
how are the dictionarys headwords chosen? As idiomatic collocations and
expressions contain more than a single word, under which word should they be
included? Ideally, from the human users point of view, it should be possible to
find the collocations and idiomatic expressions using any of the relevant constituent words as search items. If this is the case, the user does not have to try
two or three possible headwords before finding the right one. However, as long
as dictionaries are published in book form, space will restrict the duplication of
information. The lexicographer thus has to choose the headword to access the
collocation (Roberts 1996: 189).
It could be argued that the expression should be located under the first
important headword that it contains. But what is an important word for a nonspecialist in the area? Furthermore, what is the status of candidate headwords
in synonymous expressions in the same or other languages represented in the
dictionary? In view of these difficulties, we have adopted a multiple headword
system. The direct headwords are the canonical forms of those terms present in

Computerised set expression dictionaries

the expression that have been chosen as headwords by the lexicographer, whilst
indirect headwords are those in synonymous expressions in the same or another language (see examples below). The expression can only be accessed via
these headwords, whether direct or indirect.

MultiCoDiCT dictionary examples

In the following sections we give examples of expressions drawn from two of the
MultiCoDiCT dictionaries and the problems that we have had to solve. Both
dictionaries are bilingual and reversible and the languages involved are French
and Spanish. One dictionary [Parts of the body] is concerned with the
metaphorical usage of parts of the body (Morgadinho 1999), the other
[Tourism] focuses on collocations in the field of tourism (Chan Ng 1999).

Expressions with multiple headwords

In a MultiCoDiCT dictionary an expression can have several headwords. For
example, in the bi-directional French-Spanish [Parts of the body] dictionary,
the French expression
(82) donner les yeux de la tte pour quelque chose

appears under 4 different headwords. It appears under the direct headwords

oeil and tte but also indirectly under dedo and mano as it can be translated by
two different expressions in Spanish:
(83) dar un dedo de la mano por algo
(84) dar una mano por una cosa

On the other hand, one and the same headword can appear in several expressions. The following example is taken from the same dictionary, the headword
being the French noun main.
Source language: French
Headword: main
Spanish: mano
Expressions and translations:


avoir la main<n(f,s)> malheureuse

(jeu\casser tout)

tener mala mano<n(f,s)>


avoir la main<n(f,s)> malheureuse

(jeu\casser tout)

tener las manos<n(f,p)> de trapo<n(m,s)>

(casser tout)

Sylviane Cardey and Peter Greenfield



avoir les mains<n(f,p)> nettes(fig)

tener las manos<n(f,p)> limpias

pleines mains<n(f,p)>

a manos<n(f,p)> llenas

porte de la main<n(f,s)> (fig)

a la mano<n(f,s)>

tenir quelquun entre ses mains<n(f,p)>

tener uno a otro en su mano<n(f,s)>(fig)

forcer la main<n(f,s)>


haut les mains<n(f,p)> !

arriba las manos<n(f,s)> !

avoir sous la main<n(f,s)>

tener a mano<n(f,s)>

de main<n(f,s)> de matre<n(m,s)>

de mano<n(f,s)> maestra

tre comme deux doigts<n(m,p)> de la


ser ua<n(f,s)> y carne<n(f,s)>

se laver les mains<n(f,p)> de quelque


lavarse uno las manos<n(f,p)> en algo

This entry shows the format in which entries are presented.

Expressions which have several possible meaning equivalents

The translation of expressions is divided between those which have the same
meaning in the source and target languages and those where there is a different
meaning. The computerised presentation distinguishes between the two cases.
Synonymous equivalents
The following example illustrates equivalents in the French-Spanish [Parts of
the body] dictionary:


tre coude coude

tre cte cte

estar hombro a hombro

Under the headword coude in the French to Spanish direction, alongside the
Spanish translation of the French expression containing coude, the dictionary
also presents the synonymous French expression in order to draw attention to
cases of synonymy in the source language. The same holds true in the case of
the French headword cte.
In some cases no less than three synonymous expressions are found:

Computerised set expression dictionaries



estar uno hasta los pelos

en avoir par-dessus la tte

avoir les oreilles rebattues
en avoir par-dessus les yeux

Note that synonymy can involve single words. The following example shows an
equivalence between a semi-frozen French expression and two single words in


forcer la main


Polysemous equivalents
The French-Spanish [Parts of the body] dictionary contains the following
example of polysemous equivalents:


avoir la main malheureuse

(jeu\tout casser)

tener mala mano


avoir la main malheureuse

(jeu\tout casser)

tener las manos de trapo

(tout casser)

In order to indicate the difference in meaning between the two languages, the
ambiguous source expression in French is repeated. Beside each instance of the
expression the lexicographer has indicated the particularities in terms of sense.
Each of these particularities is also placed beside each of the Spanish expressions to indicate which French sense the Spanish expression corresponds to.
The particularities are written in French so as to reinforce the message that it is
the French expression that is polysemous.
The inverse direction is also handled in this dictionary:


arrimar el hombro

donner un coup dpaule


arrimar el hombro

travailler activement

Sylviane Cardey and Peter Greenfield

Language variants
Several of the MultiCoDiCT dictionaries take language variants into account,
as in the following example from the [Tourism] dictionary which includes
variants in Spanish:



billete (Espagne)
boleto (Amricanisme)
tiquet (Espagne)
billete de avin (Espagne)
boleto de avin (Amricanisme)
tiquete<n(m,s)> (Amricanisme)

billet davion
billet de bus

10. Conclusion
In this paper, certain problems have been brought to light, particularly those
concerning the collection of the data and its representation in computerised
dictionaries for translation, as well as problems posed by the recognition of
expressions in context and in their translation.
We are of the opinion that although machines are useful in advancing and
verifying the work of the linguist, there remains much core work which only
the linguist is competent to carry out (conception, understanding and organisation), and such work is also essentially manual in nature. Even if it is considered that hand-collected sets of citations cannot give robust information concerning relative frequencies (Moon 1998: 47), we think that frequency alone is
not sufficient. Indeed, the sequences that a human translator will have difficulty translating are likely to be sequences that he or she has never come across
before. We are thus of a different opinion, especially in view of the fact that: Par
rapport la macrostructure dun dictionnaire bilingue gnral, celle dun dictionnaire bilingue spcialis est beaucoup plus rduite, mais elle contient des
termes que les dictionnaires gnraux nont pas (Marello 1996 : 39)2. There is a
need for specialised dictionaries in all sorts of fields, as general dictionaries are
often silent on specific terms. Corpora have proved to be extremely useful in
uncovering these expressions but they will always be limited. Exclusive reliance
on them might therefore lead lexicographers to miss important expressions
which the corpus used does not contain. We think and this is the line we
have followed ourselves that corpus research, existing dictionaries and

Computerised set expression dictionaries

human intuition should be used concurrently if we are to succeed in building

systems capable of recognising set expressions in context, as well as multilingual dictionaries for automatic and human translation.

. The term corpus denotes a set of homogeneous texts, i.e. texts covering the same field,
written and used by the same type of people and under similar conditions.
. As compared to the macrostructure of a general bilingual dictionary, the macrostructure of a specialised bilingual dictionary is not as large, but it contains terms which are
absent from general dictionaries.

Cardey, S. and Greenfield, P. 1997. Ambigut et traitement automatique des langues. Que
peut faire lordinateur?. In Actes du 16 Congrs International des Linguistes, Paris,
2025 juillet 1997. Elsevier Sciences, sous forme de CD ROM.
Chan Ng, R. 1999. Prototype informatis de dictionnaire du tourisme franais-espagnolfranais. Mmoire de DEA, Centre Tesnire, Besanon, France.
Dauphin, E. 1997. Etude de corpus : un pralable pour ladaptation de systmes de traduction automatique aux besoins des utilisateurs in TA-TAO. In Recherches de pointe et
applications immdiates, A. Clas et P. Bouillon (eds), 1534. Montral: AUPELF-UREF.
Greenfield, P. 1998. Lespace de ltat et les invariants de ltat des dictionnaires terminologiques spcialiss de collocations multilingues. In Actes de la 1re Rencontre
Linguistique Mditerranenne, Le Figement Lexical, Tunis, les 1718 et 19 septembre
1998, 271283.
Greenfield, P., Cardey, S., Achche, S., Chan Ng, R., Galliot, J., Gavieiro, E., Morgadinho, H.,
Petit, E. 1999. Conception de systmes de dictionnaires de collocations multilingues,
le projet MultiCoDiCT. Colloque international AUPELF Rseau Lexicologie,
Terminologie, Traduction, Beirut, November 1999 ( paratre).
Limame, D. 1998. Vers une reconnaissance automatique des expressions figes, thorie et
application, anglais, franais, italien. Mmoire de DEA, Centre Tesnire, Besanon,
Marello, C. 1994. Les diffrents types de dictionnaires bilingues. In Les dictionnaires
bilingues, H. Bejoint and P. Thoiron (eds), 3551. Louvain-la-Neuve: Duculot.
Meluk, I. A. 1995. Phrasemes in language and phraseology in linguistics. In Idioms:
Structural and psychological Perspectives, M. Everaert, E.-J. van der Linden, A. Schenk
and R. Schreuder (eds), 167232. Hillsdale, N.J.: Lawrence Erlbaum.
Moon, R. 1998. Fixed expressions and idioms in English, a corpus-based approach. Oxford:
Clarendon Press.

Sylviane Cardey and Peter Greenfield

Morgadinho, H. 1999. Dictionnaire lectronique franais-espagnol. Expressions figes.

Mmoire de Matrise despagnol, mention Industries de la langue, Besanon, France.
Roberts, R.-P. 1994. Le traitement des collocations et des expressions idiomatiques. In Les
dictionnaires bilingues, H. Bejoint and P. Thoiron (eds), 181202. Louvain-la-Neuve:
Thomas, I. 1998. Analyse, reconnaissance et traduction des expression figes: vers un traitement automatique. Mmoire de DEA, Centre Tesnire, Besanon, France.

Making a workable glossary

out of a specialised corpus
Term extraction and expert knowledge
Christine Chodkiewicz, Didier Bourigault and John Humbley
The aim of this paper is to show what remains to be done, especially in terms of
lexicographical treatment by subject specialists, in order to turn a glossary
obtained through computer-assisted term extraction into a tool that can be used
by professional translators. It is argued that the use made of the term extractor
gives a high-quality result, but that further treatment is still required, in which
linguistic and specialist knowledge are inextricably linked. The example which
serves as a demonstration is that of the elaboration of a human rights glossary
intended, inter alia, for the translators of the European Court of Human Rights.
The corpus used consisted in the complete texts of the European Convention on
Human Rights, plus its eleven protocols and 36 court decisions. The term
extractor Lexter first identified candidate terms in French, and then a legal
expert manually discarded irrelevant sequences and paired the English-language equivalent in the aligned text. This yielded a bilingual glossary of equivalents, which was not considered immediately exploitable, mainly because of a
high percentage of multiple correspondences in both languages. An analysis of
these multiple equivalences indicates that many are due to purely linguistic phenomena, and can be easily dealt with, but many others require expert knowledge
to unravel. The example of procdure/procedure + proceedings is given to illustrate this.

The Human Rights terminology project

Automatic or computer-assisted term extraction makes it possible to create

dictionaries or terminology tools which can be tailor-made for a specific pur-

Christine Chodkiewicz, Didier Bourigault and John Humbley

pose or a specific type of user. In many fields dictionaries are only found for the
most general level of specialised knowledge, and bilingual or multilingual dictionaries are uncommon in restricted domains where the national systems and
terminology differ greatly, such as law. It was this situation that caused the
European Court of Human Rights to regret the lack of an up-to-date dictionary to help not only its own translators working in the two official languages,
English and French, but also those dealing with other languages, notably those
of the countries of Eastern and Central Europe.
As the Court produces all its official texts in English and French, both versions being considered equally legal and binding, it was possible to produce a
corpus based on its own founding texts (the Convention and the protocols
added to it) and a number of the decisions made by the Court (36 in the present
corpus; at the time of writing over 850 decisions had been handed down). Use
could thus be made of decades of painstaking searches for equivalents by translators, and inconsistencies of the past could be pinpointed and corrected.
After the initial glossary was obtained in English and French, work was
begun on adding Polish and Romanian equivalents, using the results of the
term extraction from the French texts.

. Term extraction
As the aim of the project was to produce a specific glossary for users with welldefined needs, the choice fell on a strategy of extracting terms from a corpus
representing a sizeable proportion of the actual production of the Court itself,
rather than of enriching existing dictionaries.
The exceptional nature of the corpus used merits some comment. The
Convention, the protocols and the decisions are issued simultaneously in
English and French, which enjoy equal official status. Moreover, the quality of
the translations is such that it is well-nigh impossible to tell which is the original and which the translation. At the same time, the texts are so densely packed
with legal terminology that it was considered more useful in the long run to
build a specialised vocabulary, rather than simply use a translation memory, as
has been done for parliamentary debates, which are much wider in scope. The
comparable nature of the two sub-corpora enables the terminologist to assume
that the terms they contain will be equally authoritative, and the objection
sometimes made of recycling stale translations can thus be firmly rejected. This
does not mean, however, that the extraction will not show inconsistencies; one

Making a workable glossary out of a specialised corpus

such area, of which the translators themselves are aware, is that of names of
institutions (Cour de cassation becomes, in English, Supreme Court, Court of
Cassation as well as Cour de cassation), and the results confirm this intuition.
Standardisation of the Courts terminology is a spin-off of the project.
The corpus (the Convention, protocols and the 36 decisions handed down in
1995) was provided by the Court itself in both English and French in ASCII files.
The first task was to align the two texts, using the highly structured framework of
the two texts (number of decision, sections of decisions, numbers of sections,
numbers of paragraphs, identification of quotations). In addition, references to
past cases, articles of law, proper names, etc. were deleted. The sentences of both
corpora were then aligned, using strong punctuation marks. As may be expected,
some discrepancies between English and French punctuation were noted, and so
some human intervention was necessary to pair off sentences correctly.
The two resulting corpora contained 12 131 sentences each, and around
300 000 words. The preparation of the programs (written in Flex under Linux)
and the verification of the part-of-speech tagging took about twenty hours of
work. The next phase was to extract French candidate terms, using Lexter. This
term extractor performs a morpho-syntactic analysis of a corpus, identifying

Figure 1. The Lexter interface. Each candidate term extracted is shown in its context in
the left-hand boxes. The terminologist pairs the terms with those in the corresponding
English-language contexts in the right-hand boxes.

Christine Chodkiewicz, Didier Bourigault and John Humbley

boundaries between syntagms in order to produce a list of noun phrases which

are likely to be terminological units (candidate terms). These noun phrases are
arranged in series so that the terminologist can look up all those in which an
element occurs, either as a head or an expansion. The hypertext unit gives
access to the contexts of the corpus in which the candidate terms occur
(Bourigault 1993, Bourigault et al. 1996).
The terminologist first validates the terms (yes, no, ?), using Lexters terminology hypertext interface (see Figure 1). Then comes the matching phase. Up
to the present, Lexter has only been used to extract French candidate terms, but
the aligned English corpus made it possible to identify the equivalent English
terms after minor modifications to the hypertext interface. The technology used
can be described as unsophisticated, as no statistical matching has been attempted. However, it is claimed that this approach bears comparison with more
sophisticated techniques which yield matches for the more frequent candidate
terms. Nevertheless, research is under way to develop more sophisticated tools.
At the time of writing, the glossary is roughly two-thirds finished in its first
stage. Its state of development is summarised in Table 1.
Table 1. Number of candidate terms processed, discarded, kept or aligned by
specialised terminologists





Frequency = 1

12 193

6 375

2 720
43 %

3 655
57 %

1 183

Frequency > 1

04 283

3 185

1 058
33 %

2 127
66 %

2 127


16 476

9 560

3 778
40 %

5 483
60 %

3 310

This means that at the time of writing the terminologist had dealt with 9 560
terms out of the 16 476 French candidates which Lexter proposed (of which
only 4 283 occurred more than once), decided to keep 5 483 of these, and
aligned 3 310 with a corresponding English term.
It should be noted that the terminologists kept more hapaxes as candidate
terms (57%) than they discarded, casting doubt on the generally accepted view
that important terms always occur frequently.
Before the glossary can be submitted to the Court translators, however, a
number of improvements have to be made, the most important being the question of multiple equivalents in both languages.

Making a workable glossary out of a specialised corpus

. Multiple equivalents: general problems

In many cases one candidate term in English corresponds to one candidate
term in French. For example, friendly settlement is the English equivalent of
rglement amiable in the seven different sentences in which it occurs. There
remain, however, a large number of candidate terms which have more than one
equivalent in the other language; of terms with a frequency of two or more, one
in four have more than one equivalent; out of 981 multiword terms with a frequency of two, 168 have two equivalents. This was considered too high a proportion of multiple equivalents, forcing the translator to look up the contexts
through the Lexter hypertext interface, and precluding a complete paper version. In order to reduce this proportion, attention was given firstly to systematising textual variants, which amounts to a further step of lemmatisation, and
secondly to highlighting some of the differences between the two legal systems
which created this multiplicity of equivalents.
Identification of multiword terms is one means of limiting the number of
equivalents. The problems involved in deciding exactly what makes up a multiword term unit has been a bone of contention, and long discussed, especially in
French-language terminology (Assal and Delavigne 1993, Rondeau 1979). The
final decision is made by the subject field expert, who determines, not without
difficulty, what is to be considered a significant unit. However, a certain
amount of formal regularity can be used to put together phrases that belong
together. For example, sous tutelle is in fact always found in the phrase placement sous tutelle or placer sous tutelle, and should be listed accordingly.
Single-word terms are often more ambiguous than multiword terms.
Thus, lawful corresponds to lgal, lgitime, licite, rgulier, but lawful acts of war
always corresponds to actes licites de guerre, and lawful restriction to restriction
lgitime. There is in addition a link between formal regularity and subject
domain. For example, rgulier is found regularly (!) as an equivalent of lawful
in texts dealing with the entry into, and presence of, foreigners in a state (e.g.
entrer rgulirement [territoire], tranger rsidant rgulirement sur le territoire),
prompting the terminologist to deal first with such phrasal units rather than
with isolated terms.
One aspect of textuality which has to be regularised for lexicographical
purposes is anaphora. While it is perfectly normal for puisement to be used
without further qualification if the form puisement des voies de recours internes
has already been used, it is obvious that the lexicon will not have separate
entries for the short and the complete form. It should be noted that short forms

Christine Chodkiewicz, Didier Bourigault and John Humbley

are also found when the complete form of the term is implied though not actually stated in a text: thus detention is used for detention preceding trial when a
time limit is mentioned (The judge extended the detention by four months). It
should not be assumed from this that the short forms are mechanically derived
from the complete forms but, knowing the expertise of the target users, it is sufficient for the latter forms to be listed in the glossary.
Another linguistic aspect of lemmatisation concerns syntactic variants
which occur in the texts. Thus, libert provisoire has as English equivalents not
only provisional release, but also released on a provisional basis, released provisionally, provisionally released, (a person) provisionally at liberty. This confirms
the translators rule of thumb that abstract nouns often translate best into
English verb forms (comparution usually appears in the English text as to
appear (for trial)). Grouping of different morpho-syntactic forms of a term
under an appropriate headword, usually a noun form, significantly reduces the
number of multiple equivalents, though some forms attested locally may be
considered too atypical to merit keeping in a glossary (such as: such an application cannot be a remedy whose exhaustion is required, which will probably not be
indicated as a possible form of non puisement des voies de recours internes).

. Synonymy
Two terms or expressions are said to be synonymous when they have exactly the
same meaning and are wholly interchangeable. Dual signifiants and identical
signifi are the two criteria of synonymy.
Synonymy disturbs the lawyer in so far as he seeks a high degree of precision in the terminology he uses. He resents duplicates which appear to him a
linguistic waste or, at least, a factor of potential misunderstanding.
In legal texts there are very few true synonyms, but many partial or quasisynonyms. These often result from the use of a generic term for a specific (e.g.
in French: convention/contrat; contrat is one of many types of conventions), but
there may be other reasons. However, in context this partial or quasi-synonymy
seldom generates problems. In fact, it is often of no practical consequence since
the text remains understandable despite a lack of accuracy of the terms used.
Nevertheless, from a theoretical point of view and in order to prepare a thoroughly reliable glossary, it is important to circumscribe the specific meaning of
each term.
Surprising as it may seem, comparing terms in two languages, French and

Making a workable glossary out of a specialised corpus

English in this instance, actually facilitates the task of the linguist who has to
deal with problems of synonymy. Using specialised and, to a lesser degree, general dictionaries of the two languages suggests interconnections and distinctions which might otherwise never have been made.

. An extreme case: procedure

It soon became apparent when investigating those terms which have two or
more equivalents in the other language that it is with reference to procedure
(procdure) that approximative synonymy, to use Cornus terminology
(Cornu 1990:178), can be most usefully illustrated. In French or in English the
terms procdure/procedure when related to the term proceedings are particularly problematical. Proceedings has no less than twelve equivalents in the
French subcorpus, and procdure has six in the English. To complicate matters
further, many equivalents correspond to terms other than proceedings/procdure. The purpose of the following illustration is not to undertake a thorough
semantic analysis of the terms that we have singled out in both languages but,
more modestly, to try to show how such terms relate to one another.
The first step was to consult established dictionaries. In English, the word
procedure designates the mode or form of conducting judicial proceedings (as
distinguished from those branches of the law which define rights or prescribe
penalties), as defined by a general language dictionary (Oxford 1994), or the formal manner in which legal proceedings are brought, as a legal dictionary puts it
(Oxford 1990). The term proceedings is thus used to circumscribe the meaning of
procedure. However, the word proceedings which is either too obvious or too
obscure to be defined in most legal dictionaries is employed to define the term
procedure itself: proceedings is defined as the course of procedure in judicial action
or in a suit in litigation; legal action (Webster 1966) or the instituting or carrying
on of an action at law or process; any act done by authority of a court of law; any
step taken in a cause by either party (Oxford 1994). We shall see below that the
words cause and action are themselves defined using the term proceedings.
In French, procdure is defined as cette branche de la science du droit
ayant pour objet de dterminer les rgles dorganisation judiciaire, de comptence, dinstruction des procs et dexcution des dcisions (Cornu 1987). No
single reference can be found to any word which might correspond, in our corpora, to proceedings, as will be seen below.
The question is then: would it be legitimate to present procedure as a syn-

Christine Chodkiewicz, Didier Bourigault and John Humbley

onym of proceedings? To answer this question we looked at the various terms

used in French by the Courts translators to render proceedings. The twelve different equivalents already mentioned are: procdure (in a large number of
cases), but also procs, litige, affaire, cause, action, instance and (less frequently)
poursuites, dbats, recours, audience and contentieux.
We shall first examine those cases in which procdure corresponds to procedure and other equivalents, before presenting the twelve equivalents for proceedings in the French subcorpus.
. Procdure corresponding to procedure and other equivalents
Clearly, procdure and procedure overlap to some degree and are thus at least
partly equivalent. Thus, the Convention specifies that La Cour tablit son rglement et fixe sa procdure, which translates into English as: The Court shall draw
up its own rules and determine its own procedure. Garanties fondamentales de
procdure is expressed in English as fundamental guarantees of procedure and
the expression garanties procdurales is invariably translated as procedural safeguards. Criminal procedural law or penal procedure render la procdure pnale.
Procedural provisions or procedural rules are said to be dispositions ou rgles
procdurales. The Continental codes are designated as Code of Criminal
Procedure or Code of Civil Procedure. Procedure in the Civil Courts is translated
as procdure devant les tribunaux civils. In both languages, procedural is generally used in opposition to substantive: thus, les questions souleves par ce
texte taient () de nature procdurale et non matrielle is translated by the
questions raised () were of a procedural nature not of a substantive nature.
But procdure corresponds to words or expressions other than procedure
(or proceedings see below). Thus, procdure is sometimes translated as litigation. Procdure orale corresponds to hearing. Procdure judiciaire is sometimes
rendered as judicial process. Sans procdure adquate is translated as without due
process. Pendant la procdure corresponds to pending trial. Sans autre forme de
procdure is rendered quite simply as without further formality. Procdure de
jugement is sometimes simply translated as trial. Procdure dinstruction corresponds to judicial investigation.
There are sentences in which the word procdure does not appear in the
French text but procedure is used in English to convey a French expression: Nul
ne peut tre priv de sa libert, sauf dans les cas suivants et selon les voies lgales
corresponds to No one shall be deprived of his liberty save in the following cases in
accordance with a procedure prescribed by law.

Making a workable glossary out of a specialised corpus

. Procdure corresponding to proceedings

Procdure very often corresponds to proceedings when an action or process (cf.
Oxford 1994) before a specific institution is involved. This correspondence is
thus in strict codistribution with names of courts or institutions:
procdure devant les organes de la Convention proceedings before the
Convention institutions
procdure de la Convention Convention proceedings
procdure devant la Cour de cassation proceedings before the Supreme Court

Proceedings is also used to circumscribe the notion of procdure within a limited time-span and hence in codistribution with expressions of time:
dbut de la procdure beginning of the proceedings
tout moment ( tout stade) de la procdure at any stage of the proceedings
suspension de la procdure stay of the proceedings
conduite (poursuite) de la procdure conduct of the proceedings
la procdure est toujours pendante proceedings are still pending

But proceedings is also used in multiword candidate terms corresponding in

French to hyponyms of procdure:
procdure pnale criminal proceedings
procdure interne national proceedings/domestic proceedings
procdure de cassation cassation proceedings
procdure de jugement trial proceedings/court proceedings
procdure dappel appeal proceedings
procdure de rvision rehearing proceedings
procdure de contrle judiciaire judicial review proceedings
procdure en chambre du conseil review chamber proceedings
procdure non-contentieuse non-contentious proceedings
procdure en diffamation libel proceedings
procdure de rglement amiable friendly settlement proceedings
procdure prliminaire preliminary proceedings
procdure au principal principal proceedings

The terminologist notes that these multiword terms have only one equivalent
in the corpus and thus pose no problem of multiple equivalence when taken
Proceedings also marks the steps taken in a cause (Oxford 1994):
engager une procdure to take/bring/institute proceedings

Christine Chodkiewicz, Didier Bourigault and John Humbley

suspendre une procdure to stay/adjourn proceedings

rouvrir la procdure to reopen the proceedings

Here, the support verb is an indicator of the equivalence to be sought.

In some cases procedure and proceedings are used without distinction for
procdure. Thus, procdure dextradition has two equivalents in the corpus:
extradition procedure or extradition proceedings. Similarly, frais de procdure is
either procedural cost or costs of the proceedings. Vice de procdure can be
expressed either by procedural defect or procedural deficiency but also by defect
in the proceedings or irregularity in the proceedings. The hypothesis is that the
English equivalents are true synonyms.
. Proceedings and its equivalents other than procdure

1. Procs
Procs is an equivalent of proceedings in the following:
procs pnal criminal proceedings
parties au procs parties to the proceedings

Procs is, however, the most common equivalent of trial:

procs contradictoire adversarial trial

But trial itself corresponds to many words besides procs:

commit the accused for trial renvoyer laccus en jugement
criminal trial audience pnale / to appear for trial comparatre laudience
in the course of the trial au cours des dbats
pending trial pendant la procdure

Trial is sometimes used in a multiword term where, again, procs is not to be

found: trial judge thus designates what is called in French the juge de premire
instance/premier juge/juge du fait/juge du fond (typical of French law); similarly,
the judge in charge of preparing the case for trial is called juge de la mise en tat
(unknown in English law). To ensure appearance for trial corresponds to assurer
la reprsentation en justice; pre-trial detention or detention pending trial to
dtention provisoire; to evade trial to se soustraire laction en justice; trial proceedings to procdure de jugement. Conversely, procs corresponds to many
words other than trial:
rouverture du procs reopening of the case

Making a workable glossary out of a specialised corpus

droit un procs right to a hearing

entamer un procs to bring an action
procs civil civil process
droit un procs quitable has three equivalents: fair hearing/due process/right to
a fair trial

2. Affaire
Affaire corresponds to proceedings:
ltat de laffaire the state of the proceedings
affaires pnales criminal proceedings

But affaire is most often the equivalence of case:

la comptence de la Cour stend toutes les affaires concernant linterprtation et
lapplication de la prsente Convention the jurisdiction of the Court shall
extend to all cases concerning the interpretation and application of the present Convention
renvoi de laffaire adjournment of the case
fond de laffaire merits of the case
affaires de diffamation libel/defamation cases

Affaire also corresponds to matter:

fond de laffaire (see above) merits of the matter
rglement amiable de laffaire friendly settlement of the matter
toutes les affaires concernant linterprtation et lapplication de la prsente
Convention all matters concerning the interpretation and application of the
present Convention.

Finally, affaire can also correspond to many other words or expressions in

which various metonymies can be noted:
toutes les affaires dont lexamen nest pas termin any application the examination of which has not been completed.
il sollicita un examen rapide de son affaire he asked for his application to be
dealt with speedily
affaire dune telle envergure trial on such a scale
une affaire o lon ignorait les faits a situation where the true facts were
le juge fut charg de linstruction de lensemble de laffaire the judge () was
put in charge of the overall investigation
dcision de joindre les affaires decision of joinder

Christine Chodkiewicz, Didier Bourigault and John Humbley

3. Litige
Litige can correspond to proceedings:
litige de nature pnale proceedings that were of a criminal nature
somme en litige sum at stake in the proceedings

Litige is also an equivalent of case:

le litige auquel (les Hautes Parties Contractantes) sont parties the case to which
(the High Contracting Parties) are parties
lexamen du litige examination of the case
fond du litige merits of the case
lobjet du litige the scope/the compass of the case

But litige also often corresponds to dispute:

somme en litige sum in dispute
litige concernant des droits et obligations de caractre civil dispute concerning
civil rights and obligations
litige de nature prive disputes between private parties
lissue du litige the outcome of the dispute
lobjet du litige the subject-matter of the dispute

Finally, litige has various other equivalents:

le litige a t rsolu the matter has been resolved
contexte du litige background to litigation
lenjeu du litige pour lintress what is at stake for the applicant in the litigation

4. Cause
Cause is an infrequent equivalent of proceedings:
tre appel dans la cause to join the proceedings
tre maintenu dans la cause to remain party to the proceedings

Instead, cause is more often an equivalent of case:

examiner la prsente cause to examine the instant case
faits/circonstances de la cause facts/circumstances of the case
il na pas pu faire entendre sa cause devant un tribunal he was not able to bring
his case before a tribunal
bien-fond/fond de la cause merits of the case

Strictly speaking, bien-fond de la cause and fond de la cause are not synonyms in
French, so this is perhaps the only error of translation detected in our corpus.

Making a workable glossary out of a specialised corpus

However, cause has several other equivalents in the French subcorpus:

les juridictions rpressives ne sont pas intervenues dans cette cause the criminal
courts had not been involved in the matter
toute personne a droit ce que sa cause soit entendue quitablement, publiquement, dans un dlai raisonnable everyone is entitled to a fair and public
hearing within a reasonable time
la Cour ajourna lexamen de la cause the Court adjourned the hearings

It is interesting to note that cause (in our corpus) is never given as an equivalent
of cause (though the word exists in legal English) or of trial (though the words
procs and cause are generally regarded as synonyms in French, as we shall see

5. Action
Action is another occasional equivalent of proceedings:
action pnale/action publique criminal proceedings
action civile civil proceedings
action en confiscation condemnation proceedings/proceedings for forfeiture
engager/intenter/entamer une action to take/to bring proceedings
action en diffamation proceedings for/of libel (see also below)

The most common English equivalent of French action is simply action:

intenter/engager/porter une action (contre quelquun) to bring/to commence an
action (against someone)
examen de son action trial of his action
rayer une action du rle to strike out an action
action en responsabilit de lEtat action to establish the States liability
action en dommages-intrts action for damages
action en diffamation libel action/action of/for defamation

However, French action can correspond to other words in English:

X sest soustrait laction X evaded trial
action en responsabilit civile civil litigation
action publique prosecution

Conversely, English action corresponds to other terms in French:

persons against whom action is being taken with a view to deportation or extradition personne contre laquelle une procdure dexpulsion ou dextradition
est en cours

Christine Chodkiewicz, Didier Bourigault and John Humbley

the Court can decide on the whole merits of the action la Cour a plnitude de
to dismiss the applicants action dbouter
remedial action mesures de redressement
the provisions do not permit the action taken les dispositions nautorisent pas les
mesures prises

6. Instance
Instance corresponds to proceedings in the following collocations:
instance judiciaire legal/judicial court proceedings
instance pnale criminal proceedings
instance de renvoi en jugement committal proceedings
suspendre linstance to adjourn the proceedings
lengagement dune instance the institutions of proceedings
conduite de linstance judiciaire conduct of judicial proceedings

But instance also corresponds to procedure (instance devant la Cour suprme

procedure before the Supreme Court) or to action.

7. Poursuites
Poursuites also corresponds to proceedings:
poursuites judiciaires legal proceedings
pousuites pnales criminal proceedings
poursuites en cours pending proceedings
engager des poursuites contre to bring proceedings against
clturer les poursuites to close the proceedings
(tre poursuivi is translated by to be subject to legal proceedings)

8. Recours
Recours is an infrequent equivalent of proceedings:
introduire un recours to take/bring/institute proceedings
droit de recours devant un tribunal right to take proceedings before a court

But recours more usually corresponds to other terms or expressions:

introduire un recours (see above) to file/make/lodge an application or to file a
droit un recours effectif right to have an effective remedy
puisement des voies de recours (internes) exhaustion of domestic remedies

Making a workable glossary out of a specialised corpus

droit de recours individuel right of individual recourse/right of individual petition

recours gracieux non-contentious claim
recours au tribunal appeal to the tribunal
le recours doit tre rejet the appeal must be dismissed
exercer un droit de recours to lodge an appeal
juridiction de recours appellate court/court of appeal
dbouter (quelquun) de son recours/rejeter un recours to dismiss (someones)
recours en annulation application for judicial review/plea of nullity
recours de droit administratif administrative law action
recours administratif administrative objection/non-contentious
application/administrative appeal

9. Audience
Audience is sometimes an equivalent of proceedings:
publicit de laudience publicity of the proceedings

Audience also corresponds to trial:

audience pnale criminal trial
comparatre laudience to appear for trial

Audience is also often translated as hearing (or trial hearing):

audience dappel appeal hearing
audience de rexamen revision hearing
audience publique public hearing

But audience also corresponds to other terms:

renvoyer laudience to adjourn the case

10. Dbats
Dbats is an equivalent of proceedings:
dbats judiciaires judicial proceedings

11. Contentieux
Contentieux is an equivalent of proceedings:
le contentieux des droits de lhomme human rights proceedings

But contentieux is also translated as litigation and dispute.

Christine Chodkiewicz, Didier Bourigault and John Humbley

. Discussion
The term proceedings is highly polysemous and has no single equivalent in
French (see Figure 2).























Figure 2. Cross-linguistic correspondences of French procdure and English

proceedings. Frequent equivalents are indicated by an unbroken line, less frequent
by a broken line.

If one term is to be highlighted as the most frequent equivalent of proceedings,

it is certainly procdure but also to a lesser extent, poursuites, which in our corpus has no equivalent other than proceedings, and instance, which has two
equivalents, procedure and action. The least frequent equivalent of proceedings
is undoubtedly recours, as this term has itself many different equivalents in
English. The equivalents of contentieux and dbats are not very significant
either, but it is not possible to draw a definite conclusion concerning these two
words since they appear infrequently in our corpus.
Three terms can be held to have exact equivalents in the other language:

Making a workable glossary out of a specialised corpus

procdure/procedure, action/action and cause/cause. But these terms are not systematically used to translate each other and all three French words often correspond to proceedings in English. As far as cause is concerned, it is difficult to
draw a conclusion as this term does not appear in English in our corpus; but
the instances in which it is used are probably fewer in English than in French
(which does not necessarily mean that its meaning is more restricted). Cause is
defined by the legal dictionary (Oxford 1990) as a court action. In French it is
often used in the expression en tout tat de cause, meaning toute hauteur de la
procdure; tout moment de linstance (par opposition au seuil de linstance)
(Cornu 1987). However, French action and English action seem to have roughly
the same meaning: the term is defined in English as a proceeding in which a
party pursues a legal right in a civil court (Oxford 1990) or the taking of legal
steps to establish a claim or obtain a judicial remedy (Oxford 1994) and in
French as voie de droit ouverte pour la protection dun droit ou dun intrt
lgitime. This is corroborated by the fact that French action has been shown to
correspond most usually to English action. But we have seen that they cannot
be regarded as strictly equivalent, given other possible equivalents. As regards
procdure/procedure, it is quite clear that these two terms cannot be held to be
always equivalent. In many instances, as seen above, proceedings is certainly a
more appropriate equivalent of procdure than procedure, even though in some
cases proceedings and procedure are used interchangeably.
In French, procs, cause, litige, affaire are generally held to be synonymous
(cf. Cornu 1990, Robert 1990). One could easily think that the same is true in
English, as these four terms have at least two common denominators: proceedings (again), but also (in many instances) case (which is defined (Oxford 1990)
as 1. A court action. 2. A legal dispute. 3. The arguments, collectively, put forward by either side in a court action and in general language (Oxford 1994) as
a cause or suit brought into court for decision). But these four terms, as seen
above, also correspond to other terms. The most common not to say the
most appropriate equivalent of procs is probably trial but trial is clearly
inappropriate as a translation of cause or litige. In fact, even litige, though often
regarded as synonymous with procs, affaire or cause has been said to designate
more exactly (Cornu 1987) a diffrend, dsaccord, conflit ds le moment o
il clate () pouvant faire lobjet dune de solution indpendamment de tout
recours la justice tatique (Cornu 1987, also Cornu 1990: 154). Litigation
and dispute are probably, therefore, better equivalents for litige than
proceedings, case or matter.
Except in relation to the term procdure and perhaps instance (defined by

Christine Chodkiewicz, Didier Bourigault and John Humbley

Cornu 1987 as la procdure engage devant une juridiction; phase dun procs;
plus prcisment, la suite des actes et dlais de cette procdure partir de la
demande introductive dinstance jusquau jugement ou autres modes dextinction de linstance On parle en ce sens du droulement ou de la poursuite de
linstance) and poursuites (defined as exercice dune voie de droit pour contraindre un personne excuter ses obligations ou se soumettre aux ordres de
la loi ou de lautorit publique), our conclusion is therefore that the term proceedings should be used sparingly (e.g. to translate audience, for which the most
appropriate translation is surely hearing), though it cannot be said that the
Courts translators ever used it wrongly. Again, the high linguistic quality of our
corpora both French and English should be emphasised. It is impossible
to tell whether the texts we have analysed were initially written in French or in
English and then translated into the other language. It is precisely because the
quality of both texts is so high that we can afford to be so meticulous.


Procdure/proceedings represent a complex and hitherto largely uncharted area

of legal terminology, which cannot be fully resolved simply by using semiautomatic term extraction. However, automatic processing presents several
major advances. Firstly, it enables the terminologist to view the total number of
occurrences of the candidate terms and their many equivalents. Secondly, the
immediate access given to all the texts in which the term candidates occur facilitates disambiguation, precision of equivalents and harmonisation of terms
chosen by the translators. Thirdly, Lexter gives priority to multiword term candidates, where the number of multiple equivalences is very much lower than in
the case of single-word terms, and these multiword equivalences can go
straight into the proposed glossary. There remains a significant number of
equivalences which can only be unravelled with specialist knowledge, especially single-word terms with high frequency. For these, the terminologist is obliged to print out the relevant portion of the corpus and engage in traditional legal
Future developments of Lexter aim at extracting verb group candidates,
which, as has been shown in the examples given, significantly reduce the problem of multiple equivalence, and, in the medium term, automatic extraction in
bilingual texts. As to the glossary in hand, three stages are projected: a term
base using Lexters interface, which will be submitted to the Courts translators

Making a workable glossary out of a specialised corpus

for validation, once the number of multiple equivalents has been significantly
reduced; a paper glossary of the essential terms of human rights; finally a database containing both equivalences and translation memory. Only this sort of
tool will enable translators to respond to the ever-increasing pressure of work
resulting from the growing number of decisions to be translated.

A Concise Dictionary of Law. 1990. 2nd. ed. Oxford: Oxford University Press.
Assal, A., et Delavigne, V. 1993. Le dcoupage des units terminologiques complexes : limites
des critres linguistiques. In Actes de la quatrime journe ERLA-GLAT, 175193. Brest.
Bourigault, D. 1993. Analyse syntaxique locale pour le reprage de termes complexes dans
un texte. In Revue t.a.l., 34/2. 105117.
Bourigault, D., Gonzales-Muilez, I., Gros, C. 1996. LEXTER, a natural language tool for
terminology extraction. In Actes du 7me congrs international EURALEX, 771779.
Cornu, G. (ed.) 1987. Vocabulaire juridique. Paris: Presses universitaires de France.
Cornu, G. 1990. Linguistique juridique, Paris: Montchrtien.
Dictionnaire des synonymes. 1990. Paris: Le Robert.
Oxford English Dictionary (CD-ROM). 1994. 2nd ed. Oxford: Oxford University Press.
Rondeau, G. (ed.) 1979. Table ronde sur les problmes du dcoupage du terme. Montal: Office
de la langue franaise.
Third New International Dictionary. 1966. Springfield, Mass.: Merriam.


Translation and Parallel Concordancing

Translation alignment
and lexical correspondences*
A methodological reflection
Olivier Kraif


In the last few years much interest has been given to the outcome of translation
aligning: Isabelle (1992) proposed using bilingual parallel texts, or bi-texts, i.e.
segmented and aligned translation corpora, as a Corporate Memory for translators. He alleged that existing translations contain more solutions to more
translation problems than any other existing resource. Such a translation database, organised as a bilingual concordancer (as in the TransSearch Project, cf.
Simard et al. 1993) would store all the previously found solutions for a given
translation problem and allow the translator to recover them easily. Other
alignment-based tools, such as automatic verification, have a natural place in a
translators workstation. Error detection can be implemented when translations are provided in aligned format. In the TransCheck system, Macklovitch
(1995a) shows how common errors such as deceptive cognates, calques, illicit
borrowings can be automatically detected in a bi-text framework. Other features, such as exhaustiveness (i.e. omission errors; cf. Isabelle et al. 1993) or terminological consistency (Macklovitch 1995 b), can be tested. It is also possible
to verify automatically, in a reliable manner, the proper translation of specific
phrasal constructions such as dates or numerical expressions. The transduction grammar formalism seems to work very well in this kind of restricted
translation task.
In the more ambitious field of Example-Based Machine Translation (Sato
& Nagao 1990, Brown et al. 1990), aligned corpora form the cornerstone of the

Olivier Kraif

system. The linguistic knowledge is stored implicitly in the recorded examples

of translation. The success of the system depends on the huge quantity of
aligned sentences that constitute mutual translations.
Another interesting application is the automatic extraction of bilingual
lexicons. Many works (Dunning 1993, Dagan et al. 1993, Gaussier & Lang
1995) have shown how to use statistical filters to pair lexical units that have a
similar distribution in each part of the bi-text. As a large proportion of these
similar units are translation equivalents, they can be useful in establishing
bilingual (or multilingual) glossaries for empirical observation.
In order to align parallel texts, several techniques have been implemented
which have yielded satisfactory results. Even when they take advantage of lexical information most of the systems work at sentence level (Brown et al. 1991,
Simard et al. 1992, Kay & Rscheisen 1993, Gale & Church 1991). Indeed, it is a
well-known fact that the hypothesis of parallelism does not hold below sentence level, and lexical alignment appears to be a far more complex problem.
However, some systems have yielded encouraging results in producing lexical
alignment (Brown et al. 1993).
Given the huge variety of algorithms and techniques devoted to alignment,
we are now entering an evaluation phase, and some large-scale projects such as
Arcade (Langlais et al. 1998) set out to give a coherent framework for definition
and evaluation of the aligning task. In the former project two different tasks
have been tested: sentence alignment and lexical spotting (i.e. finding lexical correspondences for a given list of test words). The evaluation task consists of two
steps: given a test corpus, we have to determine first a gold standard, i.e. a manually constructed alignment that is considered to be exact. Then we have to
implement a metric in order to effect a quantitative comparison of any other
alignment with the standard. Both in the case of sentence and of word track,
two kinds of difficulty resulted from the definition of a standard alignment:
segmentation discrepancy and correspondence problems. Detailed criteria
were given to human aligners and annotators in order to cope with inconsistencies, but the lexical spotting task, in respect of sentence alignment, rapidly
proves problematic.
After giving a precise definition of what bilingual alignment involves, we
will go on to describe various problems associated with alignment at word
level. We will then show the inconsistency of such a concept, and draw a line
between the extraction of lexical correspondences and the alignment task from
a general point of view. We believe that only a proper definition of the concepts
of alignment and correspondence that takes account of the actual practice of

Translation alignment and lexical correspondences

translation can produce reliable criteria for the creation of a gold standard that
can be used for the purpose of evaluation.

. The concept of alignment

The standard concept of alignment can be summed up as follows:
Aligning consists in finding correspondences, in bilingual parallel corpora,
between textual segments that are translation equivalents.
Translation equivalence is above all a global property of the translation of a
text. It is not a linguistic property, but a pragmatic one: the translation arrived
at is a result of interpretative choices that are made in a specific situational context. As Sager (1994:186) says:
While the cognitive and linguistic equivalents are mainly established at the
level of the sentence or in smaller units during the translation phase, the pragmatic equivalents have to be selected first in the preparation phase and at the
level of the text type before being also realised in smaller units at appropriate
points in the document.

These extra-linguistic parameters are linked to many factors at the pragmatic

level: text typology, text intention, receptors, dynamic equivalence (cf. Nida &
Taber 1982), cultural adaptation, conceptual background and so on.1
Translation equivalence is a relationship between messages entrenched in
two given contexts and backgrounds: the source and the target context. This
global equivalence does not imply equivalence at the level of linguistic units. In
the following example, the original advertisement for golf items is not translated at word level (Henry 1991:15):
(1) To make your greens come true
Pour faire putt de velours

The French version includes a pun, as in English: it refers to the expression faire
patte de velours, which means to sheathe its claws (of a cat). Putt is a particular
stroke in golf, and the translation plays on the paronymy between putt and patte.
This example illustrates the fact that the equivalence holds at a global and
an abstract level. The two versions work in the same way, although using different linguistic means. In this case the relevant features are the pun and the
theme. Depending on the function of the message, some features are more relevant than others, and have to be maintained in translation whatever the cost

Olivier Kraif

(while other features are lost): these may be the conceptual content or rhetorical figures, stylistic devices, formal features such as alliteration, and so on.
Therefore, to segment and to establish correspondence between segments,
we have to make a specific assumption about the translation. We might call it
translational compositionality. This concept is developed by Isabelle (1992):
For translation to be possible at all, translational equivalence must be compositional in some sense; that is, the translation of a text must be a function of the
translation of its parts, down to the level of some finite number of primitive
equivalences (say between words and phrase).

I do not completely agree with Isabelle when he presents compositionality as a

condition of the possibility of translation. Compositionality may be a characteristic of the process of translation, but remains a relative notion as as far as the
product of translation is concerned. In fact, the translational compositionality of
a bilingual corpus determines exactly the level at which it is possible to align it.
In more formal terms, the compositionality assumption leads to the definition of a specific corpus structure: the bi-text. Generally speaking, a bi-text is a
quadruple <T1,T2,Fs,A> where T1 and T2 are mutual translations (the direction of the translation is irrelevant), Fs is a segmentation function which divides
the texts into a set of smaller units (e.g. paragraphs, sentences, phrases), and A is
the alignment of these units, i.e. a subset of the product Fs(T1) x Fs(T2).
This general definition can lead to different kinds of bi-text: Fs can produce
either a complete or a fragmentary partition of the texts, or a hierarchical partition where different levels are simultaneously involved (paragraph, sentence,
words). Moreover, we can focus on particular alignments with several restrictions. For instance, Isabelle & Simard (1996) define a monotone alignment in
terms of three constraints:
no crossing correspondences; i.e. the segments must appear in the same
order in both texts.
no partially overlapping segments: two different segments that appear in
different pairings cannot share the same portion of text. For instance, the
phrase Machine Aided Translation would not yield two segments: Machine
Aided and Aided Translation.
no discontinuous correspondences; i.e. there are no discontinuous segments,
such as Machine [] Translation in the previous example.
Most existing alignment systems use this kind of monotone alignment. Indeed,
in the current state of the art, the possibility of automatic alignment is strongly

Translation alignment and lexical correspondences

conditioned by the parallelism of the corpora. As Gaussier & Lang (1995: 71)
have defined it, parallelism consists in the conjunction of two criteria: one-toone matching and monotony:
One-to-one matching means that each segment of one text has a correspondence in the other text. In fact, this condition is never completely realised,
because translation induces additions and omissions. Therefore, this criterion is
more or less satisfied, depending on the particularities of the translation.
Monotony, as previously defined, is also a relative property. In general,
however, inversion of the sequence of segments is rare.

. Alignment techniques
As Simard & Plamondon (1996) point out, alignment techniques can produce
two different kinds of result:
alignment involving a parallel segmentation of both texts into smaller logical
units (such as paragraphs, sentences or even phrases), in such a way that the nth
segment of source text and the nth segment of target text are mutual translations.
a bi-text map involving a set of points (x,y), called anchor points, where x
and y refer to precise locations in the source and the target text that denote portions of text corresponding to one another.
The latter case is very general, because it does not presuppose a previous segmentation. But a bi-text map is not a very useful form of bi-text, as it does not directly
indicate correspondences between textual units as in bilingual concordances: it
only establishes connections between text areas. We consider the bi-text map as a
preliminary and intermediate step for the achievement of a full alignment.
In the following discussion, I will give examples of sentence alignment, but
the problems are the same for every kind of segmentation compatible with

What is alignment?
Bilingual alignment is not a negligible problem, as translation does not preserve unit boundaries. Practically, a sentence can be translated by two or more
sentences, or can simply be omitted. At every stage the alignment algorithm
has to determine the appropriate clustering of units in order to respect the

Olivier Kraif

translation equivalence property. We can illustrate this by the example in the

following table, extracted from an English translation of Jules Vernes novel De
la terre la lune (which is a part of the BAF corpus, developed at the CITI of
Montreal, which has been used as a benchmark in the Arcade Project; cf.
Langlais et al. 1998 and Simard 1998:489).
Table 1. Example of sentence alignment
English text

French text

P1 ! Nous voil au 10 aot, dit un matin

P1 Here we are at the 10th of
August, exclaimed J.T. Maston
J.-T. Maston
one morning, only four
P2 Quatre mois peine nous sparent du premier
months to the 1st of December.
dcembre !
P3 Enlever le moule intrieur, calibrer lme de la
pice, charger la Columbiad, tout cela est faire !
P2 We shall never be ready in time! P4 Nous ne serons pas prts !

We can write this alignment as follows:



A= {[P1;P1P2],[;P3 ],[P2;P4]}

It is also possible to represent these clusters as a sequence of n-p transitions,

called an alignment path:
A = (12), (01), (11)
Figure 1 gives a two-dimensional representation of this path, with T and T on
the X and Y axes. The alignment is represented by the surfaces involved in the
segment pairings:
00P 1

00P 2

Figure 1. Two-dimensional representation of an alignment

Translation alignment and lexical correspondences

If we draw a chart representing the complete translation of Vernes novel, we

get a general view of the path, as shown in Figure 2.
English version

Corpus Verne Sentence alignment









French version

Figure 2. A complete alignment path

The more parallel the translation is, the closer the path is to the diagonal of the

General framework
Several methods have been developed to calculate this kind of path automatically. They are usually implemented within a probabilistic framework: by estimating the probability of all possible paths, the algorithm can find the bestscoring one, i.e. the one with the highest probability.
Given a function p(A) which estimates the probability of alignment A, the
algorithm has to find:
A* = argmaxA p(A)
Naturally, this task of maximisation creates great problems of computation:
the number of possible paths is in O(n!) (where n represents the number of
sentences). A Viterbi algorithm which considers simultaneously all the subpaths that share the same beginning can reduce the computation to O(n2) but
it is still a considerable problem.
A simpler method of reducing search space is to consider only the paths
that are not too far from the diagonal. This is a direct implication of the parallelism hypothesis: if omissions, additions and inversions are marginal, the path
cannot diverge too much from the diagonal.

Olivier Kraif

Another way of reducing search space is a preliminary extraction of a rough but
reliable bi-text map, based on superficial clues. Chapter separators, titles, headers and sometimes paragraph markers can yield information of great interest to
produce a quick and acceptable pre-alignment (Gale & Church 1991). Other
superficial clues are the chains that remain invariant in translation, such as
proper nouns or numbers (Gaussier & Lang 1995). If one had to align a text
and its translation manually in a completely unknown language, one would use
exactly the same superficial, straightforward information. I have shown elsewhere (Kraif 1999) that such chains can be used to align 20% to 50% of the different texts in the BAF corpus (with less than 1% error rate).

Alignment clues
Once the search space has been reduced, we can evaluate the probability of each
possible sentence cluster in order to calculate the global probabilities of each
path. Different kinds of information are available for this estimation.

Segment length
Gale & Church (1991) and Brown et al. (1991) simultaneously developed a lengthbased method which yielded good results on the Canadian Hansard Corpus.2 The
principle of this method is very simple: a long segment will probably be translated
by a long segment in the target language, and a short segment by a short one.
Indeed, Gale & Church show empirically that the ratio of the source and target
lengths corresponds approximately to a normal distribution. Note that it is possible to compute the segment lengths in two ways: as the number of characters or
the number of words in the segment. According to Gale & Church, the length in
characters seems to be a little more reliable in the case of translations between
English and French (the variance of the ratio is slightly smaller). Using the average
and the variance of this ratio as specific parameters, depending on the language
pairs involved, they compute the probability of a cluster as a combination of two
factors: the probability of length ratio and the probability of transition. These latter probabilities were determined in an empirical way in the case of the Gale &
Church corpus, considering only six of the most frequent types of transition, viz.:
One sentence one sentence : p(11)=0.89
One sentence zero sentence and reciprocally : p(10)=p(01)=0.0099
Two sentences one sentence and reciprocally : p(21)=p(12)=0.089

Translation alignment and lexical correspondences

Two sentences two sentences : p(22)=0.011

All the other alignment clues are based on the lexical content of the segment. They
come from a very straightforward heuristic: word pairings can lead to segment
pairings. If two segments are translation equivalents, they will probably include
more lexical units that are translation equivalents than any independent segments
would. To take the lexical information into account, one just needs to know which
units are potential equivalents. This linguistic knowledge can be extracted from
various sources including bilingual dictionaries and bilingual corpora.

Bilingual dictionaries
To be usable for this purpose, dictionaries have to be available in electronic format. Moreover, in technical fields, it is not always easy to find a dictionary that
is consistent with the corpus concerned.
Bilingual corpora
It is also possible to extract a list of lexical equivalents directly from a bilingual corpus. Indeed, translation equivalents usually have very similar distributions in
both texts. These distributions can be converted into a mathematical form and
then be compared quantitatively. In the K-vec method, developed by Fung &
Church (1994), both texts are divided into K equal segments. Then, for each word
(here the words are treated as lexical units), it is possible to compute a vector representing its occurrence in each segment: with 1 for the ith co-ordinate if the word
appears in the ith segment, otherwise 0. Thus, when both words have 1 for the
same co-ordinate, one can say that they co-occur. This model of co-occurrence
(cf. Melamed 1998) makes it possible to calculate the similarity of two distributions by several measures based on probabilities and information theory.
In two texts divided in N segments, for two words W1 and W2 occurring in
each text in N1 and N2 segments respectively, and co-occurring in N12 segments, you can easily compute their mutual information:

I = log
N1 N2


If N1 and N2 are not too small (>3), then beyond a certain threshold of mutual
information (I>2), it is highly improbable that the N12 co-occurrences are due
to chance: you can assume that they are linked by a special contrastive relation,
which may be translational equivalence. For rarer events (N1 or N2 3), other

Olivier Kraif

measures, such as the likelihood ratio (Dunning 1993) or the t-score (Fung &
Church 1994), are more suitable.
The problem of the K-vec method is that segments are big (because the system has no knowledge about the real sentence alignment) and the co-occurrences model is very imprecise. The finer the alignment, the more exact the
word pairing obtained.
As there is an interrelation between segment pairing and word pairing,
some systems work in an iterative framework (Kay & Rscheinsen 1993, Dbili
& Sammouda 1992). From a rough prealignment of the corpus they extract a
list of word correspondences. From these correspondences they then compute
a finer alignment. From this new alignment they extract a new and more complete set of word pairings. And so on, until the alignment has reached stability.

Formal resemblance
Another way of determining lexical equivalence is to focus on cognate words
which share common etymological roots, such as the French word correspondance and the English word correspondence. Cognateness is defined by Simard
et al. (1992) as word pairs which share the same first four characters (4-grams),
including also invariant chains such as proper nouns and numbers. Simard et
al. show empirically that cognateness is strongly correlated with translation
equivalence. On the basis of a probabilistic model, they estimate the probability of a segment cluster given its cognateness. This model, combined with the
length-based model, yielded significant improvements of the results achieved
by Gale & Church. In previous works, we show that a special filtering of cognate words can give a very precise and complete prealignment: in the case of the
BAF corpus, we obtained 80% of the full alignment, with a very low error rate
(about 0.5%). Of course, the exploitation of formal similarities depends on the
languages involved. In the case of related languages such as English and French,
cognateness is important. In the case of technical texts we can expect to observe
cognates even between unrelated languages, because technical and scientific
terms usually share common Graeco-Latin roots.

. The concept of lexical correspondence

Usually, lexical correspondences are treated as a particular case of alignment.
In the Arcade project, for instance, lexical spotting is seen as a simpler subproblem of full alignment. Brown et al. (1990) give the following example of

Translation alignment and lexical correspondences

what can be described as word alignment:

(2) The poor dont have any money
Les pauvres sont dmunis
A={(The ; Les) (poor ; pauvres) (dont have any money ; sont dmunis)}

Even if it is generally admitted that the condition of quasi-monotony does not

hold in this case, the supposed one-to-one matching seems to justify the concept of word alignment. Let us examine the problems that are involved here.
Segmentation discrepancy
From a monolingual point of view, a lexical unit is defined in terms of syntactic
and semantic autonomy. A compound expression can be characterised by the
conjunction of several criteria:

a certain degree of semantic non-compositionality.

more or less syntactically frozen structure.
a certain recurrence.

We will not discuss the complexity of this problem. The definition of a lexical
unit is a difficult problem in linguistics, and no consensus has been reached so
far in the linguistic community.
In any case, it appears that the units emerging from lexical alignment do
not have lexical consistency, depending only on the structural homology
between the related segments. For instance, another translation of the previous
sentence results in different units:
(3) The poor dont have any money
Les pauvres nont pas dargent
A={(The ; Les) (poor ; pauvres) (dont have ; nont pas) (any ; d)
(money ; argent)}

Lexical alignment yields non-lexical compounds, but it can also break up genuine lexical units. For example, we can align the English, French and Italian
expressions in different ways:
(4) To be the very devil
Avoir le diable au corps
Avere il diavolo in corpo
French/Italian: A ={(Avoir ; Avere) (le ; il) (diable ; diavolo) (au ; in)
(corps ; corpo)}
English / French: A = {(To be the very devil ; Avoir le diable au corps)}

Olivier Kraif

In this case we have word-for-word correspondence inside the lexical unit

across Italian and French. The problem is: should the lexical alignment be
allowed to break up lexical compounds, when it is possible?

Semantic discrepancies
Another problem is semantic discrepancy, which is common between a text
and its translation. The following example is extracted from a European
Parliament report.3
(5) the marking of banknotes for the benefit of the blind and partially sighted
lmission de billets de banque identifiables par les aveugles et par les personnes vision rduite
[literally: the issue of banknotes identifiable by the blind and partially
sighted persons]

The phenomenon of semantic discrepancy is frequently found in the practice

of translating. This can be explained by the importance of the extra-linguistic
level. Translation, as Pergnier notes (1993: 23), is not only an operation
between two different languages, it is first a transformation between messages,
involving the whole pragmatic and conceptual background.4 As Pergnier
(1993: 75) says, the equivalence at both levels, between two utterances and
between the signs that they include, does not exist before the translation, but is
a consequence of it [my translation].
Thus the contrastive level, i.e. the possible equivalence between signs of
different systems, is secondary: it is a result of translation as an act of communication, as shown in Figure 3.

Linguistic level:
mediated contrastive relation


Text 1

Pragmatic and extra-linguistic level

Translational equivalence

Text 2

Figure 3. The level of translation equivalence

Translation alignment and lexical correspondences

As a result, lexical alignment based on semantic criteria is very often unclear. In

these two sentences
(6) the various policies for access to employment for disabled people
les diffrentes politiques mises en uvre pour permettre laccs des personnes
handicapes lemploi
[literally: the various policies implemented to allow disabled people to
access a job]

divergent solutions are possible for the following phrases:

A={(for ; mises en uvre pour permettre)}
or else, if we take omissions into account:
A={( ; mises en uvre) (for ; pour) ( ; permettre)}
These semantic discrepancies, combined with segmentation difficulties, create
very complex configurations in lexical alignment. Consider the following case:
(7) The assessment of the official cause of death is a piece of information vital to
these registers.
Pour la bonne tenue de ces registres, lvaluation des cas de mortalit constats
par les autorits apporte des informations importantes.
[literally: For the good keeping of these registers, the evaluation of causes of
death noted by the authorities gives important information]

In these sentences we observe correspondences between discontinuous units:

A={(vital ; importantes [] pour la bonne tenue de ces registres)}
There are thus two possible alignments of the following phrases:
A={(cause of death ; cas de mortalit) (official ; constats par les autorits)}
A={(official cause of death ; cas de mortalit constats par les autorits)}
Since semantic discrepancy and segmentation inconsistency are not discrete
phenomena, but follow a continuum of intensity, the determination of reliable
criteria to solve this kind of alignment is almost impossible.
Recently great attention has been given to automatically extracted bilingual glossaries. Indeed, as we have seen before, probabilistic models make it
possible to extract lexical correspondences by comparing the distribution of
lexical items in a parallel corpus. Large-scale evaluations, as in the Arcade project, have been designed to test these methods and to guide the construction of

Olivier Kraif

a gold standard, established on the basis of a test corpus, in order to benchmark

the different systems. In order to cope with the problems inherent in the concept of lexical alignment and delineate more clearly the task of automatic lexical pairing, we propose a redefinition of the concept of lexical correspondence.

Lexical correspondences
We agree with Debili (1997:200) that lexical alignment is neither one-to-one, nor
sequential, nor compact. Correspondences are fuzzy and contextual. He therefore
proposes to distinguish between lexical correspondence, where the mutual translation can be validated by a bilingual dictionary, and contextual correspondence
(1997:203), i.e. translation that depends on a specific context. But we do not subscribe to this point of view. The attestation of a dictionary is a somewhat arbitrary
criterion, and it does not reflect the inherent continuity of the phenomena.
We prefer to distinguish two different kinds of task: alignment and the
determination of correspondences. Indeed, lexical correspondence can be
defined in a very restricted sense:
A lexical correspondence is a relation of denotational (conceptual, extra-linguistic) equivalence between two lexical units in the context of two segments
that are translation equivalents.
This definition raises the following issues:

lexical units are linguistically defined, in a monolingual context. By adopting a broad definition of lexical units, including compounds, phraseology
and even terms, it is possible to avoid the issue of segmentation inconsistency. If the problem is shifted to a monolingual point of view, its resolution appears to be far more reasonable.
we focus on the contextual sense of the lexical unit (referring to the opposition between signe type and signe occurrence made by Rastier 1991:96).
monotony and one-to-one matching are no longer presumed, in accordance with empirical observations.

We feel that lexical alignment is a nebulous notion which inherits most of the
misleading statements from the first generation of MT systems. For instance, in
this case:
(8) the marking of banknotes for the benefit of the blind and partially sighted
lmission de billets de banque identifiables par les aveugles et par les personnes vision rduite.

Translation alignment and lexical correspondences

We can draw the following correspondences:

C={(banknotes ; billets de banque) (blind ; aveugles) (partially sighted ; personnes vision rduite)}
The rest of the sentences is just a normal translation residue, due to the divergences between the two versions. These divergences can have a linguistic cause
(e.g. morphosyntactic or lexical differences) or not (e.g. conceptual inferences).

Maximal resolution alignment

This kind of lexical correspondence differs from sub-sentence alignment. We
define a special kind of alignment that is very often confused with lexical correspondence:
A maximal resolution alignment is a matching of the smallest possible segments in accordance with the principle of translational compositionality.
This kind of alignment does respect the criteria of parallelism, except for
monotony below the sentence level. In such an alignment, the syntactic characterisation of the segments is not determined: it can be a word, a phrase, a whole
sentence, or even a paragraph. This depends on whether the translation is literal or not: if the translation of a sentence cannot be decomposed, the sentence
has to be considered as a whole.
Translation spotting, as defined in the Arcade project, appears to be a kind
of maximal alignment, and yet it is fragmentary: it focuses on segments that
contain some specific lexical units. For instance, looking for the correspondence of the French word apporter, it yields the alignment between the boldfaced segments:
(9) A meeting held in Brussels [] went a long way towards meeting the concerns expressed by the Honourable Member.
Une runion, qui sest tenue Bruxelles [] a permis daccentuer leffort pour
apporter des lments concrets de rponse aux proccupations exprimes par
lhonorable parlementaire.

The notions of translational compositionality and maximality capture very

neatly the criteria of translation spotting. In discussions about the appropriateness of aligning peas with pois in the phrases green peas and petits pois, the noncompositionality of this translation pair gives a very clear solution: petits pois
and green peas cannot be decomposed.

Olivier Kraif

The characteristics of lexical correspondence and maximal alignment are

summed up in Table 2.
Table 2. Characteristics of Lexical Correspondence and Maximal Alignment
Lexical Correspondence

Maximal Alignment


Monolingual, lexical unit level

Segmentation depends on structural homology between texts. It is

based on both translational compositionality and on maximality: the
segments cannot be decomposed


Usually one-to-one relations

Quasi-bijection, quasi-monotony
between some lexical units, and below sentence level.
the rest is residual. Many-tomany relations are also possible.

Syntactic nature
of the segments

Lexical unit: words, compounds, set phrases, terms.

No syntactic consistency: word,

phrase, sentence, paragraph.

Pairing criterion

Denotational identity
(in the occurrence context).

Translation equivalence

To illustrate these two concepts, we give another example:

(10) Confidential secret service information on applicants for European civil
service posts
Rcolte de donnes caractre personnel par les services secrets d un tat
membre sur les candidats aux concours organiss par les institutions

The maximal alignment could be as follows:

A={(Confidential ; caractre personnel) (secret service: par les services
secrets) ( ; dun tat membre) (information ; Rcolte de donnes) (on ; sur)
(applicants ; les candidats) (for European civil service post ; aux concours
organiss par les institutions europennes)}
And we can extract the following lexical correspondences:
C={(confidential ; personnel) (secret service ; services secrets) (information ;
donnes) (on ; sur) (applicant ; candidat) (European ; europennes)}

Translation alignment and lexical correspondences

. Conclusion
These reflections aim at defining and clarifying the key concepts of alignment
and correspondence in the field of bi-text exploitation and evaluation. We
make a distinction between two different types of bilingual pairing: the alignment of the smallest segments that are considered as translational equivalents
(in accordance with the principle of translational compositionality), and the
lexical correspondences which concern stable lexical units (in a broad sense)
having the same denotational content. In fact, inside two aligned sentences,
there is no need to have all lexical units correspond with each other. Semantic
discrepancies between a sentence and its translation can be very important,
and the assumption of quasi-bijection does not hold at the lexical level.
This distinction opens up a number of new possibilities:

the development of more consistent criteria in order to establish benchmark corpora in the field of evaluation,

a more accurate interpretation of the meaning of contrastive phenomena

which emerge from a bi-text. The sets of textual segments constituting a bitext are not linked by specific linguistic properties, but by translational
equivalence, which is defined at an extra-linguistic level. Of course, contrastive regularities can be observed at different levels: morpho-syntax,
lexicology, terminology and phraseology. But these regularities are not
rules: they emerge statistically from the recurrence of translation facts.

* Many thanks to Kim Van den Broecke, Hlne Ledouble and Luc Bardolph for their
helpful assistance in the editing of this article.

Dynamic equivalence is therefore to be defined in terms of the degree to which receptors of the message in the receptor language respond to it in substantially the same manner
as the receptors in the source language. (Nida and Taber 1982:24).
The Canadian Hansard Copus consists in a French / English Canadian Parliamentary
Proceedings, available at

These reports can be found at

Dire que la traduction opre sur des messages, cest en effet proclamer quelle est un
acte de communication (ou dchange linguistique) avant dtre un acte de comparaison
inter-linguale. (Pergnier, 1993:23)

Olivier Kraif

Brown, P., Cocke, J., Della Pietra, S., Jelinek, F., Lafferty, J., Mercer, R. and Roossin, P. 1990.
A statistical approach to machine translation. Computational Linguistics 16: 7985.
Brown, P., Della Pietra, S. and Mercer, R. 1993. The mathematics of statistical machine
translation: parameter estimation. Computational Linguistics 19: 263311.
Brown, P., Lai, J. and Mercer, R. 1991. Aligning sentences in parallel corpora. In Proceedings
of the 29th Annual Meeting of the Association for Computational Linguistics, 169176.
Berkeley, CA.
Dagan, I., Church, K. W. and Gale, W. 1993. Robust bilingual word alignment for machine
aided translation. In Proceedings of the Workshop on Very Large Corpora, Academic and
Industrial Perspectives, 18.
Debili, F. 1997. Lappariement: quels problmes?. In Actes des 1re JST FRANCIL de
lAUPELF UREF, 199206. Avignon.
Debili, F. and Sammouda, E. 1992. Appariements de phrases de textes bilingues FranaisAnglais et Franais-Arabes. In Actes de COLING-92, 528524. Nantes.
Dunning, T. 1993. Accurate methods for the statistics of surprise and coincidence.
Computational Linguistics 19: 6174.
Fung, P. and Church, K. W. 1994. K-vec: A new approach for aligning parallel texts. In
Proceedings of the 15th International Conference on Computational Linguistics,
10961102. Kyoto.
Gale, W. and Church, K.W. 1991. A program for aligning sentences in bilingual corpora. In
Proceedings of the 29th Annual Meeting of the ACL, 177184. Berkeley, CA.
Gaussier, E. and Lang, J.-M. 1995. Modles statistiques pour lextraction de lexiques
bilingues. T.A.L. 36 (12): 133155.
Isabelle, P. 1992. La bi-textualit: vers une nouvelle gnration daides la traduction et la
terminologie. Meta XXXVII (4): 721731.
Isabelle, P., Dymetman, M., Foster, G., Jutras, J. M. and Macklovitch, E. 1993. Translation
analysis and translation automation. In Proceedings of the 5th International Conference
on Theoretical and Methodological Issues in MT. Kyoto.
Isral, F. and Lederer, M. 1991. La libert en traduction. Actes du colloque international tenu
lE.S.I.T. les 7,8 et 9 juin 90. Paris. Didier Erudition, Coll. traductologie.
Kay, M. and Rscheisen, M. 1993. Text-translation alignment. Computational Linguistics
19: 121142.
Kraif, O. 1999. Identification des cognats et alignement bi-textuel: une tude empirique. In
Actes de la 6me confrence annuelle sur la Traitement Automatique des Langues
Naturelles. TALN 99, 205214. Cargse, France.
Lang, J.-M. and Gaussier, E. 1995. Alignement de corpus multilingues au niveau des
phrases. T.A.L. 36 (12): 133155.
Langlais, Ph., Simard, M. and Veronis, J. 1998. Methods and practical issues in evaluating
alignment techniques. In Proceedings of 36th Annual Meeting of the Association for
Computational Linguistics and 17th International Conference on Computational
Linguistics. Montral, Canada.

Translation alignment and lexical correspondences

Macklovitch, E. 1995a. Can terminological consistency be validated automatically?. In

Proceedings of the IVmes Journes scientifiques, lexicommatiques et dictionnairiques,
organized by Aupelf-Uref. Lyon, France.
Macklovitch, E. 1995b. The future of MT is now, and Bar-Hillel was (almost entirely)
right. Centre dinnovation en technologies de linformation (CITI). Laval, Canada.
[Available at]
Melamed, I. D. 1998. Models of co-occurrence. In Technical Report #9805. Institute for
Research in Cognitive Science, University of Pennsylvania, Philadelphia, PA. [Available
Nida, E.A. and Taber, C.R. 1982. The Theory and Practice of Translation. Leiden: Brill.
Pergnier, M. 1993. Les fondements socio-linguistiques de la traduction. Lille: Presses
Universitaires de Lille.
Rastier, F. 1989. Sens et textualit. Paris: Hachette, Coll. HU.
Sager, J. C. 1994. Language Engineering and Translation: Consequences of Automation.
Amsterdam: John Benjamins.
Sato, S. and Nagao, M. 1990. Towards memory-based translation. In Proceedings of COLING90, 247252. Helsinki.
Simard, M., Foster, G. and Isabelle, P. 1992. Using cognates to align sentences. In
Proceedings of the Fourth International Conference on Theoretical and Methodological
Issues in Machine Translation, 6781. Montral, Canada.
Simard, M., Foster, F. and Perrault, F. 1993. TransSearch: un concordancier bilingue.
Centre dinnovation en technologies de linformation (CITI), Laval, Canada.
[Available at URL]
Simard, M. 1998. The BAF: a corpus of English-French bitext. In Proceedings of First
International Conference on Language Resources and Evaluation, 489494. Granada,

The use of electronic corpora and

lexical frequency data in solving
translation problems
Franois Maniez
Polysemy creates difficulties for human translators, and even greater difficulties for automatic translation programs. Starting from an example of such a
difficulty (the translation into French of the compound sedimentation rate), a
case is made for using computerized databases that include the most frequently used compounds and collocations. Attention is also given to measures of
lexical frequency that take into account the various meanings of a polysemous
lexical item. The study of a case of syntactic ambiguity demonstrates that taking into account the frequency of the collocation based on in two separate corpora (scientific and non-scientific) makes it possible to achieve disambiguation in an automatic translation program if it is made sensitive to lexical
environment. A third example deals with the failure to detect a non-contiguous collocation. The factors that contribute to the correct interpretation of
such lexical constructs by bilingual translators are examined, as well as the
ways in which their recognition can be replicated in the creation of automatic
translation software.

Disambiguation techniques and automatic translation

The first step in automatic language processing with a view to translation from
English into a foreign language is automatic part-of-speech (POS) tagging. A
number of tagging programs exist and have been used with large corpora,
including Brills stochastic tagger (see Brill 1993), and their accuracy is improving given that generally they can identify successfully the POS of about 95% of all
words that are processed. The CLAWS4 automatic tagger (described in Garside
1987) that was used with the 100-million-word British National Corpus (BNC)

Franois Maniez

produced erroneous tags for only an estimated 1.7% of all words, with approximately 4.7% of the tags labeled as ambiguity tags cases in which the automatic tagger was unable to decide which was the correct category, for instance
between VVD (past tense verb) and VVN (past participle).
A trickier problem is that of word-sense disambiguation, for which there is
no totally reliable automatic processing to date. Part of the Brown Corpus
(about 670 000 words) has been manually tagged at the University of Princeton,
using semantic tags that correspond to the word senses of the WordNet data
base (Miller 1990), and attempts are in progress to match the WordNet project
with other European Languages (Vossen 1996). The DEFI project team at the
University of Lige (Michiels 1996) has developed an automatic word-sense disambiguation project that matches words drawn from the context in which an
ambiguous word is found with the words contained in the dictionary definitions for its various meanings, using the large Collins-Robert machine-readable
dictionary (MRD). In another experiment involving an MRD, the Collins
English-German dictionary, Neff & McCord (1990) successfully used monolingual resources comparable to WordNet synsets to match a given polysemous
word with one of the collocates provided in entries and thus determine the correct translation in context. Sutcliffe & Slater (1995) also describe a method that
uses word association in order to achieve word-sense disambiguation, and point
out that higher levels of performance can be obtained in restricted contexts
where domain-specific word-sense frequency data can be exploited.
Experiments in the field of word-sense disambiguation are also being carried out by the Rank Xerox research team in Grenoble. Encouraging results
have been obtained by Dini et al. (1998), using 45 upper-level WordNet semantic tags and merging them with functional tags into single tags in order to
achieve disambiguation. The current limitations of their unsupervised learning algorithm lie in the fact that it is only able to learn relations among bigrams.
The fact that Wordnet does not distinguish between homonymy and polysemy
has also proved to be something of a hurdle.

. Corpora
It has become commonplace to argue that computer science has now revolutionized the study of language, and the contribution of corpus linguistics to the
fields of lexicography and phraseology has been widely demonstrated. My aim,
therefore, is not so much to prove the usefulness of computerized tools in the

The use of electronic corpora and lexical frequency data in solving translation problems

field of translation as to define the role of man in the collection of the data that
are needed to improve expert systems or to assist in the compilation of reference tools that can be of use to translators.
Considering the huge amount of data that is now available in electronic
form, the creation of an electronic corpus is no longer a matter of gathering a
sufficient quantity of information, but rather of selecting the data that are relevant to the linguistic mechanisms one wishes to focus on. In the case of bilingual corpora, the issue of choice between parallel and comparable corpora has
been raised by many authors, for example Teubert (1996). As regards monolingual corpora, the main problem that needs to be solved before compilation is
the desired level of homogeneity and what subdivision of a given language is
best suited to ones research. Also, as has been pointed out by Clear (1996), the
issue of corpus size must be addressed if one is concerned with the statistical
significance of the data on which linguistic observations are based.
For the present study, I have used two separate electronic corpora: the first
consists of articles published in Time Magazine in the past ten years (TIME 20TH
CD-ROM: approximately 10 million words); the second is a collection of medical
articles taken from a CD-ROM (Internal Medicine 1993), with a total of approximately 18 million words. For the study of high-frequency lexical items, I have also
used subsets of these two corpora: all the articles published in Time Magazine in
the year 1991 (henceforth referred to as Time91) and a subsection of articles published by the Journal of the American Medical Association in the year 1993, which I
called Corpumed. Both subsets were analyzed with John Bradleys TACT program
(Bradley 1989), which was developed at the University of Toronto.
Taking as a starting point some concrete translation problems encountered
by our own medical students, I searched our corpora for lexical frequency and
co-occurrence data that might help a human translator to solve the problems
that were created by some instances of lexical or syntactic ambiguity, with a
view to formalizing some of our findings so as to make them ready to use in the
framework of an automatic translation program.

. Some examples of misinterpretation

. Polysemous lexical units
(1) Hematocrit was 0.38, with an elevated white blood cell count of 13.3X109/L.
The Westergren method for erythrocyte sedimentation rate was 103 mm/h.

Franois Maniez

Lhmatocrite tait de 0.38 (38%), et la numration leucocytaire tait leve

(13,3 x 109/l). La (VSG) vitesse de sdimentation globulaire (sanguine / des
hmaties / rythrocytaire), mesure par la mthode de Westergren, tait de
103 mm/h.

The uses of the word rate can be divided into the following five categories (sorted by descending order of frequency in our general corpus). The French equivalents are given for each meaning:
a. standard of reckoning obtained by expressing the quantity or amount of
one thing in relation to another (taux, pourcentage)
b. measure of charge or cost (prix, tarif)
c. speed of movement, change, etc; pace (vitesse, rythme, cadence)
d. measure of value (ordre, classe)
e. case, as in at any rate (cas)
It is worth noting that our distinctions are a reflection of the division into
French equivalents. For instance, some monolingual dictionaries will combine
meanings (b) and (d) into one category.
Example (1) was given in its original context to a class of graduate students
who specialize in medical translation, and they were asked to translate it into
French. Although a speed unit (mm/h) was mentioned, about half of them
misunderstood rate, translating it by an equivalent usually reserved for meaning (a), the word taux (it is quite possible that some of them were misled by the
mention of hematocrit in the previous sentence, as the result of this test is typically given in the form of a percentage).
As the compound sedimentation rate is very frequently used in medical literature, one may safely assume that its translation would not have posed much
of a problem for todays best translation software, provided a specialized dictionary was included. However, it is worth considering the ways in which lexical statistics might help human translators in their task.
First of all, one can safely assume that the number of polysemous words
that are used in scientific literature is limited. If one could draw up a list of the
most frequently used polysemous words in a given field, one might consider
various ways of signaling their presence to the translator, provided his original
text was in machine-readable form. As a test, I examined a list of the words
whose frequency was higher than 200 in the Corpumed corpus (the Figure 200
was chosen because the frequency for rate was 225 in that corpus). If we set
aside function words, the majority of our hits were monosemous nouns (blood,
bone, breast, calcium, cancer, data, diagnosis, levels, patients, symptoms, weight),

The use of electronic corpora and lexical frequency data in solving translation problems

and the only polysemous words that were in the same order of frequency as rate
were the nouns stroke, table and trial.
It thus seems feasible to establish a list of the most frequently used polysemous words in scientific English in order to point out possible ambiguities to
the translator. However, if the aim is to assist both human translators and
machines in their task, it might prove useful to add some measure of frequency
information to such a list. If we once again consider the examples given above,
it is more than likely that in a medical publication the meanings of the words
stroke, table and trial will be those that would translate into French as accident
vasculaire crbral, tableau and essai, not as coup, table or procs.
Bearing this in mind, it might prove useful to provide the translator with
statistical information about the relative frequency of a given lexical item in
specialized literature as opposed to everyday speech, as well as potential differences in the proportions in which the various meanings of polysemous words
are used depending on the type of language. Table 1 shows that the word rate is
quite representative of such context-dependent variation.
Table 1. French translation for the word rate in the TIME91 and CORPUMED corpora.
Uses of rate

(1,8 M)

(306 000)

Absolute frequency of rate

Normalized frequency of rate
(per 500,000 words)



Collocation types



a) pourcentage, taux

054 (48%)

019 (90%)

b) tarif, prix

014 (12%)


c) vitesse, rythme

011 (10%)

002 (10%)


033 (29%)


The table shows two things. The first is that the word rate is used almost five
times as frequently in the medical corpus as in the general corpus. Another
point worthy of note is that meaning (b) is not to be found in the medical corpus; meaning (c) is likewise hardly ever used. It should be noted, however, that
searching the general corpus for the plural form (rates) gave us very different
results (meaning (a): 187 occurrences, meaning (b): 23 occurrences, meaning (c):
4 occurrences), which makes it worth considering the case for differentiating collocational statistics in the singular and plural forms.

Franois Maniez

In conclusion, we can see that in the medical corpus, meanings (a) and (c)
are the most frequent and that in order to determine the exact meaning of the
word rate, it is very often sufficient to examine the word that immediately precedes it (its premodifier). One can thus consider including several different
kinds of information in a dictionary that lists compounds and collocations:
1. A list of all the premodifiers of a given lexical item
2. The most frequently found meaning of that particular lexical item, depending on which premodifier is used, or on which type of specialized language is
used. In the case of medical literature, for example, one may want to assign
meaning (a) as the default value for rate; consequently, one might consider
including in the dictionary (or displaying as the result of a given query) only
those word combinations with a meaning other than percentage.
3. The most frequently used translation of a given word combination when a
choice has to be made between several possible translations of the premodifier,
possibly depending on the type of specialized language one is dealing with.
Thus, the translator could be told that success rate is more frequently translated
as taux de russite in a medical context than as pourcentage de russite; however,
both forms exist and are acceptable in everyday use. In the case of the collocates
for rate in meaning (a), including this type of information when entering data
would allow users to test the acceptability of collocations, leading them to discard pourcentage dintrt as a possible translation of interest rate. As suggested
in the case of rate and rates, one might also include information concerning the
proportion of the meanings of the singular and plural forms, provided it is statistically significant.
In the case of a dictionary devoted only to collocations in scientific English, one
can easily imagine that the high level of repetition of the same lexical combinations, which has been confirmed by various statistical measurements of lexical
richness, will make it unnecessary to resort to any kind of detailed semantic
analysis. However, in the case of a reference work which aims to describe the
general use of a given language, the model suggested by Fontenelle (1997) using
the lexical functions defined by Melcuk (1994) seems best suited to a comprehensive description of collocational structures in all their diversity.
. Syntactic ambiguity
The extract given below was also misinterpreted by half the students asked to
translate it into French. In this case, however, the mistake is likely to be repro-

The use of electronic corpora and lexical frequency data in solving translation problems

duced by the morpho-syntactic analyzers of most of todays automatic translation programs.

(2) Not infrequently, patients with Gauchers disease are initially presumed to
have lymphoproliferative disorders or childhood soft tumors based on
their abdominal mass and suppressed blood cell counts.
Il nest pas rare que lhypertrophie abdominale et la cytopnie sanguine des
patients induisent (provoquent) initialement le diagnostic de maladie lymphoprolifrative ou de tumeur des parties molles chez lenfant dans les cas de
maladie de Gaucher.

In this case, the students who misunderstood the sentence interpreted based as
the past participle of a verb form whose auxiliary had been deleted together
with the pronoun of an underlying relative clause (soft tumors that are based
on their abdominal mass ), failing to recognize based on as a complex
preposition that is synonymous with because of or due to. The semantic analysis
that is necessary in order to rule out the erroneous interpretation is beyond the
capacity of todays automatic translation programs, but the root of the mistake
probably lies also in the fact that the reduction of a relative clause in this fashion seems to be more frequent in French than in English. It should also be
noted that in many cases the fact that the prepositional clause introduced by
based on is placed at the beginning of a sentence prevents such ambiguity.
As in the case of example (1), I examined the occurrences of the expression
based on in the same corpora (Time91 and Corpumed). The results are shown
in Table 2.
Table 2. Frequencies of based and based on used as a complex preposition (CP)
in the two corpora.

of based and
their relative
frequency in


of based on
and their % of
all occurrences
of based

of based on
as a CP (start
of sentence)

of based on
as a CP

Total % of
of based on
as a CP

300 (0.016%) 181 (60%)




CORPUMED 145 (0.047%) 138 (95%)




Bearing in mind that Time91 has approximately six times as many tokens as
Corpumed, it appears that the frequency of based is three times as great in the
medical corpus as in the Time corpus, while the frequency of based on is five times
as great and the frequency of based on as a complex preposition ten times as high.

Franois Maniez

It thus seems that including such collocations in a bilingual scientific lexicon

(whether it is machine-readable or not), together with their frequency in a corpus
in the way that has been suggested above for rate, could be of use to the translator.
. Non-contiguous collocations
In example (3), misinterpretation resulted from failure to identify a collocation
(mixed results) which is frequently used in a non-scientific context.
(3) Results of trials of selective gut decontamination have been mixed. The
general consensus is that although some infections can be avoided, overall
mortality is not reproducibly influenced.

The following translation appeared in the French edition of the Journal of the
American Medical Association:
Les rsultats des essais cliniques sur la dcontamination digestive slective ont
t analyss. Le consensus gnral est de dire que, si quelques infections sont
vitables, la mortalit globale nest pas modifie de faon reproductible.

In this instance, the translator wrongly assumed that the English sentence contained a passive form whose underlying active equivalent was <somebody has
mixed the results>, whereas in fact mixed is obviously used as an adjective.
Most probably the translator believed that mixing results was a way of referring to a common tool of medical statistics, the combining of results from trials
in which similar protocols were used (a better French translation would have
been: Les rsultats des essais cliniques sur la dcontamination digestive slective
ont t mitigs.)
It seems that human intervention is still required in order to solve such
syntactic ambiguities. I submitted example (4) to the CLAWS 4 grammatical
tagging program (Garside 1987) created at the Unit for Computer Research on
the English Language (UCREL) of Lancaster University (the program has been
used for the tagging of the British National Corpus). Even though such tagging
generally proves useful in the disambiguation process, the results listed below
show that here, too, mixed was interpreted as being part of a passive verb form:
(4) Results_NN2 of_IO trials_NN2 of_IO selective_JJ gut_NN1 decontamination_NN1 have_VH0 been_VBN mixed_VVN ._.
The_AT general_JJ consensus_NN1 is_VBZ that_CST although_CS
some_DD infections_NN2 can_VM be_VBI avoided_VVN ,_, overall_JJ
mortality_NN1 is_VBZ not_XX reproducibly_RR influenced_VVN ._.

The use of electronic corpora and lexical frequency data in solving translation problems

In order to develop translation software that could solve or reduce such problems, it is worth analyzing which factors play a role in the correct interpretation
of such ambiguous forms. I found three possible factors, two of which rely on
purely lexical knowledge:
a. Previous knowledge of verbs that are synonymous with mixed and are
known to co-occur with results.
Searching our corpus for various occurrences of the word results revealed that
combine is generally used instead of mix to express the compilation of results
known as meta-analysis in medical literature. Example (5) (Reidenberg 1993)
actually provides a definition of this procedure:
(5) How best to combine the results of different clinical trials to produce a single valid conclusion has been an issue in clinical pharmacology and the
rest of medicine since literature reviews were first conducted. Although
formal statistical methodology for combining clinical trial results, or meta
analysis, is an improvement over earlier methods of less formal literature
review and interpretation, one must not let the rigor and formality of the
statistics give the analysis more credibility than the underlying data

Storing collocations such as combine results in a collocation database in the

VERB NOUN category could be a step towards avoiding misinterpretation
of strings like mix results, provided the user had access to the collocates of polysemous words on a semi-automatic basis. A search carried out in our Internal
Medicine 93 corpus showed that the collocation appeared in 46 articles in its
contiguous form. Conversely, mix is never found to co-occur in the active form
with results, and whenever the form mixed co-occurred with results, it was
always used as an adjective.
b. Storage of collocations such as mixed results.
The storage and automatic retrieval of such collocations obviously seems to be
a sine qua non for correct interpretation. In the previous instance, a computer
program with a well-documented data base was able to achieve what humans
achieve through their awareness of polysemy. However, the task of a human
translator is not quite completed when the correct meaning has been assigned
to the adjective mixed, as several French equivalents can be used depending on
what the English node word is. Table 3 summarizes the use of collocates for
mixed in our corpora and indicates those that are listed in two monolingual
dictionaries, Websters Encyclopedic Unabridged Dictionary, 1989 (WEUD) and
The American Heritage Dictionary, 1998 (AHD). The suggested translation

Franois Maniez

equivalents are taken from the Robert and Collins Senior Dictionary (1995),
except those for signals, reviews and messages, which are my own.
Table 3. Frequency of collocations with the word mixed in the two Time corpora and
their normalized frequency (number of occurrences per 1 million words).

Freq. in

Freq. in

Suggested Translation WEUD AHD


Abs. 0Norm. Abs. 0Norm.


11 05.5

26 02.6

signaux, messages



25 02.5

race mixte



21 02.1

avantage incertain



21 02.1

sentiments contraires,



21 02.1

avis partags



19 01.9

resultats mitigs, bilan




16 01.6

signaux, messages



10 01

conomie mixte



10 01


A brief comparison of the figures that were obtained in the larger corpus and its
subset (10 million vs. 2 million words) demonstrates that it is necessary to use a
large corpus when searching for co-occurrence data that concern infrequently
used lexical items, as only two out of ten of the listed collocations occurred
more than twice in the smaller corpus, a threshold under which statistical significance may be considered doubtful.
c. Identification of collocations that occur in a non-contiguous form.
Most programs that automatically retrieve collocations from computer corpora isolate recurring multi-word strings (as is the case with Collgen, the collocation generator that comes with the TACT software) or provide concordances
for two words that have been selected by the user according to certain contiguousness parameters. Generally, the more intervening words there are between
the node word and its collocate, the less likely it is to be identified as a statistically significant instance of co-occurrence. Needless to say, a large distance

The use of electronic corpora and lexical frequency data in solving translation problems

between the two components of a collocation is also an obstacle to human

understanding. When I asked French students of English to translate example
(3), two thirds of those who misunderstood the sentence knew of the collocation, and claimed that they would have had no trouble understanding it in a
shorter sentence such as Results have been mixed.
Automatic identification of collocations would no doubt be made easier if
their components were stored together with the various grammatical forms in
which they co-occur, if possible in descending order of probability of occurrence. I searched for such differences in the TIME91 corpus and in a subset of
our large medical corpus (Internal Medicine 93). The figures for the occurrences of mixed are shown in Table 4.
Table 4. Grammatical status of mixed in the TIME91 and Internal Medicine 93 corpora
Grammatical status for mixed


I.M. 93


% of all

% of all

verb forms (active)

verb forms (passive)
adjectival uses (predicative)
adjectival uses (attributive)







Such figures could be used as a basis for prioritization in algorithms designed

to solve such translation problems. In the case of mixed, the following steps
could be followed: if grammatical analysis suggests that a passive form was
used, and if the preposition with does not follow the occurrence of mixed, then
the previous context could be scanned for occurrences of the words most frequently used collocates (results, response, attitudes, reactions, feelings in medical
literature). If one of them was found, then the appropriate translation equivalent could be provided. If not, the sentence would be translated with the equivalent passive structure in the target language.
However, one issue has yet to be addressed. Of the above-listed methods,
which is easiest to formalize and adapt to automatic corpus processing with a
view to generating collocations to be used in an automatic contextual retrieval
If we consider the case for our first possibility, i.e. the suppression of an
erroneous interpretation through the previous storage of a collocate with a
higher probability of occurring with a given base, we can see that this method is

Franois Maniez

difficult to apply in the case of automatic translation. The subtle difference

between mix and combine, if it could be expressed with the help of semantically
distinctive features, would need to be weighed in relationship to the type of
language that is used. In our particular case, the use of mixing results (as
opposed to combining results in example 5) may sound strange in scientific
prose, but acceptable in everyday speech. The same is true of the French equivalents (although the noun mlange is much more common than the verb
mlanger in medical literature, the verb does occur in the specialized vocabulary of medicine). Actually, trying to reproduce such cognitive processes automatically would most probably prove too costly in terms of computer memory,
since it would require:
a. storage of all the collocations that match a given grammatical pattern (in
this case, VERB OBJECT NOUN PHRASE) for all the words of the text
to be translated.
b. elimination of some possible choices (such as MIX -RESULTS) based on the
existence of synonymous collocates (such as combine) that are more frequently used; establishing a data base of potential synonyms would in itself
require preliminary work, especially in terms of designing its structure.
The second possibility seems to be better suited to automatic data processing,
since generating collocations from computer-encoded texts is a relatively easy
task. However, the amount of noise in relation to the signal needs to be
emphasized. After processing a French medical corpus with the TACT collocation generator, I found that eliminating function words left us with only 6% of
all the forms that had initially been retrieved by the program. Most function
words are short, but since word length cannot be the sole criterion for paring
down the lists of collocations that are generated by the program, it is necessary
to eliminate certain lexical combinations. Table 5 provides an example of the
collocations that were obtained from a 200 000-word gastroenterology corpus
after preparatory work of this kind.
The homogeneity in terms of grammatical categories is particularly striking, as 18 word combinations out of 20 are of the NOUN-ADJECTIVE type.
The high frequency of this structure is rather typical of medical literature (and
perhaps of scientific writing in general). A further look at the table reveals a
clear distinction between compounds that belong to the specialized lexicon
and collocations proper, with bases (aspect, augmentation) that are frequently
used in non-scientific texts.
As to the identification of non-contiguous collocations, the problem is

The use of electronic corpora and lexical frequency data in solving translation problems

twofold. First, an automatic analysis of the kind that was summarily described
above would considerably slow down any automatic translation program
because of the sheer number of such collocations. Second, their retrieval from
computer-encoded texts would require the setting of a maximum span value
for the search (which is possible in most concordance programs) and would
have the same effect (in example (3), mixed and results are 8 words apart). In
order to fine-tune any search module that uses this span function, it would be
necessary to integrate data that list frequencies of occurrence in the non-contiguous form for each collocation (to take an example drawn from Table 2, one
can easily predict that such statistics would reveal that aspects observs will be
found in a non-contiguous form more frequently than atteinte vasculaire), so
as to use such functions only where necessary.
Table 5. Collocations with a frequency of 4 starting with the letter A in the French
gastroenterology corpus.






Franois Maniez

. Conclusion
The results obtained in the attempt to solve the various translation problems
that have been discussed here seem to demonstrate the benefits that can be
derived from automatic processing of machine-readable corpora, but they also
show the limits of this approach. Human intervention remains necessary at a
number of stages of the data gathering and formatting process. In the examples
I have chosen to examine, ambiguity is always a consequence of the polysemous nature of a given lexical item, and word-sense disambiguation cannot be
achieved without identifying and analyzing either a syntactic structure that is
itself ambiguous or the collocate for that lexical item in a given context.
We seem, therefore, to be confronted with a double task. First, what is
needed is a comprehensive description of the syntactical ambiguities that occur
in a given language and a corpus that lists examples of such structures, so that
lexical co-occurrence phenomena can be examined and studied for disambiguation. Second, and such a task seems achievable in the case of scientific literature, it is necessary to establish a list of the most frequently used polysemous
words in order to establish a certain number of translation rules that are based
on statistically confirmed lexical co-occurrence data.

Ahlswede, T. & Even, M. 1988. Generating a relational lexicon from a machine-readable
dictionary. International Journal of Lexicography. Special issue edited by F. Frawley &
R. Smith.
Benson, M. 1985. A Combinatory Dictionary of English. Dictionaries 7: 189200.
Brill, E. & Marcus, M. 1993. Tagging an unfamiliar text with minimal human supervision.
ARPA Technical Report.
Church, K., Gale, W., Hanks, P., Hindle, D. & Moon, R. 1994. Lexical Substitutability. In
Computational Approaches to the Lexicon, Atkins and Zampoli (eds), 153177. Oxford:
Oxford University Press.
Clear, J. 1996. Technical implications of multilingual corpus lexicography. International
Journal of Lexicography 9: 265276.
Cowie, A. P. 1986. Strategies for dealing with idioms, collocations and routine formulae in
dictionaries. In Workshop on Automating the Lexicon 1523 May 1986. Grosseto, Italy.
Dini, L., Di Tomaso, V. & Segond, F. 1998. Word sense disambiguation with functional relations. Language Resource and Evaluation Conference, Granada, May 98.
Fontenelle, T. 1994. Towards the construction of a collocational database for translation
students. META 39:4756.

The use of electronic corpora and lexical frequency data in solving translation problems

Fontenelle, T. 1997. Turning a Bilingual Dictionary into a Lexical-Semantic Database.

Tbingen: Max Niemeyer.
Garside, R. 1987. The CLAWS word-tagging system. In The Computational Analysis of
English, R. Garside, G. Leech and G. Sampson (eds), 3041. London: Longman.
Heid, U. 1992. Dcrire les collocations deux approches lexicographiques et leur application dans un outil informatis. Terminologie and Traduction, 23.
Heid, U. 1994. On ways work together topics in lexical combinatorics In Euralex94:
Proceedings of the Sixth Euralex International Congress, Martin et al (eds), 226257.
Heid, U. 1994. Relating lexicon and corpus: Computational support for corpus-based lexicon building in DELIS In Euralex94: Proceedings of the Sixth Euralex International
Congress, Martin et al (eds), 459471. Amsterdam.
Knowles, F. E. 1986. Computational lexicography and lexical databases. In Proceedings of
the 13th International ALLC Conference April, 1986. Norwich Association for Literary
and Linguistic Computing.
Knowles, F. & Roe, P. 1994. SP and the notion of distribution as a basis for lexicography. In
Euralex94: Proceedings of the Sixth Euralex International Congress, Martin et al (eds),
306319. Amsterdam.
Lakoff, G. 1993. The syntax of metaphorical semantic roles. in Semantics and the Lexicon, J.
Pustejovsky (ed.), 2736. Dordrecht: Kluwer Academic.
Melcuk, I. & Wanner, L. 1994. Towards an efficient representation of restricted lexical
cooccurrence. In Euralex94: Proceedings of the Sixth Euralex International Congress,
Martin et al. (eds), 325338. Amsterdam.
Michiels, A. 1996. An experiment in translation selection and word sense discrimination
using the metalinguistic apparatus of two computerized dictionaries. DEFI Technical
Report, 24. University of Lige. [Available at
Miller, G. A. (ed). 1990. WordNet: An on-line lexical database. International Journal of
Lexicography 3.
Neff, M & McCord, M. 1990. Acquiring lexical data from machine-readable dictionary
resources for machine translation. In Proceedings of the 3rd International Conference on
Theoretical and Methodological Issues in Machine Translation of Natural Language,
University of Texas at Austin, 8590.
Reidenberg, M. 1993. Clinical Pharmacology. Journal of the American Medical Association
270: 192.
Sinclair, J. 1991. Corpus, Concordance, Collocation. Oxford: Oxford University Press.
Smadja, F. A., McKeown, K. R. 1990. Automatically extracting and representing collocations for language generation. In 28th Annual Meeting of the Association for
Computational Linguistics, 252259. New York.
Sutcliffe, R. & Slater, B. 1995. Disambiguation by association as a practical method:
Experiments and findings. Journal of Quantitative Linguistics 2: 4352.
Teubert, W. 1996. Comparable or Parallel Corpora? International Journal of Lexicography
9: 238264.
Thoiron, P. & Bejoint, H. 1989. Pour un index cumulatif et volutif de co-occurents. Meta

Franois Maniez

Vossen, P. 1996. Right or wrong: combining lexical resources in the EuroWordNet project.
In Euralex96, 715728. University of Gteborg.
Zgusta, L. 1967. Multiword lexical units. Word 23: 578587.

Computer software and electronic corpora

Bradley, J. 1989. TACT. Copyright (c) 1989 John Bradley, University of Toronto.
JAMA & ARCHIVES JOURNALS, American Medical Association, 1994 Complete
Collection. Ovid Technologies.
Time Almanac 1993. Compact Publishing.

A computer tool for cross-linguistic research
Patrick Corness


Substantial monolingual text corpora exist in many languages and the number
and size of these increase daily, providing a vast resource for empirical linguistic research. Individual researchers can readily build their own corpora suited
to particular needs. The potential for data extraction from corpora is significantly enhanced by their annotation, based on internationally agreed markup
conventions and standards.
With the advent of parallel concordancing software for PCs (such programs, such as ParaConc, were first developed for the Macintosh platform), the
scope for contrastive linguistics research based on translation corpora has
expanded considerably. The growth of translation corpora for contrastive
studies is dependent on the additional human and technical resources needed
to align the original text with its translation as well as on the availability and
suitability of the translations. Many researchers wish to explore texts of their
own choice, involving them in the processes of deriving equivalent electronic
text in at least two languages and, as a minimum, aligning the texts at paragraph or sentence level.
In view of the advances in automatic annotation techniques and of the
potentially great additional value given to a corpus by annotation, one might
conclude that serious involvement in the field of contrastive analysis requires
familiarity with the relevant standards, methodology and state of the art automatic tagging software. However, the actual scope and power of automated text
annotation techniques, impressive as they are, remain always under develop-

Patrick Corness

ment. The capability of even the most sophisticated techniques currently available does not cater for all needs that contrastive linguists may have. At the same
time, it should be recognised that parallel corpora with only minimal annotation, or even no annotation at all, are also an extremely valuable resource. For
many concordancing purposes, non-annotated texts are perfectly adequate,
and this applies also to parallel concordancing. It is likely that contrastive linguistics will continue for some time to grow not only through advances in
automatic annotation but also through contributions dependent on manual
manipulation of the data, for example by text editing software such as Microsoft
Word, in conjunction with a parallel concordancing program.
It is thanks to automated techniques, permitting the treatment of ever larger corpora, that contrastive analysis is presently coming into its own as an
empirical science. It is logical, therefore, to pursue the enhancement of tools
for automatic corpus analysis in order to enable researchers to work with
expanding corpora and to concentrate their effort on those tasks which have to
be performed manually. The role of manual procedures, including the checking of the results of automated analysis, remains significant. Some of the cognitive limitations that apply to machine translation (MT), as opposed to computer aided translation (CAT), are found when a high level of automation is
attempted in corpus analysis. Contrastive analysis is more akin to CAT than to
MT in its methodology, though the findings may contribute to both. As with
translation itself, much can be achieved in the contrastive study of languages
with the assistance of straightforward tools. A great deal is possible with
aligned, though unannotated, corpora.
Multiconcord (Woolls 1998) was designed primarily as a tool to assist in the
teaching and learning of translation. It is a parallel concordancing program for
Windows which works with aligned bilingual corpora, enabling a search to be
made for all examples of a given expression and, simultaneously, for the respective translations which occur in the target text. The resulting parallel concordance can be viewed on the screen, sorted and saved as a text file (see Figs. 69).
The pedagogical value of raising awareness of contrastive features, even on a
small scale, is self-evident. If it can be demonstrated that a small, unannotated
corpus can reveal important contrastive features, providing insights that go
beyond what a substantial bilingual desk dictionary can provide, students
should come to appreciate better the significance of context. They should then
see bilingual dictionaries in their correct perspective and perceive the value of
exploring translation strategies further with the help of better translation corpora and corpus analysis tools.


The semantic and pragmatic significance of textual context could be

revealed to students through Tim Johnss Data Driven Learning (DDL)
approach to monolingual corpora, via MicroConcord (Johns & Scott 1993).
Now Multiconcord takes DDL a step further, exposing learners to authentic
translation corpora for similar purposes, facilitating empirical problem solving in a bilingual, contrastive dimension.
It is the aim of the present paper to point out how Multiconcord, as one tool
on the translators workbench, can work effectively in conjunction with
Minmark, its associated markup program, and Microsoft Word for the alignment and editing of results. Multiconcord returns up to 250 hits for a given
search. Considering the permutations that are possible with the specification
of context words, this is adequate for most learning purposes and contrastive
data can be readily discovered that suggest a wider range of translation equivalents than even major printed bilingual dictionaries typically include.
I will outline procedures and techniques for preparing a bilingual translation corpus for teaching and learning purposes and as a springboard for
research. This is based on a small English/French corpus of approximately
3,000 words in each language. Then I will show some sample results of a
Multiconcord search of George Orwells novel Nineteen Eighty-Four and its
translation into Czech and Lithuanian.

. Techniques
The first step is to open in Word the source and target texts constituting the
required bilingual corpus. In the File menu, Page Setup is selected and under
the Margins tab the left and right margins are adjusted to permit viewing both
files in parallel windows on the screen. For the text which is to appear on the
left of the screen, the right margin is set to approximately 11 cms. and for the
text which is to appear on the right, a correspondingly wide left margin is set.
By selecting Arrange All in the Window menu and then dragging the edges of
the panes, the parallel arrangement seen in Figure 1 is achieved. The
English/French translation corpus sampled here is taken from TransIt-TIGER
English-French (Corness et al. 1997)
The texts can now be aligned. Multiconcord requires texts aligned at paragraph level; the program uses an algorithm to identify the sentences containing
the equivalents of the source language search words in the target text for display
in the concordance. The first step in the alignment process is to Select All text in

Patrick Corness

Figure 1. Texts in parallel windows

Figure 2. Texts with automatic paragraph numbering


the Edit menu and click on the Numbering icon so that Word numbers the start
of each paragraph. However, a caution is needed here. If there are numerals at
the beginning of paragraphs in the text itself, Word replaces these numbers by
bullets (paragraph markers in the form of small circles, squares or diamonds
etc: ) without warning. In this case, it is vital to click on the
Bullets icon first, selecting the option not to replace existing numbers by bullets. Clicking the numbering icon then replaces bullets by numbering without
affecting the real numbers (see Figure 2 above).
Editing either text to add or remove paragraph breaks will result in automatic changes in the numbering, enabling alignment to be checked and adjusted.
Once the alignment has been completed, the numbering is no longer needed and
may be removed. The aligned texts should now be saved as plain text files.
The next stage is to mark up the texts so that they can be recognised and
searched by Multiconcord. For this the accompanying Minmark program is
used. A text is selected via the Minmark File menu (see Figure 3).

Figure 3. Selecting a text file for markup

By providing a valid filename for the marked-up counterparts of the selected

texts at the prompt, pairs of files ready for Multiconcord are generated (see
Figure 4). At this stage, it is essential to create pairs of files with names which
are identical except for the extensions. English files must have the extension
.en, French files .fr, German files .de, Spanish files .es etc.
Minmark inserts the following markers needed by Multiconcord. The
beginning and end of the text are marked as <body> and </body> respectively.

Patrick Corness

Figure 4. Saving marked-up files

Paragraph breaks are marked by <p> and sentence breaks by <s>. The markedup text can be viewed in Word and edited further if it is required, but any alterations to the paragraph structure at this stage entail the manual addition or
removal of <p> markers as appropriate. A quick count of <p> markers in each
text can be done in Word via the Replace facility in the Edit menu. Replacing
<p> by <p> will not change the text, but Word will report the number of
replacements made each time.
Where the translation merges two sentences of the original into one, or
splits one sentence of the original into two, Multiconcord detects the difference
in the number of sentences within the corresponding paragraphs and the algorithm is usually able to locate the correct place in the target text. The user has the
option to select paragraph view to compare whole paragraphs when necessary.
The first step when running Multiconcord (see Figure 5 below) is to select
the Search language and Target language. Any filenames in the selected directory valid for the selected language pair will then appear in the Available Files list.
Any number of the available files may be selected for searching. The search
word or words are then typed in the search for box and added, followed by a
click on the Start Search button.
A report on each file searched, giving the number of hits, appears after a
few seconds, then the search results can be viewed and sorted alphabetically on
the search word itself or on the first, second or third word to left or right of the
search word (see Figure 6 below).


Figure 5. Search options in Multiconcord

Figure 6. Multiconcord search results

Patrick Corness

All examples are initially marked C1 (category 1). They can be changed to C2,
C3 or C4, representing categories determined by the user, and the results resorted. Four categories may prove insufficient, however, and more sophisticated categorisation and sorting can be done subsequently in Word (see under
Figure 7).

Figure 7. Saving concordance results

Figure 8. Initial results file in Word (unformatted)


Figure 9. Results with some initial formatting

The results as currently sorted are saved in the Test screen, which also permits
the creation of cloze tests based on the search words (see Figure 7 above).
The results may be saved in a file with search and target language interleaved.
For sorting of the data, however, tabular format is more practical. If the saved file
is opened in Word, it does not at first appear in a very usable form (see Figure 8
above), but if the text is converted to a table in Word, selecting Tabs as the separator, a parallel results file is created, which can be easily edited and sorted.
The first step might be to change to bold typeface (using Edit/Replace) all
occurrences of the search word in the search text and the respective translations
in the target text (see Figure 9 above). A third column can be added, categorisation descriptions or codes entered here and the file re-sorted on this column.

. Sample parallel concordancing results

The purpose of the limited experiment described here is to show an example of
how Multiconcord can be used to explore a corpus consisting of a single novel in
English (George Orwells Nineteen Eighty-Four) and its translations into Czech
and Lithuanian.
Bearing in mind that English phrasal verbs typically have a broad semantic
range, it was decided to check the variety of translations of a selected phrasal
verb suggested by a major English-Czech desk dictionary and an English-

Patrick Corness

Czech dictionary of phrasal verbs and to compare these with the translations
found in Nineteen Eighty-Four. The phrasal verb pick up was chosen, more or
less at random.
The four-volume English-Czech dictionary by Hais and Hodek
(19911993) gives 58 different translations of pick up, as follows:
dt dohromady; dodat si; dohnet; dostat; chopit se; chytat/chytit/chytnout;
chytat za slovo; chytit se; koupit; nabalit si; nabrat; najt; nalozit; narazit na;
nasbrat; navzat znmost; objevit; podat uvazovac lano na (prstavn bji);
pochytit; posbrat; postavit se na nohy; probudit k zivotu; pribrat; pridat; prijet si
pro; rozplit; sbalit; sbalit sv veci; sbrat (se)/sebrat (se); sehnat; seznmit se
nhodou; schrastit; splasit; stlouci/stloukat; uklidit; ukoristit; vybrat si
(spoluhrce); vythnout; vzchopit se; vzt do ruky; vzt s sebou; vzmhat se;
zabrat; zadrzet; zachrnit; zachytit; zajistit; zajmout; zamerit se na; zastavit se
pro; zatknout; zeslit; zlepsovat se; znovu najt; znovu sledovat; zotavit se; zrychlit;
zvedat (se)/zvednout (se)

The English-Czech dictionary of phrasal verbs by Luks Vodic ka (1992) offers

22 additional translations, with some references to usage but without contextualised examples:
byt pripraveny zaplatit; dozvedet se; krat; mt se k zaplacen; nahodit; nalodit;
napojit se; napomnat; naskocit; navzat (na tma); opravovat; ozivit se; prijt k;
rozjet se; rozkopat; sprtelit se; svzt; vydelvat si; vyzvednout (si); vzt si (e.g.
taxi); zskat; znovu se chytit

In George Orwells Nineteen Eighty-Four there are 25 occurrences of pick up in

all, and 12 different translations of it were found in the Czech version (see
Table 1). Eleven cases were identified where there was a close semantic match
between one of the 58 Czech equivalents of pick up suggested by Hais and
Hodek (H&H) and occurrences in the Orwell novel. These are (an approximate general equivalent in English is shown for each expression) vzt do ruky
take hold of ; zachycovat/zachytit catch, detect1; zastavit se pro call for, collect; zvednout raise:
(1) As Winston wandered towards the table his eye was caught by a round,
smooth thing that gleamed softly in the lamplight, and he picked it up.
Winston prikrocil ke stolu a jeho pozornost upoutala okrouhl hladk
vecicka, kter se jemne leskla ve svetle lampy; vzal ji do ruky.
(2) He picked up his pen half-heartedly, wondering whether he could find
something more to write in the diary.


Lhostejne vzal do ruky pero a uvazoval, zda prijde jeste na neco, co by zapsal do denku. cf. H&H:
he always picked up your personal stuff and looked at it
vzal vzdycky do ruky vs osobn materil a prohldl si jej
(3) In a place like this the danger that there would be a hidden microphone was
very small, and even if there was a microphone it would only pick up sounds.
Na takovm mste hrozilo minimln nebezpec, ze by tam byl skryty
mikrofon, a i kdyby tam byl, zachycoval by pouze zvuky.
(4) Any sound that Winston made, above the level of a very low whisper, would
be picked up by it, moreover, so long as he remained within the field of vision
which the metal plaque commanded, he could be seen as well as heard.
Kazdy zvuk, ktery Winston vydal a jenz byl hlasitejs nez velmi tich
septn, obrazovka zachycovala; a co vc, pokud zustval v zornm poli
kovov desky, bylo ho videt a slyset.
(5) To keep your face expressionless was not difficult, and even your breathing
could be controlled, with an effort: but you could not control the beating
of your heart, and the telescreen was quite delicate enough to pick it up.
Nebylo tezk zachovat bezvyraznou

tvr a s jistym
silm mohl clovek
kontrolovat i dech; nikoli vsak busen srdce, obrazovka byla natolik citliv,
ze je zachycovala.
(6) He and Julia had spoken only in low whispers, and it would not pick up
what they had said, but it would pick up the thrush.
Hovorili s Juli sice jen septem a mikrofon by nezachytil, co rkali, ale
zachytil by drozda.
(7) He and Julia had spoken only in low whispers, and it would not pick up
what they had said, but it would pick up the thrush.
Hovorili s Juli sice jen septem a mikrofon by nezachytil, co rkali, ale
zachytil by drozda.
(8) There were no telescreens, of course, but there was always the danger of
concealed microphones by which your voice might be picked up and recognized; besides, it was not easy to make a journey by yourself without
attracting attention.
Obrazovky tu samozrejme nebyly, ale mohly tu byt
skryt mikrofony,
mohli zachytit a desifrovat vs hlas; krome toho nebylo snadn
vydat se sm na cestu, aniz to vyvolalo pozornost.
cf. H&H:
zachytit enemy planes picked up by our radar installations
(9) Perhaps you could pick it up at my flat at some time that suited you?

Patrick Corness

Mozn byste se pro nej mohl nekdy zastavit u me doma, az se vm to bude

cf. H&H:
Ill pick you up at your house
zastavm se pro tebe doma; prijedu si pro tebe domu
(10) OBrien picked up the cage, and, as he did so, pressed something in it.
OBrien zvedl klec a neco na n stiskl.
(11) She picked the stove up and shook it.
Zvedla varic a zatrsla jm.
cf. H&H:

He bent down to pick up his hat

Two cases were found where a potential translation equivalent given in H&H
was adopted in the Orwell translation but where there was not a close semantic
match, viz. nabrat take up; posbrat gather up:
(12) With the tip of his finger he picked up an identifiable grain of whitish dust
and deposited it on the corner of the cover, where it was bound to be shaken off if the book was moved.
Nabral spickou prstu drobn zrnko belavho prachu a polozil je na roh
desek; kdyby s denkem nekdo pohnul, musel by je setrst.
cf. H&H:
nabrat pletacm drtem [with a knitting needle];
where did you pick up with that queer fellow?;
pick up speed nabrat rychlost
(13) Pick up those pieces, he said sharply.
Posbrejte to, rekl zostra.
cf. H&H: bits of information, souvenirs he had picked up all over the world

Finally, there were twelve examples of translations of pick up in Nineteen

Eighty-Four which were not found in H&H, using the verbs nadzvednout se
rise; popadnout seize, snatch; uchopit grasp; vzt take; vzt si take with you;
zdvihnout raise. Only one of these verbs (vzt si) is mentioned in Vodicka:
(14) The girl picked herself up and pulled a bluebell out of her hair.
Dvka se nadzvedla a vythla si z vlasu modry zvonek.
(15) The dark-haired girl behind Winston had begun crying out Swine! Swine!
Swine! and suddenly she picked up a heavy Newspeak dictionary and
flung it at the screen.
Tmavovlas dvka za Winstonem zacala vykrikovat Svine! a znicehonic


popadla tezky slovnk newspeaku a mrstila jm do obrazovky.

(16) He picked up his pen again and wrote:
Opet uchopil pero a psal:
(17) He drank another mouthful of gin, picked up the white knight and made a
tentative move.
Vlil do sebe dals dousek ginu, uchopil blho jezdce a zkusmo thl.
(18) He turned back to the chessboard and picked up the white knight again.
Vrtil se k sachovnici a znovu uchopil blho jezdce.
(19) He picked up the childrens history book and looked at the portrait of Big
Brother which formed its frontispiece.
Vzal detskou ucebnici dejepisu a zadval se na portrt Velkho bratra na
tituln strane.
(20) Someone had picked up the glass paperweight from the table and smashed
it to pieces on the hearth-stone.
Nekdo vzal ze stolu sklenen teztko a rozbil ho na kusy o krb.
(21) OBrien picked up the cage and brought it across to the nearer table.
OBrien vzal klec a prenesl ji k blizsmu stolu.
(22) He picked up the white knight and moved it across the board.
Vzal blho jezdce a thl jm po sachovnici.
(23) Lets pick up a gin on the way.
Cestou si vezmeme gin.
cf. Vodicka:
e.g. taxi
(24) He picked up his glass and drained it at a gulp.
Zdvihl sklenku a narz ji vypil.
(25) He saw Julia pick up her glass and sniff at it with frank curiosity.
Videl, jak Julie zdvihla sklenku a privonela k n s uprmnou zvedavost.

To summarise, for twelve out of the twenty-five occurrences of a randomly selected lexical unit in the novel six different plausible translations are found which do
not occur in the authoritative bilingual dictionary used as a point of reference
(though one of them is mentioned in the dictionary of phrasal verbs). Additionally,
on the corpus evidence, two translations given in the dictionary are found to have
a broader range of semantic equivalents than the dictionary mentions.
If the evidence of the Lithuanian translation of Nineteen Eighty-Four is compared with an English-Lithuanian dictionary, a similar discrepancy is found. As
equivalents of pick up, the English-Lithuanian/Lithuanian-English Dictionary by

Patrick Corness

Table 1. Summary of comparison between H&H and translations of pick up in

Nineteen Eighty-Four
Eleven translations of pick
up in Orwell matching
those found in H&H

Two translations of pick up Twelve translations of pick up in

found in H&H but not
Orwell not found in H&H
matching Orwell

vzt do ruky (2)

nabrat (1)

nadzvednout se (1)

zachycovat (3) /

posbrat (1)

popadnout (1)

zachytit (3)
zastavit se pro (1)

uchopit (3)

zvednout (2)

vzt (4)
vzt si (1) [found in Vodicka]
zdvihnout (2)

Bronius Piesarskas & Bronius Svecevicius (1997) gives the following:

surinkti, pakelti, pasitaisyti, pagereti, pagauti, greit ismokti, pavezti, atsitiktinai
susipazinti, isgelbeti (ske stanti ), sugauti (begli )

Of these, the Lithuanian translation attests one example of surinkti gather up,
four of pakelti raise and one of pagauti catch. Nineteen out of twenty-five
occurrences of pick up are thus unaccounted for by the dictionary.
Although forms of paimti take are found as translations in ten cases and
there are also two examples of pasiimti take/take with you, neither of these
verbs is given in the dictionary as an equivalent of pick up. This is a parallel phenomenon to the omission of the Czech verb vzt take, occurring four times in
the translation, from the English-Czech dictionary and raises the question as to
whether there is a tendency for dictionary compilers to focus on more specialised, less frequent, meanings of the phrasal verb while omitting more common ones.
Other equivalents of pick up found in the Lithuanian translation are
uz rasyti record, uz fiksuoti note, stverti seize, snatch and atsisesti sit up.

. Translation equivalents
A comparison of the two translations shows something of the translators
respective strategies. To consider this, the various meanings of pick up in the


English text can be categorised and contextualised, so that translation of meaning in context can be assessed and other potential factors then considered.
One significant use of pick up is in relation to the concept of the all-pervasive surveillance by Big Brother which is central to the theme of the novel. The
Lithuanian version reveals a greater variety of expression here, different
semantic components being selected for emphasis. The detection of conversations by hidden microphones is rendered throughout in the Czech translation
by the verb zachycovat/zachytit catch, detect, whereas there are four different
equivalents in the Lithuanian version.
Only in one case is the verb pagauti catch found, corresponding closely to
the Czech zachycovat/ zachytit catch, detect:
(26) English

Any sound that Winston made, above the level of a very low
whisper, would be picked up by it, moreover, so long as he
remained within the field of vision which the metal plaque
commanded, he could be seen as well as heard.
Lithuanian Jis pagaudavo bet kuri Vinstono sukelta garsa, bent kiek
smarkesni uz tylu snibzdesi ; dar daugiau kol jis neiseidavo is lek
stes apimamo ploto, galedavo

buti ne tik girdimas bet

ir matomas.
Kazdy zvuk, ktery Winston vydal a jenz byl hlasitejs nez
velmi tich septn, obrazovka zachycovala; a co vc, pokud
zustval v zornm poli kovov desky, bylo ho videt a slyset.

There are examples in which, by using the verb uzrasyti record, the translator
has introduced a semantic component not explicit in pick up but derived from
the wider situational context of the novel, indicating that conversations were
universally recorded and used as evidence:
(27) English

He and Julia had spoken only in low whispers, and it would not
pick up what they had said, but it would pick up the thrush.
Lithuanian Jiedu su Dzulija kalbasi tiktai pasnibzdom, ir mikrofonas
nepajegtu uzrasyti ju zodziu, bet strazda uzrasytu.
Hovorili s Juli sice jen septem a mikrofon by nezachytil, co
rkali, ale zachytil by drozda.

(28) English

In a place like this the danger that there would be a hidden

microphone was very small, and even if there was a microphone it would only pick up sounds.
Lithuanian Tikimybe, kad tokioje vietoje pasleptas mikrofonas, buvo labai
maza; net jeigu ir yra mikrofonas, tai uzrasys tik garsus.

Patrick Corness


Na takovm mste hrozilo minimln nebezpec, ze by tam

byl skryty mikrofon, a i kdyby tam byl, zachycoval by pouze

The example of uz fiksuoti note is a similar case. The phrase gal ejo visk a u z fiksuoti also emphasises the capability of the all-powerful state to record everything that anybody said:
(29) English

To keep your face expressionless was not difficult, and even

your breathing could be controlled, with an effort: but you
could not control the beating of your heart, and the telescreen was quite delicate enough to pick it up.
Lithuanian Islaikyti veida nereiksminga buvo nesunku, pasistengus galima suvaldyti ir kvepavima, bet sirdies plakimo taip lengvai
nesukontroliuosi, o teleekranas buvo pakankamai jautrus ir
galejo viska uzfiksuoti.
Nebylo tezk zachovat bezvyraznou

tvr a s jistym
mohl clovek kontrolovat i dech; nikoli vsak busen srdce,
obrazovka byla natolik citliv, ze je zachycovala.

In one case, by contrast, the semantic component detect is subsumed under

the component recognize:
(30) English

There were no telescreens, of course, but there was always the

danger of concealed microphones by which your voice might
be picked up and recognized;

Lithuanian Zinoma,
cia nera
teleekranu, bet gali buti pasleptu
mikrofonu, is kuriu tavo balsas butu atpazintas.
Obrazovky tu samozrejme nebyly, ale mohly tu byt skryt
mikrofony, kterymi
mohli zachytit a desifrovat vs hlas;

The most common meaning of pick up is take hold of , with or without the
additional semantic component raise, only mildly inherent in the particle. As
already mentioned, there are 12 examples of the use of paimti or pasiimti take
as the translation of this concept and 4 of pakelti raise. The context does not
always provide clear authority for variation between paimti and pakelti, as can
be seen from the following examples of similar contexts, but variety for its own
sake can be a valid stylistic decision. Of course, raising ones glass could have a
very different meaning in English from simply picking it up:
(31) English
He picked up his glass and drained it at a gulp.
Lithuanian Jis paeme<- raise> stikline ir vienu ypu isger



Zdvihl <+ raise> sklenku a narz ji vypil.

(32) English
He saw Julia pick up her glass and sniff at it with frank curiosity.
Lithuanian Jis mate, kaip Dzulija pakel e <+ raise> taure ir neslepdama

smalsumo pauoste.
Videl, jak Julie zdvihla <+ raise>sklenku a privonela k n s
uprmnou zvedavost.

The Czech translation of pick up in this general sense also shows greater variation, in that the Czech verbs nabrat gather up, uchopit grasp, vzt take, vzt do
ruky take hold of , vzt si take with you, zdvihnout raise and zvednout raise
respectively are met where plain paimti is found in Lithuanian.
Pick up in the sense of collect from somewhere and take away has an
equivalent fixed expression in Czech. The Lithuanian version here is rather
more descriptive (call at my house and take [it] with you), using the general
verb pasiimti just mentioned:
(33) English

Perhaps you could pick it up at my flat at some time that suited you?
Lithuanian Gal galetumet kokiu patogiu laiku uzeiti pas mane namo ir
Mozn byste se pro nej mohl nekdy zastavit u me doma, az se
vm to bude hodit.

Lithuanian stverti seize, snatch is found where a sudden, impulsive action is

indicated by the context. This verb incorporates the semantic component suddenly, explicit in the original English context. The Czech version has the verb
popadnout (seize, snatch), which in itself expresses the impulsiveness of the
action, yet a reinforcing adverb zni cehonic all of a sudden is also included:
(34) English

The dark-haired girl behind Winston had begun crying out

Swine! Swine! Swine! and suddenly she picked up a heavy
Newspeak dictionary and flung it at the screen.
Lithuanian Tamsiaplauke mergina uz Vinstono nugaros pradejo saukti
Kiaule! Kiaule! Kiaule!, paskui stvere stor a naujakalbes
zodyna ir mete ji i ekrana.
Tmavovlas dvka za Winstonem zacala vykrikovat Svine! a
znicehonic popadla tezky slovnk newspeaku a mrstila jm
do obrazovky.

Both the Czech and Lithuanian versions render pick up by an unambiguous

verb meaning gather up:

Patrick Corness

(35) English
Pick up those pieces, he said sharply.
Lithuanian Surinkit tas sukes, grieztai pasake jis.
Posbrejte to, rekl zostra.

The reflexive pick oneself up is also rendered in both versions by an unambiguous verb meaning sit up:
(36) English
The girl picked herself up and pulled a bluebell out of her hair.
Lithuanian Mergina atsisedo ir issiem
e is plauku katileli ,
Dvka se nadzvedla a vythla si z vlasu modry zvonek.

. Conclusion
The outcome of the present experiment suggests that parallel corpora are a
resource that cannot be ignored in translation studies. Data extracted from
translation corpora offer considerable potential for contrastive analysis of the
respective patterns of linguistic forms which express given semantic content.
This view is supported by R u ta Marcinkeviciene, whose paper on parallel corpora and bilingual lexicography starts from the position that parallel corpora
(i.e. texts of source language and target language, aligned on the level of sentence) can considerably improve bilingual dictionaries and other tools of
translators (Marcinkeviciene 1998:40).
Bilingual lexicographers may have reservations concerning the validity of
translation corpora as a source of empirical evidence for the improvement of
bilingual dictionaries, as translators may be subject to interference from the
language of the original. Wolfgang Teubert writes that
it still remains to be seen what [parallel corpora] really can contribute to multilingual lexicography Translations, however good and near-perfect they
may be (but rarely are), cannot but give a distorted picture of the language they
represent. (Teubert 1996:247)

An example is given by Martin Gellerstam (1996). Comparing original Swedish

novels and English novels translated into Swedish, he has shown that certain
linguistic features in Swedish are overused by Swedish translators under the
influence of English.
However, it cannot follow from this that translation studies should ignore
the evidence of translation corpora; rather it means that this evidence of the
intuitive knowledge of translators should be considered alongside the evidence
of comparable corpora representing writing by native speakers of the respec-


tive languages. Translation corpora yield, inter alia, valuable evidence of translation problems and translation strategies, especially if alternative versions are
included. Insights into the sources of such problems and of the motivation of
strategies for their solution are central to pedagogy and to academic research in
this field.


Zachycovat/zachytit are considered here as different aspects of the same verb.

Primary Sources
Erjavec, T., Lawson, A. & Romary, L. (eds). 1998. East Meets West: a compendium of multilingual resources. Mannheim: TELRI.
Orvelas, Dzordzas 1991. 1984-ieji. [Translated into Lithuanian by Virgilijus Cepliejus]
Vilnius: Vyturys.
Orwell, George 1949. Nineteen Eighty-Four: a novel. Harmondsworth: Penguin Books.
Orwell, George 1949. Nineteen Eighty-Four. New York: New American Library.
Orwell, George 1991. 1984. Praha: Nase vojsko. [Anonymous Czech translation]

Secondary sources
Aijmer, K., Altenberg, B. & Johansson, M. (eds). 1996. Languages in contrast: papers from a
symposium on text-based cross-linguistic studies, Lund 45 March 1994. Lund: Lund
University Press.
Corness, P. J., Daniels, C. R., Deepwell, F. H., Haydon, D., Holland, M., Read, F., Thompson,
D., Thompson, J. 1997. TransIt-TIGER English-French. London: Hodder & Stoughton.
Gellerstam, M. 1996. Translations as a source for cross-linguistic studies. In Languages in
contrast: papers from a symposium on text-based cross-linguistic studies, Lund 45 March
1994, K. Aijmer, B. Altenberg & M. Johansson (eds), 5362. Lund: Lund University
Hais, K. & Hodek, B. 19911993. English-Czech dictionary (4 vols.). Praha: Academia.
Johns, T. F. & Scott, M. 1993. MicroConcord: an introduction to the practices and principles of
concordancing in language teaching. Oxford: Oxford University Press.
Johns, T.F. & Scott, M. 1993. MicroConcord. Oxford: Oxford Electronic Publishing.
Marcinkeviciene, R. 1998. Parallel corpora and bilingual lexicography. In Germanic and
Baltic Linguistic Studies and Translation: proceedings of the international conference held
at the University of Vilnius, Lithuania, 2224 April 1998, A.Useniene (ed.), 4047.

Patrick Corness

Piesarskas, B. & Svecevicius. 1997. English-Lithuanian/Lithuanian-English dictionary.

Vilnius: Zodynas
Teubert, W. 1996. Comparable or parallel corpora? International Journal of Lexicography 9
(3): 238264.
Vodicka, L. 1992. Anglicko- cesk y slovnk frzov ych sloves. Praha: Fragment a Prh.
Woolls, D. 1998. Multiconcord [multilingual concordancing program, incorporating
Minmark markup program]. Birmingham: CFL Software Development. [ Funded by the Lingua Office of the
European Union.]

General index

adjective 32, 33, 98, 103, 112, 123, 138, 145,

157, 175-178, 180, 208, 211, 218, 219,
223-226, 298, 299, 302
adverb 19, 20, 106, 157, 177, 323
aligning 42, 44, 271-273, 285, 288, 307
alignment 10, 11, 13, 39, 40, 45-48, 271288, 309, 311
ambiguity 29, 35, 36, 120, 133, 149, 238240, 291-293, 296, 297, 304
anaphora 253
animacy 182
annotation 307, 308
Arbeit 194, 199-203, 205-210, 212
automatic translation 37, 231, 233, 237,
291, 293, 297, 302, 303
avoir 56, 69, 70, 125, 126, 235, 237, 241, 281

back-translation 17, 29
balanced corpus 9
based on 36, 291, 297
bi-text 271, 272, 274, 275, 278, 287
bilingual concordancer 271
bilingual corpus 44, 47, 216, 274, 279, 288,
293, 308, 309
bilingual dictionary 21, 30, 33-35, 43, 52,
54, 55, 81, 190, 203, 204, 210-229, 235,
237, 247, 279, 284, 305, 308, 309, 319,
bilingual glossary 249, 283
bilingual lexicon 10
British National Corpus (BNC) 35, 60,
216, 218, 222, 224, 226, 229, 291, 298
Brown Corpus 40, 121, 292

Canadian Hansard Corpus 11, 278

case 33, 83-92
caso 33, 83-90
CAT 308
Catalan viii, 35, 216, 217, 219-221, 224,
225, 228, 229
causative 19, 41, 97-116, 123, 125, 129, 136,
137, 145, 147
causative construction 19, 105
causative verb 19, 98, 99, 102, 103, 105,
106, 108, 109
Chinese viii, 31, 115, 116, 151, 152, 156,
157, 167, 171-174
co-occurrence 5, 26, 27, 32, 37, 60, 94, 209,
279, 289, 293, 300, 304
co-selection 32, 75, 77, 78, 80, 91, 92, 94
cognate 10, 19, 127, 280
cognitive linguistics 149, 174, 190, 191,
195, 196, 221
cognitive semantics 48, 150-153, 174
cognitive universals 152
cold 35, 145, 171, 216, 218-225, 228
colligation 22, 32
collocate 60, 77, 207, 300, 301, 304
collocation 8, 22, 26, 27, 32, 42, 44, 47, 77,
78, 88, 95, 182, 200-202, 208, 209, 211,
222, 241, 242, 291, 298-303, 305
comparable corpus 7-9, 16, 17, 30, 40, 81,
83, 91, 93, 100, 115, 216, 293, 324
complex preposition 83, 84, 86, 297
compound 36, 180, 200, 208, 281, 291, 294
computational lexicography 7, 305
computational linguistics vii, 38, 42, 44,
45, 94, 288, 305

conceptual ontology 191-196, 199, 203,
211, 212
concordance 13, 33, 47, 76, 77, 82-85, 89,
93, 95, 183, 303, 305, 308, 309
concordancer 13, 42, 271
connector 19
contain 20, 52-60, 68-71
contextual correspondence 284
contrastive analysis 5, 16, 18, 28, 43, 45, 46,
127, 147, 152, 228, 307, 308, 324
contrastive linguistics vii, 3, 5, 6, 43, 46, 57,
60, 61, 74, 79, 307, 308
corporate memory 271
corpus-based vii, viii, 4, 14, 15, 18, 26, 29,
31, 32, 36-39, 41, 43, 46, 74, 75, 94, 97,
115, 153, 183, 187, 204, 215, 217, 226,
247, 305
corpus-based dictionary 41, 217
corpus-based lexicography 226
corpus-driven 15, 32, 43, 73-78, 81, 94, 204
cross-linguistic lexicology 48, 116, 119,
148, 150, 190-192
Czech viii, 309, 315, 316, 320-325

Danish 127
data-driven learning 95
dictionary entry 35, 216, 218, 228
disambiguation 37, 119, 120, 123, 147, 240,
241, 266, 291, 292, 298, 304, 305
domain-specific corpora 8
down 31, 151-157, 161-173
Dutch 10, 24, 26, 30, 34, 35, 46, 47

electronic lexicon 34, 35

ellipsis 31, 177-179, 183
English viii, ix, x, 5, 6, 9-11, 18-21, 23-26,
29-37, 39-48, 52-55, 57-61, 64, 73, 8385, 87, 89, 90, 92-94, 97-116, 119, 121,
124, 125, 127-131, 133-137, 139-141,
144, 147-152, 156, 167, 170-175, 177185, 189-191, 194-196, 198, 199, 205,
216-222, 224, 226-229, 231-233, 236,
247, 249-256, 258, 261, 264-267, 273,

276, 278, 280, 281, 287, 289, 291, 292,

295-299, 301, 304, 305, 309, 311, 315,
316, 319-326
English-Norwegian Parallel Corpus 9, 10,
29, 45
English-Swedish Parallel Corpus 9, 23, 25,
41, 100, 121, 140, 148
equivalence viii, 15-18, 21, 22, 33-35, 40,
46-49, 51, 53, 55-57, 60, 73, 79-81, 85,
88, 91, 95, 191, 199, 222, 229, 242, 245,
257, 258, 259, 266, 273, 274, 276, 279,
280, 282, 284, 287
EU documents 190, 201, 205, 213
European Court of Human Rights 249, 250
European Parliament 190, 203, 205, 282
EuroWordNet 35, 45, 48, 196, 306
experiential grounding 157, 160, 162, 164,
165, 167
experiential realism 151, 160

f 19, 23, 98, 101-104, 107-113, 119, 121147

faire 125, 137, 202
figure of speech 175, 176
Finnish viii, 10, 23, 24, 121, 124, 125, 127,
128, 130, 136, 137, 147, 190
fixed expression 323
fork 76-78, 94
frame semantics 23, 27, 39, 41
French viii, 10, 11, 18, 20, 21, 23, 30, 31, 3337, 40, 41, 46, 52-56, 59, 68, 97, 98,
114, 121, 124-128, 130, 136, 137, 147,
175, 177, 178-185, 189, 190, 194, 199205, 208, 209, 213, 231-233, 236, 243245, 249-258, 260, 261, 264-266, 273,
278, 280-282, 285, 287, 289, 291, 294299, 301, 302, 309, 311, 325
functional equivalence 33, 73, 81, 91
functionally complete unit of meaning 33,
73, 74, 76, 79, 81, 85, 90, 91

German viii, 10, 20, 21, 24, 25, 30, 37, 40,
43, 47, 52, 54, 55, 57, 65, 127, 174, 189,

General index

194, 199-205, 208, 209, 213, 292, 311

get 19, 23, 99, 111, 119, 121, 125, 128-131,
get-passive 146
grammaticalisation 21
grammaticalized meanings 144
Greek 196-199, 201, 213

headword 242-244, 254

high 35, 216, 219, 223-225, 228
homonymy 119, 120, 292
Human Rights terminology 249
hypallage 31, 175-179, 183, 184
hyponomy 29
hyponym 142

idiom 92, 234

idiom principle 92
idiomaticity 5
in case 33, 78, 83, 86-90, 93
in case of 33, 78, 86-89
in caso di 33, 86-88
in the case of 33, 82-84, 86, 93
inchoative 123, 125, 129, 135, 136, 140,
interference 216, 324
interlanguage 115
interlingua approach 193
International Corpus of Learner English
(ICLE) 97, 115, 116
Italian viii, 30, 33, 37, 73, 83-85, 87, 89, 90,
92, 93, 221, 231, 281, 282

journalistic prose 175

kaum 20, 52-58, 64

landmark 153, 155-157, 173

language system 18, 40, 120, 211

language teaching 5, 6, 14, 95, 116, 147, 325

language use 8, 18, 38
langue 18, 55, 60
legal documents 205, 212
legal terminology 250, 266
lemmatisation 253, 254
lexical alignment 272, 281-284
lexical correspondence 280, 284-286
lexical database 30, 305
lexical decomposition 41
lexical field 32
lexical item 6, 22, 26, 27, 32, 47, 95, 291,
295, 296, 304
lexical relations 14, 28, 29, 35, 38
lexical semantics viii, 28, 29, 43, 47, 48, 95,
lexical unit 26, 27, 281, 282, 284, 319
lexico-grammatical 4, 5, 41, 73, 83, 114
lexicology vii, 5, 48, 116, 119, 148, 150,
190-192, 212, 213, 287
literary creativity 183
Lithuanian viii, 309, 315, 319-326

machine translation 32, 36, 37, 191-193,

211, 284, 288, 289, 308
machine-readable dictionary 36, 45, 292,
304, 305
make 19, 30, 97-116, 125, 137, 145
markup 307, 309, 326
metaphor 16, 45, 143, 144, 149, 151, 152,
160, 162, 163, 165, 167, 171, 173, 174,
182, 184, 221, 236
metaphorical 21, 31, 142, 143, 151-153, 156,
157, 159, 160, 165, 167-169, 172, 173,
219, 220, 222, 224, 232, 241, 243, 305
metaphorical extension 21, 157, 173
metaphorical mapping 151, 152
metonymy 175, 180, 181, 183
MicroConcord 309, 325
Microsoft Word 13, 308, 309
Minmark 309, 311, 326
mixed 298, 299, 301, 303
modal auxiliaries 19, 25, 105
modal particles 25, 41

modality 16, 19, 103, 109, 123, 125, 127,
132, 149, 150, 177
modulation 20, 57-61
monolingual corpus 14, 30-35, 51, 52, 93,
148, 204, 216, 217, 227, 228, 293, 309
monolingual dictionary 40, 204, 215, 217,
218, 220, 221, 224, 226, 227, 229, 294,
motion 21, 23, 29, 106, 107, 121, 140-144,
163-165, 170
MultiCoDiCT dictionary system 241
Multiconcord 13, 48, 307-309, 311, 312,
315, 326
multilingual corpus vii, 7, 9, 10, 13, 15, 18,
37, 38, 42, 213, 215, 228, 304
multilingual dictionary 35, 41, 237, 247,
multilingual lexicography viii, 14, 33-35,
38, 41, 47, 189, 193, 229, 324
multilingual thesaurus 58, 60
multiple equivalents 23, 54, 252-254, 267
multiword term 253, 257, 258, 266
mutual correspondence 14, 17-19, 23
mutual information 30, 42, 279
mutual translatability 23, 147

natural language processing 30, 35, 36, 38

nel caso di 33, 83, 84, 87
nominalization 181, 182
non-compositional 234
Norwegian 9, 10, 20, 24, 26, 29, 30, 43-45,
116, 127
noun 26, 40, 76, 87, 98, 99, 122, 123, 125,
129, 131, 148, 175, 180, 182, 194, 196,
197, 199, 201, 206-208, 211, 243, 252,
254, 299, 302

obligation 105, 123, 125, 132-135, 144, 147

odd 35, 216, 218, 225-227
order of equivalents 225-228
order of senses 218, 219, 223, 226, 227
Oslo Multilingual Corpus 10
overlapping polysemy 22, 35, 215, 217,
221, 225, 227

ParaConc 13, 42, 307

paradigmatic relations 5
parallel concordancer 13
parallel concordancing viii, 39, 269, 307,
308, 315
parallel corpus 8-10, 23, 25, 29, 41-43, 45,
46, 48, 61, 81, 100, 115, 121, 140, 148,
178, 181, 189, 193, 194, 203-205, 211213, 216, 229, 273, 283, 288, 305, 308,
parallel texts 10, 42-46, 271, 272, 288
parliamentary debates 47, 250
parole 18, 55, 60
paronymy 273
part-of-speech tagging 251
partial overlap 21
particle 25, 26, 30, 123, 137-139, 142-144,
157, 174, 322
periphrastic causative 125, 137, 145
permission 123, 125, 132-135, 147
phrasal verb 77, 78, 315, 316, 320
phraseology 27, 43, 45, 78, 247, 284, 287,
pick up 157, 316-324
Polish 236, 250
polysemy 19, 22-24, 28, 34, 35, 41, 48, 97,
119, 120, 127, 140, 148, 150, 179, 182,
195, 215, 217, 218, 221, 225, 227, 245,
291, 292, 299
Portuguese 10
possession 19, 23, 121-123, 125-132, 140142, 146-149
premodifier 296
primary meaning 120
procdure 23, 249, 255-258, 261, 264-266
proceedings 23, 249, 255-266
pronoun 20, 106, 122, 297
prototype 120, 127, 139, 148, 152, 157, 247
prototypical 19, 25, 31, 34, 99, 103, 113,
114, 120, 125, 126, 128, 141, 152, 157,
167, 173
prototypicality 28, 29, 99, 114
psychotypology 114
pun 273

General index

quasi-idiomatic 234

rate 36, 224, 293-296, 298

recurrence 209, 211, 212, 281, 287
register 9, 81, 82, 94, 100, 237
restricted domain 211
Romanian 250

saada 125, 127, 128, 130, 137, 147

se per caso 33, 83, 88-90
segmentation 272, 274, 275, 281, 283, 284
selection restrictions 27, 103, 113
semantic extension 133, 146
semantic features 28, 41
semantic field 29, 60, 79, 87, 88, 91, 207
semantic preference 32, 78, 79, 84, 85, 8789, 91, 92
semantic prosody 27, 32, 78, 79, 84, 85, 8793
semantic scope 176
semantic unit 190, 191, 212
semi-idiomatic 234
sense distinction 227
sentence alignment 10, 40, 272, 275, 280,
set expression 231, 233-236, 238-240
set expression dictionaries 231, 233
set phrase 235
shang 31, 151-157, 159-161, 163-168, 172,
shift of meaning 179
source domain 151, 152
Spanish viii, 35, 37, 205, 216, 217, 219-221,
225, 228, 229, 231, 243-246, 311
specialised corpus 249
specialised dictionaries 242, 246
Swedish viii, 9, 10, 19, 21, 23-26, 37, 41, 44,
48, 94, 97-116, 119-121, 124-127, 129,
130, 133-135, 138, 140-142, 144-148,
150, 324
synonymous equivalents 212, 244
synset 195-199

syntactic frame 122, 123, 136, 141, 143, 147

syntactic shift 176
syntagmatic relations 4, 5, 22, 29
synthetic causative 19, 99, 109, 112, 115

TACT program 293

target domain 151, 152, 165
term extraction 249, 250, 266
terminology vii, 7, 14, 34, 42, 85, 94, 249255, 266, 267, 287
tertium comparationis 15, 16, 20, 28, 191
text alignment 10, 11, 40, 46, 47
textual context 309
thesaurus 58, 60, 195, 196
trajector 153-157, 159, 167, 173
transfer 17, 80, 98, 114-116, 216
translation vii, viii, 6-11, 13-26, 29-48, 5155, 57, 58, 60-62, 73, 74, 79-85, 89-95,
100, 101, 105, 109, 115, 119-121, 124128, 130-136, 138, 140-145, 147, 148,
150, 176, 178, 180, 183, 184, 189-194,
196-200, 203-205, 209, 211, 212, 213,
216, 217, 221, 222, 225, 227-229, 231233, 237, 239, 244, 246, 247, 250, 260,
265-267, 269, 271, 272-282, 284, 285,
287-289, 291-294, 296-299, 301-305,
307-309, 312, 318-325
translation aids 211
translation corpus 7-11, 13, 16-22, 24, 25,
30, 31, 33, 34, 37-41, 43, 45, 47, 48, 51,
52, 54, 55, 57, 58, 60, 61, 81-83, 91, 93,
100, 115, 121, 132, 148, 271, 307-309,
324, 325
translation equivalence 15-18, 34, 40, 4648, 51, 57, 60, 95, 191, 229, 273, 276,
translation equivalent 17, 32, 33, 53, 82-84,
89, 90, 125, 190, 204, 209, 301, 318
translation memory 37, 61, 250, 267
translation platform 203, 205
translation practice 94, 199, 204, 212
translation process 80, 93
translation strategy 17, 54, 58, 308, 325
translation studies vii, 38, 39, 41, 42, 44, 45,
94, 148, 213, 229, 324

translation unit 203, 205, 209, 211, 212
translational compositionality 274, 285,
translational systematicity 55, 57
translational unsystematicity 55
translationese 9, 44, 47, 85, 94
translator training 14
translator's workbench 309
translator's workstation 271
travail 194, 199-202, 205-210, 212
typological 14, 23, 27, 29, 116, 191

underspecification 120, 149

unit of meaning 32, 33, 73, 74, 76-81, 85,
87, 88, 90-92, 95, 193, 195-197, 212
unit of translation 80, 199
universal 14, 17, 21, 23, 27, 28, 31, 40, 41,
97, 116, 146, 149, 150, 153, 173, 174,
191, 193, 195, 196, 233
up 30, 31, 46, 151-157, 167-174, 316-324

valency 27, 34, 43, 47

verb 19, 20, 23, 24, 29, 30, 34, 40, 41, 43, 46,
47, 56, 76-78, 97-99, 102-106, 108115, 120-123, 125-132, 135-147, 157,
174, 182, 207, 237, 254, 258, 266, 292,
297-299, 302, 315, 316, 320, 321, 323325
verb of possession 121, 123, 127-129, 131,
141, 142

word alignment 11, 13, 281, 288

word formation 179
work 194, 196-199

xia 31, 151-157, 159-161, 163-168, 172,


Author index

Achche 247
Adams 179, 180, 184
Ahlswede 304
Ahrenberg 37, 41
Aijmer 8, 9, 19, 24, 25, 41, 44, 46, 48, 100,
101, 115, 116, 121, 148, 150, 325
Akhundov 165, 174
Al-Kasimi 215, 229
Allan 163, 174
Altenberg vii, viii, ix, 3, 9, 18, 19, 41, 97,
105, 115, 116, 121, 124, 136, 148, 150,
Alverson 163, 174
Assal 253, 267
Astington 181, 184
Aston 218, 229
Atkins 6, 23, 34, 39, 41-43, 304

Bahns 30, 42
Baker 7, 9, 39, 42, 43, 93-95
Bally 39, 42, 121, 148
Barlow 13, 42
Bejoint 247, 248, 305
Belkin 184, 185
Benson 304
Berlin 119, 148
Biber 5, 42, 51, 61, 94
Bickel 174
Blser 34, 42
Botley 42, 46, 47
Boucher 193, 213
Bourigault ix, 249, 252, 267
Bradley 293, 306

Bresnan 39, 42
Brill 95, 289, 291, 304
Brown 10, 37, 40, 42, 44, 119, 121, 149, 271,
272, 278, 280, 288
Burnard 218, 229
Butler 39, 41, 42, 46

Calzolari 22, 42, 94

Cardey ix, 35, 231, 247
Carruthers 193, 213
Celle 182-184, 246
Chan 236, 243, 247
Chesterman 6, 15, 16, 18, 42, 60, 61
Chuquet 57, 58, 61, 181, 184
Church 10, 13, 30, 40, 42, 44, 272, 278-280,
288, 304
Clear 39, 42, 93, 293, 304
Coates 180, 184
Cocke 42, 288
Conrad 42, 61
Corness ix, 13, 307, 309, 325
Cornu 255, 265-267
Cowie 43-45, 304
Cruse 26, 27, 29, 41, 43
Culioli 182, 184

Dagan 272, 288

Daniels 325
Darbelnet 48, 57, 58, 62
Dauphin 235, 247
Debili 280, 284, 288
Deepwell 325
Defrancq 182, 184

Delavigne 253, 267
Della Pietra 42, 288
Devos 34, 43
Di Pietro 6, 43
Di Sciullo 3, 43
Di Tomaso 304
Dickens 34, 43
Dik 39, 43
Dini 292, 304
Dorr 119, 148
Dunning 272, 280, 288
Dupriez 184
Dymetman 45, 288
Dyvik 29, 43

Ebeling 13, 18, 43

Eco 193, 213
Erjavec 40, 43, 325
Even 304

Faber 39, 41, 43, 92, 94

Fabricius-Hansen 30, 31, 43
Fellbaum 39, 46
Filipovic 40, 43
Fillmore 23, 39, 41, 43
Finnegan 42
Firth 4, 5, 32, 42-44, 73, 76, 91, 92, 94
Fisiak 34, 43
Fodor 193, 213
Fontenelle 44, 296, 304, 305
Foster 45, 47, 288, 289
Francis 15, 40, 42-44, 75, 94, 213
Fromilhague 176, 184
Fung 279, 280, 288

Gale 10, 13, 40, 42, 44, 272, 278, 280, 288, 304
Galliot 247
Grdenfors 48, 132, 150, 151, 174
Garside 291, 298, 305
Gaussier 272, 275, 278, 288
Gavieiro 247
Gazdar 39, 44

Geeraerts 120, 148

Geiger 151, 152, 174
Gellerstam 9, 34, 42-44, 46, 48, 85, 94, 324,
Gerardy 44
Goatly 160, 174
Goddard 119, 148
Goldberg 39, 44
Gonzales-Muilez 267
Granger vii, viii, ix, 3, 30, 44, 97, 115, 116,
Greenbaum 116, 149
Greenfield ix, 35, 231, 241, 247
Greenstein 45
Grefenstette 30, 44
Gronemeyer 140, 148
Guillemin-Flescher 44, 181, 182, 184
Gumperz 119, 149
Gutt 51, 61

Hais 316, 325

Halliday 3, 4, 39, 44, 47, 73, 75, 83, 92, 94
Hanks 30, 42, 304
Hannan 47
Hartmann 7, 44, 215, 229
Hasselgrd 41, 43, 44, 116
Hasselgren 114, 116
Hatch 119, 149
Haydon 325
Heid 34, 41, 44, 305
Heine 128, 149
Heyn 37, 44
Hindle 304
Hodek 316, 325
Hofland 10, 44, 45
Holland 43, 45, 325
Howarth 30, 45
Hudson 39, 45
Hunston 75, 94
Hyltenstam 48, 114, 116

Ide 35-37, 44, 45, 123, 131

Isabelle 37, 45, 47, 271, 274, 288, 289

Author index

Isral 288
Ivir 17, 29, 45

Jakobson 120, 149

James 6, 15, 16, 28, 29, 39, 45, 171, 213,
215, 229
Jrborg 44
Jelinek 42, 288
Johansson 7-10, 13, 20, 26, 41-45, 47, 48,
100, 101, 115, 116, 121, 140, 148-150,
Johns 81, 95, 174, 309, 325
Johnson 23, 46, 128, 149, 160, 162, 163,
173, 174, 184
Johnson-Laird 23, 46, 128, 149
Juffs 99, 115, 116
Jutrac 45

Kay 10, 45, 119, 148, 272, 280, 288

Kellerman 114, 116
Kervio-Berthou 203, 213
Kittay 28, 29, 43, 45
Kleiber 184
Klein 44
Knowles 305
Koskinen 190, 214
Kraif x, 11, 271, 278, 288
Krzeszowski 6, 14-16, 45
Kucera 40, 44

Lafferty 42, 288

Lai 42, 164, 288
Lakoff 143, 149, 151, 152, 160, 162-165,
174, 184, 305
Langacker 39, 45, 133, 149, 151, 174, 182,
Lang 272, 275, 278, 288
Langlais 272, 276, 288
LaPolla 39, 48
Larreya 177, 184
Lawson 43, 325
Lederer 288

Leech 42, 85, 116, 149, 174, 180, 184, 305

Lerner 184, 185
Levin 39, 41, 46
Levinson 119, 149
Lewandowska-Tomaszczyk 45, 46
Limame 231, 236, 247
Lindner 157, 174
Lken 26, 45
Louw 78, 87, 90, 93, 94

Macklovitch 45, 47, 271, 288, 289

Mairal Usn 39, 41, 43, 92, 94
Malmgren 44
Marcinkeviciene 324, 325
Marcus 304
Marello 246, 247
Matisoff 147, 149
Mauranen 24, 46
McCord 292, 305
McEnery 40, 42, 46
McKeown 305
Melcuk 296, 305
Melamed 279, 289
Melby 192, 214
Melia 45, 46
Mercer 42, 288
Merkel 10, 37, 40, 41, 46
Mry 177, 184
Michiels 292, 305
Miller 23, 39, 46, 128, 149, 195, 292, 305
Montgomery 34, 46
Moon 26, 46, 222, 233-235, 238, 246, 247,
Morgadinho 231, 236, 243, 247, 248
Morgan 157, 174

Nagao 271, 289

Neff 292, 305
Neubert 79, 95
Newman 119, 147, 149
Nida 80, 95, 273, 287, 289
Norn 44
Nuyts 24, 46, 174

Oakes 40, 46
Oksefjell 41, 43-45, 47, 48, 116, 140, 149
Orvelas 325
Orwell 309, 315, 316, 318, 325
Ostler 34, 41, 46

Rondeau 253, 267

Roos 30, 46
Roossin 42, 288
Rosch 28, 46
Rscheisen 10, 45, 272, 288
Rudzka-Ostyn 149, 151, 152, 174

Paillard x, 31, 57, 58, 61, 175, 181, 183, 184

Paulussen 30, 31, 46
Payne 47
Prez Hernndez 47
Pergnier 282, 287, 289
Perrault 47, 289
Persson 41, 46
Peters 30, 40, 46, 149
Petit 180, 185, 247
Piesarskas 320, 326
Pinker 193, 214
Plamondon 47, 275
Plungian 132, 149
Poesio 120, 149
Pollard 39, 46
Pullum 44
Pustejovsky 120, 149, 305
Putnam 192, 214

Sag 39, 44, 46

Sager 94, 273, 289
Sajavaara 6, 46
Salkie x, 18, 20, 34, 43, 46, 51, 52, 58, 61
Sammouda 280, 288
Sato 271, 289
Schffler 9, 39, 40, 47
Schmied 9, 30, 39, 40, 47
Schnefeld 152, 174
Schultze 44
Schwarze 29, 47, 119, 149
Scott 309, 325
Segond 304
Simard 40, 45, 47, 271, 272, 274-276, 280,
288, 289
Simon-Vandenbergen 24, 34, 43, 47
Sinclair 3, 4, 22, 26, 27, 32, 33, 36-38, 42,
47, 48, 74, 76-78, 92-95, 221, 229, 305
Singleton 26, 53, 119, 149
Sinha 173, 174
Slater 292, 305
Smadja 305
Smith 116, 152, 174, 304
Song 99, 116
Steffens 36, 42, 44-47
Stibbe 152, 174
Stubbs 78, 92, 95
Suhamy 176, 185
Sutcliffe 292, 305
Svartvik 94, 116, 149
Svecevicius 320, 326
Svensn 21, 47
Svensson 148
Svorou 151, 163, 174
Swallow viii, 185

Quirk 99, 116, 139, 149

Rainer 184
Rappaport 39, 46
Rastier 284, 289
Read 67, 70, 156, 190, 228, 325
Reidenberg 299, 305
Ren 45
Reppen 61
Ridings 13, 46
Ringbom 6, 46
Roberts 34, 46, 242, 248
Roe 305
Rogstrm 44
Rjder Papmehl 44
Romary 43, 325

Author index

Taber 273, 287, 289

Taeldeman 47
Talmy 23, 48, 119, 149
Taylor 28, 48, 120, 149
Teubert vii, x, 8, 9, 21, 26, 32, 34, 36-38, 40,
48, 189, 201, 214, 216, 229, 293, 305,
324, 326
Thoiron 247, 248, 305
Thomas 149, 231, 236, 248
Thompson 325
Tognini Bonelli x, 15, 32, 33, 42, 48, 73-75,
79, 92, 94, 95, 204, 214, 229
Tomaszczyk 45, 46, 215, 229
Tournier 179, 180, 185
Toury 216, 229
Tsohatzidis 120, 149

Ullmann 185

van der Auwera 132, 149

Van Hoof 57, 62, 185
Van Roey 184, 185
Van Valin 39, 48
Vronis 36, 37, 40, 45, 48, 288
Viaggio 79, 95
Viberg x, 19, 21, 23, 29, 39, 40, 48, 97, 104,
116, 119, 120, 129, 148, 150
Vinay 48, 57, 58, 62
Vodicka 316, 318, 319, 326
Volz 48
Vossen 35, 45, 48, 292, 306

Wandruszka 39, 48, 121, 150

Wanner 119, 150, 305
Weigand 7, 41, 47, 48, 95
Wierzbicka 43, 119, 148
Willems 47, 182, 184
Williams 3, 43
Wilson 42
Winter 132, 150
Wong 115, 116

Woolls 308, 326

Yu 152, 163, 165, 174

Zampolli 41, 42, 93, 94

Zgusta 306

In the series STUDIES IN CORPUS LINGUISTICS (SCL) the following titles have been
published thus far:
1. PEARSON, Jennifer: Terms in Context. 1998.
2. PARTINGTON, Alan: Patterns and Meanings. Using corpora for English language research and teaching. 1998.
3. BOTLEY, Simon and Anthony Mark McENERY (eds.): Corpus-based and Computational Approaches to Discourse Anaphora. 2000.
4. HUNSTON, Susan and Gill FRANCIS: Pattern Grammar. A corpus-driven approach to
the lexical grammar of English. 2000.
5. GHADESSY, Mohsen, Alex HENRY and Robert L. ROSEBERRY (eds.): Small Corpus
Studies and ELT. Theory and practice. 2001.
6. TOGNINI-BONELLI, Elena: Corpus Linguistics at Work. 2001.
7. ALTENBERG, Bengt and Sylviane GRANGER (eds.): Lexis in Contrast. Corpus-based
approaches. 2002.
8. STENSTRM, Anna-Brita, Gisle ANDERSEN and Ingrid Kristine HASUND: Trends in
Teenage Talk. Corpus compilation, analysis and findings. n.y.p.