Anda di halaman 1dari 25

Probing the Properties of Determinologization

- the DiaSketch
Jakob Halskov
Dept. of Computational Linguistics
Copenhagen Business School
e-mail: jh.id@cbs.dk

Abstract
Identifying recurrent usage patterns of terms in non-specialized contexts may act as a filtering device
and thus help increase the precision of web-based term extraction algorithms. This article presents a
corpus-driven approach to the study of determinologization and investigates the claim in Melby [16]
that terms are always characterized by special reference irrespective of the context in which they are
used. An implementation of a system called the DiaSketch (Diachronic wordSketch) is outlined. The
DiaSketch can detect changing co-occurrence patterns of mother terms in diachronic corpora and thus
indirectly assess their termhood. Analyzing the usage of a term from the domain of Information
Technology (IT) in specialized and non-specialized contexts by means of DiaSketches, it is shown that
termhood seems to be a gradable property.

Introduction

Assessing termhood is the key challenge to Automatic Terminology Recognition (ATR)


software, whether based on statistical or linguistic methods. When using the dynamic but
chaotic Internet as a basis for the extraction of new terminology, however, the properties of
termhood become even more complex than with small, static corpora of highly specialized
discourse. Strings of natural language containing elements, which would function as terms
behind the domain wall, may not do so in other contexts.
The goal of this article is to present the thoughts and theories behind the DiaSketch, a
system which can detect changing co-occurrence patterns of mother terms in diachronic
corpora and thereby assess the specificity of reference. The study was largely inspired by the
following quote from Rita Temmerman
An attempt at getting more insight into how the meaning of terms evolves [...] could be
a major research topic for Terminology [26]p. 15

39

Section 2 of the article discusses what determinologization is, why it is an interesting


phenomenon to study and how one might operationalize the definition of determinologization
in [18]. While section 3 then briefly summarizes the discussion in Cabr ([3], [4], [5]),
Kageura [13], Melby [16], Pearson [20] and Sager ([21], [22], [23]) about the distinction
between terms and words, section 4 relates the DiaSketch approach to four theoretical schools
within the science of terminology, namely the General Theory of Terminology,
Communicative Theory of Terminology ([4]), Socioterminology ([9], [10], [11]) and
Sociocognitive terminology ([26]). Drawing heavily on work by Evert ([7], [8]), Kilgarriff
([14], [15]) and Schulze [24] section 5 then proceeds to a description of various
implementation issues and section 6 finally gives an example of the output from a beta
version of the DiaSketch implementation.

Determinologization

This section discusses what determinologization is, why it should be studied and how one
might proceed to study it with statistical methods from corpus linguistics.

2.1

What is determinologization?

It is not surprising that conceptual fuzziness tends to occur when non-specialists use
terminology in non-specialized communicative contexts. It seems intuitive that what is a term
(representing a clear-cut concept) to one person may be a (possibly unknown) word
representing a fuzzy category to another person who lacks the required specialist knowledge
to decode the term fully and correctly. It also seems probable that traces of this conceptual
fuzziness can be registered in linguistic usage. While one-off cases of creative or fuzzy usage
of terms pose no problem to web-based ATR software, it is a different story when large
numbers of non-specialists use terms from a domain forming strong collocations which,
formally speaking, may resemble terminological neologisms, while not functioning as such.
Although determinologization has received little attention and has yet to be studied in a
quantitative framework, it has been defined as "the ways in which terminological usage and
meaning can 'loosen' when a term captures the interest of the general public" [17]p.12 and
"det at ei eksisterende terminologisk ordform gr over i allmennsprket" 1 [19]p.112. The

That an existing terminological unit enters general language

40

definition in Meyer and Mackintosh is the more specific of the two and groups the
semantic/pragmatic changes caused by determinologization into two types:
1) Maintien des aspects fondamentaux du sens terminologique
2) Dilution du sens terminologique d'origine [18]pp.202, 205
To illustrate the difference between 1) preservation and 2) dilution of a terminological sense
we can consider the two phrases from The New York Times (1999) below:

A large server

The Internet business model needs a reboot

When the term server is modified by the conceptually fuzzy adjective large, it seems that the
reference of the combined phrase has somehow become less accurate. Are we talking about a
server, which is physically large, or about a server which is equipped with large amounts of
RAM? In spite of the superficial fuzziness, the reference to the domain specific concept2
seems to be largely intact. However, this is not the case with the figurative use of the term
reboot. When taking the context into consideration it becomes obvious that reboot no longer
refers to the original domain specific concept of shutting down and restarting an operating
system, but is being used in the more general sense of starting something afresh.
Clearly, there are a lot of intermediate stages in-between determinologization of type 1)
and 2), also known as sense modulation and sense selection in lexical semantics [6], and it is
not at all clear whether terms which are exposed to this phenomenon will come to represent
increasingly fuzzy categories or not. The rest of the article will outline an approach, which
may eventually answer this and other questions regarding determinologization.

2.2

Why should we study determinologization?

Having reviewed existing definitions of determinologization, we need to justify our interest in


this phenomenon. While determinologization ought to be an important field of research in
Socioterminology (see section 4.4), it also has important implications for the optimization of
term extraction algorithms. The extensive usage of terminology from a domain like IT by vast
2

a computer that controls or performs a particular job for all the computers in a network (definition from
MacMillan English Dictionary for Advanced Learners, 2002)

41

numbers of non-experts in a variety of communicative settings complicates the automatic


extraction task. Although determinologized usage can be avoided by using corpora which
have been manually compiled and are known to represent specialized communication between
experts, such corpora are expensive to come by and age swiftly (especially a domain like IT).
Thus using the Internet as a dynamic and inexhaustible treasure trove of terms is becoming
increasingly appealing to computational terminologists [2], but this involves tackling a
number of problems caused by determinologization. While these problems cannot be fully
answered through statistical analysis of terminological usage in large general language
corpora, we can at least get some indication of the linguistic characteristics of this
phenomenon and thus a better understanding of factors, which are important to the notions of
termhood and domain.
Clearly, the concepts of certain domains are more exposed to determinologization than
others. The domain of IT has been chosen as the testing ground for the present study, but why
this particular domain?
[...] computerese has transcended its fundamental purpose: to describe and explain
computing. Although it still fulfills its original function, it frequently steps outside
these bounds to describe the human condition. Conversely, in the computer industry,
the human condition is frequently explained in terms of technological metaphors.
[1]p.xiiv
As a technical subject field, IT needs concepts, which require a high degree of determinacy,
but at the same time these specialized concepts are highly popularized making them
particularly exposed to determinologization. While this is also the case for domains like
medicine, appliances and technology in general, IT is special in that neology by
terminologization (metaphorical extension of general language lexical units) is much more
frequent in this domain.
Having established what determinologization is and where to look for it, the following
section will explicate how the phenomenon can be investigated within the framework of
corpus linguistics.

42

2.3

How can we study determinologization?

While the definition in section 2.1 provided a starting point for an empirical investigation of
the linguistic properties of determinologization, it will be necessary to explicate the theory of
meaning adopted in this study in order to arrive at an operational definition of conceptual
fuzziness and determinologization.
From a statistical NLP perspective
it is [...] natural to think of meaning as residing in the distribution of contexts over
which words and utterances are used [25]p.16
In a contextual theory of meaning syntax and semantics are intimately related and
interdependent. Every word has a certain semantic potential, but it is the context, ie.
neighbouring words (and sometimes extra-linguistic context), which activates a particular
sense in a particular case. This theory of meaning is highly pragmatic and descriptive and can
be summarized by the famous Wittgenstein aphorism: "The meaning of a word is its use in the
language" [27]. Based on a contextual theory of meaning, determinologization can be
described as the process by which the combinatory potential, ie. relational co-occurrence
patterns, of a term starts to resemble that of a comparable lexical unit from general language.
While conceptual fuzziness can be measured synchronically, we should not forget that
determinologization is controlled by an extra-linguistic factor, namely the degree of diffusion
of domain concepts into the consciousness of the general public. The spread of domainspecific concepts into non-specialized discourse is a diachronic phenomenon, and a
description of the linguistic properties of determinologization can thus only be attempted by
comparing a series of synchronic assessments of conceptual fuzziness. Such assessments are
performed by computing lexical profiles of terms in diachronic corpora.
Lexical profiling has primarily been used within lexicography, and an example of an
implementation for English is described in [14] and [15]. The basic technique involves the
calculation of the strength of association of key relational collocates of a word in a part of
speech tagged corpus and the subsequent listing of these collocates ordered by relation and
statistical salience. Figure 1 shows such a lexical profile, also known as a "wordsketch", of
the term server in the British National Corpus (BNC)3.

http://www.natcorp.ox.ac.uk/ (March 4, 2005)

43

Figure 1 - Lexical profile of server as computed by SketchEngine4

So far wordsketching has only been applied to synchronic studies of lexical units, but the
technique seems very promising for diachronic studies of how terminological units behave in
general language corpora. Retrieving significant collocational changes from wordsketch to
wordsketch in successive time slices of a general language corpus will yield what we might
call a DiaSketch (Diachronic wordSketch). Classifying the speed and manner in which
changes might register in such a DiaSketch will then help us gain a greater knowledge of the
linguistic properties of determinologization and perhaps allow us to refine the definition
proposed in section 2.1. Before proceeding to an actual test run of the DiaSketch
implementation in section 6, sections 3 and 4 will summarize the theoretical debate on the
issue of termhood.

Terms vs. words

This section discusses the notion of termhood as a prelude to the summary of theoretical
approaches to terminology in section 4.

3.1 The ideal


A term is typically defined as a lexical unit which represents a concept inside a domain or "a
verbal designation of a general concept in a specific subject field" (ISO 1087-1/ISO 12620).
Ideally, there is a one-to-one mapping between a domain specific concept and the term which
designates or labels it. Although there may be terminological variants (like hard disk and hard
drive in the domain of IT), these refer to the same clear-cut concept, and this synonymy, or
4

http://www.sketchengine.co.uk/ (March 4, 2005)

44

superficial ambiguity [16]p.55, does not impede efficient and unambiguous specialized
communication between domain experts.
Ideally, a distinctive feature of terms, as opposed to words, is that the 1:1
correspondence between a term and the concept it labels is impervious to linguistic context,
and the meaning of a term can thus be fully decoded irrespective of the context in which it
was used. Words, on the other hand, represent fuzzy categories, which partly overlap with
adjacent categories, and for their meaning to be fully decoded, one normally needs to consider
both the linguistic and communicative contexts. In fact, statistical analysis of large amounts
of actual usage is needed to cluster the meanings of a polysemous word into a list of
predominant word senses, and such a list will necessarily be infinite and highly dynamic due
to the fundamental ambiguity [16]p.55 of general language. When we move beyond neat,
synchronic samples of expert-expert communication within a highly specialized subdomain,
the clear-cut line between terms and words starts to blur, however.

3.2 Critical voices


Cabrs Theory of Doors (section 4.3) explains why we have yet to see a decisive theory of
the term which allows us to distinguish it from the word. The research object of terminology
is multidimensional, and the symbolic dimension (the level of the term) and the
representational dimension (the level of the concept) do not, in themselves, give us the full
picture. Since terms can rarely be distinguished from words by any formal means, as pointed
out in [13] and [23], we need to consider their communicative function as well. The
representational function of terms has been studied in detail within the conceptually-oriented
framework of the General Theory of Terminology (section 4.1), but Pearson argues that the
communicative function of terms in actual usage has largely been ignored:
it is futile to propose differences between words and terms without reference to the
circumstances in which they are used [we need to consider] what happens when
terms are actually used in text rather than simply as labels for concepts in knowledge
structures [20]pp.7-8
Kageura speculates along the same lines that termhood is perhaps more like an aspectual
category [13]p.26, and so does Cabr:

45

a lexical unit is by itself neither terminological nor general but [...] it is general by
default and acquires special or terminological meaning when this is activated by the
pragmatic characteristics of the discourse [my emphasis] [5]pp. 189-190
Sager has been arguing that the key to distinguishing terms from words is a theory of
reference [21] and a multidimensional model of knowledge space. Words, he argued, map
onto notions through general reference, while terms map onto concepts (more restricted
segments of knowledge space) through special reference. Twenty years later, in [23], he
reemphasizes that terms are basically the end products of a double evolutionary process of
abstraction and subsequent specification in natural language. The process starts with unique
reference (proper names), the abstraction of which is general reference (nouns), and ends with
special reference (terms).
To avoid misunderstanding and make it possible to enhance the knowledge of
mankind, Sager argues that notions (or general representations) must be refined into concepts
or bundles of judgment. From this perspective a term is then
the name given to a set of judgements considered pro tem as a unit representing a
scientifically defined concept [23]p.53
By pro tem Sager acknowledges the dynamic nature of natural language and the fact that
concepts

may

degenerate

into

notions

through

usage-governed

process

of

determinologization. Although it seems convincing that termhood is a function of context


(rather than a property which is established a priori), the next section will introduce a counter
argument presented as an analogy.

3.3 Melbys analogy


In [16] Alan Melby introduces an interesting clay/stone analogy of the difference between
words (lexical units) and terms (terminological units):
A word is thus a chunk of pliable clay and a term is a hard stone. One can think of a
stone as a blob of clay that has become transformed into a solid object through some
chemical process, just as a terminological unit receives a fixed meaning within the
context of a certain domain of knowledge. [16]pp.52-53

46

In Melby's view termhood is not a gradable property:


the continuum between a very general text and a highly domain-specific text is not a
gray scale along which words gradually become terms and terms gradually become
words [...] The ratio of the mix may change gradually, but usually words are words
and terms are terms, and different processing applies to each. [16]p.53
The reason why domain-specific concepts are non-overlapping and can be ordered into
hierarchical conceptual structures (ontologies), is that they were defined using a metalanguage
(general language) and are protected from conceptual fuzziness by a wall of conventions
surrounding the domain:
When we create a narrow, well thought-out domain we build a wall around it so that
from inside the domain the universe appears to be orderly and computable [16]p.101
Maintaining this wall allows domain experts to optimize the balance between three principles
which are vital to achieving efficient specialist communication, namely precision, economy of
expression and appropriateness [22]. Extreme precision could be achieved by always citing
the complete definition of the given concept, but this is disallowed by the principle of
economy of expression, and thus a compromise is gradually reached on the basis of
appropriateness, which is essentially the norm established by domain experts over time.
Ideally, this balance results in a means of communication, which approximates an artificial
(or controlled) language.
However, when a domain attracts the sustained interest of the general public, and
terms from the domain are used by non-specialists outside the wall, we can no longer assume
that the linguistic and communicative contexts have no bearing on the meaning of these
terms. It is the hypothesis of this article that terms under these circumstances may come to
represent fuzzy categories with a prototypical core, rather than clear-cut, non-overlapping
concepts. In answer to the question posed at the beginning of this section: lexical units, which
function as terms in some contexts, do not necessarily fulfil the criteria for termhood (such as
special reference) in other contexts. The test run of the DiaSketch implementation in section 6
will provide more evidence against Melbys claim that termhood is not a gradable property.

47

Theories of terminology

This section will further elaborate the


discussion of termhood by juxtaposing the
viewpoints of four theoretical schools of
terminology
DiaSketch

and

finally

approach

in

position
this

the

theoretical

framework.

4.1

General Theory of Terminology


(GTT)
Central to the science of terminology has been
the apex of the semantic triangle, namely the

Figure 2 Melbys clay/stone analogy

concept. Extra-linguistic reality (objects or


referents) is classified by identifying the distinctive properties of the corresponding mental
representations, storing these properties in attribute-value matrices and ordering the resulting
clear-cut concepts into conceptual hierarchies or ontologies. In classical terminology (as
advocated in the post-humously published works of Eugen Wster [28]) concepts are thus
static, universal and non-overlapping, and the position of a particular concept in a given
hierarchy is precisely determined by its definition, which typically specifies a genus
proximum (nearest superordinate concept) and differentia specifica (specific differences).
This objectivist approach to terminology is computationally tractable and has proven
extremely successful in fields like knowledge engineering, ontology-based Information
Retrieval and terminological standardization. The last few years, however, have seen a
vigorous theoretical debate in which the explanatory adequacy of GTT as an all-embracing
theory has been questioned. The attacks on GTT come from many branches of linguistics,
establishing new terminological schools such as Socioterminology ([9], [10], [11]), SocioCognitive Terminology [26] and Communicative Theory of Terminology [4].
While some scholars contest the very status of terminology as an independent
scientific discipline:
there is no substantial body of literature which could support the proclamation of
terminology as a separate discipline and there is not likely to be. Everything of
48

importance that can be said about terminology is more appropriately said in the
context of linguistics, or information science or computational linguistics [22]p.1
other scholars criticize GTT for ignoring actual usage:
Wster developed a theory about what terminology should be [my emphasis] in order
to ensure unambiguous plurilingual communication and not about what terminology
actually is in its great variety and plurality [5]p.167
Harsher critics claim that:
Traditional Terminology confuses principles, i.e. objectives to be aimed at, with facts
which are the foundation of a science. By raising principles to the level of facts, it
converts wishes into reality [26]p.15
As long as one recognizes that the primary objectives of GTT are knowledge structuring and
standardization, rather than a descriptive account of terminological usage in various contexts,
I find this bias perfectly legitimate, however. The following sections will briefly review
alternative paradigms in terminology and finally explicate the theoretical foundations of the
present study on determinologization.

4.2

SocioCognitive Terminology

While the GTT approach must be counted among the positivist or objectivist theories of
science, Sociocognitive terminology, as advocated in Temmerman [26], is a hermeneutic or
experientialist theory. The premise in experientialism is that reality does not exist
independently of the perceiving subject. All knowledge comes from experience, and meaning
cannot be completely objectified because it always involves a subject and is perceived and
expressed through an inescapable filter (natural language). Inspired by recent findings in
Cognitive Science which suggest that there is no clear separation between general and
specialized knowledge, [26] thus claims that terms, more often than not, represent categories
(notions in Sagers terminology) which are as fuzzy and dynamic as those represented by
words.

49

Temmerman argues that clear-cut concepts, which are not prototypical to some extent,
are extremely rare outside of exact sciences like Mathematics and Chemistry [26]p.223. The
analytical (intensional) definitions used in GTT are thus often inadequate because
prototypical categories with gradable membership cannot be understood in a logical or
ontological structure. The core of Temmermans criticism of GTT is that it rejects the unity of
the linguistic sign, by dissociating form (the term) from content (the concept) and thus
reducing terms to context-independent labels for things.
In her Sociocognitive terminology Temmerman speaks of Units of Understanding
(UU), rather than of concepts. These UUs typically have prototype structure and are in
constant evolution. UUs can rarely be intensionally defined but should be interpreted by
means of templates of understanding which are composed of different modules of
information depending on the receiver and the context.
On the whole I agree with Temmerman that GTT needs to be extended in various
directions to achieve descriptive and explanatory adequacy, and I believe it is correct that
terms, in certain contexts, represent categories as fuzzy as those represented by words, but I
disagree with her that this is the general case. I think it requires a process of
determinologization, and this process is only initiated when the domain to which the term
belongs catches the interest of non-specialists. While the SocioCognitive approach to
terminology shakes the very foundations of classical terminology the Communicative Theory
of Terminology outlined in the next section is much more inclusive of GTT.

4.3

Communicative Theory of Terminology

Cabr ([4],[5]) claims that the research object of terminology is not concepts, nor units of
understanding, but rather Terminological Units (TU).
At the core of the knowledge field of terminology we, therefore, find the terminological
unit seen as a polyhedron with three viewpoints: the cognitive (the concept), the
linguistic (the term) and the communicative (the situation) [5]p.187
While GTT accounts for one dimension of the terminological polyhedron, namely the
conceptual one, it fails to consider the other dimensions. This does not mean that GTT is
flawed, because TUs are such complex and multidimensional phenomena that they can hardly
be accessed on all fronts at once. It does mean, however, that GTT can only be an ancillary

50

component in a more comprehensive theory the outline of which has only recently manifested
itself. Cabr argues that:
it is impossible to account for the complexity of terminology within a single theory [...]
a number of integrated and complementary theories are required which deal with the
different facets of terms [3]pp.12-13
Although Sager already discussed the communicative dimension of terms in [22], Cabr
iterates his arguments and calls for a Communicative Theory of Terminology (CTT) in which
"each one of the three dimensions [the cognitive, linguistic and communicative], while being
inseparable in the terminological unit, permits a direct access to the object" [5]p.187

4.4 Socioterminology
Like Sociocognitive Terminology and CTT, Socioterminology, as outlined in Gambier ([9],
[10]) and Gaudin [11], also argues that GTT needs to be extended. Socioterminology is
basically a functionalist approach to terminology, which stipulates that we should include
contextual factors like language change and social practices in the study of terminology:
Un terme ne peut pas tre vu seulement par rapport un systme (adquation de la
dsignation, rattachement un rseau de notions): il est aussi voir dans son
fonctionemment, sur le terrain des contradictions socials. (Qui utilise quoi? Qui
innove? Comment et par qui les termes se diffusent-ils? Comment soprent les
rajustements terminologiques, les reformulations? Etc.)5 [10]p.320
Its focal point is thus linguistic reality, or terminological performance in Chomskian terms,
rather than terminological standardization or knowledge structuring as such:
En rupture avec les usages traditionnels: consultation dexperts, travaux sur les corpus
limits, ignorance de la dimension orale, une attitude plus linguistique la linguistique

A term cannot be viewed exclusively with respect to a system (the adequacy of the designation, inclusion in a
network of concepts): it should also be viewed wrt. its function in the field of social contradictions. (Who uses
what? Who innovates? How and by whom are terms spread? How are terms subjected to language change? Etc.)

51

tant essentiellement une science descriptive suppose que les termes soient tudis
dans leur dimension interactive et discursive6. [11]p.295
This admittedly simplified survey of GTT and three newer theoretical schools shows how the
frameworks of sociolinguistics, cognitive science and communication theory are being
applied to terminology to increase the explanatory adequacy of classical terminological
theory. The following section will describe the theoretical foundations of the DiaSketch
approach to determinologization.

4.5

The DiaSketch approach

Having summed up the viewpoints of GTT, CTT, SocioCognitive Terminology and


Socioterminology, it is now time to position the DiaSketch approach in this theoretical
framework. Owing to its principle of synchrony and monosemy, the evolution of
terminological meaning and a phenomenon like determinologization cannot be studied in a
GTT framework. The framework of SocioCognitive Terminology as presented in [26] does
not seem attractive from a computational, corpus linguistic perspective. Temmerman's Units
of Understanding do not seem to offer a coherent alternative to the conceptual analysis of
classical terminology. As a corpus-driven approach, the theoretical foundations of DiaSketch
are best described as a mixture of CTT and SocioTerminology. CTT highlights the impact
that communicative context has on terminological meaning, and Socioterminology stresses
the functional aspects of terms. These aspects are reflected in the composition of the corpora
on which the DiaSketch analyses are based (see section 5.2).

Methodology and issues

A corpus-based description of how domain specific terms are used in general language is
faced with two obvious problems:
1. since terms are specific to a domain we must expect their frequency of occurrence to
be relatively low outside the domain in question.

Breaking with traditional approaches (consulting experts, working with limited corpora, ignoring the oral
dimension), a more linguistic approach (linguistics being essentially a descriptive science) presupposes that the
terms are studied in their interactive and discursive dimension.

52

2. when lexical units, which function as terms in specialized discourse (e.g. bus, server,
driver), occur in non-specialized contexts, the most frequent senses are likely to be the
non-specialized ones.
While the context of the domain allows us to presume monosemy, language outside Melby's
wall is rife with polysemy. Mother terms thus need to be disambiguated (sense tagged) before
any reliable DiaSketching can take place.
Table 1 lists twenty terms from the ANSDIT8 terminology compilation, which have
the highest average relative frequency in a 1.8M word fragment of a specialized corpus
(PcPlus) and a 64M word fragment of a newspaper corpus (New York Times). Not
surprisingly, the lemmas seem to
represent

very

superordinate

concepts which all - with the


possible exception of software, PC,
computer, Internet and CD - have
one or more general language
senses in addition to the domainspecific

sense.

These

general-

language senses may cause more or


less noise (the majority of the
windows in New York Times are
physical ones), but it is obvious
that candidate mother terms need
to be sense-tagged before they can
be subjected to diasketching.

5.1

The Yarowsky algorithm


Sense

tagging

can

be

Table 1 - Terms from ANSDIT present both in LSP


and LGP corpora and ranked by average relative
frequency 2001 (2000)
[lemma=term
& pos="N.*"]
1 window (2)
2
PC (1)
3
file (4)
4 software (5)
5 image (13)
6
user (7)
7 information (10)
8 Internet (6)
9
feature (9)
10 service (11)
11
code
12 computer (16)
13 button (15)
14 screen (19)
15 object (12)
16
CD (17)
17
web (161)
18 server (14)
19 memory (18)
20 package (26)

NYT PCP NYT PCP


Average
(abs) (abs) (rel)7 (rel)
relative
6013 3946 92.7 2113.1 1102.9
2409 3823 37.1 2047.3 1042.2
5001 3481 77.1 1864.1
970.6
5714 2385 88.1 1277.2
682.7
7301 2303 112.6 1233.3
672.9
5021 2250 77.4 1204.9
641.2
20778 1373 320.4 735.3
527.8
17633 1399 271.9 749.2
510.5
10149 1493 156.5 799.5
478
24993 1044 385.4 559.1
472.2
4131 1538 63.7 823.6
443.7
14812 1228 228.4 657.6
443
1137 1345 17.5 720.3
368.9
4244 1207 65.4 646.4
355.9
2073 1243
32 665.6
348.8
2872 1211 44.3 648.5
346.4
1910 1212 29.5
649
339.2
902 1221 13.9 653.9
333.9
4716 1066 72.7 570.9
321.8
5174 1041 79.8 557.5
318.6

performed using supervised or


unsupervised methods. Since supervised methods presuppose a large corpus, which has been
manually disambiguated, they are labour-intensive and not appealing in the present case. In
7

Relative frequencies are given in occurrences per million running words


American National Standard Dictionary of Information Technology (app. 5,500 terms), www.incits.org

53

his landmark paper from 1995, Yarowsky [29] proposes and evaluates a semi-unsupervised
WSD algorithm, which achieves accuracy rates rivaling those of the best supervised methods.
It does so by making two very simple, but powerful assumptions, namely that polysemy is
restricted by the fact that polysemous lexical units typically have one sense per discourse and
one sense per collocation. Thanks to these two assumptions, the algorithm only needs a few
seed collocates (sense indicators), and can then use the assumptions as "bridges" to new
contexts from which more (or better) sense indicators can be retrieved in an iterative fashion.
The basic steps in the Yarowsky algorithm are as follows:
1. identify all occurrences of the polysemous word and store the contexts
2. for each possible sense identify a seed word (fx. through vs. pop-up for window)
3. using these sense indicators, extract a seed set of manually disambiguated
occurrences (table 2)
4. for each collocation type in the seed set compute log(P(sensea|collocationi)/
P(senseb|collocationi)) <=> log((f(sensea)+0.5)/(f(senseb)+0.5))9
5. order the collocations by numerical log-likelihood ratio to get a decision list (table 3)
6. Apply the decision list on all contexts from step 1 to classify more occurrences
7. Optionally apply one-sense-per-discourse assumption to classify more occurrences
8. Iterate steps 4 - 8
9. stop when the decision list is unchanged. New data can now be sense tagged.

Table 2 - seed set of manually disambiguated occurrences of window


at 5 a.m. when the bullet crashed through the <window> , law enforcement officials said . The
When your turn comes , bark your order through the <window> . If for any reason they ignore
she was shooting skeet , staring from the castle <window> , looking through the gloaming at t
little children . The sea is visible through the <windows> of the room where these gargantuan
or zoom-in tools that left me stranded in a pop-up <window> . My computer never crashed .
lure to get you to wade through a swamp of pop-up <windows> and banners hawking their pr
, often used on Web sites to create pop-up <windows> and navigational aids , can be embedd
....

Additive smooting (+0.5) is used to avoid division by zero

54

phys
phys
phys
phys
comp
comp
comp
...

This kind of (virtually) unsupervised approach will be used to sense tag all occurrences of
candidate mother terms in the final implementation of the DiaSketch, but even DiaSketches of
candidate mother terms which have not been semantically disambiguated may yield
interesting results as can be seen in the case study of section 6.

5.2

Corpus annotation and CQL

In a study of determinologization, which is essentially an aspect of language change, the


dimension of time is of course the main variable. To gain a better understanding of the
linguistic properties of conceptual fuzziness, however, the dimension of text type should also
be examined. The corpora listed in table 4 are included in the study and represent highly
specialized discourse (computer science papers from The Computer Journal 10 ), popular
science (the British computer magazine PcPlus 11 ), technical, online discourse (newsgroup
postings) and general language (newspaper
corpora from the Gigaword corpus which
includes New York Times).
In order to identify relational cooccurrences the corpora are PoS tagged (Penn
12

tagset) and lemmatized with the TreeTagger


and

subsequently

phrase

chunked

with

Yamcha13. An example of an annotated corpus


fragment can be seen in table 5 (where B
indicates the beginning of a phrase and I
indicates a non-boundary). The corpora are
finally converted into the special format
required by Corpus WorkBench [24]. CWB

Table 3 initial decision list based on a


seed set of 32 disambiguated occurrences of the polysemous word window
phys. comp.
LogL sense sense collocation pos sense
-1.46 0
14
pop-up
any comp
-1.40 0
12
pop-up
-1 comp
1.36 11
0
through
-2 phys
1.28 9
0
the
-1 phys
0.85 3
0
visible
any phys
0.85 17
2
through
any phys
-0.70 0
2
create
any comp
-0.70 0
2
programs
any comp
-0.70 0
2
computer
any comp
0.70 2
0
stained-glass any phys
...
....
...
...
... ...

includes a Corpus Query Language (CQL),


which allows sophisticated queries using regular expressions over combinations of positional
and/or structural attributes. In the case of the DiaSketch implementation we use the four
positional attributes token, PoS, lemma and chunk.

10

http://www3.oup.co.uk/computer_journal/ (March 7, 2005)


http://www.pcplus.co.uk/ (March 7, 2005)
12
http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger (March 4, 2005)
13
http://chasen.org/~taku/software/yamcha (March 4, 2005)
11

55

5.3

Contingency tables and UCS


Stefan Evert [8] describes the benefits and pitfalls of a number of statistical association

measures implemented by Evert himself in the Utilities for Co-occurrence Statistics (UCS)14
toolkit. He favours relational co-occurrence over simple positional co-occurrence because the
former reduces noise from grammatically unrelated n-grams and leads to more meaningful
results [7]. The UCS tables generated by Evert's perl scripts are called frequency signatures
and contain the four values, joint frequency (O11), marginal frequencies (O12, O21) and sample
size (N). This corresponds to a classical four-celled contingency table like table 6, where O11
is the number of times partition and server co-occur in the sample (of size N) of all noun
bigrams in the corpus, O12 is the number of times
partition occurs with another noun in this sample
and O21 the number of times server occurs with
another noun.
Comparing the observed with the expected
frequencies (E11-E22) provides a measure of the
strength of association of the two lemmas. This
association measure can be based on a number of
statistical models, but in the case study in section 6

Table 5 - Annotated corpus slice


Token PoS Lemma Chunk
He
PP
he
B-NP
reckons VVZ reckon B-VP
the
DT
the
B-NP
current JJ
current I-NP
account NN
account I-NP
deficit NN
deficit I-NP
will
MD will
B-VP

Evert's implementation of Fisher's exact test is used


because it provides p-values, which are not approximations and it "is now generally accepted
as the most appropriate test for independence in a 2-by-2 contingency table"15. With these pvalues it is straightforward to enforce a cut-off level of significance (for example p<10-6).
Table 4 Diachronic corpora
Computer Journal
Tokens
4.5M
Time frame
1997-2004
Assumed degree of none
determinologization

14

PcPlus
6M
2000-2004
low

Newsgroups
100Bn16
1981-present
low/moderate

LDC Gigaword
1.7Bn
1994-2002
high

http://www.collocations.de (March 4, 2005)


Evert (2004) - http://www.collocations.de/AM/ (March 13, 2005)
16
The Google archives contain more than 1 billion postings, and an average length of 100 words seems
reasonable.
15

56

5.4

Computing DiaSketches with


CQL and UCS

The

original

wordsketches

as

implemented in the sketchengine [15]


plot statistically salient co-occurrence

Table 6 - Contingency table


v=server
u=partition O11
upartition O21
u=partition E11 = (R1*C1)/N
upartition E21 = (R2*C1)/N

vserver
O12
O22
E12 = (R1*C2)/N
E22 = (R2*C2)/N

pairs from a set of twenty odd


grammatical relations. The present implementation of the DiaSketch accepts only nouns as
input and identifies significant co-occurrences, which are noun or adjective modifiers of the
node or predicates which subcategorize for the node as subject or object. In the case of the
SUBJ_OF relation, a CQL query like
[pos="NN.*"][pos="WDT"]?[chunk=".*-VP"]*[pos="VV[ZPDG]?"]17

Table 7 - SUBJ_OF examples from New York Times


agency/NN/I-NP has/VHZ/B-VP been/VBN/I-VP recruited/VVN/I-VP to/TO/I-VP help/VV/I-VP
convince/VV/I-VP
model/NN/I-NP that/WDT/B-NP made/VVD/B-VP
schools/NNS/I-NP serving/VVG/B-VP
industry/NN/I-NP might/MD/B-VP not/RB/I-VP exist/VV/I-VP

will retrieve (virtually) all these relations in the given corpus slice.
By specifying that the PoS of the rightmost verb must be simple present (VVZ), simple past
(VVD), gerund, (VVG) or infinitive (VVP), we filter out passives (which have PoS=VVN)
from this set of SUBJ_OF relations. Moreover, by setting the matching strategy to "longest",
we make sure that we get the main verb of complex VPs like to help convince (cf. table 7). If
longer relative clauses intervene, however, the pattern will simply match the first main verb.
This can only be avoided by carrying out a computationally expensive full parsing.
The lemma pairs (ie. agency/convince, model/make etc.) are then extracted and piped
into UCS, yielding the N value mentioned in section 5.3. In case we want a sketch of the term
server, the general query is simply transformed to a specific query by substituting
[lemma="server"] for [pos="NN.*"]. Such a query will then provide all the O11, O12 and O21
values needed to complete the contingency table and compute the co-occurrence statistics.
17

A noun in plural or singular possibly followed by a relative pronoun, any number of VP chunk elements and
finally a mandatory full verb.

57

Case study: "server"

According to MacMillan English Dictionary for Advanced Learners (2002) the lexical unit
server has five senses:
1. a computer
2. player who starts to play
3. large spoon/fork/etc
4. sb who helps in church
5. sb who brings food
Judging by the collocations in the general language DiaSketch for server (figure 3), senses 2
through 5 seem to be virtually absent (except for the collocation altar server which is
indicative of sense 4). So in this case semantic tagging was not a critical issue. Defining what
counts as a significant collocation can be difficult, but in this case study we only include those
co-occurrences which defeat the Null Hypothesis at a significance level of p < 0.00000118 (or
log(p) > 6) and where the number of co-occurrences (O11) exceeds 2% of the marginal
frequency of the term in question (O21). While the absolute number of occurrences of the
lemma server in this relation are approximately the same in the two corpora (some 800 per
time slice), the relative frequencies of course differ tremendously.
While figure 3 charts the most significant modifiers of the term server through time in
a 6M word slice of the British computer magazine PcPlus, figure 4 lists the most significant
modifiers of the same term in a 915M word slice of the New York Times (NYT) corpus.
Strength of association, as measured in negative logarithmic p-values, is indicated along the
y-axis and the collocation candidates are listed along the x-axis. A striking difference between
the two figures is that the three noun modifiers network, Internet and computer are prominent
collocations in the newspaper corpus (throughout the timeframe) but do not occur in the
DiaSketch for the corpus of computer magazines. While network server and Internet server
are infrequent19 variants of the term web server (which is highly salient in both corpora),
computer server is a case of determinologized usage. In this compound the noun modifier is
18

Mail correspondence with Stefan Evert has led me to believe that this is the standard significance level for
collocation strength as computed by means of contingency tables.
19
http://scholar.google.com (May 4, 2005): computer server (680 hits), Internet server (3,320 hits), network
server (3,800 hits), web server (62,800 hits)

58

used as a kind of domain label to distinguish the IT sense of the head noun from other senses
possible outside this domain (for example the altar server). The expression is no more fuzzy
than server on its own, but the modifier would be redundant in specialist communication and
might cause noise in an ATR system.
The only example of a collocation referring to a fuzzy category is powerful server
which climbs above the strict threshold values in two time slices of the NYT corpus (Fall of
1994 and 1999). While the adjectival modifier in PcPlus (virtual) combines with the mother
term to form a clear-cut subordinate concept in a generic relation to server, the collocation of
powerful with server modulates the special reference of server so that the combined phrase no
longer refers to a clear-cut domain specific concept. What exactly is denoted by powerful
server? Does powerful refer to the storage capacity of the server as measured in gigabytes or
to its clock rate as measured in MHz or rather to the data transmission rate of its network
interface as measured in gigabit per second?
Figure 3 Modifiers of the term server in PcPlus (2000-2003)

logp)

association strength (-

modifiers of server in PcP (2000-2004)


120
100
80
60
40
20
0

2000
2001
2002
2003

l
n
ai
io
m cat
i
pl
ap

y
ox
pr

eb
w

ur
yo

P
FT

S
N
D

nt
fo

is
N

l
tra
n
ce

al
ri tu
v

collocates

59

Figure 4 - Modifiers of the term server in New York Times (1994-2001)

60

Conclusion

Based on a lengthy discussion of the theoretical foundations of terminology (sections 3 and 4)


we arrived at 1) a functional definition of the term in which termhood is contingent upon
context and 2) an operational definition of determinologization as the process by which the
relational co-occurrence patterns of a term20 in non-specialized discourse come to resemble
that of comparable lexical units from the general vocabulary. This contextual theory of
terminological meaning was the basis for the implementation of the DiaSketch (section 5)
with which an example of conceptual fuzziness was identified by extracting salient relational
collocations of the term server in a specialized and a general language corpus, respectively
(section 6). Judging by the DiaSketches of the term server in a 9-year fragment of the New
York Times corpus, we must conclude that Melbys stone/clay analogy as described in section
3.3 seems to be inaccurate and that we are dealing with a greyscale where lexical units, which
invoke clear-cut concepts in some contexts, come to represent increasingly prototypical
categories in other contexts.

7.1 Further work


Although DiaSketches of a single term seems to indicate that termhood is gradable, usage
patterns of a wider range of (sense tagged) mother terms21 need to be analyzed to see if the
tendencies from the case of server can be generalized. While examining co-occurrence
patterns for relations like SUBJ_OF might provide a richer description of determinologized
usage, this may not be feasible due to data sparseness. However, it seems likely that even
identifying simple positional co-occurrence patterns will make it possible to improve the
precision of ATR software using the Internet as a corpus. A list of modifiers typically used
with mother terms in non-specialized discourse could for example be used as a document
classification device or a simple filtering device. The possible increase in precision brought
about by such a filter would then need to be evaluated by running the web-based ATR system
with and without the filter.
Finally, it would be interesting to see if certain terms are less context sensitive than
others. Whether terms, which have been terminologized by metaphor (e.g. bus, icon, mouse),
are more susceptible to subsequent determinologization than terms formed by formal neology
(e.g. byte) or by compounding (e.g. operating system)?
20
21

Strictly speaking, a lexical unit which functions as a term in specialized communicative contexts
the list in table 1 could be a starting point

61

References
[1] Barry, John (1993) A. Technobabble MIT Press
[2] Baroni, Marco (2004) "BootCat: BootStrapping Corpora and terms from the web" In:
Proceedings of LREC 2004
[3] Cabr Castellv, Mara Teresa (1999) "Do We Need an Autonomous Theory of
Terms?" In: Terminology 5:1, pp. 5-19, John Benjamins
[4] Cabr Castellv, Mara Teresa (2000) "Elements for a theory of terminology: Towards
an alternative paradigm" In: Terminology 6:1, pp. 35-57, John Benjamins
[5] Cabr Castellv, Mara Teresa (2003) "Theories of terminology - their description,
prescription and explanation" In: Terminology 9:2, pp. 163-199, John Benjamins
[6] Cruse, D.A. (1986) Lexical Semantics Cambridge University Press
[7] Evert, Stefan; Brigitte Krenn (2003) "Computational approaches to collocations"
Introductory course at the European Summer School on Logic, Language and Information
(ESSLLI)
[8] Evert, Stefan (2004) The Statistics of Word Co-occurrences: Word Pairs and
Collocations PhD thesis, University of Stuttgart
[9] Gambier, Yves (1991) Travail et vocabulaire spcialiss: Prolgomnes une socioterminologie In: Meta, 36(1), pp. 8-15, Les Presses de l'Universit de Montral
[10] Gambier, Yves (1987) Problmes terminologiques des pluies acides: Pour une socioterminologie In: Meta, 32(3), pp. 314-320, Les Presses de l'Universit de Montral
[11] Gaudin, Franois (1993) Socioterminologie: du signe au sens, construction dun
champ, In: Meta, 38(2), pp. 293-301, Les Presses de l'Universit de Montral
[12] Jrvi, Outi (2001) "From Precise Terms to Fuzzy Words - from Bad to Worse in
Terminology Science? In: IITF Journal vol. 12, no. 1-2, pp. 85-88
[13] Kageura, Kyo (2002) The Dynamics of Terminology - a descriptive theory of term
formation and terminological growth John Benjamins
[14] Kilgarriff, Adam; David Tugwell (2001) "WORD SKETCH: Extraction and Display
of Significant Collocations for Lexicography" In: Proceedings of ACL 2001, Toulouse,
France, pp. 32-38
[15] Kilgarriff, Adam (2004) "The sketch engine" In: Proceedings of the 11th EuraLex
International Congress
[16] Melby, Alan K. (1995) The Possibility of Language John Benjamins

62

[17] Meyer, Ingrid; Kristen Mackintosh (2000) "When terms move into our everyday lives:
An overview of de-terminologization" In: Terminology vol. 6:1, pp. 111-138, John
Benjamins
[18] Meyer, Ingrid; Kristen Mackintosh (2000) "L'tirement du sens terminologique:
apercu du phnomne de la dterminologisation" In: Le Sens en Terminologie Ed. by Henri
Bjoint and Philippe Thoiron, Presses universitaires de Lyon, pp. 198-217
[19] Myking, Johan (2000) "Sosioterminologi - Ein modell for Norden?" In: I
Terminologins tjnst - Festskrift fr Heribert Picht p 60-rsdagen, University of Vaasa
[20] Pearson, Jennifer (1998) Terms in Context John Benjamins
[21] Sager, Juan C.; David Dungworth (1980) English Special Languages Brandsetter
Verlag, Wiesbaden
[22] Sager, Juan C. (1990) A practical course in terminology processing John Benjamins
[23] Sager, Juan C. (1999) "In search of a foundation: Towards a theory of the term" In:
Terminology 5:1, pp. 41-57, John Benjamins
[24] Schulze, Bruno M. (1994) Entwurf und implementierung eines anfragesystems fr
maschinelle textcorpora. Master's thesis, Institut fr maschinelle Sprachverarbeitung
(IMS), Stuttgart University
[25] Schtze, Hinrich; Christopher D. Manning (1999) Foundations of Statistical Natural
Language Processing MIT Press
[26] Temmerman, Rita (2000) Towards New Ways of Terminology Description. The
sociocognitive approach John Benjamins
[27] Wittgenstein, Ludwig (1997) Philosophical Investigations translated by G. E. M.
Anscombe Basil Blackwell, orig. 1953
[28] Wster, Eugen (1991) Einfhrung in die allgemeine Terminologielehre und
terminologische Lexikographie. Bonn: Romanistischer Verlag, 3. Auflage, orig. 1979
[29] Yarowsky, David (1995) "Unsupervised Word Sense Disambiguation Rivaling
Supervised Methods" In: Proceedings of ACL 33, pp. 189-196

63

Anda mungkin juga menyukai