Anda di halaman 1dari 43

Corpus Linguistics

and Stylistics
PALA Summer School, Maribor, 2014

In this lecture...
Stylistics and style
Combining stylistics + corpus linguistics
Examples of studies combining corpus
linguistics and stylistics
Analysis
Analysis
Analysis
Analysis

of
of
of
of

genres
the works by particular authors
individual texts
variation inside texts

Corpus Tools
WMatrix

Stylistics
Stylistics is the study of
literature using methods,
theories and concepts from
linguistics
(Leech and Short 2007: 1)
it is "[...] the study of the
relationship between linguistic
form and literary function [...]
(Leech and Short 2007: 3).

Linguistic style
Style is a way in which
language is used
(Leech and Short 2007: 31)
[S]tyle consists in choices
made from the repertoire
of the language.
(Leech and Short 2007: 31)

Linguistic style
Stylistic choice is
limited to those aspects
of linguistic choice which
concern alternative
ways of rendering the
same subject matter
(Leech and Short 2007:
31)
e.g. horse vs. steed but
not horse vs. dog

Linguistic style
Style and genre, e.g. science fiction,
romance novels, etc.
Style and author
Style and text
Style and parts of texts (e.g. the
narration or speech of different
characters)

Ways of analysing style


Analysts intuitions
Manual comparative analysis

Ways of analysing style


Style and comparison
Even if style is defined as that variety of
language which correlates with context, the
recognition and analysis of styles are
squarely based on comparison. The essence
of variation, and thus of style, is difference,
and differences cannot be analysed and
described without comparison.
(Enkvist 1973: 21)

Ways of analysing style


Comparative analysis manually
OK for shorter texts/extract

Comparative analysis using computers:


Corpus linguistic methods/tools
Especially useful for longer texts prose
fiction

Combining corpus linguistics


and stylistics
The corpus turn (Leech and Short
2007:284).
On-going trend in stylistics to use
methods and tools from corpus-linguistics
for the analysis of literary and other texts.
Usually referred to as corpus stylistics
Other terms:
digital stylistics (Louw 2008)
electronic text analysis (Adolphs 2006)

Examples of studies
Combining corpus linguistics and
stylistics
Analysis
Analysis
authors
Analysis
Analysis

of genres
of the works by particular
of individual texts
of variation inside texts

Genre style
Biber (1988) multivariate statistical techniques
factor analysis
many different variables
variables = linguistic features (e.g. passive
constructions)

e.g. narrative versus non-narrative texts


important variables = past tense verbs, 3rd
person pronouns, perfect aspect, present
participle clauses
High scores = narrative
Low scores = non-narrative

A range not a dichotomy

the top text-types

there exists a whole range of text-types in the


the bottom text types
middle its not just a two-way distinction
Note also spoken and written genres are
mixed together along the dimension
narrative / non-narrative

Genre style direct speech


Corpus-based study of
speech, writing and
thought presentation
(Semino and Short
2004)

Genre style direct speech


Corpus of 260,000 (approx) words of
(late) 20th century written British
English
120 text samples
2,000 (approx) words each,
amounting to a total of 258,348
words. It is divided into three
sections:

Genre style direct speech


Corpus divided into three sections:
prose fiction (87,709 words),
newspaper news reports (83,603 words),
and
biography and autobiography (87,036
words)

Each genre section further divided


into a serious and a popular subsections.

Genre style direct


speech
Corpus tagged manually
<sptag cat=NRS next=DS s=0.37 w=7>
The theme parks manager, Mike Slattery
said:
<sptag cat=DS next=NRS s=1.63 w=18>
By closing Crinkley Bottom, the council
has shot Morecambe in the foot. And Im
out of a job.

Genre style direct speech


Section of the corpus

Number of instances of DS

Whole corpus

2,974

Fiction

1,569

Press

770

(Auto)biography

635

Fiction sub-section

Number of instances of DS

Serious

629

Popular

940

Authorial style
Studies attempting to fingerprint authors:
i.e. to identify linguistic items that
distinguish the works by one author from
those of others.
Burrows (1987): study of Jane Austens
novels focusing on closed-class words,
such as the, and, of, a and to.
Burrows found that these words can
distinguish the works of different authors ,
different novels, and even the words
spoken by different characters.

Authorial style
Hoover (2002) studied a series of corpora
containing chunks from novels by different
authors.
For example, he looked at a corpus
containing the first 30,000 words of 29
novels by 17 different authors.
The distribution of the 300 most frequent
words in the corpus as a whole correctly
clusters 15 out of 17 novels.

Authorial style
An analysis of the most frequent word
sequences (n-grams) can also be useful,
e.g.
of the
in the
to the
it was
he was
and the

Authorial style
Mahlberg (2007, 2009, 2012)
Corpus stylistics and Dickenss
fiction
Also shows that analysis of
frequent word sequences
(clusters) can be useful.
Clusters containing body parts
his hands in his pockets
his head on one side
his hands upon his

Text style
Stubbss (2005) study of
Joseph Conrads Heart of
Darkness, first published in
1899.
Marlow, the protagonist and
first-person narrator, tells of
how he was contracted to
travel up a river in the Belgian
Congo, in order to find an ivory
trader called Kurtz, who was
the subject of stories of
madness and suspect
practices. However, Kurtz dies
while travelling back down the
river.

Text style
Main themes
hypocrisy of the colonizers
unreliability of progress and civilization
breakdowns in communication
Light vs. dark
Restraint vs. frenzy
Appearance vs. reality
Marlows unreliable and distorted knowledge
(Stubbs 2005: 8-9)

Text style
Used WordSmith Tools (Scott 2007)
Compared one novel with a corpus of
fictional texts of around 700,000 words
Overused words in novel include: seemed,
mystery, darkness, absurd, horror, terror,
desolation
Several words concern uncertainty,
perception and knowledge.
Coincide with some of the novels themes

Text style
Stubbs shows how the application of
corpus methods can provide:
further justification for well-established
interpretations,
new insights into the language and
meaning potential of the text.

Text style: variation inside texts


Culpeper (2002) used WordSmith Tools to
do a key-word analysis of the speech of the
main characters in Romeo and Juliet
A file with the words spoken by each
character was compared to a reference
corpus containing the words of all the
other characters.
Findings are relevant to an understanding
of how the characters are linguistically
constructed (characterisation).

Text style: variation inside texts


Juliets key-words (raw frequencies in
brackets):
If (31), Or (25), Sweet (16), Be (59), News
(9), My (92), Night (27), I (138), Would
(20), Yet (18), Thou (71), Words (5),
Name (11), Nurse (20), Tybalts (6), Send
(7), Husband (7), That (82), Swear (5)

Text style: variation inside texts


Key-words such as if, or, would, yet can be
related to Juliets tendency to express
uncertainty and anxiety throughout the play:

I fear it is: and yet,


methinks, it should
not, For he hath still
been tried a holy man
(IV.iii.)
[Context: Wondering
whether the Friar has

Corpus tools
Corpus tools make comparison
relatively easy
WordSmith Tools (Scott 2007)
WMatrix (Rayson 2009)
AntConc (Anthony 2011)
MLCT (Piao)

Summary
Style is the way in which language is
used.
The notion of style is fundamentally
based on comparison
Corpus linguistic methods are
relevant to the analysis of style in
fiction/literature.
They have been applied to the
analysis of genres, authors and texts.
Manual analysis and interpretation of

Summary
[...] corpus stylistics is not
purely a quantitative study
of literature. Rather, it is still
a qualitative stylistic
approach to the study of the
language of literature,
combined with or supported
by corpus-based
quantitative methods and
technology.
(Ho 2011:10)

References
Culpeper, J. (2009) Keyness: words, parts-of-speech and semantic categories in the character-talk
of Shakespeares Romeo and Juliet International Journal of Corpus Linguistics, 14(1): 29-59.
Ho, Y. (2011) Corpus Stylistics in Principles and Practice: A Stylistic Exploration of John Fowles The
Magus. London: Continuum
Leech, G. (2008) Language in Literature: style and foregrounding Harlow, UK: Pearson
Louw, B. (2008) "Consolidating Empirical method in data-assisted stylistics: Towards a corpusattested glossary of literary terms" in Zyngier, S., Bortlussi, M., Chesnokova, A. and Auracher, J.
Directions in Empirical Literary Studies, pp. 243-264. Amsterdam: Benjamins.
Mahlberg M. (2007) Clusters, Key Clusters and local textual functions in Dickens Corpora 2(1): 1-31
Mahlberg, M. (2009) Corpus Stylistics and the Pickwickian watering-pot, in Contemporary Corpus
Linguistics Baker, P. (ed.) Contemporary Corpus Linguistics, pp47-63. London: Continuum.
Mahlberg, M. (2012) Corpus Stylistics and Dickenss Fiction. London: Routledge
McIntyre, D. (2010) Dialogue and Characterization in Quentin Tarantinos Reservoir Dogs: A Corpus
Stylistic Analysis, in McIntyre, M. and Busse, B. (eds.) Language and Style pp 162-182.
Basingstoke: Palgrave.
McIntyre, D. and Walker, B. (2010) 'How can corpora be used to explore the language of poetry and
drama?' in McCarthy, M. and OKeefe, A. (eds) The Routledge Handbook of Corpus Linguistics.
London: Routledge
Widdowson, H. G. (2008) The Novel Features of Text. Corpus Analysis and Stylistics in Gerbig, A.
and Mason, O. (eds.)Language, People, Numbers: Corpus Stylistics and Society, pp. 293-304.
Amsterdam: Rodopi.

WMatrix

WMatrix
Web-based corpus tool
Developed by Paul Rayson at
Lancaster University
Automated grammatical and
semantic analysis of texts/corpora
A web-based front end for CLAWS
and USAS

WMatrix
Using a web interface:
Texts are uploaded onto the Wmatrix
server (at Lancaster)
The upload procedure automatically
adds
(i) Grammatical or Part of Speech
(POS) tags;
(ii) Semantic tags

WMatrix
CLAWS grammatical (POS) tagger.
CLAWS = Constituent Likelihood
Automatic Word-tagging System
USAS semantic tagger
USAS = UCREL Semantic Analysis System
(UCREL = University Centre for Corpus
Research on Language)

WMatrix
USAS
Assigns tags to each word using a
hierarchical framework of
categorization
Based originally on McArthurs
(1981) Longman Lexicon of
Contemporary English

The 21 Top Level Semantic Categories of the


USAS Tag-set
A
GENERAL &
ABSTRACT
TERMS

B
THE BODY &
THE
INDIVIDUAL

C
ARTS &
CRAFTS

E
EMOTION

F
FOOD &
FARMING

G
GOVERNMENT
& PUBLIC
DOMAIN

H
ARCHITECTUR
E, HOUSING &
THE HOME

I
MONEY &
COMMERCE
(IN INDUSTRY)

K
ENTERTAINME
NT

L
LIFE & LIVING
THINGS

M
MOVEMENT,
LOCATION,
TRAVEL,
TRANSPORT

N
NUMBERS &
MEASUREMEN
T

O
SUBSTANCES,
MATERIALS,
OBJECTS,
EQUIPMENT

P
EDUCATION

Q
LANGUAGE &
COMMUNICATI
ON

S
SOCIAL
ACTIONS,
STATES &
PROCESSES

T
TIME

W
X
WORLD &
PSYCHOLOGIC
ENVIRONMENT AL ACTIONS,
STATES &
PROCESSES

Z
NAMES &

Y
SCIENCE &
TECHNOLOGY

WMatrix
G - Government and the public domain
G1

G2

G3

Government,
politics and
elections
Crime, law and
order

War, defence
and the army:
weapons

Government, etc.

G1.1

Politics

G1.2

WMatrix
Allows analysis of texts at :
the word level
the grammatical level (POS)
and the semantic level

WMatrix
Allows text comparison at:
the word level
the grammatical level (POS)
and the semantic level

WMatrix
Keyness
Word level Key-words
Grammatical level Key-POS
Semantic level Key-concepts

Anda mungkin juga menyukai