Not a branch of linguistics, like socio~, psycho~, Not a theory of linguistics A set of tools and methods (and a philosophy) to support linguistic investigation across all branches of the subject
1/23
Assessment for this course is to use corpus/corpora to investigate something This lecture may give you some ideas of the kind of thing you can do
2/23
Lexicology Grammatical studies Study of language variation Historical linguistics Contrastive analysis and translation theory Study of language acquisition (psycholinguistics) Language teaching
3/23
Study of behaviour of individual words Particularly useful for dictionary construction (lexicography) Can identify more and less frequently occurring words More interesting is HOW words are used
Syntax Meaning
4/23
Most frequent words are function words (the, of, and, to, a are 5 most frequent words in LOB) If corpus is small, it can only give an indicative snapshot of word usage LOB (1m words): hundreds of words occur less than 10 times
5/23
For dictionary construction, need bigger corpus Monitor corpus, constantly updated and added to Traditional lexicography: collection of slips by experts
OED took 50 years and includes 5m citations, sorted and
Same idea, but more systematic Dictionary as descriptive rather than pre- (or pro-) scriptive
edited manually
6/23
Collins COBUILD
Birmingham corpus (20m words, 1980s) Bank of English corpus (415m words in Oct 2000)
70m words of transcripts of BBC broadcasts Used as basis of BBC English dictionary
Cambridge Language Survey Longmans corpus of American English, and use of BNC for (BrE) dictionary
7/23
Concordancing
Collocation
Lists occurrences of word in context Identify syntactic use of word Identify range of meanings Identify relative frequency of different uses/meanings
What words occur together? Compare distribution of close synonyms Can be interesting to compare meanings/uses given by dictionaries with
http://www.collins.co.uk/corpus/CorpusSearch.aspx
8/23
9/23
10/23
constructions) Use in different registers (eg narrative vs argumentative) or modes (eg written vs spoken)
11/23
Appositives
eg George Bush, US president or US president George Bush) See CF Meyer Can you really study language variation in linguistic
corpora? American Speech 79.4 (2004) 339-355 Genuine titles, pseudotitles, descriptives
Junichiro Koizumi, the Japanese prime minister Gerald Ford, former president of the USA Osama bin Laden, Americas no.1 enemy
Looked at how appositives (esp. pseudotitles) are used differently in
newspaper reports from different countries, and how descriptives become pseudotitles
12/23
Simple past vs perfective verb forms Use of modals can~may, shall~will Use of passive, and means/reasons to avoid
eg especially in translation
13/23
Most try to investigate the factors that determine choice of one construction over another
Lexical Grammatical Stylistic etc
14/23
and tools need to be available for examples to be extracted Corpus may need to be sufficiently large to get good number of examples If comparing registers/subject domains/modes, corpus needs to reflect these
15/23
Both lexical and grammatical studies often contrast usage by mode, domain, register etc. Sociolinguists often interested in other aspects, eg sex, age, social class of author or audience; historical linguists interested in change over time Recent corpora (eg BNC) have included this information in header mark-up Simple examples
lovely used more by females than males What does cool mean?
16/23
Are there lexical and grammatical factors that can help us to classify text genres? Biber used statistical measures to identify stylistic factors that co-occurred, and could therefore be definitional of text types and genres
Eg conjuncts like therefore, nevertheless and use of passive together
Factor analysis
choose a range of features to measure, see which ones are correlated does not (necessarily) predetermine analysis (except obviously you
17/23
Similar things can be done with historical texts, though (obviously) these are more limited in terms of genre Also, diachronic studies can compare texts from different periods (again as long as you compare like for like as much as possible) Topics:
18/23
Nevalainen in J. Engl. Ling (2000) used Corpus of Early English Correspondence (U. Helsinki) to track sex roles in linguistic innovation Popular theory that females more innovative, and males follow trends He analysed sex-of-author differences in three linguistic changes between 16th and 20th century:
Replacement of ye by you in subject position Replacement of 3rd-person verb suffix -th by -s Reduction in use of multiple negatives and use of any and
ever instead
19/23
Parallel corpora
texts + their translations preferably aligned
Comparable corpora
Texts in different languages but of a similar nature What parallels are there in genre characteristics?
20/23
Aligned corpus allows search for word or phrase and its translation
Of interest in studies of translationese
Translated text too influenced by original Certain constructions more prevalent in translation than in native
How is it translated? Is it translated consistently?
Evidence of explicitation
Translation is often more explicit than original Sometimes, explanation added for foreign reader But often, just a reflection of the translators effort (eg
text
21/23
First-language acquisition
CHILDES database (Child Language Data Exchange
System) http://childes.psy.cmu.edu/ Transcriptions of conversations with (and between) young children Includes software to help extract data
Second-language acquisition
Learner corpora, notably ICLE http://www.fltr.ucl.ac.be/FLTR/GERM/ETAN/CECL/Cecl-
Projects/Icle/icle.htm
22/23