Anda di halaman 1dari 8

Corpus Linguistics workshop

By
Dr. Wessam Ibrahim of Tanta University
Setting: Faculty of Education
Time: 8-11 sept.2014
Corpus tools:
1. CQP web: Lancaster university. Advantage: easiest to use disadvantage: cant upload my own corpus
User name: azzaabdeen
Password: azzaabdeen
2. Antconc: advantage: can upload my corpus disadvantage: complicated, but useful for part of speech
tagging. The only available one
3. Wmatrix: semantic domains and tagging. Free for very little corpus
4. Smith

D Wessam email: weselmasry@yahoo.com

Introduction
What is corpus?
A methodology to save time
Corpus: body plural. Corpra: a collection of texts readable b computer saved as plain text file so the
computer can read (machine readable)
The text has to be saved to word then to text file
For Pd files has to be converted from Pdf to plain texts. PdF saved as image only cannot be used as
corpus
Why use a corpus:
For research work:
1. testing a hypothesis
2. evade research-bias: large corpus help researcher evade being bias by supporting or refuting a corpus
3. helps spot common and rare phenonmena, especially in large corpora
4. helps us make generalization
What is corpus linguistics?
1.Empirical: authentic data the has evidence
What is empiricism:
a. Scientific/philosophical
b. symptoms of evidence
It needs computer: a software for analysis
2. Quantitive and Qualititive analytical techniques: We can get frequencies from a corpus, but these
frequencies have to be interpreted. Say something about the pattern and explain it

How to analyze?
1.Search for frequency of a certain item so we can make claims. For example, the word wash is more
frequent with women than men. So we could make claims about why women use this more.
3. Key words: What is the corpus about:
There should be reference corpus to compare the uploaded corpus to

2. Tag a corpus: adding information to your corpus. We do the tagging while compiling. For example,
comparev men and women. This should be tagged who is who (this part is giving by a 25 year old
female, middle class). Tagging involves lots of work,ie, corpus A and corpus be that is more general and
larger. For example, newspaper in Arabic to the more representative of Arabic statistics will work out
the results and get the keyword: one word which is statistically significant. This is concordance
3. Concordance: the little context in which the query word is used, a certain number before and after
the query (key word).A list of all the occurrences of a word or phrase in a corpus given in the context of
the sentence it occurs in sometimes called KWIC (key word in context. Before+key word+after
4. Collocation: Words that tend to come together. They have ideological connotation. For example, the
word, spinister and bacholar have ideological connotation because of their collocates:
Bacholar: unmarried, man, happy, gentleman
Spinister: married woman, eldrly, cold-hearted, witch
The generalization that comes out of this maybe that English language is sexiest
Muslim brotherhood in Egyptian and British Newspaper
which gives the connotation of violence, crackdown, source of trouble
In British newspaper, set in which means peaceful which gives a positive image of Ikhwan as victims,
whereas, the Egyptian is negative because of a certain agenda the media is following.
Features of Corpus:
1.Very large: related to the kind of research questions
2. Representative: different genres to talk about general things. News reports should be in focus
3. Machine Readable
4. Often annotated: tagging is an extra information you add according to syntax, e.g. n,v, according to
semantic domain. For example categorizing words that belong to groups of semantic features: food,
family, sports. The software will do it. W Matrix is the only software for semantic annotation. Most of
others give syntactic annotation
5. Representaive: corpora are so big to be representative of language variation. It should be large to
establish norms/patterns to reveal cases of usual uses.
6. Annotated: tagging the corpus such as Age- sex class of the speakers

Types of Corpora
1. Specialized corpus: For example:
Genre: the language of newspaper
Time: 2005 till present
Place: texts published in Egypt
2. General corpus: needs to be larger, for example, British National Corpus (BNC) has about 100million
words of spoken and written British English. We can search in general corpora things such as: Discourse
markers, transitive, Modals or any other grammatical features in corpus.
There are two corpora: LOB : Lancaster Oslen Burger (British Corpora)and
FLOB corpus (American 1961)
3. Multilingual corpus: English and Arabic or American English and British English
4. Parallel corpus: 2 corpora about 2 different languages, e.g. English and Arabic
5. Learner corpus: language use created by people learning a particular language, e.g, the international
corpus of learner English Adjectives expressing feelings are the same as Americans
6. Historical or Diachornic corpus, eg. Hesinki corpus. 1.5 million words of texts from 700 AD to 1700 AD
7. Monitor corpus: continually added to, e.g. the bank of English (COCA: American corpus for free)
- size of corpus is based on your purpose: what do you want to do with it. Specialized corpora does not
have to be big..according to the purpose.
\ demographic data: everyday conversation
Goverened data: TV language
Types of Searches
A single word: book
A phrase: book the hotel
One word or another: clever,mart
Wild cards in words : hat, hit, hot
Wild cards as words: the*man
Part of speech: love NN!
Headword searches: {list/lists/ listing}
Lemma search: word dervatives {light/verb} }lights/N} {lit, lighted, lightening}
Restricted searches: Only news genre or only female speakers, restrict setting before embarking on
corpus
Coping with too many concordances lines: Thin the concordances: e.g 100 lines. Look at 30 lines, then
another 30 untill there are no patterns
-Use a small no of lines to form hypothesis, then carry out other searches
- Use collocation or keywords . Get collocates of each key word
-All choices should be based on statistical significance

Collocation
The systematic co-ocuurances of words in use. First key word is the nod word fixed, eg. Telephone
operator: fixed relationship
Variable: tell me a story
Story to tell
Non-idiomatic: told a story
Tell a story
Telling a story
Some collocates are based on a certain ideology
Idiomatic collocation: kick the bucket
Nod word is the word I want to search its collocates. How large should the span be?
Antconc: -1+1 and can be changed into -5+5
It is important to specify the span of collocates befor you do statistics
5 words before the collocates+5 words after
Loglikehood: most frequently used
Mutual Information: in small corpus we can use mutual information for statistic significance. It measures
the strength of association of 2 words (collocates)
Mutual information (MI) mainly based to get ideology of the producer. The words the journalist imposed
this ideology.subtle patterns that is statistically significant because it has an ideology. It creates an
entity. It has cut off which is when to say it is statistically significant. MI is ameasure of effect size
showing strength or salience of collocation
MI= 3: 2 occurrences = 1
MI: measures very strong association8 occurrences = 3
Colligation: a word collocates with a certain part of speech
Semantic preference: the collocates belong to the same semantic domain:
A glass of : water, lemon, juice (colligates with domain of cold beverage)
Semantic prosody: words used for a special feeling effect: negative or positive connotation
Semantic refeference: is a common semantic field around a word
Consequence+Adjective related to logic/importance
Semantic Discourse Prosody: cause as a noun rare: aim (positive)
Cause as a noun verb: bad (negative)
Semantic prosody explain connotation

Corpus Software Tools
Wordsmith 5
Antconc
CQB Web
WMatrix

WMatrix
1. Word is----- T frequency
2. Open File corpus text file
Choose text now-ok
If you want to change text. Chose button to change selection.
3. Make a word list now
Wordsmith tools
File setting utilities window
- word list
- tick file
-open
-choose text from my computer
- Browse
- choose my file from my Comp.)
Change selection in case ou need to change (All books, one Book, 2 books)
Select
-Tick on the ruler
-Ok
If I need to change after downloading____highlight________clear
- Make a word list now
- All words appeared arranged in number of frequency from most to least/ top-to-down
- Frequency: no. of occurrences
- Percentages------- the frequency of occurrences compared to the text
-
How to save word list?

- Tick file-------------tick save as a word list
- Save twice as excel sheet and as word list
- Tick function--------concord-----------open window and the key word will occur witrh frequency
(concordances highlighted)
- Window-sort now-yes with concord

Anda mungkin juga menyukai