Introduction To Text Mining and Natural Language Processing: Judith Risse

Introduction to Text Mining and Natural Language Processing
BIF-30806 January 2010
Judith Risse
Outline

Literature and Databases Natural Language Processing

Information Retrieval Question Answering Information Extraction
Indexing Document Classification Exercises

2
Definitions
Natural Language Processing (NLP)
the study of automated generation and understanding of natural human languages (Wikipedia)
Text Mining
extract high quality (previously unknown) information from large amounts of unstructured text
Biomedical Literature
communication of scientific discoveries peer-reviewed and community reviewed provides additional information of experimental results base for annotation of biological databases
Literature Databases

NCBI Bookshelf PubMed Central PubMed

currently 19476540 citations (Jan 27, 2010) 5414 journals in Medline unique identifier PMID entries contain author, journal and title info more than 50% also abstracts links to full-text articles Medical Subject Headings (MeSH)
PubMed
21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
1950 1953 1956 1959
PubMed growth
No of publications in millions
entries per year total No of entries
1962
1965
1968
1971
1974
1977
1980
1983
1986
1989
1992
1995
1998
2001
2004
2007
Pubmed (3)
NLM 2008
A scientific article
journal specific format
sections print style review letter html pdf

8
type of article

document format

Article content
Full-text

title authors abstract body
Tables Figures References

9
Biomedical Language
domain specific terminology
cytosolic, erythroid precursor

e.g. Drosophila gene names: coitus interruptus, lost in
polysemic words
space
acronyms
APC (activated protein C), mdh (malate dehydrogenase)
low frequency words anaphora (references)
Overexpression of FumRs and Frds1 resulted in the best citrate-producing strain in the presence of trace manganese concentrations. This strain gave a maximum yield of .
10
Biomedical Language (2)
synonyms/creating new terms typographical variants

malic dehydrogenase L-malate dehydrogenase NAD-L-malate dehydrogenase malic acid dehydrogenase NAD-dependent malic dehydrogenase NAD-malate dehydrogenase NAD-malic dehydrogenase malate (NAD) dehydrogenase MDH L-malate-NAD+ oxidoreductase
11
Natural Language Processing

create computational models of language multi-disciplinary
information technology, linguistics, artificial intelligence, statistics . machine learning, rule-based, regular expressions
statistical properties of language
grammatical, morphological, syntactic and semantic features

12
Grammatical Features
Grammar
rules governing a language syntax and morphology noun, verb, adjective, adverb, preposition depends on context in sentence http://www.cst.dk/online/pos_tagger/uk/index.html http://en.wikipedia.org/wiki/Brill_Tagger
13
Part of speech (POS)

Brill tagger (Eric Brill, PhD thesis,1993)

Morphological Features
structure of words inflection

enzyme and enzymes (plural form) catalyse, catalyses, catalysing (verb inflection) earth, earthworm (compounding) dependent, independent (derivation) reduction of words to common base form

word-formation

stemming and lemmatisation
am, are, is be catalyse, catalyses, catalysing catalys
Porter Stemmer (tartarus.org/martin/PorterStemmer)

14
Syntactic Features
relationships between words in a sentence

noun-phrase, verb-phrase subject object relationships
15
POS Tagged Sentence
(NNP Pain) (VBD vanished) (IN for) (IN at) (JJS least) (CD three) (NNS months) (IN in) (NNS rats) (WP who) (VBD were) (VBN injected) (IN in) (DT the) (NN spine) (IN with) (DT a) (NN gene) (IN that) (NNS triggers) (VBZ endorphins) (. .)
injected - Verb, past participle in - Preposition the - Determiner spine - Singular noun with - Preposition a - Determiner gene - Singular noun that - Preposition triggers - Plural noun endorphins - Verb, 3rd ps. sing.present . - Final punctuation
16
Pain - Proper singular noun vanished - Verb, past tense for - Preposition at - Preposition least - Superlative adjective three - Cardinal number months - Plural noun in - Preposition rats - Plural noun who - wh-pronoun were - Verb, past tense
Semantic Features
meaning of words given the context dictionaries, thesauri
Gene Ontology
17
Contextual Analysis
Guilt by association
Co-occurrence analysis bag of words statistical analysis of word frequency
Word frequency

18
Exercise 1

take a gene/protein name of your interest query pubMed and retrieve 1 abstract
Take a look at what the Porter stemmer does using the abstract Describe what problems might occur from stemming Porter Stemmer
http://maya.cs.depaul.edu/~classes/ds575/porter.h tml
19
Coffee Break
Tasks of NLP

Information Extraction (IE) Question Answering (QA) Information Retrieval (IR)

machine translation text proofing speech recognition optical character recognition (OCR)
21
Information Retrieval
Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). Introduction to IR (CambUnivPr,
2008)
Indexing

Tokenization Case Folding (TNFalpha, Tnfalpha tnfalpha Stemming Stop-word removal (e.g. at, be, from, this )
Boolean Queries Vector Space Model queries

22
Zipfs Law
A small number of words occur very often Those high frequency words are often function words (e.g. prepositions) Most words with low frequency
23
Boolean Queries
Combination of query terms with boolean operators

AND OR NOT
Google, PubMed high recall, low precision unranked result

24
The vector space model

(1+logTF)log(N/DF) term weight
term frequency (TF) inverse document frequency (IDF) corpus size (N)
the vector points in word space each dimension corresponds to a word or phrase
Nat Rev Gen(2002):3 pp 601-610 25
IR Evaluation
A document is relevant if it addresses the stated information need, not because it just happens to contain all the words in the query. Introduction to IR (CambUnivPr,
2008)
document collection test cases of information need, as queries measure of relevance
26
Evaluation (2)
Precision
What fraction of the returned results are relevant to the information need?
Recall
What fraction of the relevant documents in the collection were returned by the system? harmonic mean of precision and recall (2pr)/(p+r)
27
F-score

Exercise 2
Compare the retrieval of abstracts between PubMed and Phasar (www.bioinformatics.nl/biometa/applet.html or twoquid.cs.ru.nl/applet.html) given the question:
What does prostaglandin inhibit?
How many results do you get? Give examples of answers to the question. Give 5 pmids of papers you would read given the results in each search. Which of the systems was more helpful and why?
28
Coffee Break
Question Answering
question posed in human language answer extracted from unstructured text more developed in generic domain difficult in biomedical domain
30
Information Extraction & Text Mining
extract structured information from unstructured text Named Entity Recognition identify relationships
e.g. protein-protein interactions
31
Information Extraction
extract meaning from a text combines:

pos-tagging ontologies regular expressions
Nat Rev Gen(2002):3 pp 601-610 32
Named Entity Recognition

tagging of biological entities high precision in generic NLP (0.9 F-score) difficult in biology
complex terms, synonyms, disambiguation typographical variations no use of official symbols gene/protein names
33
gene symbols

Challenges of NLP
Abbreviation

punctuation can be confused with end of sentence Wash. (Washington) with wash.
Decimal points apostrophes: To split or not to split?
34
Challenges (2)
hyphens
single or multiple words? data-base vs. data base vs. database carry-over?
simple stemming
operate operating operates operation operative operatives operational oper

brown car vs Mr. Brown
35
case folding
Anaphora
co-references one expression refering to another
The monkey took the banana and ate it.
strictly only local antecendent statements Sortal anaphora
this gene, the virus
resolution required for increased recall
36
Exercise 3
compare NER programmes

retrieve one pubMed abstract http://biocreative.sourceforge.net/bionlp_tools_links .html
NLProt TerMine Whatizit (http://www.ebi.ac.uk/webservices/whatizit/info.jsf)
What are the differences in recognized entities? Do they miss any obvious entities?
37
Indexing
Inverted Index (Inverted File)

for each word in the collection (dictionary) list occurrence and frequency
size of index is proportional to size of corpus remove stopwords, use stemming for more efficient index classic version is a boolean index
can also contain positional information
sparse matrix
38
Example
number of docs containing the term
document ids
total # of occurrences
term position in counted words
deterministic 20 73 89 90 106 173 194 233 243 251 252 255 257 258 267 276 281 304 312 315 326 27 36822 44643 45285 53003 53061 86740 86743 97082 116618 121984 125750 125952 125968 126039 127633 128882 128978 129048 133781 133789 138493 140946 140947 152011 156191 157881 163490 deterrence 1 60 4 30309 30345 30444 30452 detonation 2 263 264 4 131781 131956 131995 132303
39
Suffix Array
A suffix array is an array that contains all the pointers to the text suffixes listed in lexicographical order.
Text is seen as one long string A text suffix is a substring from given position till end of string position refers to beginning of word return all occurrences of string W in large text A
40
Example:
the word: abracadabra 1. create all suffixes 2. sort suffixes on alphabet
3. resulting suffix array
Finding every occurrence of the substring is equivalent to finding every suffix that begins with the substring
41
Document Classification
assign a document to a class given its content

manual (ad hoc) rule-based decision tree machine learning approaches
42
Statistical Text Classification
training documents for each class supervised learning test data or new data training data and test data have to be similar
43
Nave Bayes
Nave: all words in text are considered independent Bayes: uses Bayes theorem
prior probability
posterior probability
P ( B | A) P ( A) P( A | B) P( B)
44
Basic Probability Theory

Given A represents an event the probability of A occuring is 0 P(A) 1 Joint probability P(A,B) = P(AB) Conditional probability P(A | B) Chain rule P(A,B) = P(A | B)P(B) = P(B | A)P(A)
45
Application to Document Classification

probability of a word belonging to category C probability of a document belonging to category C given its words
wikipedia.org
46
Coffee Break
Exercise 4
Try to apply nave Bayes to a selection of sentences using
http://search.cpan.org/~kwilliams/AlgorithmNaiveBayes/ rugby.txt and tennis.txt as training and test data. If you have it implemented try using this in combination with the Porter Stemmer (http://bionlp.stanford.edu/bionlp.pl)
48
Added Challenge
From sequence to abstract to NER
MSTESMIRDVELAEEALPQKMGGFQNSRRCLCLSLFSFLLVAGATTLFCLLNFGVIGPQR DEKFPNGLPLISSMAQTLTLRSSSQNSSDKPVAHVVANHQVEEQLEWLSQRANALLANGM DLKDNQLVVPADGLYLVYSQVLFKGQGCPDYVLLTHTVSRFAISYQEKVNLLSAVKSPCPKDTPEGAE LKPWYEPIYLGGVFQLEKGDQLSAEVNLPKYLDFAESGQVYFGVIAL
retrieve UniprotID via BLAST (take best hit) retrieve gene name using getz (GeneName field) retrieve relevant abstracts from pubMed in Medline format using eSearch and eFetch with the gene name extract all protein/gene names from these abstracts
http://bionlp.stanford.edu/webservices.html
how do they relate to the original protein? compare to the output of ebiMed using the gene name (http://www.ebi.ac.uk/Rebholzsrv/ebimed/index.jsp)
49
Helpful resources
http://www-nlp.stanford.edu/links/statnlp.html http://nlp.stanford.edu/IRbook/html/htmledition/mybook.html www.biocreative.org Drosophila gene names:
http://www.curioustaxonomy.net/gene/fly.html
50
Further Reading
Introduction to Information Retrieval

Cambridge University Press ISBN 987-0-521-86571-5 Cambridge University Press ISBN-13 978-0-521-83657-9
The Text Mining Handbook

51

Introduction To Text Mining and Natural Language Processing: Judith Risse

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Introduction To Text Mining and Natural Language Processing: Judith Risse

Diunggah oleh

Hak Cipta:

Format Tersedia

Introduction to Text Mining and Natural Language Processing

BIF-30806 January 2010

Literature and Databases Natural Language Processing

Information Retrieval Question Answering Information Extraction

Indexing Document Classification Exercises

Natural Language Processing (NLP)

NCBI Bookshelf PubMed Central PubMed

entries per year total No of entries

journal specific format

sections print style review letter html pdf

title authors abstract body

Tables Figures References

domain specific terminology

cytosolic, erythroid precursor

APC (activated protein C), mdh (malate dehydrogenase)

low frequency words anaphora (references)

Biomedical Language (2)

synonyms/creating new terms typographical variants

Natural Language Processing

create computational models of language multi-disciplinary

statistical properties of language

grammatical, morphological, syntactic and semantic features

Part of speech (POS)

Brill tagger (Eric Brill, PhD thesis,1993)

structure of words inflection

stemming and lemmatisation

am, are, is be catalyse, catalyses, catalysing catalys

Porter Stemmer (tartarus.org/martin/PorterStemmer)

relationships between words in a sentence

noun-phrase, verb-phrase subject object relationships

POS Tagged Sentence

meaning of words given the context dictionaries, thesauri

Co-occurrence analysis bag of words statistical analysis of word frequency

Information Extraction (IE) Question Answering (QA) Information Retrieval (IR)

Boolean Queries Vector Space Model queries

Combination of query terms with boolean operators

Google, PubMed high recall, low precision unranked result

The vector space model

document collection test cases of information need, as queries measure of relevance

What does prostaglandin inhibit?

Information Extraction & Text Mining

e.g. protein-protein interactions

extract meaning from a text combines:

pos-tagging ontologies regular expressions

Nat Rev Gen(2002):3 pp 601-610 32

Named Entity Recognition

Decimal points apostrophes: To split or not to split?

operate operating operates operation operative operatives operational oper

co-references one expression refering to another

The monkey took the banana and ate it.

strictly only local antecendent statements Sortal anaphora

this gene, the virus

resolution required for increased recall

compare NER programmes

retrieve one pubMed abstract http://biocreative.sourceforge.net/bionlp_tools_links .html

NLProt TerMine Whatizit (http://www.ebi.ac.uk/webservices/whatizit/info.jsf)

Inverted Index (Inverted File)

can also contain positional information

number of docs containing the term

term position in counted words

3. resulting suffix array

assign a document to a class given its content

manual (ad hoc) rule-based decision tree machine learning approaches