Artificial Intelligence
Natural Language Processing Dr Alexiei Dingli
1
Aims of NLP?
Trying to make computers talk Give computers the linguistic abilities of humans
1/7/2012
1940s - 1950s
Turings (1936) model of algorithmic computation McCulloch-Pitts neuron (McCulloch and Pitts, 1943) a simplified model of the neuron as a kind of computing element (propositional logic) Kleene (1951) and (1956) finite automata and regular expressions.
Shannon (1948) probabilistic models of discrete Markov processes to automata for language. Chomsky (1956) finite state machines as a way to characterize a grammar 3
1940s - 1950s
Speech and language processing Shannon
metaphor of the noisy channel entropy as a way of measuring the information capacity of a channel
1/7/2012
1940s - 1950s
One of the earliest applications of computers Major attempts in US and USSR
Russian to English and reverse
A formal definition of grammars and languages Provides the basis for an automatic syntactic processing of NL expressions
A procedural approach to the meaning of a sentence Provides the basis for a automatic semantic processing of NL expressions
1/7/2012
Newell and Simon - the Logic Theorist and the General Problem Solver Early natural language understanding systems
Domains Combination of pattern matching and keyword search Simple heuristics for reasoning and questionanswering
1/7/2012
a large dictionary compute the likelihood of each observed letter sequence given each word in the dictionary Joshi and Hopely (1999) and Karttunen (1999)
Testable psychological models of human language processing based on transformational grammar Resources
First online corpora: the Brown corpus of American English DOC (Dictionary on Computer) an on-line Chinese dialect dictionary.
10
1/7/2012
11
1/7/2012
first to attempt to build an extensive (for the time) grammar of English (based on Hallidays systemic grammar)
13
14
1/7/2012
1983-1993
Return of state models
Finite-state phonology and morphology (Kaplan and Kay, 1981) Finite-state models of syntax by Church (1980).
Return of empiricism
1994-1999
Major changes Probabilistic and data-driven models had become quite standard Parsing, part-of-speech tagging, reference resolution, and discourse processing
Algorithms incorporate probabilities Evaluation methodologies from speech recognition and information retrieval. commercial exploitation (speech recognition, spelling and grammar correction) need for language-based information retrieval and information extraction.
16
1/7/2012
Competitive evaluations
Parsing (Dejean and Tjong Kim Sang, 2001), Information extraction (NIST, 2007a; Tjong Kim Sang, 2002; Tjong Kim Sang and De Meulder, 2003) Word sense disambiguation (Palmer et al., 2001; Kilgarriff and Palmer, 2000) Question answering (Voorhees and Tice, 1999), and 18 summarization (Dang, 2006).
1/7/2012
19
10
1/7/2012
Statistical approaches to machine translation (Brown et al., 1990; Och and Ney, 2003) t Topic modeling (Blei et al., 2003)
Effective applications could be constructed from systems trained on unannotated data alone Use of unsupervised techniques
21
Elements of a Language
Phonemes Morphemes Syntax Semantics
22
11
1/7/2012
Smallest phonetic unit in a language Capable of conveying a distinction in meaning. Every language has discrete set of phonemes Describing all possible sounds
Eg: "M", in "man," and "c", in "can," are phonemes.
A meaningful linguistic unit Consisting of a root word or a word element that cannot be divided into smaller meaningful parts.
Eg: "Pick" and "s", in the word "picks," are morphemes
23
24
12
1/7/2012
Exercise
Word Bay Pots A Teacher Morpheme Bay (1) Phoneme B + ay (2)
? ? ?
? ? ?
P + o + t + ? (4) s A (1)
Teach + er (2)
T + ea + ch + e + r (5)
25
Exercise
Word Bay Pots A Teacher Morpheme Bay (1) Pot + s (2) A (1) Teach + er (2) Phoneme B + ay (2) P + o + t + s (4) A (1) T + ea + ch + e + r (5)
26
13
1/7/2012
27
Syntax
Describes the constituent structure of NL expressions
(I (am sorry)), Dave, ( I ((cant do) that))
Grammars are used to describe the syntax of a language Syntactic analysers and surface realisers assign a syntactic structure to a string/semantic representation on the basis of a grammar
28
14
1/7/2012
Syntax
It is useful to think of this structure as a tree:
represents the syntactic structure of a string according to some formal grammar. the interior nodes are labeled by non-terminals of the grammar, while the leaf nodes are labeled by terminals of the grammar
29
often gives
30
15
1/7/2012
Methods in syntax
Words - syntactic tree
Algorithm: parser Resources used: Lexicon + Grammar Symbolic : hand-written grammar and lexicon Statistical : grammar acquired from treebank
A parser checks for correct syntax and builds a data structure.
Treebank : text corpus in which each sentence has been annotated with syntactic structure. Syntactic structure is commonly represented as a tree structure, hence the name treebank.
31
Syntax applications
For spell checking
*its a fair exchange No syntactic tree Its a fair exchange ok syntactic tree
16
1/7/2012
Syntax to meaning
John loves Mary love(j,m)
33
Semantics
Where the hell d you get that idea HAL Dave, although you took thorough precautions in the pod against my hearing you, I could see your lips move
34
17
1/7/2012
1.
2. 3.
Lexical semantics
36
18
1/7/2012
Compositional semantics
A swear word that some people use when they are annoyed or surprised or to emphasize something
37
38
19
1/7/2012
Pragmatics
Knowledge about the kind of actions that speakers intend by their use of sentences
REQUEST: HAL, open the pod bay door. STATEMENT: HAL, the pod bay door is open. INFORMATION QUESTION: HAL, is the pod bay door open?
Discourse
Where the hell'd you get that idea, HAL?
Dave and Frank were planning to disconnect me Much of language interpretation is dependent on the preceding discourse/dialogue
40
20
1/7/2012
Ambiguity
I made her duck I cooked duck for her. I cooked duck belonging to her. I caused her to quickly lower her head or body.
42
21
1/7/2012
Ambiguity
Sound-to- text issues:
Recognise speech.
43
Ambiguity vs paraphrase
Ambiguity : the same sentence can mean different things Paraphrase: There are many ways of saying the same thing.
Beer, please. Can I have a beer? Give me a beer, please. I would like beer. Id like a beer, please.
44
22
1/7/2012
Applications of NLP
IE IR QA Dialogue Systems
45
46
23
1/7/2012
How did Socrates die? Finding an answer may require reasoning. In this example die has to be linked with drinking poisoned wine.
47
48
24
1/7/2012
Evaluating QA Systems
The biggest independent evaluations of question answering systems have been carried out at TREC (Text Retrieval Conference) Five hundred factoid questions are provided and the groups taking part have a week in which to process the questions and return one answer per question. No changes are allowed to your system between the time you receive the questions and the time you submit the answers.
49
A Generic QA Framework
Search Engine
Document Collection Top n documents
Document Processing
Answers
Questions
Questions
A search engine is used to find the n most relevant documents in the document collection These documents are then processed with respect to the question to produce a set of answers which are passed back to the user Most of the differences between question answering systems are centred around the document processing stage
25
1/7/2012
A Simplified Approach
The answers to the majority of factoid questions are easily recognised named entities, such as countries, cities, dates, peoples names, etc The relatively simple techniques of gazetteer lists and named entity recognisers allow us to locate these entities within the relevant documents the most frequent of which can be returned as the answer This leaves just one issue that needs solving how do we know, for a specific question, what the type of the answer should be
51
52
26
1/7/2012
Problems (1)
The most frequently occurring instance of the right type might not be the correct answer.
For example if you are asking when someone was born, it maybe that their death was more notable and hence will appear more often (e.g. John F Kennedys assassination).
There are many questions for which correct answers are not named entities:
How did Ayrton Senna die? in a car crash
54
27
1/7/2012
Problems (2)
The gazetteer lists and named entity recognisers are unlikely to cover every type of named entity that may be asked about:
Even those types that are covered may well not be complete. It is of course relatively easy to build new lists, e.g. Birthstones.
55
28
1/7/2012
Dialogue (1)
A sequence of utterances Exchange of information among multiple dialogue participants Stays coherent over the time Driven by certain goal
finding the most suitable restaurant in a foreign city, booking the cheapest flight to a given city, controlling the state of the devices in a home, or the goal might also be the interaction itself (chatting)
57
Dialogue (2)
Most natural means for communication for humans perceived as a very expressive, efficient and robust However, dialogue is very complex protocol
follow certain conventions or protocols that are adopted by participants
humans usually use their extensive knowledge and reasoning capabilities to understand the conversational partner the dialogue utterances are often imperfect ungrammatical or elliptical
58
29
1/7/2012
Ellipsis
People often utter partial phrases to avoid repetition
A: At what time is Titanic playing? B: 8pm A: And The 5th element?
Deixis
Some words can only be interpreted in context:
Previous context (anaphora)
The monkey took the banana and ate it
Temporal/spatial
The man behind me will be dead tomorrow. (Who is the man? When he died/dies?)
60
30
1/7/2012
Indirect Meaning
The meaning of a discourse may be far from literal.
B: I cant reach him. A: There is the telephone. B: I am not in my office. A: Okay.
Turn Taking
People seem to know very well when they can take their turn
There is little overlap (5%) Gaps are often a few 1/10ths of a second Appears fluid, but not obvious why
62
31
1/7/2012
Conversational fillers
Phrases like a-ha, yes, hmm or eh are often prompted in order to fill the pauses of the conversation, to indicate the attention or reflection The challenge here is to recognize when they should be understood as a request for turn taking and when they should be ignored
63
64
32
1/7/2012
65
Dictation machine
Recognizes any utterance N-gram language model Often speaker dependent
66
33
1/7/2012
67
Dialogue Manager
Coordinates activity of all components Maintains representation of the current state of the dialogue Communicates with external applications Decides about the next dialogue step
68
34
1/7/2012
Three types of DM
Finite-state
dialogue flow determined by a finite state automata
Frame-based
form filling
in practice, you often find an extended versions or combinations of above mentioned approaches!
69
70
35
1/7/2012
Frame Based
71
Plan Based
Take a problem solving approach
There are goals to be reached Plans are made to reach those goals The goals and plans of the other participants must be iteratively inferred or predicted
36
1/7/2012
Text-To-Speech
Transforms the surface realization into a an acoustic representation (sound signal)
74
37
1/7/2012
Typical parameters
Commercial systems:
small vocabulary (~100 words) closed domain system initiative
Research systems:
larger (but still small) vocabulary (~10000 words) closed domain (limited) mixed initiative
75
Different Initiatives
System-initiative
system always has control, user only responds to system questions
User-initiative:
user always has control, system passively answers user questions
Mixed-initiative:
control switches between system and user using fixed rules
Variable-initiative:
control switches between system and user dynamically based on participant roles, dialogue history, etc.
76
38
1/7/2012
Not single most convenient modality (different modalities have different advantages)
entering day of week: click on a calendar entering Zip code: use keyboard performing commands: speech complex query: express them as typed natural language
Case Study
Eliza Comic Companions
78
39
1/7/2012
Eliza
Eliza, the first chatterbot : a computer program that mimics human conversation.
Joseph Weizenbaum, Massachusetts Institute of Technology
User types in natural language ELIZA then analyzes the user's statement and generates some response which it types out. http://www-ai.ijs.si/eliza/eliza.html
79
80
40
1/7/2012
82
41
1/7/2012
42
1/7/2012
World Knowledge
User: I am a human. Eliza: How long have you been a human?
85
Wh-Questions
User: Who are you? Eliza: Would you prefer it if I were not? User: What do you want me to tell? Eliza: Does that question interest you?
Complex Assertions
User: I don't have a problem. I'm quite happy, I have a girlfriend, I study. Everything is OK! Eliza: Why do you not have any problem, I'm quite happy, you have a girlfriend, you study. Everything is OK!
86
43
1/7/2012
But they may provide users with the most natural interface for many applications
87
88
44
1/7/2012
Wizard of Oz
89
Putting it together
90
45
1/7/2012
91
92
46
1/7/2012
93
94
47
1/7/2012
Why is NE important?
NE provides a foundation from which to build more complex IE systems Relations between NEs can provide tracking, ontological information and scenario building Tracking (co-reference) Dr Head, John, he
96
48
1/7/2012
Learning Systems
use statistics or other machine learning developers do not need expertise require large amounts of annotated training data some changes may require re-annotation of the entire training corpus
97
Typical NE pipeline
Pre-processing (tokenisation, sentence splitting, morphological analysis, POS tagging) Entity finding (gazeteer lookup, NE grammars) Coreference (alias finding, orthographic coreference etc.) Export to database / XML
98
49
1/7/2012
GATE
100
50
1/7/2012
IR
IE
101
A couple of approaches
Active learning to reduce annotation burden
Supervised learning Adaptive IE The Melita methodology
51
1/7/2012
52
1/7/2012
How many retrieved documents are relevant out of all the relevant documents?
105
IE Measures Examples
If I ask the librarian to search for books on cars, there are 10 relevant books in the library and out of the 8 he found, only 4 seem to be relevant books. What is his precision, recall and fmeasure?
106
53
1/7/2012
IE Measures Answers
If I ask the librarian to search for books on cars, there are 10 relevant books in the library and out of the 8 he found, only 4 seem to be relevant books. What is his precision, recall and fmeasure? Precision = 4/8 = 50% Recall = 4/10 = 40% F =(2*50*40)/(50+40) = 44.4%
107
Adaptive IE
What is IE?
Automated ways of extracting unstructured or partially structured information from machine readable files
What is AIE?
Performs tasks of traditional IE Exploits the power of Machine Learning in order to adapt to
complex domains having large amounts of domain dependent data different sub-language features different text genres
108
54
1/7/2012
What is adaptable?
New domain information Based upon an ontology which can change Different sub-language features POS, Noun chunks, etc Different text genres Free text, structured, semi-structured, etc Different types Text, String, Date, Name, etc
109
Amilcare
Tool for adaptive IE from Web-related texts
Specifically designed for document annotation Based on (LP)2 algorithm
*Linguistic Patterns by Learning Patterns
Covering algorithm based on Lazy NLP Trains with a limited amount of examples Effective on different text types
free texts semi-structured texts structured texts
55
1/7/2012
1. Best overall accuracy 2. Best result on speaker field 3. No results below 75%
56
1/7/2012
IE by example (1)
the seminar at 4 pm will ... How can we learn a rule to extract the seminar time?
113
IE by example (2)
114
57
1/7/2012
IE by example (3)
115
Deep approach
Uses syntactic information Uses semantics (Named entity, etc) Heuristics (World rules, Brother is male) Additional knowledge
116
58
1/7/2012
Multi Slot
Extract several concepts simultaneously
Tom is the brother of Mary.
Brother(Tom, Mary)
117
Bottom Up
Starts from a specific rule and relax it
118
59
1/7/2012
Top Down
119
Bottom Up
120
60
1/7/2012
Overfitting
When the learner fits the model and the noise
121
61
1/7/2012
Metadata extraction
Metadata extraction consists of two types:
Explicit metadata extraction involves information describing the document, such as that contained in the header information of HTML documents (titles, abstracts, authors, creation date, etc.) Implicit metadata extraction involves semantic information deduced from the material itself, i.e. endogenous information such as names of entities and relations contained in the text. This essentially involves Information Extraction techniques, often with the help of an ontology.
123
62
1/7/2012
IE as an alternative to IR
IE returns knowledge at a much deeper level than traditional IR Constructing a database through IE and linking it back to the documents can provide a valuable alternative search tool. Even if results are not always accurate, they can be valuable if linked back to the original text
125
E.g.
<SUCCESSION-1>
ORGANIZATION : New York Times POST : "president" WHO_IS_IN : Russell T. Lewis WHO_IS_OUT : Lance R. Primis
126
63
1/7/2012
<DOC> <DOCID> wsj93_050.0203 </DOCID> <DOCNO> 930219-0013. </DOCNO> <HL> Marketing Brief: @ Noted.... </HL> <DD> 02/19/93 </DD> <SO> WALL STREET JOURNAL (J), PAGE B5 </SO> <CO> NYTA </CO> <IN> MEDIA (MED), PUBLISHING (PUB) </IN> <TXT> <p> New York Times Co. named Russell T. Lewis, 45, president and general manager of its flagship New York Times newspaper, responsible for all business-side activities. He was executive vice president and deputy general manager. He succeeds Lance R. Primis, who in September was named president and chief operating officer of the parent. </p> </TXT> </DOC>
127
Answer (1)
<SUCCESSION-2> ORGANIZATION : "New York Times" POST : "general manager" WHO_IS_IN : "Russell T. Lewis" WHO_IS_OUT : "Lance R. Primis" <SUCCESSION-3> ORGANIZATION : "New York Times" POST : "executive vice president" WHO_IS_IN : WHO_IS_OUT : "Russell T. Lewis"
128
64
1/7/2012
Answer (2)
<SUCCESSION-4> ORGANIZATION : "New York Times" POST : "deputy general manager" WHO_IS_IN : WHO_IS_OUT : "Russell T. Lewis" <SUCCESSION-5> ORGANIZATION : "New York Times Co." POST : "president" WHO_IS_IN : "Lance R. Primis" WHO_IS_OUT :
129
Answer (3)
<SUCCESSION-6> ORGANIZATION : "New York Times Co." POST : "chief operating officer" WHO_IS_IN : "Lance R. Primis" WHO_IS_OUT :
130
65
1/7/2012
Questions?
131
66