Anda di halaman 1dari 66

1/7/2012

Artificial Intelligence
Natural Language Processing Dr Alexiei Dingli
1

Aims of NLP?
Trying to make computers talk Give computers the linguistic abilities of humans

1/7/2012

1940s - 1950s
Turings (1936) model of algorithmic computation McCulloch-Pitts neuron (McCulloch and Pitts, 1943) a simplified model of the neuron as a kind of computing element (propositional logic) Kleene (1951) and (1956) finite automata and regular expressions.

Shannon (1948) probabilistic models of discrete Markov processes to automata for language. Chomsky (1956) finite state machines as a way to characterize a grammar 3

1940s - 1950s
Speech and language processing Shannon
metaphor of the noisy channel entropy as a way of measuring the information capacity of a channel

Foundational research in phonetics First machine speech recognizers (early 1950s).


1952, Bell Lab, statistical system that could recognize any of the 10 digits from a single speaker (Davis et al., 1952)

1/7/2012

1940s - 1950s
One of the earliest applications of computers Major attempts in US and USSR
Russian to English and reverse

George Town University, Washington system:


Translated sample texts in 1954

The ALPAC report (1964)


Assessed research results of groups working on MTs


Concluded: MT not possible in near future Funding should cease for MT ! Basic research should be supported Word to word translation does not work
Linguistic Knowledge is needed

1950s - 1970s Symbolic paradigm


Formal language theory and generative syntax 1957 Noam Chomsky's Syntactic Structures

A formal definition of grammars and languages Provides the basis for an automatic syntactic processing of NL expressions

1967 : Woods procedural semantics

A procedural approach to the meaning of a sentence Provides the basis for a automatic semantic processing of NL expressions

1/7/2012

1950s - 1970s Symbolic paradigm


Parsing algorithms
top-down and bottom-up dynamic programming Transformations and Discourse Analysis Project (TDAP)
Harris, 1962 Joshi and Hopely (1999) and Karttunen (1999), cascade of finite-state transducers
7

1950s - 1970s Symbolic paradigm


AI Summer of 1956 :John McCarthy, Marvin Minsky, Claude Shannon, and Nathaniel Rochester
work on reasoning and logic

Newell and Simon - the Logic Theorist and the General Problem Solver Early natural language understanding systems
Domains Combination of pattern matching and keyword search Simple heuristics for reasoning and questionanswering

Late 1960s - more formal logical systems


8

1/7/2012

1950s - 1970s Statistical paradigm


Bayesian method to the problem of optical character recognition.
Bledsoe and Browning (1959) : Bayesian textrecognition

a large dictionary compute the likelihood of each observed letter sequence given each word in the dictionary Joshi and Hopely (1999) and Karttunen (1999)

cascade of finite-state transducers likelihoods for each letter.

Bayesian methods to the problem of authorship attribution on The Federalist papers


Mosteller and Wallace (1964)

Testable psychological models of human language processing based on transformational grammar Resources

First online corpora: the Brown corpus of American English DOC (Dictionary on Computer) an on-line Chinese dialect dictionary.

Symbolic vs statistical approaches


Symbolic Based on hand written rules Requires linguistic expertise No frequencey information More brittle and slower than statistical approaches Often more precise than statistical approaches Error analysis is usually easier than for statistical approaches Statistical Supervised or non-supervised Rules acquired from large size corpora Not much linguistic expertise required Robust and quick Requires large size (annotated) corpora Error analysis is often difficult

10

1/7/2012

1970-1983 Statistical paradigm


Speech recognition algorithms Hidden Markov model (HMM) and the metaphors of the noisy channel and decoding
Jelinek, Bahl, Mercer, and colleagues at IBMs Thomas J. Watson Research Center, Baker at Carnegie Mellon University

11

1970-1983 Logic-based paradigm


Q-systems and metamorphosis grammars (Colmerauer, 1970, 1975) Definite Clause Grammars (Pereira and Warren, 1980) Functional grammar (Kay,1979) Lexical Functional Grammar (LFG) (Bresnan and Kaplans,1982)
12

1/7/2012

1970-1983 Natural Language Understanding


SHRDLU system : simulated a robot embedded in a world of toy blocks (Winograd, 1972a).
natural-language text commands
Move the red block on top of the smaller green one complexity and sophistication

first to attempt to build an extensive (for the time) grammar of English (based on Hallidays systemic grammar)
13

1970-1983 Natural Language Understanding


Yale School : series of language understanding programs
conceptual knowledge (scripts, plans, goals..) human memory organization network-based semantics (Quillian, 1968)

14

1/7/2012

1983-1993
Return of state models
Finite-state phonology and morphology (Kaplan and Kay, 1981) Finite-state models of syntax by Church (1980).

Return of empiricism

Probabilistic models throughout speech and language processing,


IBM Thomas J. Watson Research Center: probabilistic models of speech recognition. Data-driven approaches

Speech - part-of-speech tagging, parsing, attachment ambiguities, semantics.

New focus on model evaluation Considerable work on natural language generation


15

1994-1999
Major changes Probabilistic and data-driven models had become quite standard Parsing, part-of-speech tagging, reference resolution, and discourse processing

Algorithms incorporate probabilities Evaluation methodologies from speech recognition and information retrieval. commercial exploitation (speech recognition, spelling and grammar correction) need for language-based information retrieval and information extraction.
16

Increases in the speed and memory of computers

Rise of the Web

1/7/2012

1994-1999 Ressources and corpora


Disk space becomes cheap Machine readable text become common US funding emphasises large scale evaluation on real data 1994 : The British National Corpus is made available
A balanced corpus of British English

Mid 1990s : WordNet (Fellbaum & Miller)


A computational thesaurus developed by psycholinguists

The World Wide Web used as a corpus


17

2000-2008 Empiricist trends 1


Spoken and written material widely available
Linguistic Data Consortium (LDC) ... Annotated collections (standard text sources with various forms of syntactic, semantic, and pragmatic annotations)
Penn Treebank (Marcus et al., 1993),) PropBank (Palmer et al., 2005), TimeBank (Pustejovsky et al., 2003b) ....

More complex traditional problems castable in supervised machine learning


Parsing and semantic analysis

Competitive evaluations

Parsing (Dejean and Tjong Kim Sang, 2001), Information extraction (NIST, 2007a; Tjong Kim Sang, 2002; Tjong Kim Sang and De Meulder, 2003) Word sense disambiguation (Palmer et al., 2001; Kilgarriff and Palmer, 2000) Question answering (Voorhees and Tice, 1999), and 18 summarization (Dang, 2006).

1/7/2012

19

2000-2008 Empiricist trends 2


More serious interplay with the statistical machine learning community
Support vector machines (Boser et al., 1992; Vapnik, 1995) Maximum entropy techniques (multinomial logistic regression) (Berger et al., 1996) Graphical Bayesian models (Pearl, 1988)
20

10

1/7/2012

2000-2008 Empiricist trends 2


Largely unsupervised statistical approaches

Statistical approaches to machine translation (Brown et al., 1990; Och and Ney, 2003) t Topic modeling (Blei et al., 2003)

Effective applications could be constructed from systems trained on unannotated data alone Use of unsupervised techniques
21

Elements of a Language
Phonemes Morphemes Syntax Semantics

22

11

1/7/2012

From sounds to language


Linked with language understanding Carried out by the auditory cortex Basic sounds of language are Phonemes
(sound)

Smallest phonetic unit in a language Capable of conveying a distinction in meaning. Every language has discrete set of phonemes Describing all possible sounds
Eg: "M", in "man," and "c", in "can," are phonemes.

Basic unit of words are Morphemes (to change form)

A meaningful linguistic unit Consisting of a root word or a word element that cannot be divided into smaller meaningful parts.
Eg: "Pick" and "s", in the word "picks," are morphemes
23

NATO Phonetic Alphabet


A - Alpha B - Bravo C - Charlie D - Delta E - Echo F - Foxtrot G - Golf H - Hotel I - India J - Juliet K - Kilo L - Lima M - Mike N - November O - Oscar P - Papa Q - Quebec R - Romeo S - Sierra T - Tango . - decimal (point) . - (full) stop U - Uniform V - Victor W - Whiskey X - X-ray Y - Yankee Z - Zulu 0 - Zero 1 - Wun (One) 2 - Two 3 - Tree (Three) 4 - Fower (Four) 5 - Fife (Five) 6 - Six 7 - Seven 8 - Ait (Eight) 9 - Niner (Nine)

24

12

1/7/2012

Exercise
Word Bay Pots A Teacher Morpheme Bay (1) Phoneme B + ay (2)

? ? ?

? ? ?

Pot + s ? (2) A (1)

P + o + t + ? (4) s A (1)

Teach + er (2)

T + ea + ch + e + r (5)

25

Exercise
Word Bay Pots A Teacher Morpheme Bay (1) Pot + s (2) A (1) Teach + er (2) Phoneme B + ay (2) P + o + t + s (4) A (1) T + ea + ch + e + r (5)

26

13

1/7/2012

Syntax structure of language


Languages have structure:
not all sequences of words over the given alphabet are valid when a sequence of words is valid (grammatical), a natural structure can be induced on it

27

Syntax
Describes the constituent structure of NL expressions
(I (am sorry)), Dave, ( I ((cant do) that))

Grammars are used to describe the syntax of a language Syntactic analysers and surface realisers assign a syntactic structure to a string/semantic representation on the basis of a grammar
28

14

1/7/2012

Syntax
It is useful to think of this structure as a tree:
represents the syntactic structure of a string according to some formal grammar. the interior nodes are labeled by non-terminals of the grammar, while the leaf nodes are labeled by terminals of the grammar
29

Syntax tree example


S NP John Adv V V Det a VP NP n book PP Prep to NP Mary

often gives

30

15

1/7/2012

Methods in syntax
Words - syntactic tree
Algorithm: parser Resources used: Lexicon + Grammar Symbolic : hand-written grammar and lexicon Statistical : grammar acquired from treebank
A parser checks for correct syntax and builds a data structure.

Difficulty: coverage and ambiguity

Treebank : text corpus in which each sentence has been annotated with syntactic structure. Syntactic structure is commonly represented as a tree structure, hence the name treebank.
31

Syntax applications
For spell checking
*its a fair exchange No syntactic tree Its a fair exchange ok syntactic tree

To construct the meaning of a sentence To generate a grammatical sentence


32

16

1/7/2012

Syntax to meaning
John loves Mary love(j,m)

33

Semantics
Where the hell d you get that idea HAL Dave, although you took thorough precautions in the pod against my hearing you, I could see your lips move

34

17

1/7/2012

Lexical semantics Meaning of words


1. 2. 3. 4. 5. 6. 7. 8. ... To get come to have or hold; receive. succeed in attaining, achieving, or experiencing; obtain. experience, suffer, or be afflicted with. move in order to pick up, deal with, or bring. bring or come into a specified state or condition. catch, apprehend, or thwart. come or go eventually or with some difficulty. move or come into a specified position or state 1. 2. 3. 4. An idea a though or suggestion about a possible course of action. a mental impression. a belief. (the idea) the aim or purpose. The hell a place regarded in various religions as a spiritual realm of evil and suffering, often depicted as a place of perpetual fire beneath the earth to which the wicked are sent after death. a state or place of great suffering. a swear word that some people use when they are annoyed or surprised
35

1.

2. 3.

Lexical semantics

Who is the master? - Context? - Semantic relations?

36

18

1/7/2012

Compositional semantics

Where the hell did you get that idea?

A swear word that some people use when they are annoyed or surprised or to emphasize something

Have this belief

37

Semantics issues in NLP


Definition and representation of meaning Meaning construction Semantic relations Interaction between semantic and syntax

38

19

1/7/2012

Pragmatics
Knowledge about the kind of actions that speakers intend by their use of sentences
REQUEST: HAL, open the pod bay door. STATEMENT: HAL, the pod bay door is open. INFORMATION QUESTION: HAL, is the pod bay door open?

Speech act analysis (politeness, irony, greeting, apologizing...)


39

Discourse
Where the hell'd you get that idea, HAL?

Dave and Frank were planning to disconnect me Much of language interpretation is dependent on the preceding discourse/dialogue
40

20

1/7/2012

Linguistics knowledge in NLP summary


Phonetics and Phonology knowledge about linguistic sounds Morphology knowledge of the meaningful components of word Syntax knowledge of the structural relationships between word Semantics knowledge of meaning Pragmatics knowledge of the relationship of meaning to the goals and intentions of the speaker Discourse knowledge about linguistic units larger than a single utterance
41

Ambiguity
I made her duck I cooked duck for her. I cooked duck belonging to her. I caused her to quickly lower her head or body.

42

21

1/7/2012

Ambiguity
Sound-to- text issues:
Recognise speech.

Speech act interpretation


Can you switch on the computer?
Question or request?

43

Ambiguity vs paraphrase
Ambiguity : the same sentence can mean different things Paraphrase: There are many ways of saying the same thing.
Beer, please. Can I have a beer? Give me a beer, please. I would like beer. Id like a beer, please.

44

22

1/7/2012

Applications of NLP
IE IR QA Dialogue Systems

45

What is Question Answering?


The main aim of QA is to present the user with a short answer to a question rather than a list of possibly relevant documents. As it becomes more and more difficult to find answers on the WWW using standard search engines, question answering technology will become increasingly important.

46

23

1/7/2012

Question Types (1)


Clearly there are many different types of questions:
When was Mozart born? Question requires a single fact as an answer. Answer may be found verbatim in text i.e. Mozart was born in 1756.

How did Socrates die? Finding an answer may require reasoning. In this example die has to be linked with drinking poisoned wine.
47

Question Types (2)


How do I assemble a bike?
The full answer may require fusing information from many different sources. The complexity can range from simple lists to script-based answers.

Is the Earth flat?


Requires a simple yes/no answer.

48

24

1/7/2012

Evaluating QA Systems
The biggest independent evaluations of question answering systems have been carried out at TREC (Text Retrieval Conference) Five hundred factoid questions are provided and the groups taking part have a week in which to process the questions and return one answer per question. No changes are allowed to your system between the time you receive the questions and the time you submit the answers.
49

A Generic QA Framework
Search Engine
Document Collection Top n documents

Document Processing

Answers

Questions

Questions

A search engine is used to find the n most relevant documents in the document collection These documents are then processed with respect to the question to produce a set of answers which are passed back to the user Most of the differences between question answering systems are centred around the document processing stage

25

1/7/2012

A Simplified Approach
The answers to the majority of factoid questions are easily recognised named entities, such as countries, cities, dates, peoples names, etc The relatively simple techniques of gazetteer lists and named entity recognisers allow us to locate these entities within the relevant documents the most frequent of which can be returned as the answer This leaves just one issue that needs solving how do we know, for a specific question, what the type of the answer should be

51

A Simplified Approach (1)


The simplest way to determine the expected type of an answer is to look at the words which make up the question:
who suggests a person when suggests a date where suggests a location

52

26

1/7/2012

A Simplified Approach (2)


Clearly this division does not account for every question but it is easy to add more complex rules:
country suggests a location how much suggests an amount of money author suggests a person birthday suggests a date college suggests an organization

These rules can be easily extended as we think of more questions to ask


53

Problems (1)
The most frequently occurring instance of the right type might not be the correct answer.
For example if you are asking when someone was born, it maybe that their death was more notable and hence will appear more often (e.g. John F Kennedys assassination).

There are many questions for which correct answers are not named entities:
How did Ayrton Senna die? in a car crash
54

27

1/7/2012

Problems (2)
The gazetteer lists and named entity recognisers are unlikely to cover every type of named entity that may be asked about:
Even those types that are covered may well not be complete. It is of course relatively easy to build new lists, e.g. Birthstones.
55

Does a gazetteer of people names contains all the names?


Amber Precious Diamond Asia Summer Holly

Are these persons names?


56

28

1/7/2012

Dialogue (1)
A sequence of utterances Exchange of information among multiple dialogue participants Stays coherent over the time Driven by certain goal
finding the most suitable restaurant in a foreign city, booking the cheapest flight to a given city, controlling the state of the devices in a home, or the goal might also be the interaction itself (chatting)
57

Dialogue (2)
Most natural means for communication for humans perceived as a very expressive, efficient and robust However, dialogue is very complex protocol
follow certain conventions or protocols that are adopted by participants

humans usually use their extensive knowledge and reasoning capabilities to understand the conversational partner the dialogue utterances are often imperfect ungrammatical or elliptical
58

29

1/7/2012

Ellipsis
People often utter partial phrases to avoid repetition
A: At what time is Titanic playing? B: 8pm A: And The 5th element?

It is necessary to keep track of the conversation to complete such phrases


59

Deixis
Some words can only be interpreted in context:
Previous context (anaphora)
The monkey took the banana and ate it

Future context (cataphora)


Give me that. The book by the lamp.

Temporal/spatial
The man behind me will be dead tomorrow. (Who is the man? When he died/dies?)
60

30

1/7/2012

Indirect Meaning
The meaning of a discourse may be far from literal.
B: I cant reach him. A: There is the telephone. B: I am not in my office. A: Okay.

Undertones & implications are often employed for effect or efficiency


61

Turn Taking
People seem to know very well when they can take their turn
There is little overlap (5%) Gaps are often a few 1/10ths of a second Appears fluid, but not obvious why

A computational model of overlap does not exists


causes problem for dialogue systems

62

31

1/7/2012

Conversational fillers
Phrases like a-ha, yes, hmm or eh are often prompted in order to fill the pauses of the conversation, to indicate the attention or reflection The challenge here is to recognize when they should be understood as a request for turn taking and when they should be ignored
63

Most common dialogue domain


Flight and train timetable information and reservation Smart homes Automated directory enquires Yellow pages enquires Weather information

64

32

1/7/2012

Components of a Dialogue System

65

Automatic Speech Recognition


Transforms speech to text Two basic types
Grammar-based ASR
The set of accepted phrases defined by regular/context-free grammars (i.e. language model in the form of a grammar) Usually speaker independent

Dictation machine
Recognizes any utterance N-gram language model Often speaker dependent

66

33

1/7/2012

Natural Language Understanding


Analyzes textual utterance and returns its formal semantic representation
Logical formula Named entities etc

67

Dialogue Manager
Coordinates activity of all components Maintains representation of the current state of the dialogue Communicates with external applications Decides about the next dialogue step

68

34

1/7/2012

Three types of DM
Finite-state
dialogue flow determined by a finite state automata

Frame-based
form filling

Plan (task) based


a dynamic plan is constructed to reach the dialogue goal

in practice, you often find an extended versions or combinations of above mentioned approaches!
69

Finite State Automata

70

35

1/7/2012

Frame Based

71

Plan Based
Take a problem solving approach
There are goals to be reached Plans are made to reach those goals The goals and plans of the other participants must be iteratively inferred or predicted

Potential for handling complicated dialogues


suffers from todays technological limitation in more complex cases the planning problem can become computationally intractable

Examples: Bathroom consultant


72

36

1/7/2012

Natural Language Generation


Produces a textual utterance (so called surface realization) from an internal (formal) representation of the answer The surface realization can include formatting information
Speaking style, pauses Background sounds
73

Text-To-Speech
Transforms the surface realization into a an acoustic representation (sound signal)

74

37

1/7/2012

Typical parameters
Commercial systems:
small vocabulary (~100 words) closed domain system initiative

Research systems:
larger (but still small) vocabulary (~10000 words) closed domain (limited) mixed initiative
75

Different Initiatives
System-initiative
system always has control, user only responds to system questions

User-initiative:
user always has control, system passively answers user questions

Mixed-initiative:
control switches between system and user using fixed rules

Variable-initiative:
control switches between system and user dynamically based on participant roles, dialogue history, etc.
76

38

1/7/2012

Multi Modal Dialogue Systems


Several possible input/output modalities to communicate with dialogue systems
speech, text, pointing, graphics, gestures, face configurations, body positions, emotions, etc.

Not single most convenient modality (different modalities have different advantages)
entering day of week: click on a calendar entering Zip code: use keyboard performing commands: speech complex query: express them as typed natural language

Several modalities useful


when one modality is not applicable - e.g. eyes or hands are busy, silent environment or when difficult to use - e.g. small devices with limited keyboard and small screen
77

Case Study
Eliza Comic Companions

78

39

1/7/2012

Eliza
Eliza, the first chatterbot : a computer program that mimics human conversation.
Joseph Weizenbaum, Massachusetts Institute of Technology

User types in natural language ELIZA then analyzes the user's statement and generates some response which it types out. http://www-ai.ijs.si/eliza/eliza.html
79

ELIZA Sample conversation


Men are all alike. IN WHAT WAY? They're always bugging us about something or other. CAN YOU THINK OF A SPECIFIC EXAMPLE? Well, my boyfriend made me come here. YOUR BOYFRIEND MADE YOU COME HERE He says I'm depressed much of the time. I AM SORRY TO HEAR YOU ARE DEPRESSED It's true. I am unhappy DO YOU THINK COMING HERE WILL HELP YOU NOT TO BE UNHAPPY I need some help, that much seems certain. WHAT WOULD IT MEAN TO YOU IF YOU GOT SOME HELP Perhaps I could learn to get along with my mother. TELL ME MORE ABOUT YOUR FAMILY

80

40

1/7/2012

Eliza's strategy Keywords


Input read and inspected for the presence of a keyword When keyword is found, the sentence is transformed according to a rule associated with the keyword Else a content-free remark or, under certain conditions, an earlier transformation is retrieved. The text so computed or retrieved is then printed out
81

ELIZA Example transformations


"I am X" "How long have you been X? I am very unhappy. HOW LONG HAVE YOU BEEN VERY UNHAPPY? "X YOU Y ME" "What makes you think I Y you? It seems that you hate me. WHAT MAKES YOU THINK THAT I HATE YOU?

82

41

1/7/2012

ELIZA Why does it work?


Rogerian psychotherapist Can pretend to know almost nothing of the real world "I went for a long boat ride" "Tell me about boats Not a real dialog (no real communication)
ELiza generates a reaction (often a rather non commital one) to the user's statement The user cannot ask for information. The user cannot issue an order
83

ELIZA Does it really work?


Eliza: Hello. I am ELIZA. How can I help you? You: I went to Paris yesterday Eliza: Can you elaborate on that? You: I took the train Eliza: Can you elaborate on that? You: The train was late Eliza: Please go on. You: All the seats were full Eliza: Please go on. You: A little boy was having a tantrum Eliza: Please go on.
84

42

1/7/2012

ELIZA Does it really work?


Indexicals (I, you, here, now)
User: I am having fun right now. Eliza: Does it please you to believe that I am having fun right now.

World Knowledge
User: I am a human. Eliza: How long have you been a human?

85

When Eliza fails


Yes/No Questions
User: Are you a good therapist? Eliza: Would you prefer it if I were a good therapist? User: I'm wondering about your skills as a therapist; Eliza: Does it please you to believe I am wondering about my skills of therapist?

Wh-Questions

User: Who are you? Eliza: Would you prefer it if I were not? User: What do you want me to tell? Eliza: Does that question interest you?

Complex Assertions

User: I don't have a problem. I'm quite happy, I have a girlfriend, I study. Everything is OK! Eliza: Why do you not have any problem, I'm quite happy, you have a girlfriend, you study. Everything is OK!
86

43

1/7/2012

Eliza is not intelligent but...


Regardless of what people believe or know about the inner workings of computers, they talk about them and interact with them as social entities. Weizenbaum (1976) notes that many people continued to believe in ELIZAs abilities even after the programs operation was explained to them. People act toward computers as if they were people and expect that computers should be able to understand their needs and be capable of interacting with them naturally. Given these predispositions, speech- and language-based system are not supposed to be intelligent

But they may provide users with the most natural interface for many applications
87

The Comic Avatar

88

44

1/7/2012

Wizard of Oz

89

Putting it together

90

45

1/7/2012

The Companions Architecture

91

The Companions Robot

92

46

1/7/2012

The Companions Interface 1

93

The Companions Interface 2

94

47

1/7/2012

What is Named Entity Recognition?


Identification of proper names in texts, and their classification into a set of predefined categories of interest Persons Organisations (companies, government organisations, committees, etc) Locations (cities, countries, rivers, etc) Date and time expressions Various other types as appropriate
95

Why is NE important?
NE provides a foundation from which to build more complex IE systems Relations between NEs can provide tracking, ontological information and scenario building Tracking (co-reference) Dr Head, John, he

96

48

1/7/2012

Two kinds of approaches


Knowledge Engineering
rule based developed by experienced language engineers make use of human intuition require only small amount of training data development can be very time consuming some changes may be hard to accommodate

Learning Systems
use statistics or other machine learning developers do not need expertise require large amounts of annotated training data some changes may require re-annotation of the entire training corpus

97

Typical NE pipeline
Pre-processing (tokenisation, sentence splitting, morphological analysis, POS tagging) Entity finding (gazeteer lookup, NE grammars) Coreference (alias finding, orthographic coreference etc.) Export to database / XML
98

49

1/7/2012

GATE and ANNIE


GATE (Generalised Architecture for Text Engineering) is a framework for language processing ANNIE (A Nearly New Information Extraction system) is a suite of language processing tools, which provides NE recognition GATE also includes: plugins for language processing, e.g. parsers, machine learning tools, stemmers, IR tools, IE components for various languages etc. tools for visualising and manipulating ontologies ontology-based information extraction tools evaluation and benchmarking tools
99

GATE

100

50

1/7/2012

Information Extraction vs. Retrieval

IR

IE

101

A couple of approaches
Active learning to reduce annotation burden
Supervised learning Adaptive IE The Melita methodology

Automatic annotation of large repositories


Largely unsupervised Armadillo

51

1/7/2012

The Seminar Announcements Task


Created by Carnegie Mellon School of Computer Science How to retrieve
Speaker Location Start Time End Time

From seminar announcements received by email


103

Seminar Announcements Example


Dr. Steals presents in Dean Hall at one am. becomes <speaker>Dr. Steals</speaker> presents in <location>Dean Hall</location> at <stime>one am</stime>.
104

52

1/7/2012

Information Extraction Measures


How many documents out of the retrieved documents are relevant?

How many retrieved documents are relevant out of all the relevant documents?

Weighted harmonic mean of precision and recall

105

IE Measures Examples
If I ask the librarian to search for books on cars, there are 10 relevant books in the library and out of the 8 he found, only 4 seem to be relevant books. What is his precision, recall and fmeasure?

106

53

1/7/2012

IE Measures Answers
If I ask the librarian to search for books on cars, there are 10 relevant books in the library and out of the 8 he found, only 4 seem to be relevant books. What is his precision, recall and fmeasure? Precision = 4/8 = 50% Recall = 4/10 = 40% F =(2*50*40)/(50+40) = 44.4%
107

Adaptive IE
What is IE?
Automated ways of extracting unstructured or partially structured information from machine readable files

What is AIE?

Performs tasks of traditional IE Exploits the power of Machine Learning in order to adapt to
complex domains having large amounts of domain dependent data different sub-language features different text genres

Considers important the Usability and Accessibility of the system

108

54

1/7/2012

What is adaptable?
New domain information Based upon an ontology which can change Different sub-language features POS, Noun chunks, etc Different text genres Free text, structured, semi-structured, etc Different types Text, String, Date, Name, etc
109

Amilcare
Tool for adaptive IE from Web-related texts
Specifically designed for document annotation Based on (LP)2 algorithm
*Linguistic Patterns by Learning Patterns

Covering algorithm based on Lazy NLP Trains with a limited amount of examples Effective on different text types
free texts semi-structured texts structured texts

Uses Gate and Annie for preprocessing


110

55

1/7/2012

CMU: detailed results


speaker location stime etime All Slots (L P )2 77.6 75.0 99.0 95.5 86.0 BWI 67.7 76.7 99.6 93.9 83.9 HMM 76.6 78.6 98.5 62.1 82.0 SRV Rapier Whisk 56.3 53.0 18.3 72.3 72.7 66.4 98.5 93.4 92.6 77.9 96.2 86.0 77.1 77.3 64.9

1. Best overall accuracy 2. Best result on speaker field 3. No results below 75%

RULIE: Rule Unification for learning IE


speaker location stime etime All Slots (LP)2 77.6 75.0 99.0 95.5 86.0 RULIE 82.0 80.0 99.0 98.0 89.7

56

1/7/2012

IE by example (1)
the seminar at 4 pm will ... How can we learn a rule to extract the seminar time?

113

IE by example (2)

114

57

1/7/2012

IE by example (3)

115

Shallow Vrs Deep Approaches


Shallow approach
Uses syntax primarily
Tokenisation, POS, etc.

Deep approach
Uses syntactic information Uses semantics (Named entity, etc) Heuristics (World rules, Brother is male) Additional knowledge
116

58

1/7/2012

Single Vrs Multi Slot


Single
Extract one element at a time
The seminar is at 4pm.

Multi Slot
Extract several concepts simultaneously
Tom is the brother of Mary.
Brother(Tom, Mary)
117

Top-Down Vrs Bottom Up


Top-Down
Starts from a generic rule and specialise it

Bottom Up
Starts from a specific rule and relax it

118

59

1/7/2012

Top Down

119

Bottom Up

120

60

1/7/2012

Overfitting Vrs Underfitting


Underfitting
When the learner does not manage to detect the full underlying model Produces excessive bias

Overfitting
When the learner fits the model and the noise
121

Stages of document processing


Document selection involves identification and retrieval of potentially relevant documents from a large set (e.g. the web) in order to reduce the search space. Standard or semantically-enhanced IR techniques can be used for this. Document pre-processing involves cleaning and preparing the documents, e.g. removal of extraneous information, error correction, spelling normalisation, tokenisation, POS tagging, etc. Document processing consists mainly of information extraction
122

61

1/7/2012

Metadata extraction
Metadata extraction consists of two types:
Explicit metadata extraction involves information describing the document, such as that contained in the header information of HTML documents (titles, abstracts, authors, creation date, etc.) Implicit metadata extraction involves semantic information deduced from the material itself, i.e. endogenous information such as names of entities and relations contained in the text. This essentially involves Information Extraction techniques, often with the help of an ontology.
123

IE for Document Access


With traditional query engines, getting the facts can be hard and slow Where has the President visited in the last year? Which places in Europe have had cases of Bird Flu? Which search terms would you use to get this kind of information? How can you specify you want someones home page? IE returns information in a structured way IR returns documents containing the relevant information somewhere (if youre lucky)
124

62

1/7/2012

IE as an alternative to IR
IE returns knowledge at a much deeper level than traditional IR Constructing a database through IE and linking it back to the documents can provide a valuable alternative search tool. Even if results are not always accurate, they can be valuable if linked back to the original text
125

Try IE yourself ... (1)


Given a particular text ... Find all the successions ...
Hint there are 6 including the one below Hint we do not have complete information

E.g.
<SUCCESSION-1>
ORGANIZATION : New York Times POST : "president" WHO_IS_IN : Russell T. Lewis WHO_IS_OUT : Lance R. Primis

126

63

1/7/2012

<DOC> <DOCID> wsj93_050.0203 </DOCID> <DOCNO> 930219-0013. </DOCNO> <HL> Marketing Brief: @ Noted.... </HL> <DD> 02/19/93 </DD> <SO> WALL STREET JOURNAL (J), PAGE B5 </SO> <CO> NYTA </CO> <IN> MEDIA (MED), PUBLISHING (PUB) </IN> <TXT> <p> New York Times Co. named Russell T. Lewis, 45, president and general manager of its flagship New York Times newspaper, responsible for all business-side activities. He was executive vice president and deputy general manager. He succeeds Lance R. Primis, who in September was named president and chief operating officer of the parent. </p> </TXT> </DOC>
127

Answer (1)
<SUCCESSION-2> ORGANIZATION : "New York Times" POST : "general manager" WHO_IS_IN : "Russell T. Lewis" WHO_IS_OUT : "Lance R. Primis" <SUCCESSION-3> ORGANIZATION : "New York Times" POST : "executive vice president" WHO_IS_IN : WHO_IS_OUT : "Russell T. Lewis"

128

64

1/7/2012

Answer (2)
<SUCCESSION-4> ORGANIZATION : "New York Times" POST : "deputy general manager" WHO_IS_IN : WHO_IS_OUT : "Russell T. Lewis" <SUCCESSION-5> ORGANIZATION : "New York Times Co." POST : "president" WHO_IS_IN : "Lance R. Primis" WHO_IS_OUT :

129

Answer (3)
<SUCCESSION-6> ORGANIZATION : "New York Times Co." POST : "chief operating officer" WHO_IS_IN : "Lance R. Primis" WHO_IS_OUT :

130

65

1/7/2012

Questions?

131

66

Anda mungkin juga menyukai