Anda di halaman 1dari 21

Introduction to Natural Language Processing

Ahmad M. Bakr Computer and Systems Engineering Department Faculty of Engineering Alexandria University, Egypy

Agenda
Introduction.

Basic text processing techniques.


Information Retrieval. Sentiment Analysis. Named Entity Recognition. Question Answering. Relation Extraction.

Introduction
NLP is a branch of artificial intelligence that deals

with analyzing, understanding and generating the languages that humans use naturally in order to interface with computers. Natural language processing aims to teach computers to understand the way humans learn and use language.

Introduction
Speech processing: get flight information or book a hotel over the

phone. Information extraction: discover names of people and events they participate in, from a document. Machine translation: translate a document from one human language into another. Question answering: find answers to natural language questions in a text collection or database. Summarization: generate a short biography of Noam Chomsky from one or more news articles.

Text Processing
Text processing is manipulation of text, especially

the transformation of text from one format to another. Usually from plain text (set of paragraphs) to a form that is easy to be included in calculations. Vector Space Model (VSM) is one of the forms used by application to represent document as a vector of its words.
dj={W1,W2, W3 . Wn}

Each word is assigned a weight (i.e TF-IDF)


Weight = Term Frequency * 1/(Document Frequency)

Similarity between two documents can be

calculated as the similarity between the vectors of these documents.

Information Retrieval
Information retrieval is the activity of obtaining

information resources relevant to an information need from a collection of information resources.

Information Retrieval
Usually information is indexed to speed up the

queries. Inverted Index is one of the primary attempts to index text based on its words.

Information Retrieval
Can we use inverted index to search for

sentences A B C?

Information Retrieval
Document Index Graph

Sentiment Analysis
Sentiment analysis or opinion mining refers to the

application of natural language processing, computational linguistics, and text analytics to identify and extract subjective information in source materials.

Sentiment Analysis
Techniques:
Maintaining a list of words for each class

Example This is a nice movie , This is a bad movie Using classifiers that trained with sentences for each class separately

Named Entity Recognition


NER is a subtask of information extraction that

seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

Name Entity Recognition


Approaches:
Database based recognition (word net) Rule based model Statistical models (ex. HMM and Maximum Entropy)

Name Entity Recognition


Wikipedia-based NER

Name Entity Recognition


Wikipedia-based NER
Index all pages titles

Two phase algorithm


Given a text, search all titles. (phase one) Score the candidate titles (phase two)

What factors should the scoring formula consider

Question Answering
What is Question Answering

QA is a computer science discipline within the

fields of information retrieval and natural language processing (NLP), which is concerned with building systems that automatically answer questions posed by humans in a natural language.

Question Answering
A QA implementation, usually a computer

program, may construct its answers by querying a structured database of knowledge or information, usually a knowledge base. More commonly, QA systems can pull answers from an unstructured collection of natural language documents.

Question Answering

Question Answering
Question Classification
Question classifier module determines the type of

question and the type of answer.


Examples:1) Who discovered x-rays? should be classied

into the type of human (individual) Examples: 2) Where is Alexandria Located ? should be classied into the type of place

Rule-based approaches Using Classifiers to be trained with possible question

types Question is put in a form of parse tree to capture the relationship between its entities (i.e subjects, objects etc) The main purpose of the parse tree is to understand the question and the links between its entities.

Question Answering
Query Formulation
Apply text processing techniques to form a query

from the question. Techniques as:


Stemming (Swimming Swim)

Adding synonymous (USA United States of America)


Give weights to words of the question (nouns takes higher

weights)

Question Answering
Search knowledge base
The main target is to identify the paragraphs that

possibly contain answers to the users question Knowledge based is usually indexed.
Answers Extraction
Parse the candidate paragraphs to extract

sentences with possible answers Construct the parse tree of the matches sentences Parse tree gives insights about the relationship between the entities of a candidate sentence Rank the possible answers based on their relevance to the question.

Anda mungkin juga menyukai