Business Intelligence & Data Mining-11

Text Mining
Motivation for Text Mining

n
Approximately 90% of the worlds data is held in

unstructured formats (source: Oracle Corporation)
Information intensive business processes demand
that we transcend from data mining to knowledge
discovery in text.
10%
90%
Structured Numerical or Coded

Information
Unstructured or Semi-structured
Information
Text Databases
n
Large collections of documents from various

sources: news articles, research papers,
books, digital libraries, e-mail messages, and
Web pages, library databases
Properties
n
n
n
Unstructured in general (some semi-structured

e.g. XML)
Semantics, not only syntax, is important
Non-numeric in nature
Challenges of Text Mining

n
Very high number of possible dimensions

n
Unlike data mining:

n
n
records (= docs) are not structurally identical

records are not statistically independent
Complex and subtle relationships between

concepts in text
n
n
All possible word and phrase types in the language!!
AOL merges with Time-Warner

Time-Warner is bought by AOL
Ambiguity and context sensitivity

n
n
automobile = car = vehicle = Toyota

Apple (the company) or apple (the fruit)
Technological Advances in Text Mining

n
Advances in text processing technology

n
n
Natural Language Processing (NLP)

New algorithms
Cheap Hardware!
n
n
n
CPU
Disk
Network
Data Mining Vs Text Mining &

Search versus Discover
Structured
Data
Unstructured
Data (Text)
Search
(goal-oriented)
Discover
Data
Retrieval
Data
Mining
Document
Retrieval
Text
Mining
Text Databases and

Information Retrieval
n
Information / document retrieval

n
Traditional study of how to retrieve information

from text documents
Information is organized into (a large number of)
documents
Information retrieval problem: locating relevant
documents based on user input, such as
keywords, queries or example documents
Text Database and

Information Retrieval
n
Typical IR systems
n
n
Online library catalogs

Online document management systems
Information retrieval vs. database systems

n
Some DB problems are not present in IR, e.g.,

update, transaction management, complex
objects
Some IR problems are not addressed well in
DBMS, e.g., unstructured documents, approximate
search using keywords and relevance
Example of Text Mining Task

(Swanson, and Smalheiser, 1997)
Extract pieces of evidence from article titles in

the biomedical literature
n
n
n
n
stress is associated with migraines

stress can lead to loss of magnesium
calcium channel-blockers prevent some migraines
magnesium is a natural calcium channel-blocker
Induce a new hypothesis not in the literature

by combining culled text fragments with
human medical expertise
n
Magnesium deficiency may play a role in some kinds of

migraine headaches
What is Text Mining?

Text Mining is the process of synthesizing information
by analyzing the relations, patterns, and rules within
textual data: semi-structured or unstructured text.
The Tool Box
Data mining algorithms
Machine learning techniques
Document / information retrieval concepts
Statistical techniques
Natural-language processing
Potential Applications
Customer comment analysis
Trend analysis
Information filtering and routing
Event tracking
News story classification
Web search
Sentiment analysis
.
Basic Measures: Precision & Recall
Precision: the percentage of retrieved documents that are in fact

relevant to the query (i.e., correct responses)
| {Relevant} {Retrieved} |
precision =
| {Retrieved} |
n
Recall: the percentage of documents that are relevant to the query

and were, in fact, retrieved
| {Relevant } {Retrieved } |
recall =
| {Relevant } |
Precision decreases as Recall

increases
Precision
SVM
Decision Tree
SOM
Logistic Regr
Recall
Dumais
(1998)
F-measure
n
F-measure = weighted harmonic mean of

precision and recall
(at some fixed operating threshold for the classifier)
F1 = 1/ ( 0.5/P + 0.5/R )
= 2PR / (P + R)
Useful as a simple single summary measure of
performance
Sometimes breakeven F 1 used, i.e., measured
when P = R
Challenges in text mining

n
Semantics:
n
Synonymy: The keywords T does not

appear anywhere in the document d, even
though the document d is closely related to
T (i.e. synonyms of T have been used in d)
Polysemy: The same keyword may mean
different things in different contexts, e.g.,
green (colour) Vs green initiatives
Challenges in text mining

n
Data pre-processing
n
Stop list: Set of words that are deemed irrelevant, even

though they may appear frequently
n
n
Word stem: Several words are small syntactic variants of

each other since they share a common word stem
n
E.g., a, the, of, for, with, etc.

Stop lists may vary when document set varies
E.g., drug, drugs, drugged
Phrases: Sometimes it is better to view a group of words as

a single unit (like a noun phrase)
n
E.g. : data mining, decision support
Feature Extraction
Feature Extraction
Task: Extract a good subset of words / phrases to represent documents
Document
collection
All unique
words/phrases
Feature
Extraction
All good
words/phrases
Feature Extraction: Example

While more and more textual information
is available online, effective retrieval is
difficult without good indexing of text
content.
14 words
While-more-textual-information-available-online-effectiveretrieval-difficult-without-good-indexing-text-content
Feature
Extraction
5 words
Text-information-online-retrieval-index
2
1
1
1
1
Stemming
n
Want to reduce all morphological variants of a word to a

single index term
n
e.g. a document containing words like fish and fisher may not be
retrieved by a query containing fishing (the word fishing not explicitly
contained in the document)
Stemming - reduce words to their root form

n
e.g. fish becomes a new index term
Porter stemming algorithm (1980)

n
relies on a preconstructed suffix list with associated rules

n
e.g. if suffix=IZATION and prefix contains at least one vowel followed by a

consonant, replace with suffix=IZE
n
BINARIZATION => BINARIZE
Not always desirable: e.g., {university, universal} -> univers (in Porters)
WordNet: dictionary-based approach
Feature Extraction
Training
documents
Identification
all unique words
Removal
stop words
non-informative word
ex.{the,and,when,more}
Removal
Word Stemming
of suffix to
generate word stem
grouping words
increasing the relevance
ex.{walker,walking}walk
Term Weighting
Terms frequency
Importance of term in Doc
Document Representation
Data Representation
n
Document vector / Frequency Matrix /

Bag of words (BOW)
n
n
Each document is represented by a vector

Each dimension of the vector is associated
with a word/term
For each document, the value of each
dimension is the frequency of the word
that exists in the document.
Term Frequency
tf - Term Frequency weighting
wij = Freqij
= the number of times jth term
occurs in document Di.
Drawback: does mot reflect the importance of the term
for document discrimination.
Ex.
A B K O Q R S T W X
D1 ABRTSAQWA
XAO
D2
RTABBAXA
QSAK
D1
1 1
D2
1 1
Example of a document-term matrix

database
SQL
index
regression
likelihood
linear
d1
24
21
d2
32
10
d3
12
16
d4
d5
43
31
20
d6
18
16
d7
32
12
d8
22
d9
34
27
25
d10
17
23
TF-IDF
tfidf - Inverse Document Frequency
wij = Freqij * log(N/ DocFreqj) .
N = the number of documents in the training
document collection.
DocFreqj = the number of documents in the training collection
where the jth term occurs.
Advantage: has reflection of importance factor for
document discrimination.
Assumption: terms with low DocFreq are better discriminator
than ones with high DocFreq in document collection
Ex.
A B K O Q R S T W X
D1
0 0.3 0
D2
0 0.3 0
0 0 0 0.3 0
0 0 0
Term Entropy
wij = log (FREQ ij + 1.0 ) * (1 + entropy ( w i ) )
where
N
FREQij
1
FREQij
entropy ( wi ) =
log
log(N ) j =1 DOCFREQ j DOCFREQ j
is average entropy of jth term; it evaluates to:

-1: if every word occurs once in every document
0: if each word occurs once in only one document
Similarity Measures
Similarity measures
n
For various tasks, we need a measure of

similarity between documents
Eucledian distance is a common measure
Cosine similarity or dot-product is another
measure used in text mining:
v1 v2
sim(v1 , v2 ) =
| v1 || v2 |
n
n
Focuses on co-occurrence of words

This corresponds to the cosine of the angle between
the two vectors
Text Mining Tasks
Text Categorization
Task: assignment of one or more predefined
categories to one document.
Topics /
Themes
Text Categorization:Architecture
Training
documents
New document
preprocessing
Weighting
Selecting feature
Predefined
categories
Classifier
Category(ies) to d
Text Categorization Classifier Algorithms

SOM-based Classifier
k-Nearest Neighbor Classifier
Decision Tree Classifier
Logistic Regression
Neural Net Classifier
Bayesian Classifier
Support Vector Machines (SVM) Classifier
Document Clustering
Task: It groups all documents so that the documents in the same
group are more similar than ones in other groups.
Cluster hypothesis: relevant documents tend to be more
closely related to each other than to
non-relevant documents.
Document Clustering Algorithms

k-means
Hierarchic Agglomerative Clustering(HAC)
Association Rule Hypergraph Partitioning (ARHP)
SOM / ESOM based clustering
Latent Semantic Indexing

n
Weakness of keyword based techniques

n
n
Lack of semantics
Cannot identify similar word/concepts without help
Observation
n
Words/phrases that represent similar concepts are

usually grouped together
The most important unit of information for
documents may not be word, but concept instead
Latent Semantic Indexing

n
n
n
Latent Semantic Indexing is an attempt to

produce such information
Start with the term-frequency matrix M
Apply singular value decomposition on M
n
n
n
M=U*S*V
S = a diagonal matrix of eigenvalues
U = a square matrix in which each entry capture
similarity between documents
V = a square matrix in which each entry capture
similarity between terms

Business Intelligence &amp; Data Mining-11

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Business Intelligence &amp; Data Mining-11

Diunggah oleh

Hak Cipta:

Format Tersedia

Text Mining

Motivation for Text Mining

Approximately 90% of the worlds data is held in

Structured Numerical or Coded

Large collections of documents from various

Unstructured in general (some semi-structured

Challenges of Text Mining

Very high number of possible dimensions

Unlike data mining:

records (= docs) are not structurally identical

Complex and subtle relationships between

All possible word and phrase types in the language!!

AOL merges with Time-Warner

Ambiguity and context sensitivity

automobile = car = vehicle = Toyota

Technological Advances in Text Mining

Advances in text processing technology

Natural Language Processing (NLP)

Data Mining Vs Text Mining &

Text Databases and

Information / document retrieval

Traditional study of how to retrieve information

Text Database and

Online library catalogs

Information retrieval vs. database systems

Some DB problems are not present in IR, e.g.,

Example of Text Mining Task

Extract pieces of evidence from article titles in

stress is associated with migraines

Induce a new hypothesis not in the literature

Magnesium deficiency may play a role in some kinds of

What is Text Mining?

Basic Measures: Precision & Recall

Precision: the percentage of retrieved documents that are in fact

Recall: the percentage of documents that are relevant to the query

Precision decreases as Recall

F-measure = weighted harmonic mean of

Challenges in text mining

Synonymy: The keywords T does not

Challenges in text mining

Stop list: Set of words that are deemed irrelevant, even

Word stem: Several words are small syntactic variants of

E.g., a, the, of, for, with, etc.

E.g., drug, drugs, drugged

Phrases: Sometimes it is better to view a group of words as

E.g. : data mining, decision support

Feature Extraction: Example

Want to reduce all morphological variants of a word to a

Stemming - reduce words to their root form

e.g. fish becomes a new index term

Porter stemming algorithm (1980)

relies on a preconstructed suffix list with associated rules

e.g. if suffix=IZATION and prefix contains at least one vowel followed by a

BINARIZATION => BINARIZE

WordNet: dictionary-based approach

Document vector / Frequency Matrix /

Each document is represented by a vector

Example of a document-term matrix

log(N ) j =1 DOCFREQ j DOCFREQ j

is average entropy of jth term; it evaluates to:

For various tasks, we need a measure of

Focuses on co-occurrence of words

Text Mining Tasks

Text Categorization Classifier Algorithms

Business Intelligence & Data Mining-11

Business Intelligence & Data Mining-11