10%
90%
Unstructured or Semi-structured
Information
Text Databases
n
n
n
Cheap Hardware!
n
n
n
CPU
Disk
Network
Structured
Data
Unstructured
Data (Text)
Search
(goal-oriented)
Discover
Data
Retrieval
Data
Mining
Document
Retrieval
Text
Mining
Typical IR systems
n
n
Potential Applications
Customer comment analysis
Trend analysis
Information filtering and routing
Event tracking
News story classification
Web search
Sentiment analysis
.
| {Relevant} {Retrieved} |
precision =
| {Retrieved} |
n
| {Relevant } {Retrieved } |
recall =
| {Relevant } |
Precision
SVM
Decision Tree
SOM
Logistic Regr
Recall
Dumais
(1998)
F-measure
n
Semantics:
n
Data pre-processing
n
Feature Extraction
Feature Extraction
Task: Extract a good subset of words / phrases to represent documents
Document
collection
All unique
words/phrases
Feature
Extraction
All good
words/phrases
Stemming
n
e.g. a document containing words like fish and fisher may not be
retrieved by a query containing fishing (the word fishing not explicitly
contained in the document)
Not always desirable: e.g., {university, universal} -> univers (in Porters)
Feature Extraction
Training
documents
Identification
all unique words
Removal
stop words
non-informative word
ex.{the,and,when,more}
Removal
Word Stemming
of suffix to
generate word stem
grouping words
increasing the relevance
ex.{walker,walking}walk
Term Weighting
Terms frequency
Importance of term in Doc
Document Representation
Data Representation
n
Term Frequency
tf - Term Frequency weighting
wij = Freqij
= the number of times jth term
occurs in document Di.
Drawback: does mot reflect the importance of the term
for document discrimination.
Ex.
A B K O Q R S T W X
D1 ABRTSAQWA
XAO
D2
RTABBAXA
QSAK
D1
1 1
D2
1 1
SQL
index
regression
likelihood
linear
d1
24
21
d2
32
10
d3
12
16
d4
d5
43
31
20
d6
18
16
d7
32
12
d8
22
d9
34
27
25
d10
17
23
TF-IDF
tfidf - Inverse Document Frequency
wij = Freqij * log(N/ DocFreqj) .
N = the number of documents in the training
document collection.
DocFreqj = the number of documents in the training collection
where the jth term occurs.
Advantage: has reflection of importance factor for
document discrimination.
Assumption: terms with low DocFreq are better discriminator
than ones with high DocFreq in document collection
Ex.
A B K O Q R S T W X
D1
0 0.3 0
D2
0 0.3 0
0 0 0 0.3 0
0 0 0
Term Entropy
wij = log (FREQ ij + 1.0 ) * (1 + entropy ( w i ) )
where
N
FREQij
1
FREQij
entropy ( wi ) =
log
Similarity Measures
Similarity measures
n
Text Categorization
Task: assignment of one or more predefined
categories to one document.
Topics /
Themes
Text Categorization:Architecture
Training
documents
New document
preprocessing
Weighting
Selecting feature
Predefined
categories
Classifier
Category(ies) to d
Document Clustering
Task: It groups all documents so that the documents in the same
group are more similar than ones in other groups.
Cluster hypothesis: relevant documents tend to be more
closely related to each other than to
non-relevant documents.
Lack of semantics
Cannot identify similar word/concepts without help
Observation
n
n
n
M=U*S*V
S = a diagonal matrix of eigenvalues
U = a square matrix in which each entry capture
similarity between documents
V = a square matrix in which each entry capture
similarity between terms