Advance Quantitiative Thecnique

Advanced Quantitative Research Methodology, Lecture
Notes: Text Analysis II: Unsupervised Learning via

Cluster Analysis1
Gary King
http://GKing.Harvard.Edu
December 23, 2011
1
Copyright 2010 Gary King, All Rights Reserved.
Gary King http://GKing.Harvard.Edu () Advanced Quantitative Research Methodology, Lecture Notes:
December
Text Analysis
23, 2011 II: Unsupervise
1 / 23
Reading
Justin Grimmer and Gary King. 2010. Quantitative Discovery of

Qualitative Information: A General Purpose Document Clustering
Methodology
http://gking.harvard.edu/files/abs/discov-abs.shtml.
Gary King (Harvard, IQSS) Quantitative Discovery from Text 2 / 23

The Problem: Discovery from Unstructured Text
Examples: scholarly literature, news stories, medical information, blog

posts, comments, product reviews, emails, social media updates,
audio-to-text summaries, speeches, press releases, legal decisions, etc.


10 minutes of worldwide email = 1 LOC equivalent


An essential part of discovery is classification: one of the most
central and generic of all our conceptual exercises. . . . the foundation
not only for conceptualization, language, and speech, but also for
mathematics, statistics, and data analysis. . . . Without classification,
there could be no advanced conceptualization, reasoning, language,
data analysis or, for that matter, social science research. (Bailey,
1994).


1994).
We focus on cluster analysis: discovery through (1) classification and
(2) simultaneously inventing a classification scheme


1994).
We focus on cluster analysis: discovery through (1) classification and
(2) simultaneously inventing a classification scheme
(We analyze text; our methods apply more generally)

Why Johnny Cant Classify (Optimally)
Bell(n) = number of ways of partitioning n objects


Bell(2) = 2 (AB, A B)


Bell(2) = 2 (AB, A B)
Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)


Bell(2) = 2 (AB, A B)
Bell(5) = 52


Bell(2) = 2 (AB, A B)
Bell(5) = 52
Bell(100)


Bell(2) = 2 (AB, A B)
Bell(5) = 52
Bell(100) 1028 Number of elementary particles in the universe


Bell(2) = 2 (AB, A B)
Bell(5) = 52
Now imagine choosing the optimal classification scheme by hand!


Bell(2) = 2 (AB, A B)
Bell(5) = 52
Now imagine choosing the optimal classification scheme by hand!
That we think of all this as astonishing . . . is astonishing

Why HAL Cant Classify Either

The Goal an optimal application-independent cluster analysis

method is mathematically impossible:


No free lunch theorem: every possible clustering method performs
equally well on average over all possible substantive applications


Existing methods:


Existing methods:
Many choices: model-based, subspace, spectral, grid-based, graph-
based, fuzzy k-modes, affinity propogation, self-organizing maps,. . .


Existing methods:
Well-defined statistical, data analytic, or machine learning foundations


Existing methods:
How to add substantive knowledge:


Existing methods:
How to add substantive knowledge: With few exceptions, who knows?!


Existing methods:
The literature: little guidance on when methods apply


Existing methods:
The literature: little guidance on when methods apply
Deep problem in cluster analysis literature: no way to know which
method will work ex ante

If Ex Ante doesnt work, try Ex Post

Methods and substance must be connected (no free lunch theorem)


The usual approach fails: hard to do it by understanding the model


We do it ex post (by qualitative choice). For example:


Create long list of clusterings; choose the best


Too hard for mere humans!


An organized list will make the search possible


An organized list will make the search possible
E.g.,: consider two clusterings that differ only because one document
(of many) moves from category 5 to 6

Our Idea: Meaning Through Geography



We develop a (conceptual) geography of clusterings

A New Strategy
Make it easy to choose best clustering from millions of choices

A New Strategy
1 Code text as numbers (in one or more of several ways)

A New Strategy

2 Apply all clustering methods we can find to the data each
representing different (unstated) substantive assumptions (<15 mins)

A New Strategy

3 (Too much for a person to understand, but organization will help)

A New Strategy

4 Develop an application-independent distance metric between
clusterings, a metric space of clusterings, and a 2-D projection

A New Strategy

5 Local cluster ensemble creates a new clustering at any point, based
on weighted average of nearby clusterings

A New Strategy

6 A new animated visualization to explore the space of clusterings
(smoothly morphing from one into others)

A New Strategy

6 A new animated visualization to explore the space of clusterings
(smoothly morphing from one into others)
7 Millions of clusterings, easily comprehended (takes about 10-15
minutes to choose a clustering with insight)

Many Thousands of Clusterings, Sorted & Organized
You choose one (or more), based on insight, discovery, useful information,. . .
Obama Space of Cluster Solutions Cluster Solution 2

Ford Cluster Solution 1 mixvmf
affprop info.costs Carter

Nixon
kmedoids stand.euc Johnson
Carter Eisenhower
rock affprop maximum Ford Roosevelt
kmeans correlation hclust correlation single Eisenhower
hclust pearson single
Truman Truman
Johnson Roosevelt hclust maximum single

hclust hclust
hclustcorrelation
pearson
pearson correlation median
centroid
median Nixon
binary hclust
hclustcanberra
hclust centroid
centroid centroid spec_max
``Other ``Roosevelt
hclust
hclustcorrelationaverage
average
pearson mcquitty
mcquitty
hclust kendall single hclust maximum ward
Presidents '' hclustcanberra
binary median
hclust euclidean centroid To Carter''
hclust hclust
canberra mcquitty median mspec_max
kmeans kendall hclust
hclustcanberra single
binary single biclust_spectral
affprop
affprop
manhattan
cosine

Clinton hclust manhattan centroid
hclust
hclust
hclust
hclustmanhattan
spearman
maximum
maximum
kmedoids
kendall single
centroid
centroid
median
manhattan
centroid
mspec_canb hclust euclidean median
hclust canberra average hclust
hclustcorrelation
pearson complete
complete
divisive stand.euc
mspec_cos hclust
hclust
hclust
hclust kendall
spearman
manhattan
kendall average
median
median
median kmeans maximum
hclust
hclust
hclust euclidean
maximum
maximum average
single
mcquitty
complete Kennedy
kmeans pearson affprop euclidean
hclust manhattan average
hclust mcquitty
euclidean average
Kennedy hclust spearman single divisive euclidean
Bushkmeans binary
hclust binary average kmedoids euclidean
som
hclust spearman average spec_mink
mspec_euc
mspec_mink
hclust
hclust binary
binary complete
mcquitty divisive manhattan
mspec_man
hclust euclidean
hclust euclidean complete
mcquitty hclust kendall complete
hclust correlation ward complete
hclust canberra Bush
clust_convex
hclust
hclustspearman hclust euclidean
kendall mcquitty
mcquitty dismea ward
Obama
hclust binary ward
hclust canberra ward hclust spearman complete
hclust manhattan complete
spec_canb hclust kendall ward
mixvmfVA
spec_cos spec_euc
hclust manhattan ward kmeans manhattan
kmeans euclidean
spec_man
hclust pearson ward
``Reagan `` Reagan To
Republicans'' hclust spearman ward Obama ''
kmeans spearman
Reagan kmeans canberra

HWBush
HWBush Clinton
Reagan
mult_dirproc

Application-Independent Distance Metric: Axioms

Metric based on 3 assumptions


1 Distance between clusterings: a function of the pairwise document
agreements (pairwise agreements triples, quadruples, etc.)


2 Invariance: Distance is invariant to the number of documents (for any
fixed number of clusters)


3 Scale: the maximum distance is set to log(num clusters)


Only one measure satisfies all three (the variation of
information)


Only one measure satisfies all three (the variation of
information)
Meila (2007): derives same metric using different axioms (lattice
theory)

The Future of Political Science
100 Perspectives
Edited by Gary King, Harvard University, Kay Lehman Schlozman, Boston College
and Norman H. Nie, Stanford University
The list of authors in The Future of Political Science is a 'whos

who' of political science. As I was reading it, I came to think of it
as a platter of tasty hors doeuvres. It hooked me thoroughly.
Peter Kingstone, University of Connecticut
In this one-of-a-kind collection, an eclectic set of contributors

offer short but forceful forecasts about the future of the
discipline. The resulting assortment is captivating, consistently
thought-provoking, often intriguing, and sure to spur discussion
and debate.
Wendy K. Tam Cho, University of Illinois at Urbana-Champaign
King, Schlozman, and Nie have created a visionary and

stimulating volume. The organization of the essays strikes me as
nothing less than brilliant. . . It is truly a joy to read.
Lawrence C. Dodd, Manning J. Dauer Eminent Scholar in Political Science,
University of Florida
Available March 2009: 304pp

Pb: 978-0-415-99701-0: $24.95
www.routledge.com/politics

Evaluators Rate Machine Choices Better Than Their Own

Scale: (1) unrelated, (2) loosely related, or (3) closely related


Table reports: mean(scale)


Pairs from Overall Mean Evaluator 1 Evaluator 2

Random Selection 1.38 1.16 1.60






Hand-Coded Clusters 1.58 1.48 1.68



Hand-Coding 2.06 1.88 2.24



Hand-Coding 2.06 1.88 2.24
Machine 2.24 2.08 2.40



Hand-Coding 2.06 1.88 2.24
Machine 2.24 2.08 2.40
p.s. The hand-coders did the evaluation!

Evaluating Performance

Goals:

Goals:
Validate Claim: computer-assisted conceptualization outperforms
human conceptualization

Goals:
Demonstrate: new experimental designs for cluster evaluation

Goals:
Inject human judgement: relying on insights from survey research

Goals:
We now present three evaluations

Goals:
Cluster Quality RA coders

Goals:
Informative discoveries Experienced scholars analyzing texts

Goals:
Informative discoveries Experienced scholars analyzing texts
Discovery Youre the judge

Evaluation 1: Cluster Quality

What Are Humans Good For?


They cant: keep many documents & clusters in their head


They can: compare two documents at a time


= Cluster quality evaluation: human judgement of document pairs


Experimental Design to Assess Cluster Quality


automated visualization to choose one clustering


many pairs of documents


for coders: (1) unrelated, (2) loosely related, (3) closely related


Quality = mean(within cluster) - mean(between clusters)


Bias results against ourselves by not letting evaluators choose clustering


Bias results against ourselves by not letting evaluators choose clustering

0.3 0.2 0.1 0.1 0.2 0.3
(Our Method) (Human Coders)

Lautenberg Press Releases

0.3 0.2 0.1 0.1 0.2 0.3
Lautenberg: 200 Senate Press Releases (appropriations, economy,

education, tax, veterans, . . . )


Policy Agendas Project

0.3 0.2 0.1 0.1 0.2 0.3
Policy Agendas: 213 quasi-sentences from Bushs State of the Union

(agriculture, banking & commerce, civil rights/liberties, defense, . . . )


Policy Agendas Project

Reuter's Gold Standard

0.3 0.2 0.1 0.1 0.2 0.3
Reuters: financial news (trade, earnings, copper, gold, coffee, . . . ); gold

standard for supervised learning studies

Evaluation 2: More Informative Discoveries

Found 2 scholars analyzing lots of textual data for their work


Created 6 clusterings:


2 clusterings selected with our method (biased against us)


2 clusterings from each of 2 other methods (varying tuning parameters)


Created info packet on each clustering (for each cluster: exemplar
document, automated content summary)


Asked for 62 =15 pairwise comparisons




User chooses only care about the one clustering that wins



Both cases a Condorcet winner:



Immigration:
Our Method 1 vMF 1 vMF 2 Our Method 2 K-Means 1 K-Means 2



Immigration:
Our Method 1 vMF 1 vMF 2 Our Method 2 K-Means 1 K-Means 2
Genetic testing:
Our Method 1 {Our Method 2, K-Means 1, K-means 2} Dir Proc. 1 Dir Proc. 2
Evaluation 3: What Do Members of Congress Do?

- David Mayhews (1974) famous typology


- Advertising


- Advertising
- Credit Claiming


- Advertising
- Credit Claiming
- Position Taking


- Advertising
- Credit Claiming
- Position Taking
- Data: 200 press releases from Frank Lautenbergs office (D-NJ)


- Advertising
- Credit Claiming
- Position Taking
- Data: 200 press releases from Frank Lautenbergs office (D-NJ)
- Apply our method

Example Discovery
mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc
mixvmf
hclust correlationmixvmfVA
hclust binary complete
mcquitty
hclust pearson single affprop cosine
hclust pearson median
hclust correlation single hclust pearson mcquitty
hclust correlation median
mec hclust pearson average hclust correlation complete
hclust binary single hclust correlation averagehclust pearson complete
hclust binary average kmeans pearson
hclustpearson
hclust correlation
centroid som
centroid rock
hclust binary median hclust binary mcquitty
hclust canberra single
biclust_spectral hclust spearman complete
spec_man
spec_cos
hclust canberra
kmeans kendall median spec_mink
spec_euc
affprop maximum hclust canberra average spec_max
mspec_minkspec_canb
mspec_man
kmeans spearman kmeans manhattan mspec_max
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust binary centroid
hclust kendall single
hclust
hclust hclustspearman
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
hclust spearman kendall mcquitty
mcquitty
hclust canberra centroid hclust kendall complete
hclust
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattan
euclidean affprop
single manhattan
centroid
medianaverage
manhattan
hclust
hclust manhattan
euclidean mediansingle
hclust maximum divisive
single manhattan
hclust
hclusteuclidean
hclust
manhattan centroid
euclidean average
mcquitty clust_convex hclust correlation ward
hclust euclidean mcquitty
kmedoids euclidean kmedoids
hclust pearson wardstand.euc
hclustmaximum
hclust maximum
divisive centroidaffprop euclidean
median
euclidean hclust canberra mcquitty
hclust maximum average
hclust
hclust maximum
euclidean complete
complete
hclust maximum
hclust manhattan complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
hclust manhattan ward affprop info.costs
kmeanshclust euclidean ward
euclidean hclust canberra complete
sot_euc
hclust binary ward
hclusthclust spearman
kendall ward ward
hclust maximum ward kmeans binary
kmeans maximum

Example Discovery
mult_dirproc
kmeans correlation
divisive stand.euc
mixvmf
mcquitty
hclust binary average
hclustpearson
hclust correlation
centroid som
centroid rock
hclust binary median
biclust_spectral
affprop cosine
hclust spearman complete
hclust binary mcquitty
kmeans pearson
spec_man
spec_cos
hclust canberra
spec_euc
mspec_minkspec_canb
mspec_man
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
mcquitty
hclust
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattan
euclidean affprop
single manhattan
centroid
medianaverage
manhattan
hclust
hclust manhattan
single manhattan
hclust
hclusteuclidean
hclust
manhattan centroid
euclidean average
hclustmaximum
hclust maximum
divisive
hclust
hclust
centroidaffprop euclidean
median
euclidean
maximum
euclidean
hclust maximum
hclust manhattan
complete
complete
complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
hclust manhattan ward
dist_cos
dismea
hclust canberra mcquitty
Red point: a clustering by

affprop info.costs
kmeanshclust
sot_euc
euclidean ward
hclust binary ward Affinity Propagation-Cosine
hclust maximum ward
kendall ward ward
kmeans binary (Dueck and Frey 2007)

kmeans maximum

Example Discovery
mult_dirproc
mixvmf kmeans correlation

divisive stand.euc
mixvmf
mcquitty
hclust binary average
hclustpearson
hclust correlation
centroid som
centroid rock
biclust_spectral
affprop cosine
kmeans pearson
spec_man
spec_cos
hclust canberra
spec_euc
mspec_minkspec_canb
mspec_man
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
mcquitty
hclust
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattan
euclidean affprop
single manhattan
centroid
medianaverage
manhattan
hclust
hclust manhattan
single manhattan
hclust
hclusteuclidean
hclust
manhattan centroid
euclidean average
hclustmaximum
hclust maximum
divisive
hclust
hclust
median
euclidean
maximum
euclidean
hclust maximum
hclust manhattan
complete
complete
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
Red point: a clustering by

affprop info.costs
kmeanshclust
sot_euc
euclidean ward
hclust binary ward Affinity Propagation-Cosine
hclust maximum ward
kendall ward ward
kmeans binary (Dueck and Frey 2007)

kmeans maximum
Close to:
Mixture of von Mises-Fisher
distributions (Banerjee et. al.
2005)

Example Discovery
mult_dirproc
kmeans correlation
divisive stand.euc
mixvmf
mcquitty
hclustpearson
hclust correlation
centroid som
centroid rock
spec_man
spec_cos
hclust canberra
spec_euc
mspec_minkspec_canb
mspec_man
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
mcquitty
hclust
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattan
euclidean affprop
single manhattan
centroid
medianaverage
manhattan
hclust
hclust manhattan
single manhattan
hclust
hclusteuclidean
hclust
manhattan centroid
euclidean average
hclustmaximum
hclust maximum
median
hclust
hclust maximum
euclidean complete
complete
hclust maximum
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
sot_euc
hclust binary ward
kendall ward ward
kmeans maximum
Space between methods:

Example Discovery
mult_dirproc
kmeans correlation
divisive stand.euc
mixvmf
mcquitty
hclustpearson
hclust correlation
centroid som
centroid rock

biclust_spectral
hclust canberra
spec_man
spec_cos
kmeans kendall median hclust canberra average
spec_mink
spec_euc
spec_max
mspec_minkspec_canb
mspec_man
affprop maximum kmeans spearman kmeans manhattan mspec_max
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
mcquitty
hclust
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattan
euclidean affprop
single manhattan
centroid
medianaverage
manhattan
hclust
hclust manhattan
single manhattan
hclust
hclusteuclidean
hclust
manhattan centroid
euclidean average
hclustmaximum
hclust maximum
median
hclust
hclust maximum
euclidean complete
complete
hclust maximum
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
sot_euc
hclust binary ward
kendall ward ward
kmeans maximum

Example Discovery
mult_dirproc
kmeans correlation
divisive stand.euc
mixvmf
mcquitty
hclustpearson
hclust correlation
centroid som
centroid rock

biclust_spectral
hclust canberra
spec_man
spec_cos
spec_mink
spec_euc
spec_max
mspec_minkspec_canb
mspec_man
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
mcquitty
hclust
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattan
euclidean affprop
single manhattan
centroid
medianaverage
manhattan
hclust
hclust manhattan
single manhattan
hclust
hclusteuclidean
hclust
manhattan centroid
euclidean average
hclustmaximum
hclust maximum
median
hclust
hclust maximum
euclidean complete
complete
hclust maximum
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
sot_euc
hclust binary ward
kendall ward ward
kmeans maximum
local cluster ensemble

Example Discovery
mult_dirproc
kmeans correlation
divisive stand.euc
mixvmf
mcquitty
hclustpearson
hclust correlation
centroid som
centroid rock
spec_man
spec_cos
hclust canberra
spec_euc
mspec_minkspec_canb
mspec_man
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
mcquitty
hclust
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattan
euclidean affprop
single manhattan
centroid
medianaverage
manhattan
hclust
hclust manhattan
single manhattan
hclust
hclusteuclidean
hclust
manhattan centroid
euclidean average
hclustmaximum
hclust maximum
median
hclust
hclust maximum
euclidean complete
complete
hclust maximum
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
sot_euc
hclust binary ward
kendall ward ward
kmeans maximum

Example Discovery
mult_dirproc
kmeans correlation
divisive stand.euc
mixvmf
mcquitty
hclustpearson
hclust correlation
centroid som
centroid rock
spec_man
spec_cos
hclust canberra
spec_euc
mspec_minkspec_canb
mspec_man
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
mcquitty
hclust
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattan
euclidean affprop
single manhattan
centroid
medianaverage
manhattan
hclust
hclust manhattan
single manhattan
hclust
hclusteuclidean
hclust
manhattan centroid
euclidean average
hclustmaximum
hclust maximum
median
hclust
hclust maximum
euclidean complete
complete
hclust maximum
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
sot_euc
hclust binary ward
kendall ward ward
kmeans maximum Found a region with particularly

insightful clusterings

Example Discovery
mult_dirproc
kmeans correlation
divisive stand.euc

hclust correlation single
mec
mixvmf
affprop cosine
mcquitty
hclust pearson mcquitty
hclust pearson average hclust correlation complete
Mixture:
hclustpearson
hclust correlation
centroid som
centroid rock
spec_man
spec_cos
hclust canberra
spec_euc
mspec_minkspec_canb
mspec_man
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
mcquitty
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattanmedian manhattan
centroid
average
hclust
hclusteuclidean
hclust affprop
single
manhattan
euclidean median
divisive
manhattan
single
manhattan
hclust maximum
hclust single
euclidean centroid
hclust
hclust euclidean
manhattan average
hclustmaximum
hclust maximum
median
hclust
hclust maximum
euclidean complete
complete
hclust maximum
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
sot_euc
hclust binary ward
kendall ward ward
kmeans maximum

Example Discovery
mult_dirproc
kmeans correlation
divisive stand.euc

mec
mixvmf
affprop cosine
mcquitty
Mixture:
hclustpearson
hclust correlation
centroid som
centroid rock
spec_man
0.39 Hclust-Canberra-McQuitty
hclust canberra spec_cos
spec_mink
spec_euc
spec_max
mspec_minkspec_canb
mspec_man
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
mcquitty
hclust
hclust manhattan
hclust kmedoids
manhattan
centroid
average
hclust
hclusteuclidean
hclust affprop
single
manhattan
euclidean median
divisive
manhattan
single
manhattan
hclust maximum
hclust single
euclidean centroid
hclust
hclust euclidean
manhattan average
hclustmaximum
hclust maximum
median
hclust
hclust maximum
euclidean complete
complete
hclust maximum
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
sot_euc
hclust binary ward
kendall ward ward
kmeans maximum

Example Discovery
mult_dirproc
kmeans correlation
divisive stand.euc

mec
mixvmf
affprop cosine
mcquitty
Mixture:
hclustpearson
hclust correlation
centroid som
centroid rock
spec_man
spec_mink
spec_euc
spec_max
mspec_minkspec_canb
mspec_man
kmeans canberra
hclust
hclust
hclust hclustkendall
spearman
kendall
spearman
kendall
hclust spearman
hclust
hclust spearman
hclust canberra centroid
single
centroid
centroid
average
median
median average
single
kendall mcquitty
mcquitty
hclust kendall complete
mspec_cos
mspec_canb
mspec_euc
0.30 Spectral clustering
hclust
hclust manhattan
hclust kmedoids
manhattan
centroid
average
hclust
hclust
hclust
euclidean
hclust
euclidean
maximum
hclust
hclust
hclust manhattan
hclustmaximum
maximum
affprop
single
manhattanmedian
divisive
single
euclidean centroid
euclidean
manhattan
single
manhattan
average
mcquitty
kmedoids
clust_convex
euclidean
hclust correlation ward
kmedoids
hclust pearson
wardstand.euc
Random Walk
hclust divisive median
euclidean
hclust
hclust maximum
euclidean
hclust maximum
hclust manhattan
complete
complete
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
affprop info.costs
(Metrics 1-6)
sot_euc
hclust binary ward
kendall ward ward
kmeans maximum

Example Discovery
mult_dirproc
kmeans correlation
divisive stand.euc

mec
mixvmf
affprop cosine
mcquitty
Mixture:
hclustpearson
hclust correlation
centroid som
centroid rock
spec_man
spec_mink
spec_euc
spec_max
mspec_minkspec_canb
mspec_man
kmeans canberra
hclust
hclust
spearman
kendall
spearman
kendall
hclust spearman
hclust
hclust spearman
single
centroid
centroid
average
median
median average
single
kendall mcquitty
mcquitty
mspec_cos
mspec_canb
mspec_euc
hclust
hclust manhattan
hclust kmedoids
manhattan
centroid
average
hclust
hclust
hclust
euclidean
hclust
euclidean
maximum
hclust
hclust
hclust manhattan
hclustmaximum
maximum
affprop
single
manhattanmedian
divisive
single
euclidean centroid
euclidean
manhattan
single
manhattan
average
mcquitty
kmedoids
clust_convex
euclidean
kmedoids
hclust pearson
wardstand.euc
Random Walk
euclidean
hclust
hclust maximum
euclidean
hclust maximum
hclust manhattan
complete
complete
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
affprop info.costs
(Metrics 1-6)
sot_euc
hclust binary ward
0.13 Hclust-Correlation-Ward
kendall ward ward
kmeans maximum

Example Discovery
mult_dirproc
kmeans correlation
divisive stand.euc

mec
mixvmf
affprop cosine
mcquitty
Mixture:
hclustpearson
hclust correlation
centroid som
centroid rock
spec_man
spec_mink
spec_euc
spec_max
mspec_minkspec_canb
mspec_man
kmeans canberra
hclust
hclust
spearman
kendall
spearman
kendall
hclust spearman
hclust
hclust spearman
single
centroid
centroid
average
median
median average
single
kendall mcquitty
mcquitty
mspec_cos
mspec_canb
mspec_euc
hclust
hclust manhattan
hclust kmedoids
manhattan
centroid
average
hclust
hclust
hclust
euclidean
hclust
euclidean
maximum
hclust
hclust
hclust manhattan
hclustmaximum
maximum
affprop
single
manhattanmedian
divisive
single
euclidean centroid
euclidean
manhattan
single
manhattan
average
mcquitty
kmedoids
clust_convex
euclidean
kmedoids
hclust pearson
wardstand.euc
Random Walk
euclidean
hclust
hclust maximum
euclidean
hclust maximum
hclust manhattan
complete
complete
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
affprop info.costs
(Metrics 1-6)
sot_euc
hclust binary ward
kendall ward ward
hclust maximum ward
kmeans maximum
kmeans binary
0.09 Hclust-Pearson-Ward

Example Discovery
mult_dirproc
kmeans correlation
divisive stand.euc

mec
mixvmf
affprop cosine
mcquitty
Mixture:
hclustpearson
hclust correlation
centroid som
centroid rock
spec_man
spec_mink
spec_euc
spec_max
mspec_minkspec_canb
mspec_man
kmeans canberra
hclust
hclust
spearman
kendall
spearman
kendall
hclust spearman
hclust
hclust spearman
single
centroid
centroid
average
median
median average
single
kendall mcquitty
mcquitty
mspec_cos
mspec_canb
mspec_euc
hclust
hclust manhattan
hclust kmedoids
manhattan
centroid
average
hclust
hclust
hclust
euclidean
hclust
euclidean
maximum
hclust
hclust
hclust manhattan
hclustmaximum
maximum
affprop
single
manhattanmedian
divisive
single
euclidean centroid
euclidean
manhattan
single
manhattan
average
mcquitty
kmedoids
clust_convex
euclidean
kmedoids
hclust pearson
wardstand.euc
Random Walk
euclidean
hclust
hclust maximum
euclidean
hclust maximum
hclust manhattan
complete
complete
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
affprop info.costs
(Metrics 1-6)
sot_euc
hclust binary ward
kendall ward ward
hclust maximum ward
kmeans maximum
kmeans binary
0.05 Kmediods-Cosine

Example Discovery
mult_dirproc
kmeans correlation
divisive stand.euc

mec
mixvmf
affprop cosine
mcquitty
Mixture:
hclustpearson
hclust correlation
centroid som
centroid rock
spec_man
spec_mink
spec_euc
spec_max
mspec_minkspec_canb
mspec_man
kmeans canberra
hclust
hclust
spearman
kendall
spearman
kendall
hclust spearman
hclust
hclust spearman
single
centroid
centroid
average
median
median average
single
kendall mcquitty
mcquitty
mspec_cos
mspec_canb
mspec_euc
hclust
hclust manhattan
hclust kmedoids
manhattan
centroid
average
hclust
hclust
hclust
euclidean
hclust
euclidean
maximum
hclust
hclust
hclust manhattan
hclustmaximum
maximum
affprop
single
manhattanmedian
divisive
single
euclidean centroid
euclidean
manhattan
single
manhattan
average
mcquitty
kmedoids
clust_convex
euclidean
kmedoids
hclust pearson
wardstand.euc
Random Walk
euclidean
hclust
hclust maximum
euclidean
hclust maximum
hclust manhattan
complete
complete
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
affprop info.costs
(Metrics 1-6)
sot_euc
hclust binary ward
kendall ward ward
hclust maximum ward
kmeans maximum
kmeans binary
0.05 Kmediods-Cosine
Symmetric
(Metrics 1-6)

Example Discovery
mult_dirproc
kmeans correlation
divisive stand.euc
mixvmf
mcquitty
hclustpearson
hclust correlation
centroid som
centroid rock
spec_man
spec_cos
hclust canberra
spec_euc
mspec_minkspec_canb
mspec_man
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
mcquitty
hclust
hclust manhattan
hclust kmedoids
manhattan
centroid
average
hclust
hclusteuclidean
hclust affprop
single
manhattan
euclidean median
divisive
manhattan
single
manhattan
hclust maximum
hclust single
euclidean centroid
hclust
hclust euclidean
manhattan average
hclustmaximum
hclust maximum
median
hclust
hclust maximum
euclidean complete
complete
hclust maximum
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
sot_euc
hclust binary ward
kendall ward ward
kmeans maximum
Clusters in this Clustering
Mayhew
Example Discovery
mult_dirproc
kmeans correlation
divisive stand.euc
mixvmf
mcquitty
hclustpearson
hclust correlation
centroid som
centroid rock
spec_man
spec_cos
hclust canberra
spec_euc
mspec_minkspec_canb
mspec_man
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust kendall mcquitty
hclust
hclust
hclust spearman
hclust
hclust manhattan
hclust
hclust
hclust
maximum
hclust
hclust
kmedoids
manhattan
manhattan
euclidean
manhattan
euclidean
euclidean
hclust
manhattan
median
affprop
single
median
divisive
single
mcquitty
average
single
manhattan
centroid
euclidean average
mcquitty
manhattan
centroid
manhattan
clust_convex

Credit Claiming, Pork:
Sens. Frank R. Lautenberg

hclustmaximum
hclust maximum
median
hclust
hclust maximum
euclidean complete
complete
hclust maximum
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
kmeanshclust
sot_euc
euclidean ward
hclust binary ward (D-NJ) and Robert Menendez
(D-NJ) announced that the U.S.
kendall ward ward
kmeans maximum
Department of Commerce has

awarded a $100,000 grant to the

South Jersey Economic

Credit Claiming
Development District
Pork
Mayhew
Example Discovery
mult_dirproc
kmeans correlation
divisive stand.euc
mixvmf
mcquitty
hclustpearson
hclust correlation
centroid som
centroid rock
spec_man
spec_cos
hclust canberra
spec_euc
mspec_minkspec_canb
mspec_man
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
mcquitty
hclust
hclust manhattan
hclust kmedoids
manhattan
centroid
average
hclust
hclusteuclidean
hclust affprop
single
manhattan
euclidean median
divisive
manhattan
single
manhattan
hclust maximum
hclust single
euclidean centroid
hclust
hclust euclidean
manhattan average
Credit Claiming, Legislation:

hclustmaximum
hclust maximum
median
hclust
hclust maximum
euclidean complete
complete
hclust maximum
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
kmeanshclust
sot_euc
euclidean ward
hclust binary ward As the Senate begins its recess,
Senator Frank Lautenberg today
kendall ward ward
kmeans maximum
pointed to a string of victories in

Congress on his legislative agenda

during this work period

Credit Claiming
Pork

Credit Claiming
Mayhew
Legislation
Example Discovery
mult_dirproc
kmeans correlation
divisive stand.euc
mixvmf
mcquitty
hclustpearson
hclust correlation
centroid som
centroid rock
spec_man
spec_cos
hclust canberra
spec_euc
mspec_minkspec_canb
mspec_man
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
mcquitty
hclust
hclust manhattan
hclust kmedoids
manhattan
centroid
average
hclust
hclusteuclidean
hclust affprop
single
manhattan
euclidean median
divisive
manhattan
single
manhattan
hclust maximum
hclust single
euclidean centroid
hclust
hclust euclidean
manhattan average
hclustmaximum
hclust maximum
median
hclust
hclust maximum
euclidean complete
complete
hclust maximum
hclust manhattan complete mcquitty
kmeanshclust
sot_euc
dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
euclidean ward
euclidean
dist_cos
dismea
hclust canberra complete

affprop info.costs
Advertising:
hclust binary ward
kendall ward ward Senate Adopts
kmeans maximum Lautenberg/Menendez Resolution

Clusters in this Clustering Honoring Spelling Bee Champion

from New Jersey

Credit Claiming Advertising

Pork

Credit Claiming
Mayhew
Legislation
Example Discovery: Partisan Taunting
mult_dirproc
kmeans correlation
divisive stand.euc
mixvmf
mcquitty
hclustpearson
hclust correlation
centroid som
centroid rock
spec_man
spec_cos
hclust canberra
spec_euc
mspec_minkspec_canb
mspec_man
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
mcquitty
hclust
hclust manhattan
hclust kmedoids
manhattan
centroid
average
hclust
hclusteuclidean
hclust affprop
single
manhattan
euclidean median
divisive
manhattan
single
manhattan
hclust maximum
hclust single
euclidean centroid
hclust
hclust euclidean
manhattan average
hclustmaximum
hclust maximum
median
hclust
hclust maximum
euclidean complete
complete
hclust maximum
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
sot_euc
hclust binary ward
kendall ward ward Partisan Taunting:
kmeans maximum Republicans Selling Out Nation

Clusters in this Clustering on Chemical Plant Security


Pork Partisan Taunting

Credit Claiming
Mayhew
Legislation
mult_dirproc
kmeans correlation
divisive stand.euc
mixvmf
mcquitty
hclustpearson
hclust correlation
centroid som
centroid rock
spec_man
spec_cos
hclust canberra
spec_euc
mspec_minkspec_canb
mspec_man
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
mcquitty
Partisan Taunting:
hclust
hclust manhattan
hclust kmedoids
manhattan
centroid
average
hclust
hclusteuclidean
hclust affprop
single
manhattan
euclidean median
divisive
manhattan
single
manhattan
hclust maximum
hclust single
euclidean centroid
hclust
hclust euclidean
manhattan average
hclustmaximum
hclust maximum centroidaffprop euclidean
median hclust canberra mcquitty
Senator Lautenbergs
divisive euclidean
hclust
hclust maximum
euclidean complete
complete
hclust maximum
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
amendment would change the

sot_euc
hclust binary ward
kendall ward ward
hclust maximum ward
kmeans maximum
kmeans binary
name of ...the Republican bill...to
More Tax Breaks for the Rich

and More Debt for Our

Grandchildren Deficit Expansion

Credit Claiming Advertising Reconciliation Act of 2006


Credit Claiming
Mayhew
Legislation
mult_dirproc
kmeans correlation
divisive stand.euc
mixvmf
mcquitty
hclustpearson
hclust correlation
centroid som
centroid rock
spec_man
spec_cos
hclust canberra
spec_euc
mspec_minkspec_canb
mspec_man
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
mcquitty
hclust
hclust manhattan
hclust kmedoids
manhattan
centroid
average
hclust
hclusteuclidean
hclust affprop
single
manhattan
euclidean median
divisive
manhattan
single
manhattan
hclust maximum
hclust single
euclidean centroid
hclust
hclust euclidean
manhattan average
hclustmaximum
hclust maximum
median
Definition: Explicit, public, and

hclust
hclust maximum
euclidean complete
complete
hclust maximum
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
sot_euc
kendall ward ward
hclust binary ward
negative attacks on another
hclust maximum ward
kmeans maximum
kmeans binary
political party or its members



Credit Claiming
Mayhew
Legislation
mult_dirproc
kmeans correlation
divisive stand.euc
mixvmf
mcquitty
hclustpearson
hclust correlation
centroid som
centroid rock
spec_man
spec_cos
hclust canberra
spec_euc
mspec_minkspec_canb
mspec_man
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
mcquitty
hclust
hclust manhattan
hclust kmedoids
manhattan
centroid
average
hclust
hclusteuclidean
hclust affprop
single
manhattan
euclidean median
divisive
manhattan
single
manhattan
hclust maximum
hclust single
euclidean centroid
hclust
hclust euclidean
manhattan average
hclustmaximum
hclust maximum
median
Definition: Explicit, public, and

hclust
hclust maximum
euclidean complete
complete
hclust maximum
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
sot_euc
kendall ward ward
hclust binary ward
negative attacks on another
hclust maximum ward
kmeans maximum
kmeans binary
political party or its members

Clusters in this Clustering Taunting ruins

deliberation



Credit Claiming
Mayhew
Legislation
In Sample Illustration of Partisan Taunting
Taunting ruins deliberation

- Senator Lautenberg Blasts
Republicans as Chicken Hawks
[Government Oversight]
Sen. Lautenberg
on Senate Floor
4/29/04


- The scopes trial took place in
1925. Sadly, President Bushs veto
today shows that we havent
progressed much since then
[Healthcare]
Sen. Lautenberg
on Senate Floor
4/29/04


- The scopes trial took place in
1925. Sadly, President Bushs veto
today shows that we havent
progressed much since then
[Healthcare]
- Every day the House Republicans
Sen. Lautenberg dragged this out was a day that
on Senate Floor made our communities less
4/29/04 safe.[Homeland Security]

Out of Sample Confirmation of Partisan Taunting
- Discovered using 200 press releases; 1 senator.

- Confirmed using 64,033 press releases; 301 senator-years.

- Apply supervised learning method: measure proportion of press
releases a senator taunts other party

30
Frequency
20
10
0.1 0.2 0.3 0.4 0.5
Prop. of Press Releases Taunting

On Avg., Senators Taunt

in 27 % of Press Releases
30
Frequency
20
10
0.1 0.2 0.3 0.4 0.5
Prop. of Press Releases Taunting

Advancing the Objective of Discovery
1) Conceptualization
Qualitative Methods (reading!)
2) Measurement
Quantitative Methods
3) Validation
Quantitative methods for conceptualization: aiding discovery

2) Measurement
3) Validation
- Few formal methods designed explicitly for conceptualization

2) Measurement
3) Validation
- Belittled: Tom Swift and His Electric Factor Analysis Machine
(Armstrong 1967)

2) Measurement
3) Validation
- Belittled: Tom Swift and His Electric Factor Analysis Machine
(Armstrong 1967)
- Evaluation methods measure progress in discovery
For more information:
http://GKing.Harvard.edu

Advance Quantitiative Thecnique

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Advance Quantitiative Thecnique

Diunggah oleh

Hak Cipta:

Format Tersedia

Advanced Quantitative Research Methodology, Lecture

Notes: Text Analysis II: Unsupervised Learning via

December 23, 2011

Justin Grimmer and Gary King. 2010. Quantitative Discovery of

Gary King (Harvard, IQSS) Quantitative Discovery from Text 2 / 23

Examples: scholarly literature, news stories, medical information, blog

Gary King (Harvard, IQSS) Quantitative Discovery from Text 3 / 23

Examples: scholarly literature, news stories, medical information, blog

Gary King (Harvard, IQSS) Quantitative Discovery from Text 3 / 23

Examples: scholarly literature, news stories, medical information, blog

Gary King (Harvard, IQSS) Quantitative Discovery from Text 3 / 23

Examples: scholarly literature, news stories, medical information, blog

Gary King (Harvard, IQSS) Quantitative Discovery from Text 3 / 23

Examples: scholarly literature, news stories, medical information, blog

Gary King (Harvard, IQSS) Quantitative Discovery from Text 3 / 23

Bell(n) = number of ways of partitioning n objects

Gary King (Harvard, IQSS) Quantitative Discovery from Text 4 / 23

Bell(n) = number of ways of partitioning n objects

Gary King (Harvard, IQSS) Quantitative Discovery from Text 4 / 23

Bell(n) = number of ways of partitioning n objects

Gary King (Harvard, IQSS) Quantitative Discovery from Text 4 / 23

Bell(n) = number of ways of partitioning n objects

Gary King (Harvard, IQSS) Quantitative Discovery from Text 4 / 23

Bell(n) = number of ways of partitioning n objects

Gary King (Harvard, IQSS) Quantitative Discovery from Text 4 / 23

Bell(n) = number of ways of partitioning n objects

Gary King (Harvard, IQSS) Quantitative Discovery from Text 4 / 23

Bell(n) = number of ways of partitioning n objects

Gary King (Harvard, IQSS) Quantitative Discovery from Text 4 / 23

Bell(n) = number of ways of partitioning n objects

Gary King (Harvard, IQSS) Quantitative Discovery from Text 4 / 23

Gary King (Harvard, IQSS) Quantitative Discovery from Text 5 / 23

The Goal an optimal application-independent cluster analysis

Gary King (Harvard, IQSS) Quantitative Discovery from Text 5 / 23

The Goal an optimal application-independent cluster analysis

Gary King (Harvard, IQSS) Quantitative Discovery from Text 5 / 23

The Goal an optimal application-independent cluster analysis

Gary King (Harvard, IQSS) Quantitative Discovery from Text 5 / 23

The Goal an optimal application-independent cluster analysis

Gary King (Harvard, IQSS) Quantitative Discovery from Text 5 / 23

The Goal an optimal application-independent cluster analysis

Gary King (Harvard, IQSS) Quantitative Discovery from Text 5 / 23

The Goal an optimal application-independent cluster analysis

Gary King (Harvard, IQSS) Quantitative Discovery from Text 5 / 23

The Goal an optimal application-independent cluster analysis

Gary King (Harvard, IQSS) Quantitative Discovery from Text 5 / 23

The Goal an optimal application-independent cluster analysis

Gary King (Harvard, IQSS) Quantitative Discovery from Text 5 / 23

The Goal an optimal application-independent cluster analysis

Gary King (Harvard, IQSS) Quantitative Discovery from Text 5 / 23

Gary King (Harvard, IQSS) Quantitative Discovery from Text 6 / 23

Methods and substance must be connected (no free lunch theorem)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 6 / 23

Methods and substance must be connected (no free lunch theorem)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 6 / 23

Methods and substance must be connected (no free lunch theorem)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 6 / 23

Methods and substance must be connected (no free lunch theorem)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 6 / 23

Methods and substance must be connected (no free lunch theorem)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 6 / 23

Methods and substance must be connected (no free lunch theorem)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 6 / 23

Methods and substance must be connected (no free lunch theorem)