Anda di halaman 1dari 142

Advanced Quantitative Research Methodology, Lecture

Notes: Text Analysis II: Unsupervised Learning via


Cluster Analysis1

Gary King
http://GKing.Harvard.Edu

December 23, 2011

1
Copyright 2010 Gary King, All Rights Reserved.
Gary King http://GKing.Harvard.Edu () Advanced Quantitative Research Methodology, Lecture Notes:
December
Text Analysis
23, 2011 II: Unsupervise
1 / 23
Reading

Justin Grimmer and Gary King. 2010. Quantitative Discovery of


Qualitative Information: A General Purpose Document Clustering
Methodology
http://gking.harvard.edu/files/abs/discov-abs.shtml.

Gary King (Harvard, IQSS) Quantitative Discovery from Text 2 / 23


The Problem: Discovery from Unstructured Text

Examples: scholarly literature, news stories, medical information, blog


posts, comments, product reviews, emails, social media updates,
audio-to-text summaries, speeches, press releases, legal decisions, etc.

Gary King (Harvard, IQSS) Quantitative Discovery from Text 3 / 23


The Problem: Discovery from Unstructured Text

Examples: scholarly literature, news stories, medical information, blog


posts, comments, product reviews, emails, social media updates,
audio-to-text summaries, speeches, press releases, legal decisions, etc.
10 minutes of worldwide email = 1 LOC equivalent

Gary King (Harvard, IQSS) Quantitative Discovery from Text 3 / 23


The Problem: Discovery from Unstructured Text

Examples: scholarly literature, news stories, medical information, blog


posts, comments, product reviews, emails, social media updates,
audio-to-text summaries, speeches, press releases, legal decisions, etc.
10 minutes of worldwide email = 1 LOC equivalent
An essential part of discovery is classification: one of the most
central and generic of all our conceptual exercises. . . . the foundation
not only for conceptualization, language, and speech, but also for
mathematics, statistics, and data analysis. . . . Without classification,
there could be no advanced conceptualization, reasoning, language,
data analysis or, for that matter, social science research. (Bailey,
1994).

Gary King (Harvard, IQSS) Quantitative Discovery from Text 3 / 23


The Problem: Discovery from Unstructured Text

Examples: scholarly literature, news stories, medical information, blog


posts, comments, product reviews, emails, social media updates,
audio-to-text summaries, speeches, press releases, legal decisions, etc.
10 minutes of worldwide email = 1 LOC equivalent
An essential part of discovery is classification: one of the most
central and generic of all our conceptual exercises. . . . the foundation
not only for conceptualization, language, and speech, but also for
mathematics, statistics, and data analysis. . . . Without classification,
there could be no advanced conceptualization, reasoning, language,
data analysis or, for that matter, social science research. (Bailey,
1994).
We focus on cluster analysis: discovery through (1) classification and
(2) simultaneously inventing a classification scheme

Gary King (Harvard, IQSS) Quantitative Discovery from Text 3 / 23


The Problem: Discovery from Unstructured Text

Examples: scholarly literature, news stories, medical information, blog


posts, comments, product reviews, emails, social media updates,
audio-to-text summaries, speeches, press releases, legal decisions, etc.
10 minutes of worldwide email = 1 LOC equivalent
An essential part of discovery is classification: one of the most
central and generic of all our conceptual exercises. . . . the foundation
not only for conceptualization, language, and speech, but also for
mathematics, statistics, and data analysis. . . . Without classification,
there could be no advanced conceptualization, reasoning, language,
data analysis or, for that matter, social science research. (Bailey,
1994).
We focus on cluster analysis: discovery through (1) classification and
(2) simultaneously inventing a classification scheme
(We analyze text; our methods apply more generally)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 3 / 23


Why Johnny Cant Classify (Optimally)

Bell(n) = number of ways of partitioning n objects

Gary King (Harvard, IQSS) Quantitative Discovery from Text 4 / 23


Why Johnny Cant Classify (Optimally)

Bell(n) = number of ways of partitioning n objects


Bell(2) = 2 (AB, A B)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 4 / 23


Why Johnny Cant Classify (Optimally)

Bell(n) = number of ways of partitioning n objects


Bell(2) = 2 (AB, A B)
Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 4 / 23


Why Johnny Cant Classify (Optimally)

Bell(n) = number of ways of partitioning n objects


Bell(2) = 2 (AB, A B)
Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)
Bell(5) = 52

Gary King (Harvard, IQSS) Quantitative Discovery from Text 4 / 23


Why Johnny Cant Classify (Optimally)

Bell(n) = number of ways of partitioning n objects


Bell(2) = 2 (AB, A B)
Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)
Bell(5) = 52
Bell(100)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 4 / 23


Why Johnny Cant Classify (Optimally)

Bell(n) = number of ways of partitioning n objects


Bell(2) = 2 (AB, A B)
Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)
Bell(5) = 52
Bell(100) 1028 Number of elementary particles in the universe

Gary King (Harvard, IQSS) Quantitative Discovery from Text 4 / 23


Why Johnny Cant Classify (Optimally)

Bell(n) = number of ways of partitioning n objects


Bell(2) = 2 (AB, A B)
Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)
Bell(5) = 52
Bell(100) 1028 Number of elementary particles in the universe
Now imagine choosing the optimal classification scheme by hand!

Gary King (Harvard, IQSS) Quantitative Discovery from Text 4 / 23


Why Johnny Cant Classify (Optimally)

Bell(n) = number of ways of partitioning n objects


Bell(2) = 2 (AB, A B)
Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)
Bell(5) = 52
Bell(100) 1028 Number of elementary particles in the universe
Now imagine choosing the optimal classification scheme by hand!
That we think of all this as astonishing . . . is astonishing

Gary King (Harvard, IQSS) Quantitative Discovery from Text 4 / 23


Why HAL Cant Classify Either

Gary King (Harvard, IQSS) Quantitative Discovery from Text 5 / 23


Why HAL Cant Classify Either

The Goal an optimal application-independent cluster analysis


method is mathematically impossible:

Gary King (Harvard, IQSS) Quantitative Discovery from Text 5 / 23


Why HAL Cant Classify Either

The Goal an optimal application-independent cluster analysis


method is mathematically impossible:
No free lunch theorem: every possible clustering method performs
equally well on average over all possible substantive applications

Gary King (Harvard, IQSS) Quantitative Discovery from Text 5 / 23


Why HAL Cant Classify Either

The Goal an optimal application-independent cluster analysis


method is mathematically impossible:
No free lunch theorem: every possible clustering method performs
equally well on average over all possible substantive applications
Existing methods:

Gary King (Harvard, IQSS) Quantitative Discovery from Text 5 / 23


Why HAL Cant Classify Either

The Goal an optimal application-independent cluster analysis


method is mathematically impossible:
No free lunch theorem: every possible clustering method performs
equally well on average over all possible substantive applications
Existing methods:
Many choices: model-based, subspace, spectral, grid-based, graph-
based, fuzzy k-modes, affinity propogation, self-organizing maps,. . .

Gary King (Harvard, IQSS) Quantitative Discovery from Text 5 / 23


Why HAL Cant Classify Either

The Goal an optimal application-independent cluster analysis


method is mathematically impossible:
No free lunch theorem: every possible clustering method performs
equally well on average over all possible substantive applications
Existing methods:
Many choices: model-based, subspace, spectral, grid-based, graph-
based, fuzzy k-modes, affinity propogation, self-organizing maps,. . .
Well-defined statistical, data analytic, or machine learning foundations

Gary King (Harvard, IQSS) Quantitative Discovery from Text 5 / 23


Why HAL Cant Classify Either

The Goal an optimal application-independent cluster analysis


method is mathematically impossible:
No free lunch theorem: every possible clustering method performs
equally well on average over all possible substantive applications
Existing methods:
Many choices: model-based, subspace, spectral, grid-based, graph-
based, fuzzy k-modes, affinity propogation, self-organizing maps,. . .
Well-defined statistical, data analytic, or machine learning foundations
How to add substantive knowledge:

Gary King (Harvard, IQSS) Quantitative Discovery from Text 5 / 23


Why HAL Cant Classify Either

The Goal an optimal application-independent cluster analysis


method is mathematically impossible:
No free lunch theorem: every possible clustering method performs
equally well on average over all possible substantive applications
Existing methods:
Many choices: model-based, subspace, spectral, grid-based, graph-
based, fuzzy k-modes, affinity propogation, self-organizing maps,. . .
Well-defined statistical, data analytic, or machine learning foundations
How to add substantive knowledge: With few exceptions, who knows?!

Gary King (Harvard, IQSS) Quantitative Discovery from Text 5 / 23


Why HAL Cant Classify Either

The Goal an optimal application-independent cluster analysis


method is mathematically impossible:
No free lunch theorem: every possible clustering method performs
equally well on average over all possible substantive applications
Existing methods:
Many choices: model-based, subspace, spectral, grid-based, graph-
based, fuzzy k-modes, affinity propogation, self-organizing maps,. . .
Well-defined statistical, data analytic, or machine learning foundations
How to add substantive knowledge: With few exceptions, who knows?!
The literature: little guidance on when methods apply

Gary King (Harvard, IQSS) Quantitative Discovery from Text 5 / 23


Why HAL Cant Classify Either

The Goal an optimal application-independent cluster analysis


method is mathematically impossible:
No free lunch theorem: every possible clustering method performs
equally well on average over all possible substantive applications
Existing methods:
Many choices: model-based, subspace, spectral, grid-based, graph-
based, fuzzy k-modes, affinity propogation, self-organizing maps,. . .
Well-defined statistical, data analytic, or machine learning foundations
How to add substantive knowledge: With few exceptions, who knows?!
The literature: little guidance on when methods apply
Deep problem in cluster analysis literature: no way to know which
method will work ex ante

Gary King (Harvard, IQSS) Quantitative Discovery from Text 5 / 23


If Ex Ante doesnt work, try Ex Post

Gary King (Harvard, IQSS) Quantitative Discovery from Text 6 / 23


If Ex Ante doesnt work, try Ex Post

Methods and substance must be connected (no free lunch theorem)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 6 / 23


If Ex Ante doesnt work, try Ex Post

Methods and substance must be connected (no free lunch theorem)


The usual approach fails: hard to do it by understanding the model

Gary King (Harvard, IQSS) Quantitative Discovery from Text 6 / 23


If Ex Ante doesnt work, try Ex Post

Methods and substance must be connected (no free lunch theorem)


The usual approach fails: hard to do it by understanding the model
We do it ex post (by qualitative choice). For example:

Gary King (Harvard, IQSS) Quantitative Discovery from Text 6 / 23


If Ex Ante doesnt work, try Ex Post

Methods and substance must be connected (no free lunch theorem)


The usual approach fails: hard to do it by understanding the model
We do it ex post (by qualitative choice). For example:
Create long list of clusterings; choose the best

Gary King (Harvard, IQSS) Quantitative Discovery from Text 6 / 23


If Ex Ante doesnt work, try Ex Post

Methods and substance must be connected (no free lunch theorem)


The usual approach fails: hard to do it by understanding the model
We do it ex post (by qualitative choice). For example:
Create long list of clusterings; choose the best
Too hard for mere humans!

Gary King (Harvard, IQSS) Quantitative Discovery from Text 6 / 23


If Ex Ante doesnt work, try Ex Post

Methods and substance must be connected (no free lunch theorem)


The usual approach fails: hard to do it by understanding the model
We do it ex post (by qualitative choice). For example:
Create long list of clusterings; choose the best
Too hard for mere humans!
An organized list will make the search possible

Gary King (Harvard, IQSS) Quantitative Discovery from Text 6 / 23


If Ex Ante doesnt work, try Ex Post

Methods and substance must be connected (no free lunch theorem)


The usual approach fails: hard to do it by understanding the model
We do it ex post (by qualitative choice). For example:
Create long list of clusterings; choose the best
Too hard for mere humans!
An organized list will make the search possible
E.g.,: consider two clusterings that differ only because one document
(of many) moves from category 5 to 6

Gary King (Harvard, IQSS) Quantitative Discovery from Text 6 / 23


Our Idea: Meaning Through Geography

Gary King (Harvard, IQSS) Quantitative Discovery from Text 7 / 23


Our Idea: Meaning Through Geography

Gary King (Harvard, IQSS) Quantitative Discovery from Text 7 / 23


Our Idea: Meaning Through Geography

Gary King (Harvard, IQSS) Quantitative Discovery from Text 7 / 23


Our Idea: Meaning Through Geography

We develop a (conceptual) geography of clusterings

Gary King (Harvard, IQSS) Quantitative Discovery from Text 7 / 23


A New Strategy
Make it easy to choose best clustering from millions of choices

Gary King (Harvard, IQSS) Quantitative Discovery from Text 8 / 23


A New Strategy
Make it easy to choose best clustering from millions of choices

1 Code text as numbers (in one or more of several ways)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 8 / 23


A New Strategy
Make it easy to choose best clustering from millions of choices

1 Code text as numbers (in one or more of several ways)


2 Apply all clustering methods we can find to the data each
representing different (unstated) substantive assumptions (<15 mins)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 8 / 23


A New Strategy
Make it easy to choose best clustering from millions of choices

1 Code text as numbers (in one or more of several ways)


2 Apply all clustering methods we can find to the data each
representing different (unstated) substantive assumptions (<15 mins)
3 (Too much for a person to understand, but organization will help)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 8 / 23


A New Strategy
Make it easy to choose best clustering from millions of choices

1 Code text as numbers (in one or more of several ways)


2 Apply all clustering methods we can find to the data each
representing different (unstated) substantive assumptions (<15 mins)
3 (Too much for a person to understand, but organization will help)
4 Develop an application-independent distance metric between
clusterings, a metric space of clusterings, and a 2-D projection

Gary King (Harvard, IQSS) Quantitative Discovery from Text 8 / 23


A New Strategy
Make it easy to choose best clustering from millions of choices

1 Code text as numbers (in one or more of several ways)


2 Apply all clustering methods we can find to the data each
representing different (unstated) substantive assumptions (<15 mins)
3 (Too much for a person to understand, but organization will help)
4 Develop an application-independent distance metric between
clusterings, a metric space of clusterings, and a 2-D projection
5 Local cluster ensemble creates a new clustering at any point, based
on weighted average of nearby clusterings

Gary King (Harvard, IQSS) Quantitative Discovery from Text 8 / 23


A New Strategy
Make it easy to choose best clustering from millions of choices

1 Code text as numbers (in one or more of several ways)


2 Apply all clustering methods we can find to the data each
representing different (unstated) substantive assumptions (<15 mins)
3 (Too much for a person to understand, but organization will help)
4 Develop an application-independent distance metric between
clusterings, a metric space of clusterings, and a 2-D projection
5 Local cluster ensemble creates a new clustering at any point, based
on weighted average of nearby clusterings
6 A new animated visualization to explore the space of clusterings
(smoothly morphing from one into others)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 8 / 23


A New Strategy
Make it easy to choose best clustering from millions of choices

1 Code text as numbers (in one or more of several ways)


2 Apply all clustering methods we can find to the data each
representing different (unstated) substantive assumptions (<15 mins)
3 (Too much for a person to understand, but organization will help)
4 Develop an application-independent distance metric between
clusterings, a metric space of clusterings, and a 2-D projection
5 Local cluster ensemble creates a new clustering at any point, based
on weighted average of nearby clusterings
6 A new animated visualization to explore the space of clusterings
(smoothly morphing from one into others)
7 Millions of clusterings, easily comprehended (takes about 10-15
minutes to choose a clustering with insight)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 8 / 23


Many Thousands of Clusterings, Sorted & Organized
You choose one (or more), based on insight, discovery, useful information,. . .

Obama Space of Cluster Solutions Cluster Solution 2


Ford Cluster Solution 1 mixvmf

affprop info.costs Carter


Nixon
kmedoids stand.euc Johnson

Carter Eisenhower
rock affprop maximum Ford Roosevelt
kmeans correlation hclust correlation single Eisenhower
hclust pearson single
Truman Truman

Johnson Roosevelt hclust maximum single


hclust hclust
hclustcorrelation
pearson
pearson correlation median
centroid
median Nixon
binary hclust
hclustcanberra
hclust centroid
centroid centroid spec_max
``Other ``Roosevelt
hclust
hclustcorrelationaverage
average
pearson mcquitty
mcquitty
hclust kendall single hclust maximum ward
Presidents '' hclustcanberra
binary median
hclust euclidean centroid To Carter''
hclust hclust
canberra mcquitty median mspec_max
kmeans kendall hclust
hclustcanberra single
binary single biclust_spectral
affprop
affprop
manhattan
cosine

Clinton hclust manhattan centroid
hclust
hclust
hclust
hclustmanhattan
spearman
maximum
maximum
kmedoids
kendall single
centroid
centroid
median
manhattan
centroid
mspec_canb hclust euclidean median
hclust canberra average hclust
hclustcorrelation
pearson complete
complete
divisive stand.euc
mspec_cos hclust
hclust
hclust
hclust kendall
spearman
manhattan
kendall average
median
median
median kmeans maximum
hclust
hclust
hclust euclidean
maximum
maximum average
single
mcquitty
complete Kennedy
kmeans pearson affprop euclidean
hclust manhattan average
hclust mcquitty
euclidean average
Kennedy hclust spearman single divisive euclidean
Bushkmeans binary
hclust binary average kmedoids euclidean
som
hclust spearman average spec_mink
mspec_euc
mspec_mink

hclust
hclust binary
binary complete
mcquitty divisive manhattan
mspec_man
hclust euclidean
hclust euclidean complete
mcquitty hclust kendall complete
hclust correlation ward complete
hclust canberra Bush
clust_convex
hclust
hclustspearman hclust euclidean
kendall mcquitty
mcquitty dismea ward
Obama
hclust binary ward
hclust canberra ward hclust spearman complete
hclust manhattan complete
spec_canb hclust kendall ward
mixvmfVA
spec_cos spec_euc
hclust manhattan ward kmeans manhattan
kmeans euclidean
spec_man
hclust pearson ward
``Reagan `` Reagan To
Republicans'' hclust spearman ward Obama ''
kmeans spearman

Reagan kmeans canberra


HWBush
HWBush Clinton
Reagan
mult_dirproc

Gary King (Harvard, IQSS) Quantitative Discovery from Text 9 / 23


Application-Independent Distance Metric: Axioms

Gary King (Harvard, IQSS) Quantitative Discovery from Text 10 / 23


Application-Independent Distance Metric: Axioms

Metric based on 3 assumptions

Gary King (Harvard, IQSS) Quantitative Discovery from Text 10 / 23


Application-Independent Distance Metric: Axioms

Metric based on 3 assumptions


1 Distance between clusterings: a function of the pairwise document
agreements (pairwise agreements triples, quadruples, etc.)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 10 / 23


Application-Independent Distance Metric: Axioms

Metric based on 3 assumptions


1 Distance between clusterings: a function of the pairwise document
agreements (pairwise agreements triples, quadruples, etc.)
2 Invariance: Distance is invariant to the number of documents (for any
fixed number of clusters)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 10 / 23


Application-Independent Distance Metric: Axioms

Metric based on 3 assumptions


1 Distance between clusterings: a function of the pairwise document
agreements (pairwise agreements triples, quadruples, etc.)
2 Invariance: Distance is invariant to the number of documents (for any
fixed number of clusters)
3 Scale: the maximum distance is set to log(num clusters)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 10 / 23


Application-Independent Distance Metric: Axioms

Metric based on 3 assumptions


1 Distance between clusterings: a function of the pairwise document
agreements (pairwise agreements triples, quadruples, etc.)
2 Invariance: Distance is invariant to the number of documents (for any
fixed number of clusters)
3 Scale: the maximum distance is set to log(num clusters)
Only one measure satisfies all three (the variation of
information)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 10 / 23


Application-Independent Distance Metric: Axioms

Metric based on 3 assumptions


1 Distance between clusterings: a function of the pairwise document
agreements (pairwise agreements triples, quadruples, etc.)
2 Invariance: Distance is invariant to the number of documents (for any
fixed number of clusters)
3 Scale: the maximum distance is set to log(num clusters)
Only one measure satisfies all three (the variation of
information)
Meila (2007): derives same metric using different axioms (lattice
theory)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 10 / 23


The Future of Political Science
100 Perspectives
Edited by Gary King, Harvard University, Kay Lehman Schlozman, Boston College
and Norman H. Nie, Stanford University

The list of authors in The Future of Political Science is a 'whos


who' of political science. As I was reading it, I came to think of it
as a platter of tasty hors doeuvres. It hooked me thoroughly.
Peter Kingstone, University of Connecticut

In this one-of-a-kind collection, an eclectic set of contributors


offer short but forceful forecasts about the future of the
discipline. The resulting assortment is captivating, consistently
thought-provoking, often intriguing, and sure to spur discussion
and debate.
Wendy K. Tam Cho, University of Illinois at Urbana-Champaign

King, Schlozman, and Nie have created a visionary and


stimulating volume. The organization of the essays strikes me as
nothing less than brilliant. . . It is truly a joy to read.
Lawrence C. Dodd, Manning J. Dauer Eminent Scholar in Political Science,
University of Florida

Available March 2009: 304pp


Pb: 978-0-415-99701-0: $24.95
www.routledge.com/politics

Gary King (Harvard, IQSS) Quantitative Discovery from Text 11 / 23


Evaluators Rate Machine Choices Better Than Their Own

Gary King (Harvard, IQSS) Quantitative Discovery from Text 12 / 23


Evaluators Rate Machine Choices Better Than Their Own

Scale: (1) unrelated, (2) loosely related, or (3) closely related

Gary King (Harvard, IQSS) Quantitative Discovery from Text 12 / 23


Evaluators Rate Machine Choices Better Than Their Own

Scale: (1) unrelated, (2) loosely related, or (3) closely related


Table reports: mean(scale)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 12 / 23


Evaluators Rate Machine Choices Better Than Their Own

Scale: (1) unrelated, (2) loosely related, or (3) closely related


Table reports: mean(scale)

Pairs from Overall Mean Evaluator 1 Evaluator 2


Random Selection 1.38 1.16 1.60

Gary King (Harvard, IQSS) Quantitative Discovery from Text 12 / 23


Evaluators Rate Machine Choices Better Than Their Own

Scale: (1) unrelated, (2) loosely related, or (3) closely related


Table reports: mean(scale)

Pairs from Overall Mean Evaluator 1 Evaluator 2


Random Selection 1.38 1.16 1.60

Gary King (Harvard, IQSS) Quantitative Discovery from Text 12 / 23


Evaluators Rate Machine Choices Better Than Their Own

Scale: (1) unrelated, (2) loosely related, or (3) closely related


Table reports: mean(scale)

Pairs from Overall Mean Evaluator 1 Evaluator 2


Random Selection 1.38 1.16 1.60
Hand-Coded Clusters 1.58 1.48 1.68

Gary King (Harvard, IQSS) Quantitative Discovery from Text 12 / 23


Evaluators Rate Machine Choices Better Than Their Own

Scale: (1) unrelated, (2) loosely related, or (3) closely related


Table reports: mean(scale)

Pairs from Overall Mean Evaluator 1 Evaluator 2


Random Selection 1.38 1.16 1.60
Hand-Coded Clusters 1.58 1.48 1.68
Hand-Coding 2.06 1.88 2.24

Gary King (Harvard, IQSS) Quantitative Discovery from Text 12 / 23


Evaluators Rate Machine Choices Better Than Their Own

Scale: (1) unrelated, (2) loosely related, or (3) closely related


Table reports: mean(scale)

Pairs from Overall Mean Evaluator 1 Evaluator 2


Random Selection 1.38 1.16 1.60
Hand-Coded Clusters 1.58 1.48 1.68
Hand-Coding 2.06 1.88 2.24
Machine 2.24 2.08 2.40

Gary King (Harvard, IQSS) Quantitative Discovery from Text 12 / 23


Evaluators Rate Machine Choices Better Than Their Own

Scale: (1) unrelated, (2) loosely related, or (3) closely related


Table reports: mean(scale)

Pairs from Overall Mean Evaluator 1 Evaluator 2


Random Selection 1.38 1.16 1.60
Hand-Coded Clusters 1.58 1.48 1.68
Hand-Coding 2.06 1.88 2.24
Machine 2.24 2.08 2.40

p.s. The hand-coders did the evaluation!

Gary King (Harvard, IQSS) Quantitative Discovery from Text 12 / 23


Evaluating Performance

Gary King (Harvard, IQSS) Quantitative Discovery from Text 13 / 23


Evaluating Performance

Goals:

Gary King (Harvard, IQSS) Quantitative Discovery from Text 13 / 23


Evaluating Performance

Goals:
Validate Claim: computer-assisted conceptualization outperforms
human conceptualization

Gary King (Harvard, IQSS) Quantitative Discovery from Text 13 / 23


Evaluating Performance

Goals:
Validate Claim: computer-assisted conceptualization outperforms
human conceptualization
Demonstrate: new experimental designs for cluster evaluation

Gary King (Harvard, IQSS) Quantitative Discovery from Text 13 / 23


Evaluating Performance

Goals:
Validate Claim: computer-assisted conceptualization outperforms
human conceptualization
Demonstrate: new experimental designs for cluster evaluation
Inject human judgement: relying on insights from survey research

Gary King (Harvard, IQSS) Quantitative Discovery from Text 13 / 23


Evaluating Performance

Goals:
Validate Claim: computer-assisted conceptualization outperforms
human conceptualization
Demonstrate: new experimental designs for cluster evaluation
Inject human judgement: relying on insights from survey research
We now present three evaluations

Gary King (Harvard, IQSS) Quantitative Discovery from Text 13 / 23


Evaluating Performance

Goals:
Validate Claim: computer-assisted conceptualization outperforms
human conceptualization
Demonstrate: new experimental designs for cluster evaluation
Inject human judgement: relying on insights from survey research
We now present three evaluations
Cluster Quality RA coders

Gary King (Harvard, IQSS) Quantitative Discovery from Text 13 / 23


Evaluating Performance

Goals:
Validate Claim: computer-assisted conceptualization outperforms
human conceptualization
Demonstrate: new experimental designs for cluster evaluation
Inject human judgement: relying on insights from survey research
We now present three evaluations
Cluster Quality RA coders
Informative discoveries Experienced scholars analyzing texts

Gary King (Harvard, IQSS) Quantitative Discovery from Text 13 / 23


Evaluating Performance

Goals:
Validate Claim: computer-assisted conceptualization outperforms
human conceptualization
Demonstrate: new experimental designs for cluster evaluation
Inject human judgement: relying on insights from survey research
We now present three evaluations
Cluster Quality RA coders
Informative discoveries Experienced scholars analyzing texts
Discovery Youre the judge

Gary King (Harvard, IQSS) Quantitative Discovery from Text 13 / 23


Evaluation 1: Cluster Quality

Gary King (Harvard, IQSS) Quantitative Discovery from Text 14 / 23


Evaluation 1: Cluster Quality

What Are Humans Good For?

Gary King (Harvard, IQSS) Quantitative Discovery from Text 14 / 23


Evaluation 1: Cluster Quality

What Are Humans Good For?


They cant: keep many documents & clusters in their head

Gary King (Harvard, IQSS) Quantitative Discovery from Text 14 / 23


Evaluation 1: Cluster Quality

What Are Humans Good For?


They cant: keep many documents & clusters in their head
They can: compare two documents at a time

Gary King (Harvard, IQSS) Quantitative Discovery from Text 14 / 23


Evaluation 1: Cluster Quality

What Are Humans Good For?


They cant: keep many documents & clusters in their head
They can: compare two documents at a time
= Cluster quality evaluation: human judgement of document pairs

Gary King (Harvard, IQSS) Quantitative Discovery from Text 14 / 23


Evaluation 1: Cluster Quality

What Are Humans Good For?


They cant: keep many documents & clusters in their head
They can: compare two documents at a time
= Cluster quality evaluation: human judgement of document pairs
Experimental Design to Assess Cluster Quality

Gary King (Harvard, IQSS) Quantitative Discovery from Text 14 / 23


Evaluation 1: Cluster Quality

What Are Humans Good For?


They cant: keep many documents & clusters in their head
They can: compare two documents at a time
= Cluster quality evaluation: human judgement of document pairs
Experimental Design to Assess Cluster Quality
automated visualization to choose one clustering

Gary King (Harvard, IQSS) Quantitative Discovery from Text 14 / 23


Evaluation 1: Cluster Quality

What Are Humans Good For?


They cant: keep many documents & clusters in their head
They can: compare two documents at a time
= Cluster quality evaluation: human judgement of document pairs
Experimental Design to Assess Cluster Quality
automated visualization to choose one clustering
many pairs of documents

Gary King (Harvard, IQSS) Quantitative Discovery from Text 14 / 23


Evaluation 1: Cluster Quality

What Are Humans Good For?


They cant: keep many documents & clusters in their head
They can: compare two documents at a time
= Cluster quality evaluation: human judgement of document pairs
Experimental Design to Assess Cluster Quality
automated visualization to choose one clustering
many pairs of documents
for coders: (1) unrelated, (2) loosely related, (3) closely related

Gary King (Harvard, IQSS) Quantitative Discovery from Text 14 / 23


Evaluation 1: Cluster Quality

What Are Humans Good For?


They cant: keep many documents & clusters in their head
They can: compare two documents at a time
= Cluster quality evaluation: human judgement of document pairs
Experimental Design to Assess Cluster Quality
automated visualization to choose one clustering
many pairs of documents
for coders: (1) unrelated, (2) loosely related, (3) closely related
Quality = mean(within cluster) - mean(between clusters)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 14 / 23


Evaluation 1: Cluster Quality

What Are Humans Good For?


They cant: keep many documents & clusters in their head
They can: compare two documents at a time
= Cluster quality evaluation: human judgement of document pairs
Experimental Design to Assess Cluster Quality
automated visualization to choose one clustering
many pairs of documents
for coders: (1) unrelated, (2) loosely related, (3) closely related
Quality = mean(within cluster) - mean(between clusters)
Bias results against ourselves by not letting evaluators choose clustering

Gary King (Harvard, IQSS) Quantitative Discovery from Text 14 / 23


Evaluation 1: Cluster Quality

What Are Humans Good For?


They cant: keep many documents & clusters in their head
They can: compare two documents at a time
= Cluster quality evaluation: human judgement of document pairs
Experimental Design to Assess Cluster Quality
automated visualization to choose one clustering
many pairs of documents
for coders: (1) unrelated, (2) loosely related, (3) closely related
Quality = mean(within cluster) - mean(between clusters)
Bias results against ourselves by not letting evaluators choose clustering

Gary King (Harvard, IQSS) Quantitative Discovery from Text 14 / 23


Evaluation 1: Cluster Quality

0.3 0.2 0.1 0.1 0.2 0.3

(Our Method) (Human Coders)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 15 / 23


Evaluation 1: Cluster Quality

Lautenberg Press Releases


0.3 0.2 0.1 0.1 0.2 0.3

(Our Method) (Human Coders)

Lautenberg: 200 Senate Press Releases (appropriations, economy,


education, tax, veterans, . . . )

Gary King (Harvard, IQSS) Quantitative Discovery from Text 15 / 23


Evaluation 1: Cluster Quality

Lautenberg Press Releases


Policy Agendas Project


0.3 0.2 0.1 0.1 0.2 0.3

(Our Method) (Human Coders)

Policy Agendas: 213 quasi-sentences from Bushs State of the Union


(agriculture, banking & commerce, civil rights/liberties, defense, . . . )

Gary King (Harvard, IQSS) Quantitative Discovery from Text 15 / 23


Evaluation 1: Cluster Quality

Lautenberg Press Releases


Policy Agendas Project


Reuter's Gold Standard


0.3 0.2 0.1 0.1 0.2 0.3

(Our Method) (Human Coders)

Reuters: financial news (trade, earnings, copper, gold, coffee, . . . ); gold


standard for supervised learning studies

Gary King (Harvard, IQSS) Quantitative Discovery from Text 15 / 23


Evaluation 2: More Informative Discoveries

Gary King (Harvard, IQSS) Quantitative Discovery from Text 16 / 23


Evaluation 2: More Informative Discoveries

Found 2 scholars analyzing lots of textual data for their work

Gary King (Harvard, IQSS) Quantitative Discovery from Text 16 / 23


Evaluation 2: More Informative Discoveries

Found 2 scholars analyzing lots of textual data for their work


Created 6 clusterings:

Gary King (Harvard, IQSS) Quantitative Discovery from Text 16 / 23


Evaluation 2: More Informative Discoveries

Found 2 scholars analyzing lots of textual data for their work


Created 6 clusterings:
2 clusterings selected with our method (biased against us)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 16 / 23


Evaluation 2: More Informative Discoveries

Found 2 scholars analyzing lots of textual data for their work


Created 6 clusterings:
2 clusterings selected with our method (biased against us)
2 clusterings from each of 2 other methods (varying tuning parameters)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 16 / 23


Evaluation 2: More Informative Discoveries

Found 2 scholars analyzing lots of textual data for their work


Created 6 clusterings:
2 clusterings selected with our method (biased against us)
2 clusterings from each of 2 other methods (varying tuning parameters)
Created info packet on each clustering (for each cluster: exemplar
document, automated content summary)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 16 / 23


Evaluation 2: More Informative Discoveries

Found 2 scholars analyzing lots of textual data for their work


Created 6 clusterings:
2 clusterings selected with our method (biased against us)
2 clusterings from each of 2 other methods (varying tuning parameters)
Created info packet on each clustering (for each cluster: exemplar
document, automated content summary)
Asked for 62 =15 pairwise comparisons


Gary King (Harvard, IQSS) Quantitative Discovery from Text 16 / 23


Evaluation 2: More Informative Discoveries

Found 2 scholars analyzing lots of textual data for their work


Created 6 clusterings:
2 clusterings selected with our method (biased against us)
2 clusterings from each of 2 other methods (varying tuning parameters)
Created info packet on each clustering (for each cluster: exemplar
document, automated content summary)
Asked for 62 =15 pairwise comparisons


User chooses only care about the one clustering that wins

Gary King (Harvard, IQSS) Quantitative Discovery from Text 16 / 23


Evaluation 2: More Informative Discoveries

Found 2 scholars analyzing lots of textual data for their work


Created 6 clusterings:
2 clusterings selected with our method (biased against us)
2 clusterings from each of 2 other methods (varying tuning parameters)
Created info packet on each clustering (for each cluster: exemplar
document, automated content summary)
Asked for 62 =15 pairwise comparisons


User chooses only care about the one clustering that wins
Both cases a Condorcet winner:

Gary King (Harvard, IQSS) Quantitative Discovery from Text 16 / 23


Evaluation 2: More Informative Discoveries

Found 2 scholars analyzing lots of textual data for their work


Created 6 clusterings:
2 clusterings selected with our method (biased against us)
2 clusterings from each of 2 other methods (varying tuning parameters)
Created info packet on each clustering (for each cluster: exemplar
document, automated content summary)
Asked for 62 =15 pairwise comparisons


User chooses only care about the one clustering that wins
Both cases a Condorcet winner:
Immigration:

Our Method 1 vMF 1 vMF 2 Our Method 2 K-Means 1 K-Means 2

Gary King (Harvard, IQSS) Quantitative Discovery from Text 16 / 23


Evaluation 2: More Informative Discoveries

Found 2 scholars analyzing lots of textual data for their work


Created 6 clusterings:
2 clusterings selected with our method (biased against us)
2 clusterings from each of 2 other methods (varying tuning parameters)
Created info packet on each clustering (for each cluster: exemplar
document, automated content summary)
Asked for 62 =15 pairwise comparisons


User chooses only care about the one clustering that wins
Both cases a Condorcet winner:
Immigration:

Our Method 1 vMF 1 vMF 2 Our Method 2 K-Means 1 K-Means 2

Genetic testing:

Our Method 1 {Our Method 2, K-Means 1, K-means 2} Dir Proc. 1 Dir Proc. 2
Gary King (Harvard, IQSS) Quantitative Discovery from Text 16 / 23
Evaluation 3: What Do Members of Congress Do?

Gary King (Harvard, IQSS) Quantitative Discovery from Text 17 / 23


Evaluation 3: What Do Members of Congress Do?

- David Mayhews (1974) famous typology

Gary King (Harvard, IQSS) Quantitative Discovery from Text 17 / 23


Evaluation 3: What Do Members of Congress Do?

- David Mayhews (1974) famous typology


- Advertising

Gary King (Harvard, IQSS) Quantitative Discovery from Text 17 / 23


Evaluation 3: What Do Members of Congress Do?

- David Mayhews (1974) famous typology


- Advertising
- Credit Claiming

Gary King (Harvard, IQSS) Quantitative Discovery from Text 17 / 23


Evaluation 3: What Do Members of Congress Do?

- David Mayhews (1974) famous typology


- Advertising
- Credit Claiming
- Position Taking

Gary King (Harvard, IQSS) Quantitative Discovery from Text 17 / 23


Evaluation 3: What Do Members of Congress Do?

- David Mayhews (1974) famous typology


- Advertising
- Credit Claiming
- Position Taking
- Data: 200 press releases from Frank Lautenbergs office (D-NJ)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 17 / 23


Evaluation 3: What Do Members of Congress Do?

- David Mayhews (1974) famous typology


- Advertising
- Credit Claiming
- Position Taking
- Data: 200 press releases from Frank Lautenbergs office (D-NJ)
- Apply our method

Gary King (Harvard, IQSS) Quantitative Discovery from Text 17 / 23


Example Discovery

mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc
mixvmf
hclust correlationmixvmfVA
hclust binary complete
mcquitty
hclust pearson single affprop cosine
hclust pearson median
hclust correlation single hclust pearson mcquitty
hclust correlation median
mec hclust pearson average hclust correlation complete
hclust binary single hclust correlation averagehclust pearson complete
hclust binary average kmeans pearson
hclustpearson
hclust correlation
centroid som
centroid rock
hclust binary median hclust binary mcquitty
hclust canberra single
biclust_spectral hclust spearman complete
spec_man
spec_cos
hclust canberra
kmeans kendall median spec_mink
spec_euc
affprop maximum hclust canberra average spec_max
mspec_minkspec_canb
mspec_man
kmeans spearman kmeans manhattan mspec_max
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust binary centroid
hclust kendall single
hclust
hclust hclustspearman
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
hclust spearman kendall mcquitty
mcquitty
hclust canberra centroid hclust kendall complete
hclust
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattan
euclidean affprop
single manhattan
centroid
medianaverage
manhattan
hclust
hclust manhattan
euclidean mediansingle
hclust maximum divisive
single manhattan
hclust
hclusteuclidean
hclust
manhattan centroid
euclidean average
mcquitty clust_convex hclust correlation ward
hclust euclidean mcquitty
kmedoids euclidean kmedoids
hclust pearson wardstand.euc
hclustmaximum
hclust maximum
divisive centroidaffprop euclidean
median
euclidean hclust canberra mcquitty
hclust maximum average
hclust
hclust maximum
euclidean complete
complete
hclust maximum
hclust manhattan complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
hclust manhattan ward affprop info.costs
kmeanshclust euclidean ward
euclidean hclust canberra complete
sot_euc
hclust binary ward

hclusthclust spearman
kendall ward ward
hclust maximum ward kmeans binary

kmeans maximum

Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23


Example Discovery

mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc
mixvmf
hclust correlationmixvmfVA
hclust binary complete
mcquitty
hclust pearson single affprop cosine
hclust pearson median
hclust correlation single hclust pearson mcquitty
hclust correlation median
mec hclust pearson average hclust correlation complete
hclust binary single hclust correlation averagehclust pearson complete
hclust binary average
hclustpearson
hclust correlation
centroid som
centroid rock
hclust binary median
hclust canberra single
biclust_spectral
affprop cosine
hclust spearman complete
hclust binary mcquitty
kmeans pearson

spec_man
spec_cos
hclust canberra
kmeans kendall median spec_mink
spec_euc
affprop maximum hclust canberra average spec_max
mspec_minkspec_canb
mspec_man
kmeans spearman kmeans manhattan mspec_max
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust binary centroid
hclust kendall single
hclust
hclust hclustspearman
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
hclust spearman kendall mcquitty
mcquitty
hclust canberra centroid hclust kendall complete
hclust
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattan
euclidean affprop
single manhattan
centroid
medianaverage
manhattan
hclust
hclust manhattan
euclidean mediansingle
hclust maximum divisive
single manhattan
hclust
hclusteuclidean
hclust
manhattan centroid
euclidean average
mcquitty clust_convex hclust correlation ward
hclust euclidean mcquitty
kmedoids euclidean kmedoids
hclust pearson wardstand.euc
hclustmaximum
hclust maximum
divisive
hclust
hclust
centroidaffprop euclidean
median
euclidean
hclust maximum average
maximum
euclidean
hclust maximum
hclust manhattan
complete
complete
complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
hclust manhattan ward
dist_cos
dismea
hclust canberra mcquitty

Red point: a clustering by


affprop info.costs
kmeanshclust
sot_euc
euclidean ward
euclidean hclust canberra complete
hclust binary ward Affinity Propagation-Cosine
hclust maximum ward
hclusthclust spearman
kendall ward ward

kmeans binary (Dueck and Frey 2007)


kmeans maximum

Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23


Example Discovery

mult_dirproc

mixvmf kmeans correlation


hclust canberra ward sot_cor
divisive stand.euc
mixvmf
hclust correlationmixvmfVA
hclust binary complete
mcquitty
hclust pearson single affprop cosine
hclust pearson median
hclust correlation single hclust pearson mcquitty
hclust correlation median
mec hclust pearson average hclust correlation complete
hclust binary single hclust correlation averagehclust pearson complete
hclust binary average
hclustpearson
hclust correlation
centroid som
centroid rock
hclust binary median
hclust canberra single
biclust_spectral
affprop cosine
hclust spearman complete
hclust binary mcquitty
kmeans pearson

spec_man
spec_cos
hclust canberra
kmeans kendall median spec_mink
spec_euc
affprop maximum hclust canberra average spec_max
mspec_minkspec_canb
mspec_man
kmeans spearman kmeans manhattan mspec_max
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust binary centroid
hclust kendall single
hclust
hclust hclustspearman
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
hclust spearman kendall mcquitty
mcquitty
hclust canberra centroid hclust kendall complete
hclust
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattan
euclidean affprop
single manhattan
centroid
medianaverage
manhattan
hclust
hclust manhattan
euclidean mediansingle
hclust maximum divisive
single manhattan
hclust
hclusteuclidean
hclust
manhattan centroid
euclidean average
mcquitty clust_convex hclust correlation ward
hclust euclidean mcquitty
kmedoids euclidean kmedoids
hclust pearson wardstand.euc
hclustmaximum
hclust maximum
divisive
hclust
hclust
centroidaffprop euclidean
median
euclidean
hclust maximum average
maximum
euclidean
hclust maximum
hclust manhattan
complete
complete
complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
hclust manhattan ward
dist_cos
dismea
hclust canberra mcquitty

Red point: a clustering by


affprop info.costs
kmeanshclust
sot_euc
euclidean ward
euclidean hclust canberra complete
hclust binary ward Affinity Propagation-Cosine
hclust maximum ward
hclusthclust spearman
kendall ward ward

kmeans binary (Dueck and Frey 2007)


kmeans maximum

Close to:
Mixture of von Mises-Fisher
distributions (Banerjee et. al.
2005)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23


Example Discovery

mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc
mixvmf
hclust correlationmixvmfVA
hclust binary complete
mcquitty
hclust pearson single affprop cosine
hclust pearson median
hclust correlation single hclust pearson mcquitty
hclust correlation median
mec hclust pearson average hclust correlation complete
hclust binary single hclust correlation averagehclust pearson complete
hclust binary average kmeans pearson
hclustpearson
hclust correlation
centroid som
centroid rock
hclust binary median hclust binary mcquitty
hclust canberra single
biclust_spectral hclust spearman complete
spec_man
spec_cos
hclust canberra
kmeans kendall median spec_mink
spec_euc
affprop maximum hclust canberra average spec_max
mspec_minkspec_canb
mspec_man
kmeans spearman kmeans manhattan mspec_max
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust binary centroid
hclust kendall single
hclust
hclust hclustspearman
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
hclust spearman kendall mcquitty
mcquitty
hclust canberra centroid hclust kendall complete
hclust
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattan
euclidean affprop
single manhattan
centroid
medianaverage
manhattan
hclust
hclust manhattan
euclidean mediansingle
hclust maximum divisive
single manhattan
hclust
hclusteuclidean
hclust
manhattan centroid
euclidean average
mcquitty clust_convex hclust correlation ward
hclust euclidean mcquitty
kmedoids euclidean kmedoids
hclust pearson wardstand.euc
hclustmaximum
hclust maximum
divisive centroidaffprop euclidean
median
euclidean hclust canberra mcquitty
hclust maximum average
hclust
hclust maximum
euclidean complete
complete
hclust maximum
hclust manhattan complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
hclust manhattan ward affprop info.costs
kmeanshclust euclidean ward
euclidean hclust canberra complete
sot_euc
hclust binary ward

hclusthclust spearman
kendall ward ward
hclust maximum ward kmeans binary

kmeans maximum
Space between methods:

Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23


Example Discovery

mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc
mixvmf
hclust correlationmixvmfVA
hclust binary complete
mcquitty
hclust pearson single affprop cosine
hclust pearson median
hclust correlation single hclust pearson mcquitty
hclust correlation median
mec hclust pearson average hclust correlation complete
hclust binary single hclust correlation averagehclust pearson complete
hclust binary average kmeans pearson
hclustpearson
hclust correlation
centroid som
centroid rock
hclust binary median hclust binary mcquitty
hclust canberra single

biclust_spectral
hclust canberra
hclust spearman complete
spec_man
spec_cos
kmeans kendall median hclust canberra average
spec_mink
spec_euc
spec_max
mspec_minkspec_canb
mspec_man
affprop maximum kmeans spearman kmeans manhattan mspec_max
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust binary centroid
hclust kendall single
hclust
hclust hclustspearman
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
hclust spearman kendall mcquitty
mcquitty
hclust canberra centroid hclust kendall complete
hclust
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattan
euclidean affprop
single manhattan
centroid
medianaverage
manhattan
hclust
hclust manhattan
euclidean mediansingle
hclust maximum divisive
single manhattan
hclust
hclusteuclidean
hclust
manhattan centroid
euclidean average
mcquitty clust_convex hclust correlation ward
hclust euclidean mcquitty
kmedoids euclidean kmedoids
hclust pearson wardstand.euc
hclustmaximum
hclust maximum
divisive centroidaffprop euclidean
median
euclidean hclust canberra mcquitty
hclust maximum average
hclust
hclust maximum
euclidean complete
complete
hclust maximum
hclust manhattan complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
hclust manhattan ward affprop info.costs
kmeanshclust euclidean ward
euclidean hclust canberra complete
sot_euc
hclust binary ward

hclusthclust spearman
kendall ward ward
hclust maximum ward kmeans binary

kmeans maximum
Space between methods:

Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23


Example Discovery

mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc
mixvmf
hclust correlationmixvmfVA
hclust binary complete
mcquitty
hclust pearson single affprop cosine
hclust pearson median
hclust correlation single hclust pearson mcquitty
hclust correlation median
mec hclust pearson average hclust correlation complete
hclust binary single hclust correlation averagehclust pearson complete
hclust binary average kmeans pearson
hclustpearson
hclust correlation
centroid som
centroid rock
hclust binary median hclust binary mcquitty
hclust canberra single

biclust_spectral
hclust canberra
hclust spearman complete
spec_man
spec_cos
kmeans kendall median hclust canberra average
spec_mink
spec_euc
spec_max
mspec_minkspec_canb
mspec_man
affprop maximum kmeans spearman kmeans manhattan mspec_max
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust binary centroid
hclust kendall single
hclust
hclust hclustspearman
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
hclust spearman kendall mcquitty
mcquitty
hclust canberra centroid hclust kendall complete
hclust
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattan
euclidean affprop
single manhattan
centroid
medianaverage
manhattan
hclust
hclust manhattan
euclidean mediansingle
hclust maximum divisive
single manhattan
hclust
hclusteuclidean
hclust
manhattan centroid
euclidean average
mcquitty clust_convex hclust correlation ward
hclust euclidean mcquitty
kmedoids euclidean kmedoids
hclust pearson wardstand.euc
hclustmaximum
hclust maximum
divisive centroidaffprop euclidean
median
euclidean hclust canberra mcquitty
hclust maximum average
hclust
hclust maximum
euclidean complete
complete
hclust maximum
hclust manhattan complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
hclust manhattan ward affprop info.costs
kmeanshclust euclidean ward
euclidean hclust canberra complete
sot_euc
hclust binary ward

hclusthclust spearman
kendall ward ward
hclust maximum ward kmeans binary

kmeans maximum
Space between methods:
local cluster ensemble

Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23


Example Discovery

mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc
mixvmf
hclust correlationmixvmfVA
hclust binary complete
mcquitty
hclust pearson single affprop cosine
hclust pearson median
hclust correlation single hclust pearson mcquitty
hclust correlation median
mec hclust pearson average hclust correlation complete
hclust binary single hclust correlation averagehclust pearson complete
hclust binary average kmeans pearson
hclustpearson
hclust correlation
centroid som
centroid rock
hclust binary median hclust binary mcquitty
hclust canberra single
biclust_spectral hclust spearman complete
spec_man
spec_cos
hclust canberra
kmeans kendall median spec_mink
spec_euc
affprop maximum hclust canberra average spec_max
mspec_minkspec_canb
mspec_man
kmeans spearman kmeans manhattan mspec_max
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust binary centroid
hclust kendall single
hclust
hclust hclustspearman
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
hclust spearman kendall mcquitty
mcquitty
hclust canberra centroid hclust kendall complete
hclust
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattan
euclidean affprop
single manhattan
centroid
medianaverage
manhattan
hclust
hclust manhattan
euclidean mediansingle
hclust maximum divisive
single manhattan
hclust
hclusteuclidean
hclust
manhattan centroid
euclidean average
mcquitty clust_convex hclust correlation ward
hclust euclidean mcquitty
kmedoids euclidean kmedoids
hclust pearson wardstand.euc
hclustmaximum
hclust maximum
divisive centroidaffprop euclidean
median
euclidean hclust canberra mcquitty
hclust maximum average
hclust
hclust maximum
euclidean complete
complete
hclust maximum
hclust manhattan complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
hclust manhattan ward affprop info.costs
kmeanshclust euclidean ward
euclidean hclust canberra complete
sot_euc
hclust binary ward

hclusthclust spearman
kendall ward ward
hclust maximum ward kmeans binary

kmeans maximum

Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23


Example Discovery

mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc
mixvmf
hclust correlationmixvmfVA
hclust binary complete
mcquitty
hclust pearson single affprop cosine
hclust pearson median
hclust correlation single hclust pearson mcquitty
hclust correlation median
mec hclust pearson average hclust correlation complete
hclust binary single hclust correlation averagehclust pearson complete
hclust binary average kmeans pearson
hclustpearson
hclust correlation
centroid som
centroid rock
hclust binary median hclust binary mcquitty
hclust canberra single
biclust_spectral hclust spearman complete
spec_man
spec_cos
hclust canberra
kmeans kendall median spec_mink
spec_euc
affprop maximum hclust canberra average spec_max
mspec_minkspec_canb
mspec_man
kmeans spearman kmeans manhattan mspec_max
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust binary centroid
hclust kendall single
hclust
hclust hclustspearman
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
hclust spearman kendall mcquitty
mcquitty
hclust canberra centroid hclust kendall complete
hclust
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattan
euclidean affprop
single manhattan
centroid
medianaverage
manhattan
hclust
hclust manhattan
euclidean mediansingle
hclust maximum divisive
single manhattan
hclust
hclusteuclidean
hclust
manhattan centroid
euclidean average
mcquitty clust_convex hclust correlation ward
hclust euclidean mcquitty
kmedoids euclidean kmedoids
hclust pearson wardstand.euc
hclustmaximum
hclust maximum
divisive centroidaffprop euclidean
median
euclidean hclust canberra mcquitty
hclust maximum average
hclust
hclust maximum
euclidean complete
complete
hclust maximum
hclust manhattan complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
hclust manhattan ward affprop info.costs
kmeanshclust euclidean ward
euclidean hclust canberra complete
sot_euc
hclust binary ward

hclusthclust spearman
kendall ward ward
hclust maximum ward kmeans binary

kmeans maximum Found a region with particularly


insightful clusterings

Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23


Example Discovery

mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc

hclust pearson single


hclust pearson median
hclust correlation single
hclust correlation median
mec
mixvmf
hclust correlationmixvmfVA
affprop cosine
hclust binary complete
mcquitty
hclust pearson mcquitty
hclust pearson average hclust correlation complete
Mixture:
hclust binary single hclust correlation averagehclust pearson complete
hclust binary average kmeans pearson
hclustpearson
hclust correlation
centroid som
centroid rock
hclust binary median hclust binary mcquitty
hclust canberra single
biclust_spectral hclust spearman complete
spec_man
spec_cos
hclust canberra
kmeans kendall median spec_mink
spec_euc
affprop maximum hclust canberra average spec_max
mspec_minkspec_canb
mspec_man
kmeans spearman kmeans manhattan mspec_max
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust binary centroid
hclust kendall single
hclust
hclust hclustspearman
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
hclust spearman kendall mcquitty
mcquitty
hclust canberra centroid hclust kendall complete
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattanmedian manhattan
centroid
average
hclust
hclusteuclidean
hclust affprop
single
manhattan
euclidean median
divisive
manhattan
single
manhattan
hclust maximum
hclust single
euclidean centroid
hclust
hclust euclidean
manhattan average
mcquitty clust_convex hclust correlation ward
hclust euclidean mcquitty
kmedoids euclidean kmedoids
hclust pearson wardstand.euc
hclustmaximum
hclust maximum
divisive centroidaffprop euclidean
median
euclidean hclust canberra mcquitty
hclust maximum average
hclust
hclust maximum
euclidean complete
complete
hclust maximum
hclust manhattan complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
hclust manhattan ward affprop info.costs
kmeanshclust euclidean ward
euclidean hclust canberra complete
sot_euc
hclust binary ward

hclusthclust spearman
kendall ward ward
hclust maximum ward kmeans binary

kmeans maximum

Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23


Example Discovery

mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc

hclust pearson single


hclust pearson median
hclust correlation single
hclust correlation median
mec
mixvmf
hclust correlationmixvmfVA
affprop cosine
hclust binary complete
mcquitty
hclust pearson mcquitty
hclust pearson average hclust correlation complete
Mixture:
hclust binary single hclust correlation averagehclust pearson complete
hclust binary average kmeans pearson
hclustpearson
hclust correlation
centroid som
centroid rock
hclust binary median
hclust canberra single
biclust_spectral hclust spearman complete
hclust binary mcquitty

spec_man
0.39 Hclust-Canberra-McQuitty
hclust canberra spec_cos
kmeans kendall median hclust canberra average
spec_mink
spec_euc
spec_max
mspec_minkspec_canb
mspec_man
affprop maximum kmeans spearman kmeans manhattan mspec_max
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust binary centroid
hclust kendall single
hclust
hclust hclustspearman
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
hclust spearman kendall mcquitty
mcquitty
hclust canberra centroid hclust kendall complete
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattanmedian manhattan
centroid
average
hclust
hclusteuclidean
hclust affprop
single
manhattan
euclidean median
divisive
manhattan
single
manhattan
hclust maximum
hclust single
euclidean centroid
hclust
hclust euclidean
manhattan average
mcquitty clust_convex hclust correlation ward
hclust euclidean mcquitty
kmedoids euclidean kmedoids
hclust pearson wardstand.euc
hclustmaximum
hclust maximum
divisive centroidaffprop euclidean
median
euclidean hclust canberra mcquitty
hclust maximum average
hclust
hclust maximum
euclidean complete
complete
hclust maximum
hclust manhattan complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
hclust manhattan ward affprop info.costs
kmeanshclust euclidean ward
euclidean hclust canberra complete
sot_euc
hclust binary ward

hclusthclust spearman
kendall ward ward
hclust maximum ward kmeans binary

kmeans maximum

Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23


Example Discovery

mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc

hclust pearson single


hclust pearson median
hclust correlation single
hclust correlation median
mec
mixvmf
hclust correlationmixvmfVA
affprop cosine
hclust binary complete
mcquitty
hclust pearson mcquitty
hclust pearson average hclust correlation complete
Mixture:
hclust binary single hclust correlation averagehclust pearson complete
hclust binary average kmeans pearson
hclustpearson
hclust correlation
centroid som
centroid rock
hclust binary median
hclust canberra single
biclust_spectral hclust spearman complete
hclust binary mcquitty

spec_man
0.39 Hclust-Canberra-McQuitty
hclust canberra spec_cos
kmeans kendall median hclust canberra average
spec_mink
spec_euc
spec_max
mspec_minkspec_canb
mspec_man
affprop maximum kmeans spearman kmeans manhattan mspec_max
kmeans canberra
hclust binary centroid
hclust
hclust
hclust hclustkendall
spearman
kendall
spearman
kendall
hclust spearman
hclust
hclust spearman
hclust canberra centroid
single
centroid
centroid
average
median
median average
single
kendall mcquitty
mcquitty
hclust kendall complete
mspec_cos
mspec_canb
mspec_euc
0.30 Spectral clustering
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattanmedian manhattan
centroid
average
hclust
hclust
hclust
euclidean
hclust
euclidean
maximum
hclust
hclust
hclust manhattan
hclustmaximum
maximum
affprop
single
manhattanmedian
divisive
single
euclidean centroid
euclidean
manhattan
single
manhattan
average
mcquitty
hclust euclidean mcquitty
kmedoids
clust_convex
euclidean
centroidaffprop euclidean
hclust correlation ward
kmedoids
hclust pearson
hclust canberra mcquitty
wardstand.euc
Random Walk
hclust divisive median
euclidean
hclust maximum average
hclust
hclust maximum
euclidean
hclust maximum
hclust manhattan
complete
complete
complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
hclust manhattan ward
dist_cos
dismea
affprop info.costs
(Metrics 1-6)
kmeanshclust euclidean ward
euclidean hclust canberra complete
sot_euc
hclust binary ward

hclusthclust spearman
kendall ward ward
hclust maximum ward kmeans binary

kmeans maximum

Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23


Example Discovery

mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc

hclust pearson single


hclust pearson median
hclust correlation single
hclust correlation median
mec
mixvmf
hclust correlationmixvmfVA
affprop cosine
hclust binary complete
mcquitty
hclust pearson mcquitty
hclust pearson average hclust correlation complete
Mixture:
hclust binary single hclust correlation averagehclust pearson complete
hclust binary average kmeans pearson
hclustpearson
hclust correlation
centroid som
centroid rock
hclust binary median
hclust canberra single
biclust_spectral hclust spearman complete
hclust binary mcquitty

spec_man
0.39 Hclust-Canberra-McQuitty
hclust canberra spec_cos
kmeans kendall median hclust canberra average
spec_mink
spec_euc
spec_max
mspec_minkspec_canb
mspec_man
affprop maximum kmeans spearman kmeans manhattan mspec_max
kmeans canberra
hclust binary centroid
hclust
hclust
hclust hclustkendall
spearman
kendall
spearman
kendall
hclust spearman
hclust
hclust spearman
hclust canberra centroid
single
centroid
centroid
average
median
median average
single
kendall mcquitty
mcquitty
hclust kendall complete
mspec_cos
mspec_canb
mspec_euc
0.30 Spectral clustering
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattanmedian manhattan
centroid
average
hclust
hclust
hclust
euclidean
hclust
euclidean
maximum
hclust
hclust
hclust manhattan
hclustmaximum
maximum
affprop
single
manhattanmedian
divisive
single
euclidean centroid
euclidean
manhattan
single
manhattan
average
mcquitty
hclust euclidean mcquitty
kmedoids
clust_convex
euclidean
centroidaffprop euclidean
hclust correlation ward
kmedoids
hclust pearson
hclust canberra mcquitty
wardstand.euc
Random Walk
hclust divisive median
euclidean
hclust maximum average
hclust
hclust maximum
euclidean
hclust maximum
hclust manhattan
complete
complete
complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
hclust manhattan ward
dist_cos
dismea
affprop info.costs
(Metrics 1-6)
kmeanshclust euclidean ward
euclidean hclust canberra complete
sot_euc
hclust binary ward
0.13 Hclust-Correlation-Ward
hclusthclust spearman
kendall ward ward
hclust maximum ward kmeans binary

kmeans maximum

Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23


Example Discovery

mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc

hclust pearson single


hclust pearson median
hclust correlation single
hclust correlation median
mec
mixvmf
hclust correlationmixvmfVA
affprop cosine
hclust binary complete
mcquitty
hclust pearson mcquitty
hclust pearson average hclust correlation complete
Mixture:
hclust binary single hclust correlation averagehclust pearson complete
hclust binary average kmeans pearson
hclustpearson
hclust correlation
centroid som
centroid rock
hclust binary median
hclust canberra single
biclust_spectral hclust spearman complete
hclust binary mcquitty

spec_man
0.39 Hclust-Canberra-McQuitty
hclust canberra spec_cos
kmeans kendall median hclust canberra average
spec_mink
spec_euc
spec_max
mspec_minkspec_canb
mspec_man
affprop maximum kmeans spearman kmeans manhattan mspec_max
kmeans canberra
hclust binary centroid
hclust
hclust
hclust hclustkendall
spearman
kendall
spearman
kendall
hclust spearman
hclust
hclust spearman
hclust canberra centroid
single
centroid
centroid
average
median
median average
single
kendall mcquitty
mcquitty
hclust kendall complete
mspec_cos
mspec_canb
mspec_euc
0.30 Spectral clustering
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattanmedian manhattan
centroid
average
hclust
hclust
hclust
euclidean
hclust
euclidean
maximum
hclust
hclust
hclust manhattan
hclustmaximum
maximum
affprop
single
manhattanmedian
divisive
single
euclidean centroid
euclidean
manhattan
single
manhattan
average
mcquitty
hclust euclidean mcquitty
kmedoids
clust_convex
euclidean
centroidaffprop euclidean
hclust correlation ward
kmedoids
hclust pearson
hclust canberra mcquitty
wardstand.euc
Random Walk
hclust divisive median
euclidean
hclust maximum average
hclust
hclust maximum
euclidean
hclust maximum
hclust manhattan
complete
complete
complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
hclust manhattan ward
dist_cos
dismea
affprop info.costs
(Metrics 1-6)
kmeanshclust euclidean ward
euclidean hclust canberra complete
sot_euc
hclust binary ward
0.13 Hclust-Correlation-Ward
hclusthclust spearman
kendall ward ward
hclust maximum ward

kmeans maximum
kmeans binary
0.09 Hclust-Pearson-Ward

Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23


Example Discovery

mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc

hclust pearson single


hclust pearson median
hclust correlation single
hclust correlation median
mec
mixvmf
hclust correlationmixvmfVA
affprop cosine
hclust binary complete
mcquitty
hclust pearson mcquitty
hclust pearson average hclust correlation complete
Mixture:
hclust binary single hclust correlation averagehclust pearson complete
hclust binary average kmeans pearson
hclustpearson
hclust correlation
centroid som
centroid rock
hclust binary median
hclust canberra single
biclust_spectral hclust spearman complete
hclust binary mcquitty

spec_man
0.39 Hclust-Canberra-McQuitty
hclust canberra spec_cos
kmeans kendall median hclust canberra average
spec_mink
spec_euc
spec_max
mspec_minkspec_canb
mspec_man
affprop maximum kmeans spearman kmeans manhattan mspec_max
kmeans canberra
hclust binary centroid
hclust
hclust
hclust hclustkendall
spearman
kendall
spearman
kendall
hclust spearman
hclust
hclust spearman
hclust canberra centroid
single
centroid
centroid
average
median
median average
single
kendall mcquitty
mcquitty
hclust kendall complete
mspec_cos
mspec_canb
mspec_euc
0.30 Spectral clustering
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattanmedian manhattan
centroid
average
hclust
hclust
hclust
euclidean
hclust
euclidean
maximum
hclust
hclust
hclust manhattan
hclustmaximum
maximum
affprop
single
manhattanmedian
divisive
single
euclidean centroid
euclidean
manhattan
single
manhattan
average
mcquitty
hclust euclidean mcquitty
kmedoids
clust_convex
euclidean
centroidaffprop euclidean
hclust correlation ward
kmedoids
hclust pearson
hclust canberra mcquitty
wardstand.euc
Random Walk
hclust divisive median
euclidean
hclust maximum average
hclust
hclust maximum
euclidean
hclust maximum
hclust manhattan
complete
complete
complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
hclust manhattan ward
dist_cos
dismea
affprop info.costs
(Metrics 1-6)
kmeanshclust euclidean ward
euclidean hclust canberra complete
sot_euc
hclust binary ward
0.13 Hclust-Correlation-Ward
hclusthclust spearman
kendall ward ward
hclust maximum ward

kmeans maximum
kmeans binary
0.09 Hclust-Pearson-Ward
0.05 Kmediods-Cosine

Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23


Example Discovery

mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc

hclust pearson single


hclust pearson median
hclust correlation single
hclust correlation median
mec
mixvmf
hclust correlationmixvmfVA
affprop cosine
hclust binary complete
mcquitty
hclust pearson mcquitty
hclust pearson average hclust correlation complete
Mixture:
hclust binary single hclust correlation averagehclust pearson complete
hclust binary average kmeans pearson
hclustpearson
hclust correlation
centroid som
centroid rock
hclust binary median
hclust canberra single
biclust_spectral hclust spearman complete
hclust binary mcquitty

spec_man
0.39 Hclust-Canberra-McQuitty
hclust canberra spec_cos
kmeans kendall median hclust canberra average
spec_mink
spec_euc
spec_max
mspec_minkspec_canb
mspec_man
affprop maximum kmeans spearman kmeans manhattan mspec_max
kmeans canberra
hclust binary centroid
hclust
hclust
hclust hclustkendall
spearman
kendall
spearman
kendall
hclust spearman
hclust
hclust spearman
hclust canberra centroid
single
centroid
centroid
average
median
median average
single
kendall mcquitty
mcquitty
hclust kendall complete
mspec_cos
mspec_canb
mspec_euc
0.30 Spectral clustering
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattanmedian manhattan
centroid
average
hclust
hclust
hclust
euclidean
hclust
euclidean
maximum
hclust
hclust
hclust manhattan
hclustmaximum
maximum
affprop
single
manhattanmedian
divisive
single
euclidean centroid
euclidean
manhattan
single
manhattan
average
mcquitty
hclust euclidean mcquitty
kmedoids
clust_convex
euclidean
centroidaffprop euclidean
hclust correlation ward
kmedoids
hclust pearson
hclust canberra mcquitty
wardstand.euc
Random Walk
hclust divisive median
euclidean
hclust maximum average
hclust
hclust maximum
euclidean
hclust maximum
hclust manhattan
complete
complete
complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
hclust manhattan ward
dist_cos
dismea
affprop info.costs
(Metrics 1-6)
kmeanshclust euclidean ward
euclidean hclust canberra complete
sot_euc
hclust binary ward
0.13 Hclust-Correlation-Ward
hclusthclust spearman
kendall ward ward
hclust maximum ward

kmeans maximum
kmeans binary
0.09 Hclust-Pearson-Ward
0.05 Kmediods-Cosine
0.04 Spectral clustering
Symmetric
(Metrics 1-6)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23


Example Discovery

mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc
mixvmf
hclust correlationmixvmfVA
hclust binary complete
mcquitty
hclust pearson single affprop cosine
hclust pearson median
hclust correlation single hclust pearson mcquitty
hclust correlation median
mec hclust pearson average hclust correlation complete
hclust binary single hclust correlation averagehclust pearson complete
hclust binary average kmeans pearson
hclustpearson
hclust correlation
centroid som
centroid rock
hclust binary median hclust binary mcquitty
hclust canberra single
biclust_spectral hclust spearman complete
spec_man
spec_cos
hclust canberra
kmeans kendall median spec_mink
spec_euc
affprop maximum hclust canberra average spec_max
mspec_minkspec_canb
mspec_man
kmeans spearman kmeans manhattan mspec_max
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust binary centroid
hclust kendall single
hclust
hclust hclustspearman
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
hclust spearman kendall mcquitty
mcquitty
hclust canberra centroid hclust kendall complete
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattanmedian manhattan
centroid
average
hclust
hclusteuclidean
hclust affprop
single
manhattan
euclidean median
divisive
manhattan
single
manhattan
hclust maximum
hclust single
euclidean centroid
hclust
hclust euclidean
manhattan average
mcquitty clust_convex hclust correlation ward
hclust euclidean mcquitty
kmedoids euclidean kmedoids
hclust pearson wardstand.euc
hclustmaximum
hclust maximum
divisive centroidaffprop euclidean
median
euclidean hclust canberra mcquitty
hclust maximum average
hclust
hclust maximum
euclidean complete
complete
hclust maximum
hclust manhattan complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
hclust manhattan ward affprop info.costs
kmeanshclust euclidean ward
euclidean hclust canberra complete
sot_euc
hclust binary ward

hclusthclust spearman
kendall ward ward
hclust maximum ward kmeans binary

kmeans maximum

Clusters in this Clustering

Mayhew
Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23
Example Discovery

mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc
mixvmf
hclust correlationmixvmfVA
hclust binary complete
mcquitty
hclust pearson single affprop cosine
hclust pearson median
hclust correlation single hclust pearson mcquitty
hclust correlation median
mec hclust pearson average hclust correlation complete
hclust binary single hclust correlation averagehclust pearson complete
hclust binary average kmeans pearson
hclustpearson
hclust correlation
centroid som
centroid rock
hclust binary median hclust binary mcquitty
hclust canberra single
biclust_spectral hclust spearman complete
spec_man
spec_cos
hclust canberra
kmeans kendall median spec_mink
spec_euc
affprop maximum hclust canberra average spec_max
mspec_minkspec_canb
mspec_man
kmeans spearman kmeans manhattan mspec_max
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust binary centroid
hclust kendall single
hclust
hclust hclustspearman
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust kendall mcquitty

hclust
hclust
hclust spearman
hclust canberra centroid
hclust
hclust manhattan
hclust
hclust
hclust
maximum
hclust
hclust
kmedoids
manhattan
manhattan
euclidean
manhattan
euclidean
euclidean
hclust
manhattan
median
affprop
single
median
divisive
single
mcquitty
average
single
manhattan
centroid
euclidean average
mcquitty
hclust kendall complete
manhattan
centroid
manhattan
clust_convex

hclust correlation ward
Credit Claiming, Pork:
hclust euclidean mcquitty
kmedoids euclidean kmedoids
hclust pearson wardstand.euc

Sens. Frank R. Lautenberg


hclustmaximum
hclust maximum
divisive centroidaffprop euclidean
median
euclidean hclust canberra mcquitty
hclust maximum average
hclust
hclust maximum
euclidean complete
complete
hclust maximum
hclust manhattan complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
hclust manhattan ward affprop info.costs
kmeanshclust
sot_euc
euclidean ward
euclidean hclust canberra complete
hclust binary ward (D-NJ) and Robert Menendez
(D-NJ) announced that the U.S.
hclusthclust spearman
kendall ward ward
hclust maximum ward kmeans binary

kmeans maximum

Department of Commerce has


Clusters in this Clustering
awarded a $100,000 grant to the








South Jersey Economic

Credit Claiming
Development District
Pork

Mayhew
Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23
Example Discovery

mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc
mixvmf
hclust correlationmixvmfVA
hclust binary complete
mcquitty
hclust pearson single affprop cosine
hclust pearson median
hclust correlation single hclust pearson mcquitty
hclust correlation median
mec hclust pearson average hclust correlation complete
hclust binary single hclust correlation averagehclust pearson complete
hclust binary average kmeans pearson
hclustpearson
hclust correlation
centroid som
centroid rock
hclust binary median hclust binary mcquitty
hclust canberra single
biclust_spectral hclust spearman complete
spec_man
spec_cos
hclust canberra
kmeans kendall median spec_mink
spec_euc
affprop maximum hclust canberra average spec_max
mspec_minkspec_canb
mspec_man
kmeans spearman kmeans manhattan mspec_max
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust binary centroid
hclust kendall single
hclust
hclust hclustspearman
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
hclust spearman kendall mcquitty
mcquitty
hclust canberra centroid hclust kendall complete
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattanmedian manhattan
centroid
average
hclust
hclusteuclidean
hclust affprop
single
manhattan
euclidean median
divisive
manhattan
single
manhattan
hclust maximum
hclust single
euclidean centroid
hclust
hclust euclidean
manhattan average
mcquitty clust_convex hclust correlation ward
hclust euclidean mcquitty
kmedoids euclidean kmedoids
hclust pearson wardstand.euc

Credit Claiming, Legislation:


hclustmaximum
hclust maximum
divisive centroidaffprop euclidean
median
euclidean hclust canberra mcquitty
hclust maximum average
hclust
hclust maximum
euclidean complete
complete
hclust maximum
hclust manhattan complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
hclust manhattan ward affprop info.costs
kmeanshclust
sot_euc
euclidean ward
euclidean hclust canberra complete
hclust binary ward As the Senate begins its recess,
Senator Frank Lautenberg today
hclusthclust spearman
kendall ward ward
hclust maximum ward kmeans binary

kmeans maximum

pointed to a string of victories in


Clusters in this Clustering
Congress on his legislative agenda








during this work period

Credit Claiming
Pork









Credit Claiming
Mayhew
Legislation
Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23
Example Discovery

mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc
mixvmf
hclust correlationmixvmfVA
hclust binary complete
mcquitty
hclust pearson single affprop cosine
hclust pearson median
hclust correlation single hclust pearson mcquitty
hclust correlation median
mec hclust pearson average hclust correlation complete
hclust binary single hclust correlation averagehclust pearson complete
hclust binary average kmeans pearson
hclustpearson
hclust correlation
centroid som
centroid rock
hclust binary median hclust binary mcquitty
hclust canberra single
biclust_spectral hclust spearman complete
spec_man
spec_cos
hclust canberra
kmeans kendall median spec_mink
spec_euc
affprop maximum hclust canberra average spec_max
mspec_minkspec_canb
mspec_man
kmeans spearman kmeans manhattan mspec_max
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust binary centroid
hclust kendall single
hclust
hclust hclustspearman
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
hclust spearman kendall mcquitty
mcquitty
hclust canberra centroid hclust kendall complete
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattanmedian manhattan
centroid
average
hclust
hclusteuclidean
hclust affprop
single
manhattan
euclidean median
divisive
manhattan
single
manhattan
hclust maximum
hclust single
euclidean centroid
hclust
hclust euclidean
manhattan average
mcquitty clust_convex hclust correlation ward
hclust euclidean mcquitty
kmedoids euclidean kmedoids
hclust pearson wardstand.euc
hclustmaximum
hclust maximum
divisive centroidaffprop euclidean
median
euclidean hclust canberra mcquitty
hclust maximum average
hclust
hclust maximum
euclidean complete
complete
hclust maximum
hclust manhattan complete mcquitty

kmeanshclust
sot_euc
dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
hclust manhattan ward
euclidean ward
euclidean
dist_cos
dismea

hclust canberra complete


affprop info.costs
Advertising:
hclust binary ward

hclusthclust spearman
kendall ward ward Senate Adopts
hclust maximum ward kmeans binary

kmeans maximum Lautenberg/Menendez Resolution


Clusters in this Clustering Honoring Spelling Bee Champion











from New Jersey






Credit Claiming Advertising


Pork









Credit Claiming
Mayhew
Legislation
Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23
Example Discovery: Partisan Taunting

mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc
mixvmf
hclust correlationmixvmfVA
hclust binary complete
mcquitty
hclust pearson single affprop cosine
hclust pearson median
hclust correlation single hclust pearson mcquitty
hclust correlation median
mec hclust pearson average hclust correlation complete
hclust binary single hclust correlation averagehclust pearson complete
hclust binary average kmeans pearson
hclustpearson
hclust correlation
centroid som
centroid rock
hclust binary median hclust binary mcquitty
hclust canberra single
biclust_spectral hclust spearman complete
spec_man
spec_cos
hclust canberra
kmeans kendall median spec_mink
spec_euc
affprop maximum hclust canberra average spec_max
mspec_minkspec_canb
mspec_man
kmeans spearman kmeans manhattan mspec_max
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust binary centroid
hclust kendall single
hclust
hclust hclustspearman
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
hclust spearman kendall mcquitty
mcquitty
hclust canberra centroid hclust kendall complete
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattanmedian manhattan
centroid
average
hclust
hclusteuclidean
hclust affprop
single
manhattan
euclidean median
divisive
manhattan
single
manhattan
hclust maximum
hclust single
euclidean centroid
hclust
hclust euclidean
manhattan average
mcquitty clust_convex hclust correlation ward
hclust euclidean mcquitty
kmedoids euclidean kmedoids
hclust pearson wardstand.euc
hclustmaximum
hclust maximum
divisive centroidaffprop euclidean
median
euclidean hclust canberra mcquitty
hclust maximum average
hclust
hclust maximum
euclidean complete
complete
hclust maximum
hclust manhattan complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
hclust manhattan ward affprop info.costs
kmeanshclust euclidean ward
euclidean hclust canberra complete
sot_euc
hclust binary ward

hclusthclust spearman
kendall ward ward Partisan Taunting:
hclust maximum ward kmeans binary

kmeans maximum Republicans Selling Out Nation


Clusters in this Clustering on Chemical Plant Security












Credit Claiming Advertising


Pork Partisan Taunting















Credit Claiming
Mayhew

Legislation
Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23
Example Discovery: Partisan Taunting

mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc
mixvmf
hclust correlationmixvmfVA
hclust binary complete
mcquitty
hclust pearson single affprop cosine
hclust pearson median
hclust correlation single hclust pearson mcquitty
hclust correlation median
mec hclust pearson average hclust correlation complete
hclust binary single hclust correlation averagehclust pearson complete
hclust binary average kmeans pearson
hclustpearson
hclust correlation
centroid som
centroid rock
hclust binary median hclust binary mcquitty
hclust canberra single
biclust_spectral hclust spearman complete
spec_man
spec_cos
hclust canberra
kmeans kendall median spec_mink
spec_euc
affprop maximum hclust canberra average spec_max
mspec_minkspec_canb
mspec_man
kmeans spearman kmeans manhattan mspec_max
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust binary centroid
hclust kendall single
hclust
hclust hclustspearman
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
hclust spearman kendall mcquitty
mcquitty

Partisan Taunting:
hclust canberra centroid hclust kendall complete
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattanmedian manhattan
centroid
average
hclust
hclusteuclidean
hclust affprop
single
manhattan
euclidean median
divisive
manhattan
single
manhattan
hclust maximum
hclust single
euclidean centroid
hclust
hclust euclidean
manhattan average
mcquitty clust_convex hclust correlation ward
hclust euclidean mcquitty
kmedoids euclidean kmedoids
hclust pearson wardstand.euc
hclustmaximum
hclust maximum centroidaffprop euclidean
median hclust canberra mcquitty

Senator Lautenbergs
divisive euclidean
hclust maximum average
hclust
hclust maximum
euclidean complete
complete
hclust maximum
hclust manhattan complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
hclust manhattan ward affprop info.costs

amendment would change the


kmeanshclust euclidean ward
euclidean hclust canberra complete
sot_euc
hclust binary ward

hclusthclust spearman
kendall ward ward
hclust maximum ward

kmeans maximum
kmeans binary
name of ...the Republican bill...to
Clusters in this Clustering
More Tax Breaks for the Rich



and More Debt for Our














Grandchildren Deficit Expansion

Credit Claiming Advertising Reconciliation Act of 2006


Pork Partisan Taunting















Credit Claiming
Mayhew

Legislation
Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23
Example Discovery: Partisan Taunting

mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc
mixvmf
hclust correlationmixvmfVA
hclust binary complete
mcquitty
hclust pearson single affprop cosine
hclust pearson median
hclust correlation single hclust pearson mcquitty
hclust correlation median
mec hclust pearson average hclust correlation complete
hclust binary single hclust correlation averagehclust pearson complete
hclust binary average kmeans pearson
hclustpearson
hclust correlation
centroid som
centroid rock
hclust binary median hclust binary mcquitty
hclust canberra single
biclust_spectral hclust spearman complete
spec_man
spec_cos
hclust canberra
kmeans kendall median spec_mink
spec_euc
affprop maximum hclust canberra average spec_max
mspec_minkspec_canb
mspec_man
kmeans spearman kmeans manhattan mspec_max
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust binary centroid
hclust kendall single
hclust
hclust hclustspearman
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
hclust spearman kendall mcquitty
mcquitty
hclust canberra centroid hclust kendall complete
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattanmedian manhattan
centroid
average
hclust
hclusteuclidean
hclust affprop
single
manhattan
euclidean median
divisive
manhattan
single
manhattan
hclust maximum
hclust single
euclidean centroid
hclust
hclust euclidean
manhattan average
mcquitty clust_convex hclust correlation ward
hclust euclidean mcquitty
kmedoids euclidean kmedoids
hclust pearson wardstand.euc
hclustmaximum
hclust maximum
divisive centroidaffprop euclidean
median
euclidean hclust canberra mcquitty
hclust maximum average

Definition: Explicit, public, and


hclust
hclust maximum
euclidean complete
complete
hclust maximum
hclust manhattan complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
hclust manhattan ward affprop info.costs
kmeanshclust euclidean ward
euclidean hclust canberra complete
sot_euc

hclusthclust spearman
kendall ward ward
hclust binary ward
negative attacks on another
hclust maximum ward

kmeans maximum
kmeans binary
political party or its members
Clusters in this Clustering












Credit Claiming Advertising


Pork Partisan Taunting















Credit Claiming
Mayhew

Legislation
Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23
Example Discovery: Partisan Taunting

mult_dirproc
kmeans correlation
hclust canberra ward sot_cor
divisive stand.euc
mixvmf
hclust correlationmixvmfVA
hclust binary complete
mcquitty
hclust pearson single affprop cosine
hclust pearson median
hclust correlation single hclust pearson mcquitty
hclust correlation median
mec hclust pearson average hclust correlation complete
hclust binary single hclust correlation averagehclust pearson complete
hclust binary average kmeans pearson
hclustpearson
hclust correlation
centroid som
centroid rock
hclust binary median hclust binary mcquitty
hclust canberra single
biclust_spectral hclust spearman complete
spec_man
spec_cos
hclust canberra
kmeans kendall median spec_mink
spec_euc
affprop maximum hclust canberra average spec_max
mspec_minkspec_canb
mspec_man
kmeans spearman kmeans manhattan mspec_max
mspec_cos
mspec_canb
mspec_euc
kmeans canberra
hclust binary centroid
hclust kendall single
hclust
hclust hclustspearman
kendall
spearman
kendall
hclust centroid
centroid
average
median
median
spearman average
single
hclust
hclust spearman kendall mcquitty
mcquitty
hclust canberra centroid hclust kendall complete
hclust
hclust manhattan
hclust kmedoids
manhattan
manhattanmedian manhattan
centroid
average
hclust
hclusteuclidean
hclust affprop
single
manhattan
euclidean median
divisive
manhattan
single
manhattan
hclust maximum
hclust single
euclidean centroid
hclust
hclust euclidean
manhattan average
mcquitty clust_convex hclust correlation ward
hclust euclidean mcquitty
kmedoids euclidean kmedoids
hclust pearson wardstand.euc
hclustmaximum
hclust maximum
divisive centroidaffprop euclidean
median
euclidean hclust canberra mcquitty
hclust maximum average

Definition: Explicit, public, and


hclust
hclust maximum
euclidean complete
complete
hclust maximum
hclust manhattan complete mcquitty dist_ebinary
dist_binary
dist_fbinary
dist_minkowski
dist_canb
dist_max
dist_cos
dismea
hclust manhattan ward affprop info.costs
kmeanshclust euclidean ward
euclidean hclust canberra complete
sot_euc

hclusthclust spearman
kendall ward ward
hclust binary ward
negative attacks on another
hclust maximum ward

kmeans maximum
kmeans binary
political party or its members


Clusters in this Clustering Taunting ruins












deliberation

Credit Claiming Advertising


Pork Partisan Taunting















Credit Claiming
Mayhew

Legislation
Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23
In Sample Illustration of Partisan Taunting

Taunting ruins deliberation


- Senator Lautenberg Blasts
Republicans as Chicken Hawks
[Government Oversight]

Sen. Lautenberg
on Senate Floor
4/29/04

Gary King (Harvard, IQSS) Quantitative Discovery from Text 19 / 23


In Sample Illustration of Partisan Taunting

Taunting ruins deliberation


- Senator Lautenberg Blasts
Republicans as Chicken Hawks
[Government Oversight]
- The scopes trial took place in
1925. Sadly, President Bushs veto
today shows that we havent
progressed much since then
[Healthcare]

Sen. Lautenberg
on Senate Floor
4/29/04

Gary King (Harvard, IQSS) Quantitative Discovery from Text 19 / 23


In Sample Illustration of Partisan Taunting

Taunting ruins deliberation


- Senator Lautenberg Blasts
Republicans as Chicken Hawks
[Government Oversight]
- The scopes trial took place in
1925. Sadly, President Bushs veto
today shows that we havent
progressed much since then
[Healthcare]
- Every day the House Republicans
Sen. Lautenberg dragged this out was a day that
on Senate Floor made our communities less
4/29/04 safe.[Homeland Security]

Gary King (Harvard, IQSS) Quantitative Discovery from Text 19 / 23


Out of Sample Confirmation of Partisan Taunting
- Discovered using 200 press releases; 1 senator.

Gary King (Harvard, IQSS) Quantitative Discovery from Text 20 / 23


Out of Sample Confirmation of Partisan Taunting
- Discovered using 200 press releases; 1 senator.
- Confirmed using 64,033 press releases; 301 senator-years.

Gary King (Harvard, IQSS) Quantitative Discovery from Text 20 / 23


Out of Sample Confirmation of Partisan Taunting
- Discovered using 200 press releases; 1 senator.
- Confirmed using 64,033 press releases; 301 senator-years.
- Apply supervised learning method: measure proportion of press
releases a senator taunts other party

Gary King (Harvard, IQSS) Quantitative Discovery from Text 20 / 23


Out of Sample Confirmation of Partisan Taunting
- Discovered using 200 press releases; 1 senator.
- Confirmed using 64,033 press releases; 301 senator-years.
- Apply supervised learning method: measure proportion of press
releases a senator taunts other party
30
Frequency

20
10

0.1 0.2 0.3 0.4 0.5

Prop. of Press Releases Taunting


Gary King (Harvard, IQSS) Quantitative Discovery from Text 21 / 23
Out of Sample Confirmation of Partisan Taunting
- Discovered using 200 press releases; 1 senator.
- Confirmed using 64,033 press releases; 301 senator-years.
- Apply supervised learning method: measure proportion of press
releases a senator taunts other party

On Avg., Senators Taunt


in 27 % of Press Releases
30
Frequency

20
10

0.1 0.2 0.3 0.4 0.5

Prop. of Press Releases Taunting


Gary King (Harvard, IQSS) Quantitative Discovery from Text 21 / 23
Advancing the Objective of Discovery
1) Conceptualization

Qualitative Methods (reading!)

2) Measurement

Quantitative Methods

3) Validation
Quantitative methods for conceptualization: aiding discovery

Gary King (Harvard, IQSS) Quantitative Discovery from Text 22 / 23


Advancing the Objective of Discovery
1) Conceptualization

Qualitative Methods (reading!)

2) Measurement

Quantitative Methods

3) Validation
Quantitative methods for conceptualization: aiding discovery
- Few formal methods designed explicitly for conceptualization

Gary King (Harvard, IQSS) Quantitative Discovery from Text 22 / 23


Advancing the Objective of Discovery
1) Conceptualization

Qualitative Methods (reading!)

2) Measurement

Quantitative Methods

3) Validation
Quantitative methods for conceptualization: aiding discovery
- Few formal methods designed explicitly for conceptualization
- Belittled: Tom Swift and His Electric Factor Analysis Machine
(Armstrong 1967)

Gary King (Harvard, IQSS) Quantitative Discovery from Text 22 / 23


Advancing the Objective of Discovery
1) Conceptualization

Qualitative Methods (reading!)

2) Measurement

Quantitative Methods

3) Validation
Quantitative methods for conceptualization: aiding discovery
- Few formal methods designed explicitly for conceptualization
- Belittled: Tom Swift and His Electric Factor Analysis Machine
(Armstrong 1967)
- Evaluation methods measure progress in discovery
Gary King (Harvard, IQSS) Quantitative Discovery from Text 22 / 23
For more information:

http://GKing.Harvard.edu

Gary King (Harvard, IQSS) Quantitative Discovery from Text 23 / 23

Anda mungkin juga menyukai