Anda di halaman 1dari 73

Frontiers of

Computational Journalism
Columbia Journalism School
Week 1: Introduction, Clustering
September 16, 2016

Computational Journalism:
Definitions
Broadly defined, it can involve changing how stories are
discovered, presented, aggregated, monetized, and
archived. Computation can advance journalism by
drawing on innovations in topic detection, video analysis,
personalization, aggregation, visualization, and
sensemaking.
- Cohen, Hamilton, Turner, Computational Journalism, 2011

Computational Journalism:
Definitions
Stories will emerge from stacks of financial disclosure
forms, court records, legislative hearings, officials' calendars
or meeting notes, and regulators' email messages that no
one today has time or money to mine. With a suite of
reporting tools, a journalist will be able to scan, transcribe,
analyze, and visualize the patterns in these documents.
- Cohen, Hamilton, Turner, Computational Journalism, 2011

Cohen et al. model


Data

Reporting

User

Computer
Science

CS for presentation /
interaction
CS

Data

CS

Reporting

User

Filter stories for user


CS

Data

Reporting

CS

Data

Reporting

CS

CS

Filtering

Reporting

CS

Data

CS

CS

User

Examples of filters

Facebook news feed


What an editor puts on the front page
Google News
Reddits comment system
Twitter
Techmeme
New York Times recommendation system

http://snap.stanford.edu/nifty

Kony 2012 early network, by Gilad Lotan

CS in Journalism
CS

Data

Reporting

CS

Data

Reporting

CS

CS

CS

Reporting

CS

Data

CS

Effects

Filtering

CS

User

Journalism as a cycle
CS

Effects

Data

CS

Reporting

User

CS
CS

Filtering

Journalism with algorithms


vs.
Journalism about algorithms

Websites Vary Prices, Deals Based on Users' Information


Valentino-Devries, Singer-Vine and Soltani, WSJ, 2012

Message Machine
Jeff Larson, Al Shaw, ProPublica, 2012

Computer Science in Journalism


Reporting
Presentation
Filtering
Tracking
Algorithmic accountability

Computational Journalism:
Definitions
the application of computer science to the problems
of public information, knowledge, and belief, by
practitioners who see their mission as outside of both
commerce and government.
- Jonathan Stray, A Computational Journalism Reading List,
2011

Course Structure
Unit 1: Filters
Information retrieval, TF-IDF, topic modeling, search engines, social filtering, filtering
system design.

Unit 2: Interpreting Data


Quantification, error, statistical basics, Bayesianism, prediction, competing
hypotheses, narratives.

Unit 3: Methods
Visualization, knowledge representation, social network analysis, privacy and
security, tracking flow and effects

Information Retrieval

Visualization
Clustering

Natural Language
Processing

Text Analysis

Sociology

Filter Design
Social Network Analysis

Artificial
Intelligence

Knowledge Representation

Graph Theory

Drawing Conclusions
Cognitive Science

Statistics

Epistemology

Administration
Assignment after each class
Some assignments require programming, but
your writing counts for more than your code!

Course blog
http://compjournalism.com

Final project
for 6-pt students only

Grading
Dual degree students
Pass/Fail.
Final project: paper, story, or software.

Non-journalism students
80% assignements
20% class participation

Vector representation of objects


Fundamental representation for many data mining, clustering,
machine learning, visualization, NLP, etc. algorithms.

!
#
#
#
#
#
#
#
"

x1 $
&
x2 &
&
x3 &
&
&
xN &
%

Each xi is a numerical or categorical feature


N = number of features or dimension

Choosing Features
!
#
#
#
#
#
#
#
"

Journalism
How do we
represent the
world
numerically?

x1 $
&
x2 &
&
x3 &
&
&
xN &
%

! x
f (1)
#
# x f (2 )
#

#
# x f (k )
"

$
&
&
&
&
&
%

where k N
Machine learning
Which variables
carry the most
information?

Examples of vector representations


Obvious
o movies watched / items purchased
o Legislative voting history for a politician
o crime locations

Less obvious, but standard


o document vector space model
o psychological survey results

Tricky research problem: disparate field types


o Corporate filing document
o Wikileaks SIGACT

What can we do with vectors?


Predict one variable based on others
o this is called regression
o or maybe "classification"
o supervised machine learning

Group similar items together


o This is clustering
o or maybe "classification" with unknown categories
o unsupervised machine learning

Classification and Clustering


Classification is arguably one of the most central and
generic of all our conceptual exercises. It is the
foundation not only for conceptualization, language,
and speech, but also for mathematics, statistics, and
data analysis in general.
- Kenneth D. Bailey, Typologies and Taxonomies: An

Introduction to Classification Techniques

Interpreting High Dimensional Data

UK House of Lords voting record, 2000-2012.


N = 1043 lords by M = 1630 votes
2 = aye, 4 = nay, -9 = didn't vote

Distance metric
Intuitively: how (dis)similar are two items?
Formally:
d(x, y) 0
d(x, x) = 0
d(x, y) = d(y, x)
d(x, z) d(x, y) + d(y, z)

Distance metric
d(x, y) 0
-

distance is never negative

d(x, x) = 0
-

reflexivity: zero distance to self

d(x, y) = d(y, x)
-

symmetry: x to y same as y to x

d(x, z) d(x, y) + d(y, z)


- triangle inequality: going direct is shorter

Distance matrix
Data matrix for M objects of N dimensions
! x1 $ ! x1,1
# & #
x2 & # x2,1
#
X=
=
# & #
# & #
" xM % #" x1,M

Distance matrix

x1,N $
&
&
&

&
xM ,N &%

x1,2
x2,2

! d
# 1,1
# d2,1
Dij = D ji = d(xi , x j ) = #
#
# d1,M
"

d1,2 dM ,M $
&
&
d2,2
&

&
dM ,M &%

We think of a cluster like this

Real data isnt so simple

Different clustering algorithms


Partitioning
o keep adjusting clusters until convergence
o e.g. K-means

Agglomerative hierarchical
o start with leaves, repeatedly merge clusters
o e.g. MIN and MAX approaches

Divisive hierarchical
o start with root, repeatedly split clusters
o e.g. binary split

K-means demo

http://www.paused21.net/off/kmeans/bin/

Agglomerative merging clusters


put each item into a leaf node
while num clusters > 1
find two closest clusters
merge them

Divisive splitting clusters


put all items into one cluster
while num clusters < num items
find largest cluster
split so pieces as far as
possible

complete link or max

single link or "min"

average

Trees and Dendrograms

UK House of Lords voting clusters

UK House of Lords voting clusters


Algorithm instructed to separate MPs into five clusters. Output:
1
1
2
2
1
1
1
2
1
2
1

2
1
2
3
5
4
1
3
1
1
2

2
1
1
2
2
1
1
1
2
2
2

1
1
2
1
1
1
2
2
1
1
1

3
1
3
3
4
4
3
1
1
2
2

2
1
2
1
2
1
3
4
2
2
3

2
5
2
1
2
2
2
1
2
1
4

2
2
4
2
1
2
2
1
2
3
2

1
1
2
1
2
1
2
4
2
2
2

4
1
1
2
1
5
5
4
1
1
2
.
.

Voting clusters with parties


LDem
1
Con
1
Lab
2
Lab
2
Con
1
Con
1
Con
1
Lab
2
Con
1
Lab
2
Con
1

XB
2
Con
1
Lab
2
XB
3
Con
5
XB
4
Con
1
XB
3
Con
1
LDem
1
Lab
2

Lab
2
LDem
1
Con
1
Lab
2
Lab
2
Con
1
Con
1
Con
1
Lab
2
Lab
2
XB
2

LDem
1
Con
1
Lab
2
Con
1
Con
1
Con
1
Lab
2
Lab
2
Con
1
Con
1
Con
1

XB
3
Con
1
XB
3
XB
3
XB
4
XB
4
Bp
3
Con
1
Con
1
Lab
2
XB
2

Lab
2
Con
1
XB
2
XB
1
Lab
2
Con
1
XB
3
XB
4
XB
2
Lab
2
XB
3

XB
2
LDem
5
Lab
2
LDem
1
Lab
2
Lab
2
Lab
2
Con
1
Lab
2
Con
1
XB
4

Lab
2
Lab
2
XB
4
Lab
2
Con
1
XB
2
Lab
2
Con
1
Lab
2
XB
3
Lab
2

Con
1
Con
1
Lab
2
XB
1
XB
2
LDem
1
Lab
2
XB
4
Lab
2
Lab
2
Lab
2

XB
4
LDem
1
Con
1
Lab
2
XB
1
Con
5
LDem
5
XB
4
Con
1
Con
1
Lab
2

Clustering Algorithm

Input: data points (feature vectors).


Output: a set of clusters, each of which is a set
of points.
Visualization

Input: data points (feature vectors).


Output: a picture of the points.

Dimensionality reduction
Problem: vector space is high-dimensional. Up to
thousands of dimensions. The screen is twodimensional.
We have to go from
x RN
to much lower dimensional points
y RK<<N
Probably K=2 or K=3.

This is called "projection"

Projection from 3 to 2 dimensions

Linear projections
Projects in a straight line
to closest point on
"screen." Mathematically,
y = Px
where P is a K by N matrix.

Projection from 2 to 1 dimensions

Think of this as rotating to align the "screen" with


coordinate axes, then simply throwing out values of
higher dimensions.

Projection from 3 to 2 dimensions

Which direction should we look from?


Principal components analysis: find a linear projection
that preserves greatest variance

Take first K eigenvectors of covariance matrix


corresponding to largest eigenvalues. This gives a Kdimensional sub-space for projection.

Sometimes overlap is unavoidable

Real data isnt so simple

Nonlinear projections
Still going from highdimensional x to lowdimensional y, but now
y = f(x)
for some function f(), not
linear. So, may not
preserve relative
distances, angles, etc.

Fish-eye projection from 3 to 2 dimensions

Multidimensional scaling
Idea: try to preserve distances between points "as much as
possible."
If we have the distances between all points in a distance matrix,
D = |xi xj| for all i,j
We can recover the original {xi} coordinates exactly (up to rigid
transformations.) Like working out a country map if you know how
far away each city is from every other.

Multidimensional scaling
Torgerson's "classical MDS" algorithm (1952)

Reducing dimension with MDS


Notice: dimension N is not encoded in the distance
matrix D (its M by M where M is number of points)
MDS formula (theoretically) allows us to recover point
coordinates {x} in any number of dimensions k.

MDS Stress minimization


The formula actually minimizes stress

stress(x) = xi x j dij
i, j

Think of springs between every pair of points. Spring between xi,


xj has rest length dij

Stress is zero if all high-dimensional distances matched exactly in


low dimension.

Multi-dimensional Scaling
Like "flattening" a
stretchy structure into
2D, so that distances
between points are
preserved (as much as
possible")

House of Lords MDS plot

Robustness of results
Regarding these analyses of congressional voting, we
could still ask:
Are we modeling the right thing? (What about other
legislative work, e.g. in committee?)
Are our underlying assumptions correct? (do
representatives really have ideal points in a
preference space?)
What are we trying to argue? What will be the effect of
pointing out this result?

Why do clusters have meaning?

What is the connection between mathematical


and semantic properties?

No unique right clustering


Different distance metrics and clustering algorithms
give different results.
Should we sort incident reports by location, time,
actor, event type, author, cost, casualties?
There is only context-specific categorization.
And the computer doesnt understand your context.

Different libraries,
different categories

Anda mungkin juga menyukai