Computational Journalism
Columbia Journalism School
Week 1: Introduction, Clustering
September 16, 2016
Computational Journalism:
Definitions
Broadly defined, it can involve changing how stories are
discovered, presented, aggregated, monetized, and
archived. Computation can advance journalism by
drawing on innovations in topic detection, video analysis,
personalization, aggregation, visualization, and
sensemaking.
- Cohen, Hamilton, Turner, Computational Journalism, 2011
Computational Journalism:
Definitions
Stories will emerge from stacks of financial disclosure
forms, court records, legislative hearings, officials' calendars
or meeting notes, and regulators' email messages that no
one today has time or money to mine. With a suite of
reporting tools, a journalist will be able to scan, transcribe,
analyze, and visualize the patterns in these documents.
- Cohen, Hamilton, Turner, Computational Journalism, 2011
Reporting
User
Computer
Science
CS for presentation /
interaction
CS
Data
CS
Reporting
User
Data
Reporting
CS
Data
Reporting
CS
CS
Filtering
Reporting
CS
Data
CS
CS
User
Examples of filters
http://snap.stanford.edu/nifty
CS in Journalism
CS
Data
Reporting
CS
Data
Reporting
CS
CS
CS
Reporting
CS
Data
CS
Effects
Filtering
CS
User
Journalism as a cycle
CS
Effects
Data
CS
Reporting
User
CS
CS
Filtering
Message Machine
Jeff Larson, Al Shaw, ProPublica, 2012
Computational Journalism:
Definitions
the application of computer science to the problems
of public information, knowledge, and belief, by
practitioners who see their mission as outside of both
commerce and government.
- Jonathan Stray, A Computational Journalism Reading List,
2011
Course Structure
Unit 1: Filters
Information retrieval, TF-IDF, topic modeling, search engines, social filtering, filtering
system design.
Unit 3: Methods
Visualization, knowledge representation, social network analysis, privacy and
security, tracking flow and effects
Information Retrieval
Visualization
Clustering
Natural Language
Processing
Text Analysis
Sociology
Filter Design
Social Network Analysis
Artificial
Intelligence
Knowledge Representation
Graph Theory
Drawing Conclusions
Cognitive Science
Statistics
Epistemology
Administration
Assignment after each class
Some assignments require programming, but
your writing counts for more than your code!
Course blog
http://compjournalism.com
Final project
for 6-pt students only
Grading
Dual degree students
Pass/Fail.
Final project: paper, story, or software.
Non-journalism students
80% assignements
20% class participation
!
#
#
#
#
#
#
#
"
x1 $
&
x2 &
&
x3 &
&
&
xN &
%
Choosing Features
!
#
#
#
#
#
#
#
"
Journalism
How do we
represent the
world
numerically?
x1 $
&
x2 &
&
x3 &
&
&
xN &
%
! x
f (1)
#
# x f (2 )
#
#
# x f (k )
"
$
&
&
&
&
&
%
where k N
Machine learning
Which variables
carry the most
information?
Distance metric
Intuitively: how (dis)similar are two items?
Formally:
d(x, y) 0
d(x, x) = 0
d(x, y) = d(y, x)
d(x, z) d(x, y) + d(y, z)
Distance metric
d(x, y) 0
-
d(x, x) = 0
-
d(x, y) = d(y, x)
-
symmetry: x to y same as y to x
Distance matrix
Data matrix for M objects of N dimensions
! x1 $ ! x1,1
# & #
x2 & # x2,1
#
X=
=
# & #
# & #
" xM % #" x1,M
Distance matrix
x1,N $
&
&
&
&
xM ,N &%
x1,2
x2,2
! d
# 1,1
# d2,1
Dij = D ji = d(xi , x j ) = #
#
# d1,M
"
d1,2 dM ,M $
&
&
d2,2
&
&
dM ,M &%
Agglomerative hierarchical
o start with leaves, repeatedly merge clusters
o e.g. MIN and MAX approaches
Divisive hierarchical
o start with root, repeatedly split clusters
o e.g. binary split
K-means demo
http://www.paused21.net/off/kmeans/bin/
average
2
1
2
3
5
4
1
3
1
1
2
2
1
1
2
2
1
1
1
2
2
2
1
1
2
1
1
1
2
2
1
1
1
3
1
3
3
4
4
3
1
1
2
2
2
1
2
1
2
1
3
4
2
2
3
2
5
2
1
2
2
2
1
2
1
4
2
2
4
2
1
2
2
1
2
3
2
1
1
2
1
2
1
2
4
2
2
2
4
1
1
2
1
5
5
4
1
1
2
.
.
XB
2
Con
1
Lab
2
XB
3
Con
5
XB
4
Con
1
XB
3
Con
1
LDem
1
Lab
2
Lab
2
LDem
1
Con
1
Lab
2
Lab
2
Con
1
Con
1
Con
1
Lab
2
Lab
2
XB
2
LDem
1
Con
1
Lab
2
Con
1
Con
1
Con
1
Lab
2
Lab
2
Con
1
Con
1
Con
1
XB
3
Con
1
XB
3
XB
3
XB
4
XB
4
Bp
3
Con
1
Con
1
Lab
2
XB
2
Lab
2
Con
1
XB
2
XB
1
Lab
2
Con
1
XB
3
XB
4
XB
2
Lab
2
XB
3
XB
2
LDem
5
Lab
2
LDem
1
Lab
2
Lab
2
Lab
2
Con
1
Lab
2
Con
1
XB
4
Lab
2
Lab
2
XB
4
Lab
2
Con
1
XB
2
Lab
2
Con
1
Lab
2
XB
3
Lab
2
Con
1
Con
1
Lab
2
XB
1
XB
2
LDem
1
Lab
2
XB
4
Lab
2
Lab
2
Lab
2
XB
4
LDem
1
Con
1
Lab
2
XB
1
Con
5
LDem
5
XB
4
Con
1
Con
1
Lab
2
Clustering Algorithm
Dimensionality reduction
Problem: vector space is high-dimensional. Up to
thousands of dimensions. The screen is twodimensional.
We have to go from
x RN
to much lower dimensional points
y RK<<N
Probably K=2 or K=3.
Linear projections
Projects in a straight line
to closest point on
"screen." Mathematically,
y = Px
where P is a K by N matrix.
Nonlinear projections
Still going from highdimensional x to lowdimensional y, but now
y = f(x)
for some function f(), not
linear. So, may not
preserve relative
distances, angles, etc.
Multidimensional scaling
Idea: try to preserve distances between points "as much as
possible."
If we have the distances between all points in a distance matrix,
D = |xi xj| for all i,j
We can recover the original {xi} coordinates exactly (up to rigid
transformations.) Like working out a country map if you know how
far away each city is from every other.
Multidimensional scaling
Torgerson's "classical MDS" algorithm (1952)
stress(x) = xi x j dij
i, j
Multi-dimensional Scaling
Like "flattening" a
stretchy structure into
2D, so that distances
between points are
preserved (as much as
possible")
Robustness of results
Regarding these analyses of congressional voting, we
could still ask:
Are we modeling the right thing? (What about other
legislative work, e.g. in committee?)
Are our underlying assumptions correct? (do
representatives really have ideal points in a
preference space?)
What are we trying to argue? What will be the effect of
pointing out this result?
Different libraries,
different categories