do with Documents?
Jonathan Stray
Columbia Journalism School
Computational Journalism
Stories will emerge from stacks of financial disclosure
forms, court records, legislative hearings, officials'
calendars or meeting notes, and regulators' email
messages that no one today has time or money to
mine. With a suite of reporting tools, a journalist will
be able to scan, transcribe, analyze, and visualize the
patterns in these documents.
- Cohen, Hamilton, Turner, 2011
1. Robust Import
2. Robust Analysis
PDF dumps
Printed, scanned emails
A million pages scraped from an antique site
CD full of random files
4. Quantitative Summaries
5. Interactive Methods
Design Study Methodology: Reflections from the Trenches and the Stacks, Sedlmair et al, 2012
Extracting yes/no answers from database of Foreign Corrupt Practices Act cases.
Comparison by Ariana Giorgi
Things We Need
Dirty document corpora
A shared development platform