data science?
(This is a live list. Edits and additions welcome)
1) Here is a good toy problem: organize twitter users into groups based on
similarity of their tweets. To get started you can use simple metrics such
as number of words in the tweet, average word length, standard deviation
of word length, etc. Use a simple classifier/clustering algorithm of your
choice (e.g. see the chapter on Naive Bayes text classification here:
http://nlp.stanford.edu/IR-book/...)
You can use Twitter Streaming API as suggested by Neil Kodner to extract
users' status updates and Enron email classification methods suggested
by Josh Wills. Run this on at least 1GB worth of tweets (you can extract it
in less than a day unless you're using a dial-up connection), see if your
algorithm scales well. Extract more features with standard NLP methods
(see How does one determine similarity between people online?) and try
to improve your classifier performance. It would be interesting to see how
your groupings compare to Twitter's 'Similar Users' suggestions or
TunkRank.
Update from Data 2.0 Conference: You can have full Firehose access now
(10,000 keyword filters for 30 cents/hr): http://www.readwriteweb.c
om/arch...
3) Explore Where can I find large datasets open to the public? and What
data APIs or sources should be in my O'Reilly guide? ,
http://www.reddit.com/r/datasets/
7) Coupons: http://paulbutler.org/archives/g...
8) http://www.heritagehealthprize.com/
9) What are some good learning projects to teach oneself about machine
learning?
12A) Linear A: See Kim Raymoure's answer: What are some computational
methods used in Linear A decipherment?
14) Michael E Driscoll's answer to What are some good summer programs
for PhD students interested in data science?
16) http://datavizchallenge.org/
17) List the directors that have directed at least 20 movies and acted in all
of them, using Internet Movie Database (IMDb) data:
http://www.imdb.com/interfaces , http://imdbpy.sourceforge.net/
19) http://www.hearstchallenge.com/
20) What are some good class projects for machine learning using
MapReduce?
24) http://overstockreclabprize.com/
29) Data.gov apps: What are the best apps built on top of open
government data?
44) What are the best algorithms for classifying the language of a text
snippet? Why?
45) Tokenising the visible english text of common crawl:
http://matpalm.com/blog/2011/12/...