Anda di halaman 1dari 5

What are some good toy problems in

data science?
(This is a live list. Edits and additions welcome)

1) Here is a good toy problem: organize twitter users into groups based on
similarity of their tweets. To get started you can use simple metrics such
as number of words in the tweet, average word length, standard deviation
of word length, etc. Use a simple classifier/clustering algorithm of your
choice (e.g. see the chapter on Naive Bayes text classification here:
http://nlp.stanford.edu/IR-book/...)
You can use Twitter Streaming API as suggested by Neil Kodner to extract
users' status updates and Enron email classification methods suggested
by Josh Wills. Run this on at least 1GB worth of tweets (you can extract it
in less than a day unless you're using a dial-up connection), see if your
algorithm scales well. Extract more features with standard NLP methods
(see How does one determine similarity between people online?) and try
to improve your classifier performance. It would be interesting to see how
your groupings compare to Twitter's 'Similar Users' suggestions or
TunkRank.
Update from Data 2.0 Conference: You can have full Firehose access now
(10,000 keyword filters for 30 cents/hr): http://www.readwriteweb.c
om/arch...

2) Find similar users on Delicious (product) as suggested by Andreas


Stuhlmüller: http://www.aiplayground.org/arti...

3) Explore Where can I find large datasets open to the public? and What
data APIs or sources should be in my O'Reilly guide? ,
http://www.reddit.com/r/datasets/

4) FAQ extraction from mailing lists, see http://mail-archives.apache.org/...


5) Find similar Quora Users by Interests and Segments: see What
interesting statistics could be computed from user statistics on Quora?

6) Run some stats on Facebook or Google Profiles. See Pete Warden's


and Paul Butler's exercises: http://petewarden.typepad.com/se... ,
http://petewarden.typepad.com/se... , http://paulbutler.org/archives/v...

7) Coupons: http://paulbutler.org/archives/g...

8) http://www.heritagehealthprize.com/

9) What are some good learning projects to teach oneself about machine
learning?

10) Kinect: Are there any cool hacks for Kinect?

11) A better spelling corrector: http://norvig.com/spell-correct....

12A) Linear A: See Kim Raymoure's answer: What are some computational
methods used in Linear A decipherment?

12B) Linear B: Quollaboration: Toy Data Analysis for Linear B

13) A murder mystery: http://www.networkworld.com/comm...

14) Michael E Driscoll's answer to What are some good summer programs
for PhD students interested in data science?

15) Object tracking: http://info.ee.surrey.ac.uk/Pers...

16) http://datavizchallenge.org/

17) List the directors that have directed at least 20 movies and acted in all
of them, using Internet Movie Database (IMDb) data:
http://www.imdb.com/interfaces , http://imdbpy.sourceforge.net/

18) Mashups: http://www.housingmaps.com/ , What data APIs or sources


should be in my O'Reilly guide?

19) http://www.hearstchallenge.com/

20) What are some good class projects for machine learning using
MapReduce?

21) Videolectures.net recommendations: http://www.r-bloggers.com


/videol...

22) Materials identification: http://tunedit.org/challenge/mat...

23) http://www.executablepapers.com/ Also What kind of collaboration


tools would reduce duplication of R&D effort in data analysis and sharing?

24) http://overstockreclabprize.com/

25) Data mining competitions: http://www.kaggle.com/ and


http://www.kdnuggets.com/dataset...

26) IEEE Vast: http://hcil.cs.umd.edu/localphp/...

27) The Mendeley API: http://dev.mendeley.com/ ,


http://dev.mendeley.com/datachal...

28) HIV Progression: http://www.kaggle.com/c/hivprogr...

29) Data.gov apps: What are the best apps built on top of open
government data?

30) HN search API: http://news.ycombinator.com/item...


31) Optimizing FX Trading Strategies: http://gociop.de/gecco-2011-indu...

32) Yahoo KDD cup: http://kddcup.yahoo.com

33) Analysis of Financial Data with Perl: http://perlmonks.org/index.pl?


no...

34) Wide Finder challenge: http://www.tbray.org/ongoing/Whe...

35) Internet Search: http://himmele.blogspot.com/2011...

36) Life Tech: http://www.lifetechnologies.com/...

37) Downloadable patents to play with: http://www.google.com/goo


glebook...

38) Toy machine learning exercises: http://stackoverflow.com/questio...

39) Assignments in CS 194-16 course at Berkeley:


http://datascienc.es/schedule/

40) Topcoder, USPTO and NASA $50k data mining contest:


http://community.topcoder.com/nt...

41) Mathworks contests: http://www.mathworks.com/academi...

42) A data mining web app: https://github.com/entaroadun/hn...

43) KDD CUP: http://www.kdd.org/kddcup/

44) What are the best algorithms for classifying the language of a text
snippet? Why?
45) Tokenising the visible english text of common crawl:
http://matpalm.com/blog/2011/12/...

46) Build a MixRank clone: mixrank.com

47) Kaggle gesture challenge: http://www.kaggle.com/c/GestureC...

48) Yandex Relevance Prediction Challenge: http://imat-relpred.yande


x.ru/en (via KDnuggets: http://www.kdnuggets.com/2011/11... via Jeff
Dalton http://twitter.com/#!/JeffD)

49) Hit prediction: http://www.wired.com/underwire/2...

50) Find Facebook Users on Match.com by Using Face Recognition Tools:


http://artemyankov.com/post/1830...

51) Reddit recommender: http://www.reddit.com/r/redditde...

Anda mungkin juga menyukai