Toy Problem List To Do in Data Science Domain

What are some good toy problems in
data science?
(This is a live list. Edits and additions welcome)
1) Here is a good toy problem: organize twitter users into groups based on
similarity of their tweets. To get started you can use simple metrics such
as number of words in the tweet, average word length, standard deviation
of word length, etc. Use a simple classifier/clustering algorithm of your
choice (e.g. see the chapter on Naive Bayes text classification here:
http://nlp.stanford.edu/IR-book/...)
You can use Twitter Streaming API as suggested by Neil Kodner to extract
users' status updates and Enron email classification methods suggested
by Josh Wills. Run this on at least 1GB worth of tweets (you can extract it
in less than a day unless you're using a dial-up connection), see if your
algorithm scales well. Extract more features with standard NLP methods
(see How does one determine similarity between people online?) and try
to improve your classifier performance. It would be interesting to see how
your groupings compare to Twitter's 'Similar Users' suggestions or
TunkRank.
Update from Data 2.0 Conference: You can have full Firehose access now
(10,000 keyword filters for 30 cents/hr): http://www.readwriteweb.c
om/arch...
2) Find similar users on Delicious (product) as suggested by Andreas

Stuhlmüller: http://www.aiplayground.org/arti...
3) Explore Where can I find large datasets open to the public? and What
data APIs or sources should be in my O'Reilly guide? ,
http://www.reddit.com/r/datasets/
4) FAQ extraction from mailing lists, see http://mail-archives.apache.org/...

5) Find similar Quora Users by Interests and Segments: see What
interesting statistics could be computed from user statistics on Quora?
6) Run some stats on Facebook or Google Profiles. See Pete Warden's

and Paul Butler's exercises: http://petewarden.typepad.com/se... ,
http://petewarden.typepad.com/se... , http://paulbutler.org/archives/v...
7) Coupons: http://paulbutler.org/archives/g...
8) http://www.heritagehealthprize.com/
9) What are some good learning projects to teach oneself about machine
learning?
10) Kinect: Are there any cool hacks for Kinect?
11) A better spelling corrector: http://norvig.com/spell-correct....
12A) Linear A: See Kim Raymoure's answer: What are some computational
methods used in Linear A decipherment?
12B) Linear B: Quollaboration: Toy Data Analysis for Linear B
13) A murder mystery: http://www.networkworld.com/comm...
14) Michael E Driscoll's answer to What are some good summer programs
for PhD students interested in data science?
15) Object tracking: http://info.ee.surrey.ac.uk/Pers...
16) http://datavizchallenge.org/
17) List the directors that have directed at least 20 movies and acted in all
of them, using Internet Movie Database (IMDb) data:
http://www.imdb.com/interfaces , http://imdbpy.sourceforge.net/
18) Mashups: http://www.housingmaps.com/ , What data APIs or sources

should be in my O'Reilly guide?
19) http://www.hearstchallenge.com/
20) What are some good class projects for machine learning using
MapReduce?
21) Videolectures.net recommendations: http://www.r-bloggers.com

/videol...
22) Materials identification: http://tunedit.org/challenge/mat...
23) http://www.executablepapers.com/ Also What kind of collaboration

tools would reduce duplication of R&D effort in data analysis and sharing?
24) http://overstockreclabprize.com/
25) Data mining competitions: http://www.kaggle.com/ and

http://www.kdnuggets.com/dataset...
26) IEEE Vast: http://hcil.cs.umd.edu/localphp/...
27) The Mendeley API: http://dev.mendeley.com/ ,

http://dev.mendeley.com/datachal...
28) HIV Progression: http://www.kaggle.com/c/hivprogr...
29) Data.gov apps: What are the best apps built on top of open
government data?
30) HN search API: http://news.ycombinator.com/item...

31) Optimizing FX Trading Strategies: http://gociop.de/gecco-2011-indu...
32) Yahoo KDD cup: http://kddcup.yahoo.com
33) Analysis of Financial Data with Perl: http://perlmonks.org/index.pl?

no...
34) Wide Finder challenge: http://www.tbray.org/ongoing/Whe...
35) Internet Search: http://himmele.blogspot.com/2011...
36) Life Tech: http://www.lifetechnologies.com/...
37) Downloadable patents to play with: http://www.google.com/goo

glebook...
38) Toy machine learning exercises: http://stackoverflow.com/questio...
39) Assignments in CS 194-16 course at Berkeley:

http://datascienc.es/schedule/
40) Topcoder, USPTO and NASA $50k data mining contest:

http://community.topcoder.com/nt...
41) Mathworks contests: http://www.mathworks.com/academi...
42) A data mining web app: https://github.com/entaroadun/hn...
43) KDD CUP: http://www.kdd.org/kddcup/
44) What are the best algorithms for classifying the language of a text
snippet? Why?
45) Tokenising the visible english text of common crawl:
http://matpalm.com/blog/2011/12/...
46) Build a MixRank clone: mixrank.com
47) Kaggle gesture challenge: http://www.kaggle.com/c/GestureC...
48) Yandex Relevance Prediction Challenge: http://imat-relpred.yande

x.ru/en (via KDnuggets: http://www.kdnuggets.com/2011/11... via Jeff
Dalton http://twitter.com/#!/JeffD)
49) Hit prediction: http://www.wired.com/underwire/2...
50) Find Facebook Users on Match.com by Using Face Recognition Tools:

http://artemyankov.com/post/1830...
51) Reddit recommender: http://www.reddit.com/r/redditde...

Toy Problem List To Do in Data Science Domain

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Toy Problem List To Do in Data Science Domain

Diunggah oleh

Hak Cipta:

Format Tersedia

What are some good toy problems in

2) Find similar users on Delicious (product) as suggested by Andreas

4) FAQ extraction from mailing lists, see http://mail-archives.apache.org/...

6) Run some stats on Facebook or Google Profiles. See Pete Warden's

10) Kinect: Are there any cool hacks for Kinect?

11) A better spelling corrector: http://norvig.com/spell-correct....

12B) Linear B: Quollaboration: Toy Data Analysis for Linear B

13) A murder mystery: http://www.networkworld.com/comm...

15) Object tracking: http://info.ee.surrey.ac.uk/Pers...

18) Mashups: http://www.housingmaps.com/ , What data APIs or sources

21) Videolectures.net recommendations: http://www.r-bloggers.com

22) Materials identification: http://tunedit.org/challenge/mat...

23) http://www.executablepapers.com/ Also What kind of collaboration

25) Data mining competitions: http://www.kaggle.com/ and

26) IEEE Vast: http://hcil.cs.umd.edu/localphp/...

27) The Mendeley API: http://dev.mendeley.com/ ,

28) HIV Progression: http://www.kaggle.com/c/hivprogr...

30) HN search API: http://news.ycombinator.com/item...

32) Yahoo KDD cup: http://kddcup.yahoo.com

33) Analysis of Financial Data with Perl: http://perlmonks.org/index.pl?

34) Wide Finder challenge: http://www.tbray.org/ongoing/Whe...

35) Internet Search: http://himmele.blogspot.com/2011...

36) Life Tech: http://www.lifetechnologies.com/...

37) Downloadable patents to play with: http://www.google.com/goo

38) Toy machine learning exercises: http://stackoverflow.com/questio...

39) Assignments in CS 194-16 course at Berkeley:

40) Topcoder, USPTO and NASA $50k data mining contest:

41) Mathworks contests: http://www.mathworks.com/academi...

42) A data mining web app: https://github.com/entaroadun/hn...

43) KDD CUP: http://www.kdd.org/kddcup/

46) Build a MixRank clone: mixrank.com

47) Kaggle gesture challenge: http://www.kaggle.com/c/GestureC...

48) Yandex Relevance Prediction Challenge: http://imat-relpred.yande

49) Hit prediction: http://www.wired.com/underwire/2...

50) Find Facebook Users on Match.com by Using Face Recognition Tools:

51) Reddit recommender: http://www.reddit.com/r/redditde...

Anda mungkin juga menyukai