I just completed the second of two finals to end the first semester of Berkeley's
MIDS program--a new data science program created by the School of Information at
UC Berkeley. It was disappointingly easy and expensive ($13k per semester for 5
semesters for an online program). The level of comprehension required to do well
was about that of a Coursera course. And this is not to say that Coursera is easy; it
isn't if you really dig your heels in. There is a higher level of accountability that
comes with a structured program, but the incremental learning that came with the
structure didn't make the degree worth it. I'm dropping the program today.
A Few (Huge) Caveats
1.
2.
3.
4.
For the past five years, I have been in academic settings and tech
companies where there was plenty of technical wisdom, mentorship, and
stretch projects.
5.
6.
SQL: If you can't get data, you can't analyze data. Whether you
retrieve data from a SQL database or Hadoop cluster with a SQL-language
layer on top of it, this is where you start.
http://sqlschool.modeanalytics.com/ is a great interactive learning
interface. O'Reilley's SQL Cookbook is a masterpiece that traverses all
levels of SQL proficiency.
Python: Code Academy or Udemy will teach you the basics. Python
can play two functions in the skill stack: 1) to conduct ad-hoc statistical
analysis as you would with R, 2) to do everything else. Python is
important for the "everything else." You might use it to get data from
APIs, scrape, write ETL jobs, refresh data in your warehouse, or retrain
models. This is the piece of the skill stack moves you from being a Static
Data Scientist (one who works with data in a manual fashion), to a Live
Data Scientist (one who has automated many of the processes
contributing to data science output, loosely defined).
how to write a MapReduce job is, for the most part, not necessary. To
build real MapReduce pipelines is a behemoth of a task that might be the
work of an early-stage startup Data Scientist, but shouldn't be if you have
a solid BI infrastructure team. This is why companies hire the rockstars
we know as backend and data engineers. Side note: if you ever meet one
and aren't sure what their company does, thank them for their service to
our country, regardless.
Cleaning: plan to spend most of your time cleaning and transforming in these
languages/technologies. The analysis is the fast and fun part.
Footnotes (5/25/2015)
* This unit in Khan academy, specifically, is negligible. I don't however believe that
Probability Theory has no place in Stat or ML. Quite the opposite, you absolutely
have to understand probability theory. I don't think that combinatorics (guessing
the probability of red balls in an urn, cards in a deck, or outcome of a dice roll) is an
essential step to understanding general probability theory.
** Unless you plan to be the sole Data Scientist at Pre-Series B company or are
running your own team, I would not make this a point of focus. I say this because if
you need to get data, every major tech company I've talked to implements Hive or
Pig to abstract away MapReduce. In the earlier days at Jawbone, we did have to
write our own MapReduce ETL, but this was only before we hired a squad of Data
Engineers to handle the heavy lifting of storing UP user behavior and weblogs. The
same has been largely true at Optimizely--if I wanted to get my hands dirty with a
project writing MapReduce jobs, I could, but my time would be more efficiently
spent doing my job versus attempting to do someone else's.