Anda di halaman 1dari 4

Why I Left My Data Science Master's Program

I just completed the second of two finals to end the first semester of Berkeley's
MIDS program--a new data science program created by the School of Information at
UC Berkeley. It was disappointingly easy and expensive ($13k per semester for 5
semesters for an online program). The level of comprehension required to do well
was about that of a Coursera course. And this is not to say that Coursera is easy; it
isn't if you really dig your heels in. There is a higher level of accountability that
comes with a structured program, but the incremental learning that came with the
structure didn't make the degree worth it. I'm dropping the program today.
A Few (Huge) Caveats
1.

Data Science is a new field, in that it combines old fields in a new


way. All the content is out there, figuring out how to put it together is the
challenge.

2.

This is a new program for Berkeley, so they're still figuring it out.

3.

I was a part-time student. It's difficult to enjoy what you're doing


without fully dedicating yourself to it. I've heard the same to be true of
other part-time programs like Executive MBAs.

4.

For the past five years, I have been in academic settings and tech
companies where there was plenty of technical wisdom, mentorship, and
stretch projects.

5.

Both tech companies I worked for were/are in growth phases at the


time and thus didn't sponsor formal programs. The economics
woul'dve been different if I were at, say, an Amazon, Boeing, or Microsoft.

6.

I have a great idea of what I need to learn to be where I want to be


in 10 years. Doing this was less a matter of exploring than learning what I
need to in the fastest way possible. I still thoroughly believe in the value
of most undergraduate and graduate programs. This just wasn't the case
for me.
Why I Joined the Program
The traditionalist in me thought a degree program was the way to go in developing
and signaling a technical skill set to potential investors and employers. Taking an
honest look at the people I've worked with in the tech space, it's clear this matters
less for high-functioning individuals here as much as in other spaces.
What I'm Doing Instead
When completing a project or pursuit, we're often at a loss for what to do with
our newfound free time. I chose to leave the program because I knew my free time
could be better used if managed at my own pace. At work, I've taken on two
additional projects. At home, I'm taking a refresher of Linear Algebra through Khan
Academy while spending two months on Logistic Regression through Coursera. By
self-pacing, I'm getting through about 1.5x the academic content I previously was.
Recommendations for Data People
Sign up for some online classes, get a pile of books, schedule two hours into every
week night, and sit at an empty desk working through them. Don't leave the desk.
Here are sets of resources you can look into, in order of importance, whether you're
learning this for the first time or as a refresher:

SQL: If you can't get data, you can't analyze data. Whether you
retrieve data from a SQL database or Hadoop cluster with a SQL-language
layer on top of it, this is where you start.
http://sqlschool.modeanalytics.com/ is a great interactive learning
interface. O'Reilley's SQL Cookbook is a masterpiece that traverses all
levels of SQL proficiency.

Full-Stack Data Science: Coursera offers a full stack online


curriculum on a continuous basis for a reasonable price. This DOES NOT
teach you SQL. If you're in SF or NYC, you can attend General Assembly's
pricier in-person full stack curriculum. This gives you a cursory
introduction to data storage, retrieval, prep, light analysis, and deeper
predictive and inferential analysis.

Python: Code Academy or Udemy will teach you the basics. Python
can play two functions in the skill stack: 1) to conduct ad-hoc statistical
analysis as you would with R, 2) to do everything else. Python is
important for the "everything else." You might use it to get data from
APIs, scrape, write ETL jobs, refresh data in your warehouse, or retrain
models. This is the piece of the skill stack moves you from being a Static
Data Scientist (one who works with data in a manual fashion), to a Live
Data Scientist (one who has automated many of the processes
contributing to data science output, loosely defined).

Basic Statistics: Khan Academy Probability and Statistics.

Linear Algebra and Multivariable Calculus: Go to a local college or


Khan Academy to brush up on Multivariable Calculus and Linear Algebra.
Their curriculums have largely been the same for the past 5 decades.

Mapreduce/Hadoop: Focus on this last**. There are so many


technologies that enable SQL-like interfacing with Hadoop that to know

how to write a MapReduce job is, for the most part, not necessary. To
build real MapReduce pipelines is a behemoth of a task that might be the
work of an early-stage startup Data Scientist, but shouldn't be if you have
a solid BI infrastructure team. This is why companies hire the rockstars
we know as backend and data engineers. Side note: if you ever meet one
and aren't sure what their company does, thank them for their service to
our country, regardless.
Cleaning: plan to spend most of your time cleaning and transforming in these
languages/technologies. The analysis is the fast and fun part.
Footnotes (5/25/2015)
* This unit in Khan academy, specifically, is negligible. I don't however believe that
Probability Theory has no place in Stat or ML. Quite the opposite, you absolutely
have to understand probability theory. I don't think that combinatorics (guessing
the probability of red balls in an urn, cards in a deck, or outcome of a dice roll) is an
essential step to understanding general probability theory.
** Unless you plan to be the sole Data Scientist at Pre-Series B company or are
running your own team, I would not make this a point of focus. I say this because if
you need to get data, every major tech company I've talked to implements Hive or
Pig to abstract away MapReduce. In the earlier days at Jawbone, we did have to
write our own MapReduce ETL, but this was only before we hired a squad of Data
Engineers to handle the heavy lifting of storing UP user behavior and weblogs. The
same has been largely true at Optimizely--if I wanted to get my hands dirty with a
project writing MapReduce jobs, I could, but my time would be more efficiently
spent doing my job versus attempting to do someone else's.