There exists a hidden gap between the more idealized view of the
world given to data-science students and recent hires, and the issues
they often face getting to grips with real-world data science problems
in industry. All these new college courses in data analytics (they’re
almost all newly-minted courses) aim at teaching students the basics of
coding, statistics, data wrangling etc. However, the kind of challenges
you’re expected to overcome in an actual data science job within
industry are greatly under-represented.
The analytics course provides the data (and often the tools) and asks
you to generate an expected result, whereas the industry role may
provide you with neither the data, nor the appropriate tools, nor even
an idea of what constitutes an expected result. Plus the industry piece
usually has a more stringent deadline, no support and a limited
statistical understanding by the person requesting the analysis.
All things going well, the budding data scientist may find their long-
expected dataset is neither up-to-date nor at the level of granularity
required for the analysis. Google Analytics, arguably the most widely
used source for web-related user behaviour, has some vexing issues
that prevent it from lending itself to detailed analysis. Firstly, there’s
the difficultly of uniquely identifying web users, and secondly, there’s
the disquieting issue of GA presenting “estimates” of the total page
views, rather than the actual stat. So, a reasonable-sounding request is
rendered impossible due to irrelevant data. For example, if you’re
asked with predicting retention rates for customers logging into
website X, then the GA data feed will be as good as useless on its own.
As a result, if you are asked to perform what sounds like a major piece
of analytics work, design the bugger so that it can be easily re-run,
ideally at the click of a button and with the minimum effort from
yourself.
Ahh, yes, the classic. You’ve completed the analysis, compiled a nice
report and a few slides on the problem, and now you need to send the
data to someone for review. I’ll just paste the data with all the
customer details in plain-text into an email, what could go wrong?
Well, for one thing, it’s not too awfully difficult to auto-complete an
email to the wrong person in a contact list, and send your crown jewels
off to God-knows-where. Or, like a former colleague, who sent the
detailed financial analysis for one company, in error, to their
competitor!
Need to use company-wide standard data encryption huh?
There are reasons the good people in information security require that
any data you send out is encrypted. Security theatre is primarily the
top of that list, ass-covering is probably second, but there are also
plenty of sound reasons apart from outward appearances of security.
If you’re not allowed to install some GPG client (as it would violate
security policies) encryption must be done using an encrypted file
format, like a password-protected Excel or encrypted zip. What’s that
you say? Encrypted zip’s get blocked by the email server, and the client
has no SFTP or file-sharing server? Tough. Never, ever, compromise
on your data security standards by taking some short-cut for ‘just this
one time’. At the end of the day, you will be left carrying the bucket,
while the person shouting at you for the analysis will go off on their
merry way while you’re looking for another job.
9. Analytics outputs are easily shared and understood.
Let’s face it—the majority of your audience will have not the slightest
clue how to evaluate any fundamentally detailed analysis. They will
prevaricate and pretend to understand, as displays of ignorance can be
seen as weakness. They will ask you to augment your analysis with
more features, claim that the analysis needs to be “mathematically
proven before it can used”, and use all manner of distraction and
subterfuge to hide their bafflement. Some just look for certain
p-values, others rely on ‘gut-feel’ but you will see your detailed analysis
doubted, questioned and ignored. Or to put it another way… any
sufficiently advanced analysis will be indistinguishable from magic.
Therefore it’s your primary job to translate the results for the less
numerically-inclined, into language they can readily comprehend,
whether or not you’ve answered the question that has been posed.
10. The answer you’re looking for is there in the first place.
A bit like an Easter Egg hunt, there’s an implicit understanding that the
desired goal of any data science project is actually achievable, given a
bit of time searching around with the help of a few tools. However,
unlike the eponymous Easter Bunny, there is nobody out there
deliberately peppering your data with the nuggets of insight that will
help prove something. Want to find out why your click-thru rate on the
website is down this month? Want to figure out what customers prefer
product X rather than product Y? These queries come pre-loaded with
an expected outcome, often to the detriment of proper scientific
enquiry.