Anda di halaman 1dari 10

ASA Amstat News ASA Community The World of Statistics STATtr@k

STATS.org This is Statistics

• Home
• About
◦ Submission Instructions
• Editorial Calendar
• PDF Archives
◦ 2008 Amstat News
◦ 2009 Amstat News
◦ 2010 Amstat News
◦ 2011 Amstat News
◦ 2012 Amstat News
◦ 2013 Amstat News
◦ 2014 Amstat News
◦ 2015 Amstat News
◦ 2016 Amstat News
◦ 2017 Amstat News
• Advertise
• Statisticians in History

Search the archive...


Home » President's Corner

Aren’t We Data Science?


1 July 2013 7,516 views 9 Comments

Davidian

Last month, I shared this column with President-elect Nat Schenker and Past President Bob Rodriguez to
announce an ASA strategic initiative to promote engagement of statisticians in Big Data. I’m following that
announcement with an account of some of my recent experiences regarding data science, which inspire my
enthusiasm for this effort. One in particular serves as a metaphor for the disconnect between statistics and
data science we noted last month.
Around the time we were finalizing that column, Michelle Dunn, chair of the ASA Committee on Funded
Research, forwarded an email to me. Michelle thought I would be interested in learning from the press
release in the email that Eric Green would be speaking in Chapel Hill, North Carolina, 25 minutes from my
office in Raleigh, on April 23. In January, the director of the National Institutes of Health (NIH), Francis
Collins, announced the creation of a new NIH-wide position, the Associate Director for Data Science
(ADDS), to “capitalize on the exponential growth of biomedical research data”. Collins named Green, current
director of the National Human Genome Research Institute, as acting ADDS. Green is also co-chair of the
search committee charged with nominating the permanent ADDS.

Indeed, I was very interested! But what was even more interesting was the organization that had invited
Green to speak. The press release announced “a new collaboration called the National Consortium for Data
Science (NCDS) (aiming) to make North Carolina a national hub for data-intensive business and data science
research.” It went on to note that the NCDS had been launched at the Renaissance Computing Institute at The
University of North Carolina at Chapel Hill (UNC-CH) and included among its founding members
businesses, government organizations, and major research universities.

Rachel Shutt
Photo taken by Nina Krstic

I highlight that last group because, upon locating the NCDS website, I was astonished to review the list of
founding members and see that not only is my university (North Carolina State) a founding member, but so
are Duke University and UNC-CH. Along with SAS Institute; Research Triangle Institute International;
NIH’s National Institute for Environmental Health Sciences; IBM; and several other institutions, businesses,
and government agencies that employ numerous statisticians. The member representatives listed on the
website from NC State, Duke, and UNC-CH are computer scientists/engineers, and among all 17
representatives, there is not one statistician.

Until I saw that email, I had no idea that the NCDS even existed. A quick check with my department head,
others in my department, and statistician friends at the other institutions listed (including Bob at SAS)
revealed that none of them did, either. I later learned that, of the 80 or so individuals participating in the
invitation-only NCDS Leadership Summit on “Data to Discovery: Genomes to Health” for which Green was
the keynote public speaker, only two are affiliated with an entity with the word “statistics” in its name (and
are known to me to be trained as statisticians).

I tell this story not to take issue with the formation of the NCDS, but because it is reminiscent of stories and
comments I have heard from many of you.

As we discussed in June, the field of data science has commanded considerable attention in the media and
among business and science leaders. It is described as a blend of computer science, mathematics, data
visualization, machine learning, distributed data management—and statistics. A New York Times article in
April reported that centers and institutes devoted to data science and Big Data are being created and curricula
and certificate and degree programs are being developed at a number of universities.

Rachel Schutt’s Introduction to Data


Science class in the Columbia
University Statistics Department,
where she is adjunct assistant
professor. She is also senior
research scientist at Johnson
Research Labs.

Many of you have expressed concern that these and other data-oriented initiatives have been or are being
conceived on your campuses without involvement of or input from the department of statistics or similar unit.
I’ve been told of university administrators who have stated their perceptions that statistics is relevant only to
“small data” and “traditional” “tools” for their analysis, while data science is focused on Big Data, Big
Questions, and innovative new methods. I’ve also heard about presentations on data science efforts by
campus and agency leaders in which the word “statistics” was not mentioned. On the flip side, I have heard
from statistics faculty frustrated at the failure of their departments to engage proactively in such efforts.

In fact, some of you have asked directly the question that comprises the title of this column.

I decided to contact a statistician who is at the forefront of data science to get her thoughts about the
challenges (and opportunities) these developments pose for our discipline and how we might confront them.
Rachel Schutt, who is featured in the Times article cited above, earned her PhD from the department of
statistics at Columbia University, where she is an adjunct faculty member. Upon graduation, Rachel took a
position at Google, where she became acquainted with the scope, practice, and jargon of data science before
moving to her current position at Johnson Research Labs. In fall 2012, she taught “Introduction to Data
Science” for the Columbia statistics department and is co-author of a book, Doing Data Science,
summarizing the course (). I encourage you to visit the course website and read Rachel’s blog about the
evolving course activities.

Rachel generously spent well over an hour sharing her perspectives with me; I summarize our discussion of
only a few key topics here.
Data science is here to stay, Rachel says. There may be a lot of “hype,” but that might not be bad if it attracts
talented people to work on data-driven problems. And to statistics. Statistics has enormous potential to
contribute to data science. There are open research problems requiring that classical statistical methods in
sampling, design, and causal inference be “scaled up” to be feasible with massive data sets. Few of the
computer scientists and others who dominate the data science landscape are well-versed in these concepts,
and many take an “algorithmic” view of data analysis. Data science needs statistical thinking and new
foundational frameworks—for example, what is the “population” when one confronts the Big Data generated
by Google?

In fact, many businesses are beginning to collect data prospectively for internal testing and validation, and
there is little appreciation for the power of design principles. Statisticians could propel major advances
through development of “experimental design for the 21st century”!

What skills does a statistician need to engage in data science activities, and how should we be preparing
statistics students? In addition to a strong foundation in statistical theory, methods, and software, statistics
students should develop deep proficiency in programming, Rachel says. Coding skills—in R and in Python
including the use of Python as a scripting language—should be part of any modern statistics curriculum. And
statisticians must appreciate issues and tools associated with parallel computing, combining data from
disparate sources, and handling textual and streaming data.

Familiarity with data visualization techniques and popular tools like D3.js would be ideal and could enliven
curricula and projects. Exploratory data analysis, which is generally not taught formally in many statistics
programs, should be emphasized. Training in machine learning methods also is key. Not to mention
communication skills.

Rachel stressed the importance of exposure to “real world” problems—the disconnect between curriculum
and the “messiness” of the real world is greater than it has ever been. She advocates engaging local
businesses and research organizations to present case studies to students, as she did in her course. Not only
will this acquaint students with what they might confront, but also such interactions can forge connections
that can inspire needed statistical research.

What can we do as individuals, a profession, and an association to address the concerns noted above?
Rachel’s thinking? Sponsor and attend events that bridge disciplinary boundaries and afford opportunities to
interact with scientists with massive data problems such as the University of California at Davis 2013
Statistical Sciences Symposium: Analysis of Complex and Massive Data. The ASA could make a big impact
by sponsoring or collaborating in a conference on statistics and data science featuring top data scientists and
statisticians as speakers.

Participate in data science Meetup groups. There are scores of these in San Francisco, Washington, DC, New
York, Boston, and elsewhere—or consider forming one. We statisticians should seek these out and attend and
offer to speak, and we should encourage our students to do likewise. In fact, Rachel and several colleagues
have started The NYC Data Skeptics Meetup, which focuses on all aspects of data from a “skeptical
perspective” on the hype surrounding Big Data and data science.

Statisticians in academia interested in engaging in data science should seek sabbatical opportunities in
industry, and departments should reach out to industry data scientists and invite them to present seminars,
contribute to the curriculum, and serve as adjunct faculty. Departments can propose partnerships with
computer science, operations research, and other disciplinary units on campus to develop and team-teach
courses and to sponsor joint seminars and working groups. Such interactions will reveal areas in which
statistical research is needed.

Rachel noted in closing that she fears academic departments of statistics could be viewed as obsolete and be
phased out over the next decade if we do not evolve to embrace this challenge—data science is not going
away. She suggests we ask ourselves, “How would you feel if there were no departments of statistics 50 years
from now?” It is essential that we confront this head-on; otherwise, the many philosophical issues data
science presents demanding deep statistical thinking will not be addressed.

I am grateful to Rachel for sharing her candid views with me. She has convinced me that the ASA Big Data
initiative is an essential step toward addressing some of these challenges at the association level, laying the
groundwork for curriculum enhancements, significant engagement with stakeholders, and professional
development. We aren’t data science, but we have a critical role to play. I encourage you to consider steps
you can take locally to raise awareness of the importance of statistics in data science.

9 Comments » (7 votes, average: 5.00 out of 5)

• Keith said:

I don’t think statisticians are alone in the challenges of understanding how their roles as individuals and
as a profession fit into the “Data Scientist” realm. The one point in the article that quoted Rachel Schutt
about statisticians needing to learn ‘R’ is one of the disconnects. As Bob Howe, from the University of
Washington points out, statisticians need to learn to deal with data that does not fit into memory. I
don’t want to miss the main focus by pointing out that tools such a R deal with in memory data sets. In
the Data Science space we are dealing with data that doesn’t fit in memory, it doesn’t fit one one
machine, it may not even fit on 100 machines. This is where statiscians need to make the leap because
all of thier other skills are so important to Data Science. I myself am on the software side and I need to
learn more about statistics in order to compete and stay relevant in the Big Data (Data Science) space.
Many traditional roles that deal with data in some aspect all have something to add to their skill set in
order to make data science work for them and for them to advance thier profession. It will be
interesting to see with the shortage of skilled “Data Scientist” how companies react both long and short
term. Do they hire multiple people’ data scientist, software engineers, analysts, and others to cover the
shortfall around this “new” science or do they continue to hold out for people that make the leap by
merging thier profession and skill sets with what has been traditional other areas of expertise. Good
article and I understand the challenge and to some extent a little frustration but its all common with
shifts in paradigm of this size.

# 27 July 2013 at 1:09 pm


• Shelly said:

Thanks for a wonderful article.

Data Science is a term much used but not understood well. Can somebody explain the data with which
a Data Scientist starts off. Is it a direct output from MapReduce? Is it then put on a different playform
for analysis and what kind of platform?

WOuld greatly appreciate if somebody can throw light on it.


Thanks

# 30 July 2013 at 6:52 am


• Randy Bartlett said:

RE: We aren’t data science, but we have a critical role to play.


RESP: Most of your article was fine; this did not digest. Out here in the field (read ‘wild west’), we are
data scientists. By our definition, all data analysis involves underlying statistical assumptions. We
complete our applied education; many of us already know R and are masters of software. We have to
be; if we are not data scientists/business quants, then we have no future.
Computer science and IT are trying to annex machine learning, predictive modeling, and data mining.
The last thing we in the field want is for ASA to hand them over. Meanwhile, ASA has granted us two
essential items we requested 30 years ago, certification and CSP. Let’s see if there is still time to save
the profession.

The TOUGH question for those of you, who publish papers with no data and are so removed from what
is going on out here is: ‘Are you still statisticians?’

# 31 July 2013 at 1:37 pm


• Abhijit said:

As an active participant, organizer and member of data science meetups in the DC area, I can say that
the participation of the academic community in our meetups is very limited, unless we have an
academic speaker. This is possibly not true of other similar meetups around the country. We have tried
to engage our academic brethren, but to limited success.

I myself am a trained statistician, but many of our members are not. They are coming from all domains,
all backgrounds. They don’t necessarily have the theoretical foundation someone similar to me might
have, but they have experience, insight and curiosity, which today counts for a bit more, IMO.

I agree with Randy that core areas like machine learning, predictive modeling are being annexed, but
how many stats and biostats programs even teach them. We have limited ourselves as stat departments
to the classical, not the modern. These areas have need for theoretical and foundational development,
and questions about how the Big Data world relates to our ideas about statistical inference and
modeling.

The biggest needs that statistics graduates will have to succeed in this environment are familiarity with
large data and cloud computing, programming skills, a solid toolbox of statistical methods, an openness
to develop and understand new methods (sounds like research ), and most importantly, a desire and
interest to translate the fruits of the analysis into information that is usable and actionable. Our work
cannot stop at the analyst’s desk, it has to continue to the consultant’s chair and make the meaning of
our analyses intelligible.

There is plenty of talent out there with the computing skills to do a lot of the work that is in demand,
and more talent is needed, no doubt. The single biggest issue out there, IMO is that lots of people have
data but have no idea what to do with that data. They don’t even have the questions. The exploration of
big data to understand the possibilities of information and actionable intelligence it contains is the big
story for the next 5 years. Can statistics, as a field, find a way to exploit this need and make it ours. If
we don’t, we know others are chomping at the bit to take over

# 1 August 2013 at 9:27 pm


• Joshua Loftus said:

@Keith: Why do statisticians need to learn how to deal with data that does not fit into memory?
Several surveys linked to in this article http://finance.yahoo.com/news/most-data-isn-t-big-
174703362.html suggest that most data sets being analyzed by “data scientists” are in the megabyte to
gigabyte range.

# 19 August 2013 at 7:51 pm


• Vincent Granville said:

@Joshua: Data science is not just about analyzing data, it’s also about implementing algorithms that
process data automatically, to provided automated predictions and actions such as
– automated bidding systems
– estimating in real time the value of all houses in US
– high frequency trading
– matching an Ad with a user and a web page to maximize odds of conversions
– book and friend recommendations (Amazon, Facebook)
– analyzing NASA pictures to detect new planets or asteroids
– weather forecasts
– computation chemistry to simulate new molecules for cancer treatment
– tax fraud detection, terrorism detection

All this involves both statistical science and terabytes of data

# 27 September 2013 at 10:16 am


Welcome!

Amstat News is the monthly membership magazine of the American Statistical Association, bringing you
news and notices of the ASA, its chapters, its sections, and its members. Other departments in the magazine
include announcements and news of upcoming meetings, continuing education courses, and statistics awards.

ASA HOME

• The American Statistical Association


• JobWeb
• Calendar of Events
• Awards and Recognition
• The ASA Store

Departments

Departments Select Category 

Archives

Archives Select Month 

QUOTABLE

“ It is amazing how meeting the right person at the right place can shape your future. CSP is definitely the
right place for meeting the right people!”

Sepehr Piri

ADVERTISERS

MISC. PRODUCTS AND SERVICES


Case Western Reserve University

PROFESSIONAL OPPORTUNITIES
Boston University School of Dental Medicine
Columbia University
Institute of Statistical Science
Johns Hopkins Bloomberg School of Public Health NIAID/NIH/HHS
NYU Langone Health
North Carolina State University
US Census Bureau
University of Pennsylvania
Westat

SOFTWARE
JMP software from SAS
Minitab Inc.
SAS Institute Inc.
Statsols
Editorial Staff

Managing Editor
Megan Murphy

Graphic Designers / Production Coordinators


Sara Davidson
Meg Ruyle

Publications Coordinator
Val Nirala

Advertising Manager
Claudine Donovan

Contributing Staff Members


Amanda Conageski
Lara Harmon
Rebecca Nichols

Contact us

Amstat News
American Statistical Association
732 North Washington Street
Alexandria, VA 22314-1904
(703) 684-1221
www.amstat.org

Address Changes

Amstat News Advertising

Powered by WordPress | Log in | Entries (RSS) | Comments (RSS) | Arthemia theme by Michael Hutagalung