Introduction To Big Data and Deep Learning

Introduction to big data and deep
learning
Na Lu
Xi’an Jiaotong University
An aphorism in machine learning:
Sometimes it's not who has the best
algorithm that wins; it's who has the
most data.
More data
• More data allows us to see new aspects

• More data allows us to see better aspects
• More data allows us to see different aspects
Big data
• What is Big data?

• Why Big data?
• When Big data is really a problem?
An example on facebook
What is big data?
• In 2014, there are 7.3 billion mobile users with 22 calls,

23 messages, and 110 status checks per day on
average.
• Baidu, Tencent, Jingdong, Weixin, Weibo, Renren,

Twitter, Facebook, Google+, LinkedIn. (E.g., topic
assignment; event detection)
• 90% of all data were generated during the past 3 years,

from gene science to physics, from eCommerce to social
media, from network to smart phone…
What is Big data?
• Big data is similar to small data, but bigger

• Bigger data consequently requires different
approaches
– Techniques, tools and architectures
• To solve
– New problems
– And old problems in a better way
What is Big data?
• Transactions (eCommerce)
• Log record
• Social media, like Weixin, facebook, twitter…
• Sensor, such as RFID, camera…
• Science research, gene data, images…
• Smart phone
• Manufacture
• …
Traditional databases: Rational
DB2
Characteristics of Big data
• Volume, velocity, and variety (V3)
The fourth V:
Veracity
(uncertainty
of the data)
More characteristics
Traditional Big data

Random Sampling Whole Dataset
Report Prediction
Rational Correlation
Two Ways All Angles

Report vs. Mining
Department: video sale Year: 2000
Q1 Q2
East $54,302 $55,336
Allanta $3,590 $3,715
Boston $5,174 $5,991
Miami $1,425 $1,544
New Orleans $3,029 $3,183

Why do we care?
• Big data is everywhere!

Realizing a competitive advantage
• 2012
– 71%
– 63%
• 2011
– 69%
– 58%
Banking and financial
• 2010 markets respondents
(n=124)
– 36%
Global respondents
– 37% (n=1144)
IBM 2012
Big banks in the US
• By the end of 2013, 96% CIO reported at least

one Big Data project is going on. 80% CIO
reported that at least one Big Data project is
done.
• 87% CIO reported the reasons for them
choosing Big Data:
– TIA (time to answer) 100:1 time advantage
– Data mining
• 75% CIO reported the relatively low investment
is the main reason for Big Data solution.
More data about big banks
• Risk analysis: 3 months to 3 hours

• Pricing: 48 hours to 20 minutes
• User behavior analysis: 72 hours to 20 minutes
• Create model: 150 model a year to 15000 model a year
• Investment: traditional database vs. Big Data solution
1:50
• User profiling fee: $300,000 (Hadoop) vs traditional
$4,000,000
• Stock model fee: Big Data solution $200,000(Hadoop) vs
traditional $4,000,000
Other industries
• Supply chain: simulate and optimize supply chain flows;

reduce inventory and stock cost. (Wal-Mart, Amazon,
Jingdong etc.)
• Customer selection: identify customers with the greatest
profit potentials. (Banks etc.)
• Pricing: identify the price that will maximize yields, or
profit. (Travel agent, real estate)
• Human capital: select the best employee for particular
task or job. (Companies, Hunters, and recruiting
websites)
• Research and development
• ……
Job for Big Data
• Job opportunities are sky-rocketing…

• 1 for 100
• 100 for 1
– According to a TED talk on Big Data
– Alibaba hired 800 people for Big Data within 6
month of 2014
– Jingdong hired 300…
– Increasing startups on Big Data
– High salary ($300,000)
Job Market for Big data
• Big data is data that exceeds the processing

capacity of conventional database systems. The
data is too big, moves too fast, or doesn’t fit the
structures of your database structures.
• To gain values from this data, you must choose

an alternative way to process it.
Big data scale
• Gigabyte: 109 bytes (University)
• Terabyte: 1012 bytes (University and industry)
• Petabyte: 1015 bytes (Industry)
• Exabyte: 1018 bytes
• Zettabyte: 1021 bytes

Big data related techniques and
applications
• Big data related techniques:
– MapReduce (Hadoop)
– Locality Sensitive Hashing
– Page Rank
– Algorithms
• Applications:
– Web Search
– Recommender System
– Online Advertising
– ……
What you need to learn?
• Hadoop common
• Hadoop distributed file system (HDFS)
• Hadoop MapReduce
• Hbase
• Hive
• NoSQL
• Pig: A high level data flow language and execution
framework for parallel computation
• Cloudera: company support the open sources
• Julia: hot on WallStreet
• Spark: newest version of Big Data solution. No reference.
Three different areas
• Back end • MapReduce • Data Mining

– Linux – Linux – Java
– Cluster – Shell – Other GUI
– Storage – Java • SSAS
• ODM
– HDFS
• R
Four directions
• Financial industry
• Medical and health care
• eCommerce and retail
• Science research
Why Big data?
• Key motivations of the growth of Big data include

– Increase of storage capacities
– Increase of processing power
– Availability of data
Machine learning
• Arthur Samuel Chess (20 Centrury)

• Amazon personalization algorithm
• Self-driving example
• Biopsy features (12 Vs 9)
Machine Learing
• Why we need machine learning?
– It is very difficult to write programs that solve
problems like recognizing the identity of a three-
dimensional face image from a novel viewpoint in
new lighting conditions in a cluttered scene.
• We do not know how the human brains have done the
job.
– It is hard to write programs identifying the spam
emails.
• No specific words or phrases are definite indication of
spam email, and a combination of features is needed
for the email classification.
• The program need keep changing.
Machine Learning
• Some tasks best resolved by machine learning

– Recognizing patterns
• Object in real scenes
• Facial identities or facial expressions
• Spoken words
– Recognizing anomalies
• Unusual sequences of credit card transactions
• Unusual patterns of sensor readings in a nuclear power plant
– Prediction
• Future stock prices or currency exchange rates
Neural Networks
• The number of neurons in the human brain

( thought to be over 80 billion)
• In 2009, IBM developed a brain simulator that

replicated one billion human brain neurons
connected by ten trillion synapses.
Deep Learning
• Deep Learning makes MIT Tech Review’s list of top-10

breakthroughs of 2013.
– With massive amounts of computational power,
machines can now recognize objects and translate
speech in real time. Artificial intelligence is finally
getting smart.
• A Google deep-learning system that had been shown 10

million images from YouTube videos proved almost twice
as good as any previous image recognition effort at
identifying objects such as cats.
Experts in Deep Learning
• Geoffrey Hinton(UoT, Google)
– He is the co-inventor of the
back propagation and
contrastive divergence training
algorithms and is an important
figure in the deep learning
movement.
– Hinton joined Google in March
2013 when his company,
DNNresearch Inc, was
acquired.
– Hinton says, “When you get to
a trillion [parameters], you’re
getting to something that’s got
a chance of really
understanding some stuff.”
• Yann LeCun(NYU, Facebook)
– He was a postdoctoral research
associate in Geoffrey Hinton's lab at
the University of Toronto.
– He developed a number of new
machine learning methods, such as a
biologically inspired model of image
recognition called Convolutional Neural
Networks,the "Optimal Brain Damage"
regularization methods, and the Graph
Transformer Networks method (similar
to conditional random field), which he
applied to handwriting recognition and
OCR.
• Yoshua Bengio
– His main research ambition is
to understand principles of
learning that yield intelligence.
– His research is widely cited
(over 16000 citations found
by Google Scholar in early
2014, with an H-index of 55).
• Jürgen Schmidhuber
– The recurrent neural
networks and deep feed
forward neural networks
developed in his research
group have won nine
international competitions
in pattern recognition and
machine learning.
What is deep learning
• Yann LeCun: Deep learning has come to

designate any learning method that can train a
system with more than 2 or 3 non-linear hidden
layers.
• Deep learning is a set of algorithms in machine
learning that attempt to model high-level
abstractions in data by using model architectures
composed of multiple non-linear transform.
• Deep learning is part of a broader family of machine

learning methods based on learning representations
of data.
• Deep learning attempts to define what makes better

representations and how to create models to learn
these representations.
• Around 2003, Geoffery Hinton, Yoshua Bengio

and Yann LeCun tried to revive the interest of
the machine learning community in the problem
of learning representations, as opposed to just
learning simple classifiers.
• In 2006-2007, some new results on

unsupervised training (or unsupervised pre-
training, followed by supervised tuning) began to
attract the attention of the machine learning
community.(3+Andrew Ng)
• One fact:
– Much of the recent practical applications of deep
learning use purely supervised learning based on
back-propagation, altogether not very different from
the neural nets of the late 80's and early 90's.
• Difference:
– What's different is that we can run very large and
very deep networks on fast GPUs (sometimes with
billions of connections, and 12 layers) and train them
on large datasets with millions of examples.
– We also have a few more tricks than in the past, such
as a regularization method called "drop out",
rectifying non-linearity for the units, different types of
spatial pooling, etc.
• Presently, most of the algorithms within the scope of
deep learning are Neural Networks related, mainly
include
– RBM(Restricted Boltzmann Machine)
– Auto-Encoder
– DBN(Deep Belief Network)
– CNN(Convolutional Network, or short as ConvNet)
• Learn the features which could be combined to

represent the objects, and be used for classification
and object recognition
• The states of the hidden layer are used as features
Representation
Filters being learned using the Predictive Sparse Decomposition algorithm as

described in the 2008 CBLL Tech Report CBLL-TR-2008-12-01: "Fast Inference in
Sparse Coding Algorithms with Applications to Object Recognition".
Deep learning architectures
• Auto-Encoder
• Restricted Boltzmann machine (RBM)
• Deep belief networks
• Convolutional neural networks
Deep Belief Network
Layer wise training of a DBN

Convolutional neural networks
• Developed by Yann LeCun in the late 80s and

early 90s.
• At Bell Labs in the mid 1990s, a number of

ConvNet-based systems for reading the amount
on bank check automatically (printed or
handwritten) have been developed.
Google X lab
• In mid-2014, Google said there were eight projects being developed
at Google X. As of late 2014, Google X projects that have been
revealed include:
– Google's self-driving car
– Project Wing, a drone delivery project
– Google Glass eyewear that includes a screen and camera
– Google contact lenses that monitor glucose in tears
– Project Loon, which provides internet service via balloons in the
stratosphere
– An airborne wind power company called Makani Power
– Lift Labs, makers of a tremor-cancelling spoon for Parkinson's
patients
– An artificial neural network for speech recognition and computer
vision
– The web of things
Google X lab
• Projects that Google X has considered and rejected

include a
– space elevator, which was deemed to be currently
unfeasible;
– a hoverboard, which was determined to be too costly
relative to the societal benefits;
– a user-safe jetpack, which was thought to be too loud
and energy-wasting;
– and teleportation, which was found to violate the laws
of physics.
Google X lab
• Google’s mysterious X lab built a neural network of 16,000
computer processors with one billion connections and let it
browse YouTube.
• The “brain” simulation was exposed to 10 million randomly

selected YouTube video thumbnails over the course of three
days and, after being presented with a list of 20,000 different
items, it began to recognize pictures of cats using a “deep
learning” algorithm.
• Picking up on the most commonly occurring images featured

on YouTube, the system achieved 81.7 percent accuracy in
detecting human faces, 76.7 percent accuracy when
identifying human body parts and 74.8 percent accuracy when
identifying cats.
Deep learning in Google
• Google is acquiring an AI startup called

DeepMind for more than 500 million dollars.
• Deep mind has recently hired several deep

learning experts and recent graduates from
Geoffrey Hinton’s, Yann Lecun’s, Yoshua
Bengio’s and Jurgen Schmidhuber’s groups. One
of the co-founders of DeepMind, Shane Legg
was a PhD student at IDSIA. Google and
Facebook was in competition to buyout
DeepMind.
Netflix competition
• The Netflix Prize was an open competition for the best

collaborative filtering algorithm to predict user ratings for
films, based on previous ratings without any other
information about the users or films, i.e. without the
users or the films being identified except by numbers
assigned for the contest.
• The Netflix Prize sought to substantially improve the

accuracy of predictions about how much someone is
going to enjoy a movie based on their movie preferences.
Netflix competition
• Netflix provided a training data set of

100,480,507 ratings that 480,189 users gave to
17,770 movies. Each training rating is a
quadruplet of the form <user, movie, date of
grade, grade>. The user and movie fields are
integer IDs, while grades are from 1 to 5
(integral) stars.
• $1M Grand prize was awarded to team “Bellkor’s

Pragmatic Chaos”.
ILSVR challenge
• ImageNet Large Scale Visual Recognition

• ImageNet:
ILSVR challenge
• ImageNet
– 14,192,122 images, 21841 categories
– Image found via web searches for WordNet noun
synsets
– Hand verified using Mechanical Turk
– Bounding boxes for query object labeled
– New data for validation and testing each year
ILSVR challenge
• WordNet
– Source of the labels
– Semantic hierarchy
– Contains large fraction of English nouns
– Also used to collect other datasets like tiny images
(Torralba et al)
– Note that categorization is not the end/only goal, so
idiosyncrasies of WordNet may be less critical
Animal: fish bird mammal invertebrate

Plant: tree flower vegetable
ILSVR challenge
• Taxonomy
ILSVR challenge
ILSVR challenge
ILSVR challenge
• The goal of this competition is to estimate the content of

photographs for the purpose of retrieval and automatic
annotation using a subset of the large hand-labeled
ImageNet dataset (10,000,000 labeled images depicting
10,000+ object categories) as training.
• Test images will be presented with no initial annotation --

no segmentation or labels -- and algorithms will have to
produce labelings specifying what objects are present in
the images.
ILSVR challenge
ILSVR challenge
• Why large scale?
– Use of massive labeled datasets

– A large number of fine-grained classes
ILSVR challenge
Team name Entry description Localization error Classification error
a combination of multiple
VGG 0.253231 0.07405
ConvNets (by averaging)
VGG ConvNets (fusion weights learnt 0.253501 0.07407
on the validation set)
ConvNets, including a net trained
VGG on images of different size 0.255431 0.07337
2014 (fusion done by averaging);
detected boxes were not updated
ConvNets, including a net trained
on images of different size
VGG 0.256167 0.07325
(fusion weights learnt on the
validation set); detected boxes
were not updated
Model with localization ~26%
GoogLeNet 0.264414 0.14828
top5 val error.
Model with localization ~26%
GoogLeNet top5 val error, limiting number of 0.264425 0.12724
classes.
a single ConvNet (13
VGG convolutional and 3 fully- 0.267184 0.08434
connected layers)
We compared the class-specific
localization accuracy of solution
1 and solution 2 by the validation
set. Then we chosen better
solution on each class based on
SYSU_Vision 0.31899 0.14446
the accuracy. General speaking,
solution 2 outformed solution 1
when there were multiple objects
in the image or the objects are
relatively small.
5 top instances predicted using
MIL 0.337414 0.20734
FV-CNN
5 top instances predicted using
FV-CNN + class specific window
MIL 0.33843 0.21023
size rejection. Flipped training
images are added.
ILSVR challenge
• Annotator was trained on 500

images, and evaluated a total of
1500 test set images.
• The GoogLeNet classication error

onthis sample was estimated to be
6:8% (recall that theerror on full test
set of 100,000 images is 6:7%.
• The human error was estimated to

be 5.1%.
ILSVR challenge
Representative example of practical frustrations of labeling ILSVRC

classes. Aww, a cute dog! Would you like to spend 5 minutes scrolling
through 120 breeds of dog to guess what species it is?
ILSVR challenge
2014 Results
Multiple SPP-nets
MSRA Visual
further tuned on 0.354769 0.08062
Computing
validation set (A)
Multiple SPP-nets
MSRA Visual
further tuned on 0.354924 0.0806
Computing
validation set (B)
MSRA Visual Multiple SPP-nets
0.355568 0.082
Computing (B)
MSRA Visual Multiple SPP-nets
0.3562 0.08307
Computing (A)
MSRA Visual
A single SPP-net 0.36118 0.09079
Computing
MSR achieves 4.94% error on ImageNet, surpassing the human accuracy of
5.1% ! (2015, latest results)
ILSVR challenge
• I am one point on that curve with 5.1%. my labmates

with almost no training are another point, with even
up to 15% error. and based on the above
hypothetical calculations, it's not unreasonable to
suggest that a group of very dedicated humans
might push this down to 2% or so.
• 5 minutes, 5.1%
• 15 minutes, ~3%
Face detection method
• Back in 2001, two computer scientists, Paul Viola and Michael
Jones, triggered a revolution in the field of computer face detection.
• After years of stagnation, their breakthrough was an algorithm that

could spot faces in an image in real time. Indeed, the so-called
Viola-Jones algorithm was so fast and simple that it was soon built
into standard point and shoot cameras.
• Viola-Jones algorithm looks first for vertical bright bands in an image

that might be noses, it then looks for horizontal dark bands that
might be eyes, it then looks for other general patterns associated
with faces.
Face detection
• Today, Sachin Farfade and Mohammad Saberian at Yahoo Labs in
California and Li-Jia Li at Stanford University nearby, reveal a new
approach to the problem that can spot faces at an angle, even when
partially occluded. They say their new approach is simpler than
others and yet achieves state-of-the-art performance. (Feb. 16, 2015)
• The idea is to train a many-layered neural network using a vast

database of annotated examples, in this case pictures of faces from
many angles. (Deep convolutional neural network)
• To that end, Farfade and co created a database of 200,000 images

that included faces at various angles and orientations and a further
20 million images without faces. They then trained their neural net in
batches of 128 images over 50,000 iterations.
Face detection
• The result is a single algorithm that can spot faces from a wide
range of angles, even when partially occluded. And it can spot many
faces in the same image with remarkable accuracy.
• The team call this approach the Deep Dense Face Detector
Conclusions
• Big Data era is here

• Big Data changes everything, including our living,
and of course how we do our finacing
• Big Data is the new starting point for many
possibilities
• Big Data is asset, and have capitalized
Conclusions
• Deep learning is an effective tool for big data

learning.
• Deep learning is built on neural networks.
• Deep learning has made significant
improvement on the classification performance
of big data.
End
Thank you.

Introduction To Big Data and Deep Learning

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Introduction To Big Data and Deep Learning

Diunggah oleh

Hak Cipta:

Format Tersedia

Introduction to big data and deep

• More data allows us to see new aspects

• What is Big data?

• In 2014, there are 7.3 billion mobile users with 22 calls,

• Baidu, Tencent, Jingdong, Weixin, Weibo, Renren,

• 90% of all data were generated during the past 3 years,

• Big data is similar to small data, but bigger

• Volume, velocity, and variety (V3)

Traditional Big data

Two Ways All Angles

East $54,302 $55,336

Allanta $3,590 $3,715

Boston $5,174 $5,991

Miami $1,425 $1,544

New Orleans $3,029 $3,183

• Big data is everywhere!

• By the end of 2013, 96% CIO reported at least

• Risk analysis: 3 months to 3 hours

• Supply chain: simulate and optimize supply chain flows;

• Job opportunities are sky-rocketing…

• Big data is data that exceeds the processing

• To gain values from this data, you must choose

• Gigabyte: 109 bytes (University)

• Terabyte: 1012 bytes (University and industry)

• Petabyte: 1015 bytes (Industry)

• Exabyte: 1018 bytes

• Zettabyte: 1021 bytes

• Back end • MapReduce • Data Mining

• Key motivations of the growth of Big data include

• Arthur Samuel Chess (20 Centrury)

• Some tasks best resolved by machine learning

• The number of neurons in the human brain

• In 2009, IBM developed a brain simulator that

• Deep Learning makes MIT Tech Review’s list of top-10

• A Google deep-learning system that had been shown 10

• Yann LeCun: Deep learning has come to

• Deep learning is part of a broader family of machine

• Deep learning attempts to define what makes better

• Around 2003, Geoffery Hinton, Yoshua Bengio

• In 2006-2007, some new results on

• Learn the features which could be combined to

Filters being learned using the Predictive Sparse Decomposition algorithm as

Layer wise training of a DBN

• Developed by Yann LeCun in the late 80s and

• At Bell Labs in the mid 1990s, a number of

• Projects that Google X has considered and rejected

• The “brain” simulation was exposed to 10 million randomly

• Picking up on the most commonly occurring images featured

• Google is acquiring an AI startup called

• Deep mind has recently hired several deep

• The Netflix Prize was an open competition for the best

• The Netflix Prize sought to substantially improve the

• Netflix provided a training data set of

• $1M Grand prize was awarded to team “Bellkor’s

• ImageNet Large Scale Visual Recognition

Animal: fish bird mammal invertebrate

• The goal of this competition is to estimate the content of

• Test images will be presented with no initial annotation --

• Why large scale?

– Use of massive labeled datasets

• Annotator was trained on 500

• The GoogLeNet classication error

• The human error was estimated to