Anda di halaman 1dari 108

Pengantar Deep Learning

9 October 2017
untuk NLP

Alfan F. Wicaksono, FASILKOM UI


Alfan Farizki Wicaksono (alfan@cs.ui.ac.id)

Information Retrieval Lab.


Fakultas Ilmu Komputer
Universitas Indonesia
2017 1
Deep Learning Tsunami
“Deep Learning waves have lapped at the shores of

9 October 2017
computational linguistics for several years now, but 2015
seems like the year when the full force of the tsunami hit the
major Natural Language Processing (NLP) conferences.”
-Dr. Christopher D. Manning, Dec 2015

Alfan F. Wicaksono, FASILKOM UI


Christopher D. Manning. (2015). Computational Linguistics and Deep
Learning Computational Linguistics, 41(4), 701–707. 2
Tips
Sebelum mempelajari RNNs dan arsitektur Deep Learning

9 October 2017
yang lainnya, disarankan untuk kita mempelajari beberapa
topik berikut:

• Gradient Descent/Ascent

Alfan F. Wicaksono, FASILKOM UI


• Linear Regression
• Logistic Regression
• Konsep Backpropagation dan Computational Graph
• Multilayer Neural Networks

3
Referensi/Bacaan
• Andrej Karpathy’ Blog

9 October 2017
• http://karpathy.github.io/2015/05/21/rnn-effectiveness/
• Colah’s Blog
• http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Alfan F. Wicaksono, FASILKOM UI


• Buku Deep Learning Yoshua Bengio
• Y. Bengio, Deep Learning, MLSS 2015

4
Deep Learning vs Machine Learning
• Deep Learning adalah bagian dari isu Machine Learning

9 October 2017
• Machine Learning adalah bagian dari isu Artificial Intelligence

Artificial Intelligence

Alfan F. Wicaksono, FASILKOM UI


• Searching
• Knowledge Machine Learning
Representation
and reasoning
• Planning
Deep Learning

5
Machine Learning : Dirancang manusia

(rule-based) : Di-infer otomatis

9 October 2017
Predicted label: positive

Alfan F. Wicaksono, FASILKOM UI


Hand-crafted rules:

if contains(‘menarik’):
return positive
...

6
“Buku ini sangat menarik dan penuh manfaat”
Machine Learning : Dirancang manusia

(classical ML) : Di-infer otomatis

Fungsi klasifikasi dioptimasi berdasarkan input (fitur) &

9 October 2017
ouput.
Predicted label: positive

Alfan F. Wicaksono, FASILKOM UI


Learn mapping from features to label Classifier

Feature Engineering!
Hand-designed Feature Extractor:
Contoh: Menggunakan TF-IDF, Representation
informasi syntax dengan POS Tagger, dsb.

7
“Buku ini sangat menarik dan penuh manfaat”
Machine Learning : Dirancang manusia

(Representation Learning) : Di-infer otomatis

Fitur dan fungsi klasifikasi di-optimasi secara bersama-sama.

9 October 2017
Predicted label: positive

Alfan F. Wicaksono, FASILKOM UI


Learn mapping from features to label Classifier

Learn Feature Extractor


Contoh: Restricted Boltzman Machine, Representation
Autoencoder, dsb.

8
“Buku ini sangat menarik dan penuh manfaat”
Machine Learning : Dirancang manusia

(Deep Learning) : Di-infer otomatis

Deep Learning learns Features!

9 October 2017
Predicted label: positive

Alfan F. Wicaksono, FASILKOM UI


Learn mapping from features to label Classifier

Fitur yang Lebih High Level

Fitur Kompleks/High-Level

Fitur Sederhana Representation

9
“Buku ini sangat menarik dan penuh manfaat”
Sejarah

Alfan F. Wicaksono, FASILKOM UI 9 October 2017


10
The Perceptron (Rosenblatt, 1958)
• Sejarah dimulai dimulai dengan Perceptron di akhir tahun 1950.

9 October 2017
• Perceptron terdiri dari 3 layer: Sensory, Association, dan
Response.

Alfan F. Wicaksono, FASILKOM UI


http://www.andreykurenkov.com/writing/a-brief-history-of-neural-nets-and-deep-learning/ 11

Rosenblatt, Frank. “The perceptron: a probabilistic model for information storage and
organization in the brain.” Psychological review 65.6 (1958): 386.
The Perceptron (Rosenblatt, 1958)

9 October 2017
Alfan F. Wicaksono, FASILKOM UI
Activation function adalah fungsi non-linier. Dalam kasus perceptron
Rosenblatt, activation function adalah operasi thresholding biasa (step
function).

Learning perceptron menggunakan metode Donald Hebb.

Saat itu, mampu melakukan klasifikasi untuk input Pixel 20x20! 12


http://www.andreykurenkov.com/writing/a-brief-history-of-neural-nets-and-deep-learning/
The organization of behavior: A neuropsychological theory. D. O. Hebb. John Wiley
And Sons, Inc., New York, 1949
The Fathers of Deep Learning(?)

9 October 2017
Alfan F. Wicaksono, FASILKOM UI
13

https://www.datarobot.com/blog/a-primer-on-deep-learning/
The Fathers of Deep Learning(?)
• Di tahun 2006, ketiga orang tersebut mengembangkan cara
untuk memanfaatkan dan mengatasi masalah training

9 October 2017
terhadap deep neural networks.
• Sebelumnya, banyak orang yang sudah menyerah terkait
manfaat dari neural network, dan cara training-nya.

Alfan F. Wicaksono, FASILKOM UI


• Mereka mengatasi masalah terkait Neural Network belum
mampu belajar untuk menemukan representasi yang
berguna.

• Geoff Hinton has been snatched up by Google;


• Yann LeCun is Director of AI Research at Facebook;
• Yoshua Bengio holds a position as research chair for
Artificial Intelligence at University of Montreal 14

https://www.datarobot.com/blog/a-primer-on-deep-learning/
The Fathers of Deep Learning(?)
• Automated learning of data representations and features is

9 October 2017
what the hype is all about!

Alfan F. Wicaksono, FASILKOM UI


15

https://www.datarobot.com/blog/a-primer-on-deep-learning/
Mengapa sebelumnya “deep learning” tidak sukses?

• Sebenarnya, neural network kompleks sudah banyak

9 October 2017
ditemukan sebelumnya.
• Bahkan Long-Short Term Memory (LSTM) network, yang
saat ini ramai digunakan di bidang NLP, ditemukan tahun
1997 oleh Hochreiter & Schmidhuber.

Alfan F. Wicaksono, FASILKOM UI


• Ditambah lagi, orang dahulu percaya bahwa neural
network “can solve everything!”. Tetapi, mengapa mereka
tidak bisa melakukannya dahulu?

16
Mengapa sebelumnya “deep learning” tidak sukses?

• Beberapa alasan, oleh Ilya Sutskever:


• http://yyue.blogspot.co.id/2015/01/a-brief-overview-of-deep-

9 October 2017
learning.html

• Computers were slow. So the neural networks of past were tiny.


And tiny neural networks cannot achieve very high performance on

Alfan F. Wicaksono, FASILKOM UI


anything. In other words, small neural networks are not powerful.
• Datasets were small. So even if it was somehow magically
possible to train LDNNs, there were no large datasets that had
enough information to constrain their numerous parameters. So
failure was inevitable.
• Nobody knew how to train deep nets. The current best object
recognition networks have between 20 and 25 successive layers of
convolutions. A 2 layer neural network cannot do anything good on
object recognition. Yet back in the day everyone was very sure that
deep nets cannot be trained with SGD, since that would’ve been
too good to be true 17
The Success of Deep Learning
Salah satu faktor-nya adalah karena saat ini ditemukan cara

9 October 2017
learning yang bekerja secara praktikal.
The success of Deep Learning hinges on a very fortunate fact: that
well-tuned and carefully-initialized stochastic gradient descent
(SGD) can train LDNNs on problems that occur in practice. It is not a

Alfan F. Wicaksono, FASILKOM UI


trivial fact since the training error of a neural network as a function
of its weights is highly non-convex. And when it comes to non-
convex optimization, we were taught that all bets are off...

And yet, somehow, SGD seems to be very good at training those large
deep neural networks on the tasks that we care about. The problem
of training neural networks is NP-hard, and in fact there exists a
family of datasets such that the problem of finding the best neural
network with three hidden units is NP-hard. And yet, SGD just solves
it in practice. 18

Ilya Sutskever, http://yyue.blogspot.co.at/2015/01/a-brief-overview-of-deep-learning.html


Apa Itu Deep Learning?

Alfan F. Wicaksono, FASILKOM UI 9 October 2017


19
Apa itu Deep Learning?
• Kenyataannya, Deep Learning = (Deep) Artificial Neural

9 October 2017
Networks (ANNs)
• Dan Neural Networks sebenarnya adalah sebuah
Tumpukan Fungsi Matematika

Alfan F. Wicaksono, FASILKOM UI


20

Image Courtesy: Google


Apa itu Deep Learning?
Secara praktis, (supervised) Machine Learning itu adalah:

9 October 2017
Ekspresikan permasalahan ke dalam sebuah fungsi F (yang
mempunyai parameter θ), lalu secara otomatis cari
parameter θ sehingga fungsi F tepat mengeluarkan output
yang diinginkan.

Alfan F. Wicaksono, FASILKOM UI


Predicted label: Y = positive

Y = F(X; θ)
21
X: “Buku ini sangat menarik dan penuh manfaat”
Apa itu Deep Learning?
Untuk Deep Learning, fungsi tersebut biasanya terdiri dari

9 October 2017
tumpukan banyak fungsi yang biasanya serupa.

Y  F ( F ( F ( X ; 3 ); 2 );1 )

Alfan F. Wicaksono, FASILKOM UI


Y = positive

F(X; θ3)
Gambar ini sering disebut Tumpukan Fungsi ini
dengan istilah sering disebut dengan
F(X; θ2)
Computational Graph Tumpukan Layer

F(X; θ1)
22
“Buku ini sangat menarik dan penuh manfaat”
Apa itu Deep Learning?
• Layer yang paling terkenal/umum adalah Fully-Connected

9 October 2017
Layer.
Y  F ( X )  f (W . X  b)
• “weighted sum of its inputs, followed by a non-linear function”

Alfan F. Wicaksono, FASILKOM UI


• Fungsi non-linier yang umum digunakan: Tanh (tangent
hyperbolic), Sigmoid, ReLU (Rectified Linear Unit)

W  R M N
 
M unit X  R
N f   wi xi  b 
 i 
N unit
b  RM f
w x b
i i
f (W . X  b) i 23

X Non-linearity
Mengapa perlu “Deep”?
• Humans organize their ideas and concepts hierarchically

9 October 2017
• Humans first learn simpler concepts and then compose
them to represent more abstract ones
• Engineers break-up solutions into multiple levels of
abstraction and processing

Alfan F. Wicaksono, FASILKOM UI


• It would be good to automatically learn / discover these
concepts

24
Y. Bengio, Deep Learning, MLSS 2015, Austin, Texas, Jan 2014
(Bengio & Delalleau 2011)
Neural Networks
Y  f (W1. X  b1 )

9 October 2017
Alfan F. Wicaksono, FASILKOM UI
X Y  f (W1. X  b1 )

25
Neural Networks
Y  f (W1.( f (W1. X  b2 ))  b1 )

9 October 2017
Alfan F. Wicaksono, FASILKOM UI
Y  f (W2 .H1  b2 )

X H1  f (W1. X  b1 )

26
Neural Networks
Y  f (W1.( f (W2 .( f (W3 . X  b3 ))  b2 ))  b1 )

9 October 2017
Alfan F. Wicaksono, FASILKOM UI
Y  f (W3 .H 2  b3 )

X
H1  f (W1. X  b1 ) H 2  f (W2 .H1  b2 )
27
Alasan matematis mengapa harus “deep”?

• A neural network with a single hidden layer of enough

9 October 2017
units can approximate any continuous function arbitrarily
well.

• In other words, it can solve whatever problem you’re

Alfan F. Wicaksono, FASILKOM UI


interested in!

(Cybenko 1998, Hornik 1991)

28
Alasan matematis mengapa harus “deep”?
Akan tetapi ...

9 October 2017
• “Enough units” can be a very large number. There are functions
representable with a small, but deep network that would
require exponentially many units with a single layer.
• The proof only says that a shallow network exists, it does not say

Alfan F. Wicaksono, FASILKOM UI


how to find it.
• Evidence indicates that it is easier to train a deep network to
perform well than a shallow one.
• A more recent result brings an example of a very large class of
functions that cannot be efficiently represented with a small-
depth network.

(e.g., Hastad et al. 1986, Bengio & Delalleau 2011)


(Braverman, 2011) 29
Mengapa perlu non-linearity?
 
f   wi xi  b 
 i 
f

9 October 2017
w x b
i
i i

Alfan F. Wicaksono, FASILKOM UI


Non-linearity

Mengapa perlu fungsi non-linier f?

H1  W1. X  b1 Kalau tanpa f, rangkaian fungsi ini adalah


tetap fungsi linier.
H 2  W2 .H1  b2
Data bisa sangat kompleks, dan terkadang
Y  W3 .H 2  b3 hubungan yang ada pada data tidak hanya 30
linier, tetapi bisa non-linier. Perlu
? representasi yang bisa menangkap hal ini.
Training Neural Networks
Y  f (W1.( f (W2 .( f (W3 . X  b3 ))  b2 ))  b1 )

9 October 2017
• Secara random, kita inisialisasi semua parameter W1, b1,
W2, b2, W3, b3

Alfan F. Wicaksono, FASILKOM UI


• Definisikan sebuah cost function/loss function yang
mengukur seberapa baik fungsi neural network Anda.
• Seberapa jauh nilai yang diprediksi dengan nilai sesungguhnya

• Secara iteratif/berulang-ulang, sesuaikan nilai parameter


sehingga nilai loss function menjadi minimal.
31
Training Neural Networks
• Initialize trainable parameters

9 October 2017
randomly

Alfan F. Wicaksono, FASILKOM UI


W(3)

W(2)

W(1)
32
Buku ini sangat baik dan mendidik 
Training Neural Networks
• Initialize trainable parameters

9 October 2017
randomly
• Loop: x = 1 → #epoch:
• Pick a training example

Alfan F. Wicaksono, FASILKOM UI


W(3)

W(2)

W(1)
33
x Buku ini sangat baik dan mendidik 
Training Neural Networks
pos neg • Initialize trainable parameters
y’

9 October 2017
True Label 1 0 randomly
• Loop: x = 1 → #epoch:
Pred. Label • Pick a training example
y 0.3 0.7
(Output) • Compute output by doing feed-

Alfan F. Wicaksono, FASILKOM UI


forward process
W(3)

h2

W(2)

h1

W(1)
34
x Buku ini sangat baik dan mendidik 
Training Neural Networks
pos neg • Initialize trainable parameters
y’

9 October 2017
True Label 1 0 randomly
L
• Loop: x = 1 → #epoch:
y • Pick a training example
Pred. Label y 0.3 0.7
(Output) • Compute output by doing feed-

Alfan F. Wicaksono, FASILKOM UI


forward process
W(3)
• Compute gradient of loss w.r.t.
h2 output

W(2)

h1

W(1)
35
x Buku ini sangat baik dan mendidik 
Training Neural Networks
pos neg • Initialize trainable parameters
y’

9 October 2017
True Label 1 0 randomly
L
• Loop: x = 1 → #epoch:
y • Pick a training example
Pred. Label y 0.3 0.7
(Output) L • Compute output by doing feed-

Alfan F. Wicaksono, FASILKOM UI


forward process
W (3)
L • Compute gradient of loss w.r.t.
output
h2
• Backpropagate loss, computing
W(2) gradients w.r.t trainable
parameters. It’s like
h1 computing contribution of
error to the output of each
W(1)
parameter
36
x Buku ini sangat baik dan mendidik 
Training Neural Networks
pos neg • Initialize trainable parameters
y’

9 October 2017
True Label 1 0 randomly
L
• Loop: x = 1 → #epoch:
y • Pick a training example
Pred. Label y 0.3 0.7
(Output) L • Compute output by doing feed-

Alfan F. Wicaksono, FASILKOM UI


forward process
W (3)
L • Compute gradient of loss w.r.t.
output
h2
L • Backpropagate loss, computing
W ( 2 ) gradients w.r.t trainable
L parameters. It’s like
h1 computing contribution of
error to the output of each
W(1)
parameter
37
x Buku ini sangat baik dan mendidik 
Training Neural Networks
pos neg • Initialize trainable parameters
y’

9 October 2017
True Label 1 0 randomly
L
• Loop: x = 1 → #epoch:
y • Pick a training example
Pred. Label y 0.3 0.7
(Output) L • Compute output by doing feed-

Alfan F. Wicaksono, FASILKOM UI


forward process
W (3)
L • Compute gradient of loss w.r.t.
output
h2
L • Backpropagate loss, computing
W ( 2 ) gradients w.r.t trainable
L parameters. It’s like
h1 computing contribution of
L error to the output of each
W (1) parameter
38
x Buku ini sangat baik dan mendidik 
Gradient Descent (GD)

Take small step in direction of negative gradient!



 ( t 1)
 (t )
 t f ( (t ) )

9 October 2017
 n(t )
n n

Alfan F. Wicaksono, FASILKOM UI


39
https://github.com/joshdk/pygradesc
Lebih Detail dalam Hal Teknis …

Alfan F. Wicaksono, FASILKOM UI 9 October 2017


40
Gradient Descent (GD)

Salah satu framework paling sederhana untuk permasalahan


optimasi (multivariate optimization).

9 October 2017
Digunakan untuk mencari konfigurasi parameter-parameter
sehingga cost function menjadi optimal, dalam hal ini
mencapai local minimum.

Alfan F. Wicaksono, FASILKOM UI


GD menggunakan metode iteratif, yang dimulai dari sebuah
titik acak, dan secara perlahan-lahan mengikuti arah negatif
dari gradient sehingga pada akhirnya akan berpindah ke
suatu titik kritikal, yang diharapkan merupakan local
minimum.
41
Tidak dijamin mencapai global minimum !
Gradient Descent (GD)

Problem: carilah nilai x sehingga fungsi f(x) = 2x4 + x3 – 3x2


mencapai titik local minimum.

9 October 2017
Misal, kita pilih x dimulai dari x=2.0:

Alfan F. Wicaksono, FASILKOM UI


Algoritma GD konvergen
pada titik x = 0.699, yang
merupakan local minimum.

Local minimum

42
Gradient Descent (GD)

Algorithm:

For t  1, 2, ..., N max :

9 October 2017
xt 1  xt   t f ' ( xt )
If f ' ( xt 1 )   then return " converged on critical po int"
If xt  xt 1   then return " converged on x value"

Alfan F. Wicaksono, FASILKOM UI


If f ( xt 1 )  f ( xt ) then return " diverging"

αt : learning rate atau step size pada iterasi ke-t


ϵ: sebuah bilangan yang sangat kecil
Nmax: batas banyaknya iterasi, atau disebut epoch jika iterasi selalu sampai akhir

Algoritma dimulai dengan menebak nilai x1 ! 43

Tips: pilih αt yang tidak terlalu kecil, juga tidak terlalu besar.
Gradient Descent (GD)

Kalau parameter-nya ada banyak ?


Carilah θ = θ1, θ2, …, θn sehingga f(θ1, θ2, …, θn) mencapai
local minimum !

9 October 2017
while not converged :

Alfan F. Wicaksono, FASILKOM UI


 Dimulai dengan menebak
 ( t 1)
(t )
 t f ( )(t )
nilai awal θ = θ1, θ2, …, θn

1 1 (t )
1


 ( t 1)
 (t )
 t f ( (t ) )
 2(t )
2 2



 n(t 1)   n(t )   t f ( (t ) )
 n(t ) 44
Gradient Descent (GD)

Take small step in direction of negative gradient!



 ( t 1)
 (t )
 t f ( (t ) )

9 October 2017
 n(t )
n n

Alfan F. Wicaksono, FASILKOM UI


45
https://github.com/joshdk/pygradesc
Logistic Regression

Berbeda dengan Linear Regression, Logistic Regression


digunakan untuk permasalahan Binary Classification. Jadi
output yang dihasilkan adalah diskrit ! (1 atau 0), (ya atau

9 October 2017
tidak).

Misal, diberikan 2 hal:

Alfan F. Wicaksono, FASILKOM UI


1. Unseen data x yang ingin diprediksi labelnya, yaitu y.
2. Model logistic regression dengan parameter θ0, θ1, …, θn
yang sudah ditentukan.
n
P( y  1 | x; )   ( 0  1 x1  ...   n xn )   ( 0    i xi )
i 1

P( y  0 | x; )  1  P( y  1 | x; )
46
1
 ( z) 
1  ez
Logistic Regression

Bisa digambarkan dalam bentuk “single neuron”.


n
P( y  1 | x; )   ( 0  1 x1  ...   n xn )   ( 0    i xi )

9 October 2017
i 1

P( y  0 | x; )  1  P( y  1 | x; )

Alfan F. Wicaksono, FASILKOM UI


1
 ( z)  x1
1  ez θ1
x2 θ2
θ3
x3
θ0
+1

47
Dengan fungsi sigmoid sebagai activation function
Logistic Regression

Slide sebelumnya mengasumsikan bahwa parameter θ sudah


ditentukan.

9 October 2017
Bagaimana bila belum ditentukan ?
Kita bisa estimasi parameter θ dengan memanfaatkan
training data {(x(1), y(1)), (x(2), y(2)), …, (y(n), y(n))} yang

Alfan F. Wicaksono, FASILKOM UI


diberikan (learning).

x1
θ1
x2 θ2
θ3
x3
θ0
48
+1
Logistic Regression
Learning

Misal,
n
h ( x)   ( 0  1 x1  ...   n xn )   ( 0    i xi )

9 October 2017
i 1

Probabilitas masing-masing kelas dapat dituliskan secara singkat


(persamaan di slide-slide sebelumnya) dengan:

Alfan F. Wicaksono, FASILKOM UI


P( y | x; )  (h ( x)) y .(1  h ( x))1 y y  {0,1}

Diberikan m buah training examples, likelihood:


m
L( )   P( y (i ) | x (i ) ; )
i 1
m
49
  (h ( x ))(i ) y(i )
.(1  h ( x ))
(i ) 1 y ( i )

i 1
Logistic Regression
Learning

Dengan MLE, kita akan mencari konfigurasi parameter yang


memberikan nilai likelihood maksimal (= nilai log-likelihood

9 October 2017
maksimal).

l ( )  log L( )

Alfan F. Wicaksono, FASILKOM UI


m
  y (i ) log h ( x (i ) )  (1  y (i ) ) log (1  h ( x (i ) ))
i 1

Atau, kita cari parameter yang meminimalkan fungsi negative


log-likelihood (cross-entropy loss). Inilah cost function kita !

m
J ( )   y (i ) log h ( x (i ) )  (1  y (i ) ) log (1  h ( x (i ) ))
i 1
50
Logistic Regression
Learning

Agar bisa menggunakan Gradient Descent untuk mencari


parameter yang meminimalkan cost function, kita perlu

9 October 2017
menurunkan:

J ( )
 i

Alfan F. Wicaksono, FASILKOM UI


Dapat ditunjukkan bahwa untuk sebuah training example (x,y):


J ( )  (h ( x)  y ) x j
 j

Jadi, untuk update nilai parameter di tiap iterasi untuk semua


example:
m
51
 j :  j    (h ( x (i ) )  y (i ) ) x ij
i 1
Logistic Regression
Learning

Batch Gradient Descent untuk learning model

9 October 2017
inisialisasi θ1, θ2, …, θn
while not converged :

m
1 : 1    (h ( x (i ) )  y (i ) ) x1i

Alfan F. Wicaksono, FASILKOM UI


i 1
m
 2 :  2    (h ( x (i ) )  y (i ) ) x2i
i 1


m
 n :  n    (h ( x (i ) )  y (i ) ) xni
i 1

52
Logistic Regression
Learning

Stochastic Gradient Descent: kita bisa membuat progress dan


update parameter ketika kita melihat sebuah training example.

9 October 2017
inisialisasi θ1, θ2, …, θn
while not converged :

Alfan F. Wicaksono, FASILKOM UI


for i := 1 to m do :

1 : 1   ( y (i )  h ( x (i ) )) x1i
 2 :  2   ( y (i )  h ( x (i ) )) x2i

 n :  n   ( y (i )  h ( x (i ) )) xni

53
Logistic Regression
Learning

Mini-Batch Gradient Descent untuk learning model: Daripada kita


gunakan sebuah sample untuk update (seperti online learning),

9 October 2017
gradient dihitung dengan cara rata-rata/sum dari sebuah mini-
batch sample (misal, 32 atau 64 sample).

Alfan F. Wicaksono, FASILKOM UI


Batch Gradient Descent: gradient dihitung untuk semua sample!
• Untuk sekali step, terlalu besar komputasinya

Online Learning: gradient dihitung untuk sebuah sample!


• Terkadang noisy
• Update step sangat kecil

54
Multilayer Neural Network (Multilayer Perceptron)

Logistic Regression sebenarnya bisa dianggap sebagai Neural


Network dengan beberapa unit di layer input, satu unit di
layer output, dan tanpa hidden layer.

9 October 2017
x1
θ1

Alfan F. Wicaksono, FASILKOM UI


x2 θ2
θ3
x3
θ0
+1

Dengan fungsi sigmoid sebagai activation function


55
Multilayer Neural Network (Multilayer Perceptron)

Misal, ada 3-layer NN, dengan 3 input unit, 2 hidden unit, dan
2 output unit.

9 October 2017
x1 (1)
𝑊11
(1) (2)
𝑊21 𝑊11

Alfan F. Wicaksono, FASILKOM UI


(2)
(1) 𝑊21
𝑊12
x2 (2)
(1) 𝑊12
𝑊22
(2)
(1) 𝑊22
𝑊13
(1) (2) (2)
𝑊23 𝑏1 𝑏2
x3
(1)
(1) 𝑏2
𝑏1
+1
56
+1 W(1), W(2), b(1), b(2) adalah parameter !
Multilayer Neural Network (Multilayer Perceptron)

Seandainya parameter sudah diketahui, bagaimana caranya


memprediksi label dari input (klasifikasi) ?

9 October 2017
Dari contoh sebelumnya, ada 2 unit di output layer. Kondisi
ini biasanya digunakan untuk binary classification. Unit
pertama menghasilkan probabilitas untuk pertama, dan unit

Alfan F. Wicaksono, FASILKOM UI


kedua menghasilkan probabilitas untuk kelas kedua.

Kita perlu melakukan proses feed-forward untuk


menghitung nilai yang dihasilkan di output layer.

57
Multilayer Neural Network (Multilayer Perceptron)

Misal, untuk activation function, kita gunakan fungsi


hyperbolic tangent.
f ( x)  tanh( x)

9 October 2017
Untuk menghitung output di hidden layer:

z1( 2 )  W11(1) x1  W12(1) x2  W13(1) x3  b1(1)

Alfan F. Wicaksono, FASILKOM UI


z 2( 2 )  W21(1) x1  W22(1) x2  W23(1) x3  b2(1)

a1( 2 )  f ( z1( 2 ) )
a2( 2 )  f ( z 2( 2 ) ) Ini hanyalah perkalian matrix !

 x1 
W(1)
W (1)
W    b1(1) 
(1)
58
z ( 2)
W x b
(1) (1)
11
(1)
12
(1)   x2    (1) 
13
(1)
W  x3   2 
21 W 22 W 
23 b
Multilayer Neural Network (Multilayer Perceptron)

Jadi, Proses feed-forward secara keseluruhan hingga


menghasilkan output di kedua output unit adalah:

9 October 2017
z ( 2 )  W (1) x  b (1)
a ( 2)  f ( z ( 2) )

Alfan F. Wicaksono, FASILKOM UI


z ( 3)  W ( 2 ) a ( 2 )  b ( 2 )
hW ,b ( x)  a (3)  softmax ( z (3) )

59
Multilayer Neural Network (Multilayer Perceptron)
Learning

Ada Cost function yang berupa cross-entropy loss:

9 October 2017
m adalah banyaknya training examples.

 nl 1 sl sl 1 (l ) 2

Alfan F. Wicaksono, FASILKOM UI


1 m
J (W , b)    yi , j log( pi , j )   (W ji )
m i 1 j 2 l 1 i 1 j 1

Regularization terms

Berarti, cost function untuk satu sample (x,y) adalah:

J (W , b; x, y )   y j loghW ,b ( x j )    y j log p j 
60
j j
Multilayer Neural Network (Multilayer Perceptron)
Learning

Namun, kali ini, Cost function kita adalah squared-error:

9 October 2017
m adalah banyaknya training examples.

Alfan F. Wicaksono, FASILKOM UI


1 m 1   nl 1 sl sl 1
J (W , b)    hW ,b ( x (i ) )  y (i )    (W ji(l ) ) 2
2

m i 1  2  2 l 1 i 1 j 1
Regularization terms

Berarti, cost function untuk satu sample (x,y) adalah:

1
J (W , b; x, y )  hW ,b ( x (i ) )  y (i )
2
2 1
2 j

  hW ,b ( x j )  y j
(i ) (i )

2 61
Multilayer Neural Network (Multilayer Perceptron)
Learning

Batch Gradient Descent

9 October 2017
inisialisasi W, b
while not converged : Bagaimana cara menghitung
gradient ??

Alfan F. Wicaksono, FASILKOM UI



Wij(l )  Wij(l )   J (W , b)
Wij
(l )


b (l )
 b   (l ) J (W , b)
(l )

bi
i i

62
Multilayer Neural Network (Multilayer Perceptron)
Learning

Misal, (x, y) adalah sebuah training sample, dan dJ(W,b;x,y)


adalah turunan parsial terkait sebuah sample (x,y).

9 October 2017
dJ(W,b;x,y) menentukan overall partial derivative dJ(W,b):

Alfan F. Wicaksono, FASILKOM UI


 1 m  
     
(i ) (i ) (l )
J (W , b ) J (W , b; x , y ) W
Wij(l )  (l ) ij
 i 1
m Wij 

 1 m 
bi(l )
J (W , b )  
m i 1 bi(l )
J (W , b; x (i )
, y (i )
)

63

Dihitung dengan teknik BackPropagation


Multilayer Neural Network (Multilayer Perceptron)
Learning
Back-Propagation

9 October 2017
1. Jalankan proses feed-forward
2. Untuk setiap output unit i pada layer nl (output layer)

 i  ( nl ) J (W , b; x, y )  (ai( nl )  yi )  f ' ( zi( nl ) )
( nl )

zi

Alfan F. Wicaksono, FASILKOM UI


3. l = nl - 1, nl - 2, ... 2
Untuk setiap node i di layer l
sl 1
 i(l )  ( W ji(l ) (j l 1) )  f ' ( zi(l ) )
j 1
4. Finally..

  64
j i
( l ) ( l 1)
J (W , b; x , y )  a J (W , b; x , y )   ( l 1)

Wij
(l )
bi(l )
i
Multilayer Neural Network (Multilayer Perceptron)
Learning (3)
(2) 𝑧1
Back-Propagation 𝑊11 (3)
𝑎1
(2)
𝑊21

9 October 2017
(2)
Contoh hitung gradient di output ... 𝑊12 (3)
𝑧2
(2) (3)
𝑊22 𝑎2
1 ( 3) 2

1 ( 3)
J (W , b; x, y )  a1  y1  a2  y2   
2
(2)
𝑏1
(2)
𝑏2

Alfan F. Wicaksono, FASILKOM UI


2 2
J
a1( 3)
 a1  y1 
( 3)
+1


  
a1(3)  f z1(3)
 f ' z ( 3)
 
 J a1(3) z1(3)
J (W , b; x, y )  (3)  (3) 
z1(3) z1(3)
1
W12( 2)
a1 z1 W12( 2 )

z( 3)
1 W ( 2)
11 a( 2)
1 W ( 2)
12 a ( 2)
2 b
( 2)
1
 
 a1(3)  y1  f ' ( z1(3) )  a2( 2 )

z1(3) 65
 a ( 2)

W12( 2 )
2
Sensitivity – Jacobian Matrix
The Jacobian J is the matrix of partial derivatives of the

9 October 2017
network output vector y with respect to the input vector x.

. . . .
. . . . yk

Alfan F. Wicaksono, FASILKOM UI


J  J k ,i 
. . . . xi
 
. . . .

These derivatives measure the relative sensitivity of the


outputs to small changes in the inputs, and can therefore
be used, for example, to detect irrelevant inputs.
66

Alex Graves, Supervised Sequence Labelling with Recurrent Neural Networks


9 October 2017
More Complex Neural Networks

Alfan F. Wicaksono, FASILKOM UI


(Neural Network Architectures)

Recurrent Neural Networks (Vanilla RNN, LSTM, GRU)


Attention Mechanisms
Recursive Neural Networks

67
Recurrent Neural Networks
O1 O2 O3 O4 O5

9 October 2017
Alfan F. Wicaksono, FASILKOM UI
h1 h2 h3 h4 h5

X1 X2 X3 X4 X5
68
One of the famous Deep Learning Architectures in the NLP community
Recurrent Neural Networks (RNNs)

Kita biasanya menggunakan RNNs untuk:


• Memproses Sequences

9 October 2017
• Menghasilkan Sequences
• …
• Intinya … ada Sequences

Alfan F. Wicaksono, FASILKOM UI


Sequences biasanya:
• Urutan kata-kata
• Urutan kalimat-kalimat
• Signal
• Suara
• Video (Sequence of Images)
• … 69
Recurrent Neural Networks (RNNs)

Not RNNs Sequence Input


(Vanilla Feed-Forward NNs) (e.g. Sentence Classification)

9 October 2017
Alfan F. Wicaksono, FASILKOM UI
Sequence Input/Output
(e.g. Machine Translation) 70
Sequence Output
(e.g. Image Captioning)
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Recurrent Neural Networks (RNNs)

9 October 2017
RNNs combine the input vector with their state vector with a fixed

Alfan F. Wicaksono, FASILKOM UI


(but learned) function to produce a new state vector.

This can, in programming terms, be interpreted as running a fixed


program with certain inputs and some internal variables.

In fact, it is known that RNNs are Turing-Complete in the sense that


they can simulate arbitrary programs (with proper weights).
71
http://binds.cs.umass.edu/papers/1995_Siegelmann_Science.pdf
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Recurrent Neural Networks (RNNs)

Misal, ada I input unit, K output unit, dan H hidden unit (state).

Y1 Y2 Komputasi RNNs untuk satu sample:

9 October 2017
ht  R H 1 xt  R I 1 yt  R K 1

W (hy ) W ( xh )  R H I W ( hh )  R H H

Alfan F. Wicaksono, FASILKOM UI


s1 s2
W ( hy )  R K H
W (hh )
W ( xh ) ht  W ( xh )  xt  W ( hh )  st 1
st  tanh ht 

X1 X2
yt  W ( hy )  st 72

h0  0
Recurrent Neural Networks (RNNs)

Back Propagation Through Time (BPTT)


The loss function depends on the activation
Y1 Y2
of the hidden layer not only through its

9 October 2017
influence on the output layer.
 ( y)
 K ( hy ) ( y ) H ( hh ) ( h ) 
 (h)
i ,t    Wi , j   j ,t   Wi ,n   n ,t 1   f ' hi ,t 
 j 1 

Alfan F. Wicaksono, FASILKOM UI


n 1
s1 s2
 (h )
Lt
Di output:  ( y)

yi ,t
i ,t

Di setiap step, kecuali


paling kanan:
Lt
X1 X2  (h)
  i(,Th )1  0 73
hi ,t
i ,t

Alex Graves, Supervised Sequence Labelling with Recurrent Neural Networks


Recurrent Neural Networks (RNNs)

Back Propagation Through Time (BPTT)


The same weights are reused at every
Y1 Y2
timestep, we sum over the whole sequence to

9 October 2017
get the derivatives with respect to the
network weights.

W (hy ) L T

Alfan F. Wicaksono, FASILKOM UI


s1 s2   j ,t  si ,t 1
(h)

Wi , j
( hh )
t 1

W (hh )
L T

W ( xh ) Wi , j
( hy )
 
t 1
 j ,t  si ,t
( y)

L T

X1 X2 Wi , j
( xh )
 
t 1
 j ,t  xi ,t
(h)

74
Recurrent Neural Networks (RNNs)

Back Propagation Through Time (BPTT)

9 October 2017
Alfan F. Wicaksono, FASILKOM UI
75

Pascanu et al., On the difficulty of training Recurrent Neural Networks, 2013


Recurrent Neural Networks (RNNs)

Back Propagation Through Time (BPTT)


Misal, untuk parameter antar state: Term-term ini disebut temporal
contribution: bagaimana W(hh)

9 October 2017
pada step k mempengaruhi cost
pada step-step setelahnya (t > k)
Lt t
Lt ht   hk
  
W ( hh )
   ( hh )

Alfan F. Wicaksono, FASILKOM UI


k 1 ht hk W 1 k t
Diputus sampai k step ke belakang.
Di sini artinya “immediate derivative”,
yaitu hk-1 dianggap konstan terhadap
W(hh).
ht ht ht 1 hk  2 hk 1
   
hk ht 1 ht  2 hk 1 hk

 s
76
 
 hk  W ( xh )
 xt  W ( hh )
 st 1
 t 1
W ( hh )
W ( hh )
Recurrent Neural Networks (RNNs)

Vanishing & Exploding Gradient Problems


Bengio et al., (1994) said that “the exploding gradients problem refers to the
large increase in the norm of the gradient during training. Such events are

9 October 2017
caused by the explosion of the long term components, which can grow
exponentially more then short term ones.”

And “The vanishing gradients problem refers to the opposite behaviour, when

Alfan F. Wicaksono, FASILKOM UI


long term components go exponentially fast to norm 0, making it impossible for
the model to learn correlation between temporally distant events.”

Kok bisa terjadi? Coba lihat salah satu temporal component dari sebelumnya:

ht ht ht 1 hk  2 hk 1


   
hk ht 1 ht  2 hk 1 hk
In the same way a product of t - k real numbers can shrink to zero or explode to
77
infinity, so does this product of Matrices. (Pascanu et al.,)

Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with gradient
descent is difficult. IEEE Transactions on Neural Networks
Sequential Jacobian biasa
Recurrent Neural Networks (RNNs) digunakan untuk analisis
penggunaan konteks pada
Vanishing & Exploding Gradient Problems RNNs.

9 October 2017
Alfan F. Wicaksono, FASILKOM UI
78

Alex Graves, Supervised Sequence Labelling with Recurrent Neural Networks


Recurrent Neural Networks (RNNs)

Solusi untuk Vanishing Gradient Problem


1) Penggunaan non-gradient based training algorithms (Simulated
Annealing, Discrete Error Propagation, etc.) (Bengio et al., 1994)

9 October 2017
2) Definisikan arsitektur baru di dalam RNN Cell!, seperti Long-Short Term
Memory (LSTM) (Hochreiter & Schmidhuber, 1997).

Alfan F. Wicaksono, FASILKOM UI


3) Untuk metode yang lain, silakan merujuk (Pascanu et al., 2013).

Pascanu et al., On the difficulty of training Recurrent Neural Networks, 2013

S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. Neural Computation,9(8):1735


1780, 1997 79

Y. Bengio, P. Simard, and P. Frasconi. Learning Long-Term Dependencies with Gradient


Descent is Difficult. IEEE Transactions on Neural Networks, 1994
Recurrent Neural Networks (RNNs)

Variant: Bi-Directional RNNs

9 October 2017
Alfan F. Wicaksono, FASILKOM UI
80

Alex Graves, Supervised Sequence Labelling with Recurrent Neural Networks


Recurrent Neural Networks (RNNs)
Sequential Jacobian (Sensitivity) untuk Bi-Directional RNNs

9 October 2017
Alfan F. Wicaksono, FASILKOM UI
81

Alex Graves, Supervised Sequence Labelling with Recurrent Neural Networks


Long-Short Term Memory (LSTM)

1. The LSTM architecture consists of a set of recurrently


connected subnets, known as memory blocks.

9 October 2017
2. These blocks can be thought of as a differentiable version
of the memory chips in a digital computer.
3. Each block contains:

Alfan F. Wicaksono, FASILKOM UI


1. Self-connected memory cells
2. Three multiplicative units (gates)
1. Input gates (analogue of write operation)
2. Output gates (analogue of read operation)
3. Forget gates ((analogue of reset operation))

82

S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. Neural Computation,9(8):1735


1780, 1997
Long-Short Term Memory (LSTM)

9 October 2017
Alfan F. Wicaksono, FASILKOM UI
83
Alex Graves, Supervised Sequence Labelling with Recurrent Neural Networks
S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. Neural Computation,9(8):1735
1780, 1997
Long-Short Term Memory (LSTM)

The multiplicative gates allow LSTM


memory cells to store and access

9 October 2017
information over long periods of time,
thereby mitigating the vanishing
gradient problem.

Alfan F. Wicaksono, FASILKOM UI


For example, as long as the input gate
remains closed (i.e. has an activation near
0), the activation of the cell will not be
overwritten by the new inputs arriving in
the network, and can therefore be made
available to the net much later in the
sequence, by opening the output gate.

84
Alex Graves, Supervised Sequence Labelling with Recurrent Neural Networks
S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. Neural Computation,9(8):1735
1780, 1997
Long-Short Term Memory (LSTM)

Visualisasi lain dari satu cell di LSTM

9 October 2017
Alfan F. Wicaksono, FASILKOM UI
85

S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. Neural Computation,9(8):1735


1780, 1997
Long-Short Term Memory (LSTM)

Komputasi di LSTM

9 October 2017
Alfan F. Wicaksono, FASILKOM UI
86

S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. Neural Computation,9(8):1735


1780, 1997
Long-Short Term Memory (LSTM)
Preservation of Gradient Information pada LSTM

9 October 2017
Alfan F. Wicaksono, FASILKOM UI
87
Alex Graves, Supervised Sequence Labelling with Recurrent Neural Networks
S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. Neural Computation,9(8):1735
1780, 1997
Example: RNNs for POS Tagger
(Zennaki, 2015)

9 October 2017
PRP VBD TO JJ NN

Alfan F. Wicaksono, FASILKOM UI


h1 h2 h3 h4 h5

88
I went to west java
LSTM + CRF for Semantic Role Labeling
(Zhou and Xu, ACL 2015)

9 October 2017
Alfan F. Wicaksono, FASILKOM UI
89
Attention Mechanism

A potential issue with this encoder–decoder approach is that a neural


network needs to be able to compress all the necessary information of

9 October 2017
a source sentence into a fixed-length vector. This may make it difficult
for the neural network to cope with long sentences, especially those
that are longer than the sentences in the training corpus.

Alfan F. Wicaksono, FASILKOM UI


Dzmitry Bahdanau, et al., Neural machine translation by jointly
learning to align and translate, 2015

90
Attention Mechanism

Neural Translation Model TANPA Attention Mechanism

9 October 2017
Alfan F. Wicaksono, FASILKOM UI
Sutkever, Ilya et al., Sequence to Sequence Learning with Neural
Networks, NIPS 2014.
91
https://blog.heuritech.com/2016/01/20/attention-mechanism/

Attention Mechanism

Neural Translation Model TANPA Attention Mechanism

9 October 2017
Sutkever, Ilya et al., Sequence
to Sequence Learning with
Neural Networks, NIPS 2014.

Alfan F. Wicaksono, FASILKOM UI


92
Attention Mechanism

Mengapa Perlu Attention Mechanism?

9 October 2017
• Each time the proposed model generates a word in a translation, it
(soft-)searches for a set of positions in a source sentence where the
most relevant information is concentrated. The model then predicts
a target word based on the context vectors associated with these

Alfan F. Wicaksono, FASILKOM UI


source positions and all the previous generated target words.
• … it encodes the input sentence into a sequence of vectors and
chooses a subset of these vectors adaptively while decoding the
translation. This frees a neural translation model from having to
squash all the information of a source sentence, regardless of its
length, into a fixed-length vector.

Dzmitry Bahdanau, et al., Neural machine translation by jointly


93
learning to align and translate, 2015
https://blog.heuritech.com/2016/01/20/attention-mechanism/

Attention Mechanism

Neural Translation Model – dengan Attention Mechanism

9 October 2017
Alfan F. Wicaksono, FASILKOM UI
94

Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio, Neural Machine Translation by Jointly
Learning to Align and Translate, arXiv:1409.0473, 2016
Attention Mechanism

Neural Translation Model – dengan Attention Mechanism

9 October 2017
Alfan F. Wicaksono, FASILKOM UI
95
Cell merepresentasikan bobot attention, terkait translation.

Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio, Neural Machine Translation by Jointly
Learning to Align and Translate, arXiv:1409.0473, 2016
Attention Mechanism

Simple Attention Networks untuk Sentence Classification

9 October 2017
Alfan F. Wicaksono, FASILKOM UI
96

Colin Raffel, Daniel P. W. Ellis, F EED -F ORWARD NETWORKS WITH ATTENTION CAN
S OLVE S OME L ONG-T ERM MEMORY P ROBLEMS, Workshop track - ICLR 2016
Attention Mechanism

Hierarchical Attention Networks untuk Sentence Classification

9 October 2017
Alfan F. Wicaksono, FASILKOM UI
97

Yang, Zichao, et al., Hierarchical Attention Networks for Document Classification, NAACL 2016
Attention Mechanism

Hierarchical Attention Networks untuk Sentence Classification


Task: Prediksi Rating Dokumen

9 October 2017
Alfan F. Wicaksono, FASILKOM UI
98

Yang, Zichao, et al., Hierarchical Attention Networks for Document Classification, NAACL 2016
Attention Mechanism

9 October 2017
Alfan F. Wicaksono, FASILKOM UI
99

https://blog.heuritech.com/2016/01/20/attention-mechanism/
Attention Mechanism

9 October 2017
Alfan F. Wicaksono, FASILKOM UI
100
https://blog.heuritech.com/2016/01/20/attention-mechanism/
Xu, Kelvin, et al. « Show, Attend and Tell: Neural Image Caption
Generation with Visual Attention (2016).
Attention Mechanism

9 October 2017
Alfan F. Wicaksono, FASILKOM UI
101
https://blog.heuritech.com/2016/01/20/attention-mechanism/
Xu, Kelvin, et al. « Show, Attend and Tell: Neural Image Caption
Generation with Visual Attention (2016).
Attention Mechanism

Attention Mechanism untuk Textual Entailment


Diberikan pasangan premis-hipotesis, tentukan apakah 2 tersebut
kontradiksi, tidak berhubungan, atau logically entail.

9 October 2017
Attention Model digunkan
Sebagai contoh:
untuk menghubungkan
• Premis: “A wedding party taking pictures“ kata-kata di premis dan

Alfan F. Wicaksono, FASILKOM UI


• Hipotesis: “Someone got married“ hipotesis.

102

Tim Rocktaschel et al., REASONING ABOUT ENTAILMENT WITH NEURAL ATTENTION, ICLR 2016
Attention Mechanism

Attention Mechanism untuk Textual Entailment


Diberikan pasangan premis-hipotesis, tentukan apakah 2 tersebut
kontradiksi, tidak berhubungan, atau logically entail.

9 October 2017
Alfan F. Wicaksono, FASILKOM UI
103

Tim Rocktaschel et al., REASONING ABOUT ENTAILMENT WITH NEURAL ATTENTION, ICLR 2016
Recursive Neural Networks

9 October 2017
Alfan F. Wicaksono, FASILKOM UI
104

R. Socher, C. Lin, A. Y. Ng, and C.D. Manning. 2011a. Parsing Natural Scenes and Natural Language with
Recursive Neural Networks. In ICML
Recursive Neural Networks

9 October 2017
Alfan F. Wicaksono, FASILKOM UI
p1  g W .b; c   bias  105

Socher et al., Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank, EMNLP
2013
Recursive Neural Networks

9 October 2017
Alfan F. Wicaksono, FASILKOM UI
106

Socher et al., Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank, EMNLP
2013
Convolutional Neural Networks (CNNs) for Sentence Classification
(Kim, EMNLP 2014)

9 October 2017
Alfan F. Wicaksono, FASILKOM UI
107
Recursive Neural Network for SMT Decoding.
(Liu et al., EMNLP 2014)

9 October 2017
Alfan F. Wicaksono, FASILKOM UI
108

Anda mungkin juga menyukai