Dr. Gilbert C.S. Lui (HKU, SAAS) STAT8301 (2016-2017) Summer 2017 1 / 39 Dr. Gilbert C.S. Lui (HKU, SAAS) STAT8301 (2016-2017) Summer 2017 2 / 39
Word Embeddings
What is word embeddings?
Collection of language modeling and feature learning techniques in natural
language processing (NLP) where words or phrases from the vocabulary
are mapped to vectors of real numbers.
A mathematical mapping from a space with one dimension per word to a
Word Embeddings continuous vector space with much lower dimension.
What is word2vec?
A group of neural network models which are used to produce word
Return to the SVD example of words embeddings.
ship boat ocean voyage trip Developed by a team of researchers led by Tomas Mikolov at Google in
in a collection 6 documents. 2013.
The SVD term vectors of these 5 words are Key features of these neural network models:
For example, the term ocean is represented by the vector Input word vectors are projected to the embedding space of lower
dimension before getting into hidden layer.
1.02 0.81 . Dimension of input word vectors is reduced in hidden layer and the output
word vector has the same dimension as the input word vector. (c.f.
autoencoder)
Softmax function is used to compute the output vector.
Dr. Gilbert C.S. Lui (HKU, SAAS) STAT8301 (2016-2017) Summer 2017 5 / 39 Dr. Gilbert C.S. Lui (HKU, SAAS) STAT8301 (2016-2017) Summer 2017 6 / 39
Assume that the words of a sentence are represented by one-hot vectors. Define two matrices V Rn|V | and U R|V |n where n is the dimension
of the embedding space (or number of hidden units in the hidden layer).
Suppose that the size of context of a word wc is m (= 10 as suggested by
original authors). Then, the context of wc is given by V is the input word matrix such that the ith column of V is an n 1
embedded vector for word wi when it is an input to the model.
{wcm , . . . , wc1 , wc+1 , . . . , wc+m }.
U is the output word matrix such that the jth row of U is an 1 n
embedded vector for word wj when it is an output of the model.
The sequence of words {wcm , . . . , wc1 , wc , wc+1 , . . . , wc+m } will have
the following associated one-hot word vectors The ith column of V and jth row of U are denoted as v i and uTj
respectively.
(x(cm) , . . . , x(c1) , x(c) , x(c+1) , . . . , x(c+m) )
Dr. Gilbert C.S. Lui (HKU, SAAS) STAT8301 (2016-2017) Summer 2017 9 / 39 Dr. Gilbert C.S. Lui (HKU, SAAS) STAT8301 (2016-2017) Summer 2017 10 / 39
Continuous Bag of Words (CBOW) Model Continuous Bag of Words (CBOW) Model
Dr. Gilbert C.S. Lui (HKU, SAAS) STAT8301 (2016-2017) Summer 2017 11 / 39 Dr. Gilbert C.S. Lui (HKU, SAAS) STAT8301 (2016-2017) Summer 2017 12 / 39
Continuous Bag of Words (CBOW) Model Remarks
The model works in the following steps:
Instead of computation of embedded word vector for each context word,
Generate one-hot word vectors for the context of words they can be computed by a simple matrix multiplication:
wcm , . . . , wc1 , wc+1 , . . . , wc+m .
V = VX,
Projection step: obtain the embedded word vectors for the context
where the columns of X are input one-hot word vectors of context words.
v cm = Vx(cm) , . . . , v c1 = Vx(c1) , v c+1 = Vx(c+1) , . . . , v c+m = Vx(c+m) .
Combination step: average (or concatenate, i.e., sum) these vectors to Similarly, the averaging of embedded word vectors can be obtained by
obtain
v cm + + v c1 + v c+1 + + v c+m 1
h= . h= V 12m ,
2m 2m
Compute the score vector (|V | 1) in the output layer
where 12m denotes an (2m 1) column vector whose elements are 1.
z = U h.
Consider the example of The cat jumped over the puddle. The word vector of the center word wc is
After tokenization of the sentence, we have
0
1
{The, cat, jumped, over, the, puddle}.
x(c) =
0.
0
Indeed, this sentence contains the following words
0
{cat, jumped, over, puddle, the}.
The input word vectors of the context words wc2 , wc1 , wc+1 , wc+2 are
which have been arranged in alphabetical order for the construction of
vocabulary V , |V | = 5. Also, all the words have been converted to lower 0 1 0 0
0 0 0 0
cases for easier handling.
x(c2) =
(c1) (c+1)
1 and x(c+2) = 0 .
0,x = 0,x
Suppose c = 3 and m = 2. Then, the center word wc is jumped and the 0 0 0 0
context {wc2 , wc1 , wc+1 , wc+2 } is 1 0 0 1
Dr. Gilbert C.S. Lui (HKU, SAAS) STAT8301 (2016-2017) Summer 2017 15 / 39 Dr. Gilbert C.S. Lui (HKU, SAAS) STAT8301 (2016-2017) Summer 2017 16 / 39
Example of CBOW Example of CBOW
ayer
Provided that the dimension of word embedding space is n = 2, the input The embedded vectors are then combined by simple averaging
word matrix and output word matrix are initialized randomly as usual in
0.068561
the neural network training: h= ,
0.114317
0.023074 0.479901 0.432148 0.375480 0.364732
V = which are the values stored in the hidden layer.
0.268008 0.424778 0.257104 0.148817 0.033922
and The score vector in the output layer is
T 0.094491 0.490796 0.072921 0.104514 0.226080 0.057232
U = .
0.443977 0.229903 0.172246 0.463000 0.154650 0.059931
0.024690 .
z = Uh =
The input embedded vectors of the context after projection are 0.045763
0.033179
(c2) 0.364732 (c1) 0.023074 (c+1) 0.432148
v = ,v = ,v ,
0.033922 0.268008 0.257104
and
(c+2) 0.364732
v = .
0.033922
Dr. Gilbert C.S. Lui (HKU, SAAS) STAT8301 (2016-2017) Summer 2017 17 / 39 Dr. Gilbert C.S. Lui (HKU, SAAS) STAT8301 (2016-2017) Summer 2017 18 / 39
For a center word wc and context size m, the embedded vectors Therefore, for context size m, the relevant embedded vectors for all center
uc , v cm , . . . , v c1 , v c+1 , . . . , v c+m can be obtained by the maximization words and their associated contexts can be obtained by the maximization
of log-probability (or minimization of cross-entropy) of the objective function
J1 () = log P (wc |wcm , . . . , wc1 , wc+1 , . . . , wc+m ) 1
NX2m |V |
X
JCBOW () = uTc h log exp(uTj h) ,
= log P (uc |h) N 2m t=1 j=1
exp(uTc h)
= log P|V |
T .where denotes the parameters we optimize, N is the number of words
j=1 exp(uj h)
|V |
in the corpus.
X
= uTc h log exp(uTj h) Dynamic logistic regression essentially!
j=1
Then, the relevant embedded vectors are obtained by the Gradient
where denotes the parameters we optimize. Descent algorithm.
Dr. Gilbert C.S. Lui (HKU, SAAS) STAT8301 (2016-2017) Summer 2017 21 / 39 Dr. Gilbert C.S. Lui (HKU, SAAS) STAT8301 (2016-2017) Summer 2017 22 / 39
The setup of continuous Skip-Gram model is largely the same as that of Consider the example of The cat jumped over the puddle. again.
CBOW model except that the input and output are reversed.
Since the objective of this model is to predict context words given a Suppose c = 3 and m = 2. Then, the center word wc is jumped and the
center word, the data samples are constructed in the following sense: context {wc2 , wc1 , wc+1 , wc+2 } is
For the center word wc , the context words are
{the, cat, over, the}.
wcm , . . . , wc1 , wc+1 , . . . , wc+m .
The (input,output) pairs of words in the neural network are The (input, output) pairs of words are
(wc , wcm ), . . . , (wc , wc1 ), (wc , wc+1 ), . . . , (wc , wc+m ). (jumped, the), (jumped, cat), (jumped, over), (jumped, the).
Assume that (input, output) pairs of words are independent from each
other.
Define the corpus D whose elements are in the form of (wi , wj ), i.e., the
collection of (input, output) pairs from the center words wc s and their
associated context words.
The input word vector, which is converted from the center word wc , will
be proceeded in the neutral network model in the same way as before.
The output of this model are y (cm) , . . . , y (c1) , y (c+1) , . . . , y (c+m) .
Dr. Gilbert C.S. Lui (HKU, SAAS) STAT8301 (2016-2017) Summer 2017 23 / 39 Dr. Gilbert C.S. Lui (HKU, SAAS) STAT8301 (2016-2017) Summer 2017 24 / 39
Continuous Skip-Gram Model Continuous Skip-Gram Model
.
Generate one-hot vector for the center word wc , i.e.,
x(c) .
Projection step: obtain the embedded vector for the center word
v c = Vx(c) .
No combination step is involved in hidden layer:
h = vc .
Compute the score vector using
z = U h.
Compute the predicted probability in y using the softmax function
y = softmax(z),
P|V |
where the ith element of y is computed by exp(uTi v c )/ k=1 exp(uTk v c ).
Dr. Gilbert C.S. Lui (HKU, SAAS) STAT8301 (2016-2017) Summer 2017 25 / 39 Dr. Gilbert C.S. Lui (HKU, SAAS) STAT8301 (2016-2017) Summer 2017 26 / 39
Consider the (input, output) pairs of words for the center word jumped: Note that the same input word matrix and output word matrix as in the
(jumped, the), (jumped, cat), (jumped, over), (jumped, the). case of CBOW model are used here.
The one-hot vector for the input word is The input embedded vectors of the center word after projection is
0
1 (c) 0.479901
v = ,
x(c) =
0.424778
0.
0
which are the values in the hidden layer.
0
The score vector in the output layer is
Suppose that the dimension of word embedding space is n = 2, the input
word matrix and output word matrix are initialized randomly as usual in
0.233938
the neural network training: 0.333191
z= 0.108161 .
0.023074 0.479901 0.432148 0.375480 0.364732
V = 0.146516
0.268008 0.424778 0.257104 0.148817 0.033922
0.174188
and
T 0.094491 0.490796 0.072921 0.104514 0.226080
U = .
0.443977 0.229903 0.172246 0.463000 0.154650
Dr. Gilbert C.S. Lui (HKU, SAAS) STAT8301 (2016-2017) Summer 2017 27 / 39 Dr. Gilbert C.S. Lui (HKU, SAAS) STAT8301 (2016-2017) Summer 2017 28 / 39
Example of Continuous Skip-Gram (Revisited) Continuous Skip-Gram Model
The activated values in the output layer, i.e., the predicted probabilities, Given an (input, output) pair of words, (wc , wj )
are j = c m, . . . , c 1, c + 1, . . . , c + m, the relevant embedded vectors are
0.182938 obtained by maximization of the log probability function (or minimization
0.165653
of cross-entropy)
0.257558 ,
y =
0.199650 J2 () = log P (wj |wc )
0.194201 exp(uTj v c )
P5 = log P|V |
where the ith entry is computed by exp(zi )/ j=1 exp(zj ) and zi denotes T
k=1 exp(uk v c )
the ith entry of z. |V |
X
= uTj v c log exp(uTk v c ).
Indeed, each value of y represents the predicted probability of each word
k=1
in the vocabulary V .
The predicted word will be the word with the largest predicted
probability.
Dr. Gilbert C.S. Lui (HKU, SAAS) STAT8301 (2016-2017) Summer 2017 29 / 39 Dr. Gilbert C.S. Lui (HKU, SAAS) STAT8301 (2016-2017) Summer 2017 30 / 39
For a center word wc and context size m, the relevant embedded vectors For the context size m, the relevant embedded vectors for all center words
can be obtained by the maximization of the log probability function (or and their associated contexts can be obtained by the maximization of the
minimization of cross-entropy) log probability function
J3 () = log P (wcm , . . . , wc1 , wc+1 , . . . , wc+m |wc ) 1
NX2m 2m
X |V |
X
2m Jskip-gram () = uTcm+j v c 2m log exp(uTk v c ) ,
Y N 2m c=1
= log P (wcm+j |wc ) j=0,j6=m k=1
j=0,j6=m
2m
where denotes the parameters we optimize, N is the number of words in
Y exp(uTcm+j v c ) the corpus.
= log P|V | T
j=0,j6=m k=1 exp(uk v c )
Then, the embedded vectors in U and V are obtained the Stochastic
2m |V |
X X Gradient Descent algorithm.
= uTcm+j v c 2m log exp(uTk v c ),
P|V |
j=0,j6=m k=1 The computation of normalization term log k=1 exp(uTk v c ) is expensive
in the optimization when the vocabulary size |V | is large.
where the independence of each (input, output) pair is assumed for the
second equality.. Negative sampling procedure can further be used to solve this issue.
Dr. Gilbert C.S. Lui (HKU, SAAS) STAT8301 (2016-2017) Summer 2017 31 / 39 Dr. Gilbert C.S. Lui (HKU, SAAS) STAT8301 (2016-2017) Summer 2017 32 / 39
Negative Sampling Negative Sampling
The idea of negative sampling is related to that of noise-contrastive Consider a pair (w, c) of center word and its context in corpus D. Define
estimation (NCE) by Gutmann and Hyvrinen (2012) which reduces the a binary random variable D by
problem of density estimation to that of binary classification,
discriminating between samples from the data distribution and samples 1 (w, c) D
D= .
from a known noise distribution. 0 (w, c)
/D
NCE postulated that a good model is capable of differentiating data from N.B. Based on previous definition of D, the pair (w, c) can further be
noise by means of logistic regression. re-expressed as a list of (input, output) pairs of words as before.
Instead of computing probability precisely as in NCE, skip-gram model is Then, P (D = 1|w, c, ) and P (D = 0|w, c, ) are the probability that
concerned with the learning high-quality vector representation only. (w, c) comes and does not come from the corpus D respectively.
In negative sampling, NCE is simplified by drawing random pairs to Suppose that D is modelled by the logistic distribution
retain the quality of vector representation.
1
P (D = 1|w, c, ) = ,
Simply speaking, continuous skip-gram with negative sampling is 1 + exp(uTw v c )
regarded as the reformulation of the density estimation problem to binary
where the parameters in are the word embedded vectors in V and U .
classification with artificial generation of noise pairs.
Dr. Gilbert C.S. Lui (HKU, SAAS) STAT8301 (2016-2017) Summer 2017 33 / 39 Dr. Gilbert C.S. Lui (HKU, SAAS) STAT8301 (2016-2017) Summer 2017 34 / 39
As mentioned in Mikolovs paper [the first one in the list], for the choice
of Pn (w), the unigram distribution U (w) raised to the power of 3/4, i.e., Mikolov, T., Sutskever, I., Chen, K.,Corrado, G., and J. Dean (2013),
U (w)3/4 /Z where Z the normalizing constant can outperform significantly Distributed Representation of Words and Phrases and their
the unigram and uniform distributions. Compositionality, Advances in Neural Information Processing Systems.
26, 31113119.
Choice of K: 520 for small training datasets, 25 for large training
datasets. Mikolov, T., Chen, K., Corrado G. and J. Dean (2013), Efficient
Estimation of Word Representations in Vector Space. In International
Conference on Learning Representations Workshop.
Mnih, A. and Y.W. Teh (2012), A fast and simple algorithm for training
neural probabilistic language models. In Proceedings of the 29th
International Conference on Machine Learning, 17511758.
Dr. Gilbert C.S. Lui (HKU, SAAS) STAT8301 (2016-2017) Summer 2017 37 / 39 Dr. Gilbert C.S. Lui (HKU, SAAS) STAT8301 (2016-2017) Summer 2017 38 / 39
Thank You!
Dr. Gilbert C.S. Lui (HKU, SAAS) STAT8301 (2016-2017) Summer 2017 39 / 39