Chap5C 4p

Chapter Outcomes
STAT8301 Big Data Analytics

Chapter 5 Text Analytics - Appendix
After completing this handout, you are able to know

Dr. Gilbert C.S. Lui
What the word2vec is
Department of Statistics and Actuarial Science,
The University of Hong Kong
2016-2017 Summer Semester
Dr. Gilbert C.S. Lui (HKU, SAAS) STAT8301 (2016-2017) Summer 2017 1 / 39 Dr. Gilbert C.S. Lui (HKU, SAAS) STAT8301 (2016-2017) Summer 2017 2 / 39
Word Embeddings
What is word embeddings?
Collection of language modeling and feature learning techniques in natural
language processing (NLP) where words or phrases from the vocabulary
are mapped to vectors of real numbers.
A mathematical mapping from a space with one dimension per word to a
Word Embeddings continuous vector space with much lower dimension.
How to generate this mapping?

Neural networks (e.g. word2vec)
Dimensionality reductions on the term-document matrix (e.g. LSA) or
word co-occurrence matrix
Probabilistic models
Explicit representation in terms of the context where words appear
Usages of word embeddings

Sentiment analysis
Syntactic parsing
Example of Word Embeddings by LSA Word Embeddings by word2vec
What is word2vec?
A group of neural network models which are used to produce word
Return to the SVD example of words embeddings.
ship boat ocean voyage trip Developed by a team of researchers led by Tomas Mikolov at Google in
in a collection 6 documents. 2013.
The SVD term vectors of these 5 words are Key features of these neural network models:
Single hidden layer

0.95 0.28 1.02 1.52 0.56
.
0.47 0.53 0.81 0.56 1.03 No activation function is used in the hidden layer.
For example, the term ocean is represented by the vector Input word vectors are projected to the embedding space of lower
dimension before getting into hidden layer.

1.02 0.81 . Dimension of input word vectors is reduced in hidden layer and the output
word vector has the same dimension as the input word vector. (c.f.
autoencoder)
Softmax function is used to compute the output vector.
Word Embeddings by word2vec Continuous Bag of Words (CBOW) Model

Suppose that we need to construct a model which will assign a probability
to a sequence of tokens. Consider the following example: Word vectors
The cat jumped over the puddle. Convert the words into a sequence of vectors of numbers, one-hot
vectors.
After tokenization of the above sentence, i.e., converting into a sequence Represent every word wi for i = 1, 2, . . . , |V | as an R|V |1 vector with all
of words, we have 0s and one 1 at the index of that word in the sorted English language
where |V | is the size of vocabulary V .
{The, cat, jumped, over, the, puddle}.
Word vectors under this type of encoding would appear as
We attempt to predict or generate the center word jumped given the
1

0

0

0
input context {The, cat, over, the, puddle}, i.e., the words around 0 1 0 0
the center word.
waardvark = 0 , wa = 0 , wat = 1 , , wzebra = 0

. . . .
This type of model is called a Continuous Bag of Words (CBOW) model. .. .. .. ..
0 0 0 1
Alternatively, we may want to predict or generate the words The, cat,
over, the, puddle given the center word jumped. Each word is represented as a completely independent identity.
This type of model is called a Continuous Skip-Gram model.

Continuous Bag of Words (CBOW) Model Continuous Bag of Words (CBOW) Model
Assume that the words of a sentence are represented by one-hot vectors. Define two matrices V Rn|V | and U R|V |n where n is the dimension
of the embedding space (or number of hidden units in the hidden layer).
Suppose that the size of context of a word wc is m (= 10 as suggested by
original authors). Then, the context of wc is given by V is the input word matrix such that the ith column of V is an n 1
embedded vector for word wi when it is an input to the model.
{wcm , . . . , wc1 , wc+1 , . . . , wc+m }.
U is the output word matrix such that the jth row of U is an 1 n
embedded vector for word wj when it is an output of the model.
The sequence of words {wcm , . . . , wc1 , wc , wc+1 , . . . , wc+m } will have
the following associated one-hot word vectors The ith column of V and jth row of U are denoted as v i and uTj
respectively.
(x(cm) , . . . , x(c1) , x(c) , x(c+1) , . . . , x(c+m) )
where x(i) denotes the one-hot vector of word wi for

i = c m, . . . , c, . . . , c + m.
Define the corpus D whose elements are in the form of
({wcm , . . . , wc1 , wc+1 , . . . , wc+m }, wc ).
Continuous Bag of Words (CBOW) Model Remarks
The model works in the following steps:
Instead of computation of embedded word vector for each context word,
Generate one-hot word vectors for the context of words they can be computed by a simple matrix multiplication:
wcm , . . . , wc1 , wc+1 , . . . , wc+m .
V = VX,
Projection step: obtain the embedded word vectors for the context
where the columns of X are input one-hot word vectors of context words.
v cm = Vx(cm) , . . . , v c1 = Vx(c1) , v c+1 = Vx(c+1) , . . . , v c+m = Vx(c+m) .
Combination step: average (or concatenate, i.e., sum) these vectors to Similarly, the averaging of embedded word vectors can be obtained by
obtain
v cm + + v c1 + v c+1 + + v c+m 1
h= . h= V 12m ,
2m 2m
Compute the score vector (|V | 1) in the output layer
where 12m denotes an (2m 1) column vector whose elements are 1.
z = U h.
Compute the vector of predicted probabilities y (|V | 1) from the scores

by the softmax function
y = softmax(z),
P|V |
where the ith element of y is computed by exp(uTi h)/ j=1 exp(uTj h).
Example of CBOW Example of CBOW
Consider the example of The cat jumped over the puddle. The word vector of the center word wc is
After tokenization of the sentence, we have
0
1
{The, cat, jumped, over, the, puddle}.
x(c) =

0.

0
Indeed, this sentence contains the following words
0
{cat, jumped, over, puddle, the}.
The input word vectors of the context words wc2 , wc1 , wc+1 , wc+2 are
which have been arranged in alphabetical order for the construction of
vocabulary V , |V | = 5. Also, all the words have been converted to lower 0 1 0 0
0 0 0 0
cases for easier handling.
x(c2) =
(c1) (c+1)
1 and x(c+2) = 0 .

0,x = 0,x

Suppose c = 3 and m = 2. Then, the center word wc is jumped and the 0 0 0 0
context {wc2 , wc1 , wc+1 , wc+2 } is 1 0 0 1
{the, cat, over, the}. respectively.
Example of CBOW Example of CBOW
ayer
Provided that the dimension of word embedding space is n = 2, the input The embedded vectors are then combined by simple averaging
word matrix and output word matrix are initialized randomly as usual in
0.068561

the neural network training: h= ,
0.114317

0.023074 0.479901 0.432148 0.375480 0.364732
V = which are the values stored in the hidden layer.
0.268008 0.424778 0.257104 0.148817 0.033922
and The score vector in the output layer is

T 0.094491 0.490796 0.072921 0.104514 0.226080 0.057232
U = .
0.443977 0.229903 0.172246 0.463000 0.154650 0.059931

0.024690 .
z = Uh =
The input embedded vectors of the context after projection are 0.045763
0.033179

(c2) 0.364732 (c1) 0.023074 (c+1) 0.432148
v = ,v = ,v ,
0.033922 0.268008 0.257104
and
(c+2) 0.364732
v = .
0.033922
Example of CBOW Continuous Bag of Words (CBOW) Model

At this moment, the matrices V and U are unknown.
The activated values in the output layer, i.e., the predicted probabilities,
Here, consider the loss measure, cross-entropy function H(y, y), which is
are
defined by

0.204546 |V |
X
0.205099 H(y, y) = yj log(yj ),

y =
0.188457 ,
j=1
0.202213
where y is the one-hot vector in the output layer.
0.199685
Then, it can be simplified to
P5
where the ith entry is computed by exp(zi )/ j=1 exp(zj ) and zi denotes H(y, y) = yc log(yc ),
the ith entry of z.
where c denotes the index of center word in the one-hot vector.
Indeed, each value of y represents the predicted probability of each word When the prediction is perfect, i.e., yc = 1,
in the vocabulary V .
H(y, y) = 1 log(1) = 0.
The predicted word will be the word with the largest predicted
When the prediction is very bad, e.g. yc = 0.01,
probability.
H(y, y) = 1 log(0.01) 4.605.
Here, the predicted word is jumped.
This indicates that cross-entropy can provide us a good measure of
distance between y and y for probability distribution.
For a center word wc and context size m, the embedded vectors Therefore, for context size m, the relevant embedded vectors for all center
uc , v cm , . . . , v c1 , v c+1 , . . . , v c+m can be obtained by the maximization words and their associated contexts can be obtained by the maximization
of log-probability (or minimization of cross-entropy) of the objective function

J1 () = log P (wc |wcm , . . . , wc1 , wc+1 , . . . , wc+m ) 1
NX2m |V |
X
JCBOW () = uTc h log exp(uTj h) ,
= log P (uc |h) N 2m t=1 j=1
exp(uTc h)
= log P|V |
T .where denotes the parameters we optimize, N is the number of words
j=1 exp(uj h)
|V |
in the corpus.
X
= uTc h log exp(uTj h) Dynamic logistic regression essentially!
j=1
Then, the relevant embedded vectors are obtained by the Gradient
where denotes the parameters we optimize. Descent algorithm.
Continuous Skip-Gram Model Example of Continuous Skip-Gram
The setup of continuous Skip-Gram model is largely the same as that of Consider the example of The cat jumped over the puddle. again.
CBOW model except that the input and output are reversed.
Since the objective of this model is to predict context words given a Suppose c = 3 and m = 2. Then, the center word wc is jumped and the
center word, the data samples are constructed in the following sense: context {wc2 , wc1 , wc+1 , wc+2 } is
For the center word wc , the context words are
{the, cat, over, the}.
wcm , . . . , wc1 , wc+1 , . . . , wc+m .
The (input,output) pairs of words in the neural network are The (input, output) pairs of words are
(wc , wcm ), . . . , (wc , wc1 ), (wc , wc+1 ), . . . , (wc , wc+m ). (jumped, the), (jumped, cat), (jumped, over), (jumped, the).
Assume that (input, output) pairs of words are independent from each
other.
Define the corpus D whose elements are in the form of (wi , wj ), i.e., the
collection of (input, output) pairs from the center words wc s and their
associated context words.
The input word vector, which is converted from the center word wc , will
be proceeded in the neutral network model in the same way as before.
The output of this model are y (cm) , . . . , y (c1) , y (c+1) , . . . , y (c+m) .
Continuous Skip-Gram Model Continuous Skip-Gram Model
.
Generate one-hot vector for the center word wc , i.e.,
x(c) .
Projection step: obtain the embedded vector for the center word
v c = Vx(c) .
No combination step is involved in hidden layer:
h = vc .
Compute the score vector using
z = U h.
Compute the predicted probability in y using the softmax function
y = softmax(z),
P|V |
where the ith element of y is computed by exp(uTi v c )/ k=1 exp(uTk v c ).
Example of Continuous Skip-Gram (Revisited) Example of Continuous Skip-Gram (Revisited)
Consider the (input, output) pairs of words for the center word jumped: Note that the same input word matrix and output word matrix as in the
(jumped, the), (jumped, cat), (jumped, over), (jumped, the). case of CBOW model are used here.
The one-hot vector for the input word is The input embedded vectors of the center word after projection is

0
1 (c) 0.479901
v = ,
x(c) =
0.424778
0.

0
which are the values in the hidden layer.
0
The score vector in the output layer is
Suppose that the dimension of word embedding space is n = 2, the input
word matrix and output word matrix are initialized randomly as usual in

0.233938
the neural network training: 0.333191

z= 0.108161 .

0.023074 0.479901 0.432148 0.375480 0.364732
V = 0.146516
0.268008 0.424778 0.257104 0.148817 0.033922
0.174188
and

T 0.094491 0.490796 0.072921 0.104514 0.226080
U = .
0.443977 0.229903 0.172246 0.463000 0.154650
Example of Continuous Skip-Gram (Revisited) Continuous Skip-Gram Model
The activated values in the output layer, i.e., the predicted probabilities, Given an (input, output) pair of words, (wc , wj )
are j = c m, . . . , c 1, c + 1, . . . , c + m, the relevant embedded vectors are
0.182938 obtained by maximization of the log probability function (or minimization
0.165653
of cross-entropy)
0.257558 ,
y =
0.199650 J2 () = log P (wj |wc )
0.194201 exp(uTj v c )
P5 = log P|V |
where the ith entry is computed by exp(zi )/ j=1 exp(zj ) and zi denotes T
k=1 exp(uk v c )
the ith entry of z. |V |
X
= uTj v c log exp(uTk v c ).
Indeed, each value of y represents the predicted probability of each word
k=1
in the vocabulary V .
The predicted word will be the word with the largest predicted
probability.
Here, the predicted word is over.
Continuous Skip-Gram Model Continuous Skip-Gram Model
For a center word wc and context size m, the relevant embedded vectors For the context size m, the relevant embedded vectors for all center words
can be obtained by the maximization of the log probability function (or and their associated contexts can be obtained by the maximization of the
minimization of cross-entropy) log probability function

J3 () = log P (wcm , . . . , wc1 , wc+1 , . . . , wc+m |wc ) 1
NX2m 2m
X |V |
X
2m Jskip-gram () = uTcm+j v c 2m log exp(uTk v c ) ,
Y N 2m c=1
= log P (wcm+j |wc ) j=0,j6=m k=1
j=0,j6=m
2m
where denotes the parameters we optimize, N is the number of words in
Y exp(uTcm+j v c ) the corpus.
= log P|V | T
j=0,j6=m k=1 exp(uk v c )
Then, the embedded vectors in U and V are obtained the Stochastic
2m |V |
X X Gradient Descent algorithm.
= uTcm+j v c 2m log exp(uTk v c ),
P|V |
j=0,j6=m k=1 The computation of normalization term log k=1 exp(uTk v c ) is expensive
in the optimization when the vocabulary size |V | is large.
where the independence of each (input, output) pair is assumed for the
second equality.. Negative sampling procedure can further be used to solve this issue.
Negative Sampling Negative Sampling
The idea of negative sampling is related to that of noise-contrastive Consider a pair (w, c) of center word and its context in corpus D. Define
estimation (NCE) by Gutmann and Hyvrinen (2012) which reduces the a binary random variable D by
problem of density estimation to that of binary classification,
discriminating between samples from the data distribution and samples 1 (w, c) D
D= .
from a known noise distribution. 0 (w, c)
/D
NCE postulated that a good model is capable of differentiating data from N.B. Based on previous definition of D, the pair (w, c) can further be
noise by means of logistic regression. re-expressed as a list of (input, output) pairs of words as before.
Instead of computing probability precisely as in NCE, skip-gram model is Then, P (D = 1|w, c, ) and P (D = 0|w, c, ) are the probability that
concerned with the learning high-quality vector representation only. (w, c) comes and does not come from the corpus D respectively.
In negative sampling, NCE is simplified by drawing random pairs to Suppose that D is modelled by the logistic distribution
retain the quality of vector representation.
1
P (D = 1|w, c, ) = ,
Simply speaking, continuous skip-gram with negative sampling is 1 + exp(uTw v c )
regarded as the reformulation of the density estimation problem to binary
where the parameters in are the word embedded vectors in V and U .
classification with artificial generation of noise pairs.
Negative Sampling Negative Sampling

Note that the (w, c) pairs not coming from D are referred as the negative
Then, the embedded vectors in V and U can be determined by the
samples which are sampled from the noise distribution Pn (w).
maximization of the objective function
Y Y Indeed, the (w, c) pair can further be transformed to a number of (input,
P (D = 1|w, c, ) P (D = 0|w, c, ), output) pairs.
(w,c)D (w,c)D
/
For an (input, output) pair, the objective function is
or
K
1 1
X X X
JN EG () = log P (D = 1|w, c, ) + log P (D = 0|w, c, ) log + log
T
(w,c)D (w,c)D
/ 1 + exp(ucmj v c ) k=1 1 + exp(uTk v c )
X X
= log P (D = 1|w, c, ) + (1 log P (D = 1|w, c, )) where {u1 , u2 , . . . , uK } are the embedded vectors of output words which
(w,c)D (w,c)D
/ are sampled from Pn (w).

X 1 X 1
= log + log 1 One may expect that the predicted probabilities for the noise output
1 + exp(uTw v c ) 1 + exp(uTw v c )
(w,c)D (w,c)D
/ words are equal to zero!
X 1 X 1 For example, the pair (jumped, puddle) would produce predicted
= log + log .
1 + exp(uTw v c ) 1 + exp(uTw v c ) probability zero for the output word puddle since it is out of the context
(w,c)D (w,c)D
/
of jumped.
Negative Sampling References
As mentioned in Mikolovs paper [the first one in the list], for the choice
of Pn (w), the unigram distribution U (w) raised to the power of 3/4, i.e., Mikolov, T., Sutskever, I., Chen, K.,Corrado, G., and J. Dean (2013),
U (w)3/4 /Z where Z the normalizing constant can outperform significantly Distributed Representation of Words and Phrases and their
the unigram and uniform distributions. Compositionality, Advances in Neural Information Processing Systems.
26, 31113119.
Choice of K: 520 for small training datasets, 25 for large training
datasets. Mikolov, T., Chen, K., Corrado G. and J. Dean (2013), Efficient
Estimation of Word Representations in Vector Space. In International
Conference on Learning Representations Workshop.
Mnih, A. and Y.W. Teh (2012), A fast and simple algorithm for training
neural probabilistic language models. In Proceedings of the 29th
International Conference on Machine Learning, 17511758.
Gutmann, M.U. and A. Hyvrinen (2012), Noise-contrastive estimation of

unnormalized statistical models, with applications to natural image
statistics. Journal of Machine Learning Research. 13, 307361.
Thank You!
Dr. Gilbert C.S. Lui (HKU, SAAS) STAT8301 (2016-2017) Summer 2017 39 / 39

Chap5C 4p

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Chap5C 4p

Diunggah oleh

Hak Cipta:

Format Tersedia

Chapter Outcomes

STAT8301 Big Data Analytics

After completing this handout, you are able to know

2016-2017 Summer Semester

How to generate this mapping?

Usages of word embeddings

Single hidden layer

Word Embeddings by word2vec Continuous Bag of Words (CBOW) Model

This type of model is called a Continuous Skip-Gram model.

where x(i) denotes the one-hot vector of word wi for

Define the corpus D whose elements are in the form of

({wcm , . . . , wc1 , wc+1 , . . . , wc+m }, wc ).

Compute the vector of predicted probabilities y (|V | 1) from the scores

Example of CBOW Example of CBOW

{the, cat, over, the}. respectively.

Example of CBOW Continuous Bag of Words (CBOW) Model

Continuous Skip-Gram Model Example of Continuous Skip-Gram

Example of Continuous Skip-Gram (Revisited) Example of Continuous Skip-Gram (Revisited)

Here, the predicted word is over.

Continuous Skip-Gram Model Continuous Skip-Gram Model

Negative Sampling Negative Sampling

Gutmann, M.U. and A. Hyvrinen (2012), Noise-contrastive estimation of

Anda mungkin juga menyukai