Anda di halaman 1dari 48

Word Embedding

M. Soleymani
Sharif University of Technology
Fall 2017

Many slides have been adopted from Socher lectures, cs224d, Stanford, 2017
and some slides from Hinton slides, “Neural Networks for Machine Learning”, coursera, 2015.
One-hot coding
Distributed similarity based representations
• representing a word by means of its neighbors

• “You shall know a word by the company it keeps” (J. R. Firth 1957: 11)

• One of the most successful ideas of modern statistical NLP

Word embedding
• Store “most” of the important information in a fixed, small number of
dimensions: a dense vector
– Usually around 25 – 1000 dimensions

• Embeddings: distributional models with dimensionality reduction,

based on prediction
How to make neighbors represent words?
• Answer: With a co-occurrence matrix X
– options: full document vs windows

• Full word-document co-occurrence matrix

– will give general topics (all sports terms will have similar entries) leading to
“Latent Semantic Analysis”

• Window around each word

– captures both syntactic (POS) and semantic information
LSA: Dimensionality Reduction based on word-doc matrix



Maintaining only the k

largest singular values
of X

Directly learn low-dimensional word vectors
• Old idea. Relevant for this lecture:
– Learning representations by back-propagating errors. (Rumelhart et al., 1986)
– NNLM: A neural probabilistic language model (Bengio et al., 2003)
– NLP (almost) from Scratch (Collobert & Weston, 2008)
– A recent, even simpler and faster model: word2vec (Mikolov et al. 2013)->
intro now
NNLM: Trigram (Language Modeling)

Bengio et al., NNLM: A neural probabilistic language model, 2003.

• Semantic and syntactic features of previous words can help to predict
the features of the next word.

• Word embedding in the NNLM model helps us to find similarities

between pairs of words

“the cat got squashed in the garden on friday”

“the dog got flattened in the yard on monday”
A simpler way to learn feature vectors for words

Collobert and Weston, A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask
Learning, ICML 2008.
Part of a 2-D map of the 2500 most common words
Word2vec embedding

• word2vec: as originally described (Mikolov et al 2013), a NN model

using a two-layer network (i.e., not deep!) to perform dimensionality

• Very computationally efficient, good all-round model (good

hyperparameters already selected).
Skip-gram vs. CBOW
• Two possible architectures:
– given some context words, predict the center (CBOW)
• Predict center word from sum of surrounding word vectors
– given a center word, predict the contexts (Skip-gram)

Skip-gram Continuous Bag of words

• Embeddings that are good at predicting neighboring words are also
good at representing similarity
Details of Word2Vec
• Learn to predict surrounding words in a window of length m of every word.

• Objective function: Maximize the log probability of any context word given the
current center word:
𝐽 𝜃 = log 𝑝 𝑤𝑡+𝑗 |𝑤𝑡
𝑡=1 −𝑚≤𝑗≤𝑚
𝑗≠0 T: training set size
m: context size
𝑤𝑗 : vector representation of the jth word
• Use a large training corpus to maximize it 𝜃: whole parameters of the network

m is usually 5~10
Word embedding matrix
• You will get the word-vector by left multiplying a one-hot vector by W

   a
   Aardvark ⋮
𝑊=    … 𝑥= 1 (𝑥𝑘 = 1)
   ⋮
   zebra

ℎ = 𝑥 𝑇 𝑊 = 𝑊𝑘,. = 𝑣𝑘
𝑘-th row of the matrix 𝑊
• 𝑤𝑜 : context or output (outside) word
• 𝑤𝐼 : center or input word

𝑠𝑐𝑜𝑟𝑒 𝑤𝑜 , 𝑤𝐼 = ℎ𝑇 𝑊.,𝑜′ = 𝑣𝐼𝑇 𝑢𝑜

ℎ = 𝑥𝐼𝑇 𝑊 = 𝑊𝐼,. = 𝑣𝐼
𝑊.,𝑜′ = 𝑢𝑜

𝑢 𝑇𝑣
𝑒 𝐼𝑜
𝑃 𝑤𝑜 𝑤𝐼 = 𝑇𝑣
𝑒 𝐼𝑘

Every word has 2 vectors

𝑣𝑊 : when 𝑤 is the center word
𝑢𝑊 : when 𝑤 is the outside word (context word)
Details of Word2Vec
• Predict surrounding words in a window of length m of every word:

𝑢 𝑇𝑣
𝑒 𝐼
𝑃 𝑤𝑜 𝑤𝐼 = 𝑉 𝑇𝑣
𝑒 𝑢𝑖 𝐼

𝐽 𝜃 = log 𝑝 𝑤𝑡+𝑗 |𝑤𝑡
𝑡=1 −𝑚≤𝑗≤𝑚
𝜕 log 𝑝 𝑤𝑜 |𝑤𝐼 𝜕 𝑒 𝑢𝑜 𝑣𝐼
= log 𝑉 𝑇𝑣
𝜕𝑣𝐼 𝜕𝑣𝐼 𝑢
𝑥=1 𝑒
𝑥 𝐼

𝜕 𝑢 𝑇𝑣 𝑢 𝑇𝑣
= log 𝑒 𝐼 − log
𝑜 𝑒 𝐼𝑥
1 𝑇
= 𝑢𝑜 − 𝑉 𝑇𝑣
𝑢𝑥 𝑒 𝑢𝑥 𝑣𝐼
𝑒 𝑢𝑥 𝐼
𝑥=1 𝑥=1

= 𝑢𝑜 − 𝑝 𝑤𝑥 |𝑤𝐼 𝑢𝑥
Training difficulties
• With large vocabularies, it is not scalable!

𝜕 log 𝑝 𝑤𝑜 |𝑤𝐼
= 𝑢𝑜 − 𝑝 𝑤𝑥 |𝑤𝐼 𝑢𝑥

• Define negative prediction that only samples a few words that do not
appear in the context
– Similar to focusing on mostly positive correlations
Negative sampling
• k is the number of negative samples
log 𝜎 𝑢𝑜𝑇 𝑣𝐼 + log 𝜎 −𝑢𝑗𝑇 𝑣𝐼
𝑤𝑗 ~𝑃(𝑤)
• Maximize probability that real outside word appears, minimize prob.
that random words appear around center word
• 𝑃(𝑤) = 𝑈 𝑤 /𝑍 4

– the unigram distribution U(w) raised to the 3/4rd power.

– The power makes less frequent words be sampled more often

Mikolov et al., Distributed Representations of Words and Phrases and their Compositionality, 2013.
What to do with the two sets of vectors?
• We end up with U and V from all the vectors u and v (in columns)
– Both capture similar co-occurrence information.

• The best solution is to simply sum them up:

Xfinal = U + V

• One of many hyperparameters explored in GloVe

Pennington et al., Global Vectors for Word Representation, 2014.

Summary of word2vec
• Go through each word of the whole corpus

• Predict surrounding words of each word

• This captures co-occurrence of words one at a time

• Why not capture co-occurrence counts directly?

Window based co-occurrence matrix: Example

Corpus Window length 1 (more common: 5 - 10)

Symmetric (irrelevant whether lel or right context)


Pennington et al., Global Vectors for Word Representation, 2014.

How to evaluate word vectors?
• Related to general evaluation in NLP: Intrinsic vs extrinsic
• Intrinsic:
– Evaluation on a specific/intermediate subtask
– Fast to compute
– Helps to understand that system
– Not clear if really helpful unless correlation to real task is established
• Extrinsic:
– Evaluation on a real task
– Can take a long time to compute accuracy
– Unclear if the subsystem is the problem or its interaction or other subsystems
– If replacing exactly one subsystem with another improves accuracy -> Winning!
Intrinsic evaluation by word analogies
• Performance in completing word vector analogies:
𝑥𝑏 − 𝑥𝑎 + 𝑥𝑐 𝑇 𝑥𝑖
𝑑 = argmax
𝑖 𝑥𝑏 − 𝑥𝑎 + 𝑥𝑐

• Evaluate word vectors by how well their cosine distance after addition
captures intuitive semantic and syntactic analogy questions
• Discarding the input words from the search!
• Problem: What if the information is there but not linear?

Mikolov et al., Distributed Representations of Words and Phrases and their Compositionality, 2013.
Word Analogies

The linearity of the skip-gram model makes its vectors more suitable for such linear analogical reasoning
GloV Visualizations: Company - CEO
Glov Visualizations: Superlatives
Other fun word2vec analogies
Analogy evaluation and hyperparameters

Pennington et al., Global Vectors for Word Representation, 2014.

Analogy evaluation and hyperparameters
• More data helps, Wikipedia is better than news text!

Pennington et al., Global Vectors for Word Representation, 2014.

Another intrinsic word vector evaluation
Closest words to “Sweden” (cosine similarity)
Extrinsic word vector evaluation
• Extrinsic evaluation of word vectors: All subsequent NLP tasks can be
considered as down stream task
• One example where good word vectors should help directly: named
entity recognition
– finding a person, organization or location
Example: word classification
• Two options: train only classifier and fix word vectors or also train
word vectors

• Question: What are the advantages and disadvantages of training the

word vectors?
– Pro: better fit on training data
– Con: Worse generalization because the words move in the vector space
Example: word classification

• What is the major benefit of word vectors obtained by skip-gram or

– Ability to also classify words accurately
• Countries cluster together -> classifying location words should be possible with word
vectors (even for countries that do not exist in the labeled training set)
– Fine tune (or learn from scratch for any task) and incorporate more
• Project sentiment into words to find most positive/negative words in corpus
Losing generalization by re-training word vectors
• Setting: Training classifier for movie
review sentiment of words
– in the training data we have “TV” and “telly”
– In the testing data we have “television”

• Originally they were all similar (from pre-

training word vectors)

• What happens when we train the word

vectors using labeled training data?
Losing generalization by re-training word vectors
• What happens when we train the
word vectors?
– Those that are in the training data move
– Words from pre-training that do NOT
appear in training stay

• Example:
– In training data: “TV” and “telly”
– Only in testing data: “television”
Losing generalization by re-training word vectors
Example: Using word2vec in Image Captioning
1. Takes words as inputs
2. Converts them to word vectors.
3. Uses word vectors as inputs to the sequence generation LSTM
4. Maps the output word vectors by this system back to natural
language words
5. Produces words as answers
Word vectors: advantages
• It captures both syntactic (POS) and semantic information
• It scales
– Train on billion word corpora in limited time
• Can easily incorporate a new sentence/ document or add a word to
the vocabulary
• Word embeddings trained by one can be used by others.
• There is a nice Python module for word2vec
– Gensim (word2vec:
• Mikolov et al., Distributed Representations of Words and Phrases and
their Compositionality, 2013.
• Pennington et al., Global Vectors for Word Representation, 2014.