Anda di halaman 1dari 23

Neural Network Language

Models and word2vec


Tambet Matiisen
8.10.2014
Sources
• Yoshua Bengio. Neural net language models.

• Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean.


Efficient Estimation of Word Representations in Vector Space.

• Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean.
Distributed Representations of Words and Phrases and their
Compositionality.

• Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig.


Linguistic Regularities in Continuous Space Word Representations.

• Tomas Mikolov, Quoc V. Le and Ilya Sutskever.


Exploiting Similarities among Languages for Machine Translation.
Language models
• A language model captures the statistical
characteristics of sequences of words in a
natural language, typically allowing one to
make probabilistic predictions of the next
word given preceding ones.
• E.g. the standard “trigram” method:
count ( wt 2 wt 1wt )
P( wt | wt 2 , wt 1 ) 
count ( wt 2 wt 1 )
Neural network language models
• A neural network language model is a language
model based on neural networks, exploiting their
ability to learn distributed representations.
• A distributed representation of a word is a vector
of activations of neurons (real values) which
characterizes the meaning of the word.
• A distributed representation is opposed to a local
representation, in which only one neuron (or very
few) is active at each time.
NNLM architecture
Softmax output layer
V nodes
(one unit per next word)

HxV weights

Hidden layer to predict output from features of the input words H nodes

HxD weights HxD weights

Learned distributed Learned distributed


D nodes D nodes
representation of word t-2 representation of word t-1

VxD weights (shared)

Sparse representation of Sparse representation of


V nodes V nodes
word t-2 word t-1
word2vec
• An efficient implementation of the continuous
bag-of-words and skip-gram architectures for
computing vector representations of words.
• The word vectors can be used to significantly
improve and simplify many NLP applications.
CBOW architecture
sparse representation
weights = distributed representation
NB! Shared!

softmax

Predicts current word given the context.


Skip-gram architecture
softmax

output weights
softmax

sparse representation

softmax

weights = distributed representation softmax

Predicts the surrounding words given the current word


Linguistic regularities

The word vector space implicitly encodes many regularities


among words, i.e. vector(KINGS) – vector(KING) +
vector(QUEEN) is close to vector(QUEENS)
Semantic-Syntactic Word Relationship
test set
Accuracy

days

minutes
hours
From words to phrases
• Find words that appear frequently together and
infrequently in other contexts.
count ( wi w j )  
score( wi , w j ) 
count ( wi )  count ( w j )
• The bigrams with score above the chosen
threshold are then used as phrases.
• The δ is used as a discounting coefficient and
prevents too many phrases consisting of very
infrequent words to be formed.
Examples - analogy
Examples – distance (rare words)
Examples – addition
Parameters
• Architecture: skip-gram (slower, better for infrequent
words) vs CBOW (fast)
• The training algorithm: hierarchical softmax (better for
infrequent words) vs negative sampling (better for
frequent words, better with low dimensional vectors)
• Sub-sampling of frequent words: can improve both
accuracy and speed for large data sets (useful values
are in range 1e-3 to 1e-5)
• Dimensionality of the word vectors: usually more is
better, but not always
• Context (window) size: for skip-gram usually around
10, for CBOW around 5
Machine translation
using distributed representations
1. Build monolingual models of languages using
large amounts of text.
2. Use a small bilingual dictionary to learn a
linear projection between the languages.
3. Translate a word by projecting its vector
representation from the source language
space to the target language space.
4. Output the most similar word vector from
target language space as the translation.
English vs Spanish
Translation accuracy

English  Spanish English  Vietnamese


How is this related to neuroscience?
How to calculate similarity matrix
import sys
import gensim

if len(sys.argv) < 2:
print "Usage: matrix.py <vectorfile> <wordfile>"
sys.exit(1)

model = gensim.models.Word2Vec.load_word2vec_format(sys.argv[1],
binary=True)

with open(sys.argv[2]) as f:
words = f.read().splitlines()

for w1 in words:
s = ""
for w2 in words:
if s != "": s += ","
s += str(model.similarity(w1, w2))
print s
Discovery of structural form - animals
Discovery of structural form - cities

Anda mungkin juga menyukai