«Word»
Sparse vector products
1-hot
token (n tokens) linear
0 W0
0 W1
0 W2
… …
0 W1335
«word» 0 W1336
(id1337) 1 W1337
0 W1338
0 W1339
… …
0 Wn-1
0 Wn
Sparse vector products
«word»
(id1337)
0
0
1
dot
W ij
?
i=1…n
0 j = 1... h
0
…
0
0
Embedding
«word»
(id1337)
0
0
1
dot
W ij
i=1…n
0 j = 1... h
0
… row 1337
0
0
Embedding: word2vec
“Peace is a lie, there is only passion”
1-hot hidden layer
(n tokens) h units
0 0
0 1
0 0
… …
0
0
1
dot W ij W jk ~
1
0
0
i=1…n j=1…h
0 j = 1... h k = 1... n 0
0 1
… …
0 1
0 0
Embedding: word2vec
the distributional hypothesis : similar context = similar meaning
hidden layer
h units
Softmax problem
Dense layer, 10^5 units
(Your CPUs gonna burn)
“Embedding layer”
Just takes row from matrix
(super fast)
hidden layer
h units
More word embeddings
Faster softmax:
• Hierarchical softmax, negative samples, …
• learn more
More word embeddings
Faster softmax:
• Hierarchical softmax, negative samples, …
• learn more
Alternative models: GloVe
More word embeddings
Faster softmax:
• Hierarchical softmax, negative samples, …
• learn more
Alternative models: GloVe
Sentence level:
• Doc2vec, skip-thought (using rnn)
More word embeddings
Faster softmax:
• Hierarchical softmax, negative samples, …
• learn more
Alternative models: GloVe
Sentence level:
• Doc2vec, skip-thought (using rnn)
To be continued...
in the NLP course