Christopher Moody
Stitch Fix One Montgomery Tower, Suite 1200
San Francisco, California 94104, USA
chrisemoody@gmail.com
Abstract
sally said the fox jumped over the
Distributed dense word vectors have been Document weight
#topics
shown to be effective at capturing token-
0.34 -0.1 0.17
level semantic and syntactic regularities
arXiv:1605.02019v1 [cs.CL] 6 May 2016
#hidden units
tors. In contrast to continuous dense doc-
ument representations, this formulation -1.7 -1.9 0.75
ganize document collections into a smaller set of sally said the fox jumped over the
prominent themes. In contrast to dense distributed
representations, these document and topic repre- t=0
tim
sentations are generally accessible to humans and 34% 32% 34% t=10
e
Figure 3: Topics discovered by lda2vec in the Twenty Newsgroups dataset. The inferred topic label is
shown in the first row. The tokens with highest similarity to the topic are shown immediately below.
Note that the twenty newsgroups corpus contains corresponding newsgroups such as sci.space, sci.crypt,
comp.windows.x and talk.politics.mideast.
Figure 4: Topics discovered by lda2vec in the Hacker News comments dataset. The inferred topic label
is shown in the first row. We form tokens from noun phrases to capture the unique vocabulary of this
specialized corpus.
Artificial sweeteners Black holes Comic Sans Functional Programming San Francisco
glucose particles typeface FP New York
fructose consciousness Arial Haskell Palo Alto
HFCS galaxies Helvetica OOP NYC
sugars quantum mechanics Times New Roman functional languages New York City
sugar universe font monads SF
Soylent dark matter new logo Lisp Mountain View
paleo diet Big Bang Anonymous Pro Clojure Seattle
diet planets Baskerville category theory Los Angeles
carbohydrates entanglement serif font OO Boston
Figure 5: Given an example token in the top row, the most similar words available in the Hacker News
comments corpus are reported.
processed datasets 6 results. This allows us to cap- hand-label Housing Issues has prominent tokens
ture phrases such as community policing measure relating to housing policy issues such as hous-
and prominent figures such as Steve Jobs as single ing supply (e.g. more housing), and costs (e.g.
tokens. However, this tokenization procedure gen- basic income and house prices). Another topic
erates a vocabulary substantially different from the lists major internet portals, such as the privacy-
one available in the Palmetto topic coherence tool conscious search engine Duck Duck Go (in the
and so we do not report topic coherences on this corpus abbreviated as DDG), as well as other ma-
corpus. After preprocessing, the corpus contains jor search engines (e.g. Bing), and home pages
75 million tokens in 66 thousand documents with (e.g. Google+, and iGoogle). A third topic is that
110 thousand unique tokens. Unlike the Twenty of the popular online curency and payment sys-
Newsgroups analysis, word vectors are initialized tem Bitcoin, the abbreviated form of the currency
randomly instead of using a library of pretrained btc, and the now-defunct Bitcoin trading platform
vectors. Mt. Gox. A fourth topic considers salaries and
We train an lda2vec model using 40 topics and compensation with tokens such as current salary,
256 hidden units and report the learned topics that more equity and vesting, the process by which em-
demonstrate the themes present in the corpus. Fur- ployees secure stock from their employers. A fifth
thermore, we demonstrate that word vectors and example topic is that of technological hardware
semantic relationships specific to this corpus are like HDMI and glossy screens and includes de-
learned. vices such as the Surface Pro and Mac Pro.
In Figure 4 five example topics discovered by Figure 5 demonstrates that token similarities are
lda2vec in the Hacker News corpus are listed. learned in a similar fashion as in SGNS (Mikolov
These topics demonstrate that the major themes et al., 2013) but specialized to the Hacker News
of the corpus are reproduced and represented in corpus. Tokens similar to the token Artificial
learned topic vectors in a similar fashion as in sweeteners include other sugar-related tokens like
LDA (Blei et al., 2003). The first, which we fructose and food-related tokens such as paleo
6
A tokenized dataset is freely available at https:// diet. Tokens similar to Black holes include
zenodo.org/record/49899 physics-related concepts such as galaxies and dark
Query Result sange are both whistleblowers who were primar-
California + technology Silicon Valley ily located in the United States and Sweden and
digital + currency Bitcoin finally the Kindle and the Surface Pro are both
Javascript - browser + Node.js tablets manufactured by Amazon and Microsoft re-
server spectively. In the above examples semantic re-
Mark Zuckerberg - Jeff Bezos lationships between tokens encode for attributes
Facebook + Amazon and features including: location, currencies, server
NLP - text + image computer vision v.s. client, leadership figures, machine learning
Snowden - United States Assange fields, political figures, nationalities, companies
+ Sweden and hardware.
Surface Pro - Microsoft + Kindle
Amazon 3.3 Conclusion
This work demonstrates a simple model, lda2vec,
Figure 6: Example linear relationships discov- that extends SGNS (Mikolov et al., 2013) to build
ered by lda2vec in the Hacker News comments unsupervised document representations that yield
dataset. The first column indicates the example coherent topics. Word, topic, and document vec-
input query, and the second column indicates the tors are jointly trained and embedded in a common
token most similar to the input. representation space that preserves semantic reg-
ularities between the learned word vectors while
still yielding sparse and interpretable document-
matter. The Hacker News corpus devotes a sub-
to-topic proportions in the style of LDA (Blei et
stantial quantity of text to fonts and design, and the
al., 2003). Topics formed in the Twenty News-
words most similar to Comic Sans are other pop-
groups corpus yield high mean topic coherences
ular fonts (e.g. Times New Roman and Helvetica)
which have been shown to correlate with hu-
as well as font-related concepts such as typeface
man evaluations of topics (Roder et al., 2015).
and serif font. Tokens similar to Functional Pro-
When applied to a Hacker News comments cor-
gramming demonstrate similarity to other com-
pus, lda2vec discovers the salient topics within
puter science-related tokens while tokens similar
this community and learns linear relationships be-
to San Francisco include other large American
tween words that allow it solve word analogies in
cities as well smaller cities located in the San Fran-
the specialized vocabulary of this corpus. Finally,
cisco Bay Area.
we note that our method is simple to implement in
Figure 6 demonstrates that in addition to learn- automatic differentiation frameworks and can lead
ing topics over documents and similarities to word to more readily interpretable unsupervised repre-
tokens, linear regularities between tokens are also sentations.
learned. The Query column lists a selection of
tokens that when combined yield a token vector
closest to the token shown in the Result column. References
The subtractions and additions of vectors are eval- David M Blei, Andrew Y Ng, and Michael I Jordan.
uated literally, but instead take advantage of the 2003. Latent dirichlet allocation. The Journal of
3COSMUL objective (Levy and Goldberg, 2014b). Machine Learning Research, 3:9931022, March.
The results show that relationships between tokens
David Blei, Lawrence Carin, and David Dunson. 2010.
important to the Hacker News community exists Probabilistic Topic Models. IEEE Signal Processing
between the token vectors. For example, the vec- Magazine, 27(6):5565.
tor for Silicon Valley is similar to both California
and technology, Bitcoin is indeed a digital cur- J Chang, S Gerrish, and C Wang. 2009. Reading tea
leaves: How humans interpret topic models. Ad-
rency, Node.js is a technology that enables run-
vances in . . . .
ning Javascript on servers instead of on client-
side browsers, Jeff Bezos and Mark Zuckerberg are Laurent Charlin, Rajesh Ranganath, James McInerney,
CEOs of Amazon and Facebook respectively, NLP and David M Blei. 2015. Dynamic Poisson Factor-
ization. ACM, September.
and computer vision are fields of machine learn-
ing research primarily dealing with text and im- Shalini Ghosh, Oriol Vinyals, Brian Strope, Scott Roy,
ages respectively, Edward Snowden and Julian As- Tom Dean, and Larry Heck. 2016. Contextual
LSTM (CLSTM) models for Large scale NLP tasks. Jeffrey Pennington, Richard Socher, and Christopher
arXiv.org, February. Manning. 2014. Glove: Global Vectors for Word
Representation. In Proceedings of the 2014 Con-
Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, ference on Empirical Methods in Natural Language
Ilya Sutskever, and Ruslan R Salakhutdinov. 2012. Processing (EMNLP), pages 15321543, Strouds-
Improving neural networks by preventing co- burg, PA, USA. Association for Computational Lin-
adaptation of feature detectors. arXiv.org, July. guistics.
Matthew Honnibal and Mark Johnson. 2015. An Im- Michael Roder, Andreas Both, and Alexander Hinneb-
proved Non-monotonic Transition System for De- urg. 2015. Exploring the Space of Topic Coherence
pendency Parsing. In Proceedings of the 2015 Measures. ACM, February.
Conference on Empirical Methods in Natural Lan-
guage Processing, pages 13731378, Stroudsburg, S Tokui, K Oono, S Hido, CA San Mateo, and
PA, USA. Association for Computational Linguis- J Clayton. 2015. Chainer: a Next-Generation
tics. Open Source Framework for Deep Learning. learn-
ingsys.org.
Diederik Kingma and Jimmy Ba. 2014. Adam: A
Method for Stochastic Optimization. arXiv.org, De-
cember.
Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov,
Richard Zemel, Raquel Urtasun, Antonio Torralba,
and Sanja Fidler. 2015. Skip-Thought Vectors. Ad-
vances in Neural . . . , pages 32943302.
Quoc V Le and Tomas Mikolov. 2014. Dis-
tributed Representations of Sentences and Docu-
ments. ICML, pages 11881196.
Omer Levy and Yoav Goldberg. 2014a. Dependency-
Based Word Embeddings. In Proceedings of the
52nd Annual Meeting of the Association for Compu-
tational Linguistics (Volume 2: Short Papers), pages
302308, Stroudsburg, PA, USA. Association for
Computational Linguistics.
Omer Levy and Yoav Goldberg. 2014b. Linguistic
Regularities in Sparse and Explicit Word Represen-
tations. CoNLL, pages 171180.
Omer Levy and Yoav Goldberg. 2014c. Neural Word
Embedding as Implicit Matrix Factorization. Ad-
vances in Neural Information Processing . . . , pages
21772185.
Christopher D Manning, Prabhakar Raghavan, and
Hinrich Schutze. 2009. Introduction to Information
Retrieval. Cambridge University Press, Cambridge.
Julian McAuley, Rahul Pandey, and Jure Leskovec.
2015. Inferring Networks of Substitutable and Com-
plementary Products. ACM, New York, New York,
USA, August.
Tomas Mikolov, Ilya Sutskever, Kai Chen 0010, Gre-
gory S Corrado, and Jeffrey Dean. 2013. Dis-
tributed Representations of Words and Phrases and
their Compositionality. NIPS, pages 31113119.
Fabian Pedregosa, Gael Varoquaux, Alexandre Gram-
fort, Vincent Michel, Bertrand Thirion, Olivier
Grisel, Mathieu Blondel, Peter Prettenhofer,
Ron Weiss, Vincent Dubourg, Jake Vanderplas,
Alexandre Passos, David Cournapeau, Matthieu
Brucher, Matthieu Perrot, and Edouard Duchesnay.
2012. Scikit-learn: Machine Learning in Python.
arXiv.org, January.