Tensor Decomposition For Topic Models: An Overview and Implementation

Tensor Decomposition for Topic Models: An Overview and Implementation
Chicheng Zhang, E.D. Guti errez, Alexander Asplund, Linda Pescatore December 7, 2012
Introduction
The goal of a topic model is to characterize observed data in terms of a much smaller set of unobserved topics. Topic models have proven especially popular for information retrieval. Latent Dirichlet Allocation (LDA) and is the most popular generative model used for topic modeling. Learning the optimal parameters of the LDA model eciently, however, is an open question. As [2] point out, the traditional techniques for learning latent variables have major disadvantages when it comes to topic modeling. Straightforward maximum likelihood estimation does not produce a closed-form solution for LDA, and its approximations are NPhard. Approaches relying on Expectation-maximization (EM) have been the most popular way of learning LDA [4]. Unfortunately, such approaches suer from a lack of guarantees about the quality of the locally optimal solutions they produce. They also exhibit slow convergence. Another class of approaches, based on Markov chain Monte Carlo (MCMC), are also prone to failure due for instance to non-ergodicity, and can also exhibit slow mixing. For these reasons, the tensor decomposition approach [1] on high-order moments of the data seems like a promising option for recovering the topic vectors.[1] This approach has been applied successfuly to other latent variable models, such as Hidden Markov Models and Gaussian Mixture Models [2] [3].
1.1
Outline
The remainder of this paper, broadly baseon [1] and [2] is structured as follows: in (2) we will present a description of the LDA generative model, as well as a formulation of this model in terms of observed binary vectors, a latent topic matrix, and a latent topic mixture vector. In (3) we will describe how tensor decomposition methods can be used to nd latent topic vectors. Then in (4) we will outline several approaches for implementing the tensor decomposition and estimation of the empirical moments of the data, and in (5) we sketch out some sample complexity bounds resulting from these approaches. In (6) we will present some experiments that evaluate the performance of the tensor decomposition methods. Finally we conclude with a discussion of our results and future work to evaluate the performance of tensor decomposition methods.
2
2.1
Latent Dirichlet Allocation

General Descrption
Suppose that we want to model multiple documents each composed of multiple words, in terms of a smaller number of latent topics. Our observed data are merely the set of words occurring in each document. In the Latent Dirichlet Allocation (LDA) model, we make the simplifying assumption that each document is a bag-of-words where the precise ordering of the words does not aect the semantic content of the document. We attempt to model each document as containing a mixture of K unobserved topics denoted by the topic mixture vector h K 1 (where the K 1 denotes the K -dimensional simplex). h for each document is drawn independently according to a Dirichlet distribution with K concentration parameter vector = [1 , 2 , ..., K ]. We will let 0 = k=1 k represent the sum of this parameter vector, which is sometimes referred to as the precision of the Dirichlet distribution. Then we assume that the topic k pertaining to a word t in document is drawn independently from a multinomial with parameters h . Finally, the word type for the word is drawn independently from a multinomial with parameters determined by the topic distribution vector k |V oc|1 that corresponds to topic k , where |V oc| is the size of the lexicon (i.e., the number of word types) in the entire data set.
2.2
Formulation of the Data
To make the tensor decomposition approach clear, we will need to represent our data as a set of binary vectors. Denote the |V oc|-dimensional basis vector by ei . Then let xt = ei if the tth term in document belongs to word class i. We also collect the k vectors into a |V oc|-by-K latent topic matrix = (1 , . . . K ).
3
3.1
Tensor Decomposition Approach: Intuition

Why cross-moments?
With the above formulation in place, we can begin to make note of some properties of the hidden moments of the data. By the assumption that the topic mixture vector h is drawn according to the Dirichlet distribution, we know that the expected value of the k th element of this vector is E[hk ] = k . 0
Due to our formulation above, for a word x1 chosen from the set {xt }, the expectation of x1 conditional on the topic mixture vector h is
K
E[x1 |h] = h =
k=1
k hk .
This equation exhibits clearly the relationship between the observed x1 and the hidden k , but it assumes knowledge of h, which is hidden. However, we can observe our marginal expectation E[x1 ], and recalling the relationship between marginal and conditional expectations we can see that
E[x1 ] = E[E[x1 |h]] = E[h] =

k=1
k k . 0
This exposition begins to give us an idea of how the moments of the data could help us recover the latent vectors k . However, note that under our topic model, the higher moments of a single word x1 are trivial. Recall that a crucial part of the structure of the LDA model is all the words in a document share a topic mixture vector h, but withindocument information does not gure in the higher moments of single words. For this reason, we derive the cross-moments of pairs and triples of distinct words x1 , x2 , x3 in {xt }. Note that such pairs and triples of words are conditionally independent given h. This allows us to write the second-order cross-moment in terms of and h as E[x1 x2 ] = E[h h] = E[h h](T , T ) and similarly we can write the third-order cross-moment as E[x1 x2 x3 ] = E[h h h] = E[h h h](T , T , T ). We have now written our observed cross-moments in terms of and the cross-moments of h. Fortunately, due to our assumption that the topic assignment vector h is drawn from a Dirichlet distribution, we can derive closed-form expressions for the cross-moments of h in terms of the Dirichlet parameters (see Appendix XXXXX). And using these expressions, we can explicitly write our observed cross-moments solely in terms of and : 1 ( + E[x1 x2 ] = 0 (0 + 1) E[x1 x2 x3 ] =
K K
k (k k ))
k=1
1 ( ) 0 (0 + 1)(0 + 2)
K
+
k=1
k (k k + k k + k k ) +
k=1
2k (k k k ))
The last terms in the two expressions above are especially promising candidates for recovering the latent vectors k . By the following algebraic manipulations, we can isolate these terms as noncentral moments of x1 , x2 , and x3 :
K
M1 := E[x1 ] =
k=1
k k 0
M2 := E[x1 x2 ]
0 (M1 M1 ) 0 + 1
M3 := E[x1 x2 x3 ]M1
0 (E[x1 x2 M1 ] + E[x1 M1 x2 ] + E[M1 x1 x2 ]) 0 + 2

2 20 ( M1 M1 M1 ) (0 + 2)(0 + 1)
where Mi denotes the ith non-central moment. Of special note is that these manipulations are in terms only of the observed moments themselves and of the parameter 0 , not the entire vector, which is a necessary input for EM methods for LDA. Our newly dened non-central moments can now be written as linear combinations of tensor powers of the k vectors:
K
M2 =
k=1 K
k ( k k ) (0 + 1)0
(1)
M3 =
k=1
2k (k k k ). (0 + 2)(0 + 1)0
3.2
Whitening Matrix
We now have moments dened from the observed data and 0 that can be expressed as linear combinations of tensor powers the variables of interest. If we can symmetrize our moments in some way and express them in terms of orthogonal matrices, techniques for decomposing into such matrices can in principle be used to recover the latent variables. Suppose we found any matrix W such that it whitens the second moment: M2 (W, W ) := W T M2 W = I . Then this can be written using (1) as
K
M2 (W, W ) =
k=1 K
k W T k W T k 0 + 1 k k , 0 + 1
K k=1
=
k=1
WT
k k W T 0 + 1
and dening k = W T
k 0 +1 k ,
we see that M2 (W, W ) =
k k = I .
In other words, the k are orthonormal vectors, and the whitened moment matrix is amenable to an orthogonal matrix decomposition. From these k it would be possible to recover the k vectors, as the k s are merely linear combinations of the k s and W . However, note that the solution produced by such a decomposition would not be unique in the general case, only in the case where no two k s are equal. Fortunately, application of the same whitening matrix W yields the following results on the third moment:
K
M3 (W, W, W ) =
k=1 K
2k W T k W T k W T k (0 + 2)(0 + 1)0 2 (0 + 1)0 (k k k ). (0 + 2) k
=
k=1
Thus, the observed third moment can also be decomposed in terms of orthogonal vectors that are linear combinations of the k s and a whitening matrix W that depends on the observed data. The details of this tensor decomposition are covered in the Implementation section below.
4
4.1
Implementation
Empirical Estimation of Moments and Whitening Matrix
While the formulation of the data in terms of binary basis vectors xt is helpful to develop intuition for our technique, it is quite cumbersome from an implementation point of view. The storage complexity of such an implementation grows linearly in the number of word tokens. Since the order of words within documents does not matter for LDA, a much more compact representation in terms of word-type count vectors is possible. Such a representation grows in the number of word types. We have derived estimates of our empirical moments and their products in terms of such count vectors, and these estimates are in Appendix XXXX. Another matter of practical concern is estimating the whitening matrix W . As [1] point out, if we take our empirical second moment and nd its singular value decomposition 2 . Thus, W 2 = AAT , then the matrix W = A 1 2 fullls the property of whitening M M can be eciently estimated from our empirical moment estimators.
4.2
Tensor Decomposition Approaches
Suppose the empirical versions of M1 , M2 , and M3 are observed, recall that the goal of the tensor decomposition is to recover k , k = 1, 2, . . . , K . Several approaches have been introduced in [1],[2], [3], and [5] and are reviewed below. Our experiments focus on variants of the rst two approaches only. 4.2.1 Tensor Power Method
First we nd the whitening matrix W , dened as above, and dene T := M3 (W, W, W ). A power-deation approach can be used to recover the k from this tensor, because if we K start with a u0 = k=1 ck k + , then after several iterations of ut+1 = T (I, ut , ut ), the K result will be ut+1 = k=1 (2t 1)k 2t ck k , so when initially ck k dominates, then it will (c k )f irst dominate in the whole run, and the convergence speed is with respect to (ckk k )second . Also note that T (k , k , k ) = k can be used to recover k . After a pair (k , k ) is extracted, we can deal with the new tensor T k k k k , and do this recursively. k Once the k are recovered, because k = 0 ( W T k , then for k is in the column 0 +1) space of W , we can see that k = W ck ,ck = 4.2.2 SVD Method
0 (0 +1) (W T W )1 k . k
The rst two steps are simliar to the tensor power approach. When T = M3 (W, W, W ) is found, we project T into a matrix, i.e.:
K
T (I, I, ) =
k=1
k ( k T ) k k
it can also be treated as a thin SVD form of T (I, I, ): T (I, I, ) = U SU T = (1 , . . . , K )diag (1 (1 T ), . . . , K (K T ))(1 , . . . , K )T So if we do SVD of the T (I, I, ) matrix, as long as 1 (1 T ), . . . , K (K T ) are distinct (in the empirical version we require them to have a not-too-small gap, note that the tensor power approach does not have this problem), then we can recover k . 5
Following the tensor power approach, we can rst recover k , then k . 4.2.3 Pseudo-Inverse Method
1 1
Consider G = (M2 ) 2 M3 (I, I, )(M2 ) 2 , suppose ( is the thin SVD, then M2 = U S 2 U T , M3 (I, I, ) = G=
1 0 (0 +1) 1 , . . . ,
K 0 (0 +1) K )
= U SV T
2 U SV T diag ( T 1 , . . . , T K )V SU T 0 + 2
2 U V T diag ( T 1 , . . . , T K )V U T 0 + 2
can be treated as a thin SVD. So if we do SVD in G and get the singluar vectors with respect to nonzero singular values, we can get (r1 , . . . , rK ) = U V T up to column permutation and signs. Then we calculate U SU T (r1 , . . . , rK ), which equals U SV T up to column permutation and signs. Then we use k = k 1 T (M2 ) 2 vk to recover k and
1 2 k (M2 ) 2 vk = k 1 0 + 2 T (M2 ) 2 vk
to recover k . (Actually we can also use M1 to recover k as well, because M1 = so = 0 (1 , . . . , K ) M1 ) 4.2.4 Eigenvector Method
1 0 ( 1 , . . . , K ) ,
Suppose U is the orthonormal base of M2 s column space. (We can do SVD on M2 to nd U, or more generally, if we can get a U whose column space is M2 s column space(which is also s column space), similar technique applies.) Consider the following matrix: (U T M3 (I, I, )U )(U T M2 U )1 = 2 1 (U T )diag ()diag (T )(U T )T ( (U T )diag ()(U T )T )1 0 (0 + 1)(0 + 2) 0 (0 + 1) = 2 (U T )diag (T )(U T )1 0 + 2
So if we extract the eigenvectors of (U T M3 (I, I, )U )(U T M2 U )1 as (r1 , . . . , rK ), we can see they are columns of (U T ), up to column permutation and scaling, then U rk = U U T k = k . Note that in this method we cannot directly recover , and we can only normalize k explicitly, because the eigenvectors have one degree of freedom in scaling.
4.2.5
Eigenvalue Method Using Simlutaneous Diagonalization
Same as the eigenvectors approach, but we consider eigenvalues instead of eigenvectors. Note that if we do diagonalization with dierent values of , we can get dierent (T ). To randomly choose k , k = 1, . . . , K , for simplicity, we choose k , k = 1, . . . , K uniformly on unit shpere, then obtain k = U k . Denote = (1 , . . . , K )T . Then we observe K vectors T k = tk , denoted as Lk . Note that k = U ck , so we can get ck , because: T T 1 1 . . . (1 , . . . , K ) = . . . U T U (c1 , . . . , cK ) = L T T K K (c1 , . . . , cK ) = 1 L (1 , . . . , K ) = U (c1 , . . . , cK ) = U 1 L A subtle issue is that we must diagonlize these matrices simultaneously. For example, if we deal with empirical moments, we can use one single P which diagnoalizes (U T M3 (I, I, U 1 )U )(U T M2 U )1 , then although P 1 (U T M3 (I, I, U k )U )(U T M2 U )1 P are not perfectly diagonal for other k , i 2, we still consider their diagonal elements. Note that if we diagonalize them individually, the order of the eigenvalues will be shued for each individual matrix, so that we cannot safely recover the k .
Theoretical Guarantees: Sample Complexity
2 (M 3 ) will converge to M2 (M3 ), as can be seen by considering thier vector First note that M stacking and applying McDiarmids Lemma. With probability 1 ,
d d
2 i,j M2 i,j )2 < (M

i=1 j =1 d d d
(1 +
ln 1/ )2 N ln 1/ )2 N
3 i,j,k M3 i,j,k )2 < (M

i=1 j =1 k=1
(1 +
2 || < EP , ||M3 (I, I, )M 3 (I, I, )|| < ET || ||, Then it is straightforward to see that ||M2 M EP = ET =
(1+
Let W = then W whitens M2 , but its range may not equal range(M2 ) = range(). Matrix perturbation theory yields the unsurprising result that for EP small, T W || ||W T 4 E , 2 P k ()
ln 1/ ) . N T ) 1 2, W (W M2 W
W || ||W || W ||
61 () EP k ()2 4 E 2 P k ()
3 (W ,W ,W ) that we are going to decompose. Then: Next, we consider the matrix M 3 (W ,W ,W ) M3 (W, W, W )|| < c( ||M (0 + 2)1/2 EP
3/2 pmin k ()2
(0 + 2)3/2 ET pmin k ()3

3/2
Denote this deviation by E . Because W whitens M2 , M3 (W, W, W ) has an orthogonal 3 (W ,W ,W ), we have decomposition. Then for the SVD/Tensor Power decompositions of M the results: (For SVD Method), w.p 3/4: i i || < c1 K 3 0 + 2E || (For Tensor Power Method), w.p 3/4: i i || < c2 0 + 2E || So if we look at reconstruction accuracy, ||i )T (W W 1 || + ||W ||| 1 1 | || || W || + ||i i || + ||W W i Z Z Zi i i Z i
It turns out the second and fourth terms dominate, and w.p. 1 over the random examples given, it is bounded by: (For SVD approach), w.p. 3/4 over the randomness of choice of : c1 K 3 (0 + 2)2 1 + ln 1/ ) ( 3 p2 N min k ()
(For Tensor Power), w.p. 3/4 over the randomness of choice of iteration startpoint: c2 (0 + 2)2 1 ( 2 pmin k ()3 + ln 1/ N )
One remarkable aspect of these results is that the error bound of the SVD method depends polynomially on K 3 , while the error bound of the tensor power method does not depend on K at all. For both methods, these results guarantee that the error falls o inversely with the square root of the number of word tokens in the sample.
6
6.1
Experiments
Datasets
We decided to test the empirical algorithm on two real-world datasets. The rst dataset is the Classic3/Classic4 dataset. Classic4 is comprised of four dierent collections of abstracts: CACM, CISI, CRAN, and MED. These collections roughly correspond to the topics of computer science, information science, aeronautics, and medicine, respectively. Classic3 is the same as Classic4, with the exclusion of CACM. The second dataset we used is the 20Newsgroups dataset. It consists of postings on 20 Usenet newsgroups, on diverse topics such as computers, religion, and politics. In order to evaluate quantitatively 8
the performance of the algorithm, we had to set a ground truth for our datasets by assigning topic mixtures to the documents in the datasets. We settled on assigning a single topic per document, which corresponds to 0 = 0. Each document in Classic3/Classic4 was assigned with a topic label determined by the collection of abstracts it came from (therefore, K = 3 for Classic3 and K =4 for Classic4). For 20Newsgroups, it did not seem appropriate to assign a separate topic for each newsgroup, since there is much topical overlap among groups. For instance, comp.windows.x and comp.os.ms-windows seem to share a great deal of vocabulary. Instead, we collected the groups into K = 6 topics, following [7]. These topics are found in Appendix C.
6.2
Procedure
Our overall empirical algorithm was as follows: , T riples 1. Construct empirical moments P airs = A1/2 where AAT is the SVD of P airs . 2. Whiten: Let W 3. Tensor Decomposition: (SVD Method) Calculate the left singular vectors of (W T riples )W as in section 4.2.2. W (W ,W ,W ) (Tensor Power Method, using deation) Calculate the eigenvectors of T riples as in Section 4.2.1, extracting one pair at a time. (W ,W ,W ) (Tensor Power Method, simultaneous) Calculate the eigenvectors of T riples as in Section 4.2.1, without deating. (W i = (W vi )T T riples vi )(W vi ) and + )T vi /Zi . 4. Reconstruct: Zi = (W We randomly divided Classic3 and 20Newsgroups data the into three folds each composed of one-third of the documents. We then used a cross-validation scheme, where we tested the algorithm on each combination of two folds, while, while using the documents and topic labels of the third fold as held-out data to compute an estimate of the ground truth moments and latent variable matrix . Suppose we wish to estimate the groundtruth distribution of our data from |Docs| dierent documents in our held-out data, and we have a label vector y R|Docs| as well as a count vector c for each document, were ci is the count of word type i in document . Then we estimate the of the ith element of k as k,i =
|Docs| ci 1yi =k =1 . |Docs| |V oc| ci 1yi =k i=1 =1
To assess the performance of the three decomposition techniques and compare it to the k and empirical momentsfor varying sample sample complexity bounds, we computed the sizes, on a logarithmic scale, using each of the three methods with 0 = 0 . We then k s derived recorded the L2 error between the ground truth estimates of the moments and from the held-out data, and the empirical moments and k s returned by the three tensor decomposition techniques. Note that tensor decomposition methods only return the matrix up to a permutation; we used a bipartite matching algorithm, the Hungarian algorithm k s to the k s. [6], to match the
Results and Conclusion
Qualitative results showing that our implementation is working reasonably well are included in Appendix XXX. Figures 1, 2, and 3 show the error E , the L2 errors in estimation of the empirical moments, and the L2 error in estimation of the topic vectors, respectively. As 9
can be seen, all the errors fall o with sample size, but they do not quite fall o as 1
(N )
Performance seems roughly the same for all three methods, and despite the polynomial dependence of the SVD method on the number of topics, the performance of this method seems similar on both datasets. There are several possible explanations for our results. First of all, we must question the quality of the ground truth we estimated. Mislabeling is a problem, and the ground truth topics approached by LDA need not correspond to our topic labels at all. This concern is supported by the fact that the errors of the topic vector estimates seem to asymptote to some nonzero level. It might be instructive to run a traditional EM- or MCMC-based approach for learning LDA on our data, in order to compare to our tensor decomposition results. Finally, we note that we have not ne-tuned the iterative parameters of the tensor power approach, which could give this method a boost in performance. Further work needs to be done in order to assess tensor decomposition approaches for LDA. In addition to the suggestions above, it would also be interesting to compare the speed and eciency with which the tensor decomposition approach approaches the ground truth, compared with an EM-based approach, as the fact that tensor decomposition approaches avoid the problem of local minima is a big advantage of this method.
References
[1] A. Anandkumar, D.P. Foster, D. Hsu, S.M. Kakade, Yi-Kai Liu. Two SVDs Suce: Spectral Decompositions for Probabilistic Topic Modeling and Latent Dirichlet Allocation. ArXiv Report, arXiv: 1204.6703v3, 2012. [2] A. Anandkumar, R. Ge, D. Hsu, S.M. Kakade, M. Telgarsky. Tensor Decompositions for Learning Latent Variable Models. ArXiv Report, arXiv: 1210.7559, 2012. [3] A. Anandkumar, D.Hsu, S.M. Kakade. A Method of Moments for Mixture Models and Hidden Markov Models. ArXiv Report, arXiv: 1203.0683, 2012. [4] D.M. Blei, A.Y. Ng, M.I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research 3:993-1022, 2003. [5] D. Hsu, S.M. Kakade. Learning Mixtures of Spherical Gaussians: Moment Methods and Spectral Decompositions. To appear in the 4th Innovations in Theoretical Computer Science (ITCS), 2013. ArXiv Report, arXiv: 1206.5766. [6] H.W. Kuhn. The Hungarian Method for the assignment problem . Naval Researh Logistics Quarterly, 3: 253-258, 1956. [7] J. Rennie. http://qwone.com/ jason/20Newsgroups/
Cross-Moments of Dirichlet-Distributed Vectors
If h is drawn from a Dirichlet distribution and the concentration parameters are known, then these moments could easily be calculated: If h Dirichlet(1 , 2 , .., K ), then E [hi ] = 10 i 0
E [h h]i,j = E [hi hj ] =
i j 0 (0 +1) , i (i 1) 0 (0 +1) ,
i=j i=j i, j, k distinct i=j=k i=j=k
E [h h h]i,j,k
i j k 0 (0 +1)(0 +2) , i (i +1)j = E [hi hj hk ] = 0 ( +1)(0 +2) , i (0 i +1)(i +2) 0 (0 +1)(0 +2) ,
B
B.1
Derivation of Empirical Moment Estimators and Products

Notes:
|V oc| |V oc| |V oc | |V oc |
For a specic document (Vl ), i , i,j can be re-written as i l , i,j l (V ocl is the number of distinct word types used in document ), because we only need to care about words that occurred in document .
B.2
Empirical First and Second Moments

1 = E[x1 ] = M 1 |Docs|
|Docs|
l=1
1 cl |Vl |
2 = E[x1 x2 ] 0 M 1 M 1 M 0 + 1 1 = |Docs|
|Docs|
l=1
1 0 1 (cl cl diag (cl )) M1 M |Vl |(|Vl | 1) 0 + 1
B.3
Empirical Third Moment and Its Multilinear Products
2 0 3 = E[x1 x2 x3 ] 0 (E[x1 x2 M1 ]+E[x1 M1 x2 ]+E[M 1 x1 x2 ])+ 1 M 1 M 1) M (M 0 + 2 (0 + 1)(0 + 2)
1 = |Docs|
|V oc| |V oc|
|Docs|
l=1
1 [cl cl cl (|Vl |)(|Vl | 1)(|Vl | 2)

|V oc| |V oc|
+2
i
cli (ei ei ei )
i,j
cli clj (ei ei ej )

i,j
cli clj (ei ej ei )

i,j
cli clj (ej ei ei )]
1 |Docs|
|Docs|
l=1
1 0 1 +cl M 1 cl +M 1 cl cl [cl cl M (|Vl |)(|Vl | 1) 0 + 2 +

2 20 1 M 1) ( M1 M (0 + 1)(0 + 2)
|V oc|
1 +ei M 1 ei +M 1 ei ei cli (ei ei M

i
11
3 (1, 1, ) = M
T
1 |Docs|
|Docs|
l=1
1 (|Vl |)(|Vl | 1)(|Vl | 2) (cl ) cl cl (cl )]
[cl cl ( cl ) + 2diag (cl )
cT l diag (cl )
1 |Docs|
|Docs|
l=1
1 0 T 1 1 T (cl cl diag (cl ))] [(cl cl diag (cl ))( T M1 )+(cl cl diag (cl )) M +M (|Vl |)(|Vl | 1) 0 + 2 +
2 20 1 )(M 1 M 1) ( T M (0 + 1)(0 + 2)
3 (W ,W , ) = M
1 |Docs|
|Docs|
l=1
1 (|Vl |)(|Vl | 1)(|Vl | 2)
T cl ) (W T cl )( T cl ) + 2W T diag (cl )W [(W T T T T T (cT l )W diag (cl )W (W (cl )) (W cl ) (W cl ) (W (cl ))]
1 |Docs|
|Docs|
l=1
1 0 T cl ) (W T cl ) W T diag (cl )W )( T M 1) [((W (|Vl |)(|Vl | 1) 0 + 2
T (cl cl diag (cl )) M TW +W T M1 T (cl cl diag (cl ))W ] +W 1 +

2 20 1 )((W TM 1 ) (W TM 1 )) ( T M (0 + 1)(0 + 2)
3 (1, , ) = M
1 |Docs|
|Docs|
l=1
1 (|Vl |)(|Vl | 1)(|Vl | 2)
T [cl ( T cl )2 + 2(cl cl ) 2(cl )(cT l ) cl (cl ( ))]
1 |Docs|
|Docs|
l=1
1 0 T T T 2 T [( T M1 )[(cT l )cl (cl )]+[cl (cl )(cl )](M1 )+M1 ( cl ) M1 (cl )] (|Vl |)(|Vl | 1) 0 + 2 +
2 20 1 )2 M 1 ( T M (0 + 1)(0 + 2)
3 (, , ) = M
1 |Docs|
|Docs|
l=1
1 (|Vl |)(|Vl | 1)(|Vl | 2)
12
T [( T cl )3 + 2(( )T (cl cl )) 3(cT l )(cl ( ))] |Docs|
1 |Docs|
l=1
1 0 2 T [3( T M1 )[(cT l ) cl ( )]] (|Vl |)(|Vl | 1) 0 + 2 +

2 20 ( T M1 )3 (0 + 1)(0 + 2)
Partitioning of 20Newsgroups
20Newsgroups was partitioned into six classes, following [7]:
D E
Qualitative Results Illustrative results
The results of simultaneous power method, ECA and tensor deation seem very similiar, so we present only the results of simultaneous power method here. The following tables show results of the method with dierent number of topics k.
E.1
Classic4
k = 10 While 3 of the natural topics are recoverd, MED is barely present. Topic 4 is a mixture of topics. k = 20 We chose to display 6 of the 20 topics. We see from columns 1-4 that all natural topics are recovered, but we also get duplicate topics as in column 5 and mixed topics as in column 6.
E.2
20Newsgroups
For 20Newsgroups, k=10 and k=20 produced fairly similiar results. k = 10 k = 20
13
comp.graphics comp.os.ms-windows.misc comp.sys.ibm.pc.hardware comp.sys.mac.hardware comp.windows.x misc.forsale
rec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey talk.politics.misc talk.politics.guns talk.politics.mideast
sci.crypt sci.electronics sci.med sci.space talk.religion.misc alt.atheism soc.religion.christian
1 inform librari system comput data scienc studi user research method servic retriev develop program search base book index process oper
2 system program comput languag method gener problem data algorithm present design time structur develop paper techniqu discuss function equat oper
3 ow layer boundari pressur number heat solut equat mach theori present bodi shock transfer eect result method laminar plate wave
4 case system result method ow present problem time studi patient algorithm eect solut bodi growth model techniqu obtain cell develop
14
1 cell structur studi data scienc activ line patient bodi relat high normal marrow languag bone type growth rat tissu strain
2 ow boundari layer heat number solut plate eect equat theori transfer laminar problem compress point pressur surfac dimension veloc case
3 librari inform book studi research journal public univers develop academ librarian system report work catalog cost present decis scienc paper
4 algorithm method program time present data number paper system result inform set problem tabl languag structur bodi oper ow gener
5 languag program comput gener problem sort fortran system algorithm list structur string featur present translat process user design rule le
6 number librari problem list bodi journal titl creep catalog method buckl function scienc librarian point work column time boundari period
1 drive disk hard oppi system le do format scsi control problem question comput set softwar compress switch bit origin copi
2 window run problem card le applic monitor video mail manag system line program screen do font color question driver set
3 game team win two plai player come run score season cub last hockei dai seri world suck tie record put
4 kei chip bit order encrypt clipper phone secur gun govern simm escrow number public de nsa two mean run call
5 god christian jesu point post question mean exist church christ jew bibl nd law group religion state show word answer
6 mail list sale address window phone post le run email info interest send question number do read group advanc back
15
1 game team run win two plai player come last window score kei le season seri system hockei name into sound
2 window problem run system applic driver manag do color le video mous monitor win graphic font program screen set card
3 god christian jesu call mean christ car read irq live bibl group religion church doesn sin sound post never love
4 drive disk hard oppi system scsi format do question le control softwar set sale origin come mac power boot program
5 mail list sale run address interest info group phone post chip type send le advanc question read inform problem email
6 card video color vga driver window mail graphic monitor mode cach bui speed address phone set problem number list fpu
7 car le problem question two bike monitor bui post last opinion god dai gener ride road didn great mac softwar
16

Tensor Decomposition For Topic Models: An Overview and Implementation

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Tensor Decomposition For Topic Models: An Overview and Implementation

Diunggah oleh

Hak Cipta:

Format Tersedia

Tensor Decomposition for Topic Models: An Overview and Implementation

Latent Dirichlet Allocation

Formulation of the Data

Tensor Decomposition Approach: Intuition

E[x1 ] = E[E[x1 |h]] = E[h] =

0 (E[x1 x2 M1 ] + E[x1 M1 x2 ] + E[M1 x1 x2 ]) 0 + 2

we see that M2 (W, W ) =

2k W T k W T k W T k (0 + 2)(0 + 1)0 2 (0 + 1)0 (k k k ). (0 + 2) k

Tensor Decomposition Approaches

Eigenvalue Method Using Simlutaneous Diagonalization

Theoretical Guarantees: Sample Complexity

2 i,j M2 i,j )2 < (M

3 i,j,k M3 i,j,k )2 < (M

(0 + 2)3/2 ET pmin k ()3

Results and Conclusion

Cross-Moments of Dirichlet-Distributed Vectors

i=j i=j i, j, k distinct i=j=k i=j=k

Derivation of Empirical Moment Estimators and Products

Empirical First and Second Moments

1 0 1 (cl cl diag (cl )) M1 M |Vl |(|Vl | 1) 0 + 1

Empirical Third Moment and Its Multilinear Products

2 0 3 = E[x1 x2 x3 ] 0 (E[x1 x2 M1 ]+E[x1 M1 x2 ]+E[M 1 x1 x2 ])+ 1 M 1 M 1) M (M 0 + 2 (0 + 1)(0 + 2)

1 [cl cl cl (|Vl |)(|Vl | 1)(|Vl | 2)

cli clj (ei ei ej )

cli clj (ei ej ei )

cli clj (ej ei ei )]

1 0 1 +cl M 1 cl +M 1 cl cl [cl cl M (|Vl |)(|Vl | 1) 0 + 2 +

1 +ei M 1 ei +M 1 ei ei cli (ei ei M

1 (|Vl |)(|Vl | 1)(|Vl | 2) (cl ) cl cl (cl )]

[cl cl ( cl ) + 2diag (cl )

1 (|Vl |)(|Vl | 1)(|Vl | 2)

T cl ) (W T cl )( T cl ) + 2W T diag (cl )W [(W T T T T T (cT l )W diag (cl )W (W (cl )) (W cl ) (W cl ) (W (cl ))]

1 0 T cl ) (W T cl ) W T diag (cl )W )( T M 1) [((W (|Vl |)(|Vl | 1) 0 + 2

T (cl cl diag (cl )) M TW +W T M1 T (cl cl diag (cl ))W ] +W 1 +

1 (|Vl |)(|Vl | 1)(|Vl | 2)

T [cl ( T cl )2 + 2(cl cl ) 2(cl )(cT l ) cl (cl ( ))]

1 (|Vl |)(|Vl | 1)(|Vl | 2)

T [( T cl )3 + 2(( )T (cl cl )) 3(cT l )(cl ( ))] |Docs|

1 0 2 T [3( T M1 )[(cT l ) cl ( )]] (|Vl |)(|Vl | 1) 0 + 2 +

20Newsgroups was partitioned into six classes, following [7]:

Qualitative Results Illustrative results

For 20Newsgroups, k=10 and k=20 produced fairly similiar results. k = 10 k = 20

comp.graphics comp.os.ms-windows.misc comp.sys.ibm.pc.hardware comp.sys.mac.hardware comp.windows.x misc.forsale

rec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey talk.politics.misc talk.politics.guns talk.politics.mideast

sci.crypt sci.electronics sci.med sci.space talk.religion.misc alt.atheism soc.religion.christian

Anda mungkin juga menyukai