Anda di halaman 1dari 6

International Journal of Computer Information Systems, Vol. 3, No.

4, 2011

The Enhanced Gibbs Sampling for Topic Model


Ms. Anjali Ganesh Jivani
Department of Computer Science & Engineering The Maharaja Sayajirao University of Baroda Vadodara, Gujarat, India anjali_jivani@yahoo.com

Ms. Neha Ripal Soni


Department of Computer Engineering SVIT, Gujarat Technological University Vasad, Gujarat, India neha_ripal@yahoo.co.in

AbstractThe topic model is an unsupervised probabilistic language model that automatically learns the topics or subjects contained in a collection of text documents, and through these topics, the topic model can automatically organize, categorize and make browsable huge numbers of documents [3]. Specifically, the topic model is based on the Latent Dirichlet Allocation (LDA) model, which has become a popular model for discrete data, such as collections of text documents. Gibbs Sampling, a Markov Chain Monte Carlo method, is a highly attractive topic model because it is simple, fast and has very few adjustable parameters. In this paper we have tried to derive a scalable algorithm which leads to reduction in the space complexity of the original Gibbs sampling for topic model. The concept used to reduce the space complexity is partitioning the dataset into smaller sets and then executing the algorithm for each partition. This reduces the space requirement without any impact on the time complexity. The enhanced Gibbs sampling algorithm has been implemented and experimented on four different datasets. Keywords- topic model; Gibbs sampling; text mining; LDA;MCMC

inference in LDA. Gibbs sampling leads to equivalent accuracy faster than alternative methods has been reported in [5]. Gibbs Sampling, a Markov Chain Monte Carlo method, is highly attractive because it is simple, fast and has very few adjustable parameters. The extensive research work related to topic models that is currently going on deals with parallel topic models [3] for improvement over the time complexity. As per our knowledge, not much work has been done with the improvement over the space requirement of the Gibbs sampling for topic model. The objective of this paper as mentioned before is to derive an algorithm which is more space efficient than the original Gibbs sampling for topic model. II. LATENT DIRICHLET ALLOCATION

I.

INTRODUCTION

The topic model has been shown to work well on a wide variety of text collections, from emails, to news articles [9] [12], to research literature abstracts [5], to World Wide Web and so on. Though, widely applied to the text collection, the application of the topic model is not limited to only textual collection, but it is as well applicable to problems involving collections of any kind of discrete data, including data from domains such as collaborative filtering, content-based image retrieval and bioinformatics. It has been recently applied to identify biological concepts from a protein-related corpus. Furthermore, the topic model is not limited to extract the latent structure from the discrete collections of data, but, there are numerous possible extensions of topic model. For example, topic model is readily extended as author-topic models, authorrole-topic models and hidden-Markov topic models for separating semantic and syntactic topics [9]. The topic model is a statistical language model that relates words and documents through topics. It is based on the idea that documents are made up of a mixture of topics, where topics are distributions over words. The Latent Dirichlet Allocation (LDA) is the most efficient and accepted model for discrete data, especially text documents [2]. While Kevin Murphy [18] proposed approximate variational Bayesian methods to solve LDA, [5] proposed Gibbs Sampling for

The evolution of topic models started with probabilistic latent semantic indexing (PLSI), created by Thomas Hofmann in 1999 [1]. Later on topic models based on the Latent Dirichlet allocation (LDA) became very popular. It was developed by David Blei, Andrew Ng, and Michael Jordan in 2002, allowing documents to have a mixture of topics [2]. Other topic models are generally extensions on LDA. LDA is a generative probabilistic model of a corpus. The basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words. The detail about LDA is given in [2] and the graphical model representation of LDA is shown in Fig. 1.

Figure 1. Graphical model representation of LDA

LDA assumes the following generative process for each document w in a corpus D: 1. 2. Choose N ~ Poisson(). Choose ~ Dir().

October Issue

Page 1 of 89

ISSN 2229 5208

3.

For each of the N words wn: (a) Choose a topic zn ~ Multinomial(). (b) Choose a word wn from p(wn zn, ), a multinomial probability conditioned on the topic zn.

There are three levels to the LDA representation. The parameters and their significance are: 1. 2. 3. and are corpus level and are sampled once in the process of generating a corpus. is document-level variable, sampled once per document. z and w are word-level variables and are sampled once for each word in each document.

International Journal of Computer Information Systems, Vol. 3, No. 4, 2011 the situation, the smoothed model proposed in [2], is shown in Fig. 2. The strategy used is, not to estimate the model parameters explicitly, but instead considering the posterior distribution over the assignments of words to topics, P (z|w). The estimates of and are then obtained by examining this posterior distribution. Evaluating P (z|w) requires solving a problem that has been studied in detail in Bayesian statistics and statistical physics, computing a probability distribution over a large discrete space. Here, and are hyper parameters, specifying the nature of the priors on and . Although these hyper parameters could be vector-valued as in [5] [3], for the purposes of this model we assume symmetric Dirichlet priors, with and each having a single value. These priors are conjugate to the multinomial distributions and , allowing us to compute the joint distribution P (w, z) by integrating out and .

To compute the posterior distribution of the hidden nodes for a given document i.e. the inference to implement the LDA is given by: (1) The distribution is difficult to be estimated because of the denominator which is a normalizing constant. The key idea behind the LDA model for text data is to assume that the words in each document were generated by a mixture of topics, where a topic is represented as a multinomial probability distribution over words. The mixing coefficients for each document and the word topic distributions are unobserved (hidden) and are learned from data using unsupervised learning methods. Blei et al [2] introduced the LDA model within a general Bayesian framework and developed a variational algorithm for learning the model from data. Griffiths and Steyvers [12] subsequently proposed a learning algorithm based on collapsed Gibbs sampling. Both the variational and Gibbs sampling approaches have their advantages: the variational approach is arguably faster computationally, but the Gibbs sampling approach is in principal more accurate since it asymptotically approaches the correct distribution. III. GIBBS SAMPLING APPROACH
Figure 2. Graphical model representation of smoothed LDA

B. The Gibbs Algorithm for LDA After applying a number of steps to the equation (1), the conditional distribution as mentioned in [5] is: (2) The different terminology used in the equation (2) is given in Tab. I.
TABLE I. Term TERMS AND THEIR MEANINGS FOR EQUATION - 2 Meaning Number of instances of word w assigned to topic t, not including current one Total number of words assigned to topic t, not including current one Number of words assigned to topic t in document d, not including current one Total number of words in document d not including current one

Gibbs sampling is an example of a Markov chain Monte Carlo algorithm. The algorithm is named after the physicist J. W. Gibbs, in reference to an analogy between the sampling algorithm and statistical physics. The algorithm was described by brothers Stuart and Donald Geman in 1984, some eight decades after the passing of Gibbs. As mentioned before Griffiths and Steyvers proposed the collapsed Gibbs sampling. A. The smoothed LDA Before discussing Gibbs sampling it is necessary to understand how the LDA is smoothed because of the problem with the original one. One problem that might arise with the original LDA model as shown in Fig. 1 is that, the new document outside of training set is likely to contain words that did not appear in any of the documents in a training corpus, and zero probability would be assigned such words. To cope with

Having obtained the full conditional distribution, the Gibbs Sampling algorithm is then straightforward. The zn variables are initialized to values in {1, 2 . . . T}, determining the initial state of the Markov chain. The chain is then run for a number of iterations, each time finding a new state by sampling each zn from the distribution specified by the equation 3.17. After enough iteration for the chain to approach the target

October Issue

Page 2 of 89

ISSN 2229 5208

International Journal of Computer Information Systems, Vol. 3, No. 4, 2011 distribution, the samples are taken after an appropriate lag to ensure that their autocorrelation is low. The algorithm is presented in Fig. 3. With a set of samples from the posterior distribution P (z | w), statistics that are independent of the content of individual topics can be computed by integrating across the full set of samples. For any single sample we can estimate and from the value z by: (3)
Input: document-word index, vocabulary-word index, vocabulary, parameters value. Output: topic wise word distribution Procedure: as described below. //initialization of Markov chain initial state for all words of the corpus n [1, N] do sample topic index z (n) ~ Mult (1/T) // increment the count variables Cwt(wid(n),t) ++ , Ctd(t,did(n))++, Ct(t) ++ ; end for // run the chain over the burn-in period, and check for the // convergence. Generally for fixed number of iterations and then // take the samples at appropriate lag. for iteration i [1, ITER] do for all words of the corpus n [1, N] do topic = z(n) // decrement all the count variables, as not to // include the current assignment Cwt(wid(n),t) -- , Ctd(t,did(n))--, Ct(t) -- ; for each topic t [1, T] do P(t) = (Cwt(wid(n),t)+)(Ctd(t,did(n))+ ) / (Ct(t)+W )) end for sample topic t from P(t) z(i) = t // increment all the count variables to consider //this new topic assignment Cwt(wid(n),t) ++ , Ctd(t,did(n))++, Ct(t) ++ ; end for end for Figure 3. Gibbs sampling algorithm for LDA

(4) These values correspond to the predictive distributions over new words w and new topics z conditioned on w and z. The algorithm for Gibbs sampling LDA is shown in Fig. 3. The dimensions required in this algorithm are shown in Tab. II and the details of the arrays required are shown in Tab. III.
TABLE II. Parameter D W N T ITER DIMENSIONS REQUIRED IN GIBBS ALGORITHM Description Number of documents in corpus Number of words in vocabulary Total number of words in corpus Number of topics Number of iterations of Gibbs sampler

TABLE III. Array wid(N) did(N) z(N) Cwt(W,T) Ctd(T,D) Ct(T)

ARRAYS USED IN GIBBS ALGORITHM Description


th

Word ID of n word Document ID of nth word Topic assignment to nth word Count of word w in topic t Count of topic t in document d Count of topic t

C. Analysis of Gibbs Algorithm The time and space complexity of the Gibbs sampling algorithm as shown in Fig. 3 is: Time Complexity ~ O ( ITER * N* T ) Space Complexity ~ O ( 3 N + ( D + W) T) To understand the limitations of the existing algorithm, consider a million-document corpus with the following size parameters: D =106 W=104 N=109 For this corpus, it would be reasonable to run with T = 103 topics and ITER = 103 iterations. Using the space complexity equation as given above, the required memory would be, (3* 109+ (106 +104) 103 ) = 4 Giga Bytes This memory requirement is beyond most desktop computers and this makes Gibbs sampled topic model computation impractical for many purposes. As observed from the space complexity equation, the memory requirement increases because of N the total number of words in a corpus which is getting multiplied three times.

October Issue

Page 3 of 89

ISSN 2229 5208

International Journal of Computer Information Systems, Vol. 3, No. 4, 2011 To reduce this space complexity problem, we have proposed the Enhanced Gibbs sampling algorithm. IV. THE ENHANCED GIBBS SAMPLING ALGORITHM
Input: document-word index, vocabulary-word index, vocabulary, parameters value. Output: topic wise word distribution Procedure: as described below //initialization of Markov chain initial state for all partition p [1, P] do for all words of the current partition p, n [1, N/P] do sample topic index z(n) ~ Mult(1/T) // increment the count variables Cwt(wid(n),t) ++ , Ctd(t,did(n))++, Ct(t) ++ ; end for end for // run the chain over the burn-in period, and check for the convergence. // Generally for the fixed number of iteration and then take the samples at // appropriate lag. for all partition p [1, P] do for iteration i [1, ITER] do for all words of the current partition p, n [1, N/P] do topic = z (n) // decrement all the count variables, as not to include the // current assignment Cwt(wid(n),t) -- , Ctd(t,did(n))--, Ct(t) -- ; for each topic t [1, T] do P(t) = (Cwt(wid(n),t) + )(Ctd(t,did(n)) + ) / (Ct(t) + W )) end for sample topic t from P(t) z(i) = t // increment all the count variables to consider this new // topic assignment Cwt(wid(n),t) ++ , Ctd(t,did(n))++, Ct(t) ++ ; end for end for end for

The algorithm proposed in this paper is to reduce the space requirement of the original Gibbs algorithm. We have applied the concept of partitioning the word set N and then executing the algorithm instead of loading the whole word set in a single run. With this we can achieve the reduction in space requirement as the size of N now reduces without any impact on the time complexity. After each run on a partition the result is stored in separate variables and there is absolutely no need to merge the results of each partition. The variables are treated as global variables for all partitions. Suppose we consider three partitions of the original word set N. The space complexity becomes: Space Complexity ~ O (3 * N / P + (D + W) T) Where P is total number of partitions, ~ O (3 * N / 3+ (D + W) T) ~ O (N + (D + W) T) The space requirement reduces considerably. Meanwhile the time complexity becomes: Time Complexity ~ O ( ITER * N / P * T * P) ~ O ( ITER * N* T ) The time complexity does not change since the algorithm is executed as many times as the number of partitions but for a smaller word set each time. The enhanced algorithm would require the following steps for execution: Read each document, perform tokenization, remove stop words, and apply case folding. Generate document-word matrix. Generate the vocabulary of the unique words in the collection. From the document-word matrix, generate the sparse arrays containing the vocabulary index and document index of each word. Apply the Enhanced Gibbs Sampling algorithm to extract the topic from the collection. Output the result.

Figure 4. The Enhanced Gibbs sampling algorithm

V.

IMPLEMENTATION OF THE ENHANCED GIBBS SAMPLING


ALGORITHM

This algorithm was implemented and tested on four datasets by varying the parameter values. It was implemented using MATLAB 7.0.1. A. The Datasets To extract the topics we require a text dataset that is rich in different topics. There are large number of textual datasets available which can be most suitable for this type of implementation such as news articles, emails, literature, research papers and abstracts, technical reports. The datasets that we used were:

The proposed Enhanced Gibbs sampling algorithm is as shown in Fig. 4.

October Issue

Page 4 of 89

ISSN 2229 5208

1. 2. 3. 4.

The Cite Seer collection of scientific literature abstracts The NIPS dataset of research papers The Times Magazine articles The Tehelka Magazine articles

International Journal of Computer Information Systems, Vol. 3, No. 4, 2011 In Tab. V the values for the parameters are: T = 30, ITER = 1000, = 1.0 and = .01 whereas in Tab. VI the result with varying parameter values and partition values is displayed.
TABLE VI. Parameters ARRAYS USED IN GIBBS ALGORITHM T = 30 ITER = 1000 =1 = 0.01
Time (sec) Space (Bytes)

The result after the preprocessing is completed on the four datasets is shown in Tab. IV. This output is now used for the next step i.e. applying the Enhanced Gibbs sampling algorithm with different partitions.
TABLE IV. Parameter Values No. of Total Words(N) No. Unique Words(W) No. of Documents (D) Time Taken in Seconds No. of Total Words(N) No. Unique Words(W) ARRAYS USED IN GIBBS ALGORITHM NIPS 51515 1485 90 661.532 51515 1485 Times Magazine 29601 3820 420 410.359 29601 3820 Tehelka Magazine 17184 1772 125 244.86 17184 No Partition Partition = 2 Partition = 3 Partition = 4

T = 10 ITER = 1000 = 0.05 = 0.01


Time (sec) Space (Bytes)

36.438 36.063 35.734 35.64

477600 377760 344488 327840

20.516 20.344 20.078 20.094

292320 192480 159208 142560

Cite Seer 8320 683 474 120.656 8320 683

As can be seen from the observations shown, space complexity reduces significantly whereas the time complexity reduces marginally. VI. CONCLUSION

1772

B. Output and Comparison of the Enhanced Algorithm This is the second phase i.e. applying both the Gibbs sampling and the Enhanced Gibbs sampling algorithms once the preprocessing is completed. A number of successive iterations are made through the topic assignment done by random sampling over the dataset. The proposed method does the same but instead of in a single step over the whole dataset, the dataset is divided into successive partitions and the algorithm is applied for each partition. The output of both the algorithms with their comparisons is shown in Tab. V and Tab. VI. The algorithms were implemented on all the datasets with varying parameter values. We have displayed only two outputs related to the Cite Seer dataset in this paper. Each dataset displayed similar results and there was a considerable reduction in the space complexity when the Enhanced Gibbs sampling was used.
TABLE V. Arrays required by the algorithm ct (1,T) ctd (T,D) cwt (W,T) did (1, N) wid (1, N) z (1, N) Space (bytes) Time (secs) OUTPUT AND COMPARISION OF BOTH ALGORITHMS Original Algorithm No Partition Proposed Algorithm Partition = 2

The topic model is a statistical language model that relates words and documents through topics. It is based on the idea that documents are made up of a mixture of topics, where topics are distributions over words. Gibbs sampling for implementing LDA has been a very popular model for topic models as compared to alternative methods such as variational Bayes and expectation propagation. Gibbs Sampling, a Markov Chain Monte Carlo method, is highly attractive because it is simple, fast and has very few adjustable parameters. While the time and space complexity of the topic model scales linearly with the number of documents in a collection, computations are only practical for modest-sized collections of up to hundreds of thousands of documents. In this paper we have proposed an enhanced Gibbs sampled topic model algorithm which scales better than the original as the space complexity gets considerably reduced. There are number of extensions possible with the topic models, such as author-topic models, author-role-topic models, topic models for images, hidden Markov topic models. Parallel topic models are also an emerging area of interest. The future work will be concentrating on any such extension of the topic model. ACKNOWLEDGMENT

Size

Bytes

Size

Bytes

1 x 30 30x474 683x30 1x 8320 1x 8320 1x 8320 477600 36.438

240 113760 163920 66560 66560 66560

1 x 30 30x474 683x30 1 x 4160 1 x 4160 1 x 4160 377760 36.063

240 113760 163920 33280 33280 33280

We are thankful to Prof. B. S. Parekh and Dr. S. K. Vij, who are our mentors and guides for their invaluable support and technical guidance throughout this research work that we carried out. REFERENCES
[1] Hofmann, Thomas (1999). "Probabilistic Latent Semantic Indexing". Proceedings of the Twenty-Second Annual International SIGIR Conference on Research and Development in Information Retrieval. http://www.cs.brown.edu/~th/papers/Hofmann-SIGIR99.pdf.

October Issue

Page 5 of 89

ISSN 2229 5208

International Journal of Computer Information Systems, Vol. 3, No. 4, 2011


[2] Blei, David M.; Ng, Andrew Y.; Jordan, Michael I; Lafferty, John (January 2003). "Latent Dirichlet allocation". Journal of Machine Learning Research 3: pp. 9931022. doi:10.1162/jmlr.2003.3.4-5.993. http://jmlr.csail.mit.edu/papers/v3/blei03a.html. David Newman, Padhraic Smyth, Mark Steyvers. Scalable Parallel Topic Models. Journal of Intelligence Community Research and Development (2006). T. L. Griffiths and M. Steyvers. Finding scientific topics. Proc Natl Acad Sci U S A, 101 Suppl1:52285235, April2004. Griffiths, T.L., and Steyvers, M., Finding Scientific Topics, National Academy of Sciences, 101 (suppl. 1) 52285235, 2004. Griffiths, T. L., & Steyvers, M., A probabilistic approach to semantic th representation, In Proceedings of the 24 Annual Conference of the Cognitive Science Society, 2002. Griffiths, T., Gibbs sampling in the generative model of Latent Dirichlet Allocation, Technical report, Stanford University (2002). D. Newman, C. Chemudugunta, P. Smyth, M. Steyvers, Analyzing Entities and Topics in News Articles using Statistical Topic Models, LNCS 3975, Intelligence and Security Informatics. Springer. (2006). R. M. Neal (1993) Probabilistic Inference Using Markov Chain Monte Carlo Methods, http://www.cs.utoronto.ca/_radford/review.abstract.html. G. Heinrich, Parameter estimation for text analysis, Technical Report, 2004. D. Newman and S. Block, Probabilistic topic decomposition of an eighteenth-century American newspaper, J. Am. Soc. Inf. Sci. Techno., 57(6):753--767, 2006. Steyvers, M. & Griffiths, T., Probabilistic topic models, In T. Landauer, D. S. McNamara, S. Dennis, & W. Kintsch (Eds.), Handbook of Latent Semantic Analysis. Hillsdale, NJ: Erlbaum, 2007. Griffiths, T. L., & Yuille, A., A primer on probabilistic inference, to appear in M. Oaks ford and N. Chater (Eds.). The probabilistic mind: Prospects for rational models of cognition. Oxford: Oxford University Press. D. Heckerman, A Tutorial on Learning with Bayesian Networks, In Learning in Graphical Models, M. Jordan, Ed... MIT Press, Cambridge, MA, 1999. H. Guo and W.H. Hsu, A survey of algorithms for real-time Bayesian network inference, AAAI/KDD/UAI-2002 Joint Workshop on RealTime Decision Support and Diagnosis Systems, 1-12, Edmonton, Alberta, 2002. [16] Kass, R. E., Carlin, B. P., Gelman, A., and Neal, R. M., Markov Chain Monte Carlo in Practice: A Roundtable Discussion'', The American Statistician, Vol. 52, pp. 93-100, 1998. [17] M. Girolami and A. Kaban, On an equivalence between PLSI and LDA, In Proceedings of SIGIR 2003. http://citeseer.ist.psu.edu/girolami03equivalence.html. [18] Kevin P. Murphy, An introduction to graphical models, University of Columbia, 2001. AUTHORS PROFILE

[3]

[4] [5] [6]

[7] [8]

[9]

[10] [11]

Ms. Anjali Ganesh Jivani is doing research in the field of Text Data Mining and pursuing Ph.D from The Maharaja Sayajirao University of Baroda, Gujarat, India. She is presently working as an Associate Professor at The M. S. University. She has published a number of National and International papers in conference proceedings as well as international journals related to her field. She has also co-authored a practice book on SQL and PL/SQL. She is a life member of Computer Society of India (CSI) and ISTE.

[12]

[13]

Ms. Neha Ripal Soni completed her M.E. in Computer Science with first rank and is a gold medallist from D. D. University, Gujarat, India.. She is presently working as an Associate Professor at SVIT, Gujarat Technolgical University. She has published a number of papers in the proceedings of National and International level conferences. She is a life member of Computer Society of India (CSI) and ISTE.

[14]

[15]

October Issue

Page 6 of 89

ISSN 2229 5208

Anda mungkin juga menyukai