Matlab Topic Modeling Toolbox 1.4

Inquiries

Mark Steyvers
mark.steyvers@uci.edu

Authors

Mark Steyvers
mark.steyvers@uci.edu
University of California, Irvine
Department of Cognitive Sciences
3151 Social Sciences Plaza
Irvine, CA 92697-5100
 
Tom Griffiths
tom_griffiths@berkeley.edu
University of California, Berkeley
 Department of Psychology
3210 Tolman Hall
Berkeley, CA 94720 USA


 

Installation & Licensing

  • Download the zipped toolbox (18Mb).
    NOTE:
    this toolbox now works with 64 bit compilers. If you are looking for the old version of this toolbox that has the code for 32 bit compilers, download this version
     
  • The program is free for scientific use. Please contact the authors, if you are planning to use the software for commercial purposes. The software must not be further distributed without prior permission of the author. By using this software, you are agreeing to this license statement.
     
  • Type 'help function' at command prompt for more information on each function
     
  • Read these notes on data format for a description on the input and output format for the different topic models
     
  • Note for MAC and Linux users: some of the Matlab functions are implemented with mex code (C code linked to Matlab). For windows based platforms, the dll's are already provided in the distribution package. For other platforms, please compile the mex functions by executing "compilescripts" at the Matlab prompt

Example Scripts

The LDA Model

exampleLDA1 extract topics with LDA model
exampleLDA2 extract multiple topic samples with LDA model
exampleLDA3 shows how to order topics according to similarity in usage
exampleVIZ1 visualize topics in a 2D map
exampleVIZ2 visualize documents in a 2D map

The AT (Author-Topic) Model

exampleAT1 extract topics with AT model
exampleAT2 extract multiple topic samples with AT model
   

The HMM-LDA Model

exampleHMMLDA1 extract topics and syntactic states with HMM-LDA model.
exampleHMMLDA2 extract multiple topic samples with HMM-LDA model

The LDA-COL (Collocation) Model

exampleLDACOL1 extract topics and collocations with the LDA-COL model. shows how to convert the model output from LDA-COL model to have collocations in vocabulary and topic counts
exampleLDACOL2 extract multiple topic samples from LDA-COL model.
exampleLDACOL3 convert stream data as used by HMM-LDA model to collocation stream data as used by LDA-COL model

Applying Topic Models to Images

exampleimages1 simulates the "bars" example
exampleimages2 extract topics from handwritten digits and characters

Matlab Functions

Topic Extraction Models

GibbsSamplerLDA Extract topics with LDA model
GibbsSamplerAT Extract topics with AT model
GibbsSamplerHMMLDA Extract topics and syntactic states with HMM-LDA model
GibbsSamplerLDACOL Extract topics and collocations with LDA-COL model

Visualization/ Interpretation

WriteTopics Write most likely entities (e.g. words, authors) per topic to a string and/or text file
WriteTopicMult Write topic-entity distributions for multiple entities to a string and/or text file
VisualizeTopics visualizes topics in 2D map
VisualizeDocs visualizes documents in 2D map based on topic distances
OrderTopics orders topics according to similarity in topic distributions over documents
CreateCollocationTopics create new vocabulary and topic counts containing collocations

Utilities

compilescripts compile all mex scripts
importworddoccounts imports text file with word-document counts into sparse matrix
stream_to_collocation_data utility to convert stream data from HMM LDA model into stream data for LDACOL model

Matlab Datasets

Psych Review Abstracts (bag of words)

bagofwords_psychreview document word counts
words_psychreview vocabulary

Psych Review Abstracts (word stream)

psychreviewstream successive word and document indices

Psych Review Abstracts (collocation word stream)

psychreviewcollocation successive word and document indices with function words removed

NIPS proceedings papers (bag of words)

bagofwords_nips document word counts
words_nips vocabulary
titles_nips titles of papers
authors_nips names of authors
authordoc_nips document author counts

NIPS proceedings papers (word stream)

nips_stream successive word and document indices
(note: the document indices in this dataset do not align with the bag-of-words dataset for nips)

NIPS proceedings papers (collocation stream)

nipscollocation successive word and document indices with function words removed

Image Data

binaryalphabet a set of handwritten digits and characters. See exampleimages2 for an application of topic models to this data

Release Notes

Version 1.4 (4/4/2011)

  • Changed the C code to be compatible with 64 bit compilers

Version 1.3.2 (12/20/2007)
  • Fixed a bug in the function "importworddoccounts"

Version 1.3.1 (1/6/2006)

  • Added the Mat files in Matlab 6 uncompressed format.
     
  • Made changes in MEX code for compatibility with linux C compilers

Version 1.3  (9/6/05)

  • Rewrote the dataformat section
     
  • The LDA-COL (LDA-Collocation) model was added. This model allows the extraction of topics just as in the LDA model. In addition, it can simultaneously extract collocations (i.e., frequently occurring combinations of words). The LDA-COL model can also start from a previously saved state.
     
  • The LDA and AT models can now start from a previously saved state.
     
  • The input for the GibbsSamplerLDA function was changed. The input is now a set of word and document indices, not a sparse word-document count matrix.
     
  • The input for the GibbsSamplerAT function was changed. The first two inputs are a set of word and document indices, not a sparse word-document count matrix.

 

References

LDA MODEL

Steyvers, M. & Griffiths, T. (2007). Probabilistic topic models. In T. Landauer, D McNamara, S. Dennis, and W. Kintsch (eds), Latent Semantic Analysis: A Road to Meaning. Laurence Erlbaum

Griffiths, T.L., Steyvers, M., & Tenenbaum, J.B.T. (2007). Topics in Semantic Representation. Psychological Review, 114(2), 211-244.

Griffiths, T., & Steyvers, M. (2004). Finding Scientific Topics. Proceedings of the National Academy of Sciences, 101 (suppl. 1), 5228-5235.

D. Blei, A. Ng, and M. Jordan (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993-1022

AT (AUTHOR-TOPIC) MODEL

Steyvers, M., Smyth, P., Rosen-Zvi, M., & Griffiths, T. (2004). Probabilistic Author-Topic Models for Information Discovery. The Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Seattle, Washington.

Rosen-Zvi, M., Griffiths T., Steyvers, M., & Smyth, P. (2004). The Author-Topic Model for Authors and Documents. In 20th Conference on Uncertainty in Artificial Intelligence. Banff, Canada

M. Rosen-Zvi, T. Griffiths, P. Smyth, M. Steyvers (submitted). Learning author-topic models from text corpora.

HMM-LDA MODEL

Griffiths, T.L., & Steyvers, M.,  Blei, D.M., & Tenenbaum, J.B. (2004). Integrating Topics and Syntax. In: Advances in Neural Information Processing Systems, 17.

LDA-COL MODEL

Griffiths, T.L., Steyvers, M., & Tenenbaum, J.B.T. (2007). Topics in Semantic Representation. Psychological Review, 114(2), 211-244. See pages 234-236.