Matlab Topic Modeling Toolbox 1.4 
Mark Steyvers
mark.steyvers@uci.edu
Mark Steyvers
mark.steyvers@uci.edu
22 Frost
Irvine, CA92617

The LDA Model
exampleLDA1 extract topics with LDA model exampleLDA2 extract multiple topic samples with LDA model exampleLDA3 shows how to order topics according to similarity in usage exampleVIZ1 visualize topics in a 2D map exampleVIZ2 visualize documents in a 2D map The AT (AuthorTopic) Model
exampleAT1 extract topics with AT model exampleAT2 extract multiple topic samples with AT model The HMMLDA Model
exampleHMMLDA1 extract topics and syntactic states with HMMLDA model. exampleHMMLDA2 extract multiple topic samples with HMMLDA model The LDACOL (Collocation) Model
exampleLDACOL1 extract topics and collocations with the LDACOL model. shows how to convert the model output from LDACOL model to have collocations in vocabulary and topic counts exampleLDACOL2 extract multiple topic samples from LDACOL model. exampleLDACOL3 convert stream data as used by HMMLDA model to collocation stream data as used by LDACOL model Applying Topic Models to Images
exampleimages1 simulates the "bars" example exampleimages2 extract topics from handwritten digits and characters
Topic Extraction Models
GibbsSamplerLDA Extract topics with LDA model GibbsSamplerAT Extract topics with AT model GibbsSamplerHMMLDA Extract topics and syntactic states with HMMLDA model GibbsSamplerLDACOL Extract topics and collocations with LDACOL model Visualization/ Interpretation
WriteTopics Write most likely entities (e.g. words, authors) per topic to a string and/or text file WriteTopicMult Write topicentity distributions for multiple entities to a string and/or text file VisualizeTopics visualizes topics in 2D map VisualizeDocs visualizes documents in 2D map based on topic distances OrderTopics orders topics according to similarity in topic distributions over documents CreateCollocationTopics create new vocabulary and topic counts containing collocations Utilities
compilescripts compile all mex scripts importworddoccounts imports text file with worddocument counts into sparse matrix stream_to_collocation_data utility to convert stream data from HMM LDA model into stream data for LDACOL model
Psych Review Abstracts (bag of words)
bagofwords_psychreview document word counts words_psychreview vocabulary Psych Review Abstracts (word stream)
psychreviewstream successive word and document indices Psych Review Abstracts (collocation word stream)
psychreviewcollocation successive word and document indices with function words removed NIPS proceedings papers (bag of words)
bagofwords_nips document word counts words_nips vocabulary titles_nips titles of papers authors_nips names of authors authordoc_nips document author counts NIPS proceedings papers (word stream)
nips_stream successive word and document indices
(note: the document indices in this dataset do not align with the bagofwords dataset for nips)NIPS proceedings papers (collocation stream)
nipscollocation successive word and document indices with function words removed Image Data
binaryalphabet a set of handwritten digits and characters. See exampleimages2 for an application of topic models to this data
Version 1.4 (4/4/2011)
 Changed the C code to be compatible with 64 bit compilers
Version 1.3.2 (12/20/2007)
 Fixed a bug in the function "importworddoccounts"
Version 1.3.1 (1/6/2006)
 Added the Mat files in Matlab 6 uncompressed format.
 Made changes in MEX code for compatibility with linux C compilers
Version 1.3 (9/6/05)
 Rewrote the dataformat section
 The LDACOL (LDACollocation) model was added. This model allows the extraction of topics just as in the LDA model. In addition, it can simultaneously extract collocations (i.e., frequently occurring combinations of words). The LDACOL model can also start from a previously saved state.
 The LDA and AT models can now start from a previously saved state.
 The input for the GibbsSamplerLDA function was changed. The input is now a set of word and document indices, not a sparse worddocument count matrix.
 The input for the GibbsSamplerAT function was changed. The first two inputs are a set of word and document indices, not a sparse worddocument count matrix.
LDA MODEL
Steyvers, M. & Griffiths, T. (2007). Probabilistic topic models. In T. Landauer, D McNamara, S. Dennis, and W. Kintsch (eds), Latent Semantic Analysis: A Road to Meaning. Laurence Erlbaum
Griffiths, T.L., Steyvers, M., & Tenenbaum, J.B.T. (2007). Topics in Semantic Representation. Psychological Review, 114(2), 211244.
Griffiths, T., & Steyvers, M. (2004). Finding Scientific Topics. Proceedings of the National Academy of Sciences, 101 (suppl. 1), 52285235.
D. Blei, A. Ng, and M. Jordan (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3:9931022
AT (AUTHORTOPIC) MODEL
Steyvers, M., Smyth, P., RosenZvi, M., & Griffiths, T. (2004). Probabilistic AuthorTopic Models for Information Discovery. The Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Seattle, Washington.
RosenZvi, M., Griffiths T., Steyvers, M., & Smyth, P. (2004). The AuthorTopic Model for Authors and Documents. In 20th Conference on Uncertainty in Artificial Intelligence. Banff, Canada
M. RosenZvi, T. Griffiths, P. Smyth, M. Steyvers (submitted). Learning authortopic models from text corpora.
HMMLDA MODEL
Griffiths, T.L., & Steyvers, M., Blei, D.M., & Tenenbaum, J.B. (2004). Integrating Topics and Syntax. In: Advances in Neural Information Processing Systems, 17.
LDACOL MODEL
Griffiths, T.L., Steyvers, M., & Tenenbaum, J.B.T. (2007). Topics in Semantic Representation. Psychological Review, 114(2), 211244. See pages 234236.