Matlab Topic Modeling Toolbox 1.4 |
Mark Steyvers
mark.steyvers@uci.edu
Mark Steyvers
mark.steyvers@uci.edu
22 Frost
Irvine, CA92617
|
The LDA Model
exampleLDA1 extract topics with LDA model exampleLDA2 extract multiple topic samples with LDA model exampleLDA3 shows how to order topics according to similarity in usage exampleVIZ1 visualize topics in a 2D map exampleVIZ2 visualize documents in a 2D map The AT (Author-Topic) Model
exampleAT1 extract topics with AT model exampleAT2 extract multiple topic samples with AT model The HMM-LDA Model
exampleHMMLDA1 extract topics and syntactic states with HMM-LDA model. exampleHMMLDA2 extract multiple topic samples with HMM-LDA model The LDA-COL (Collocation) Model
exampleLDACOL1 extract topics and collocations with the LDA-COL model. shows how to convert the model output from LDA-COL model to have collocations in vocabulary and topic counts exampleLDACOL2 extract multiple topic samples from LDA-COL model. exampleLDACOL3 convert stream data as used by HMM-LDA model to collocation stream data as used by LDA-COL model Applying Topic Models to Images
exampleimages1 simulates the "bars" example exampleimages2 extract topics from handwritten digits and characters
Topic Extraction Models
GibbsSamplerLDA Extract topics with LDA model GibbsSamplerAT Extract topics with AT model GibbsSamplerHMMLDA Extract topics and syntactic states with HMM-LDA model GibbsSamplerLDACOL Extract topics and collocations with LDA-COL model Visualization/ Interpretation
WriteTopics Write most likely entities (e.g. words, authors) per topic to a string and/or text file WriteTopicMult Write topic-entity distributions for multiple entities to a string and/or text file VisualizeTopics visualizes topics in 2D map VisualizeDocs visualizes documents in 2D map based on topic distances OrderTopics orders topics according to similarity in topic distributions over documents CreateCollocationTopics create new vocabulary and topic counts containing collocations Utilities
compilescripts compile all mex scripts importworddoccounts imports text file with word-document counts into sparse matrix stream_to_collocation_data utility to convert stream data from HMM LDA model into stream data for LDACOL model
Psych Review Abstracts (bag of words)
bagofwords_psychreview document word counts words_psychreview vocabulary Psych Review Abstracts (word stream)
psychreviewstream successive word and document indices Psych Review Abstracts (collocation word stream)
psychreviewcollocation successive word and document indices with function words removed NIPS proceedings papers (bag of words)
bagofwords_nips document word counts words_nips vocabulary titles_nips titles of papers authors_nips names of authors authordoc_nips document author counts NIPS proceedings papers (word stream)
nips_stream successive word and document indices
(note: the document indices in this dataset do not align with the bag-of-words dataset for nips)NIPS proceedings papers (collocation stream)
nipscollocation successive word and document indices with function words removed Image Data
binaryalphabet a set of handwritten digits and characters. See exampleimages2 for an application of topic models to this data
Version 1.4 (4/4/2011)
- Changed the C code to be compatible with 64 bit compilers
Version 1.3.2 (12/20/2007)
- Fixed a bug in the function "importworddoccounts"
Version 1.3.1 (1/6/2006)
- Added the Mat files in Matlab 6 uncompressed format.
- Made changes in MEX code for compatibility with linux C compilers
Version 1.3 (9/6/05)
- Rewrote the dataformat section
- The LDA-COL (LDA-Collocation) model was added. This model allows the extraction of topics just as in the LDA model. In addition, it can simultaneously extract collocations (i.e., frequently occurring combinations of words). The LDA-COL model can also start from a previously saved state.
- The LDA and AT models can now start from a previously saved state.
- The input for the GibbsSamplerLDA function was changed. The input is now a set of word and document indices, not a sparse word-document count matrix.
- The input for the GibbsSamplerAT function was changed. The first two inputs are a set of word and document indices, not a sparse word-document count matrix.
LDA MODEL
Steyvers, M. & Griffiths, T. (2007). Probabilistic topic models. In T. Landauer, D McNamara, S. Dennis, and W. Kintsch (eds), Latent Semantic Analysis: A Road to Meaning. Laurence Erlbaum
Griffiths, T.L., Steyvers, M., & Tenenbaum, J.B.T. (2007). Topics in Semantic Representation. Psychological Review, 114(2), 211-244.
Griffiths, T., & Steyvers, M. (2004). Finding Scientific Topics. Proceedings of the National Academy of Sciences, 101 (suppl. 1), 5228-5235.
D. Blei, A. Ng, and M. Jordan (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993-1022
AT (AUTHOR-TOPIC) MODEL
Steyvers, M., Smyth, P., Rosen-Zvi, M., & Griffiths, T. (2004). Probabilistic Author-Topic Models for Information Discovery. The Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Seattle, Washington.
Rosen-Zvi, M., Griffiths T., Steyvers, M., & Smyth, P. (2004). The Author-Topic Model for Authors and Documents. In 20th Conference on Uncertainty in Artificial Intelligence. Banff, Canada
M. Rosen-Zvi, T. Griffiths, P. Smyth, M. Steyvers (submitted). Learning author-topic models from text corpora.
HMM-LDA MODEL
Griffiths, T.L., & Steyvers, M., Blei, D.M., & Tenenbaum, J.B. (2004). Integrating Topics and Syntax. In: Advances in Neural Information Processing Systems, 17.
LDA-COL MODEL
Griffiths, T.L., Steyvers, M., & Tenenbaum, J.B.T. (2007). Topics in Semantic Representation. Psychological Review, 114(2), 211-244. See pages 234-236.