DATA REQUIREMENTS

To get started on applying topic models to your own corpus of text, there are various Matlab variables that need to be constructed. The exact format depends on which type of topic model you want to run. Below is a list of specification for the input that needs to be provided to each kind of topic model and the kind of output each model will produce

Contents

LDA MODEL

The input is a bag of word representation containing the number of times each words occurs in a document. The outputs are the topic assignments to each word token as well as the counts of the number of times each word is assigned to a topic and the number of times a topic is assigned to a document

INPUT

OUTPUT

NOTES

If you have a text file with word-document counts, and would like to convert this text file into the matlab vectors WS and DS, use the conversion function importworddoccounts. This function assumes your text file is organized into three columns where each row contains the document index, the word index, and the word count. For example: 1 2 10, 1 3 4, 2 2 6 (each comma representing a new line) should be read as "word 2 occurs 10 times in doc 1, word 3 occurs 4 times in doc 1, and word 2 occurs 6 times in doc 2". (type help importworddoccounts for more information). To convert a text file of your vocabulary into a cell array of strings, use the "textread" function (a native Matlab function). For example, if your vocabulary is a text file "vocab.txt" with a different word on each row, then [ WO ] = textread( 'vocab.txt' , '%s' ) should convert this to an appropriate cell array of strings.

AT (Author-Topic) MODEL

The input is a bag of word representation containing the number of times each words occurs in a document. Also needed is a matrix containing the authors present on each document. The outputs are the topic and author assignments to each word token as well as the counts of the number of times each word is assigned to a topic and the number of times a topic is assigned to an author.

INPUT

OUTPUT

LDA-COL (Collocation) MODEL

The input is a stream of words representation containing the order of words as they appear in documents. The outputs are the topic assignments to each word token and the assignments of tokens to a collocation or topic model. The output also includes the counts of the number of times each word is assigned to a topic and the number of times a topic is assigned to a document

INPUT

OUTPUT

HMM-LDA MODEL

The input is a stream of words representation containing the order of words as they appear in documents. The outputs are the topic assignments to each word token and the assignments of tokens to a HMM state. The output also includes the counts of the number of times each word is assigned to a topic and HMM state and the number of times a topic is assigned to a document.

INPUT

OUTPUT