To get started on applying topic models to your own corpus of text, there are various Matlab variables that need to be constructed. The exact format depends on which type of topic model you want to run. Below is a list of specification for the input that needs to be provided to each kind of topic model and the kind of output each model will produce

The input is a bag of word representation containing the number of times each words occurs in a document. The outputs are the topic assignments to each word token as well as the counts of the number of times each word is assigned to a topic and the number of times a topic is assigned to a document

**INPUT**

a 1 x`WS``N`vector where`WS(k)`contains the vocabulary index of the kth word token, and`N`is the number of word tokens. The word indices are not zero based, i.e., min(`WS`)=1 and max(`WS`) =`W`= number of distinct words in vocabulary

a 1 x`DS``N`vector where`DS(k)`contains the document index of the kth word token. The document indices are not zero based, i.e., min(`DS`)=1 and max(`DS`) =`D`= number of documents

a 1 x`WO``W`cell array of strings where`WO{k}`contains the kth vocabulary item and`W`is the number of distinct vocabulary items. Not needed for running the Gibbs sampler but becomes necessary when writing the resulting word-topic distributions to a file using the`writetopics`matlab function.

**OUTPUT**

a sparse matrix of size`WP``W`x`T`, where`W`is the number of words in the vocabulary and`T`is the number of topics.`WP(i,j)`contains the number of times word`i`has been assigned to topic`j`.

a sparse`DP``D`x`T`matrix, where`D`is the number of documents.`DP(d,j)`contains the number of times a word token in document`d`has been assigned to topic`j`.

a 1 x`Z``N`vector containing the topic assignments where`N`is the number of word tokens.`Z(k)`contains the topic assignment for token`k`.

**NOTES**

If you have a text file with word-document counts, and would like to convert this text file into the matlab vectors `WS` and `DS`, use the conversion function `importworddoccounts`. This function assumes your text file is organized into three columns where each row contains the document index, the word
index, and the word count. For example: 1 2 10, 1 3 4, 2 2 6 (each comma representing a new line) should be read as "word
2 occurs 10 times in doc 1, word 3 occurs 4 times in doc 1, and word 2 occurs 6 times in doc 2". (type help importworddoccounts
for more information). To convert a text file of your vocabulary into a cell array of strings, use the "textread" function
(a native Matlab function). For example, if your vocabulary is a text file "vocab.txt" with a different word on each row,
then [ WO ] = textread( 'vocab.txt' , '%s' ) should convert this to an appropriate cell array of strings.

The input is a bag of word representation containing the number of times each words occurs in a document. Also needed is a matrix containing the authors present on each document. The outputs are the topic and author assignments to each word token as well as the counts of the number of times each word is assigned to a topic and the number of times a topic is assigned to an author.

**INPUT**

a 1 x`WS``N`vector where`WS(k)`contains the vocabulary index of the kth word token, and`N`is the number of word tokens. The word indices are not zero based, i.e., min(`WS`)=1 and max(`WS`) =`W`= number of distinct words in vocabulary

a 1 x`DS``N`vector where`DS(k)`contains the document index of the kth word token. The document indices are not zero based, i.e., min(`DS`)=1 and max(`DS`) =`D`= number of documents

a`AD``A`x`D`sparse matrix where`A`is the number of distinct authors and`D`the number of documents.`AD(a,d)`= 1 when author`a`is present on document`d`and zero otherwise.

a 1 x`WO``W`cell array of strings where`WO{k}`contains the kth vocabulary item and`W`is the number of distinct vocabulary items. Not needed for running the Gibbs sampler but becomes necessary when writing the resulting word-topic distributions to a file using the`writetopics`matlab function.

a 1 x`AN``A`cell array of strings where`AN{k}`contains the kth author name and`A`is the number of distinct authors. Not needed for running the Gibbs sampler but becomes necessary when writing the resulting author-topic distributions to a file using the`writetopics`matlab function.

**OUTPUT**

a sparse matrix of size`WP``W`x`T`, where`W`is the number of words in the vocabulary and`T`is the number of topics.`WP(i,j)`contains the number of times word`i`has been assigned to topic`j`.

a sparse`AT``A`x`T`matrix, where`A`is the number of authors.`AT(a,j)`contains the number of times a word token associated with author`a`has been assigned to topic`j`.

a 1 x`Z``N`vector containing the topic assignments where`N`is the number of word tokens.`Z(k)`contains the topic assignment for token`k`.

a 1 x`X``N`vector containing the author assignments where`N`is the number of word tokens.`X(k)`contains the author assignment for token`k`.

The input is a stream of words representation containing the order of words as they appear in documents. The outputs are the topic assignments to each word token and the assignments of tokens to a collocation or topic model. The output also includes the counts of the number of times each word is assigned to a topic and the number of times a topic is assigned to a document

**INPUT**

a 1 x`WS``N`vector where`WS(k)`contains the vocabulary index of the kth word token, and`N`is the number of word tokens. The word indices are not zero based, i.e., min(`WS`)=1 and max(`WS`) =`W`= number of distinct words in vocabulary. Note that the words are ordered according to occurence in documents, but that some words such as stop words can and should be omitted in this stream of words.

a 1 x`DS``N`vector where`DS(k)`contains the document index of the kth word token. The document indices are not zero based, i.e., min(`DS`)=1 and max(`DS`) =`D`= number of documents

a`WW``W`x`W`sparse matrix where`W(i,j)`contains the count of the number of times that word`i`follows word`j`in the word stream.

a 1 x`SI``N`vector where`SI(k)=1`only if the kth word can form a collocation with the (`k-1`)th word and`SI(k)=0`otherwise. Note that this representation is necessary because some consecutive words cross document boundaries and should not be allowed to form a collocation. Similarly, some words such as stop words in the original word stream might have been deleted. Any word at position`k`that follows a following a previously deleted word should have`SI(k)`=0

a 1 x`WO``W`cell array of strings where`WO{k}`contains the kth vocabulary item and`W`is the number of distinct vocabulary items. Not needed for running the Gibbs sampler but becomes necessary when writing the resulting word-topic distributions to a file using the`writetopics`matlab function.

**OUTPUT**

a sparse matrix of size`WP``W`x`T`, where`W`is the number of words in the vocabulary and`T`is the number of topics.`WP(i,j)`contains the number of times word`i`has been assigned to topic`j`.

a sparse`DP``D`x`T`matrix, where`D`is the number of documents.`DP(d,j)`contains the number of times a word token in document`d`has been assigned to topic`j`.

a 1 x`WC``W`vector where`WC(k)`contains the number of times word`k`led to a collocation with the next word in the word stream.

a 1 x`C``N`vector containing the topic/collocation assignments where`N`is the number of word tokens.`C(k)=0`when token`k`was assigned to the topic model.`C(k)=1`when token`k`was assigned to a collocation with word token`k`-1.

a 1 x`Z``N`vector containing the topic assignments where`N`is the number of word tokens.`Z(k)`contains the topic assignment for token`k`.

The input is a stream of words representation containing the order of words as they appear in documents. The outputs are the topic assignments to each word token and the assignments of tokens to a HMM state. The output also includes the counts of the number of times each word is assigned to a topic and HMM state and the number of times a topic is assigned to a document.

**INPUT**

a 1 x`WS``N`vector where`WS(k)`contains the vocabulary index of the kth word token, and`N`is the number of word tokens. The word indices are not zero based, i.e., min(`WS`)=1 and max(`WS`) =`W`= number of distinct words in vocabulary. A word index of 0 denotes the end-of-sentence marker. Note that the words are ordered according to occurence in documents.

a 1 x`DS``N`vector where`DS(k)`contains the document index of the kth word token. The document indices are not zero based, i.e., min(`DS`)=1 and max(`DS`) =`D`= number of documents

a 1 x`WO``W`cell array of strings where`WO{k}`contains the kth vocabulary item and`W`is the number of distinct vocabulary items. Not needed for running the Gibbs sampler but becomes necessary when writing the resulting word-topic distributions to a file using the`writetopics`matlab function.

**OUTPUT**

a sparse matrix of size`WP``W`x`T`, where`W`is the number of words in the vocabulary and`T`is the number of topics.`WP(i,j)`contains the number of times word`i`has been assigned to topic`j`.

a sparse`DP``D`x`T`matrix, where`D`is the number of documents.`DP(d,j)`contains the number of times a word token in document`d`has been assigned to topic`j`.

a sparse`MP``W`x`S`matrix where`S`is the number of HMM states.`MP(i,j)`contains the number of times word`i`has been assigned to HMM state`j`. Note that HMM state 1 represents the LDA model and 2..`S`represent the "syntactic" HMM states

a 1 x`Z``N`vector containing the topic assignments where`N`is the number of word tokens.`Z(k)`contains the topic assignment for token`k`.

a 1 x`X``N`vector containing the HMM state assignments where`N`is the number of word tokens.`X(k)`contains the assignment of the kth word token to a HMM state. Note that HMM state 1 represents the LDA model and 2..`S`represent the "syntactic" HMM states