DATA Format

To get started on applying topic models to your own corpus of text, there are two Matlab matrices that need to be constructed: WD and WO. These contain the word-document counts and the vocabulary respectively.

WD is a sparse matrix of size W x D where W is the number of words in the vocabulary and D is the number of documents. WD(i,j) contains the frequency with which word i occurs in document j.

WO is a cell array of strings of size W x 1 where W is the number of words in the vocabulary.
WO{ i} contains the string of word i.

If you have a text file with word-document counts, and would like to convert this text file into the matlab sparse matrix format, use the conversion function importworddoccounts. This function assumes your text file is organized into three columns where each row contains the document index, the word index, and the word count. For example:
1 2 10
1 3 4
2 2 6
should be read as "word 2 occurs 10 times in doc 1, word 3 occurs 4 times in doc 1, and word 2 occurs 6 times in doc 2". (type help importworddoccounts for more information)

To convert a text file of your vocabulary into a cell array of strings, use the "textread" function (a native Matlab function). For example, if your vocabulary is a text file "vocab.txt" with a different word on each row:

then [ WO ] = textread( 'vocab.txt' , '%s' ) should convert this to an appropriate cell array of strings where WO{1} equals 'ABANDON' and WO{2} equals 'ABDOMINAL' etc.