Function GibbsSamplerLDA

Runs the Gibbs sampler for the Latent Dirichlet Allocation (LDA) model

[ WP,DP,Z ] = GibbsSamplerLDA( WS,DS,T,N,ALPHA,BETA,SEED,OUTPUT ) will run the Gibbs sampler for the LDA model on a bag of words data provided by the vectors WS and DS. WS(k) and DS(k) contain the word and document indices for the kth token. The maximum of WS is W, the vocabulary size. The maximum of DS is D, the number of documents. T is the number of topics. The first output is the sparse matrix WP, of size W x T, where WP(i,j) contains the number of times word i has been assigned to topic j. The second output is DP, a sparse D x T matrix, where DP(i,j) contains the number of times a word in document d has been assigned to topic j. The third output Z contains the topic assignments; Z(k) contains the topic assignment for token k.

[ WP,DP,Z ] = GibbsSamplerLDA( WS,DS,T,N,ALPHA,BETA,SEED,OUTPUT,ZIN ) will run the sampler from starting state ZIN, where ZIN(k) contains the topic assignment for token k, saved from a previous sample.

NOTES

WS and DS should be in double precision N determines the number of iterations to run the Gibbs sampler. ALPHA and BETA are the hyperparameters on the Dirichlet priors for the topic distributions (theta) and the topic-word distributions (phi) respectively

SEED sets the seed for the random number generator

OUTPUT determines the screen output by the sampler 0 = no output provided 1 = show the iteration number only 2 = show all output

The time to complete the procedure scales linearly with the number of topics and the number of iterations. The memory requirements scale linearly with the number of topics and documents.

A good setting for the number of iterations will depend on the number of topics and the complexity of problem. For most problems, 500 to 2000 iterations will suffice.

Appropriate values for ALPHA and BETA depend on the number of topics and the number of words in vocabulary. For most applications, good results can be obtained by setting ALPHA = 50 / T and BETA = 200 / W

The sampler uses its own random number generator and setting the seed for this function will not influence the random number seed for Matlab functions

REFERENCES