Example 1 of running HMM-LDA topic model

This example shows how to run the HMM-LDA Gibbs sampler on a small dataset to extract a set of topics and a set of syntactic states. Unlike the LDA and AT topic models, there is no need to exclude the stop words from a corpus of text. Also, this model differs from the LDA and AT topic models by utilizing the word order information of the text. The output of this code is a count matrix WP with the number of times words are assigned to topics and a count matrix MP that contains the number of times words are assigned to syntactic states.

Choose the dataset

dataset = 1; % 1 = psych review; 2 = nips papers

if (dataset == 1)
    % Load the psych review word stream
    load 'psychreviewstream';

    % Set the parameters for the model
    T      = 50;     % number of topics
    NS     = 12;     % number of syntactic states
    N      = 200;    % number of iterations
    ALPHA  = 50 / T; % ALPHA hyperparameter
    BETA   = 0.01;   % BETA hyperparameter
    GAMMA  = 0.1;    % GAMMA hyperparameter
    SEED    = 2;     % random SEED

    filename1 = 'topics_psychreview_hmmlda_2.txt'; % text file showing topic-word distributions
    filename2 = 'states_psychreview_hmmlda_2.txt'; % text file showing hmm state-word distributions
end

if (dataset == 2)
    % Load the nips paper word stream
    load 'nips_stream';

    % Set the parameters for the model
    T      = 50;     % number of topics
    NS     = 16;     % number of syntactic states
    N      = 400;    % number of iterations
    ALPHA  = 50 / T; % ALPHA hyperparameter
    BETA   = 0.01;   % BETA hyperparameter
    GAMMA  = 0.1;    % GAMMA hyperparameter
    SEED    = 2;     % random SEED

    filename1 = 'topics_nips_hmmlda_2.txt'; % text file showing topic-word distributions
    filename2 = 'states_nips_hmmlda_2.txt'; % text file showing hmm state-word distributions
end

What output to show (0=no output; 1=iterations; 2=all output)

OUTPUT = 1;

Run the HMM-LDA Gibbs sampler

tic
[WP,DP,MP,Z,X]=GibbsSamplerHMMLDA( WS,DS,T,NS,N,ALPHA,BETA,GAMMA,SEED,OUTPUT);

fprintf( 'Elapsed time = %5.0f seconds\n' , toc );
	Iteration 0 of 200
	Iteration 10 of 200
	Iteration 20 of 200
	Iteration 30 of 200
	Iteration 40 of 200
	Iteration 50 of 200
	Iteration 60 of 200
	Iteration 70 of 200
	Iteration 80 of 200
	Iteration 90 of 200
	Iteration 100 of 200
	Iteration 110 of 200
	Iteration 120 of 200
	Iteration 130 of 200
	Iteration 140 of 200
	Iteration 150 of 200
	Iteration 160 of 200
	Iteration 170 of 200
	Iteration 180 of 200
	Iteration 190 of 200
Elapsed time =    92 seconds

Save the results to a file

if (dataset==1)
   save 'hmmldasingle_psychreview' WP DP MP Z X ALPHA BETA GAMMA N;
elseif (dataset==2)
   save 'hmmldasingle_nips' WP DP MP Z X ALPHA BETA GAMMA N;
end

Calculate the most likely words in each topic and write to a cell array of strings

[S] = WriteTopics( WP , BETA , WO , 7 , 0.8 , 4 , filename1 );

Show the most likely words in the topics

fprintf( '\n\nMost likely words in the topics:\n' );
S( 1:T )

Most likely words in the topics:
ans = 
    'similarity bias strategies drug systematic biases conditions'
    'order serial search process parallel elements attention'
    'stimulus response stimuli responses color cs increase'
    'ss s change rate normal underlying practice'
    'self individual situations individuals those others consequences'
    'environment general behaviors constraints internal other external'
    '2 experiments single results experimental high trial'
    'personality variables measures research consistency issues cross'
    'pattern patterns changes critical false food sequences'
    'effects theories target cues predictions statistical inference'
    'processing human or presentation encoding recent times'
    'judgments probability frequency judgment event probabilities effect'
    'test matching representation conceptual verbal feedback particular'
    '2 relative previous 3 perceived distance apparent'
    'data model alternative assumptions general those empirical'
    'theory knowledge research studies new measurement memories'
    'attention features changes new objects category implicit'
    'problem processes independent specific solving independence domain'
    'behavior behavioral result goals interaction person activities'
    'group masking sensory inhibition groups depression target'
    'speech movement central motor accuracy timing attitude'
    'based spatial dynamic space sensitive simple surface'
    'only reasoning mental problems way standard possible'
    'more rather context than criterion less greater'
    'different same priming developmental emotional criteria traditional'
    'related function structures such other computational relations'
    'time temporal 2 functions simple scale linear'
    'processing language comprehension working sentence levels involved'
    'detection signal new noise strength latency mean'
    'system neural systems brain physiological arousal activity'
    'perception perceptual object dimensions image organization auditory'
    'data decision time making rt good mathematical'
    'they that important reading not evidence dual'
    'word words semantic activation network lexical phonological'
    'mechanisms children animals adaptive avoidance known associative'
    'do 1 component same natural but original'
    'information use integration based gender one production'
    'theoretical phenomena empirical such psychological common large'
    'capacity some 1st current present outcome separate'
    'models reinforcement response continuous random discrete rule'
    'memory recognition recall retrieval items item list'
    'social influence automatic controlled d iq x'
    'processes control cognitive primary tasks occur affect'
    [1x78 char]
    'discrimination power psychology motion causal value ratio'
    'choice stage error components errors all alternatives'
    [1x84 char]
    'performance task events tasks people selection level'
    'action basic experience research emotion cognition emotions'
    'other differences one relationships sex negative positive'

Calculate the most likely words in each syntactic state and write to a cell array of strings

[S] = WriteTopics( MP , BETA , WO , 7 , 0.8 , 4 , filename2 );

Show the most likely words in the syntactic states

fprintf( '\n\nMost likely words in the syntactic states:\n' );
S( 1:NS )

Most likely words in the syntactic states:
ans = 
    'model theory authors article process function hypothesis'
    'it presents there describes they we however'
    'which for a both explain whether how'
    'and are or'
    'in by with as on from for'
    'proposed shown used also not based suggested'
    'that to can may has have account'
    'discussed j presented a d r p'
    'the a this an these'
    'of'
    'is are be been'
    'effects evidence theories terms number models implications'