This example shows how to run the HMM-LDA Gibbs sampler on a small dataset to extract a set of topics and a set of syntactic states. Unlike the LDA and AT topic models, there is no need to exclude the stop words from a corpus of text. Also, this model differs from the LDA and AT topic models by utilizing the word order information of the text. The output of this code is a count matrix WP with the number of times words are assigned to topics and a count matrix MP that contains the number of times words are assigned to syntactic states.
Choose the dataset
dataset = 1; % 1 = psych review; 2 = nips papers if (dataset == 1) % Load the psych review word stream load 'psychreviewstream'; % Set the parameters for the model T = 50; % number of topics NS = 12; % number of syntactic states N = 200; % number of iterations ALPHA = 50 / T; % ALPHA hyperparameter BETA = 0.01; % BETA hyperparameter GAMMA = 0.1; % GAMMA hyperparameter SEED = 2; % random SEED filename1 = 'topics_psychreview_hmmlda_2.txt'; % text file showing topic-word distributions filename2 = 'states_psychreview_hmmlda_2.txt'; % text file showing hmm state-word distributions end if (dataset == 2) % Load the nips paper word stream load 'nips_stream'; % Set the parameters for the model T = 50; % number of topics NS = 16; % number of syntactic states N = 400; % number of iterations ALPHA = 50 / T; % ALPHA hyperparameter BETA = 0.01; % BETA hyperparameter GAMMA = 0.1; % GAMMA hyperparameter SEED = 2; % random SEED filename1 = 'topics_nips_hmmlda_2.txt'; % text file showing topic-word distributions filename2 = 'states_nips_hmmlda_2.txt'; % text file showing hmm state-word distributions end
What output to show (0=no output; 1=iterations; 2=all output)
OUTPUT = 1;
Run the HMM-LDA Gibbs sampler
tic
[WP,DP,MP,Z,X]=GibbsSamplerHMMLDA( WS,DS,T,NS,N,ALPHA,BETA,GAMMA,SEED,OUTPUT);
fprintf( 'Elapsed time = %5.0f seconds\n' , toc );
Iteration 0 of 200 Iteration 10 of 200 Iteration 20 of 200 Iteration 30 of 200 Iteration 40 of 200 Iteration 50 of 200 Iteration 60 of 200 Iteration 70 of 200 Iteration 80 of 200 Iteration 90 of 200 Iteration 100 of 200 Iteration 110 of 200 Iteration 120 of 200 Iteration 130 of 200 Iteration 140 of 200 Iteration 150 of 200 Iteration 160 of 200 Iteration 170 of 200 Iteration 180 of 200 Iteration 190 of 200 Elapsed time = 92 seconds
Save the results to a file
if (dataset==1) save 'hmmldasingle_psychreview' WP DP MP Z X ALPHA BETA GAMMA N; elseif (dataset==2) save 'hmmldasingle_nips' WP DP MP Z X ALPHA BETA GAMMA N; end
Calculate the most likely words in each topic and write to a cell array of strings
[S] = WriteTopics( WP , BETA , WO , 7 , 0.8 , 4 , filename1 );
Show the most likely words in the topics
fprintf( '\n\nMost likely words in the topics:\n' );
S( 1:T )
Most likely words in the topics:
ans =
'similarity bias strategies drug systematic biases conditions'
'order serial search process parallel elements attention'
'stimulus response stimuli responses color cs increase'
'ss s change rate normal underlying practice'
'self individual situations individuals those others consequences'
'environment general behaviors constraints internal other external'
'2 experiments single results experimental high trial'
'personality variables measures research consistency issues cross'
'pattern patterns changes critical false food sequences'
'effects theories target cues predictions statistical inference'
'processing human or presentation encoding recent times'
'judgments probability frequency judgment event probabilities effect'
'test matching representation conceptual verbal feedback particular'
'2 relative previous 3 perceived distance apparent'
'data model alternative assumptions general those empirical'
'theory knowledge research studies new measurement memories'
'attention features changes new objects category implicit'
'problem processes independent specific solving independence domain'
'behavior behavioral result goals interaction person activities'
'group masking sensory inhibition groups depression target'
'speech movement central motor accuracy timing attitude'
'based spatial dynamic space sensitive simple surface'
'only reasoning mental problems way standard possible'
'more rather context than criterion less greater'
'different same priming developmental emotional criteria traditional'
'related function structures such other computational relations'
'time temporal 2 functions simple scale linear'
'processing language comprehension working sentence levels involved'
'detection signal new noise strength latency mean'
'system neural systems brain physiological arousal activity'
'perception perceptual object dimensions image organization auditory'
'data decision time making rt good mathematical'
'they that important reading not evidence dual'
'word words semantic activation network lexical phonological'
'mechanisms children animals adaptive avoidance known associative'
'do 1 component same natural but original'
'information use integration based gender one production'
'theoretical phenomena empirical such psychological common large'
'capacity some 1st current present outcome separate'
'models reinforcement response continuous random discrete rule'
'memory recognition recall retrieval items item list'
'social influence automatic controlled d iq x'
'processes control cognitive primary tasks occur affect'
[1x78 char]
'discrimination power psychology motion causal value ratio'
'choice stage error components errors all alternatives'
[1x84 char]
'performance task events tasks people selection level'
'action basic experience research emotion cognition emotions'
'other differences one relationships sex negative positive'
Calculate the most likely words in each syntactic state and write to a cell array of strings
[S] = WriteTopics( MP , BETA , WO , 7 , 0.8 , 4 , filename2 );
Show the most likely words in the syntactic states
fprintf( '\n\nMost likely words in the syntactic states:\n' );
S( 1:NS )
Most likely words in the syntactic states:
ans =
'model theory authors article process function hypothesis'
'it presents there describes they we however'
'which for a both explain whether how'
'and are or'
'in by with as on from for'
'proposed shown used also not based suggested'
'that to can may has have account'
'discussed j presented a d r p'
'the a this an these'
'of'
'is are be been'
'effects evidence theories terms number models implications'