Example 1 of Running Collocation Topic Model (LDACOL)

Load a dataset

dataset = 1; % 1 = psych review; 2 = nips

if (dataset == 1)
    fprintf( 'Loading Psych Review Abstracts - Collocation Data\n' );
    load 'psychreviewcollocation'; % load in variables: WW WO DS WS SI
    filenm = 'topics_psychreview_col.txt';
elseif (dataset == 2 )
    fprintf( 'Loading NIPS papers - Collocation Data\n' );
    load 'nipscollocation'; % load in variables: WW WO DS WS SI
    filenm = 'topics_nips_col.txt';
end
Loading Psych Review Abstracts - Collocation Data

The number of topics

T = 100;

What output to show (0=no output; 1=iterations; 2=all output)

OUTPUT = 1;

Set the hyperparameters of the model

BETA   = 0.01;
ALPHA  = 50/T;
GAMMA0 = 0.1;
GAMMA1 = 0.1;
DELTA  = 0.1;

MAXC   = 4;   % maximum collocation length (in post-processing topics)

The number of iterations

N = 100;

The seed number

SEED = 1;

This function might need a few minutes to finish

tic
[ WP,DP,WC,C,Z ] = GibbsSamplerLDACOL( WS , DS , SI , WW , T , N , ALPHA , BETA , GAMMA0, GAMMA1 , DELTA , SEED , OUTPUT  );
toc

% convert topics to include collocations
[ WPNEW , DPNEW , WONEW ] = CreateCollocationTopics( C , Z , WO , DS , WS , T , MAXC );

fprintf( 'Writing collocation topics to file: %s\n' , filenm );
	Iteration 0 of 100;   Number of tokens in collocation = 0
	Iteration 10 of 100;   Number of tokens in collocation = 4736
	Iteration 20 of 100;   Number of tokens in collocation = 5767
	Iteration 30 of 100;   Number of tokens in collocation = 6279
	Iteration 40 of 100;   Number of tokens in collocation = 6514
	Iteration 50 of 100;   Number of tokens in collocation = 6539
	Iteration 60 of 100;   Number of tokens in collocation = 6680
	Iteration 70 of 100;   Number of tokens in collocation = 6714
	Iteration 80 of 100;   Number of tokens in collocation = 6626
	Iteration 90 of 100;   Number of tokens in collocation = 6607
Elapsed time is 21.063034 seconds.
Concatenating terms in stream...
Find all unique terms...
Find term indices (this might be slow)...
Writing collocation topics to file: topics_psychreview_col.txt

Post-process the vocabulary to include collocations as separate entries. Recalculate word-topic distributions with expanded vocabulary

if (dataset == 1)
    S = WriteTopics( WPNEW , BETA , WONEW , 20 , 0.7 , 4 , filenm );
    save 'ldacol_psychreview' WPNEW DPNEW WONEW WP DP WC C ALPHA BETA GAMMA0 GAMMA1 DELTA SEED N Z T;
elseif (dataset ==2)
    S = WriteTopics( WPNEW , BETA , WONEW , 40 , 0.7 , 4 , filenm );
    save 'ldacol_nips' WPNEW DPNEW WONEW WP DP WC C ALPHA BETA GAMMA0 GAMMA1 DELTA SEED N Z T;
end

Show some topics

S{1}( 1:90 )
S{2}( 1:90 )
S{3}( 1:90 )
S{4}( 1:90 )
S{5}( 1:90 )
S{6}( 1:90 )
S{7}( 1:90 )
S{8}( 1:90 )
S{9}( 1:90 )
S{10}( 1:90 )
ans =
specific authors personality authors_propose authors_show authors_argue authors_present va
ans =
choice continuous preference discrete alternatives consistent variable shown stochastic pr
ans =
visual spatial object space visual_system feature orientation line attentional color conto
ans =
signal_detection avoidance detection high low signal latency iq variance mean presence thr
ans =
memory recognition retrieval recall stored study familiarity cued retrieved forgetting ter
ans =
effect frequency proposed correct found occur degree acoustic reported results frequencies
ans =
motion direction physical points relative generates differential dimensional apparent dept
ans =
action support schema actions wide_range ambiguity involving identification personal conte
ans =
social individuals attitude discussed factors psychological relevant domains implications 
ans =
suggested research influence reviewed arousal evidence activity bem examines found recipro