Load a dataset

dataset = 1; % 1 = psych review; 2 = nips if (dataset == 1) fprintf( 'Loading Psych Review Abstracts - Collocation Data\n' ); load 'psychreviewcollocation'; % load in variables: WW WO DS WS SI filenm = 'topics_psychreview_col.txt'; elseif (dataset == 2 ) fprintf( 'Loading NIPS papers - Collocation Data\n' ); load 'nipscollocation'; % load in variables: WW WO DS WS SI filenm = 'topics_nips_col.txt'; end

Loading Psych Review Abstracts - Collocation Data

The number of topics

T = 100;

What output to show (0=no output; 1=iterations; 2=all output)

OUTPUT = 1;

Set the hyperparameters of the model

```
BETA = 0.01;
ALPHA = 50/T;
GAMMA0 = 0.1;
GAMMA1 = 0.1;
DELTA = 0.1;
MAXC = 4; % maximum collocation length (in post-processing topics)
```

The number of iterations

N = 100;

The seed number

SEED = 1;

This function might need a few minutes to finish

tic [ WP,DP,WC,C,Z ] = GibbsSamplerLDACOL( WS , DS , SI , WW , T , N , ALPHA , BETA , GAMMA0, GAMMA1 , DELTA , SEED , OUTPUT ); toc % convert topics to include collocations [ WPNEW , DPNEW , WONEW ] = CreateCollocationTopics( C , Z , WO , DS , WS , T , MAXC ); fprintf( 'Writing collocation topics to file: %s\n' , filenm );

Iteration 0 of 100; Number of tokens in collocation = 0 Iteration 10 of 100; Number of tokens in collocation = 4736 Iteration 20 of 100; Number of tokens in collocation = 5767 Iteration 30 of 100; Number of tokens in collocation = 6279 Iteration 40 of 100; Number of tokens in collocation = 6514 Iteration 50 of 100; Number of tokens in collocation = 6539 Iteration 60 of 100; Number of tokens in collocation = 6680 Iteration 70 of 100; Number of tokens in collocation = 6714 Iteration 80 of 100; Number of tokens in collocation = 6626 Iteration 90 of 100; Number of tokens in collocation = 6607 Elapsed time is 21.063034 seconds. Concatenating terms in stream... Find all unique terms... Find term indices (this might be slow)... Writing collocation topics to file: topics_psychreview_col.txt

Post-process the vocabulary to include collocations as separate entries. Recalculate word-topic distributions with expanded vocabulary

if (dataset == 1) S = WriteTopics( WPNEW , BETA , WONEW , 20 , 0.7 , 4 , filenm ); save 'ldacol_psychreview' WPNEW DPNEW WONEW WP DP WC C ALPHA BETA GAMMA0 GAMMA1 DELTA SEED N Z T; elseif (dataset ==2) S = WriteTopics( WPNEW , BETA , WONEW , 40 , 0.7 , 4 , filenm ); save 'ldacol_nips' WPNEW DPNEW WONEW WP DP WC C ALPHA BETA GAMMA0 GAMMA1 DELTA SEED N Z T; end

Show some topics

S{1}( 1:90 ) S{2}( 1:90 ) S{3}( 1:90 ) S{4}( 1:90 ) S{5}( 1:90 ) S{6}( 1:90 ) S{7}( 1:90 ) S{8}( 1:90 ) S{9}( 1:90 ) S{10}( 1:90 )

ans = specific authors personality authors_propose authors_show authors_argue authors_present va ans = choice continuous preference discrete alternatives consistent variable shown stochastic pr ans = visual spatial object space visual_system feature orientation line attentional color conto ans = signal_detection avoidance detection high low signal latency iq variance mean presence thr ans = memory recognition retrieval recall stored study familiarity cued retrieved forgetting ter ans = effect frequency proposed correct found occur degree acoustic reported results frequencies ans = motion direction physical points relative generates differential dimensional apparent dept ans = action support schema actions wide_range ambiguity involving identification personal conte ans = social individuals attitude discussed factors psychological relevant domains implications ans = suggested research influence reviewed arousal evidence activity bem examines found recipro