A utility to convert the stream data used by HMM-LDA model into stream data useful for the LDA-COL model.
[ WW , WO , WS , DS , SI ] = stream_to_collocation_data( DS_IN , WS_IN , WO_IN , stopwordsfile )
creates the new stream of words data. The input document and word indices are provided by DS_IN and WS_IN respectively. The vocabulary is given in the cell array WO_IN. The string stopwordsfile contains the name of the file with stop words. The output WW is a matrix where WW(w2,w1) gives frequency with which w2 follows w1. The status vector SI has values where SI(k) = 1 for positions k where previous word k-1 can be considered as part of a collocation with current word. For SI(|k)| = 0, the previous word k-1 did not precede word k in the original text (because there were stop words in between) or because the word ended a previous document.
n = length( WS_IN ); W = length( WO_IN ) + 1; D = max( DS_IN ); w1 = WS_IN( 1:n-1 ) + 1; % add one to get sentence marker on index 1 w2 = WS_IN( 2:n ) + 1; % find document boundaries DB = diff( DS_IN ); notboundary = find( DB == 0 ); w1 = w1( notboundary ); w2 = w2( notboundary ); WW = sparse( w2 , w1 , ones( size( w1 )) , W , W ); % remove the sentence marker from these matrices WW( 1,: ) = ; WW( :,1 ) = ; % load in stopwords file [ stopwords ] = lower( textread( stopwordsfile , '%s' )); WO_IN = lower( WO_IN ); % find word indices that are ok [ WO , ind ] = setdiff( WO_IN , stopwords ); WW = WW( ind,ind ); [ SI , WS ] = ismember( WS_IN , ind ); whok = find( SI ); % mark the locations where collocation cannot be formed SI( 2:end ) = SI( 2:end ) .* SI( 1:end-1); % also exclude document boundaries DB = diff( DS_IN ); whboundary = find( DB == 1 ); SI( whboundary+1 ) = 0; SI = double( SI( whok )); WS = WS( whok ); DS = DS_IN( whok );