A utility to convert the stream data used by HMM-LDA model into stream data useful for the LDA-COL model.
[ WW , WO , WS , DS , SI ] = stream_to_collocation_data( DS_IN , WS_IN , WO_IN , stopwordsfile )
creates the new stream of words data. The input document and word indices are provided by DS_IN and WS_IN respectively. The vocabulary is given in the cell array WO_IN. The string stopwordsfile contains the name of the file with stop words. The output WW is a matrix where WW(w2,w1) gives frequency with which w2 follows w1. The status vector SI has values where SI(k) = 1 for positions k where previous word k-1 can be considered as part of a collocation with current word. For SI(|k)| = 0, the previous word k-1 did not precede word k in the original text (because there were stop words in between) or because the word ended a previous document.