Skip to content

Processing Pipeline

Cory Schillaci edited this page Apr 30, 2015 · 12 revisions
  1. The raw data was concatenated into files of approximately 100Mb using the bash script destress/clean_data/combine_data.sh. These files are available on Mercury at /var/local/destress/combined.
  • The file cu2.xml includes the user curiousallie, whose file did not include a </posts> tag. This tag has been added to the end of cu2.xml by hand.
  1. The data was tokenized using destress/process_data/xmltweet.exe with the flex file destress/process_data/xmltweet.flex. This flex file was tweaked from the BIDMach default to ignore the contents of <base64> tags and discard some of the escaped html tags in the post bodies, as well as improve handling of some other LiveJournal specific issues (see commit history on xmltweet.flex). The bash script destress/process_data/tokenize_files.sh runs the flexer on all of the files in a specified directory. Tokenized representations are saved in /var/local/destress/tokenized.

  2. A master dictionary was then created using the scala/BIDMach script destress/process_data/merge_dicts.scala. This requires that utils.scala has been compiled into the main BIDMach jar*. The master dictionary is created by merging one dictionary at a time, and trimming whenever the number of words exceeds 1 million. For the full dataset, only words which occurred fewer than 41 times were trimmed. The master dictionary and the corresponding counts are saved in /var/local/destress/tokenized as masterDict.sbmat and masterDict.dmat, respectively. The dictionary is ordered by word frequency in the tokenized dataset.

  3. The dataset is then featurized by running

import featurizers._
featurizeMoodID("/var/local/destress/tokenized/","/var/local/destress/tokenized/",
     "/var/local/destress/featurized/","/var/local/destress/tokenized/fileList.txt")

This requires that featurizers.scala has been compiled into the main BIDMach jar*. Only <string> posts with <current_moodid> tags from the default list are currently featurized. Posts are represented in batches of 100,000 by two files, data<number>.smat.lz4 and data<number>.imat. These are saved in /var/local/destress/featurized.

  • For each post, a bag of words representation is saved as a column in the sparse matrix, where the rows correspond to the entries in the master dictionary. The word indices start at 0 in both. The master dictionary is saved in /var/local/destress/featurized. A dmat of the word counts from the valid posts only, for sorting in preprocessing, is also saved here.
  • The dense IMat columns corresponding to the columns in the sparse matrix specify a userid number in the first row and the moodid in the second row. Note that users with no posts satisfying the above criteria are still assigned a userid number, so not all userids will appear.
  1. The dataset is then preprocessed into a form suitable for learning models with BIDMach using
import preprocessers._
preprocess(nrWords,indir,outdir,masterDict,testPercent:Float=0.0f,sort:Boolean=true,transformation:String="None")

For now, see the code for details.

* For instructions on compiling scala code with bidmach dependencies, see issue #25.

Clone this wiki locally