Skip to content

Processing Pipeline

Cory Schillaci edited this page Apr 2, 2015 · 12 revisions
  1. The raw data was concatenated into files of approximately 100Mb using the bash script destress/clean_data/combine_data.sh. These files are available on Mercury at /var/local/destress/combined.
  • The file cu2.xml includes the user curiousallie, whose file did not include a </posts> tag. This tag has been added to the end of cu2.xml by hand.
  1. The data was tokenized using destress/process_data/xmltweet.exe with the flex file destress/process_data/xmltweet.flex. This flex file was tweaked from the BIDMach default to ignore the contents of <base64> tags and to properly parse the tag in ;</string>. The bash script destress/process_data/tokenize_files.sh runs the flexer on all of the files in a specified directory. Tokenized representations are saved in /var/local/destress/tokenized.

  2. A master dictionary was then created using the scala/BIDMach script destress/process_data/merge_dicts.scala. This requires that utils.scala has been compiled into the main BIDMach jar. The master dictionary is created by merging one dictionary at a time, and trimming whenever the number of words exceeds 1 million. For the full dataset, only words which occurred fewer than 41 times were trimmed. The master dictionary and the corresponding counts are saved in /var/local/destress/tokenized as masterDict.sbmat and masterDict.dmat, respectively. The dictionary is ordered by word frequency in the tokenized dataset.

  3. The dataset is then featurized by running

import featurizers._
featurizeMoodID("/var/local/destress/tokenized/","/var/local/destress/tokenized/","/var/local/destress/featurized/","/var/local/destress/tokenized/fileList.txt")

This requires that featurizers.scala has been compiled into the main BIDMach jar. Only <string> posts with <current_moodid> tags from the default list are currently featurized. Posts are represented in batches of 100,000 by two files, data<number>.smat.lz4 and data<number>.imat. These are saved in /var/local/destress/featurized.

  • For each post, a bag of words representation is saved as a column in the sparse matrix, where the rows correspond to the entries in the master dictionary.
  • The dense IMat columns corresponding to the columns in the sparse matrix specify a userid number in the first row and the moodid in the second row. Note that users with no posts satisfying the above criteria are still assigned a userid number, so not all userids will appear.
Clone this wiki locally