-
Notifications
You must be signed in to change notification settings - Fork 2
Processing Pipeline
- The raw data was concatenated into files of approximately 100Mb using the bash script
destress/clean_data/combine_data.sh
. These files are available on Mercury at/var/local/destress/combined
.
- The file
cu2.xml
includes the user curiousallie, whose file did not include a</posts>
tag. This tag has been added to the end ofcu2.xml
by hand.
-
The data was tokenized using
destress/process_data/xmltweet.exe
with the flex filedestress/process_data/xmltweet.flex
. This flex file was tweaked from the BIDMach default to ignore the contents of<base64>
tags and to properly parse the tag in;</string>
. The bash scriptdestress/process_data/tokenize_files.sh
runs the flexer on all of the files in a specified directory. Tokenized representations are saved in/var/local/destress/tokenized
. -
A master dictionary was then created using the scala/BIDMach script
destress/process_data/merge_dicts.scala
. This requires thatutils.scala
has been compiled into the main BIDMach jar. The master dictionary is created by merging one dictionary at a time, and trimming whenever the number of words exceeds 1 million. For the full dataset, only words which occurred fewer than 41 times were trimmed. The master dictionary and the corresponding counts are saved in/var/local/destress/tokenized
asmasterDict.sbmat
andmasterDict.dmat
, respectively. The dictionary is ordered by word frequency in the tokenized dataset. -
The dataset is then featurized by running
import featurizers._
featurizeMoodID("/var/local/destress/tokenized/","/var/local/destress/tokenized/","/var/local/destress/featurized/","/var/local/destress/tokenized/fileList.txt")
This requires that featurizers.scala
has been compiled into the main BIDMach jar. Only <string>
posts with <current_moodid>
tags from the default list are currently featurized. Posts are represented in batches of 100,000 by two files, data<number>.smat.lz4
and data<number>.imat
. These are saved in /var/local/destress/featurized
.
- For each post, a bag of words representation is saved as a column in the sparse matrix, where the rows correspond to the entries in the master dictionary.
- The dense IMat columns corresponding to the columns in the sparse matrix specify a userid number in the first row and the moodid in the second row. Note that users with no posts satisfying the above criteria are still assigned a userid number, so not all userids will appear.