-
Notifications
You must be signed in to change notification settings - Fork 2
Processing Pipeline
- The raw data was concatenated into files of approximately 100Mb using the bash script
destress/clean_data/combine_data.sh
. These files are available on Mercury at/var/local/destress/combined
.
- The file
cu2.xml
includes the user curiousallie, whose file did not include a</posts>
tag. This tag has been added to the end ofcu2.xml
by hand.
-
The data was tokenized using
destress/process_data/xmltweet.exe
with the flex filedestress/process_data/xmltweet.flex
. This flex file was tweaked from the BIDMach default to ignore the contents of<base64>
tags and discard some of the escaped html tags in the post bodies, as well as improve handling of some other LiveJournal specific issues (see commit history onxmltweet.flex
). The bash scriptdestress/process_data/tokenize_files.sh
runs the flexer on all of the files in a specified directory. Tokenized representations are saved in/var/local/destress/tokenized
. -
A master dictionary was then created using the scala/BIDMach script
destress/process_data/merge_dicts.scala
. This requires thatutils.scala
has been compiled into the main BIDMach jar*. The master dictionary is created by merging one dictionary at a time, and trimming whenever the number of words exceeds 1 million. For the full dataset, only words which occurred fewer than 41 times were trimmed. The master dictionary and the corresponding counts are saved in/var/local/destress/tokenized
asmasterDict.sbmat
andmasterDict.dmat
, respectively. The dictionary is ordered by word frequency in the tokenized dataset. -
The dataset is then featurized by running
import featurizers._
featurizeMoodID("/var/local/destress/tokenized/","/var/local/destress/tokenized/",
"/var/local/destress/featurized/","/var/local/destress/tokenized/fileList.txt")
This requires that featurizers.scala
has been compiled into the main BIDMach jar*. Only <string>
posts with <current_moodid>
tags from the default list are currently featurized. Posts are represented in batches of 100,000 by two files, data<number>.smat.lz4
and data<number>.imat
. These are saved in /var/local/destress/featurized
.
- For each post, a bag of words representation is saved as a column in the sparse matrix, where the rows correspond to the entries in the master dictionary. The word indices start at 0 in both. The master dictionary is saved in
/var/local/destress/featurized
. A dmat of the word counts from the valid posts only, for sorting in preprocessing, is also saved here. - The dense IMat columns corresponding to the columns in the sparse matrix specify a userid number in the first row and the moodid in the second row. Note that users with no posts satisfying the above criteria are still assigned a userid number, so not all userids will appear.
- The dataset is then preprocessed into a form suitable for learning models with BIDMach using
import preprocessers._
preprocess(nrWords,indir,outdir,masterDict,testPercent:Float=0.0f,sort:Boolean=true,transformation:String="None")
For now, see the code for details.
* For instructions on compiling scala code with bidmach dependencies, see issue #25.