-
Notifications
You must be signed in to change notification settings - Fork 21
Framework Overview
Processing documents has three phases: ingest, analysis, and reporting. Here we describe what happens in these phases and where they occur in code.
The entire process occurs in the tpkickoff.sh file, located in the bin/ folder. Looking there is a good way to get a handle for how the process works.
We use both HDFS and HBase to store file information.
The folder structure for HDFS is:
\ texaspete | regexes (regular expressions for grep search) | reports/ (contains report template files) \ data \ $IMG_ID | crossimg/ (cross-image scoring calculation data) | extents/ (volume extents information) | grep/ (file text search results) | reports/ (raw report json data) | text/ (text extracted from files) | reports.zip (contains the final report)
The HDFS Tables are:
-
entries
- contains information about all of the files on HDFS. TheFSEntry
class wraps around the entries in this table usingFSEntryHBaseInputFormat
andFSEntryHBaseOutputFormat
to provide a map-reduce friendly way to read and write data about files. -
images
- contains information about the disk images that have been uploaded. -
hashes
- contains hashes for seen, known-good, and known-bad files. This table is populated as images are added and also by the NSRL uploading tools. If a file was seen on a particular drive (as is the case with files added here during the ingest process), this table contains information about the hash of that drive (so there is a link between the file and the drive it was seen on).
Ingest has two parts:
In the first part, we run the fsrip utility on a hard disk image twice; once to get information about the file system and once to get information about the files on the hard drive, using fsrip dumpfs
and fsrip dumpimg
respectively. We dump both of these records onto HDFS using com.lightboxtechnologies.spectrum.Uploader
for disk volume information and com.lightboxtechnologies.spectrum.InfoPutter
for file data (the raw file data is actually just stuffed into HBase here and sorted out later). Note that we have two identifiers for images; we take a hash of the image file as we begin the ingest process, and we also supply a "friendly name" of the image.
We then kick off the rest of the ingest process by invoking org.sleuthkit.hadoop.pipeline.Ingest
. This in turn kicks off three separate map-reduce jobs, which are described below:
-
com.lightboxtechnologies.spectrum.JsonImport
- This kicks off a map reduce job which populates the HBase entries table with information from the hard drive files. -
com.lightboxtechnologies.spectrum.ExtentsExtractor
andcom.lightboxtechnologies.spectrum.ExtractData
- these two steps put the raw file data into HBase, or links to the file data in the case where a file is large enough that we do not want to store its data in HBase.
At the end of the ingest process, we have rows in the HBase table "entries" which is filled with information about all of the files on the hard drive, an entry in the "images" table containing a bit of information about the hard drive image we have just uploaded, and some lines in the "hashes" table indicating that we know about some additional files that belong to this hard drive.
In the analysis step, we analyze the uploaded text in order to create some reports about it. The data is either stored in report files or fed back into the entries table.
Analysis is kicked off in org.sleuthkit.hadoop.pipeline.Pipeline
.
The following classes have runPipeline()
methods which are invoked from Pipeline.main()
. They all kick off hadoop map-reduce tasks to do the operations described below.
-
org.sleuthkit.hadoop.FSEntryTikaTextExtractor
- iterates over the files in the entries table that belong to a hard drive with a given hash, and uses tika to extract the raw text content from them. It puts the resulting text back into the entries table. -
org.sleuthkit.hadoop.GrepSearchJob
- performs a search using java regular expressions supplied within the file/texaspete/regexes
. The match counts and indexes of the match are put back into the entries table. -
org.sleuthkit.hadoop.SequenceFsEntryText
- creates sequence files from the text stored within the entries table. Text for a given file is only output to a sequence file if the grep search step (above) found matches. The mahout classes that are used to tokenize, vectorize, and cluster documents cannot use the data we store in FSEntry as input directly, which is why we output that data to sequnece files in this intermediate step. -
org.sleuthkit.hadoop.TokenizeAndVectorizeDocuments
- converts the text sequence files previously generated into a sequence file containing tokens. It then turns the tokenized files into vectors. The tokens are stored in thetokens/
directory and the vectors in thevectors/
directory. We use mahout to do this. -
org.sleuthkit.hadoop.ClusterDocumentsJob
- takes the vectors from the previous step and clusters them, organizing them into groups based on their similarity. The project uses canopy (t1 and t2 are supplied as arguments to therunPipeline()
function; cosine is hardcoded as the distance measure and you can look inPipeline.main()
to see the defaults), followed by running k-means using the resultant canopy centroid vectors as the "means". -
org.sleuthkit.hadoop.scoring.CrossImageScorerJob
- compares the files on the hard drive being analyzed to others in the system, and generates a number that represents how similar two drives are to one another. This code uses the file hashes stored in the hashes table. Also generates the report for the cross drive scoring based on these similarity scores. This similarity is calculated by taking into account the number of files which match on each hard drive, and the number of files each hard drive actually contains.
Reporting generates JSON report files from the analysis done. The reporting steps are mixed in with the analysis steps, but the final report files are not compiled until the following methods are called. There are a few report files that are generated, and you can search for buildReport()
methods to determine exactly where the report files are being compiled:
-
org.sleuthkit.hadoop.GrepReportGenerator
- Generates a report file containing information about the grep matches we've found. Stores the number of matches we've found with each regex on the image, and a bit of information about where those matches were located (filename and context string). -
org.sleuthkit.hadoop.ClusterDocumentsJob
- This file also contains map-reduce methods for generating reports based on the clustering data we've gathered (in addition to actually determining the clusters as described in the previous section). -
org.sleuthkit.hadoop.scoring.CrossImageScorerJob
- Generates a report detailing which drives are most similar to one another (see the previous section for a bit more detail). -
com.lightboxtechnologies.spectrum.HDFSArchiver
- Creates the reports.zip file. This moves the reports placed in thereports/
directory and the report template from the/texaspete/reports
folder into the archive file. The resulting archive is available for download on the web interface or can be copied locally manually.
These report files are placed in an archive with an html/js web template, which lets you graphically browse the data.