TwitterLogAnalyzer

Twitter Log Analysis Library

This code base contains a general purpose library for log data analysis.
Currently the library supports service to calculate average user time.
Library uses external merge sort to sort the log file. This approach helps in optimizing the memory available and uses disk storage for computation.
Custom analysis can be carried out on the sorted file. Currently, average user activity time is calculated using the sorted file.
Default policy is used to handle open/close tick values.

Code tested with following example:

Run Instructions:

Maven project. To build jar use: mvn install
Go to target folder and use jar "twitter-logs-analytics-1.0-SNAPSHOT-jar-with-dependencies.jar" OR
Go to folder "jar" and use jar "twitter-logs-analytics-1.0-SNAPSHOT-jar-with-dependencies.jar"
Use script (runLogAnalysisService.sh) inside "jar" folder to run the jar file
Script takes in 3 arguments: Number of lines per file, path to input file, output file directory path OR
Use following command to run jar file: java -DMAX_LINE_FILE={number of lines} -DMAX_CHARS_READ=30 -DPOLICY_NAME=DefaultOCPolicy -Xmx{memory size} -cp twitter-logs-analytics-1.0-SNAPSHOT-jar-with-dependencies.jar org.twitter.analytics.service.LogAnalyzerService {inputFilePath} {outputPath}

Current Approach:

Sort-based approach is used to calculate average value.
Sorting helps in sequentially arranging the values for a particular user in one file. This way in order to calculate average value only 2 ticks need to be considered at a time (low memory usage). If sorting was not used then we would have to maintain the open tick value in process memory, until the next tick is found. In case, the next tick value for a number of users is available after a million records, then in worst case the process can go out of memory.
Since the file is sorted and stored in disk, only 2 records are stored in memory at a time for a particular user to calculate the average. During the merge-sort phase available memory is used to sort the file in chunks.

Alternative Approach:

Alternatively, first the entire file can be scanned to count the total number of users as well as the number of entries per user.
Based on the count above, an adequate number of files can be created. These files will be similar to shards, where each file stores a particular number of user's data and the shard file vs user mapping is maintained.
Therefore, in the initial phase, based on maximum number of lines, data is stored in memory, and then based on the shard configuration, data is written to the appropriate file.
Now data can be read from one file at a time, where data for all the users in the file is maintained in memory. Since, each file consists of data for limited number of users, the process will be able to operate within the memory constraints.

Code Structure:

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
jar		jar
src		src
.gitignore		.gitignore
README.md		README.md
TwitterLogAnalyzer.iml		TwitterLogAnalyzer.iml
pom.xml		pom.xml

Provide feedback