Tweet Ingestion application designed to efficiently ingest a high volume of tweets and perform processing which includes the following:
- Produce a file that groups words from tweets and counts their frequency. Example output is as follows:
analytics 1
bigdata 3
kdn 1
smb 1
- Produce a running median of unique word counts from tweets. Example output is as follows:
11.0
12.5
14.0
This code is portable across the following OS's: Linux distributions, Mac and Windows OS's. Scripts were written using Python 2.7 and have not been tested for portability to Python 3.X.
You are encouraged to use a python virtual environment using virtualenv and pip. NOTE (2015-07-18): As of now, the requirements file is empty because no modules outside the default build are used.
$ virtualenv venv
$ source venv/bin/activate
$ pip install -r requirements
- os - operating system interface related and used to port-ably join paths, remove files, etc.
- sys - interpreter-related and used to parse parameters from the commandline
- heapq - priority queue algorithm used to efficiently obtain root from an array
- unittest - framework used to create and employ unit tests
Applications can be run separately or together from a shell script.
To run separately:
Both words_tweeted.py and median_unique.py accepts the same two parameters:
- Input file: this is the file containing tweets separated by newlines and located in
- Output file: this is the file that is produced and located
$ git clone https://github.com/vchiapaikeo/tweet_ingestion.git
$ cd tweet_ingestion
$ python src/words_tweeted.py tweet_input/tweets.txt tweet_output/ft1.txt
$ python src/median_unique.py tweet_input/tweets.txt tweet_output/ft1.txt
To run from shell script:
This scenario is simpler and will execute both scripts back to back.
$ ./run.sh
Unit tests have been created to test functions within the app. A test should be executed on the commandline at the top-level dir. Two tests are available corresponding to each of module in src and can be executed as follows:
$ cd tweet_ingestion
$ python -m src.tests.unit_words_tweeted
$ cd tweet_ingestion
$ python -m src.tests.unit_median_unique
The following output should return:
..
----------------------------------------------------------------------
Ran 2 tests in 0.003s
OK
It goes without saying that this script is a work in progress. A number of items could still be added to increase functionality, performance, and robustness of this script. A few of my favorite wish-list items are listed.
- Reduce memory footprint in src/median_unique.py - the current implementation (submitted 2015-07-19) retains lists of streaming unique word counts per line in memory...
- Support multiple files in input dir using glob? Could this be something we want?
- Add more tests!