Given a suitably filtered stream of documents returned by a Twitter query, calculate real-time statistics and show the ranking of the most tweeted URLs since system activation.
The statistics must be updated on screen every N seconds.
They show the links organized into various domain categories, each with its counting popularity:
Domain | Link | Frequency |
---|---|---|
foursquare.com | expanded.url.com/123 | 9 times |
foursquare.com | expanded.url.com/456 | 8 times |
youtube.com | ... | ... |
instagram.com | ... | ... |
... |
The system has to use Twitter APIs (Twitter4j, Hosebird for instance) to perform queries and retrieve Tweets, suitably filters them (e.g. according to the coordinates of a polygon centered on Rome, Milan or a city of your choice).
The links of interest are the ones retrieved from the entities/urls field of the Tweet json:
- first of all, links have to be expanded, reversing the output of Twitter's shortening service (t.co);
- if the Tweet contains the expanded form of the URL, the count is assigned to it;
- if the Tweet contains a “shortened” form of the URL (e.g.
bit.ly/13NHE7v
,goo.gl/uJH2Y
,http://instagr.am/p/S3l5rQjCcA/
, etc ...), then it has to be expanded in order to obtain the completely expanded form (eventually after several expansions); the count can then be assigned to it.
Starting from the final expanded form, domain information can be extracted to organize the current results.
This must be done in real time, using Apache Storm.
RTwUP is developed in Java.
To listen to Twitter's stream, it was chosen Twitter4j, Twitter Stream API in particular.
To process the Tweets real time, it was chosen Apache Storm.
The user interface is written as a Node.js application, making use of socket.io and Redis to display results in real time.
For more information, you can refer to the wiki pages.