In the current era, real-time data analysis holds critical significance for both SMEs and Large Corporations, spanning industries such as Financial services, Legal services, IT operation management services, Marketing, and Advertising. This entails the examination of extensive real-time and historical data to facilitate informed business decisions.
Distinguished by its velocity, volume, and variety, Big data differs from regular data. The development of a distributed data pipeline becomes essential for processing, storing, and analyzing real-time data, distinguishing it from traditional Big data applications.
This personal project aims to apply the principles of large-scale parallel data processing (CS 6240 - NEU) to create a real-time processing pipeline using open source tools. The goal is to efficiently capture, process, store, and analyze substantial data from diverse sources, ensuring scalability and effectiveness.
Leveraging Twitter streaming trends popularity and sentiment analysis proves to be an exceptional choice for constructing a distributed data pipeline. On a daily basis, an impressive volume of approximately 500 million tweets, sourced globally, emerges (as of October, 2019). Out of this vast number, approximately 1%, equivalent to 5 million tweets, becomes publicly accessible.
The data pipeline is ingeniously designed, employing the following components: Apache Kafka as the data ingestion system, Apache Spark as the real-time data processing system, MongoDB for distributed storage and retrieval, and Apache Drill to establish connectivity between MongoDB and Tableau for real-time analytics.
The Twitter data is obtained through the utilization of the Twitter Streaming API and is efficiently streamed to Apache Kafka, enabling seamless accessibility for Apache Spark, which then undertakes data processing and sentiment classification, storing the resultant data into MongoDB. The analysis of trends' popularity and sentiment is conducted through a comprehensive Tableau dashboard.
Note: Apache Drill plays a crucial role in connecting MongoDB with Tableau, thereby facilitating a smooth and cohesive integration. Further elaboration on Apache Drill will follow.
In this data pipeline architecture, the Twitter streaming producer utilizes Kafka to publish real-time tweets to the 'tweets-1' topic within an Apache Kafka broker. Subsequently, the Apache Spark Streaming Context subscribes to the 'tweets-1' topic, enabling the ingestion of tweets for further processing.
The Spark engine efficiently leverages Spark Streaming to conduct batch processing on the incoming tweets. Prior to storing the processed data, Spark performs sentiment classification on the tweets. The processed results are then persistently stored in MongoDB, ensuring distributed storage and retrieval.
To facilitate the integration of MongoDB with Tableau, Apache Drill serves as the connector, establishing a seamless connection between the two systems. By tapping into real-time data from MongoDB, a live dashboard is created using Tableau, offering comprehensive insights into the popularity and sentiment of trending topics on Twitter. This live dashboard provides a powerful tool for analyzing and understanding the dynamics of popular topics as they unfold on the platform.
The Kafka Twitter Streaming Producer is a crucial component responsible for publishing real-time tweets to the 'tweets-1' topic in the central Apache Kafka broker. Utilizing the twitter4j library for Twitter API integration, this producer captures streaming tweets written in English from various locations worldwide.
Apache Kafka serves as a distributed publish-subscribe messaging system and a robust queue, adept at handling high volumes of data. Its primary role is to facilitate the seamless transmission of messages between endpoints. Supporting both offline and online message consumption, Kafka ensures data persistence on disk and replicates messages within the cluster to ensure data integrity. It integrates seamlessly with Apache Spark for real-time streaming data analysis.
A critical dependency of Apache Kafka is Apache Zookeeper, which acts as a distributed configuration and synchronization service. Zookeeper acts as the coordination interface between Kafka brokers and consumers. Storing essential metadata, such as information about topics, brokers, consumer offsets, and more, Zookeeper enables Kafka to maintain a robust and fault-tolerant state, even in the event of broker or Zookeeper failure.
Apache Spark, a high-speed and versatile distributed cluster computing framework, is the foundation of the project. It offers a rich set of higher-level tools, including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for real-time data analysis.
The core component of Apache Spark, Spark Core, revolves around Resilient Distributed Datasets (RDDs) as the primary data abstraction. RDDs represent immutable, partitioned collections of elements that can be processed in parallel with fault tolerance. Spark's RDD lineage graph allows for re-computation of missing or damaged partitions due to node failures, ensuring fault-tolerance.
Leveraging Spark Core, Spark Streaming performs real-time streaming analysis. The key abstraction used is Discretized Stream or DStream, representing continuous data streams. DStreams consist of a continuous series of RDDs, each containing data from a specific interval. The data processing involves both stateless and stateful transformations on the raw streaming tweets, preparing them for sentiment classification.
MongoDB serves as the distributed storage and retrieval system for the processed data from Spark. The results of sentiment classification are stored persistently in MongoDB, ensuring efficient data management and scalability.
Apache Drill, an open-source SQL execution engine, enables SQL queries on non-relational databases and file systems. In this project, Drill plays a vital role in connecting MongoDB with Tableau, facilitating smooth data integration.
Tableau, a powerful data visualization tool, utilizes the real-time data stored in MongoDB to create an interactive live dashboard. This dashboard offers a comprehensive analysis of trending topics on Twitter, presenting valuable insights into their popularity and sentiment in real-time.
Instructions to Setup Data Pipeline and DashboardCertainly! Here are the rephrased instructions without the image links:
-
Download Required Components:
- Download Zookeeper, MongoDB, Apache Kafka, Apache Spark, and Apache Drill.
-
Optional: Instructions to Setup Spark Development Environment:
- Optionally, follow the instructions here to set up Spark Development Environment.
-
Clone Project Repository:
- Clone the project repository to your local machine.
-
Create a Twitter Developer Account:
- Sign up for a Twitter developer account here.
-
Update Twitter API Tokens:
- Locate the 'oAuth-tokens.txt' file in the input directory of the project repository.
- Update the file with your respective Twitter API keys and tokens.
-
Start Zookeeper Server:
- Open a terminal and run the following command:
/usr/local/zookeeper/bin/zkServer.sh start
- Open a terminal and run the following command:
-
Start Kafka Server:
- Open another terminal and run the following command:
/usr/local/kafka/bin/kafka-server-start.sh /usr/local/kafka/config/server.properties
- Open another terminal and run the following command:
-
Create Kafka Topic:
- In a terminal, create a topic named "tweets-1" in Kafka using the following command:
/usr/local/kafka/bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic tweets-1
- In a terminal, create a topic named "tweets-1" in Kafka using the following command:
-
Verify Topic Creation:
- To verify if the topic has been successfully created, run the following command:
/usr/local/kafka/bin/kafka-topics.sh --list --zookeeper localhost:2181
- To verify if the topic has been successfully created, run the following command:
-
Start MongoDB Server:
- Start MongoDB server on your local machine.
-
Start Apache Drill in Distributed Mode:
- Follow the instructions here to start Apache Drill in distributed mode.
-
Enable MongoDB Storage Plugin in Apache Drill:
- Configure MongoDB as a storage plugin for Apache Drill using the instructions found here.
-
Run KafkaTwitterProducer.java:
- Execute the KafkaTwitterProducer.java with the appropriate arguments.
-
Run KafkaSparkProcessor.scala:
- Run the KafkaSparkProcessor.scala with the appropriate arguments.
-
Configure Tableau to Connect with MongoDB via Apache Drill:
- Follow the instructions here to configure Tableau and connect it to MongoDB using Apache Drill.
Here are the tools and IDE with their download links.