This repo demonstrates two use cases.
Tbe first one is about fetching some content from a database and then run some ML alogorithm on that data and get insights out of it. The base for this repo is from Ververica fork. Thanks.
See the slides for more context.
Second use case: Fetch data from a Kafka topic, process in Flink and then write to another kafka topic.
This is an example to demonstrate the functionality of fetching the data stream from a Kafka topic, process that in Apache Flink, and write the processed data to another Kafka topic. The streamed data is accumulated for every 5 minutes, then a SQL style query is run on that data, and the result is pushed to another Kafka topic.
To keep things simple, this demo uses a Docker Compose setup that makes it easier to bundle up all the services you need:
docker-compose build
docker-compose up -d
docker-compose ps
The following UI end points will be up as part of this.
- Flink Web UI - Analytics (http://localhost:8081)
- Flink Web UI - Source and Sink are Kafka (http://localhost:8082)
- Kafka UI (http://localhost:8080)
- Superset (http://localhost:8088)
What are people asking more frequently about in the Flink User Mailing List? How can you make sense of such a huge amount of random text?
The model in this demo was trained using a popular topic modeling algorithm called LDA and Gensim, a Python library with a good implementation of the algorithm. The trained model knows to some extent what combination of words are associated with certain topics, and can just be passed as a dependency to PyFlink.
Don't trust the model. 👹
docker-compose exec jobmanager-nlp ./bin/flink run -py /opt/pyflink-nlp/pipeline.py -d
Once you get the Job has been submitted with JobID <JobId>
green light, you can check and monitor its execution using the Flink WebUI:
To visualize the results, navigate to (http://localhost:8088) and log into Superset using:
username: admin
password: superset
There should be a default dashboard named "Flink User Mailing List" listed under Dashboards
:
Second use case: Fetch data from a Kafka topic, process in Flink and then write to another kafka topic.
There is a Kafka producer (pipelines/streaming.py) that creates some data and push it to a Kafka topic. For this example, I took credit card purchase information (transaction log) of users. The Kafka producer create this mock data and push it to a Kafka topic. 100 records will be created for every 5 minutes and being pushed to the Kafka topic.
Flink listenes on this Kafka topic and fetches this data. What I am interested is to find the list of users who have done unusual pattern of purchases in the 5 minute window. There could be many ways to get this insight. The way I did: group the data for the last 5 minutes, find the sum of all the online transactions, sum of all instore transactions and then filter the users for which the following conditions are met.
- Number of total transactions > 5
- Online transactions > In Store transactions
- For the In Store transactions, the distance between the two store locations > 2 miles
- Store the last entry of location data for the previous 5 mins data group
If the above conditions are met, then this "could" be a potential fraud situation. Write the results to another Kafka topic.
You can check the filtered transactions via the Kafka UI.
And that's it!
For the latest updates on PyFlink, follow Apache Flink on Twitter.# streaminganalytics-kafka-flink