Skip to content

It is kubernetes microservice that ingests Steam data, deployed as a service using sQLite for storing processes followed by an event driven message queue controlled by RabbitMQ. This message queue is also deployed as a service with an interconnection with celery workers integrated with Elasticsearch and Redis for caching and ingestion.

License

Notifications You must be signed in to change notification settings

ashishreddy2411/data-ingestion-pipeline

 
 

Repository files navigation

Data Ingestion Pipeline

Gaming industry is currently one of the most prominent industries in the market. Of all the factors that determine the popularity of a game, reviews are paramount. This project aims to analyse Steam reviews dataset using Elasticsearch engine deployed in a Kubernetes environment where data ingestion queues are handled by RabbitMQ, processes are handled by Celery and data is cached in Redis.

architecture

Demo Scenario

Whenever dealing with Kubernetes, one can use micr8s or minikube for kubernetes installation on the base system. MicroK8s is the easiest and fastest way to get Kubernetes up and running. High availability in a Kubernetes cluster, is one of the premiere qualities one should look for when dealing with clusters so as to withstand a failure on any component and continue serving workloads without interruption, therefore the following three factors are necessary for a Highly Available Kubernetes cluster and micro8s serves them all.

Installation

Simply following the canonical documentation, one can get started with kubernetes installation by using these commands:

  • sudo snap install microk8s –classic
  • microk8s status –wait-ready
  • microk8s enable dashboard dns registry istio

There are a few python based dependies which can be installed using: pip3 install -r requirements.txt for the following directories:

  • async_backend
  • celery_app
  • ingestion_engine

I am also aliasing microk8s kubectl to kcdev for using kubernetes as a short form for my convenience which was done by adding this alias in the bashrc/zshrc profile.

The directory structure is representative of the commands used to start the kubernetes environment which are as follows:

  • elasticsearch-setup-passwords auto
  • kcdev apply -f async_backend/deployment_config/service.yaml
  • kcdev apply -f async_backend/deployment_config/deployment.yaml
  • kcdev apply -f debug_pod/debug_pod.yaml
  • kcdev apply -f elasticsearch_config/elasticsearch_deployment.yaml
  • kcdev apply -f elasticsearch_config/elasticsearch_pv.yaml
  • kcdev apply -f elasticsearch_config/elasticsearch_service.yaml
  • kcdev apply -f message_queue/rabbitmq_deployment.yaml
  • kcdev apply -f message_queue/rabbitmq_service.yaml
  • kcdev apply -f redis/redis_pv.yaml
  • kcdev apply -f redis/redis_deployment.yaml
  • kcdev apply -f redis/redis_service.yaml

Kubernetes Pods

Using Kubernetes, we have setup 6 microservice pods in this project which can be seen in the firgure below. These microservices run in an even driven architecture which are given persistent volumes for claiming the resources.

pods

In Kubernetes, a HorizontalPodAutoscaler automatically updates a workload resource (such as a Deployment or StatefulSet), with the aim of automatically scaling the workload to match demand.

hpa

Persistent Volumes

A PersistentVolume (PV) as shown in the figure below is a piece of storage in the cluster that has been provisioned by an administrator or dynamically provisioned using Storage Classes. It is a resource in the cluster just like a node is a cluster resource.

pv

A PersistentVolumeClaim (PVC) as shown in the figure below is a request for storage by a user. It is similar to a Pod. Pods consume node resources and PVCs consume PV resources. Pods can request specific levels of resources (CPU and Memory).

pvc

Results

Using port forwarding to render the results on a Flask application, I am able to show the processes and ids of celery workers running concurrently and scaling at the same time. In the figure below, the process ids are seen on a scaffoled flask template.

processresult

For quering the Steam Reviews for exploration, relevant columns are extracted such as language of review, the review itself, game name and whether it is recommended or not. In the figure below, the process ids are seen on a scaffoled flask template.

queryresult

Credits

About

It is kubernetes microservice that ingests Steam data, deployed as a service using sQLite for storing processes followed by an event driven message queue controlled by RabbitMQ. This message queue is also deployed as a service with an interconnection with celery workers integrated with Elasticsearch and Redis for caching and ingestion.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 67.9%
  • JavaScript 17.9%
  • HTML 7.9%
  • Dockerfile 3.8%
  • CSS 2.5%