Utilising Pulumi to create a EKS insfra structure in a minute and deploy a scraping project that uses distributed message queue.
Library/Technology | Purpose |
---|---|
Pulumi | Infrastructure as code tool for defining, deploying, and managing cloud infrastructure using familiar programming languages. |
pulumi-awsx | Higher-level abstraction for working with AWS infrastructure resources in Pulumi, simplifying the process of defining and managing AWS resources. |
pulumi-eks | Abstractions for working with Amazon EKS clusters in Pulumi, simplifying the deployment and management of Kubernetes clusters on AWS. |
Celery | Managing concurrent requests efficiently for web scraping. |
Docker | Containerization of the Scrapy project for portability and ease of deployment. |
Kubernetes | Orchestration platform for deploying and managing containerized applications. |
psycopg2-binary | Optional: PostgreSQL adapter for interacting with PostgreSQL databases if storing scraped data. |
python-dotenv | Loading environment variables for managing configuration settings securely. |
Scrapy | Web scraping framework for extracting data from IMDb's Top 50 movies list. |
SQLAlchemy | Optional: ORM library for interacting with relational databases if storing scraped data. |
AWS EKS | Kubernetes service on AWS for deploying the Dockerized Scrapy scraper with scalability and reliability. |
AWS Secrets Manager | Securely handling sensitive configurations like API keys or database credentials. |
Github Actions | Manage CI CD and automatic deployment on main branch |
1.1 pulumi setup
cd resources/iaac
if venv is not yet generated
python -m venv venv
Active venv
source venv/bin/activate
pip install -r requirements.txt
1.2 Setup aws
Get you secret acesss key from AWS IAM. then setup you creds into cli. CLI setup instruction.
aws configure
**1.3 Spin up pulumi*
pulumi up
1.4 Spin down
pulumi down
Pre requsite: Postgres, redis.
Setup
cd /scaper
python -m venv venv
source venv/bin/active
2.1 Scraper Project
Crawls through imdb top 250 website visit each pages and sends datas to queue.
scrapy crawl imdb
2.2 Worker Project
Recieved datas from the scraper and upset them into database.
celery -A scraper worker --loglevel=info