A simple and efficient web crawler for Python.
- Crawl web pages and extract links starting from a root URL recursively
- Concurrent workers and custom delay
- Handle relative and absolute URLs
- Designed with simplicity in mind, making it easy to use and extend for various web crawling tasks
Install using pip:
pip install tiny-web-crawler
from tiny_web_crawler import Spider
from tiny_web_crawler import SpiderSettings
settings = SpiderSettings(
root_url = 'http://github.com',
max_links = 2
)
spider = Spider(settings)
spider.start()
# Set workers and delay (default: delay is 0.5 sec and verbose is True)
# If you do not want delay, set delay=0
settings = SpiderSettings(
root_url = 'https://github.com',
max_links = 5,
max_workers = 5,
delay = 1,
verbose = False
)
spider = Spider(settings)
spider.start()
Crawled output sample for https://github.com
{
"http://github.com": {
"urls": [
"http://github.com/",
"https://githubuniverse.com/",
"..."
],
"https://github.com/solutions/ci-cd": {
"urls": [
"https://github.com/solutions/ci-cd/",
"https://githubuniverse.com/",
"..."
]
}
}
}
Thank you for considering to contribute.
- If you are a first time contributor you can pick a
good-first-issue
and get started. - Please feel free to ask questions.
- Before starting to work on an issue. Please get it assigned to you so that we can avoid multiple people from working on the same issue.
- We are working on doing our first major release. Please check this
issue
and see if anything interests you.
- Install poetry in your system
pipx install poetry
- Clone the repo you forked
- Create a venv or use
poetry shell
- Run
poetry install --with dev
pre-commit install
(see)pre-commit install --hook-type pre-push
- An issue exists or is created which address the PR
- Tests are written for the changes
- All lint/test passes