This is a simple crawler implemented by Python 3.6

Change mongodb to other databases

The crawled pages and the pending crawl pages are saved to mongodb by default.

If you want to change to other databases, you could implement your own class inherited from DbBase(Default I use mongodb, so class MongoDbQueue is my instance.).

After you have your own DbBase class, add it to the variable DB_CLASS_NAME in file project_settings.py, and also you should provide your DB_CONNECTION_STRING and DB_REPOSITORY_NAME to connect to the database.

Change html resolver of your own

In current version, only links (<a> tags in html) will be saved to the database. If you want to customized the behavior of the spiders, you can also implement your own HTMLParser class to replace HtmlResolver in my code. Don't forget to replace your own parser name to HTML_RESOLVER_NAME in file project_settings.py

Make it better

I start to learn python a week ago, maybe made some stupid mistakes and have some bugs. I'll be very appreciate if you can help me to improve my code, thanks.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
abc_base		abc_base
.gitignore		.gitignore
README.md		README.md
db_queue.py		db_queue.py
helpers.py		helpers.py
html_resolver.py		html_resolver.py
main.py		main.py
project_settings.py		project_settings.py
setup.py		setup.py
spider.py		spider.py
worker.py		worker.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

This is a simple crawler implemented by Python 3.6

Change mongodb to other databases

Change html resolver of your own

Make it better

About

Releases

Packages

Languages

fisherdan/crawler

Folders and files

Latest commit

History

Repository files navigation

This is a simple crawler implemented by Python 3.6

Change mongodb to other databases

Change html resolver of your own

Make it better

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages