-
Notifications
You must be signed in to change notification settings - Fork 9
How to configure the pyCharm Debugger for Scrapy
-
Run your currently selected crawler via pyCharm (
Shift + F10
) to have a baseline of settings that you can modify -
Open the configuration via
Main Menu -> Run -> Edit Configurations
-
Click on Script Path and select Module name from the drop-down menu
- Module name:
scrapy.cmdline
- Parameters:
runspider <crawler_name.py>
orcrawl <spider_name>
- for example:
runspider serlo_spider.py
orcrawl serlo_spider
- for example:
- Module name:
-
Your Working directory should be set to the
/spiders/
-folder- for example:
/home/<your_username>/PycharmProjects/oeh-search-etl/converter/spiders
- for example:
-
You should now be able to use pyCharm's debugger by pressing
Shift + F9
After you're done, your debugger configuration should roughly look like this:
You can customize your debug-run with additional CLI parameters / options, e.g. by dumping the collected scrapy.Item
s into a .json
-file or saving a new logfile with each run of your debugger.
If you want to see a JSON dump of all scraped items, you can use the --output
-option in your parameters,
e.g.:
scrapy crawl <spider_name> -O "spidername.json"
Reminder:
-
-O
overwrites the.json
-file with each new crawl -
-o
appends
By default, scrapy appends the current run into the scrapy.log
-file in the project root folder.
If you wanted to customize the filename or where it's saved, you could use
scrapy crawl <spider_name> --logfile "./logs/filename.log"
in the debug configuration parameters.
If you just want to use both options at the same time, here's a short TLDR version that you can copy-paste and customize to your needs. You can either use the "runspider" or "crawl" command:
runspider <spider_name.py> -O "spider_name.json" --logfile "spider_name.log"
or:
crawl <spider_name> -O "spider_name.json" --logfile "spider_name.log"