Crawler downloads pdfs to project root directory #6829
Labels
1.x
Contributions wanted!
Looking for external contributions
P3
Low priority, leave it in the backlog
type:bug
Something isn't working
Milestone
Describe the bug
crawler = Crawler(output_dir="crawled_files") works ok. Defaults are a bit screwy (hidden_text=True, really?) but it also ends up following links to pdfs and downloading them. They aren't placed in the output_dir. I believe the underlying Selenium driver is just doing its thing, and the pdf link corner case isn't handled.
Error message
No error, but if you crawl for a while you will invariably see a bunch of pdfs in your root folder if you are crawling sites with pdf brochures.
Expected behavior
Ideally, a way to handle the pdfs, convert them to documents, and specify how you want them converted.
Additional context
The default extract_hidden_text=True doesn't make sense, it will extract javascript code usually not what you want in your documents
To Reproduce
Steps to reproduce the behavior
Just crawl the following urls with defaults, you will see a bunch of pdfs appear in your project root folder
travel_insurance_urls = [ "https://www.hsbc.com.hk/insurance/products/travel", "https://www.aig.com.hk/personal/travel-insurance", "https://www.zurich.com.hk/en/products/travel", "https://www.bluecross.com.hk/en/Travel-Smart/Information", "https://www.moneysmart.hk/en/travel-insurance", "https://www.moneyhero.com.hk/en/travel-insurance?psCollapse=true", ]
FAQ Check
System:
The text was updated successfully, but these errors were encountered: