Crawler downloads pdfs to project root directory #6829

augchan42 · 2024-01-26T09:12:01Z

Describe the bug
crawler = Crawler(output_dir="crawled_files") works ok. Defaults are a bit screwy (hidden_text=True, really?) but it also ends up following links to pdfs and downloading them. They aren't placed in the output_dir. I believe the underlying Selenium driver is just doing its thing, and the pdf link corner case isn't handled.

Error message
No error, but if you crawl for a while you will invariably see a bunch of pdfs in your root folder if you are crawling sites with pdf brochures.

Expected behavior
Ideally, a way to handle the pdfs, convert them to documents, and specify how you want them converted.

Additional context
The default extract_hidden_text=True doesn't make sense, it will extract javascript code usually not what you want in your documents

To Reproduce
Steps to reproduce the behavior
Just crawl the following urls with defaults, you will see a bunch of pdfs appear in your project root folder
travel_insurance_urls = [ "https://www.hsbc.com.hk/insurance/products/travel", "https://www.aig.com.hk/personal/travel-insurance", "https://www.zurich.com.hk/en/products/travel", "https://www.bluecross.com.hk/en/Travel-Smart/Information", "https://www.moneysmart.hk/en/travel-insurance", "https://www.moneyhero.com.hk/en/travel-insurance?psCollapse=true", ]

FAQ Check

Have you had a look at our new FAQ page?

System:

OS:
GPU/CPU:
Haystack version (commit or version number): 1.24
Crawler - from haystack.nodes.connector import Crawler

The text was updated successfully, but these errors were encountered:

anakin87 · 2024-02-05T11:24:11Z

Minimal reproducible example

# pip install farm-haystack[crawler]

from haystack.nodes import Crawler

crawler = Crawler(output_dir="crawled_files")
docs = crawler.crawl(urls=["https://www.hsbc.com.hk/insurance/products/travel"])

Some PDFs are created in the working directory.

Solutions

Making sure that PDF files are created in the output_dir: this involves investigating how Selenium works with the files, shouldn't be much of an effort.
@augchan42 if you want to do this, feel free to open a PR.

Enhancements

Handle the pdfs, convert them to documents, and specify how you want them converted.

I would not prioritize this: it's a big change.

anakin87 · 2024-03-11T17:02:42Z

fixed in #7335

masci added the 1.x label Feb 5, 2024

anakin87 added type:bug Something isn't working Contributions wanted! Looking for external contributions labels Feb 6, 2024

anakin87 added this to Haystack - Contributions wanted Feb 10, 2024

anakin87 removed this from Haystack - Contributions wanted Feb 10, 2024

anakin87 added this to Haystack - Contributions wanted Feb 10, 2024

masci added the P3 Low priority, leave it in the backlog label Feb 16, 2024

masci added this to the 1.x-LTS milestone Feb 23, 2024

mohitlal31 mentioned this issue Mar 7, 2024

bug: Crawler downloads pdfs to project root directory #7335

Merged

anakin87 closed this as completed Mar 11, 2024

github-project-automation bot moved this to Done in Haystack - Contributions wanted Mar 11, 2024

masci modified the milestones: 1.x-LTS, 1.26.0 Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawler downloads pdfs to project root directory #6829

Crawler downloads pdfs to project root directory #6829

augchan42 commented Jan 26, 2024 •

edited

Loading

anakin87 commented Feb 5, 2024 •

edited

Loading

anakin87 commented Mar 11, 2024

Crawler downloads pdfs to project root directory #6829

Crawler downloads pdfs to project root directory #6829

Comments

augchan42 commented Jan 26, 2024 • edited Loading

anakin87 commented Feb 5, 2024 • edited Loading

Minimal reproducible example

Solutions

Enhancements

anakin87 commented Mar 11, 2024

augchan42 commented Jan 26, 2024 •

edited

Loading

anakin87 commented Feb 5, 2024 •

edited

Loading