Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawler downloads pdfs to project root directory #6829

Closed
1 task done
augchan42 opened this issue Jan 26, 2024 · 2 comments
Closed
1 task done

Crawler downloads pdfs to project root directory #6829

augchan42 opened this issue Jan 26, 2024 · 2 comments
Labels
1.x Contributions wanted! Looking for external contributions P3 Low priority, leave it in the backlog type:bug Something isn't working
Milestone

Comments

@augchan42
Copy link
Contributor

augchan42 commented Jan 26, 2024

Describe the bug
crawler = Crawler(output_dir="crawled_files") works ok. Defaults are a bit screwy (hidden_text=True, really?) but it also ends up following links to pdfs and downloading them. They aren't placed in the output_dir. I believe the underlying Selenium driver is just doing its thing, and the pdf link corner case isn't handled.

Error message
No error, but if you crawl for a while you will invariably see a bunch of pdfs in your root folder if you are crawling sites with pdf brochures.

Expected behavior
Ideally, a way to handle the pdfs, convert them to documents, and specify how you want them converted.

Additional context
The default extract_hidden_text=True doesn't make sense, it will extract javascript code usually not what you want in your documents

To Reproduce
Steps to reproduce the behavior
Just crawl the following urls with defaults, you will see a bunch of pdfs appear in your project root folder
travel_insurance_urls = [ "https://www.hsbc.com.hk/insurance/products/travel", "https://www.aig.com.hk/personal/travel-insurance", "https://www.zurich.com.hk/en/products/travel", "https://www.bluecross.com.hk/en/Travel-Smart/Information", "https://www.moneysmart.hk/en/travel-insurance", "https://www.moneyhero.com.hk/en/travel-insurance?psCollapse=true", ]

FAQ Check

System:

  • OS:
  • GPU/CPU:
  • Haystack version (commit or version number): 1.24
  • Crawler - from haystack.nodes.connector import Crawler
@masci masci added the 1.x label Feb 5, 2024
@anakin87
Copy link
Member

anakin87 commented Feb 5, 2024

Minimal reproducible example

# pip install farm-haystack[crawler]

from haystack.nodes import Crawler

crawler = Crawler(output_dir="crawled_files")
docs = crawler.crawl(urls=["https://www.hsbc.com.hk/insurance/products/travel"])

Some PDFs are created in the working directory.

Solutions

Making sure that PDF files are created in the output_dir: this involves investigating how Selenium works with the files, shouldn't be much of an effort.
@augchan42 if you want to do this, feel free to open a PR.

Enhancements

Handle the pdfs, convert them to documents, and specify how you want them converted.

I would not prioritize this: it's a big change.

@anakin87 anakin87 added type:bug Something isn't working Contributions wanted! Looking for external contributions labels Feb 6, 2024
@masci masci added the P3 Low priority, leave it in the backlog label Feb 16, 2024
@masci masci added this to the 1.x-LTS milestone Feb 23, 2024
@anakin87
Copy link
Member

fixed in #7335

@masci masci modified the milestones: 1.x-LTS, 1.26.0 Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.x Contributions wanted! Looking for external contributions P3 Low priority, leave it in the backlog type:bug Something isn't working
Projects
Development

No branches or pull requests

3 participants