bug: Crawler downloads pdfs to project root directory #7335

mohitlal31 · 2024-03-07T21:27:25Z

Related Issues

fixes Crawler downloads pdfs to project root directory #6829

Proposed Changes:

When a .pdf is passed to Selenium's webdriver.get() method, the .pdf is automatically downloaded to the current working directory. This issue happens here. Whereas for .html files, the crawler extracts the information and manually saves it in the provided output directory.
By default, we use Selenium's chromium webdriver that provides additional preferences to set the download location for a file. These preferences are defined here.
When a crawler object is initialized with an output directory, we use that directory as the download location. Else, the .pdf is downloaded to the PWD which is the current behaviour.

How did you test it?

Added a unit test
Tested it using the following script

from haystack.nodes import Crawler

crawler = Crawler(output_dir="output_files")
docs = crawler.crawl(urls=["https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"])

Notes for the reviewer

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
I documented my code
I ran pre-commit hooks and fixed any issue

anakin87 · 2024-03-11T08:38:17Z

I tried your code and works great!

Can you please add a release note, as described in the contributors guidelines?

anakin87

Thanks for your contribution! 🙏

bug: Crawler downloads pdfs to project root directory

49b52c7

mohitlal31 requested a review from a team as a code owner March 7, 2024 21:27

mohitlal31 requested review from davidsbatista and removed request for a team March 7, 2024 21:27

github-actions bot added topic:tests topic:crawler labels Mar 7, 2024

anakin87 self-requested a review March 11, 2024 08:31

Added a release note

d66bf8e

mohitlal31 requested a review from a team as a code owner March 11, 2024 15:56

mohitlal31 requested review from dfokina and removed request for a team March 11, 2024 15:56

improve release note

ea60c32

anakin87 approved these changes Mar 11, 2024

View reviewed changes

anakin87 merged commit 936e293 into deepset-ai:v1.x Mar 11, 2024
14 checks passed

anakin87 mentioned this pull request Mar 11, 2024

Crawler downloads pdfs to project root directory #6829

Closed

1 task

mohitlal31 deleted the bug_Crawler_downloads_pdfs_to_project_root_directory branch March 11, 2024 17:04

mohitlal31 mentioned this pull request Mar 12, 2024

Support for Crawler in Haystack 2.x #6609

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: Crawler downloads pdfs to project root directory #7335

bug: Crawler downloads pdfs to project root directory #7335

mohitlal31 commented Mar 7, 2024

anakin87 commented Mar 11, 2024

anakin87 left a comment

bug: Crawler downloads pdfs to project root directory #7335

bug: Crawler downloads pdfs to project root directory #7335

Conversation

mohitlal31 commented Mar 7, 2024

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

anakin87 commented Mar 11, 2024

anakin87 left a comment

Choose a reason for hiding this comment