Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: Crawler downloads pdfs to project root directory #7335

Conversation

mohitlal31
Copy link
Contributor

Related Issues

Proposed Changes:

  • When a .pdf is passed to Selenium's webdriver.get() method, the .pdf is automatically downloaded to the current working directory. This issue happens here. Whereas for .html files, the crawler extracts the information and manually saves it in the provided output directory.
  • By default, we use Selenium's chromium webdriver that provides additional preferences to set the download location for a file. These preferences are defined here.
  • When a crawler object is initialized with an output directory, we use that directory as the download location. Else, the .pdf is downloaded to the PWD which is the current behaviour.

How did you test it?

  • Added a unit test
  • Tested it using the following script
from haystack.nodes import Crawler

crawler = Crawler(output_dir="output_files")
docs = crawler.crawl(urls=["https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"])

Notes for the reviewer

Checklist

@mohitlal31 mohitlal31 requested a review from a team as a code owner March 7, 2024 21:27
@mohitlal31 mohitlal31 requested review from davidsbatista and removed request for a team March 7, 2024 21:27
@anakin87 anakin87 self-requested a review March 11, 2024 08:31
@anakin87
Copy link
Member

I tried your code and works great!

Can you please add a release note, as described in the contributors guidelines?

@mohitlal31 mohitlal31 requested a review from a team as a code owner March 11, 2024 15:56
@mohitlal31 mohitlal31 requested review from dfokina and removed request for a team March 11, 2024 15:56
Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! 🙏

@anakin87 anakin87 merged commit 936e293 into deepset-ai:v1.x Mar 11, 2024
14 checks passed
@mohitlal31 mohitlal31 deleted the bug_Crawler_downloads_pdfs_to_project_root_directory branch March 11, 2024 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants