Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

community: fallback on core async atransform_documents method for MarkdownifyTransformer #27866

Merged
merged 14 commits into from
Dec 13, 2024

Conversation

rparkr
Copy link
Contributor

@rparkr rparkr commented Nov 3, 2024

Description

Implements the atransform_documents method for MarkdownifyTransformer using the asyncio built-in library for concurrency.

Note that this is mainly for API completeness when working with async frameworks rather than for performance, since the markdownify function is not I/O bound because it works with Document objects already in memory.

Issue

Fixes #27865

Dependencies

No new dependencies added, but markdownify is required since this PR updates the markdownify integration.

Tests and docs

  • Tests added
  • I did not modify the docstrings since they already described the basic functionality, and the API docs also already included a description. If it would be helpful, I would be happy to update the docstrings and/or the API docs.

Lint and test

  • format
  • lint
  • test

I ran formatting with make format, linting with make lint, and confirmed that tests pass using make test. Note that some unit tests pass in CI but may fail when running make_test. Those unit tests are:

  • test_extract_html (and test_extract_html_async)
  • test_strip_tags (and test_strip_tags_async)
  • test_convert_tags (and test_convert_tags_async)

The reason for the difference is that there are trailing spaces when the tests are run in the CI checks, and no trailing spaces when run with make test. I ensured that the tests pass in CI, but they may fail with make test due to the addition of trailing spaces.

- Add asynchronous method for transforming a single document (`_atransform_document`)
- Implement the asynchronous method for transforming a list of documents using async.gather
- Remove extra function definition; replace with asyncio.create_task()
@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Nov 3, 2024
Copy link

vercel bot commented Nov 3, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Dec 13, 2024 10:30pm

@dosubot dosubot bot added the community Related to langchain-community label Nov 3, 2024
- Split the list comprehension across lines for better readability
@rparkr rparkr marked this pull request as draft November 3, 2024 11:57
- Fix synchronous unit tests for MarkdownifyTransformer
- Add asynchronous unit tests for MarkdownifyTransformer (duplicates of the synchronous tests)
- The unit tests passed in the dev container which was created in GitHub Codespaces using the devcontainer.json file from the Langchain repo
- However, the unit tests were failing in CI
- I updated the failed unit tests and will continue to debug using the CI checks
- A few unit tests continued to fail in CI, although they passed with `make test` in the dev container environment
- I updated the unit tests to fix the CI failures
- Fix more unit tests failures related to trailing spaces
@rparkr rparkr marked this pull request as ready for review December 4, 2024 21:24
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Dec 4, 2024
Copy link
Member

@efriis efriis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because this isn't for performance on a cpu bound operation, I'll probably just delete the NotImplementedError stub and fallback on the default atransform_documents behavior for api completeness!

@efriis efriis changed the title community: implement the atransform_documents method for MarkdownifyTransformer community: fallback on core async atransform_documents method for MarkdownifyTransformer Dec 13, 2024
@dosubot dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Dec 13, 2024
@efriis efriis enabled auto-merge (squash) December 13, 2024 22:30
@efriis efriis self-assigned this Dec 13, 2024
@efriis efriis merged commit 12111cb into langchain-ai:master Dec 13, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Related to langchain-community lgtm PR looks good. Use to confirm that a PR is ready for merging. size:L This PR changes 100-499 lines, ignoring generated files.
Projects
Archived in project
2 participants