Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

community: Add @mozilla/readability document transformer #27604

Closed

Conversation

CNSeniorious000
Copy link

@CNSeniorious000 CNSeniorious000 commented Oct 24, 2024

Description

langchain-js already has a useful document transformer that use @mozilla/readability to extract main content of a web page heuristically. [docs] [source]

This PR introduces a new ReadabilityTransformer class to the langchain_community/document_transformers, which class leverages the python-readability library to do the same thing.

Dependencies:

python-readability — a Standalone Python wrapper for @mozilla/readability

Mention that no nodejs environment is needed. In regular CPython distributions, python-readability requires PythonMonkey to interpret JavaScript, and in Pyodide, it uses the native JavaScript environment. So this package is available even if the user deploys langchain apps on Cloudflare Workers.

@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Oct 24, 2024
Copy link

vercel bot commented Oct 24, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Dec 10, 2024 5:54pm

@dosubot dosubot bot added the community Related to langchain-community label Oct 24, 2024
@CNSeniorious000 CNSeniorious000 marked this pull request as draft October 24, 2024 03:29
@CNSeniorious000 CNSeniorious000 force-pushed the readability branch 2 times, most recently from e86d5d2 to 9700c11 Compare October 24, 2024 03:33
@CNSeniorious000 CNSeniorious000 marked this pull request as ready for review October 24, 2024 03:36
@CNSeniorious000 CNSeniorious000 force-pushed the readability branch 4 times, most recently from d5bb2de to f9b2004 Compare December 7, 2024 10:30
Copy link
Collaborator

@ccurme ccurme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @CNSeniorious000.

It's unclear to me that the demand for this is high enough to justify the additional maintenance burden.

Would you be interested in publishing an OSS integration package (e.g., langchain-readability or similar)? We've written a walkthrough on this process here:

https://python.langchain.com/docs/contributing/how_to/integrations/

We are encouraging contributors of LangChain integrations to go this route. This way we don't have to be in the loop for reviews, you're able to properly integration test the package, and you have control over versioning.

Docs would continue to be maintained in the langchain repo.

Let me know what you think!

@ccurme ccurme closed this Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Related to langchain-community size:M This PR changes 30-99 lines, ignoring generated files.
Projects
Status: Closed
Development

Successfully merging this pull request may close these issues.

2 participants