Add open-source text extraction libraries #293

garrethlee · 2024-09-27T01:01:54Z

Description

Refactored extraction logic to separate HTML cleaning and text extraction into distinct steps. This allows chaining the cleaning step from one library with the extraction step from another, enhancing flexibility and interoperability.

Context

Most extractors follow a two-step process:
1. Clean raw HTML into a sanitized representation (usually a stripped down version of HTML)
2. Convert the cleaned HTML to plaintext.
Readability, for example, only provides an HTML cleaning method and lacks built-in plaintext conversion. To handle such cases, we now support chaining steps across libraries (e.g., clean_html from one library and extract from another).
Direct use cases, such as Trafilatura, remain unaffected—its extract function works independently, while clean_html is reserved for interoperability scenarios like inscriptis.

Thus, we break down the extraction functionality into the two phases referenced above, in the form of a clean_html and extract method in each Extractor.

Changes

Added clean_html as a standalone method in extractors
Refactored the logic in applicable extractors to separate cleaning and extracting processes.
Integrated new text extraction libraries (readabilipy, readability, resiliparse) to extend functionality and improve coverage.

… initialization - Added a default `clean_html` method to the `BaseExtractor` class, providing a warning for extractors that do not implement their own. - Implemented specific `clean_html` methods in `Inscriptis`, `Justext`, `ReadabiliPy`, `Readability`, and `Trafilatura` extractors to handle HTML cleaning. - Updated the `Inscriptis` extractor to accept a preprocessor during initialization. - Modified the `extract` methods in `ReadabiliPy` and `Readability` to utilize the new `clean_html` method. - Adjusted the `Justext` extractor to remove the default English language parameter from `get_stoplist`. - Updated tests to reflect changes in extractor initialization and functionality.

…n' into feat/text-extraction

…e sizes

garrethlee and others added 27 commits September 24, 2024 17:06

feat: add justext

d49463b

fix: remove justext cli comment

9fbc6b2

feat: add resiliparse

a6cce5d

feat: add inscriptis

add6807

feat: add readabilipy

e3a7285

feat: add readability

2a6ef15

feat: add require_readability to utils

84f1ed4

feat: add tests

c085736

feat: changed configs & pyproject

ea3a915

fix: move postprocessor to init

891850e

feat: add justext

b5cd839

fix: remove justext cli comment

8f5fc26

feat: add resiliparse

b3e7942

feat: add inscriptis

24d1594

feat: add readabilipy

be87f9a

feat: add readability

9f83073

feat: add require_readability to utils

c40db90

feat: add tests

b3912b2

feat: changed configs & pyproject

0eb2f2a

fix: move postprocessor to init

462ffc1

Merge branch 'main' into feat/text-extraction

e273e51

Merge remote-tracking branch 'refs/remotes/origin/feat/text-extractio…

d0f6ead

…n' into feat/text-extraction

refactor: move warning log to constructor to avoid ballooning log fil…

26bf413

…e sizes

style: fixed lint errors

c9f1c2b

style: fixed ruff format errors

2ac81d5

garrethlee marked this pull request as ready for review December 21, 2024 23:50

remove additional brotlipy in pyproject

751ec13

garrethlee changed the title ~~Add several open-source text extraction libraries~~ Add open-source text extraction libraries Dec 21, 2024

garrethlee and others added 4 commits December 22, 2024 00:01

undo changes made to tokenizer (for number tokenization experiment)

0be58ae

delete modular extractor due to redundancy

56a71ed

improved test case robustness

cd18c59

nit

aae7e33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add open-source text extraction libraries #293

Add open-source text extraction libraries #293

garrethlee commented Sep 27, 2024 •

edited

Loading

Add open-source text extraction libraries #293

Are you sure you want to change the base?

Add open-source text extraction libraries #293

Conversation

garrethlee commented Sep 27, 2024 • edited Loading

Description

Context

Changes

garrethlee commented Sep 27, 2024 •

edited

Loading