Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance x100 for bigger pages #53

Merged
merged 3 commits into from
Dec 13, 2023
Merged

Conversation

Valian
Copy link
Collaborator

@Valian Valian commented Dec 6, 2023

This PR optimizes finding & scoring process of individual candidates.

I had problems parsing big pages, like this one

Previously, process involved multiple calls to Floki.text() which is quite slow. Parsing that page took more than 5 minutes on M1 Pro. After my changes, it completes in 2s.

How it works:

  • I'm precalculating text_length and number of commas for each HTML node and storing it in attributes (it's a small hack since it's not a real node attribute, but since it's later cleared up it's fine)
  • I've introduced some functions Helpers.text_length, Helpers.count_character and Helpers.find_tag which either uses these precomputed values or are faster than Floki equivalents
  • I've replaced most calls to Floki by calls to Helpers
  • In build_article, I'm first precalculating attributes for the whole tree (it's fast because when calculating parent we already cached children), doing work as usual, and then removing these attributes from the final result.

All tests pass, and it works fine in production. Ouch, and I bumped dependencies a little so I could use Floki traverse functions.

I see that library is not maintained too much, but still decided to give PR a go. Maybe someone else will find my fork useful ;) @keepcosmos thanks for creating this!

@Valian Valian merged commit b1a6a0e into keepcosmos:master Dec 13, 2023
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant