You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Sites such as the New Yorker use span elements to make the first element in their article more prominent. These are supposed to appear inline with the rest of the paragraph, but since Crux replaces spans with paragraph tags in the post-process step, the single character occurs as its own paragraph in the output.
We can start retaining span tags in the output without a minimum length (because spans in these cases are usually really short) or remove the span tag and only keep the content.
I can create a PR if you want.
The text was updated successfully, but these errors were encountered:
Sure, that sounds good! PRs are always welcomed. All we request is that tests continue to pass, either by updating the tests to match the expected extracted output, or if the existing tests are not affected.
Sites such as the New Yorker use span elements to make the first element in their article more prominent. These are supposed to appear inline with the rest of the paragraph, but since Crux replaces spans with paragraph tags in the post-process step, the single character occurs as its own paragraph in the output.
https://www.newyorker.com/news/our-columnists/putin-and-trumps-ominous-nostalgia-for-the-second-world-war
We can start retaining span tags in the output without a minimum length (because spans in these cases are usually really short) or remove the span tag and only keep the content.
I can create a PR if you want.
The text was updated successfully, but these errors were encountered: