Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node-htmlparser mangles text containing angle brackets #1

Open
mshmoustafa opened this issue May 29, 2019 · 0 comments
Open

node-htmlparser mangles text containing angle brackets #1

mshmoustafa opened this issue May 29, 2019 · 0 comments
Labels
bug Something isn't working

Comments

@mshmoustafa
Copy link
Owner

What is the problem: node-htmlparser interprets any angle brackets (< and >) as delimiters for HTML tags and will parse the remaining text as HTML.

Proposed fix: process entries only in the Links section through Utility.convertLinksToTextLinks

Further details: HTML parsing is used to turn tags that are used in entries into components (see Utility.convertLinksToTextLinks). node-htmlparser causes problems with strings such as The number of sequences (s(0),s(1),...,s(n)) such that 0<s(i)<5, |s(i)-s(i-1)|=1 and s(0)=1 is F(n+1); e.g., F(5+1) = 8 corresponds to 121212, 121232, 121234, 123212, 123232, 123234, 123432, 123434. - Clark Kimberling, Jun 22 2004 [corrected by Neven Juric, Jan 09 2009] (https://oeis.org/A000045). This problem didn't show up in the Cordova implementation because that used the built in DOM API in Safari, which apparently is more forgiving than node-htmlparser. Luckily (or very likely by design), it seems that hyperlinks are present only in the Links section. For now, the plan is to process entries only in the Links section with Utility.convertLinksToTextLinks. If, after testing, that fixes the problem, then there is no further work needed. However, if that doesn't fix most of the mangling issues, the next step would be either to:

  1. Use a Web View to gain access to Safari's/Chrome's HTML parser and make use of it with a bridge.
  2. Parse the hyperlinks ourselves. Seeing as we now have a LinkText component, we could just use a regex (I know, parsing HTML with regex is taboo, but this is a really small subset of HTML with reasonably well-defined parameters) to find all tags and pull out the href and text.
@mshmoustafa mshmoustafa added the bug Something isn't working label May 29, 2019
mshmoustafa added a commit that referenced this issue May 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant