Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implicitly terminated <li> tags and atDepth #77

Open
jforberg opened this issue Mar 2, 2019 · 5 comments
Open

Implicitly terminated <li> tags and atDepth #77

jforberg opened this issue Mar 2, 2019 · 5 comments

Comments

@jforberg
Copy link

jforberg commented Mar 2, 2019

I'm having problems parsing a document that relies on HTML's implicit termination of <li> tags. I can't get it to work correctly in combination with scalpel's atDepth facility.

I've written a pair of test cases that I think shows the problem:

jforberg@f83413e

    -- This case passes
    ,   scrapeTest
            "atDepth correctly handles explicitly terminated <li> tags"
            "<body><ul><li>Li</li></ul><ul><li>La</li></ul></body>"
            (Just ["Li", "La"])
            (texts $ "body" // "li" `atDepth` 2)

    -- This case fails
    ,   scrapeTest
            "atDepth correctly handles implicitly terminated <li> tags"
            "<body><ul><li>Li</ul><ul><li>La</ul></body>"
            (Just ["Li", "La"])
            (texts $ "body" // "li" `atDepth` 2)

Based on my understanding of HTML, the two strings are equivalent representations of the same document. But the latter seems to be parsed incorrectly by scalpel (or maybe an underlying HTML library?).

My real use case is very similar to the test strings, except that there are nested <ul>'s which I need to avoid recursing into, hence the use of atDepth.

@fimad
Copy link
Owner

fimad commented Mar 2, 2019

Interesting, I was not aware that unclosed tags were implicitly allowed like this...

First up, I think we can solve your current parsing problem with the new SerialScraper API:

let html = "<body><ul><li>Li<li>Li2</ul><ul><li>La<li><li>La2</ul></body>"

scrapeStringLike html $ do
    -- Focus in on each <ul> tag at the required depth in turn.
    chroots ("ul" `atDepth` 1) $ do
        -- Within each <ul>, repeatedly step through the direct children in sequence.
        inSerial $ many $ do
            -- Move the cursor forward until it is focusing on the next <li> tag that
            -- is a direct child of the current <ul> tag.
            seekNext (matches $ "li" `atDepth` 0)
            -- Move the cursor forward one step and extract the text from a bare
            -- text node. If there is no bare text node, return "" and allow the many
            -- to continue moving through the children of <ul>.
            stepNext (text $ textSelector `atDepth` 0) <|> return ""

-- Returns:
--   Just [["Li","Li2"],["La","","La2"]]

Currently, scalpel's parser does not attempt to follow the HTML spec exactly. Instead the goal of the parsing algorithm is to quickly and consistently do the least surprising thing in the face of both valid and invalid HTML.

Right now the parsing algorithm does not have any special casing by tag name and treats any tag without a corresponding closing tag as being implicitly closed immediately. This was chosen so that parsing HTML like <div>foo<br><img></div> would not result in the <br> tag being a parent of <img>.

Realistically, I don't think it is possible for scalpel to 100% conform to the official HTML parsing algorithm. It seems like supporting only some of the spec's tag specific special cases instead of all of them would result in more surprising behavior than not supporting any tag specific special cases at all.

What do you think?

@jforberg
Copy link
Author

jforberg commented Mar 4, 2019

Thanks for your example. Another possible work-around would be to add an explicit pre-processing step to normalise the HTML (close all tags in the appropriate way).

I understand the desire to keep the code simple, but it seems to me that this goal may be in conflict with "do the least surprising thing" in this case. There are different ways to not support an HTML feature: returning a parse error (not surprising, a minor annoyance) or silently returning an incorrect result (surprising, at least to me).

Would you be interested in a patch to make scalpel handle this case correctly? If so, I could try my hand at it. However, it would involve adding special cases for different tag types.

If not, then I would suggest that this limitation is documented, so that users know that they must explicitly close all tags before scraping.

@fimad
Copy link
Owner

fimad commented Mar 6, 2019

I just read through the HTML 5 spec's section on optional tags and I find it's behavior surprising despite it being the official definition of how HTML is suppose to work :)

For example, the following two HTML snippets are equivalent according to the spec:

<!DOCTYPE HTML><title>Hello</title><p>Welcome to this example.
<!DOCTYPE HTML>
<html><head><title>Hello</title>
</head><body><p>Welcome to this example.</p></body></html>

I think this would be confusing from a scalpel user's perspective since a change to support optional tags would change the depth and sibling relationship of the nodes in ways that are hard to debug.

How would you feel about supporting this via an explicit pre-processing step? That way users have to knowingly opt-into a more correct but harder to debug representation of the tag tree.

Internally a Scraper is just a function that goes from a parsed tag tree to some result, we could have a function like preprocessForOptionalTags :: Scraper a -> Scraper a (with a better name) that mutates the parsed tree to add in implicit nodes.

@jforberg
Copy link
Author

jforberg commented Mar 6, 2019

I can perhaps agree with you regarding your example, but that's a wholly different case from the one I hoped to address here. In your example, html tags should implicitly added that were not present in the source text. In my example, tags should be implicitly closed and scalpel is closing them but unfortunately in the wrong order.

I can certainly solve my own problem with a workaround, but my goal in posting this was to try and help others who may run into the same issue and wonder why their code doesn't work.

If there is no desire to implement HTML according to spec in this regard, I think it would be preferable if scalpel returned a parse error when it encountered HTML it doesn't understand. To me, it's fairly clear that "return parse error" is superior from a debugging perspective to "silently return empty string". It would be very unfortunate if correct parsing was made opt-in rather than the default.

Like I said, I'm happy to give it a try if you are willing to accept a fix for this.

@fimad
Copy link
Owner

fimad commented Mar 10, 2019

I think it is related to this issue in that both are special cases dictated by the HTML spec and I personally would find it hard to explain why a subset of these special cases are implemented but others are not.

I also strongly disagree that scalpel should ever return a parse error. Scalpel is not meant to be a 100% compliant HTML parser/validator, it is meant to be a practical tool for extracting content out of text that looks like HTML. As such, it is intended to be able handle malformed and incomplete HTML in a reasonably consistent way.

Given this, the reason I would prefer an opt-in approach to the more delicate parts of the HTML spec is because I think it is hard to precisely define how they apply to malformed HTML. For example, the spec says <li> can only be implicitly closed if it is followed by another <li> or the end of its parent (which by the spec may only be <ul>, <ol>, or <menu>). What should the behavior be if this is not the case?

Rather than deal with the ambiguities that rise when trying to apply the spec to invalid HTML I would prefer the default to be an unambiguous parsing model that is applied consistently to both invalid and valid HTML, even if it differs from the HTML spec in some cases.

I agree that the parsing model of scalpel should be better documented. It's been on my mental TODO list for a while now and I've created #79 to explicitly track adding it.

I'm happy to accept a patch to add an explicit preprocessor function to handle closing implicit tags, but I'm not convinced that changing the default behavior of the parser is the correct approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants