Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Escape method won't match correctly #20

Open
TechnikEmpire opened this issue Nov 21, 2015 · 5 comments
Open

Escape method won't match correctly #20

TechnikEmpire opened this issue Nov 21, 2015 · 5 comments

Comments

@TechnikEmpire
Copy link

So, going over the code for parsing selectors, there's the method parseEscape(...) which appears to be intended to convert attribute values that contain escaped characters and code points to literal values. However, doing some quick tests embedding such values in HTML, parsing then matching, it appears that gumbo_parser directly embeds these values untouched. So if I'm right, unescaping these sequences in supplied selectors will actually cause the selectors which should match, to not match.

I'm not familiar with the GO html package, but from what searching I've done, it appears that this might have been a requirement for cascadia (since it worked against the GO html package), but is invalid for use with gumbo_parser. On that note, gumbo_parser does appear to convert numeric character references and named character references, but gumbo_query doesn't of course, so matching a selector with an id containing something like " " will fail.

Will try to post some tests to confirm, I'm just reading over all of this figuring it out.

References:
http://www.w3.org/International/questions/qa-escapes
https://mathiasbynens.be/notes/css-escapes
http://www.w3.org/TR/html5/syntax.html#named-character-references

@TechnikEmpire
Copy link
Author

I've confirmed this. Gumbo Parser directly embeds escaped character sequences without modification, yet gumbo-query converts them before building a selector. Gumbo Parser has a massive portion of source code dedicated to converting character references (char_ref.c), but gumbo-parser does not. So you have a twofold breakdown here. Selectors built to match escaped sequences will never match, and HTML elements that use character references will never match.

The character references issue can be sorted by leveraging the character reference conversion built into gumbo parser, and the parseEscaped method needs to be dropped entirely.

void ConvertCharRefStr(std::string& stringWithCharRef)
{
    std::string htmlStr = u8"<html><body><div gqconverted=\"" + stringWithCharRef + u8"\"/></body></html>";

    GumboOutput* output = gumbo_parse(htmlStr.c_str());
    GumboNode* nodeWithConvertedValue = static_cast<GumboNode*>(static_cast<GumboNode*>(output->root->v.element.children.data[1])->v.element.children.data[0]);
    GumboAttribute* a = gumbo_get_attribute(&nodeWithConvertedValue->v.element.attributes, u8"gqconverted");
    stringWithCharRef.clear();
    stringWithCharRef.assign(a->value);
    gumbo_destroy_output(&kGumboDefaultOptions, output);
}

@lazytiger
Copy link
Owner

Hi, @TechnikEmpire
You're right about the escape. I've just port the code from go to cpp, and don't do much research about gumbo_parser. I will look into this problem and make a fix. Thanks for the report.

@TechnikEmpire
Copy link
Author

@lazytiger I understand. I'll see if I can help. I'm doing an unofficial fork fixing these things, but also rewriting almost everything, but that's for my own purposes. I'm injecting a bunch of third party dependencies and don't expect it to be pulled from back to here. However, I'll create a separate actual fork, throw in whatever bug fixes I come up with and you can decide if you want to pull them.

@TechnikEmpire
Copy link
Author

@lazytiger I lost track of most of the fixes I made. I tried to do a clone and push fixes but, like I said just lost track. I've finished my rewrite of gumbo-query and renamed it to GQ so it doesn't shadow your project. I've created a ton of tests that can be run in the browser and test programs which verify correct functionality for all types of selectors (over 40 tests). I've also implemented a tree map that makes matching extremely fast. Can find it here https://github.com/TechnikEmpire/GQ if you feel like backporting any of it. Unfortunately due to the full scale rewrite and implementation of boost deps, it wasn't possible to do an official fork and submit PR's.

@lazytiger
Copy link
Owner

@TechnikEmpire Thanks for the fixes, I'll see to it later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants