Parser for HTML.
The goal is not to recreate the wheel, but to have at least something available for the Proof of Concept.
However, having our own parser, even if it would need to be maintained by us, allows to nicely bring the features we want, be closer to our models and optimize processings, etc.
README.md
: this current file.gitignore
: Git related file
Parser:
index.js
: entry point of the parsergrammar.js
: generated fromgrammar.pegjs
, it is the actual parsergrammar.pegjs
: grammar of the parseroptions.json
: options used for the generation of the parser
Test:
test.js
: file launching a standard test for PEG.js generated parsers, using the test data inside this moduletest.html
: aimed at being a comprehensive input to test the parser
To ignore:
grammar.js
: file generated from thegrammar.pegjs
file
Optional: any convenient script to automate the commands described in the contribute
section below.
To version: everything else.
To be able to use the parser, build the grammar with the following command executed from this folder: ..\..\..\..\..\node_modules\.bin\pegjs --extra-options-file options.json grammar.pegjs
.
This command uses a binary installed by npm, in the node_modules
folder dedicated to third-parties libraries (in the root folder of the backend project). If it doesn't work, maybe the PEG.js module was not properly installed, try reinstalling it.
Optionally you can add the installed PEG.js binary to your system environment variable PATH
, and then simply use the command: pegjs --extra-options-file options.json grammar.pegjs
, which avoids the path mess.
Alternatively, you can also manually install PEG.js globally, with the good version, using the following command: npm install -g git+https://github.com/dmajda/pegjs
.
To try it, you can launch the test set with the following command after ensuring you built the grammar before: node test
.
- Be able to parse at empty source code: probably returning a single root node, without any element inside
- Complete TODOs and FIXMEs from the code
- handle inline nodes without
/
before>
(see note below) - improve test data, by adding every possible syntactic constructs
- fix white spaces mess: sometimes they are added to the children index, as an empty list, sometimes they are not added at all when they belong to a conditional group
- all whitespaces must be added
- they must be added as children nodes
- these children must be reachable through the two collections of the parent node: index and list
- There can be a confusing thing concerning the model for attributes: the key of an attribute is set as a property of the attribute node, while the value is set as a node. This is however because it's not a string anymore in this case, but an object, thus wrapped in a node.
- Put references for HTML: to know how to parse it
- Document the model of the language
- Document the grammar, explaining each rule specifically
There is a current solution taking into account a static list of tag ids for inline statements.
Another solution would be to drastically change the model, where we would only parse a flat list of elements, tags becoming elements in this, and then post-process the result by traversing this list to build the hierarchy.