Name		Name	Last commit message	Last commit date
parent directory ..
.gitignore		.gitignore
README.md		README.md
grammar.pegjs		grammar.pegjs
index.js		index.js
options.json		options.json
test.html		test.html
test.js		test.js

README.md

Parser for HTML.

The goal is not to recreate the wheel, but to have at least something available for the Proof of Concept.

However, having our own parser, even if it would need to be maintained by us, allows to nicely bring the features we want, be closer to our models and optimize processings, etc.

File system layout

README.md: this current file
.gitignore: Git related file

Parser:

index.js: entry point of the parser
grammar.js: generated from grammar.pegjs, it is the actual parser
grammar.pegjs: grammar of the parser
options.json: options used for the generation of the parser

Test:

test.js: file launching a standard test for PEG.js generated parsers, using the test data inside this module
test.html: aimed at being a comprehensive input to test the parser

Versioning

To ignore:

grammar.js: file generated from the grammar.pegjs file

Optional: any convenient script to automate the commands described in the contribute section below.

To version: everything else.

Contribute

Setup

To be able to use the parser, build the grammar with the following command executed from this folder: ..\..\..\..\..\node_modules\.bin\pegjs --extra-options-file options.json grammar.pegjs.

This command uses a binary installed by npm, in the node_modules folder dedicated to third-parties libraries (in the root folder of the backend project). If it doesn't work, maybe the PEG.js module was not properly installed, try reinstalling it.

Optionally you can add the installed PEG.js binary to your system environment variable PATH, and then simply use the command: pegjs --extra-options-file options.json grammar.pegjs, which avoids the path mess.

Alternatively, you can also manually install PEG.js globally, with the good version, using the following command: npm install -g git+https://github.com/dmajda/pegjs.

Try

To try it, you can launch the test set with the following command after ensuring you built the grammar before: node test.

Backlog

Be able to parse at empty source code: probably returning a single root node, without any element inside
Complete TODOs and FIXMEs from the code
handle inline nodes without / before > (see note below)
improve test data, by adding every possible syntactic constructs

Consistency

fix white spaces mess: sometimes they are added to the children index, as an empty list, sometimes they are not added at all when they belong to a conditional group
- all whitespaces must be added
- they must be added as children nodes
- these children must be reachable through the two collections of the parent node: index and list
There can be a confusing thing concerning the model for attributes: the key of an attribute is set as a property of the attribute node, while the value is set as a node. This is however because it's not a string anymore in this case, but an object, thus wrapped in a node.

Documentation

Put references for HTML: to know how to parse it
Document the model of the language
Document the grammar, explaining each rule specifically

Discussions

Should HTML recognize CDATA? XHTML I guess yes.

Notes

Inline nodes

There is a current solution taking into account a static list of tag ids for inline statements.

Another solution would be to drastically change the model, where we would only parse a flat list of elements, tags becoming elements in this, and then post-process the result by traversing this list to build the hierarchy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parser

parser

README.md

File system layout

Versioning

Contribute

Setup

Try

Backlog

Consistency

Documentation

Discussions

Notes

Inline nodes

Files

parser

Directory actions

More options

Directory actions

More options

Latest commit

History

parser

Folders and files

parent directory

README.md

File system layout

Versioning

Contribute

Setup

Try

Backlog

Consistency

Documentation

Discussions

Notes

Inline nodes