Skip to content

Latest commit

 

History

History
 
 

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Parser for HTML.

The goal is not to recreate the wheel, but to have at least something available for the Proof of Concept.

However, having our own parser, even if it would need to be maintained by us, allows to nicely bring the features we want, be closer to our models and optimize processings, etc.

File system layout

  • README.md: this current file
  • .gitignore: Git related file

Parser:

  • index.js: entry point of the parser
  • grammar.js: generated from grammar.pegjs, it is the actual parser
  • grammar.pegjs: grammar of the parser
  • options.json: options used for the generation of the parser

Test:

  • test.js: file launching a standard test for PEG.js generated parsers, using the test data inside this module
  • test.html: aimed at being a comprehensive input to test the parser

Versioning

To ignore:

  • grammar.js: file generated from the grammar.pegjs file

Optional: any convenient script to automate the commands described in the contribute section below.

To version: everything else.

Contribute

Setup

To be able to use the parser, build the grammar with the following command executed from this folder: ..\..\..\..\..\node_modules\.bin\pegjs --extra-options-file options.json grammar.pegjs.

This command uses a binary installed by npm, in the node_modules folder dedicated to third-parties libraries (in the root folder of the backend project). If it doesn't work, maybe the PEG.js module was not properly installed, try reinstalling it.

Optionally you can add the installed PEG.js binary to your system environment variable PATH, and then simply use the command: pegjs --extra-options-file options.json grammar.pegjs, which avoids the path mess.

Alternatively, you can also manually install PEG.js globally, with the good version, using the following command: npm install -g git+https://github.com/dmajda/pegjs.

Try

To try it, you can launch the test set with the following command after ensuring you built the grammar before: node test.

Backlog

  1. Be able to parse at empty source code: probably returning a single root node, without any element inside
  2. Complete TODOs and FIXMEs from the code
  3. handle inline nodes without / before > (see note below)
  4. improve test data, by adding every possible syntactic constructs

Consistency

  • fix white spaces mess: sometimes they are added to the children index, as an empty list, sometimes they are not added at all when they belong to a conditional group
    • all whitespaces must be added
    • they must be added as children nodes
    • these children must be reachable through the two collections of the parent node: index and list
  • There can be a confusing thing concerning the model for attributes: the key of an attribute is set as a property of the attribute node, while the value is set as a node. This is however because it's not a string anymore in this case, but an object, thus wrapped in a node.

Documentation

  1. Put references for HTML: to know how to parse it
  2. Document the model of the language
  3. Document the grammar, explaining each rule specifically

Discussions

Notes

Inline nodes

There is a current solution taking into account a static list of tag ids for inline statements.

Another solution would be to drastically change the model, where we would only parse a flat list of elements, tags becoming elements in this, and then post-process the result by traversing this list to build the hierarchy.