-
Notifications
You must be signed in to change notification settings - Fork 44
Layout versus data tables
This wiki is for tracking the logic behind how QUAIL determines if a table is being used for layout or data. Most of this is informed by this academic paper. Note that these tests are ordered by ease of computation to speed things up, if a test says something is a layout table, we stop there. This is all encompassed in the quail.isDataTable
method.
If there is only one row, this is being used as layout.
If a table has another table element within it, this is probably a layout table. We do check that there is not a consistent use of child tables throughout the parent table. For example, if every third cell in the table has a table element within it, then this is probably a data table. Essentially, data tables are almost always leaf elements.
If a table has made it this far, we look to see if there is an appropriate usage of <th>
elements. If there is a group of <th>
elements in a <thead>
element, and especially if there are <th>
elements with a scope
attribute, then we are assuming the document author was thinking about a data table, and flag this as a data table.
Layout tables almost always have cells spanning different widths, so we convert the table into a matrix representing cell spanning. Column spans are counted up in a given column, and if there are colspans that are not used consistently from row to row, this is a layout table.
Data tables tend to have consistent types of information across columns. We again make a matrix representing the table, and compute the length of content compared to others in the column, and the type of data (HTML, text, number) across the column. Number types are inferred by removing repeated words across columns first, so if a unit is used throughout the column (i.e. "4 feet"), it will be ignored. Types are assigned numeric values of 1 for HTML, 2 for plain text, and 3 for numbers so we can use them with standard deviation.
We take a standard deviation for both content length and content type, then make sure that no cells fall out of that deviation. If more than 10% of all cells lie outside the standard deviation of type and/or content length for a given column, this is flagged as a layout table.