Skip to content

Exported tables and their columns

Adam Hooper edited this page Aug 2, 2021 · 6 revisions

Workbench deals in tables. When you download a table from Workbench, it can be in one of these formats:

Refer to Column Types for Workbench's internal specifications.

Parquet

Parquet is how Workbench stores tables. It stores data by-column, not by-row: this makes math remarkably efficient.

The tables Workbench exports are identical to the tables Workbench uses itself.

Here are the only logical types you should expect -- and the only metadata Workbench defines for them:

  • Signed Integers (up to INT64); FLOAT; DOUBLE: Python treats all these as "number".
    • Don't expect Workbench to always return "number" in the same format: it often changes a "number" column's storage format (as Excel, JavaScript or Python do).
    • The only metadata is format, a UTF-8 string like ${,.2f}, compatible with both Python PEP3101 and d3-format.
    • Workbench does not produce the special floating-point values NaN, Infinity or -Infinity.
  • DATE
    • The only metadata is unit, with ASCII value day, week, month, quarter, year. The metadata determines which values are valid: for instance, week means all values are Mondays; quarter means they're all Jan. 1, Apr. 1, July 1 or Oct. 1.
  • TIMESTAMP(isAdjustedToUTC=true, unit=NANOS), always UTC nanoseconds, with no metadata.
  • STRING, with no metadata.
    • Each column may be dictionary-encoded. Don't expect Workbench to always or never dictionary-encode a column: its decision-making process is complex and subject to change.

Beware: all columns allow nulls. Each column can be stored as two arrays: an array of values, and an array of "is-null" markers. This can be awkward in some tools, such as Numpy.

JSON

JSON is an inefficient but convenient means of transmitting arbitrary data. Workbench returns an Array of row Objects, keyed by column name.

Here are the only logical types you should expect:

  • Numbers: int64 or double-precision.
    • Workbench outputs exact 64-bit integers in JSON; but JavaScript's JSON.parse() will round any number above 9,007,199,254,740,991. You can work around this with json-bigint.
    • Unlike Parquet, there is no format metadata.
  • Dates: YYYY-MM-DD Strings.
    • Unlike Parquet, there is no unit metadata.
  • Timestamps: YYYY-MM-DDTHH:MM[:SS[.sssssssss]]Z (RFC3339) Strings.
    • The timezone is always Z (UTC).
  • Strings.

Any value may be null.

CSV

Comma-Separated Values (CSV, RFC4180) is a ubiquitous -- if lossy -- means of transmitting rows of text values.

CSV only transmits text; here's how Workbench maps its values to text:

  • Numbers: int64 or double-precision.
    • Large floating-point numbers can use decimal exponents: for instance, 2.3e52. This notation is compatible with JavaScript's Number() or C's atof().
    • Unlike Parquet, there is no format metadata.
  • Dates: YYYY-MM-DD Strings.
    • Unlike Parquet, there is no unit metadata.
  • Timestamps: YYYY-MM-DDTHH:MM[:SS[.sssssssss]]Z (RFC3339) Strings.
    • The timezone is always Z (UTC).
  • Strings.

CSV does not transmit null. Any value may be an empty string. In Number/Date/Timestamp columns you can infer that empty-text = null; but in String columns there is no way to distinguish between empty string and null.