-
Notifications
You must be signed in to change notification settings - Fork 45
Exported tables and their columns
Workbench deals in tables. When you download a table from Workbench, it can be in one of these formats:
Refer to Column Types for Workbench's internal specifications.
Parquet is how Workbench stores tables. It stores data by-column, not by-row: this makes math remarkably efficient.
The tables Workbench exports are identical to the tables Workbench uses itself.
Here are the only logical types you should expect -- and the only metadata Workbench defines for them:
-
Signed Integers (up to
INT64
);FLOAT
;DOUBLE
: Python treats all these as "number".- Don't expect Workbench to always return "number" in the same format: it often changes a "number" column's storage format (as Excel, JavaScript or Python do).
- The only metadata is
format
, a UTF-8 string like${,.2f}
, compatible with both Python PEP3101 and d3-format. - Workbench does not produce the special floating-point values
NaN
,Infinity
or-Infinity
.
-
DATE
- The only metadata is
unit
, with ASCII valueday
,week
,month
,quarter
,year
. The metadata determines which values are valid: for instance,week
means all values are Mondays;quarter
means they're all Jan. 1, Apr. 1, July 1 or Oct. 1.
- The only metadata is
- TIMESTAMP(isAdjustedToUTC=true, unit=NANOS), always UTC nanoseconds, with no metadata.
-
STRING, with no metadata.
- Each column may be dictionary-encoded. Don't expect Workbench to always or never dictionary-encode a column: its decision-making process is complex and subject to change.
Beware: all columns allow nulls. Each column can be stored as two arrays: an array of values, and an array of "is-null" markers. This can be awkward in some tools, such as Numpy.
JSON is a convenient means of sharing structured data. Workbench returns an Array of row Objects, keyed by column name.
Here are the only logical types you should expect:
- Numbers: int64 or double-precision.
- Workbench outputs exact 64-bit integers in JSON; but JavaScript's
JSON.parse()
will round any number above9,007,199,254,740,991
. You can work around this with json-bigint. - Unlike Parquet, there is no
format
metadata.
- Workbench outputs exact 64-bit integers in JSON; but JavaScript's
- Dates:
YYYY-MM-DD
Strings.- Unlike Parquet, there is no
unit
metadata.
- Unlike Parquet, there is no
- Timestamps:
YYYY-MM-DDTHH:MM[:SS[.sssssssss]]Z
(RFC3339) Strings.- The timezone is always
Z
(UTC).
- The timezone is always
- Strings.
Any value may be null
.
Comma-Separated Values (CSV, RFC4180) is a ubiquitous -- if lossy -- means of transmitting rows of text values.
CSV only transmits text; here's how Workbench maps its values to text:
- Numbers: int64 or double-precision.
- Large floating-point numbers can use decimal exponents: for instance,
2.3e52
. This notation is compatible with JavaScript'sNumber()
or C'satof()
. - Unlike Parquet, there is no
format
metadata.
- Large floating-point numbers can use decimal exponents: for instance,
- Dates:
YYYY-MM-DD
Strings.- Unlike Parquet, there is no
unit
metadata.
- Unlike Parquet, there is no
- Timestamps:
YYYY-MM-DDTHH:MM[:SS[.sssssssss]]Z
(RFC3339) Strings.- The timezone is always
Z
(UTC).
- The timezone is always
- Strings.
CSV does not transmit null. Any value may be an empty string. In Number/Date/Timestamp columns you can infer that empty-text = null; but in String columns there is no way to distinguish between empty string and null.