-
Notifications
You must be signed in to change notification settings - Fork 45
Exported tables and their columns
Workbench deals in tables. When you download a table from Workbench, it can be in one of these formats:
Refer to Column Types for Workbench's internal specifications.
Parquet is how Workbench stores tables. It stores data by-column, not by-row: this makes math remarkably efficient.
The tables Workbench exports are identical to the tables Workbench uses itself.
Here are the only logical types you should expect -- and the only metadata Workbench defines for them:
-
Signed Integers (up to
INT64
);FLOAT
;DOUBLE
: Python treats all these as "number".- Don't expect Workbench to always return "number" in the same format: it often changes a "number" column's storage format (as Excel, JavaScript or Python do).
- The only metadata is
format
, a UTF-8 string like${,.2f}
, compatible with both Python PEP3101 and d3-format. - Workbench does not produce the special floating-point values
NaN
,Infinity
or-Infinity
.
-
DATE
- The only metadata is
unit
, with ASCII valueday
,week
,month
,quarter
,year
. The metadata determines which values are valid: for instance,week
means all values are Mondays;quarter
means they're all Jan. 1, Apr. 1, July 1 or Oct. 1.
- The only metadata is
- TIMESTAMP(isAdjustedToUTC=true, unit=NANOS), always UTC nanoseconds, with no metadata.
-
STRING, with no metadata.
- Each column may be dictionary-encoded. Don't expect Workbench to always or never dictionary-encode a column: its decision-making process is complex and subject to change.
Beware: all columns allow nulls. Each column can be stored as two arrays: an array of values, and an array of "is-null" markers. This can be awkward in some tools, such as Numpy.
JSON is an inefficient but convenient means of transmitting arbitrary data. Workbench returns an Array of row Objects, keyed by column name.
Here are the only logical types you should expect:
- Numbers: int64 or double-precision.
- Workbench outputs exact 64-bit integers in JSON; but JavaScript's
JSON.parse()
will round any number above9,007,199,254,740,991
. You can work around this with json-bigint. - Unlike Parquet, there is no
format
metadata.
- Workbench outputs exact 64-bit integers in JSON; but JavaScript's
- Dates:
YYYY-MM-DD
Strings.- Unlike Parquet, there is no
unit
metadata.
- Unlike Parquet, there is no
- Timestamps:
YYYY-MM-DDTHH:MM[:SS[.sssssssss]]Z
(RFC3339) Strings.- The timezone is always
Z
(UTC).
- The timezone is always
- Strings.
Any value may be null
.
Comma-Separated Values (CSV, RFC4180) is a ubiquitous -- if lossy -- means of transmitting rows of text values.
CSV only transmits text; here's how Workbench maps its values to text:
- Numbers: int64 or double-precision.
- Large floating-point numbers can use decimal exponents: for instance,
2.3e52
. This notation is compatible with JavaScript'sNumber()
or C'satof()
. - Unlike Parquet, there is no
format
metadata.
- Large floating-point numbers can use decimal exponents: for instance,
- Dates:
YYYY-MM-DD
Strings.- Unlike Parquet, there is no
unit
metadata.
- Unlike Parquet, there is no
- Timestamps:
YYYY-MM-DDTHH:MM[:SS[.sssssssss]]Z
(RFC3339) Strings.- The timezone is always
Z
(UTC).
- The timezone is always
- Strings.
CSV does not transmit null. Any value may be an empty string. In Number/Date/Timestamp columns you can infer that empty-text = null; but in String columns there is no way to distinguish between empty string and null.