Skip to content

Table Data Structures

Adam Hooper edited this page Dec 18, 2019 · 3 revisions

Workbench stores user data in various formats, on disk and in memory. Here are the most important formats.

1. Parameters ("params" and "secrets")

Who stores it: a user, through HTTP form controls

What it looks like: "params" and "secrets" are dataclasses in Python, JSON Objects in JavaScript, and HTML forms from the user's point of view. Think of them as a JSON Object.

When it's stored: when a user creates a Step or or changes its parameters.

When it's read: when we fetch or render the Step's module, we call the module's migrate_params() on the Step's params, then we pass params and secrets to the module's fetch() and/or render().

Where we store it: we store "params" and "secrets" as two columns in the step table.

Why store it: so the user can control a Step.

The most obvious example of turning a parameter into a table is pastecsv. The user pastes CSV, creating params={"csv":"A,B\na,b","has_header":True}. pastecsv.render() turns this input into a table that looks like "A": "a", "B": "b".

2. Fetch-result files ("raw files")

Who stores it: a module's fetch() function.

What it looks like: whatever the module wants. For instance, loadurl and gdrive store HTTP-response headers and body in a custom format we call httpfile. Design your module's fetch-result file to look as close to "raw data" as is practical.

When it's stored: when a module's fetch() function runs.

When it's read: in a module's render() function.

Where we store it: the stored-objects object-storage bucket, with a path like {workflowId}/{stepId}/{uuid}. The UUID is stored in the StoredObjects database table.

Why store it: so a module author may edit render(). The render() function will parse the raw file; and if there's a bug, parsing might fail. The user shouldn't lose data when a module has a bug.

Many Workbench modules store files in Apache Parquet format. As a special case, Workbench automatically reads Parquet files. We recommend you do not pre-process fetched data to store it in Parquet format, since that defeats the purpose of this data layer. (See Why above.)

We encourage you to store using data formats that can be reused between modules. For instance, googlesheets and loadurl both store similar data: an HTTP response. They share a custom format we call "httpfile".

Fetch results are stored forever.When you deploy a fetch() function, its output will be fed to every future version of render(). Don't deploy your module until you choose a data format you can support forever. A cautionary tale: googlesheets.fetch() and loadurl.fetch() output Parquet files from 2017 to 2019. Now, their render() functions must still support Parquet fetch-result files, to handle fetch results from 2017-2019.

3. Arrow files ("render input/output")

Who stores it: Workbench (as input to render()) and each module (as output from render()).

What it looks like: see cjwkernel.types. The bulk of data is in Apache Arrow format -- specifically, Arrow IPC File Format. Alongside that, some Thrift metadata describes a cjwkernel.types.TableMetadata dataclass with n_rows and columns, a list of column names and their expected types. The Arrow file and Thrift metadata must always agree: anything else is a bug.

When it's stored: Workbench creates an Arrow table to kick off a sequence of render()s. Then, each Step's render() writes an Arrow file that will be fed as input to the next Step in the sequence.

When it's read: Workbench reads every Step's output Arrow file and validates it with its Thrift metadata. It caches each value in Parquet format (see 4. Parquet files)

Where we store it: in memory, and in temporary files. After a Workflow has been rendered, all Arrow files disappear.

Why store it: safe and fast communication between modules and Workbench.

Think of this format as "in-memory table data."

Arrow files are a way of sharing table data between processes. Workbench and modules can read Arrow files using mmap(), and the Linux kernel will swap tabular data in and out of RAM as needed. The upshot: in most cases, the process that writes an Arrow file and the process that reads it access the exact same RAM -- it's zero-copy.

The downside of sharing RAM: Arrow files can't be modified. Each Arrow table is immutable. Each Step must write a new Arrow file. (In fringe cases, a Step may copy its input Arrow file to its output Arrow file path. For instance, formatnumbers changes the Thrift metadata but doesn't change any Arrow data.)

See Column types for supported types and their storage formats.

See cjwkernel.settings for a list of other restrictions we impose on modules' output.

See cjwkernel/thrift.thrift for a description of module input/output metadata.

4. Parquet files ("render cache")

Who stores it: Workbench

What it looks like: an Apache Parquet file with no metadata, plus cjwkernel.types.TableMetadata. In the rendercache module, we also store a list of cjwkernel.types.RenderError and module-output JSON data.

When it's stored: upon receiving output from a module's render().

When it's read: when the user views table data. Also, when rendering a tab: if there's a previously-cached output from Step N, we skip Steps 1 to N and use Step N's cached output to create an Arrow-file input to Step N+1. Also, when considering whether to email a notification that a Step's output has changed.

Where we store it: the cached-render-results object-storage bucket, with a path like wf-{workflowId}/wfm-{stepId}/delta-{deltaId}.dat. The renderer reads the most-current delta for the module from the delta database table. The metadata, module-output errors and module-output JSON are in step.cached_render_result_* columns in the Step table.

Why store it: a crucial optimization. If we didn't cache render results, we'd need to re-render all Steps each time the user edits a parameter, each time the user views table data, and each time the user scrolls table data.

See Column types for supported types and their storage formats.

The render cache is derived data. Historically, we have cleared all cached data a few times per year. The only user-visible effect: the first time we render a Step after clearing the render cache, we can't consider whether to email a notification that the Step's output has changed.

5. Pandas tables

Who stores it: modules that opt to use Pandas tables. (Workbench itself does not use Pandas, and modules don't need to use Pandas tables either.)

What it looks like: a Pandas table and accompanying metadata. See cjwkernel.pandas.types.ProcessResult for a pervasive pattern in Pandas module input/output.

When it's stored: Workbench's provided module loader creates a Pandas table by default when calling render().

When it's read: Workbench's provided module loader parses a cjwkernel.pandas.types.ProcessResult from render()'s return value and converts it to Arrow format.

Where we store it: Pandas tables only exist in RAM. (Since the Workbench module loader reads Pandas tables from input Arrow files, not-null integer/float column data may be mmapped from disk.)

Why store it: Many module authors are Pandas experts.

See Column types for supported types and their storage formats.

With text data, Pandas can use an astonishing amount of RAM. Pandas will run out of memory before reaching the limits described in cjwkernel/settings.py. (End users will see this exit code -9 error.) If you need your module to scale, make it read and write Arrow-format files directly.