Skip to content

Table Data Structures

Adam Hooper edited this page Dec 18, 2019 · 3 revisions

Workbench stores user data in various formats, on disk and in memory. Here are the most important formats.

1. Parameters ("params" and "secrets")

Who stores it: a user, through HTTP form controls

What it looks like: "params" and "secrets" are dataclasses in Python, JSON Objects in JavaScript, and HTML forms from the user's point of view. Think of them as a JSON Object.

When it's stored: when a user creates a Step or or changes its parameters.

When it's read: in the Step's module's fetch() and/or render() function.

Where we store it: we store "params" and "secrets" as two columns in the step table.

Why store it: so the user can make a Step do something useful.

The most obvious example of turning a parameter into a table is pastecsv. The user pastes CSV, creating params={"csv":"A,B\na,b","has_header":True}. pastecsv.render() turns this input into a table that looks like "A": "a", "B": "b".

2. Fetch-result files ("raw files")

Who stores it: a module's fetch() function.

What it looks like: whatever the module wants. For instance, loadurl and gdrive store HTTP-response headers and body in a custom format we call httpfile. Design your module's fetch-result file to look as close to "raw data" as is practical.

When it's stored: when a module's fetch() function runs.

When it's read: in a module's render() function.

Where we store it: the stored-objects object-storage bucket, with a path like {workflowId}/{stepId}/{uuid}. The UUID is stored in the StoredObjects database table.

Why store it: so a module author may edit render(). The render() function will parse the raw file; and if there's a bug, parsing might fail. The user shouldn't lose data when a module has a bug.

Many Workbench modules store files in Apache Parquet format. As a special case, Workbench automatically reads Parquet files. We recommend you do not pre-process fetched data to store it in Parquet format, since that defeats the purpose of this data layer. (See Why above.)

We encourage you to store using data formats that can be reused between modules. For instance, googlesheets and loadurl both store similar data: an HTTP response. They share a custom format we call "httpfile".

Fetch results are stored forever.When you deploy a fetch() function, its output will be fed to every future version of render(). Don't deploy your module until you choose a data format you can support forever. A cautionary tale: googlesheets.fetch() and loadurl.fetch() output Parquet files from 2017 to 2019. Now, their render() functions must still support Parquet fetch-result files, to handle fetch results from 2017-2019.