-
Notifications
You must be signed in to change notification settings - Fork 45
How we render a step
Adam Hooper edited this page Nov 20, 2020
·
3 revisions
Each Step in a Workflow corresponds to a "module". Its inputs are:
-
arrow_table
-- the prior Step's output -
settings
-- e.g., row-number limit, column-name-length limit -
fetch_result
-- if this is a Fetch module and data was downloaded -
params
-- form fields entered by the user (and validated and massaged by Workbench at render time) -
secrets
-- tokens even the user may not read -
tabs
-- output tables from other Tabs that this step's params require
Its outputs are:
-
table
: the new output table (may be null, if an error is set) -
errors
: a list of "errors" (iftable
is null) or "warnings" (iftable
is not null), in an i18n-ready format. (We store translation keys in the database; Workbench translates them when a visitor views the workflow.)
Outputs are cached so that users can view them on the Workbench website.
Ignoring optimizations, here's what happens during a render:
- Workbench loads
params
,secrets
andsettings
using the database. - Workbench loads
fetch_result
from the "Stored Objects" S3 bucket (if it exists). The file may have any format -- the module's ownfetch()
function generated it. - Workbench loads
table
andtabs
-- output from the previous step and other tabs, if applicable -- from the "Render Cache" S3 bucket. This is Parquet format. Table Data Structures - Workbench converts Parquets file to Arrow. (Parquet-format files are compressed and the format won't change frequently. Arrow-format files are for fast computation and inter-process communication.)
- Workbench spawns a new process with the module. It links the
fetch_result
file and Arrow files into the process's sandbox. - Workbench transmits all inputs (other than the
fetch_result
file and Arrow files) in a Thrift format to the module process. - The module process decodes the Thrift data and opens
fetch_result
and Arrow files. - The module process executes
render()
. - The module process writes a new Arrow table to
output_filename
, outputs table metadata anderrors
to stdout, and exits with status0
. - Workbench validates the Step-output Arrow file, metadata and exit status.
- Workbench converts the Arrow file to Parquet and stores it in the "Render Cache". It stores metadata (such as
errors
andcolumns
) in the database.
... and then Workbench moves on to the next Step.
- Workbench renders a workflow at a time. When it finishes a Step but still needs its output, it re-uses the Arrow file instead of reading the Parquet file it just wrote to the render cache.
- Workbench skips the module code entirely if it runs into an error while massaging
params
(for instance, if a user chose a "Timestamp" column for timestamp math, but it's now a "Text" column come render time). - Module-process spawning is cached.
- Each module process is sandboxed during creation. Read our 3-part series on sandboxing for details. Briefly:
- The process can't access our internal network.
- The process runs as non-root, and it can't gain privileges.
- The process's can't read Workbench's environment variables.
- The process has a greatly-constrained filesystem: it can read only the files we provide (such as approved Python modules and input files), and it can only write output files and temporary files -- and that, only to a constrained filesystem size. All files the process writes are destroyed before another process is spawned.
- The process's RAM and CPU are bounded by cgroups.
- The process's child processes all die when the process exits.
- The process is killed if it exceeds a timeout.
- The process gets no file handles but
stdin
,stdout
andstderr
. Itsstderr
andstdout
are truncated to a fixed buffer size, and itsstdout
is validated using Thrift.