Skip to content

How we render a step

Adam Hooper edited this page Nov 20, 2020 · 3 revisions

Each Step in a Workflow corresponds to a "module". Its inputs are:

  • arrow_table -- the prior Step's output
  • settings -- e.g., row-number limit, column-name-length limit
  • fetch_result -- if this is a Fetch module and data was downloaded
  • params -- form fields entered by the user (and validated and massaged by Workbench at render time)
  • secrets -- tokens even the user may not read
  • tabs -- output tables from other Tabs that this step's params require

Its outputs are:

  • table: the new output table (may be null, if an error is set)
  • errors: a list of "errors" (if table is null) or "warnings" (if table is not null), in an i18n-ready format. (We store translation keys in the database; Workbench translates them when a visitor views the workflow.)

Outputs are cached so that users can view them on the Workbench website.

Anatomy of a render

Ignoring optimizations, here's what happens during a render:

  1. Workbench loads params, secrets and settings using the database.
  2. Workbench loads fetch_result from the "Stored Objects" S3 bucket (if it exists). The file may have any format -- the module's own fetch() function generated it.
  3. Workbench loads table and tabs -- output from the previous step and other tabs, if applicable -- from the "Render Cache" S3 bucket. This is Parquet format. Table Data Structures
  4. Workbench converts Parquets file to Arrow. (Parquet-format files are compressed and the format won't change frequently. Arrow-format files are for fast computation and inter-process communication.)
  5. Workbench spawns a new process with the module. It links the fetch_result file and Arrow files into the process's sandbox.
  6. Workbench transmits all inputs (other than the fetch_result file and Arrow files) in a Thrift format to the module process.
  7. The module process decodes the Thrift data and opens fetch_result and Arrow files.
  8. The module process executes render().
  9. The module process writes a new Arrow table to output_filename, outputs table metadata and errors to stdout, and exits with status 0.
  10. Workbench validates the Step-output Arrow file, metadata and exit status.
  11. Workbench converts the Arrow file to Parquet and stores it in the "Render Cache". It stores metadata (such as errors and columns) in the database.

... and then Workbench moves on to the next Step.

Optimizations

  • Workbench renders a workflow at a time. When it finishes a Step but still needs its output, it re-uses the Arrow file instead of reading the Parquet file it just wrote to the render cache.
  • Workbench skips the module code entirely if it runs into an error while massaging params (for instance, if a user chose a "Timestamp" column for timestamp math, but it's now a "Text" column come render time).
  • Module-process spawning is cached.

Security constraints

  • Each module process is sandboxed during creation. Read our 3-part series on sandboxing for details. Briefly:
    • The process can't access our internal network.
    • The process runs as non-root, and it can't gain privileges.
    • The process's can't read Workbench's environment variables.
    • The process has a greatly-constrained filesystem: it can read only the files we provide (such as approved Python modules and input files), and it can only write output files and temporary files -- and that, only to a constrained filesystem size. All files the process writes are destroyed before another process is spawned.
    • The process's RAM and CPU are bounded by cgroups.
    • The process's child processes all die when the process exits.
    • The process is killed if it exceeds a timeout.
    • The process gets no file handles but stdin, stdout and stderr. Its stderr and stdout are truncated to a fixed buffer size, and its stdout is validated using Thrift.