Skip to content

Commit

Permalink
Update docs for lateral subqueries and over operator (#5264)
Browse files Browse the repository at this point in the history
  • Loading branch information
philrz authored Sep 16, 2024
1 parent 81194aa commit 43053bf
Show file tree
Hide file tree
Showing 2 changed files with 63 additions and 44 deletions.
73 changes: 58 additions & 15 deletions docs/language/lateral-subqueries.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,19 @@ sidebar_label: Lateral Subqueries

Lateral subqueries provide a powerful means to apply a Zed query
to each subsequence of values generated from an outer sequence of values.
The inner query may be _any Zed query_ and may refer to values from
The inner query may be _any_ dataflow operator sequence (excluding
[`from` operators](operators/from.md)) and may refer to values from
the outer sequence.

:::tip Note
This pattern rhymes with the SQL pattern of a "lateral
join", which runs a subquery for each row of the outer query's results.
:::

Lateral subqueries are created using the scoped form of the
[`over` operator](operators/over.md) and may be nested to arbitrary depth.
[`over` operator](operators/over.md). They may be nested to arbitrary depth
and accesses to variables in parent lateral query bodies follows lexical
scoping.

For example,
```mdtest-command
Expand All @@ -24,7 +32,7 @@ produces
{name:"foo",elem:2}
{name:"bar",elem:3}
```
Here the lateral scope, described below, creates a subquery
Here the [lateral scope](#lateral-scope), described below, creates a subquery
```
yield {name,elem:this}
```
Expand All @@ -41,7 +49,7 @@ The first subquery thus operates on the input values `1, 2` with the variable
{name:"foo",elem:1}
{name:"foo",elem:2}
```
and the second subquery operators on the input value `3` with the variable
and the second subquery operates on the input value `3` with the variable
`name` set to "bar", emitting
```
{name:"bar",elem:3}
Expand Down Expand Up @@ -81,17 +89,23 @@ between each `<expr>` evaluated in the outer scope and each `<var>`, which
represents a new symbol in the inner scope of the `<query>`.
In the field reference form, a single identifier `<field>` refers to a field
in the parent scope and makes that field's value available in the lateral scope
with the same name.
via the same name.

Note that any such variable definitions override [implied field references](dataflow-model.md#implied-field-references) of
`this`. If a both a field named `x` and a variable named `x` need be
referenced in the lateral scope, the field reference should be qualified as
`this.x` while the variable is referenced simply as `x`.

The `<query>`, which may be any Zed query, is evaluated once per outer value
The `<query>` is evaluated once per outer value
on the sequence generated by the `over` expression. In the lateral scope,
the value `this` refers to the inner sequence generated from the `over` expressions.
This query runs to completion for each inner sequence and emits
each subquery result as each inner sequence traversal completes.

This structure is powerful because _any_ Zed query can appear in the body of
the lateral scope. In contrast to the `yield` example, a sort could be
applied to each subsequence in the subquery, where sort
This structure is powerful because _any_ dataflow operator sequence (excluding
[`from` operators](operators/from.md)) can appear in the body of
the lateral scope. In contrast to the [`yield`](operators/yield.md) example above, a [`sort`](operators/sort.md) could be
applied to each subsequence in the subquery, where `sort`
reads all values of the subsequence, sorts them, emits them, then
repeats the process for the next subsequence. For example,
```mdtest-command
Expand All @@ -112,13 +126,12 @@ parenthesized form:
```
( over <expr> [, <expr>...] [with <var>=<expr> [, ... <var>[=<expr>]] | <lateral> )
```
> Note that the parentheses disambiguate a lateral expression from a lateral
> dataflow operator.

This form must always include a lateral scope as indicated by `<lateral>`,
which can be any dataflow operator sequence excluding [`from` operators](operators/from.md).
As with the `over` operator, values from the outer scope can be brought into
the lateral scope using the `with` clause.
:::tip
The parentheses disambiguate a lateral expression from a [lateral dataflow operator](operators/over.md).
:::

This form must always include a [lateral scope](#lateral-scope) as indicated by `<lateral>`.

The lateral expression is evaluated by evaluating each `<expr>` and feeding
the results as inputs to the `<lateral>` dataflow operators. Each time the
Expand Down Expand Up @@ -148,3 +161,33 @@ produces
{sorted:[1,4,7],sum:12}
{sorted:[1,2,3],sum:6}
```
Because Zed expressions evaluate to a single result, if multiple values remain
at the conclusion of the lateral dataflow, they are automatically wrapped in
an array, e.g.,
```mdtest-command
echo '{x:1} {x:[2]} {x:[3,4]}' |
zq -z 'yield {s:(over x | yield this+1)}' -
```
produces
```mdtest-output
{s:2}
{s:3}
{s:[4,5]}
```
To handle such dynamic input data, you can ensure your downstream dataflow
always receives consistently packaged values by explicitly wrapping the result
of the lateral scope, e.g.,
```mdtest-command
echo '{x:1} {x:[2]} {x:[3,4]}' |
zq -z 'yield {s:(over x | yield this+1 | collect(this))}' -
```
produces
```mdtest-output
{s:[2]}
{s:[3]}
{s:[4,5]}
```
Similarly, a primitive value may be consistently produced by concluding the
lateral scope with an operator such as [`head`](operators/head.md) or
[`tail`](operators/tail.md), or by applying certain [aggregate functions](aggregates/README.md)
such as done with [`sum`](aggregates/sum.md) above.
34 changes: 5 additions & 29 deletions docs/language/operators/over.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,45 +12,21 @@ The `over` operator traverses complex values to create a new sequence
of derived values (e.g., the elements of an array) and either
(in the first form) sends the new values directly to its output or
(in the second form) sends the values to a scoped computation as indicated
by `<lateral>`, which may represent any Zed subquery operating on the
derived sequence of values as `this`.
by `<lateral>`, which may represent any Zed [subquery](../lateral-subqueries.md) operating on the
derived sequence of values as [`this`](../dataflow-model.md#the-special-value-this).

Each expression `<expr>` is evaluated in left-to-right order and derived sequences are
generated from each such result depending on its types:
* an array value generates each of its element,
* an array value generates each of its elements,
* a map value generates a sequence of records of the form `{key:<key>,value:<value>}` for each
entry in the map, and
* all other values generate a single value equal to itself.

Records can be converted to maps with the [_flatten_ function](../functions/flatten.md)
Records can be converted to maps with the [`flatten` function](../functions/flatten.md)
resulting in a map that can be traversed,
e.g., if `this` is a record, it can be traversed with `over flatten(this)`.

The nested subquery depicted as `<lateral>` is called a "lateral query" as the
outer query operates on the top-level sequence of values while the lateral
query operates on subsequences of values derived from each input value.
This pattern rhymes with the SQL pattern of a "lateral join", which runs a
SQL subquery for each row of the outer query's table.

In a Zed lateral query, each input value induces a derived subsequence and
for each such input, the lateral query runs to completion and yields its results.
In this way, operators like `sort` and `summarize`, which operate on their
entire input, run to completion for each subsequence and yield to the output the
lateral result set for each outer input as a sequence of values.

Within the lateral query, `this` refers to the values of the subsequence thereby
preventing lateral expressions from accessing the outer `this`.
To accommodate such references, the _over_ operator includes a _with_ clause
that binds arbitrary expressions evaluated in the outer scope
to variables that may be referenced by name in the lateral scope.

> Note that any such variable definitions override implied field references
> of `this`. If a both a field named "x" and a variable named "x" need be
> referenced in the lateral scope, the field reference should be qualified as `this.x`
> while the variable is referenced simply as `x`.
Lateral queries may be nested to arbitrary depth and accesses to variables
in parent lateral query bodies follows lexical scoping.
The nested subquery depicted as `<lateral>` is called a [lateral subquery](../lateral-subqueries.md).

### Examples

Expand Down

0 comments on commit 43053bf

Please sign in to comment.