Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update aggregations on zero rows for subqueries #175

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
= Understanding aggregations on zero rows
:slug: understanding-aggregations-on-zero-rows
:author: Andrew Bowman
:neo4j-versions: 3.5, 4.0, 4.1, 4.2, 4.3, 4.4
:neo4j-versions: 3.5, 4.0, 4.1, 4.2, 4.3, 4.4, 5.x
:tags: cypher
:category: cypher

Expand Down Expand Up @@ -180,3 +180,43 @@ However, it should be clear that setting the grouping key to null can have negat
If we don't return and inspect the output, it's possible for bad data to have been written to the graph, and who knows when that would be detected.

For these reasons, we feel justified that it is more correct to stay at 0 rows in these situations than to suddenly and unexpectedly change variable values and let the query continue in a not-so-sane state.


=== Avoiding the problem by using subqueries

Subqueries were introduced in Neo4j 4.x.
They allow for a subquery to execute per input row, and they are another means by which we can avoid this zero-rows problem.

All of the data for a row prior to the subquery call will continue to exist until the subquery call finishes, even if rows go to zero midway through subquery execution.

If the subquery finishes and there are no rows to return at that point, only then is the corresponding row and its data wiped out.

But if rows are recovered prior to the subquery ending, such as by performing an aggregation (without a grouping key), then since at least one returned row exists, the data for the row persists after that subquery return.

In this way, by separating out the segment of cypher that COULD fail to match and hit zero rows into its own subquery, and using an aggregation to recover rows in any case, we can circumvent the problem:

[source,cypher]
----
MATCH (movie:Movie)
WHERE exists(movie.title)
WITH count(movie) AS movieCount

CALL {
MATCH (person:Person)
WHERE exists(person.title)
WITH count(person) AS personCount
RETURN personCount
}

RETURN personCount, movieCount
----

Notice that when we aggregate within the subquery, we no longer need to use `movieCount` as the grouping key.
Why? Because `movieCount` exists outside the scope of the subquery, we don't need to retain it or address it at all within the subquery.

Also, because subqueries execute per row, the row itself already predefines the grouping, even if we don't reference it within the subquery.

This allows our `count()` aggregation to recover from zero rows, giving us a count of 0, and since a row is returned from the subquery, we retain the `movieCount` data from prior to the subquery call.

Note that this only applies for the row data that existed prior to the subquery call.
Any new data introduced within the subquery would be wiped out as expected if rows ever go to zero during the subquery execution.