You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We recently put up some sync performance fixes to ensure that block-level syncs are fast, for a reasonably large load, to allow for a safe incremental release. After pushing these changes to production, we discovered that the improvements actually negatively impacted regular FacilityGroup syncing.
This impact wasn't much, but it was enough that we decided to revert the changes.
what actually causes the slowness
It lies in a couple of distinct places. The first one is the query that fetches the records for "other facilities" (not the current facility) and the second is how many times this query is called during a request-response cycle.
The 1st one is more nuanced than just write a faster query. In this diff from the revert, notice the reverted changes (in red),
This causes a kind of an "inversion" point where up until a certain number of syncable_patients are reached, this is faster for most of syncs. However, for FG, this number is typically very large and causes it to be slower than the original query itself.
After running some benchmarks, we discovered that the inversion point is somewhere around 4-5k syncable_patients.
This stands to reason that if we simply branched the code by counting the syncable_patients, we'd achieve good results for both scenarios – this is mostly true, but there's more...
To understand precisely why the original query is slow, it's worth looking at it's bottlenecks in different scenarios. From further benchmarking, we find that this query,
is actually blazing fast for a large number of region_records (or a large number of syncable patients) but extremely slow when that number is very small. This is quite unintuitive behavior, but looking at the EXPLAIN and the query, it's clear that the primary limiting factor is the LIMIT.
Effectively, the probability of finding the first 1000 BPs in 30k patients is higher than finding the first 1000 BPs in 2k patients. In other words, finding the first 1000 rows matching X becomes slower as the condition X becomes more restrictive.
any
If we look at the slower EXPLAIN, we notice that the most amount of time spent is in the loops. We're looping over each blood pressure and checking if there's a BP for the matching patient. This is very inefficient generally as explained earlier.
Changing this query to an ANY(array()) reaps benefits. EXPLAIN. Notice how the loops have dramatically reduced. What we've done is effectively do an in-built ANY check (on an array) rather than iterating over each BP.
This isn't as fast as the fastest regular query, but it's overall better normalized performance in different cases[1]
However, if we take a step back, we aren't really optimizing for mass resyncing a FacilityGroup, all we really need is:
Normal FG syncs finish in reasonable times
Bulk block re-syncs happen fast
The query itself is unnecessarily complicated, since it tries to make two different where and where.not clauses,
which is both, fewer indirections and faster since we avoid an entire separate where.not clause.
memoization
We end up calling records_to_sync / current_facility_records / other_facility_records multiple times for a single request. Rails helps with query caching, but a fair few of these queries have added modifiers like count which don't really hit the Rails query cache per request. Memoizing these queries gives us a ton of gains. It also has a nice side-effect which is that the benchmarking is on an even keel between different queries and all the processing noise is eliminated.
what indexes are we using now?
With the final bunch of queries, we tend to use only:
updated_at index on all the models except Patient
Compound index of id and updated_at on Patient
My recommendation would be to continue to keep the existing index infrastructure and monitoring the index usage periodically and remove the non-essentials.
does this work?
Before and after graphs on perf for both Block and FG,
master code with 5% of users resyncing FacilityGroup
sync-perf-fixes code with 5% of users resyncing Block
master code with DB CPU utilization on 5% of users resyncing FacilityGroup
sync-perf-fixes code with DB CPU utilization on 5% of users resyncing Block
master code with 1 instance running (8 vCPUs)
sync-perf-fixes code with 1 instance running (8 vCPUs)
I'm confident that these changes should work for a 5% rollout for block-sync and existing regular FG syncs as they happen.
__references
[1] The cases we've been testing are: Fetch 1000 BPs and Patients from beginning, Fetch 1000 BPs and Patients from different time segments since 2018 to 2020.
The text was updated successfully, but these errors were encountered:
We recently put up some sync performance fixes to ensure that block-level syncs are fast, for a reasonably large load, to allow for a safe incremental release. After pushing these changes to production, we discovered that the improvements actually negatively impacted regular
FacilityGroup
syncing.This impact wasn't much, but it was enough that we decided to revert the changes.
what actually causes the slowness
It lies in a couple of distinct places. The first one is the query that fetches the records for "other facilities" (not the current facility) and the second is how many times this query is called during a request-response cycle.
The 1st one is more nuanced than just write a faster query. In this diff from the revert, notice the reverted changes (in red),
This piece of code was one of the bottlenecks,
This causes a kind of an "inversion" point where up until a certain number of
syncable_patients
are reached, this is faster for most of syncs. However, for FG, this number is typically very large and causes it to be slower than the original query itself.After running some benchmarks, we discovered that the inversion point is somewhere around 4-5k
syncable_patients
.This stands to reason that if we simply branched the code by counting the
syncable_patients
, we'd achieve good results for both scenarios – this is mostly true, but there's more...To understand precisely why the original query is slow, it's worth looking at it's bottlenecks in different scenarios. From further benchmarking, we find that this query,
is actually blazing fast for a large number of
region_records
(or a large number of syncable patients) but extremely slow when that number is very small. This is quite unintuitive behavior, but looking at the EXPLAIN and the query, it's clear that the primary limiting factor is theLIMIT
.Effectively, the probability of finding the first 1000 BPs in 30k patients is higher than finding the first 1000 BPs in 2k patients. In other words, finding the first 1000 rows matching X becomes slower as the condition X becomes more restrictive.
any
If we look at the slower EXPLAIN, we notice that the most amount of time spent is in the loops. We're looping over each blood pressure and checking if there's a BP for the matching patient. This is very inefficient generally as explained earlier.
Changing this query to an
ANY(array())
reaps benefits. EXPLAIN. Notice how the loops have dramatically reduced. What we've done is effectively do an in-builtANY
check (on anarray
) rather than iterating over each BP.This isn't as fast as the fastest regular query, but it's overall better normalized performance in different cases[1]
However, if we take a step back, we aren't really optimizing for mass resyncing a
FacilityGroup
, all we really need is:The query itself is unnecessarily complicated, since it tries to make two different
where
andwhere.not
clauses,can be rewritten as,
can be rewritten as,
which is both, fewer indirections and faster since we avoid an entire separate
where.not
clause.memoization
We end up calling
records_to_sync
/current_facility_records
/other_facility_records
multiple times for a single request. Rails helps with query caching, but a fair few of these queries have added modifiers likecount
which don't really hit the Rails query cache per request. Memoizing these queries gives us a ton of gains. It also has a nice side-effect which is that the benchmarking is on an even keel between different queries and all the processing noise is eliminated.what indexes are we using now?
With the final bunch of queries, we tend to use only:
updated_at
index on all the models except Patientid
andupdated_at
on PatientMy recommendation would be to continue to keep the existing index infrastructure and monitoring the index usage periodically and remove the non-essentials.
does this work?
Before and after graphs on perf for both Block and FG,
master
code with 5% of users resyncingFacilityGroup
sync-perf-fixes
code with 5% of users resyncingBlock
master
code with DB CPU utilization on 5% of users resyncingFacilityGroup
sync-perf-fixes
code with DB CPU utilization on 5% of users resyncingBlock
master
code with 1 instance running (8 vCPUs)sync-perf-fixes
code with 1 instance running (8 vCPUs)I'm confident that these changes should work for a 5% rollout for block-sync and existing regular FG syncs as they happen.
__references
[1] The cases we've been testing are: Fetch 1000 BPs and Patients from beginning, Fetch 1000 BPs and Patients from different time segments since 2018 to 2020.
The text was updated successfully, but these errors were encountered: