Support range as an option for async reindex #1549

kapso · 2022-03-07T01:58:16Z

I am curious why this change was made? - Changed async reindex to fetch ids instead of using ranges for numeric primary keys with Active Record

...and if it's possible (or perhaps makes sense) to support range as an option as well for async reindex

Why? - enqueuing ids is really slow, especially when writing 100s of ids/job to Sidekiq/Redis. We use a batch_size of 1000 and creating jobs is really slow when reindexing 40m documents.

UPDATE: it took ~16 hours to enqueue ids for 40m documents with a batch size of 1000.

The text was updated successfully, but these errors were encountered:

kapso · 2022-03-07T02:01:39Z

Or maybe use Sidekiq::Worker.perform_bulk which I think could speed up Sidekiq/Redis writes, in case Sidekiq is being used.

kapso · 2022-04-09T20:19:33Z

@ankane any thoughts? thanks.

ankane · 2022-05-18T21:31:17Z

Hey @kapso, I was thinking it'd be good to handle gaps in primary keys at the expense of longer enqueuing times, but may be good to support the previous pattern as well. How long did it previously take to enqueue, and how does the total reindexing time compare before and after?

kapso · 2022-05-31T16:29:15Z

@ankane We have 140m rows in the table, out of which 40m rows (we use search_import scope for filtering) get indexed. With a batch_size of 500, it takes 20+ hours to just enqueue all ids.

The previous pattern used to take ~20m.

We use Heroku worker to enqueue these ids, and sometimes Heroku workers get re-cycled and so does the enqueue process.

ramaboo · 2022-05-31T19:43:15Z

Since so many datasets have few if any gaps in their primary keys I think the option should exist to enqueue faster rather than more accurately. Could this be a simple config option in an initializer config.queue_method = :fast # or :precise

kapso · 2022-05-31T23:14:51Z

@ankane another interesting thing we have seen a few times is that the Enqueuing job (Heroku worker) crashes after a couple of hours with no error reported in our error system (Sentry). And since it crashed (and re-started) it obviously re-started the enqueuing process - creating a fresh index and writing to this new index.

The Enqueuing (Heroku) worker had 2.5GB memory, so memory definitely wasn't an issue.

This is the enqueuing job/process - Model.reindex(mode: :async)

kapso · 2022-06-18T00:26:22Z

@ankane curious if there are any plans to also support the range option?

kapso · 2022-06-28T01:48:53Z

@ankane thinking of submitting a PR for this, is this the place to start? - 88f52da

Or is there any other code we should also be looking at?

BobbyMcWho · 2022-06-29T14:50:18Z

if relation.respond_to?(:find_in_batches)
  relation.find_in_batches(batch_size: batch_size) do |items|
  batch_job(class_name, batch_id, items.map(&:id))

I believe, and correct me if I'm wrong, but find_in_batches here will convert batch to an array of AR objects, whereas in_batches.each would leave it as an ActiveRecord::Association. If we used the latter, then we could items.pluck(:id) instead of items.map(&:id), and save the overhead of loading all the objects into memory.

irb(main):010:0> Product.find_in_batches do |batch|
irb(main):011:1*   puts batch.is_a?(ActiveRecord::Relation)
irb(main):012:1> end
false

irb(main):007:0> Product.in_batches.each do |batch|
irb(main):008:1*   puts batch.is_a?(ActiveRecord::Relation)
irb(main):009:1> end
true

igorbelo · 2023-10-04T10:56:56Z

Ran into the same problem where 16 million records took nearly 2 hours just to get enqueued.
First reaction was to increase the batch size, but working with ranges sounds like a more reasonable approach.

I can work something out if you're willing to support ranges again @ankane.
I see that the BulkReindexJob still accepts range options here, so it should be a straightforward PR.

kapso added the feature request label Mar 7, 2022

BobbyMcWho mentioned this issue Jun 29, 2022

[WIP] Use pluck instead of map #1571

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support range as an option for async reindex #1549

Support range as an option for async reindex #1549

kapso commented Mar 7, 2022 •

edited

Loading

kapso commented Mar 7, 2022

kapso commented Apr 9, 2022

ankane commented May 18, 2022

kapso commented May 31, 2022 •

edited

Loading

ramaboo commented May 31, 2022

kapso commented May 31, 2022

kapso commented Jun 18, 2022

kapso commented Jun 28, 2022 •

edited

Loading

BobbyMcWho commented Jun 29, 2022

igorbelo commented Oct 4, 2023

Support range as an option for async reindex #1549

Support range as an option for async reindex #1549

Comments

kapso commented Mar 7, 2022 • edited Loading

kapso commented Mar 7, 2022

kapso commented Apr 9, 2022

ankane commented May 18, 2022

kapso commented May 31, 2022 • edited Loading

ramaboo commented May 31, 2022

kapso commented May 31, 2022

kapso commented Jun 18, 2022

kapso commented Jun 28, 2022 • edited Loading

BobbyMcWho commented Jun 29, 2022

igorbelo commented Oct 4, 2023

kapso commented Mar 7, 2022 •

edited

Loading

kapso commented May 31, 2022 •

edited

Loading

kapso commented Jun 28, 2022 •

edited

Loading