cloud: Load submissions from PostgreSQL into BigQuery #541

spbnick · 2024-07-05T11:04:02Z

To increase throughput of loading submissions into BigQuery, switch to loading them in big chunks from PostgreSQL, but still using load jobs.

The streaming mechanism is somewhat troublesome in our case, as it needs its buffers flushed before any DELETE or UPDATE operations can be done on the table, there's no force flushing, and it could take up to 90 minutes to have everything flushed. The TRUNCATE operation which would've suited us fine also has problems with this currently. This prevents us from always loading using streaming, as this would break tests which needs to empty the database repeatedly.

Doing this will also help us to move closer to making the BigQuery dataset public, as data in PostgreSQL will be largely de-duplicated, making partitioning of BigQuery more viable, and reducing the cost of queries.

Remove loading submissions into BigQuery until we come up with a way to increase throughput (likely pulling chunks from PostgreSQL). This should help us deal with the backlog in production submission queue. We might also need to either switch to direct triggering of Cloud Functions by messages from the queue, to reduce latency, or simply switch to a persistent Cloud Run service. Concerns: #541

spbnick · 2024-10-30T10:26:22Z

OK, here are the requirements:

Pick up from the current state and reliably transfer the three-or-so months of data we're missing in BigQuery right now
Be able to recover from transfer failures, and not lose data in case something went wrong on a previous attempt
Avoid loading duplicate data as much as possible, but prefer duplicates over missing data.
Transfer data only after it's outside the "editing window", to reduce duplicates of objects

And here's the plan:

Assume a certain "editing window", but don't implement it yet (a matter for another issue and time), say two weeks
Support dumping data between two timestamps in database client and drivers
Remove rows older than the maximum timestamp, minus, say, an hour, from BigQuery, so we make sure we don't have incomplete data for the maximum timestamp there.
Add a Cloud Function triggered on a daily schedule, say at midday, which:
- For each object type:
  - Queries the maximum _timestamp from BigQuery, and puts it into start_ts
  - Calculates the timestamp for two weeks ago and puts it into end_ts
  - Caps end_ts at a week from start_ts
  - Dumps data from PostgreSQL with _timestamp > start_ts AND _timestamp <= end_ts (note the equals!), with metadata.
  - Loads the dump into BigQuery, preserving metadata

spbnick · 2024-10-30T13:38:55Z

We need to figure out how we're going to handle inclusive/exclusive boundaries.

spbnick · 2024-10-30T13:40:28Z

Perhaps we could just implement start < timestamp <= end and be done with that for now.

spbnick · 2024-11-01T12:07:04Z

OK, this has been just implemented in the PR above:

Support dumping data between two timestamps in database client and drivers

However, it doesn't actually support dumping object types separately, and takes one timestamp for all types (for either boundary). This seems to be simpler and easier to manage and maintain. Correspondingly, we need to change our plan:

Remove rows older than the maximum timestamp, minus, say, an hour, from BigQuery, so we make sure we don't have incomplete data for the maximum timestamp there.
Add a Cloud Function triggered on a daily schedule, say at midday, which:
- Queries the maximum _timestamp across all object types from BigQuery, and puts it into after_ts.
- Calculates the timestamp for two weeks ago and puts it into until_ts
- Caps until_ts at a week from after_ts
- Dumps data from PostgreSQL with _timestamp > after_ts AND _timestamp <= until_ts (note the equals!), with metadata.
- Loads the dump into BigQuery, preserving metadata

spbnick · 2024-11-01T17:40:40Z

OK, everything is written. Now I need to write a test for that archival Cloud Function, and make sure everything works.

spbnick · 2024-11-06T14:31:15Z

Tests are written, implementation is fixed, waiting for CI before merging #598

spbnick · 2024-11-06T18:47:42Z

The #598 is merged, and everything after 2024-07-06T06:00:00+00:00 is removed from production BigQuery (the last modified time was 2024-07-06T06:27:19.430720+00:00). Similar is done to playground.

spbnick self-assigned this Jul 5, 2024

spbnick mentioned this issue Jul 5, 2024

cloud: Load submissions into PostgreSQL only, faster #542

Merged

spbnick mentioned this issue Oct 31, 2024

Support time range dump #595

Merged

spbnick mentioned this issue Nov 1, 2024

Get last modified from data #596

Merged

spbnick mentioned this issue Nov 11, 2024

Archive each table independently #606

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cloud: Load submissions from PostgreSQL into BigQuery #541

cloud: Load submissions from PostgreSQL into BigQuery #541

spbnick commented Jul 5, 2024

spbnick commented Oct 30, 2024 •

edited

Loading

spbnick commented Oct 30, 2024

spbnick commented Oct 30, 2024

spbnick commented Nov 1, 2024 •

edited

Loading

spbnick commented Nov 1, 2024

spbnick commented Nov 6, 2024

spbnick commented Nov 6, 2024 •

edited

Loading

cloud: Load submissions from PostgreSQL into BigQuery #541

cloud: Load submissions from PostgreSQL into BigQuery #541

Comments

spbnick commented Jul 5, 2024

spbnick commented Oct 30, 2024 • edited Loading

spbnick commented Oct 30, 2024

spbnick commented Oct 30, 2024

spbnick commented Nov 1, 2024 • edited Loading

spbnick commented Nov 1, 2024

spbnick commented Nov 6, 2024

spbnick commented Nov 6, 2024 • edited Loading

spbnick commented Oct 30, 2024 •

edited

Loading

spbnick commented Nov 1, 2024 •

edited

Loading

spbnick commented Nov 6, 2024 •

edited

Loading