-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cloud: Load submissions from PostgreSQL into BigQuery #541
Comments
Remove loading submissions into BigQuery until we come up with a way to increase throughput (likely pulling chunks from PostgreSQL). This should help us deal with the backlog in production submission queue. We might also need to either switch to direct triggering of Cloud Functions by messages from the queue, to reduce latency, or simply switch to a persistent Cloud Run service. Concerns: #541
Remove loading submissions into BigQuery until we come up with a way to increase throughput (likely pulling chunks from PostgreSQL). This should help us deal with the backlog in production submission queue. We might also need to either switch to direct triggering of Cloud Functions by messages from the queue, to reduce latency, or simply switch to a persistent Cloud Run service. Concerns: #541
Remove loading submissions into BigQuery until we come up with a way to increase throughput (likely pulling chunks from PostgreSQL). This should help us deal with the backlog in production submission queue. We might also need to either switch to direct triggering of Cloud Functions by messages from the queue, to reduce latency, or simply switch to a persistent Cloud Run service. Concerns: #541
OK, here are the requirements:
And here's the plan:
|
We need to figure out how we're going to handle inclusive/exclusive boundaries. |
Perhaps we could just implement |
OK, this has been just implemented in the PR above:
However, it doesn't actually support dumping object types separately, and takes one timestamp for all types (for either boundary). This seems to be simpler and easier to manage and maintain. Correspondingly, we need to change our plan:
|
OK, everything is written. Now I need to write a test for that archival Cloud Function, and make sure everything works. |
Tests are written, implementation is fixed, waiting for CI before merging #598 |
The #598 is merged, and everything after |
To increase throughput of loading submissions into BigQuery, switch to loading them in big chunks from PostgreSQL, but still using load jobs.
The streaming mechanism is somewhat troublesome in our case, as it needs its buffers flushed before any DELETE or UPDATE operations can be done on the table, there's no force flushing, and it could take up to 90 minutes to have everything flushed. The TRUNCATE operation which would've suited us fine also has problems with this currently. This prevents us from always loading using streaming, as this would break tests which needs to empty the database repeatedly.
Doing this will also help us to move closer to making the BigQuery dataset public, as data in PostgreSQL will be largely de-duplicated, making partitioning of BigQuery more viable, and reducing the cost of queries.
The text was updated successfully, but these errors were encountered: