Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DB Backup Bug #2853

Merged
merged 46 commits into from
Feb 29, 2024
Merged

DB Backup Bug #2853

merged 46 commits into from
Feb 29, 2024

Conversation

elipe17
Copy link

@elipe17 elipe17 commented Feb 16, 2024

Summary of Changes

  • Updated backup task with a lot of logging and pinned the location of the PG client directory
  • Updated the Terrform to pin the version of our PG server
  • Named all crontabs so that it is obvious what is what in DAC and so that they actually run
  • Recreated dev environment DB with TF so that it is on version 12.x

Pull request closes #2852

How to Test

  • Log into the Raft environment DAC and change the cron schedule to once every few minutes: minute=*/5. Everything else should be *'s.
  • Watch the logs to see the task start and complete.
  • verify in s3 that the backup was created. Use the script below to set the correct environment variables and be sure to have aws cli installed.
#!/bin/bash

# Run this script with a . in order to set environment variables in your shell
# For example:
# . ./getcreds.sh

SERVICE_INSTANCE_NAME=tdp-staticfiles-dev
KEY_NAME=eric-s3-key1

cf create-service-key "${SERVICE_INSTANCE_NAME}" "${KEY_NAME}"
S3_CREDENTIALS=$(cf service-key "${SERVICE_INSTANCE_NAME}" "${KEY_NAME}" | tail -n +2)

export AWS_ACCESS_KEY_ID=$(echo "${S3_CREDENTIALS}" | jq -r '.credentials.access_key_id')
export AWS_SECRET_ACCESS_KEY=$(echo "${S3_CREDENTIALS}" | jq -r '.credentials.secret_access_key')
export BUCKET_NAME=$(echo "${S3_CREDENTIALS}" | jq -r '.credentials.bucket')
export AWS_DEFAULT_REGION=$(echo "${S3_CREDENTIALS}" | jq -r '.credentials.region')

Deliverables

More details on how deliverables herein are assessed included here.

Deliverable 1: Accepted Features

Checklist of ACs:

  • root cause addressed
  • evidence of backups in logs and DAC
  • lfrohlich and/or adpennington confirmed that ACs are met.

Deliverable 2: Tested Code

  • Are all areas of code introduced in this PR meaningfully tested?
    • If this PR introduces backend code changes, are they meaningfully tested?
    • If this PR introduces frontend code changes, are they meaningfully tested?
  • Are code coverage minimums met?
    • Frontend coverage: [insert coverage %] (see CodeCov Report comment in PR)
    • Backend coverage: [insert coverage %] (see CodeCov Report comment in PR)

Deliverable 3: Properly Styled Code

  • Are backend code style checks passing on CircleCI?
  • Are frontend code style checks passing on CircleCI?
  • Are code maintainability principles being followed?

Deliverable 4: Accessible

  • Does this PR complete the epic?
  • Are links included to any other gov-approved PRs associated with epic?
  • Does PR include documentation for Raft's a11y review?
  • Did automated and manual testing with iamjolly and ttran-hub using Accessibility Insights reveal any errors introduced in this PR?

Deliverable 5: Deployed

  • Was the code successfully deployed via automated CircleCI process to development on Cloud.gov?

Deliverable 6: Documented

  • Does this PR provide background for why coding decisions were made?
  • If this PR introduces backend code, is that code easy to understand and sufficiently documented, both inline and overall?
  • If this PR introduces frontend code, is that code easy to understand and sufficiently documented, both inline and overall?
  • If this PR introduces dependencies, are their licenses documented?
  • Can reviewer explain and take ownership of these elements presented in this code review?

Deliverable 7: Secure

  • Does the OWASP Scan pass on CircleCI?
  • Do manual code review and manual testing detect any new security issues?
  • If new issues detected, is investigation and/or remediation plan documented?

Deliverable 8: User Research

Research product(s) clearly articulate(s):

  • the purpose of the research
  • methods used to conduct the research
  • who participated in the research
  • what was tested and how
  • impact of research on TDP
  • (if applicable) final design mockups produced for TDP development

@elipe17 elipe17 self-assigned this Feb 16, 2024
Copy link

codecov bot commented Feb 16, 2024

Codecov Report

Attention: Patch coverage is 71.05263% with 11 lines in your changes are missing coverage. Please review.

Project coverage is 93.62%. Comparing base (ae0a5fc) to head (d1fd6e2).
Report is 20 commits behind head on develop.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #2853      +/-   ##
===========================================
- Coverage    93.67%   93.62%   -0.05%     
===========================================
  Files          262      265       +3     
  Lines         6053     6073      +20     
  Branches       503      510       +7     
===========================================
+ Hits          5670     5686      +16     
- Misses         287      294       +7     
+ Partials        96       93       -3     
Flag Coverage Δ
dev-backend 93.82% <71.05%> (-0.02%) ⬇️
dev-frontend 92.62% <ø> (-0.21%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
tdrs-backend/tdpservice/settings/common.py 99.23% <ø> (+0.01%) ⬆️
tdrs-backend/tdpservice/email/tasks.py 71.05% <71.05%> (ø)

... and 7 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c8d0934...d1fd6e2. Read the comment docs.

@elipe17 elipe17 added the Deploy with CircleCI-raft Deploy to https://tdp-frontend-raft.app.cloud.gov through CircleCI label Feb 16, 2024
@elipe17 elipe17 added Deploy with CircleCI-raft Deploy to https://tdp-frontend-raft.app.cloud.gov through CircleCI and removed Deploy with CircleCI-raft Deploy to https://tdp-frontend-raft.app.cloud.gov through CircleCI labels Feb 16, 2024
@elipe17 elipe17 added Deploy with CircleCI-raft Deploy to https://tdp-frontend-raft.app.cloud.gov through CircleCI and removed Deploy with CircleCI-raft Deploy to https://tdp-frontend-raft.app.cloud.gov through CircleCI labels Feb 17, 2024
@elipe17 elipe17 added Deploy with CircleCI-raft Deploy to https://tdp-frontend-raft.app.cloud.gov through CircleCI and removed Deploy with CircleCI-raft Deploy to https://tdp-frontend-raft.app.cloud.gov through CircleCI labels Feb 20, 2024
@elipe17 elipe17 removed the Deploy with CircleCI-raft Deploy to https://tdp-frontend-raft.app.cloud.gov through CircleCI label Feb 20, 2024
@elipe17 elipe17 added the Deploy with CircleCI-raft Deploy to https://tdp-frontend-raft.app.cloud.gov through CircleCI label Feb 20, 2024
@elipe17 elipe17 added Deploy with CircleCI-raft Deploy to https://tdp-frontend-raft.app.cloud.gov through CircleCI and removed Deploy with CircleCI-raft Deploy to https://tdp-frontend-raft.app.cloud.gov through CircleCI labels Feb 20, 2024
s3_client.upload_file(file_name, bucket, object_name)
print("Uploaded {} to S3:{}{}".format(file_name, bucket, object_name))
response = s3_client.upload_file(file_name, bucket, object_name)
logger.info(f"S3 upload response: {response}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elipe17 should the response here be None if its successful? Asking because that's what I'm observing so far.

   2024-02-23T12:25:01.07-0500 [APP/PROC/WEB/0] ERR 2024-02-23 17:25:01,070 INFO db_backup.py::upload_file:L152 :  S3 upload response: None
   2024-02-23T12:25:01.07-0500 [APP/PROC/WEB/0] ERR [2024-02-23 17:25:01,070: INFO/ForkPoolWorker-1] S3 upload response: None
   2024-02-23T12:25:01.07-0500 [APP/PROC/WEB/0] ERR 2024-02-23 17:25:01,071 INFO db_backup.py::upload_file:L153 :  Uploaded /tmp/backup.pg to s3://...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should return either True or False: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-uploading-files.html. Would like to investigate more

Copy link
Author

@elipe17 elipe17 Feb 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the source code for our version of the library and it looks like it doesnt return anything. That is a feature added in a more recent version. I will remove the logging of the response since it does not exist for our version of boto3.

Copy link

@raftmsohani raftmsohani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to figure out why we are getting None instead of False, is it because client is not set?

@ADPennington
Copy link
Collaborator

@elipe17 @raftmsohani please ping me when this is ready again.

  • if feasible to add the logentry as part of this ticket, please do so.
  • please check into the S3 response.

@ADPennington ADPennington added Blocked Label for Pull Requests that are currently blocked by a dependency and removed Deploy with CircleCI-qasp Deploy to https://tdp-frontend-qasp.app.cloud.gov through CircleCI labels Feb 23, 2024
@elipe17
Copy link
Author

elipe17 commented Feb 27, 2024

Sorry @andrew-jameson, I didnt mean to re-request your review. I fat fingered it haha.

@ADPennington ADPennington added Deploy with CircleCI-qasp Deploy to https://tdp-frontend-qasp.app.cloud.gov through CircleCI Blocked Label for Pull Requests that are currently blocked by a dependency and removed Blocked Label for Pull Requests that are currently blocked by a dependency Deploy with CircleCI-qasp Deploy to https://tdp-frontend-qasp.app.cloud.gov through CircleCI labels Feb 28, 2024
@ADPennington
Copy link
Collaborator

@elipe17 I'm blocked on this one. After changing the minute on the crontab schedule, I'm seeing results like the following in the logs whenever the scheduled task is supposed to start:

 2024-02-28T10:11:15.64-0500 [APP/PROC/WEB/0] ERR [2024-02-28 15:11:15,644: INFO/MainProcess] DatabaseScheduler: Schedule changed.
   2024-02-28T10:11:21.98-0500 [APP/PROC/WEB/0] ERR [2024-02-28 15:11:21 +0000] [566] [DEBUG] Closing connection.
   2024-02-28T10:11:52.01-0500 [APP/PROC/WEB/0] ERR [2024-02-28 15:11:52 +0000] [569] [DEBUG] Closing connection.
   2024-02-28T10:12:22.04-0500 [APP/PROC/WEB/0] ERR [2024-02-28 15:12:22 +0000] [566] [DEBUG] Closing connection.
   2024-02-28T10:12:52.07-0500 [APP/PROC/WEB/0] ERR [2024-02-28 15:12:52 +0000] [569] [DEBUG] Closing connection.
   2024-02-28T10:13:00.01-0500 [APP/PROC/WEB/0] ERR [2024-02-28 15:13:00,011: INFO/MainProcess] Scheduler: Sending due task celery.backend_cleanup (celery.backend_cleanup)
   2024-02-28T10:13:00.03-0500 [APP/PROC/WEB/0] ERR [2024-02-28 15:13:00,036: INFO/MainProcess] Scheduler: Sending due task Database Backup (tdpservice.scheduling.db_tasks.postgres_backup)
   2024-02-28T10:13:00.03-0500 [APP/PROC/WEB/0] ERR [2024-02-28 15:13:00,039: ERROR/MainProcess] Received unregistered task of type 'tdpservice.scheduling.db_tasks.postgres_backup'.
   2024-02-28T10:13:00.03-0500 [APP/PROC/WEB/0] ERR The message has been ignored and discarded.
   2024-02-28T10:13:00.03-0500 [APP/PROC/WEB/0] ERR Did you remember to import the module containing this task?
   2024-02-28T10:13:00.03-0500 [APP/PROC/WEB/0] ERR Or maybe you're using relative imports?
   2024-02-28T10:13:00.03-0500 [APP/PROC/WEB/0] ERR Please see
   2024-02-28T10:13:00.03-0500 [APP/PROC/WEB/0] ERR http://docs.celeryq.org/en/latest/internals/protocol.html
   2024-02-28T10:13:00.03-0500 [APP/PROC/WEB/0] ERR for more information.
   2024-02-28T10:13:00.03-0500 [APP/PROC/WEB/0] ERR The full contents of the message body was:
   2024-02-28T10:13:00.03-0500 [APP/PROC/WEB/0] ERR b'[["-", "b"], {}, {"callbacks": null, "errbacks": null, "chain": null, "chord": null}]' (85b)
   2024-02-28T10:13:00.03-0500 [APP/PROC/WEB/0] ERR Thw full contents of the message headers:
   2024-02-28T10:13:00.03-0500 [APP/PROC/WEB/0] ERR {'lang': 'py', 'task': 'tdpservice.scheduling.db_tasks.postgres_backup', 'id': '90c2ea20-be68-40ed-868c-1817566d2d52', 'shadow': None, 'eta': None, 'expires': None, 'group': None, 'group_index': None, 'retries': 0, 'timelimit': [None, None], 'root_id': '90c2ea20-be68-40ed-868c-1817566d2d52', 'parent_id': None, 'argsrepr': "['-', 'b']", 'kwargsrepr': '{}', 'origin': 'gen565@4ac3259b-a690-4a04-60f4-0955', 'ignore_result': False}
   2024-02-28T10:13:00.03-0500 [APP/PROC/WEB/0] ERR The delivery info for this task is:
   2024-02-28T10:13:00.03-0500 [APP/PROC/WEB/0] ERR {'exchange': '', 'routing_key': 'celery'}
   2024-02-28T10:13:00.03-0500 [APP/PROC/WEB/0] ERR Traceback (most recent call last):
   2024-02-28T10:13:00.03-0500 [APP/PROC/WEB/0] ERR File "/home/vcap/deps/1/python/lib/python3.10/site-packages/celery/worker/consumer/consumer.py", line 591, in on_task_received
   2024-02-28T10:13:00.03-0500 [APP/PROC/WEB/0] ERR strategy = strategies[type_]
   2024-02-28T10:13:00.03-0500 [APP/PROC/WEB/0] ERR KeyError: 'tdpservice.scheduling.db_tasks.postgres_backup'


- update crontabs
@elipe17
Copy link
Author

elipe17 commented Feb 28, 2024

@elipe17 I'm blocked on this one. After changing the minute on the crontab schedule, I'm seeing results like the following in the logs whenever the scheduled task is supposed to start:

 2024-02-28T10:11:15.64-0500 [APP/PROC/WEB/0] ERR [2024-02-28 15:11:15,644: INFO/MainProcess] DatabaseScheduler: Schedule changed.
   2024-02-28T10:11:21.98-0500 [APP/PROC/WEB/0] ERR [2024-02-28 15:11:21 +0000] [566] [DEBUG] Closing connection.
   2024-02-28T10:11:52.01-0500 [APP/PROC/WEB/0] ERR [2024-02-28 15:11:52 +0000] [569] [DEBUG] Closing connection.
   2024-02-28T10:12:22.04-0500 [APP/PROC/WEB/0] ERR [2024-02-28 15:12:22 +0000] [566] [DEBUG] Closing connection.
   2024-02-28T10:12:52.07-0500 [APP/PROC/WEB/0] ERR [2024-02-28 15:12:52 +0000] [569] [DEBUG] Closing connection.
   2024-02-28T10:13:00.01-0500 [APP/PROC/WEB/0] ERR [2024-02-28 15:13:00,011: INFO/MainProcess] Scheduler: Sending due task celery.backend_cleanup (celery.backend_cleanup)
   2024-02-28T10:13:00.03-0500 [APP/PROC/WEB/0] ERR [2024-02-28 15:13:00,036: INFO/MainProcess] Scheduler: Sending due task Database Backup (tdpservice.scheduling.db_tasks.postgres_backup)
   2024-02-28T10:13:00.03-0500 [APP/PROC/WEB/0] ERR [2024-02-28 15:13:00,039: ERROR/MainProcess] Received unregistered task of type 'tdpservice.scheduling.db_tasks.postgres_backup'.
   2024-02-28T10:13:00.03-0500 [APP/PROC/WEB/0] ERR The message has been ignored and discarded.
   2024-02-28T10:13:00.03-0500 [APP/PROC/WEB/0] ERR Did you remember to import the module containing this task?
   2024-02-28T10:13:00.03-0500 [APP/PROC/WEB/0] ERR Or maybe you're using relative imports?
   2024-02-28T10:13:00.03-0500 [APP/PROC/WEB/0] ERR Please see
   2024-02-28T10:13:00.03-0500 [APP/PROC/WEB/0] ERR http://docs.celeryq.org/en/latest/internals/protocol.html
   2024-02-28T10:13:00.03-0500 [APP/PROC/WEB/0] ERR for more information.
   2024-02-28T10:13:00.03-0500 [APP/PROC/WEB/0] ERR The full contents of the message body was:
   2024-02-28T10:13:00.03-0500 [APP/PROC/WEB/0] ERR b'[["-", "b"], {}, {"callbacks": null, "errbacks": null, "chain": null, "chord": null}]' (85b)
   2024-02-28T10:13:00.03-0500 [APP/PROC/WEB/0] ERR Thw full contents of the message headers:
   2024-02-28T10:13:00.03-0500 [APP/PROC/WEB/0] ERR {'lang': 'py', 'task': 'tdpservice.scheduling.db_tasks.postgres_backup', 'id': '90c2ea20-be68-40ed-868c-1817566d2d52', 'shadow': None, 'eta': None, 'expires': None, 'group': None, 'group_index': None, 'retries': 0, 'timelimit': [None, None], 'root_id': '90c2ea20-be68-40ed-868c-1817566d2d52', 'parent_id': None, 'argsrepr': "['-', 'b']", 'kwargsrepr': '{}', 'origin': 'gen565@4ac3259b-a690-4a04-60f4-0955', 'ignore_result': False}
   2024-02-28T10:13:00.03-0500 [APP/PROC/WEB/0] ERR The delivery info for this task is:
   2024-02-28T10:13:00.03-0500 [APP/PROC/WEB/0] ERR {'exchange': '', 'routing_key': 'celery'}
   2024-02-28T10:13:00.03-0500 [APP/PROC/WEB/0] ERR Traceback (most recent call last):
   2024-02-28T10:13:00.03-0500 [APP/PROC/WEB/0] ERR File "/home/vcap/deps/1/python/lib/python3.10/site-packages/celery/worker/consumer/consumer.py", line 591, in on_task_received
   2024-02-28T10:13:00.03-0500 [APP/PROC/WEB/0] ERR strategy = strategies[type_]
   2024-02-28T10:13:00.03-0500 [APP/PROC/WEB/0] ERR KeyError: 'tdpservice.scheduling.db_tasks.postgres_backup'

@ADPennington I have resolved the issue. One thing you will also have to do (to avoid erroneous error messages) is to delete the periodic task with name name. That is technically the Account Deactivation Warning task before it was named Account Deactivation Warning. I moved some tasks around in the code base so the name task is going to fail because check_for_accounts_needing_deactivation_warning lives here tdpservice.email.tasks now and not here tdpservice.scheduling.tasks. If you don't delete the task you wont have any failures or issues. You'll just see a similar error message as the one above in the logs when the name task runs.

@ADPennington ADPennington added the Deploy with CircleCI-qasp Deploy to https://tdp-frontend-qasp.app.cloud.gov through CircleCI label Feb 28, 2024
Copy link
Collaborator

@ADPennington ADPennington left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @elipe17 lgtm 🚀 I noted an action item for me to discuss retention policy for backups with data team. cc: @lfrohlich @ttran-hub

Test notes here

@ADPennington ADPennington added Ready to Merge and removed Blocked Label for Pull Requests that are currently blocked by a dependency Deploy with CircleCI-qasp Deploy to https://tdp-frontend-qasp.app.cloud.gov through CircleCI QASP Review labels Feb 28, 2024
@andrew-jameson andrew-jameson merged commit d559d4f into develop Feb 29, 2024
16 checks passed
@andrew-jameson andrew-jameson deleted the 2852-db-backup branch February 29, 2024 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend bug database For issues primarily related to schema changes devops Priority Use this label for issues or PRs that need to be expedited Ready to Merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DB Backup Failing
5 participants