Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3157 - Reparse Meta through model #3220

Merged
merged 15 commits into from
Oct 17, 2024
Merged

Conversation

jtimpe
Copy link

@jtimpe jtimpe commented Oct 8, 2024

Summary of Changes

Pull request closes #3157
Adds additional information to ReparseMeta

  • cat 4 error count per file associated with reparse
  • num records created per file associated with reparse
  • finished/success status per associated file
  • started/finished at per associated file
  • properties on ReparseMeta for reporting the per-file information replaces fields that originally stored that info, in favor of the more granular, less-concurrency-error-prone ReparseFileMeta

How to Test

  1. Check out develop. Upload some datafiles. Reparse those data files.
  2. Check out this branch. Ensure the migration preserves relevant data.
  3. Start another reparse. Ensure the data fields are filled out and consistent.

Deliverables

More details on how deliverables herein are assessed included here.

Deliverable 1: Accepted Features

Checklist of ACs:

  • ReparseMeta model as new "finished_at" datetime field and cat4 tracking field
  • Testing Checklist has been run and all tests pass
  • lfrohlich and/or adpennington confirmed that ACs are met.

Deliverable 2: Tested Code

  • Are all areas of code introduced in this PR meaningfully tested?
    • If this PR introduces backend code changes, are they meaningfully tested?
    • If this PR introduces frontend code changes, are they meaningfully tested?
  • Are code coverage minimums met?
    • Frontend coverage: [insert coverage %] (see CodeCov Report comment in PR)
    • Backend coverage: [insert coverage %] (see CodeCov Report comment in PR)

Deliverable 3: Properly Styled Code

  • Are backend code style checks passing on CircleCI?
  • Are frontend code style checks passing on CircleCI?
  • Are code maintainability principles being followed?

Deliverable 4: Accessible

  • Does this PR complete the epic?
  • Are links included to any other gov-approved PRs associated with epic?
  • Does PR include documentation for Raft's a11y review?
  • Did automated and manual testing with iamjolly and ttran-hub using Accessibility Insights reveal any errors introduced in this PR?

Deliverable 5: Deployed

  • Was the code successfully deployed via automated CircleCI process to development on Cloud.gov?

Deliverable 6: Documented

  • Does this PR provide background for why coding decisions were made?
  • If this PR introduces backend code, is that code easy to understand and sufficiently documented, both inline and overall?
  • If this PR introduces frontend code, is that code easy to understand and sufficiently documented, both inline and overall?
  • If this PR introduces dependencies, are their licenses documented?
  • Can reviewer explain and take ownership of these elements presented in this code review?

Deliverable 7: Secure

  • Does the OWASP Scan pass on CircleCI?
  • Do manual code review and manual testing detect any new security issues?
  • If new issues detected, is investigation and/or remediation plan documented?

Deliverable 8: User Research

Research product(s) clearly articulate(s):

  • the purpose of the research
  • methods used to conduct the research
  • who participated in the research
  • what was tested and how
  • impact of research on TDP
  • (if applicable) final design mockups produced for TDP development

@jtimpe jtimpe self-assigned this Oct 8, 2024
Copy link

codecov bot commented Oct 8, 2024

Codecov Report

Attention: Patch coverage is 85.39326% with 13 lines in your changes missing coverage. Please review.

Project coverage is 90.85%. Comparing base (a382f74) to head (03bc847).
Report is 3 commits behind head on develop.

Files with missing lines Patch % Lines
...d/tdpservice/search_indexes/models/reparse_meta.py 84.21% 6 Missing ⚠️
...ations/0016_remove_datafile_reparse_meta_models.py 78.57% 2 Missing and 1 partial ⚠️
...nd/tdpservice/search_indexes/admin/reparse_meta.py 71.42% 2 Missing ⚠️
...drs-backend/tdpservice/data_files/admin/filters.py 0.00% 1 Missing ⚠️
...ch_indexes/management/commands/tdp_search_index.py 0.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #3220      +/-   ##
===========================================
- Coverage    92.66%   90.85%   -1.82%     
===========================================
  Files           47      299     +252     
  Lines         1009     8451    +7442     
  Branches       169      788     +619     
===========================================
+ Hits           935     7678    +6743     
- Misses          42      660     +618     
- Partials        32      113      +81     
Flag Coverage Δ
dev-backend 90.60% <85.39%> (?)
dev-frontend 92.66% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
tdrs-backend/tdpservice/data_files/admin/admin.py 78.33% <100.00%> (ø)
...vice/data_files/migrations/0014_reparsefilemeta.py 100.00% <100.00%> (ø)
...ce/data_files/migrations/0015_datafile_reparses.py 100.00% <100.00%> (ø)
tdrs-backend/tdpservice/data_files/models.py 81.64% <100.00%> (ø)
tdrs-backend/tdpservice/data_files/tasks.py 77.77% <100.00%> (ø)
tdrs-backend/tdpservice/parsers/parse.py 82.35% <ø> (ø)
tdrs-backend/tdpservice/parsers/test/factories.py 100.00% <ø> (ø)
...h_indexes/management/commands/clean_and_reparse.py 74.45% <100.00%> (ø)
...arch_indexes/migrations/0032_auto_20241008_1745.py 100.00% <100.00%> (ø)
...drs-backend/tdpservice/data_files/admin/filters.py 56.00% <0.00%> (ø)
... and 4 more

... and 238 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2792776...03bc847. Read the comment docs.

@jtimpe jtimpe added the raft review This issue is ready for raft review label Oct 8, 2024
@jtimpe jtimpe mentioned this pull request Oct 8, 2024
28 tasks
list_display = [
'id',
'created_at',
'timeout_at',
'success',
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do lose filtering on these fields - we can create a custom filter for properties, future work

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtimpe couple questions --

  • is this class really only for SSP files?
  • is True the default value for is_finished and is_success? asking because i just ran the command on a large number of files and its still in the process of deleting records but these are the metrics im seeing ⬇️

Screenshot 2024-10-11 112200

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, do we still need these?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ADPennington

is this class really only for SSP files?

it should work for all file types. are you seeing something that specifies SSP only?

is True the default value for is_finished and is_success? asking because i just ran the command on a large number of files and its still in the process of deleting records but these are the metrics im seeing

Yeah, this is interesting. So, under the hood running reparse first deletes all associated records, then creates the new associations. So while the new associations are not-yet-existing, the all in the ReparseMeta property might return true. I could update it to check how many file associations there are and respond accordingly.

also, do we still need these

These fields are only used when reparse is run via the management command. we have #3205 which would refactor the action. we can remove the management command and these fields at that time.

Copy link
Collaborator

@ADPennington ADPennington Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ADPennington

is this class really only for SSP files?

it should work for all file types. are you seeing something that specifies SSP only?

just this 😄

is True the default value for is_finished and is_success? asking because i just ran the command on a large number of files and its still in the process of deleting records but these are the metrics im seeing

Yeah, this is interesting. So, under the hood running reparse first deletes all associated records, then creates the new associations. So while the new associations are not-yet-existing, the all in the ReparseMeta property might return true. I could update it to check how many file associations there are and respond accordingly.

ok. worth noting, that once files were queued for parsing these booleans changed back to False. So a little misleading initially?

also, do we still need these

These fields are only used when reparse is run via the management command. we have #3205 which would refactor the action. we can remove the management command and these fields at that time.

ahh! got it. thank you @jtimpe

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ADPennington I pushed a change so that is_finished and is_success will stay False if there are no files queued. I also fixed the misleading docstring 😄

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k i'll re-deploy sometime over the weekend. lots of files currently reparsing in qasp

@elipe17
Copy link

elipe17 commented Oct 10, 2024

@jtimpe I checked out develop, parsed 8 files with 8 STTs, reparsed all files twice, checked out this branch, ran migrations, and then see the screenshot below in the admin. I also ran another reparse after checking out the branch. Shouldnt the through model show up as a table now as opposed to the FK?
Screenshot 2024-10-10 at 11 56 44 AM

@jtimpe
Copy link
Author

jtimpe commented Oct 10, 2024

@ADPennington
as @elipe17 accurately pointed out, the migration does not preserve information in the removed ReparseMeta fields. The associated data files are all preserved and the relationships converted to ReparseFileMeta, but each of the files will show a failed state (as will the parent ReparseMeta). This is because we're increasing data granularity and cannot accurately preserve the data with the new model structure. Newly run reparses will populate the fields on the new through-model as expected.

Since reparse has not yet run in production, we should consider deploying this PR (as well as #3106) before running reparse in production. This will ensure no data loss.

@elipe17 elipe17 self-requested a review October 10, 2024 16:44
Copy link

@elipe17 elipe17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested with 5 Celery workers and reparsed 8 files 10 times to see if I could break anything. Works like a charm, feels faster because we don't lock DB rows anymore, and we get more usable info. LGTM!

@ADPennington
Copy link
Collaborator

@ADPennington as @elipe17 accurately pointed out, the migration does not preserve information in the removed ReparseMeta fields. The associated data files are all preserved and the relationships converted to ReparseFileMeta, but each of the files will show a failed state (as will the parent ReparseMeta). This is because we're increasing data granularity and cannot accurately preserve the data with the new model structure. Newly run reparses will populate the fields on the new through-model as expected.

Since reparse has not yet run in production, we should consider deploying this PR (as well as #3106) before running reparse in production. This will ensure no data loss.

thanks for this update @jtimpe . pinging @vlasse86 @lfrohlich @ttran-hub for awareness. what im gathering here is that we need both of these tickets before we can safely reparse in prod. I'm not sure of the timeline just yet to get these through review, but id say both are top priority.

@jtimpe jtimpe added QASP Review and removed raft review This issue is ready for raft review labels Oct 10, 2024
@jtimpe jtimpe requested a review from ADPennington October 10, 2024 19:32
@ADPennington ADPennington added the Deploy with CircleCI-qasp Deploy to https://tdp-frontend-qasp.app.cloud.gov through CircleCI label Oct 10, 2024
@ADPennington ADPennington removed the Deploy with CircleCI-qasp Deploy to https://tdp-frontend-qasp.app.cloud.gov through CircleCI label Oct 11, 2024
@ADPennington
Copy link
Collaborator

@jtimpe i think there's a lint error here. no rush on this, ill re-test over the weekend after the latest reparse completes, if possible.

Copy link
Collaborator

@ADPennington ADPennington left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtimpe this is in pretty good shape! my testing notes are below.


  • linting error on branch
  • I'll need to re-test this to check the status fields is_finished and is_success during the record deletion portion of the reparsing process.
  • I tested this twice so far:
    • test # 1: Screenshot 2024-10-16 084708

      • 1 out of 519 files failed during the re-parse

      • the file that failed still resulted in records being added to the db Screenshot 2024-10-16 092447

      • this file also had some exceptions (see below): Screenshot 2024-10-16 092551

      • The results here suggest that total_num_records_post will not populate if at least one file failed. Is this true? This would be unexpected, since records from the file before and after this file (in the sequence) were added to the db. Screenshot 2024-10-16 092350

    • test # 2:
      Screenshot 2024-10-16 092037

@ADPennington ADPennington added the Blocked Label for Pull Requests that are currently blocked by a dependency label Oct 16, 2024
@jtimpe jtimpe force-pushed the 3157-new-reparse-meta-through branch from 7924d57 to 5b823bc Compare October 16, 2024 16:36
@ADPennington ADPennington added Deploy with CircleCI-qasp Deploy to https://tdp-frontend-qasp.app.cloud.gov through CircleCI and removed Blocked Label for Pull Requests that are currently blocked by a dependency Deploy with CircleCI-qasp Deploy to https://tdp-frontend-qasp.app.cloud.gov through CircleCI labels Oct 16, 2024
@ADPennington ADPennington added the Deploy with CircleCI-qasp Deploy to https://tdp-frontend-qasp.app.cloud.gov through CircleCI label Oct 16, 2024
@ADPennington
Copy link
Collaborator

@jtimpe this is in pretty good shape! my testing notes are below.

  • linting error on branch

Resolved ✔️

  • I'll need to re-test this to check the status fields is_finished and is_success during the record deletion portion of the reparsing process.

Resolved ✔️

statuses while backing up db
while_backingup

statuses while deleting records and objects
while_deleting

@ADPennington
Copy link
Collaborator

before re-enable total_num_records_post

  • test # 2:
    Screenshot 2024-10-16 092037

after ✔️

after

Copy link
Collaborator

@ADPennington ADPennington left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great work @jtimpe 🥇

@ADPennington ADPennington added Ready to Merge and removed QASP Review Deploy with CircleCI-qasp Deploy to https://tdp-frontend-qasp.app.cloud.gov through CircleCI labels Oct 16, 2024
@jtimpe jtimpe merged commit 4f267fa into develop Oct 17, 2024
16 of 17 checks passed
@jtimpe jtimpe deleted the 3157-new-reparse-meta-through branch October 17, 2024 00:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reparse Model Field Additions
4 participants