Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spike: As tech lead, I need elastic re-indexing to be automated #2870

Closed
1 of 8 tasks
ADPennington opened this issue Mar 1, 2024 · 8 comments · Fixed by #2881
Closed
1 of 8 tasks

Spike: As tech lead, I need elastic re-indexing to be automated #2870

ADPennington opened this issue Mar 1, 2024 · 8 comments · Fixed by #2881
Assignees
Labels
backend database For issues primarily related to schema changes dev Parity Work associated with TDP Parity spike

Comments

@ADPennington
Copy link
Collaborator

ADPennington commented Mar 1, 2024

Description:

As mentioned in #2820, python manage.py search_index --rebuild is needed to facilitate elastic re-indexing.

This is a manual step currently that was needed to yield expected results in QASP review for parsing/validation tickets like #2825. This will also be a manual step before releasing code to hhs:main and hhs:master.

This is a step that would be better to automate.

Acceptance Criteria:
Create a list of functional outcomes that must be achieved to complete this issue

  • elasticsearch indexes are rebuilt/refreshed when new model changes are available.
  • Testing Checklist has been run and all tests pass
  • README is updated, if necessary

Tasks:
Create a list of granular, specific work items that must be completed to deliver the desired outcomes of this issue

  • Run python manage.py search_index --rebuild with the --parallel flag over the weekend to gauge
    • how long the task will actually take in prod with the larger elasticsearch instance
    • if the --parallel flag makes a difference
    • if data can be added while indexing is happening
  • Adjust BulkIndexError exception handling - should not roll-back un-indexed records, but instead write them to postgres and ignore the indexing step
  • cron/beat task to reindex the entire database periodically
  • Mitigation plan - how can we get as close to zero downtime as possible?
  • Run Testing Checklist and confirm all tests pass

Notes:
Possible approaches

  • similar to apply-remote-migrations - rebuild search indexes after backend deployment (via ssh or cf run-task)
  • have a celery beat/cron job to periodically reindex
    • caveat: need to change our handling of BulkIndexException to not roll back un-indexed records
    • or - reindex post bulk_create (after parsing completion for each file)
  • write a custom indexing routine using bulk requests (per "Tune for indexing speed", linked below)
    • django-elasticsearch-dsl seems to use the _bulk endpoint when making requests, we may need to further
      investigate how the library works and/or introduce some customization to tune it to our needs
    • alternatively, use elastic's reindex api to reindex previously indexed data without deleting/recreating (as done by django-elasticsearch-dsl), then try to bulk index any un-indexed data
  • increase the resources available to the elastic cluster (per "Tune for indexing speed", linked below)
    • currently es-dev in dev/staging and es-medium in prod
  • utilize the --parallel option in python manage.py search_index --rebuild - https://django-elasticsearch-dsl.readthedocs.io/en/latest/management.html
  • utilize the --use-alias option in python manage.py search_index --rebuild

Supporting Documentation:
Please include any relevant log snippets/files/screen shots

Open Questions:
Please include any questions or decisions that must be made before beginning work or to confidently call this issue complete

  • Open Question 1
  • Open Question 2
@ADPennington ADPennington added backend dev database For issues primarily related to schema changes labels Mar 1, 2024
@ADPennington ADPennington mentioned this issue Mar 1, 2024
33 tasks
@ADPennington
Copy link
Collaborator Author

ADPennington commented Mar 7, 2024

@jtimpe below are the metrics after releasing sprint 93 to staging yesterday:

  • Number of data files: 394
  • Number of db records: ~872K
    • SSP T1: N=362
    • SSP T2: N=432
    • SSP T3: N=785
    • SSP T4: N=2206
    • SSP T5: N=6739
    • SSP T6: N=12
    • SSP T7: N=45
    • TANF T1: N=214226
    • TANF T2: N=238098
    • TANF T3: N=403698
    • TANF T4: N=895
    • TANF T5: N=2423
    • TANF T6: N=57
    • TANF T7: N=48
    • Tribal TANF T1: N= 360
    • Tribal TANF T2: N= 549
    • Tribal TANF T3: N= 810
    • Tribal TANF T4: N= 116
    • Tribal TANF T5: N= 355
    • Tribal TANF T6: N= 18
    • Tribal TANF T7: N= 1
  • Number of minutes it took to run this: 35 minutes ⚠️

Update: there are 6.6 million records in prod as of today.

@jtimpe jtimpe self-assigned this Mar 7, 2024
@ADPennington
Copy link
Collaborator Author

@jtimpe do you happen to know if the re-index command impacts the newest filter in any way?

I'm noticing in staging (which is on sprint 93 release), that the latest submission isn't flagged as "newest" in DAC. I tried submitting the same file in (2023.Q3.Aggregate Data.txt) in develop and staging and got different results in DAC:

file
latest2

develop T6
latest1

staging T6
notlatest

@jtimpe
Copy link

jtimpe commented Mar 27, 2024

Notes from testing in raft (wip)
https://hackmd.io/@dBEtH2T9SRqyVnE3ZKtYSA/ry7GWLP0a/edit

@robgendron
Copy link

Potential spike - still working through it.

@robgendron
Copy link

Relabeled to spike.

@robgendron robgendron changed the title As tech lead, I need elastic re-indexing to be automated Spike: As tech lead, I need elastic re-indexing to be automated Apr 17, 2024
@jtimpe
Copy link

jtimpe commented Apr 29, 2024

data lifecycle

  • remove "old" data, esp. when deploying logic updates that makes some data obsolete
    • submitted before certain date
    • data for prior submission period
    • fiscal period (test fy 22)
  • delete and reparse cron job or manual task - re-code everything since last reporting period
    • draft a ticket - this has use beyond this specific issue
  • take advantage of this time to reindex elastic

other options

  • host elastic app ourselves

@reitermb
Copy link

reitermb commented May 6, 2024

Moving into Raft Review this week but staying in blocked pending testing in staging

@robgendron
Copy link

PR for this is in QASP, will be able to test soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend database For issues primarily related to schema changes dev Parity Work associated with TDP Parity spike
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants