Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reparse Memory Management #3172

Merged
merged 48 commits into from
Sep 30, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
cc536b5
- Updated method to batch delete records in elastic to avoid pulling …
elipe17 Sep 4, 2024
a27086c
- Instantiate document once
elipe17 Sep 4, 2024
6f2105f
- Updated clean and reparse to not load entire queryset into memory
elipe17 Sep 5, 2024
223bc40
Merge branch 'develop' into 3170-reparse-memory-mgmt
elipe17 Sep 5, 2024
a8093e5
- Update names of functions
elipe17 Sep 10, 2024
f34b014
- Update sequential function to return boolean to allow testing
elipe17 Sep 10, 2024
ccd1816
Merge branch 'develop' into 3170-reparse-memory-mgmt
elipe17 Sep 11, 2024
d96f455
- new indices
elipe17 Sep 11, 2024
b1f71dd
Merge branch '3170-reparse-memory-mgmt' of https://github.com/raft-te…
elipe17 Sep 11, 2024
5621f7f
- linting
elipe17 Sep 11, 2024
9a03ff3
- Added tests for most exception paths
elipe17 Sep 11, 2024
778c827
-linting
elipe17 Sep 11, 2024
4d2ee69
- Adding refresh
elipe17 Sep 12, 2024
d8a3b44
- testing not duplicating container
elipe17 Sep 12, 2024
fa77bac
- revert test change
elipe17 Sep 12, 2024
01dc2c3
- remove param
elipe17 Sep 12, 2024
876260c
- Add missing tests
elipe17 Sep 12, 2024
76624fc
- adding test
elipe17 Sep 12, 2024
644703f
- Add remaining tests
elipe17 Sep 12, 2024
09bf83d
- Add more debug logging
elipe17 Sep 13, 2024
9fe3d1c
- running only failing test
elipe17 Sep 13, 2024
9421869
- more debugging
elipe17 Sep 13, 2024
d456c8b
- backend only
elipe17 Sep 13, 2024
f408833
- remove test_reparse only
elipe17 Sep 13, 2024
a3aa2f4
- Reset settings after parsing
elipe17 Sep 13, 2024
9e28a66
- update file
elipe17 Sep 13, 2024
d7b543a
- Update factories to use the correct types for fields
elipe17 Sep 13, 2024
e1bfce3
- paremetrize test
elipe17 Sep 13, 2024
3f89673
- revert debugging changes
elipe17 Sep 13, 2024
dc55e3e
Merge branch 'develop' into 3170-reparse-memory-mgmt
elipe17 Sep 13, 2024
c3002e0
- Add tests for more coverage
elipe17 Sep 13, 2024
4075194
Merge branch '3170-reparse-memory-mgmt' of https://github.com/raft-te…
elipe17 Sep 13, 2024
96372d8
- linting
elipe17 Sep 13, 2024
830b6d2
- UPdate logic for setting reparse to finished
elipe17 Sep 13, 2024
51ec731
- linting
elipe17 Sep 13, 2024
b9cb34e
- add basic tests for db backup
elipe17 Sep 13, 2024
a85b232
- remove mock
elipe17 Sep 13, 2024
76fa734
- add main routine test for backup
elipe17 Sep 13, 2024
0dfb37e
- linting
elipe17 Sep 13, 2024
16c1e11
- Added comment for posterity
elipe17 Sep 19, 2024
0fd06c2
- Update reparse to not fail/exit when elastic throws index update. i…
elipe17 Sep 23, 2024
8187285
Merge branch 'develop' into 3170-reparse-memory-mgmt
ADPennington Sep 24, 2024
e752589
Merge branch 'develop' into 3170-reparse-memory-mgmt
ADPennington Sep 26, 2024
23924fb
- update terraform for testing
elipe17 Sep 26, 2024
93bd544
Merge branch '3170-reparse-memory-mgmt' of https://github.com/raft-te…
elipe17 Sep 26, 2024
0543f29
- Update terraform based on cloud.gov response
elipe17 Sep 30, 2024
f0a6c55
Merge branch 'develop' into 3170-reparse-memory-mgmt
elipe17 Sep 30, 2024
a981311
Merge branch 'develop' into 3170-reparse-memory-mgmt
ADPennington Sep 30, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion tdrs-backend/tdpservice/parsers/validators/category3.py
Original file line number Diff line number Diff line change
Expand Up @@ -386,7 +386,7 @@ def validate(record, row_schema):
"Caught exception in validator: validate__WORK_ELIGIBLE_INDICATOR__HOH__AGE. " +
f"With field values: {vals}."
)
logger.error(f'Exception: {e}')
logger.debug(f'Exception: {e}')
elipe17 marked this conversation as resolved.
Show resolved Hide resolved
# Per conversation with Alex on 03/26/2024, returning the true case during exception handling to avoid
# confusing the STTs.
return true_case
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

from django.core.management.base import BaseCommand
from django.core.management import call_command
from django.core.paginator import Paginator
from django.db.utils import DatabaseError
from elasticsearch.exceptions import ElasticsearchException
from tdpservice.data_files.models import DataFile
Expand Down Expand Up @@ -61,6 +62,7 @@ def __handle_elastic(self, new_indices, log_context):
"""Create new Elastic indices and delete old ones."""
if new_indices:
try:
logger.info("Creating new elastic indexes.")
call_command('tdp_search_index', '--create', '-f', '--use-alias')
log("Index creation complete.",
logger_context=log_context,
Expand All @@ -82,7 +84,10 @@ def __delete_summaries(self, file_ids, log_context):
"""Raw delete all DataFileSummary objects."""
try:
qset = DataFileSummary.objects.filter(datafile_id__in=file_ids)
count = qset.count()
logger.info(f"Deleting {count} datafile summary objects.")
qset._raw_delete(qset.db)
logger.info("Successfully deleted datafile summary objects.")
except DatabaseError as e:
log('Encountered a DatabaseError while deleting DataFileSummary from Postgres. The database '
'and Elastic are INCONSISTENT! Restore the DB from the backup as soon as possible!',
Expand All @@ -102,11 +107,16 @@ def __delete_records(self, file_ids, new_indices, log_context):
for doc in DOCUMENTS:
try:
model = doc.Django.model
qset = model.objects.filter(datafile_id__in=file_ids)
total_deleted += qset.count()
qset = model.objects.filter(datafile_id__in=file_ids).order_by('id')
count = qset.count()
total_deleted += count
logger.info(f"Deleting {count} records of type: {model}.")
if not new_indices:
# If we aren't creating new indices, then we don't want duplicate data in the existing indices.
doc().update(qset, refresh=True, action='delete')
paginator = Paginator(qset, settings.BULK_CREATE_BATCH_SIZE)
document = doc()
for page in paginator:
document.update(page.object_list, action='delete')
qset._raw_delete(qset.db)
except ElasticsearchException as e:
log(f'Elastic document delete failed for type {model}. The database and Elastic are INCONSISTENT! '
Expand All @@ -132,7 +142,10 @@ def __delete_errors(self, file_ids, log_context):
"""Raw delete all ParserErrors for each file ID."""
try:
qset = ParserError.objects.filter(file_id__in=file_ids)
count = qset.count()
logger.info(f"Deleting {count} parser errors.")
qset._raw_delete(qset.db)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is _raw_delete() fast enough that we don't need a paginator? Does PG have an optimized solution here but not above?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we don't need to use pagination on the deletes is because nothing is pulled into main memory. Every operation up to that point was being lazy evaluated. When we tell Django to _raw_delete it lets the DB engine execute the query and only returns the number of rows that were deleted. As opposed to returning a dict of model items that were deleted.

logger.info("Successfully deleted parser errors.")
except DatabaseError as e:
log('Encountered a DatabaseError while deleting ParserErrors from Postgres. The database '
'and Elastic are INCONSISTENT! Restore the DB from the backup as soon as possible!',
Expand Down
Loading