From e414a60e483a9c1350accce6197db916d7b8499e Mon Sep 17 00:00:00 2001 From: Eric Lipe Date: Mon, 26 Aug 2024 09:32:10 -0400 Subject: [PATCH] - initial commit with templates - halfway completed tm for new multi select filter updates --- .../multi-select-filters.md | 26 ++++++++ .../tech-memos/reparse.md | 48 ++++++++++++++ .../tech-memos/sequential-reparse.md | 62 +++++++++++++++++++ .../tech-memos/tm-template.md | 26 ++++++++ 4 files changed, 162 insertions(+) create mode 100644 docs/Technical-Documentation/tech-memos/multi-select-fiters/multi-select-filters.md create mode 100644 docs/Technical-Documentation/tech-memos/reparse.md create mode 100644 docs/Technical-Documentation/tech-memos/sequential-reparse.md create mode 100644 docs/Technical-Documentation/tech-memos/tm-template.md diff --git a/docs/Technical-Documentation/tech-memos/multi-select-fiters/multi-select-filters.md b/docs/Technical-Documentation/tech-memos/multi-select-fiters/multi-select-filters.md new file mode 100644 index 000000000..fb149f285 --- /dev/null +++ b/docs/Technical-Documentation/tech-memos/multi-select-fiters/multi-select-filters.md @@ -0,0 +1,26 @@ +# Multi-Select Filters + +**Audience**: TDP Software Engineers
+**Subject**: Multi-Select Filter Integration
+**Date**: August 8, 2024
+ +## Summary +This is a template to use to create new technical memorandums. + +## Background +TDP has been expanding it's Django Admin Console (DAC) filtering capabilities by introducing custom filters, specifically multi-select filters. This has introduced a myriad of issues because TDP does not use the default DAC. Instead, to assist with accessability compliance TDP wraps the default DAC with [Django 508](https://github.com/raft-tech/django-admin-508) (henceforth referred to as 508) which makes various updates to the styling and functionality of the default DAC. A key change is that 508 introduces to the DAC is an `Apply Filters` button that intercepts query string parameters from default DAC filters and only applies them after clicking the button. The default DAC applies the filters as they are selected as opposed to all at once. The issue with 508's approach is that it assumes all filters are builtin Django filters (i.e. single select filters). This presents a discrepancy because Django allows developers to write custom templates and filters to add further filtering functionality (e.g. multi-select filters). + +## Out of Scope +Call out what is out of scope for this technical memorandum and should be considered in a different technical memorandum. + +## Method/Design +This section should contain sub sections that provide general implementation details surrounding key components required to implement the feature. + +### Sub header (piece of the design, can be many of these) +sub header content describing component. + +## Affected Systems +provide a list of systems this feature will depend on/change. + +## Use and Test cases to consider +provide a list of use cases and test cases to be considered when the feature is being implemented. diff --git a/docs/Technical-Documentation/tech-memos/reparse.md b/docs/Technical-Documentation/tech-memos/reparse.md new file mode 100644 index 000000000..5c1f5991e --- /dev/null +++ b/docs/Technical-Documentation/tech-memos/reparse.md @@ -0,0 +1,48 @@ +# Reparsing + +**Audience**: TDP Software Engineers
+**Subject**: Reparsing
+**Date**: August 9, 2024
+ +## Summary +Re-parsing improves the flexibility of TDP's workflow for ingesting data files. These enhancement requests came out of pragmatic needs by the administrator user of the tool from 3041 and theoretical concerns from the development team in addressing current system limitations with this new feature. + +## Background +https://github.com/raft-tech/TANF-app/issues/2870 +https://github.com/raft-tech/TANF-app/issues/2820 +https://github.com/raft-tech/TANF-app/releases/tag/v3.2.0-Sprint-90 +https://github.com/raft-tech/TANF-app/pull/2772 +https://github.com/raft-tech/TANF-app/issues/1858 +https://github.com/raft-tech/TANF-app/issues/1350 + +[Driving force of reparsing](https://github.com/raft-tech/TANF-app/issues/2870) +- Reparsing files that are stuck in pending to some other state because validators have changed, or the parser has better exception handling + +## Out of Scope +- Parsing and/or validator logic changes +- Data Model or search_indices changes +- Systemic/Infrastructure changes to accommodate large data sets +- End-user facing changes to our frontend +- Pipeline or Orchestration changes + +## Method/Design +The reparsing enhancements focus on a maturization of the clean_and_reparse.py django commando which needed CLI invocation by system administrator(s). To mature and polish this feature to meet our new deliverables, we plan to shift major functionality and visibility into the Administrator Console to leverage our existing tools within. + +#3004 introduced an initial pass at reparsing. From this key components were identified that would improve both reparsing and it's usability for the system administrators. The following items were identified to enhance the reparsing feature: introduce a Django model that tracks meta data surrounding the reparsing event, managing data synchronization and parallel execution of reparsing events, and moving away from the current CLI interface in favor of a DAC specific way to execute reparsing. + +### Meta Model +This enhancement will seek to improve our visibility into what has happened during execution of a re-parsing command. We believe creating a database model to store relevant fields about the run will improve usability. Fields will include (start time, end time, number of files processed, which files were targetted, number of records repopulated, etc.) + +### Data Synchronization +... + +### DAC Reparse Action +To mature and polish this feature it should no longer be executed from the CLI. The DAC provides all/most of the necessary filtering required to specify what datafiles to reparse. Adding a new `reparse` action to the `DataFiles` page in the DAC provides a seamless experience for the admins while also providing the reparse event with the appropriate datafiles. + #### Confirmation dialog asking "are you sure you want to reparse?" + +## Affected Systems +- Elastic +- Postgres (records, dfs, datafiles, parser errors) + +## Use and Test cases to consider +provide a list of use cases and test cases to be considered when the feature is being implemented. diff --git a/docs/Technical-Documentation/tech-memos/sequential-reparse.md b/docs/Technical-Documentation/tech-memos/sequential-reparse.md new file mode 100644 index 000000000..077d21f36 --- /dev/null +++ b/docs/Technical-Documentation/tech-memos/sequential-reparse.md @@ -0,0 +1,62 @@ +# Guarantee Sequential Reparse Events + +**Audience**: TDP Software Engineers
+**Subject**: Sequential Reparsing
+**Date**: August 8, 2024
+ + +## Summary +This technical memorandum aims to provide a software engineer with initial research, design patterns, and ideas necessary +to implement sequential reparsing in the TDP application. This document covers distributed/parallel data safety, how +the data synchronization allows sequential execution guarantees, and a last ditch timeout calculation necessary to +guarantee sequential reparse events at the application level. This memorandum does not take into account network partition tolerance or parsing idempotence. + +## Background +When a reparse event is executed by an admin user a set of size N files can be selected where N is on the range +[0, # of datafiles in DB]. For each reparsing event, a ReparseMeta Django model is created to track meta data about the +event such as: the number of files to be reparsed, the number of records deleted before reparsing, the number of records +created during reparsing, a backup location, etc... The meta model also contains the fields: `files_completed`, and +`files_failed`. These two fields were added to the model for it to be able to track when all files in it's set of files +had finished the parsing process, regardless of whether they passed or failed parsing. + +## Distributed/Parallel Data Safety +In the [Background](#background) section the meta model and some of it's fields were introduced along with the idea that +a reparse event generates N parsing tasks. Because (theoretically) all the tasks can execute in parallel, and there is +only one meta model per event, the meta model inherently becomes a shared object and therefore must be synchronized +across the set of N parsing tasks. There are many ways to synchronize data in a distributed system, both custom and not. +However, because the meta model is a database object, this technical memorandum suggests using the already tested and +vetted concurrency control and synchronization mechanisms inherent to TDPs Postgres database. That is for the fields in +the meta model that need to be updated in parallel (`files_completed`, `files_failed`, `num_records_created`), the +implementing engineer should ensure to leverage Django queries that convert to minimumly scoped locking database +transactions. This memorandum suggests leveraging the [select_for_update()](https://docs.djangoproject.com/en/5.0/ref/models/querysets/#select-for-update) query provides row based locking for transactions in a Postgres environment. Using this +query ensures that whichever task executes it first will be the only task that can update the fields. All other tasks trying to query the model for updates will be blocked until the original task releases the lock. Thus, each parser task can query the appropriate meta model, update the appropriate fields, and continue on as normal. The one caveat to this approach is that whenever an update needs to be made, the task must explicitely re-query the meta model to avoid any race conditions and stale +data. An piece of example code is given below to demostrate how the implementer might update the `files_completed` field. Note the function was implemented as a static member of the ReparseMeta class. + +```python +@staticmethod +def increment_files_completed(reparse_meta_models): + """ + Increment the count of files that have completed parsing for the datafile's current/latest reparse model. + + Because this function can be called in parallel we use `select_for_update` because multiple parse tasks can + referrence the same ReparseMeta object that is being queried below. `select_for_update` provides a DB lock on + the object and forces other transactions on the object to wait until this one completes. + """ + if reparse_meta_models.exists(): + with transaction.atomic(): + try: + meta_model = reparse_meta_models.select_for_update().latest("pk") + meta_model.files_completed += 1 + if ReparseMeta.assert_all_files_done(meta_model): + ReparseMeta.set_reparse_finished(meta_model) + meta_model.save() + except DatabaseError: + logger.exception("Encountered exception while trying to update the `files_reparsed` field on the " + f"ReparseMeta object with ID: {meta_model.pk}.") +``` + +## Sequential Execution +... + +## Last Ditch Timeout +... diff --git a/docs/Technical-Documentation/tech-memos/tm-template.md b/docs/Technical-Documentation/tech-memos/tm-template.md new file mode 100644 index 000000000..0921d3888 --- /dev/null +++ b/docs/Technical-Documentation/tech-memos/tm-template.md @@ -0,0 +1,26 @@ +# TITLE + +**Audience**: TDP Software Engineers
+**Subject**: SUBJECT/TITLE
+**Date**: August 8, 2024
+ +## Summary +This is a template to use to create new technical memorandums. + +## Background (Optional) +Background for the feature if necessary. + +## Out of Scope +Call out what is out of scope for this technical memorandum and should be considered in a different technical memorandum. + +## Method/Design +This section should contain sub sections that provide general implementation details surrounding key components required to implement the feature. + +### Sub header (piece of the design, can be many of these) +sub header content describing component. + +## Affected Systems +provide a list of systems this feature will depend on/change. + +## Use and Test cases to consider +provide a list of use cases and test cases to be considered when the feature is being implemented.