Skip to content

Commit

Permalink
[BigQuery] Stop deploying SECONDARY raw data latest views (Recidiviz/…
Browse files Browse the repository at this point in the history
…recidiviz-data#34309)

## Description of the change

The raw data `_latest` views for SECONDARY are really only useful in the
context of a raw data reimport in SECONDARY (a relatively rare event),
but they represent a huge fraction of the views we generate with every
deploy.

This PR stops generating latest views for SECONDARY. If someone wanted
to load SECONDARY latest views or any downstream views that use
SECONDARY data, they could run:
```
python -m recidiviz.tools.load_end_to_end_data_sandbox \
    --state_code US_XX \
    --sandbox_prefix <prefix> \
    --raw_data_source_instance SECONDARY \
    --load_up_to_datasets <datasets to load>
```

This change reduces the number of deployed views from 4713 to 3309. I
tested that using the command above would properly override the raw data
dataset and it does (see
[example](https://console.cloud.google.com/bigquery?project=recidiviz-staging&organizationId=448885369991&ws=!1m5!1m4!4m3!1srecidiviz-staging!2sageiduschek_us_oz_raw_data_up_to_date_views!3sIcecreamshop_latest)).

## Type of change

> All pull requests must have at least one of the following labels
applied (otherwise the PR will fail):

| Label | Description |
|-----------------------------
|-----------------------------------------------------------------------------------------------------------
|
| Type: Bug | non-breaking change that fixes an issue |
| Type: Feature | non-breaking change that adds functionality |
| Type: Breaking Change | fix or feature that would cause existing
functionality to not work as expected |
| Type: Non-breaking refactor | change addresses some tech debt item or
prepares for a later change, but does not change functionality |
| Type: Configuration Change | adjusts configuration to achieve some end
related to functionality, development, performance, or security |
| Type: Dependency Upgrade | upgrades a project dependency - these
changes are not included in release notes |

## Related issues

Related to Recidiviz/recidiviz-data#26138

## Checklists

### Development

**This box MUST be checked by the submitter prior to merging**:
- [x] **Double- and triple-checked that there is no Personally
Identifiable Information (PII) being mistakenly added in this pull
request**

These boxes should be checked by the submitter prior to merging:
- [x] Tests have been written to cover the code changed/added as part of
this pull request

### Code review

These boxes should be checked by reviewers prior to merging:

- [x] This pull request has a descriptive title and information useful
to a reviewer
- [x] Potential security implications or infrastructural changes have
been considered, if relevant

GitOrigin-RevId: f3349fcef94779eca0122e4fe189013281912b2e
  • Loading branch information
ageiduschek authored and Helper Bot committed Nov 2, 2024
1 parent 7648ed0 commit 5dad113
Show file tree
Hide file tree
Showing 3 changed files with 27 additions and 3 deletions.
5 changes: 3 additions & 2 deletions recidiviz/ingest/direct/views/view_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,9 +38,10 @@ def get_direct_ingest_view_builders() -> Sequence[BigQueryViewBuilder]:
# table in all regions
DirectIngestRawDataTableLatestViewCollector(
region_code=state_code.value.lower(),
raw_data_source_instance=instance,
# We only deploy latest views for PRIMARY - views for SECONDARY can be
# loaded via a sandbox.
raw_data_source_instance=DirectIngestInstance.PRIMARY,
).collect_view_builders()
for instance in DirectIngestInstance
for state_code in get_existing_direct_ingest_states()
)
)
Expand Down
2 changes: 2 additions & 0 deletions recidiviz/tests/tools/load_end_to_end_sandbox_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -273,6 +273,7 @@ def test_no_post_ingest_pipelines(self) -> None:

with local_project_id_override("recidiviz-456"):
overrides_dict = get_view_update_input_dataset_overrides_dict(
state_code=StateCode.US_CA,
ingest_pipeline_params=ingest_pipeline_params,
post_ingest_pipeline_params=[],
)
Expand Down Expand Up @@ -334,6 +335,7 @@ def test_complex(self) -> None:

with local_project_id_override("recidiviz-456"):
overrides_dict = get_view_update_input_dataset_overrides_dict(
state_code=StateCode.US_CA,
ingest_pipeline_params=ingest_pipeline_params,
post_ingest_pipeline_params=post_ingest_pipeline_params,
)
Expand Down
23 changes: 22 additions & 1 deletion recidiviz/tools/load_end_to_end_data_sandbox.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,14 @@
--state_code US_XX \
--sandbox_prefix my_prefix \
--load_up_to_addresses aggregated_metrics.incarceration_state_aggregated_metrics,aggregated_metrics.supervision_state_aggregated_metrics
# Read only from raw data in `us_xx_raw_data_secondary` (e.g. to test a raw data
# reimport in SECONDARY).
python -m recidiviz.tools.load_end_to_end_data_sandbox \
--state_code US_XX \
--sandbox_prefix my_prefix \
--raw_data_source_instance SECONDARY \
--load_up_to_datasets sessions
"""
import argparse
import json
Expand All @@ -40,6 +48,7 @@
from recidiviz.big_query.address_overrides import BigQueryAddressOverrides
from recidiviz.big_query.big_query_address import BigQueryAddress
from recidiviz.common.constants.states import StateCode
from recidiviz.ingest.direct.dataset_config import raw_tables_dataset_for_region
from recidiviz.ingest.direct.types.direct_ingest_instance import DirectIngestInstance
from recidiviz.metrics.export.export_config import ExportViewCollectionConfig
from recidiviz.pipelines.config_paths import PIPELINE_CONFIG_YAML_PATH
Expand Down Expand Up @@ -217,6 +226,7 @@ def get_sandbox_post_ingest_pipeline_params(


def get_view_update_input_dataset_overrides_dict(
state_code: StateCode,
ingest_pipeline_params: IngestPipelineParameters,
post_ingest_pipeline_params: list[
MetricsPipelineParameters | SupplementalPipelineParameters
Expand All @@ -226,6 +236,15 @@ def get_view_update_input_dataset_overrides_dict(
when loading sandbox views that read from the given sandbox pipelines.
"""
input_dataset_overrides_dict: dict[str, str] = {}

ingest_input_instance = DirectIngestInstance(
ingest_pipeline_params.raw_data_source_instance
)
if ingest_input_instance != DirectIngestInstance.PRIMARY:
input_dataset_overrides_dict[
raw_tables_dataset_for_region(state_code, DirectIngestInstance.PRIMARY)
] = raw_tables_dataset_for_region(state_code, ingest_input_instance)

for params in [ingest_pipeline_params, *post_ingest_pipeline_params]:
output_dataset_overrides = assert_type(
params.output_dataset_overrides, BigQueryAddressOverrides
Expand Down Expand Up @@ -277,6 +296,8 @@ def _get_params_summary(params_list: list[PipelineParameters]) -> str:
}
if isinstance(params, MetricsPipelineParameters):
metadata["metric_types"] = params.metric_types
elif isinstance(params, IngestPipelineParameters):
metadata["raw_data_source_instance"] = params.raw_data_source_instance

table_data.append([params.job_name, json.dumps(metadata, indent=2)])

Expand Down Expand Up @@ -328,7 +349,7 @@ def load_end_to_end_sandbox(
print("\nCollecting views to load after pipelines run...\n")
view_update_input_dataset_overrides_dict = (
get_view_update_input_dataset_overrides_dict(
ingest_pipeline_params, post_ingest_pipeline_params
state_code, ingest_pipeline_params, post_ingest_pipeline_params
)
)
view_builders_to_load = collect_changed_views_and_descendants_to_load(
Expand Down

0 comments on commit 5dad113

Please sign in to comment.