Skip to content

Commit

Permalink
docs(observe): DataHub Operation freshness assertion guide (datahub-p…
Browse files Browse the repository at this point in the history
…roject#8749)

Co-authored-by: John Joyce <[email protected]>
  • Loading branch information
zmcnellis and jjoyce0510 authored Aug 30, 2023
1 parent bebee88 commit dee1bc8
Showing 1 changed file with 34 additions and 7 deletions.
41 changes: 34 additions & 7 deletions docs/managed-datahub/observe/freshness-assertions.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,8 +122,12 @@ Change Source types vary by the platform, but generally fall into these categori
is higher than the previously observed value, in order to determine whether the Table has been changed within a given period of time.
Note that this approach is only supported if the Change Window does not use a fixed interval.

Using the final 2 approaches - column value queries - to determine whether a Table has changed useful because it can be customized to determine whether
specific types of important changes have been made to a given Table.
- **DataHub Operation**: A DataHub "Operation" aspect contains timeseries information used to describe changes made to an entity. Using this
option avoids contacting your data platform, and instead uses the DataHub Operation metadata to evaluate Freshness Assertions.
This relies on Operations being reported to DataHub, either via ingestion or via use of the DataHub APIs (see [Report Operation via API](#reporting-operations-via-api)).
Note if you have not configured an ingestion source through DataHub, then this may be the only option available.

Using either of the column value approaches (**Last Modified Column** or **High Watermark Column**) to determine whether a Table has changed can be useful because it can be customized to determine whether specific types of important changes have been made to a given Table.
Because it does not involve system warehouse tables, it is also easily portable across Data Warehouse and Data Lake providers.

Freshness Assertions also have an off switch: they can be started or stopped at any time with the click of button.
Expand Down Expand Up @@ -178,7 +182,7 @@ _Check whether the table has changed in a specific window of time_


7. (Optional) Click **Advanced** to customize the evaluation **source**. This is the mechanism that will be used to evaluate
the check. Each Data Platform supports different options including Audit Log, Information Schema, Last Modified Column, and High Watermark Column.
the check. Each Data Platform supports different options including Audit Log, Information Schema, Last Modified Column, High Watermark Column, and DataHub Operation.

<p align="center">
<img width="45%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/observe/freshness/assertion-builder-freshness-source-type.png"/>
Expand All @@ -189,11 +193,12 @@ the check. Each Data Platform supports different options including Audit Log, In
- **Last Modified Column**: Check for the presence of rows using a "Last Modified Time" column, which should reflect the time at which a given row was last changed in the table, to
determine whether the table changed within the evaluation period.
- **High Watermark Column**: Monitor changes to a continuously-increasing "high watermark" column value to determine whether a table
has been changed. This option is particularly useful for tables that grow consistently with time, for example fact or event (e.g. click-strea) tables. It is not available
has been changed. This option is particularly useful for tables that grow consistently with time, for example fact or event (e.g. click-stream) tables. It is not available
when using a fixed lookback period.
- **DataHub Operation**: Use DataHub Operations to determine whether the table changed within the evaluation period.

8. Click **Next**
9. Configure actions that should be taken when the Freshness Assertion passes or fails
1. Click **Next**
2. Configure actions that should be taken when the Freshness Assertion passes or fails

<p align="left">
<img width="55%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/observe/freshness/assertion-builder-actions.png"/>
Expand Down Expand Up @@ -280,7 +285,7 @@ Note that to create or delete Assertions and Monitors for a specific entity on D
In order to create a Freshness Assertion that is being monitored on a specific **Evaluation Schedule**, you'll need to use 2
GraphQL mutation queries to create a Freshness Assertion entity and create an Assertion Monitor entity responsible for evaluating it.

Start by creating the Freshness Assertion entity using the `createFreshnessAssertion` query and hang on to the 'urn' field of the Assertion entit y
Start by creating the Freshness Assertion entity using the `createFreshnessAssertion` query and hang on to the 'urn' field of the Assertion entity
you get back. Then continue by creating a Monitor entity using the `createAssertionMonitor`.

##### Examples
Expand Down Expand Up @@ -337,6 +342,28 @@ After creating the monitor, the new assertion will start to be evaluated every 8

You can delete assertions along with their monitors using GraphQL mutations: `deleteAssertion` and `deleteMonitor`.

### Reporting Operations via API

DataHub Operations can be used to capture changes made to entities. This is useful for cases where the underlying data platform does not provide a mechanism
to capture changes, or where the data platform's mechanism is not reliable. In order to report an operation, you can use the `reportOperation` GraphQL mutation.


##### Examples
```json
mutation reportOperation {
reportOperation(
input: {
urn: "<urn of the dataset being reported>",
operationType: INSERT,
sourceType: DATA_PLATFORM,
timestampMillis: 1693252366489
}
)
}
```

Use the `timestampMillis` field to specify the time at which the operation occurred. If no value is provided, the current time will be used.

### Tips

:::info
Expand Down

0 comments on commit dee1bc8

Please sign in to comment.