Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add dataset schema validation #1304

Merged
merged 31 commits into from
Dec 6, 2024
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
9ecd6ef
docs: add dataset schema validation
katacek Nov 27, 2024
b96183f
fix lint
katacek Nov 27, 2024
c4a48ec
fix lint
katacek Nov 27, 2024
d1b2db6
capitalized Actor
katacek Nov 27, 2024
087f164
one two three four
katacek Nov 27, 2024
2372e0b
remove examples of dataset field statistics
katacek Nov 29, 2024
b5b2032
Update sources/platform/actors/development/actor_definition/dataset_s…
katacek Dec 3, 2024
f1d989a
Update sources/platform/actors/development/actor_definition/dataset_s…
katacek Dec 3, 2024
73e0f6c
Update sources/platform/actors/development/actor_definition/dataset_s…
katacek Dec 3, 2024
4bdf857
Update sources/platform/actors/development/actor_definition/dataset_s…
katacek Dec 3, 2024
4844192
add link, format fixes
katacek Dec 3, 2024
4aca8a8
Update sources/platform/actors/development/actor_definition/dataset_s…
katacek Dec 4, 2024
7356b35
Update sources/platform/actors/development/actor_definition/dataset_s…
katacek Dec 4, 2024
d484ad0
Update sources/platform/actors/development/actor_definition/dataset_s…
katacek Dec 4, 2024
1adaf5b
Update sources/platform/actors/development/actor_definition/dataset_s…
katacek Dec 4, 2024
89a4166
Merge branch 'master' into docs/add-dataset-schema-validation
katacek Dec 4, 2024
dc6329e
merge master plus open api
katacek Dec 4, 2024
4fb5eaf
add info to api docs
katacek Dec 4, 2024
9c6c99c
api docs part final
katacek Dec 5, 2024
0f9872b
no subsequent admonitions
katacek Dec 5, 2024
12142db
Update sources/platform/actors/development/actor_definition/dataset_s…
katacek Dec 5, 2024
df06398
Update sources/platform/actors/development/actor_definition/dataset_s…
katacek Dec 5, 2024
6d26ea0
Update sources/platform/actors/development/actor_definition/dataset_s…
katacek Dec 5, 2024
976dd4f
Update sources/platform/actors/development/actor_definition/dataset_s…
katacek Dec 5, 2024
fbb6dcd
Update sources/platform/actors/development/actor_definition/dataset_s…
katacek Dec 5, 2024
64b51e0
Update sources/platform/actors/development/actor_definition/dataset_s…
katacek Dec 5, 2024
b07d03d
Update sources/platform/actors/development/actor_definition/dataset_s…
katacek Dec 5, 2024
310d614
Update sources/platform/actors/development/actor_definition/dataset_s…
katacek Dec 5, 2024
c518b6b
Update sources/platform/actors/development/actor_definition/dataset_s…
katacek Dec 5, 2024
8e0de86
Update sources/platform/actors/development/actor_definition/dataset_s…
katacek Dec 6, 2024
330dfe2
Update sources/platform/actors/development/actor_definition/dataset_s…
katacek Dec 6, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ The template above defines the configuration for the default dataset output view

The default behavior of the Output tab UI table is to display all fields from `transformation.fields` in the specified order. You can customize the display properties for specific formats or column labels if needed.

![Output tab UI](./images/output-schema-example.png)
![Output tab UI](../images/output-schema-example.png)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the image is used also on some other page so I have left it in the original folder and just change the path, but it can be done either way (change the folder and path for the othe page)


## Structure

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,312 @@
---
title: Dataset validation
description: Specify the dataset schema within the Actors so you can add monitoring and validation down to the field level.
katacek marked this conversation as resolved.
Show resolved Hide resolved
slug: /actors/development/actor-definition/dataset-schema/validation
---

**Specify the dataset schema within the Actors so you can add monitoring and validation down to the field level.**
katacek marked this conversation as resolved.
Show resolved Hide resolved

---

To define a schema for a default dataset of an Actor run, you need to set `fields` property in the dataset schema. It’s currently impossible to set a schema for a named dataset (same as for dataset views).
katacek marked this conversation as resolved.
Show resolved Hide resolved

:::info

The schema defines a single item in the dataset. Be careful not to define the schema as an array, it always needs to be a schema of an object.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gippy, does user get an error when this happens?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not actually 100% sure. I think you could theoretically set the top level type of the schema to array. Will test it tomorrow. If it's possible then we will try to add some check to build so that the creator cannot do it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Y, let's throw an error as early as possible so ideally in the build. Later it's too late.


:::

You can either do that directly through `actor.json` like this:
katacek marked this conversation as resolved.
Show resolved Hide resolved

```json title=".actor.json"
{
"actorSpecification": 1,
"storages": {
"dataset": {
"actorSpecification": 1,
"fields": {
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"name": {
"type": "string"
}
},
"required": ["name"]
},
"views": {}
}
}
}
```

Or in a separate separate file like this:
katacek marked this conversation as resolved.
Show resolved Hide resolved

```json title=".actor.json"
{
"actorSpecification": 1,
"storages": {
"dataset": "./dataset_schema.json"
}
}
```

```json title="dataset_schema.json"
{
"actorSpecification": 1,
"fields": {
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"name": {
"type": "string"
}
},
"required": ["name"]
},
"views": {}
}
```

:::important

The `$schema` line is important and must be exactly this value or it must be omitted:
katacek marked this conversation as resolved.
Show resolved Hide resolved

`"$schema": "http://json-schema.org/draft-07/schema#"`

:::

## Dataset validation

When you define a schema of your default dataset, the schema is then always used when you insert data into the dataset to perform validation (we use [AJV](https://ajv.js.org/)).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to consider - do we want to mention we currently use AJV? Perhaps AJV contains some JSON schema extensions, and if we replace them, we could change the expected behavior. Or is this not the thing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to mention it, because the creators can then look up the validation output and what it means. I do not think AJV contains any additions to JSON schema.


If the validation succeeds, nothing changes from the current behavior, data is stored and an empty response with status code 201 is returned.
katacek marked this conversation as resolved.
Show resolved Hide resolved

**If the data you attempt to store in the dataset is invalid** (meaning any of the items received by the API fails the validation), **the whole request is discarded** and the API will return a response with status code 400 and the following JSON response:
katacek marked this conversation as resolved.
Show resolved Hide resolved

```json
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the API docs with this and link them to this documentation. It's important to have it there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, thanks, api docs part updated, link added

Screenshot 2024-12-05 at 12 10 04 Screenshot 2024-12-05 at 12 09 47

{
"error": {
"type": "schema-validation-error",
"message": "Schema validation failed",
"data": {
"invalidItems": [{
"itemPosition": "<array index in the received array of items>",
"validationErrors": "<Complete list of AJV validation error objects>"
}]
}
}
}
```

The type of the AJV validation error object is [here](https://github.com/ajv-validator/ajv/blob/master/lib/types/index.ts#L86)
katacek marked this conversation as resolved.
Show resolved Hide resolved

If you use the Apify JS client or Apify SDK and call `pushData` function you can access the validation errors in a `try catch` block like this:

```javascript
try {
const response = await Actor.pushData(items);
} catch (error) {
if (!error.data?.invalidItems) throw error;
error.data.invalidItems.forEach((item) => {
const { itemPosition, validationErrors } = item;
});
}
```
Comment on lines +106 to +117
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about Python Client & SDK? If we provide code sample for one, it would make sense to provide it for the other as well, and utilize Tabs component

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems it does not at the moment, I am in contact w @vdusek about adding it
not sure about time scope so maybe we can release the docs without it and add it later?


## Examples
katacek marked this conversation as resolved.
Show resolved Hide resolved

Optional field (price is optional in this case):

```json
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"name": {
"type": "string"
},
"price": {
"type": "number"
katacek marked this conversation as resolved.
Show resolved Hide resolved
}
},
"required": ["name"]
}
```

Field with multiple types:

```json
{
"price": {
"type": ["string", "number"]
}
}
```

Field with type `any`:

```json
{
"price": {
"type": ["string", "number", "object", "array", "boolean"]
}
}
```

Enabling fields to be `null` :

```json
{
"name": {
"type": "string",
"nullable": true
}
}
```

Define type of objects in array:

```json
{
"comments": {
"type": "array",
"items": {
"type": "object",
"properties": {
"author_name": {
"type": "string"
}
}
}
}
}
```

Define specific fields, but allow anything else to be added to the item:

```json
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"name": {
"type": "string"
}
},
"additionalProperties": true
}
```

See [json schema reference](https://json-schema.org/understanding-json-schema/reference) for additional options.

Example of schema generator [here](https://www.liquid-technologies.com/online-json-to-schema-converter).
katacek marked this conversation as resolved.
Show resolved Hide resolved

# Dataset field statistics
katacek marked this conversation as resolved.
Show resolved Hide resolved

When you have the dataset fields schema set up, we then use the schema to generate a list of fields and measure statistics for these fields.

The measured statistics are following:
katacek marked this conversation as resolved.
Show resolved Hide resolved

- **Null count:** how many items in the dataset have the field set to null
- **Empty count:** how many items in the dataset are `undefined` , meaning that for example empty string is not considered empty
- **Minimum and maximum**
- For numbers, this is calculated directly
- For strings, this field tracks string length
- For arrays, this field tracks the number of items in the array
- For objects, this tracks the number of keys

katacek marked this conversation as resolved.
Show resolved Hide resolved
:::note

Currently, you cannot view these statistics. We will add API endpoint soon. But you can already use them in monitoring.
katacek marked this conversation as resolved.
Show resolved Hide resolved

:::

## Examples

For this schema:

```json
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"name": {
"type": "string"
},
"description": {
"type": "string"
},
"dimensions": {
"type": "object",
"nullable": true,
"properties": {
"width": {
"type": "number"
},
"height": {
"type": "number"
}
},
"required": ["width", "height"]
},
"price": {
"type": ["string", "number"]
}
},
"required": ["name", "price"]
}
```

The stored statistics and fields in the database look like this:

```json
{
"_id" : "1lVGVBkWIhSYPY1dD",
"fields" : [
"name",
"description",
"dimensions",
"dimensions/width",
"dimensions/height",
"price"
],
"stats": {
"description": {
"emptyCount": 105,
"max": 19,
"min": 19
},
"dimensions": {
"emptyCount": 144,
"max": 2,
"min": 2,
"nullCount": 86
},
"dimensions/height": {
"emptyCount": 230,
"max": 992,
"min": 18
},
"dimensions/width": {
"emptyCount": 230,
"max": 977,
"min": 4
},
"name": {
"max": 13,
"min": 11
},
"price": {
"max": 999,
"min": 1
}
}
}
```

:::note

If you want to see for yourself, check `datasetStatistics` collection. The ids correspond to the ids of datasets.
katacek marked this conversation as resolved.
Show resolved Hide resolved

:::
12 changes: 11 additions & 1 deletion sources/platform/monitoring/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,12 +41,22 @@ Currently, the monitoring option offers the following features:

### Alert configuration

When you set up an alert, you have two choices for how you want the metrics to be evaluated. And depending on your choices, the alerting system will behave differently:
When you set up an alert, you have four choices for how you want the metrics to be evaluated. And depending on your choices, the alerting system will behave differently:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a part of the validation, new monitoring possibilities arised, I have added them here


1. **Alert, when the metric is lower than** - This type of alert is checked after the run finishes. If the metric is lower than the value you set, the alert will be triggered and you will receive a notification.

2. **Alert, when the metric is higher than** - This type of alert is checked both during the run and after the run finishes. During the run, we do periodic checks (approximately every 5 minutes) so that we can notify you as soon as possible if the metric is higher than the value you set. After the run finishes, we do a final check to make sure that the metric does not go over the limit in the last few minutes of the run.

3. **Alert, when run status is one of following** - This type of alert is checked only after the run finishes. It makes possible to track the status of your finished runs and send an alert if the run finishes in a state you do not expect. If your Actor runs very often and suddenly starts failing, you will receive a single alert after the first failed run in 1 minute, and then aggregated alert every 15 minutes.

4. **Alert for dataset field statistics** - If you have a [dataset schema](../actors/development/actor_definition/dataset_schema/validation.md) set up, then you can use the field statistics to set up an alert. You can use field statistics for example to track if some field is filled in in all records, if some numeric value is too low/high (for example when tracking the price of a product over multiple sources), if the number of items in an array is too low/high (for example alert on Instagram Actor if post has a lot of comments) and many other tasks like these.

:::important

Available dataset fields are taken from the last successful build of the monitored Actor. If different versions have different fields, currently the solution will always display only those from the default version.

:::

![Metric condition configuration](./images/metric-options.png)

You can get notified by email, Slack, or in Apify Console. If you use Slack, we suggest using Slack notifications instead of email because they are more reliable, and you can also get notified quicker.
Expand Down