Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

readme - json-pairs #288

Merged
merged 4 commits into from
Aug 13, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 19 additions & 18 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -53,35 +53,36 @@ Quick Start
$ pip install datachain


Curating a dataset using JSON metadata
Selecting files using JSON metadata
======================================

This dataset consists of images of cats and dogs, annotated with ground truth and model inferences.
Annotations are stored in the 'json-pairs' format, where each image has a matching JSON file.
Our goal here is to find all animals the model assigned to class 'cats' with high confidence.
A storage consists of images of cats and dogs (`dog.1048.jpg`, `cat.1009.jpg`),
annotated with ground truth and model inferences in the 'json-pairs' format,
where each image has a matching JSON file like `cat.1009.json`:

.. code:: json

{
"class": "cat", "id": "1009", "num_annotators": 8,
"inference": {"class": "dog", "confidence": 0.68}
}

Example of downloading only high-confidence cat images using JSON metadata:


.. code:: py

import re
from datachain import Column, DataChain

def extract_id(filename: str) -> str:
# find the json-pair ID encoded in filename
match = re.search(r'\.(\d+)\.', filename)
if match:
return match.group(1)
else:
return None

meta = DataChain.from_json("gs://datachain-demo/dogs-and-cats/*json", object_name="meta")
images = DataChain.from_storage("gs://datachain-demo/dogs-and-cats/*jpg")
images = images.map(id = lambda file: extract_id(file.path))

annotated = images.merge(meta, on="id", right_on="meta.id")
likely_cats = annotated.filter((Column("meta.inference.confidence") > 0.91) \
& (Column("meta.inference.class_") == "cat"))
likely_cats.export_files("high-confidence-cats/", signal="file")
images_id = images.map(id=lambda file: file.path.split('.')[-2])
annotated = images_id.merge(meta, on="id", right_on="meta.id")

likely_cats = annotated.filter((Column("meta.inference.confidence") > 0.93) \
& (Column("meta.inference.class_") == "cat"))
likely_cats.export_files("high-confidence-cats/", signal="file")


Data curation with a local AI model
Expand Down
Loading