Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sha256 uniqueness to CollectionVersion #1815

Merged
merged 2 commits into from
Nov 11, 2024

Conversation

mdellweg
Copy link
Member

No description provided.

@mdellweg mdellweg force-pushed the 1052_uniqueness branch 8 times, most recently from e97a863 to 775fb7c Compare April 16, 2024 10:22
@mdellweg mdellweg force-pushed the 1052_uniqueness branch 2 times, most recently from 03bef99 to baf54a1 Compare April 16, 2024 12:33
try:
await sync_to_async(artifact.save)()
except IntegrityError:
artifact = Artifact.objects.get(sha256=artifact.sha256)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line would never have worked in an async context.

Copy link
Contributor

@gerrod3 gerrod3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is based on my original work on this, but there was discussion back then on how the sha256 for the Collection should be the digest of the MANIFEST.json file inside the tarball. I actually never got around to finishing all of the changes to my PR, so most of my comments here is me pointing out all the areas I didn't change it.

#1119 (comment)

)

return col
return CollectionVersion.objects.filter(sha256=validated_data["sha256"]).first()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This is so much more simple.

Comment on lines +538 to +570
# There was an autogenerated validator rendering sha256 required.
validators = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a really annoying feature of DRF. It tripped me up a lot when I was originally adding domains.

except IntegrityError:
artifact = Artifact.objects.get(sha256=artifact.sha256)
if existing_artifact := await Artifact.objects.filter(sha256=artifact.sha256).afirst():
existing_artifact.touch()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be wrapped in a sync_to_async or do we have a atouch?

pulp_ansible/app/serializers.py Show resolved Hide resolved
pulp_ansible/app/tasks/collections.py Outdated Show resolved Hide resolved
for key in ["description", "documentation", "homepage", "issues", "repository"]:
if collection_info[key] is None:
collection_info.pop(key)
collection, created = Collection.objects.get_or_create(
Copy link
Contributor

@gerrod3 gerrod3 Apr 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this a database operation? This breaks the comment about metadata_only=True.

pulp_ansible/app/tasks/collections.py Show resolved Hide resolved
if artifact_file_name := d_artifact_files.get(d_artifact):
artifact_file = open(artifact_file_name, mode="rb")
Copy link
Contributor

@gerrod3 gerrod3 Apr 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this logic. Can you help explain it? Edit Looking at the original PR I was the one who wrote this... Well present me can't remember what past me intended with this logic.

pulp_ansible/app/tasks/upload.py Show resolved Hide resolved
@mdellweg
Copy link
Member Author

I know this is based on my original work on this, but there was discussion back then on how the sha256 for the Collection should be the digest of the MANIFEST.json file inside the tarball. I actually never got around to finishing all of the changes to my PR, so most of my comments here is me pointing out all the areas I didn't change it.

#1119 (comment)

The reason raised for this idea was: A user can rebuild the collection and reupload it. After that they may be confused to find two content units with the same (namespace, name, version) but different artifacts. The whole idea of this pr is to allow having different artifacts claiming the same (namespace, name, version). I as a user would on the other hand be most suspicious if i built a collection, uploaded it downloaded it again and then got a different artifact.
I'd like to revisit that discussion, as it probably decides on whether we can ever implement on_demand sync in a sane way.

@mdellweg mdellweg force-pushed the 1052_uniqueness branch 4 times, most recently from 80667d2 to b09eed7 Compare May 7, 2024 10:48
@mdellweg mdellweg linked an issue May 8, 2024 that may be closed by this pull request
@mdellweg mdellweg force-pushed the 1052_uniqueness branch 2 times, most recently from dcd8b05 to 2733da5 Compare May 13, 2024 14:31
@mdellweg mdellweg force-pushed the 1052_uniqueness branch 2 times, most recently from d857fe1 to b4ded81 Compare May 27, 2024 13:19
@mdellweg mdellweg force-pushed the 1052_uniqueness branch 4 times, most recently from 52973f6 to 58f25f4 Compare November 6, 2024 13:44
@mdellweg mdellweg marked this pull request as ready for review November 6, 2024 15:21
@mdellweg mdellweg enabled auto-merge (rebase) November 6, 2024 15:21
Copy link
Contributor

@gerrod3 gerrod3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small suggestions, everything else looks good.

if count == 0:
return

for unit in unit_qs.iterator():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add theprefetch_related here to reduce the number of queries being made?

relative_path=collection_version.relative_path,
)
collection_version, tags = await sync_to_async(create_collection_from_importer)(metadata)
await _save_collection_version(collection_version, artifact, tags)
except ValidationError as e:
if "unique" in str(e):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to update the CollectionVersion.get below on line 148 to include the sha256, also can make it use aget.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think this is the place where we still have the old behaviour, and we should not change that.
Remember this is not fixing the issue, but preparing the database to fix the issue with proper lookup later.

collection_versions.append(cv)

if build_artifacts:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build_artifacts is no longer being used in this method, can we remove it from the signature?

This should be ZDU compatible. The transition will be completed with the
next y-release. This commit only adds the field an starts to populate it.
It comes with a data repair command to help transition to an actual
uniqueness constraint in the next release.

relates: pulp#1052

Co-authored-by: Matthias Dellweg <[email protected]>
Copy link
Contributor

@gerrod3 gerrod3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hurray! One step closer.

@mdellweg mdellweg merged commit 544e468 into pulp:main Nov 11, 2024
12 checks passed
@mdellweg mdellweg deleted the 1052_uniqueness branch November 11, 2024 18:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Content should have a digest in the (system-)uniqueness constraints
2 participants