Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bulk fix assets rake task #1551

Merged
merged 6 commits into from
Dec 10, 2024
Merged

Conversation

lauraghiorghisor-tw
Copy link
Contributor

@lauraghiorghisor-tw lauraghiorghisor-tw commented Nov 27, 2024

Context

Multiple Zendesk tickets related to whitehall assets being live when they should not be, triggered an analysis of the problematic data - PR here. We've already deleted a batch of assets that whitehall expected to be deleted but they were not, using this rake task. We have now investigated asset deletion states further, and we've produced a set of assets that, despite not being deleted in Whitehall, should not be live. In order to preserve some data integrity, we're choosing to fix the data with this rake task, rather than do a mass deletion. The assets in question are serving a chain of only superseded and deleted editions (nothing live) - see extraction rake task.

We now know that there are a few broken flows in whitehall that are continuously producing incorrect asset states, specifically unintentionally live assets. This PR defines a rake task that should be able to tackle each asset and, if it matches some of the patterns that we know how to fix, should remedy the data, otherwise just delete it.

Breakdown of the rake task

One of the broken flows we've uncovered is: the user replaces an Attachment in whitehall, then they delete the replacement. The intended effect should be that the original assets of the attachment are not live anymore, nor are the replacements, since the Attachment is now deleted. The bug is that the replacement Asset in asset-manager remains in draft as deleted (draft: true and deleted_at not nil), when it should be going live as deleted. This causes the original asset to not redirect to the draft replacement. This is intentional, for example when we do not want a replacement redirect to be effective just yet, for example when working on a new draft.

We want to fix this by checking if the asset has a draft replacement or is itself a draft replacement. We delete the draft replacement/draft asset if it is not already deleted. Nonetheless, for the redirect to work, we also need to switch the draft state to false. The order is important, as the replacements themselves would become public if we don't action the deletion first.

For assets that are not draft replacements nor have draft replacements, we want to allow them to continue redirecting, being replaced etc. Draft assets that are not replaced or redirected, should also be deleted, as they should not be available in the draft stack if they have been superseded. The current correct flows in whitehall suggest that the only possible states for an asset that has been superseded is either deleted or replaced, if it has not been otherwise redirected.

A final catch-all step ensures remaining assets (for which the steps above don't apply) are deleted.

Data extraction

In Whitehall Publisher, we've identified what data we need to extract by extracting all the AttachmentDatas where the deleted? method is false and relevant database fields to help guide our understanding of the data. The core theory was that assets corresponding to superseded or deleted editions, where theAttachmentData's deleted? method is false, should be further replaced or otherwise deleted, unless they have been redirected.

This is the second part of the data extraction - see first part here. Splitting the data extraction based on the deleted? method evaluation was a way to reason about deletion state in whitehall, since there is no obvious way in which that state is represented in the database. Whilst a soft delete is applied to the Attachment model, it is the AttachmentData model that actually manages the Assets. It deduces the deletion state based on the Attachment state and the Edition visibility. This logic is captured in this deleted? method, which is evaluated whenever we try to update asset-manager. It must be true for a delete value to be sent, and false for the other updates to be sent through - replacement, draft, redirect_url.

NB: Whilst we have extracted data throughly, tackled the known bugs, and otherwise deleted all other assets we've reported on, there is a chance we may have missed some scenarios. This data is restricted to superseded and deleted editions and should have exhaustively fixed or deleted assets that pertain only to superseded and deleted editions. Anything outside of that has not been tackled here. Additionally, the current data extracted to be fixed is based on an older integration dump (end of October 2024), meaning there may be wrong Assets that got created between then and the time an upcoming code fix goes live. Nonetheless, this data patch should eliminate a great deal of risk from past data.

Related fixes

We've also identified a broken flow that causes missing replacement links. If the user makes a new edition draft, and proceeds to make an attachment replacement without having saved the draft first, the live assets do not get set with a replacement in asset-manager. These assets will be fixed by running some updater logic from a whitehall rake task. That will be run prior to this rake task, so that we have healthy replacement chains that we can then further fix.

Upcoming work will also tackle unpublished editions, ensuring redirects are set accordingly.

Next steps

We'll attempt to fix the logic that causes the ongoing issues in an upcoming card. The way in which we're proposing to fix the data in this rake task will translate to how we'll remedy the logic in whitehall. Fixing this data removes a big part of the risk, as does fixing the code. Nonetheless, there are batches of assets we've looked at, whose states we cannot account for. We've started ideating around some remodelling work of the attachments/assets workflow, to make it simpler, easier to understand and maintain, and bug-free.

Links

Trello card
Assets investigation document
Data extraction rake task
Assets we deleted

Co-authored-by @minhngocd

@lauraghiorghisor-tw lauraghiorghisor-tw force-pushed the add-rake-task-to-fix-assets branch 2 times, most recently from 65dbd8e to 894eb5a Compare November 29, 2024 09:09
@lauraghiorghisor-tw lauraghiorghisor-tw marked this pull request as ready for review November 29, 2024 09:10
@lauraghiorghisor-tw lauraghiorghisor-tw force-pushed the add-rake-task-to-fix-assets branch 12 times, most recently from 58c4006 to f47657e Compare December 4, 2024 16:49
On the back of an investigation into asset deletion states, we've
produced a set of assets that are not deleted in wh, but should not be live.
In order to preserve some data integrity, we're choosing to fix the data,
rather than do a mass deletion.

We've identified some flows where draft replacements should be deleted.
Nonetheless, the original asset does not redirect to a replacement if
the replacement is in draft, so we're also switching the draft state to false.
The order is important, as the replacements themselves would become public
if we don't action the deletion first. We're checking whether each asset
that we feed through the rake task is itself or has a replacement in this state.

For assets that are not draft replacements nor have draft replacements,
we want to allow them to continue redirecting, being replaced etc.

A final catch-all step ensures remaining assets (for which the steps above
don't fire) are deleted.

We've also identified broken flows that cause missing replacement links,
but those will be fixed by running some updater logic from a wh rake task.
@lauraghiorghisor-tw lauraghiorghisor-tw force-pushed the add-rake-task-to-fix-assets branch 3 times, most recently from 2522db1 to 0b5759d Compare December 5, 2024 13:41
minhngocd and others added 2 commits December 5, 2024 13:42
- Add a clause to remove parent document url when fixing draft states when deleting superseded assets
Some assets are incorrectly staying in the draft state. But when we try and fix the draft state, it cannot be linking to a draft parent_document_url. We are removing the parent document url link since it is difficult to figure out what the parent_document_url should be, and is largely irrelevant for a superseded asset.
- Fix correct state for assets that have gone in a weird invalid state over the years.
"deleted" should no longer be a valid state. patching any data that we come across that has a wrong state.
- Log when we've patched an invalid state or parent document URL
We want to know if we are skipping assets because they have already been fixed,
so that we store a correct account of what's been changed.
@lauraghiorghisor-tw lauraghiorghisor-tw force-pushed the add-rake-task-to-fix-assets branch from 0b5759d to 7171e60 Compare December 5, 2024 13:42
@minhngocd minhngocd force-pushed the add-rake-task-to-fix-assets branch from 7171e60 to 0fbadf9 Compare December 5, 2024 15:47
We want to catch any exceptions not surfaced by the validations on the model.
Also grouped the tests to be more readible.
…edirected

These are relatively harmless to the public, as the links will redirect to SignOn and require a login to preview. However it would still be available to other users with a SignOn account. For assets that are superseded, it's probably best we also remove them so users are certain that as long as assets are not linked to their live or preview editions, it should not be accessible anywhere.
@minhngocd minhngocd force-pushed the add-rake-task-to-fix-assets branch from 560213a to 61002a2 Compare December 6, 2024 15:10
Copy link
Contributor

@minhngocd minhngocd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Paired on this

…at are further replaced.

The set of data plucked from integration is of only assets associated with superseded editions to the end of the line, therefore should not be a problem to remove.
@lauraghiorghisor-tw lauraghiorghisor-tw merged commit ee924ff into main Dec 10, 2024
9 checks passed
@lauraghiorghisor-tw lauraghiorghisor-tw deleted the add-rake-task-to-fix-assets branch December 10, 2024 10:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants