-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add task to bulk remove / replace links to domains #9340
base: main
Are you sure you want to change the base?
Conversation
Occasionally, entire domains which are regularly linked to on GOV.UK break. For example, there are over 1000 links to storify.com which was: > a social network service that let the user create stories or timelines > using social media such as Twitter, Facebook and Instagram (https://en.wikipedia.org/wiki/Storify) ... until it went out of business in 2018. Sometimes these domains just disappear, but occasionally they're hoovered up by some malicious third party and end up doing spammy things, or, worse, phishy things. (Note: storify isn't an example of that - it's simply down). In these situations, we've got a choice with no clearly good options: - Leave the links on GOV.UK pointing to malicious sites (not good) - Attempt to contact the "owners" of the affected pages on GOV.UK and have them fix the links (not practical, since many pages older than a couple of years are effectively unowned) - Mark up the links in some way to show that they're broken and warn users before they click on them (confusing / scary UX) - Archive the content (only works if the content with the broken links can actually be archived, which isn't always true. Also it's never true at the moment, as there's no way to meaningfully "archive" content on GOV.UK yet). - Replace the links with their bare link text (what this PR does) (potentially confusing in some situations, where the link text might be "click here" or similar). - Redirect the links to a "this link is broken" page on GOV.UK (looks bad, potentially scary UX) - Redirect the links to another site (only works if there's a sensible place to redirect to) (this will be addressed in a subsequent commit) - Remove the links entirely, including their link text (would break sentences / potentially make content unreadable / inaccurate) This commit adds a rake task which, given a domain, finds all the content in Whitehall which links to that domain, and replaces the links with the plain text link text. I wanted to do this with the govspeak parser, instead of doing a regex replace, but unfortunately it doesn't cleanly round trip (i.e. if you parse a doc and then emit it again as govspeak, there will be lots of differences, including but not limited to whitespace). The regex approach does have some risk that we could incorrectly replace something that's not a link (e.g. something inside a code block might look like a link, but not actually be one). But I think the risk is low, considering the subset of markdown that's permitted in govspeak. I've tried to make the patterns as strict as possible to reduce the risk of this being called with bad arguments and doing widespread damage. The domain has to match the one in the link exactly, and we don't support very short domains.
This is very similar to the previous commit, which introduced a task to replace links to a certain domain with the plain text link text. For some domains, there's a URL somewhere else on GOV.UK or on the internet where we could re-point all of the links. For example, anything pointing at http://www.cesg.gov.uk/ (which is an NXDOMAIN) could be repointed at https://www.gov.uk/government/organisations/cesg There's a fair bit of duplication between LinkRemover and LinkRedirector, but they're different enough that it's hard to reuse the code without making them hard to read. So I've just accepted the duplication.
PublishingApiDocumentRepublishingWorker.perform_async_in_queue( | ||
"bulk_republishing", | ||
edition.document_id, | ||
true, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is required, as we might be able to rely on callbacks on Edition when we call save! to republish to publishing-api.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no callbacks on editions that save content to Publishing API, editions use an imperative workflow where a service is called from the controller to send content to Publishing API
@richardTowers is this still needed? It would be quite handy if it wasn't given the complexity, I wouldn't want to maintain it if we don't absolutely need to |
There's still an outstanding ask to remove / redirect a relatively large number of links (spreadsheet), which I've just dropped the ball on. I guess we could use this as a one off and then revert it, and hope that an archiving strategy comes along before we need this again. Or we could just fix all the links manually I suppose. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain more about the changes you've made?
Occasionally, entire domains which are regularly linked to on GOV.UK break. For example, there are over 1000 links to storify.com which was:
(https://en.wikipedia.org/wiki/Storify)
... until it went out of business in 2018.
Sometimes these domains just disappear, but occasionally they're hoovered up by some malicious third party and end up doing spammy things, or, worse, phishy things. (Note: storify isn't an example of that - it's simply down).
In these situations, we've got a choice with no clearly good options:
These two rake tasks are an attempt at automating the removal or redirection of links to bad domains.
I am aware that I am committing a grave sin by parsing HTML with regex, but given that I had to do it for govspeak (because the govspeak parser won't round trip cleanly), I figured it was best to be consistent. The resulting regexes are pretty horrifix, so I've used the extended syntax to comment them.
I'm interested in thoughts on alternative approaches here - I had fun writing this code, but I'm not wedded to it if there's a better way. I don't think "find all the bad URLs and update them by hand" is a very good option though.
Follow these steps if you are doing a Rails upgrade.