Add task to bulk remove / replace links to domains #9340

richardTowers · 2024-08-02T17:01:52Z

Occasionally, entire domains which are regularly linked to on GOV.UK break. For example, there are over 1000 links to storify.com which was:

a social network service that let the user create stories or timelines
using social media such as Twitter, Facebook and Instagram

(https://en.wikipedia.org/wiki/Storify)

... until it went out of business in 2018.

Sometimes these domains just disappear, but occasionally they're hoovered up by some malicious third party and end up doing spammy things, or, worse, phishy things. (Note: storify isn't an example of that - it's simply down).

In these situations, we've got a choice with no clearly good options:

Leave the links on GOV.UK pointing to malicious sites (not good)
Attempt to contact the "owners" of the affected pages on GOV.UK and have them fix the links (not practical, since many pages older than a couple of years are effectively unowned)
Mark up the links in some way to show that they're broken and warn users before they click on them (confusing / scary UX)
Archive the content (only works if the content with the broken links can actually be archived, which isn't always true. Also it's never true at the moment, as there's no way to meaningfully "archive" content on GOV.UK yet).
Redirect the links to a "this link is broken" page on GOV.UK (looks bad, potentially scary UX)
Remove the links entirely, including their link text (would break sentences / potentially make content unreadable / inaccurate)
Replace the links with their bare link text (what this PR does) (potentially confusing in some situations, where the link text might be "click here" or similar).
Redirect the links to another site (only works if there's a sensible place to redirect to) (what this PR also does)

These two rake tasks are an attempt at automating the removal or redirection of links to bad domains.

I am aware that I am committing a grave sin by parsing HTML with regex, but given that I had to do it for govspeak (because the govspeak parser won't round trip cleanly), I figured it was best to be consistent. The resulting regexes are pretty horrifix, so I've used the extended syntax to comment them.

I'm interested in thoughts on alternative approaches here - I had fun writing this code, but I'm not wedded to it if there's a better way. I don't think "find all the bad URLs and update them by hand" is a very good option though.

⚠️ This repo is Continuously Deployed: make sure you follow the guidance ⚠️

Follow these steps if you are doing a Rails upgrade.

Occasionally, entire domains which are regularly linked to on GOV.UK break. For example, there are over 1000 links to storify.com which was: > a social network service that let the user create stories or timelines > using social media such as Twitter, Facebook and Instagram (https://en.wikipedia.org/wiki/Storify) ... until it went out of business in 2018. Sometimes these domains just disappear, but occasionally they're hoovered up by some malicious third party and end up doing spammy things, or, worse, phishy things. (Note: storify isn't an example of that - it's simply down). In these situations, we've got a choice with no clearly good options: - Leave the links on GOV.UK pointing to malicious sites (not good) - Attempt to contact the "owners" of the affected pages on GOV.UK and have them fix the links (not practical, since many pages older than a couple of years are effectively unowned) - Mark up the links in some way to show that they're broken and warn users before they click on them (confusing / scary UX) - Archive the content (only works if the content with the broken links can actually be archived, which isn't always true. Also it's never true at the moment, as there's no way to meaningfully "archive" content on GOV.UK yet). - Replace the links with their bare link text (what this PR does) (potentially confusing in some situations, where the link text might be "click here" or similar). - Redirect the links to a "this link is broken" page on GOV.UK (looks bad, potentially scary UX) - Redirect the links to another site (only works if there's a sensible place to redirect to) (this will be addressed in a subsequent commit) - Remove the links entirely, including their link text (would break sentences / potentially make content unreadable / inaccurate) This commit adds a rake task which, given a domain, finds all the content in Whitehall which links to that domain, and replaces the links with the plain text link text. I wanted to do this with the govspeak parser, instead of doing a regex replace, but unfortunately it doesn't cleanly round trip (i.e. if you parse a doc and then emit it again as govspeak, there will be lots of differences, including but not limited to whitespace). The regex approach does have some risk that we could incorrectly replace something that's not a link (e.g. something inside a code block might look like a link, but not actually be one). But I think the risk is low, considering the subset of markdown that's permitted in govspeak. I've tried to make the patterns as strict as possible to reduce the risk of this being called with bad arguments and doing widespread damage. The domain has to match the one in the link exactly, and we don't support very short domains.

This is very similar to the previous commit, which introduced a task to replace links to a certain domain with the plain text link text. For some domains, there's a URL somewhere else on GOV.UK or on the internet where we could re-point all of the links. For example, anything pointing at http://www.cesg.gov.uk/ (which is an NXDOMAIN) could be repointed at https://www.gov.uk/government/organisations/cesg There's a fair bit of duplication between LinkRemover and LinkRedirector, but they're different enough that it's hard to reuse the code without making them hard to read. So I've just accepted the duplication.

richardTowers · 2024-08-05T09:09:00Z

lib/tasks/bulk_content_updates.rake

+        PublishingApiDocumentRepublishingWorker.perform_async_in_queue(
+          "bulk_republishing",
+          edition.document_id,
+          true,
+        )


Not sure if this is required, as we might be able to rely on callbacks on Edition when we call save! to republish to publishing-api.

There are no callbacks on editions that save content to Publishing API, editions use an imperative workflow where a service is called from the controller to send content to Publishing API

ryanb-gds · 2024-08-27T15:14:38Z

@richardTowers is this still needed? It would be quite handy if it wasn't given the complexity, I wouldn't want to maintain it if we don't absolutely need to

richardTowers · 2024-08-28T08:48:44Z

There's still an outstanding ask to remove / redirect a relatively large number of links (spreadsheet), which I've just dropped the ball on.

I guess we could use this as a one off and then revert it, and hope that an archiving strategy comes along before we need this again. Or we could just fix all the links manually I suppose.

vignesh1507

Can you explain more about the changes you've made?

richardTowers added 3 commits August 2, 2024 17:39

Fix rubocop issues

307d3d7

richardTowers commented Aug 5, 2024

View reviewed changes

vignesh1507 reviewed Sep 29, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add task to bulk remove / replace links to domains #9340

Add task to bulk remove / replace links to domains #9340

richardTowers commented Aug 2, 2024 •

edited

Loading

richardTowers Aug 5, 2024

ryanb-gds Aug 5, 2024

ryanb-gds commented Aug 27, 2024

richardTowers commented Aug 28, 2024

vignesh1507 left a comment

Add task to bulk remove / replace links to domains #9340

Are you sure you want to change the base?

Add task to bulk remove / replace links to domains #9340

Conversation

richardTowers commented Aug 2, 2024 • edited Loading

richardTowers Aug 5, 2024

Choose a reason for hiding this comment

ryanb-gds Aug 5, 2024

Choose a reason for hiding this comment

ryanb-gds commented Aug 27, 2024

richardTowers commented Aug 28, 2024

vignesh1507 left a comment

Choose a reason for hiding this comment

richardTowers commented Aug 2, 2024 •

edited

Loading