Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add task to bulk remove / replace links to domains #9340

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

richardTowers
Copy link
Contributor

@richardTowers richardTowers commented Aug 2, 2024

Occasionally, entire domains which are regularly linked to on GOV.UK break. For example, there are over 1000 links to storify.com which was:

a social network service that let the user create stories or timelines
using social media such as Twitter, Facebook and Instagram

(https://en.wikipedia.org/wiki/Storify)

... until it went out of business in 2018.

Sometimes these domains just disappear, but occasionally they're hoovered up by some malicious third party and end up doing spammy things, or, worse, phishy things. (Note: storify isn't an example of that - it's simply down).

In these situations, we've got a choice with no clearly good options:

  • Leave the links on GOV.UK pointing to malicious sites (not good)
  • Attempt to contact the "owners" of the affected pages on GOV.UK and have them fix the links (not practical, since many pages older than a couple of years are effectively unowned)
  • Mark up the links in some way to show that they're broken and warn users before they click on them (confusing / scary UX)
  • Archive the content (only works if the content with the broken links can actually be archived, which isn't always true. Also it's never true at the moment, as there's no way to meaningfully "archive" content on GOV.UK yet).
  • Redirect the links to a "this link is broken" page on GOV.UK (looks bad, potentially scary UX)
  • Remove the links entirely, including their link text (would break sentences / potentially make content unreadable / inaccurate)
  • Replace the links with their bare link text (what this PR does) (potentially confusing in some situations, where the link text might be "click here" or similar).
  • Redirect the links to another site (only works if there's a sensible place to redirect to) (what this PR also does)

These two rake tasks are an attempt at automating the removal or redirection of links to bad domains.

I am aware that I am committing a grave sin by parsing HTML with regex, but given that I had to do it for govspeak (because the govspeak parser won't round trip cleanly), I figured it was best to be consistent. The resulting regexes are pretty horrifix, so I've used the extended syntax to comment them.

I'm interested in thoughts on alternative approaches here - I had fun writing this code, but I'm not wedded to it if there's a better way. I don't think "find all the bad URLs and update them by hand" is a very good option though.

⚠️ This repo is Continuously Deployed: make sure you follow the guidance ⚠️

Follow these steps if you are doing a Rails upgrade.

Occasionally, entire domains which are regularly linked to on GOV.UK
break. For example, there are over 1000 links to storify.com which was:

> a social network service that let the user create stories or timelines
> using social media such as Twitter, Facebook and Instagram

(https://en.wikipedia.org/wiki/Storify)

... until it went out of business in 2018.

Sometimes these domains just disappear, but occasionally they're
hoovered up by some malicious third party and end up doing spammy
things, or, worse, phishy things. (Note: storify isn't an example of
that - it's simply down).

In these situations, we've got a choice with no clearly good options:

- Leave the links on GOV.UK pointing to malicious sites (not good)
- Attempt to contact the "owners" of the affected pages on GOV.UK and
  have them fix the links (not practical, since many pages older than a
  couple of years are effectively unowned)
- Mark up the links in some way to show that they're broken and warn
  users before they click on them (confusing / scary UX)
- Archive the content (only works if the content with the broken links
  can actually be archived, which isn't always true. Also it's never
  true at the moment, as there's no way to meaningfully "archive" content
  on GOV.UK yet).
- Replace the links with their bare link text (what this PR does)
  (potentially confusing in some situations, where the link text might
  be "click here" or similar).
- Redirect the links to a "this link is broken" page on GOV.UK (looks
  bad, potentially scary UX)
- Redirect the links to another site (only works if there's a sensible
  place to redirect to) (this will be addressed in a subsequent commit)
- Remove the links entirely, including their link text (would break
  sentences / potentially make content unreadable / inaccurate)

This commit adds a rake task which, given a domain, finds all the
content in Whitehall which links to that domain, and replaces the links
with the plain text link text.

I wanted to do this with the govspeak parser, instead of doing a regex
replace, but unfortunately it doesn't cleanly round trip (i.e. if you
parse a doc and then emit it again as govspeak, there will be lots of
differences, including but not limited to whitespace).

The regex approach does have some risk that we could incorrectly
replace something that's not a link (e.g. something inside a code block
might look like a link, but not actually be one). But I think the risk
is low, considering the subset of markdown that's permitted in govspeak.

I've tried to make the patterns as strict as possible to reduce the risk
of this being called with bad arguments and doing widespread damage. The
domain has to match the one in the link exactly, and we don't support
very short domains.
This is very similar to the previous commit, which introduced a task to replace links to a certain domain with the plain text link text.

For some domains, there's a URL somewhere else on GOV.UK or on the internet where we could re-point all of the links. For example, anything pointing at http://www.cesg.gov.uk/ (which is an NXDOMAIN) could be repointed at https://www.gov.uk/government/organisations/cesg

There's a fair bit of duplication between LinkRemover and LinkRedirector, but they're different enough that it's hard to reuse the code without making them hard to read. So I've just accepted the duplication.
Comment on lines +28 to +32
PublishingApiDocumentRepublishingWorker.perform_async_in_queue(
"bulk_republishing",
edition.document_id,
true,
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is required, as we might be able to rely on callbacks on Edition when we call save! to republish to publishing-api.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no callbacks on editions that save content to Publishing API, editions use an imperative workflow where a service is called from the controller to send content to Publishing API

@ryanb-gds
Copy link
Contributor

@richardTowers is this still needed? It would be quite handy if it wasn't given the complexity, I wouldn't want to maintain it if we don't absolutely need to

@richardTowers
Copy link
Contributor Author

There's still an outstanding ask to remove / redirect a relatively large number of links (spreadsheet), which I've just dropped the ball on.

I guess we could use this as a one off and then revert it, and hope that an archiving strategy comes along before we need this again. Or we could just fix all the links manually I suppose.

Copy link

@vignesh1507 vignesh1507 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain more about the changes you've made?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants