Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backup not functional #2270

Open
5 tasks done
JimMadge opened this issue Oct 29, 2024 · 19 comments · May be fixed by #2272
Open
5 tasks done

Backup not functional #2270

JimMadge opened this issue Oct 29, 2024 · 19 comments · May be fixed by #2272
Assignees
Labels
bug Problem when deploying a Data Safe Haven.
Milestone

Comments

@JimMadge
Copy link
Member

JimMadge commented Oct 29, 2024

✅ Checklist

  • I have searched open and closed issues for duplicates.
  • This is a problem observed when managing a Data Safe Haven.
  • I can reproduce this with the latest version.
  • I have read through the documentation.
  • This isn't an open-ended question (open a discussion if it is).

💻 System information

  • Operating System:
  • Data Safe Haven version: 5.0.1

📦 Packages

List of packages
Paste list of packages here

🚫 Describe the problem

Backup not functional, workaround in docs doesn't work.

A major problem here is, the storage account types we have chosen (for good reasons of performance, ability to upload/download to/from) are not compatible with Azure backup services.

🚂 Workarounds or solutions

Because of the incompatibility, I think our two options are

  1. Change the type of storage accounts we use to ones which are compatible
    • This will mean we loose nice features like performance, nice integration with NFS, upload/download using Azure storage explorer
    • Quite a significant infrastructure change to make
  2. Implement our own backup
    • Requires writing/supporting our own solution
    • There are off-the-shelf tools we could use (I think Borg/Borgmatic)
    • Should be fairly simple to implement with a VM or container, as we can mount all of our storage as NFS
    • I have a proposal for how this would look using Borg/Borgmatic in the comments below
@JimMadge JimMadge added bug Problem when deploying a Data Safe Haven. hotfix An issue that should be fixed on a hotfix branch, with a point release labels Oct 29, 2024
@JimMadge JimMadge changed the title <short description of issue> Backup not functional Oct 29, 2024
@JimMadge
Copy link
Member Author

JimMadge commented Oct 29, 2024

Following the docs, fixing the backup instance fails (after 22 minutes 😱)

Fixing protection error for BlobBackupSensitiveData
UserErrorMissingRequiredPermissions: Appropriate permissions to perform the operation is missing.

Possibly because my user is missing some backup roles, I am an Owner and Storage Blob Data Owner though.

@JimMadge
Copy link
Member Author

The problem may be that the backup vault doesn't have the correct role assigned in the target storage account.

@JimMadge JimMadge self-assigned this Oct 29, 2024
@JimMadge
Copy link
Member Author

Assigning the needed role to the backup vault fixed that issue.
Should be easy to do in Pulumi.

Next problem is

UserErrorUnsupportedStorageAccountType: The storage account type is not supported for backup.

@JimMadge
Copy link
Member Author

Storage account kind is BlockBlobStorage

operational backup supports block blobs in standard general-purpose v2 storage accounts only

You can back up only block blobs in a standard general-purpose v2 storage account using the vaulted backup solution for blobs.

from here

@JimMadge
Copy link
Member Author

Is this the correct storage account. The target is sensitivedata which has the inputs and outputs. That feels like the least important thing to backup. In fact, I could see a strong case for not making copies of the input data when you are acting as a data processor.

@jemrobinson
Copy link
Member

I don't think we're ever acting as a data processor - pretty much all useful research is going to involve the researchers making significant-enough decisions that they are a data controller.

@Davsarper : can you remember what we should be backing up for DSPT-compatibility?

@Davsarper
Copy link

Agreed on us being data controllers.

So we refer to backups when asked about our business continuity plan, and currently answer that after failure we would recover 'as much as possible'. I don't see on first instance a hard description of what needs to be recovered for our organisation.

I think critical things to recover would be those necessary for offering whichever (healthcare) services, since we do not then it might be relatively up to us to decide what is key to backup.

I will have a read on business continuity plans, which we should develop https://www.digitalcarehub.co.uk/resource/creating-and-testing-a-business-continuity-plan-for-data-and-cyber-security/

@JimMadge
Copy link
Member Author

I think we should be careful not to focus too much on us. I think the use case for a ephemeral TRE strictly for data processing is strong.

I feel we might have tried to backup the inverse of what we really want,

  • Backup
    • working data (/home, /shared)
    • databases
    • state data? (like container data)
  • Don't backup
    • input data
    • staged outputs

@JimMadge JimMadge linked a pull request Oct 30, 2024 that will close this issue
3 tasks
@JimMadge
Copy link
Member Author

Also,

Azure Backup currently doesn't support NFS shares.

🫠

@JimMadge
Copy link
Member Author

So, backing up either the block blobs or NFS shares will require either,

  • Major infra changes, using different storage account and container/share types to use Azure Backup Vaults and Recovery Services
  • Implementing our own solution

My feeling is backing up to some redundant storage using borg is the most flexible and easiest to implement.
The biggest downside is restoring backups would involve running commands rather than clicking buttons in the portal.

@JimMadge
Copy link
Member Author

Maybe the best way to implement our own is,

  • Container or VM (configured by templated files or Ansible)
    • user data (/home, /shared) mounted as rw
    • backup share mounted as rw
    • Borgmatic + Borg
      • configured to make incremental, encrypted backups every x hours
      • retention rules for e.g. 6 monthly checkpoints, 4 weekly checkpoints, 7 daily checkpoints, 24 hourly checkpoints
    • Script templates/commands for backup restoration

That way you don't need to worry about multiple workspaces running conflicting backup jobs.

@JimMadge
Copy link
Member Author

@jemrobinson @craddm I think we need to reach a consensus for the change we want to make here.

@jemrobinson
Copy link
Member

jemrobinson commented Oct 31, 2024

I would like to see all the data/state necessary to recreate the environment backed up.

Imagine, for example, that an SRE has been compromised by a ransomware attack 4 years into a 5 year project. Would our backups be sufficient to deploy an identical (or updated) new environment that results in only days/weeks of work being lost? We should also ensure that we have tiered backups (e.g. X days of daily backups + Y weeks of weekly backups + Z months of monthly backups) in case of longer-term problems.

I'm agnostic as to the method used to achieve this: using Azure built-in solutions would be nice, but it's more important to have something that works than something with a point-and-click interface.

@martintoreilly do you agree with this?

@JimMadge
Copy link
Member Author

Agreed, in terms of borg+borgmatic as a solution,

Backups are incremental, encrypted and hashed. That gives good protection over that data becoming corrupted (you can schedule regular checks) while reducing the space needed.
There are flexible and convenient retention rules, so declaring something like keep three annual snapshots, the last 12 monthly snapshots, ... is easy.

For restoration/disaster recovery. I think what we need to backup is /mnt/shared, /home and possibly /mnt/input, /mnt_output. As we can mount all of those into a VM I think that should be easy. I think all of the other data is not needed, either we can recreate it with desired state or it is cache.

@JimMadge
Copy link
Member Author

@craddm @jemrobinson Any objection before I start along the lines I've posted here?

@jemrobinson
Copy link
Member

Sounds good to me. It would be worth thinking how we could backup/restore /mnt/input since this is mounted as read-only. Possibly we'd restore it to /shared and then manually move it across with storage explorer?

@JimMadge
Copy link
Member Author

For container instances, it is possible to mount Azure file shares but not blobs.

@jemrobinson
Copy link
Member

This doesn't work for NFS file shares - it's just the standard kind that you can browse in the portal or with Storage Explorer.

@JimMadge
Copy link
Member Author

In that case, I think the best way forward is to add a VM and configure with cloud-init and Ansible.
We already know how to mount all of the storage that way.

The workload should fit a small burstable size.

@JimMadge JimMadge removed the hotfix An issue that should be fixed on a hotfix branch, with a point release label Nov 12, 2024
@JimMadge JimMadge added this to the Release 5.1.0 milestone Nov 12, 2024
@JimMadge JimMadge moved this to Ready to Work in Data Safe Haven Nov 18, 2024
@JimMadge JimMadge mentioned this issue Nov 18, 2024
5 tasks
@JimMadge JimMadge moved this from Ready to Work to In progress in Data Safe Haven Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Problem when deploying a Data Safe Haven.
Projects
Status: In progress
Development

Successfully merging a pull request may close this issue.

3 participants