-
Notifications
You must be signed in to change notification settings - Fork 37
Backup Strategies
If you're deploying Overview -- either on your own machine, in your company's intranet or on the cloud -- you'll want to consider the worst. What happens if all Overview's files become unavailable?
We recommend overview-local for all new deployments. It includes backup and restore programs. Be sure to stop the server before backing up or restoring, or you may get inconsistent data.
First, take a few minutes to decide: what problems are you worried about, and how hard to you want to work to solve them?
Here's what Overview stores:
- Overview stores document-set data, document text and data, tags, and plugin-generated data in its PostgreSQL database.
- Overview stores files you upload, thumbnail images, and specially-generated PDF documents for its user interface in "blob storage" (either S3 or your filesystem).
- Overview stores search indexes in Lucene indexes on your filesystem.
Here's what could happen:
- You probably shouldn't worry about Lucene indexes being corrupted. If your search indexes are lost, you can reindex while Overview is running; most users may not even notice anything amiss.
- You could lose files in blob storage (S3 or directory full of files). Many Overview features would stop working. For instance, thumbnails wouldn't appear, and you'd only be able to view documents in text mode. You could still export document sets as spreadsheets, though. You could even continue using Overview: future document sets would have all features.
- You could lose your PostgreSQL database. Overview would not run, and you wouldn't be able to view or download any data. (Even files in blob-storage are nearly useless if there's no database.)
- Overview itself could have a bug that deletes or invalidates your data.
- You could accidentally delete a document set or modify tags.
Prepare for the possibility that you lose your PostgreSQL database, or that you lose your "blob storage" files, or that you lose both at the same time. How would that affect you? That should guide your approach.
We recommend the following approaches to backing up and restoring Overview's data.
If you like, you can use separate approaches for Lucene, PostgreSQL and Blob Storage data. For instance, if you use Overview mainly to generate spreadsheets of data, you may decide to only back up the PostgreSQL data and not even devise a backup strategy for the Lucene or Blob Storage data.
This strategy takes no technical expertise.
You can export a document set from Overview in spreadsheet format. You can then re-import exported CSV-format spreadsheets into Overview. The re-imported version won't have thumbnails, and you won't be able to browse it like a PDF; but if you care more about document tags and metadata, this is the simplest approach.
If you're using overview-local
, you can use its backup and restore programs. If you're not using Docker, you can follow the same logic:
- Backup:
- Stop Overview.
- Zip all Overview's data into a single file.
- Start Overview.
- Restore:
- Stop Overview.
- Unzip the data back to the original files.
- Start Overview.
Gotchas:
- Overview can't run while the backup is happening. PostgreSQL and Lucene databases can be corrupted if Overview writes them while the backup script reads them. If you don't stop Overview, the backups may be useless.
- The larger your database, the longer it will take to back up and restore.
Overview's databases are on filesystems. Many filesystems have a "snapshot" operation that backs up all its files simultaneously. This can't corrupt a PostgreSQL or Lucene database.
We support this approach for Amazon's EBS volumes and VirtualBox Virtual Machines. (The VirtualBox approach is useful if you're using Docker on Windows or Mac.)
We don't support this on Docker. Docker may provide volume snapshots someday.
- Setup:
- Store Overview's data on its own volume(s). Your backups could be the same size as the volume.
- Backup:
- Take a snapshot. (On AWS, it's
aws ec2 create-snapshot
; in VirtualBox, there's an icon in the "Snapshots" tab.)
- Take a snapshot. (On AWS, it's
- Restore:
- Stop Overview.
- Restore the volume from the snapshot.
- Start Overview.
Gotchas:
- Snapshots can take a lot of space.
- Sometimes, snapshot files depend upon one another: if you lose one snapshot file, other snapshot files may not have enough information for a restore to work.
Snapshots are excellent in a cloud environment -- AWS, for instance.
PostgreSQL has a Write-Ahead Log feature that lets you recover to any point in time that has been configured correctly. Cloud providers like Amazon RDS can take advantage of this. Unfortunately, the write-ahead log can get large if you use it for point-in-time recovery; also, restoring from it can take a long time.
S3 has versioning, which stores enough information that you can revert your blob-storage to a specific point in time. This kind of restore, too, can take a long time.
Overview doesn't recommending using any versioning techniques with its Lucene indexes. (Lucene supports versioning, but Overview doesn't take advantage of those features.)