Skip to content

Backup Strategies

Jonathan Stray edited this page Sep 10, 2017 · 5 revisions

If you're deploying Overview -- either on your own machine, in your company's intranet or on the cloud -- you'll want to consider the worst. What happens if all Overview's files become unavailable?

1. The recommended approach

We recommend overview-local for all new deployments. It includes backup and restore programs. Be sure to stop the server before backing up or restoring, or you may get inconsistent data.

2. Rolling your own

First, take a few minutes to decide: what problems are you worried about, and how hard to you want to work to solve them?

Data In Overview

Here's what Overview stores:

  • Overview stores document-set data, document text and data, tags, and plugin-generated data in its PostgreSQL database.
  • Overview stores files you upload, thumbnail images, and specially-generated PDF documents for its user interface in "blob storage" (either S3 or your filesystem).
  • Overview stores search indexes in Lucene indexes on your filesystem.

Risks

Here's what could happen:

  • You probably shouldn't worry about Lucene indexes being corrupted. If your search indexes are lost, you can reindex while Overview is running; most users may not even notice anything amiss.
  • You could lose files in blob storage (S3 or directory full of files). Many Overview features would stop working. For instance, thumbnails wouldn't appear, and you'd only be able to view documents in text mode. You could still export document sets as spreadsheets, though. You could even continue using Overview: future document sets would have all features.
  • You could lose your PostgreSQL database. Overview would not run, and you wouldn't be able to view or download any data. (Even files in blob-storage are nearly useless if there's no database.)
  • Overview itself could have a bug that deletes or invalidates your data.
  • You could accidentally delete a document set or modify tags.

Plan

Prepare for the possibility that you lose your PostgreSQL database, or that you lose your "blob storage" files, or that you lose both at the same time. How would that affect you? That should guide your approach.

2. Overview-approved Strategies

We recommend the following approaches to backing up and restoring Overview's data.

If you like, you can use separate approaches for Lucene, PostgreSQL and Blob Storage data. For instance, if you use Overview mainly to generate spreadsheets of data, you may decide to only back up the PostgreSQL data and not even devise a backup strategy for the Lucene or Blob Storage data.

2a. Download Your Document Sets as Spreadsheets

This strategy takes no technical expertise.

You can export a document set from Overview in spreadsheet format. You can then re-import exported CSV-format spreadsheets into Overview. The re-imported version won't have thumbnails, and you won't be able to browse it like a PDF; but if you care more about document tags and metadata, this is the simplest approach.

2b. Store All Data in a Single File (as in overview-local)

If you're using overview-local, you can use its backup and restore programs. If you're not using Docker, you can follow the same logic:

  1. Backup:
    1. Stop Overview.
    2. Zip all Overview's data into a single file.
    3. Start Overview.
  2. Restore:
    1. Stop Overview.
    2. Unzip the data back to the original files.
    3. Start Overview.

Gotchas:

  • Overview can't run while the backup is happening. PostgreSQL and Lucene databases can be corrupted if Overview writes them while the backup script reads them. If you don't stop Overview, the backups may be useless.
  • The larger your database, the longer it will take to back up and restore.

2c. Take Volume Snapshots

Overview's databases are on filesystems. Many filesystems have a "snapshot" operation that backs up all its files simultaneously. This can't corrupt a PostgreSQL or Lucene database.

We support this approach for Amazon's EBS volumes and VirtualBox Virtual Machines. (The VirtualBox approach is useful if you're using Docker on Windows or Mac.)

We don't support this on Docker. Docker may provide volume snapshots someday.

  1. Setup:
    1. Store Overview's data on its own volume(s). Your backups could be the same size as the volume.
  2. Backup:
    1. Take a snapshot. (On AWS, it's aws ec2 create-snapshot; in VirtualBox, there's an icon in the "Snapshots" tab.)
  3. Restore:
    1. Stop Overview.
    2. Restore the volume from the snapshot.
    3. Start Overview.

Gotchas:

  • Snapshots can take a lot of space.
  • Sometimes, snapshot files depend upon one another: if you lose one snapshot file, other snapshot files may not have enough information for a restore to work.

Snapshots are excellent in a cloud environment -- AWS, for instance.

2d. Use Point-in-Time Recovery (advanced, unsupported)

PostgreSQL has a Write-Ahead Log feature that lets you recover to any point in time that has been configured correctly. Cloud providers like Amazon RDS can take advantage of this. Unfortunately, the write-ahead log can get large if you use it for point-in-time recovery; also, restoring from it can take a long time.

S3 has versioning, which stores enough information that you can revert your blob-storage to a specific point in time. This kind of restore, too, can take a long time.

Overview doesn't recommending using any versioning techniques with its Lucene indexes. (Lucene supports versioning, but Overview doesn't take advantage of those features.)

Clone this wiki locally