[arch-devops] Secondary backup ideas (Task #50)

Phillip Smith

5 Mar 2018 5 Mar '18

3:02 a.m.

(Sorry, I can't reply to the original thread as I wasn't a member of the list at the time) Can I throw restic[0] in to the mix? It's very similar to borg, but super fast[1]. I moved away from borg a while ago for various reasons. It's written in golang, so it's a simple ELF executable to deploy. Supports lots of destinations[2] and multiple keys [3]. Specifically, I'm thinking the multiple key feature would allow us to have the client push backups to the backup server over a read-only connection (SSH, S3, rsync, whatever), then be able to run restic server-side to handle the cleanups. Anyway, 'food for thought' that we can discuss. Cheers, ~p [0] https://restic.net/ [1] My home desktop backup: ~121gb of source data on spinning-rust (not SSD); incremental backup takes ~8 seconds: scanned 11482 directories, 72707 files in 0:01 [0:07] 100.00% 13.298 GiB/s 121.249 GiB / 121.249 GiB 84189 / 84189 items 0 errors ETA 0:02 duration: 0:07, 15798.45MiB/s snapshot 8c4aea6d saved [2] https://restic.readthedocs.io/en/stable/030_preparing_a_new_repo.html [3] https://restic.readthedocs.io/en/stable/070_encryption.html

Attachments:

attachment.html (text/html — 2.3 KB)

Show replies by date

Florian Pritz

5 Mar 5 Mar

5:06 a.m.

On 05.03.2018 09:02, Phillip Smith via arch-devops wrote:

...

(Sorry, I can't reply to the original thread as I wasn't a member of the list at the time)

In that case you could go to the list archive, open the post you want to reply to and then click the email address of the sender which will set the correct subject and In-Reply-To headers.

...

Specifically, I'm thinking the multiple key feature would allow us to have the client push backups to the backup server over a read-only connection (SSH, S3, rsync, whatever), then be able to run restic server-side to handle the cleanups.

The problem here is that restic doesn't work with glacier according to this[1]. So we'd need to use s3 which is more expensive. How much mostly depends on how long we want to keep the data and how well restic compresses/deduplicates it. An alternative would be wasabi[2] which also supports the s3 protocol, but I have no idea how well their service works. [1] https://github.com/restic/restic/issues/541 [2] https://wasabi.com I like the idea of using a different tool with (hopefully) good deduplication/compression though. This is certainly better than sending many gigabytes of tarballs around for each backup. As for the cleanups, I understand that the server and the client would both have keys to access the backup data, correct? That means that the server can read all of the data which makes it a good target for an attacker. Currently we avoid this by only storing client-side encrypted data on the server. I'd like to keep it this way. I also like the idea of having a WORM s3/glacier bucket. However, I'm not sure how this can be combined sanely with anything other than tarballs. From looking at the restic documentation it seems that they also use an object store so even old objects might still be used in recent backups. Is there another way to achieve cleanup with restic that doesn't require a server with access to the backup keys? Also, how badly do outside changes impact the performance? Let's say we have the keys on the admin machines (which we need for restores anyway) and perform the cleanup there. How long would it take to run, how much data would it need to transfer (few megabytes, few hundred megabytes, gigabytes, ...?) and do the clients then need to regenerate their caches or can they run at full performance just like before? Florian

Giancarlo Razzolini

7:57 a.m.

Em março 5, 2018 7:06 Florian Pritz via arch-devops escreveu:

...

The problem here is that restic doesn't work with glacier according to this[1]. So we'd need to use s3 which is more expensive. How much mostly depends on how long we want to keep the data and how well restic compresses/deduplicates it. An alternative would be wasabi[2] which also supports the s3 protocol, but I have no idea how well their service works.

We don't need it to support Glacier directly, since we can setup a bucket policy that moves the data to Glacier after some time. We can always have the recent backups readily accessible on S3, and the remaining ones on Glacier, which is cheaper. Regards, Giancarlo Razzolini

Phillip Smith

6:38 p.m.

On 5 March 2018 at 21:06, Florian Pritz via arch-devops < arch-devops@lists.archlinux.org> wrote:

...

In that case you could go to the list archive, open the post you want to reply to and then click the email address of the sender which will set the correct subject and In-Reply-To headers.

TIL. Thank-you :)

...

The problem here is that restic doesn't work with glacier according to this[1]. So we'd need to use s3 which is more expensive. How much mostly depends on how long we want to keep the data and how well restic compresses/deduplicates it.

restic is yet to implement compression [0]. The deduplication seems quite functional, especially when you can share a restic repository between multiple clients (if desired), so common data across clients is deduped in the repo. Have we committed to the idea of paying for Amazon or similar service for this project? I like the idea of using a different tool with (hopefully) good

...

deduplication/compression though. This is certainly better than sending many gigabytes of tarballs around for each backup.

Definitely! :)

...

As for the cleanups, I understand that the server and the client would both have keys to access the backup data, correct? That means that the server can read all of the data which makes it a good target for an attacker. Currently we avoid this by only storing client-side encrypted data on the server. I'd like to keep it this way.

I don't see any way to allow the client to manage cleanups without having write access (and therefore the ability to delete) to the 2nd backup. Perhaps we could consider having the 2nd backup on a snapshotting file-system (ie, ZFS) with something like rsync.net [1]. Then it would just be a dumb rsync from primary backup to secondary backup and the 2nd host retains snapshots to protect against malicious 'cleanups' from the 1st backup host.

...

I also like the idea of having a WORM s3/glacier bucket. However, I'm not sure how this can be combined sanely with anything other than tarballs. From looking at the restic documentation it seems that they also use an object store so even old objects might still be used in recent backups. Is there another way to achieve cleanup with restic that doesn't require a server with access to the backup keys?

Indexes etc would have to be updated I'm sure, so I don't think there is any tricky ways to do this. I did read somewhere that the repo format is 'read-only' to ensure consistency (ie, files only ever get added to the repo on disk). I can't find the reference to that right now though sorry.

...

Also, how badly do outside changes impact the performance? Let's say we have the keys on the admin machines (which we need for restores anyway) and perform the cleanup there. How long would it take to run, how much data would it need to transfer (few megabytes, few hundred megabytes, gigabytes, ...?) and do the clients then need to regenerate their caches or can they run at full performance just like before?

I'll do some testing to get some idea of the answers for this. [0] https://github.com/restic/restic/issues/21 [1] http://www.rsync.net/resources/howto/snapshots.html

Florian Pritz

6 Mar 6 Mar

11:35 a.m.

On 06.03.2018 00:38, Phillip Smith via arch-devops wrote:

...

restic is yet to implement compression [0]. The deduplication seems quite functional, especially when you can share a restic repository between multiple clients (if desired), so common data across clients is deduped in the repo.

I don't like the idea of deduplicating across clients because that potentially allows an attacker to gain more than just the information stored on the attacked machine. I know of at least one case where a company had serious problems because their backup server was hacked. They didn't have client-side encrypted backups so the attacker could access some keys that allowed to connect to other machines and so on. Compression would be nice, but maybe we can survive without it. The biggest chunk of data are our packages and those are compressed anyways.

...

Have we committed to the idea of paying for Amazon or similar service for this project?

Sure, infra costs money and if we pay it ourselves we can actually count on getting good support and/or being independent. So paying something is certainly fine, but the final amount will have to be discussed once we have satisfactory solutions.

...

Perhaps we could consider having the 2nd backup on a snapshotting file-system (ie, ZFS) with something like rsync.net <http://rsync.net> [1]. Then it would just be a dumb rsync from primary backup to secondary backup and the 2nd host retains snapshots to protect against malicious 'cleanups' from the 1st backup host.

That has been suggested elsewhere in the original thread already. I also have a note mentioning a combination of such a snapshotting solution on one of our own servers and more rarely created tarball backups on glacier with deletion restrictions enabled for disaster recovery.

...

Also, how badly do outside changes impact the performance? Let's say we have the keys on the admin machines (which we need for restores anyway) and perform the cleanup there. How long would it take to run, how much data would it need to transfer (few megabytes, few hundred megabytes, gigabytes, ...?) and do the clients then need to regenerate their caches or can they run at full performance just like before?

I'll do some testing to get some idea of the answers for this.

Looking forward to the results! Florian

Giancarlo Razzolini

27 Jul 27 Jul

12:51 p.m.

Em março 6, 2018 13:35 Florian Pritz via arch-devops escreveu:

...

That has been suggested elsewhere in the original thread already. I also have a note mentioning a combination of such a snapshotting solution on one of our own servers and more rarely created tarball backups on glacier with deletion restrictions enabled for disaster recovery.

Hi All, Due to recent discussions on IRC, I'm revisiting this thread. I played around with the AWS Calculator [0] to determine how we would pay to have 2-3 full monthly backups (around 200GB) on S3 plus 2-3 backups transitioned to Glacier. It seems we have enough monthly bandwidth on Hetzner to accommodate one full monthly backup upload plus the daily increments to AWS S3 with room to spare. Since Amazon doesn't charge for uploads, we are looking at spending around 20 dollars per month for this. Since we're going to use Glacier and we can delete backups after some period, this would be mostly a fixed cost. If we need to restore from there, the charges can increase, but it's not going to be much more than that. Regards, Giancarlo Razzolini [0] https://calculator.s3.amazonaws.com/index.html

2585

Age (days ago)

2729

Last active (days ago)

List overview

Download

5 comments

3 participants

participants (3)

Florian Pritz
Giancarlo Razzolini
Phillip Smith