[arch-devops] Cleaning up archive.archlinux.org
Hi, While I like the idea of the archive, it currently does not implement any kind of cleanup and forever grows in size. Right now is is somewhere around 50 million files with a total size of 1.7TB. Those are very rough numbers since calculating them for real takes considerable time. This creates multiple problems: - The disks only have ~700gb more space before they are full. We will reach that eventually - The backup creation takes somewhere between 4 to 5 hours depending on machine load. I am thinking about increasing the backup frequency where possible, but if one backup takes 4 hours that won't happen. I am also worried that if we ever need to restore, this could easily take well over 24 hours. The initial backup (which read all data rather than ignoring unchanged data) took something around 24 hours, but that was with less files. Essentially this means we need to put some kind of automatic deletion of old data in place and reduce the size of the archive. What would be a good time frame here? Have we defined a clear goal for the archive that we can use as a guideline? My gut says 6 months should be enough, but feel free to disagree. Once we have decided how long data should be kept, we also need something to actually delete it. Deleting files from the ./repos directory should be simple, but the ./packages directory is more complicated because it is not nicely separated into directories per day. Sebastien, do you have an idea about how we could delete old data or do you have a script for that? Florian
Em julho 5, 2017 15:57 Florian via arch-devops escreveu:
My gut says 6 months should be enough, but feel free to disagree.
Agree with 6 months. More than enough time for issues to appear and people to need the archive.
Once we have decided how long data should be kept, we also need something to actually delete it. Deleting files from the ./repos directory should be simple, but the ./packages directory is more complicated because it is not nicely separated into directories per day.
Sebastien, do you have an idea about how we could delete old data or do you have a script for that?
Well, find can do that. At least, I always used either atime/mtime to delete older backups with find. Don't know if there is a better tool for this. Cheers, Giancarlo Razzolini
On July 6, 2017 12:22:11 AM GMT+02:00, Giancarlo Razzolini <grazzolini@archlinux.org> wrote:
Em julho 5, 2017 15:57 Florian via arch-devops escreveu:
My gut says 6 months should be enough, but feel free to disagree.
Agree with 6 months. More than enough time for issues to appear and people to need the archive.
Well it will be bit more complicated then that. One we finally get the reproducible builds patches live we will need the archives for the reproducer script to build such package again. If we simply clean by date we will loose the possibility to rebuild certain packages that don't require a rebuild of something but we're using a specific version with its specific behaviors during build time. I'm not sure how to tackle this while not loosing reproducibility, but we certainly should think about this scenario before doing time based cleanups. Cheers, Levente
Em julho 5, 2017 19:26 Levente Polyak escreveu:
Well it will be bit more complicated then that. One we finally get the reproducible builds patches live we will need the archives for the reproducer script to build such package again. If we simply clean by date we will loose the possibility to rebuild certain packages that don't require a rebuild of something but we're using a specific version with its specific behaviors during build time.
I'm not sure how to tackle this while not loosing reproducibility, but we certainly should think about this scenario before doing time based cleanups.
What about time based + keep the latest n versions of a package? Do you think that would be enough to satisfy the reproducibility needs? Cheers, Giancarlo Razzolini
On July 6, 2017 12:31:07 AM GMT+02:00, Giancarlo Razzolini <grazzolini@archlinux.org> wrote:
Em julho 5, 2017 19:26 Levente Polyak escreveu:
Well it will be bit more complicated then that. One we finally get
the reproducible builds patches live we will need the archives for the reproducer script to build such package again. If we simply clean by date we will loose the possibility to rebuild certain packages that don't require a rebuild of something but we're using a specific version with its specific behaviors during build time.
I'm not sure how to tackle this while not loosing reproducibility,
but we certainly should think about this scenario before doing time based cleanups.
What about time based + keep the latest n versions of a package? Do you think that would be enough to satisfy the reproducibility needs?
Not really, there are tons of packages not getting rebuild/bumped for a long time that used certain versions of packages. Those will never be able to get rebuild. If we want to safely keep reproducibility of f.e. at least the most current version of all packages, we will need to parse its used dependencies from the internal metadata file and keep all those packages as long as the last package that used it vanished. Right now I don't see any way around this while having both, cleanup and at least the possibility to be able to verify all most current versions of our packages are in fact reproducible. Cheers Levente
Can't we use btrfs block level dedup for this as well so the disk space issue is not as pressing? On Thu, Jul 6, 2017, 00:36 Levente Polyak <anthraxx@archlinux.org> wrote:
On July 6, 2017 12:31:07 AM GMT+02:00, Giancarlo Razzolini < grazzolini@archlinux.org> wrote:
Em julho 5, 2017 19:26 Levente Polyak escreveu:
Well it will be bit more complicated then that. One we finally get
the reproducible builds patches live we will need the archives for the reproducer script to build such package again. If we simply clean by date we will loose the possibility to rebuild certain packages that don't require a rebuild of something but we're using a specific version with its specific behaviors during build time.
I'm not sure how to tackle this while not loosing reproducibility,
but we certainly should think about this scenario before doing time based cleanups.
What about time based + keep the latest n versions of a package? Do you think that would be enough to satisfy the reproducibility needs?
Not really, there are tons of packages not getting rebuild/bumped for a long time that used certain versions of packages. Those will never be able to get rebuild. If we want to safely keep reproducibility of f.e. at least the most current version of all packages, we will need to parse its used dependencies from the internal metadata file and keep all those packages as long as the last package that used it vanished. Right now I don't see any way around this while having both, cleanup and at least the possibility to be able to verify all most current versions of our packages are in fact reproducible.
Cheers Levente
On 2017-07-06 08:45, Sven-Hendrik Haase wrote:
Can't we use btrfs block level dedup for this as well so the disk space issue is not as pressing?
Last time I checked, there was no automatic deduplication (so separate tool was needed). And I guess it wouldn't be so beneficial due to the fact that we store packages compressed. Bartłomiej
On 2017-07-06 00:36, Levente Polyak wrote:
On July 6, 2017 12:31:07 AM GMT+02:00, Giancarlo Razzolini <grazzolini@archlinux.org> wrote:
Em julho 5, 2017 19:26 Levente Polyak escreveu:
Well it will be bit more complicated then that. One we finally get
the reproducible builds patches live we will need the archives for the reproducer script to build such package again. If we simply clean by date we will loose the possibility to rebuild certain packages that don't require a rebuild of something but we're using a specific version with its specific behaviors during build time.
I'm not sure how to tackle this while not loosing reproducibility,
but we certainly should think about this scenario before doing time based cleanups.
What about time based + keep the latest n versions of a package? Do you think that would be enough to satisfy the reproducibility needs?
Not really, there are tons of packages not getting rebuild/bumped for a long time that used certain versions of packages. Those will never be able to get rebuild. If we want to safely keep reproducibility of f.e. at least the most current version of all packages, we will need to parse its used dependencies from the internal metadata file and keep all those packages as long as the last package that used it vanished. Right now I don't see any way around this while having both, cleanup and at least the possibility to be able to verify all most current versions of our packages are in fact reproducible.
Cheers Levente
Given that at the moment our packages are not reproducible, I think it makes sense to store only the last 6-12 months of packages and exclude the archive from backups. I don't see it as critical loss if our disk explodes and it eats too much space on our backup box for no gain. Bartłomiej
On 06.07.2017 09:10, Bartłomiej Piotrowski wrote:
Given that at the moment our packages are not reproducible, I think it makes sense to store only the last 6-12 months of packages and exclude the archive from backups. I don't see it as critical loss if our disk explodes and it eats too much space on our backup box for no gain.
I've now excluded the archive from the backups. Once we actually use it for something like reproducible builds or if we have automatic cleanups so that the archive size doesn't blow up backup times, I'll put it back in. Florian
participants (6)
-
Bartłomiej Piotrowski
-
Florian
-
Florian Pritz
-
Giancarlo Razzolini
-
Levente Polyak
-
Sven-Hendrik Haase