[arch-dev-public] Archive cleanup (2013, 2014, 2015)
Hi, So far we haven't ever cleaned up the archive (https://archive.archlinux.org/), but it's getting too big and I don't think we have any use for packages from years ago. Next week I will remove the years 2013, 2014 and 2015. Florian
Hi, On 30-05-18, Florian Pritz via arch-dev-public wrote:
So far we haven't ever cleaned up the archive (https://archive.archlinux.org/), but it's getting too big and I don't think we have any use for packages from years ago.
You never know what useful things people could do with this historical data. What about sending previous years to the Internet Archive before deleting them locally? They already host quite a lot of software: https://archive.org/details/software Baptiste
On 30.05.2018 17:49, Baptiste Jonglez wrote:
What about sending previous years to the Internet Archive before deleting them locally?
I myself am not interested enough in keeping the old packages to figure out what rules the Internet Archive has and how I could upload all this data to them. If you are interested in this, please check with them if they are ok with adding this much data and if so, feel free to go to /srv/archive/ on orion and upload the files. If you do that I'll wait with deleting things until you are done. If you need any tools installed on the server to do this, tell me. Florian
On 30-05-18, Florian Pritz wrote:
On 30.05.2018 17:49, Baptiste Jonglez wrote:
What about sending previous years to the Internet Archive before deleting them locally?
I myself am not interested enough in keeping the old packages to figure out what rules the Internet Archive has and how I could upload all this data to them. If you are interested in this, please check with them if they are ok with adding this much data and if so, feel free to go to /srv/archive/ on orion and upload the files. If you do that I'll wait with deleting things until you are done. If you need any tools installed on the server to do this, tell me.
Ok, thanks, I will give it a shot in the next few days then! Their software [1] is written in python, so can you please install these packages: python-pip python-args python-clint python-docopt python-jsonpointer python-jsonpatch python-yaml It should be enough to allow me to install the client locally. Baptiste [1] https://github.com/jjjake/internetarchive
On 30.05.2018 18:25, Baptiste Jonglez wrote:
Their software [1] is written in python, so can you please install these packages:
python-pip python-args python-clint python-docopt python-jsonpointer python-jsonpatch python-yaml
Done. Please tell me once it's done so I can remove them again. Florian
On Wed, 30 May 2018 at 17:18:07, Florian Pritz via arch-dev-public wrote:
So far we haven't ever cleaned up the archive (https://archive.archlinux.org/), but it's getting too big and I don't think we have any use for packages from years ago.
Next week I will remove the years 2013, 2014 and 2015.
Do you plan on keeping packages if they are the ones currently in the repos?
On 30.05.2018 20:20, Lukas Fleischer via arch-dev-public wrote:
Do you plan on keeping packages if they are the ones currently in the repos?
AFAIK they are saved in such a way that for each day there is a directory that contains all the packages. If a package exists in multiple directories it is hardlinked so deleting one directory will not mess with the others. Florian
Hi, Here is the progress on the upload of old packages to archive.org. I uploaded a few packages to test if my script works: https://archive.org/details/archlinux_pkg_babeld https://archive.org/details/archlinux_pkg_kde-l10n-ca_valencia https://archive.org/details/archlinux_pkg_lucene__ There is one identifier for each package, and then all versions of the package + their signatures are contained under this identifier (see "Show all" on the right). Now, to finish this, I have a few questions: - does the devops team have a place to store passwords? I would like to create an "Arch Linux" account, so that I'm not the only one to have access. I also need an email address for the account, maybe something like internetarchive at archlinux.org or just the devops mailing list address? - is that OK to upload ~2 TB from orion? Is the server on an limited data plan? - I'm still waiting on a final confirmation from archive.org, whether they are OK with this amount of data. The upload process itself is quite slow latency-wise, it takes 5-10 seconds to upload a file whatever its size. For packages from 2013 to 2015 there's 250k files to upload, I estimate it will take a few days if I run 32 upload threads in parallel. By the way, we could even keep the year/month/day symlink hierarchy on orion for old packages, and redirect downloads to archive.org. There is just a small issue with packages that have "+", "@" or "." in their name, because that's not allowed as identifiers in archive.org (see the second and third examples above, where my script replaced the "@" with "_") Baptiste
On 01.06.2018 19:24, Baptiste Jonglez wrote:
- does the devops team have a place to store passwords? I would like to create an "Arch Linux" account, so that I'm not the only one to have access.
You can send me the password and I'll put it in our ansible repo in the vault.
I also need an email address for the account, maybe something like internetarchive at archlinux.org or just the devops mailing list address?
We don't want to spam the ML with password rest mails or whatever else we get so let's use a private address. I've set up internetarchive@archlinux.org to forward to root@archlinux.org. Please use that and if you want, I can also add you to the alias.
- is that OK to upload ~2 TB from orion? Is the server on an limited data plan?
That's fine.
The upload process itself is quite slow latency-wise, it takes 5-10 seconds to upload a file whatever its size. For packages from 2013 to 2015 there's 250k files to upload, I estimate it will take a few days if I run 32 upload threads in parallel.
Fine by me. I'll wait for that to complete before deleting anything.
By the way, we could even keep the year/month/day symlink hierarchy on orion for old packages, and redirect downloads to archive.org. There is just a small issue with packages that have "+", "@" or "." in their name, because that's not allowed as identifiers in archive.org (see the second and third examples above, where my script replaced the "@" with "_")
I'd much prefer getting rid of the inodes because having this many inodes around makes file system operations slow. It's also the reason why the archive is not part of the backup. Thanks for working on this! Florian
Hi, Archival of all packages between September 2013 and December 2016 is finished: https://archive.org/details/archlinuxarchive Here is some documentation on this "Historical Archive" hosted on archive.org: https://wiki.archlinux.org/index.php/Arch_Linux_Archive#Historical_Archive Happily, we were able to provide download redirection from orion to this Historical Archive. This means that e.g. https://archive.archlinux.org/repos/2013/09/01/ should continue to work fine, even though the actual packages are served from archive.org. To provide this, we only need to keep an archive of the database files on orion. This represents around 50 files per day of archive, while the packages themselves represent about 25k hardlinks and symlinks per day of archive. So that's a real saving of inodes! Please test things out and let me know if you encounter any issue! (other than "downloading from archive.org is slow", because it indeed is) If everything works well, Florian will delete the old packages on orion in a few days. Thanks, Baptiste
On 10.06.2018 01:35, Baptiste Jonglez wrote:
Archival of all packages between September 2013 and December 2016 is finished:
https://archive.org/details/archlinuxarchive
Here is some documentation on this "Historical Archive" hosted on archive.org:
https://wiki.archlinux.org/index.php/Arch_Linux_Archive#Historical_Archive
Thanks for providing that!
If everything works well, Florian will delete the old packages on orion in a few days.
I've now deleted all the old data using Levente's script. Florian
participants (3)
-
Baptiste Jonglez
-
Florian Pritz
-
Lukas Fleischer