[arch-devops] Mirrorbrain
It's been discussed in the past, but I thought it worth having a semi-formal and documented discussion about it. Do we think setting up mirrorbrain[0] would be a worthwhile service? For those not familiar, the elevator pitch is that it's an open-source download redirector to coordinate a simple CDN. It is used by a number of open-source projects such as vlc, Gnome, KDE, LibreOffice and OpenSUSE. Architecturally speaking, it maintains a list of our mirrors and local copy of our repos, and monitors those mirrors for availability and staleness. HTTP(s) clients (ie, pacman) are then redirected to an appropriate mirror based on the clients geographic location and mirror health. Benefits include: 1. Ensuring users always receive up-to-date packages (mirrorbrain won't redirect to a mirror if that mirrors version of that package is outdated compared to the authoritative repository). 2. Automated monitoring of our mirror network to proactively detect stale or broken mirrors. Mirrorlist files (ie, /etc/pacman.d/mirrorlist) can be automatically generated based on MirrorBrain's data. 3. Reduced load on core mirrors, and load-balancing (for example, a 1Gbit mirror can be weighted to receive 10x the traffic of a 100MBps mirror). 4. Automatic MetaLink and Torrent file generation, with web-seeds (currently handled by hefur on luna?). Exceptions to redirection can be applied, for example to ensure security-sensitive files (checksum files perhaps) are always served directly from the authoritative repo. Requirements are pretty basic; apache with mod_asn, postgresql, and some python+perl modules. It can be run behind a reverse-proxy if we wanted to hide apache behind nginx. It's not a particular fast-moving project (last release 2014) but that's a reflection of it's stability IMHO. What do others think? I'm happy to take on the implementation project. Cheers, ~p [0] http://mirrorbrain.org/
On Wed, Sep 12, 2018 at 03:17:39PM +1000, Phillip Smith via arch-devops <arch-devops@lists.archlinux.org> wrote:
1. Ensuring users always receive up-to-date packages (mirrorbrain won't redirect to a mirror if that mirrors version of that package is outdated compared to the authoritative repository).
I can see the appeal since we don't have that now.
2. Automated monitoring of our mirror network to proactively detect stale or broken mirrors. Mirrorlist files (ie, /etc/pacman.d/mirrorlist) can be automatically generated based on MirrorBrain's data.
While nice, we already have mirror list generation from monitored data directly in archweb if you check the "Use mirror status" option here: https://www.archlinux.org/mirrorlist/
3. Reduced load on core mirrors, and load-balancing (for example, a 1Gbit mirror can be weighted to receive 10x the traffic of a 100MBps mirror).
We already have `rankmirrors` which selects mirrors based on their performance from a client's point of view (I think download speed and response time). Though, on its own, it won't detect new mirrors if we add them and the users don't update their lists. I don't really know if new mirror are actually getting any use or not right now, so I can't tell if that is an issue or not.
4. Automatic MetaLink and Torrent file generation, with web-seeds (currently handled by hefur on luna?).
I think that's done manually by Pierre when he creates a new ISO. I'm not 100% sure though. I don't know if we really want or need mirrorbrain. It's a great fit for projects with few big files which are downloaded by many people upon a new release. If we were to use mirrorbrain, the pacman mirror list would only contain the redirector URL, correct? That would then mean that pacman asks the redirector for each individual package which I think generates way too much unnecessary load compared to our current system. Also, right now people can update without having to connect to a central server and I don't like the idea of adding a SPOF for something that already works rather well. Mirrorbrain sounds fine for our ISOs, but I don't know if those get enough downloads (via HTTP) to warrant setting it up. Florian
On 12.09.18 - 09:50, Florian Pritz via arch-devops wrote:
I don't know if we really want or need mirrorbrain. It's a great fit for projects with few big files which are downloaded by many people upon a new release. If we were to use mirrorbrain, the pacman mirror list would only contain the redirector URL, correct? That would then mean that pacman asks the redirector for each individual package which I think generates way too much unnecessary load compared to our current system. Also, right now people can update without having to connect to a central server and I don't like the idea of adding a SPOF for something that already works rather well.
Mirrorbrain sounds fine for our ISOs, but I don't know if those get enough downloads (via HTTP) to warrant setting it up.
Florian
I suppose you could have multiple mirrorbrain hosts with synced config/database (pgsql streaming replication works quite well) and then there could be multiple entries in the client's mirrorlist again. Otherwise it could still be the same FQDN but as a DNS RR with multiple records and a somewhat low TTL of 300s-ish? The project sounds quite nice and well suited for the job and should be able to centralize/unify or current situation, but I think we have some other things with a higher priority than changing/optimizing something that is working reasonably well currently. Cheers, Thore -- Thore Bödecker GPG ID: 0xD622431AF8DB80F3 GPG FP: 0F96 559D 3556 24FC 2226 A864 D622 431A F8DB 80F3
On Wed, 12 Sep 2018 at 17:50, Florian Pritz via arch-devops < arch-devops@lists.archlinux.org> wrote:
If we were to use mirrorbrain, the pacman mirror list would only contain the redirector URL, correct?
I wouldn't expect we would replace the content of mirrorlist with the single URL to mirrorbrain; just add it as the top option under "Worldwide", perhaps with a short description of it's purpose. At the very most it could be uncommented as the default option, but I think leaving everything commented so the user can select would still be best.
That would then mean that pacman asks the redirector for each individual package which I think generates way too much unnecessary load compared to our current system.
IME, the load isn't particularly bad since it's primary work is just handing out 301 redirects based on read-only DB queries. I obviously haven't benchmarked, but if I had to estimate then I would say the BBS produces far more load considering it's CRUD DB operations and page generation functions. The DB-writing requirements are non-interactive and can be run as low-priority to minimize impact. According to [0], "handling all downloads of OpenOffice.org, a busy site, could be called *boring*. A simple 512MB box can do this. Handling all openSUSE.org traffic (20-40 millions hits per day) is still relaxed (commodity hardware, server load: zero point something)." [0] http://mirrorbrain.org/
On Wed, 12 Sep 2018 at 15:17, Phillip Smith <fukawi2@gmail.com> wrote:
What do others think? I'm happy to take on the implementation project.
Given the silence, I'll assume no interest for now. Unless anyone objects, I'll set up an unofficial instance for my own sick pleasure :P Does anyone mind if I set it up as a Tier 2, with direct rsync from orion? Who manages the list of IP's that have access in archweb?
On 9/23/18 8:51 PM, Phillip Smith via arch-devops wrote:
On Wed, 12 Sep 2018 at 15:17, Phillip Smith <fukawi2@gmail.com <mailto:fukawi2@gmail.com>> wrote:
What do others think? I'm happy to take on the implementation project.
Given the silence, I'll assume no interest for now. Unless anyone objects, I'll set up an unofficial instance for my own sick pleasure :P Does anyone mind if I set it up as a Tier 2, with direct rsync from orion? Who manages the list of IP's that have access in archweb?
https://wiki.archlinux.org/index.php/DeveloperWiki:NewMirrors -- brent saner https://square-r00t.net/ GPG info: https://square-r00t.net/gpg-info
participants (4)
-
brent s.
-
Florian Pritz
-
Phillip Smith
-
Thore Bödecker