Re: [arch-mirrors] [sysadmin] CDN based/caching mirror?
On 30.01.2020 17.04, Konstantin Ryabitsev wrote:
On Sun, 26 Jan 2020 at 11:19, Kristian Klausen via arch-mirrors <arch-mirrors@archlinux.org> wrote:
I'm considering setting up a Arch Linux mirror and I'm considering a different design.
So instead of mirroring the whole thing, the idea is to mirror only the database files (core.db etc) and download the packages on demand from a Tier 1 mirror (and let nginx cache them). By doing it that way, I only download requested packages from the Tier 1 mirrors, instead of downloading the whole thing (saving Tier 1 bandwidth).
To provide even better performance a CDN (ex: Cloudflare) could be used to provide more caching. So we end up with a setup like this: Cloudflare -> Nginx cache -> Tier1 mirrors (nginx with multiple upstream)
Do I miss something? Is this a bad idea? If you are trying to save Tier1 some bandwidth, you'll probably actually end up causing them more problems due to increased random seek waits. Tier1 mirrors may not necessarily have fast storage -- for example, all kernel.org mirror nodes have terabytes of spinning rust and about half-a-TB of ssd used via lvm-cache. It works great for Tier1 setups because most Tier2 mirrors want the same set of recent updates that are served out of ssd cache. If a new mirror comes along and wants to slurp and entire distro, that is fine too, because even if there's higher iowait latency, the Tier2 mirror isn't working against any HTTP timeouts or impatient clients and doesn't care if the data arrives at a slower rate due to higher iowait. Tier1 can also tell Tier2 mirror "I'm overloaded right now, please try again later" and it'll be fine as most Tier2 mirrors can wait an hour or two before receiving updates.
Making Tier1 mirrors a "cold cache" for your setup will likely cause more disk thrash for them, but will also result in poorer service for people using your mirror due to the reasons I listed above.
Tier 1 mirrors is also used directly by end-users (correct me if I'm wrong) So worst-case (cache miss) my SSD-backed shared cache won't be noticeable slower than pulling directly from the Tier 1 mirror. Best-case (cache hit) I'm saving the Tier 1 mirror some bandwidth and disk usage. My idea is basically tiered caching (CDN -> Nginx SSD-backed shared cache -> Tier 1 mirror(s)), is that worse than status quo? :)
If someone tries to install a package and watches their download bar sit at 0 for half a minute due to backend proxies fetching data from Tier1 origin, that's going to result in frustrated people.
Nginx streams the data as it is received from the upstream server, so worst-case (cache miss) the data can be delivered as fast as received from the upstream server.
TL;DR: If you can afford CDN-fronting your mirror, that should be mostly fine, but I would recommend against using Tier1 as your cache-miss backend. Storage is cheap and most Tier1 mirrors have unlimited bandwidth, so just run a Tier2 mirror (with slow/fast storage caching) and keep local copies of everything.
-K (mirrors.kernel.org administrator)
participants (1)
-
Kristian Klausen