[arch-mirrors] CDN based/caching mirror?
Hi I'm considering setting up a Arch Linux mirror and I'm considering a different design. So instead of mirroring the whole thing, the idea is to mirror only the database files (core.db etc) and download the packages on demand from a Tier 1 mirror (and let nginx cache them). By doing it that way, I only download requested packages from the Tier 1 mirrors, instead of downloading the whole thing (saving Tier 1 bandwidth). To provide even better performance a CDN (ex: Cloudflare) could be used to provide more caching. So we end up with a setup like this: Cloudflare -> Nginx cache -> Tier1 mirrors (nginx with multiple upstream) Do I miss something? Is this a bad idea? If I do setup a mirror like that, is there any chance it could be added as a official mirror? Best regards Kristian Klausen
Hi,
To provide even better performance a CDN (ex: Cloudflare) could be used to provide more caching. So we end up with a setup like this: Cloudflare -> Nginx cache -> Tier1 mirrors (nginx with multiple upstream)
From a "decentralized internet" point of view, I wouldn't like to have a mirror going through a "big" CDN like Cloudflare. Morever, actual mirror are plenty enough and fast enough. Cheers, Frank
From a "decentralized internet" point of view, I wouldn't like to have a mirror going through a "big" CDN like Cloudflare. Morever, actual mirror are plenty enough and fast enough.
On top of that, Cloudflare's HTTP firewall may block some apt-get/curl/wget users[1][2] which is not wanted for software repositories. Also, Cloudflare's Terms of Service[3] state they're explicitly meant to be used with HTML content.
2.8 Limitation on Serving Non-HTML Content
The Service is offered primarily as a platform to cache and serve web pages and websites. Unless explicitly included as a part of a Paid Service purchased by you, you agree to use the Service solely for the purpose of serving web pages as viewed through a web browser or other application and the Hypertext Markup Language (HTML) protocol or other equivalent technology. Use of the Service for serving video (unless purchased separately as a Paid Service) or a disproportionate percentage of pictures, audio files, or other non-HTML content, is prohibited.
Even "cheap" CDN providers are more expensive for high-volume traffic when comparing to dedicated servers with unmetered bandwidth, and many servers are hosted by universities with large[4] amounts of bandwidth. [1]: https://github.com/oerdnj/deb.sury.org/issues/1170 [2]: https://github.com/oerdnj/deb.sury.org/issues/1299 [3]: https://www.cloudflare.com/terms/ [4]: https://ftp.acc.umu.se/about/index.html
On Sun, 26 Jan 2020 17:19:10 +0100 Kristian Klausen via arch-mirrors <arch-mirrors@archlinux.org> wrote:
So instead of mirroring the whole thing, the idea is to mirror only the database files (core.db etc) and download the packages on demand from a Tier 1 mirror (and let nginx cache them). By doing it that way, I only download requested packages from the Tier 1 mirrors, instead of downloading the whole thing (saving Tier 1 bandwidth).
I'm not quite sure what problem you're trying to solve - tier 1 servers have plenty of bandwidth, otherwise they shouldn't be running such a mirror, and I'd wager that downstream mirrors syncing occasionally pales in comparison to end user traffic, so I don't think you need to really worry about the upstream. If your concern is *your* bandwidth or disk space, then you probably shouldn't be setting up a public mirror at all - assuming, of course, that it is a public mirror you're talking about here, and not just a an internal network cache to point your boxes at so that you only download each package once, not once for every machine.
To provide even better performance a CDN (ex: Cloudflare) could be used to provide more caching.
Others have already addressed that this may break Cloudflare's terms, as they're designed to optimise websites by hosting HTML/JS.
Do I miss something? Is this a bad idea?
Immediate thought is that the first request for each package could seem unacceptably slow, as your mirror would have to fetch it first before it could serve it to the client, and for larger packages, that could begin to make it feel slow (especially if also doing that for ISOs, etc). It also means that if your upstream is temporarily down, you have an incomplete mirror which appears reachable but fails to serve some files, which is probably not ideal. To me, it feels rather like you're trying to solve a problem which doesn't really exist. Cheers Dave P
On 26.01.2020 21.52, David Precious wrote:
So instead of mirroring the whole thing, the idea is to mirror only the database files (core.db etc) and download the packages on demand from a Tier 1 mirror (and let nginx cache them). By doing it that way, I only download requested packages from the Tier 1 mirrors, instead of downloading the whole thing (saving Tier 1 bandwidth). I'm not quite sure what problem you're trying to solve - tier 1 servers have plenty of bandwidth, otherwise they shouldn't be running such a mirror, and I'd wager that downstream mirrors syncing occasionally
On Sun, 26 Jan 2020 17:19:10 +0100 Kristian Klausen via arch-mirrors <arch-mirrors@archlinux.org> wrote: pales in comparison to end user traffic, so I don't think you need to really worry about the upstream.
If your concern is *your* bandwidth or disk space, then you probably shouldn't be setting up a public mirror at all - assuming, of course, that it is a public mirror you're talking about here, and not just a an internal network cache to point your boxes at so that you only download each package once, not once for every machine.
To provide even better performance a CDN (ex: Cloudflare) could be used to provide more caching. Others have already addressed that this may break Cloudflare's terms, as they're designed to optimise websites by hosting HTML/JS.
Valid point.
Do I miss something? Is this a bad idea? Immediate thought is that the first request for each package could seem unacceptably slow, as your mirror would have to fetch it first before it could serve it to the client, and for larger packages, that could begin to make it feel slow (especially if also doing that for ISOs, etc).
Valid point, that could in theory be fixed by downloading from multiple servers in parallel. It would require a more complex setup, but in theory it could be done.
It also means that if your upstream is temporarily down, you have an incomplete mirror which appears reachable but fails to serve some files, which is probably not ideal.
The idea was to fallback to another mirror on errors/404.
To me, it feels rather like you're trying to solve a problem which doesn't really exist.
Roger that, it was just a "crazy" idea to run a mirror without mirroring everything (requiring less storage) and a CDN like (deb.debian.org), but as the Arch project seems to have more than enough mirrors, the idea doesn't make sense. Thanks for your time everyone!
Cheers
Dave P
Em janeiro 26, 2020 13:19 Kristian Klausen via arch-mirrors escreveu:
So instead of mirroring the whole thing, the idea is to mirror only the database files (core.db etc) and download the packages on demand from a Tier 1 mirror (and let nginx cache them). By doing it that way, I only download requested packages from the Tier 1 mirrors, instead of downloading the whole thing (saving Tier 1 bandwidth).
How you actually serve the files is irrelevant to us, but it might not be irrelevant to the users, even if they don't actually *know* how you're serving files. I see some latency problems that can happen with such scenario. Also, as pointed already, Tier 1 mirrors don't need their bandwidth saved really.
To provide even better performance a CDN (ex: Cloudflare) could be used to provide more caching. So we end up with a setup like this: Cloudflare -> Nginx cache -> Tier1 mirrors (nginx with multiple upstream)
You'll have even more latency issues for packages you have not cached yet. I wonder how are you going to do invalidatios.
Do I miss something? Is this a bad idea? If I do setup a mirror like that, is there any chance it could be added as a official mirror?
As I've said, we won't care how you serve the files. As long as you serve them and serve the right file. But users will. Regards, Giancarlo Razzolini
On 27.01.2020 13.15, Giancarlo Razzolini wrote:
Em janeiro 26, 2020 13:19 Kristian Klausen via arch-mirrors escreveu:
So instead of mirroring the whole thing, the idea is to mirror only the database files (core.db etc) and download the packages on demand from a Tier 1 mirror (and let nginx cache them). By doing it that way, I only download requested packages from the Tier 1 mirrors, instead of downloading the whole thing (saving Tier 1 bandwidth).
How you actually serve the files is irrelevant to us, but it might not be irrelevant to the users, even if they don't actually *know* how you're serving files. I see some latency problems that can happen with such scenario.
I'm not sure latency is a big issue, the mirror won't be faster than the "weakest link" (the upstream Tier 1 server) if the package isn't cached and it cost a few roundtrips + TCP ramp up, but that's about it. (nginx doesn't need to receive the whole response first, it can "stream" the data to the client as it receive it from the upstream server)
Also, as pointed already, Tier 1 mirrors don't need their bandwidth saved really.
I got that impression from the wiki, but private mirror != official/public mirror. https://wiki.archlinux.org/index.php/DeveloperWiki:NewMirrors
To provide even better performance a CDN (ex: Cloudflare) could be used to provide more caching. So we end up with a setup like this: Cloudflare -> Nginx cache -> Tier1 mirrors (nginx with multiple upstream)
You'll have even more latency issues for packages you have not cached yet.
See my previous comment.
I wonder how are you going to do invalidatios.
I'm under the impression that package files never changes? That database files do change though, but I think that can be easily handled by either using a low "max-age" (1-5 min) or simply using "no-cache" ("Caches must check with the origin server for validation before using the cached copy.")
Do I miss something? Is this a bad idea? If I do setup a mirror like that, is there any chance it could be added as a official mirror?
As I've said, we won't care how you serve the files. As long as you serve them and serve the right file. But users will.
In that case, I will see if I can come up with a solution and open a "feature request" to be a official mirror if it works out.
Regards, Giancarlo Razzolini
Hi, Have you thought about the maximum file size Cloudflare caches or supports ? Via: https://support.cloudflare.com/hc/en-us/articles/200172516-Understanding-Clo... Cloudflare cache maximum file size is 512MB for Free, Pro, and Business customers. In the archlinux reposoitory there are files that are larger than 512M: # find . -type f -size +512M ./iso/archboot/2016.08/archlinux-2016.08-1-archboot.iso ./iso/archboot/2016.12/archlinux-2016.12-1-archboot.iso ./iso/archboot/2018.06/archlinux-2018.06-1-archboot.iso ./iso/2019.12.01/archlinux-2019.12.01-x86_64.iso ./iso/2019.12.01/arch/x86_64/airootfs.sfs ./iso/2020.01.01/archlinux-2020.01.01-x86_64.iso ./iso/2020.01.01/arch/x86_64/airootfs.sfs ./iso/2019.11.01/archlinux-2019.11.01-x86_64.iso ./iso/2019.11.01/arch/x86_64/airootfs.sfs ./pool/community/sauerbraten-data-2013_02_03_collect_edition-3-any.pkg.tar.xz ./pool/community/cuda-10.2.89-3-x86_64.pkg.tar.zst ./pool/community/xonotic-data-0.8.2-3-any.pkg.tar.zst ./pool/community/kea-devel-docs-1.5.0-1-any.pkg.tar.xz ./pool/community/0ad-data-a23.1-1-any.pkg.tar.xz ./pool/community/supertuxkart-1.1-1-x86_64.pkg.tar.zst --- Artis Šteinbergs Kristian Klausen via arch-mirrors @ 26.01.2020 18:19 rakstīja:
Hi
I'm considering setting up a Arch Linux mirror and I'm considering a different design.
So instead of mirroring the whole thing, the idea is to mirror only the database files (core.db etc) and download the packages on demand from a Tier 1 mirror (and let nginx cache them). By doing it that way, I only download requested packages from the Tier 1 mirrors, instead of downloading the whole thing (saving Tier 1 bandwidth).
To provide even better performance a CDN (ex: Cloudflare) could be used to provide more caching. So we end up with a setup like this: Cloudflare -> Nginx cache -> Tier1 mirrors (nginx with multiple upstream)
Do I miss something? Is this a bad idea? If I do setup a mirror like that, is there any chance it could be added as a official mirror?
Best regards Kristian Klausen
On 30.01.2020 11.30, Artis Steinbergs via arch-mirrors wrote:
Hi,
Have you thought about the maximum file size Cloudflare caches or supports ? Via: https://support.cloudflare.com/hc/en-us/articles/200172516-Understanding-Clo... Cloudflare cache maximum file size is 512MB for Free, Pro, and Business customers.
I have thought about it, but there are only 6 packages which is bigger than 512M and nginx is gonna cache them anyway, so I don't see it as a big issue. iso is optional so I'm not sure if I gonna mirror them.
In the archlinux reposoitory there are files that are larger than 512M: # find . -type f -size +512M ./iso/archboot/2016.08/archlinux-2016.08-1-archboot.iso ./iso/archboot/2016.12/archlinux-2016.12-1-archboot.iso ./iso/archboot/2018.06/archlinux-2018.06-1-archboot.iso ./iso/2019.12.01/archlinux-2019.12.01-x86_64.iso ./iso/2019.12.01/arch/x86_64/airootfs.sfs ./iso/2020.01.01/archlinux-2020.01.01-x86_64.iso ./iso/2020.01.01/arch/x86_64/airootfs.sfs ./iso/2019.11.01/archlinux-2019.11.01-x86_64.iso ./iso/2019.11.01/arch/x86_64/airootfs.sfs ./pool/community/sauerbraten-data-2013_02_03_collect_edition-3-any.pkg.tar.xz
./pool/community/cuda-10.2.89-3-x86_64.pkg.tar.zst ./pool/community/xonotic-data-0.8.2-3-any.pkg.tar.zst ./pool/community/kea-devel-docs-1.5.0-1-any.pkg.tar.xz ./pool/community/0ad-data-a23.1-1-any.pkg.tar.xz ./pool/community/supertuxkart-1.1-1-x86_64.pkg.tar.zst
--- Artis Šteinbergs
Kristian Klausen via arch-mirrors @ 26.01.2020 18:19 rakstīja:
Hi
I'm considering setting up a Arch Linux mirror and I'm considering a different design.
So instead of mirroring the whole thing, the idea is to mirror only the database files (core.db etc) and download the packages on demand from a Tier 1 mirror (and let nginx cache them). By doing it that way, I only download requested packages from the Tier 1 mirrors, instead of downloading the whole thing (saving Tier 1 bandwidth).
To provide even better performance a CDN (ex: Cloudflare) could be used to provide more caching. So we end up with a setup like this: Cloudflare -> Nginx cache -> Tier1 mirrors (nginx with multiple upstream)
Do I miss something? Is this a bad idea? If I do setup a mirror like that, is there any chance it could be added as a official mirror?
Best regards Kristian Klausen
On 26.01.2020 17.19, Kristian Klausen via arch-mirrors wrote:
Hi
I'm considering setting up a Arch Linux mirror and I'm considering a different design.
Hi I just got time to implement this and the setup looks like this: Cloudflare -> Cloudflare Workers -> Backblaze B2 bucket <- Tier1 mirror The files is synced from mirror.ams1.nl.leaseweb.net every hour to the Backblaze B2 bucket and they are fetched from the bucket with the help of a Cloudflare Workers script. Cloudflare is configured to cache everything (size <=2GB*), database files is cached for 5 minute everything else is cached for 24 hours. * CF is sponsoring a plan with a higher limit than the 512MB default I have done some quick testing, and time to first byte isn't impressive (at least not when downloading from Europe), but the speed is acceptable (80-100MB/s is achievable if the file is cached, and 8-12MB/s if not (tested from Europe)). To make it easier to implement, I took some shortcuts: * Directory listing isn't implemented * "latest" files isn't synced * Only packages in "pool/" is synced, the package files in the different repo isn't synced, but if you request a package (\.pkg\.tar\.(xz|zst)(|.sig)$) it is automatic retrieved from the pool/ directory. This means that you can download ex Firefox from both: https://archlinux.amirror.xyz/extra/os/x86_64/firefox-73.0.1-1-x86_64.pkg.ta... https://archlinux.amirror.xyz/community/os/x86_64/firefox-73.0.1-1-x86_64.pk... I'm not sure if the shortcuts is acceptable, but it can be fixed if it is a issue. Also please note that: archive, other and sources isn't synced. Feel free to try it out: https://archlinux.amirror.xyz/ Best regards Kristian Klausen
So instead of mirroring the whole thing, the idea is to mirror only the database files (core.db etc) and download the packages on demand from a Tier 1 mirror (and let nginx cache them). By doing it that way, I only download requested packages from the Tier 1 mirrors, instead of downloading the whole thing (saving Tier 1 bandwidth).
To provide even better performance a CDN (ex: Cloudflare) could be used to provide more caching. So we end up with a setup like this: Cloudflare -> Nginx cache -> Tier1 mirrors (nginx with multiple upstream)
Do I miss something? Is this a bad idea? If I do setup a mirror like that, is there any chance it could be added as a official mirror?
Best regards Kristian Klausen
I'll just leave the one I did when I was at CF here then :) https://cloudflaremirrors.com/archlinux On Thu, Feb 27, 2020 at 8:46 PM Kristian Klausen via arch-mirrors < arch-mirrors@archlinux.org> wrote:
On 26.01.2020 17.19, Kristian Klausen via arch-mirrors wrote:
Hi
I'm considering setting up a Arch Linux mirror and I'm considering a different design.
Hi
I just got time to implement this and the setup looks like this: Cloudflare -> Cloudflare Workers -> Backblaze B2 bucket <- Tier1 mirror
The files is synced from mirror.ams1.nl.leaseweb.net every hour to the Backblaze B2 bucket and they are fetched from the bucket with the help of a Cloudflare Workers script. Cloudflare is configured to cache everything (size <=2GB*), database files is cached for 5 minute everything else is cached for 24 hours. * CF is sponsoring a plan with a higher limit than the 512MB default
I have done some quick testing, and time to first byte isn't impressive (at least not when downloading from Europe), but the speed is acceptable (80-100MB/s is achievable if the file is cached, and 8-12MB/s if not (tested from Europe)).
To make it easier to implement, I took some shortcuts: * Directory listing isn't implemented * "latest" files isn't synced * Only packages in "pool/" is synced, the package files in the different repo isn't synced, but if you request a package (\.pkg\.tar\.(xz|zst)(|.sig)$) it is automatic retrieved from the pool/ directory. This means that you can download ex Firefox from both:
https://archlinux.amirror.xyz/extra/os/x86_64/firefox-73.0.1-1-x86_64.pkg.ta...
https://archlinux.amirror.xyz/community/os/x86_64/firefox-73.0.1-1-x86_64.pk...
I'm not sure if the shortcuts is acceptable, but it can be fixed if it is a issue.
Also please note that: archive, other and sources isn't synced.
Feel free to try it out: https://archlinux.amirror.xyz/
Best regards Kristian Klausen
So instead of mirroring the whole thing, the idea is to mirror only the database files (core.db etc) and download the packages on demand from a Tier 1 mirror (and let nginx cache them). By doing it that way, I only download requested packages from the Tier 1 mirrors, instead of downloading the whole thing (saving Tier 1 bandwidth).
To provide even better performance a CDN (ex: Cloudflare) could be used to provide more caching. So we end up with a setup like this: Cloudflare -> Nginx cache -> Tier1 mirrors (nginx with multiple upstream)
Do I miss something? Is this a bad idea? If I do setup a mirror like that, is there any chance it could be added as a official mirror?
Best regards Kristian Klausen
Hello there, May I confirm is this official from Cloudflare? Yours faithfully, Ivan Ip<m@lifeho.me> On Tue, Mar 3, 2020 at 1:55 AM Sevki Hasirci <s@sevki.org> wrote:
I'll just leave the one I did when I was at CF here then :) https://cloudflaremirrors.com/archlinux
On Thu, Feb 27, 2020 at 8:46 PM Kristian Klausen via arch-mirrors <arch-mirrors@archlinux.org> wrote:
On 26.01.2020 17.19, Kristian Klausen via arch-mirrors wrote:
Hi
I'm considering setting up a Arch Linux mirror and I'm considering a different design.
Hi
I just got time to implement this and the setup looks like this: Cloudflare -> Cloudflare Workers -> Backblaze B2 bucket <- Tier1 mirror
The files is synced from mirror.ams1.nl.leaseweb.net every hour to the Backblaze B2 bucket and they are fetched from the bucket with the help of a Cloudflare Workers script. Cloudflare is configured to cache everything (size <=2GB*), database files is cached for 5 minute everything else is cached for 24 hours. * CF is sponsoring a plan with a higher limit than the 512MB default
I have done some quick testing, and time to first byte isn't impressive (at least not when downloading from Europe), but the speed is acceptable (80-100MB/s is achievable if the file is cached, and 8-12MB/s if not (tested from Europe)).
To make it easier to implement, I took some shortcuts: * Directory listing isn't implemented * "latest" files isn't synced * Only packages in "pool/" is synced, the package files in the different repo isn't synced, but if you request a package (\.pkg\.tar\.(xz|zst)(|.sig)$) it is automatic retrieved from the pool/ directory. This means that you can download ex Firefox from both: https://archlinux.amirror.xyz/extra/os/x86_64/firefox-73.0.1-1-x86_64.pkg.ta... https://archlinux.amirror.xyz/community/os/x86_64/firefox-73.0.1-1-x86_64.pk...
I'm not sure if the shortcuts is acceptable, but it can be fixed if it is a issue.
Also please note that: archive, other and sources isn't synced.
Feel free to try it out: https://archlinux.amirror.xyz/
Best regards Kristian Klausen
So instead of mirroring the whole thing, the idea is to mirror only the database files (core.db etc) and download the packages on demand from a Tier 1 mirror (and let nginx cache them). By doing it that way, I only download requested packages from the Tier 1 mirrors, instead of downloading the whole thing (saving Tier 1 bandwidth).
To provide even better performance a CDN (ex: Cloudflare) could be used to provide more caching. So we end up with a setup like this: Cloudflare -> Nginx cache -> Tier1 mirrors (nginx with multiple upstream)
Do I miss something? Is this a bad idea? If I do setup a mirror like that, is there any chance it could be added as a official mirror?
Best regards Kristian Klausen
I can't, I'm no longer at CF, but I'll try to get in touch with the person I handed this over to and see what's it's state. On Wed, Mar 4, 2020 at 3:28 AM Ip, Ivan <m@lifeho.me> wrote:
Hello there,
May I confirm is this official from Cloudflare?
Yours faithfully, Ivan Ip<m@lifeho.me>
On Tue, Mar 3, 2020 at 1:55 AM Sevki Hasirci <s@sevki.org> wrote:
I'll just leave the one I did when I was at CF here then :)
https://cloudflaremirrors.com/archlinux
On Thu, Feb 27, 2020 at 8:46 PM Kristian Klausen via arch-mirrors <
arch-mirrors@archlinux.org> wrote:
On 26.01.2020 17.19, Kristian Klausen via arch-mirrors wrote:
Hi
I'm considering setting up a Arch Linux mirror and I'm considering a different design.
Hi
I just got time to implement this and the setup looks like this: Cloudflare -> Cloudflare Workers -> Backblaze B2 bucket <- Tier1 mirror
The files is synced from mirror.ams1.nl.leaseweb.net every hour to the Backblaze B2 bucket and they are fetched from the bucket with the help of a Cloudflare Workers script. Cloudflare is configured to cache everything (size <=2GB*), database files is cached for 5 minute everything else is cached for 24 hours. * CF is sponsoring a plan with a higher limit than the 512MB default
I have done some quick testing, and time to first byte isn't impressive (at least not when downloading from Europe), but the speed is acceptable (80-100MB/s is achievable if the file is cached, and 8-12MB/s if not (tested from Europe)).
To make it easier to implement, I took some shortcuts: * Directory listing isn't implemented * "latest" files isn't synced * Only packages in "pool/" is synced, the package files in the different repo isn't synced, but if you request a package (\.pkg\.tar\.(xz|zst)(|.sig)$) it is automatic retrieved from the pool/ directory. This means that you can download ex Firefox from both:
https://archlinux.amirror.xyz/extra/os/x86_64/firefox-73.0.1-1-x86_64.pkg.ta...
https://archlinux.amirror.xyz/community/os/x86_64/firefox-73.0.1-1-x86_64.pk...
I'm not sure if the shortcuts is acceptable, but it can be fixed if it is a issue.
Also please note that: archive, other and sources isn't synced.
Feel free to try it out: https://archlinux.amirror.xyz/
Best regards Kristian Klausen
So instead of mirroring the whole thing, the idea is to mirror only the database files (core.db etc) and download the packages on demand from a Tier 1 mirror (and let nginx cache them). By doing it that way, I only download requested packages from the Tier 1 mirrors, instead of downloading the whole thing (saving Tier 1 bandwidth).
To provide even better performance a CDN (ex: Cloudflare) could be used to provide more caching. So we end up with a setup like this: Cloudflare -> Nginx cache -> Tier1 mirrors (nginx with multiple
upstream)
Do I miss something? Is this a bad idea? If I do setup a mirror like that, is there any chance it could be added as a official mirror?
Best regards Kristian Klausen
participants (8)
-
Artis Steinbergs
-
David Precious
-
Frank Villaro-Dixon
-
Giancarlo Razzolini
-
Ip, Ivan
-
Kristian Klausen
-
Lelux Mirrormaster
-
Sevki Hasirci