Hello, Entirely personal opinion here, and not of any company or organization I am/have been associated with.. ------- Original Message ------- On Monday, May 30th, 2022 at 2:44 PM, Imre Jonk <imre@imrejonk.nl> wrote:
On Mon, 30 May 2022 12:11:48 -0400 Tyler Dence tyzoid.d@gmail.com wrote:
Perhaps, if it would ease mirror operator's minds (especially our commercial partners), it might be wise to put a "readme.txt" or "sources.txt" file in the root of the mirrored directory explaining how/where one might obtain the sources?
We did something similar to this on the main page of the mirror that had: 1. Where we synced the files from (rsync/http/etc. and source mirror URL) Caveat - There are some exceptions where the source mirror is private and we could not mention their org's internal mirror URL 2. How often/when we sync the files (generally 1-4x a day, stepped over ~6hrs to prevent network saturation) 3. Where to find those files root folder on our mirror 4. Where the ISO files or primary software binaries were located 5. The source organization's website, eg. kernel.org, gnu.org, etc. It would be trivial to link to the page that had the sources from the particular distro (eg, the Arch Linux source folder, or the wiki). It could even be on an index page that mirror operators are provided if it comes to that, somewhat like what Ubuntu does [1, 1a].
Yes, I think that would be wise. The GPLv3 allows pointing to another server where the sources can be obtained in order to fulfill the article 6 obligation. However, the GPLv2 does not seem to allow that, instead requiring a 'written offer' that would basically be a promise from the operator to anyone who obtains the compiled software from their mirror. They would promise to provide a copy of the source code to them at a later date, up to three years in the future. I can see some theoretical issues with that. In practice though, I have never heard of an open source mirror operator who has faced legal threads because they were mirroring GPL-licensed software.
Not a legal threat per-se, but there was one time in 2020 that a major web search provider threatened to de-list our organization's domain name in search results, because of the hosting of an older Mailman package from the /gnu folder on the mirror. They said that if we did not remove the "malware" from the server that we would not be listed in their search results anymore, which would be really bad for our organization since that's (presumably) where we get a lot of site views and promote our organization. We did a lot of investigation because it was a pretty real threat that someone could have compromised the infrastructure of a parent mirror; we looped in the parent organization's seasoned CISSP, we brought down the network infrastructure immediately, and started from the ground up on what was good and what could be bad, looking for threat actors, etc. In reality, after a few days of investigation, what ended up being the problem was the "malware" was the test malware suite that came with Mailman that provided test cases in Mailman 2.1.4. At the time it gave us a false positive on VirusTotal [2], but now it doesn't [3]. Note how the SHA256 sums for the contents match exactly. Our CISSP's original suggestion was to either remove the file from rsync or to just blackhole the file from HTTP and only provide it over rsync. Those have some relatively bad implications for downstream mirrors where we were the source mirror if other large companies asked us to take down stuff or threat stuff. After a lot of discussion with the parent organization, we came to a reasonable solution. We solved it by setting nginx to route all bot user agents to get provided a 403 on the mirror website, which resolved the issue. If that was not sufficient, we were prepared to have nginx 403 anyone that requested that file over HTTP, but that's not a great solution.
What would be even better of course is if more (preferably all) mirrors would start mirroring the source packages alongside the binary packages. As I said in my previous email, other Linux distributions do this too.
I don't mirror sources on my personal mirror because it's a waste of disk space, I could be running other homelab-y stuff in that space. For Arch Linux, the sources are not too bad (one of the smallest Linux distros on that mirror at 3rd/4th?), but in other distros like Fedora, it can be upwards of dozens of terabytes (over ~25TB) [4] to mirror all of the sources and binaries they have available. Granted, most of them are "archives", but most mirrors probably want to mirror that two if they are to be a "useful" public mirror?
A quick comment on those commercial partners you mention: it is not at all my intention to fearmonger Arch sponsors (or anyone contributing mirror capacity) into rethinking their involvement in this great community project. Sorry if it might seem this way :(
Those are my two cents. I'm sure there are much more experienced mirror operators out there. But, I hope that the above experiences help to raise these points: 1. Mirror operators do sometimes have to take requests with regards to security from their (parent) organizations and from external organizations, and they may not be able to talk about it. 2. If a distro offers software that has questionable legal status, it may not be able to be hosted by some mirrors. (I'm sure that if Arch Linux offered compiled AUR packages that it would violate a whole bunch of more restrictive software licenses and we would not mirror that folder either) Have a good one, Jared D. [1]: https://mirrors.edge.kernel.org/ubuntu-releases/HEADER.html [1a]: http://releases.ubuntu.com/ [2]: https://www.virustotal.com/gui/file/4d47ca9bb28b602a8245dbdd0384e8326c3c813a... [3]: https://www.virustotal.com/gui/url/aa5e9bab3ec7df18b38d8203a0d001a2ffac818bb... [4]: https://dl.fedoraproject.org/pub/DIRECTORY_SIZES.txt