[arch-general] Locale packages
Hi, While working on locale-gen support for systemd-firstboot ( https://github.com/systemd/systemd/pull/15994), I started wondering if it wouldn't be simpler to delegate the installation of locales to pacman instead. I haven't been following the mailing lists for very long so I don't know if this has ever been discussed. I'd imagine Arch could provide a package for each locale supported by glibc and users would install the ones they need. The PKGBUILD would use localedef to generate separate folders of compiled locale files for each locale that would be stored in /usr/lib/locale. This approach is already implemented by distros such as Fedora (and co) and Ubuntu. The main advantage of this approach is that there's no need to set up an entire chroot to run locale-gen when pacstrapping a new Arch system image. This might seem easy but becomes trickier when the image uses a different architecture than the host system since emulation of that architecture has to be set up first. Even if locale-gen had a --root option so using the host's locale-gen would be an option, I'm not sure if there's any guarantee that compiled locale definitions generated by the host system's locale-gen would work with the glibc version used by the image (less of a problem with Arch but the glibc on the host could still potentially be out-of-date compared to the one installed in the image). Being able to install locales with pacman would solve all these problems. Any interest in something like this from the Arch developers? I'd be willing to try my hand at a PKGBUILD for this but I'm not a TU so I'd need some support to get this implemented (if there is any interest at all). (This also doesn't imply that locale-gen wouldn't work anymore, locale-gen stores everything in /usr/lib/locale/locale-archive which would be independent from the files installed by the locale packages, so both approaches should work side-by-side) Cheers, Daan De Meyer
On 6/22/20 3:11 PM, Daan De Meyer via arch-general wrote:
Hi,
While working on locale-gen support for systemd-firstboot ( https://github.com/systemd/systemd/pull/15994), I started wondering if it wouldn't be simpler to delegate the installation of locales to pacman instead. I haven't been following the mailing lists for very long so I don't know if this has ever been discussed. I'd imagine Arch could provide a package for each locale supported by glibc and users would install the ones they need.
Very firm -1 to any approach that involves creating hundreds of new packages which each provide a tiny file.
The PKGBUILD would use localedef to generate separate folders of compiled locale files for each locale that would be stored in /usr/lib/locale. This approach is already implemented by distros such as Fedora (and co) and Ubuntu.
The main advantage of this approach is that there's no need to set up an entire chroot to run locale-gen when pacstrapping a new Arch system image. This might seem easy but becomes trickier when the image uses a different architecture than the host system since emulation of that architecture has to be set up first. Even if locale-gen had a --root option so using the host's locale-gen would be an option, I'm not sure if there's any guarantee that compiled locale definitions generated by the host system's locale-gen would work with the glibc version used by the image (less of a problem with Arch but the glibc on the host could still potentially be out-of-date compared to the one installed in the image). Being able to install locales with pacman would solve all these problems.
Any interest in something like this from the Arch developers? I'd be willing to try my hand at a PKGBUILD for this but I'm not a TU so I'd need some support to get this implemented (if there is any interest at all).
(This also doesn't imply that locale-gen wouldn't work anymore, locale-gen stores everything in /usr/lib/locale/locale-archive which would be independent from the files installed by the locale packages, so both approaches should work side-by-side)
This is not about locale-gen. locale-gen (and /etc/locale.gen) are Arch-specific custom scripts which IIRC were copied from Debian once upon a time, which just run localedef. I actually use a much simpler locale-gen program which uses flag files e.g. /etc/locales/en_US (file contents can contain a charset but are otherwise assumed to be UTF-8). It's not hard to hack your own. IIRC Fedora follows the "hundreds of packages which each provide a small file" approach, that being the localedef --no-archive intersection of a locale and a charmap. The combination of all possibilities will result in significant size bloat, so it is not feasible to provide them all in the glibc package itself. (e.g. try uncommenting all 487 locales in /etc/locale.gen and it is a 500MB locale-archive, "only" 100MB if you stick to UTF-8 locales) -- Eli Schwartz Bug Wrangler and Trusted User
Very firm -1 to any approach that involves creating hundreds of new packages which each provide a tiny file.
You're right, this would be overkill. Even when limiting to only UTF-8 we'd still have 313 packages.
This is not about locale-gen. locale-gen (and /etc/locale.gen) are Arch-specific custom scripts which IIRC were copied from Debian once upon a time, which just run localedef. I actually use a much simpler locale-gen program which uses flag files e.g. /etc/locales/en_US (file contents can contain a charset but are otherwise assumed to be UTF-8). It's not hard to hack your own.
Running localedef directly doesn't really solve any of the issues I mentioned either though. What if we make do with a single locale package? I just found out there's some progress on the C.UTF-8 locale upstream support in glibc ( https://sourceware.org/pipermail/libc-alpha/2020-June/115224.html). It doesn't look like it will be built-in though unless they manage to get the size down significantly. If it isn't built-in, maybe we could add a single package just for the C.UTF-8 locale? That should be sufficient for 95% of the "I'm building an Arch container/vm image for development/server/any other development stuff" use cases which generally will be using an english locale and avoids all the problems I mentioned earlier without requiring the addition of 300+ packages. It'll have to wait until we have C.UTF-8 in glibc though. I guess we could add a package for en_US.UTF-8 as a stopgap but that doesn't seem worth the effort assuming C.UTF-8 gets merged in a reasonable timeframe. As an example of why one would need a UTF-8 locale specifically in a container/vm image, meson (actually python) does not like running under a non UTF-8 locale at all. (I don't use mailing lists very often, I hope I didn't mess up the reply etiquette) Daan On Mon, 22 Jun 2020 at 22:31, Eli Schwartz via arch-general < arch-general@archlinux.org> wrote:
Hi,
While working on locale-gen support for systemd-firstboot ( https://github.com/systemd/systemd/pull/15994), I started wondering if it wouldn't be simpler to delegate the installation of locales to pacman instead. I haven't been following the mailing lists for very long so I don't know if this has ever been discussed. I'd imagine Arch could
On 6/22/20 3:11 PM, Daan De Meyer via arch-general wrote: provide
a package for each locale supported by glibc and users would install the ones they need.
Very firm -1 to any approach that involves creating hundreds of new packages which each provide a tiny file.
The PKGBUILD would use localedef to generate separate folders of compiled locale files for each locale that would be stored in /usr/lib/locale. This approach is already implemented by distros such as Fedora (and co) and Ubuntu.
The main advantage of this approach is that there's no need to set up an entire chroot to run locale-gen when pacstrapping a new Arch system image. This might seem easy but becomes trickier when the image uses a different architecture than the host system since emulation of that architecture has to be set up first. Even if locale-gen had a --root option so using the host's locale-gen would be an option, I'm not sure if there's any guarantee that compiled locale definitions generated by the host system's locale-gen would work with the glibc version used by the image (less of a problem with Arch but the glibc on the host could still potentially be out-of-date compared to the one installed in the image). Being able to install locales with pacman would solve all these problems.
Any interest in something like this from the Arch developers? I'd be willing to try my hand at a PKGBUILD for this but I'm not a TU so I'd need some support to get this implemented (if there is any interest at all).
(This also doesn't imply that locale-gen wouldn't work anymore, locale-gen stores everything in /usr/lib/locale/locale-archive which would be independent from the files installed by the locale packages, so both approaches should work side-by-side)
This is not about locale-gen. locale-gen (and /etc/locale.gen) are Arch-specific custom scripts which IIRC were copied from Debian once upon a time, which just run localedef. I actually use a much simpler locale-gen program which uses flag files e.g. /etc/locales/en_US (file contents can contain a charset but are otherwise assumed to be UTF-8). It's not hard to hack your own.
IIRC Fedora follows the "hundreds of packages which each provide a small file" approach, that being the localedef --no-archive intersection of a locale and a charmap. The combination of all possibilities will result in significant size bloat, so it is not feasible to provide them all in the glibc package itself. (e.g. try uncommenting all 487 locales in /etc/locale.gen and it is a 500MB locale-archive, "only" 100MB if you stick to UTF-8 locales)
-- Eli Schwartz Bug Wrangler and Trusted User
On 6/23/20 3:02 PM, Daan De Meyer via arch-general wrote:
This is not about locale-gen. locale-gen (and /etc/locale.gen) are Arch-specific custom scripts which IIRC were copied from Debian once upon a time, which just run localedef. I actually use a much simpler locale-gen program which uses flag files e.g. /etc/locales/en_US (file contents can contain a charset but are otherwise assumed to be UTF-8). It's not hard to hack your own.
Running localedef directly doesn't really solve any of the issues I mentioned either though.
It would: - avoid the *additional* issue "what to do if locale-gen doesn't exist", - solve the issue "locale-gen does not have a --root option" It wouldn't: - solve the issue "host/guest glibc version mismatches"
What if we make do with a single locale package? I just found out there's some progress on the C.UTF-8 locale upstream support in glibc ( https://sourceware.org/pipermail/libc-alpha/2020-June/115224.html). It doesn't look like it will be built-in though unless they manage to get the size down significantly. If it isn't built-in, maybe we could add a single package just for the C.UTF-8 locale? That should be sufficient for 95% of the "I'm building an Arch container/vm image for development/server/any other development stuff" use cases which generally will be using an english locale and avoids all the problems I mentioned earlier without requiring the addition of 300+ packages. It'll have to wait until we have C.UTF-8 in glibc though. I guess we could add a package for en_US.UTF-8 as a stopgap but that doesn't seem worth the effort assuming C.UTF-8 gets merged in a reasonable timeframe.
The ultimate goal is to ensure C.UTF-8 always exists no matter what. If it gets merged upstream in glibc as a non-builtin localedef generated locale, then the probable best solution is to make locale-gen always include C.UTF-8 regardless of which other locales are requested by the user's system. Or include its compiled form in the glibc package directly, if it isn't too bloated.
As an example of why one would need a UTF-8 locale specifically in a container/vm image, meson (actually python) does not like running under a non UTF-8 locale at all.
You're preaching to the choir, here. ;) I thoroughly agree there must be a UTF-8 locale. The question is at what stage should this be selected and generated.
(I don't use mailing lists very often, I hope I didn't mess up the reply etiquette)
Generally people tend to delete the sections they are not replying to, but reply inline, rather than including everytyhing the bottom as a second copy of the sections you quoted and replied to inline. Still, replying inline is the main thing, and you did that. :) -- Eli Schwartz Bug Wrangler and Trusted User
Alright, thanks for all the info. I'll leave this be for now until the C.UTF-8 support in upstream glibc is released. If they manage to reduce the size sufficiently to have it built-in, there might not even be anything to change on Arch's side. Daan
participants (2)
-
Daan De Meyer
-
Eli Schwartz