[arch-general] Including compiled C.UTF-8 locale by default in glibc package? Inbox
Hi, Now that glibc 2.35 is available, could we enable and ship the compiled form of the new C.UTF-8 locale in glibc by default in Arch Linux?
From the glibc 2.35 release notes (https://sourceware.org/pipermail/libc-alpha/2022-February/136040.html):
* Support for the C.UTF-8 locale has been added to glibc. The locale supports full code-point sorting for all valid Unicode code points. A limitation in the framework for fnmatch, regexec, and regcomp requires a compromise to save space and only ASCII-based range expressions are supported for now (see bug 28255). The full size of the locale is only ~400KiB, with 346KiB coming from LC_CTYPE information for Unicode. This locale harmonizes downstream C.UTF-8 already shipping in various downstream distributions. The locale is not built into glibc, and must be installed.
Being able to rely on the existence of a UTF-8 english locale simplifies many use cases. A good example of issues introduced due to a lack of a built-in UTF-8 locale is https://github.com/systemd/systemd/pull/8340 which is a workaround added in 2018 that still exists today. Having the C.UTF-8 locale available by default in Arch would enable removing such workarounds. Any thoughts? Cheers, Daan De Meyer
On Tue, Feb 15, 2022 at 00:08:04 +0100, Daan De Meyer via arch-general wrote:
Hi,
Now that glibc 2.35 is available, could we enable and ship the compiled form of the new C.UTF-8 locale in glibc by default in Arch Linux?
From the glibc 2.35 release notes (https://sourceware.org/pipermail/libc-alpha/2022-February/136040.html):
* Support for the C.UTF-8 locale has been added to glibc. The locale supports full code-point sorting for all valid Unicode code points. A limitation in the framework for fnmatch, regexec, and regcomp requires a compromise to save space and only ASCII-based range expressions are supported for now (see bug 28255). The full size of the locale is only ~400KiB, with 346KiB coming from LC_CTYPE information for Unicode. This locale harmonizes downstream C.UTF-8 already shipping in various downstream distributions. The locale is not built into glibc, and must be installed.
Being able to rely on the existence of a UTF-8 english locale simplifies many use cases. A good example of issues introduced due to a lack of a built-in UTF-8 locale is https://github.com/systemd/systemd/pull/8340 which is a workaround added in 2018 that still exists today. Having the C.UTF-8 locale available by default in Arch would enable removing such workarounds.
Any thoughts?
That would indeed be neat. But let's at least wait until the next glibc release[*] that will fix this: Generating locales... C.UTF-8...failed to set locale! [error] LC_MONETARY: value for field `mon_decimal_point' must not be an empty string done [*] or just grab these patches: https://patchwork.sourceware.org/project/glibc/cover/20220131053442.3995804-... https://patchwork.sourceware.org/project/glibc/patch/20220131053442.3995804-... https://patchwork.ozlabs.org/project/glibc/patch/20220131053442.3995804-3-ca... Geert
TLDR: Sounds like this makes it harder to mess up the locale setup so I'm all for it. This mail just reminded me that one of my own projects didn't handle a messed locale setup gracefully in the past causing multiple users (that apparently had a messed setup) to report issues, e.g. https://aur.archlinux.org/packages/ syncthingtray#comment-816671 and https://github.com/Martchus/syncthingtray/ issues/109. And when I searched for the error to investigate the issue I found many results. If it helps preventing applications from crashing completely with errors like "locale::facet::_S_create_c_locale name not valid" it is likely worth including.
Marius,
This mail just reminded me that one of my own projects didn't handle a messed locale setup gracefully in the past causing multiple users (that apparently had a messed setup) to report issues, e.g. https://aur.archlinux.org/packages/ syncthingtray#comment-816671 and https://github.com/Martchus/syncthingtray/ issues/109. And when I searched for the error to investigate the issue I found many results. If it helps preventing applications from crashing completely with errors like "locale::facet::_S_create_c_locale name not valid" it is likely worth including.
fwiw, for people writing code: a friend suggested something like this: first try user's configured locale (environment variables). if that fails, try the "C" locale. if that also fails, print out a warning, and continue to run: ---- /* modified from https://www.cl.cam.ac.uk/~mgk25/unicode.html */ if (!setlocale(LC_CTYPE, "") && !setlocale(LC_CTYPE, "C")) { fprintf(stderr, "Warning: Can't set the specified locale! " "Check LANG, LC_CTYPE, LC_ALL.\n"); /* but, keep going */ } ---- (though, of course, "continue to run" may not work in your case, i.e., Boost or other frameworks, etc.) cheers, Greg
On Wed, 16 Feb 2022 at 14:18, Greg Minshall via arch-general <arch-general@lists.archlinux.org> wrote:
Marius,
This mail just reminded me that one of my own projects didn't handle a messed locale setup gracefully in the past causing multiple users (that apparently had a messed setup) to report issues, e.g. https://aur.archlinux.org/packages/ syncthingtray#comment-816671 and https://github.com/Martchus/syncthingtray/ issues/109. And when I searched for the error to investigate the issue I found many results. If it helps preventing applications from crashing completely with errors like "locale::facet::_S_create_c_locale name not valid" it is likely worth including.
fwiw, for people writing code: a friend suggested something like this: first try user's configured locale (environment variables). if that fails, try the "C" locale. if that also fails, print out a warning, and continue to run: ---- /* modified from https://www.cl.cam.ac.uk/~mgk25/unicode.html */ if (!setlocale(LC_CTYPE, "") && !setlocale(LC_CTYPE, "C")) { fprintf(stderr, "Warning: Can't set the specified locale! " "Check LANG, LC_CTYPE, LC_ALL.\n"); /* but, keep going */ } ---- (though, of course, "continue to run" may not work in your case, i.e., Boost or other frameworks, etc.)
cheers, Greg
A C program is in the "C" locale already when it reaches the main function, so I think there's no point to calling it again. But I definitely agree that to crash or to refuse to run depending on the locale is a serious bug. Neven
Neven,
A C program is in the "C" locale already when it reaches the main function, so I think there's no point to calling it again.
ah, thanks. i didn't know that. with the new "C.UTF-8", possibly it makes sense to try falling back to that locale (after "" fails), before defaulting to the "C" locale? cheers, Greg
Hi Neven, I disagree with some of what's been said on this thread and think it could mislead others, so I'm piping up...
Greg Minshall wrote:
fwiw, for people writing code: a friend suggested something like this: first try user's configured locale (environment variables). if that fails, try the "C" locale. if that also fails, print out a warning, and continue to run: ---- /* modified from https://www.cl.cam.ac.uk/~mgk25/unicode.html */ if (!setlocale(LC_CTYPE, "") && !setlocale(LC_CTYPE, "C")) { fprintf(stderr, "Warning: Can't set the specified locale! " "Check LANG, LC_CTYPE, LC_ALL.\n");
That textual warning is incorrect because not just the specified locale was attempted. Looking at the URL, it doesn't suggest an attempt to fall back to the C locale.
/* but, keep going */
Neither does the web page suggest the program continue, but instead exit with an error status. int main() { if (!setlocale(LC_CTYPE, "")) { fprintf(stderr, "Can't set the specified locale! " "Check LANG, LC_CTYPE, LC_ALL.\n"); return 1; }
(though, of course, "continue to run" may not work in your case, i.e., Boost or other frameworks, etc.)
Or worse, it may continue to run but cause corruption of data.
A C program is in the "C" locale already when it reaches the main function
Agreed.
so I think there's no point to calling it again.
There isn't. setlocale(3p) says on error ‘the global locale shall not be changed’ so without a previous successful call to setlocale() it shall remain at the default ‘C’ locale which the C run-time has set up before entry to main(). This includes the more normal ‘setlocale(LC_ALL, "")’ as the man page says all error checking occurs up front.
But I definitely agree that to crash or to refuse to run depending on the locale is a serious bug.
No, to complain and stop is absolutely the right behaviour. The user has requested, through environment variables or lack thereof, the desired locale. Anything other than that could cause problems the program cannot anticipate and so it must not continue. Instead, it should clearly indicate the issue and exit with a non-zero status. -- Cheers, Ralph.
Ralph,
Or worse, it may continue to run but cause corruption of data.
thanks for weighing in. i sympathize with (and find appealing) your position. but, i think this may be a trade-off between usability and correctness. (we more often talk about the trade-off between usability and security.) if so, there may be no "absolutely right" answer to how to deal with it. i think it used to be the case that many unix/linux users didn't have their locale stuff very straight. i don't know the extent to which that is still true. i think it reasonable to expect programmers to figure out how to do things, set things up, "right". (and, of course, that's what you're trying to suggest!). but, "end users" may be lost, and expecting them to do get things configured correctly before providing them service may be unrealistic. (a risk -- but, to the users? to us developers? --- is that they get so frustrated they give up.) and, in fact, the services themselves -- like sed(1) and grep(1) -- may be tools a user might need to use in order to figure things out. [*] my take would be, printing a stern warning to stderr is a reasonable compromise. but, ytmv, as it were. it was also very helpful for me in this exchange to learn that if the call to, e.g., `setlocale(LC_CTYPE, "")` fails, *that* is when the error message should appear. and, also the fact that "C" locale will "always" be there, so there's no reason to try doing a `setlocale()` to it as a fall back. my current code does try to fall back on `C.UTF-8`: ---- if (!setlocale(LC_CTYPE, "")) { fprintf(stderr, "Warning: Can't set the specified locale! " "Check LANG, LC_CTYPE, LC_ALL.\n"); #if defined(STOP_IF_NO_LOCALE) exit(RET_LOCALE); #else /* defined(STOP_IF_NO_LOCALE) */ /* make an attempt to activate C.UTF-8, if available, but * ignore any errors. */ setlocale(LC_CTYPE, "C.UTF-8"); #endif /* defined(STOP_IF_NO_LOCALE) */ } ---- cheers, Greg [*] fwiw, sed and grep both seem to run even if they are unable to deal with an unsupported locale. ps--
Looking at the URL, it doesn't suggest an attempt to fall back to the C locale.
Neither does the web page suggest the program continue, but instead exit with an error status.
sorry, the URL was acknowledging where i got the general structure; as The Author says, "Any remaining errors are, of course, my responsibility." :)
participants (6)
-
Daan De Meyer
-
Geert Hendrickx
-
Greg Minshall
-
Marius Kittler
-
Neven Sajko
-
Ralph Corderoy