[arch-general] Opening a document with unicode in path
Hi everyone, there's a document on Dropbox, that has unicode character in its path (french character). Trying to open this document with libre office (Plasma is running) fails with 'file not found', and the path shown with error clearly presents the path with that unicode character replaced by '??' What I tried: * copy the document in a path where there's no unicode - it opens * copy the document using shell - it works * copy the document using Dolphin (from Plasma) - it works * check $LANG - its set to `en_CA.UTF8` * search for 'libreoffice unicode path', 'archlinux unicode path' and plethora of similar search terms - not much came through This makes me think the issue is actually with LibreOffice, but the reason I ask here, and not in their forum, is that on another computer running Ubuntu - this works without fail, so I'm fairly certain the issue is in some local configuration. Could anyone shed some light on this, please, or at least point me in some direction where I could look? Cheers -- "That gum you like is going to come back in style."
On 8/2/19 8:59 AM, John Z. wrote:
This makes me think the issue is actually with LibreOffice, but the reason I ask here, and not in their forum, is that on another computer running Ubuntu - this works without fail, so I'm fairly certain the issue is in some local configuration.
Good jump on the research.
Could anyone shed some light on this, please, or at least point me in some direction where I could look?
I don't have a direct answer, but check the version(s) of LibreOffice, Dropbox, and possibly some of the other packages you've already mentioed. Perhaps your issue is something that's been fixed by a newer package in Ubuntu and not yet fixed in the corresponding package in Arch (or something that's been broken by a newer package in Arch and yet merged into Ubuntu). Maybe a release note or a patch will jump out at you.
Good jump on the research.
I try to do what I can, before asking other people to spend their time on me :-)
I don't have a direct answer, but check the version(s) of LibreOffice, Dropbox, and possibly some of the other packages you've already mentioed. Perhaps your issue is something that's been fixed by a newer package in Ubuntu and not yet fixed in the corresponding package in Arch (or something that's been broken by a newer package in Arch and yet merged into Ubuntu). Maybe a release note or a patch will jump out at you.
That is a solid idea! I'll see if maybe I can find a version mismatch and downgrade accordingly. I already checked Libreoffice's bugtracker for this (which would indicate there's a patch incoming), but haven't found any entries, so I'll file one. Thank you. -- "That gum you like is going to come back in style."
There might also be a difference between libreoffice-fresh and libreoffice-still which is quite a bit behind fresh.
Hi Gene, also a good idea, I wasn't even aware of the `libreoffice-still` package. I tried replacing `libreoffice-fresh` with it, and I still get the same error, although with slightly different looking dialog :-( -- "That gum you like is going to come back in style."
On 8/2/19 8:59 AM, John Z. wrote:
Hi everyone, there's a document on Dropbox, that has unicode character in its path (french character). Trying to open this document with libre office (Plasma is running) fails with 'file not found', and the path shown with error clearly presents the path with that unicode character replaced by '??'
What I tried: * copy the document in a path where there's no unicode - it opens * copy the document using shell - it works * copy the document using Dolphin (from Plasma) - it works * check $LANG - its set to `en_CA.UTF8` * search for 'libreoffice unicode path', 'archlinux unicode path' and plethora of similar search terms - not much came through
This makes me think the issue is actually with LibreOffice, but the reason I ask here, and not in their forum, is that on another computer running Ubuntu - this works without fail, so I'm fairly certain the issue is in some local configuration.
Could anyone shed some light on this, please, or at least point me in some direction where I could look?
Can you determine some steps that exactly reproduce the problem? Assuming that the problem should manifest when opening the file using /usr/bin/loffice /path/to/file, I tried creating a test file and opening it, and it worked: $ mkdir -p '/tmp/unicode paths are 💩/' $ touch '/tmp/unicode paths are 💩/testfile.txt' $ loffice '/tmp/unicode paths are 💩/testfile.txt' $ I could successfully edit this file in libreoffice, save content, or reopen it. Tested with LANG=en_US.UTF-8 and the libreoffice-fresh package -- Eli Schwartz Bug Wrangler and Trusted User
Could you verify that the encoding of the filepath is, in fact, UTF8? Filepaths in linux are free to be arbitrary bytes despite the locale settings. Most tools don't care, though I would expect the filepath to display incorrectly in the terminal and file browser if it were not UTF8. So it is probably a long shot but perhaps worth checking. The following Python script, run in the directory containing the file/directory containing the french character should tell you if it it valid UTF8: import os for item in os.listdir(b'.'): try: item.decode('utf8') except UnicodeDecodeError: print(item, "is not valid UTF8") raise On Fri, Aug 2, 2019 at 12:48 PM Eli Schwartz via arch-general < arch-general@archlinux.org> wrote:
On 8/2/19 8:59 AM, John Z. wrote:
Hi everyone, there's a document on Dropbox, that has unicode character in its path (french character). Trying to open this document with libre office (Plasma is running) fails with 'file not found', and the path shown with error clearly presents the path with that unicode character replaced by '??'
What I tried: * copy the document in a path where there's no unicode - it opens * copy the document using shell - it works * copy the document using Dolphin (from Plasma) - it works * check $LANG - its set to `en_CA.UTF8` * search for 'libreoffice unicode path', 'archlinux unicode path' and plethora of similar search terms - not much came through
This makes me think the issue is actually with LibreOffice, but the reason I ask here, and not in their forum, is that on another computer running Ubuntu - this works without fail, so I'm fairly certain the issue is in some local configuration.
Could anyone shed some light on this, please, or at least point me in some direction where I could look?
Can you determine some steps that exactly reproduce the problem? Assuming that the problem should manifest when opening the file using /usr/bin/loffice /path/to/file, I tried creating a test file and opening it, and it worked:
$ mkdir -p '/tmp/unicode paths are 💩/' $ touch '/tmp/unicode paths are 💩/testfile.txt' $ loffice '/tmp/unicode paths are 💩/testfile.txt' $
I could successfully edit this file in libreoffice, save content, or reopen it. Tested with LANG=en_US.UTF-8 and the libreoffice-fresh package
-- Eli Schwartz Bug Wrangler and Trusted User
Could you verify that the encoding of the filepath is, in fact, UTF8? Filepaths in linux are free to be arbitrary bytes despite the locale settings. Most tools don't care, though I would expect the filepath to display incorrectly in the terminal and file browser if it were not UTF8. So it is probably a long shot but perhaps worth checking.
Hi, thank you for the suggestion. I tried running your script, and all filenames are decoded correctly, no exception was thrown (I also tried without try/except just in case something else gets thrown) However, you might be onto something here because, interestingly enough: while BASH prompt and autocompletition feature both decode the character correctly, `ls` does not and outputs a sequence of escape codes: Proc'$'\303\251''dures instead of Procedures (where first 'e' is the unicode char, and has french accent)
The following Python script, run in the directory containing the file/directory containing the french character should tell you if it it valid UTF8:
import os for item in os.listdir(b'.'): try: item.decode('utf8') except UnicodeDecodeError: print(item, "is not valid UTF8") raise
On Fri, Aug 2, 2019 at 12:48 PM Eli Schwartz via arch-general < arch-general@archlinux.org> wrote:
On 8/2/19 8:59 AM, John Z. wrote:
Hi everyone, there's a document on Dropbox, that has unicode character in its path (french character). Trying to open this document with libre office (Plasma is running) fails with 'file not found', and the path shown with error clearly presents the path with that unicode character replaced by '??'
What I tried: * copy the document in a path where there's no unicode - it opens * copy the document using shell - it works * copy the document using Dolphin (from Plasma) - it works * check $LANG - its set to `en_CA.UTF8` * search for 'libreoffice unicode path', 'archlinux unicode path' and plethora of similar search terms - not much came through
This makes me think the issue is actually with LibreOffice, but the reason I ask here, and not in their forum, is that on another computer running Ubuntu - this works without fail, so I'm fairly certain the issue is in some local configuration.
Could anyone shed some light on this, please, or at least point me in some direction where I could look?
Can you determine some steps that exactly reproduce the problem? Assuming that the problem should manifest when opening the file using /usr/bin/loffice /path/to/file, I tried creating a test file and opening it, and it worked:
$ mkdir -p '/tmp/unicode paths are 💩/' $ touch '/tmp/unicode paths are 💩/testfile.txt' $ loffice '/tmp/unicode paths are 💩/testfile.txt' $
I could successfully edit this file in libreoffice, save content, or reopen it. Tested with LANG=en_US.UTF-8 and the libreoffice-fresh package
-- Eli Schwartz Bug Wrangler and Trusted User
-- "That gum you like is going to come back in style."
On 8/2/19 1:24 PM, John Z. wrote:
Could you verify that the encoding of the filepath is, in fact, UTF8? Filepaths in linux are free to be arbitrary bytes despite the locale settings. Most tools don't care, though I would expect the filepath to display incorrectly in the terminal and file browser if it were not UTF8. So it is probably a long shot but perhaps worth checking.
Hi, thank you for the suggestion. I tried running your script, and all filenames are decoded correctly, no exception was thrown (I also tried without try/except just in case something else gets thrown)
However, you might be onto something here because, interestingly enough: while BASH prompt and autocompletition feature both decode the character correctly, `ls` does not and outputs a sequence of escape codes:
Proc'$'\303\251''dures
instead of
Procedures (where first 'e' is the unicode char, and has french accent)
The ls command will by default escape the character into its numeric code if it thinks the character is invalid in your locale. I can get ls to print the same thing as you (using shell-escaped $'\303\251') *iff* I first export LC_ALL=C (which is not a UTF-8 locale and therefore cannot print unicode characters). This indicates something is wrong with your locale, because at the very least, your shell cannot parse the character correctly -- maybe neither can libreoffice. -- Eli Schwartz Bug Wrangler and Trusted User
The ls command will by default escape the character into its numeric code if it thinks the character is invalid in your locale. I can get ls to print the same thing as you (using shell-escaped $'\303\251') *iff* I first export LC_ALL=C (which is not a UTF-8 locale and therefore cannot print unicode characters).
This indicates something is wrong with your locale, because at the very least, your shell cannot parse the character correctly -- maybe neither can libreoffice.
Man, can't thank you enough. You guided me to the issue. So, I tried what you said, but I couldn't modify LC_ALL at all - bash was complaining. If I echo it, I'd get back en_CA.UTF-8. I started wondering if there's an issue with locales since the install, so I figured I'll check /etc/locale.conf and regenerate them, and lo and behold - all locales were commented out. I uncommented en_CA.UTF-8, ran locale-gen, and now both `ls` and libreoffice work correctly. Thanks everyone on their time, both to read my questions and write out answers, and helping me fix this issue. -- "That gum you like is going to come back in style."
August 2, 2019 11:10 AM, "Eli Schwartz via arch-general" <arch-general@archlinux.org> wrote:
The ls command will by default escape the character into its numeric code if it thinks the character is invalid in your locale. I can get ls to print the same thing as you (using shell-escaped $'\303\251') *iff* I first export LC_ALL=C (which is not a UTF-8 locale and therefore cannot print unicode characters).
This indicates something is wrong with your locale, because at the very least, your shell cannot parse the character correctly -- maybe neither can libreoffice.
"I forgot to generate the locales" will cause this issue. Try running `localedef --list-archive` and checking that en_CA.UTF-8 actually exists. If not, uncomment it in /etc/locale.gen and run `sudo locale-gen`. ~Celti
"I forgot to generate the locales" will cause this issue. Try running `localedef --list-archive` and checking that en_CA.UTF-8 actually exists. If not, uncomment it in /etc/locale.gen and run `sudo locale-gen`.
Right on the mark, Mr. Celti, I discovered this mere minutes before your mail. Please just allow me to save my honour by adding the fact that I'm troubleshooting a machine that isn't mine. -- "That gum you like is going to come back in style."
Can you determine some steps that exactly reproduce the problem? Assuming that the problem should manifest when opening the file using /usr/bin/loffice /path/to/file, I tried creating a test file and opening it, and it worked:
Hi Eli, good idea, I tried following your sequence as well. I created a directory using `mkdir`, then launched libre office and tried to save a file in it. Interesting thing happens:, it actually creates a directory named 'Proc?dures' instead of the original 'Procédures' directory, and saves it in there. I repeated the test twice, because the first time around, I was puzzled enough that I wasn't sure I actually saved the file. Furthermore, I copied the file using console into the 'Procédures', then opened it using libreoffice, and it opened the one in 'Proc?dures' - I know because I updated the file and saved it, and the latter one was updated. The only difference between us is that I'm using `libreoffice` launcher command, and you seem to have `loffice`? The package is also libreoffice-fresh, package version 6.2.5-1, and `libreoffice --version` 6.2.5.2 2@(build: 2) The --version in ubuntu, that works, is 6.0.7.3 P.S. I am unsure how well Unicode fares in mailing lists, so I apologize if there are weird escape sequences in there. I just composed it with vim. -- "That gum you like is going to come back in style."
However, you might be onto something here because, interestingly enough: while BASH prompt and autocompletition feature both decode the character correctly, `ls` does not and outputs a sequence of escape codes:
That's interesting. If I run: touch Proc$'\303\251'dures and then ls, I get it printing correctly with the accented character. Then if I do an os.listdir(b'.') in Python and look at its raw bytes, they are the same as if I type the character on my keyboard (US keyboard but with a compose key) and encode UTF8. So it looks to me to be UTF8 encoded (I do not understand the escape sequences \303\251 - once in Python I see the two bytes \xc3\xa9 for the character, which is the correct UTF8 encoding but do not map to the numbers in the bash escape sequences). What happens if you run the following? $ echo $'\303\251' I get the character printing correctly. This could be terminal-dependent behaviour, it works for me in xterm, tilix, alacritty and gnome-terminal. Perhaps if it doesn't work for you in one of these terminals it indicates there is a locale issue deeper than the check you already did to ensure the locale was set correctly. On Fri, Aug 2, 2019 at 1:36 PM John Z. <johnz@pleasantnightmare.com> wrote:
Can you determine some steps that exactly reproduce the problem? Assuming that the problem should manifest when opening the file using /usr/bin/loffice /path/to/file, I tried creating a test file and opening it, and it worked:
Hi Eli, good idea, I tried following your sequence as well.
I created a directory using `mkdir`, then launched libre office and tried to save a file in it. Interesting thing happens:, it actually creates a directory named 'Proc?dures' instead of the original 'Procédures' directory, and saves it in there. I repeated the test twice, because the first time around, I was puzzled enough that I wasn't sure I actually saved the file.
Furthermore, I copied the file using console into the 'Procédures', then opened it using libreoffice, and it opened the one in 'Proc?dures' - I know because I updated the file and saved it, and the latter one was updated.
The only difference between us is that I'm using `libreoffice` launcher command, and you seem to have `loffice`? The package is also libreoffice-fresh, package version 6.2.5-1, and `libreoffice --version` 6.2.5.2 2@(build: 2) The --version in ubuntu, that works, is 6.0.7.3
P.S. I am unsure how well Unicode fares in mailing lists, so I apologize if there are weird escape sequences in there. I just composed it with vim.
-- "That gum you like is going to come back in style."
What happens if you run the following?
$ echo $'\303\251'
I get the character printing correctly.
Same here, it prints out fine. Terminal is Konsole. I tried touching new file with é, and ls again prints the escape sequence, however - trying to `cat` the file by hitting Tab to get autocompletition list, it prints it correctly there. I am not entirely sure how to check for locale issues? I know there's an extensive page in arch wiki which I checked, but I don't think any of Troubleshooting issues applies. I tried `localectl status` and, I dunno if this is normal, but it prints different locale than the one set in $LANG. > echo $LANG en_CA.UTF-8 > localectl status System Locale: LANG=en_US.UTF-8 VC Keymap: n/a X11 layout: n/a Changing keyboard layouts doesn't change the output of `localectl status` -- "That gum you like is going to come back in style."
Hi John,
> echo $LANG en_CA.UTF-8
A good command to use for these situations is locale(1). Note, the double quotes in the output are significant, as the fine man page explains. -- Cheers, Ralph.
A good command to use for these situations is locale(1). Note, the double quotes in the output are significant, as the fine man page explains.
Hi Ralph, thanks for the tip! I wasn't aware of this command previously, and I'm reading through the man right now. -- "That gum you like is going to come back in style."
On 8/2/19 1:48 PM, Chris Billington via arch-general wrote:
... I do not understand the escape sequences \303\251 ...
They're octal: 303 (octal) = 011 000 011 (binary) = 0 1100 0011 (binary) = c3 (hex) 251 (octal) = 010 101 001 (binary) = 0 1010 1001 (binary) = a9 (hex) HTH, Dan
On Fri, 2 Aug 2019 at 14:59, John Z. <johnz@pleasantnightmare.com> wrote:
Hi everyone, there's a document on Dropbox, that has unicode character in its path (french character). Trying to open this document with libre office (Plasma is running) fails with 'file not found', and the path shown with error clearly presents the path with that unicode character replaced by '??'
What I tried: * copy the document in a path where there's no unicode - it opens * copy the document using shell - it works * copy the document using Dolphin (from Plasma) - it works * check $LANG - its set to `en_CA.UTF8`
Does `locale -a` show that locale? -- damjan
participants (8)
-
Chris Billington
-
Damjan Georgievski
-
Dan Sommers
-
Eli Schwartz
-
Genes Lists
-
John Z.
-
Pat Burroughs
-
Ralph Corderoy