[arch-general] journalctl and I/O errors
A couple months ago, I started getting I/O errors (see below) whenever I tried to do journalctl -n X for sufficiently large X (and journalctl would segfault). I assumed my hard drive was going to die, but in the mean time this was annoying so I did: # cp -rp /var/log/journal{,-clone} # rm -rf /var/log/journal # mv /var/log/journal{-clone,} There were a few errors after step 1 but this made the segfaulting stop. A few weeks later, same problem, same "fix". A few weeks later, ditto. In all this time, the only I/O errors I've ever seen have been related to accessing the journal. The SMART data for the drive says that the current pending sector count is 57, but the reallocated sector count is 0. So if I'm reading this correctly, there have been a bunch of instances of unreadable sectors, somehow only pertaining to one directory despite efforts to make an independent copy of the data (which should put it in random other sectors), but none of them have been confirmed bad. So at this point I'm guessing it's a systemd bug (and there are no actual bad sectors), but it would be great if someone could confirm that any of this actually makes sense before I file a bug report based on ignorance and speculation. Thanks! - J.M. ------------------------------------------- Example I/O error: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 ata1.00: BMDMA stat 0x5 ata1.00: failed command: READ DMA EXT ata1.00: cmd 25/00:08:c8:ca:92/00:00:2e:00:00/e0 tag 0 dma 4096 in res 51/40:05:cb:ca:92/40:00:2e:00:00/ee Emask 0x9 (media error) ata1.00: status: { DRDY ERR } ata1.00: error: { UNC } ata1.00: configured for UDMA/100 sd 0:0:0:0: [sda] Unhandled sense code sd 0:0:0:0: [sda] mResult: hostbyte=0x00 driverbyte=0x08 sd 0:0:0:0: [sda] mSense Key : 0x3 [current] [descriptor] mDescriptor sense data with sense descriptors (in hex): 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 2e 92 ca cb sd 0:0:0:0: [sda] mASC=0x11 ASCQ=0x4 sd 0:0:0:0: [sda] CDB: mcdb[0]=0x28: 28 00 2e 92 ca c8 00 00 08 00 end_request: I/O error, dev sda, sector 781372107 ata1: EH complete
A couple months ago, I started getting I/O errors (see below) whenever I tried to do journalctl
You are able to read all the journald files, aren't you? You are running cp -r so it looks like so. `cp` would die with non-zero exit status if there were read errors I guess. If it's possible to read the file, journalctl should not segfault IMO, so it should be OK to file an issue. -- Kind regards, Damian Nowak StratusHost www.AtlasHost.eu
On Thursday 30 Jan 2014 11:46:38 Nowaker wrote:
A couple months ago, I started getting I/O errors (see below) whenever I tried to do journalctl
You are able to read all the journald files, aren't you? You are running cp -r so it looks like so. `cp` would die with non-zero exit status if there were read errors I guess. If it's possible to read the file, journalctl should not segfault IMO, so it should be OK to file an issue.
IO errors can also cause a process to hang, and enter the dreaded "D" state. These processes usually can't be killed, because they're blocked in the kernel. Paul
You are able to read all the journald files, aren't you? You are running cp -r so it looks like so.
IO errors can also cause a process to hang, and enter the dreaded "D" state.
It varies. Sometimes it hangs as a D-state process; sometimes it explicitly says 'segmentation fault' and other times it just says 'bus error (core dumped)'. Also cp -r does die with exit status 1 and errors /bin/cp: error reading 'journal/8ba8e3a22718630e590610ed000007dc/system@81fd64a9006446f2af954d739934edeb-0000000000073a5b-0004f09d7cc446d8.journal': Input/output error /bin/cp: failed to extend 'journal-new/8ba8e3a22718630e590610ed000007dc/system@81fd64a9006446f2af954d739934edeb-0000000000073a5b-0004f09d7cc446d8.journal': Input/output error Most of the files are copied though and everything works fine afterward.
On Thu, Jan 30, 2014 at 7:28 PM, Janna Martl <janna.martl109@gmail.com> wrote:
You are able to read all the journald files, aren't you? You are running cp -r so it looks like so.
IO errors can also cause a process to hang, and enter the dreaded "D" state.
It varies. Sometimes it hangs as a D-state process; sometimes it explicitly says 'segmentation fault' and other times it just says 'bus error (core dumped)'.
Also cp -r does die with exit status 1 and errors
/bin/cp: error reading 'journal/8ba8e3a22718630e590610ed000007dc/system@81fd64a9006446f2af954d739934edeb-0000000000073a5b-0004f09d7cc446d8.journal': Input/output error /bin/cp: failed to extend 'journal-new/8ba8e3a22718630e590610ed000007dc/system@81fd64a9006446f2af954d739934edeb-0000000000073a5b-0004f09d7cc446d8.journal': Input/output error
Most of the files are copied though and everything works fine afterward.
Mind sharing the coredump so we could have a look? :)
On Mon, Feb 3, 2014 at 5:48 AM, Martti Kühne <mysatyre@gmail.com> wrote:
Mind sharing the coredump so we could have a look? :)
Here's a backtrace, in case that helps: (gdb) bt full #0 journal_file_move_to_object (f=f@entry=0x969cbc0, type=type@entry=3, offset=2638264, ret=ret@entry=0xbfc6505c) at src/journal/journal-file.c:434 r = 1 t = 0xb2c781b8 o = 0xb2c781b8 __PRETTY_FUNCTION__ = "journal_file_move_to_object" #1 0x0805fb37 in generic_array_get (f=f@entry=0x969cbc0, first=375376, i=<optimized out>, ret=ret@entry=0xbfc651ac, offset=offset@entry=0xbfc651b0) at src/journal/journal-file.c:1443 o = 0xb2c68b40 p = <optimized out> a = <optimized out> t = 3158 r = <optimized out> ci = 0x9687280 __PRETTY_FUNCTION__ = "generic_array_get" #2 0x08060269 in journal_file_next_entry (f=f@entry=0x969cbc0, o=o@entry=0xb2c77c38, p=2636856, direction=direction@entry=DIRECTION_DOWN, ret=ret@entry=0xbfc651ac, offset=offset@entry=0xbfc651b0) at src/journal/journal-file.c:1920 i = 3181 r = <optimized out> __PRETTY_FUNCTION__ = "journal_file_next_entry" #3 0x0805855c in next_with_matches (j=j@entry=0x9670850, f=f@entry=0x969cbc0, direction=direction@entry=DIRECTION_DOWN, ret=ret@entry=0xbfc651ac, offset=offset@entry=0xbfc651b0) at src/journal/sd-journal.c:814 c = 0xb2c77c38 cp = 2636856 __PRETTY_FUNCTION__ = "next_with_matches" #4 0x0805a744 in next_beyond_location (offset=<synthetic pointer>, ret=0xbfc651a4, direction=DIRECTION_DOWN, f=0x969cbc0, j=0x9670850) at src/journal/sd-journal.c:836 c = 0xb2c77c38 cp = 2636856 r = <optimized out> #5 real_journal_next (j=j@entry=0x9670850, direction=direction@entry=DIRECTION_DOWN) at src/journal/sd-journal.c:895 found = <optimized out> f = 0x969cbc0 new_file = 0x969b4e0 new_offset = 399336 o = 0xb324f718 p = <optimized out> i = 0x968df04 r = <optimized out> __func__ = "real_journal_next" #6 0x0805b08e in sd_journal_next (j=0x9670850) at src/journal/sd-journal.c:931 No locals. #7 0x0804daa1 in main (argc=1, argv=0xbfc65574) at src/journal/journalctl.c:1573 flags = <optimized out> r = 0 j = 0x9670850 need_seek = true previous_boot_id = {bytes = "X\357`\266L\307O̟\274\356\232\177\241\b2", qwords = {14722204839188688728, 3605309071142337695}} previous_boot_id_valid = true first_line = true n_shown = 41695 ellipsized = false __func__ = "main"
On Tue, Feb 4, 2014 at 7:33 AM, Janna Martl <janna.martl109@gmail.com> wrote:
On Mon, Feb 3, 2014 at 5:48 AM, Martti Kühne <mysatyre@gmail.com> wrote:
Mind sharing the coredump so we could have a look? :)
Here's a backtrace, in case that helps:
(gdb) bt full #0 journal_file_move_to_object (f=f@entry=0x969cbc0, type=type@entry=3, offset=2638264, ret=ret@entry=0xbfc6505c) at src/journal/journal-file.c:434
Uhh... That error is inside the inner workings of the journal. The data received there is obviously corrupted. Be really careful with that drive now, I suggest immediate replacement unless you prefer relying on imposition of hands and summoning the gods of sheer luck to get the same data back that was written. cheers! mar77i
Conclusion (as I understand it): 1. There is definitely a bug in Journalctl: it crashes (segfaults) on I/O errors. 2. You have a drive that is failing, or your BIOS might not be set correctly. This is causing the I/O errors. How large is the drive? You might have to turn off settings such as "SATA legacy compatibility" or the like -- I had a 3TB drive that would cause ATA command errors on a ~2006 computer; I found this option in the BIOS and as soon as I turned it off everything worked perfectly. Although journalctl should not crash on I/O errors, I think it is not unreasonable to assume that many other apps do not tolerate I/O errors either. So I would say: you should still report this bug upstream. -- Sébastien Leblanc
Hi I agree on both points. On Wed, Feb 5, 2014 at 10:03 PM, Sébastien Leblanc <leblancsebas@gmail.com> wrote:
Conclusion (as I understand it):
1. There is definitely a bug in Journalctl: it crashes (segfaults) on I/O errors.
A few months ago I had a problem with btrfs. I set +C attribute (disable copy-on-write) on existing journal files. Btrfs recommends to put the attribute on empty files and seems was confused that I applied it to non-empty files. Btrfs started returning IO error when I was trying to read the file with 'less' and journalctl started crashing with segfault. It is very similar to what being discussed here. And I agree that journalctl should play nicer. If anyone still sees this problem please run 'strace journalctl ...'. If it shows that a filesystem operation returns IO error right before SEGFAULT then it proves current thesis.
2. You have a drive that is failing, or your BIOS might not be set correctly. This is causing the I/O errors. How large is the drive? You might have to turn off settings such as "SATA legacy compatibility" or the like -- I had a 3TB drive that would cause ATA command errors on a ~2006 computer; I found this option in the BIOS and as soon as I turned it off everything worked perfectly.
Although journalctl should not crash on I/O errors, I think it is not unreasonable to assume that many other apps do not tolerate I/O errors either. So I would say: you should still report this bug upstream.
-- Sébastien Leblanc
On Thu, Feb 6, 2014 at 1:03 AM, Sébastien Leblanc <leblancsebas@gmail.com>wrote:
Conclusion (as I understand it):
1. There is definitely a bug in Journalctl: it crashes (segfaults) on I/O errors.
2. You have a drive that is failing, or your BIOS might not be set correctly.
Thanks, all, for the analysis. I have submitted a bug report [1] for systemd. Also, it seems you were right about hardware failure -- though I still can't get smartctl to acknowledge anything being wrong (except for nonzero current pending sector count), I ran badblocks, which found a bunch of errors, including on sectors corresponding to files outside /var/log/journal. Guess I should get a new drive. [1] https://bugs.freedesktop.org/show_bug.cgi?id=74714
On 8 February 2014 12:06, Janna Martl <janna.martl109@gmail.com> wrote:
On Thu, Feb 6, 2014 at 1:03 AM, Sébastien Leblanc <leblancsebas@gmail.com>wrote:
Conclusion (as I understand it):
1. There is definitely a bug in Journalctl: it crashes (segfaults) on I/O errors.
2. You have a drive that is failing, or your BIOS might not be set correctly.
Thanks, all, for the analysis. I have submitted a bug report [1] for systemd. Also, it seems you were right about hardware failure -- though I still can't get smartctl to acknowledge anything being wrong (except for nonzero current pending sector count), I ran badblocks, which found a bunch of errors, including on sectors corresponding to files outside /var/log/journal. Guess I should get a new drive.
Do not rely too much on SMART data; in their hard disk failure study, Google concluded that when SMART reports a drive as unhealthy, it is often right, but for many drives that failed, SMART was still reporting them as healthy.
"Failure Trends in a Large Disk Drive Population", by Google Inc. [...] Our analysis identifies several parameters from the drive’s self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that mod- els based on SMART parameters alone are unlikely to be useful for predicting individual drive failures
http://static.googleusercontent.com/media/research.google.com/en//archive/di...
-- Sébastien Leblanc
Am 30.01.2014 11:46, schrieb Nowaker:
If it's possible to read the file, journalctl should not segfault IMO, so it should be OK to file an issue.
No program should ever segfault. Unexpected input or errors must be handled properly.
On Thu, 30 Jan 2014 01:11:54 -0500 Janna Martl <janna.martl109@gmail.com> wrote:
A couple months ago, I started getting I/O errors (see below) whenever I tried to do journalctl -n X for sufficiently large X (and journalctl would segfault). I assumed my hard drive was going to die, but in the mean time this was annoying so I did:
# cp -rp /var/log/journal{,-clone} # rm -rf /var/log/journal # mv /var/log/journal{-clone,}
There were a few errors after step 1 but this made the segfaulting stop. A few weeks later, same problem, same "fix". A few weeks later, ditto. In all this time, the only I/O errors I've ever seen have been related to accessing the journal. The SMART data for the drive says that the current pending sector count is 57, but the reallocated sector count is 0.
So if I'm reading this correctly, there have been a bunch of instances of unreadable sectors, somehow only pertaining to one directory despite efforts to make an independent copy of the data (which should put it in random other sectors), but none of them have been confirmed bad. So at this point I'm guessing it's a systemd bug (and there are no actual bad sectors), but it would be great if someone could confirm that any of this actually makes sense before I file a bug report based on ignorance and speculation.
The errors are pretty low-level. So, this is either hard drive failing in an obscure fashion (I/O errors do not necessarily imply "bad blocks") or a strange filesystem problem. Try running smart long test and a thorough fsck. If those come up clean, try mounting another partition (or bind mounting a directory from another partition) at /var/log/journal and see if errors persist. Finally, if /var is a separate partition, I'd back it up, fill it with zeroes (cat /dev/zero> /dev/sdXY), and if no errors are encountered, restore. Hopefully, the drive firmware will relocate all bad sectors (if any). Cheers, L.
Thanks!
- J.M.
------------------------------------------- Example I/O error:
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 ata1.00: BMDMA stat 0x5 ata1.00: failed command: READ DMA EXT ata1.00: cmd 25/00:08:c8:ca:92/00:00:2e:00:00/e0 tag 0 dma 4096 in res 51/40:05:cb:ca:92/40:00:2e:00:00/ee Emask 0x9 (media error) ata1.00: status: { DRDY ERR } ata1.00: error: { UNC } ata1.00: configured for UDMA/100 sd 0:0:0:0: [sda] Unhandled sense code sd 0:0:0:0: [sda] mResult: hostbyte=0x00 driverbyte=0x08 sd 0:0:0:0: [sda] mSense Key : 0x3 [current] [descriptor] mDescriptor sense data with sense descriptors (in hex): 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 2e 92 ca cb sd 0:0:0:0: [sda] mASC=0x11 ASCQ=0x4 sd 0:0:0:0: [sda] CDB: mcdb[0]=0x28: 28 00 2e 92 ca c8 00 00 08 00 end_request: I/O error, dev sda, sector 781372107 ata1: EH complete
-- Leonid Isaev GnuPG key: 0x164B5A6D Fingerprint: C0DF 20D0 C075 C3F1 E1BE 775A A7AE F6CB 164B 5A6D
participants (8)
-
Anatol Pomozov
-
Janna Martl
-
Leonid Isaev
-
Martti Kühne
-
Nowaker
-
Paul Gideon Dann
-
Sébastien Leblanc
-
Thomas Bächler