System is randomly freezing, would like troubleshooting help
I have a System76 Thelio machine running Arch (fully up to date as of a few hours ago today). For quite a while now (at least a few months), it's been randomly hanging. I've configured the system to provide a crash dump per the Kdump wiki page, but whenever I reboot it after one of these hangs (or use the sysrq crash key), there is no /proc/vmcore file, and there's nothing in the logs. As you can understand, this makes troubleshooting difficult. In addition to this, I've not been able to identify many common factors of the hangs. The only thing that seems to be consistent is that most of the hangs occur when the system puts the display to sleep. But even this isn't consistent; today's hang occurred while the display was active (though I turned off the monitor). I've changed hardware and operating systems a few times. I originally ran PopOS on this machine, and experienced hangs there with an NVIDIA graphics card, but I moved to AMD when installing Manjaro. Again, to be clear, the machine does have a very up to date Arch install presently. The hangs occur with both X11 and Wayland. Sometimes it happens a few hours after I boot the machine, sometimes (like today's) it took a week or two. About the only unusual thing about the system is that I'm running ZFS. The root filesystem for this installation runs on ZFS, but this wasn't the case when it ran PopOS. I still had a ZFS pool on the PopOS machine. I have another ZFS pool with my home directory and my libvirt VMs on it. That said, neither of the pools have any errors, and I scrub them regularly. As I said at the start of this message, I've been struggling to figure out the problem for a few months and I'm not sure how to make progress troubleshooting it. I don't see anything in dmesg about overheating, or hardware failures, or anything like that. The machine works great... until it very suddenly doesn't. So how do I go about figuring out what's going on? I'd love to be able to -- Cheers, Luna Celeste
On 4/8/23 19:55, Luna Celeste wrote: Sounds very annoying Couple of thoughts: - does the system journal reveal anything? If not it seems less like a kernel crash and more like graphics type hangs - good to avoid any nvidia drivers. Not just kernel but 'other processes' - journal --since -1h or whatever. - Can you keep a remote ssh from another computer - and when hang occurs is the remote ssh also dead or just the console/monitor keyboard? - Definitely worth checking memtest86+ as well as smarctl test your disk(s). While bad mem is possible your symptoms dont seem quite consistent with that - i'd check anyway. - is your CPU AMD? Can you turn off sleep and see if it makes a difference - there were some issues with s2idle and AMD, though I thought the were worked around/fixed back in 6.1. Also is your bios fully up to date?
On Sun, Apr 9, 2023 at 8:08 AM Genes Lists <lists@sapience.com> wrote:
On 4/8/23 19:55, Luna Celeste wrote:
Sounds very annoying
Couple of thoughts:
In my case:
- does the system journal reveal anything? If not it seems less like a kernel crash and more like graphics type hangs - good to avoid any nvidia drivers. Not just kernel but 'other processes' - journal --since -1h or whatever.
The journal is clean, and there's the blinking Caps Lock. So it definitely looks like a crash.
- Can you keep a remote ssh from another computer - and when hang occurs is the remote ssh also dead or just the console/monitor keyboard?
I actually tried that, but the SSH shell closes before the crash happens. No additional information is gained even if it's running dmesg --follow.
- Definitely worth checking memtest86+ as well as smarctl test your disk(s). While bad mem is possible your symptoms dont seem quite consistent with that - i'd check anyway.
Didn't recently check for bad mem. Like you already implied, it would be unexpected for it to only cause issues during suspend, but I'll add it to my to-do list anyway. SMART is clear. That said, the firmware in this laptop has some issues: it occasionally (but only very rarely) causes unrelated hangups in both Linux and Windows.
- is your CPU AMD? Can you turn off sleep and see if it makes a difference - there were some issues with s2idle and AMD, though I thought the were worked around/fixed back in 6.1. Also is your bios fully up to date?
Intel. Fully up to date BIOS. Will add disabling s2idle to my to-do list as well, just in case. Thanks, Jonas
On 4/9/23 08:06, Jonas Malaco wrote: As you say, your case is definitely a hard crash. - Do you have any modules that taint the kernel? - the s2idle was only AMD not intel as far as I know. But good to prevent any kind of sleep or hibernation just to keep that away. - your occasional hangups might also suggest hardware problems of course, which can be tricky to diagnose - esp since happens in both windows and linux. What leads you to bios bugs vs hardware? Assume your filesys are all clean.
On Sun, Apr 9, 2023 at 9:52 AM Genes Lists <lists@sapience.com> wrote:
On 4/9/23 08:06, Jonas Malaco wrote:
As you say, your case is definitely a hard crash.
- Do you have any modules that taint the kernel?
Hi, On that particular system, no.
- the s2idle was only AMD not intel as far as I know. But good to prevent any kind of sleep or hibernation just to keep that away.
I'm curious to see if disabling s2idle will have any effect on the less common crashes that happen when just the monitor turns off, without suspending to RAM.
- your occasional hangups might also suggest hardware problems of course, which can be tricky to diagnose - esp since happens in both windows and linux. What leads you to bios bugs vs hardware? Assume your filesys are all clean.
Honestly, it's been a while since the last one, and I can't find my notes or old logs. IIRC out of nowhere a bunch of PCI AER errors would pop up (and Windows also showed similar warnings from time to time). Initially this happened a lot, but only sometimes a hangup would follow. At some point I found similar reports, one or more of them which pointed to a possible BIOS/firmware issue. As time went on, and new BIOS versions were available, the problem got progressively less common, despite virtually no hardware changes (there was one, but there was no observable decrease or increase in AER errors or hangups). And now I don't have a single AER error logged for the past 2 months (which is as far as the logs on that machine go). But I went too far before when I said that it _was_ a firmware issue. It would be more reasonable to say that those _might_ have been firmware issues. Thanks, Jonas
On Sun, Apr 9, 2023 at 9:52 AM Genes Lists <lists@sapience.com> wrote:
On 4/9/23 08:06, Jonas Malaco wrote:
As you say, your case is definitely a hard crash.
- Do you have any modules that taint the kernel?
- the s2idle was only AMD not intel as far as I know. But good to prevent any kind of sleep or hibernation just to keep that away.
- your occasional hangups might also suggest hardware problems of course, which can be tricky to diagnose - esp since happens in both windows and linux. What leads you to bios bugs vs hardware? Assume your filesys are all clean.
Ops, filesystems are clean too. Since Luna mentioned using zfs: I'm actually using ext4 (and fat32 for /efi).
On Sun, Apr 09, 2023 at 09:06:49 -0300, Jonas Malaco wrote:
On Sun, Apr 9, 2023 at 8:08 AM Genes Lists <lists@sapience.com> wrote:
On 4/8/23 19:55, Luna Celeste wrote:
Sounds very annoying
Couple of thoughts:
In my case:
- does the system journal reveal anything? If not it seems less like a kernel crash and more like graphics type hangs - good to avoid any nvidia drivers. Not just kernel but 'other processes' - journal --since -1h or whatever.
The journal is clean, and there's the blinking Caps Lock. So it definitely looks like a crash.
Since you're seeing a hard crash rather than a hard freeze, the next step is to obtain a crash dump to get more information. The Kdump wiki page ( https://wiki.archlinux.org/title/Kdump ) provides information on how to do this. -- Cheers, Luna Celeste
On Sun, Apr 9, 2023 at 4:10 PM Luna Celeste <luna@unixpoet.dev> wrote:
On Sun, Apr 09, 2023 at 09:06:49 -0300, Jonas Malaco wrote:
On Sun, Apr 9, 2023 at 8:08 AM Genes Lists <lists@sapience.com> wrote:
On 4/8/23 19:55, Luna Celeste wrote:
Sounds very annoying
Couple of thoughts:
In my case:
- does the system journal reveal anything? If not it seems less like a kernel crash and more like graphics type hangs - good to avoid any nvidia drivers. Not just kernel but 'other processes' - journal --since -1h or whatever.
The journal is clean, and there's the blinking Caps Lock. So it definitely looks like a crash.
Since you're seeing a hard crash rather than a hard freeze, the next step is to obtain a crash dump to get more information. The Kdump wiki page ( https://wiki.archlinux.org/title/Kdump ) provides information on how to do this.
Now that the issue is back I do need to try that, yes. Thanks, Jonas
-- Cheers, Luna Celeste
On Sun, Apr 09, 2023 at 07:07:39 -0400, Genes Lists wrote:
On 4/8/23 19:55, Luna Celeste wrote:
Sounds very annoying
You have no idea.
Couple of thoughts:
- does the system journal reveal anything? If not it seems less like a kernel crash and more like graphics type hangs - good to avoid any nvidia drivers. Not just kernel but 'other processes' - journal --since -1h or whatever.
There's nothing in any of the logs. Nothing in dmesg, nothing in journalctl. The system works fine, hangs, I reboot it, and the next messages I see are the system booting. This is an AMD machine with no NVIDIA hardware / software whatsoever. Unlike in Jonas's case, my capslock does not flash when the machine hangs--unless this is something that only happens a few times immediately after a crash.
- Can you keep a remote ssh from another computer - and when hang occurs > is the remote ssh also dead or just the console/monitor keyboard?
I almost always have a mosh connection from my laptop to the Linux machine, and it too hangs. I've gotten disappointingly familiar with mosh's blue/white "can't reach remote machine" message.
- Definitely worth checking memtest86+ as well as smarctl test your disk(s). While bad mem is possible your symptoms dont seem quite consistent with that - i'd check anyway.
This was one of the first things I tried! I've run both the proprietary and the open source versions and both show no errors. Beyond this, the system is using ECC memory, so I would expect to see checksum errors or the like in the logs, but again, nothing. I've also checked smartctl, and again nothing. Also, as I said, I'm running ZFS (and only ZFS), and I would expect it to tell me that there are checksum errors *long* before smartctl shows anything.
- is your CPU AMD? Can you turn off sleep and see if it makes a difference - there were some issues with s2idle and AMD, though I thought the were worked around/fixed back in 6.1. Also is your bios fully up to date?
As said above, yes, the CPU is AMD, but I wasn't clear in my original message, I'm sorry. *Only* the display is sleeping. I do not have any other power saving measures configured. And, even if that was the case, my research shows that the AMD s2idle bug was fixed in 5.15. This machine is running 6.2.10. I'm not sure about the BIOS, will have to check that. Thank you everyone for helping me dig into this. It has been quite frustrating! I am thoroughly out of ideas, so anything helps. -- Cheers, Luna Celeste
On 4/9/23 11:00, Luna Celeste wrote:
message, I'm sorry. *Only* the display is sleeping. I do not have any other power saving measures configured. And, even if that was the case, ' I have in the past seen display related problems when display(s), esp more than 1 display, power themselves down - this was with sddm/kde - which, after years and years since abandoned, and changed to gdm/gnome which did not have any such issues.
that said - since your remote ssh also dies, this sounds like could be more than a display issue in your case - and if I recall you have had similar issues with 2 different graphics cards as well. In case there is a filesystem problem, are you running 'journalctl -f' in your remote ssh - if there is a filesystem problem, it may show something even if nothing is written to the journal. I have no experience with zfs, but plenty of others do and seem not to have problems. Lerts not count it out quite yet tho.
On Sun, Apr 09, 2023 at 11:26:03 -0400, Genes Lists wrote:
On 4/9/23 11:00, Luna Celeste wrote:
message, I'm sorry. *Only* the display is sleeping. I do not have any other power saving measures configured. And, even if that was the case, ' I have in the past seen display related problems when display(s), esp more than 1 display, power themselves down - this was with sddm/kde - which, after years and years since abandoned, and changed to gdm/gnome which did not have any such issues.
I've disabled the display sleep (though will be powering off the display altogether when not using the machine). This machine does run sddm/KDE, so perhaps there is a connection.
In case there is a filesystem problem, are you running 'journalctl -f' in your remote ssh - if there is a filesystem problem, it may show something even if nothing is written to the journal.
I have no experience with zfs, but plenty of others do and seem not to have problems. Lerts not count it out quite yet tho.
Any issues with ZFS will be reported with the ZFS tools, and I can run diagnostics to expose issues. That said, I have set up a tmux session with `journalctl -f` running in order to catch anything else. Additionally, I just realized that the last time I ran the diagnostic for my data pool was in October, so I'm running it again. Arch provides systemd timers for these tasks, and I'll look at enabling them. Following up on a point from an earlier message, I have verified that my BIOS and firmware are up to date. -- Cheers, Luna Celeste
On 4/9/23 12:19, Luna Celeste wrote:
I have set up a tmux session with `journalctl -f` running in order to catch anything else.
hopefully it catches something.
Additionally, I just realized that the last time I ran the diagnostic for my data pool was in October, so I'm running it again.
good to check.
Following up on a point from an earlier message, I have verified that my BIOS and firmware are up to date.
do you have more than 1 display? To minimize interruption to your set up you can replace sddm with gdm and use it to start plasma - might not hurt (not a fan of sddm to be frank). plasma 5.27 supposedly finally (after a year or so) is supposed to have improved/fixed the terrible display issues it suffered from (esp multi monitor). The most annoying were triggered by display(s) going to sleep and/or waking up which wrecked havoc on kde. But I would not be shocked if bugs still remained; given how bad the bugs were - largely design flaws. But certainly eliminating sddm would be a good thing (imho). It was those unfixable without a big rewrite display issues that finally pushed me away from kde about a year back now.
On Sun, Apr 09, 2023 at 12:51:46 -0400, Genes Lists wrote:
On 4/9/23 12:19, Luna Celeste wrote:
Additionally, I just realized that the last time I ran the diagnostic for my data pool was in October, so I'm running it again.
good to check.
No errors on the pool I'm using for my home directory and libvirt VMs.
Following up on a point from an earlier message, I have verified that my BIOS and firmware are up to date.
do you have more than 1 display? To minimize interruption to your set up you can replace sddm with gdm and use it to start plasma - might not hurt (not a fan of sddm to be frank).
I only have the one display. I haven't encountered any issues with KDE or SDDM. I used both long before these problems started. I can try gdm, and will get that set up later tonight.
plasma 5.27 supposedly finally (after a year or so) is supposed to have improved/fixed the terrible display issues it suffered from (esp multi monitor). The most annoying were triggered by display(s) going to sleep and/or waking up which wrecked havoc on kde.
But I would not be shocked if bugs still remained; given how bad the bugs were - largely design flaws. But certainly eliminating sddm would be a good thing (imho). It was those unfixable without a big rewrite display issues that finally pushed me away from kde about a year back now.
Do you have any links or resources I can read about these issues? My day to day use with KDE has been rather enjoyable, and I'd hate to have to use something else. -- Cheers, Luna Celeste
On 4/9/23 13:22, Luna Celeste wrote:
Do you have any links or resources I can read about these issues? My day to day use with KDE has been rather enjoyable, and I'd hate to have to use something else.
If you're not having troubles then perhaps don't change - i used to be a big fan too. And your symptoms certainly sound different enough and more driver/hardware than confused display software. Its long ago now and as I said they are supposed to have made improvements in 5.27. The issues I directly experienced were dual monitor setups - you could look at bugs.kde.org and search for display or dual display or dual display crash maybe. Possibly reddit too or forum.kde.org As I recall, in addition to the core design flaws, there were some nasty interactions between sddm and screen locker in addition to the display power bugs.
On Sun, Apr 09, 2023 at 17:18:55 -0400, Genes Lists wrote:
On 4/9/23 13:22, Luna Celeste wrote:
Do you have any links or resources I can read about these issues? My day to day use with KDE has been rather enjoyable, and I'd hate to have to use something else.
If you're not having troubles then perhaps don't change - i used to be a big fan too. And your symptoms certainly sound different enough and more driver/hardware than confused display software.
Disabling monitor sleep did not fix it. I've switched to gdm to see if that solves it, though like you I am skeptical. Replying to another message, here is the output from sensors(1): $ sensors iwlwifi_1-virtual-0 Adapter: Virtual device temp1: N/A nvme-pci-0300 Adapter: PCI adapter Composite: +54.9°C (low = -273.1°C, high = +81.8°C) (crit = +84.8°C) Sensor 1: +54.9°C (low = -273.1°C, high = +65261.8°C) Sensor 2: +66.8°C (low = -273.1°C, high = +65261.8°C) acpitz-acpi-0 Adapter: ACPI interface temp1: +16.8°C (crit = +20.8°C) temp2: +16.8°C (crit = +20.8°C) temp3: +16.8°C (crit = +20.8°C) k10temp-pci-00c3 Adapter: PCI adapter Tctl: +57.0°C Tccd1: +51.5°C Tccd2: +51.0°C amdgpu-pci-0b00 Adapter: PCI adapter vddgfx: 6.00 mV fan1: 976 RPM (min = 0 RPM, max = 2700 RPM) edge: +49.0°C (crit = +110.0°C, hyst = -273.1°C) (emerg = +115.0°C) junction: +55.0°C (crit = +110.0°C, hyst = -273.1°C) (emerg = +115.0°C) mem: +56.0°C (crit = +105.0°C, hyst = -273.1°C) (emerg = +110.0°C) PPT: 29.00 W (cap = 186.00 W) $ The system is always idle when it hangs, and indeed is mostly idle overall. Some of these numbers do look a little high, but they're not at the critical points so I'm not sure what to make of it. -- Cheers, Luna Celeste
On 4/10/23 15:15, Luna Celeste wrote:
Disabling monitor sleep did not fix it. I've switched to gdm to see if that solves it, though like you I am skeptical. Replying to another message, here is the output from sensors(1):
laptops tend to run warmish nothing looks scary at all to me anyway.
Hello, On 4/10/2023 9:15 PM, Luna Celeste wrote: ... [snip] ...
The system is always idle when it hangs, and indeed is mostly idle overall. Some of these numbers do look a little high, but they're not at the critical points so I'm not sure what to make of it.
An AMD CPU that hangs when it's mostly idle rings a bell: Try disabling c6 and see if the problem goes away. Some older AMD CPUs had problems waking up from deep sleep. The script [1] can do it for you: ~~ # modprobe msr # zenstates.py --c6-disable ~~ -- Best regards, ihad [1]: https://github.com/r4m0n/ZenStates-Linux
On Mon, Apr 10, 2023 at 21:59:24 +0200, Shawn Michaels wrote:
Hi,
On 9 April 2023 01:55:34 CEST, Luna Celeste <luna@unixpoet.dev> wrote:
it's been randomly hanging.
Long shot, but I had random freezes on a machine years ago and it turned out that I forgot to install the intel-ucode package [1]. EDIT: I realize now that you're running an AMD cpu.
Also, things that may help you track this down: - monitor /proc/interrupts when it freezes
This is a 16 core processor and there's too much output on my 27" display to view it all at once; suggestions?
- remove some hardware, see if crashes disappear, add back progressively
I tried that a few months ago; as I said in an earlier message, since I don't know what's causing the issue, and since right now it's not predictable, I don't have a timeframe on how much testing is "enough".
- increase kernel verbosity (I forgot how I did this, maybe [2] will help?)
Will look into this once the memtest86+ is finished; see below.
- when and how exactly did it start?
At least a few months ago; I'm not sure of the specifics, but I don't think it was after adding any particular bit of hardware. Of course, that doesn't rule out driver bugs.
Definitely worth checking memtest86+
Keep in mind that it may take days of stress testing memory before catching an error. I caught a faulty RAM slot on a MB once after 3 days of running memtest. I suggest running for an entire week.
I'm currently doing this as it's an easy troubleshooting step. So far I'm at pass 5 and have run for 13.5h with no errors. I'll keep folks updated after the full week has passed. On Tue, Apr 11, 2023 at 09:52:14 +0200, ihad wrote:
An AMD CPU that hangs when it's mostly idle rings a bell: Try disabling c6 and see if the problem goes away. Some older AMD CPUs had problems waking up from deep sleep. The script [1] can do it for you:
This is a 16 core AMD Ryzen 9 3950X; does that count as older? The machine is only a few years old. I understand that's quite a while in tech, but I want to be sure. -- Cheers, Luna Celeste
Hi, On 4/11/2023 5:49 PM, Luna Celeste wrote:
An AMD CPU that hangs when it's mostly idle rings a bell: Try disabling c6 and see if the problem goes away. Some older AMD CPUs had problems waking up from deep sleep. The script [1] can do it for you:
This is a 16 core AMD Ryzen 9 3950X; does that count as older? The machine is only a few years old. I understand that's quite a while in tech, but I want to be sure.
I had that problem on this CPU: ~~ processor : 8 vendor_id : AuthenticAMD cpu family : 23 model : 1 model name : AMD Ryzen 5 1600X Six-Core Processor stepping : 1 ~~ According to Google my model was launched in April, 2017. I'd give it a shot, but YMMV... -- Regards, ihad.
Den ons 12 apr. 2023 kl 09:51 skrev ihad <ihad@d-tor.org>:
Hi,
On 4/11/2023 5:49 PM, Luna Celeste wrote:
An AMD CPU that hangs when it's mostly idle rings a bell: Try disabling c6 and see if the problem goes away. Some older AMD CPUs had problems waking up from deep sleep. The script [1] can do it for you:
This is a 16 core AMD Ryzen 9 3950X; does that count as older? The machine is only a few years old. I understand that's quite a while in tech, but I want to be sure.
I had that problem on this CPU: ~~ processor : 8 vendor_id : AuthenticAMD cpu family : 23 model : 1 model name : AMD Ryzen 5 1600X Six-Core Processor stepping : 1 ~~ According to Google my model was launched in April, 2017. I'd give it a shot, but YMMV...
At work we've been having a problem with a 12 core 3900X that would randomly lock up in a pattern similar to yours. 13 days ago I added rcu_nocbs=0-23 processor.max_cstate=1 kernel parameters and it's been running fine since then. Don't want to say the issue is finally solved until it's been running for a little longer, but it's something you could try, in case your 3950X suffers the same bug. More info at https://wiki.archlinux.org/title/Ryzen#Soft_lock_freezing and also some more explanation at https://utcc.utoronto.ca/~cks/space/blog/linux/KernelRcuNocbsMeaning Elvis
-- Regards, ihad.
On Wed, Apr 12, 2023 at 18:01:13 +0200, Elvis Stansvik wrote:
Den ons 12 apr. 2023 kl 09:51 skrev ihad <ihad@d-tor.org>:
Hi,
On 4/11/2023 5:49 PM, Luna Celeste wrote:
An AMD CPU that hangs when it's mostly idle rings a bell: Try disabling c6 and see if the problem goes away. Some older AMD CPUs had problems waking up from deep sleep. The script [1] can do it for you:
This is a 16 core AMD Ryzen 9 3950X; does that count as older? The machine is only a few years old. I understand that's quite a while in tech, but I want to be sure.
I had that problem on this CPU: ~~ processor : 8 vendor_id : AuthenticAMD cpu family : 23 model : 1 model name : AMD Ryzen 5 1600X Six-Core Processor stepping : 1 ~~ According to Google my model was launched in April, 2017. I'd give it a shot, but YMMV...
At work we've been having a problem with a 12 core 3900X that would randomly lock up in a pattern similar to yours.
13 days ago I added rcu_nocbs=0-23 processor.max_cstate=1 kernel parameters and it's been running fine since then. Don't want to say the issue is finally solved until it's been running for a little longer, but it's something you could try, in case your 3950X suffers the same bug.
More info at https://wiki.archlinux.org/title/Ryzen#Soft_lock_freezing and also some more explanation at https://utcc.utoronto.ca/~cks/space/blog/linux/KernelRcuNocbsMeaning
Curious! The machine I'm running isn't going into sleep of any kind (at least not that I know of, I have all the power management disabled, and it's not a laptop, but I'll give this a try regardless. -- Cheers, Luna Celeste
On 11 April 2023 17:49:33 CEST, Luna Celeste <luna@unixpoet.dev> wrote:
it's been randomly hanging.
Also, things that may help you track this down: - monitor /proc/interrupts when it freezes
This is a 16 core processor and there's too much output on my 27" display to view it all at once; suggestions?
I would try to run something like this in the background: watch -n 1 "cat /proc/interrupts >> ~/watch.log && sync" (I did not check that the command works as expected but you get the intention). Once a crash is caught, analyze the produced logs. Perhaps you can monitor other files from sysfs/debugfs as well. Another thing that comes to mind: perhaps your system is still running, albeit very slow. I see that you're running libvirt. I've had a problem like this on my host: for more than a year, it would randomly and seldomly "freeze" (become astonishingly slow) when starting a VM (Windows guest with multiple passthroughs). I tried to debug this by increasing journald/kernel log levels but the issue appears to have vanished lately. I just assumed that it was fixed upstream, but perhaps it's still there.
On Wed, Apr 12, 2023 at 11:02:46 +0200, Shawn Michaels wrote:
On 11 April 2023 17:49:33 CEST, Luna Celeste <luna@unixpoet.dev> wrote:
it's been randomly hanging.
Also, things that may help you track this down: - monitor /proc/interrupts when it freezes
This is a 16 core processor and there's too much output on my 27" display to view it all at once; suggestions?
I would try to run something like this in the background: watch -n 1 "cat /proc/interrupts >> ~/watch.log && sync"
(I did not check that the command works as expected but you get the intention).
Once a crash is caught, analyze the produced logs. Perhaps you can monitor other files from sysfs/debugfs as well.
This is a good strategy, thank you! I'm a little worried about disk wear, though, but maybe that's just human bias?
Another thing that comes to mind: perhaps your system is still running, albeit very slow. I see that you're running libvirt. I've had a problem like this on my host: for more than a year, it would randomly and seldomly "freeze" (become astonishingly slow) when starting a VM (Windows guest with multiple passthroughs). I tried to debug this by increasing journald/kernel log levels but the issue appears to have vanished lately. I just assumed that it was fixed upstream, but perhaps it's still there.
Most of the time the VMs aren't actually running when the machine freezes / hangs; also, the last time it froze, the display was still active, and the clock hadn't advanced for something like 6-10 hours, matching the time when the mosh session lost its connection. So I don't think this is the cause. Unrelated, would you please check your mail client? When you reply, I get a copy in my main inbox and in the folder for the mailing list, despite setting both Mail-Followup-To and Reply-To headers. Something seems to be acting strangely. Regardless, thank you for all the advice you've provided! -- Cheers, Luna Celeste
Hi, On 14 April 2023 04:06:56 CEST, Luna Celeste <luna@unixpoet.dev> wrote:
On Wed, Apr 12, 2023 at 11:02:46 +0200, Shawn Michaels wrote:
On 11 April 2023 17:49:33 CEST, Luna Celeste <luna@unixpoet.dev> wrote:
it's been randomly hanging.
Also, things that may help you track this down: - monitor /proc/interrupts when it freezes
This is a 16 core processor and there's too much output on my 27" display to view it all at once; suggestions?
I would try to run something like this in the background: watch -n 1 "cat /proc/interrupts >> ~/watch.log && sync"
(I did not check that the command works as expected but you get the intention).
Once a crash is caught, analyze the produced logs. Perhaps you can monitor other files from sysfs/debugfs as well.
This is a good strategy, thank you! I'm a little worried about disk wear, though, but maybe that's just human bias?
If you're worried about that, you can store the logs on an external USB dongle. Or you could even remotely SSH into the box and use something like "script" in order to log the SSH session into a file on the remote machine. That way, you wouldn't need to sync every second.
Another thing that comes to mind: perhaps your system is still running, albeit very slow. I see that you're running libvirt. I've had a problem like this on my host: for more than a year, it would randomly and seldomly "freeze" (become astonishingly slow) when starting a VM (Windows guest with multiple passthroughs). I tried to debug this by increasing journald/kernel log levels but the issue appears to have vanished lately. I just assumed that it was fixed upstream, but perhaps it's still there.
Most of the time the VMs aren't actually running when the machine freezes / hangs; also, the last time it froze, the display was still active, and the clock hadn't advanced for something like 6-10 hours, matching the time when the mosh session lost its connection. So I don't think this is the cause.
This may still be the case. If you get e.g. a couple of system ticks every minute, you may not see a minute pass until a very long time.
Unrelated, would you please check your mail client? When you reply, I get a copy in my main inbox and in the folder for the mailing list, despite setting both Mail-Followup-To and Reply-To headers. Something seems to be acting strangely.
Sorry about that. I'm not used to mailing lists. I had a quick look through the settings and couldn't find anything related. Maybe this was caused because I replied to an "old" mail from the middle of the thread? I'm using k9 on Android. If somebody has an idea, don't hesitate to chime in.
On Sat, Apr 8, 2023 at 8:56 PM Luna Celeste <luna@unixpoet.dev> wrote:
I have a System76 Thelio machine running Arch (fully up to date as of a few hours ago today). For quite a while now (at least a few months), it's been randomly hanging. I've configured the system to provide a crash dump per the Kdump wiki page, but whenever I reboot it after one of these hangs (or use the sysrq crash key), there is no /proc/vmcore file, and there's nothing in the logs. As you can understand, this makes troubleshooting difficult.
In addition to this, I've not been able to identify many common factors of the hangs. The only thing that seems to be consistent is that most of the hangs occur when the system puts the display to sleep. But even this isn't consistent; today's hang occurred while the display was active (though I turned off the monitor).
Hi Luna, I've been having what appears to be the same issue, but on a Dell Inspiron 7572. When it happens, I also see a blinking Caps Lock (indicating a kernel panic) just a few seconds after suspending or turning off the screen. Does this detail match as well? In my case, not using the coretemp driver seemed to help for a while, and there was a patch on the HWMON list that I was hoping would solve the issue for good.[1] Unfortunately I just saw that the patch has been in the Arch kernel for a few versions already, and yesterday I started having these issues again (after ~5 months). But it also turns out that I never actually blocklisted coretemp back in November, and in the past few days I've been leaving btop open, which also reads from coretemp (if available). So there's still some correlation... and for the lack of a better idea, I've now blocklisted coretemp, and will see if that makes a difference over the next few days/weeks. Maybe you can try doing the same, if you're also on Intel. [1]: https://lore.kernel.org/linux-hwmon/20230103114620.15319-1-janusz.krzysztofi... Cheers, Jonas
I've changed hardware and operating systems a few times. I originally ran PopOS on this machine, and experienced hangs there with an NVIDIA graphics card, but I moved to AMD when installing Manjaro. Again, to be clear, the machine does have a very up to date Arch install presently. The hangs occur with both X11 and Wayland. Sometimes it happens a few hours after I boot the machine, sometimes (like today's) it took a week or two.
About the only unusual thing about the system is that I'm running ZFS. The root filesystem for this installation runs on ZFS, but this wasn't the case when it ran PopOS. I still had a ZFS pool on the PopOS machine. I have another ZFS pool with my home directory and my libvirt VMs on it. That said, neither of the pools have any errors, and I scrub them regularly.
As I said at the start of this message, I've been struggling to figure out the problem for a few months and I'm not sure how to make progress troubleshooting it. I don't see anything in dmesg about overheating, or hardware failures, or anything like that. The machine works great... until it very suddenly doesn't.
So how do I go about figuring out what's going on? I'd love to be able to
-- Cheers, Luna Celeste
Hi, On 9 April 2023 01:55:34 CEST, Luna Celeste <luna@unixpoet.dev> wrote:
it's been randomly hanging.
Long shot, but I had random freezes on a machine years ago and it turned out that I forgot to install the intel-ucode package [1]. EDIT: I realize now that you're running an AMD cpu. Also, things that may help you track this down: - monitor /proc/interrupts when it freezes - remove some hardware, see if crashes disappear, add back progressively - increase kernel verbosity (I forgot how I did this, maybe [2] will help?) - when and how exactly did it start?
Definitely worth checking memtest86+
Keep in mind that it may take days of stress testing memory before catching an error. I caught a faulty RAM slot on a MB once after 3 days of running memtest. I suggest running for an entire week. Good luck [1] https://wiki.archlinux.org/title/microcode [2] https://linuxconfig.org/introduction-to-the-linux-kernel-log-levels
participants (6)
-
Elvis Stansvik
-
Genes Lists
-
ihad
-
Jonas Malaco
-
Luna Celeste
-
Shawn Michaels