Help debugging random lockups
Hi, I have a reasonably new PC which ran with no problems for around six weeks after initial install of Arch, but then about two weeks ago it started locking up randomly. I can't see anything obvious that is causing it to lock up. Sometimes it does it while I am actively working, and other times it locks up overnight (I tend to leave it running continuously). The lockups don't all look the same. For instance, sometimes the mouse cursor is still active and can be moved around the screen, and at other times it disappears completely. I've got kernel.sysrq = 1 set in my sysctl config, but sometimes it doesn't respond to the Magic SysRq key despite that. Obviously I'd like to work out what is causing the crash, so I'm looking for some pointers as to where I should be looking and what tools I could use to investigate. Thanks in advance for any info. Regards, Spencer
Did you ever take a look at your system logs and perhaps coredumps?
I have also had these issues recently. Not sure what's causing it On Wed, May 24, 2023, 1:18 PM Abraham S.A.H. <arash.sah.1996@gmail.com> wrote:
Did you ever take a look at your system logs and perhaps coredumps?
Hello, I have found that the most common cause for random lock ups are memory issues. Make sure the following: - You are not starving the kernel of memory, in the logs there should be kernel panics due to lack of memory. - Your memory is not defective, you have memtester within the Arch Installation Media. I found that XMP can cause the system to randomly freeze, it could boot and run for 30-60 mins and then freeze completely, without any kernel messages, disable XMP fixed this. If it is not memory, then like Andy has suggested, you will need to troubleshoot the hardware. Ensure to check the kernel logs before proceeding to do anything, and also as Andy said in their sidenote, temps would be a good idea to check just in case. Good luck, -- Polarian GPG signature: 0770E5312238C760 Website: https://polarian.dev JID/XMPP: polarian@polarian.dev
On Wed, 2023-05-24 at 19:56 +0100, Polarian wrote:
I have found that the most common cause for random lock ups are memory issues.
Hi, I experienced such random lockups usually with broken HDDs. On one of my machines I experienced such an issue with some GTK software. It looked like a defective SSD, but it was a software issue. Only one time I had broken memory and that time the issue was signaled by POST. The machine couldn't go beyond POST. Since the machine didn't cause problems for around six weeks, I would take a look at pacman.log and e.g. downgrade the kernel. I'm not a fan of memtest, however running memtest is a starting point. On my new machine I installed chaotic-aur/memtest86-efi version 10+, alternatively I would use a 10+ life media. Every now and then nothing is broken. Sometimes only a cable, a card or a RAM needs to be plugged in again. I would also take a look at available BIOS and other firmware updates and read the changelogs, if updates should be available. Regards, Ralf
On Wed, 24 May 2023 at 17:17, Spencer Collyer < spencer@spencercollyer.plus.com> wrote:
Hi,
I have a reasonably new PC which ran with no problems for around six weeks after initial install of Arch, but then about two weeks ago it started locking up randomly. I can't see anything obvious that is causing it to lock up. Sometimes it does it while I am actively working, and other times it locks up overnight (I tend to leave it running continuously).
Another thing you can try to do if you have access to another machine is to configure netconsole to send your kernel log output to another machine. That way you can capture everything right up until the moment it stops responding.
http://www.infotinks.com/linux-netconsole-module-send-dmesgconsolelogs-to-re...
Hi, Thanks for all the ideas folks. I especially like Andy's idea of setting up netconsole - I've still got my old desktop machine available so will give it a go setting that up at the weekend. I'm hoping it's not a hardware fault - I'm not a hardware hacker. Hopefully the fact it ran fine for six weeks means the hardware shouldn't be the problem. Any further ideas gratefully received :) Regards, Spencer
On Thu, May 25, 2023 at 09:15:06 +0100, Spencer Collyer wrote:
Hi,
Thanks for all the ideas folks. I especially like Andy's idea of setting up netconsole - I've still got my old desktop machine available so will give it a go setting that up at the weekend.
I'm hoping it's not a hardware fault - I'm not a hardware hacker. Hopefully the fact it ran fine for six weeks means the hardware shouldn't be the problem.
Any further ideas gratefully received :)
I had similar issues a while ago and posted about them here. Folks were really helpful! Everyone asked me for a lot of system specs, like processor and graphics card and memory, as well as any major software packages you're running. Specifying those would give people some additional information and direction to provide better assistance. I also suggest you look at that thread: https://lists.archlinux.org/archives/list/arch-general@lists.archlinux.org/t... There are many helpful suggestions there; try the ones that are relevant. -- Cheers, Luna Celeste
On Thu, 25 May 2023 07:51:59 -0400, Luna Celeste wrote:
On Thu, May 25, 2023 at 09:15:06 +0100, Spencer Collyer wrote:
Hi,
Thanks for all the ideas folks. I especially like Andy's idea of setting up netconsole - I've still got my old desktop machine available so will give it a go setting that up at the weekend.
I'm hoping it's not a hardware fault - I'm not a hardware hacker. Hopefully the fact it ran fine for six weeks means the hardware shouldn't be the problem.
Any further ideas gratefully received :)
I had similar issues a while ago and posted about them here. Folks were really helpful! Everyone asked me for a lot of system specs, like processor and graphics card and memory, as well as any major software packages you're running. Specifying those would give people some additional information and direction to provide better assistance. I also suggest you look at that thread:
https://lists.archlinux.org/archives/list/arch-general@lists.archlinux.org/t...
There are many helpful suggestions there; try the ones that are relevant.
Thanks Luna. I do recall that thread but couldn't find it when I scanned my mailbox. I've located it now so will be taking a look at the suggestions. Regards, Spencer
On 5/25/23, Spencer Collyer <spencer@spencercollyer.plus.com> wrote:
Hi, (...) I'm hoping it's not a hardware fault - I'm not a hardware hacker. Hopefully the fact it ran fine for six weeks means the hardware shouldn't be the problem.
Any further ideas gratefully received :)
Hi, Spencer, have you looked for any video-drivers issue? I have some recollection of similar graphical symptoms caused by GPU drivers issues... Maybe switching to a more basic video support could allow you to diagnose something... Hope it helps, best of luck!
Could you simply be running out of RAM? 🤔 Mouse cursor moving but system being otherwise unresponsive sounds a lot like a system that is running low on memory and is trying to move bits in and out of swap. A memory leak in some app you are using perhaps. Or possibly some Docker image or vm hogging heaps of memory if you do stuff like that.
I also have this issue. But, these issues didnt pop up till recently after an update. I have systemd-oomd configured so that should take care of any lockups due to a lack of ram. On Thu, May 25, 2023 at 7:43 AM Jesse Jaara <jesse.jaara@gmail.com> wrote:
Could you simply be running out of RAM? 🤔 Mouse cursor moving but system being otherwise unresponsive sounds a lot like a system that is running low on memory and is trying to move bits in and out of swap.
A memory leak in some app you are using perhaps. Or possibly some Docker image or vm hogging heaps of memory if you do stuff like that.
-- Sincerely, Matthew Blankenbehler
Are you running something like barrier , or something that could interfere with inputs like that ? x11vnc etc etc. does dmesg say something ? every time i've had issues with something like this and i went the kernel route ended up being something silly like barrier acting up etc etc On Thu, May 25, 2023 at 1:53 PM Matthew Blankenbeheler < thecoolkids322@gmail.com> wrote:
I also have this issue. But, these issues didnt pop up till recently after an update. I have systemd-oomd configured so that should take care of any lockups due to a lack of ram.
On Thu, May 25, 2023 at 7:43 AM Jesse Jaara <jesse.jaara@gmail.com> wrote:
Could you simply be running out of RAM? 🤔 Mouse cursor moving but system being otherwise unresponsive sounds a lot like a system that is running low on memory and is trying to move bits in and out of swap.
A memory leak in some app you are using perhaps. Or possibly some Docker image or vm hogging heaps of memory if you do stuff like that.
-- Sincerely, Matthew Blankenbehler
Also weird compositors like compton have caused hiccups like this On Thu, May 25, 2023 at 2:00 PM Jeronimo Garcia <garciaj.uk@gmail.com> wrote:
Are you running something like barrier , or something that could interfere with inputs like that ? x11vnc etc etc. does dmesg say something ? every time i've had issues with something like this and i went the kernel route ended up being something silly like barrier acting up etc etc
On Thu, May 25, 2023 at 1:53 PM Matthew Blankenbeheler < thecoolkids322@gmail.com> wrote:
I also have this issue. But, these issues didnt pop up till recently after an update. I have systemd-oomd configured so that should take care of any lockups due to a lack of ram.
On Thu, May 25, 2023 at 7:43 AM Jesse Jaara <jesse.jaara@gmail.com> wrote:
Could you simply be running out of RAM? 🤔 Mouse cursor moving but system being otherwise unresponsive sounds a lot like a system that is running low on memory and is trying to move bits in and out of swap.
A memory leak in some app you are using perhaps. Or possibly some Docker image or vm hogging heaps of memory if you do stuff like that.
-- Sincerely, Matthew Blankenbehler
I managed to get netconsole working (had to wait until the long weekend so I could replace my work machine with my old desktop) and I've seen two lockups today, both referring to the nouveau driver. Looking on the ArchWiki I found that there are known problems with this driver and some (unspecified) chips, and the suggestion is to add the kernel parameter `nouveau.noaccel=1', so I'm going to try that. I had a look at the ArchWiki description of the official drivers and for the driver for my GPU, nvidia-open, it says 'This is currently alpha quality, so there will be issues.' So I think I'll keep away from that for the moment. I'll report back if adding the kernel parameter helps. If I can get a week of use out of this box with no lockups that will be a start. Regards, Spencer
On 5/29/23, Spencer Collyer <spencer@spencercollyer.plus.com> wrote:
I managed to get netconsole working (had to wait until the long weekend so I could replace my work machine with my old desktop) and I've seen two lockups today, both referring to the nouveau driver. (...)
Hi, I had some related issues a while ago, and the devs at Nouveau mailing- -list were really helpful. I paste the address just in case: nouveau@lists.freedesktop.org Good luck and kind regards!
On Mon, 29 May 2023 11:54:07 +0100, Spencer Collyer wrote:
I managed to get netconsole working (had to wait until the long weekend so I could replace my work machine with my old desktop) and I've seen two lockups today, both referring to the nouveau driver. Looking on the ArchWiki I found that there are known problems with this driver and some (unspecified) chips, and the suggestion is to add the kernel parameter `nouveau.noaccel=1', so I'm going to try that.
I had a look at the ArchWiki description of the official drivers and for the driver for my GPU, nvidia-open, it says 'This is currently alpha quality, so there will be issues.' So I think I'll keep away from that for the moment.
I'll report back if adding the kernel parameter helps. If I can get a week of use out of this box with no lockups that will be a start.
Regards,
Spencer
As promised, a quick update. Since adding the kernel parameter I have had no random lockups at all. So it looks like it is working now. As suggested by riveravaldez I will be contacting the nouveau mailing list at some point, but again I'm going to have to wait until I can hook up my old desktop again - work takes priority over that. I do still see one oddity - every now and then when I return to the machine after being away for a while, sometimes the monitor shows no video input. I can bring the machine back to life by tapping the power button off then back on and after a few seconds the screen comes back to life and everything is as I left it. No idea what is causing that - haven't had netconsole running at any point when it's occurred yet. It's more an annoyance than anything else - at least I haven't (yet) lost any work. Regards, Spencer
On Wed, May 24, 2023 at 6:18 PM Spencer Collyer < spencer@spencercollyer.plus.com> wrote:
Hi,
I have a reasonably new PC which ran with no problems for around six weeks after initial install of Arch, but then about two weeks ago it started locking up randomly. I can't see anything obvious that is causing it to lock up. Sometimes it does it while I am actively working, and other times it locks up overnight (I tend to leave it running continuously).
Are you using XFS? XFS in Linux 6.3 has a livelock bug that will be fixed by 6.3.5 or 6.3.4-arch2.
participants (11)
-
Abraham S.A.H.
-
Andy Pieters
-
Jan Alexander Steffens (heftig)
-
Jeronimo Garcia
-
Jesse Jaara
-
Luna Celeste
-
Matthew Blankenbeheler
-
Polarian
-
Ralf Mardorf
-
riveravaldez
-
Spencer Collyer