[arch-general] Linux server crash causing router switch to stop working

David Rosenstrauch darose at darose.net
Wed Feb 16 21:28:02 UTC 2022



On 2/11/22 11:44 AM, David Rosenstrauch via arch-general wrote:
> On 2/11/22 9:21 AM, Genes Lists via arch-general wrote:
>> Also it may be worthwhile running memcheck to be sure your memory is 
>> not faulty.
> 
> Yeah that thought occurred to me as well.  I ran a quick memtest86+ when 
> I first built the computer and it didn't show any issues.  But perhaps I 
> should take some downtime and have it run one through the full 3 rounds.


Following up with an update to close the loop on this thread, for anyone 
who's interested.


So bad news is:  machine crashed again a couple of times the other day - 
and right in the middle of a large and important pacman update, no less! 
  (150+ packages, including kernel, glibc, gcc, mariadb, bunch of other 
key libraries.  Machine wound up not being bootable, and took several 
hours to recover.  It was pretty ugly.)  :-(

But good news is (aside from being able to recover from the failed 
update):  I think I may have finally pinned down what's been causing 
these issues.


Long story short, I had an issue when I first built the box where the 
machine wouldn't POST and boot, with the mobo's "DRAM" health LED 
indicator lighting up.  After much digging I was able to pin down that I 
had fastened the CPU cooler down too tight.  That apparently can bend 
the mobo, and prevent some components from working correctly.  In my 
case it was one of the memory slots, which is located very close to the 
CPU/cooler.  (I figured it out for certain when I was able to boot with 
only one of the 2 memory sticks installed.)  Once I pinpointed the 
issue, I loosened the cooler a bit and from then on I've been able to 
repeatedly boot as normal with both sticks installed.

Or so I thought!  I'm pretty sure now that this has actually still been 
an issue, and that connectivity to one of the memory sticks has been 
periodically cutting out in the middle of operations and so causing 
these random crashes.  Evidence:  a) while debugging one of the recent 
crashes I again hit the same issue where it wouldn't POST and showed the 
same DRAM LED, and b) when I again took out one memory stick from the 
slot in question and rebooted and it's been running without issue ever 
since.  (Although on 1/2 the RAM.)  I'm pretty sure the issue isn't that 
the memory stick is bad, as I've run memtests several times with no 
issues.  But I'll also try swapping sticks to confirm.


I'll give it a few more days of uptime to confirm that this was indeed 
the issue, but I'm growing increasingly confident that that's the case. 
  Will spend some time reinstalling the cooler, thermal paste, and 2nd 
RAM stick again when I get some time and try to get everything back to 
full RAM capacity without crashing.


Many thanks to everyone for all the helpful suggestions!

DR



More information about the arch-general mailing list