[arch-general] Linux server crash causing router switch to stop working
Been experiencing a weird issue several times recently that's got me stumped. A couple of weeks ago, my entire home network went dead right in the middle of a zoom call. Same problem happened again late last night. The problem is intermittent/occasional: everything runs without issue for several days, then suddenly crashes. I figured the issue was a problem with my wifi router, but after much debugging I don't think that's the case. I have a new (Arch) linux server I built a few weeks ago. What seems to be happening is that my server crashes for some reason - and then looks like it's causing the ethernet LAN switch inside my router to stop working. (I verified that when I unplug the server's network cable the router starts working again, and when I plug it back in the router stops.) My best guess as to what's happening is that when the server crashes it's somehow either flooding the network, or sending out invalid network packets, and so making the router inoperable. I have no idea what could cause this, though, or how to fix it. (Or even if this is definitely the issue.) I also have no way to know if this is an issue with my server hardware vs. say an issue in a recent kernel upgrade, since the server is so new. Server is running Arch, and is kept very up to date. It's using the on-board LAN chip on the mobo - Realtek RTL8125 2.5GbE (network module r8169, RTL8125B chip/firmware). Router is a Netgear R8000, running dd-wrt (r47097). Any suggestions welcome! Thanks, DR
On 1/21/22 12:54, David Rosenstrauch via arch-general wrote:
Been experiencing a weird issue several times recently that's got me stumped.
Hi David Sounds a bit tricky to sort out - not sure can help but maybe you can help us understand your set up a bit more. When you say server crashes - I assume kernel crashed and is not functioning (as opposed to kernel is up and running but network is not doing what I want)? Does server respond to any keyboard or mouse movement on console? What do kernel logs show leading up to the crash? If the kernel oopsed it is often logged. Prior log messages may be helpful in tracking down that problem. Is server running desktop or no graphics? Your network topology - Is the Netgear routing to internet or is it your new server or a separate firewall? Is your internet ipv4 or ipv6 or both? Same question - is internet firewall routing (ipv6) or NAT'ing (ipv4)? Is there a periodicity to the problem? Like say dhcp lease expiration length or something? Can we assume that Netgear has been working fine with same configs before new server deployed? Assuming the netgear is your firewall / router to internet - When things are "broken" can internal clients see each other (say ping from one client to another) or is all internal traffic hung up as well as internet traffic? And in that vein, sorry for obvious, but I'll ask anyway - can I assume only 1 server (or kea with hot-standby) is providing dhcp service? Also I notice that latest stable dd-wrt on website is r44715 and your build seems to be beta from last July - I note there are newer builds of the beta - I have no view on the firmware just making an observation. gene
Tnx much for the reply! Responses inline below. On 1/21/22 2:29 PM, Genes Lists via arch-general wrote:
When you say server crashes - I assume kernel crashed and is not functioning (as opposed to kernel is up and running but network is not doing what I want)? Does server respond to any keyboard or mouse movement on console?
It's hard to know for sure exactly what's happening, as the server is headless. It does have a KVM-over-IP hooked up to it, though. But when I try use the KVM when the server gets in this state, it doesn't show anything on the screen, or respond to any keyboard or mouse input. (At least not that I can see.) Eventually I wind up having to just power off the machine.
What do kernel logs show leading up to the crash? If the kernel oopsed it is often logged. Prior log messages may be helpful in tracking down that problem.
When the crash happened this time I went back and looked at the logs and system journal. Last log message was at 2.59:01am. Timing is a little suspicious, as I do have some cron jobs that kick off at 3am. However, the first time the crash happened it was during the day, so I don't think 3am cron jobs could be solely responsible.
Is server running desktop or no graphics?
It's mostly a server. But I do have a desktop manager running (using the CPU's onboard graphics) and connected to the KVM in case I need to log in. (Which I rarely do. I mostly SSH in.)
Your network topology - Is the Netgear routing to internet or is it your new server or a separate firewall? Is your internet ipv4 or ipv6 or both? Same question - is internet firewall routing (ipv6) or NAT'ing (ipv4)?
Netgear router is home LAN's gateway to the internet. It uses a combination of ipv4 and v6.
Is there a periodicity to the problem? Like say dhcp lease expiration length or something?
Not that I can tell. Again, the issue only happened twice - once at around 10am, and the second time at (it looks like) 3am. And there was probably a good week or so between the 2 events.
Can we assume that Netgear has been working fine with same configs before new server deployed?
Yep. Router had been pretty solid up until this point. Plus the fact that unplugging the server's network cable makes the issue go away leads me to believe the problem isn't the router.
Assuming the netgear is your firewall / router to internet - When things are "broken" can internal clients see each other (say ping from one client to another) or is all internal traffic hung up as well as internet traffic?
Anything using an ethernet port on the router pretty much goes dead. That includes the new server itself, a network printer, the upstream modem that the router is connected to, a POE wifi extender in the other room, etc. IIRC, it might have still been possible to connect to the router using wifi during the outage. (I.e., it would respond and assign an IP address with DHCP.) But with the upstream modem unreachable, having a wifi connection wasn't of much use.
And in that vein, sorry for obvious, but I'll ask anyway - can I assume only 1 server (or kea with hot-standby) is providing dhcp service?
Yep, only 1 machine on the network handing out IP addresses - the router itself.
Also I notice that latest stable dd-wrt on website is r44715 and your build seems to be beta from last July - I note there are newer builds of the beta - I have no view on the firmware just making an observation.
Yes, I am definitely a bit behind on dd-wrt updates. But the version I'm running has been quite stable up till now, so I didn't see any urgency to update. One thing I did notice after doing a bit more digging on this issue: although the r8169 network module does seem to work with the mobo's onboard network chip (RTL8125B), that's not technically the right driver for it. There's an r8125 module that's not part of the kernel, which is available on AUR. (https://aur.archlinux.org/packages/r8125-dkms/) I've switched over to start using that (and blacklisted r8169). I've also upgraded to the most recent kernel (5.16.2). So I'm watching to see if either/both of those changes eliminate the issue. I was mostly posting to the list really to ask if anyone had heard of such a thing as a crashed server somehow either sending screwed up network packets or flooding the network in such a way that it could render a router/switch inoperable. From the limited amount I know of networking I think that might be possible. But I don't know exactly how one would remedy something like that. I guess fixing the underlying issue that is crashing the server would be the way to do that, but I haven't been able to pin down the cause yet. Anyway, thanks again for the response, and I appreciate the debugging tips. If any other ideas come to mind, please LMK Thanks, DR
Hello David,
When you say server crashes - I assume kernel crashed and is not functioning (as opposed to kernel is up and running but network is not doing what I want)? Does server respond to any keyboard or mouse movement on console?
It's hard to know for sure exactly what's happening, as the server is headless. It does have a KVM-over-IP hooked up to it, though. But when I try use the KVM when the server gets in this state, it doesn't show anything on the screen, or respond to any keyboard or mouse input. (At least not that I can see.) Eventually I wind up having to just power off the machine.
Does your server support IPMI and thus SOL (serial over LAN)? If so you could configure your server to also provide a login on the serial device which it will also use to log kernel messages. - Configure SOL in EFI - Add "console=tty console=ttyS0,115200n8" (if SOL uses ttyS0/COM1, else use ttyS1) to the kernel command line (how to depends on your boot mechanism) - Configure BMC credentials and privileges for SOL - Use "ipmitool -I lanplus -U $USER -P $PASSWORD -H $SERVERIP sol activate" from another host in your network to connect. You can exit with "<enter> ~ ." Alternatively use "ipmiconsole -u $USER -p $PASSWORD -h $HOST" from freeipmi package and exit with "ctrl+esc & ." (if I remember correctly). In order to catch those hangs it then would be best to have SOL connected and logged around the clock, e.g. by using a Rasperry Pi and Conserver (https://aur.archlinux.org/packages/conserver/) Regards, Uwe
What do kernel logs show leading up to the crash? If the kernel oopsed it is often logged. Prior log messages may be helpful in tracking down that problem.
When the crash happened this time I went back and looked at the logs and system journal. Last log message was at 2.59:01am. Timing is a little suspicious, as I do have some cron jobs that kick off at 3am. However, the first time the crash happened it was during the day, so I don't think 3am cron jobs could be solely responsible.
Is server running desktop or no graphics?
It's mostly a server. But I do have a desktop manager running (using the CPU's onboard graphics) and connected to the KVM in case I need to log in. (Which I rarely do. I mostly SSH in.)
Your network topology - Is the Netgear routing to internet or is it your new server or a separate firewall? Is your internet ipv4 or ipv6 or both? Same question - is internet firewall routing (ipv6) or NAT'ing (ipv4)?
Netgear router is home LAN's gateway to the internet. It uses a combination of ipv4 and v6.
Is there a periodicity to the problem? Like say dhcp lease expiration length or something?
Not that I can tell. Again, the issue only happened twice - once at around 10am, and the second time at (it looks like) 3am. And there was probably a good week or so between the 2 events.
Can we assume that Netgear has been working fine with same configs before new server deployed?
Yep. Router had been pretty solid up until this point. Plus the fact that unplugging the server's network cable makes the issue go away leads me to believe the problem isn't the router.
Assuming the netgear is your firewall / router to internet - When things are "broken" can internal clients see each other (say ping from one client to another) or is all internal traffic hung up as well as internet traffic?
Anything using an ethernet port on the router pretty much goes dead. That includes the new server itself, a network printer, the upstream modem that the router is connected to, a POE wifi extender in the other room, etc. IIRC, it might have still been possible to connect to the router using wifi during the outage. (I.e., it would respond and assign an IP address with DHCP.) But with the upstream modem unreachable, having a wifi connection wasn't of much use.
And in that vein, sorry for obvious, but I'll ask anyway - can I assume only 1 server (or kea with hot-standby) is providing dhcp service?
Yep, only 1 machine on the network handing out IP addresses - the router itself.
Also I notice that latest stable dd-wrt on website is r44715 and your build seems to be beta from last July - I note there are newer builds of the beta - I have no view on the firmware just making an observation.
Yes, I am definitely a bit behind on dd-wrt updates. But the version I'm running has been quite stable up till now, so I didn't see any urgency to update.
One thing I did notice after doing a bit more digging on this issue: although the r8169 network module does seem to work with the mobo's onboard network chip (RTL8125B), that's not technically the right driver for it. There's an r8125 module that's not part of the kernel, which is available on AUR. (https://aur.archlinux.org/packages/r8125-dkms/) I've switched over to start using that (and blacklisted r8169). I've also upgraded to the most recent kernel (5.16.2). So I'm watching to see if either/both of those changes eliminate the issue.
I was mostly posting to the list really to ask if anyone had heard of such a thing as a crashed server somehow either sending screwed up network packets or flooding the network in such a way that it could render a router/switch inoperable. From the limited amount I know of networking I think that might be possible. But I don't know exactly how one would remedy something like that. I guess fixing the underlying issue that is crashing the server would be the way to do that, but I haven't been able to pin down the cause yet.
Anyway, thanks again for the response, and I appreciate the debugging tips. If any other ideas come to mind, please LMK
Thanks,
DR
On 1/24/22 2:47 AM, Uwe Sauter via arch-general wrote:
Does your server support IPMI and thus SOL (serial over LAN)? Thanks for the suggestion. Unfortunately this is just a home/desktop server, so I don't think its mobo supports IPMI.
Thanks, DR
On 1/23/22 21:12, David Rosenstrauch via arch-general wrote:
Tnx much for the reply! Responses inline below.
...
It's hard to know for sure exactly what's happening, as the server is headless. It does have a KVM-over-IP hooked up to it, though. But when Perhaps you can login after server is back and look at journal spanning
period before and after the problem (e.g. journalctl --since -2h).
On Mon, 24 Jan 2022 at 15:50, Genes Lists via arch-general < arch-general@lists.archlinux.org> wrote:
It's hard to know for sure exactly what's happening, as the server is headless. It does have a KVM-over-IP hooked up to it, though. But when Perhaps you can login after server is back and look at journal spanning period before and after the problem (e.g. journalctl --since -2h).
There is a kernel module that enabled kernel log to be sent over network [1] but the described problem has all the flavour of a hardware issue. I'd try replacing the LAN cable between server and router first, and if at all possible the network card as well.. If that's not possible, try running tcpdump on another machine and have it capture in promiscuous mode. After a crash you can load the saved pcap file into wireshark GUI and inspect for clues (apologies if you're a lemon farmer and I'm teaching you how to juice lemons) [1] https://wiki.ubuntu.com/Kernel/Netconsole
On 1/21/22 18:54, David Rosenstrauch via arch-general wrote:
I figured the issue was a problem with my wifi router, but after much debugging I don't think that's the case. I have a new (Arch) linux server I built a few weeks ago. What seems to be happening is that my server crashes for some reason - and then looks like it's causing the ethernet LAN switch inside my router to stop working. (I verified that when I unplug the server's network cable the router starts working again, and when I plug it back in the router stops.)
When it crashes, instead of reconnecting router connect any laptop with arch booted from usb. If link comes up then use tcpdump to see what is happening on the wire. Regards, Łukasz
On 1/21/22 5:08 PM, Łukasz Michalski via arch-general wrote:
When it crashes, instead of reconnecting router connect any laptop with arch booted from usb. If link comes up then use tcpdump to see what is happening on the wire.
That's a good suggestion - thanks. (No better way to see if the server is sending screwy network traffic ... than inspecting the traffic.) I do sometimes get a bit lost looking through tcpdump data, but I might be able to muddle through it. Thanks again, DR
Following up on this crash issue I keep having with my Arch server. Basically server just completely freezes up - doesn't respond to pings, or keyboard/mouse input, and eventually has to just be rebooted. Good news is: a) after upgrading everything (including the router firmware) it no longer seems to hang my entire router/network (yay!) b) I was able to get a screenshot of the issue (more details about that below) Bad news is: It keeps happening! (About once a week or so.) I was able to capture a screenshot from the VM. (See http://darose.net/ServerCrash20220209.png) Basically this is the only thing that's providing me with any detail as to why it's crashing: "rcu_preempt detected stalls on cpus/tasks". But searching on that phrase didn't really give any clear indication what the problem might be. Plus the rest of the details from that message got obscured in the garbled video output. Any thoughts/suggestions as to what might be happening here / how to debug welcome! Thanks, DR On 1/21/22 12:54 PM, David Rosenstrauch wrote:
Been experiencing a weird issue several times recently that's got me stumped.
A couple of weeks ago, my entire home network went dead right in the middle of a zoom call. Same problem happened again late last night. The problem is intermittent/occasional: everything runs without issue for several days, then suddenly crashes.
I figured the issue was a problem with my wifi router, but after much debugging I don't think that's the case. I have a new (Arch) linux server I built a few weeks ago. What seems to be happening is that my server crashes for some reason
seems so obvious you already illuminated it but some buffer not clearing old data? mick in glen innes 2370 On Fri, 11 Feb 2022 at 09:32, David Rosenstrauch via arch-general <arch-general@lists.archlinux.org> wrote:
Following up on this crash issue I keep having with my Arch server. Basically server just completely freezes up - doesn't respond to pings, or keyboard/mouse input, and eventually has to just be rebooted.
Good news is:
a) after upgrading everything (including the router firmware) it no longer seems to hang my entire router/network (yay!)
b) I was able to get a screenshot of the issue (more details about that below)
Bad news is:
It keeps happening! (About once a week or so.)
I was able to capture a screenshot from the VM. (See http://darose.net/ServerCrash20220209.png) Basically this is the only thing that's providing me with any detail as to why it's crashing: "rcu_preempt detected stalls on cpus/tasks". But searching on that phrase didn't really give any clear indication what the problem might be. Plus the rest of the details from that message got obscured in the garbled video output.
Any thoughts/suggestions as to what might be happening here / how to debug welcome!
Thanks,
DR
On 1/21/22 12:54 PM, David Rosenstrauch wrote:
Been experiencing a weird issue several times recently that's got me stumped.
A couple of weeks ago, my entire home network went dead right in the middle of a zoom call. Same problem happened again late last night. The problem is intermittent/occasional: everything runs without issue for several days, then suddenly crashes.
I figured the issue was a problem with my wifi router, but after much debugging I don't think that's the case. I have a new (Arch) linux server I built a few weeks ago. What seems to be happening is that my server crashes for some reason
Thanks much for the response. I'm not really clear what specifically that might mean though: A buffer in the kernel? In an application? And how might I debug that and try to prevent it from happening again? Thanks, DR On 2/10/22 6:44 PM, mick howe via arch-general wrote:
seems so obvious you already illuminated it but some buffer not clearing old data?
On Fri, 11 Feb 2022 at 09:32, David Rosenstrauch via arch-general <arch-general@lists.archlinux.org> wrote:
Following up on this crash issue I keep having with my Arch server. Basically server just completely freezes up - doesn't respond to pings, or keyboard/mouse input, and eventually has to just be rebooted.
On 2/10/22 18:32, David Rosenstrauch via arch-general wrote: ... ...
"rcu_preempt detected stalls on cpus/tasks". But searching on that
Since the CPU is stalled and unable to make further progress something is inhibiting it. One thought - I wonder if the CPU is waiting for memory IO - in which case do you have sufficient memory for the load on the cpu and do you have swap set up? Also it may be worthwhile running memcheck to be sure your memory is not faulty. gene
Another thought - if you can try a different network hardware that might be useful as well. And to be clear, this kernel is running on physical hardware not a VM right? If you're running on VM please share which host and VM is used. thanks.
On 2/11/22 9:56 AM, Genes Lists via arch-general wrote:
Another thought - if you can try a different network hardware that might be useful as well.
And to be clear, this kernel is running on physical hardware not a VM right? If you're running on VM please share which host and VM is used.
Sorry, yes, physical hardware. (I accidentally wrote earlier that I took a screenshot from the "VM" when I meant to write "KVM" - i.e., a KVM-over-IP.) AFA different network hardware, that might be a little tricky to do, as I'm using the mobo's onboard LAN. I'd have to see if I have an old PCI network card somewhere in my junk box. Thanks, DR
On 2/11/22 9:21 AM, Genes Lists via arch-general wrote:
On 2/10/22 18:32, David Rosenstrauch via arch-general wrote: ... ...
"rcu_preempt detected stalls on cpus/tasks". But searching on that
Since the CPU is stalled and unable to make further progress something is inhibiting it. One thought - I wonder if the CPU is waiting for memory IO - in which case do you have sufficient memory for the load on the cpu and do you have swap set up?
I'm almost certain I have enough memory - 32GB on a machine that really is not doing anything very memory intensive. (Samba, dovecot, plex, foldingathome, etc.) And I have a 4GB swap file.
Also it may be worthwhile running memcheck to be sure your memory is not faulty.
Yeah that thought occurred to me as well. I ran a quick memtest86+ when I first built the computer and it didn't show any issues. But perhaps I should take some downtime and have it run one through the full 3 rounds. Thanks, DR
I suppose it could also be southbridge being annoying - did you check temps on the mobo are reasonable to be sure of adequate cooling? Are you overclocking at all by chance? sorry i know this kind of thing is frustrating to deal with.
Thanks much for following up! Responses inline. On 2/11/22 4:15 PM, Genes Lists via arch-general wrote:
I suppose it could also be southbridge being annoying
It's a very new machine (Rocket Lake and PCIE4) so doesn't technically use the traditional northbridge/southbridge model. But point taken that this could be an issue with some component outside of the core CPU/RAM/PCIE assembly. My best guesses so far are that this is either an issue with the memory, or with the network chip. Haven't been able to confirm or refute either theory yet though.
did you check temps on the mobo are reasonable to be sure of adequate cooling? Are you overclocking at all by chance?
Temps don't appear to be an issue. I have a cron job that monitors temps and sends emails to root when it goes above around 80C. (Which I know works because I see those emails from time to time.) But there were no temp warning emails just before any of the times it's crashed. (3 or 4 times so far.)
sorry i know this kind of thing is frustrating to deal with.
Big time. I've gotten very good over the years at diagnosing and fixing issues using log messages. But sudden catastrophic crashes like this that don't leave any trace in the logs/journal are *really* hard to pin down. Thanks, DR
Hi David, Looking at your http://darose.net/ServerCrash20220209.png, are you aware of https://www.kernel.org/doc/Documentation/RCU/stallwarn.txt which has detail on what it means? Though it looks to me like at least one line of output has been trampled. Also, one Google'd suggestion was a real-time program was running amok, and you did start by saying the crash occurred in a Zoom call which sounds real-time-ish. Perhaps keep logging what RT processes there are to a file, followed by a sync(1) on the file, and see if something was loading the machine before the crash. I *think* ‘rtprio’ shows what I'm after. ps axww -o pid,rtprio,pcpu,wchan,comm | awk '$2 != "-"' -- Cheers, Ralph.
sorry i know this kind of thing is frustrating to deal with.
Big time. I've gotten very good over the years at diagnosing and fixing issues using log messages. But sudden catastrophic crashes like this that don't leave any trace in the logs/journal are *really* hard to pin down.
Did you already consider setting up remote logging? A Raspberry Pi would suffice for that task and you would get any messages that make it out of the system… Uwe
Thanks,
DR
On 2/11/22 11:44 AM, David Rosenstrauch via arch-general wrote:
On 2/11/22 9:21 AM, Genes Lists via arch-general wrote:
Also it may be worthwhile running memcheck to be sure your memory is not faulty.
Yeah that thought occurred to me as well. I ran a quick memtest86+ when I first built the computer and it didn't show any issues. But perhaps I should take some downtime and have it run one through the full 3 rounds.
Following up with an update to close the loop on this thread, for anyone who's interested. So bad news is: machine crashed again a couple of times the other day - and right in the middle of a large and important pacman update, no less! (150+ packages, including kernel, glibc, gcc, mariadb, bunch of other key libraries. Machine wound up not being bootable, and took several hours to recover. It was pretty ugly.) :-( But good news is (aside from being able to recover from the failed update): I think I may have finally pinned down what's been causing these issues. Long story short, I had an issue when I first built the box where the machine wouldn't POST and boot, with the mobo's "DRAM" health LED indicator lighting up. After much digging I was able to pin down that I had fastened the CPU cooler down too tight. That apparently can bend the mobo, and prevent some components from working correctly. In my case it was one of the memory slots, which is located very close to the CPU/cooler. (I figured it out for certain when I was able to boot with only one of the 2 memory sticks installed.) Once I pinpointed the issue, I loosened the cooler a bit and from then on I've been able to repeatedly boot as normal with both sticks installed. Or so I thought! I'm pretty sure now that this has actually still been an issue, and that connectivity to one of the memory sticks has been periodically cutting out in the middle of operations and so causing these random crashes. Evidence: a) while debugging one of the recent crashes I again hit the same issue where it wouldn't POST and showed the same DRAM LED, and b) when I again took out one memory stick from the slot in question and rebooted and it's been running without issue ever since. (Although on 1/2 the RAM.) I'm pretty sure the issue isn't that the memory stick is bad, as I've run memtests several times with no issues. But I'll also try swapping sticks to confirm. I'll give it a few more days of uptime to confirm that this was indeed the issue, but I'm growing increasingly confident that that's the case. Will spend some time reinstalling the cooler, thermal paste, and 2nd RAM stick again when I get some time and try to get everything back to full RAM capacity without crashing. Many thanks to everyone for all the helpful suggestions! DR
On 2/16/22 16:28, David Rosenstrauch via arch-general wrote: Well it's great to hear you've got a decent explanation - and consistent with the cpu hangs you noted. good luck getting everything fully operational. gene
participants (7)
-
Andy Pieters
-
David Rosenstrauch
-
Genes Lists
-
mick howe
-
Ralph Corderoy
-
Uwe Sauter
-
Łukasz Michalski