Hello David,
When you say server crashes - I assume kernel crashed and is not functioning (as opposed to kernel is up and running but network is not doing what I want)? Does server respond to any keyboard or mouse movement on console?
It's hard to know for sure exactly what's happening, as the server is headless. It does have a KVM-over-IP hooked up to it, though. But when I try use the KVM when the server gets in this state, it doesn't show anything on the screen, or respond to any keyboard or mouse input. (At least not that I can see.) Eventually I wind up having to just power off the machine.
Does your server support IPMI and thus SOL (serial over LAN)? If so you could configure your server to also provide a login on the serial device which it will also use to log kernel messages. - Configure SOL in EFI - Add "console=tty console=ttyS0,115200n8" (if SOL uses ttyS0/COM1, else use ttyS1) to the kernel command line (how to depends on your boot mechanism) - Configure BMC credentials and privileges for SOL - Use "ipmitool -I lanplus -U $USER -P $PASSWORD -H $SERVERIP sol activate" from another host in your network to connect. You can exit with "<enter> ~ ." Alternatively use "ipmiconsole -u $USER -p $PASSWORD -h $HOST" from freeipmi package and exit with "ctrl+esc & ." (if I remember correctly). In order to catch those hangs it then would be best to have SOL connected and logged around the clock, e.g. by using a Rasperry Pi and Conserver (https://aur.archlinux.org/packages/conserver/) Regards, Uwe
What do kernel logs show leading up to the crash? If the kernel oopsed it is often logged. Prior log messages may be helpful in tracking down that problem.
When the crash happened this time I went back and looked at the logs and system journal. Last log message was at 2.59:01am. Timing is a little suspicious, as I do have some cron jobs that kick off at 3am. However, the first time the crash happened it was during the day, so I don't think 3am cron jobs could be solely responsible.
Is server running desktop or no graphics?
It's mostly a server. But I do have a desktop manager running (using the CPU's onboard graphics) and connected to the KVM in case I need to log in. (Which I rarely do. I mostly SSH in.)
Your network topology - Is the Netgear routing to internet or is it your new server or a separate firewall? Is your internet ipv4 or ipv6 or both? Same question - is internet firewall routing (ipv6) or NAT'ing (ipv4)?
Netgear router is home LAN's gateway to the internet. It uses a combination of ipv4 and v6.
Is there a periodicity to the problem? Like say dhcp lease expiration length or something?
Not that I can tell. Again, the issue only happened twice - once at around 10am, and the second time at (it looks like) 3am. And there was probably a good week or so between the 2 events.
Can we assume that Netgear has been working fine with same configs before new server deployed?
Yep. Router had been pretty solid up until this point. Plus the fact that unplugging the server's network cable makes the issue go away leads me to believe the problem isn't the router.
Assuming the netgear is your firewall / router to internet - When things are "broken" can internal clients see each other (say ping from one client to another) or is all internal traffic hung up as well as internet traffic?
Anything using an ethernet port on the router pretty much goes dead. That includes the new server itself, a network printer, the upstream modem that the router is connected to, a POE wifi extender in the other room, etc. IIRC, it might have still been possible to connect to the router using wifi during the outage. (I.e., it would respond and assign an IP address with DHCP.) But with the upstream modem unreachable, having a wifi connection wasn't of much use.
And in that vein, sorry for obvious, but I'll ask anyway - can I assume only 1 server (or kea with hot-standby) is providing dhcp service?
Yep, only 1 machine on the network handing out IP addresses - the router itself.
Also I notice that latest stable dd-wrt on website is r44715 and your build seems to be beta from last July - I note there are newer builds of the beta - I have no view on the firmware just making an observation.
Yes, I am definitely a bit behind on dd-wrt updates. But the version I'm running has been quite stable up till now, so I didn't see any urgency to update.
One thing I did notice after doing a bit more digging on this issue: although the r8169 network module does seem to work with the mobo's onboard network chip (RTL8125B), that's not technically the right driver for it. There's an r8125 module that's not part of the kernel, which is available on AUR. (https://aur.archlinux.org/packages/r8125-dkms/) I've switched over to start using that (and blacklisted r8169). I've also upgraded to the most recent kernel (5.16.2). So I'm watching to see if either/both of those changes eliminate the issue.
I was mostly posting to the list really to ask if anyone had heard of such a thing as a crashed server somehow either sending screwed up network packets or flooding the network in such a way that it could render a router/switch inoperable. From the limited amount I know of networking I think that might be possible. But I don't know exactly how one would remedy something like that. I guess fixing the underlying issue that is crashing the server would be the way to do that, but I haven't been able to pin down the cause yet.
Anyway, thanks again for the response, and I appreciate the debugging tips. If any other ideas come to mind, please LMK
Thanks,
DR