On 3/12/18, David Rosenstrauch <darose@darose.net> wrote:
My server's been exhibiting some very strange behavior lately. Every couple of days I run into a situation where one core (core #0) on the quad core CPU starts continuously using around 34% of CPU, but I'm not able to see (using htop) any process that's responsible for using all that CPU. Even when I tell htop to show me kernel threads too, I still am not able to see the offending process. Every process remains under 1% CPU usage (except for occasional, small, short-lived spikes up) yet the CPU usage on that core remains permanently hovering at around 34%. The problem goes away when I reboot, but then comes back with a day or so.
My gut feeling is that one of the kernel worker threads hangs. So that would be 25% overall and 100% of the affected core. But you say there's no load to be found in the kernel threads, which is odd. Or if the server is accessible from the Internet, is it possible it's rooted and someone's running a hidden process? To confirm this isn't the case, cut off Internet access and let it run for two days. I don't think there are any official hidden processes that do not show up in htop or top since that would make them seem like rootkits. That means if the guilty process is really invisible, then it's definitely unusual. It's scary to consider a rootkit, but if that's the case, then it's best to be aware as soon as possible. I hope this is not case for you, wouldn't wish it on your worst enemy. Another idea. Can you limit the cores to 1 or maybe two and see if it becomes easier to pinpoint? This might work in the booted system: echo 0 > /sys/devices/system/cpu/cpu1/online echo 0 > /sys/devices/system/cpu/cpu2/online echo 0 > /sys/devices/system/cpu/cpu3/online But on the kernel command line maxcpus=1 should work.