On 03/12/2018 05:13 AM, Jiachen Yang via arch-general wrote:
On 2018年03月12日 11:19, David Rosenstrauch wrote:
My server's been exhibiting some very strange behavior lately. Every couple of days I run into a situation where one core (core #0) on the quad core CPU starts continuously using around 34% of CPU, but I'm not able to see (using htop) any process that's responsible for using all that CPU.
Can you check whether you have enabled "Detailed CPU time" option in htop's setup (F2 -> Display options -> "Detailed CPU time")? From my experience and understanging, htop's CPU meter is accounting IO-wait/IRQ-response time by default but not showing them differently unless you enabled the "Detailed CPU time" option. And these waiting time is not accounted on each process or kernel thread. Enabling that said option will revail more detailed CPU usage info. High IO-wait or IRQ time is itself an indication of some misbehaving hardware, but at least you can be sure that it is not by more "dangerous" malwares or attacks.
Thanks for the suggestion. So this issue happened again tonight, and I switched to "Detailed CPU time" to try to research it further. According to htop, the cpu usage is from "irq" (orange color). I guess this would explain why I'm not seeing any process responsible too. And it also might be related that I'm seeing these messages in my dmesg: [ 871.317377] perf: interrupt took too long (2506 > 2500), lowering kernel.perf_event_max_sample_rate to 79000 [ 1732.773491] perf: interrupt took too long (3140 > 3132), lowering kernel.perf_event_max_sample_rate to 63000 [ 3375.392292] perf: interrupt took too long (3950 > 3925), lowering kernel.perf_event_max_sample_rate to 50000 So if this issue is irq-based, I guess that means some piece of hardware is faulty or failing. Any idea how I might go about pinning down which one? Would there be info in the kernel log about this? Or something that I can look at in /proc? Thanks, DR