[arch-general] High CPU on one core, but unable to find process responsible
darose at darose.net
Tue Mar 13 01:23:21 UTC 2018
On 03/12/2018 05:13 AM, Jiachen Yang via arch-general wrote:
> On 2018年03月12日 11:19, David Rosenstrauch wrote:
>> My server's been exhibiting some very strange behavior lately. Every
>> couple of days I run into a situation where one core (core #0) on the
>> quad core CPU starts continuously using around 34% of CPU, but I'm not
>> able to see (using htop) any process that's responsible for using all
>> that CPU.
> Can you check whether you have enabled "Detailed CPU time" option in
> htop's setup (F2 -> Display options -> "Detailed CPU time")?
> From my experience and understanging, htop's CPU meter is accounting
> IO-wait/IRQ-response time by default but not showing them differently
> unless you enabled the "Detailed CPU time" option.
> And these waiting time is not accounted on each process or kernel
> thread. Enabling that said option will revail more detailed CPU usage info.
> High IO-wait or IRQ time is itself an indication of some misbehaving
> hardware, but at least you can be sure that it is not by more
> "dangerous" malwares or attacks.
Thanks for the suggestion. So this issue happened again tonight, and I
switched to "Detailed CPU time" to try to research it further.
According to htop, the cpu usage is from "irq" (orange color). I guess
this would explain why I'm not seeing any process responsible too.
And it also might be related that I'm seeing these messages in my dmesg:
[ 871.317377] perf: interrupt took too long (2506 > 2500), lowering
kernel.perf_event_max_sample_rate to 79000
[ 1732.773491] perf: interrupt took too long (3140 > 3132), lowering
kernel.perf_event_max_sample_rate to 63000
[ 3375.392292] perf: interrupt took too long (3950 > 3925), lowering
kernel.perf_event_max_sample_rate to 50000
So if this issue is irq-based, I guess that means some piece of hardware
is faulty or failing. Any idea how I might go about pinning down which
one? Would there be info in the kernel log about this? Or something
that I can look at in /proc?
More information about the arch-general