[arch-general] High CPU on one core, but unable to find process responsible

David Rosenstrauch darose at darose.net
Tue Mar 13 01:23:21 UTC 2018

On 03/12/2018 05:13 AM, Jiachen Yang via arch-general wrote:
> On 2018年03月12日 11:19, David Rosenstrauch wrote:
>> My server's been exhibiting some very strange behavior lately.  Every
>> couple of days I run into a situation where one core (core #0) on the
>> quad core CPU starts continuously using around 34% of CPU, but I'm not
>> able to see (using htop) any process that's responsible for using all
>> that CPU.

> Can you check whether you have enabled "Detailed CPU time" option in
> htop's setup (F2 -> Display options -> "Detailed CPU time")?
>  From my experience and understanging, htop's CPU meter is accounting
> IO-wait/IRQ-response time by default but not showing them differently
> unless you enabled the "Detailed CPU time" option.
> And these waiting time is not accounted on each process or kernel
> thread. Enabling that said option will revail more detailed CPU usage info.
> High IO-wait or IRQ time is itself an indication of some misbehaving
> hardware, but at least you can be sure that it is not by more
> "dangerous" malwares or attacks.

Thanks for the suggestion.  So this issue happened again tonight, and I 
switched to "Detailed CPU time" to try to research it further. 
According to htop, the cpu usage is from "irq" (orange color).  I guess 
this would explain why I'm not seeing any process responsible too.

And it also might be related that I'm seeing these messages in my dmesg:

[  871.317377] perf: interrupt took too long (2506 > 2500), lowering 
kernel.perf_event_max_sample_rate to 79000
[ 1732.773491] perf: interrupt took too long (3140 > 3132), lowering 
kernel.perf_event_max_sample_rate to 63000
[ 3375.392292] perf: interrupt took too long (3950 > 3925), lowering 
kernel.perf_event_max_sample_rate to 50000

So if this issue is irq-based, I guess that means some piece of hardware 
is faulty or failing.  Any idea how I might go about pinning down which 
one?  Would there be info in the kernel log about this?  Or something 
that I can look at in /proc?



