[arch-general] High CPU on one core, but unable to find process responsible
My server's been exhibiting some very strange behavior lately. Every couple of days I run into a situation where one core (core #0) on the quad core CPU starts continuously using around 34% of CPU, but I'm not able to see (using htop) any process that's responsible for using all that CPU. Even when I tell htop to show me kernel threads too, I still am not able to see the offending process. Every process remains under 1% CPU usage (except for occasional, small, short-lived spikes up) yet the CPU usage on that core remains permanently hovering at around 34%. The problem goes away when I reboot, but then comes back with a day or so. I'm rather stumped as to how to fix this. The server is a bit old, running an up-to-date installation of Arch on a Intel Core 2 Quad Q6600 CPU. Any suggestions anyone might have as to either what might be going on here, or how to go about debugging it would be greatly appreciated. Thanks! DR
On 3/12/18, David Rosenstrauch <darose@darose.net> wrote:
My server's been exhibiting some very strange behavior lately. Every couple of days I run into a situation where one core (core #0) on the quad core CPU starts continuously using around 34% of CPU, but I'm not able to see (using htop) any process that's responsible for using all that CPU. Even when I tell htop to show me kernel threads too, I still am not able to see the offending process. Every process remains under 1% CPU usage (except for occasional, small, short-lived spikes up) yet the CPU usage on that core remains permanently hovering at around 34%. The problem goes away when I reboot, but then comes back with a day or so.
My gut feeling is that one of the kernel worker threads hangs. So that would be 25% overall and 100% of the affected core. But you say there's no load to be found in the kernel threads, which is odd. Or if the server is accessible from the Internet, is it possible it's rooted and someone's running a hidden process? To confirm this isn't the case, cut off Internet access and let it run for two days. I don't think there are any official hidden processes that do not show up in htop or top since that would make them seem like rootkits. That means if the guilty process is really invisible, then it's definitely unusual. It's scary to consider a rootkit, but if that's the case, then it's best to be aware as soon as possible. I hope this is not case for you, wouldn't wish it on your worst enemy. Another idea. Can you limit the cores to 1 or maybe two and see if it becomes easier to pinpoint? This might work in the booted system: echo 0 > /sys/devices/system/cpu/cpu1/online echo 0 > /sys/devices/system/cpu/cpu2/online echo 0 > /sys/devices/system/cpu/cpu3/online But on the kernel command line maxcpus=1 should work.
On 2018年03月12日 11:19, David Rosenstrauch wrote:
My server's been exhibiting some very strange behavior lately. Every couple of days I run into a situation where one core (core #0) on the quad core CPU starts continuously using around 34% of CPU, but I'm not able to see (using htop) any process that's responsible for using all that CPU. Even when I tell htop to show me kernel threads too, I still am not able to see the offending process. Every process remains under 1% CPU usage (except for occasional, small, short-lived spikes up) yet the CPU usage on that core remains permanently hovering at around 34%. The problem goes away when I reboot, but then comes back with a day or so.
I'm rather stumped as to how to fix this. The server is a bit old, running an up-to-date installation of Arch on a Intel Core 2 Quad Q6600 CPU. Any suggestions anyone might have as to either what might be going on here, or how to go about debugging it would be greatly appreciated.
Thanks!
DR
Can you check whether you have enabled "Detailed CPU time" option in htop's setup (F2 -> Display options -> "Detailed CPU time")? From my experience and understanging, htop's CPU meter is accounting IO-wait/IRQ-response time by default but not showing them differently unless you enabled the "Detailed CPU time" option. And these waiting time is not accounted on each process or kernel thread. Enabling that said option will revail more detailed CPU usage info. High IO-wait or IRQ time is itself an indication of some misbehaving hardware, but at least you can be sure that it is not by more "dangerous" malwares or attacks.
On 03/12/2018 05:13 AM, Jiachen Yang via arch-general wrote:
On 2018年03月12日 11:19, David Rosenstrauch wrote:
My server's been exhibiting some very strange behavior lately. Every couple of days I run into a situation where one core (core #0) on the quad core CPU starts continuously using around 34% of CPU, but I'm not able to see (using htop) any process that's responsible for using all that CPU.
Can you check whether you have enabled "Detailed CPU time" option in htop's setup (F2 -> Display options -> "Detailed CPU time")? From my experience and understanging, htop's CPU meter is accounting IO-wait/IRQ-response time by default but not showing them differently unless you enabled the "Detailed CPU time" option. And these waiting time is not accounted on each process or kernel thread. Enabling that said option will revail more detailed CPU usage info. High IO-wait or IRQ time is itself an indication of some misbehaving hardware, but at least you can be sure that it is not by more "dangerous" malwares or attacks.
Thanks for the suggestion. So this issue happened again tonight, and I switched to "Detailed CPU time" to try to research it further. According to htop, the cpu usage is from "irq" (orange color). I guess this would explain why I'm not seeing any process responsible too. And it also might be related that I'm seeing these messages in my dmesg: [ 871.317377] perf: interrupt took too long (2506 > 2500), lowering kernel.perf_event_max_sample_rate to 79000 [ 1732.773491] perf: interrupt took too long (3140 > 3132), lowering kernel.perf_event_max_sample_rate to 63000 [ 3375.392292] perf: interrupt took too long (3950 > 3925), lowering kernel.perf_event_max_sample_rate to 50000 So if this issue is irq-based, I guess that means some piece of hardware is faulty or failing. Any idea how I might go about pinning down which one? Would there be info in the kernel log about this? Or something that I can look at in /proc? Thanks, DR
On 3/13/18, David Rosenstrauch <darose@darose.net> wrote:
So if this issue is irq-based, I guess that means some piece of hardware is faulty or failing. Any idea how I might go about pinning down which one? Would there be info in the kernel log about this? Or something that I can look at in /proc?
Any BIOS updates or kernel updates recently (4.15.8)? Try with 3.16 or 4.9 or another old lts kernel from archive.archlinux.org just for testing (not production). It's more likely that the kernel regressed rather than IRQ issues popping up suddenly. It's possible but less likely.
On 03/12/2018 09:56 PM, Carsten Mattner via arch-general wrote:
Any BIOS updates or kernel updates recently (4.15.8)?
Try with 3.16 or 4.9 or another old lts kernel from archive.archlinux.org just for testing (not production).
It's more likely that the kernel regressed rather than IRQ issues popping up suddenly. It's possible but less likely.
No BIOS updates. As far as kernel updates, I do those all the time, so not sure that would be the cause. After doing some digging, though, I did "cat /proc/interrupts", and this line stood out, for having an astronomically high number: CPU0 CPU1 CPU2 CPU3 ... 16: 2424156658 0 0 0 IO-APIC 16-fasteoi uhci_hcd:usb5, parport1 I have an old PCI card in the machine that's powers an old parallel port printer I used to use with it. Perhaps that's failing. I don't need the card anymore, so I might as well try taking it out and see if that makes things better. If not, I'll be back. :-) Thanks, DR
On 3/13/18, David Rosenstrauch <darose@darose.net> wrote:
On 03/12/2018 09:56 PM, Carsten Mattner via arch-general wrote:
Any BIOS updates or kernel updates recently (4.15.8)?
Try with 3.16 or 4.9 or another old lts kernel from archive.archlinux.org just for testing (not production).
It's more likely that the kernel regressed rather than IRQ issues popping up suddenly. It's possible but less likely.
No BIOS updates. As far as kernel updates, I do those all the time, so not sure that would be the cause.
After doing some digging, though, I did "cat /proc/interrupts", and this line stood out, for having an astronomically high number:
CPU0 CPU1 CPU2 CPU3 ... 16: 2424156658 0 0 0 IO-APIC 16-fasteoi uhci_hcd:usb5, parport1
I have an old PCI card in the machine that's powers an old parallel port printer I used to use with it. Perhaps that's failing. I don't need the card anymore, so I might as well try taking it out and see if that makes things better. If not, I'll be back. :-)
Let us know in either case. Please tell me it's a huge plotter you use to make construction plans for your rocket ships.
On 03/13/2018 03:34 AM, Carsten Mattner via arch-general wrote:
On 3/13/18, David Rosenstrauch <darose@darose.net> wrote >> I have an old PCI card in the machine that's powers an old parallel port
printer I used to use with it. Perhaps that's failing. I don't need the card anymore, so I might as well try taking it out and see if that makes things better. If not, I'll be back. :-)
Let us know in either case. Please tell me it's a huge plotter you use to make construction plans for your rocket ships.
Just an old printer that's been good to me that I've been reluctant to let go of: https://www.cnet.com/products/hp-laserjet-6l/specs/ DR
On 03/11/2018 09:19 PM, David Rosenstrauch wrote:
My server's been exhibiting some very strange behavior lately. Every couple of days I run into a situation where one core (core #0) on the quad core CPU starts continuously using around 34% of CPU, but I'm not able to see (using htop) any process that's responsible for using all that CPU. Even when I tell htop to show me kernel threads too, I still am not able to see the offending process. Every process remains under 1% CPU usage (except for occasional, small, short-lived spikes up) yet the CPU usage on that core remains permanently hovering at around 34%. The problem goes away when I reboot, but then comes back with a day or so.
I'm rather stumped as to how to fix this. The server is a bit old, running an up-to-date installation of Arch on a Intel Core 2 Quad Q6600 CPU. Any suggestions anyone might have as to either what might be going on here, or how to go about debugging it would be greatly appreciated.
Thanks!
DR
DR, You are on to something. In the past 24hr. I've experienced two hardlocks on two separate multi-cpu Arch servers. Both AMD based boxes. One the dual-quad-core opteron box that was the subject of the 4.15.8 update failure that left the box unbootable. Then today on a quad-quad-core opteron box that hardlocked doing a simple rsync from another server on the LAN. Other than the rsync process, the box is nearly idle, e.g. top - 19:49:45 up 8 min, 2 users, load average: 2.93, 2.23, 1.11 Tasks: 271 total, 1 running, 158 sleeping, 0 stopped, 0 zombie %Cpu0 : 0.0/0.0 0[ ] %Cpu1 : 0.0/0.0 0[ ] %Cpu2 : 0.0/0.0 0[ ] %Cpu3 : 0.0/0.0 0[ ] %Cpu4 : 0.0/0.0 0[ ] %Cpu5 : 0.0/0.0 0[ ] %Cpu6 : 0.0/0.0 0[ ] %Cpu7 : 0.0/0.0 0[ ] %Cpu8 : 0.0/0.0 0[ ] %Cpu9 : 0.0/0.0 0[ ] %Cpu10 : 0.0/0.0 0[ ] %Cpu11 : 0.0/0.7 1[| ] %Cpu12 : 0.0/0.0 0[ ] %Cpu13 : 0.0/0.0 0[ ] %Cpu14 : 0.0/0.0 0[ ] %Cpu15 : 0.0/0.0 0[ ] GiB Mem : 1.7/62.915 [ ] GiB Swap: 0.0/1.998 [ ] I could not get a top during the lockup, because when the lockup occurred, while I could connect to the server from another box via ssh, I could not log in. E.g., 19:34 wizard:~/dev/src-c/tmp/debug> svh Last login: Mon Mar 12 19:34:44 2018 from 192.168.6.104 ^CConnection to valhalla closed. Both boxes have exhibited strange behavior regarding the linux-raid array (all disks are fine), but I receive spurious errors like (Out of IOMMU space), Huh? Mar 12 19:45:20 valhalla su[869]: pam_unix(su:session): session opened for user root by david(uid=1000) Mar 12 19:45:57 valhalla kernel: sata_nv 0000:00:05.0: PCI-DMA: Out of IOMMU space for 65536 bytes Mar 12 19:45:57 valhalla kernel: ata3: EH in SWNCQ mode,QC:qc_active 0x4 sactive 0x4 Mar 12 19:45:57 valhalla kernel: ata3: SWNCQ:qc_active 0x0 defer_bits 0x0 last_issue_tag 0x1 dhfis 0x0 dmafis 0x0 sdbfis 0x0 Mar 12 19:45:57 valhalla kernel: ata3: ATA_REG 0x40 ERR_REG 0x0 Mar 12 19:45:57 valhalla kernel: ata3: tag : dhfis dmafis sdbfis sactive Mar 12 19:45:57 valhalla kernel: ata3.00: exception Emask 0x0 SAct 0x4 SErr 0x0 action 0x6 Mar 12 19:45:57 valhalla kernel: ata3.00: failed command: WRITE FPDMA QUEUED Mar 12 19:45:57 valhalla kernel: ata3.00: cmd 61/00:10:00:d0:e4/0a:00:0f:00:00/40 tag 2 ncq dma 1310720 ou res 40/00:20:00:ea:e3/00:00:0f:00:00/40 Emask 0x40 (internal error) Mar 12 19:45:57 valhalla kernel: ata3.00: status: { DRDY } Mar 12 19:45:57 valhalla kernel: ata3: hard resetting link Mar 12 19:45:57 valhalla kernel: ata3: nv: skipping hardreset on occupied port Mar 12 19:45:58 valhalla kernel: ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300) Mar 12 19:45:58 valhalla kernel: ata3.00: configured for UDMA/133 Mar 12 19:45:58 valhalla kernel: ata3: EH complete Mar 12 19:46:09 valhalla kernel: sata_nv 0000:00:05.1: PCI-DMA: Out of IOMMU space for 65536 bytes Mar 12 19:46:09 valhalla kernel: ata5: EH in SWNCQ mode,QC:qc_active 0x4 sactive 0x4 Mar 12 19:46:09 valhalla kernel: ata5: SWNCQ:qc_active 0x0 defer_bits 0x0 last_issue_tag 0x1 dhfis 0x0 dmafis 0x0 sdbfis 0x0 Mar 12 19:46:09 valhalla kernel: ata5: ATA_REG 0x40 ERR_REG 0x0 Mar 12 19:46:09 valhalla kernel: ata5: tag : dhfis dmafis sdbfis sactive Mar 12 19:46:09 valhalla kernel: ata5.00: exception Emask 0x0 SAct 0x4 SErr 0x0 action 0x6 Mar 12 19:46:09 valhalla kernel: ata5.00: failed command: WRITE FPDMA QUEUED Mar 12 19:46:09 valhalla kernel: ata5.00: cmd 61/00:10:00:c0:f8/0a:00:0f:00:00/40 tag 2 ncq dma 1310720 ou res 40/00:20:00:da:f7/00:00:0f:00:00/40 Emask 0x40 (internal error) Mar 12 19:46:09 valhalla kernel: ata5.00: status: { DRDY } Mar 12 19:46:09 valhalla kernel: ata5: hard resetting link Mar 12 19:46:09 valhalla kernel: ata5: nv: skipping hardreset on occupied port Mar 12 19:46:10 valhalla kernel: ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 300) Mar 12 19:46:10 valhalla kernel: ata5.00: configured for UDMA/133 Mar 12 19:46:10 valhalla kernel: ata5: EH complete Today's lockup occurred after update (where 29 packages were updated), e.g. Packages (29) calc-2.12.6.7-1 cryptsetup-2.0.2-1 device-mapper-2.02.177-5 e2fsprogs-1.44.0-1 efivar-34-1 filesystem-2018.1-2 gawk-4.2.1-1 git-2.16.2-2 graphite-1:1.3.11-1 hdparm-9.54-1 imagemagick-7.0.7.26-1 libevdev-1.5.9-1 libmagick-7.0.7.26-1 libmagick6-6.9.9.38-1 libsystemd-238.0-3 libzip-1.4.0-1 linux-lts-4.14.25-1 linux-lts-headers-4.14.25-1 lvm2-2.02.177-5 nano-2.9.4-1 qt5-base-5.10.1-5 systemd-238.0-3 systemd-sysvcompat-238.0-3 tcl-8.6.8-2 vulkan-icd-loader-1.1.70.0-3 wxgtk-common-3.0.4-1 wxgtk2-3.0.4-1 wxgtk3-3.0.4-1 xfsprogs-4.15.1-1 Following update, I reloaded systemd (systemctl daemon-reload), and then proceeded with the rsync transfer -- which locked. I note we have vulkan-icd-loader and qt5-base, both of which also had updates yesterday at the time of the locked system update that caused so much problem. I don't know where to begin. I've been monitoring with top since the last lockup, but have not caught anything. Both of these boxes have run Arch since 2016 and have never experienced any problems before now. Both are SuperMicro boards and both have either 2 or 4 quad-core opterons. Both are raid1 linux raid based boxes and have either 32 or 64G of RAM. Something has gone haywire in the past week that has a adverse impact on these boxes. I have to admit -- I have no idea where to start tracking this down. -- David C. Rankin, J.D.,P.E.
participants (4)
-
Carsten Mattner
-
David C. Rankin
-
David Rosenstrauch
-
Jiachen Yang