Re: [arch-general] High CPU on one core, but unable to find process responsible

13 Mar 2018

      On 03/11/2018 09:19 PM, David Rosenstrauch wrote:
...
My server's been exhibiting some very strange behavior lately.  Every couple
of days I run into a situation where one core (core #0) on the quad core CPU
starts continuously using around 34% of CPU, but I'm not able to see (using
htop) any process that's responsible for using all that CPU.  Even when I tell
htop to show me kernel threads too, I still am not able to see the offending
process.  Every process remains under 1% CPU usage (except for occasional,
small, short-lived spikes up) yet the CPU usage on that core remains
permanently hovering at around 34%.  The problem goes away when I reboot, but
then comes back with a day or so.
I'm rather stumped as to how to fix this.  The server is a bit old, running an
up-to-date installation of Arch on a Intel Core 2 Quad Q6600 CPU.  Any
suggestions anyone might have as to either what might be going on here, or how
to go about debugging it would be greatly appreciated.
Thanks!
DR
DR,

  You are on to something. In the past 24hr. I've experienced two hardlocks on
two separate multi-cpu Arch servers. Both AMD based boxes. One the
dual-quad-core opteron box that was the subject of the 4.15.8 update failure
that left the box unbootable. Then today on a quad-quad-core opteron box that
hardlocked doing a simple rsync from another server on the LAN. Other than the
rsync process, the box is nearly idle, e.g.

top - 19:49:45 up 8 min,  2 users,  load average: 2.93, 2.23, 1.11
Tasks: 271 total,   1 running, 158 sleeping,   0 stopped,   0 zombie
%Cpu0  :   0.0/0.0     0[                                            ]
%Cpu1  :   0.0/0.0     0[                                            ]
%Cpu2  :   0.0/0.0     0[                                            ]
%Cpu3  :   0.0/0.0     0[                                            ]
%Cpu4  :   0.0/0.0     0[                                            ]
%Cpu5  :   0.0/0.0     0[                                            ]
%Cpu6  :   0.0/0.0     0[                                            ]
%Cpu7  :   0.0/0.0     0[                                            ]
%Cpu8  :   0.0/0.0     0[                                            ]
%Cpu9  :   0.0/0.0     0[                                            ]
%Cpu10 :   0.0/0.0     0[                                            ]
%Cpu11 :   0.0/0.7     1[|                                           ]
%Cpu12 :   0.0/0.0     0[                                            ]
%Cpu13 :   0.0/0.0     0[                                            ]
%Cpu14 :   0.0/0.0     0[                                            ]
%Cpu15 :   0.0/0.0     0[                                            ]
GiB Mem :  1.7/62.915   [                                            ]
GiB Swap:  0.0/1.998    [                                            ]

  I could not get a top during the lockup, because when the lockup occurred,
while I could connect to the server from another box via ssh, I could not log
in. E.g.,

19:34 wizard:~/dev/src-c/tmp/debug> svh
Last login: Mon Mar 12 19:34:44 2018 from 192.168.6.104
^CConnection to valhalla closed.

  Both boxes have exhibited strange behavior regarding the linux-raid array
(all disks are fine), but I receive spurious errors like (Out of IOMMU space),
Huh?

Mar 12 19:45:20 valhalla su[869]: pam_unix(su:session): session opened for
user root by david(uid=1000)
Mar 12 19:45:57 valhalla kernel: sata_nv 0000:00:05.0: PCI-DMA: Out of IOMMU
space for 65536 bytes
Mar 12 19:45:57 valhalla kernel: ata3: EH in SWNCQ mode,QC:qc_active 0x4
sactive 0x4
Mar 12 19:45:57 valhalla kernel: ata3: SWNCQ:qc_active 0x0 defer_bits 0x0
last_issue_tag 0x1
                                   dhfis 0x0 dmafis 0x0 sdbfis 0x0
Mar 12 19:45:57 valhalla kernel: ata3: ATA_REG 0x40 ERR_REG 0x0
Mar 12 19:45:57 valhalla kernel: ata3: tag : dhfis dmafis sdbfis sactive
Mar 12 19:45:57 valhalla kernel: ata3.00: exception Emask 0x0 SAct 0x4 SErr
0x0 action 0x6
Mar 12 19:45:57 valhalla kernel: ata3.00: failed command: WRITE FPDMA QUEUED
Mar 12 19:45:57 valhalla kernel: ata3.00: cmd
61/00:10:00:d0:e4/0a:00:0f:00:00/40 tag 2 ncq dma 1310720 ou
                                          res
40/00:20:00:ea:e3/00:00:0f:00:00/40 Emask 0x40 (internal error)
Mar 12 19:45:57 valhalla kernel: ata3.00: status: { DRDY }
Mar 12 19:45:57 valhalla kernel: ata3: hard resetting link
Mar 12 19:45:57 valhalla kernel: ata3: nv: skipping hardreset on occupied port
Mar 12 19:45:58 valhalla kernel: ata3: SATA link up 1.5 Gbps (SStatus 113
SControl 300)
Mar 12 19:45:58 valhalla kernel: ata3.00: configured for UDMA/133
Mar 12 19:45:58 valhalla kernel: ata3: EH complete
Mar 12 19:46:09 valhalla kernel: sata_nv 0000:00:05.1: PCI-DMA: Out of IOMMU
space for 65536 bytes
Mar 12 19:46:09 valhalla kernel: ata5: EH in SWNCQ mode,QC:qc_active 0x4
sactive 0x4
Mar 12 19:46:09 valhalla kernel: ata5: SWNCQ:qc_active 0x0 defer_bits 0x0
last_issue_tag 0x1
                                   dhfis 0x0 dmafis 0x0 sdbfis 0x0
Mar 12 19:46:09 valhalla kernel: ata5: ATA_REG 0x40 ERR_REG 0x0
Mar 12 19:46:09 valhalla kernel: ata5: tag : dhfis dmafis sdbfis sactive
Mar 12 19:46:09 valhalla kernel: ata5.00: exception Emask 0x0 SAct 0x4 SErr
0x0 action 0x6
Mar 12 19:46:09 valhalla kernel: ata5.00: failed command: WRITE FPDMA QUEUED
Mar 12 19:46:09 valhalla kernel: ata5.00: cmd
61/00:10:00:c0:f8/0a:00:0f:00:00/40 tag 2 ncq dma 1310720 ou
                                          res
40/00:20:00:da:f7/00:00:0f:00:00/40 Emask 0x40 (internal error)
Mar 12 19:46:09 valhalla kernel: ata5.00: status: { DRDY }
Mar 12 19:46:09 valhalla kernel: ata5: hard resetting link
Mar 12 19:46:09 valhalla kernel: ata5: nv: skipping hardreset on occupied port
Mar 12 19:46:10 valhalla kernel: ata5: SATA link up 1.5 Gbps (SStatus 113
SControl 300)
Mar 12 19:46:10 valhalla kernel: ata5.00: configured for UDMA/133
Mar 12 19:46:10 valhalla kernel: ata5: EH complete

  Today's lockup occurred after update (where 29 packages were updated), e.g.

Packages (29)

calc-2.12.6.7-1  cryptsetup-2.0.2-1  device-mapper-2.02.177-5
e2fsprogs-1.44.0-1  efivar-34-1 filesystem-2018.1-2  gawk-4.2.1-1
git-2.16.2-2  graphite-1:1.3.11-1  hdparm-9.54-1 imagemagick-7.0.7.26-1
libevdev-1.5.9-1  libmagick-7.0.7.26-1  libmagick6-6.9.9.38-1
libsystemd-238.0-3  libzip-1.4.0-1  linux-lts-4.14.25-1
linux-lts-headers-4.14.25-1 lvm2-2.02.177-5  nano-2.9.4-1  qt5-base-5.10.1-5
systemd-238.0-3  systemd-sysvcompat-238.0-3 tcl-8.6.8-2
vulkan-icd-loader-1.1.70.0-3  wxgtk-common-3.0.4-1  wxgtk2-3.0.4-1
wxgtk3-3.0.4-1 xfsprogs-4.15.1-1

  Following update, I reloaded systemd (systemctl daemon-reload), and then
proceeded with the rsync transfer -- which locked. I note we have
vulkan-icd-loader and qt5-base, both of which also had updates yesterday at
the time of the locked system update that caused so much problem.

  I don't know where to begin. I've been monitoring with top since the last
lockup, but have not caught anything. Both of these boxes have run Arch since
2016 and have never experienced any problems before now. Both are SuperMicro
boards and both have either 2 or 4 quad-core opterons. Both are raid1 linux
raid based boxes and have either 32 or 64G of RAM.

  Something has gone haywire in the past week that has a adverse impact on
these boxes. I have to admit -- I have no idea where to start tracking this down.

-- 
David C. Rankin, J.D.,P.E.