On 03/11/2018 09:19 PM, David Rosenstrauch wrote:
My server's been exhibiting some very strange behavior lately. Every couple of days I run into a situation where one core (core #0) on the quad core CPU starts continuously using around 34% of CPU, but I'm not able to see (using htop) any process that's responsible for using all that CPU. Even when I tell htop to show me kernel threads too, I still am not able to see the offending process. Every process remains under 1% CPU usage (except for occasional, small, short-lived spikes up) yet the CPU usage on that core remains permanently hovering at around 34%. The problem goes away when I reboot, but then comes back with a day or so.
I'm rather stumped as to how to fix this. The server is a bit old, running an up-to-date installation of Arch on a Intel Core 2 Quad Q6600 CPU. Any suggestions anyone might have as to either what might be going on here, or how to go about debugging it would be greatly appreciated.
Thanks!
DR
DR, You are on to something. In the past 24hr. I've experienced two hardlocks on two separate multi-cpu Arch servers. Both AMD based boxes. One the dual-quad-core opteron box that was the subject of the 4.15.8 update failure that left the box unbootable. Then today on a quad-quad-core opteron box that hardlocked doing a simple rsync from another server on the LAN. Other than the rsync process, the box is nearly idle, e.g. top - 19:49:45 up 8 min, 2 users, load average: 2.93, 2.23, 1.11 Tasks: 271 total, 1 running, 158 sleeping, 0 stopped, 0 zombie %Cpu0 : 0.0/0.0 0[ ] %Cpu1 : 0.0/0.0 0[ ] %Cpu2 : 0.0/0.0 0[ ] %Cpu3 : 0.0/0.0 0[ ] %Cpu4 : 0.0/0.0 0[ ] %Cpu5 : 0.0/0.0 0[ ] %Cpu6 : 0.0/0.0 0[ ] %Cpu7 : 0.0/0.0 0[ ] %Cpu8 : 0.0/0.0 0[ ] %Cpu9 : 0.0/0.0 0[ ] %Cpu10 : 0.0/0.0 0[ ] %Cpu11 : 0.0/0.7 1[| ] %Cpu12 : 0.0/0.0 0[ ] %Cpu13 : 0.0/0.0 0[ ] %Cpu14 : 0.0/0.0 0[ ] %Cpu15 : 0.0/0.0 0[ ] GiB Mem : 1.7/62.915 [ ] GiB Swap: 0.0/1.998 [ ] I could not get a top during the lockup, because when the lockup occurred, while I could connect to the server from another box via ssh, I could not log in. E.g., 19:34 wizard:~/dev/src-c/tmp/debug> svh Last login: Mon Mar 12 19:34:44 2018 from 192.168.6.104 ^CConnection to valhalla closed. Both boxes have exhibited strange behavior regarding the linux-raid array (all disks are fine), but I receive spurious errors like (Out of IOMMU space), Huh? Mar 12 19:45:20 valhalla su[869]: pam_unix(su:session): session opened for user root by david(uid=1000) Mar 12 19:45:57 valhalla kernel: sata_nv 0000:00:05.0: PCI-DMA: Out of IOMMU space for 65536 bytes Mar 12 19:45:57 valhalla kernel: ata3: EH in SWNCQ mode,QC:qc_active 0x4 sactive 0x4 Mar 12 19:45:57 valhalla kernel: ata3: SWNCQ:qc_active 0x0 defer_bits 0x0 last_issue_tag 0x1 dhfis 0x0 dmafis 0x0 sdbfis 0x0 Mar 12 19:45:57 valhalla kernel: ata3: ATA_REG 0x40 ERR_REG 0x0 Mar 12 19:45:57 valhalla kernel: ata3: tag : dhfis dmafis sdbfis sactive Mar 12 19:45:57 valhalla kernel: ata3.00: exception Emask 0x0 SAct 0x4 SErr 0x0 action 0x6 Mar 12 19:45:57 valhalla kernel: ata3.00: failed command: WRITE FPDMA QUEUED Mar 12 19:45:57 valhalla kernel: ata3.00: cmd 61/00:10:00:d0:e4/0a:00:0f:00:00/40 tag 2 ncq dma 1310720 ou res 40/00:20:00:ea:e3/00:00:0f:00:00/40 Emask 0x40 (internal error) Mar 12 19:45:57 valhalla kernel: ata3.00: status: { DRDY } Mar 12 19:45:57 valhalla kernel: ata3: hard resetting link Mar 12 19:45:57 valhalla kernel: ata3: nv: skipping hardreset on occupied port Mar 12 19:45:58 valhalla kernel: ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300) Mar 12 19:45:58 valhalla kernel: ata3.00: configured for UDMA/133 Mar 12 19:45:58 valhalla kernel: ata3: EH complete Mar 12 19:46:09 valhalla kernel: sata_nv 0000:00:05.1: PCI-DMA: Out of IOMMU space for 65536 bytes Mar 12 19:46:09 valhalla kernel: ata5: EH in SWNCQ mode,QC:qc_active 0x4 sactive 0x4 Mar 12 19:46:09 valhalla kernel: ata5: SWNCQ:qc_active 0x0 defer_bits 0x0 last_issue_tag 0x1 dhfis 0x0 dmafis 0x0 sdbfis 0x0 Mar 12 19:46:09 valhalla kernel: ata5: ATA_REG 0x40 ERR_REG 0x0 Mar 12 19:46:09 valhalla kernel: ata5: tag : dhfis dmafis sdbfis sactive Mar 12 19:46:09 valhalla kernel: ata5.00: exception Emask 0x0 SAct 0x4 SErr 0x0 action 0x6 Mar 12 19:46:09 valhalla kernel: ata5.00: failed command: WRITE FPDMA QUEUED Mar 12 19:46:09 valhalla kernel: ata5.00: cmd 61/00:10:00:c0:f8/0a:00:0f:00:00/40 tag 2 ncq dma 1310720 ou res 40/00:20:00:da:f7/00:00:0f:00:00/40 Emask 0x40 (internal error) Mar 12 19:46:09 valhalla kernel: ata5.00: status: { DRDY } Mar 12 19:46:09 valhalla kernel: ata5: hard resetting link Mar 12 19:46:09 valhalla kernel: ata5: nv: skipping hardreset on occupied port Mar 12 19:46:10 valhalla kernel: ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 300) Mar 12 19:46:10 valhalla kernel: ata5.00: configured for UDMA/133 Mar 12 19:46:10 valhalla kernel: ata5: EH complete Today's lockup occurred after update (where 29 packages were updated), e.g. Packages (29) calc-2.12.6.7-1 cryptsetup-2.0.2-1 device-mapper-2.02.177-5 e2fsprogs-1.44.0-1 efivar-34-1 filesystem-2018.1-2 gawk-4.2.1-1 git-2.16.2-2 graphite-1:1.3.11-1 hdparm-9.54-1 imagemagick-7.0.7.26-1 libevdev-1.5.9-1 libmagick-7.0.7.26-1 libmagick6-6.9.9.38-1 libsystemd-238.0-3 libzip-1.4.0-1 linux-lts-4.14.25-1 linux-lts-headers-4.14.25-1 lvm2-2.02.177-5 nano-2.9.4-1 qt5-base-5.10.1-5 systemd-238.0-3 systemd-sysvcompat-238.0-3 tcl-8.6.8-2 vulkan-icd-loader-1.1.70.0-3 wxgtk-common-3.0.4-1 wxgtk2-3.0.4-1 wxgtk3-3.0.4-1 xfsprogs-4.15.1-1 Following update, I reloaded systemd (systemctl daemon-reload), and then proceeded with the rsync transfer -- which locked. I note we have vulkan-icd-loader and qt5-base, both of which also had updates yesterday at the time of the locked system update that caused so much problem. I don't know where to begin. I've been monitoring with top since the last lockup, but have not caught anything. Both of these boxes have run Arch since 2016 and have never experienced any problems before now. Both are SuperMicro boards and both have either 2 or 4 quad-core opterons. Both are raid1 linux raid based boxes and have either 32 or 64G of RAM. Something has gone haywire in the past week that has a adverse impact on these boxes. I have to admit -- I have no idea where to start tracking this down. -- David C. Rankin, J.D.,P.E.