[arch-general] linux 3.1-4 - two i686 lockups after ~ 5 hours of operations. two x86_64 seem OK

Thu Nov 10 14:16:16 EST 2011

On 11/10/2011 12:56 PM, David J. Haines wrote:
> On Thu, Nov 10, 2011 at 1:44 PM, Richard Schütz<r.schtz at t-online.de>  wrote:
>> Am 10.11.2011 18:47, schrieb David C. Rankin:
>>>
>>> tpowa,
>>>
>>> Upgraded 5 i686 boxes and 2 x86_64 boxes to linux 3.1-4 yesterday night.
>>> This morning, one i686 server is dead, other i686 box responded to xterm
>>> (return input) and then locked (ssh connection was left up after login
>>> to confirm reboot). Two other i686 boxes (under no load) still running.
>>> The boxes are remote. I'll pull the logs when I get to the site and
>>> send. Anybody else seeing this with linux 3.1-4?
>>>
>>
>> I had lockups on my notebook [1] and netbook [2] during normal usage. Both
>> have a Intel processor. The AMD based desktop machine had no problems so
>> far. All systems are running linux 3.1-4 x86_64.
>>
>> [1] http://pastebin.com/VAnTLKtP
>> [2] http://pastebin.com/64QKSJTN
>>
>> --
>> Regards,
>> Richard Schütz
>>
>
> I'm getting lockups on an i5 box with Intel graphics running x86_64
> while I'm using it. This has been happening while I've been using the
> computer and has been happening since 3.0.7-1. 3.0.6-2, however,
> seemed perfectly fine.
>
> David J. Haines
> dhaines at gmail.com
>

   Hmm.. Absolutely no help from the logs on the box that locked:

Nov 10 03:20:04 phoenix -- MARK --
Nov 10 03:25:34 phoenix dhcpd: DHCPREQUEST for 192.168.7.124 from 
00:11:43:22:50:08 via eth0
Nov 10 03:25:34 phoenix dhcpd: DHCPACK on 192.168.7.124 to 00:11:43:22:50:08 via 
eth0
Nov 10 12:44:33 phoenix kernel: [    0.000000] Initializing cgroup subsys cpuset
Nov 10 12:44:33 phoenix kernel: [    0.000000] Initializing cgroup subsys cpu

   Obviously something occurred after 03:25:34, but no indication of what. The 
second box I lost and thought was locked, wasn't locked, I just had the uncanny 
coincidence of trying it during one of its spontaneous reboots due to hwclock 
drift (I'll create a cron job to update this).  The boxes are on the same LAN 
subnet. The only SWAG I have is that once the box with the drifting clock got 
far enough out of time any net communications with the box that locked may have 
caused it to panic over the time sync issue.

(but that is wrong because once running, the sysclock is the only clock that 
matters - right? But that can't be all wrong, otherwise there is no explanation 
for the spontaneous reboot due to clock drift. A digital paradox so to speak :)

   Richard, David - check your hardware clock "# hwclock -r" and compare that to 
the time returned by "# date". If they are hours apart, then make sure your 
sysclock is correct and set the hardware clock to your sysclock with "# hwclock 
-w". Worth checking regardless.  I know this used to be done on boot or shutdown 
and I don't know why it isn't anymore. I'll do some more digging.

-- 
David C. Rankin, J.D.,P.E.