[arch-general] mce after linux-3.11.5-1 on NP900X3C

Rasmus Liland jensrasmus at gmail.com
Wed Nov 19 17:15:34 UTC 2014


On 2014-11-17 00:19, Rasmus Liland wrote:
> On 2014-11-15 18:28, Mark Lee wrote:
> >On 11/15/2014 12:20 PM, Rasmus Liland wrote:
> >> On 2014-11-15 15:21, LoneVVolf wrote:
> >>> On 15-11-14 06:57, Rasmus Liland wrote:
> >>>> On 2014-11-15 06:10, Mark Lee wrote:
> >>>>> On 11/14/2014 10:29 PM, Rasmus Liland wrote:
> >>>>>> On 2014-11-15 04:01, Mark Lee wrote:
> >>>>>>> Are you booting with the new intel u-code?
> >>>>>> Are you fairly sure this is a Intel microcode issue?
> >>>>> I'm not completely certain; but it would make sense. I'd test
> >>>>> it out.
> >>>> Thank you for your help thus far. I'll examine this further
> >>>> tomorrow, g'night.
> >>> From rasmus first post:
> >>>> I'm experiencing machine check exceptions since every kernel
> >>>> after package linux-3.11.5-1 (Oct 14 2013)
> >>> New intel microcode was only introduced with kernel 3.17 ... It's
> >>> unlikely to have to do with this issue.
> >>> 
> >>> install mcelog, run it as the log tells you and post the result.
> >> [ ... output, see previous messages ... ]
> >> I never did use the mcelog tool before, but to me it looks like not
> >> much of an analysis, perhaps I'm doing it wrong.
> >Looks like a microcode error, please try to add the intel-ucode to
> >your kernel cmdline.
> Bah, just as I was finished enabling syslinux using syslinux-install_update
> and rebooted, the system did not respond, just a blank screen and lighting
> shutting off, then rebooting again. 
> 
> Thus, this system needs an overhaul -- apparently some difficulty with the
> bootcode or the MBR, though I am able to mount the old partitions and chroot
> into them using arch-chroot. 
> 
> I tried installing grub using the standard method grub-install according to
> the wiki, with little success -- some good news at least relevant to previous
> topic in this thread is that grub recognized and added the intel-ucode file I
> had copied to the /boot directory, when running grub-mkconfig.
> 
> The plan forward is to forget about generating new mbr using gpart and
> install Debian at the end of the disk to, hopefully, restore some boot
> related stuff that might have come crashing down after meddling with
> syslinux.

A breakthrough in this thread has happened. 

I ended up taking a backup of the disk to an external hdd using

> # dd if=/dev/sda of=/mnt/angrist-sda-18nov14.img

then I booted FreeBSD 10.1 memstick, entered shell and entered some commands:

> # gpart delete -i 1 ada0
> # gpart delete -i 2 ada0
> # gpart delete -i 3 ada0
> # gpart destroy ada0
> # gpart create -s mbr ada0
> # gpart add -s 20g -t linux-data ada0
> # gpart add -t linux-data ada0

Then I rebooted into ArchLinux iso memstick to install Arch on the 20G
partition and using the other one as /home. So now Syslinux works,
unfortunately I don't know why. And I was able to install all new packages
including linux 3.17.3-1 and intel-ucode 20140913-1, loading it in Syslinux
according to the wiki.

I got a new mce after exactly three hours: 

> [10827.051523] mce: [Hardware Error]: CPU 1: Machine Check Exception: 5 Bank 4: b200000000100402 Increasing limit for this warning to that value arg
> [10827.051632] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff81321387> {intel_idle+0xe7/0x180}
> [10827.055440] mce: [Hardware Error]: TSC 2238c73db17
> [10827.059291] mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 1416411506 SOCKET 0 APIC 1 microcode 1b
> [10827.063192] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
> [10827.067078] mce: [Hardware Error]: CPU 3: Machine Check Exception: 5 Bank 4: b200000000100402
> [10827.070986] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff81321387> {intel_idle+0xe7/0x180}
> [10827.074899] mce: [Hardware Error]: TSC 2238c73db43
> [10827.078769] mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 1416411506 SOCKET 0 APIC 3 microcode 1b
> [10827.082673] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
> [10827.086569] mce: [Hardware Error]: CPU 2: Machine Check Exception: 5 Bank 4: b200000000100402
> [10827.090503] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff812ab186> {intel_sqrt+0x36/0x50}
> [10827.094415] mce: [Hardware Error]: TSC 2238c73db28
> [10827.098299] mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 1416411506 SOCKET 0 APIC 2 microcode 1b
> [10827.102242] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
> [10827.106182] mce: [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 4: b200000000100402
> [10827.110177] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff81321387> {intel_idle+0xe7/0x180}
> [10827.114143] mce: [Hardware Error]: TSC 2238c73db06
> [10827.118038] mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 1416411506 SOCKET 0 APIC 0 microcode 1b
> [10827.122028] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
> [10827.126037] mce: [Hardware Error]: Machine check: Processor context corrupt
> [10827.130076] Kernel panic - not syncing: Fatal Machine check
> [10827.134149] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
> [10827.136647] drm_kms_helper: panic occured, switching back to text console
> [10827.163009] Rebooting in 30 seconds..
> [10857.234707] ACPI MEMORY or I/O RESET_REG.

I am also making this output an attachment. There is a lot of more
information in this new mce compared to the other one I sent.

Perhaps some of you got some new suggestions.

Meanwhile, I am downgrading back to 3.11.5-1.

-- 
Rasmus Liland, jrl at jrl.dyndns.dk, jens.rasmus.liland at nmbu.no 
-------------- next part --------------
[10827.051523] mce: [Hardware Error]: CPU 1: Machine Check Exception: 5 Bank 4: b200000000100402 Increasing limit for this warning to that value arg
[10827.051632] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff81321387> {intel_idle+0xe7/0x180}
[10827.055440] mce: [Hardware Error]: TSC 2238c73db17
[10827.059291] mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 1416411506 SOCKET 0 APIC 1 microcode 1b
[10827.063192] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[10827.067078] mce: [Hardware Error]: CPU 3: Machine Check Exception: 5 Bank 4: b200000000100402
[10827.070986] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff81321387> {intel_idle+0xe7/0x180}
[10827.074899] mce: [Hardware Error]: TSC 2238c73db43
[10827.078769] mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 1416411506 SOCKET 0 APIC 3 microcode 1b
[10827.082673] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[10827.086569] mce: [Hardware Error]: CPU 2: Machine Check Exception: 5 Bank 4: b200000000100402
[10827.090503] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff812ab186> {intel_sqrt+0x36/0x50}
[10827.094415] mce: [Hardware Error]: TSC 2238c73db28
[10827.098299] mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 1416411506 SOCKET 0 APIC 2 microcode 1b
[10827.102242] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[10827.106182] mce: [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 4: b200000000100402
[10827.110177] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff81321387> {intel_idle+0xe7/0x180}
[10827.114143] mce: [Hardware Error]: TSC 2238c73db06
[10827.118038] mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 1416411506 SOCKET 0 APIC 0 microcode 1b
[10827.122028] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[10827.126037] mce: [Hardware Error]: Machine check: Processor context corrupt
[10827.130076] Kernel panic - not syncing: Fatal Machine check
[10827.134149] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
[10827.136647] drm_kms_helper: panic occured, switching back to text console
[10827.163009] Rebooting in 30 seconds..
[10857.234707] ACPI MEMORY or I/O RESET_REG.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <https://lists.archlinux.org/pipermail/arch-general/attachments/20141119/12ecbe83/attachment.bin>


More information about the arch-general mailing list