[arch-general] mce after linux-3.11.5-1 on NP900X3C
Dear @arch-general readers, I'm experiencing machine check exceptions since every kernel after package linux-3.11.5-1 (Oct 14 2013). I hope some nice people will be able to assist me or perhaps point me in a direction of something fruitful. First here goes some kernel panic output I was able to snap on May 25th (I also made it an attachment):
[19367.116180] Disabling lock debugging due to kernel taint [19367.116196] mce: [Hardware Error]: CPU 1: Machine Check Exception: 5 Bank 4: b200000000100402 [19367.116202] mce: [Hardware Error]: RIP !INEXACT! 33:<00007f8b4934c8b7> [19367.116205] mce: [Hardware Error]: TSC 2824672b8e7 [19367.116211] mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 14010118857 SOCKET 0 APIC 1 microcode 12 [19367.116213] mce: [Hardware Error]: Run the above through 'mcelog --ascii' [19367.116216] mce: [Hardware Error]: Some CPUs didn't answer in synchronization [19367.116218] mce: [Hardware Error]: Machine check: Invalid [19367.116220] Kernel panic - not syncing: Fatal machine check on current CPU [19368.211815] Shutting down cpus with NMI [19368.222834] Kernel Offset: 0x0 from 0xffffffff81000000 0000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff) [19368.222942] drm_kms_helper: panic occurred, switching back to text console [19368.245774] Rebooting in 30 seconds [19398.323579] ACPI RECOVERY or RESET_REG.
Assuming I did a complete system update about 5.4 hours earlier, this implies kernel output for linux kernel version 3.14.4. This has been the case for every kernel after version 3.11.5; from what I can trace in the `/var/cache/pacman/pkg' directory this also calls for _at least_ version 3.15.8; I've removed all other archived versions as they used valuable space. In dmesg output, my good version 3.11.5-1 is called:
Linux version 3.11.5-1-ARCH (tobias@T-POWA-LX) (gcc version 4.8.1 20130725 (prerelease) (GCC)) #1 SMP PREEMPT Mon Oct 14 08:31:43 CEST 2013
The hardware of my system is perhaps relevant. I've got a Samsung NP900X3C-A01SE (Jun.2012) laptop:
Intel Core i5-3317U CPU @ 1.70GHz stepping 9 microcode 0x12 4GB RAM (3743M/3888M available, 2048M available to graphics) SanDisk SSD U100 128GB, 10.01.04, max UDMA/133 Intel Centrino Advanced-N 6235 AGN, REV=0xB0
Presuming this is a good starting point, if more specific hardware information is needed in this thread in the future I can add the output of e.g. lscpi -vvnn. --Rasmus -- Rasmus Liland, jrl@jrl.dyndns.dk
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 On 11/14/2014 09:48 PM, Rasmus Liland wrote:
Dear @arch-general readers,
I'm experiencing machine check exceptions since every kernel after package linux-3.11.5-1 (Oct 14 2013). I hope some nice people will be able to assist me or perhaps point me in a direction of something fruitful. First here goes some kernel panic output I was able to snap on May 25th (I also made it an attachment):
[19367.116180] Disabling lock debugging due to kernel taint [19367.116196] mce: [Hardware Error]: CPU 1: Machine Check Exception: 5 Bank 4: b200000000100402 [19367.116202] mce: [Hardware Error]: RIP !INEXACT! 33:<00007f8b4934c8b7> [19367.116205] mce: [Hardware Error]: TSC 2824672b8e7 [19367.116211] mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 14010118857 SOCKET 0 APIC 1 microcode 12 [19367.116213] mce: [Hardware Error]: Run the above through 'mcelog --ascii' [19367.116216] mce: [Hardware Error]: Some CPUs didn't answer in synchronization [19367.116218] mce: [Hardware Error]: Machine check: Invalid [19367.116220] Kernel panic - not syncing: Fatal machine check on current CPU [19368.211815] Shutting down cpus with NMI [19368.222834] Kernel Offset: 0x0 from 0xffffffff81000000 0000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff) [19368.222942] drm_kms_helper: panic occurred, switching back to text console [19368.245774] Rebooting in 30 seconds [19398.323579] ACPI RECOVERY or RESET_REG.
Assuming I did a complete system update about 5.4 hours earlier, this implies kernel output for linux kernel version 3.14.4.
This has been the case for every kernel after version 3.11.5; from what I can trace in the `/var/cache/pacman/pkg' directory this also calls for _at least_ version 3.15.8; I've removed all other archived versions as they used valuable space.
In dmesg output, my good version 3.11.5-1 is called:
Linux version 3.11.5-1-ARCH (tobias@T-POWA-LX) (gcc version 4.8.1 20130725 (prerelease) (GCC)) #1 SMP PREEMPT Mon Oct 14 08:31:43 CEST 2013
The hardware of my system is perhaps relevant. I've got a Samsung NP900X3C-A01SE (Jun.2012) laptop:
Intel Core i5-3317U CPU @ 1.70GHz stepping 9 microcode 0x12 4GB RAM (3743M/3888M available, 2048M available to graphics) SanDisk SSD U100 128GB, 10.01.04, max UDMA/133 Intel Centrino Advanced-N 6235 AGN, REV=0xB0
Presuming this is a good starting point, if more specific hardware information is needed in this thread in the future I can add the output of e.g. lscpi -vvnn.
--Rasmus
To Rasmus, Are you booting with the new intel u-code? Regards, Mark -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iF4EAREIAAYFAlRmweYACgkQZ/Z80n6+J/ZQyQD+I4yYLZlnxr/8gRRlP94HDYDm C5eUXzAh+/ghqPj3cHcBAJVrEG8gMHK9XG8hZn4j/0uFoZgIZrOk5aqxRgK0lj7k =hJNM -----END PGP SIGNATURE-----
On 2014-11-15 04:01, Mark Lee wrote:
Are you booting with the new intel u-code?
Regards, Mark
You mean installing the intel-ucode package and enabling it in the bootloader as per instructions at https://wiki.archlinux.org/index.php/microcode ? No, I haven't gotten around to it yet as I'm, since August 2012, a user of the grub-legacy (0.97) package on this laptop. I know grub-legacy doesn't support the loading of the microcode. I'll switch to using Syslinux instead when I find a proper memstick. Are you fairly sure this is a Intel microcode issue? -- Rasmus Liland, jrl@jrl.dyndns.dk
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 On 11/14/2014 10:29 PM, Rasmus Liland wrote:
On 2014-11-15 04:01, Mark Lee wrote:
Are you booting with the new intel u-code?
Regards, Mark
You mean installing the intel-ucode package and enabling it in the bootloader as per instructions at https://wiki.archlinux.org/index.php/microcode ?
No, I haven't gotten around to it yet as I'm, since August 2012, a user of the grub-legacy (0.97) package on this laptop. I know grub-legacy doesn't support the loading of the microcode. I'll switch to using Syslinux instead when I find a proper memstick.
Are you fairly sure this is a Intel microcode issue?
To Rasmus, I'm not completely certain; but it would make sense. I'd test it out. Regards, Mark -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iF4EAREIAAYFAlRm4DYACgkQZ/Z80n6+J/YGdgD/dlMHYGqgZ/JWzT18fBBZnFg4 1DvEjv8tcJQ60SvkggoA/0wZ4N5L894RNtl64JrimYcb3cgTNdrX7SDpvMpkMWHb =npDn -----END PGP SIGNATURE-----
On 2014-11-15 06:10, Mark Lee wrote:
On 11/14/2014 10:29 PM, Rasmus Liland wrote:
On 2014-11-15 04:01, Mark Lee wrote:
Are you booting with the new intel u-code? You mean installing the intel-ucode package and enabling it in the bootloader as per instructions at https://wiki.archlinux.org/index.php/microcode ?
No, I haven't gotten around to it yet as I'm, since August 2012, a user of the grub-legacy (0.97) package on this laptop. I know grub-legacy doesn't support the loading of the microcode. I'll switch to using Syslinux instead when I find a proper memstick.
Are you fairly sure this is a Intel microcode issue?
To Rasmus,
I'm not completely certain; but it would make sense. I'd test it out.
Regards, Mark
Thank you for your help thus far. I'll examine this further tomorrow, g'night. -- Rasmus Liland, jrl@jrl.dyndns.dk, jens.rasmus.liland@nmbu.no
On 15-11-14 06:57, Rasmus Liland wrote:
On 2014-11-15 06:10, Mark Lee wrote:
On 11/14/2014 10:29 PM, Rasmus Liland wrote:
On 2014-11-15 04:01, Mark Lee wrote:
Are you booting with the new intel u-code? You mean installing the intel-ucode package and enabling it in the bootloader as per instructions at https://wiki.archlinux.org/index.php/microcode ?
No, I haven't gotten around to it yet as I'm, since August 2012, a user of the grub-legacy (0.97) package on this laptop. I know grub-legacy doesn't support the loading of the microcode. I'll switch to using Syslinux instead when I find a proper memstick.
Are you fairly sure this is a Intel microcode issue?
To Rasmus,
I'm not completely certain; but it would make sense. I'd test it out.
Regards, Mark
Thank you for your help thus far. I'll examine this further tomorrow, g'night.
From rasmus first post : I'm experiencing machine check exceptions since every kernel after package linux-3.11.5-1 (Oct 14 2013) New intel microcode was only introduced with kernel 3.17 ..... It's unlikely to have to do with this issue. Rasmus, check the log you posted again (bold added by me). [19367.116196] mce: [Hardware Error]: CPU 1: Machine Check Exception: 5 Bank 4: b200000000100402 [19367.116202] mce: [Hardware Error]: RIP !INEXACT! 33:<00007f8b4934c8b7> [19367.116205] mce: [Hardware Error]: TSC 2824672b8e7 [19367.116211] mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 14010118857 SOCKET 0 APIC 1 microcode 12 [19367.116213] mce: [Hardware Error]:*Run the above through 'mcelog --ascii'* install mcelog , run it as the log tells you and post the result. LW
On 2014-11-15 15:21, LoneVVolf wrote:
On 15-11-14 06:57, Rasmus Liland wrote:
On 11/14/2014 10:29 PM, Rasmus Liland wrote:
On 2014-11-15 04:01, Mark Lee wrote:
Are you booting with the new intel u-code? Are you fairly sure this is a Intel microcode issue? I'm not completely certain; but it would make sense. I'd test it out. Thank you for your help thus far. I'll examine this further tomorrow, g'night. From rasmus first post: I'm experiencing machine check exceptions since every kernel after package
On 2014-11-15 06:10, Mark Lee wrote: linux-3.11.5-1 (Oct 14 2013) New intel microcode was only introduced with kernel 3.17 ... It's unlikely to have to do with this issue.
install mcelog, run it as the log tells you and post the result. Thank you for your new, important thoughts. I put the lines in a file called mce.log.
rasmus@angrist ~ % cat mce.log [19367.116196] mce: [Hardware Error]: CPU 1: Machine Check Exception: 5 Bank 4: b200000000100402 [19367.116202] mce: [Hardware Error]: RIP !INEXACT! 33:<00007f8b4934c8b7> [19367.116205] mce: [Hardware Error]: TSC 2824672b8e7 [19367.116211] mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 14010118857 SOCKET 0 APIC 1 microcode 12 rasmus@angrist ~ % sudo mcelog --ascii < mce.log mce: [Hardware Error]: CPU 1: Machine Check Exception: 5 Bank 4: b200000000100402 mce: [Hardware Error]: RIP !INEXACT! 33:<00007f8b4934c8b7> mce: [Hardware Error]: TSC 2824672b8e7 mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 14010118857 SOCKET 0 APIC 1 microcode 12 I never did use the mcelog tool before, but to me it looks like not much of an analysis, perhaps I'm doing it wrong. -- Rasmus Liland, jrl@jrl.dyndns.dk
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 On 11/15/2014 12:20 PM, Rasmus Liland wrote:
On 2014-11-15 15:21, LoneVVolf wrote:
On 15-11-14 06:57, Rasmus Liland wrote:
On 2014-11-15 06:10, Mark Lee wrote:
On 11/14/2014 10:29 PM, Rasmus Liland wrote:
On 2014-11-15 04:01, Mark Lee wrote:
Are you booting with the new intel u-code? Are you fairly sure this is a Intel microcode issue? I'm not completely certain; but it would make sense. I'd test it out. Thank you for your help thus far. I'll examine this further tomorrow, g'night. From rasmus first post: I'm experiencing machine check exceptions since every kernel after package linux-3.11.5-1 (Oct 14 2013) New intel microcode was only introduced with kernel 3.17 ... It's unlikely to have to do with this issue.
install mcelog, run it as the log tells you and post the result. Thank you for your new, important thoughts. I put the lines in a file called mce.log.
rasmus@angrist ~ % cat mce.log [19367.116196] mce: [Hardware Error]: CPU 1: Machine Check Exception: 5 Bank 4: b200000000100402 [19367.116202] mce: [Hardware Error]: RIP !INEXACT! 33:<00007f8b4934c8b7> [19367.116205] mce: [Hardware Error]: TSC 2824672b8e7 [19367.116211] mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 14010118857 SOCKET 0 APIC 1 microcode 12 rasmus@angrist ~ % sudo mcelog --ascii < mce.log mce: [Hardware Error]: CPU 1: Machine Check Exception: 5 Bank 4: b200000000100402 mce: [Hardware Error]: RIP !INEXACT! 33:<00007f8b4934c8b7> mce: [Hardware Error]: TSC 2824672b8e7 mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 14010118857 SOCKET 0 APIC 1 microcode 12
I never did use the mcelog tool before, but to me it looks like not much of an analysis, perhaps I'm doing it wrong.
To Rasmus, Looks like a microcode error, please try to add the intel-ucode to your kernel cmdline. Regards, Mark -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iF4EAREIAAYFAlRnjRcACgkQZ/Z80n6+J/YGTgD/dUHzUH1Q9Lj144tZmXQ/xlzt ICxbu0YUp9ryAGK81I0A/Ao/dyrBV0/3fgg5PBm8/EZnG6EyCcrWSRSVFW3uDow2 =Cnxt -----END PGP SIGNATURE-----
On 2014-11-15 18:28, Mark Lee wrote:
On 11/15/2014 12:20 PM, Rasmus Liland wrote:
On 2014-11-15 15:21, LoneVVolf wrote:
On 15-11-14 06:57, Rasmus Liland wrote:
On 2014-11-15 06:10, Mark Lee wrote:
On 11/14/2014 10:29 PM, Rasmus Liland wrote:
On 2014-11-15 04:01, Mark Lee wrote: > Are you booting with the new intel u-code? Are you fairly sure this is a Intel microcode issue? I'm not completely certain; but it would make sense. I'd test it out. Thank you for your help thus far. I'll examine this further tomorrow, g'night. From rasmus first post: I'm experiencing machine check exceptions since every kernel after package linux-3.11.5-1 (Oct 14 2013) New intel microcode was only introduced with kernel 3.17 ... It's unlikely to have to do with this issue.
install mcelog, run it as the log tells you and post the result. [ ... output, see previous messages ... ] I never did use the mcelog tool before, but to me it looks like not much of an analysis, perhaps I'm doing it wrong. Looks like a microcode error, please try to add the intel-ucode to your kernel cmdline. Bah, just as I was finished enabling syslinux using syslinux-install_update and rebooted, the system did not respond, just a blank screen and lighting shutting off, then rebooting again.
Thus, this system needs an overhaul -- apparently some difficulty with the bootcode or the MBR, though I am able to mount the old partitions and chroot into them using arch-chroot. I tried installing grub using the standard method grub-install according to the wiki, with little success -- some good news at least relevant to previous topic in this thread is that grub recognized and added the intel-ucode file I had copied to the /boot directory, when running grub-mkconfig. The plan forward is to forget about generating new mbr using gpart and install Debian at the end of the disk to, hopefully, restore some boot related stuff that might have come crashing down after meddling with syslinux. I was able to jot down this at my remote terminal. -- Rasmus Liland, jrl@jrl.dyndns.dk
On 2014-11-17 00:19, Rasmus Liland wrote:
On 2014-11-15 18:28, Mark Lee wrote:
On 11/15/2014 12:20 PM, Rasmus Liland wrote:
On 2014-11-15 15:21, LoneVVolf wrote:
On 15-11-14 06:57, Rasmus Liland wrote:
On 2014-11-15 06:10, Mark Lee wrote:
On 11/14/2014 10:29 PM, Rasmus Liland wrote: > On 2014-11-15 04:01, Mark Lee wrote: >> Are you booting with the new intel u-code? > Are you fairly sure this is a Intel microcode issue? I'm not completely certain; but it would make sense. I'd test it out. Thank you for your help thus far. I'll examine this further tomorrow, g'night. From rasmus first post: I'm experiencing machine check exceptions since every kernel after package linux-3.11.5-1 (Oct 14 2013) New intel microcode was only introduced with kernel 3.17 ... It's unlikely to have to do with this issue.
install mcelog, run it as the log tells you and post the result. [ ... output, see previous messages ... ] I never did use the mcelog tool before, but to me it looks like not much of an analysis, perhaps I'm doing it wrong. Looks like a microcode error, please try to add the intel-ucode to your kernel cmdline. Bah, just as I was finished enabling syslinux using syslinux-install_update and rebooted, the system did not respond, just a blank screen and lighting shutting off, then rebooting again.
Thus, this system needs an overhaul -- apparently some difficulty with the bootcode or the MBR, though I am able to mount the old partitions and chroot into them using arch-chroot.
I tried installing grub using the standard method grub-install according to the wiki, with little success -- some good news at least relevant to previous topic in this thread is that grub recognized and added the intel-ucode file I had copied to the /boot directory, when running grub-mkconfig.
The plan forward is to forget about generating new mbr using gpart and install Debian at the end of the disk to, hopefully, restore some boot related stuff that might have come crashing down after meddling with syslinux.
A breakthrough in this thread has happened. I ended up taking a backup of the disk to an external hdd using
# dd if=/dev/sda of=/mnt/angrist-sda-18nov14.img
then I booted FreeBSD 10.1 memstick, entered shell and entered some commands:
# gpart delete -i 1 ada0 # gpart delete -i 2 ada0 # gpart delete -i 3 ada0 # gpart destroy ada0 # gpart create -s mbr ada0 # gpart add -s 20g -t linux-data ada0 # gpart add -t linux-data ada0
Then I rebooted into ArchLinux iso memstick to install Arch on the 20G partition and using the other one as /home. So now Syslinux works, unfortunately I don't know why. And I was able to install all new packages including linux 3.17.3-1 and intel-ucode 20140913-1, loading it in Syslinux according to the wiki. I got a new mce after exactly three hours:
[10827.051523] mce: [Hardware Error]: CPU 1: Machine Check Exception: 5 Bank 4: b200000000100402 Increasing limit for this warning to that value arg [10827.051632] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff81321387> {intel_idle+0xe7/0x180} [10827.055440] mce: [Hardware Error]: TSC 2238c73db17 [10827.059291] mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 1416411506 SOCKET 0 APIC 1 microcode 1b [10827.063192] mce: [Hardware Error]: Run the above through 'mcelog --ascii' [10827.067078] mce: [Hardware Error]: CPU 3: Machine Check Exception: 5 Bank 4: b200000000100402 [10827.070986] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff81321387> {intel_idle+0xe7/0x180} [10827.074899] mce: [Hardware Error]: TSC 2238c73db43 [10827.078769] mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 1416411506 SOCKET 0 APIC 3 microcode 1b [10827.082673] mce: [Hardware Error]: Run the above through 'mcelog --ascii' [10827.086569] mce: [Hardware Error]: CPU 2: Machine Check Exception: 5 Bank 4: b200000000100402 [10827.090503] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff812ab186> {intel_sqrt+0x36/0x50} [10827.094415] mce: [Hardware Error]: TSC 2238c73db28 [10827.098299] mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 1416411506 SOCKET 0 APIC 2 microcode 1b [10827.102242] mce: [Hardware Error]: Run the above through 'mcelog --ascii' [10827.106182] mce: [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 4: b200000000100402 [10827.110177] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff81321387> {intel_idle+0xe7/0x180} [10827.114143] mce: [Hardware Error]: TSC 2238c73db06 [10827.118038] mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 1416411506 SOCKET 0 APIC 0 microcode 1b [10827.122028] mce: [Hardware Error]: Run the above through 'mcelog --ascii' [10827.126037] mce: [Hardware Error]: Machine check: Processor context corrupt [10827.130076] Kernel panic - not syncing: Fatal Machine check [10827.134149] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff) [10827.136647] drm_kms_helper: panic occured, switching back to text console [10827.163009] Rebooting in 30 seconds.. [10857.234707] ACPI MEMORY or I/O RESET_REG.
I am also making this output an attachment. There is a lot of more information in this new mce compared to the other one I sent. Perhaps some of you got some new suggestions. Meanwhile, I am downgrading back to 3.11.5-1. -- Rasmus Liland, jrl@jrl.dyndns.dk, jens.rasmus.liland@nmbu.no
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 On 11/19/2014 12:15 PM, Rasmus Liland wrote:
On 2014-11-17 00:19, Rasmus Liland wrote:
On 2014-11-15 18:28, Mark Lee wrote:
On 11/15/2014 12:20 PM, Rasmus Liland wrote:
On 2014-11-15 15:21, LoneVVolf wrote:
On 15-11-14 06:57, Rasmus Liland wrote:
On 2014-11-15 06:10, Mark Lee wrote: > On 11/14/2014 10:29 PM, Rasmus Liland wrote: >> On 2014-11-15 04:01, Mark Lee wrote: >>> Are you booting with the new intel u-code? >> Are you fairly sure this is a Intel microcode issue? > I'm not completely certain; but it would make sense. > I'd test it out. Thank you for your help thus far. I'll examine this further tomorrow, g'night. From rasmus first post: I'm experiencing machine check exceptions since every kernel after package linux-3.11.5-1 (Oct 14 2013) New intel microcode was only introduced with kernel 3.17 ... It's unlikely to have to do with this issue.
install mcelog, run it as the log tells you and post the result. [ ... output, see previous messages ... ] I never did use the mcelog tool before, but to me it looks like not much of an analysis, perhaps I'm doing it wrong. Looks like a microcode error, please try to add the intel-ucode to your kernel cmdline. Bah, just as I was finished enabling syslinux using syslinux-install_update and rebooted, the system did not respond, just a blank screen and lighting shutting off, then rebooting again.
Thus, this system needs an overhaul -- apparently some difficulty with the bootcode or the MBR, though I am able to mount the old partitions and chroot into them using arch-chroot.
I tried installing grub using the standard method grub-install according to the wiki, with little success -- some good news at least relevant to previous topic in this thread is that grub recognized and added the intel-ucode file I had copied to the /boot directory, when running grub-mkconfig.
The plan forward is to forget about generating new mbr using gpart and install Debian at the end of the disk to, hopefully, restore some boot related stuff that might have come crashing down after meddling with syslinux.
A breakthrough in this thread has happened.
I ended up taking a backup of the disk to an external hdd using
# dd if=/dev/sda of=/mnt/angrist-sda-18nov14.img
then I booted FreeBSD 10.1 memstick, entered shell and entered some commands:
# gpart delete -i 1 ada0 # gpart delete -i 2 ada0 # gpart delete -i 3 ada0 # gpart destroy ada0 # gpart create -s mbr ada0 # gpart add -s 20g -t linux-data ada0 # gpart add -t linux-data ada0
Then I rebooted into ArchLinux iso memstick to install Arch on the 20G partition and using the other one as /home. So now Syslinux works, unfortunately I don't know why. And I was able to install all new packages including linux 3.17.3-1 and intel-ucode 20140913-1, loading it in Syslinux according to the wiki.
I got a new mce after exactly three hours:
[10827.051523] mce: [Hardware Error]: CPU 1: Machine Check Exception: 5 Bank 4: b200000000100402 Increasing limit for this warning to that value arg [10827.051632] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff81321387> {intel_idle+0xe7/0x180} [10827.055440] mce: [Hardware Error]: TSC 2238c73db17 [10827.059291] mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 1416411506 SOCKET 0 APIC 1 microcode 1b [10827.063192] mce: [Hardware Error]: Run the above through 'mcelog --ascii' [10827.067078] mce: [Hardware Error]: CPU 3: Machine Check Exception: 5 Bank 4: b200000000100402 [10827.070986] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff81321387> {intel_idle+0xe7/0x180} [10827.074899] mce: [Hardware Error]: TSC 2238c73db43 [10827.078769] mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 1416411506 SOCKET 0 APIC 3 microcode 1b [10827.082673] mce: [Hardware Error]: Run the above through 'mcelog --ascii' [10827.086569] mce: [Hardware Error]: CPU 2: Machine Check Exception: 5 Bank 4: b200000000100402 [10827.090503] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff812ab186> {intel_sqrt+0x36/0x50} [10827.094415] mce: [Hardware Error]: TSC 2238c73db28 [10827.098299] mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 1416411506 SOCKET 0 APIC 2 microcode 1b [10827.102242] mce: [Hardware Error]: Run the above through 'mcelog --ascii' [10827.106182] mce: [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 4: b200000000100402 [10827.110177] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff81321387> {intel_idle+0xe7/0x180} [10827.114143] mce: [Hardware Error]: TSC 2238c73db06 [10827.118038] mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 1416411506 SOCKET 0 APIC 0 microcode 1b [10827.122028] mce: [Hardware Error]: Run the above through 'mcelog --ascii' [10827.126037] mce: [Hardware Error]: Machine check: Processor context corrupt [10827.130076] Kernel panic - not syncing: Fatal Machine check [10827.134149] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff) [10827.136647] drm_kms_helper: panic occured, switching back to text console [10827.163009] Rebooting in 30 seconds.. [10857.234707] ACPI MEMORY or I/O RESET_REG.
I am also making this output an attachment. There is a lot of more information in this new mce compared to the other one I sent.
Perhaps some of you got some new suggestions.
Meanwhile, I am downgrading back to 3.11.5-1.
To Rasmus, Can you run the parts where it says "run the abvoe through mcelog - --ascii" and post the contents? Regards, Mark -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iF4EAREIAAYFAlRtAEYACgkQZ/Z80n6+J/bSDAD/QULX/4mYDEVfTsiXn2p1PBwx kGcvdIgfTiSwYRMbrz4A/20NYjKeQ6EJPUpdXODgl8kp03CVAVeQknkzxtmZrnlL =mDQq -----END PGP SIGNATURE-----
On November 19, 2014 3:40:38 PM EST, Mark Lee <mark@markelee.com> wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256
On 11/19/2014 12:15 PM, Rasmus Liland wrote:
On 2014-11-17 00:19, Rasmus Liland wrote:
On 2014-11-15 18:28, Mark Lee wrote:
On 11/15/2014 12:20 PM, Rasmus Liland wrote:
On 2014-11-15 15:21, LoneVVolf wrote:
On 15-11-14 06:57, Rasmus Liland wrote: > On 2014-11-15 06:10, Mark Lee wrote: >> On 11/14/2014 10:29 PM, Rasmus Liland wrote: >>> On 2014-11-15 04:01, Mark Lee wrote: >>>> Are you booting with the new intel u-code? >>> Are you fairly sure this is a Intel microcode issue? >> I'm not completely certain; but it would make sense. >> I'd test it out. > Thank you for your help thus far. I'll examine this > further tomorrow, g'night. From rasmus first post: > I'm experiencing machine check exceptions since every > kernel after package linux-3.11.5-1 (Oct 14 2013) New intel microcode was only introduced with kernel 3.17 ... It's unlikely to have to do with this issue.
install mcelog, run it as the log tells you and post the result. [ ... output, see previous messages ... ] I never did use the mcelog tool before, but to me it looks like not much of an analysis, perhaps I'm doing it wrong. Looks like a microcode error, please try to add the intel-ucode to your kernel cmdline. Bah, just as I was finished enabling syslinux using syslinux-install_update and rebooted, the system did not respond, just a blank screen and lighting shutting off, then rebooting again.
Thus, this system needs an overhaul -- apparently some difficulty with the bootcode or the MBR, though I am able to mount the old partitions and chroot into them using arch-chroot.
I tried installing grub using the standard method grub-install according to the wiki, with little success -- some good news at least relevant to previous topic in this thread is that grub recognized and added the intel-ucode file I had copied to the /boot directory, when running grub-mkconfig.
The plan forward is to forget about generating new mbr using gpart and install Debian at the end of the disk to, hopefully, restore some boot related stuff that might have come crashing down after meddling with syslinux.
A breakthrough in this thread has happened.
I ended up taking a backup of the disk to an external hdd using
# dd if=/dev/sda of=/mnt/angrist-sda-18nov14.img
then I booted FreeBSD 10.1 memstick, entered shell and entered some commands:
# gpart delete -i 1 ada0 # gpart delete -i 2 ada0 # gpart delete -i 3 ada0 # gpart destroy ada0 # gpart create -s mbr ada0 # gpart add -s 20g -t linux-data ada0 # gpart add -t linux-data ada0
Then I rebooted into ArchLinux iso memstick to install Arch on the 20G partition and using the other one as /home. So now Syslinux works, unfortunately I don't know why. And I was able to install all new packages including linux 3.17.3-1 and intel-ucode 20140913-1, loading it in Syslinux according to the wiki.
I got a new mce after exactly three hours:
[10827.051523] mce: [Hardware Error]: CPU 1: Machine Check Exception: 5 Bank 4: b200000000100402 Increasing limit for this warning to that value arg [10827.051632] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff81321387> {intel_idle+0xe7/0x180} [10827.055440] mce: [Hardware Error]: TSC 2238c73db17 [10827.059291] mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 1416411506 SOCKET 0 APIC 1 microcode 1b [10827.063192] mce: [Hardware Error]: Run the above through 'mcelog --ascii' [10827.067078] mce: [Hardware Error]: CPU 3: Machine Check Exception: 5 Bank 4: b200000000100402 [10827.070986] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff81321387> {intel_idle+0xe7/0x180} [10827.074899] mce: [Hardware Error]: TSC 2238c73db43 [10827.078769] mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 1416411506 SOCKET 0 APIC 3 microcode 1b [10827.082673] mce: [Hardware Error]: Run the above through 'mcelog --ascii' [10827.086569] mce: [Hardware Error]: CPU 2: Machine Check Exception: 5 Bank 4: b200000000100402 [10827.090503] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff812ab186> {intel_sqrt+0x36/0x50} [10827.094415] mce: [Hardware Error]: TSC 2238c73db28 [10827.098299] mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 1416411506 SOCKET 0 APIC 2 microcode 1b [10827.102242] mce: [Hardware Error]: Run the above through 'mcelog --ascii' [10827.106182] mce: [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 4: b200000000100402 [10827.110177] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff81321387> {intel_idle+0xe7/0x180} [10827.114143] mce: [Hardware Error]: TSC 2238c73db06 [10827.118038] mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 1416411506 SOCKET 0 APIC 0 microcode 1b [10827.122028] mce: [Hardware Error]: Run the above through 'mcelog --ascii' [10827.126037] mce: [Hardware Error]: Machine check: Processor context corrupt [10827.130076] Kernel panic - not syncing: Fatal Machine check [10827.134149] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff) [10827.136647] drm_kms_helper: panic occured, switching back to text console [10827.163009] Rebooting in 30 seconds.. [10857.234707] ACPI MEMORY or I/O RESET_REG.
I am also making this output an attachment. There is a lot of more information in this new mce compared to the other one I sent.
Perhaps some of you got some new suggestions.
Meanwhile, I am downgrading back to 3.11.5-1.
To Rasmus,
Can you run the parts where it says "run the abvoe through mcelog - --ascii" and post the contents?
Regards, Mark -----BEGIN PGP SIGNATURE----- Version: GnuPG v2
iF4EAREIAAYFAlRtAEYACgkQZ/Z80n6+J/bSDAD/QULX/4mYDEVfTsiXn2p1PBwx kGcvdIgfTiSwYRMbrz4A/20NYjKeQ6EJPUpdXODgl8kp03CVAVeQknkzxtmZrnlL =mDQq -----END PGP SIGNATURE-----
I may have had a similar error, but I can't remember the details. Have you checked if your hardware clock is synchronized? It fixed my kernel panic issues. -- vixsomnis
On 2014-11-19 21:48, vixsomnis wrote:
On November 19, 2014 3:40:38 PM EST, Mark Lee <mark@markelee.com> wrote:
On 11/19/2014 12:15 PM, Rasmus Liland wrote:
On 2014-11-17 00:19, Rasmus Liland wrote:
On 2014-11-15 18:28, Mark Lee wrote:
On 11/15/2014 12:20 PM, Rasmus Liland wrote:
On 2014-11-15 15:21, LoneVVolf wrote: > On 15-11-14 06:57, Rasmus Liland wrote: >> On 2014-11-15 06:10, Mark Lee wrote: >>> On 11/14/2014 10:29 PM, Rasmus Liland wrote: >>>> On 2014-11-15 04:01, Mark Lee wrote: >>>>> Are you booting with the new intel u-code? >>>> Are you fairly sure this is a Intel microcode issue? >>> I'm not completely certain; but it would make sense. >>> I'd test it out. >> Thank you for your help thus far. I'll examine this >> further tomorrow, g'night. > From rasmus first post: >> I'm experiencing machine check exceptions since every >> kernel after package linux-3.11.5-1 (Oct 14 2013) > New intel microcode was only introduced with kernel 3.17 > ... It's unlikely to have to do with this issue. > > install mcelog, run it as the log tells you and post the > result. [ ... output, see previous messages ... ] I never did use the mcelog tool before, but to me it looks like not much of an analysis, perhaps I'm doing it wrong. Looks like a microcode error, please try to add the intel-ucode to your kernel cmdline. Bah, just as I was finished enabling syslinux using syslinux-install_update and rebooted, the system did not respond, just a blank screen and lighting shutting off, then rebooting again.
Thus, this system needs an overhaul -- apparently some difficulty with the bootcode or the MBR, though I am able to mount the old partitions and chroot into them using arch-chroot.
I tried installing grub using the standard method grub-install according to the wiki, with little success -- some good news at least relevant to previous topic in this thread is that grub recognized and added the intel-ucode file I had copied to the /boot directory, when running grub-mkconfig.
The plan forward is to forget about generating new mbr using gpart and install Debian at the end of the disk to, hopefully, restore some boot related stuff that might have come crashing down after meddling with syslinux.
A breakthrough in this thread has happened.
I ended up taking a backup of the disk to an external hdd using
# dd if=/dev/sda of=/mnt/angrist-sda-18nov14.img
then I booted FreeBSD 10.1 memstick, entered shell and entered some commands:
[ ... cut, see previously archived message for output ... ]
Then I rebooted into ArchLinux iso memstick to install Arch on the 20G partition and using the other one as /home. So now Syslinux works, unfortunately I don't know why. And I was able to install all new packages including linux 3.17.3-1 and intel-ucode 20140913-1, loading it in Syslinux according to the wiki.
I got a new mce after exactly three hours:
[ ... cut, see previously archived message for output ... ]
I am also making this output an attachment. There is a lot of more information in this new mce compared to the other one I sent.
Perhaps some of you got some new suggestions.
Meanwhile, I am downgrading back to 3.11.5-1.
To Rasmus,
Can you run the parts where it says "run the abvoe through mcelog - --ascii" and post the contents?
Regards, Mark
I may have had a similar error, but I can't remember the details. Have you checked if your hardware clock is synchronized? It fixed my kernel panic issues. -- vixsomnis
This is interesting. Do you mean synchronizing via NTP protocol? My experience from this laptop is that the RTC quickly becomes desynchronized at times when I am not able to sync via OpenNTPd, i.e. when I am not connected to the internet. -- Rasmus Liland, jrl@jrl.dyndns.dk, jens.rasmus.liland@nmbu.no
On 19/11/14 16:54, Rasmus Liland wrote:
This is interesting. Do you mean synchronizing via NTP protocol?
My experience from this laptop is that the RTC quickly becomes desynchronized at times when I am not able to sync via OpenNTPd, i.e. when I am not connected to the internet.
If so, don't use OpenNTPD but rather an ntp server that can keep track of RTC drift between reboots and also *write to the hardware clock*. chrony can do this while being lighter and simpler than ntpd.
On 2014-11-19 23:22, "P. A. López-Valencia" wrote:
On 19/11/14 16:54, Rasmus Liland wrote:
This is interesting. Do you mean synchronizing via NTP protocol?
My experience from this laptop is that the RTC quickly becomes desynchronized at times when I am not able to sync via OpenNTPd, i.e. when I am not connected to the internet.
If so, don't use OpenNTPD but rather an ntp server that can keep track of RTC drift between reboots and also *write to the hardware clock*. chrony can do this while being lighter and simpler than ntpd.
Thank you for suggesting chrony. It seems quite useful. I will start to use it on this laptop from now on. -- Rasmus Liland, jrl@jrl.dyndns.dk, jens.rasmus.liland@nmbu.no
On 2014-11-19 21:41, Mark Lee wrote:
On 11/19/2014 12:15 PM, Rasmus Liland wrote:
On 2014-11-17 00:19, Rasmus Liland wrote:
On 2014-11-15 18:28, Mark Lee wrote:
On 11/15/2014 12:20 PM, Rasmus Liland wrote:
On 2014-11-15 15:21, LoneVVolf wrote:
On 15-11-14 06:57, Rasmus Liland wrote: > On 2014-11-15 06:10, Mark Lee wrote: >> On 11/14/2014 10:29 PM, Rasmus Liland wrote: >>> On 2014-11-15 04:01, Mark Lee wrote: >>>> Are you booting with the new intel u-code? >>> Are you fairly sure this is a Intel microcode issue? >> I'm not completely certain; but it would make sense. >> I'd test it out. > Thank you for your help thus far. I'll examine this > further tomorrow, g'night. From rasmus first post: > I'm experiencing machine check exceptions since every > kernel after package linux-3.11.5-1 (Oct 14 2013) New intel microcode was only introduced with kernel 3.17 ... It's unlikely to have to do with this issue.
install mcelog, run it as the log tells you and post the result. [ ... output, see previous messages ... ] I never did use the mcelog tool before, but to me it looks like not much of an analysis, perhaps I'm doing it wrong. Looks like a microcode error, please try to add the intel-ucode to your kernel cmdline. Bah, just as I was finished enabling syslinux using syslinux-install_update and rebooted, the system did not respond, just a blank screen and lighting shutting off, then rebooting again.
Thus, this system needs an overhaul -- apparently some difficulty with the bootcode or the MBR, though I am able to mount the old partitions and chroot into them using arch-chroot.
I tried installing grub using the standard method grub-install according to the wiki, with little success -- some good news at least relevant to previous topic in this thread is that grub recognized and added the intel-ucode file I had copied to the /boot directory, when running grub-mkconfig.
The plan forward is to forget about generating new mbr using gpart and install Debian at the end of the disk to, hopefully, restore some boot related stuff that might have come crashing down after meddling with syslinux.
A breakthrough in this thread has happened.
I ended up taking a backup of the disk to an external hdd using
# dd if=/dev/sda of=/mnt/angrist-sda-18nov14.img
then I booted FreeBSD 10.1 memstick, entered shell and entered some commands:
# gpart delete -i 1 ada0 # gpart delete -i 2 ada0 # gpart delete -i 3 ada0 # gpart destroy ada0 # gpart create -s mbr ada0 # gpart add -s 20g -t linux-data ada0 # gpart add -t linux-data ada0
Then I rebooted into ArchLinux iso memstick to install Arch on the 20G partition and using the other one as /home. So now Syslinux works, unfortunately I don't know why. And I was able to install all new packages including linux 3.17.3-1 and intel-ucode 20140913-1, loading it in Syslinux according to the wiki.
I got a new mce after exactly three hours:
[10827.051523] mce: [Hardware Error]: CPU 1: Machine Check Exception: 5 Bank 4: b200000000100402 Increasing limit for this warning to that value arg [10827.051632] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff81321387> {intel_idle+0xe7/0x180} [10827.055440] mce: [Hardware Error]: TSC 2238c73db17 [10827.059291] mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 1416411506 SOCKET 0 APIC 1 microcode 1b [10827.063192] mce: [Hardware Error]: Run the above through 'mcelog --ascii' [10827.067078] mce: [Hardware Error]: CPU 3: Machine Check Exception: 5 Bank 4: b200000000100402 [10827.070986] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff81321387> {intel_idle+0xe7/0x180} [10827.074899] mce: [Hardware Error]: TSC 2238c73db43 [10827.078769] mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 1416411506 SOCKET 0 APIC 3 microcode 1b [10827.082673] mce: [Hardware Error]: Run the above through 'mcelog --ascii' [10827.086569] mce: [Hardware Error]: CPU 2: Machine Check Exception: 5 Bank 4: b200000000100402 [10827.090503] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff812ab186> {intel_sqrt+0x36/0x50} [10827.094415] mce: [Hardware Error]: TSC 2238c73db28 [10827.098299] mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 1416411506 SOCKET 0 APIC 2 microcode 1b [10827.102242] mce: [Hardware Error]: Run the above through 'mcelog --ascii' [10827.106182] mce: [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 4: b200000000100402 [10827.110177] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff81321387> {intel_idle+0xe7/0x180} [10827.114143] mce: [Hardware Error]: TSC 2238c73db06 [10827.118038] mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 1416411506 SOCKET 0 APIC 0 microcode 1b [10827.122028] mce: [Hardware Error]: Run the above through 'mcelog --ascii' [10827.126037] mce: [Hardware Error]: Machine check: Processor context corrupt [10827.130076] Kernel panic - not syncing: Fatal Machine check [10827.134149] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff) [10827.136647] drm_kms_helper: panic occured, switching back to text console [10827.163009] Rebooting in 30 seconds.. [10857.234707] ACPI MEMORY or I/O RESET_REG.
I am also making this output an attachment. There is a lot of more information in this new mce compared to the other one I sent.
Perhaps some of you got some new suggestions.
Meanwhile, I am downgrading back to 3.11.5-1.
To Rasmus,
Can you run the parts where it says "run the abvoe through mcelog --ascii" and post the contents?
Regards, Mark
I'm attaching the output of mcelog to this message. However, I'm unsure of the usefulness of the output. -- Rasmus Liland, jrl@jrl.dyndns.dk, jens.rasmus.liland@nmbu.no
On 2014-11-19 22:53, Rasmus Liland wrote:
On 2014-11-19 21:41, Mark Lee wrote:
To Rasmus,
Can you run the parts where it says "run the abvoe through mcelog --ascii" and post the contents?
Regards, Mark
I'm attaching the output of mcelog to this message. However, I'm unsure of the usefulness of the output.
I checked dmesg now after having uptime of ...
rasmus@angrist ~ % uptime 02:04:01 up 1 day, 7:35, 1 user, load average: 0.04, 0.15, 0.40 rasmus@angrist ~ % uname -a Linux angrist 3.11.5-1-ARCH #1 SMP PREEMPT Mon Oct 14 08:31:43 CEST 2013 x86_64 GNU/Linux
... about 26 hours. It seems after about 19 hours some (possibly) temperature related were causing mce hardware errors over a ten minute interval:
[70133.209654] mce: [Hardware Error]: Machine check events logged [70376.833053] CPU2: Core temperature above threshold, cpu clock throttled (total events = 30628) [70376.833056] CPU3: Core temperature above threshold, cpu clock throttled (total events = 30628) [70376.833061] CPU3: Package temperature above threshold, cpu clock throttled (total events = 174126) [70376.833070] CPU2: Package temperature above threshold, cpu clock throttled (total events = 174126) [70376.833074] CPU1: Package temperature above threshold, cpu clock throttled (total events = 174126) [70376.833077] CPU0: Package temperature above threshold, cpu clock throttled (total events = 174124) [70376.835060] CPU3: Core temperature/speed normal [70376.835064] CPU2: Core temperature/speed normal [70376.835070] CPU2: Package temperature/speed normal [70376.835074] CPU3: Package temperature/speed normal [70376.835087] CPU1: Package temperature/speed normal [70376.835090] CPU0: Package temperature/speed normal [70433.353800] mce: [Hardware Error]: Machine check events logged [70676.969501] CPU2: Core temperature/speed normal [70676.969505] CPU3: Core temperature/speed normal [70676.969511] CPU0: Package temperature above threshold, cpu clock throttled (total events = 198545) [70676.969516] CPU1: Package temperature above threshold, cpu clock throttled (total events = 198547) [70676.969522] CPU3: Package temperature above threshold, cpu clock throttled (total events = 198547) [70676.969545] CPU2: Package temperature above threshold, cpu clock throttled (total events = 198547) [70676.970519] CPU0: Package temperature/speed normal [70676.970522] CPU2: Package temperature/speed normal [70676.970524] CPU3: Package temperature/speed normal [70676.970526] CPU1: Package temperature/speed normal [70733.497978] mce: [Hardware Error]: Machine check events logged
As the system did not reboot, it were able to self heal. -- Rasmus Liland, jrl@jrl.dyndns.dk, jens.rasmus.liland@nmbu.no
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 On 11/20/2014 08:24 PM, Rasmus Liland wrote:
On 2014-11-19 22:53, Rasmus Liland wrote:
On 2014-11-19 21:41, Mark Lee wrote:
To Rasmus,
Can you run the parts where it says "run the abvoe through mcelog --ascii" and post the contents?
Regards, Mark
I'm attaching the output of mcelog to this message. However, I'm unsure of the usefulness of the output.
I checked dmesg now after having uptime of ...
rasmus@angrist ~ % uptime 02:04:01 up 1 day, 7:35, 1 user, load average: 0.04, 0.15, 0.40 rasmus@angrist ~ % uname -a Linux angrist 3.11.5-1-ARCH #1 SMP PREEMPT Mon Oct 14 08:31:43 CEST 2013 x86_64 GNU/Linux
... about 26 hours. It seems after about 19 hours some (possibly) temperature related were causing mce hardware errors over a ten minute interval:
[70133.209654] mce: [Hardware Error]: Machine check events logged [70376.833053] CPU2: Core temperature above threshold, cpu clock throttled (total events = 30628) [70376.833056] CPU3: Core temperature above threshold, cpu clock throttled (total events = 30628) [70376.833061] CPU3: Package temperature above threshold, cpu clock throttled (total events = 174126) [70376.833070] CPU2: Package temperature above threshold, cpu clock throttled (total events = 174126) [70376.833074] CPU1: Package temperature above threshold, cpu clock throttled (total events = 174126) [70376.833077] CPU0: Package temperature above threshold, cpu clock throttled (total events = 174124) [70376.835060] CPU3: Core temperature/speed normal [70376.835064] CPU2: Core temperature/speed normal [70376.835070] CPU2: Package temperature/speed normal [70376.835074] CPU3: Package temperature/speed normal [70376.835087] CPU1: Package temperature/speed normal [70376.835090] CPU0: Package temperature/speed normal [70433.353800] mce: [Hardware Error]: Machine check events logged [70676.969501] CPU2: Core temperature/speed normal [70676.969505] CPU3: Core temperature/speed normal [70676.969511] CPU0: Package temperature above threshold, cpu clock throttled (total events = 198545) [70676.969516] CPU1: Package temperature above threshold, cpu clock throttled (total events = 198547) [70676.969522] CPU3: Package temperature above threshold, cpu clock throttled (total events = 198547) [70676.969545] CPU2: Package temperature above threshold, cpu clock throttled (total events = 198547) [70676.970519] CPU0: Package temperature/speed normal [70676.970522] CPU2: Package temperature/speed normal [70676.970524] CPU3: Package temperature/speed normal [70676.970526] CPU1: Package temperature/speed normal [70733.497978] mce: [Hardware Error]: Machine check events logged
As the system did not reboot, it were able to self heal.
To Rasmus, Can you run a logger to find out which programs causing your cpu temperatures to rise? Regards, Mark -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iF4EAREIAAYFAlRumB8ACgkQZ/Z80n6+J/YI8gD/bN3dHoENwzLxK33lS0GCF2zs cn+8X3TDDqIMWSe8lEQBAJLcUwazQrJS7R4qTOZo8gbk2NE9wSoAo1t1jaeoolCB =mirr -----END PGP SIGNATURE-----
On 20/11/14 20:24, Rasmus Liland wrote:
[snip] I checked dmesg now after having uptime of ...
[snip] ... about 26 hours. It seems after about 19 hours some (possibly) temperature related were causing mce hardware errors over a ten minute interval: [snip] As the system did not reboot, it were able to self heal.
Rasmus, try using thermald (in AUR). It comes from Intel's 01.org project, so you can interpret it as a recognition of real problems with temperature regulation with Intel CPUs. -- Pedro Alejandro López-Valencia http://about.me/palopezv/ Every nation gets the government it deserves. -Joseph de Maistre
On 2014-11-19 18:16, Rasmus Liland wrote:
On 2014-11-17 00:19, Rasmus Liland wrote:
On 2014-11-15 18:28, Mark Lee wrote:
On 11/15/2014 12:20 PM, Rasmus Liland wrote:
On 2014-11-15 15:21, LoneVVolf wrote:
On 15-11-14 06:57, Rasmus Liland wrote:
On 2014-11-15 06:10, Mark Lee wrote: > On 11/14/2014 10:29 PM, Rasmus Liland wrote: > > On 2014-11-15 04:01, Mark Lee wrote: > > > Are you booting with the new intel u-code? > > Are you fairly sure this is a Intel microcode issue? > I'm not completely certain; but it would make sense. I'd test > it out. Thank you for your help thus far. I'll examine this further tomorrow, g'night. From rasmus first post: I'm experiencing machine check exceptions since every kernel after package linux-3.11.5-1 (Oct 14 2013) New intel microcode was only introduced with kernel 3.17 ... It's unlikely to have to do with this issue.
install mcelog, run it as the log tells you and post the result. [ ... output, see previous messages ... ] I never did use the mcelog tool before, but to me it looks like not much of an analysis, perhaps I'm doing it wrong. Looks like a microcode error, please try to add the intel-ucode to your kernel cmdline. Bah, just as I was finished enabling syslinux using syslinux-install_update and rebooted, the system did not respond, just a blank screen and lighting shutting off, then rebooting again.
Thus, this system needs an overhaul -- apparently some difficulty with the bootcode or the MBR, though I am able to mount the old partitions and chroot into them using arch-chroot.
I tried installing grub using the standard method grub-install according to the wiki, with little success -- some good news at least relevant to previous topic in this thread is that grub recognized and added the intel-ucode file I had copied to the /boot directory, when running grub-mkconfig.
The plan forward is to forget about generating new mbr using gpart and install Debian at the end of the disk to, hopefully, restore some boot related stuff that might have come crashing down after meddling with syslinux.
A breakthrough in this thread has happened.
I ended up taking a backup of the disk to an external hdd using
# dd if=/dev/sda of=/mnt/angrist-sda-18nov14.img
then I booted FreeBSD 10.1 memstick, entered shell and entered some commands:
# gpart delete -i 1 ada0 # gpart delete -i 2 ada0 # gpart delete -i 3 ada0 # gpart destroy ada0 # gpart create -s mbr ada0 # gpart add -s 20g -t linux-data ada0 # gpart add -t linux-data ada0
Then I rebooted into ArchLinux iso memstick to install Arch on the 20G partition and using the other one as /home. So now Syslinux works, unfortunately I don't know why. And I was able to install all new packages including linux 3.17.3-1 and intel-ucode 20140913-1, loading it in Syslinux according to the wiki.
I got a new mce after exactly three hours:
[ snip ]
I am also making this output an attachment. There is a lot of more information in this new mce compared to the other one I sent.
Perhaps some of you got some new suggestions.
Meanwhile, I am downgrading back to 3.11.5-1.
It is dead. Yesterday, as I tried to suspend to ram using systemd on old working kernel, supending did not work completely. So I tried moving up to new kernel 3.17.something to see if things worked out better there; as now I was more optimistic, since e.g. chrony were syncing the rtc based on statistical methods and not only NTP protocol. Suspend to ram was able to complete with new kernel, and everything was good for a while -- Until yesterday when I suspended on very low battery and after that I think the battery went flat during suspend. This has not been a problem in the past, but when I tried to charge the laptop afterwards, the charge LED did not light up even though the light on the charger said it was active. So, no power connection there, thus I guess most parts of the system are still working as before, something related to the delivery of power is broken -- probably a capasitor of some sort or other things that wear out over time, I have little knowledge on this, but I guess if this was a desktop I would probably swap the power supply unit for a fresh one. Honestly, I was hoping this laptop would last me at least four years of intensive everyday use, as the price tag was quite high. I am going to try to email the vendor to try to get a decent refund, as I think Norwegean law permits a three-year-warranty on consumer electronics, no matter what the Samsung company says. -- Rasmus Liland, jrl@jrl.dyndns.dk, jens.rasmus.liland@nmbu.no
On November 28, 2014 9:08:16 AM EST, Rasmus Liland <jensrasmus@gmail.com> wrote:
On 2014-11-17 00:19, Rasmus Liland wrote:
On 2014-11-15 18:28, Mark Lee wrote:
On 11/15/2014 12:20 PM, Rasmus Liland wrote:
On 2014-11-15 15:21, LoneVVolf wrote:
On 15-11-14 06:57, Rasmus Liland wrote: > On 2014-11-15 06:10, Mark Lee wrote: > > On 11/14/2014 10:29 PM, Rasmus Liland wrote: > > > On 2014-11-15 04:01, Mark Lee wrote: > > > > Are you booting with the new intel u-code? > > > Are you fairly sure this is a Intel microcode issue? > > I'm not completely certain; but it would make sense. I'd test > > it out. > Thank you for your help thus far. I'll examine this further > tomorrow, g'night. From rasmus first post: > I'm experiencing machine check exceptions since every kernel > after package linux-3.11.5-1 (Oct 14 2013) New intel microcode was only introduced with kernel 3.17 ... It's unlikely to have to do with this issue.
install mcelog, run it as the log tells you and post the result. [ ... output, see previous messages ... ] I never did use the mcelog tool before, but to me it looks like not much of an analysis, perhaps I'm doing it wrong. Looks like a microcode error, please try to add the intel-ucode to your kernel cmdline. Bah, just as I was finished enabling syslinux using syslinux-install_update and rebooted, the system did not respond, just a blank screen and
shutting off, then rebooting again.
Thus, this system needs an overhaul -- apparently some difficulty with the bootcode or the MBR, though I am able to mount the old partitions and chroot into them using arch-chroot.
I tried installing grub using the standard method grub-install according to the wiki, with little success -- some good news at least relevant to previous topic in this thread is that grub recognized and added the intel-ucode file I had copied to the /boot directory, when running grub-mkconfig.
The plan forward is to forget about generating new mbr using gpart and install Debian at the end of the disk to, hopefully, restore some boot related stuff that might have come crashing down after meddling with syslinux.
A breakthrough in this thread has happened.
I ended up taking a backup of the disk to an external hdd using
# dd if=/dev/sda of=/mnt/angrist-sda-18nov14.img
then I booted FreeBSD 10.1 memstick, entered shell and entered some commands:
# gpart delete -i 1 ada0 # gpart delete -i 2 ada0 # gpart delete -i 3 ada0 # gpart destroy ada0 # gpart create -s mbr ada0 # gpart add -s 20g -t linux-data ada0 # gpart add -t linux-data ada0
Then I rebooted into ArchLinux iso memstick to install Arch on the 20G partition and using the other one as /home. So now Syslinux works, unfortunately I don't know why. And I was able to install all new
On 2014-11-19 18:16, Rasmus Liland wrote: lighting packages
including linux 3.17.3-1 and intel-ucode 20140913-1, loading it in Syslinux according to the wiki.
I got a new mce after exactly three hours:
[ snip ]
I am also making this output an attachment. There is a lot of more information in this new mce compared to the other one I sent.
Perhaps some of you got some new suggestions.
Meanwhile, I am downgrading back to 3.11.5-1.
It is dead.
Yesterday, as I tried to suspend to ram using systemd on old working kernel, supending did not work completely.
So I tried moving up to new kernel 3.17.something to see if things worked out better there; as now I was more optimistic, since e.g. chrony were syncing the rtc based on statistical methods and not only NTP protocol.
Suspend to ram was able to complete with new kernel, and everything was good for a while -- Until yesterday when I suspended on very low battery and after that I think the battery went flat during suspend. This has not been a problem in the past, but when I tried to charge the laptop afterwards, the charge LED did not light up even though the light on the charger said it was active.
So, no power connection there, thus I guess most parts of the system are still working as before, something related to the delivery of power is broken -- probably a capasitor of some sort or other things that wear out over time, I have little knowledge on this, but I guess if this was a desktop I would probably swap the power supply unit for a fresh one.
Honestly, I was hoping this laptop would last me at least four years of intensive everyday use, as the price tag was quite high.
I am going to try to email the vendor to try to get a decent refund, as I think Norwegean law permits a three-year-warranty on consumer electronics, no matter what the Samsung company says.
-- Rasmus Liland, jrl@jrl.dyndns.dk, jens.rasmus.liland@nmbu.no
Have you done memtests? It could also be a failing drive. You should probably make a bootable mentest86 drive and run a full test. -- vixsomnis
On 2014-11-28 16:51, vixsomnis wrote:
On November 28, 2014 9:08:16 AM EST, Rasmus Liland wrote:
It is dead.
Yesterday, as I tried to suspend to ram using systemd on old working kernel, supending did not work completely.
So I tried moving up to new kernel 3.17.something to see if things worked out better there; as now I was more optimistic, since e.g. chrony were syncing the rtc based on statistical methods and not only NTP protocol.
Suspend to ram was able to complete with new kernel, and everything was good for a while -- Until yesterday when I suspended on very low battery and after that I think the battery went flat during suspend. This has not been a problem in the past, but when I tried to charge the laptop afterwards, the charge LED did not light up even though the light on the charger said it was active.
So, no power connection there, thus I guess most parts of the system are still working as before, something related to the delivery of power is broken -- probably a capasitor of some sort or other things that wear out over time, I have little knowledge on this, but I guess if this was a desktop I would probably swap the power supply unit for a fresh one.
Honestly, I was hoping this laptop would last me at least four years of intensive everyday use, as the price tag was quite high.
I am going to try to email the vendor to try to get a decent refund, as I think Norwegean law permits a three-year-warranty on consumer electronics, no matter what the Samsung company says.
Have you done memtests? It could also be a failing drive.
You should probably make a bootable mentest86 drive and run a full test.
I am not able to boot the laptop at all as the battery is unable to consume power from the charger, thus I am not even able to run memtests as I am not able to boot anything, even entering bios is not possible. Nothing -- it has become a brick ... -- Rasmus Liland, jrl@jrl.dyndns.dk, jens.rasmus.liland@nmbu.no
participants (5)
-
"P. A. López-Valencia"
-
LoneVVolf
-
Mark Lee
-
Rasmus Liland
-
vixsomnis