[arch-general] dmraid boot fail (grub errors 5 & 24) - follow up

David C. Rankin drankinatty at suddenlinkmail.com
Wed Nov 10 06:40:41 CET 2010


On 11/09/2010 12:45 PM, Thomas Bächler wrote:
> Am 09.11.2010 19:25, schrieb David C. Rankin:
>> Guys,
>>
>>     As a follow up, the post to kernel.org did not elicit any response. The
>> folks at dm-devel suggested it may be a grub bug. So that leave me with two more
>> avenues to try (1) the grub list, and (2) lilo test.
> 
> https://wiki.archlinux.org/index.php/Syslinux
> Always worth a try.
> 

Thanks Thomas, Dwight:

	I have one more piece of input and one more question. The issue may be more
than just this one box. I have two x86_64 nv dmraid boxes at the house
(primary/backup servers). The one I have had the boot problems with (MSI K9N2
SLI Platinum - Award BIOS) and the other one is based on a Tyan Tomcat K8e
(Model: S2865 - Pheonix BIOS/Opteron 180) (running 2.6.35.8) Both have similar
nv dmraid setups. (MSI box has 2 RAID 1 arrays, Tyan box has 1 RAID 1 array)

	What I have noticed recently, the Tyan box boots and experiences what sounds
like disk/drive controller "confusion." What is weird is that it depends on how
the box inits. The problem is either "there" or it "isn't".

	What I mean is that when the problem occurs on the Tyan box -- it effects the
box from boot until shutdown. It behaves just like there is an interrupt
conflict or drive/controller fault. I can hear consistent read/write head
excursions (once every 1-2 secs.) and I get 15-30-60 second delays with
everything (type ls -- then wait 30,60 seconds for the listing or rt-click on
the desktop and wait, and wait... for the context menu). It doesn't matter
whether I have a desktop running or boot to runlevel 3 -- it's a low-level issue.

	Normally that is a "Hey stupid, you have a drive failing... go fix it" issue.
But it's not. smartctl is fine on all drives -- "no errors logged". Nothing in
syslog or dmesg, and the disks are clean.

	A shutdown or reboot will completely "fix" the problem. Although today I had to
shutdown/restart 3 times before it "fixed" itself. When the box "inits" without
having this problem - it never exhibits *any* problem until the next boot when
whatever it is strikes again.

	Since I rarely boot the box, I don't exactly know when this started, but it has
been within the past month -- which is consistent with the latest round of boot
failures on the MSI box moving from 2.6.35.7 to .8.

	I don't know what to make of it? It seems like something has just gone "flaky"
with how dmraid is working (or grub or kernel or whatever), and it's like some
part of the setup is just confused. On the MSI box, it appears as some attempt
to read beyond the partition boundary or the box thinking there is a corrupt
partition table and booting fails with the latest kernels. On the Tyan box, it
appears as something that causes read/write head excursions and causes the 15-60
second hangs like there is an interrupt conflict or some hardware thing waiting
on a timeout.

	One item that did catch my eye on the kernel list was a dmraid issue concerning
a "CFQ dm-crypt" problem. I have no idea what that is other than gleaning it had
to do with some type of dmraid queue/scheduler that was causing problems. I
don't know if that could point to some area of dmraid that might be the culprit,
but I'll follow up with the dmraid list there.

	So that's the latest. I'll try syslinux and see if anything changes. It will
take a couple of days to get the time to do it, but hopefully it will help
narrow this issue down.

	If you have any ideas of any type of test and/or diagnostic I could use the
next time the Tyan box exhibits the problem -- to look at where the hang/timeout
issue is, I would appreciate your ideas. (that's an area where I have no clue...
how or what to look for)

	Thanks for all your continued help and willingness to provide ideas. I know
this is a weird issue, but now that I have two boxes showing some signs of a
similar problem -- hopefully that will help me narrow it down.

-- 
David C. Rankin, J.D.,P.E.
Rankin Law Firm, PLLC
510 Ochiltree Street
Nacogdoches, Texas 75961
Telephone: (936) 715-9333
Facsimile: (936) 715-9339
www.rankinlawfirm.com


More information about the arch-general mailing list