[arch-general] Bad Magic Number in Superblock - Any trick for Arch or for new kernels?
Guys, I have a disk on a arch box that has gone south. No prior warning, no smart errors, no nothing. Granted the drive is 5-6 years old, so a complete crater is not out of the question. The situation: On reboot - grub 22 error boot arch install cd 2009-08, cannot mount drive partition table shows 1 partition /dev/hda1 (PATA drive) (should be 3) (I can't put my hands on the saved 'fdisk -l' data, it's somewhere) gparted 0.46 (0.44 wont work) - loads, but only shows a single partition fsck -> Gives the bad magic number in superblock error, tries backup same error Booted without issue for over a year From the looks of it, the drive scattered. But my question is "Is there something with all the recent kernel changes that would require troubleshooting this drive error in a different way than usual?" I know there have been a lot of changes in moving modules into the kernel, but I don't know what got moved (if anything related to drives), or how it might affect my troubleshooting. Before I add the drive to the heap of drives in my 'dead drive box' are there any other silver bullets I should try to try and resurrect the drive? (Data isn't an issue, it's all backed up :-) What say the gurus? -- David C. Rankin, J.D.,P.E. Rankin Law Firm, PLLC 510 Ochiltree Street Nacogdoches, Texas 75961 Telephone: (936) 715-9333 Facsimile: (936) 715-9339 www.rankinlawfirm.com
On 06/09/2010 01:16 PM, David C. Rankin wrote:
The situation:
On reboot - grub 22 error boot arch install cd 2009-08, cannot mount drive partition table shows 1 partition /dev/hda1 (PATA drive) (should be 3) (I can't put my hands on the saved 'fdisk -l' data, it's somewhere) gparted 0.46 (0.44 wont work) - loads, but only shows a single partition fsck -> Gives the bad magic number in superblock error, tries backup same error Booted without issue for over a year
More information: I pulled the drive to diagnose it under a running system where I would have access to logs, etc.., so I have it connected via usb. Here are the errors on connenct: Jun 9 14:20:16 alchemy kernel: usb 1-8: new high speed USB device using ehci_hcd and address 3 Jun 9 14:20:16 alchemy kernel: klogd 1.4.1, ---------- state change ---------- Jun 9 14:20:16 alchemy kernel: usb 1-8: configuration #1 chosen from 1 choice Jun 9 14:20:16 alchemy kernel: usb 1-8: New USB device found, idVendor=152d, idProduct=2338 Jun 9 14:20:16 alchemy kernel: usb 1-8: New USB device strings: Mfr=1, Product=2, SerialNumber=5 Jun 9 14:20:16 alchemy kernel: usb 1-8: Product: USB to ATA/ATAPI Bridge Jun 9 14:20:16 alchemy kernel: usb 1-8: Manufacturer: JMicron Jun 9 14:20:16 alchemy kernel: usb 1-8: SerialNumber: 7D457CA67703 Jun 9 14:20:17 alchemy kernel: Initializing USB Mass Storage driver... Jun 9 14:20:17 alchemy kernel: scsi6 : SCSI emulation for USB Mass Storage devices Jun 9 14:20:17 alchemy kernel: usb-storage: device found at 3 Jun 9 14:20:17 alchemy kernel: usb-storage: waiting for device to settle before scanning Jun 9 14:20:17 alchemy kernel: usbcore: registered new interface driver usb-storage Jun 9 14:20:17 alchemy kernel: USB Mass Storage support registered. Jun 9 14:20:18 alchemy kernel: scsi 6:0:0:0: Direct-Access MDT MD25 00JB-00GVA0 2D08 PQ: 0 ANSI: 2 CCS Jun 9 14:20:18 alchemy kernel: sd 6:0:0:0: [sdb] 488397168 512-byte hardware sectors (250059 MB) Jun 9 14:20:18 alchemy kernel: sd 6:0:0:0: [sdb] Write Protect is off Jun 9 14:20:18 alchemy kernel: sd 6:0:0:0: [sdb] Mode Sense: 00 38 00 00 Jun 9 14:20:18 alchemy kernel: sd 6:0:0:0: [sdb] Assuming drive cache: write through Jun 9 14:20:18 alchemy kernel: sd 6:0:0:0: [sdb] 488397168 512-byte hardware sectors (250059 MB) Jun 9 14:20:18 alchemy kernel: sd 6:0:0:0: [sdb] Write Protect is off Jun 9 14:20:18 alchemy kernel: sd 6:0:0:0: [sdb] Mode Sense: 00 38 00 00 Jun 9 14:20:18 alchemy kernel: sd 6:0:0:0: [sdb] Assuming drive cache: write through Jun 9 14:20:18 alchemy kernel: sdb: sdb1 Jun 9 14:20:18 alchemy kernel: sd 6:0:0:0: [sdb] Attached SCSI disk Jun 9 14:20:18 alchemy kernel: sd 6:0:0:0: Attached scsi generic sg2 type 0 Jun 9 14:20:18 alchemy kernel: usb-storage: device scan complete Jun 9 14:20:20 alchemy kernel: EXT3-fs error (device sdb1): ext3_check_descriptors: Block bitmap for group 880 not in group (block 0)! Jun 9 14:20:20 alchemy kernel: EXT3-fs: group descriptors corrupted! Jun 9 14:20:29 alchemy kernel: EXT3-fs error (device sdb1): ext3_check_descriptors: Block bitmap for group 880 not in group (block 0)! Jun 9 14:20:29 alchemy kernel: EXT3-fs: group descriptors corrupted! 21:21 alchemy:/home/backup/alchemy # cat /proc/partitions major minor #blocks name 8 0 625131864 sda 8 1 1542208 sda1 8 2 83891430 sda2 8 3 1 sda3 8 5 2048256 sda5 8 6 36861111 sda6 8 7 500786181 sda7 8 16 244198584 sdb 8 17 244196001 sdb1 14:24 alchemy:/home/backup/alchemy # tune2fs -l /dev/sdb1 | grep 'Block size' Block size: 4096 14:25 alchemy:/home/backup/alchemy # e2fsck -y -b 32768 /dev/sdb e2fsck 1.40.8 (13-Mar-2008) e2fsck: Device or resource busy while trying to open /dev/sdb Filesystem mounted or opened exclusively by another program? I guess if you can't even open it to check the superblock, things are not good. What say the experts, is it looking more like a paper-weight at this point? Any other thoughts? -- David C. Rankin, J.D.,P.E. Rankin Law Firm, PLLC 510 Ochiltree Street Nacogdoches, Texas 75961 Telephone: (936) 715-9333 Facsimile: (936) 715-9339 www.rankinlawfirm.com
Hi David, Excerpts from David C. Rankin's message of Wed, 09 Jun 2010 13:16 -0500:
Before I add the drive to the heap of drives in my 'dead drive box' are there any other silver bullets I should try to try and resurrect the drive? (Data isn't an issue, it's all backed up :-)
I had similar situation (just a couple of days ago) with my friend's USB flash drive, which has fallen dead accidentally being transferred from one computer to another. The symptoms were very similar. After half an hour of trying to read outputs and resurrect it I've decided to disassemble it. The quartz resonator's leg was torn off the board. I've soldered it back - everything was back - partitions and data. This long story is a hint that you may have an electronics failure in your HDD. Just a physicist's opinion ;-) Cheers, Sergey
On 06/09/2010 03:14 PM, Sergey Manucharian wrote:
Hi David,
Excerpts from David C. Rankin's message of Wed, 09 Jun 2010 13:16 -0500:
Before I add the drive to the heap of drives in my 'dead drive box' are there any other silver bullets I should try to try and resurrect the drive? (Data isn't an issue, it's all backed up :-)
I had similar situation (just a couple of days ago) with my friend's USB flash drive, which has fallen dead accidentally being transferred from one computer to another. The symptoms were very similar. After half an hour of trying to read outputs and resurrect it I've decided to disassemble it. The quartz resonator's leg was torn off the board. I've soldered it back - everything was back - partitions and data.
This long story is a hint that you may have an electronics failure in your HDD. Just a physicist's opinion ;-)
Cheers, Sergey
I'll take that physicists opinion to heart. With the drive out, I powered it up on my glass coffee table for the usb connection. I have probably powered 30-50 drives like this and you get an 'ear' for whether there are problems with the drive. In this case it sounded perfect. A bit slow to spin up (~ 2 sec lag in spinup), but nothing major. After spinup, the read-write head did just what it was supposed to: uncaged, a quick swing through the extent of its range of motion, then promptly settled and sounded like it was saying "OK, I'm here, I'm ready to read/write like I'm supposed to -- where's the requests?" The drives balance was notably perfect. Being an off/old brand (MTD) I was expecting slop. Nope, it put the new Seagates to shame. If the drive had just cratered, I would expect more noise from the read-write head searching for partition boundaries, etc. (But.... There's not a damn thing I can do to check --> my vacuum chamber is on the fritz :-) I'll dork with it a bit more, then it's in the box with the rest of my collection... -- David C. Rankin, J.D.,P.E. Rankin Law Firm, PLLC 510 Ochiltree Street Nacogdoches, Texas 75961 Telephone: (936) 715-9333 Facsimile: (936) 715-9339 www.rankinlawfirm.com
On 06/09/2010 07:16 PM, David C. Rankin wrote:
Before I add the drive to the heap of drives in my 'dead drive box' are there any other silver bullets I should try to try and resurrect the drive? (Data isn't an issue, it's all backed up :-)
What say the gurus?
Try issuing a smart short test and see if it goes well, if it does finish without errors then issue a full smart self test and check for any smart attribute changes and/or errors. If that goes well, which seems to indicate that the drive is still ok, then use badblocks and let it do the 4 write and read passes, keep an eye on the smart attributes because any problem not detected before may show up now. If you get no errors or smart attribute changes then the drive should be ok. -- Mauro Santos
On 06/09/2010 03:44 PM, Mauro Santos wrote:
On 06/09/2010 07:16 PM, David C. Rankin wrote:
Before I add the drive to the heap of drives in my 'dead drive box' are there any other silver bullets I should try to try and resurrect the drive? (Data isn't an issue, it's all backed up :-)
What say the gurus?
Try issuing a smart short test and see if it goes well, if it does finish without errors then issue a full smart self test and check for any smart attribute changes and/or errors.
If that goes well, which seems to indicate that the drive is still ok, then use badblocks and let it do the 4 write and read passes, keep an eye on the smart attributes because any problem not detected before may show up now. If you get no errors or smart attribute changes then the drive should be ok.
Thanks Mauro, I do like badblocks. It saved my bacon once before. Rather than doing the badblock recovery (since I have the data), what I think I'll do is search a bit more for the 'fdisk -l' info for the drive. If I find it, I'll try recreating the partitions and see what is left with the drive. If not, then I'll just add the drive to the pile. Eventually I'll do some type of chronological art exhibit with drives. Everything from 8 'Meg' MFMRLL drives to the new Seagate 500-750G drives that drop like flies now for some reason :p -- David C. Rankin, J.D.,P.E. Rankin Law Firm, PLLC 510 Ochiltree Street Nacogdoches, Texas 75961 Telephone: (936) 715-9333 Facsimile: (936) 715-9339 www.rankinlawfirm.com
On 06/10/2010 02:00 AM, David C. Rankin wrote:
I do like badblocks. It saved my bacon once before. Rather than doing the badblock recovery (since I have the data), what I think I'll do is search a bit more for the 'fdisk -l' info for the drive. If I find it, I'll try recreating the partitions and see what is left with the drive. If not, then I'll just add the drive to the pile. Eventually I'll do some type of chronological art exhibit with drives. Everything from 8 'Meg' MFMRLL drives to the new Seagate 500-750G drives that drop like flies now for some reason :p
I guess that you can't recover much more from the drive as it is just by trying to read from it (unless you get hold of some advanced tool to make some sense of the whole drive). That problem may not be caused by a drive failure but a combination of factors, you said that this particular disk has been running for a few years without problems and there is no indication of failure by the smart attributes (I read that smart catches only about 2/3 of failures). In my experience, power supplies go bad after 2 or 3 year or continuous use if they are consumer grade hardware, so a bad power supply coupled with the worst case for the hard disk can lead to problems, that is why I have suggested badblocks to look for problems while keeping and eye on the smart attributes. Also you may have an hardware failure somewhere else, the motherboard or the hardware connected directly to the disk are good candidates (as much as anything else actually if the system is 5 or 6 years old).
From my experience only, I find it quite hard to know when a disk is about to fail. Currently I am trying to figure out if an hard disk in a machine I manage is about to fail or not (3'5 drive), smart says it is, badblocks can't find anything wrong with the drive (even after 2 full write passes) but one of the smart attributes, the one that says failing_now increases by one with each full read cycle, smart attributes do not report any reallocated sectors. This is a new drive (6 months old, give or take) and the other drives assembled in the machine have exactly the same usage and do not show any signs of trouble (the serial numbers of the drives are all very close, almost sequential, all from the same manufacturer).
I have had some trouble with a drive from the same manufacturer before (2'5 drive), but things seem to go smoothly after I did just one 'dd if=/dev/zero of=/dev/sd?' and then read it back, no smart attribute said the drive was failing that time, so it might be just a bad coincidence. As far as I can see, you have done the best thing you could have done, which is keep backups of the important data, now all you can do is try to decide if that drive can still be used and trust it a bit less (put it in a raid array that can tolerate failures). Unless the drive fails terribly with no margin for doubt it is hard to say, from the users point of view, if it is really failing or not. -- Mauro Santos
On 06/10/2010 08:46 AM, Mauro Santos wrote:
From my experience only, I find it quite hard to know when a disk is about to fail. Currently I am trying to figure out if an hard disk in a machine I manage is about to fail or not (3'5 drive), smart says it is, badblocks can't find anything wrong with the drive (even after 2 full write passes) but one of the smart attributes, the one that says failing_now increases by one with each full read cycle, smart attributes do not report any reallocated sectors. This is a new drive (6 months old, give or take) and the other drives assembled in the machine have exactly the same usage and do not show any signs of trouble (the serial numbers of the drives are all very close, almost sequential, all from the same manufacturer).
Mauro, Your experience sounds exactly like mine over the past year. I have had 4 Seagate drives supposedly "go bad" after 13-14 months use (1-2 months after warranty runs out). The problem is always the same - smart says there is a badblock problem and it logs the time/date of the error. Subsequent passes with smartctl -t long shows no additional problem and the drives always 'PASS'. Where this behavior between badblock/smart/Seagate drives is killing me is that most of my drives run in raid1 sets with either dmraid or mdraid. The dmraid installs seem to be the most sensitive to this problem. I know that the hardware ought to provide badblock remapping on a per-drive basis on the fly, but I still don't have a good feel for how dmraid handles this internally. Regardless, when I split an array where one drive is showing badblock issues and then use the drive as a single drive, then I don't have any more problems with the drive. So, from what I'm seeing, there is a problem in the way smart/badblocks/dmraid plays together. I don't have a clue what it is, but I've been through that scenario 4 times in the past 12 months. This failues is different. Here the drive was stand-alone to begin with and contrary to the earlier badblock/dmraid drives, this drive can no longer be read with any power supply. (when I work on them out of the machine, they have a dedicated power source provided by the usb connection kit) I think the only way I will ever get an answer on this drive is if I find my dump of the CHS partition info for the drive and then manually re-create the partitions to tell the drive where to start looking. Ceste La Vie.... I'll provide a follow up if I manage to uncover any more on the reason for the failure. Thanks for your help. -- David C. Rankin, J.D.,P.E. Rankin Law Firm, PLLC 510 Ochiltree Street Nacogdoches, Texas 75961 Telephone: (936) 715-9333 Facsimile: (936) 715-9339 www.rankinlawfirm.com
On 06/10/2010 05:06 PM, David C. Rankin wrote:
Your experience sounds exactly like mine over the past year. I have had 4 Seagate drives supposedly "go bad" after 13-14 months use (1-2 months after warranty runs out). The problem is always the same - smart says there is a badblock problem and it logs the time/date of the error. Subsequent passes with smartctl -t long shows no additional problem and the drives always 'PASS'.
I didn't want to name the manufacturer because I think it is not fair, failure due to normal wear is acceptable and expected (it seems to be your case), it is also normal to see some drives fail early after starting to work, this follows what we call here the bathtub curve, higher failure rates at the start of the life, then failure rates decrease significantly and then rise again at the end of life. As a side note, here in Europe the warranty is 24 months (at least where I live) so I doubt the manufacturer would make drives that would last less than that, besides some manufacturers are offering/advertising 3 or 5 year warranty in their websites, so I guess they must be quite sure their drives are reliable enough, you may want to look at that too and see if you are illegible for a free replacement from the manufacturer itself. So far I have only seen consumer grade drives being used in machines that are working 24/7, that is clearly a mistake but it's the cheapest option and I guess that most of the times it works fine, hence it is hard to explain why should more money be spent on server grade hardware. Like I said before, the problem may be caused or aggravated by some other component. Even with my limited experience I've seen some weird problems caused by components that would not seem liable at first glance. The latest trend from my limited experience seems to be power supply failure, if not complete failure, at least not conforming to specs and causing instability. The trend seems to be to supply most components from 12V to reduce the current flowing from the power supply to the component being supplied (the supply voltage keeps decreasing with the latest technology nodes), but hard disks (the 3'5 ones at least) still rely on 12v to make the platters spin. If you have to reduce the voltage from, lets say, 12V to 1V or 3.3V or whatever close to that, you have lots of working margin, but because the hard disk still requires 12V+-10% (it is what the atx spec requires) if the 12V line goes out of spec, which it can go if the power supply is going bad, then the hard disk may not be able to work properly while everything else will still probably work happily. All this to say, you may not have a bad drive on your hands, it may be just an unfortunate coincidence, if you really have a backup of all the data, try to write to the drive while it is connected to a "good" power supply and the problems may be gone (happened to me before with a 2'5 hard disk), however it is a good opportunity to try to recover some data from that drive, just to learn some tricks for the future before you write to it and wonder what has caused that problem. -- Mauro Santos
Am Thu, 10 Jun 2010 11:06:36 -0500 schrieb David C. Rankin:
Your experience sounds exactly like mine over the past year. I have had 4 Seagate drives supposedly "go bad" after 13-14 months use (1-2 months after warranty runs out).
Hi. I think it was about 1 1/2 year ago when there was a big fuss about some Seagate drive models going bad because of a firmware bug. Seagate provided a firmware update for these. If you haven't allready done so, you might want to check their knowledge base if your drives were affected and if you somehow can resurrect them yourself.
On Thu, Jun 10, 2010 at 4:06 PM, Axel Müller <axel-mueller-74@web.de> wrote:
I think it was about 1 1/2 year ago when there was a big fuss about some Seagate drive models going bad because of a firmware bug. Seagate provided a firmware update for these. If you haven't allready done so, you might want to check their knowledge base if your drives were affected and if you somehow can resurrect them yourself.
[1] Random blog post on the issue [2] Link to serial number validation tool to see if it was one of the affected units: [1] http://storagesecrets.org/2009/01/alert-seagate-barracuda-diamondmax-drives-... [2] http://support.seagate.com/sncheck.html
On 06/10/2010 05:20 PM, Gary Wright wrote:
On Thu, Jun 10, 2010 at 4:06 PM, Axel Müller <axel-mueller-74@web.de> wrote:
I think it was about 1 1/2 year ago when there was a big fuss about some Seagate drive models going bad because of a firmware bug. Seagate provided a firmware update for these. If you haven't allready done so, you might want to check their knowledge base if your drives were affected and if you somehow can resurrect them yourself.
[1] Random blog post on the issue [2] Link to serial number validation tool to see if it was one of the affected units:
[1] http://storagesecrets.org/2009/01/alert-seagate-barracuda-diamondmax-drives-... [2] http://support.seagate.com/sncheck.html
Thanks Guys, I have an account with seagate (seeing how I have done this so many times) I just type in my drive model and serial number info and they have a screen that tells you (1) whether your drive is still covered, and (2) whether there are any firmware updates that you might try. My luck lately has been (1) out of warranty just be 30 days - a couple of months and (2) already had the latest firmware ... :-( But when you can throw them a Benjamin for a Terabyte of storage, the drives are pretty much throw away at that point (too bad the quality had to drop to reflect it ;-) -- David C. Rankin, J.D.,P.E. Rankin Law Firm, PLLC 510 Ochiltree Street Nacogdoches, Texas 75961 Telephone: (936) 715-9333 Facsimile: (936) 715-9339 www.rankinlawfirm.com
participants (5)
-
Axel Müller
-
David C. Rankin
-
Gary Wright
-
Mauro Santos
-
Sergey Manucharian