[arch-general] How does dmraid handle the development of a bad block on 1 drive in an array??
Baho Utot
baho-utot at columbus.rr.com
Wed Jan 6 16:56:01 EST 2010
On Wednesday 06 January 2010 16:13:21 David C. Rankin wrote:
> Listmates (Tobias)
>
> I have a server that has 2 dmraid arrays (4 drives -> 2 arrays) that has
> been bullet-proof for years. A month ago (either coincidentally or due to
> a bug in the suse 11.2 kernel for client ssh/sftp sessions) I began
> experiencing sda errors on the array comprised of sda/sdc drives. The
> errors took the form of:
>
> Dec 5 20:48:48 nirvana sshd[30922]: error: ssh_msg_send: write
> Dec 5 20:49:10 nirvana sshd[30965]: Accepted keyboard-interactive/pam for
> legaleagle from 192.168.6.102 port 36 663 ssh2
> Dec 5 20:50:12 nirvana kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr
> 0x0 action 0x0 Dec 5 20:50:12 nirvana kernel: ata3.00: BMDMA stat 0x25
> Dec 5 20:50:12 nirvana kernel: ata3.00: cmd
> 25/00:08:33:0c:8c/00:00:34:00:00/e0 tag 0 cdb 0x0 data 4096 in Dec 5
> 20:50:12 nirvana kernel: res 51/40:00:39:0c:8c/40:00:34:00:00/e0
> Emask 0x9 (media error) Dec 5 20:50:12 nirvana kernel: ata3.00:
> configured for UDMA/133
> Dec 5 20:50:12 nirvana kernel: ata3: EH complete
> Dec 5 20:50:14 nirvana kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr
> 0x0 action 0x0 Dec 5 20:50:14 nirvana kernel: ata3.00: BMDMA stat 0x25
> Dec 5 20:50:14 nirvana kernel: ata3.00: cmd
> 25/00:08:33:0c:8c/00:00:34:00:00/e0 tag 0 cdb 0x0 data 4096 in Dec 5
> 20:50:14 nirvana kernel: res 51/40:00:39:0c:8c/40:00:34:00:00/e0
> Emask 0x9 (media error) Dec 5 20:50:14 nirvana kernel: ata3.00:
> configured for UDMA/133
> Dec 5 20:50:14 nirvana kernel: ata3: EH complete
> Dec 5 20:50:16 nirvana kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr
> 0x0 action 0x0 Dec 5 20:50:16 nirvana kernel: ata3.00: BMDMA stat 0x25
> Dec 5 20:50:16 nirvana kernel: ata3.00: cmd
> 25/00:08:33:0c:8c/00:00:34:00:00/e0 tag 0 cdb 0x0 data 4096 in Dec 5
> 20:50:23 nirvana kernel: res 51/40:00:39:0c:8c/40:00:34:00:00/e0
> Emask 0x9 (media error) Dec 5 20:50:23 nirvana kernel: ata3.00:
> configured for UDMA/133
> Dec 5 20:50:23 nirvana kernel: ata3: EH complete
> Dec 5 20:50:23 nirvana kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr
> 0x0 action 0x0 Dec 5 20:50:23 nirvana kernel: ata3.00: BMDMA stat 0x25
> Dec 5 20:50:23 nirvana kernel: ata3.00: cmd
> 25/00:08:33:0c:8c/00:00:34:00:00/e0 tag 0 cdb 0x0 data 4096 in Dec 5
> 20:50:23 nirvana kernel: res 51/40:00:39:0c:8c/40:00:34:00:00/e0
> Emask 0x9 (media error) Dec 5 20:50:23 nirvana kernel: ata3.00:
> configured for UDMA/133
> Dec 5 20:50:23 nirvana kernel: ata3: EH complete
> Dec 5 20:50:23 nirvana kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr
> 0x0 action 0x0 Dec 5 20:50:23 nirvana kernel: ata3.00: BMDMA stat 0x25
> Dec 5 20:50:23 nirvana kernel: ata3.00: cmd
> 25/00:08:33:0c:8c/00:00:34:00:00/e0 tag 0 cdb 0x0 data 4096 in Dec 5
> 20:50:23 nirvana kernel: res 51/40:00:39:0c:8c/40:00:34:00:00/e0
> Emask 0x9 (media error) Dec 5 20:50:23 nirvana kernel: ata3.00:
> configured for UDMA/133
> Dec 5 20:50:23 nirvana kernel: ata3: EH complete
> Dec 5 20:50:23 nirvana kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr
> 0x0 action 0x0 Dec 5 20:50:23 nirvana kernel: ata3.00: BMDMA stat 0x25
> Dec 5 20:50:23 nirvana kernel: ata3.00: cmd
> 25/00:08:33:0c:8c/00:00:34:00:00/e0 tag 0 cdb 0x0 data 4096 in Dec 5
> 20:50:23 nirvana kernel: res 51/40:00:39:0c:8c/40:00:34:00:00/e0
> Emask 0x9 (media error) Dec 5 20:50:23 nirvana kernel: ata3.00:
> configured for UDMA/133
> Dec 5 20:50:23 nirvana kernel: sd 2:0:0:0: [sda] Result: hostbyte=DID_OK
> driverbyte=DRIVER_SENSE,SUGGEST_OK Dec 5 20:50:23 nirvana kernel: sd
> 2:0:0:0: [sda] Sense Key : Medium Error [current] [descriptor] Dec 5
> 20:50:23 nirvana kernel: Descriptor sense data with sense descriptors (in
> hex): Dec 5 20:50:23 nirvana kernel: 72 03 11 04 00 00 00 0c 00
> 0a 80 00 00 00 00 00 Dec 5 20:50:23 nirvana kernel: 34 8c 0c 39
> Dec 5 20:50:23 nirvana kernel: sd 2:0:0:0: [sda] Add. Sense: Unrecovered
> read error - auto reallocate failed Dec 5 20:50:23 nirvana kernel:
> end_request: I/O error, dev sda, sector 881593401 Dec 5 20:50:23 nirvana
> kernel: ata3: EH complete
> Dec 5 20:50:23 nirvana kernel: sd 2:0:0:0: [sda] 976773168 512-byte
> hardware sectors (500108 MB) Dec 5 20:50:23 nirvana kernel: sd 2:0:0:0:
> [sda] Write Protect is off Dec 5 20:50:23 nirvana kernel: sd 2:0:0:0:
> [sda] Mode Sense: 00 3a 00 00 Dec 5 20:50:23 nirvana kernel: sd 2:0:0:0:
> [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or
> FUA
>
> Booting from the install disk, assembling the arrays and fsck'ing -c -y the
> individual disks found a bad block on sda, corrected the problem and
> things were fine for a month. Then the same sda error appeared.
>
> Currently I have disabled dmraid and have the two disks running
> independently with zero errors. I'm keeping the contents mirrored with a
> cron job that basically does a 'cp -a / /mnt/sda' where sdc is the disk
> that is booted from and sda is mounted at /mnt/sda.
>
> This experience has raised a question that I can't find the answer to:
>
> "How does dmraid handle a bad block?"
>
> Also (hardware issues aside), is there anything from a kernel/drive
> controller standpoint that could be invoking the error?
>
> After running the disks independently for a week without a single error, I
> think I'll reassemble the array and rebuild it. Any other
> thoughts/suggestions? Obviously there was a bad block that developed on
> sda, but adding it to the bad block table fixed it. I have a pair of 750 G
> drives coming to replace this set, but I'm curious about the bad block
> handling with dmraid. If 1 bad block is enough to kill the array, then
> that's not a very robust system.
>
> The disks are both seagate 500G drives (ST3500641AS). The system hardware
> is:
>
>
> System Information
> Manufacturer: TYAN Computer Corp
> Product Name: S2865
>
> BIOS Information
> Vendor: Phoenix Technologies, LTD
> Version: 6.00 PG
> Release Date: 06/20/2005
> Address: 0xE0000
> Runtime Size: 128 kB
> ROM Size: 512 kB
> Characteristics:
>
> Processor Information
> Socket Designation: Socket 939
> Type: Central Processor
> Family: Athlon 64
> Manufacturer: AMD
> ID: 32 0F 02 00 FF FB 8B 17
> Signature: Family 15, Model 35, Stepping 2
>
> All thoughts welcomed...
>
I would run the following on the drive
/usr/sbin/smartctl -a /dev/sda
That will tell you about the stats on the drive.
Can't help with the badblock questions as I use mdadm with 3 drives raid5
array.
More information about the arch-general
mailing list