[arch-general] How does dmraid handle the development of a bad block on 1 drive in an array??

Baho Utot baho-utot at columbus.rr.com
Wed Jan 6 16:56:01 EST 2010


On Wednesday 06 January 2010 16:13:21 David C. Rankin wrote:
> Listmates (Tobias)
> 
> 	I have a server that has 2 dmraid arrays (4 drives -> 2 arrays) that has
>  been bullet-proof for years. A month ago (either coincidentally or due to
>  a bug in the suse 11.2 kernel for client ssh/sftp sessions) I began
>  experiencing sda errors on the array comprised of sda/sdc drives. The
>  errors took the form of:
> 
> Dec  5 20:48:48 nirvana sshd[30922]: error: ssh_msg_send: write
> Dec  5 20:49:10 nirvana sshd[30965]: Accepted keyboard-interactive/pam for
>  legaleagle from 192.168.6.102 port 36 663 ssh2
> Dec  5 20:50:12 nirvana kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr
>  0x0 action 0x0 Dec  5 20:50:12 nirvana kernel: ata3.00: BMDMA stat 0x25
> Dec  5 20:50:12 nirvana kernel: ata3.00: cmd
>  25/00:08:33:0c:8c/00:00:34:00:00/e0 tag 0 cdb 0x0 data 4096 in Dec  5
>  20:50:12 nirvana kernel:          res 51/40:00:39:0c:8c/40:00:34:00:00/e0
>  Emask 0x9 (media error) Dec  5 20:50:12 nirvana kernel: ata3.00:
>  configured for UDMA/133
> Dec  5 20:50:12 nirvana kernel: ata3: EH complete
> Dec  5 20:50:14 nirvana kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr
>  0x0 action 0x0 Dec  5 20:50:14 nirvana kernel: ata3.00: BMDMA stat 0x25
> Dec  5 20:50:14 nirvana kernel: ata3.00: cmd
>  25/00:08:33:0c:8c/00:00:34:00:00/e0 tag 0 cdb 0x0 data 4096 in Dec  5
>  20:50:14 nirvana kernel:          res 51/40:00:39:0c:8c/40:00:34:00:00/e0
>  Emask 0x9 (media error) Dec  5 20:50:14 nirvana kernel: ata3.00:
>  configured for UDMA/133
> Dec  5 20:50:14 nirvana kernel: ata3: EH complete
> Dec  5 20:50:16 nirvana kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr
>  0x0 action 0x0 Dec  5 20:50:16 nirvana kernel: ata3.00: BMDMA stat 0x25
> Dec  5 20:50:16 nirvana kernel: ata3.00: cmd
>  25/00:08:33:0c:8c/00:00:34:00:00/e0 tag 0 cdb 0x0 data 4096 in Dec  5
>  20:50:23 nirvana kernel:          res 51/40:00:39:0c:8c/40:00:34:00:00/e0
>  Emask 0x9 (media error) Dec  5 20:50:23 nirvana kernel: ata3.00:
>  configured for UDMA/133
> Dec  5 20:50:23 nirvana kernel: ata3: EH complete
> Dec  5 20:50:23 nirvana kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr
>  0x0 action 0x0 Dec  5 20:50:23 nirvana kernel: ata3.00: BMDMA stat 0x25
> Dec  5 20:50:23 nirvana kernel: ata3.00: cmd
>  25/00:08:33:0c:8c/00:00:34:00:00/e0 tag 0 cdb 0x0 data 4096 in Dec  5
>  20:50:23 nirvana kernel:          res 51/40:00:39:0c:8c/40:00:34:00:00/e0
>  Emask 0x9 (media error) Dec  5 20:50:23 nirvana kernel: ata3.00:
>  configured for UDMA/133
> Dec  5 20:50:23 nirvana kernel: ata3: EH complete
> Dec  5 20:50:23 nirvana kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr
>  0x0 action 0x0 Dec  5 20:50:23 nirvana kernel: ata3.00: BMDMA stat 0x25
> Dec  5 20:50:23 nirvana kernel: ata3.00: cmd
>  25/00:08:33:0c:8c/00:00:34:00:00/e0 tag 0 cdb 0x0 data 4096 in Dec  5
>  20:50:23 nirvana kernel:          res 51/40:00:39:0c:8c/40:00:34:00:00/e0
>  Emask 0x9 (media error) Dec  5 20:50:23 nirvana kernel: ata3.00:
>  configured for UDMA/133
> Dec  5 20:50:23 nirvana kernel: ata3: EH complete
> Dec  5 20:50:23 nirvana kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr
>  0x0 action 0x0 Dec  5 20:50:23 nirvana kernel: ata3.00: BMDMA stat 0x25
> Dec  5 20:50:23 nirvana kernel: ata3.00: cmd
>  25/00:08:33:0c:8c/00:00:34:00:00/e0 tag 0 cdb 0x0 data 4096 in Dec  5
>  20:50:23 nirvana kernel:          res 51/40:00:39:0c:8c/40:00:34:00:00/e0
>  Emask 0x9 (media error) Dec  5 20:50:23 nirvana kernel: ata3.00:
>  configured for UDMA/133
> Dec  5 20:50:23 nirvana kernel: sd 2:0:0:0: [sda] Result: hostbyte=DID_OK
>  driverbyte=DRIVER_SENSE,SUGGEST_OK Dec  5 20:50:23 nirvana kernel: sd
>  2:0:0:0: [sda] Sense Key : Medium Error [current] [descriptor] Dec  5
>  20:50:23 nirvana kernel: Descriptor sense data with sense descriptors (in
>  hex): Dec  5 20:50:23 nirvana kernel:         72 03 11 04 00 00 00 0c 00
>  0a 80 00 00 00 00 00 Dec  5 20:50:23 nirvana kernel:         34 8c 0c 39
> Dec  5 20:50:23 nirvana kernel: sd 2:0:0:0: [sda] Add. Sense: Unrecovered
>  read error - auto reallocate failed Dec  5 20:50:23 nirvana kernel:
>  end_request: I/O error, dev sda, sector 881593401 Dec  5 20:50:23 nirvana
>  kernel: ata3: EH complete
> Dec  5 20:50:23 nirvana kernel: sd 2:0:0:0: [sda] 976773168 512-byte
>  hardware sectors (500108 MB) Dec  5 20:50:23 nirvana kernel: sd 2:0:0:0:
>  [sda] Write Protect is off Dec  5 20:50:23 nirvana kernel: sd 2:0:0:0:
>  [sda] Mode Sense: 00 3a 00 00 Dec  5 20:50:23 nirvana kernel: sd 2:0:0:0:
>  [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or
>  FUA
> 
> Booting from the install disk, assembling the arrays and fsck'ing -c -y the
>  individual disks found a bad block on sda, corrected the problem and
>  things were fine for a month. Then the same sda error appeared.
> 
> Currently I have disabled dmraid and have the two disks running
>  independently with zero errors. I'm keeping the contents mirrored with a
>  cron job that basically does a 'cp -a / /mnt/sda' where sdc is the disk
>  that is booted from and sda is mounted at /mnt/sda.
> 
> This experience has raised a question that I can't find the answer to:
> 
> 	"How does dmraid handle a bad block?"
> 
> Also (hardware issues aside), is there anything from a kernel/drive
>  controller standpoint that could be invoking the error?
> 
> After running the disks independently for a week without a single error, I
>  think I'll reassemble the array and rebuild it. Any other
>  thoughts/suggestions? Obviously there was a bad block that developed on
>  sda, but adding it to the bad block table fixed it. I have a pair of 750 G
>  drives coming to replace this set, but I'm curious about the bad block
>  handling with dmraid. If 1 bad block is enough to kill the array, then
>  that's not a very robust system.
> 
> The disks are both seagate 500G drives (ST3500641AS). The system hardware
>  is:
> 
> 
> System Information
> 	Manufacturer: TYAN Computer Corp
> 	Product Name: S2865
> 
> BIOS Information
> 	Vendor: Phoenix Technologies, LTD
> 	Version: 6.00 PG
> 	Release Date: 06/20/2005
> 	Address: 0xE0000
> 	Runtime Size: 128 kB
> 	ROM Size: 512 kB
> 	Characteristics:
> 
> Processor Information
> 	Socket Designation: Socket 939
> 	Type: Central Processor
> 	Family: Athlon 64
> 	Manufacturer: AMD
> 	ID: 32 0F 02 00 FF FB 8B 17
> 	Signature: Family 15, Model 35, Stepping 2
> 
> All thoughts welcomed...
> 

I would run the following on the drive

/usr/sbin/smartctl -a /dev/sda

That will tell you about the stats on the drive.
Can't help with the badblock questions as I use mdadm with 3 drives raid5 
array.




More information about the arch-general mailing list