[arch-general] How does dmraid handle the development of a bad block on 1 drive in an array??

David C. Rankin drankinatty at suddenlinkmail.com
Wed Jan 6 16:13:21 EST 2010


Listmates (Tobias)

	I have a server that has 2 dmraid arrays (4 drives -> 2 arrays) that has been bullet-proof for years. A month ago (either coincidentally or due to a bug in the suse 11.2 kernel for client ssh/sftp sessions) I began experiencing sda errors on the array comprised of sda/sdc drives. The errors took the form of:

Dec  5 20:48:48 nirvana sshd[30922]: error: ssh_msg_send: write
Dec  5 20:49:10 nirvana sshd[30965]: Accepted keyboard-interactive/pam for legaleagle from 192.168.6.102 port 36
663 ssh2
Dec  5 20:50:12 nirvana kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec  5 20:50:12 nirvana kernel: ata3.00: BMDMA stat 0x25
Dec  5 20:50:12 nirvana kernel: ata3.00: cmd 25/00:08:33:0c:8c/00:00:34:00:00/e0 tag 0 cdb 0x0 data 4096 in
Dec  5 20:50:12 nirvana kernel:          res 51/40:00:39:0c:8c/40:00:34:00:00/e0 Emask 0x9 (media error)
Dec  5 20:50:12 nirvana kernel: ata3.00: configured for UDMA/133
Dec  5 20:50:12 nirvana kernel: ata3: EH complete
Dec  5 20:50:14 nirvana kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec  5 20:50:14 nirvana kernel: ata3.00: BMDMA stat 0x25
Dec  5 20:50:14 nirvana kernel: ata3.00: cmd 25/00:08:33:0c:8c/00:00:34:00:00/e0 tag 0 cdb 0x0 data 4096 in
Dec  5 20:50:14 nirvana kernel:          res 51/40:00:39:0c:8c/40:00:34:00:00/e0 Emask 0x9 (media error)
Dec  5 20:50:14 nirvana kernel: ata3.00: configured for UDMA/133
Dec  5 20:50:14 nirvana kernel: ata3: EH complete
Dec  5 20:50:16 nirvana kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec  5 20:50:16 nirvana kernel: ata3.00: BMDMA stat 0x25
Dec  5 20:50:16 nirvana kernel: ata3.00: cmd 25/00:08:33:0c:8c/00:00:34:00:00/e0 tag 0 cdb 0x0 data 4096 in
Dec  5 20:50:23 nirvana kernel:          res 51/40:00:39:0c:8c/40:00:34:00:00/e0 Emask 0x9 (media error)
Dec  5 20:50:23 nirvana kernel: ata3.00: configured for UDMA/133
Dec  5 20:50:23 nirvana kernel: ata3: EH complete
Dec  5 20:50:23 nirvana kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec  5 20:50:23 nirvana kernel: ata3.00: BMDMA stat 0x25
Dec  5 20:50:23 nirvana kernel: ata3.00: cmd 25/00:08:33:0c:8c/00:00:34:00:00/e0 tag 0 cdb 0x0 data 4096 in
Dec  5 20:50:23 nirvana kernel:          res 51/40:00:39:0c:8c/40:00:34:00:00/e0 Emask 0x9 (media error)
Dec  5 20:50:23 nirvana kernel: ata3.00: configured for UDMA/133
Dec  5 20:50:23 nirvana kernel: ata3: EH complete
Dec  5 20:50:23 nirvana kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec  5 20:50:23 nirvana kernel: ata3.00: BMDMA stat 0x25
Dec  5 20:50:23 nirvana kernel: ata3.00: cmd 25/00:08:33:0c:8c/00:00:34:00:00/e0 tag 0 cdb 0x0 data 4096 in
Dec  5 20:50:23 nirvana kernel:          res 51/40:00:39:0c:8c/40:00:34:00:00/e0 Emask 0x9 (media error)
Dec  5 20:50:23 nirvana kernel: ata3.00: configured for UDMA/133
Dec  5 20:50:23 nirvana kernel: ata3: EH complete
Dec  5 20:50:23 nirvana kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec  5 20:50:23 nirvana kernel: ata3.00: BMDMA stat 0x25
Dec  5 20:50:23 nirvana kernel: ata3.00: cmd 25/00:08:33:0c:8c/00:00:34:00:00/e0 tag 0 cdb 0x0 data 4096 in
Dec  5 20:50:23 nirvana kernel:          res 51/40:00:39:0c:8c/40:00:34:00:00/e0 Emask 0x9 (media error)
Dec  5 20:50:23 nirvana kernel: ata3.00: configured for UDMA/133
Dec  5 20:50:23 nirvana kernel: sd 2:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
Dec  5 20:50:23 nirvana kernel: sd 2:0:0:0: [sda] Sense Key : Medium Error [current] [descriptor]
Dec  5 20:50:23 nirvana kernel: Descriptor sense data with sense descriptors (in hex):
Dec  5 20:50:23 nirvana kernel:         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Dec  5 20:50:23 nirvana kernel:         34 8c 0c 39
Dec  5 20:50:23 nirvana kernel: sd 2:0:0:0: [sda] Add. Sense: Unrecovered read error - auto reallocate failed
Dec  5 20:50:23 nirvana kernel: end_request: I/O error, dev sda, sector 881593401
Dec  5 20:50:23 nirvana kernel: ata3: EH complete
Dec  5 20:50:23 nirvana kernel: sd 2:0:0:0: [sda] 976773168 512-byte hardware sectors (500108 MB)
Dec  5 20:50:23 nirvana kernel: sd 2:0:0:0: [sda] Write Protect is off
Dec  5 20:50:23 nirvana kernel: sd 2:0:0:0: [sda] Mode Sense: 00 3a 00 00
Dec  5 20:50:23 nirvana kernel: sd 2:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO
 or FUA

Booting from the install disk, assembling the arrays and fsck'ing -c -y the individual disks found a bad block on sda, corrected the problem and things were fine for a month. Then the same sda error appeared.

Currently I have disabled dmraid and have the two disks running independently with zero errors. I'm keeping the contents mirrored with a cron job that basically does a 'cp -a / /mnt/sda' where sdc is the disk that is booted from and sda is mounted at /mnt/sda.

This experience has raised a question that I can't find the answer to:

	"How does dmraid handle a bad block?"

Also (hardware issues aside), is there anything from a kernel/drive controller standpoint that could be invoking the error?

After running the disks independently for a week without a single error, I think I'll reassemble the array and rebuild it. Any other thoughts/suggestions? Obviously there was a bad block that developed on sda, but adding it to the bad block table fixed it. I have a pair of 750 G drives coming to replace this set, but I'm curious about the bad block handling with dmraid. If 1 bad block is enough to kill the array, then that's not a very robust system.

The disks are both seagate 500G drives (ST3500641AS). The system hardware is:


System Information
	Manufacturer: TYAN Computer Corp
	Product Name: S2865

BIOS Information
	Vendor: Phoenix Technologies, LTD
	Version: 6.00 PG
	Release Date: 06/20/2005
	Address: 0xE0000
	Runtime Size: 128 kB
	ROM Size: 512 kB
	Characteristics:

Processor Information
	Socket Designation: Socket 939
	Type: Central Processor
	Family: Athlon 64
	Manufacturer: AMD
	ID: 32 0F 02 00 FF FB 8B 17
	Signature: Family 15, Model 35, Stepping 2

All thoughts welcomed...


-- 
David C. Rankin, J.D.,P.E.
Rankin Law Firm, PLLC
510 Ochiltree Street
Nacogdoches, Texas 75961
Telephone: (936) 715-9333
Facsimile: (936) 715-9339
www.rankinlawfirm.com


More information about the arch-general mailing list