[arch-general] btrfs raid 10 fileserver with ata errors

niya levi niyalevi at gmail.com
Sat Jan 14 09:39:01 UTC 2017


> after some googling it's been suggested that it's either a hard drive,
> the sata controller or the sata cables.
> how do i go about diagnosing and fixing the problem,
> any suggestions or guidance would be appreciated.
> shadrock
> I've had this problem before. IIRC, you can match up the ata17.00 with what drive it's talking about by looking at your kernel boot messages. The first thing I would do is switch out the SATA cable and see if the problem persists. If that doesn't work, run a scan of the drive using the manufacturers scan program.
>

hi everyone

these are the following tests i've tried and the results

journalctl -f | grep ata
Jan 13 12:37:13 maybel kernel: ata17.00: exception Emask 0x0 SAct 0x0
SErr 0x0 action 0x6 frozen
Jan 13 12:37:13 maybel kernel: ata17.00: failed command: READ DMA EXT
Jan 13 12:37:13 maybel kernel: ata17.00: cmd
25/00:00:00:24:7d/00:07:1a:00:00/e0 tag 13 dma 917504 in
Jan 13 12:37:13 maybel kernel: ata17.00: status: { DRDY }
Jan 13 12:37:13 maybel kernel: ata17: hard resetting link
Jan 13 12:37:13 maybel kernel: ata17: SATA link up 1.5 Gbps (SStatus 113
SControl 310)
Jan 13 12:37:13 maybel kernel: ata17.00: configured for UDMA/33
Jan 13 12:37:13 maybel kernel: ata17: EH complete
Jan 13 12:37:45 maybel kernel: ata17.00: exception Emask 0x0 SAct 0x0
SErr 0x0 action 0x6 frozen
Jan 13 12:37:45 maybel kernel: ata17.00: failed command: READ DMA EXT
Jan 13 12:37:45 maybel kernel: ata17.00: cmd
25/00:00:00:d9:7d/00:07:1a:00:00/e0 tag 25 dma 917504 in
Jan 13 12:37:45 maybel kernel: ata17.00: status: { DRDY }
Jan 13 12:37:45 maybel kernel: ata17: hard resetting link
Jan 13 12:37:46 maybel kernel: ata17: SATA link up 1.5 Gbps (SStatus 113
SControl 310)
Jan 13 12:37:46 maybel kernel: ata17.00: configured for UDMA/33
Jan 13 12:37:46 maybel kernel: ata17: EH complete
Jan 13 12:38:19 maybel kernel: ata17.00: exception Emask 0x0 SAct 0x0
SErr 0x0 action 0x6 frozen
Jan 13 12:38:19 maybel kernel: ata17.00: failed command: READ DMA EXT
Jan 13 12:38:19 maybel kernel: ata17.00: cmd
25/00:00:80:d6:81/00:06:1a:00:00/e0 tag 1 dma 786432 in
Jan 13 12:38:19 maybel kernel: ata17.00: status: { DRDY }
Jan 13 12:38:19 maybel kernel: ata17: hard resetting link
Jan 13 12:38:20 maybel kernel: ata17: SATA link up 1.5 Gbps (SStatus 113
SControl 310)
Jan 13 12:38:20 maybel kernel: ata17.00: configured for UDMA/33
Jan 13 12:38:20 maybel kernel: ata17: EH complete
Jan 13 12:38:52 maybel kernel: ata17.00: exception Emask 0x0 SAct 0x0
SErr 0x0 action 0x6 frozen
Jan 13 12:38:52 maybel kernel: ata17.00: failed command: READ DMA EXT
Jan 13 12:38:52 maybel kernel: ata17.00: cmd
25/00:80:80:d1:82/00:05:1a:00:00/e0 tag 28 dma 720896 in
Jan 13 12:38:52 maybel kernel: ata17.00: status: { DRDY }
Jan 13 12:38:52 maybel kernel: ata17: hard resetting link
Jan 13 12:38:52 maybel kernel: ata17: SATA link up 1.5 Gbps (SStatus 113
SControl 310)
Jan 13 12:38:52 maybel kernel: ata17.00: configured for UDMA/33
Jan 13 12:38:52 maybel kernel: ata17: EH complete
Jan 13 12:39:24 maybel kernel: ata17.00: exception Emask 0x0 SAct 0x0
SErr 0x0 action 0x6 frozen
Jan 13 12:39:24 maybel kernel: ata17.00: failed command: READ DMA EXT
Jan 13 12:39:24 maybel kernel: ata17.00: cmd
25/00:00:00:9d:84/00:05:1a:00:00/e0 tag 1 dma 655360 in
Jan 13 12:39:24 maybel kernel: ata17.00: status: { DRDY }
Jan 13 12:39:24 maybel kernel: ata17: hard resetting link
Jan 13 12:39:25 maybel kernel: ata17: SATA link up 1.5 Gbps (SStatus 113
SControl 310)
Jan 13 12:39:25 maybel kernel: ata17.00: configured for UDMA/33
Jan 13 12:39:25 maybel kernel: ata17: EH complete
Jan 13 12:39:57 maybel kernel: ata17.00: exception Emask 0x0 SAct 0x0
SErr 0x0 action 0x6 frozen
Jan 13 12:39:57 maybel kernel: ata17.00: failed command: READ DMA EXT
Jan 13 12:39:57 maybel kernel: ata17.00: cmd
25/00:00:80:b8:85/00:05:1a:00:00/e0 tag 6 dma 655360 in
Jan 13 12:39:57 maybel kernel: ata17.00: status: { DRDY }
Jan 13 12:39:57 maybel kernel: ata17: hard resetting link
Jan 13 12:39:57 maybel kernel: ata17: SATA link up 1.5 Gbps (SStatus 113
SControl 310)
Jan 13 12:39:57 maybel kernel: ata17.00: configured for UDMA/33
Jan 13 12:39:57 maybel kernel: ata17: EH complete
Jan 13 12:40:29 maybel kernel: ata17.00: exception Emask 0x0 SAct 0x0
SErr 0x0 action 0x6 frozen
Jan 13 12:40:29 maybel kernel: ata17.00: failed command: READ DMA EXT
Jan 13 12:40:29 maybel kernel: ata17.00: cmd
25/00:80:80:cd:85/00:05:1a:00:00/e0 tag 16 dma 720896 in
Jan 13 12:40:29 maybel kernel: ata17.00: status: { DRDY }
Jan 13 12:40:29 maybel kernel: ata17: hard resetting link
Jan 13 12:40:30 maybel kernel: ata17: SATA link up 1.5 Gbps (SStatus 113
SControl 310)
Jan 13 12:40:30 maybel kernel: ata17.00: configured for UDMA/33
Jan 13 12:40:30 maybel kernel: ata17: EH complete
Jan 13 12:41:02 maybel kernel: ata17.00: exception Emask 0x0 SAct 0x0
SErr 0x0 action 0x6 frozen
Jan 13 12:41:02 maybel kernel: ata17.00: failed command: READ DMA EXT
Jan 13 12:41:02 maybel kernel: ata17.00: cmd
25/00:80:80:f2:85/00:05:1a:00:00/e0 tag 16 dma 720896 in
Jan 13 12:41:02 maybel kernel: ata17.00: status: { DRDY }
Jan 13 12:41:02 maybel kernel: ata17: hard resetting link
Jan 13 12:41:02 maybel kernel: ata17: SATA link up 1.5 Gbps (SStatus 113
SControl 310)
Jan 13 12:41:02 maybel kernel: ata17.00: configured for UDMA/33
Jan 13 12:41:02 maybel kernel: ata17: EH complete
^C
=======================================================================================================

[alarm at maybel ~]$ ls -l /sys/block/ | grep sd.
lrwxrwxrwx 1 root root 0 Jan  9 14:21 sda ->
../devices/pci0000:00/0000:00:05.0/ata1/host0/target0:0:0/0:0:0:0/block/sda
lrwxrwxrwx 1 root root 0 Jan  9 14:26 sdb ->
../devices/pci0000:00/0000:00:0c.0/0000:02:00.0/ata13/host12/target12:0:0/12:0:0:0/block/sdb
lrwxrwxrwx 1 root root 0 Jan  9 14:26 sdc ->
../devices/pci0000:00/0000:00:0c.0/0000:02:00.0/ata14/host13/target13:0:0/13:0:0:0/block/sdc
lrwxrwxrwx 1 root root 0 Jan  9 14:26 sdd ->
../devices/pci0000:00/0000:00:0c.0/0000:02:00.0/ata15/host14/target14:0:0/14:0:0:0/block/sdd
lrwxrwxrwx 1 root root 0 Jan  9 14:26 sde ->
../devices/pci0000:00/0000:00:0c.0/0000:02:00.0/ata16/host15/target15:0:0/15:0:0:0/block/sde
lrwxrwxrwx 1 root root 0 Jan  9 14:26 sdf ->
../devices/pci0000:00/0000:00:0d.0/0000:03:00.0/ata17/host16/target16:0:0/16:0:0:0/block/sdf
lrwxrwxrwx 1 root root 0 Jan  9 14:26 sdg ->
../devices/pci0000:00/0000:00:0d.0/0000:03:00.0/ata18/host17/target17:0:0/17:0:0:0/block/sdg
lrwxrwxrwx 1 root root 0 Jan 10 02:43 sdh ->
../devices/pci0000:00/0000:00:02.1/usb1/1-6/1-6:1.0/host20/target20:0:0/20:0:0:0/block/sdh
=======================================================================================================

sudo smartctl -i /dev/sdf
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.8.13-1-ARCH] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     HITACHI HUA722010ALA330
Serial Number:    N136GXML
LU WWN Device Id: 5 000cca 39ced38c2
Firmware Version: JP4ONA00
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 2.6, 1.5 Gb/s
Local Time is:    Fri Jan 13 12:59:51 2017 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=======================================================================================================

sudo smartctl -t short /dev/sdf
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.8.13-1-ARCH] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in
off-line mode".
Drive command "Execute SMART Short self-test routine immediately in
off-line mode" successful.
Testing has begun.
Please wait 1 minutes for test to complete.
Test will complete after Fri Jan 13 13:31:48 2017

Use smartctl -X to abort test.
=======================================================================================================

sudo smartctl -a /dev/sdf
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.8.13-1-ARCH] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     HITACHI HUA722010ALA330
Serial Number:    N136GXML
LU WWN Device Id: 5 000cca 39ced38c2
Firmware Version: JP4ONA00
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 2.6, 1.5 Gb/s
Local Time is:    Fri Jan 13 13:34:00 2017 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x85)    Offline data collection activity
                    was aborted by an interrupting command from host.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (  41)    The self-test routine was
interrupted
                    by the host with a hard or soft reset.
Total time to complete Offline
data collection:         ( 9929) seconds.
Offline data collection
capabilities:              (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   1) minutes.
Extended self-test routine
recommended polling time:      ( 166) minutes.
SCT capabilities:            (0x003d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE     
UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail 
Always       -       0
  2 Throughput_Performance  0x0005   137   137   054    Pre-fail 
Offline      -       91
  3 Spin_Up_Time                 0x0007   130   130   024    Pre-fail 
Always       -       278 (Average 305)
  4 Start_Stop_Count            0x0012   100   100   000    Old_age  
Always       -       69
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail 
Always       -       0
  7 Seek_Error_Rate            0x000b   100   100   067    Pre-fail 
Always       -       0
  8 Seek_Time_Performance   0x0005   138   138   020    Pre-fail 
Offline      -       31
  9 Power_On_Hours          0x0012   099   099   000    Old_age  
Always       -       10782
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail 
Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age  
Always       -       68
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age  
Always       -       123
193 Load_Cycle_Count        0x0012   100   100   000    Old_age  
Always       -       123
194 Temperature_Celsius     0x0002   193   193   000    Old_age  
Always       -       31 (Min/Max 12/46)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age  
Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age  
Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age  
Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age  
Always       -       0

SMART Error Log Version: 0
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining 
LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Interrupted (host reset)      90%    
10782         -
# 2  Short offline       Interrupted (host reset)      90%    
10776         -
# 3  Short offline       Interrupted (host reset)      90%    
10752         -
# 4  Short offline       Interrupted (host reset)      90%    
10728         -
# 5  Short offline       Completed without error       00%    
10703         -
# 6  Short offline       Interrupted (host reset)      90%    
10656         -
# 7  Short offline       Interrupted (host reset)      90%    
10632         -
# 8  Extended offline    Interrupted (host reset)      90%    
10628         -
# 9  Short offline       Interrupted (host reset)      90%    
10608         -
#10  Short offline       Interrupted (host reset)      90%    
10584         -
#11  Short offline       Interrupted (host reset)      90%    
10560         -
#12  Short offline       Interrupted (host reset)      90%    
10537         -
#13  Short offline       Interrupted (host reset)      90%    
10513         -
#14  Short offline       Interrupted (host reset)      90%    
10489         -
#15  Short offline       Interrupted (host reset)      90%    
10465         -
#16  Extended offline    Interrupted (host reset)      90%    
10461         -
#17  Short offline       Interrupted (host reset)      90%    
10441         -
#18  Short offline       Interrupted (host reset)      90%    
10417         -
#19  Short offline       Interrupted (host reset)      90%    
10393         -
#20  Short offline       Completed without error       00%    
10368         -
#21  Short offline       Completed without error       00%    
10344         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

lspci | grep SATA
00:05.0 IDE interface: NVIDIA Corporation MCP55 SATA Controller (rev a3)
00:05.1 IDE interface: NVIDIA Corporation MCP55 SATA Controller (rev a3)
00:05.2 IDE interface: NVIDIA Corporation MCP55 SATA Controller (rev a3)
02:00.0 SATA controller: Marvell Technology Group Ltd. Device 9215 (rev 11)
03:00.0 SATA controller: JMicron Technology Corp. JMB363 SATA/IDE
Controller (rev 03)
03:00.1 IDE interface: JMicron Technology Corp. JMB363 SATA/IDE
Controller (rev 03)
04:00.0 SATA controller: JMicron Technology Corp. JMB363 SATA/IDE
Controller (rev 03)
04:00.1 IDE interface: JMicron Technology Corp. JMB363 SATA/IDE
Controller (rev 03)
=======================================================================================================

sudo lshw -c storage
  *-usb                    
       description: Mass storage device
       product: AS2105
       vendor: ASMedia
       physical id: 6
       bus info: usb at 1:6
       version: 0.01
       serial: WD-WCC4N5AHPU2D
       capabilities: usb-2.10 scsi
       configuration: driver=usb-storage speed=480Mbit/s
  *-ide:0
       description: IDE interface
       product: MCP55 IDE
       vendor: NVIDIA Corporation
       physical id: 7
       bus info: pci at 0000:00:04.0
       version: a1
       width: 32 bits
       clock: 66MHz
       capabilities: ide pm bus_master cap_list
       configuration: driver=pata_amd latency=0 maxlatency=1 mingnt=3
       resources: irq:0 ioport:1f0(size=8) ioport:3f6 ioport:170(size=8)
ioport:376 ioport:f000(size=16)
  *-ide:1
       description: IDE interface
       product: MCP55 SATA Controller
       vendor: NVIDIA Corporation
       physical id: 5
       bus info: pci at 0000:00:05.0
       version: a3
       width: 32 bits
       clock: 66MHz
       capabilities: ide pm msi ht bus_master cap_list
       configuration: driver=sata_nv latency=0 maxlatency=1 mingnt=3
       resources: irq:21 ioport:9f0(size=8) ioport:bf0(size=4)
ioport:970(size=8) ioport:b70(size=4) ioport:dc00(size=16)
memory:fe02d000-fe02dfff
  *-ide:2
       description: IDE interface
       product: MCP55 SATA Controller
       vendor: NVIDIA Corporation
       physical id: 5.1
       bus info: pci at 0000:00:05.1
       version: a3
       width: 32 bits
       clock: 66MHz
       capabilities: ide pm msi ht bus_master cap_list
       configuration: driver=sata_nv latency=0 maxlatency=1 mingnt=3
       resources: irq:20 ioport:9e0(size=8) ioport:be0(size=4)
ioport:960(size=8) ioport:b60(size=4) ioport:c800(size=16)
memory:fe02c000-fe02cfff
  *-ide:3
       description: IDE interface
       product: MCP55 SATA Controller
       vendor: NVIDIA Corporation
       physical id: 5.2
       bus info: pci at 0000:00:05.2
       version: a3
       width: 32 bits
       clock: 66MHz
       capabilities: ide pm msi ht bus_master cap_list
       configuration: driver=sata_nv latency=0 maxlatency=1 mingnt=3
       resources: irq:23 ioport:c400(size=8) ioport:c000(size=4)
ioport:bc00(size=8) ioport:b800(size=4) ioport:b400(size=16)
memory:fe02b000-fe02bfff
  *-storage
       description: SATA controller
       product: Marvell Technology Group Ltd.
       vendor: Marvell Technology Group Ltd.
       physical id: 0
       bus info: pci at 0000:02:00.0
       version: 11
       width: 32 bits
       clock: 33MHz
       capabilities: storage pm msi pciexpress ahci_1.0 bus_master
cap_list rom
       configuration: driver=ahci latency=0
       resources: irq:27 ioport:9c00(size=8) ioport:9800(size=4)
ioport:9400(size=8) ioport:9000(size=4) ioport:8c00(size=32)
memory:fdeff000-fdeff7ff memory:fdee0000-fdeeffff
  *-storage
       description: SATA controller
       product: JMB363 SATA/IDE Controller
       vendor: JMicron Technology Corp.
       physical id: 0
       bus info: pci at 0000:03:00.0
       version: 03
       width: 32 bits
       clock: 33MHz
       capabilities: storage pm pciexpress ahci_1.0 bus_master cap_list rom
       configuration: driver=ahci latency=0
       resources: irq:16 memory:fddfe000-fddfffff memory:fdde0000-fddeffff
  *-ide
       description: IDE interface
       product: JMB363 SATA/IDE Controller
       vendor: JMicron Technology Corp.
       physical id: 0.1
       bus info: pci at 0000:03:00.1
       version: 03
       width: 32 bits
       clock: 33MHz
       capabilities: ide pm bus_master cap_list
       configuration: driver=pata_jmicron latency=0
       resources: irq:16 ioport:7c00(size=8) ioport:7800(size=4)
ioport:7400(size=8) ioport:7000(size=4) ioport:6c00(size=16)
  *-storage
       description: SATA controller
       product: JMB363 SATA/IDE Controller
       vendor: JMicron Technology Corp.
       physical id: 0
       bus info: pci at 0000:04:00.0
       version: 03
       width: 32 bits
       clock: 33MHz
       capabilities: storage pm pciexpress ahci_1.0 bus_master cap_list rom
       configuration: driver=ahci latency=0
       resources: irq:16 memory:fdcfe000-fdcfffff memory:fdce0000-fdceffff
  *-ide
       description: IDE interface
       product: JMB363 SATA/IDE Controller
       vendor: JMicron Technology Corp.
       physical id: 0.1
       bus info: pci at 0000:04:00.1
       version: 03
       width: 32 bits
       clock: 33MHz
       capabilities: ide pm bus_master cap_list
       configuration: driver=pata_jmicron latency=0
       resources: irq:16 ioport:5c00(size=8) ioport:5800(size=4)
ioport:5400(size=8) ioport:5000(size=4) ioport:4c00(size=16)
  *-scsi
       physical id: 1
       bus info: scsi at 20
       logical name: scsi20
       capabilities: scsi-host
       configuration: driver=usb-storage
=======================================================================================================

/dev/sdf is connected to ata17 on the 03.00 controller has the problems
/dev/sdg  connected to ata18 on the same controller is fine
/dev/sdf exibits a long delay when getting the report from smartctl -a 
and frequent interurpted smart tests
i will try a new cable later and report back.
thanks
shadrock


More information about the arch-general mailing list