[arch-general] blk_update_request: I/O error, dev sda

Mon May 31 19:37:04 UTC 2021

On 31.05.21 20:27, Morten Bo Johansen via arch-general wrote:
> I include it below. It says "Device Error Count: 12".
> Oddly enough, the command
>    sudo smartctl -t long /dev/sda
> completes without any errors at all. The problem is also
> intermittent. It is rather odd.

This seems odd indeed, I am not sure however what an extended/long
smart self test exactly does on an SSHD, does it include the cache-SSD
as well?
It is my understanding though, that the self tests do not cover all
involved components of the drive to the full extent and thus a
successful long selftest does not necessarily proof the drive is
healthy.

> --------------- output from sudo smartctl -x /dev/sda -----------------
> ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
>   1 Raw_Read_Error_Rate     POSR--   118   099   006    -    192553232
>   3 Spin_Up_Time            PO----   099   098   000    -    0
>   4 Start_Stop_Count        -O--CK   098   098   020    -    2850
>   5 Reallocated_Sector_Ct   PO--CK   100   100   010    -    0
>   7 Seek_Error_Rate         POSR--   082   060   030    -    4492926169
>   9 Power_On_Hours          -O--CK   078   078   000    -    19702
>  10 Spin_Retry_Count        PO--C-   100   100   097    -    0
>  12 Power_Cycle_Count       -O--CK   097   097   020    -    3079
> 184 End-to-End_Error        -O--CK   077   077   099    NOW  23

This indicates a drive failure, depending on what kind of End-to-End
errors are logged it might (rarely) be just a defective cable as was
already suggested.

> 187 Reported_Uncorrect      -O--CK   099   099   000    -    1
> 188 Command_Timeout         -O--CK   100   099   000    -    12
> 189 High_Fly_Writes         -O-RCK   067   067   000    -    33
> 190 Airflow_Temperature_Cel -O---K   051   042   045    Past 49 (Min/Max 28/55 #251)

This indicates the drive is running rather hot, I suppose it is
enclosed in a laptop with poor ventilation and there is not much you
can do about it? However, if you can, a bit more airflow would be
appreciated by the drive. But in case you are going to replace it with
an SSD (in contrast to the current SSHD), these tend to create less
prolonged heat and should run cooler.

> 191 G-Sense_Error_Rate      -O--CK   100   100   000    -    0
> 192 Power-Off_Retract_Count -O--CK   100   100   000    -    169
> 193 Load_Cycle_Count        -O--CK   099   099   000    -    2892
> 194 Temperature_Celsius     -O---K   049   058   000    -    49 (0 9 0 0 0)
> 197 Current_Pending_Sector  -O--C-   100   100   000    -    0
> 198 Offline_Uncorrectable   ----C-   100   100   000    -    0
> 199 UDMA_CRC_Error_Count    -OSRCK   200   200   000    -    0
> 254 Free_Fall_Sensor        -O--CK   100   100   000    -    0
> 
> 
> Error 12 [11] occurred at disk power-on lifetime: 19675 hours (819 days + 19 hours)
>   When the command that caused the error occurred, the device was active or idle.
> 
>   After command completion occurred, registers were:
>   ER -- ST COUNT  LBA_48  LH LM LL DV DC
>   -- -- -- == -- == == == -- -- -- -- --
>   40 -- 51 00 00 00 00 00 03 51 48 00 00  Error: UNC at LBA = 0x00035148 = 217416
> 
>   Commands leading to the command that caused the error were:
>   CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
>   -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
>   60 00 00 00 90 00 00 00 03 50 e8 40 00     00:00:41.749  READ FPDMA QUEUED
>   61 00 00 00 08 00 00 1c 44 0f 38 40 00     00:00:41.749  WRITE FPDMA QUEUED
>   ea 00 00 00 00 00 00 00 00 00 00 a0 00     00:00:41.748  FLUSH CACHE EXT
>   61 00 00 00 08 00 00 00 03 57 b0 40 00     00:00:41.714  WRITE FPDMA QUEUED
>   61 00 00 00 08 00 00 07 00 44 28 40 00     00:00:41.714  WRITE FPDMA QUEUED
> 
> Error 11 [10] occurred at disk power-on lifetime: 19675 hours (819 days + 19 hours)
>   When the command that caused the error occurred, the device was active or idle.
> 
>   After command completion occurred, registers were:
>   ER -- ST COUNT  LBA_48  LH LM LL DV DC
>   -- -- -- == -- == == == -- -- -- -- --
>   40 -- 51 00 00 00 00 12 eb 60 30 00 00  Error: UNC at LBA = 0x12eb6030 = 317415472
> 
>   Commands leading to the command that caused the error were:
>   CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
>   -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
>   60 00 00 01 00 00 00 12 eb 5f f0 40 00     00:00:38.123  READ FPDMA QUEUED
>   60 00 00 00 c0 00 00 00 03 78 10 40 00     00:00:38.088  READ FPDMA QUEUED
>   60 00 00 00 08 00 00 12 eb 68 38 40 00     00:00:38.088  READ FPDMA QUEUED
>   60 00 00 00 80 00 00 00 03 6d c0 40 00     00:00:38.086  READ FPDMA QUEUED
>   60 00 00 00 68 00 00 00 03 61 b8 40 00     00:00:38.059  READ FPDMA QUEUED
> 
>  [...]
> 
>  [...]
> 

Since these errors are not UDMA/CRC errors (which might be caused by a
defective cable) and only occured very recently (error timestamp in
relation to Power_On_Hours) a faulty cable seems very unlikely if you
did not recently poke at it (e.g. reconnected/reseated the drive for
whatever reason).

In my opinion this drive is on the brink of failure and should be
replaced immediately.

Cheers

-- 
Thore "foxxx0" Bödecker

GPG ID: 0xEB763B4E9DB887A6
GPG FP: 051E AD6A 6155 389D 69DA  02E5 EB76 3B4E 9DB8 87A6
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://lists.archlinux.org/pipermail/arch-general/attachments/20210531/e5f0f981/attachment.sig>