• Disk I/O errors

    From George at Clug@21:1/5 to All on Sat Aug 10 10:30:01 2024
    Hi,


    I case there might be a known, fixable, fault that could cause this,
    anyone know what the following errors indicate?


    Aug 10 17:30:51 srv01 kernel: ata6: EH complete
    Aug 10 17:30:54 srv01 kernel: ata6.00: exception Emask 0x0 SAct 0x4000
    SErr 0xc0000 action 0x0
    Aug 10 17:30:54 srv01 kernel: ata6.00: irq_stat 0x40000008
    Aug 10 17:30:54 srv01 kernel: ata6: SError: { CommWake 10B8B }
    Aug 10 17:30:54 srv01 kernel: ata6.00: failed command: READ FPDMA
    QUEUED
    Aug 10 17:30:54 srv01 kernel: ata6.00: cmd
    60/08:70:68:c4:00/00:00:00:00:00/40 tag 14 ncq dma 4096 in
    Aug 10 17:30:54 srv01 kernel: ata6.00: status: { DRDY ERR }
    Aug 10 17:30:54 srv01 kernel: ata6.00: error: { UNC }
    Aug 10 17:30:54 srv01 kernel: ata6.00: configured for UDMA/133
    Aug 10 17:30:54 srv01 kernel: sd 5:0:0:0: [sdb] tag#14 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=2s
    Aug 10 17:30:54 srv01 kernel: sd 5:0:0:0: [sdb] tag#14 Sense Key :
    Medium Error [current]
    Aug 10 17:30:54 srv01 kernel: sd 5:0:0:0: [sdb] tag#14 Add. Sense:
    Unrecovered read error - auto reallocate failed
    Aug 10 17:30:54 srv01 kernel: sd 5:0:0:0: [sdb] tag#14 CDB: Read(16)
    88 00 00 00 00 00 00 00 c4 68 00 00 00 08 00 00
    Aug 10 17:30:54 srv01 kernel: I/O error, dev sdb, sector 50280 op
    0x0:(READ) flags 0x0 phys_seg 1 prio class 2
    Aug 10 17:30:54 srv01 kernel: Buffer I/O error on dev sdb, logical
    block 6285, async page read


    I am running badblocks against a Western Digital 3TB WDC
    WD30EFRX-68AX9N0​, and it keep generating the above.

    https://www.storagereview.com/review/western-digital-red-nas-hard-drive-review-wd30efrx


    I have changed the port it is connected to, and the SATA cable, but
    still the errors follow the disk drive.


    I suspect that the error message "Medium Error", means just that, an
    area of the disk has failed, hence "Unrecovered read error".



    Sadly "Unrecovered read error" also implies "auto reallocate failed",
    so what ever data was on the failed area, it is gone forever. Do not
    worry, backups mean important data is safe, but it does mean a few
    hours effort to replace the drive, test the replacement, and then
    restore data. Sadly I was just starting to use the storage for
    testing, and now I will have to again copy of the data for testing to
    the replacement drive.



    Bad blocks start at 21632  and so far continue past 26606.


    George.



    Home test lab, Debian Bookworm, KVM host server. AMD Ryzen 9 3900X CPU
    and motherboard. The drive was mounted as spare data storage.

    <html>
    <head>
    <style type="text/css">
    body,p,td,div,span{
    font-size:13px; font-family:Arial, Helvetica, sans-serif;
    };
    body p{
    margin:0px;
    }
    </style>
    </head>
    <body><div>Hi,</div><div><br></div><div>I case there might be a known, fixable, fault that could cause this, anyone know what the following errors indicate?</div><div><br></div><div>Aug 10 17:30:51 srv01 kernel: ata6: EH complete<br>Aug 10 17:30:54 srv01
    kernel: ata6.00: exception Emask 0x0 SAct 0x4000 SErr 0xc0000 action 0x0<br>Aug 10 17:30:54 srv01 kernel: ata6.00: irq_stat 0x40000008<br>Aug 10 17:30:54 srv01 kernel: ata6: SError: { CommWake 10B8B }<br>Aug 10 17:30:54 srv01 kernel: ata6.00: failed
    command: READ FPDMA QUEUED<br>Aug 10 17:30:54 srv01 kernel: ata6.00: cmd 60/08:70:68:c4:00/00:00:00:00:00/40 tag 14 ncq dma 4096 in<br>Aug 10 17:30:54 srv01 kernel: ata6.00: status: { DRDY ERR }<br>Aug 10 17:30:54 srv01 kernel: ata6.00: error: { UNC }<br>
    Aug 10 17:30:54 srv01 kernel: ata6.00: configured for UDMA/133<br>Aug 10 17:30:54 srv01 kernel: sd 5:0:0:0: [sdb] tag#14 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=2s<br>Aug 10 17:30:54 srv01 kernel: sd 5:0:0:0: [sdb] tag#14 Sense Key :
    Medium Error [current] <br>Aug 10 17:30:54 srv01 kernel: sd 5:0:0:0: [sdb] tag#14 Add. Sense: Unrecovered read error - auto reallocate failed<br>Aug 10 17:30:54 srv01 kernel: sd 5:0:0:0: [sdb] tag#14 CDB: Read(16) 88 00 00 00 00 00 00 00 c4 68 00 00 00
    08 00 00<br>Aug 10 17:30:54 srv01 kernel: I/O error, dev sdb, sector 50280 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2<br>Aug 10 17:30:54 srv01 kernel: Buffer I/O error on dev sdb, logical block 6285, async page read</div><div><br></div><div>I am
    running badblocks against a Western Digital 3TB WDC WD30EFRX-68AX9N0​, and it keep generating the above.<br></div><div>https://www.storagereview.com/review/western-digital-red-nas-hard-drive-review-wd30efrx</div><div><br></div><div>I have changed the
    port it is connected to, and the SATA cable, but still the errors follow the disk drive.</div><div><br></div><div>I suspect that the error message "Medium Error", means just that, an area of the disk has failed, hence "Unrecovered read error". <br></div><
    <br></div><div>Sadly "Unrecovered read error" also implies "auto reallocate failed", so what ever data was on the failed area, it is gone forever. Do not worry, backups mean important data is safe, but it does mean a few hours effort to replace the
    drive, test the replacement, and then restore data. Sadly I was just starting to use the storage for testing, and now I will have to again copy of the data for testing to the replacement drive.<br></div><div><br></div><div>Bad blocks start at 21632&nbsp;
    and so far continue past 26606.</div><div><br></div><div>George.<br></div><div><br></div><div>Home test lab, Debian Bookworm, KVM host server. AMD Ryzen 9 3900X CPU and motherboard. The drive was mounted as spare data storage.<br></div><div><br></div><
    <br></div></body></html>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Charles Curley@21:1/5 to George at Clug on Sat Aug 10 14:30:02 2024
    On Sat, 10 Aug 2024 18:20:36 +1000
    George at Clug <[email protected]> wrote:

    I case there might be a known, fixable, fault that could cause this,
    anyone know what the following errors indicate?

    I suspect you have a drive getting ready to die on you, which it may do
    at any time.

    Sadly "Unrecovered read error" also implies "auto reallocate failed",
    so what ever data was on the failed area, it is gone forever.

    Auto reallocation failure often means that the drive has run out of
    spare sectors, a clear indication the drive is dying.

    The first thing I would do is order a replacement drive. I would then
    fire up gsmartcontrol and run a short self-test on the drive to confirm
    the failure to reallocate.

    --
    Does anybody read signatures any more?

    https://charlescurley.com
    https://charlescurley.com/blog/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael =?utf-8?B?S2rDtnJsaW5n?=@21:1/5 to All on Sat Aug 10 14:40:06 2024
    On 10 Aug 2024 18:20 +1000, from [email protected] (George at Clug):
    I have changed the port it is connected to, and the SATA cable, but
    still the errors follow the disk drive.

    So that potentially leaves things like the SATA controller (unlikely),
    the power supply (possible) and the drive itself (highly likely),
    including the drive's onboard controller hardware and firmware.


    I suspect that the error message "Medium Error", means just that, an
    area of the disk has failed, hence "Unrecovered read error".

    That would be the typical conclusion, yes.


    Sadly "Unrecovered read error" also implies "auto reallocate failed",

    No, it does not necessarily imply that.

    so what ever data was on the failed area, it is gone forever.

    Likely, yes. Especially if they recur in the same physical location,
    which with LBA mapping can be moderately difficult to tell.

    Specifically, unrecoverable read error does not imply _that_ automatic remapping failed _if_ the error developed after the data was written.
    In that case, the firmware can't know what _should_ be stored (if it
    could, then the error wouldn't be unrecoverable/uncorrectable), so
    remapping _can't_ be done. If the firmware is doing the right thing,
    then the problematic sectors will be remapped on the next write if
    they fail to hold the newly-written data; but that doesn't help with
    the data that _was_ there.

    Check SMART data for the drive. If offline uncorrectable or pending
    sectors is climbing as you try a read test, that's a strong signal
    that the drive is somehow physically damaged.

    Each drive has a limited pool of spare sectors and once that pool is
    used up for remapping, it cannot handle any further sectors going bad.

    --
    Michael Kjörling 🔗 https://michael.kjorling.se “Remember when, on the Internet, nobody cared that you were a dog?”

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From piorunz@21:1/5 to George at Clug on Wed Aug 14 20:10:02 2024
    Hi George,

    It would be useful if you paste here full smart attribute stats, command:
    sudo smartctl /dev/sda --all
    replace sda with correct name as needed

    Do "long" SMART test on this drive. It should be able to map out bad
    sectors so Linux doesn't see the errors any more. Unless bad sectors are growing and changing constantly due to dying hard drive. Sometimes you
    need to repeat long test, until the number of bad sectors reallocated
    stop growing (for time being if HDD is dying).

    How to do long smart test:
    sudo smartctl /dev/sda --test=long

    Next step, if you want to continue to use this drive (to some extent, as
    you will never be able to consider it *reliable* to store important
    data), is to use excellent (paid) SpinRite program by Steve Gibson. It
    can recover data from unrecoverable sectors, and map out ALL bad
    sectors, so that remaining ones work smoothly and Linux kernel isn't
    thrown off every few minutes.

    https://www.grc.com/sr/spinrite.htm


    On 10/08/2024 09:20, George at Clug wrote:
    Hi,

    I case there might be a known, fixable, fault that could cause this,
    anyone know what the following errors indicate?

    Aug 10 17:30:51 srv01 kernel: ata6: EH complete
    Aug 10 17:30:54 srv01 kernel: ata6.00: exception Emask 0x0 SAct 0x4000
    SErr 0xc0000 action 0x0
    Aug 10 17:30:54 srv01 kernel: ata6.00: irq_stat 0x40000008
    Aug 10 17:30:54 srv01 kernel: ata6: SError: { CommWake 10B8B }
    Aug 10 17:30:54 srv01 kernel: ata6.00: failed command: READ FPDMA QUEUED
    Aug 10 17:30:54 srv01 kernel: ata6.00: cmd 60/08:70:68:c4:00/00:00:00:00:00/40 tag 14 ncq dma 4096 in
    Aug 10 17:30:54 srv01 kernel: ata6.00: status: { DRDY ERR }
    Aug 10 17:30:54 srv01 kernel: ata6.00: error: { UNC }
    Aug 10 17:30:54 srv01 kernel: ata6.00: configured for UDMA/133
    Aug 10 17:30:54 srv01 kernel: sd 5:0:0:0: [sdb] tag#14 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=2s
    Aug 10 17:30:54 srv01 kernel: sd 5:0:0:0: [sdb] tag#14 Sense Key :
    Medium Error [current]
    Aug 10 17:30:54 srv01 kernel: sd 5:0:0:0: [sdb] tag#14 Add. Sense: Unrecovered read error - auto reallocate failed
    Aug 10 17:30:54 srv01 kernel: sd 5:0:0:0: [sdb] tag#14 CDB: Read(16) 88
    00 00 00 00 00 00 00 c4 68 00 00 00 08 00 00
    Aug 10 17:30:54 srv01 kernel: I/O error, dev sdb, sector 50280 op
    0x0:(READ) flags 0x0 phys_seg 1 prio class 2
    Aug 10 17:30:54 srv01 kernel: Buffer I/O error on dev sdb, logical block 6285, async page read

    I am running badblocks against a Western Digital 3TB WDC
    WD30EFRX-68AX9N0​, and it keep generating the above. https://www.storagereview.com/review/western-digital-red-nas-hard-drive-review-wd30efrx

    I have changed the port it is connected to, and the SATA cable, but
    still the errors follow the disk drive.

    I suspect that the error message "Medium Error", means just that, an
    area of the disk has failed, hence "Unrecovered read error".

    Sadly "Unrecovered read error" also implies "auto reallocate failed", so
    what ever data was on the failed area, it is gone forever. Do not worry, backups mean important data is safe, but it does mean a few hours effort
    to replace the drive, test the replacement, and then restore data. Sadly
    I was just starting to use the storage for testing, and now I will have
    to again copy of the data for testing to the replacement drive.

    Bad blocks start at 21632  and so far continue past 26606.

    George.

    Home test lab, Debian Bookworm, KVM host server. AMD Ryzen 9 3900X CPU
    and motherboard. The drive was mounted as spare data storage.



    --
    With kindest regards, Piotr.

    ⢀⣴⠾⠻⢶⣦⠀
    ⣾⠁⢠⠒⠀⣿⡁ Debian - The universal operating system ⢿⡄⠘⠷⠚⠋⠀ https://www.debian.org/
    ⠈⠳⣄⠀⠀⠀⠀

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)