• I/O errors during RAID check but no SMART errors

    From Jochen Spieker@21:1/5 to All on Tue Oct 8 17:00:01 2024
    Hey,

    please forgive me for posting a question that is not Debian-specific,
    but maybe somebody here can explain this to me. Ten years ago I would
    have posted to Usenet instead.

    I have two disks in a RAID-1:

    | $ cat /proc/mdstat
    | Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
    | md0 : active raid1 sdb1[2] sdc1[0]
    | 5860390400 blocks super 1.2 [2/2] [UU]
    | bitmap: 5/44 pages [20KB], 65536KB chunk
    |
    | unused devices: <none>

    During the latest monthly check I got kernel messages like this:

    | Oct 06 00:57:01 jigsaw kernel: md: data-check of RAID array md0
    | Oct 06 14:27:11 jigsaw kernel: ata3.00: exception Emask 0x0 SAct 0x4000000 SErr 0x0 action 0x0
    | Oct 06 14:27:11 jigsaw kernel: ata3.00: irq_stat 0x40000008
    | Oct 06 14:27:11 jigsaw kernel: ata3.00: failed command: READ FPDMA QUEUED
    | Oct 06 14:27:11 jigsaw kernel: ata3.00: cmd 60/80:d0:80:74:f9/08:00:2d:02:00/40 tag 26 ncq dma 1114112 in
    | res 41/40:00:50:77:f9/00:00:2d:02:00/00 Emask 0x409 (media error) <F>
    | Oct 06 14:27:11 jigsaw kernel: ata3.00: status: { DRDY ERR }
    | Oct 06 14:27:11 jigsaw kernel: ata3.00: error: { UNC }
    | Oct 06 14:27:11 jigsaw kernel: ata3.00: configured for UDMA/133
    | Oct 06 14:27:11 jigsaw kernel: sd 2:0:0:0: [sdb] tag#26 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=7s
    | Oct 06 14:27:11 jigsaw kernel: sd 2:0:0:0: [sdb] tag#26 Sense Key : Medium Error [current]
    | Oct 06 14:27:11 jigsaw kernel: sd 2:0:0:0: [sdb] tag#26 Add. Sense: Unrecovered read error - auto reallocate failed
    | Oct 06 14:27:11 jigsaw kernel: sd 2:0:0:0: [sdb] tag#26 CDB: Read(16) 88 00 00 00 00 02 2d f9 74 80 00 00 08 80 00 00
    | Oct 06 14:27:11 jigsaw kernel: I/O error, dev sdb, sector 9361257600 op 0x0:(READ) flags 0x0 phys_seg 150 prio class 3
    | Oct 06 14:27:11 jigsaw kernel: ata3: EH complete

    The sector number mentioned at the bottom is increasing during the
    check.

    The way I understand these messages is that some sectors cannot be read
    from sdb at all and the disk is unable to reallocate the data somewhere
    else (probably because it doesn't know what the data should be in the
    first place).

    The disk has been running continuously for seven years now and I am
    running out of space anyway, so I already ordered a replacement. But I
    do not fully understand what is happening.

    Two of these message blocks end with this:

    | Oct 07 10:26:12 jigsaw kernel: md/raid1:md0: sdb1: rescheduling sector 10198068744

    What does that mean for the other instances of this error? The data
    is still readable from the other disk in the RAID, right? Why doesn't md mention it? Why is the RAID still considered healthy? At some point I
    would expect the disk to be kicked from the RAID.

    I unmounted the filesystem and performed a bad blocks scan (fsck.ext4
    -fcky) that did not find anything of importance (only "Inode x extent
    tree (at level 1) could be shorter/narrower"), and it also did not yield
    any of the above kernel messages. But another RAID check triggers these messages again, just with different sector numbers. The RAID is still
    healthy, though.

    Should this tell me that it is new sectors are dying all the time, or
    should this lead me to believe that a cable / the SATA controller is at
    fault? I don't even see any errors with smartctl:

    | # smartctl -a /dev/sdb
    | smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-25-amd64] (local build)
    | Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
    |
    | === START OF INFORMATION SECTION ===
    | Model Family: Western Digital Red
    | Device Model: WDC WD60EFRX-68L0BN1
    | Serial Number: WD-xxxxxxxxxxxx
    | LU WWN Device Id: 5 0014ee 263faee8c
    | Firmware Version: 82.00A82
    | User Capacity: 6,001,175,126,016 bytes [6.00 TB]
    | Sector Sizes: 512 bytes logical, 4096 bytes physical
    | Rotation Rate: 5700 rpm
    | Device is: In smartctl database 7.3/5319
    | ATA Version is: ACS-2, ACS-3 T13/2161-D revision 3b
    | SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
    | Local Time is: Tue Oct 8 15:15:22 2024 CEST
    | SMART support is: Available - device has SMART capability.
    | SMART support is: Enabled
    |
    | === START OF READ SMART DATA SECTION ===
    | SMART overall-health self-assessment test result: PASSED
    |
    | General SMART Values:
    | Offline data collection status: (0x85) Offline data collection activity
    | was aborted by an interrupting command from host.
    | Auto Offline Data Collection: Enabled.
    | Self-test execution status: ( 245) Self-test routine in progress...
    | 50% of test remaining.
    | Total time to complete Offline
    | data collection: ( 1904) seconds.
    | Offline data collection
    | capabilities: (0x7b) SMART execute Offline immediate.
    | Auto Offline data collection on/off support.
    | Suspend Offline collection upon new
    | command.
    | Offline surface scan supported.
    | Self-test supported.
    | Conveyance Self-test supported.
    | Selective Self-test supported.
    | SMART capabilities: (0x0003) Saves SMART data before entering
    | power-saving mode.
    | Supports SMART auto save timer.
    | Error logging capability: (0x01) Error logging supported.
    | General Purpose Logging supported.
    | Short self-test routine
    | recommended polling time: ( 2) minutes.
    | Extended self-test routine
    | recommended polling time: ( 673) minutes.
    | Conveyance self-test routine
    | recommended polling time: ( 5) minutes.
    | SCT capabilities: (0x303d) SCT Status supported.
    | SCT Error Recovery Control supported. | SCT Feature Control supported.
    | SCT Data Table supported.
    |
    | SMART Attributes Data Structure revision number: 16
    | Vendor Specific SMART Attributes with Thresholds:
    | ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
    | 1 Raw_Read_Error_Rate 0x002f 199 169 051 Pre-fail Always - 81
    | 3 Spin_Up_Time 0x0027 198 197 021 Pre-fail Always - 9100
    | 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 83
    | 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
    | 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
    | 9 Power_On_Hours 0x0032 016 016 000 Old_age Always - 61794
    | 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
    | 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
    | 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 82
    | 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 54
    | 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 2219
    | 194 Temperature_Celsius 0x0022 119 116 000 Old_age Always - 33
    | 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
    | 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
    | 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
    | 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
    | 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 43
    |
    | SMART Error Log Version: 1
    | No Errors Logged
    |
    | SMART Self-test log structure revision number 1
    | Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
    | # 1 Short offline Completed without error 00% 61789 -
    | # 2 Short offline Completed without error 00% 61758 -
    | # 3 Short offline Completed without error 00% 61752 -
    | # 4 Extended offline Completed without error 00% 61726 -
    | # 5 Short offline Completed without error 00% 61710 -
    | # 6 Short offline Completed without error 00% 61686 -
    | # 7 Short offline Completed without error 00% 61662 -
    | # 8 Short offline Completed without error 00% 61638 -
    | # 9 Short offline Completed without error 00% 61615 -
    | #10 Short offline Completed without error 00% 61591 -
    | #11 Short offline Completed without error 00% 61567 -
    | #12 Extended offline Completed without error 00% 61559 -
    | #13 Short offline Completed without error 00% 61543 -
    | #14 Short offline Completed without error 00% 61519 -
    | #15 Short offline Completed without error 00% 61495 -
    | #16 Short offline Completed without error 00% 61471 -
    | #17 Short offline Completed without error 00% 61447 -
    | #18 Short offline Completed without error 00% 61423 -
    | #19 Short offline Completed without error 00% 61399 -
    | #20 Extended offline Completed without error 00% 61391 -
    | #21 Short offline Completed without error 00% 61375 -
    |
    | SMART Selective self-test log data structure revision number 1
    | SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
    | 1 0 0 Not_testing
    | 2 0 0 Not_testing
    | 3 0 0 Not_testing
    | 4 0 0 Not_testing
    | 5 0 0 Not_testing
    | Selective self-test flags (0x0):
    | After scanning selected spans, do NOT read-scan remainder of disk.
    | If Selective self-test is pending on power-up, resume after 0 minute delay.

    I am still waiting for the result of a long self-test.

    Do you think I should do remove the drive from the RAID immediately? Or
    should I suspect something else is at faula?t I perfer not to run the
    risk of losing the RAID completely when I keep on running on one disk
    while the new one is being shipped. I do have backups, but it would be
    great if I didn't need to restore.

    Regards,
    Jochen
    --
    I see weapons of mass destruction as shameful but necessary.
    [Agree] [Disagree]
    <http://archive.slowlydownward.com/NODATA/data_enter2.html>

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCAAdFiEERCNn0ngYrOUG3zZFU4ruOUNvhZcFAmcFSKQACgkQU4ruOUNv hZfVVw//e9K97BcV2h5mLmNoQSEgFjNJMRvSZALjm+9eZEV0VddOq1p099xoSmEb GNYXy7TRT7BSLo4seNwQC6xW4/uGwTE0zrdM9BTI82ujzlfJ7yklVvSB0pP/DEs9 lXCxX4eKQPZHYqO939V+QhL8ZIGqDWYZYUk8+a02VKTv1mSlNalZCFJia2yN9Lra +N8nATYoNWaZC971zfaNv5JZORebS7/Zm/3qYnGqTIB9jpNCTl9zNxH7e8Vtam46 rapo46n4zEnDIbQDNdckXT1UxBeQm0zFwvGtdKNC83fNET3yliD7cZ0ljE78K4Qe c++SMYKF+CWZ+H3spzFu0RFT2X7/9xpMpETv5jyb8wwGYbnI+s1ST7Fi+crUuij/ wYqAE21H8IXPiA4tCBFfBNFOduOqkMN4IDRSQ10fuGPeQ+FnWKx6N7w51KEgV9MY R7ADSnfwXpRg54KixLsShzk417AP2at2GAgg9342f3Gavu1tbPtaLfB3Smf5n+6T Q3cn3iKf//PiiPwbg74snqk6pzkqbrneG/h6dkDZh/pTFjwrlayH/g0KYo8eJIXi x/Ss7VmfKWuX0FbBnFokggJiwDybbhOG8oSfLl0GbgbakQpXWGS5xRPF3kyTsp8G MRX4bOG0F/hCOTwChSRA2sV/Vld/3bW4eaUP9+Xdq77zeQBi0CY=
    =TE6c
    -----END PGP SIGNATURE-----

    ---
  • From Dan Ritter@21:1/5 to Jochen Spieker on Tue Oct 8 17:50:01 2024
    Jochen Spieker wrote:
    I have two disks in a RAID-1:

    | $ cat /proc/mdstat
    | Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
    | md0 : active raid1 sdb1[2] sdc1[0]
    | 5860390400 blocks super 1.2 [2/2] [UU]
    | bitmap: 5/44 pages [20KB], 65536KB chunk
    |
    | unused devices: <none>

    During the latest monthly check I got kernel messages like this:

    | Oct 06 00:57:01 jigsaw kernel: md: data-check of RAID array md0
    | Oct 06 14:27:11 jigsaw kernel: ata3.00: exception Emask 0x0 SAct 0x4000000 SErr 0x0 action 0x0
    | Oct 06 14:27:11 jigsaw kernel: ata3.00: irq_stat 0x40000008
    | Oct 06 14:27:11 jigsaw kernel: ata3.00: failed command: READ FPDMA QUEUED
    | Oct 06 14:27:11 jigsaw kernel: ata3.00: cmd 60/80:d0:80:74:f9/08:00:2d:02:00/40 tag 26 ncq dma 1114112 in
    | res 41/40:00:50:77:f9/00:00:2d:02:00/00 Emask 0x409 (media error) <F>
    | Oct 06 14:27:11 jigsaw kernel: ata3.00: status: { DRDY ERR }
    | Oct 06 14:27:11 jigsaw kernel: ata3.00: error: { UNC }
    | Oct 06 14:27:11 jigsaw kernel: ata3.00: configured for UDMA/133
    | Oct 06 14:27:11 jigsaw kernel: sd 2:0:0:0: [sdb] tag#26 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=7s
    | Oct 06 14:27:11 jigsaw kernel: sd 2:0:0:0: [sdb] tag#26 Sense Key : Medium Error [current]
    | Oct 06 14:27:11 jigsaw kernel: sd 2:0:0:0: [sdb] tag#26 Add. Sense: Unrecovered read error - auto reallocate failed
    | Oct 06 14:27:11 jigsaw kernel: sd 2:0:0:0: [sdb] tag#26 CDB: Read(16) 88 00 00 00 00 02 2d f9 74 80 00 00 08 80 00 00
    | Oct 06 14:27:11 jigsaw kernel: I/O error, dev sdb, sector 9361257600 op 0x0:(READ) flags 0x0 phys_seg 150 prio class 3
    | Oct 06 14:27:11 jigsaw kernel: ata3: EH complete

    If this happens once, it's just a thing that happened.

    If it happens multiple times, it means that there's a hardware
    error: sometimes a cable, rarely the SATA port, often the drive.

    The sector number mentioned at the bottom is increasing during the
    check.

    So it repeats, and it's contiguous. That suggests a flaw in the
    drive itself.


    The way I understand these messages is that some sectors cannot be read
    from sdb at all and the disk is unable to reallocate the data somewhere
    else (probably because it doesn't know what the data should be in the
    first place).

    Yes.

    The disk has been running continuously for seven years now and I am
    running out of space anyway, so I already ordered a replacement. But I
    do not fully understand what is happening.

    The drive is dying, slowly. In this case it's starting with a
    bad patch on a platter.


    Two of these message blocks end with this:

    | Oct 07 10:26:12 jigsaw kernel: md/raid1:md0: sdb1: rescheduling sector 10198068744

    What does that mean for the other instances of this error? The data
    is still readable from the other disk in the RAID, right? Why doesn't md mention it? Why is the RAID still considered healthy? At some point I
    would expect the disk to be kicked from the RAID.

    md will eventually do that, but not until it gets bad enough.
    That could be quite noticeable.


    I unmounted the filesystem and performed a bad blocks scan (fsck.ext4
    -fcky) that did not find anything of importance (only "Inode x extent
    tree (at level 1) could be shorter/narrower"), and it also did not yield
    any of the above kernel messages. But another RAID check triggers these messages again, just with different sector numbers. The RAID is still healthy, though.

    I don't think it is.

    Should this tell me that it is new sectors are dying all the time, or
    should this lead me to believe that a cable / the SATA controller is at fault? I don't even see any errors with smartctl:

    If the sectors were effectively random, a cable fault would be
    likely. If the sectors are contiguous or nearly-so, that's
    definitely the disk.


    | SMART Attributes Data Structure revision number: 16
    | Vendor Specific SMART Attributes with Thresholds:
    | ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
    | 1 Raw_Read_Error_Rate 0x002f 199 169 051 Pre-fail Always - 81
    | 3 Spin_Up_Time 0x0027 198 197 021 Pre-fail Always - 9100
    | 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 83
    | 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
    | 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
    | 9 Power_On_Hours 0x0032 016 016 000 Old_age Always - 61794
    | 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
    | 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
    | 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 82
    | 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 54
    | 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 2219
    | 194 Temperature_Celsius 0x0022 119 116 000 Old_age Always - 33
    | 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
    | 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
    | 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
    | 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
    | 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 43


    This looks like a drive which is old and starting to wear out
    but is not there yet. The raw read error rate is starting to
    creep up but isn't at a threshold.


    I am still waiting for the result of a long self-test.

    Do you think I should do remove the drive from the RAID immediately? Or should I suspect something else is at faula?t I perfer not to run the
    risk of losing the RAID completely when I keep on running on one disk
    while the new one is being shipped. I do have backups, but it would be
    great if I didn't need to restore.

    If the disk is a few days away from being replaced, I would not
    bother shutting it off, but I would assume that it is not a full
    mirror and somehow having the good disk fail would be bad.

    -dsr-

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Andy Smith@21:1/5 to Jochen Spieker on Tue Oct 8 20:50:02 2024
    Hi,

    On Tue, Oct 08, 2024 at 04:58:46PM +0200, Jochen Spieker wrote:
    The way I understand these messages is that some sectors cannot be read
    from sdb at all and the disk is unable to reallocate the data somewhere
    else (probably because it doesn't know what the data should be in the
    first place).

    When MD receives a read error it does read the mirrored data and write
    it back. If it can't do that it fails the disk, so you are not getting
    there yet.

    Two of these message blocks end with this:

    | Oct 07 10:26:12 jigsaw kernel: md/raid1:md0: sdb1: rescheduling sector 10198068744

    What does that mean for the other instances of this error?

    I expect you probably have either no TLER value set or it's set higher
    than the kernel's own timeout. By default consumer drives try very hard
    to read data, taking a long time doing so when there's issues. The
    kernel SCSI layer will try several times, so the drive's timeout is
    multiplied. Only if this ends up exceeding 30s will you get a read
    error, and the message from MD about rescheduling the sector.

    The data is still readable from the other disk in the RAID, right? Why doesn't md mention it?

    I suspect that the times you saw an error from the SCSI layer but not
    from MD, were times that the SCSI layer retried and got the data out eventually.

    When the SCSI layer times out of all its retries it actually resets the
    drive and then the whole bus, and that often causes MD to drop the disk.
    You haven;t mentioned any messages about resetting the bus so I think
    you are not having that many retries.

    The fact that you are having any is bad, though.

    Why is the RAID still considered healthy? At some point I
    would expect the disk to be kicked from the RAID.

    This will happen when/if MD can't compensate by reading data from other
    mirrors and writing it back. If a write fails, or a disk drops
    out entirely, then MD will fail the device.

    Hopefully the results of your SMART long self-test will help clear this
    up. These things can be hard to track down though.

    After you do resolve this you should set TLER to some sensible value
    like 7 seconds. That is not your biggest concern right now though.

    Here is a thing I wrote about it quite some time ago:

    https://strugglers.net/~andy/mothballed-blog/2015/11/09/linux-software-raid-and-drive-timeouts/#how-to-check-set-drive-timeouts

    Do you think I should do remove the drive from the RAID immediately? Or should I suspect something else is at faula?t

    The fact that you have no reallocated sectors and no pending sectors
    and apparently all your writes are working makes me think there probably
    isn't a fault with the drive but in some ways that is worse as it's easy
    to replace a drive, not so eay to diagnose bad cables and marginal power supplies etc etc.

    I probably wouldn't remove it because it's better than nothing. I
    probably would try the easy fix of replacing the drive first, if I could
    afford that.

    I perfer not to run the risk of losing the RAID completely when I keep
    on running on one disk while the new one is being shipped.

    I would make sure the timeouts are set correctly because if you do get
    into the situation where the kernel is resetting the bus, that can
    temporarily take away both drives at once which can cause MD to fail
    both out and mark the array as faulty. It's relatively easy to do the
    manual intervention required to start it up again but it is a stressful.

    Thanks,
    Andy

    --
    https://bitfolk.com/ -- No-nonsense VPS hosting

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jochen Spieker@21:1/5 to All on Tue Oct 8 22:10:01 2024
    Dan Ritter:
    Jochen Spieker wrote:

    The sector number mentioned at the bottom is increasing during the
    check.

    So it repeats, and it's contiguous. That suggests a flaw in the
    drive itself.

    It definitely looks like that:

    | Oct 06 14:27:11 jigsaw kernel: I/O error, dev sdb, sector 9361257600 op 0x0:(READ) flags 0x0 phys_seg 150 prio class 3
    | Oct 06 14:27:30 jigsaw kernel: I/O error, dev sdb, sector 9361275264 op 0x0:(READ) flags 0x4000 phys_seg 161 prio class 3
    | Oct 06 14:27:37 jigsaw kernel: I/O error, dev sdb, sector 9361277696 op 0x0:(READ) flags 0x0 phys_seg 71 prio class 3
    | Oct 06 14:28:02 jigsaw kernel: I/O error, dev sdb, sector 9361283584 op 0x0:(READ) flags 0x0 phys_seg 160 prio class 3
    | Oct 06 14:28:09 jigsaw kernel: I/O error, dev sdb, sector 9361284864 op 0x0:(READ) flags 0x4000 phys_seg 160 prio class 3
    | Oct 06 14:34:03 jigsaw kernel: I/O error, dev sdb, sector 9400838400 op 0x0:(READ) flags 0x0 phys_seg 168 prio class 3
    | Oct 06 14:34:17 jigsaw kernel: I/O error, dev sdb, sector 9400841088 op 0x0:(READ) flags 0x0 phys_seg 153 prio class 3
    | Oct 06 14:34:24 jigsaw kernel: I/O error, dev sdb, sector 9400842496 op 0x0:(READ) flags 0x4000 phys_seg 138 prio class 3
    | Oct 06 14:34:31 jigsaw kernel: I/O error, dev sdb, sector 9400845056 op 0x0:(READ) flags 0x0 phys_seg 44 prio class 3
    | Oct 06 14:34:39 jigsaw kernel: I/O error, dev sdb, sector 9400846464 op 0x0:(READ) flags 0x0 phys_seg 16 prio class 3
    | Oct 06 14:34:46 jigsaw kernel: I/O error, dev sdb, sector 9400846592 op 0x0:(READ) flags 0x0 phys_seg 4 prio class 3
    | Oct 06 14:34:53 jigsaw kernel: I/O error, dev sdb, sector 9400846848 op 0x0:(READ) flags 0x4000 phys_seg 59 prio class 3
    | Oct 06 14:35:00 jigsaw kernel: I/O error, dev sdb, sector 9400849408 op 0x0:(READ) flags 0x0 phys_seg 27 prio class 3
    | Oct 06 14:35:11 jigsaw kernel: I/O error, dev sdb, sector 9400850944 op 0x0:(READ) flags 0x4000 phys_seg 160 prio class 3
    | Oct 06 14:35:19 jigsaw kernel: I/O error, dev sdb, sector 9400852224 op 0x0:(READ) flags 0x4000 phys_seg 160 prio class 3
    | Oct 06 14:35:26 jigsaw kernel: I/O error, dev sdb, sector 9400853504 op 0x0:(READ) flags 0x4000 phys_seg 160 prio class 3
    | Oct 06 14:35:37 jigsaw kernel: I/O error, dev sdb, sector 9400855040 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 3
    | Oct 06 14:35:45 jigsaw kernel: I/O error, dev sdb, sector 9400855296 op 0x0:(READ) flags 0x0 phys_seg 160 prio class 3
    | Oct 06 14:35:52 jigsaw kernel: I/O error, dev sdb, sector 9400856576 op 0x0:(READ) flags 0x4000 phys_seg 160 prio class 3
    | Oct 06 14:35:59 jigsaw kernel: I/O error, dev sdb, sector 9400857856 op 0x0:(READ) flags 0x4000 phys_seg 159 prio class 3
    | Oct 06 14:36:14 jigsaw kernel: I/O error, dev sdb, sector 9400859392 op 0x0:(READ) flags 0x4000 phys_seg 160 prio class 3
    | Oct 06 14:36:21 jigsaw kernel: I/O error, dev sdb, sector 9400860672 op 0x0:(READ) flags 0x4000 phys_seg 160 prio class 3
    | Oct 06 14:36:28 jigsaw kernel: I/O error, dev sdb, sector 9400861952 op 0x0:(READ) flags 0x4000 phys_seg 160 prio class 3
    | Oct 06 14:36:41 jigsaw kernel: I/O error, dev sdb, sector 9400863488 op 0x0:(READ) flags 0x0 phys_seg 160 prio class 3
    | Oct 06 14:36:48 jigsaw kernel: I/O error, dev sdb, sector 9400864768 op 0x0:(READ) flags 0x4000 phys_seg 168 prio class 3
    | Oct 06 14:37:00 jigsaw kernel: I/O error, dev sdb, sector 9400867584 op 0x0:(READ) flags 0x0 phys_seg 160 prio class 3
    | Oct 06 14:37:07 jigsaw kernel: I/O error, dev sdb, sector 9400868864 op 0x0:(READ) flags 0x4000 phys_seg 160 prio class 3
    | Oct 06 14:37:20 jigsaw kernel: I/O error, dev sdb, sector 9400871680 op 0x0:(READ) flags 0x0 phys_seg 160 prio class 3

    … and so on. On the second RAID check, the numbers are not the same, but
    in the same range.

    If the disk is a few days away from being replaced, I would not
    bother shutting it off, but I would assume that it is not a full
    mirror and somehow having the good disk fail would be bad.

    Thanks a lot for your input, that is exactly the kind of advice that I
    was looking for.

    J.
    --
    Thy lyrics in pop songs seem to describe my life uncannily accurately.
    [Agree] [Disagree]
    <http://archive.slowlydownward.com/NODATA/data_enter2.html>

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCAAdFiEERCNn0ngYrOUG3zZFU4ruOUNvhZcFAmcFkQUACgkQU4ruOUNv hZdTEw/+OHP9mSAjP6nxoxYbHvJM27wD5QSbGTXHIR/NrpiOxAFuCYaOef2c4Z8N DFaJeJcn54P5+IPdyG5CBbjoOddgxq4NH8F3NQn5TVWbTMhJ+hJzrJN2fo1AIpHC lJAqUDndKNf8BwUAiJ4pD3RkApSX/uNF4WoR287kppiZZRbm9rQO0J0Wce26m/W7 /zbPu7yLWUsq39vkUE4eD/7HtIwXkyKYoi+Eov/sRCmgmSqcwO9B0m9z+9u+XAqO zzjskDACwPIQWH0kDp9yskoOJnhsWBXJgYEUN0LT8wTQ7NQyMBkdsjAVIgs2HTu6 hf4WJqUibXzgSzPy7Q+kkVSBVoRpIJNgOtJffuyNqC1bzwfXH2X0TXxPeZ83V4ge TY7cBI+zA5NgcdxvnUbgDL5z9ClXQfrPnoDqGf0636PM5q4W6eauHUB6TfUh4i+V eRzdQuPFHe/9+iL1NJewoWAVj0XEYJIRNcwGbIhphOgr6oMoLQYXEc5oWeL7zREg 9Ehd9Fr047oHZS3YCA2TbrBiz4vH5x+Zjedc/gpm7G0ZJmoKesVCnP2YCJgP5eI/ +QZUg225dYN3eCBT43G2lngU3BV7FnJ+I2884Z+BhRcPltTeVzmx9Dtfe1I7LwdJ iEe9+rfgE150Q6jJy5v2UyGOmds81ca9mwKBOrE5+ca1pyTOw24=
    =Dt5Z
    -----END PGP SIGNATURE---
  • From Jochen Spieker@21:1/5 to All on Tue Oct 8 22:30:01 2024
    Andy Smith:
    On Tue, Oct 08, 2024 at 04:58:46PM +0200, Jochen Spieker wrote:
    The way I understand these messages is that some sectors cannot be read
    from sdb at all and the disk is unable to reallocate the data somewhere
    else (probably because it doesn't know what the data should be in the
    first place).

    When MD receives a read error it does read the mirrored data and write
    it back. If it can't do that it fails the disk, so you are not getting
    there yet.

    Okay, that's good, I guess.

    Two of these message blocks end with this:

    | Oct 07 10:26:12 jigsaw kernel: md/raid1:md0: sdb1: rescheduling sector 10198068744

    What does that mean for the other instances of this error?

    I expect you probably have either no TLER value set

    Thanks a lot, I had never heard of that before. But by chance my WD REDs actually seem to come with a default of 7 seconds:

    | /dev/sdb:
    | smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-25-amd64] (local build)
    | Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
    |
    | SCT Error Recovery Control:
    | Read: 70 (7.0 seconds)
    | Write: 70 (7.0 seconds)


    or it's set higher
    than the kernel's own timeout. By default consumer drives try very hard
    to read data, taking a long time doing so when there's issues. The
    kernel SCSI layer will try several times, so the drive's timeout is multiplied. Only if this ends up exceeding 30s will you get a read
    error, and the message from MD about rescheduling the sector.

    That makes sense. And might also explain why the disk does not report
    any reallocated sectors (yet).

    Hopefully the results of your SMART long self-test will help clear this
    up. These things can be hard to track down though.

    10% remaining … "long" is really long.

    After you do resolve this you should set TLER to some sensible value
    like 7 seconds. That is not your biggest concern right now though.

    Here is a thing I wrote about it quite some time ago:

    https://strugglers.net/~andy/mothballed-blog/2015/11/09/linux-software-raid-and-drive-timeouts/#how-to-check-set-drive-timeouts

    Thanks a lot again.

    Do you think I should do remove the drive from the RAID immediately? Or
    should I suspect something else is at faula?t

    The fact that you have no reallocated sectors and no pending sectors
    and apparently all your writes are working makes me think there probably isn't a fault with the drive but in some ways that is worse as it's easy
    to replace a drive, not so eay to diagnose bad cables and marginal power supplies etc etc.

    See my other reply, the sector numbers do not appear to be random, so I
    hope that it is actually the disk.

    I perfer not to run the risk of losing the RAID completely when I keep
    on running on one disk while the new one is being shipped.

    I would make sure the timeouts are set correctly because if you do get
    into the situation where the kernel is resetting the bus, that can temporarily take away both drives at once which can cause MD to fail
    both out and mark the array as faulty. It's relatively easy to do the
    manual intervention required to start it up again but it is a stressful.

    I guess if that really happens I will strongly consider to just restore
    from backup. I just need to think hard about the things that I have
    excluded from backup deliberately. ^^ But the new disk is expected to be delivered tomorrow, so I keep my fingers crossed. I mean, that is why I
    am using RAID1 in the first place.

    J.
    --
    I use a Playstation to block out the existence of my partner.
    [Agree] [Disagree]
    <http://archive.slowlydownward.com/NODATA/data_enter2.html>

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCAAdFiEERCNn0ngYrOUG3zZFU4ruOUNvhZcFAmcFlgQACgkQU4ruOUNv hZc2zg/+KWdKVTEAu83YJwNoC9jG3M6Zbe2SvUNvI6tmT0MPZkz0CsjpA1V1q1rX L5I+wewM4bDc/O9i+oSLhUitSGhF3eJfRpoT+lGIOEeqBhP99uh89eiFAELSCE18 GPjfDfksr1e2NulwLqoVaHQXVLoeWK9OHI2UcOrrX3ywpuUgi4/Pzs3q1pJ8VybU 7xMplth56rl0fhAJFZZ8fC4ChJSCVZmiidtW5CLzTOqhz+7tdj3II97zcv0ZQSfD M2llN2U1OCUe4lsfFatVdBawaCMV8CzbywFrAXo1rCm2NRvrGk9DY2Dh74BZecaJ 74+0+l9nQyaKKv2oxi5X/23BNiy+q+1awwnjpnzv7Rwor1XgZ//8ZThXt2ZgFTEN UrsbYJ8vUqEUvNU7C7aLsQigaeBHBSTIzzd0ojtJaMJPLurvjOoRCtwz63JfB8G7 mVQsX0IRNmyQH/D/rTjuhxMwPNgnea5Gfx8YjHkUKr1BYbUyeoevEwXdpOkY6Ba0 kGFIjlaAcnGUlahtJaK8W3JBad6kuqErDzTqOF3+V60T7ogH2fOFFh9CyZFcovM7 OYFbbLGitcvprxfb5xQQTjfZA02y3j56I9TqBRK9Crj/NPmNRTFgIveRfAUuAghY NXTnIko+emOZtC/oWS1OUDcK4QTi0Vm8sazAcZWqilkFqQQ9R6k=
    =pK+H
    -----END PGP SIGNATURE-----

    ---
  • From [email protected]@21:1/5 to Jochen Spieker on Tue Oct 8 22:40:01 2024
    On 10/8/24 16:07, Jochen Spieker wrote:
    | Oct 06 14:27:11 jigsaw kernel: I/O error, dev sdb, sector 9361257600 op 0x0:(READ) flags 0x0 phys_seg 150 prio class 3
    | Oct 06 14:27:30 jigsaw kernel: I/O error, dev sdb, sector 9361275264 op 0x0:(READ) flags 0x4000 phys_seg 161 prio class 3
    | Oct 06 14:27:37 jigsaw kernel: I/O error, dev sdb, sector 9361277696 op 0x0:(READ) flags 0x0 phys_seg 71 prio class 3
    | Oct 06 14:28:02 jigsaw kernel: I/O error, dev sdb, sector 9361283584 op 0x0:(READ) flags 0x0 phys_seg 160 prio class 3
    | Oct 06 14:28:09 jigsaw kernel: I/O error, dev sdb, sector 9361284864 op etc.

    Those aren't sequential, or even exhibiting the same interval from one to
    the next. Am I misinterpreting the data? Ten of the errors are 1280
    sectors after the previous error and five more pairs are 1536 sectors apart; maybe that's significant?

    --
    I was 21 years when I wrote this song
    I'm 22 now, but I won't be for long.
    Time hurries on / and the leaves that are green turn to brown.
    -- S&G, "Leaves that are Green"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael =?utf-8?B?S2rDtnJsaW5n?=@21:1/5 to All on Tue Oct 8 23:20:01 2024
    On 8 Oct 2024 11:29 -0400, from [email protected] (Dan Ritter):
    The disk has been running continuously for seven years now and I am
    running out of space anyway, so I already ordered a replacement. But I
    do not fully understand what is happening.

    The drive is dying, slowly. In this case it's starting with a
    bad patch on a platter.

    That would be my take too. The LBA sectors reported in a different
    post in this thread being as close as they appear to be would also
    corroborate the platter issue theory.


    | SMART Attributes Data Structure revision number: 16
    | Vendor Specific SMART Attributes with Thresholds:
    | ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
    | 1 Raw_Read_Error_Rate 0x002f 199 169 051 Pre-fail Always - 81
    | 3 Spin_Up_Time 0x0027 198 197 021 Pre-fail Always - 9100
    | 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 83
    | 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
    | 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
    | 9 Power_On_Hours 0x0032 016 016 000 Old_age Always - 61794
    | 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
    | 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
    | 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 82
    | 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 54
    | 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 2219
    | 194 Temperature_Celsius 0x0022 119 116 000 Old_age Always - 33
    | 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
    | 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
    | 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
    | 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
    | 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 43

    This looks like a drive which is old and starting to wear out
    but is not there yet. The raw read error rate is starting to
    creep up but isn't at a threshold.

    I agree. The almost 62000 hours is well over 7 years of run time, and
    based on the start/stop count and power cycle count it's been running
    basically continuously for that time (which is generally good for
    longevity, as long as it's not subjected to excessive heat). It's
    entirely possible that the mechanical components are degrading; which
    in turn might also be interfering with the physical properties of data
    storage. Yes, servo tracks and such things are supposed to catch and
    compensate for that; but it might not be quite that bad yet.

    Sometimes HDDs fail with a bang, and sometimes they fail with a
    whimper.

    Also note that some disks actually lie in SMART data. I don't know if
    yours does, but I would definitely question a value of 0 for failed
    (current pending and offline uncorrectable) _and_ reallocated sectors
    for a disk that's reporting I/O errors, for example. _At least_ one of
    those should be >0 for a truthful storage device in that situation.

    What I would not do at this point is subject it to more physical
    stress than unavoidable. Unless you absolutely must, do not physically
    unplug or remove that disk before the RAID array has resilvered onto
    the new disk. It's currently providing value being a second source of
    truth about what's stored; you don't want to remove it and then find
    during the resilver that the other current disk has a problem.

    --
    Michael Kjörling 🔗 https://michael.kjorling.se “Remember when, on the Internet, nobody cared that you were a dog?”

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jochen Spieker@21:1/5 to All on Wed Oct 9 15:00:01 2024
    [email protected]:
    On 10/8/24 16:07, Jochen Spieker wrote:
    | Oct 06 14:27:11 jigsaw kernel: I/O error, dev sdb, sector 9361257600 op 0x0:(READ) flags 0x0 phys_seg 150 prio class 3
    | Oct 06 14:27:30 jigsaw kernel: I/O error, dev sdb, sector 9361275264 op 0x0:(READ) flags 0x4000 phys_seg 161 prio class 3
    | Oct 06 14:27:37 jigsaw kernel: I/O error, dev sdb, sector 9361277696 op 0x0:(READ) flags 0x0 phys_seg 71 prio class 3
    | Oct 06 14:28:02 jigsaw kernel: I/O error, dev sdb, sector 9361283584 op 0x0:(READ) flags 0x0 phys_seg 160 prio class 3
    | Oct 06 14:28:09 jigsaw kernel: I/O error, dev sdb, sector 9361284864 op etc.

    Those aren't sequential, or even exhibiting the same interval from one to
    the next. Am I misinterpreting the data?

    No, but the numbers are close to each other and the errors did not
    happen sporadically throughout the runtime of the md check, but only
    within a specific timeframe.

    Ten of the errors are 1280
    sectors after the previous error and five more pairs are 1536 sectors apart; maybe that's significant?

    That may have something to do with the physical layout on the platters.
    If there is a "bad patch" on one of them, I would expect something like
    this. Maybe one rotation is 1280 sectors apart at some point, and 1536 a
    little bit further from the center.


    J.
    --
    I think the environment will be okay.
    [Agree] [Disagree]
    <http://archive.slowlydownward.com/NODATA/data_enter2.html>

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCAAdFiEERCNn0ngYrOUG3zZFU4ruOUNvhZcFAmcGfUoACgkQU4ruOUNv hZdK3xAAxrLRrYfNXGkOL+3aWOimtlBiDKHFxvXk2ZpqC2l0wxP2nevt3uLBdDbi 2Tg6ebr0Zro8fOz97YrAizTU43XKBttPG0SA6x6T6RTkDQYYnidAZYLndaAIJdBg jd1BLTcVXptWr3MI+udxYbPKIYK806Ll2/XPgpWdSVZoKtJJ9TODFujxK/NtLTeO F377UlBiKcvV6/1+sKLHLejBsA4KNrXOr6WZmw3EdlO/0bSBvYZtObyYRjdEjXji NOS5E6lR/3pY8T7nIh5bd+tIExcrX4QCqzV2XN0MlAnjgQDzNaImu66k2N1HHo0o gsDywf19ZTl4bHz/cgwwX5Xe1RZchrfJskfTJO87YmtORugEVwiVtLKZ5fMC/2gk 96UIYjrcYTbg+aJOhgtePO8qBp51CUzA4lkuLj8dsl1+hgxz37VwkkWtlpkEJQg4 2bPimMGn5rwZtYNVO1COV+/YWhtWpDhP+wEpcGkR0Vkezu9GzeBPQfdSdn223FAB 9IcAv+cJMMTbLbj3Ox7k3v24wi/jZ64DxTFKqFVIvKgLvEJuLrCuOEn1H1og92HF Dd2X0+tnx9g5yuZNiBB7dmDGtESQqwAujGenKNe2Rwyo/6jYDsInc7uEwAW8QrKV d+egPYaOWv4ZWUXmqOls8PKlYsQM0WYlANe7nTCmKTS/9FA5xBA=
    =JSmY
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
  • From Jochen Spieker@21:1/5 to All on Wed Oct 9 15:20:01 2024
    Michael Kj�rling:
    On 8 Oct 2024 11:29 -0400, from [email protected] (Dan Ritter):

    This looks like a drive which is old and starting to wear out
    but is not there yet. The raw read error rate is starting to
    creep up but isn't at a threshold.

    I agree. The almost 62000 hours is well over 7 years of run time, and
    based on the start/stop count and power cycle count it's been running basically continuously for that time (which is generally good for
    longevity, as long as it's not subjected to excessive heat).

    It is exactly that. It has been running mostly uninterrupted in my
    basement. Max temp from the past 12 months (as far as I can tell by
    looking at aggregated data in munin) is 36�C. That should be fairly
    ideal.


    Also note that some disks actually lie in SMART data. I don't know if
    yours does, but I would definitely question a value of 0 for failed
    (current pending and offline uncorrectable) _and_ reallocated sectors
    for a disk that's reporting I/O errors, for example. _At least_ one of
    those should be >0 for a truthful storage device in that situation.

    That is exactly what was confusing me here.

    What I would not do at this point is subject it to more physical
    stress than unavoidable. Unless you absolutely must, do not physically
    unplug or remove that disk before the RAID array has resilvered onto
    the new disk. It's currently providing value being a second source of
    truth about what's stored; you don't want to remove it and then find
    during the resilver that the other current disk has a problem.

    Helpful advice, thanks. Unfortunately, I cannot hotplug into this
    system. Thinking of this, the errors came shortly after a (long overdue) reboot, so it will have to survive at least another shutdown to provide
    some redundancy.

    Just for completeness, the long self-check did not report any issues and
    the SMART values also stayed the same. Nothing to see here, move along.
    ^^

    The new disk is already sitting on my desk.

    J.
    --
    I cannot comprehend the idea of chemical and biological weapons.
    [Agree] [Disagree]
    <http://archive.slowlydownward.com/NODATA/data_enter2.html>

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCAAdFiEERCNn0ngYrOUG3zZFU4ruOUNvhZcFAmcGgJcACgkQU4ruOUNv hZd5tw//V0lRGIzYamtGDdpmZgtOfkLdX3l2rCRH/Af0VOnf4QAaEs6XtxZJgQJh uIH6zIiNG4F+D8uX32Hz32xma6jFNGdz+ggwUaLCg3Z0I7dq2lF2TM8Pg1NGGpgz SB3NUcsQgqWRcyFysFAxPtH03eXjPGnT6ph2W4SMRzqOPvhqOYOCVZp69cb6pzY6 SAvFj+2lNddPDrmxoKJDWtv27VosqR7sQbenZOP4Qr3cbV44brLp0LqHcXy6dxZE GQYxacpEqIL/WD2mfzJMMT4vWGT5bBHMBKG4AOUPO+t6vswHSdqlO6QiSP353L3s ttg6XP5fPg0zr3T71izxUk8Ci/+LmdxlJS7gqWUotnWRit2aJXLzV59xEmQTxfFW jsD4iusE/unx1i0S+V153bxOQDwJ02zng24jni0fTkGR7AdOWzklk59S5FHjYjXD 4pG0Tqj0Mlg7O6i0DZXG+HNgIyVJXlw6U0KYU3eIX3GHO5ghvCRbQGyosl+mxKq0 kDdFmFWyJiZpDOTC3YBaiQzEPbQAgzvMudVWzClsk8JxOVmQLLObJBPBGpLa2o1o WCj+wjgMOGw1zklO7w6N08S+TdRdi0eV+4QdoniNBrD66xbu9XbPv8u2+MmmG/eO dvAch9CmtX9ozUgl/aaalBdqP3gTEMIIcETpJiRBn/OLiWFLYvQ=
    =PXlZ
    -----END PGP SIGNATURE-----
  • From Andy Smith@21:1/5 to Franco Martelli on Wed Oct 9 21:00:01 2024
    Hi,

    On Wed, Oct 09, 2024 at 08:41:38PM +0200, Franco Martelli wrote:
    Do you know whether MD is clever enough to send an email to root when it fails the device? Or have I to keep an eye on /proc/mdstat?

    For more than a decade mdadm has shipped with a service that runs in
    monitor mode to do this.

    https://manpages.debian.org/bookworm/mdadm/mdadm.8.en.html#MONITOR_MODE

    There are also plugins for every Linux monitoring system out there to
    read /proc/mdstat.

    Thanks,
    Andy

    --
    https://bitfolk.com/ -- No-nonsense VPS hosting

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jochen Spieker@21:1/5 to All on Wed Oct 9 21:20:02 2024
    Andy Smith:
    Hi,

    On Wed, Oct 09, 2024 at 08:41:38PM +0200, Franco Martelli wrote:
    Do you know whether MD is clever enough to send an email to root when it
    fails the device? Or have I to keep an eye on /proc/mdstat?

    For more than a decade mdadm has shipped with a service that runs in
    monitor mode to do this.

    https://manpages.debian.org/bookworm/mdadm/mdadm.8.en.html#MONITOR_MODE

    And this is configured here:

    # grep -B1 MAILADDR /etc/mdadm/mdadm.conf
    # instruct the monitoring daemon where to send mail alerts
    MAILADDR root

    J.
    --
    Hell will have perfume.
    [Agree] [Disagree]
    <http://archive.slowlydownward.com/NODATA/data_enter2.html>

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCAAdFiEERCNn0ngYrOUG3zZFU4ruOUNvhZcFAmcG1Q8ACgkQU4ruOUNv hZdnqg//VsZxwEFjiOBIPoLnp7NvTdCMXYbpSlOXcnIlAiHBThZkZKH2oBumweot rPtLGZux8U/8oXQ29YbdUofZN5bGV9rZDyWQ0Rl/dxVRSrUdShYOmYxTLruiKEXj h7ItlHJzgTpVOOBqwg3XLjT0qrujd+QCAZBNWdfKADCwEjaWahGMfwMGcp8Ky4xF NYGZxI0qxCJGu7RVKFCUIqH+3DMc8wasVLa9ey0W9pW4KU76InTQdc3+8bHvi8tU reyYwK14GG3bMBLXZs2IINoAp0ktP47k7sbPwu+5eVczkIZNj4yjY85mFiD6DKkp 9kJeqdqANjKb5gim4vmjjHG57a8s4JEnrEjQsKDxgWFySU1afwI5MyH8URBC3leR lLWwDHZEIL434vnv8UjtOiTDOzdgnWRk+vxtw26SCj4Y3Ry6w+ODraCtX7OFPY2u /WMm9F8nrMlUOEsVqGk8EsjX8IeLeOoCDfVPxMUeUSl8iFlVn/p/7B58W0bK8dr3 lQKX+9MVbLjZ/wUbetuPSm4fscfqnJggB3H9SzMYkHVRYwgPvBU7B2lVcV7IuPYJ ZDEpDNXzTwRo7Gfl5TQG+gMS9wIVIiMoWRumDbg044fPPCPOgzWPOOGdBS4L3WE2 t1Q3bgE2oB3+SPiSvqE4R2yLP0OfmBMoaYV3ujwdQJi1ECsXRpM=
    =kCpq
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
    * Origin: fsxN