Forum: >>> Magnum BBS <<<

Vanishing hard disk device

From Janis Papanagnou@21:1/5 to All on Sun Feb 28 06:55:43 2016

My first posting in this newsgroup; in case this is not the appropriate newsgroup for this question I'd welcome pointers.

The problem: In the past I had problems with the consistency of my hard
disk file system (from ext2/ext3, reiser, with soft RAID, until ZFS); I've often corrupted data with the older file systems. Now, with ZFS running,
I notice the probable source of the faulty effect; ZFS reports one of the
hard disks as 'removed' and the whole disks state as 'degraded'. Running
the smartctl tool seems to indicate that one of the three RAID-Z disks
is unavailable, as if it's just switched off. Removing and re-inserting
the disk to its slot activates it again; the smartctl tool shows all the expected disk information and after a ZFS 'online' of that disk everyting
is fine (i.e. no data loss).

I ruled out that it's a hard disk problem, since I bought many different
disks (different vendors, or same disk types), and the problem is always
only with the disks that are connected to /dev/sdd.

My suspicion is that the controller hardware in the motherboard might be faulty.

Any advice about the source of this sort of problem? Or suggestions how
to avoid that the disk at /dev/sdd will occasionally get unavailable and
(sort of) vanishes?

Thanks.

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Aragorn@21:1/5 to All on Sun Feb 28 09:57:54 2016

On Sunday 28 Feb 2016 06:55, Janis Papanagnou conveyed the following to comp.unix.admin...

My first posting in this newsgroup; in case this is not the
appropriate newsgroup for this question I'd welcome pointers.

The problem: In the past I had problems with the consistency of my
hard disk file system (from ext2/ext3, reiser, with soft RAID, until
ZFS); I've often corrupted data with the older file systems. Now, with
ZFS running, I notice the probable source of the faulty effect; ZFS
reports one of the hard disks as 'removed' and the whole disks state
as 'degraded'. Running the smartctl tool seems to indicate that one of
the three RAID-Z disks is unavailable, as if it's just switched off.
Removing and re-inserting the disk to its slot activates it again; the smartctl tool shows all the expected disk information and after a ZFS 'online' of that disk everyting is fine (i.e. no data loss).

I ruled out that it's a hard disk problem, since I bought many
different disks (different vendors, or same disk types), and the
problem is always only with the disks that are connected to /dev/sdd.

My suspicion is that the controller hardware in the motherboard might
be faulty.

Any advice about the source of this sort of problem? Or suggestions
how to avoid that the disk at /dev/sdd will occasionally get
unavailable and (sort of) vanishes?

If it is indeed the controller on the motherboard, then the only thing I
can think of, given that you're on a software RAID, would be to get a
PCI, PCI-X or PCIe SATA adapter card. And then for good measure, you
should connect _all_ of your hard disks to that one.

On the other hand, there's also a chance ─ given that you're alluding to hot-swap drive bays ─ that it could be the backplane itself which is
faulty, or the cable for that one particular drive bay. And in that
case, the only thing you can do is replace the cable (which is cheapest)
or the backplane (which will cost you more).

So I would advise first checking the cable, see whether it's well-
seated, try with another cable for a while, and then see whether the
problem persists. With a bit of luck, it's only the cable. which is
faulty. ;)

--
= Aragorn =

http://www.linuxcounter.net - registrant #223157

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Aragorn on Mon Feb 29 10:28:26 2016

On 28.02.2016 09:57, Aragorn wrote:

On Sunday 28 Feb 2016 06:55, Janis Papanagnou conveyed the following to comp.unix.admin...

My first posting in this newsgroup; in case this is not the
appropriate newsgroup for this question I'd welcome pointers.

The problem: In the past I had problems with the consistency of my
hard disk file system (from ext2/ext3, reiser, with soft RAID, until
ZFS); I've often corrupted data with the older file systems. Now, with
ZFS running, I notice the probable source of the faulty effect; ZFS
reports one of the hard disks as 'removed' and the whole disks state
as 'degraded'. Running the smartctl tool seems to indicate that one of
the three RAID-Z disks is unavailable, as if it's just switched off.
Removing and re-inserting the disk to its slot activates it again; the
smartctl tool shows all the expected disk information and after a ZFS
'online' of that disk everyting is fine (i.e. no data loss).

I ruled out that it's a hard disk problem, since I bought many
different disks (different vendors, or same disk types), and the
problem is always only with the disks that are connected to /dev/sdd.

My suspicion is that the controller hardware in the motherboard might
be faulty.

Any advice about the source of this sort of problem? Or suggestions
how to avoid that the disk at /dev/sdd will occasionally get
unavailable and (sort of) vanishes?

If it is indeed the controller on the motherboard, then the only thing I
can think of, given that you're on a software RAID, would be to get a
PCI, PCI-X or PCIe SATA adapter card. And then for good measure, you
should connect _all_ of your hard disks to that one.

On the other hand, there's also a chance ─ given that you're alluding to hot-swap drive bays ─ that it could be the backplane itself which is faulty, or the cable for that one particular drive bay. And in that
case, the only thing you can do is replace the cable (which is cheapest)
or the backplane (which will cost you more).

So I would advise first checking the cable, see whether it's well-
seated, try with another cable for a while, and then see whether the
problem persists. With a bit of luck, it's only the cable. which is
faulty. ;)

Thanks for your suggestions, Aragorn!

Sadly, making a plan to follow your suggestions localizing the problem,
my system seems to have decided to fool me. Without changing anything
the ZFS file system again became 'degraded'; but this time (and for the
first time) it is another device, /dev/sdc, that became inaccessible.

Does that now, in your experience, change the diagnosis of the problem?

Frankly, I'm totally confused. (Previously I had at least some ideas of potential sources of the issue, but now...)

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Aragorn@21:1/5 to All on Mon Feb 29 11:05:08 2016

On Monday 29 Feb 2016 10:28, Janis Papanagnou conveyed the following to comp.unix.admin...

On 28.02.2016 09:57, Aragorn wrote:

On the other hand, there's also a chance ─ given that you're alluding
to hot-swap drive bays ─ that it could be the backplane itself which
is faulty, or the cable for that one particular drive bay. And in
that case, the only thing you can do is replace the cable (which is
cheapest) or the backplane (which will cost you more).

So I would advise first checking the cable, see whether it's well-
seated, try with another cable for a while, and then see whether the
problem persists. With a bit of luck, it's only the cable. which is
faulty. ;)

Thanks for your suggestions, Aragorn!

Sadly, making a plan to follow your suggestions localizing the
problem, my system seems to have decided to fool me. Without changing anything the ZFS file system again became 'degraded'; but this time
(and for the first time) it is another device, /dev/sdc, that became inaccessible.

Does that now, in your experience, change the diagnosis of the
problem?

Well, there is now something else that pops into my mind, and from
reading your contributions to comp.unix.shell, I'd imagine you to be a professional and thus not to make the mistake I'm about to expound on,
but there _is_ always the chance that the hard disks in your array are
actually not RAID-certified.

It is not uncommon for consumer-grade SATA disks ─ and most notably
those made by Western Digital ─ to be a little slow in handling certain status polls from the RAID controller ─ whether hardware or software ─
with as a result that the controller may falsely detect a degraded
state. For this purpose, Western Digital has released "RAID-certified"
disks, which have a different timing setup and report faster to status
polls, so that they wouldn't be marked as defective by software or
hardware RAID setups when they are in fact still functioning normally
but busy executing other instructions.

A second possibility is the following... Since you enumerate the
devices as /dev/sdc and /dev/sdd, that tells me that you're running a
GNU/Linux system. And then there are a few questions that pop up,
because then more information is needed...

1. Do you ever power the machine down, and if so, did you power down
between your previous report on the issue and the report that I'm
now replying to?

2. What distribution are you running on your system?

3. Are the devices mounted by UUID or LABEL, or do you mount them
by way of their Linux-specific /dev/sd? designations?

The thing is that the /dev/sd? designations are not guaranteed to be
persistent across reboots. The udev device manager was supposed to
provide for some consistency in that regard, but it doesn't do that
particular job very well either. Therefore, when using multiple disks
in the same machine, it is best to give the individual partitions a
unique LABEL and mount them while using that, or to mount them by way of
their unique UUID. (This does require booting with an initrd/initramfs,
as the kernel itself does not recognize LABELs and UUIDs at boot time.)

If you have indeed rebooted the machine, then it is possible that the
faulty drive /dev/sdd of the last time has now become /dev/sdc. If you
have not rebooted your machine, then you may safely discard this section
of my reply. ;)

If it's neither of the above, then I suspect there to be a problem with
your hot-swap backplane, as I wrote in my previous reply. It could just
be an intermittent problem ─ e.g. a contact issue ─ or it could be permanent, but I have insufficient experience with such backplanes, so I
don't really know which ones are high quality and which ones are prone
to failure.

Hope this helps. ;)

--
= Aragorn =

http://www.linuxcounter.net - registrant #223157

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Aragorn on Tue Mar 1 02:59:14 2016

On 29.02.2016 11:05, Aragorn wrote:

[...]
Sadly, making a plan to follow your suggestions localizing the
problem, my system seems to have decided to fool me. Without changing
anything the ZFS file system again became 'degraded'; but this time
(and for the first time) it is another device, /dev/sdc, that became
inaccessible.

Well, there is now something else that pops into my mind, and from
reading your contributions to comp.unix.shell, I'd imagine you to be a professional and thus not to make the mistake I'm about to expound on,
but there _is_ always the chance that the hard disks in your array are actually not RAID-certified.

Well, you can safely assume that I don't know much WRT hardware or system administration, so every hint is welcome. :-)

WRT "RAID-certified"; I don't really know what that means. My assumption
would have been that such a classification would refer to hardware RAID systems, not software-RAID, but I don't know.

What I can tell for my case is that I originally had two "server-disks" (Seagate) that were said to be designed for continuous operation, which
I prefered at that time because I rarely reboot. But since those disks
failed despite "24/7 feature" and journalling file-system and software
RAID, I replaced them later with "desktop hard-disks" (Toshiba and WD);
the failure was all the same with every hard-disk configuration, though.

It is not uncommon for consumer-grade SATA disks ─ and most notably
those made by Western Digital ─ to be a little slow in handling certain status polls from the RAID controller ─ whether hardware or software ─ with as a result that the controller may falsely detect a degraded
state. For this purpose, Western Digital has released "RAID-certified" disks, which have a different timing setup and report faster to status
polls, so that they wouldn't be marked as defective by software or
hardware RAID setups when they are in fact still functioning normally
but busy executing other instructions.

A second possibility is the following... Since you enumerate the
devices as /dev/sdc and /dev/sdd, that tells me that you're running a GNU/Linux system.

Right.

And then there are a few questions that pop up,
because then more information is needed...

1. Do you ever power the machine down, and if so, did you power down
between your previous report on the issue and the report that I'm
now replying to?

I rarely reboot, and I haven't rebooted since 20+ days.

2. What distribution are you running on your system?

Xubuntu.

3. Are the devices mounted by UUID or LABEL, or do you mount them
by way of their Linux-specific /dev/sd? designations?

I'm unsure about that. I have a couple (other) ext4 disks that I mounted
by UUID. But I was thinking that ZFS has it's own way of accessing the
disks; if listing the status of the zpool the disks are identified by
unique strings, like "ata-<vendor>_<hard-disk-model>_<serial-number>".

What I can say is that on failure the ZFS disk identifications of the
'removed' disks matched with the respective /dev/sd? that was show by
smartctl to be defective.

The thing is that the /dev/sd? designations are not guaranteed to be persistent across reboots. [...]

If it's neither of the above, then I suspect there to be a problem with
your hot-swap backplane, as I wrote in my previous reply. It could just
be an intermittent problem ─ e.g. a contact issue ─ or it could be permanent, but I have insufficient experience with such backplanes, so I don't really know which ones are high quality and which ones are prone
to failure.

Hope this helps. ;)

Thanks!

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Rixter
  Wed Jul 29 02:00:40 2026
  from Madison, Nc via Telnet
- Centurion
  Tue Jul 28 22:54:59 2026
  from Berea, Ohio via Telnet
- Bob Worm
  Tue Jul 28 16:01:18 2026
  from Wales, Uk via Telnet
- Rixter
  Tue Jul 28 13:42:46 2026
  from Madison, Nc via Telnet
- Krenn
  Tue Jul 28 11:59:57 2026
  from Sydney, Nsw via Telnet
- Rixter
  Tue Jul 28 01:23:48 2026
  from Madison, Nc via Telnet
- Centurion
  Mon Jul 27 22:50:42 2026
  from Berea, Ohio via Telnet
- Ataricrypt
  Mon Jul 27 19:19:17 2026
  from England via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	56:26:07
Calls:	12,446
Calls today:	1
Files:	15,192
Messages:	6,537,360

Vanishing hard disk device

Who's Online

Recent Visitors

System Info