• Matching grub data in the MBR with the installed grub-pc package

    From Andy Smith@21:1/5 to All on Mon Sep 9 22:10:01 2024
    Hi,

    I've come into possession of a machine running Debian 10 with two
    drives in it; sda and sdb. These have been labelled with a DOS MBR
    and partitioned. The first partition starts at sector 2048 of both
    drives (512 byte sectors). It appears that GRUB has been installed
    on both sda and sdb:

    $ sudo dd if=/dev/sda bs=1 count=512 2>/dev/null | xxd
    00000000: eb63 9010 8ed0 bc00 b0b8 0000 8ed8 8ec0 .c..............
    00000010: fbbe 007c bf00 06b9 0002 f3a4 ea21 0600 ...|.........!..
    00000020: 00be be07 3804 750b 83c6 1081 fefe 0775 ....8.u........u
    00000030: f3eb 16b4 02b0 01bb 007c b280 8a74 018b .........|...t..
    00000040: 4c02 cd13 ea00 7c00 00eb fe00 0000 0000 L.....|.........
    00000050: 0000 0000 0000 0000 0000 0080 0100 0000 ................
    00000060: 0000 0000 fffa 9090 f6c2 8074 05f6 c270 ...........t...p
    00000070: 7402 b280 ea79 7c00 0031 c08e d88e d0bc t....y|..1......
    00000080: 0020 fba0 647c 3cff 7402 88c2 52bb 1704 . ..d|<.t...R...
    00000090: f607 0374 06be 887d e817 01be 057c b441 ...t...}.....|.A
    000000a0: bbaa 55cd 135a 5272 3d81 fb55 aa75 3783 ..U..ZRr=..U.u7.
    000000b0: e101 7432 31c0 8944 0440 8844 ff89 4402 [email protected].
    000000c0: c704 1000 668b 1e5c 7c66 895c 0866 8b1e ....f..\|f.\.f..
    000000d0: 607c 6689 5c0c c744 0600 70b4 42cd 1372 `|f.\..D..p.B..r
    000000e0: 05bb 0070 eb76 b408 cd13 730d 5a84 d20f ...p.v....s.Z...
    000000f0: 83d0 00be 937d e982 0066 0fb6 c688 64ff .....}...f....d.
    00000100: 4066 8944 040f b6d1 c1e2 0288 e888 f440 @f.D...........@
    00000110: 8944 080f b6c2 c0e8 0266 8904 66a1 607c .D.......f..f.`|
    00000120: 6609 c075 4e66 a15c 7c66 31d2 66f7 3488 f..uNf.\|f1.f.4.
    00000130: d131 d266 f774 043b 4408 7d37 fec1 88c5 .1.f.t.;D.}7....
    00000140: 30c0 c1e8 0208 c188 d05a 88c6 bb00 708e 0........Z....p.
    00000150: c331 dbb8 0102 cd13 721e 8cc3 601e b900 .1......r...`...
    00000160: 018e db31 f6bf 0080 8ec6 fcf3 a51f 61ff ...1..........a.
    00000170: 265a 7cbe 8e7d eb03 be9d 7de8 3400 bea2 &Z|..}....}.4...
    00000180: 7de8 2e00 cd18 ebfe 4752 5542 2000 4765 }.......GRUB .Ge
    00000190: 6f6d 0048 6172 6420 4469 736b 0052 6561 om.Hard Disk.Rea
    000001a0: 6400 2045 7272 6f72 0d0a 00bb 0100 b40e d. Error........
    000001b0: cd10 ac3c 0075 f4c3 f2b8 530e 0000 8020 ...<.u....S....
    000001c0: 2100 fd35 373e 0008 0000 0038 0f00 0035 !..57>.....8...5
    000001d0: 383e fd51 6031 0040 0f00 0098 3b00 0051 8>.Q`1.@....;..Q
    000001e0: 6131 fdef 45aa 00d8 4a00 00d0 1d00 0010 a1..E...J.......
    000001f0: 64ab 05fe ffff feaf 6800 0230 27df 55aa d.......h..0'.U.
    $ sudo dd if=/dev/sdb bs=1 count=512 2>/dev/null | xxd
    00000000: eb63 9010 8ed0 bc00 b0b8 0000 8ed8 8ec0 .c..............
    00000010: fbbe 007c bf00 06b9 0002 f3a4 ea21 0600 ...|.........!..
    00000020: 00be be07 3804 750b 83c6 1081 fefe 0775 ....8.u........u
    00000030: f3eb 16b4 02b0 01bb 007c b280 8a74 018b .........|...t..
    00000040: 4c02 cd13 ea00 7c00 00eb fe00 0000 0000 L.....|.........
    00000050: 0000 0000 0000 0000 0000 0080 0100 0000 ................
    00000060: 0000 0000 fffa 9090 f6c2 8074 05f6 c270 ...........t...p
    00000070: 7402 b280 ea79 7c00 0031 c08e d88e d0bc t....y|..1......
    00000080: 0020 fba0 647c 3cff 7402 88c2 52be 807d . ..d|<.t...R..}
    00000090: e817 01be 057c b441 bbaa 55cd 135a 5272 .....|.A..U..ZRr
    000000a0: 3d81 fb55 aa75 3783 e101 7432 31c0 8944 =..U.u7...t21..D
    000000b0: 0440 8844 ff89 4402 c704 1000 668b 1e5c [email protected]..\
    000000c0: 7c66 895c 0866 8b1e 607c 6689 5c0c c744 |f.\.f..`|f.\..D
    000000d0: 0600 70b4 42cd 1372 05bb 0070 eb76 b408 ..p.B..r...p.v..
    000000e0: cd13 730d 5a84 d20f 83d8 00be 8b7d e982 ..s.Z........}..
    000000f0: 0066 0fb6 c688 64ff 4066 8944 040f b6d1 [email protected]....
    00000100: c1e2 0288 e888 f440 8944 080f b6c2 c0e8 [email protected]......
    00000110: 0266 8904 66a1 607c 6609 c075 4e66 a15c .f..f.`|f..uNf.\
    00000120: 7c66 31d2 66f7 3488 d131 d266 f774 043b |f1.f.4..1.f.t.;
    00000130: 4408 7d37 fec1 88c5 30c0 c1e8 0208 c188 D.}7....0.......
    00000140: d05a 88c6 bb00 708e c331 dbb8 0102 cd13 .Z....p..1......
    00000150: 721e 8cc3 601e b900 018e db31 f6bf 0080 r...`......1....
    00000160: 8ec6 fcf3 a51f 61ff 265a 7cbe 867d eb03 ......a.&Z|..}..
    00000170: be95 7de8 3400 be9a 7de8 2e00 cd18 ebfe ..}.4...}.......
    00000180: 4752 5542 2000 4765 6f6d 0048 6172 6420 GRUB .Geom.Hard
    00000190: 4469 736b 0052 6561 6400 2045 7272 6f72 Disk.Read. Error
    000001a0: 0d0a 00bb 0100 b40e cd10 ac3c 0075 f4c3 ...........<.u..
    000001b0: 0000 0000 0000 0000 481f 78c6 0000 8020 ........H.x....
    000001c0: 2100 fd35 373e 0008 0000 0038 0f00 0035 !..57>.....8...5
    000001d0: 383e fd51 6031 0040 0f00 0098 3b00 0051 8>.Q`1.@....;..Q
    000001e0: 6131 fdef 45aa 00d8 4a00 00d0 1d00 0010 a1..E...J.......
    000001f0: 64ab 05fe ffff feaf 6800 0230 27df 55aa d.......h..0'.U.

    This machine does not boot properly, going immediately to a grub>
    prompt. If, during the boot process, I force the BIOS to boot from
    sdb then it does boot properly.

    This machine is doing something useful at the moment, so I am under
    pressure to not have it out of service for extended periods of time
    while I tinker with it.

    The drives are partitioned and set up for MD RAID-1. The current
    grub config loads the mdraid1x module and is set to consider its
    root as array UUID aca790f8:3fcc9451:e65b1821:87ee8ab7:

    $ grep root=\' /boot/grub/grub.cfg | sed 's/^\t*//' | uniq
    set root='mduuid/aca790f83fcc9451e65b182187ee8ab7'
    $ for dev in sda1 sdb1; do sudo mdadm -E "/dev/$dev" | grep 'Array UUID'; done
    Array UUID : aca790f8:3fcc9451:e65b1821:87ee8ab7
    Array UUID : aca790f8:3fcc9451:e65b1821:87ee8ab7

    /proc/mdstat shows all the MD arrays are currently running fine with
    paired partitions from sda and sdb.

    So, it looks like this machine is redundant against drive failure
    EXCEPT for during boot, and it's just something odd with the MBR of
    sda.

    What is the simplest way to make it work, and be redundant?

    Normally for things of this era (i.e. not UEFI) I would be taking
    care to grub-install on both drives after install. Clearly someone
    did install grub to both, but sda's copy is no longer working.
    Perhaps grub has been upgraded since then, causing another
    grub-install against sda (only), and then the BIOS's idea of drive
    order flipped around?

    Can I simply copy the first 512 bytes of sdb to the start of sda?

    I do not particularly want to run grub-install, as the MBR of sdb is
    known good at the moment. Perhaps though I could run:

    $ sudo grub-install /dev/sda

    and then compare again the first 512 bytes of each drive?

    Is there any way to sanity check what an MBR will do, grub-wise? My
    searching found grub-emu but I couldn't find any useful
    documentation and didn't want to just run it. Similarly grub-probe.

    I was kind of hoping that there would be something I could run which
    would say "yes, this MBR has grub v<whatever> and is set to find its
    grub.cfg on (hdX)", then I might be able to see some difference in
    what the MBR of sda wants to do. I'm particularly interested in
    seeing if the binary grub data in the MBR actually comes from the
    grub that is installed from the grub-pc package in the OS.

    Thanks,
    Andy

    --
    https://bitfolk.com/ -- No-nonsense VPS hosting

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Andy Smith@21:1/5 to Andy Smith on Mon Sep 9 22:30:01 2024
    On Mon, Sep 09, 2024 at 07:59:58PM +0000, Andy Smith wrote:
    I was kind of hoping that there would be something I could run which
    would say "yes, this MBR has grub v<whatever> and is set to find its
    grub.cfg on (hdX)", then I might be able to see some difference in
    what the MBR of sda wants to do. I'm particularly interested in
    seeing if the binary grub data in the MBR actually comes from the
    grub that is installed from the grub-pc package in the OS.

    $ xxd /usr/lib/grub/i386-pc/boot.img > /tmp/img.hex
    $ sudo dd if=/dev/sda bs=1 count=512 2>/dev/null | xxd > /tmp/sda.hex
    $ sudo dd if=/dev/sdb bs=1 count=512 2>/dev/null | xxd > /tmp/sdb.hex
    $ diff /tmp/sda.hex /tmp/img.hex | wc -l
    66
    $ diff /tmp/sdb.hex /tmp/img.hex | wc -l
    28

    Interesting.

    Thanks,
    Andy

    --
    https://bitfolk.com/ -- No-nonsense VPS hosting

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Florent Rougon@21:1/5 to All on Tue Sep 10 16:00:01 2024
    Hi,

    Not an expert on this matter, so take this with a grain of salt.

    Le 09/09/2024, Andy Smith <[email protected]> a écrit:

    Can I simply copy the first 512 bytes of sdb to the start of sda?

    I would not do this, one of the reasons being that AFAICT, the start
    offsets of the (up to 4) primary partitions of each drive are among
    these bytes.

    I do not particularly want to run grub-install, as the MBR of sdb is
    known good at the moment. Perhaps though I could run:

    $ sudo grub-install /dev/sda

    I believe so, although by habit I'd rather 'dpkg-reconfigure grub-pc'
    where you can IIRC select the drives to act on (it will remember your selection, so in case you don't select sdb in the debconf dialog for
    fear of breaking it, next time GRUB is updated on that system, your sdb
    GRUB installation would become out-of-date). Of course, I am assuming
    the computer boots with BIOS, not UEFI.

    (Keep a rescue disk, installation medium or Debian Live around, in case
    there is a problem booting afterwards.)

    and then compare again the first 512 bytes of each drive?

    Out of curiosity, I skimmed through [1] and computed the offsets of your
    "GRUB " strings as they would be found in memory when the code is run at
    boot (adding 7C00h, AFAIUI). I found 7D88h for your sda and 7D80h for
    your sdb, none of which matches the values at [1] under heading
    “Location of the GRUB ID String and Error Messages in Memory”. Thus, my understanding is that both of your MBRs were probably written by GRUB 2
    (I wanted to check if maybe one had been written by GRUB 1 and the other
    by GRUB 2).

    Regards

    [1] https://thestarman.pcministry.com/asm/mbr/GRUB.htm

    --
    Florent

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Andy Smith@21:1/5 to Florent Rougon on Tue Sep 10 18:50:01 2024
    Hi,

    On Tue, Sep 10, 2024 at 03:58:58PM +0200, Florent Rougon wrote:
    Le 09/09/2024, Andy Smith <[email protected]> a écrit:
    Can I simply copy the first 512 bytes of sdb to the start of sda?

    I would not do this, one of the reasons being that AFAICT, the start
    offsets of the (up to 4) primary partitions of each drive are among
    these bytes.

    Good point. I understand the bootloader is actually the first 446
    bytes so maybe I should only be looking at these.

    https://unix.stackexchange.com/a/254668/36243

    I'd rather 'dpkg-reconfigure grub-pc' where you can IIRC select
    the drives to act on (it will remember your selection, so in case
    you don't select sdb in the debconf dialog for fear of breaking
    it, next time GRUB is updated on that system, your sdb GRUB
    installation would become out-of-date). Of course, I am assuming
    the computer boots with BIOS, not UEFI.

    Yes, this machine boots with BIOS and MBR.

    To keep such machines (BIOS boot, multiple boot drives, MD RAID for
    redundancy once booted) in good booting health are people doing
    anything more sophisticated than remembering to run "dpkg-reconfigure
    grub-pc" and install grub to all boot drives any time grub-pc is
    updated?

    Out of curiosity, I skimmed through [1] and computed the offsets of your "GRUB " strings as they would be found in memory when the code is run at
    boot (adding 7C00h, AFAIUI). I found 7D88h for your sda and 7D80h for
    your sdb, none of which matches the values at [1] under heading
    “Location of the GRUB ID String and Error Messages in Memory”. Thus, my understanding is that both of your MBRs were probably written by GRUB 2
    (I wanted to check if maybe one had been written by GRUB 1 and the other
    by GRUB 2).

    THis machine dates from 2016 and whatever was Debian stable at that
    time. It will have been dist-upgrade as far as 10 (buster) after
    that. As far as I;m aware the drives are the same as it was first
    installed with.

    Thanks for the info!
    Andy

    --
    https://bitfolk.com/ -- No-nonsense VPS hosting

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Florent Rougon@21:1/5 to All on Wed Sep 11 00:50:02 2024
    Le 10/09/2024, Andy Smith <[email protected]> a écrit:

    Good point. I understand the bootloader is actually the first 446
    bytes so maybe I should only be looking at these.

    https://unix.stackexchange.com/a/254668/36243

    The partition table indeed starts at offset 446 (decimal), however I'd
    still rather run grub-install or “dpkg-reconfigure grub-pc” than copy
    the first 446 bytes from one drive to another drive. The reason is that, AFAIUI, what GRUB writes in this area when installed is likely to
    contain disc-specific info. More specifically, according to [1]:

    There isn't room for much function in the 446 bytes available for
    executable code in the boot sector. The sole function of this stage1
    code is to load the much larger stage2 boot program. When stage1 is
    installed in the MBR, it is configured with the BIOS drive number and
    the absolute LBA of the first sector of the stage2 file in the boot
    partition. It loads that one sector into a fixed location in memory
    and transfers control to it. (...)

    Yes, this machine boots with BIOS and MBR.

    To keep such machines (BIOS boot, multiple boot drives, MD RAID for redundancy once booted) in good booting health are people doing
    anything more sophisticated than remembering to run "dpkg-reconfigure grub-pc" and install grub to all boot drives any time grub-pc is
    updated?

    That's what I've been doing for a bit more than 20 years (before
    switching to UEFI), but that was only my home machine.

    THis machine dates from 2016 and whatever was Debian stable at that
    time. It will have been dist-upgrade as far as 10 (buster) after
    that. As far as I;m aware the drives are the same as it was first
    installed with.

    These dates seem consistent with my guess that this is probably GRUB 2
    that was installed to your MBRs.

    Thanks for the info!

    You're welcome. Hope someone with more experience chimes in. Good luck
    in any case. :)

    Regards

    [1] https://www.linuxquestions.org/questions/linux-general-1/help-understand-446-bytes-of-boot-code-in-mbr-4175500398/#post5146305

    --
    Florent

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Andy Smith@21:1/5 to Florent Rougon on Wed Sep 11 02:50:01 2024
    Hi,

    On Wed, Sep 11, 2024 at 12:45:46AM +0200, Florent Rougon wrote:
    The partition table indeed starts at offset 446 (decimal), however I'd
    still rather run grub-install or “dpkg-reconfigure grub-pc” than copy
    the first 446 bytes from one drive to another drive. The reason is that, AFAIUI, what GRUB writes in this area when installed is likely to
    contain disc-specific info. More specifically, according to [1]:

    There isn't room for much function in the 446 bytes available for
    executable code in the boot sector. The sole function of this stage1
    code is to load the much larger stage2 boot program. When stage1 is
    installed in the MBR, it is configured with the BIOS drive number and
    the absolute LBA of the first sector of the stage2 file in the boot
    partition. It loads that one sector into a fixed location in memory
    and transfers control to it. (...)

    Since booting from sdb wasn't working in any case, I thought I'd
    experiment a bit. I copied the first 446 bytes of sda to sdb. This
    made matters worse! Instead of a "grub> " prompt, I just got a blank
    screen.

    I then rebooted from sda and did:

    $ sudo dpkg-reconfigure grub-pc

    selecting both sda and sdb.

    I was then able to boot from sdb.

    This does leave me wondering however, if the boot code in the mBR of
    sdb is now set to believe that this is "the second drive", I suppose
    (hd1) in grub terms? With the implication that should sda fail or be
    removed, this machine may still not boot because its boot code looks
    for something on a drive that no longer exists (sdb now being (hd0))?

    The grub.cfg itself (and later, the fstab) finds its drives by UUID
    so I'm not worried about that part.

    I just have dim memories about having to do grub-install to sdb but
    trick it somehow that this was (hd0)…

    I do also wonder why my simple dd of the first 446 bytes did not
    work, as the /boot partition is at the same position on both drives
    and is an MDADM RAID1 so should have its stage2 at the same LBA.
    After doing the "dpkg-reconfigure grub-pc" the first 446 bytes of
    both sad and sdb are (still) identical so something else somewhere
    else must have been changed.

    $ sudo dd if=/dev/sda bs=446 count=1 2>/dev/null | sha256sum b7ccacdeb89b1fd8c272549c69ff07570b033747c0a84a73febc7851c7cf4f2e -
    $ sudo dd if=/dev/sdb bs=446 count=1 2>/dev/null | sha256sum b7ccacdeb89b1fd8c272549c69ff07570b033747c0a84a73febc7851c7cf4f2e -

    Not understanding quite what is going on is worrying to me, even if
    things do now work. 🙁

    Thanks,
    Andy

    --
    https://bitfolk.com/ -- No-nonsense VPS hosting

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Roy J. Tellason, Sr.@21:1/5 to All on Thu Sep 12 01:30:01 2024
    On Tuesday 10 September 2024 08:39:59 pm Andy Smith wrote:
    This does leave me wondering however, if the boot code in the mBR of
    sdb is now set to believe that this is "the second drive", I suppose
    (hd1) in grub terms? With the implication that should sda fail or be
    removed, this machine may still not boot because its boot code looks
    for something on a drive that no longer exists (sdb now being (hd0))?


    Simple enough to test by unplugging a cable...

    --
    Member of the toughest, meanest, deadliest, most unrelenting -- and
    ablest -- form of life in this section of space,  a critter that can
    be killed but can't be tamed.  --Robert A. Heinlein, "The Puppet Masters"
    -
    Information is more dangerous than cannon to a society ruled by lies. --James M Dakin

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Andy Smith@21:1/5 to Florent Rougon on Thu Sep 12 01:50:01 2024
    On Thu, Sep 12, 2024 at 01:21:12AM +0200, Florent Rougon wrote:
    Hi,

    Le 11/09/2024, Andy Smith <[email protected]> a écrit:

    Since booting from sdb wasn't working in any case, I thought I'd
    experiment a bit. I copied the first 446 bytes of sda to sdb. This
    made matters worse! Instead of a "grub> " prompt, I just got a blank screen.

    I then rebooted from sda and did:

    I believe “sda” and “sdb” are swapped with respect to your first message. Of course, it's expected that these are not stable across
    reboots, however it's a bit confusing for me here.

    Yes, sorry. I actually have two machines like this I am looking at,
    where one of them seems to have an older (non-working) grub on sda
    and current grub on sdb, while the other has current grub on sds and
    no grub at all on sdb. I tried to simplify by only talking about one
    of these here, but ended up confusing myself several times over
    which one I was getting info from.

    Anyway. Copying the 446 bytes so as to make sda and sdb identical
    did not work, as described. Then doing "dpkg-reconfigure grub-pc"
    did result in working boot from either drive.

    The special value 0xFF is the one you had on both of your drives and
    means “use the boot drive”

    Okay, that is good to know, thanks!

    In any case, stage 1 can load some “stage 1.5” from “empty sectors (if available) between the MBR and the first partition”. These sectors
    wouldn't by synchronized by MD RAID, unless you're using it on the whole drives—as opposed to partition by partition. I don't claim that “this is it”, but this might explain some difference between your drives' booting behavior, even with identical:
    - stage1 code+data in the MBR;
    - boot partitions' start offset and contents.

    Sounds very plausible. The MD arrays are just made of partitions,
    and the first partitions for /boot start at 2048 (512 byte) sectors
    in.

    So, there's more grub data at different places in that first 1MiB of
    each boot disk. As some of it could be copied and some not, it
    sounds like I should not try to fix this again with dd and instead
    stick to reconfiguring grub-pc.

    Not understanding quite what is going on is worrying to me, even if
    things do now work. 🙁

    I just hope I didn't confuse you more. :-)

    It was very helpful, thanks again!

    Andy

    --
    https://bitfolk.com/ -- No-nonsense VPS hosting

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Florent Rougon@21:1/5 to All on Thu Sep 12 01:30:01 2024
    Hi,

    Le 11/09/2024, Andy Smith <[email protected]> a écrit:

    Since booting from sdb wasn't working in any case, I thought I'd
    experiment a bit. I copied the first 446 bytes of sda to sdb. This
    made matters worse! Instead of a "grub> " prompt, I just got a blank
    screen.

    I then rebooted from sda and did:

    I believe “sda” and “sdb” are swapped with respect to your first message. Of course, it's expected that these are not stable across
    reboots, however it's a bit confusing for me here.

    (...)

    This does leave me wondering however, if the boot code in the mBR of
    sdb is now set to believe that this is "the second drive", I suppose
    (hd1) in grub terms? With the implication that should sda fail or be
    removed, this machine may still not boot because its boot code looks
    for something on a drive that no longer exists (sdb now being (hd0))?

    I believe this is not necessary the case. I've tried to read some of the GRUB 2 stage 1 code from the grub2 2.12-5 package. I'm far from being
    able to claim I understand everything, but... let's see.

    My impression is that the “drive number” that is written to the MBR can
    be of two kinds:

    (a) an actual number, typically 0x80, 0x81, etc. for hard disks
    (it is the BIOS drive number for INT 13h, cf. [1]);

    (b) or the special value 0xFF (thus, the 128th hard disk is not
    available for case (a)—too bad if you have that many disks!).

    The special value 0xFF is the one you had on both of your drives and
    means “use the boot drive” (the one the BIOS booted from, whose number
    is in register DL when the BIOS transfers control to the MBR code loaded
    at physical address 0x7C00):

    (From grub2-2.12/grub-core/boot/i386/pc/boot.S)

    .org GRUB_BOOT_MACHINE_BOOT_DRIVE
    boot_drive:
    .byte 0xff /* the disk to load kernel from */
    /* 0xff means use the boot drive */

    (...)

    .org GRUB_BOOT_MACHINE_DRIVE_CHECK
    (...) ← fixup of DL in case in was incorrectly set by the BIOS

    /*
    * Check if we have a forced disk reference here
    */
    movb boot_drive, %al
    cmpb $0xff, %al
    je 1f
    movb %al, %dl
    1:
    /* save drive reference first thing! */
    pushw %dx

    [ One may find it “interesting” that the “jmp 3f” from line 216 of
    boot.S may be overwritten by internal ”grub-setup” code (cf.
    “grub-bios-setup” in grub-install.c) from grub2-2.12/util/setup.c:

    boot_drive_check = (grub_uint8_t *) (boot_img
    + GRUB_BOOT_MACHINE_DRIVE_CHECK);

    (...)

    /* If DEST_DRIVE is a hard disk, enable the workaround, which is
    for buggy BIOSes which don't pass boot drive correctly. Instead,
    they pass 0x00 or 0x01 even when booted from 0x80. */
    if (!allow_floppy && !grub_util_biosdisk_is_floppy (dest_dev->disk))
    {
    /* Replace the jmp (2 bytes) with double nop's. */
    boot_drive_check[0] = 0x90;
    boot_drive_check[1] = 0x90;
    }
    ]

    In your case, I pretend your MBR-stored drive config (from your first
    message for both drives) was “use the boot drive” for both sda and sdb, because from grub2-2.12/include/grub/i386/pc/boot.h:

    /* The offset of BOOT_DRIVE. */
    #define GRUB_BOOT_MACHINE_BOOT_DRIVE 0x64

    and both of your MBRs had 0xff at this offset:

    00000060: 0000 0000 fffa 9090 f6c2 8074 05f6 c270 ...........t...p

    You can also see right here the two NOPs (9090) at offset GRUB_BOOT_MACHINE_DRIVE_CHECK (i.e. 0x66) which override the
    aforementioned “jmp 3f” from boot.S line 216, because this stage1 code
    was written to hard disks.

    Conclusion: in your case, the option was “load the next stage from the
    drive the BIOS booted from” for both MBRs. Therefore, AFAIUI, assuming everything else was good (incl. the offset for finding the next stage),
    it should still have been able to boot with only one of the drives
    present in the machine.

    The grub.cfg itself (and later, the fstab) finds its drives by UUID
    so I'm not worried about that part.

    I just have dim memories about having to do grub-install to sdb but
    trick it somehow that this was (hd0)…

    Yep... AFAIUI, hd0 is for times when GRUB talks to the BIOS (at boot)
    and corresponds to 0x80 (on x86 machines), but when running grub-install
    or the internal grub-bios-setup, GRUB attempts to guess how the BIOS is
    going to number the devices you gave it in Linux-speak (/dev/sda,
    /dev/sdb, etc.), which may be unreliable. At least the GRUB 1.x
    documentation clearly said so according to my recollection, and
    therefore indicated (in the 2000s) as the bullet-proof recipe, to
    perform GRUB installation to hard disk *from GRUB itself* using the
    (hd0), (hd1), etc. notations, e.g. after booting from a GRUB floppy
    disk.

    I do also wonder why my simple dd of the first 446 bytes did not
    work, as the /boot partition is at the same position on both drives
    and is an MDADM RAID1 so should have its stage2 at the same LBA.
    After doing the "dpkg-reconfigure grub-pc" the first 446 bytes of
    both sad and sdb are (still) identical so something else somewhere
    else must have been changed.

    GRUB is a complex beast; available documentation may be a bit confusing
    when it comes to stage 1.5 and stage 2, e.g.[3]:

    Version 0 (GRUB Legacy)
    ~~~~~~~~~~~~~~~~~~~~~~~

    Stage 1 can load stage 2 directly, but it is normally set up to load the
    stage 1.5., located in the first 30 KiB of hard disk immediately
    following the MBR and before the first partition. (...) The stage 1.5
    image contains file system drivers, enabling it to directly load stage 2
    from any known location in the filesystem, for example from /boot/grub.

    Version 2 (GRUB 2)
    ~~~~~~~~~~~~~~~~~~

    [Different description]

    In any case, stage 1 can load some “stage 1.5” from “empty sectors (if available) between the MBR and the first partition”. These sectors
    wouldn't by synchronized by MD RAID, unless you're using it on the whole drives—as opposed to partition by partition. I don't claim that “this is it”, but this might explain some difference between your drives' booting behavior, even with identical:
    - stage1 code+data in the MBR;
    - boot partitions' start offset and contents.

    Not understanding quite what is going on is worrying to me, even if
    things do now work. 🙁

    I just hope I didn't confuse you more. :-)

    Regards

    [1] https://en.wikipedia.org/wiki/INT_13H#List_of_INT_13h_services
    [2] https://wiki.osdev.org/MBR_(x86)#MBR_Bootstrap
    [3] https://en.wikipedia.org/wiki/GNU_GRUB

    --
    Florent

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)