Forum: >>> Magnum BBS <<<

stack smashing detected

From Stan Johnson@21:1/5 to All on Mon Jan 30 05:10:01 2023

Hello,

I am seeing anywhere from zero to four of the following errors while
booting Linux on 68030 systems and using sysvinit startup scripts:

*** stack smashing detected ***: terminated
Aborted

I usually (but not always) see three of the errors while init is running
the rcS.d scripts, and one while running the rc2.d scripts. The stack
smashing messages appear only on the system console (nothing is logged
in an error log or dmesg). Despite the errors, the system continues
booting to multiuser mode without any obvious additional problems. I
haven't tested systemd, which is too slow to be useful on my m68k
systems (though I have a Debian SID with systemd that I can restore for
testing if necessary).

I'm using the current Debian SID and Debian kernel, and I've confirmed
the errors on a Mac IIci and SE/30. I haven't seen the errors on any
68040 system (I only tested on a Centris 650 and PowerBook 550c). I also
notice the errors on 68030 systems using custom kernels that I have cross-compiled using GCC 12 or GCC 10 on a x86_64 system running Debian
SID; however, I do not see the errors as often if I cross-compile using
GCC 8.3.0 on a 686 system (running Debian 10.7 Buster) -- I saw the
errors a few weeks ago with an earlier kernel, but none today using
Linux 6.1.8 cross-compiled with GCC 8.3.0.

I'll be happy to help debug or troubleshoot, though at this point, since
the "stack smashing detected" errors aren't reporting which processes
are being terminated/aborted, I'm not sure where to start.

thanks for any suggestions

-Stan Johnson [email protected]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Geert Uytterhoeven@21:1/5 to [email protected] on Mon Jan 30 11:20:01 2023

CC linux-m68k

On Mon, Jan 30, 2023 at 5:01 AM Stan Johnson <[email protected]> wrote:

Hello,

I am seeing anywhere from zero to four of the following errors while
booting Linux on 68030 systems and using sysvinit startup scripts:

*** stack smashing detected ***: terminated
Aborted

I usually (but not always) see three of the errors while init is running
the rcS.d scripts, and one while running the rc2.d scripts. The stack smashing messages appear only on the system console (nothing is logged
in an error log or dmesg). Despite the errors, the system continues
booting to multiuser mode without any obvious additional problems. I
haven't tested systemd, which is too slow to be useful on my m68k
systems (though I have a Debian SID with systemd that I can restore for testing if necessary).

I'm using the current Debian SID and Debian kernel, and I've confirmed
the errors on a Mac IIci and SE/30. I haven't seen the errors on any
68040 system (I only tested on a Centris 650 and PowerBook 550c). I also notice the errors on 68030 systems using custom kernels that I have cross-compiled using GCC 12 or GCC 10 on a x86_64 system running Debian
SID; however, I do not see the errors as often if I cross-compile using
GCC 8.3.0 on a 686 system (running Debian 10.7 Buster) -- I saw the
errors a few weeks ago with an earlier kernel, but none today using
Linux 6.1.8 cross-compiled with GCC 8.3.0.

I'll be happy to help debug or troubleshoot, though at this point, since
the "stack smashing detected" errors aren't reporting which processes
are being terminated/aborted, I'm not sure where to start.

thanks for any suggestions

-Stan Johnson [email protected]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael Schmitz@21:1/5 to All on Tue Jan 31 04:10:01 2023

Hi Stan,

Am 30.01.2023 um 17:00 schrieb Stan Johnson:

Hello,

I am seeing anywhere from zero to four of the following errors while
booting Linux on 68030 systems and using sysvinit startup scripts:

*** stack smashing detected ***: terminated
Aborted

I usually (but not always) see three of the errors while init is running
the rcS.d scripts, and one while running the rc2.d scripts. The stack smashing messages appear only on the system console (nothing is logged
in an error log or dmesg). Despite the errors, the system continues
booting to multiuser mode without any obvious additional problems. I
haven't tested systemd, which is too slow to be useful on my m68k
systems (though I have a Debian SID with systemd that I can restore for testing if necessary).

I'm using the current Debian SID and Debian kernel, and I've confirmed
the errors on a Mac IIci and SE/30. I haven't seen the errors on any
68040 system (I only tested on a Centris 650 and PowerBook 550c). I also notice the errors on 68030 systems using custom kernels that I have cross-compiled using GCC 12 or GCC 10 on a x86_64 system running Debian
SID; however, I do not see the errors as often if I cross-compile using
GCC 8.3.0 on a 686 system (running Debian 10.7 Buster) -- I saw the
errors a few weeks ago with an earlier kernel, but none today using
Linux 6.1.8 cross-compiled with GCC 8.3.0.

I'll be happy to help debug or troubleshoot, though at this point, since
the "stack smashing detected" errors aren't reporting which processes
are being terminated/aborted, I'm not sure where to start.

The man page of init states that init logs process and reason for
termination in /var/run/utmp and /var/log/wtmp each time a child process terminates. You're looking for processed terminated by SIGABRT as far as
I can see.

There does not appear to be any tool to extract that information from
utmp/wtmp files though - utmpdump only shows login process information
for me, nothing on init processes.

Another way may be logging the start of each of the rcS.d or rc2.d
scripts until you know what scripts to look at in more detail, then
adding 'set -v' at the start of those to log every command in the
offending script.

Once the offending binary is known (and the crash can be reproduced
after system boot), gdb can be used to find the function that overwrote
its local stack guard.

That's a lot of work on a 030 Mac - have you tried to reproduce this on
any kind of emulator?

I suppose one difference between your 030 and 040 Macs might be the
amount of RAM available. I wonder if this bug results from a combination
of 030 MMU and memory pressure, or 030 MMU only.

Cheers,

Michael

thanks for any suggestions

-Stan Johnson [email protected]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Eero Tamminen@21:1/5 to Michael Schmitz on Tue Jan 31 20:30:01 2023

Hi,

On 31.1.2023 5.05, Michael Schmitz wrote:

That's a lot of work on a 030 Mac - have you tried to reproduce this on
any kind of emulator?

I suppose one difference between your 030 and 040 Macs might be the
amount of RAM available. I wonder if this bug results from a combination
of 030 MMU and memory pressure, or 030 MMU only.

As to emulation... 030 and 040 MMUs differ a lot.

E.g. Aranym emulates only the (simpler) 040 MMU, and does not emulate
CPU cache. Qemu does not have any cache emulation either.

WinAUE (and Hatari & Previous) emulate both 030 MMU and CPU cache.

Using Hatari with Linux is documented here: https://hatari.tuxfamily.org/doc/m68k-linux.txt

AFAIK 030 MMU + cache emulation works well enough in WinUAE (Amiga
emulator) to run Linux kernel + user-space, but with Hatari (Atari
emulator), you need to disable cache emulation to have working
user-space with m68k linux.

- Eero

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael Schmitz@21:1/5 to Stan Johnson on Wed Feb 1 20:00:02 2023

Hi Stan,

On 2/02/23 05:38, Stan Johnson wrote:

On 1/30/23 8:05 PM, Michael Schmitz wrote:

...
Am 30.01.2023 um 17:00 schrieb Stan Johnson:

Hello,

I am seeing anywhere from zero to four of the following errors while
booting Linux on 68030 systems and using sysvinit startup scripts:

*** stack smashing detected ***: terminated
Aborted

I usually (but not always) see three of the errors while init is running >>> the rcS.d scripts, and one while running the rc2.d scripts. The stack
smashing messages appear only on the system console (nothing is logged
in an error log or dmesg). Despite the errors, the system continues
booting to multiuser mode without any obvious additional problems. I
haven't tested systemd, which is too slow to be useful on my m68k
systems (though I have a Debian SID with systemd that I can restore for
testing if necessary).

...

Another way may be logging the start of each of the rcS.d or rc2.d
scripts until you know what scripts to look at in more detail, then
adding 'set -v' at the start of those to log every command in the
offending script.

Hi Michael,

Thanks for your reply.

After logging the start and end of each script, I see that the "stack smashing detected" error often happens while running "/etc/rcS.d/S01mountkernfs.sh" (/etc/init.d/mountkernfs.sh). I'll try to isolate it to a particular command.

This may be a coincidence, but the error seems to happen (up to about 4 times) after a warm boot into Mac OS 7.5.5 and a subsequent boot into
Linux that when starting with a cold boot into Mac OS 7.5.5, but it
doesn't seem that that should make any difference for Linux. This
morning, after a cold boot, I saw two of the errors, while after a warm
boot, I saw four.

Hmm - that might well indicate a hardware issue rather than software.
Bits flipping at random in RAM (and getting picked up because the stack
canary changes).

Once the offending binary is known (and the crash can be reproduced
after system boot), gdb can be used to find the function that overwrote
its local stack guard.

Is there a way to configure the kernel to use the stack guard for every function, and then log every resulting abort? I realize that that would
be very slow, but it might also be useful for debugging.

The stack canary mechanism pushes a token on the stack at function
entry, and compares against that token's value at function exit. This is
all code generated by gcc in the user binary.

The kernel is not involved in function calls other than syscalls. For
syscalls, we could try to find the user mode stack, and do a similar
canary trick, but I don't think that would be necessary for all
syscalls. Might be easier to instrument copy_to_user() instead if you're worried about a syscall receiving result data that way to a variable on
the stack.

But since we're touching on copy_to_user() here - the 'remove set_fs'
patch set by Christoph Hellwig refactored the m68k inline helpers around
July 2021. Can you test a kernel prior to those patches (5.15-rc2)?

That's a lot of work on a 030 Mac - have you tried to reproduce this on
any kind of emulator?

I haven't seen the error in QEMU.

I suppose one difference between your 030 and 040 Macs might be the
amount of RAM available. I wonder if this bug results from a combination
of 030 MMU and memory pressure, or 030 MMU only.

For some reason, the error seems to happen only with 68030 systems, regardless of processor speed or memory:

PB 170 : 68030, 25 MHz, 8 MiB, external SCSI2SD
Mac IIci : 68030, 25 MHz, 80 MiB, internal SCSI2SD
SE/30 : 68030, 16 MHz, 128 MiB, external SCSI2SD
PB 550c : 68040, 33 MHz, 36 MiB, external SCSI2SD
Centris 650 : 68040, 25 MHz, 136 MiB, internal SCSI2SD

Exception handling in copy_to_user() and the related bits in 030 page
fault handling might need another look in then...

Cheers,

Michael

-Stan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Geert Uytterhoeven@21:1/5 to [email protected] on Thu Feb 2 09:00:01 2023

Hi Michael et al,

On Wed, Feb 1, 2023 at 7:52 PM Michael Schmitz <[email protected]> wrote:

On 2/02/23 05:38, Stan Johnson wrote:

On 1/30/23 8:05 PM, Michael Schmitz wrote:

...
Am 30.01.2023 um 17:00 schrieb Stan Johnson:

I am seeing anywhere from zero to four of the following errors while
booting Linux on 68030 systems and using sysvinit startup scripts:

*** stack smashing detected ***: terminated
Aborted

I usually (but not always) see three of the errors while init is running >>> the rcS.d scripts, and one while running the rc2.d scripts. The stack
smashing messages appear only on the system console (nothing is logged >>> in an error log or dmesg). Despite the errors, the system continues
booting to multiuser mode without any obvious additional problems. I
haven't tested systemd, which is too slow to be useful on my m68k
systems (though I have a Debian SID with systemd that I can restore for >>> testing if necessary).

...

Another way may be logging the start of each of the rcS.d or rc2.d
scripts until you know what scripts to look at in more detail, then
adding 'set -v' at the start of those to log every command in the
offending script.

Hi Michael,

Thanks for your reply.

After logging the start and end of each script, I see that the "stack smashing detected" error often happens while running "/etc/rcS.d/S01mountkernfs.sh" (/etc/init.d/mountkernfs.sh). I'll try to isolate it to a particular command.

This may be a coincidence, but the error seems to happen (up to about 4 times) after a warm boot into Mac OS 7.5.5 and a subsequent boot into
Linux that when starting with a cold boot into Mac OS 7.5.5, but it
doesn't seem that that should make any difference for Linux. This
morning, after a cold boot, I saw two of the errors, while after a warm boot, I saw four.

Hmm - that might well indicate a hardware issue rather than software.
Bits flipping at random in RAM (and getting picked up because the stack canary changes).

You can enable extra debugging options in the kernel, which might help detecting memory corruption, like CONFIG_DEBUG_LIST and DEBUG_SLAB.
It will slow down your kernel, and make it grow too large, though.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael Schmitz@21:1/5 to All on Fri Feb 3 01:20:01 2023

Hi Stan,

Am 03.02.2023 um 12:16 schrieb Stan Johnson:

On 2/1/23 11:51 AM, Michael Schmitz wrote:

...

The stack canary mechanism pushes a token on the stack at function
entry, and compares against that token's value at function exit. This is
all code generated by gcc in the user binary.

The kernel is not involved in function calls other than syscalls. For
syscalls, we could try to find the user mode stack, and do a similar
canary trick, but I don't think that would be necessary for all
syscalls. Might be easier to instrument copy_to_user() instead if you're
worried about a syscall receiving result data that way to a variable on
the stack.

But since we're touching on copy_to_user() here - the 'remove set_fs'
patch set by Christoph Hellwig refactored the m68k inline helpers around
July 2021. Can you test a kernel prior to those patches (5.15-rc2)?
...

I updated my m68k rootfs in QEMU to the latest Debian SID and installed
the updated rootfs on my Mac IIci. And it's now restoring to my SE/30,
so I'll be able to test on that system again eventually.

I tested 5.15.0-rc3, 5.15.0-rc2 and 5.15.0-rc1, with inconclusive
results on the IIci (3 tests per kernel; e.g. "0, 1, 5" in
"Stack-Smashing" indicates that test 1 had zero smashing errors, test 2
had 1, and test 3 had 5):

Kernel Stack-Smashing
5.15.0-rc3 0, 1, 5
5.15.0-rc2 2, 3, 2
5.15.0-rc1 7, 1, 3

I think we can rule out those changes to usermode copy routines now.

Good idea to test on some other hardware. I'd also consider Geert's
suggestion to enable kernel level debugging for the kernel's memory
allocators. That does look a lot easier than debugging usermode copy.

Cheers,

Michael

I saved the serial console logs, but they're probably not too useful at
this point. Next, I'll confirm a similar failure on the SE/30, then I'll check 4.0 and 5.0 (on the IIci) to see whether the issue started in that range, then use git bisect in an attempt to isolate the issue further.
I'll also do a spot check on a different IIci to lessen the chance that
the issue is being caused by a hardware problem. If it's not caused by a kernel bug, the next step will be to isolate the specific offending executable(s) in the sysvinit scripts.

Or I could test using systemd instead of sysvinit; that would take
longer but it might be worthwhile if it doesn't look like a kernel bug.

-Stan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Finn Thain@21:1/5 to Stan Johnson on Sat Feb 4 01:00:01 2023

On Wed, 1 Feb 2023, Stan Johnson wrote:

After logging the start and end of each script, I see that the "stack smashing detected" error often happens while running "/etc/rcS.d/S01mountkernfs.sh" (/etc/init.d/mountkernfs.sh). I'll try to isolate it to a particular command.

That brings to mind some other unresolved initscript failures, also
involving 68030, which were accompanied by "page allocation failure".

But it's hard to blame "stack smashing detected" on a page allocation
failure since the latter always produces a very noisy splat in the kernel messages.

Have you reproduced the error with Debian's kernel package?

If not, please refer to private correspondence from me dated 10 December
2022 regarding setting up /etc/initramfs-tools so as to produce a suitably small initramfs.

To save time, I recommend using QEMU and an up-to-date Debian/m68k SID
virtual machine to produce the vmlinux and initrd files needed for use
with Penguin on your slower machines.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael Schmitz@21:1/5 to All on Sun Feb 5 23:30:01 2023

Hi Stan,

Am 02.02.2023 um 07:51 schrieb Michael Schmitz:

But since we're touching on copy_to_user() here - the 'remove set_fs'
patch set by Christoph Hellwig refactored the m68k inline helpers around
July 2021. Can you test a kernel prior to those patches (5.15-rc2)?

That's a lot of work on a 030 Mac - have you tried to reproduce this on
any kind of emulator?

I haven't seen the error in QEMU.

I suppose one difference between your 030 and 040 Macs might be the
amount of RAM available. I wonder if this bug results from a combination >>> of 030 MMU and memory pressure, or 030 MMU only.

For some reason, the error seems to happen only with 68030 systems,
regardless of processor speed or memory:

PB 170 : 68030, 25 MHz, 8 MiB, external SCSI2SD
Mac IIci : 68030, 25 MHz, 80 MiB, internal SCSI2SD
SE/30 : 68030, 16 MHz, 128 MiB, external SCSI2SD
PB 550c : 68040, 33 MHz, 36 MiB, external SCSI2SD
Centris 650 : 68040, 25 MHz, 136 MiB, internal SCSI2SD

Exception handling in copy_to_user() and the related bits in 030 page
fault handling might need another look in then...

Seeing Finn's report that Al Viro's VM_FAULT_RETRY fix may have solved
his task corruption troubles on 040, I just noticed that I probably misunderstood how Al's patch works.

Botching up a fault retry and carrying on may well leave the page tables
in a state where some later access could go to the wrong page and
manifest as user space corruption. Could you try Al's patch 4 (m68k: fix livelock in uaccess) to see if this helps?

Cheers,

Michael

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael Schmitz@21:1/5 to Stan Johnson on Wed Feb 8 00:00:02 2023

Thanks Stan,

On 8/02/23 08:37, Stan Johnson wrote:

Hi Michael,

On 2/5/23 3:19 PM, Michael Schmitz wrote:

...

Seeing Finn's report that Al Viro's VM_FAULT_RETRY fix may have solved
his task corruption troubles on 040, I just noticed that I probably
misunderstood how Al's patch works.

Botching up a fault retry and carrying on may well leave the page tables
in a state where some later access could go to the wrong page and
manifest as user space corruption. Could you try Al's patch 4 (m68k: fix
livelock in uaccess) to see if this helps?
...

ok, this appears to be the patch:

Signed-off-by: Al Viro <[email protected]>
---
arch/m68k/mm/fault.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c
index 4d2837eb3e2a..228128e45c67 100644
--- a/arch/m68k/mm/fault.c
+++ b/arch/m68k/mm/fault.c
@@ -138,8 +138,11 @@ int do_page_fault(struct pt_regs *regs, unsigned
long address,
fault = handle_mm_fault(vma, address, flags, regs);
pr_debug("handle_mm_fault returns %x\n", fault);

- if (fault_signal_pending(fault, regs))
+ if (fault_signal_pending(fault, regs)) {
+ if (!user_mode(regs))
+ goto no_context;
return 0;
+ }

/* The fault is fully completed (including releasing mmap lock) */
if (fault & VM_FAULT_COMPLETED)

That's correct.

Your results show improvement but the problem does not entirely go away.

Looking at differences between 030 and 040/040 fault handling, it
appears only 030 handles faults corrected by exception tables (such as
used in uaccess macros) special, i.e. aborting bus error processing
while 040 and 060 carry on in the fault handler.

I wonder if that's the main difference between 030 and 040 behaviour?

I'll try and log such accesses caught by exception tables on 030 to see
if they are rare enough to allow adding a kernel log message...

Cheers,

Michael

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael Schmitz@21:1/5 to All on Thu Feb 9 04:50:01 2023

This is a multi-part message in MIME format.
Hi Stan,

Am 08.02.2023 um 11:58 schrieb Michael Schmitz:

Thanks Stan,

On 8/02/23 08:37, Stan Johnson wrote:

Hi Michael,

On 2/5/23 3:19 PM, Michael Schmitz wrote:

...

Seeing Finn's report that Al Viro's VM_FAULT_RETRY fix may have solved
his task corruption troubles on 040, I just noticed that I probably
misunderstood how Al's patch works.

Botching up a fault retry and carrying on may well leave the page tables >>> in a state where some later access could go to the wrong page and
manifest as user space corruption. Could you try Al's patch 4 (m68k: fix >>> livelock in uaccess) to see if this helps?
...

ok, this appears to be the patch:

Signed-off-by: Al Viro <[email protected]>
---
arch/m68k/mm/fault.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c
index 4d2837eb3e2a..228128e45c67 100644
--- a/arch/m68k/mm/fault.c
+++ b/arch/m68k/mm/fault.c
@@ -138,8 +138,11 @@ int do_page_fault(struct pt_regs *regs, unsigned
long address,
fault = handle_mm_fault(vma, address, flags, regs);
pr_debug("handle_mm_fault returns %x\n", fault);

- if (fault_signal_pending(fault, regs))
+ if (fault_signal_pending(fault, regs)) {
+ if (!user_mode(regs))
+ goto no_context;
return 0;
+ }

/* The fault is fully completed (including releasing mmap lock) */
if (fault & VM_FAULT_COMPLETED)

That's correct.

Your results show improvement but the problem does not entirely go away.

Looking at differences between 030 and 040/040 fault handling, it
appears only 030 handles faults corrected by exception tables (such as
used in uaccess macros) special, i.e. aborting bus error processing
while 040 and 060 carry on in the fault handler.

I wonder if that's the main difference between 030 and 040 behaviour?

Following the 040 code a bit further, I suspect that happens in the 040 writeback handler, so this may be a red herring.

I'll try and log such accesses caught by exception tables on 030 to see
if they are rare enough to allow adding a kernel log message...

Looks like this kind of event is rare enough to not trigger in a normal
boot on my 030. Please give the attached patch a try so we can confirm
(or rule out) that user space access faults from kernel mode are to
blame for your stack smashes.

Cheers,

Michael

Cheers,

Michael

From a55467a02b66addca6f74fc32b473bc077cb34b2 Mon Sep 17 00:00:00 2001

From: Michael Schmitz <[email protected]>
Date: Thu, 9 Feb 2023 14:39:35 +1300
Subject: [PATCH] m68k: debug exception handling data faults on 030

030 faults handled by exception tables are just silently ignored - see how
many of these do happen in practice, and if they are related to 'stack smashing' faults.

Signed-off-by: Michael Schmitz <[email protected]>
---
arch/m68k/kernel/traps.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/m68k/kernel/traps.c b/arch/m68k/kernel/traps.c
index 5c8cba0efc63..b3cef760f7e8 100644
--- a/arch/m68k/kernel/traps.c
+++ b/arch/m68k/kernel/traps.c
@@ -554,8 +554,13 @@ static inline void bus_error030 (struct frame *fp)
}
/* Don't try to do anything further if an exception was
handled. */
- if (do_page_fault (&fp->ptregs, addr, errorcode) < 0)
+ if (do_page_fault (&fp->ptregs, addr, errorcode) < 0) { + pr_err("Exception handled for data %s fault at %#010lx in %s (pc=%#lx)\n",
+ ssw & RW ? "read" : "write",
+ fp->un.fmtb.daddr,
+ space_names[ssw & DFC], fp->ptregs.pc);
return;
+ }
} else if (!(mmusr & MMU_I)) {
/* probably a 020 cas fault */
if (!(ssw & RM) && send_fault_sig(&fp->ptregs) > 0)
--
2.17.1

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael Schmitz@21:1/5 to All on Fri Feb 10 09:00:02 2023

Hi Stan,

Am 10.02.2023 um 13:24 schrieb Stan Johnson:

Hi Michael,

On 2/8/23 8:41 PM, Michael Schmitz wrote:

...

Following the 040 code a bit further, I suspect that happens in the 040
writeback handler, so this may be a red herring.

I'll try and log such accesses caught by exception tables on 030 to see
if they are rare enough to allow adding a kernel log message...

Looks like this kind of event is rare enough to not trigger in a normal
boot on my 030.

Have you seen the error using my modified config file? Perhaps I'm not including something that's causing (or revealing) the problem.

I haven't seen these stack smashing errors ever, but that's with a
really ancient user space and a kernel config for Atari that only
includes the bare minimum.

Please give the attached patch a try so we can confirm
(or rule out) that user space access faults from kernel mode are to
blame for your stack smashes.
...

With "0001-m68k-debug-exception-handling-data-faults-on-030.patch",

Kernel Stack-Smashing
6.2.0-rc7 (no patch) 4, 3, 3 (from earlier test)
6.2.0-rc7 (new patch) 6, 2, 0

The earlier patch is not applied. Serial console log is attached.

Without Al's patch, I doubt even in case a uaccess fault happens with
signal pending we'd return -1 from send_fault_sig() (the no_context path
isn't taken and do_page_fault() returns without error). No kernel
messages expected in that case. But none seen otherwise either which
indicates exception handling in uaccess isn't a problem.

Not sure it's worth the hassle to retry with both patches applied...

Thanks,

Michael

thanks

-Stan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Finn Thain@21:1/5 to Stan Johnson on Sat Feb 11 23:30:01 2023

On Sat, 11 Feb 2023, Stan Johnson wrote:

v5.1 x SCSI2SD crashes, goes offline with activity LED on,
rootfs corrupted, needed to be restored from backups, SCSI2SD SD card
needed to have the Apple driver updated to boot MacOS

v4.20 bad stack smashing on first boot, corrupted rootfs (bad superblock magic number) on second boot, fsck from a different rootfs
found many block counts wrong, rootfs had to be restored from backups

That looks like an unrelated problem, presumably caused by hardware.

It could be related to the problem you described to me that arises when
you subdivide an SD card into multiple SCSI targets.

Anyway, if you want to pursue the stack smashing error you'll probably
need to use a different storage device.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Finn Thain@21:1/5 to Stan Johnson on Sun Feb 12 00:30:01 2023

On Sat, 11 Feb 2023, Stan Johnson wrote:

I think if there were hardware problems with the SCSI2SD board or the SD card, then I would be seeing lots of errors in MacOS and with v6.x
kernels.

That's not true in general. But I can agree that certain SCSI device
faults will show up in any operating systems.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Finn Thain@21:1/5 to Finn Thain on Sun Feb 12 01:40:01 2023

On Sun, 12 Feb 2023, Finn Thain wrote:

On Sat, 11 Feb 2023, Stan Johnson wrote:

v5.1 x SCSI2SD crashes, goes offline with activity LED on, rootfs corrupted, needed to be restored from backups, SCSI2SD SD card needed to have the Apple driver updated to boot MacOS

v4.20 bad stack smashing on first boot, corrupted rootfs (bad superblock magic number) on second boot, fsck from a different rootfs
found many block counts wrong, rootfs had to be restored from backups

That looks like an unrelated problem, presumably caused by hardware.

It could be related to the problem you described to me that arises when
you subdivide an SD card into multiple SCSI targets.

Anyway, if you want to pursue the stack smashing error you'll probably
need to use a different storage device.

I'll have to retract that -- there's no reason to blame the SCSI2SD. The mac_scsi driver did not really stabilize until v5.3 so that seems like the
most likely cause.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Finn Thain@21:1/5 to Stan Johnson on Thu Feb 16 05:40:01 2023

On Wed, 15 Feb 2023, Stan Johnson wrote:

1) Default Debian kernel (vmlinux-6.1.0-4-m68k, initrd-6.1.0-4-m68k)
Default Debian sysvinit scripts
Boot time (ABC... to login prompt): 15 min 3 sec
NIC not detected, no stack smashing

Did you check whether the NIC module (mac8390) got loaded? I don't see any
sign of that in the console log. You can find out by running 'lsmod'. You should also check whether that module is present on the initrd. The initrd contents can be listed with 'zstd -dc < initrd-6.1.0-4-m68k | cpio -t'.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Paul Adrian Glaubitz@21:1/5 to Stan Johnson on Thu Feb 16 11:10:01 2023

Hi Stan!

On Wed, 2023-02-15 at 19:44 -0700, Stan Johnson wrote:

Going from 15 min to about 4 min seems worth the effort on old hardware.
As always, YMMV. Developers who use QEMU or other emulators likely don't always realize how long it takes to boot real hardware.

I am not sure what makes you think we aren't aware of the long boot times,
I just booted Debian unstable on my Amiga 4000/060 yesterday:

root@elgar:~> systemd-analyze
Startup finished in 1min 30.785s (kernel) + 5min 31.586s (userspace) = 7min 2.371s
graphical.target reached after 5min 26.053s in userspace
root@elgar:~>

I understand your frustration, but please keep in mind that modern software
is more complex and therefore runs slower on older hardware.

We will certainly work on providing a leaner kernel in the future to help alleviate this issue. And rebootstrapping the whole distribution with 32-bit alignment (-malign-int) should improve code execution as well.

Adrian

--
.''`. John Paul Adrian Glaubitz
: :' : Debian Developer
`. `' Physicist
`- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael Schmitz@21:1/5 to Stan Johnson on Fri Feb 17 00:20:01 2023

Hi Stan,

On 16/02/23 15:44, Stan Johnson wrote:

I could re-install the previous sysvinit over the version in the current Debian SID and see if the stack smashing is still gone. I don't know how
to do that, but if someone has instructions, I'll try. I'm guessing I
need to download the previous .deb binaries and use dpkg to install the
older versions over the newer versions, while the newer init is still running; maybe rename /sbin/init to /sbin/init.tmp and boot with init=/sbin/init.tmp to get it out of the way (?).

No need to move the old init binary out of the way - as long as a file
is still in use, it won't actually be deleted (the directory entry just
points to the new file so subsequent invocations use the new file contents.

Or I could download the source for sysvinit-core 3.06-2 and
sysvinit-utils 3.06-2 and compile using all of Debian's options plus -fstack-protector-all (and perhaps other options?) to see whether there
might be a bounds issue on an array somewhere. But I also don't know
where to find the source or the options that Debian uses for compilation.

'apt-get source sysvinit=3.06-2' will download and unpack that specific version. That should unpack the source in sysvinit-3.06-2/.

To add compile options, look at the patches in debian/patches/ there -
haven't found the 3.06-2 version, but older versions have a patch to add additional CFLAGS to src/Makefile. Add a new patch in debian/patches/,
add that patch file name to debian/patches/series and use
dpkg-buildpackage to build the package.

Cheers,

Michael

Otherwise, I'm done looking into the stack smashing for now. If anyone
is interested in developing "68030-lean" kernel config options or custom sysvinit scripts, I'll be happy to contribute.

thanks

-Stan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Paul Adrian Glaubitz@21:1/5 to Stan Johnson on Fri Feb 17 19:10:01 2023

On Fri, 2023-02-17 at 10:09 -0700, Stan Johnson wrote:

Would downloading the source from an x86 repository be sufficient, ...?

Yes, there is no architecture-specific source code repository.

Use:

deb-src http://ftp.debian.org/debian/ unstable main

Adrian

--
.''`. John Paul Adrian Glaubitz
: :' : Debian Developer
`. `' Physicist
`- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael Schmitz@21:1/5 to All on Fri Feb 17 21:40:02 2023

Hi Stan,

Am 18.02.2023 um 06:09 schrieb Stan Johnson:

Hi Michael,

On 2/16/23 4:10 PM, Michael Schmitz wrote:

...
'apt-get source sysvinit=3.06-2' will download and unpack that specific
version. That should unpack the source in sysvinit-3.06-2/.

That doesn't work for me:
# apt-get source sysvinit=3.06-2
Reading package lists... Done
E: You must put some 'deb-src' URIs in your sources.list

Adding this to /etc/apt/sources.list doesn't help (same error):
deb-src http://ftp.ports.debian.org/debian-ports/ sid main

Running "apt-get update", I see:
W: Skipping acquire of configured file 'main/source/Sources' as
repository 'http://ftp.ports.debian.org/debian-ports sid InRelease' does
not seem to provide it (sources.list entry misspelt?)

So I might not be pointing to the right source repository. Would
downloading the source from an x86 repository be sufficient, or do I
need patches that are specific to m68k?

To add compile options, look at the patches in debian/patches/ there -
haven't found the 3.06-2 version, but older versions have a patch to add
additional CFLAGS to src/Makefile. Add a new patch in debian/patches/,
add that patch file name to debian/patches/series and use
dpkg-buildpackage to build the package.

If I can download the source, I'll do that.

Just to confirm that an earlier version (3.01-2) doesn't cause the stack smashing errors with an otherwise up-to-date Debian SID, I can restore

Good work -

from an earlier backup and run "apt-get upgrade", though I would want to "keep back" sysvinit-core 3.01-2 and sysvinit-utils 3.01-2. Please let
me know how to keep back a package from being upgraded.

'dpkg hold' should mark packages so apt won't consider them for
upgrading. You may want to check what other packages depend on newer
versions of sysvinit before attempting that (search for 'sysvinit-utils

=' in the Packages files in /var/lib/apt/lists) before attempting that.

Do you know whether there is some location where all of the older binary packages are saved? If there is, I can download and check versions
between 3.01-2 and 3.06-2 and determine where the issue started.

snapshot.debian.org is what you want. 15 versions to bisect ...

Cheers,

Michael

thanks

-Stan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Finn Thain@21:1/5 to Stan Johnson on Fri Feb 17 23:10:01 2023

On Fri, 17 Feb 2023, Stan Johnson wrote:

I noticed that /sbin/init seems to ignore SIGABRT, so I thought that
might mean that init itself was somehow triggering the stack smashing
but nothing was really aborting, but I could be wrong about that.

Why do you say init ignores SIGABRT? I couldn't find that in the source
code. Did you try 'kill -ABRT 1'?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Finn Thain@21:1/5 to Stan Johnson on Sat Feb 18 00:00:02 2023

On Fri, 17 Feb 2023, Stan Johnson wrote:

That's not to say a SIGABRT is ignored, it just doesn't kill PID 1.

I doubt that /sbin/init is generating the "stack smashing detected" error
but you may need to modify it to find out. If you can't figure out which userland binary is involved, you'll have to focus on your custom kernel
binary, just as I proposed in my message dated 8 Feb 2023.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Finn Thain@21:1/5 to Andreas Schwab on Sat Feb 18 01:10:01 2023

On Sat, 18 Feb 2023, Andreas Schwab wrote:

On Feb 18 2023, Finn Thain wrote:

Why do you say init ignores SIGABRT?

PID 1 is special, it never receives signals it doesn't handle.

I see. I wonder if there is some way to configure the kernel so that PID 1 could be aborted for fstack-protector. I doubt it.

My gut says that a compiler change somehow made the fstack-protector implementation insensitive to kernel configuration.

So I still think that, if Stan adopted Debian's build environment, random processes would cease to be aborted (regardless of kernel .config).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Andreas Schwab@21:1/5 to Finn Thain on Sat Feb 18 00:30:01 2023

On Feb 18 2023, Finn Thain wrote:

Why do you say init ignores SIGABRT?

PID 1 is special, it never receives signals it doesn't handle.

--
Andreas Schwab, [email protected]
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1
"And now for something completely different."

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Finn Thain@21:1/5 to Stan Johnson on Sat Feb 18 00:40:01 2023

On Fri, 17 Feb 2023, Stan Johnson wrote:

The error could have been exposed in any package where "-fstack-protector-strong" was recently added.

And if you find the last good userland binary, what then? Fix the bad
userland binary? That's wonderful but it doesn't explain why the bad
userland binary went undetected with Debian's kernel build. And it doesn't explain why you can't reproduce the problem in QEMU.

Moreover, the above was always an unlikely scenario because an actual
buffer overrun in a userland binary that only shows up on '030 is
improbable in the first place, because code paths conditional on processor variant are normally found in the kernel.

Hence the advice I gave 10 days ago.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Finn Thain@21:1/5 to Stan Johnson on Sat Feb 18 02:10:01 2023

On Fri, 17 Feb 2023, Stan Johnson wrote:

What's interesting to me is that although there are stack smashing
errors and "aborted" messages during boot, nothing seems to have
actually failed or aborted (note that no processes are named as
"terminated" or "aborted" in the messages).

Initscripts do a lot of "if grep -q" and "this || that" etc. Since a
failure is normal, how would you tell that a process got aborted?

Once logged in, I can't duplicate the stack smashing (yet), and I don't
see random executables failing. The stack smashing seems to happen only
in startup scripts that have heavy access to /proc (the mountkernfs
script seems particularly vulnerable, as I recall), though none of the scripts seems to actually fail.

Yes. We've seen processes started from initscripts crash when accessing
/proc before. It was never debugged, though it wasn't a stack smashing
error at the time.

I'm not a programmer, but I'm willing to learn. I don't know what
Debian's build environment is, or how to adopt it or set it up, but I
can look into it with some suggestions.

As I understand it, Debian/m68k SID is built in Debian/m68k SID in QEMU.
Please see,
https://lists.debian.org/debian-68k/2023/02/msg00025.html

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael Schmitz@21:1/5 to All on Sat Feb 18 01:50:02 2023

Hi Finn,

Am 18.02.2023 um 12:49 schrieb Finn Thain:

On Sat, 18 Feb 2023, Andreas Schwab wrote:

On Feb 18 2023, Finn Thain wrote:

Why do you say init ignores SIGABRT?

PID 1 is special, it never receives signals it doesn't handle.

I see. I wonder if there is some way to configure the kernel so that PID 1 could be aborted for fstack-protector. I doubt it.

You could add SIGABRT to the list of signals handled by init (see init.c:init_main() and init.c:process_signals() in the sysvinit source).

Not sure it's wise to allow init to abort though. You could instead try
to use a similar signal handler to segv_handler(), and dump core when
receiving the signal? Maybe re-exec init instead of continuing?

My gut says that a compiler change somehow made the fstack-protector implementation insensitive to kernel configuration.

So I still think that, if Stan adopted Debian's build environment, random processes would cease to be aborted (regardless of kernel .config).

Changes in compiler version between sysvinit 3.01 and 3.06 might cause a
bisect using snapshot binaries and a bisect using recompiled binaries to differ. But using a build environment equivalent to that used by the
package autobuilders is certainly good advice.

Or did you mean the kernel build environment?

Cheers.

Michael

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Finn Thain@21:1/5 to Michael Schmitz on Sat Feb 18 04:20:01 2023

On Sat, 18 Feb 2023, Michael Schmitz wrote:

Am 18.02.2023 um 12:49 schrieb Finn Thain:

On Sat, 18 Feb 2023, Andreas Schwab wrote:

On Feb 18 2023, Finn Thain wrote:

Why do you say init ignores SIGABRT?

PID 1 is special, it never receives signals it doesn't handle.

I see. I wonder if there is some way to configure the kernel so that
PID 1 could be aborted for fstack-protector. I doubt it.

You could add SIGABRT to the list of signals handled by init (see init.c:init_main() and init.c:process_signals() in the sysvinit source).

Not sure it's wise to allow init to abort though. You could instead try
to use a similar signal handler to segv_handler(), and dump core when receiving the signal? Maybe re-exec init instead of continuing?

I like the idea of patching the kernel so as to log every SIGABRT sent
(even if never delivered), ideally along with the target pid and cmd and
the sending pid and cmd.

My gut says that a compiler change somehow made the fstack-protector implementation insensitive to kernel configuration.

So I still think that, if Stan adopted Debian's build environment,
random processes would cease to be aborted (regardless of kernel
.config).

Changes in compiler version between sysvinit 3.01 and 3.06 might cause a bisect using snapshot binaries and a bisect using recompiled binaries to differ.

According to the buildd logs:

package | compiler
---------+-----------
3.01-1 | 11.2.0-12
3.02-1 | 11.2.0-19
3.03-1 | 11.2.0-19
3.04-1 | 12.1.0-7
3.05-7 | 12.2.0-7
3.06-2 | 12.2.0-10

Those debs are all available on
https://snapshot.debian.org/package/sysvinit/
in case Stan wants to bisect. No need to build anything.

But using a build environment equivalent to that used by the package autobuilders is certainly good advice.

Or did you mean the kernel build environment?

I meant the kernel build environment which was a double-think on my part
as it isn't really relevant here.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Finn Thain@21:1/5 to Stan Johnson on Sat Feb 18 06:40:01 2023

On Fri, 17 Feb 2023, Stan Johnson wrote:

On 2/17/23 4:24 PM, Finn Thain wrote:

On Fri, 17 Feb 2023, Stan Johnson wrote:

The error could have been exposed in any package where
"-fstack-protector-strong" was recently added.

And if you find the last good userland binary, what then? Fix the bad userland binary? That's wonderful but it doesn't explain why the bad userland binary went undetected with Debian's kernel build. And it
doesn't explain why you can't reproduce the problem in QEMU.

Maybe the 68030 handles memory differently than the 68040? As I
mentioned previously, if there is a 68030 Mac emulator that can boot
Linux, then I could try that. I could also try running QEMU on a slow
x86 host to see if a slow client combined with the rapid accesses to
/proc from the scripts might be causing an issue, I suspect if the 68040
were vulnerable to this problem, we'd be seeing it in QEMU. I'll check
this weekend to see whether QEMU is still not seeing any problem with my latest kernel builds together with the latest Debian SID.

And no, I likely won't be fixing anything, but perhaps I can report the problem to the maintainers of the relevant package. And I will likely
choose (for my own systems) to keep back the last good userland binary
to prevent it from being overwritten until a fix is found.

Good luck with that.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Andreas Schwab@21:1/5 to Michael Schmitz on Sat Feb 18 09:20:01 2023

On Feb 18 2023, Michael Schmitz wrote:

Not sure it's wise to allow init to abort though.

When PID 1 exits, the kernel panics.

--
Andreas Schwab, [email protected]
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1
"And now for something completely different."

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael Schmitz@21:1/5 to All on Sun Feb 19 20:00:01 2023

Hi Andreas,

Am 18.02.2023 um 20:59 schrieb Andreas Schwab:

On Feb 18 2023, Michael Schmitz wrote:

Not sure it's wise to allow init to abort though.

When PID 1 exits, the kernel panics.

I might have been a mite subtle there ...

Anyway - Finn's intention was to log signals caused by the stack
protector mechanism. Reading the description of the abort() function, it appears that allowing init to take the signal through a handler that
returns, abort() restores the default handler and retries. That might
just manage to abort init after all (I haven't found where PID 1 is
treated special in the kernel signal code to rule that out).

Logging ABRT in the kill() syscall path seems much the safer option.
I'll see what I can come up with.

Cheers,

Michael

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Rixter
  Thu Jul 30 14:17:17 2026
  from Madison, Nc via Telnet
- Krenn
  Thu Jul 30 13:16:49 2026
  from Sydney, Nsw via Telnet
- Bob Worm
  Thu Jul 30 09:03:28 2026
  from Wales, Uk via Telnet
- Bob Worm
  Thu Jul 30 08:47:34 2026
  from Wales, Uk via Telnet
- Bob Worm
  Thu Jul 30 08:36:06 2026
  from Wales, Uk via Telnet
- Rixter
  Thu Jul 30 02:32:09 2026
  from Madison, Nc via Telnet
- Bob Worm
  Wed Jul 29 22:26:45 2026
  from Wales, Uk via Telnet
- Zenobyte
  Wed Jul 29 21:08:05 2026
  from San Juan, Pr via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	91:13:10
Calls:	12,456
Calls today:	6
Files:	15,197
Messages:	6,537,869

stack smashing detected

Who's Online

Recent Visitors

System Info