Hello,
I am seeing anywhere from zero to four of the following errors while
booting Linux on 68030 systems and using sysvinit startup scripts:
*** stack smashing detected ***: terminated
Aborted
I usually (but not always) see three of the errors while init is running
the rcS.d scripts, and one while running the rc2.d scripts. The stack smashing messages appear only on the system console (nothing is logged
in an error log or dmesg). Despite the errors, the system continues
booting to multiuser mode without any obvious additional problems. I
haven't tested systemd, which is too slow to be useful on my m68k
systems (though I have a Debian SID with systemd that I can restore for testing if necessary).
I'm using the current Debian SID and Debian kernel, and I've confirmed
the errors on a Mac IIci and SE/30. I haven't seen the errors on any
68040 system (I only tested on a Centris 650 and PowerBook 550c). I also notice the errors on 68030 systems using custom kernels that I have cross-compiled using GCC 12 or GCC 10 on a x86_64 system running Debian
SID; however, I do not see the errors as often if I cross-compile using
GCC 8.3.0 on a 686 system (running Debian 10.7 Buster) -- I saw the
errors a few weeks ago with an earlier kernel, but none today using
Linux 6.1.8 cross-compiled with GCC 8.3.0.
I'll be happy to help debug or troubleshoot, though at this point, since
the "stack smashing detected" errors aren't reporting which processes
are being terminated/aborted, I'm not sure where to start.
thanks for any suggestions
-Stan Johnson [email protected]
Hello,
I am seeing anywhere from zero to four of the following errors while
booting Linux on 68030 systems and using sysvinit startup scripts:
*** stack smashing detected ***: terminated
Aborted
I usually (but not always) see three of the errors while init is running
the rcS.d scripts, and one while running the rc2.d scripts. The stack smashing messages appear only on the system console (nothing is logged
in an error log or dmesg). Despite the errors, the system continues
booting to multiuser mode without any obvious additional problems. I
haven't tested systemd, which is too slow to be useful on my m68k
systems (though I have a Debian SID with systemd that I can restore for testing if necessary).
I'm using the current Debian SID and Debian kernel, and I've confirmed
the errors on a Mac IIci and SE/30. I haven't seen the errors on any
68040 system (I only tested on a Centris 650 and PowerBook 550c). I also notice the errors on 68030 systems using custom kernels that I have cross-compiled using GCC 12 or GCC 10 on a x86_64 system running Debian
SID; however, I do not see the errors as often if I cross-compile using
GCC 8.3.0 on a 686 system (running Debian 10.7 Buster) -- I saw the
errors a few weeks ago with an earlier kernel, but none today using
Linux 6.1.8 cross-compiled with GCC 8.3.0.
I'll be happy to help debug or troubleshoot, though at this point, since
the "stack smashing detected" errors aren't reporting which processes
are being terminated/aborted, I'm not sure where to start.
thanks for any suggestions
-Stan Johnson [email protected]
That's a lot of work on a 030 Mac - have you tried to reproduce this on
any kind of emulator?
I suppose one difference between your 030 and 040 Macs might be the
amount of RAM available. I wonder if this bug results from a combination
of 030 MMU and memory pressure, or 030 MMU only.
On 1/30/23 8:05 PM, Michael Schmitz wrote:Hmm - that might well indicate a hardware issue rather than software.
...Hi Michael,
Am 30.01.2023 um 17:00 schrieb Stan Johnson:
Hello,Another way may be logging the start of each of the rcS.d or rc2.d
I am seeing anywhere from zero to four of the following errors while
booting Linux on 68030 systems and using sysvinit startup scripts:
*** stack smashing detected ***: terminated
Aborted
I usually (but not always) see three of the errors while init is running >>> the rcS.d scripts, and one while running the rc2.d scripts. The stack
smashing messages appear only on the system console (nothing is logged
in an error log or dmesg). Despite the errors, the system continues
booting to multiuser mode without any obvious additional problems. I
haven't tested systemd, which is too slow to be useful on my m68k
systems (though I have a Debian SID with systemd that I can restore for
testing if necessary).
...
scripts until you know what scripts to look at in more detail, then
adding 'set -v' at the start of those to log every command in the
offending script.
Thanks for your reply.
After logging the start and end of each script, I see that the "stack smashing detected" error often happens while running "/etc/rcS.d/S01mountkernfs.sh" (/etc/init.d/mountkernfs.sh). I'll try to isolate it to a particular command.
This may be a coincidence, but the error seems to happen (up to about 4 times) after a warm boot into Mac OS 7.5.5 and a subsequent boot into
Linux that when starting with a cold boot into Mac OS 7.5.5, but it
doesn't seem that that should make any difference for Linux. This
morning, after a cold boot, I saw two of the errors, while after a warm
boot, I saw four.
Once the offending binary is known (and the crash can be reproducedIs there a way to configure the kernel to use the stack guard for every function, and then log every resulting abort? I realize that that would
after system boot), gdb can be used to find the function that overwrote
its local stack guard.
be very slow, but it might also be useful for debugging.
That's a lot of work on a 030 Mac - have you tried to reproduce this onI haven't seen the error in QEMU.
any kind of emulator?
I suppose one difference between your 030 and 040 Macs might be theFor some reason, the error seems to happen only with 68030 systems, regardless of processor speed or memory:
amount of RAM available. I wonder if this bug results from a combination
of 030 MMU and memory pressure, or 030 MMU only.
PB 170 : 68030, 25 MHz, 8 MiB, external SCSI2SD
Mac IIci : 68030, 25 MHz, 80 MiB, internal SCSI2SD
SE/30 : 68030, 16 MHz, 128 MiB, external SCSI2SD
PB 550c : 68040, 33 MHz, 36 MiB, external SCSI2SD
Centris 650 : 68040, 25 MHz, 136 MiB, internal SCSI2SD
-Stan
On 2/02/23 05:38, Stan Johnson wrote:
On 1/30/23 8:05 PM, Michael Schmitz wrote:
...Hi Michael,
Am 30.01.2023 um 17:00 schrieb Stan Johnson:
I am seeing anywhere from zero to four of the following errors whileAnother way may be logging the start of each of the rcS.d or rc2.d
booting Linux on 68030 systems and using sysvinit startup scripts:
*** stack smashing detected ***: terminated
Aborted
I usually (but not always) see three of the errors while init is running >>> the rcS.d scripts, and one while running the rc2.d scripts. The stack
smashing messages appear only on the system console (nothing is logged >>> in an error log or dmesg). Despite the errors, the system continues
booting to multiuser mode without any obvious additional problems. I
haven't tested systemd, which is too slow to be useful on my m68k
systems (though I have a Debian SID with systemd that I can restore for >>> testing if necessary).
...
scripts until you know what scripts to look at in more detail, then
adding 'set -v' at the start of those to log every command in the
offending script.
Thanks for your reply.
After logging the start and end of each script, I see that the "stack smashing detected" error often happens while running "/etc/rcS.d/S01mountkernfs.sh" (/etc/init.d/mountkernfs.sh). I'll try to isolate it to a particular command.
This may be a coincidence, but the error seems to happen (up to about 4 times) after a warm boot into Mac OS 7.5.5 and a subsequent boot intoHmm - that might well indicate a hardware issue rather than software.
Linux that when starting with a cold boot into Mac OS 7.5.5, but it
doesn't seem that that should make any difference for Linux. This
morning, after a cold boot, I saw two of the errors, while after a warm boot, I saw four.
Bits flipping at random in RAM (and getting picked up because the stack canary changes).
On 2/1/23 11:51 AM, Michael Schmitz wrote:
...I updated my m68k rootfs in QEMU to the latest Debian SID and installed
The stack canary mechanism pushes a token on the stack at function
entry, and compares against that token's value at function exit. This is
all code generated by gcc in the user binary.
The kernel is not involved in function calls other than syscalls. For
syscalls, we could try to find the user mode stack, and do a similar
canary trick, but I don't think that would be necessary for all
syscalls. Might be easier to instrument copy_to_user() instead if you're
worried about a syscall receiving result data that way to a variable on
the stack.
But since we're touching on copy_to_user() here - the 'remove set_fs'
patch set by Christoph Hellwig refactored the m68k inline helpers around
July 2021. Can you test a kernel prior to those patches (5.15-rc2)?
...
the updated rootfs on my Mac IIci. And it's now restoring to my SE/30,
so I'll be able to test on that system again eventually.
I tested 5.15.0-rc3, 5.15.0-rc2 and 5.15.0-rc1, with inconclusive
results on the IIci (3 tests per kernel; e.g. "0, 1, 5" in
"Stack-Smashing" indicates that test 1 had zero smashing errors, test 2
had 1, and test 3 had 5):
Kernel Stack-Smashing
5.15.0-rc3 0, 1, 5
5.15.0-rc2 2, 3, 2
5.15.0-rc1 7, 1, 3
I saved the serial console logs, but they're probably not too useful at
this point. Next, I'll confirm a similar failure on the SE/30, then I'll check 4.0 and 5.0 (on the IIci) to see whether the issue started in that range, then use git bisect in an attempt to isolate the issue further.
I'll also do a spot check on a different IIci to lessen the chance that
the issue is being caused by a hardware problem. If it's not caused by a kernel bug, the next step will be to isolate the specific offending executable(s) in the sysvinit scripts.
Or I could test using systemd instead of sysvinit; that would take
longer but it might be worthwhile if it doesn't look like a kernel bug.
-Stan
After logging the start and end of each script, I see that the "stack smashing detected" error often happens while running "/etc/rcS.d/S01mountkernfs.sh" (/etc/init.d/mountkernfs.sh). I'll try to isolate it to a particular command.
But since we're touching on copy_to_user() here - the 'remove set_fs'
patch set by Christoph Hellwig refactored the m68k inline helpers around
July 2021. Can you test a kernel prior to those patches (5.15-rc2)?
That's a lot of work on a 030 Mac - have you tried to reproduce this onI haven't seen the error in QEMU.
any kind of emulator?
I suppose one difference between your 030 and 040 Macs might be theFor some reason, the error seems to happen only with 68030 systems,
amount of RAM available. I wonder if this bug results from a combination >>> of 030 MMU and memory pressure, or 030 MMU only.
regardless of processor speed or memory:
PB 170 : 68030, 25 MHz, 8 MiB, external SCSI2SD
Mac IIci : 68030, 25 MHz, 80 MiB, internal SCSI2SD
SE/30 : 68030, 16 MHz, 128 MiB, external SCSI2SD
PB 550c : 68040, 33 MHz, 36 MiB, external SCSI2SD
Centris 650 : 68040, 25 MHz, 136 MiB, internal SCSI2SD
Exception handling in copy_to_user() and the related bits in 030 page
fault handling might need another look in then...
Hi Michael,
On 2/5/23 3:19 PM, Michael Schmitz wrote:
...ok, this appears to be the patch:
Seeing Finn's report that Al Viro's VM_FAULT_RETRY fix may have solved
his task corruption troubles on 040, I just noticed that I probably
misunderstood how Al's patch works.
Botching up a fault retry and carrying on may well leave the page tables
in a state where some later access could go to the wrong page and
manifest as user space corruption. Could you try Al's patch 4 (m68k: fix
livelock in uaccess) to see if this helps?
...
Signed-off-by: Al Viro <[email protected]>
---
arch/m68k/mm/fault.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c
index 4d2837eb3e2a..228128e45c67 100644
--- a/arch/m68k/mm/fault.c
+++ b/arch/m68k/mm/fault.c
@@ -138,8 +138,11 @@ int do_page_fault(struct pt_regs *regs, unsigned
long address,
fault = handle_mm_fault(vma, address, flags, regs);
pr_debug("handle_mm_fault returns %x\n", fault);
- if (fault_signal_pending(fault, regs))
+ if (fault_signal_pending(fault, regs)) {
+ if (!user_mode(regs))
+ goto no_context;
return 0;
+ }
/* The fault is fully completed (including releasing mmap lock) */
if (fault & VM_FAULT_COMPLETED)
Thanks Stan,
On 8/02/23 08:37, Stan Johnson wrote:
Hi Michael,
On 2/5/23 3:19 PM, Michael Schmitz wrote:
...ok, this appears to be the patch:
Seeing Finn's report that Al Viro's VM_FAULT_RETRY fix may have solved
his task corruption troubles on 040, I just noticed that I probably
misunderstood how Al's patch works.
Botching up a fault retry and carrying on may well leave the page tables >>> in a state where some later access could go to the wrong page and
manifest as user space corruption. Could you try Al's patch 4 (m68k: fix >>> livelock in uaccess) to see if this helps?
...
Signed-off-by: Al Viro <[email protected]>
---
arch/m68k/mm/fault.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c
index 4d2837eb3e2a..228128e45c67 100644
--- a/arch/m68k/mm/fault.c
+++ b/arch/m68k/mm/fault.c
@@ -138,8 +138,11 @@ int do_page_fault(struct pt_regs *regs, unsigned
long address,
fault = handle_mm_fault(vma, address, flags, regs);
pr_debug("handle_mm_fault returns %x\n", fault);
- if (fault_signal_pending(fault, regs))
+ if (fault_signal_pending(fault, regs)) {
+ if (!user_mode(regs))
+ goto no_context;
return 0;
+ }
/* The fault is fully completed (including releasing mmap lock) */
if (fault & VM_FAULT_COMPLETED)
That's correct.
Your results show improvement but the problem does not entirely go away.
Looking at differences between 030 and 040/040 fault handling, it
appears only 030 handles faults corrected by exception tables (such as
used in uaccess macros) special, i.e. aborting bus error processing
while 040 and 060 carry on in the fault handler.
I wonder if that's the main difference between 030 and 040 behaviour?
I'll try and log such accesses caught by exception tables on 030 to see
if they are rare enough to allow adding a kernel log message...
Cheers,
Michael
From a55467a02b66addca6f74fc32b473bc077cb34b2 Mon Sep 17 00:00:00 2001From: Michael Schmitz <[email protected]>
Hi Michael,
On 2/8/23 8:41 PM, Michael Schmitz wrote:
...
Following the 040 code a bit further, I suspect that happens in the 040
writeback handler, so this may be a red herring.
I'll try and log such accesses caught by exception tables on 030 to see
if they are rare enough to allow adding a kernel log message...
Looks like this kind of event is rare enough to not trigger in a normal
boot on my 030.
Have you seen the error using my modified config file? Perhaps I'm not including something that's causing (or revealing) the problem.
Please give the attached patch a try so we can confirm
(or rule out) that user space access faults from kernel mode are to
blame for your stack smashes.
...
With "0001-m68k-debug-exception-handling-data-faults-on-030.patch",
Kernel Stack-Smashing
6.2.0-rc7 (no patch) 4, 3, 3 (from earlier test)
6.2.0-rc7 (new patch) 6, 2, 0
The earlier patch is not applied. Serial console log is attached.
thanks
-Stan
v5.1 x SCSI2SD crashes, goes offline with activity LED on,
rootfs corrupted, needed to be restored from backups, SCSI2SD SD card
needed to have the Apple driver updated to boot MacOS
v4.20 bad stack smashing on first boot, corrupted rootfs (bad superblock magic number) on second boot, fsck from a different rootfs
found many block counts wrong, rootfs had to be restored from backups
I think if there were hardware problems with the SCSI2SD board or the SD card, then I would be seeing lots of errors in MacOS and with v6.x
kernels.
On Sat, 11 Feb 2023, Stan Johnson wrote:
v5.1 x SCSI2SD crashes, goes offline with activity LED on, rootfs corrupted, needed to be restored from backups, SCSI2SD SD card needed to have the Apple driver updated to boot MacOS
v4.20 bad stack smashing on first boot, corrupted rootfs (bad superblock magic number) on second boot, fsck from a different rootfs
found many block counts wrong, rootfs had to be restored from backups
That looks like an unrelated problem, presumably caused by hardware.
It could be related to the problem you described to me that arises when
you subdivide an SD card into multiple SCSI targets.
Anyway, if you want to pursue the stack smashing error you'll probably
need to use a different storage device.
1) Default Debian kernel (vmlinux-6.1.0-4-m68k, initrd-6.1.0-4-m68k)
Default Debian sysvinit scripts
Boot time (ABC... to login prompt): 15 min 3 sec
NIC not detected, no stack smashing
Going from 15 min to about 4 min seems worth the effort on old hardware.
As always, YMMV. Developers who use QEMU or other emulators likely don't always realize how long it takes to boot real hardware.
I could re-install the previous sysvinit over the version in the current Debian SID and see if the stack smashing is still gone. I don't know howNo need to move the old init binary out of the way - as long as a file
to do that, but if someone has instructions, I'll try. I'm guessing I
need to download the previous .deb binaries and use dpkg to install the
older versions over the newer versions, while the newer init is still running; maybe rename /sbin/init to /sbin/init.tmp and boot with init=/sbin/init.tmp to get it out of the way (?).
Or I could download the source for sysvinit-core 3.06-2 and
sysvinit-utils 3.06-2 and compile using all of Debian's options plus -fstack-protector-all (and perhaps other options?) to see whether there
might be a bounds issue on an array somewhere. But I also don't know
where to find the source or the options that Debian uses for compilation.
Otherwise, I'm done looking into the stack smashing for now. If anyone
is interested in developing "68030-lean" kernel config options or custom sysvinit scripts, I'll be happy to contribute.
thanks
-Stan
Would downloading the source from an x86 repository be sufficient, ...?
Hi Michael,
On 2/16/23 4:10 PM, Michael Schmitz wrote:
...
'apt-get source sysvinit=3.06-2' will download and unpack that specific
version. That should unpack the source in sysvinit-3.06-2/.
That doesn't work for me:
# apt-get source sysvinit=3.06-2
Reading package lists... Done
E: You must put some 'deb-src' URIs in your sources.list
Adding this to /etc/apt/sources.list doesn't help (same error):
deb-src http://ftp.ports.debian.org/debian-ports/ sid main
Running "apt-get update", I see:
W: Skipping acquire of configured file 'main/source/Sources' as
repository 'http://ftp.ports.debian.org/debian-ports sid InRelease' does
not seem to provide it (sources.list entry misspelt?)
So I might not be pointing to the right source repository. Would
downloading the source from an x86 repository be sufficient, or do I
need patches that are specific to m68k?
To add compile options, look at the patches in debian/patches/ there -
haven't found the 3.06-2 version, but older versions have a patch to add
additional CFLAGS to src/Makefile. Add a new patch in debian/patches/,
add that patch file name to debian/patches/series and use
dpkg-buildpackage to build the package.
If I can download the source, I'll do that.
Just to confirm that an earlier version (3.01-2) doesn't cause the stack smashing errors with an otherwise up-to-date Debian SID, I can restore
from an earlier backup and run "apt-get upgrade", though I would want to "keep back" sysvinit-core 3.01-2 and sysvinit-utils 3.01-2. Please let
me know how to keep back a package from being upgraded.
=' in the Packages files in /var/lib/apt/lists) before attempting that.
Do you know whether there is some location where all of the older binary packages are saved? If there is, I can download and check versions
between 3.01-2 and 3.06-2 and determine where the issue started.
thanks
-Stan
I noticed that /sbin/init seems to ignore SIGABRT, so I thought that
might mean that init itself was somehow triggering the stack smashing
but nothing was really aborting, but I could be wrong about that.
That's not to say a SIGABRT is ignored, it just doesn't kill PID 1.
On Feb 18 2023, Finn Thain wrote:
Why do you say init ignores SIGABRT?
PID 1 is special, it never receives signals it doesn't handle.
Why do you say init ignores SIGABRT?
The error could have been exposed in any package where "-fstack-protector-strong" was recently added.
What's interesting to me is that although there are stack smashing
errors and "aborted" messages during boot, nothing seems to have
actually failed or aborted (note that no processes are named as
"terminated" or "aborted" in the messages).
Once logged in, I can't duplicate the stack smashing (yet), and I don't
see random executables failing. The stack smashing seems to happen only
in startup scripts that have heavy access to /proc (the mountkernfs
script seems particularly vulnerable, as I recall), though none of the scripts seems to actually fail.
I'm not a programmer, but I'm willing to learn. I don't know what
Debian's build environment is, or how to adopt it or set it up, but I
can look into it with some suggestions.
On Sat, 18 Feb 2023, Andreas Schwab wrote:
On Feb 18 2023, Finn Thain wrote:
Why do you say init ignores SIGABRT?
PID 1 is special, it never receives signals it doesn't handle.
I see. I wonder if there is some way to configure the kernel so that PID 1 could be aborted for fstack-protector. I doubt it.
My gut says that a compiler change somehow made the fstack-protector implementation insensitive to kernel configuration.
So I still think that, if Stan adopted Debian's build environment, random processes would cease to be aborted (regardless of kernel .config).
Am 18.02.2023 um 12:49 schrieb Finn Thain:
On Sat, 18 Feb 2023, Andreas Schwab wrote:
On Feb 18 2023, Finn Thain wrote:
Why do you say init ignores SIGABRT?
PID 1 is special, it never receives signals it doesn't handle.
I see. I wonder if there is some way to configure the kernel so that
PID 1 could be aborted for fstack-protector. I doubt it.
You could add SIGABRT to the list of signals handled by init (see init.c:init_main() and init.c:process_signals() in the sysvinit source).
Not sure it's wise to allow init to abort though. You could instead try
to use a similar signal handler to segv_handler(), and dump core when receiving the signal? Maybe re-exec init instead of continuing?
My gut says that a compiler change somehow made the fstack-protector implementation insensitive to kernel configuration.
So I still think that, if Stan adopted Debian's build environment,
random processes would cease to be aborted (regardless of kernel
.config).
Changes in compiler version between sysvinit 3.01 and 3.06 might cause a bisect using snapshot binaries and a bisect using recompiled binaries to differ.
But using a build environment equivalent to that used by the package autobuilders is certainly good advice.
Or did you mean the kernel build environment?
On 2/17/23 4:24 PM, Finn Thain wrote:
On Fri, 17 Feb 2023, Stan Johnson wrote:
The error could have been exposed in any package where
"-fstack-protector-strong" was recently added.
And if you find the last good userland binary, what then? Fix the bad userland binary? That's wonderful but it doesn't explain why the bad userland binary went undetected with Debian's kernel build. And it
doesn't explain why you can't reproduce the problem in QEMU.
Maybe the 68030 handles memory differently than the 68040? As I
mentioned previously, if there is a 68030 Mac emulator that can boot
Linux, then I could try that. I could also try running QEMU on a slow
x86 host to see if a slow client combined with the rapid accesses to
/proc from the scripts might be causing an issue, I suspect if the 68040
were vulnerable to this problem, we'd be seeing it in QEMU. I'll check
this weekend to see whether QEMU is still not seeing any problem with my latest kernel builds together with the latest Debian SID.
And no, I likely won't be fixing anything, but perhaps I can report the problem to the maintainers of the relevant package. And I will likely
choose (for my own systems) to keep back the last good userland binary
to prevent it from being overwritten until a fix is found.
Not sure it's wise to allow init to abort though.
On Feb 18 2023, Michael Schmitz wrote:
Not sure it's wise to allow init to abort though.
When PID 1 exits, the kernel panics.
| Sysop: | Keyop |
|---|---|
| Location: | Huddersfield, West Yorkshire, UK |
| Users: | 715 |
| Nodes: | 16 (3 / 13) |
| Uptime: | 43:07:49 |
| Calls: | 12,111 |
| Calls today: | 2 |
| Files: | 15,008 |
| Messages: | 6,518,438 |