On Fri, 17 Feb 2023, Stan Johnson wrote:
That's not to say a SIGABRT is ignored, it just doesn't kill PID 1.
I doubt that /sbin/init is generating the "stack smashing detected"
error but you may need to modify it to find out. If you can't figure out which userland binary is involved, you'll have to focus on your custom
kernel binary, just as I proposed in my message dated 8 Feb 2023.
Looking at sysdeps/unix/sysv/linux/wait3.c, I guess the only possible
place for a buffer overrun would be struct __rusage64 usage64. https://sources.debian.org/src/glibc/2.36-8/sysdeps/unix/sysv/linux/wait3.c/?hl=41#L41
0xc012a3c2 <+34>: lea %sp@(12),%sp0xc012a3c6 <+38>: movel %d3,%sp@-
...
So %a3 was a pointer into stack frame 6??
(gdb) x/z $a3
0xefee1068: 0xc00e0172
Clearly 0xd000c38e != 0xc00e0172 (that is, %fp@(-4) != %a3@) but did the canary value change? It rather looks like the canary pointer is wrong...
Another way to find the value of %a3 during __wait3() execution is to look
at its initialization: moveal %a5@(108),%a3. And we can see from 'info
frame' above that __stack_chk_fail() saved %a5 at 0xefee1064.
(gdb) x/4z 0xefee1060
0xefee1060: 0xc0182c5e 0xc0198000 0xc00e0172 0xd001e718 (gdb) x/z *0xefee1064+108
0xc019806c: Cannot access memory at address 0xc019806c
So, in summary, the canary validation failed in this case not because the canary got clobbered but because %a3 got clobbered, somewhere between __wait3+24 and __wait3+70 (below).
The call to __GI___wait4_time64 causes %a3 to be saved to and restored
from the stack, so stack corruption seems to be a strong possibility to explain the change in %a3.
But if that's what happened, I'd expect __GI___wait4_time64 to report
stack smashing, not __wait3...
So, in summary, the canary validation failed in this case not because the canary got clobbered but because %a3 got clobbered, somewhere between __wait3+24 and __wait3+70 (below).
The call to __GI___wait4_time64 causes %a3 to be saved to and restored
from the stack, so stack corruption seems to be a strong possibility to explain the change in %a3.
But if that's what happened, I'd expect __GI___wait4_time64 to report
stack smashing, not __wait3... And it just begs the question, what then caused the corruption? Was it the wait4 syscall? Was it another thread?
And why is it so rare?
(gdb) disass __wait3
Dump of assembler code for function __wait3:
0xc00e0070 <+0>: linkw %fp,#-96
0xc00e0074 <+4>: moveml %a2-%a3/%a5,%sp@-
0xc00e0078 <+8>: lea %pc@(0xc0198000),%a5
0xc00e0080 <+16>: movel %fp@(8),%d0
0xc00e0084 <+20>: moveal %fp@(16),%a2
0xc00e0088 <+24>: moveal %a5@(108),%a3
0xc00e008c <+28>: movel %a3@,%fp@(-4)
0xc00e0090 <+32>: tstl %a2
0xc00e0092 <+34>: beqw 0xc00e0152 <__wait3+226>
0xc00e0096 <+38>: pea %fp@(-92)
0xc00e009a <+42>: movel %fp@(12),%sp@-
0xc00e009e <+46>: movel %d0,%sp@-
0xc00e00a0 <+48>: pea 0xffffffff
0xc00e00a4 <+52>: bsrl 0xc00e0174 <__GI___wait4_time64>
0xc00e00aa <+58>: lea %sp@(16),%sp
0xc00e00ae <+62>: tstl %d0
0xc00e00b0 <+64>: bgts 0xc00e00c8 <__wait3+88>
0xc00e00b2 <+66>: moveal %fp@(-4),%a0
0xc00e00b6 <+70>: movel %a3@,%d1
0xc00e00b8 <+72>: cmpl %a0,%d1
0xc00e00ba <+74>: bnew 0xc00e016c <__wait3+252>
0xc00e00be <+78>: moveml %fp@(-108),%a2-%a3/%a5
0xc00e00c4 <+84>: unlk %fp
0xc00e00c6 <+86>: rts
0xc00e00c8 <+88>: pea 0x44
0xc00e00cc <+92>: clrl %sp@-
0xc00e00ce <+94>: pea %a2@(4)
0xc00e00d2 <+98>: movel %d0,%fp@(-96)
0xc00e00d6 <+102>: bsrl 0xc00b8850 <__GI_memset>
0xc00e00dc <+108>: movel %fp@(-88),%a2@
0xc00e00e0 <+112>: movel %fp@(-80),%a2@(4)
0xc00e00e6 <+118>: movel %fp@(-72),%a2@(8)
0xc00e00ec <+124>: movel %fp@(-64),%a2@(12)
0xc00e00f2 <+130>: movel %fp@(-60),%a2@(16)
0xc00e00f8 <+136>: movel %fp@(-56),%a2@(20)
0xc00e00fe <+142>: movel %fp@(-52),%a2@(24)
0xc00e0104 <+148>: movel %fp@(-48),%a2@(28)
0xc00e010a <+154>: movel %fp@(-44),%a2@(32)
0xc00e0110 <+160>: movel %fp@(-40),%a2@(36)
0xc00e0116 <+166>: movel %fp@(-36),%a2@(40)
0xc00e011c <+172>: movel %fp@(-32),%a2@(44)
0xc00e0122 <+178>: movel %fp@(-28),%a2@(48)
0xc00e0128 <+184>: movel %fp@(-24),%a2@(52)
0xc00e012e <+190>: movel %fp@(-20),%a2@(56)
0xc00e0134 <+196>: movel %fp@(-16),%a2@(60)
0xc00e013a <+202>: movel %fp@(-12),%a2@(64)
0xc00e0140 <+208>: movel %fp@(-8),%a2@(68)
0xc00e0146 <+214>: lea %sp@(12),%sp
0xc00e014a <+218>: movel %fp@(-96),%d0
0xc00e014e <+222>: braw 0xc00e00b2 <__wait3+66>
0xc00e0152 <+226>: clrl %sp@-
0xc00e0154 <+228>: movel %fp@(12),%sp@-
0xc00e0158 <+232>: movel %d0,%sp@-
0xc00e015a <+234>: pea 0xffffffff
0xc00e015e <+238>: bsrl 0xc00e0174 <__GI___wait4_time64>
0xc00e0164 <+244>: lea %sp@(16),%sp
0xc00e0168 <+248>: braw 0xc00e00b2 <__wait3+66>
0xc00e016c <+252>: bsrl 0xc012a38c <__stack_chk_fail>
End of assembler dump.
(gdb) disass __GI___wait4_time64
Dump of assembler code for function __GI___wait4_time64:
0xc00e0174 <+0>: lea %sp@(-80),%sp
0xc00e0178 <+4>: moveml %d2-%d5/%a2-%a3/%a5,%sp@-
0xc00e017c <+8>: lea %pc@(0xc0198000),%a5
0xc00e0184 <+16>: movel %sp@(116),%d2
0xc00e0188 <+20>: moveal %sp@(124),%a2
0xc00e018c <+24>: moveal %a5@(108),%a3
0xc00e0190 <+28>: movel %a3@,%sp@(104)
0xc00e0194 <+32>: bsrl 0xc0052e2c <__m68k_read_tp@plt>
0xc00e019a <+38>: movel %a0@(-29920),%d4
0xc00e019e <+42>: bnew 0xc00e026c <__GI___wait4_time64+248>
0xc00e01a2 <+46>: tstl %a2
0xc00e01a4 <+48>: beqs 0xc00e01aa <__GI___wait4_time64+54>
0xc00e01a6 <+50>: moveq #32,%d4
0xc00e01a8 <+52>: addl %sp,%d4
0xc00e01aa <+54>: movel %sp@(120),%d3
0xc00e01ae <+58>: movel %sp@(112),%d1
0xc00e01b2 <+62>: moveq #114,%d0
0xc00e01b4 <+64>: trap #0
0xc00e01b6 <+66>: cmpil #-4096,%d0
0xc00e01bc <+72>: bhiw 0xc00e02a6 <__GI___wait4_time64+306>
0xc00e01c0 <+76>: tstl %d0
0xc00e01c2 <+78>: blew 0xc00e0256 <__GI___wait4_time64+226>
0xc00e01c6 <+82>: tstl %a2
0xc00e01c8 <+84>: beqw 0xc00e0256 <__GI___wait4_time64+226>
0xc00e01cc <+88>: movel %sp@(32),%a2@(4)
0xc00e01d2 <+94>: smi %d1
0xc00e01d4 <+96>: extbl %d1
0xc00e01d6 <+98>: movel %d1,%a2@
0xc00e01d8 <+100>: movel %sp@(36),%a2@(12)
0xc00e01de <+106>: smi %d1
0xc00e01e0 <+108>: extbl %d1
0xc00e01e2 <+110>: movel %d1,%a2@(8)
0xc00e01e6 <+114>: movel %sp@(40),%a2@(20)
0xc00e01ec <+120>: smi %d1
0xc00e01ee <+122>: extbl %d1
0xc00e01f0 <+124>: movel %d1,%a2@(16)
0xc00e01f4 <+128>: movel %sp@(44),%a2@(28)
0xc00e01fa <+134>: smi %d1
0xc00e01fc <+136>: extbl %d1
0xc00e01fe <+138>: movel %d1,%a2@(24)
0xc00e0202 <+142>: movel %sp@(48),%a2@(32)
0xc00e0208 <+148>: movel %sp@(52),%a2@(36)
0xc00e020e <+154>: movel %sp@(56),%a2@(40)
0xc00e0214 <+160>: movel %sp@(60),%a2@(44)
0xc00e021a <+166>: movel %sp@(64),%a2@(48)
0xc00e0220 <+172>: movel %sp@(68),%a2@(52)
0xc00e0226 <+178>: movel %sp@(72),%a2@(56)
0xc00e022c <+184>: movel %sp@(76),%a2@(60)
0xc00e0232 <+190>: movel %sp@(80),%a2@(64)
0xc00e0238 <+196>: movel %sp@(84),%a2@(68)
0xc00e023e <+202>: movel %sp@(88),%a2@(72)
0xc00e0244 <+208>: movel %sp@(92),%a2@(76)
0xc00e024a <+214>: movel %sp@(96),%a2@(80)
0xc00e0250 <+220>: movel %sp@(100),%a2@(84)
0xc00e0256 <+226>: moveal %sp@(104),%a0
0xc00e025a <+230>: movel %a3@,%d1
0xc00e025c <+232>: cmpl %a0,%d1
0xc00e025e <+234>: bnew 0xc00e02f2 <__GI___wait4_time64+382>
0xc00e0262 <+238>: moveml %sp@+,%d2-%d5/%a2-%a3/%a5
0xc00e0266 <+242>: lea %sp@(80),%sp
0xc00e026a <+246>: rts
0xc00e026c <+248>: bsrl 0xc00a1b88 <__GI___pthread_enable_asynccancel>
0xc00e0272 <+254>: movel %d0,%d5
0xc00e0274 <+256>: tstl %a2
0xc00e0276 <+258>: beqs 0xc00e02c4 <__GI___wait4_time64+336>
0xc00e0278 <+260>: moveq #32,%d4
0xc00e027a <+262>: addl %sp,%d4
0xc00e027c <+264>: movel %sp@(120),%d3
0xc00e0280 <+268>: movel %sp@(112),%d1
0xc00e0284 <+272>: moveq #114,%d0
0xc00e0286 <+274>: trap #0
0xc00e0288 <+276>: cmpil #-4096,%d0
0xc00e028e <+282>: bhis 0xc00e02c8 <__GI___wait4_time64+340>
0xc00e0290 <+284>: movel %d5,%sp@-
0xc00e0292 <+286>: movel %d0,%sp@(32)
0xc00e0296 <+290>: bsrl 0xc00a1bea <__GI___pthread_disable_asynccancel>
0xc00e029c <+296>: addql #4,%sp
0xc00e029e <+298>: movel %sp@(28),%d0
0xc00e02a2 <+302>: braw 0xc00e01c0 <__GI___wait4_time64+76>
0xc00e02a6 <+306>: movel %d0,%sp@(28)
0xc00e02aa <+310>: bsrl 0xc0052e2c <__m68k_read_tp@plt>
0xc00e02b0 <+316>: addal %a5@(2cf8),%a0
0xc00e02b8 <+324>: movel %sp@(28),%d0
0xc00e02bc <+328>: negl %d0
0xc00e02be <+330>: movel %d0,%a0@
0xc00e02c0 <+332>: moveq #-1,%d0
0xc00e02c2 <+334>: bras 0xc00e0256 <__GI___wait4_time64+226>
0xc00e02c4 <+336>: clrl %d4
0xc00e02c6 <+338>: bras 0xc00e027c <__GI___wait4_time64+264>
0xc00e02c8 <+340>: movel %d0,%sp@(28)
0xc00e02cc <+344>: bsrl 0xc0052e2c <__m68k_read_tp@plt>
0xc00e02d2 <+350>: addal %a5@(2cf8),%a0
0xc00e02da <+358>: movel %sp@(28),%d0
0xc00e02de <+362>: negl %d0
0xc00e02e0 <+364>: movel %d0,%a0@
0xc00e02e2 <+366>: movel %d5,%sp@-
0xc00e02e4 <+368>: bsrl 0xc00a1bea <__GI___pthread_disable_asynccancel>
0xc00e02ea <+374>: addql #4,%sp
0xc00e02ec <+376>: moveq #-1,%d0
0xc00e02ee <+378>: braw 0xc00e0256 <__GI___wait4_time64+226>
0xc00e02f2 <+382>: bsrl 0xc012a38c <__stack_chk_fail>
End of assembler dump.
Saved registers are restored from the stack before return from __GI___wait4_time64 but we don't know which of the two wait4 call sites
was used, do we?
What registers does __m68k_read_tp@plt clobber?
Maybe an interaction between (multiple?) signals and syscall return...
depends on how long we sleep in wait4, and whether a signal happens just during that time.
%a3 is the first register saved to the switch stack BTW.
That kernel does contain Al Viro's patch that corrected our switch stack handling in the signal return path? I wonder whether there's a potential
race lurking in there?
And I just notice that we had had trouble with a copy_to_user in setup_frame() earlier (reason for my buserr handler patch). I wonder
whether something's gone wrong there. Do you get a segfault instead of
the abort signal if you drop my patch?
On Apr 01 2023, Finn Thain wrote:
So, in summary, the canary validation failed in this case not because
the canary got clobbered but because %a3 got clobbered, somewhere
between __wait3+24 and __wait3+70 (below).
The call to __GI___wait4_time64 causes %a3 to be saved to and restored
from the stack, so stack corruption seems to be a strong possibility
to explain the change in %a3.
But if that's what happened, I'd expect __GI___wait4_time64 to report
stack smashing, not __wait3...
The stask smashing probably didn't fire in __wait4_time64, because it
hit the saved register area, not the canary (which reside on the
opposite ends of the stack frame).
On Sat, 1 Apr 2023, Andreas Schwab wrote:
On Apr 01 2023, Finn Thain wrote:OK.
So, in summary, the canary validation failed in this case not becauseThe stask smashing probably didn't fire in __wait4_time64, because it
the canary got clobbered but because %a3 got clobbered, somewhere
between __wait3+24 and __wait3+70 (below).
The call to __GI___wait4_time64 causes %a3 to be saved to and restored
from the stack, so stack corruption seems to be a strong possibility
to explain the change in %a3.
But if that's what happened, I'd expect __GI___wait4_time64 to report
stack smashing, not __wait3...
hit the saved register area, not the canary (which reside on the
opposite ends of the stack frame).
This is odd:
https://sources.debian.org/src/dash/0.5.12-2/src/jobs.c/?hl=1165#L1165
1176 do {
1177 gotsigchld = 0;
1178 do
1179 err = wait3(status, flags, NULL);
1180 while (err < 0 && errno == EINTR);
1181
1182 if (err || (err = -!block))
1183 break;
1184
1185 sigblockall(&oldmask);
1186
1187 while (!gotsigchld && !pending_sig)
1188 sigsuspend(&oldmask);
1189
1190 sigclearmask();
1191 } while (gotsigchld);
1192
1193 return err;
Execution of dash under gdb doesn't seem to agree with the source code
above.
If wait3() returns the child pid then the break should execute. And it
does return the pid (4107) but the while loop was not terminated. Hence wait3() was called again and the same breakpoint was hit again. Also, the
while loop should have ended after the first iteration because gotsigchild should have been set by the signal handler which executed before wait3()
even returned...
...
(gdb) c
Continuing.
#
#
# x=$(:)
[Detaching after fork from child process 4107]
Program received signal SIGCHLD, Child status changed.
0xc00e81b6 in __GI___wait4_time64 (pid=-1, stat_loc=0xeffff87a, options=2,
usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:35
35 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
(gdb) c
Continuing.
Breakpoint 3, waitproc (status=0xeffff86a, block=1) at jobs.c:1180
1180 jobs.c: No such file or directory.
(gdb) info locals
oldmask = {__val = {1101825, 3844132865, 2072969216, 192511, 4190371840,
4509697, 3836788738, 1049415681, 3837317121, 3094671359, 4184080384,
536870943, 717475840, 3485913089, 3836792833, 2072969216, 184321,
3844141055, 4190425089, 4127248385, 3094659084, 597610497, 4137734145,
3844079616, 131072, 269156352, 184320, 3878473729, 3844132865, 3094663168,
3549089793, 3844132865}}
flags = 2
err = 4107
oldmask = <optimized out>
flags = <optimized out>
err = <optimized out>
(gdb) print errno
$6 = 2
(gdb) c
Continuing.
Breakpoint 3, waitproc (status=0xeffff86a, block=0) at jobs.c:1180
1180 in jobs.c
(gdb) info locals
oldmask = {__val = {1101825, 3844132865, 2072969216, 192511, 4190371840,
4509697, 3836788738, 1049415681, 3837317121, 3094671359, 4184080384,
536870943, 717475840, 3485913089, 3836792833, 2072969216, 184321,
3844141055, 4190425089, 4127248385, 3094659084, 597610497, 4137734145,
3844079616, 131072, 269156352, 184320, 3878473729, 3844132865, 3094663168,
3549089793, 3844132865}}
flags = 3
err = -1
oldmask = <optimized out>
flags = <optimized out>
err = <optimized out>
(gdb) print errno
$7 = 10
(gdb)
On Sun, 2 Apr 2023, Michael Schmitz wrote:
Saved registers are restored from the stack before return from
__GI___wait4_time64 but we don't know which of the two wait4 call sites
was used, do we?
What registers does __m68k_read_tp@plt clobber?
But that won't matter to the caller, __wait3, right?
I did check that %a3 was saved on entry, before any wait4 syscall or __m68k_read_tp call etc. I also looked at the rts and %a3 did get restored there. Is it worth the effort to trace every branch, in case there's some
way to reach an rts without having first restored the saved registers?
Maybe an interaction between (multiple?) signals and syscall return...
When running dash from gdb in QEMU, there's only one signal (SIGCHLD) and
it gets handled before __wait3() returns. (Of course, the "stack smashing detected" failure never shows up in QEMU.)
depends on how long we sleep in wait4, and whether a signal happens just
during that time.
I agree, there seems to be a race condition there. (And dash's waitproc() seems to take pains to reap the child and handle the signal in any order.)
I wouldn't be surprised if this race somehow makes the failure rare.
I don't want to recompile any userland binaries at this stage, so it would
be nice if we could modify the kernel to keep track of exactly how that
race gets won and lost. Or perhaps there's an easy way to rig the outcome
one way or the other.
%a3 is the first register saved to the switch stack BTW.
That kernel does contain Al Viro's patch that corrected our switch stack
handling in the signal return path? I wonder whether there's a potential
race lurking in there?
I'm not sure which patch you're referring to, but I think Al's signal handling work appeared in v5.15-rc4. I have reproduced the "stack smashing
detected" failure with v5.14.0 and with recent mainline (62bad54b26db from March 30th).
And I just notice that we had had trouble with a copy_to_user in
setup_frame() earlier (reason for my buserr handler patch). I wonder
whether something's gone wrong there. Do you get a segfault instead of
the abort signal if you drop my patch?
Are you referring to e36a82bebbf7? I doubt that it's related. I believe
that copy_to_user is not involved here for the reason already given i.e. wait3(status, flags, NULL) means wait4 gets a NULL pointer for the struct rusage * parameter. Also, Stan first reported this failure in December
with v6.0.9.
On 2/04/23 22:46, Finn Thain wrote:
This is odd:
https://sources.debian.org/src/dash/0.5.12-2/src/jobs.c/?hl=1165#L1165
1176 do {
1177 gotsigchld = 0;
1178 do
1179 err = wait3(status, flags, NULL);
1180 while (err < 0 && errno == EINTR);
1181
1182 if (err || (err = -!block))
1183 break;
1184
1185 sigblockall(&oldmask);
1186
1187 while (!gotsigchld && !pending_sig)
1188 sigsuspend(&oldmask);
1189
1190 sigclearmask();
1191 } while (gotsigchld);
1192
1193 return err;
Execution of dash under gdb doesn't seem to agree with the source code above.
If wait3() returns the child pid then the break should execute. And it
does return the pid (4107) but the while loop was not terminated. Hence wait3() was called again and the same breakpoint was hit again. Also, the
I wonder whether line 1182 got miscompiled by gcc. As err == 4107 it's >
0 and the break clearly ought to have been taken, and the second
condition (which changes err) does not need to be examined. Do the same ordering constraints apply to '||' as to '&&' ?
What does the disassembly of this section look like?
while loop should have ended after the first iteration because gotsigchild should have been set by the signal handler which executed before wait3() even returned...
Setting gotsigchild > 0 would cause the while loop to continue, no?
Am 02.04.2023 um 21:31 schrieb Finn Thain:
Maybe an interaction between (multiple?) signals and syscall
return...
When running dash from gdb in QEMU, there's only one signal (SIGCHLD)
and it gets handled before __wait3() returns. (Of course, the "stack smashing detected" failure never shows up in QEMU.)
Might be a clue that we need multiple signals to force the stack
smashing error. And we might not get that in QEMU, due to the faster execution in emulating on a modern processor.
Thinking a bit more about interactions between signal delivery and
syscall return, it turns out that we don't check for pending signals
when returning from a syscall. That's OK on SMP systems, because we
don't have another process running while we execute the syscall (and we
_do_ run signal handling when scheduling, i.e. when wait4 sleeps or is
woken up)?
Seems we can forget about that interaction then.
depends on how long we sleep in wait4, and whether a signal happens
just during that time.
I agree, there seems to be a race condition there. (And dash's
waitproc() seems to take pains to reap the child and handle the signal
in any order.)
Yes, it makes sure the SIGCHLD is seen no matter in what order the
signals are delivered ...
I wouldn't be surprised if this race somehow makes the failure rare.
I don't want to recompile any userland binaries at this stage, so it
would be nice if we could modify the kernel to keep track of exactly
how that race gets won and lost. Or perhaps there's an easy way to rig
the outcome one way or the other.
A race between syscall return due to child exit and signal delivery
seems unlikely, but maybe there is a race between syscall return due to
a timer firing and signal delivery. Are there any timers set to
periodically interrupt wait3?
Still no nearer to a solution - something smashes the stack near %sp,
causes the %a3 register restore after __GI___wait4_time64 to return a
wrong pointer to the stack canary, and triggers a stack smashing warning
in this indirect way. But what??
The actual corruption might offer a clue here. I believe the saved %a3
was clobbered with the value 0xefee1068 which seems to be a pointer into
some stack frame that would have come into existence shortly after __GI___wait4_time64 was called.
It looks like I messed up. waitproc() appears to have been invoked
twice, which is why wait3 was invoked twice...
GNU gdb (Debian 13.1-2) 13.1
...
(gdb) set osabi GNU/Linux
(gdb) file /bin/dash
Reading symbols from /bin/dash...
Reading symbols from /usr/lib/debug/.build-id/aa/4160f84f3eeee809c554cb9f3e1ef0686b8dcc.debug...
(gdb) b waitproc
Breakpoint 1 at 0xc346: file jobs.c, line 1168.
(gdb) b jobs.c:1180
Breakpoint 2 at 0xc390: file jobs.c, line 1180.
(gdb) run
Starting program: /usr/bin/dash
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/m68k-linux-gnu/libthread_db.so.1".
# x=$(:)
[Detaching after fork from child process 570]
Breakpoint 1, waitproc (status=0xeffff86a, block=1) at jobs.c:1168
1168 jobs.c: No such file or directory.
(gdb) c
Continuing.
Breakpoint 2, waitproc (status=0xeffff86a, block=1) at jobs.c:1180
1180 in jobs.c
(gdb) info locals
oldmask = {__val = {1997799424, 49154, 396623872, 184321, 3223896090, 53249,
3836788738, 1049411610, 867225601, 3094609920, 0, 1048580, 2857693183,
4184129547, 3435708442, 863764480, 184321, 3844141055, 4190425089,
4127248385, 3094659084, 597610497, 4135112705, 3844079616, 131072,
37355520, 184320, 3878473729, 3844132865, 3094663168, 3549089793,
3844132865}}
flags = 2
err = 570
oldmask = <optimized out>
flags = <optimized out>
err = <optimized out>
(gdb) c
Continuing.
Breakpoint 1, waitproc (status=0xeffff86a, block=0) at jobs.c:1168
1168 in jobs.c
(gdb) c
Continuing.
Breakpoint 2, waitproc (status=0xeffff86a, block=0) at jobs.c:1180
1180 in jobs.c
(gdb) info locals
oldmask = {__val = {1997799424, 49154, 396623872, 184321, 3223896090, 53249,
3836788738, 1049411610, 867225601, 3094609920, 0, 1048580, 2857693183,
4184129547, 3435708442, 863764480, 184321, 3844141055, 4190425089,
4127248385, 3094659084, 597610497, 4135112705, 3844079616, 131072,
37355520, 184320, 3878473729, 3844132865, 3094663168, 3549089793,
3844132865}}
flags = 3
err = -1
oldmask = <optimized out>
flags = <optimized out>
err = <optimized out>
(gdb) c
Continuing.
#
On 4/04/23 12:13, Finn Thain wrote:
It looks like I messed up. waitproc() appears to have been invoked
twice, which is why wait3 was invoked twice...
GNU gdb (Debian 13.1-2) 13.1
...
(gdb) set osabi GNU/Linux
(gdb) file /bin/dash
Reading symbols from /bin/dash...
Reading symbols from /usr/lib/debug/.build-id/aa/4160f84f3eeee809c554cb9f3e1ef0686b8dcc.debug... (gdb) b waitproc
Breakpoint 1 at 0xc346: file jobs.c, line 1168.
(gdb) b jobs.c:1180
Breakpoint 2 at 0xc390: file jobs.c, line 1180.
(gdb) run
Starting program: /usr/bin/dash
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/m68k-linux-gnu/libthread_db.so.1".
# x=$(:)
[Detaching after fork from child process 570]
Breakpoint 1, waitproc (status=0xeffff86a, block=1) at jobs.c:1168
1168 jobs.c: No such file or directory.
(gdb) c
Continuing.
Breakpoint 2, waitproc (status=0xeffff86a, block=1) at jobs.c:1180
1180 in jobs.c
(gdb) info locals
oldmask = {__val = {1997799424, 49154, 396623872, 184321, 3223896090, 53249,
3836788738, 1049411610, 867225601, 3094609920, 0, 1048580, 2857693183,
4184129547, 3435708442, 863764480, 184321, 3844141055, 4190425089,
4127248385, 3094659084, 597610497, 4135112705, 3844079616, 131072,
37355520, 184320, 3878473729, 3844132865, 3094663168, 3549089793,
3844132865}}
flags = 2
err = 570
oldmask = <optimized out>
flags = <optimized out>
err = <optimized out>
(gdb) c
Continuing.
Breakpoint 1, waitproc (status=0xeffff86a, block=0) at jobs.c:1168
1168 in jobs.c
(gdb) c
Continuing.
Breakpoint 2, waitproc (status=0xeffff86a, block=0) at jobs.c:1180
1180 in jobs.c
(gdb) info locals
oldmask = {__val = {1997799424, 49154, 396623872, 184321, 3223896090, 53249,
3836788738, 1049411610, 867225601, 3094609920, 0, 1048580, 2857693183,
4184129547, 3435708442, 863764480, 184321, 3844141055, 4190425089,
4127248385, 3094659084, 597610497, 4135112705, 3844079616, 131072,
37355520, 184320, 3878473729, 3844132865, 3094663168, 3549089793,
3844132865}}
flags = 3
err = -1
oldmask = <optimized out>
flags = <optimized out>
err = <optimized out>
(gdb) c
Continuing.
#
That means we may well see both signals delivered at the same time if the parent shell wasn't scheduled to run until the second subshell terminated (answering the question I was about to ask on your other mail, the one about the crashy script with multiple subshells).
Now does waitproc() handle that case correctly? The first signal
delivered results in err == child PID so the break is taken, causing
exit from waitproc().
Does waitproc() get called repeatedly until an error is returned?
On Wed, 5 Apr 2023, Michael Schmitz wrote:
On 4/04/23 12:13, Finn Thain wrote:
It looks like I messed up. waitproc() appears to have been invokedThat means we may well see both signals delivered at the same time if the
twice, which is why wait3 was invoked twice...
GNU gdb (Debian 13.1-2) 13.1
...
(gdb) set osabi GNU/Linux
(gdb) file /bin/dash
Reading symbols from /bin/dash...
Reading symbols from
/usr/lib/debug/.build-id/aa/4160f84f3eeee809c554cb9f3e1ef0686b8dcc.debug... >>> (gdb) b waitproc
Breakpoint 1 at 0xc346: file jobs.c, line 1168.
(gdb) b jobs.c:1180
Breakpoint 2 at 0xc390: file jobs.c, line 1180.
(gdb) run
Starting program: /usr/bin/dash
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/m68k-linux-gnu/libthread_db.so.1". >>> # x=$(:)
[Detaching after fork from child process 570]
Breakpoint 1, waitproc (status=0xeffff86a, block=1) at jobs.c:1168
1168 jobs.c: No such file or directory.
(gdb) c
Continuing.
Breakpoint 2, waitproc (status=0xeffff86a, block=1) at jobs.c:1180
1180 in jobs.c
(gdb) info locals
oldmask = {__val = {1997799424, 49154, 396623872, 184321, 3223896090, 53249,
3836788738, 1049411610, 867225601, 3094609920, 0, 1048580, 2857693183, >>> 4184129547, 3435708442, 863764480, 184321, 3844141055, 4190425089,
4127248385, 3094659084, 597610497, 4135112705, 3844079616, 131072,
37355520, 184320, 3878473729, 3844132865, 3094663168, 3549089793,
3844132865}}
flags = 2
err = 570
oldmask = <optimized out>
flags = <optimized out>
err = <optimized out>
(gdb) c
Continuing.
Breakpoint 1, waitproc (status=0xeffff86a, block=0) at jobs.c:1168
1168 in jobs.c
(gdb) c
Continuing.
Breakpoint 2, waitproc (status=0xeffff86a, block=0) at jobs.c:1180
1180 in jobs.c
(gdb) info locals
oldmask = {__val = {1997799424, 49154, 396623872, 184321, 3223896090, 53249,
3836788738, 1049411610, 867225601, 3094609920, 0, 1048580, 2857693183, >>> 4184129547, 3435708442, 863764480, 184321, 3844141055, 4190425089,
4127248385, 3094659084, 597610497, 4135112705, 3844079616, 131072,
37355520, 184320, 3878473729, 3844132865, 3094663168, 3549089793,
3844132865}}
flags = 3
err = -1
oldmask = <optimized out>
flags = <optimized out>
err = <optimized out>
(gdb) c
Continuing.
#
parent shell wasn't scheduled to run until the second subshell terminated
(answering the question I was about to ask on your other mail, the one about >> the crashy script with multiple subshells).
How is that possible? If the parent does not get scheduled, the second
fork will not take place.
Now does waitproc() handle that case correctly? The first signal
delivered results in err == child PID so the break is taken, causing
exit from waitproc().
I don't follow. Can you rephrase that perhaps?
For a single subshell, the SIGCHLD signal can be delivered before wait4 is called or after it returns. For example, $(sleep 5) seems to produce the latter whereas $(:) tends to produce the former.
Does waitproc() get called repeatedly until an error is returned?
It's complicated...
https://sources.debian.org/src/dash/0.5.12-2/src/jobs.c/?hl=1122#L1122
I don't care that much what dash does as long as it isn't corrupting it's
own stack, which is a real possibility, and one which gdb's data watch
point would normally resolve. And yet I have no way to tackle that.
I've been running gdb under QEMU, where the failure is not reproducible. Running dash under gdb on real hardware is doable (RAM permitting). But
the failure is intermittent even then -- it only happens during execution
of certain init scripts, and I can't reproduce it by manually running
those scripts.
(Even if I could reproduce the failure under gdb, instrumenting execution
in gdb can alter timing in undesirable ways...)
So, again, the best avenue I can think of for such experiments to modify
the kernel to either keep track of the times of the wait4 syscalls and
signal delivery and/or push the timing one way or the other e.g. by
delaying signal delivery, altering scheduler behaviour, etc. But I don't
have code for that. I did try adding random delays around kernel_wait4()
but it didn't have any effect...
Am 05.04.2023 um 14:00 schrieb Finn Thain:
On Wed, 5 Apr 2023, Michael Schmitz wrote:
That means we may well see both signals delivered at the same time if
the parent shell wasn't scheduled to run until the second subshell
terminated (answering the question I was about to ask on your other
mail, the one about the crashy script with multiple subshells).
How is that possible? If the parent does not get scheduled, the second
fork will not take place.
I assumed subshells could run asynchronously, and the parent shell
continue until it hits a statement that needs the result of one of the subshells.
What is the point of subshells, if not to allow this?
Running dash under gdb on real hardware is doable (RAM permitting).
But the failure is intermittent even then -- it only happens during execution of certain init scripts, and I can't reproduce it by
manually running those scripts.
(Even if I could reproduce the failure under gdb, instrumenting
execution in gdb can alter timing in undesirable ways...)
So, again, the best avenue I can think of for such experiments to
modify the kernel to either keep track of the times of the wait4
syscalls and
The easiest way to do that is to log all wait and signal syscalls, as
well as process exit. That might alter timing if these log messages go
to the serial console though. Is that what you have in mind?
signal delivery and/or push the timing one way or the other e.g. by delaying signal delivery, altering scheduler behaviour, etc. But I
don't have code for that. I did try adding random delays around kernel_wait4() but it didn't have any effect...
I wonder whether it's possible to delay process exit (and parent process signaling) by placing the exit syscall on a timer workqueue. But the
same effect could be had by inserting a sleep before subshell exit ...
And causing a half-dead task to schedule in order to delay signaling
doesn't seem safe to me ...
The easiest way to do that is to log all wait and signal syscalls, as
well as process exit. That might alter timing if these log messages go
to the serial console though. Is that what you have in mind?
When dash is feeling crashy, you can get results like this:
root@debian:~# sh /etc/init.d/mountdevsubfs.sh
*** stack smashing detected ***: terminated
Aborted (core dumped)
Warning: mountdevsubfs should be called with the 'start' argument. root@debian:~# sh /etc/init.d/mountdevsubfs.sh
*** stack smashing detected ***: terminated
Aborted (core dumped)
Warning: mountdevsubfs should be called with the 'start' argument. root@debian:~# sh /etc/init.d/mountdevsubfs.sh
*** stack smashing detected ***: terminated
Aborted (core dumped)
Warning: mountdevsubfs should be called with the 'start' argument. root@debian:~# sh /etc/init.d/mountdevsubfs.sh
*** stack smashing detected ***: terminated
Aborted (core dumped)
Warning: mountdevsubfs should be called with the 'start' argument. root@debian:~#
But when it's not feeling crashy, you can't:
root@debian:~# sh /etc/init.d/mountdevsubfs.sh
Warning: mountdevsubfs should be called with the 'start' argument. root@debian:~# sh /etc/init.d/mountdevsubfs.sh
Warning: mountdevsubfs should be called with the 'start' argument. root@debian:~# sh /etc/init.d/mountdevsubfs.sh
Warning: mountdevsubfs should be called with the 'start' argument. root@debian:~# sh /etc/init.d/mountdevsubfs.sh
Warning: mountdevsubfs should be called with the 'start' argument.
The only way I have found to alter dash's inclination to crash is to
reboot. (I said previously I was unable to reproduce this in a single user mode shell but it turned out to be more subtle.)
The only way I have found to alter dash's inclination to crash is to reboot. (I said previously I was unable to reproduce this in a single
user mode shell but it turned out to be more subtle.)
That sounds like memory corruption somewhere else, e.g. in the buffer cache...
Hi Michael,
On Fri, Apr 7, 2023 at 3:58 AM Michael Schmitz <[email protected]> wrote:
The easiest way to do that is to log all wait and signal syscalls, as
well as process exit. That might alter timing if these log messages go
to the serial console though. Is that what you have in mind?
Store to RAM, retrieve through a new /proc file?
Gr{oetje,eeting}s,
Geert
So, again, the best avenue I can think of for such experiments to
modify the kernel to either keep track of the times of the wait4
syscalls and
The easiest way to do that is to log all wait and signal syscalls, as
well as process exit. That might alter timing if these log messages go
to the serial console though. Is that what you have in mind?
What I had in mind was collecting measurements in such way that would not impact timing, perhaps by storing them somewhere they could be retrieved
from the process core dump.
But that's probably not realistic and it's probably pointless anyway -- I don't expect to find an old bug in common code like kernel/exit.c, or in a hot path like those in arch/m68k/kernel/entry.S.
More likely is that some kind of bug in dash causes it to corrupt its own stack when conditions are just right. I just need to figure out how to recreate those conditions. :-/
When dash is feeling crashy, you can get results like this:
root@debian:~# sh /etc/init.d/mountdevsubfs.sh
*** stack smashing detected ***: terminated
Aborted (core dumped)
Warning: mountdevsubfs should be called with the 'start' argument. root@debian:~# sh /etc/init.d/mountdevsubfs.sh
*** stack smashing detected ***: terminated
Aborted (core dumped)
Warning: mountdevsubfs should be called with the 'start' argument. root@debian:~# sh /etc/init.d/mountdevsubfs.sh
*** stack smashing detected ***: terminated
Aborted (core dumped)
Warning: mountdevsubfs should be called with the 'start' argument. root@debian:~# sh /etc/init.d/mountdevsubfs.sh
*** stack smashing detected ***: terminated
Aborted (core dumped)
Warning: mountdevsubfs should be called with the 'start' argument. root@debian:~#
But when it's not feeling crashy, you can't:
root@debian:~# sh /etc/init.d/mountdevsubfs.sh
Warning: mountdevsubfs should be called with the 'start' argument. root@debian:~# sh /etc/init.d/mountdevsubfs.sh
Warning: mountdevsubfs should be called with the 'start' argument. root@debian:~# sh /etc/init.d/mountdevsubfs.sh
Warning: mountdevsubfs should be called with the 'start' argument. root@debian:~# sh /etc/init.d/mountdevsubfs.sh
Warning: mountdevsubfs should be called with the 'start' argument.
The only way I have found to alter dash's inclination to crash is to
reboot. (I said previously I was unable to reproduce this in a single user mode shell but it turned out to be more subtle.)
On Sun, 9 Apr 2023, Michael Schmitz wrote:
The only way I have found to alter dash's inclination to crash is to
reboot. (I said previously I was unable to reproduce this in a single
user mode shell but it turned out to be more subtle.)
I wonder what could change from one boot to another - can you have dash
(and its subshells) dump /proc/self/maps and see whether there's any
variation in that? But what we really need is the physical mappings. How
can we find those?
With the kernel RNG disabled, I would expect neither of these mappings
to change between boots?
It looks like the stack area still changes across invocations:
# sh
# cat < /proc/self/maps
c0000000-c0021000 r-xp 00000000 08:06 38780 /usr/lib/m68k-linux-gnu/ld.so.1
c0021000-c0023000 rw-p 00000000 00:00 0
c0023000-c0024000 r--p 00021000 08:06 38780 /usr/lib/m68k-linux-gnu/ld.so.1
c0024000-c0026000 rw-p 00022000 08:06 38780 /usr/lib/m68k-linux-gnu/ld.so.1
c002a000-c0199000 r-xp 00000000 08:06 38786 /usr/lib/m68k-linux-gnu/libc.so.6
c0199000-c019a000 ---p 0016f000 08:06 38786 /usr/lib/m68k-linux-gnu/libc.so.6
c019a000-c019c000 r--p 00170000 08:06 38786 /usr/lib/m68k-linux-gnu/libc.so.6
c019c000-c01a0000 rw-p 00172000 08:06 38786 /usr/lib/m68k-linux-gnu/libc.so.6
c01a0000-c01aa000 rw-p 00000000 00:00 0
d0000000-d0019000 r-xp 00000000 08:06 32713 /usr/bin/dash d001b000-d001c000 r--p 00019000 08:06 32713 /usr/bin/dash d001c000-d001d000 rw-p 0001a000 08:06 32713 /usr/bin/dash d001d000-d001f000 rwxp 00000000 00:00 0 [heap]
d001f000-d0040000 rwxp 00000000 00:00 0 [heap]
eff9f000-effc0000 rw-p 00000000 00:00 0 [stack]
# sh
# cat < /proc/self/maps
c0000000-c0021000 r-xp 00000000 08:06 38780 /usr/lib/m68k-linux-gnu/ld.so.1
c0021000-c0023000 rw-p 00000000 00:00 0
c0023000-c0024000 r--p 00021000 08:06 38780 /usr/lib/m68k-linux-gnu/ld.so.1
c0024000-c0026000 rw-p 00022000 08:06 38780 /usr/lib/m68k-linux-gnu/ld.so.1
c002a000-c0199000 r-xp 00000000 08:06 38786 /usr/lib/m68k-linux-gnu/libc.so.6
c0199000-c019a000 ---p 0016f000 08:06 38786 /usr/lib/m68k-linux-gnu/libc.so.6
c019a000-c019c000 r--p 00170000 08:06 38786 /usr/lib/m68k-linux-gnu/libc.so.6
c019c000-c01a0000 rw-p 00172000 08:06 38786 /usr/lib/m68k-linux-gnu/libc.so.6
c01a0000-c01aa000 rw-p 00000000 00:00 0
d0000000-d0019000 r-xp 00000000 08:06 32713 /usr/bin/dash d001b000-d001c000 r--p 00019000 08:06 32713 /usr/bin/dash d001c000-d001d000 rw-p 0001a000 08:06 32713 /usr/bin/dash d001d000-d001f000 rwxp 00000000 00:00 0 [heap]
d001f000-d0040000 rwxp 00000000 00:00 0 [heap]
effd8000-efff9000 rw-p 00000000 00:00 0 [stack]
# sh
# cat < /proc/self/maps
c0000000-c0021000 r-xp 00000000 08:06 38780 /usr/lib/m68k-linux-gnu/ld.so.1
c0021000-c0023000 rw-p 00000000 00:00 0
c0023000-c0024000 r--p 00021000 08:06 38780 /usr/lib/m68k-linux-gnu/ld.so.1
c0024000-c0026000 rw-p 00022000 08:06 38780 /usr/lib/m68k-linux-gnu/ld.so.1
c002a000-c0199000 r-xp 00000000 08:06 38786 /usr/lib/m68k-linux-gnu/libc.so.6
c0199000-c019a000 ---p 0016f000 08:06 38786 /usr/lib/m68k-linux-gnu/libc.so.6
c019a000-c019c000 r--p 00170000 08:06 38786 /usr/lib/m68k-linux-gnu/libc.so.6
c019c000-c01a0000 rw-p 00172000 08:06 38786 /usr/lib/m68k-linux-gnu/libc.so.6
c01a0000-c01aa000 rw-p 00000000 00:00 0
d0000000-d0019000 r-xp 00000000 08:06 32713 /usr/bin/dash d001b000-d001c000 r--p 00019000 08:06 32713 /usr/bin/dash d001c000-d001d000 rw-p 0001a000 08:06 32713 /usr/bin/dash d001d000-d001f000 rwxp 00000000 00:00 0 [heap]
d001f000-d0040000 rwxp 00000000 00:00 0 [heap]
effdf000-f0000000 rw-p 00000000 00:00 0 [stack]
#
That can be disabled easily though (see below). I'll have to modify some
init scripts to find out what effect it has.
# setarch -R sh
# sh
# cat < /proc/self/maps
c0000000-c0021000 r-xp 00000000 08:06 38780 /usr/lib/m68k-linux-gnu/ld.so.1
c0021000-c0023000 rw-p 00000000 00:00 0
c0023000-c0024000 r--p 00021000 08:06 38780 /usr/lib/m68k-linux-gnu/ld.so.1
c0024000-c0026000 rw-p 00022000 08:06 38780 /usr/lib/m68k-linux-gnu/ld.so.1
c002a000-c0199000 r-xp 00000000 08:06 38786 /usr/lib/m68k-linux-gnu/libc.so.6
c0199000-c019a000 ---p 0016f000 08:06 38786 /usr/lib/m68k-linux-gnu/libc.so.6
c019a000-c019c000 r--p 00170000 08:06 38786 /usr/lib/m68k-linux-gnu/libc.so.6
c019c000-c01a0000 rw-p 00172000 08:06 38786 /usr/lib/m68k-linux-gnu/libc.so.6
c01a0000-c01aa000 rw-p 00000000 00:00 0
d0000000-d0019000 r-xp 00000000 08:06 32713 /usr/bin/dash d001b000-d001c000 r--p 00019000 08:06 32713 /usr/bin/dash d001c000-d001d000 rw-p 0001a000 08:06 32713 /usr/bin/dash d001d000-d001f000 rwxp 00000000 00:00 0 [heap]
d001f000-d0040000 rwxp 00000000 00:00 0 [heap]
effdf000-f0000000 rw-p 00000000 00:00 0 [stack]
# md5sum < /proc/self/maps
baacbaf944fb01d3200d924da7f7a815 -
# sh
# md5sum < /proc/self/maps
baacbaf944fb01d3200d924da7f7a815 -
# sh
# md5sum < /proc/self/maps
baacbaf944fb01d3200d924da7f7a815 -
On Tue, 4 Apr 2023, I wrote:
The actual corruption might offer a clue here. I believe the saved %a3
was clobbered with the value 0xefee1068 which seems to be a pointer into some stack frame that would have come into existence shortly after __GI___wait4_time64 was called.
Wrong... it is a pointer to the location below the __wait3 stack frame.
(gdb) info frame
Stack level 8, frame at 0xefee10e0:
pc = 0xc00e0172 in __wait3 (../sysdeps/unix/sysv/linux/wait3.c:41);
saved pc = 0xd000c38e
called by frame at 0xefee11dc, caller of frame at 0xefee106c
source language c.
Arglist at 0xefee10d8, args: stat_loc=<optimized out>,
options=<optimized out>, usage=<optimized out>
Locals at 0xefee10d8, Previous frame's sp is 0xefee10e0
Saved registers:
a2 at 0xefee106c, a3 at 0xefee1070, a5 at 0xefee1074, fp at 0xefee10d8,
pc at 0xefee10dc
That shows %a2 was saved at 0xefee106c, and the address of interest is the stack location immediately below that. But it has no particular
significance: it holds a NULL pointer when the struct __rusage64 *usage argument to __wait4_time64() gets pushed there:
0xc00e8152 <__wait3+226>: clrl %sp@-
0xc00e8154 <__wait3+228>: movel %fp@(12),%sp@-
0xc00e8158 <__wait3+232>: movel %d0,%sp@-
0xc00e815a <__wait3+234>: pea 0xffffffff
0xc00e815e <__wait3+238>: bsrl 0xc00e8174 <__GI___wait4_time64>
But it's no longer a NULL pointer at the time of the crash, though it
should be, since that stack frame is still active.
(gdb) x/16z 0xefee1068
0xefee1068: 0xc00e0172 0xd001e718 0xd001e498 0xd001b874 0xefee1078: 0x00170700 0x00170700 0x00170700 0x00005360 0xefee1088: 0x0000e920 0x00000006 0x00002000 0x00000002 0xefee1098: 0x00171f20 0x00171f20 0x00171f20 0x000000e0
Beats me.
The only way I have found to alter dash's inclination to crash is to reboot. (I said previously I was unable to reproduce this in a single
user mode shell but it turned out to be more subtle.)
I wonder what could change from one boot to another - can you have dash
(and its subshells) dump /proc/self/maps and see whether there's any variation in that? But what we really need is the physical mappings. How
can we find those?
With the kernel RNG disabled, I would expect neither of these mappings
to change between boots?
Am 09.04.2023 um 16:42 schrieb Finn Thain:
On Sun, 9 Apr 2023, Michael Schmitz wrote:
The only way I have found to alter dash's inclination to crash is to
reboot. (I said previously I was unable to reproduce this in a
single user mode shell but it turned out to be more subtle.)
I wonder what could change from one boot to another - can you have
dash (and its subshells) dump /proc/self/maps and see whether there's
any variation in that? But what we really need is the physical
mappings. How can we find those?
With the kernel RNG disabled, I would expect neither of these
mappings to change between boots?
It looks like the stack area still changes across invocations:
Yep, but running the same commands in the same order across different
boots, does it still change?
(I'm making a huge assumption here - that timing of the boot process and hence evolution of the kernel RNG is sufficiently deterministic. And
this might apply only to the shells run from sysvinit, since that does require no keyboard input ...)
Looks like cat < /proc/self/maps | grep stack would give us enough information without overwhelming the serial console?
OTOH - if you can show the error is gone without stack address
randomization, that would be a hint maybe?
On Fri, 7 Apr 2023, Geert Uytterhoeven wrote:
The only way I have found to alter dash's inclination to crash is to reboot. (I said previously I was unable to reproduce this in a single user mode shell but it turned out to be more subtle.)
That sounds like memory corruption somewhere else, e.g. in the buffer cache...
If so, once the corruption showed up, you would expect the same crash next time...
root@debian:~# sh /etc/init.d/mountdevsubfs.sh start
*** stack smashing detected ***: terminated
Aborted (core dumped)
root@debian:~# sh /etc/init.d/mountdevsubfs.sh start
*** stack smashing detected ***: terminated
Aborted (core dumped)
*** stack smashing detected ***: terminated
Aborted (core dumped)
root@debian:~# sh /etc/init.d/mountdevsubfs.sh start
*** stack smashing detected ***: terminated
Aborted (core dumped)
root@debian:~# echo 3 > /proc/sys/vm/drop_caches
[ 937.250000] bash (717): drop_caches: 3
root@debian:~# sh /etc/init.d/mountdevsubfs.sh start
root@debian:~# sh /etc/init.d/mountdevsubfs.sh start
*** stack smashing detected ***: terminated
Aborted (core dumped)
root@debian:~# sh /etc/init.d/mountdevsubfs.sh start
*** stack smashing detected ***: terminated
Aborted (core dumped)
*** stack smashing detected ***: terminated
Aborted (core dumped)
I'd say it's probably not buffer cache corruption causing this because we
can see two subshells fail, then just one.
For that build I enabled SLUB_DEBUG but forgot to enable SLUB_DEBUG_ON --
Looks like cat < /proc/self/maps | grep stack would give us enough
information without overwhelming the serial console?
OTOH - if you can show the error is gone without stack address
randomization, that would be a hint maybe?
The results below were produced with 'norandmaps' added to the kernel parameters to avoid ASLR.
root@debian:~# sh /etc/init.d/mountdevsubfs.sh start
root@debian:~# sh /etc/init.d/mountdevsubfs.sh start
root@debian:~# sh /etc/init.d/mountdevsubfs.sh start
root@debian:~# sh /etc/init.d/mountdevsubfs.sh start
root@debian:~# echo 3 > /proc/sys/vm/drop_caches
[ 913.560000] bash (1024): drop_caches: 3
root@debian:~# sh /etc/init.d/mountdevsubfs.sh start
root@debian:~# sh /etc/init.d/mountdevsubfs.sh start
root@debian:~# sh /etc/init.d/mountdevsubfs.sh start
root@debian:~# sh /etc/init.d/mountdevsubfs.sh start
*** stack smashing detected ***: terminated
Aborted (core dumped)
root@debian:~# sh -c "md5sum < /proc/self/maps" baacbaf944fb01d3200d924da7f7a815 -
root@debian:~# sh -c "md5sum < /proc/self/maps" baacbaf944fb01d3200d924da7f7a815 -
root@debian:~# sh -c "md5sum < /proc/self/maps" baacbaf944fb01d3200d924da7f7a815 -
root@debian:~# sh -c "md5sum < /proc/self/maps" baacbaf944fb01d3200d924da7f7a815 -
root@debian:~# sh -c "cat < /proc/self/maps"
c0000000-c0021000 r-xp 00000000 08:06 38780 /usr/lib/m68k-linux-gnu/ld.so.1
c0021000-c0023000 rw-p 00000000 00:00 0
c0023000-c0024000 r--p 00021000 08:06 38780 /usr/lib/m68k-linux-gnu/ld.so.1
c0024000-c0026000 rw-p 00022000 08:06 38780 /usr/lib/m68k-linux-gnu/ld.so.1
c002a000-c0199000 r-xp 00000000 08:06 38786 /usr/lib/m68k-linux-gnu/libc.so.6
c0199000-c019a000 ---p 0016f000 08:06 38786 /usr/lib/m68k-linux-gnu/libc.so.6
c019a000-c019c000 r--p 00170000 08:06 38786 /usr/lib/m68k-linux-gnu/libc.so.6
c019c000-c01a0000 rw-p 00172000 08:06 38786 /usr/lib/m68k-linux-gnu/libc.so.6
c01a0000-c01aa000 rw-p 00000000 00:00 0
d0000000-d0019000 r-xp 00000000 08:06 32713 /usr/bin/dash d001b000-d001c000 r--p 00019000 08:06 32713 /usr/bin/dash d001c000-d001d000 rw-p 0001a000 08:06 32713 /usr/bin/dash d001d000-d001f000 rwxp 00000000 00:00 0 [heap]
d001f000-d0040000 rwxp 00000000 00:00 0 [heap]
effdf000-f0000000 rw-p 00000000 00:00 0 [stack]
So I guess this bug has more to do with timing and little to do with
state, contrary to my guesswork above. And no doubt I will have to
contradict myself again if/when it turns out that uninitialized memory is
a factor :-/
Am 08.04.2023 um 00:06 schrieb Geert Uytterhoeven:
On Fri, Apr 7, 2023 at 3:58 AM Michael Schmitz <wrote:
The easiest way to do that is to log all wait and signal syscalls, as
well as process exit. That might alter timing if these log messages
go to the serial console though. Is that what you have in mind?
Store to RAM, retrieve through a new /proc file?
Yes, that could be done, though I'd rather avoid duplicating a lot of
the generic message formatting code (printk and friends).
I'll have a look around ...
On Sun, 9 Apr 2023, Michael Schmitz wrote:
Am 08.04.2023 um 00:06 schrieb Geert Uytterhoeven:
On Fri, Apr 7, 2023 at 3:58 AM Michael Schmitz <wrote:
The easiest way to do that is to log all wait and signal syscalls, as
well as process exit. That might alter timing if these log messages
go to the serial console though. Is that what you have in mind?
Store to RAM, retrieve through a new /proc file?
Yes, that could be done, though I'd rather avoid duplicating a lot of
the generic message formatting code (printk and friends).
I'll have a look around ...
A better solution might be be to port the existing instrumentation like ftrace, kprobes, uprobes etc. Might be a lot of work though. I wonder
how portable that stuff is.
If you use printk, you could probably avoid most of the delays by enabling the dummy console. Then the kernel messages would be collected with dmesg, given a sufficiently large CONFIG_LOG_BUF_SHIFT. But it would be
inconvenient to have no serial console available for the usual purposes.
Am 11.04.2023 um 12:20 schrieb Finn Thain:
On Sun, 9 Apr 2023, Michael Schmitz wrote:
Am 08.04.2023 um 00:06 schrieb Geert Uytterhoeven:
On Fri, Apr 7, 2023 at 3:58 AM Michael Schmitz <wrote:
The easiest way to do that is to log all wait and signal syscalls,
as well as process exit. That might alter timing if these log
messages go to the serial console though. Is that what you have in
mind?
Store to RAM, retrieve through a new /proc file?
Yes, that could be done, though I'd rather avoid duplicating a lot of
the generic message formatting code (printk and friends).
I'll have a look around ...
A better solution might be be to port the existing instrumentation
like ftrace, kprobes, uprobes etc. Might be a lot of work though. I
wonder how portable that stuff is.
If you use printk, you could probably avoid most of the delays by
enabling the dummy console. Then the kernel messages would be
collected with dmesg, given a sufficiently large CONFIG_LOG_BUF_SHIFT.
But it would be inconvenient to have no serial console available for
the usual purposes.
Can we disable the serial console after boot, by registering the dummy console? Or will that just log messages to both?
Am 11.04.2023 um 12:20 schrieb Finn Thain:
On Sun, 9 Apr 2023, Michael Schmitz wrote:
Am 08.04.2023 um 00:06 schrieb Geert Uytterhoeven:
On Fri, Apr 7, 2023 at 3:58 AM Michael Schmitz <wrote:
The easiest way to do that is to log all wait and signal syscalls, as >>>> well as process exit. That might alter timing if these log messages
go to the serial console though. Is that what you have in mind?
Store to RAM, retrieve through a new /proc file?
Yes, that could be done, though I'd rather avoid duplicating a lot of
the generic message formatting code (printk and friends).
I'll have a look around ...
A better solution might be be to port the existing instrumentation like ftrace, kprobes, uprobes etc. Might be a lot of work though. I wonder
how portable that stuff is.
If you use printk, you could probably avoid most of the delays by enabling the dummy console. Then the kernel messages would be collected with dmesg, given a sufficiently large CONFIG_LOG_BUF_SHIFT. But it would be inconvenient to have no serial console available for the usual purposes.
Can we disable the serial console after boot, by registering the dummy console? Or will that just log messages to both?
Hi Michael,
On Tue, Apr 11, 2023 at 6:56 AM Michael Schmitz <[email protected]> wrote:
Am 11.04.2023 um 12:20 schrieb Finn Thain:
On Sun, 9 Apr 2023, Michael Schmitz wrote:
Am 08.04.2023 um 00:06 schrieb Geert Uytterhoeven:
On Fri, Apr 7, 2023 at 3:58 AM Michael Schmitz <wrote:
The easiest way to do that is to log all wait and signal syscalls, as >>>>>> well as process exit. That might alter timing if these log messages >>>>>> go to the serial console though. Is that what you have in mind?
Store to RAM, retrieve through a new /proc file?
Yes, that could be done, though I'd rather avoid duplicating a lot of
the generic message formatting code (printk and friends).
I'll have a look around ...
A better solution might be be to port the existing instrumentation like
ftrace, kprobes, uprobes etc. Might be a lot of work though. I wonder
how portable that stuff is.
If you use printk, you could probably avoid most of the delays by enabling >>> the dummy console. Then the kernel messages would be collected with dmesg, >>> given a sufficiently large CONFIG_LOG_BUF_SHIFT. But it would be
inconvenient to have no serial console available for the usual purposes.
Can we disable the serial console after boot, by registering the dummy
console? Or will that just log messages to both?
You can increase loglevel.
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]
In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
Am 11.04.2023 um 19:19 schrieb Geert Uytterhoeven:
On Tue, Apr 11, 2023 at 6:56 AM Michael Schmitz <[email protected]>
Am 11.04.2023 um 12:20 schrieb Finn Thain:
A better solution might be be to port the existing instrumentation like >>>> ftrace, kprobes, uprobes etc. Might be a lot of work though. I wonder
how portable that stuff is.
Looking at a few arch implementations, I'm utterly confused. Wouldn't
know where to start.
I don't care that much what dash does as long as it isn't corrupting
it's own stack, which is a real possibility, and one which gdb's data
watch point would normally resolve. And yet I have no way to tackle
that.
I've been running gdb under QEMU, where the failure is not reproducible. Running dash under gdb on real hardware is doable (RAM permitting). But
the failure is intermittent even then -- it only happens during
execution of certain init scripts, and I can't reproduce it by manually running those scripts.
(Even if I could reproduce the failure under gdb, instrumenting
execution in gdb can alter timing in undesirable ways...)
On 11.4.2023 11.24, Michael Schmitz wrote:
Am 11.04.2023 um 19:19 schrieb Geert Uytterhoeven:
On Tue, Apr 11, 2023 at 6:56 AM Michael Schmitz <[email protected]> >>>> Am 11.04.2023 um 12:20 schrieb Finn Thain:
A better solution might be be to port the existing instrumentation
like
ftrace, kprobes, uprobes etc. Might be a lot of work though. I wonder >>>>> how portable that stuff is.
Another possibility is just using manually added kernel tracepoints: https://www.kernel.org/doc/html/latest/trace/tracepoints.html
"git grep TRACE_EVENT" did not given any hits for arch/m68k/ though.
(I don't think those need any particular architecture support. One just
adds them to locations where it makes sense to trace activity that is important, but which does not happen too often for the very small
overhead from disabled static trace point to be significant.)
Looking at a few arch implementations, I'm utterly confused. Wouldn't
know where to start.
Good article: https://lwn.net/Articles/132196/
(Ignore jprobes stuff as that's deprecated.)
Because probe replaces instruction at the probe point with a breakpoint: https://www.kernel.org/doc/html/latest/trace/kprobes.html#how-does-a-kprobe-work
It needs architecture specific code to decode and simulate the replaced instruction (its side-effects) when it is later executed "out-of-line".
This also causes architecture specific limitations on where the probe
can be set. Architecture may not support simulating all possible instructions, just the most common ones (e.g. ones typically used on
function entry and exit points).
Besides that, AFAIK plain kprobes support should not have much
architecture specific stuff.
(I have no experience with kernel code, I just remember some LWN
articles about kprobes during past 2 decades, and I was involved with a
tool that did something similar in user-space, before Linux got uprobes support.)
As to what architecture to use as an example, I think it would be best
to ask tracing maintainers for advice, as there seems to be a lot of
code which has been copied from each other at different points in time,
that might be nowadays better as shared rather than under each arch...
These architectures support kprobes:
------------------------------------
$ grep " ok " Documentation/features/debug/kprobes/arch-support.txt
| arc: | ok |
| arm: | ok |
| arm64: | ok |
| csky: | ok |
| ia64: | ok |
| mips: | ok |
| parisc: | ok |
| powerpc: | ok |
| riscv: | ok |
| s390: | ok |
| sh: | ok |
| sparc: | ok |
| x86: | ok | ------------------------------------
But only these seem to have probing functionality isolated nicely to its
own directory:
------------------------------------
$ ls arch/*/kernel/probes/
arch/arm64/kernel/probes/:
decode-insn.c decode-insn.h kprobes.c kprobes_trampoline.S Makefile simulate-insn.c simulate-insn.h uprobes.c
arch/csky/kernel/probes/:
decode-insn.c decode-insn.h ftrace.c kprobes.c kprobes_trampoline.S Makefile simulate-insn.c simulate-insn.h uprobes.c
arch/riscv/kernel/probes/:
decode-insn.c decode-insn.h ftrace.c kprobes.c Makefile rethook.c rethook.h rethook_trampoline.S simulate-insn.c simulate-insn.h uprobes.c
------------------------------------
As to testing whether kprobes work, that could be easier with ftrace
support also present: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/ftrace/README
- Eero
PS. Looking at the kernel features list, m68k arch seems to be missing a
lot of other features too: https://www.kernel.org/doc/html/latest/m68k/features.html
I'm a bit surprised how many functions (symbol addresses) start with link.w instruction, instead of e.g. move* instruction.
Would signal delivery erase any of the memory immediately below the USP?
If so, it would erase those old stack frames, which would give some indication of the timing of signal delivery.
If I run dash under gdb under QEMU, I can break on entry to onsig() and
find the signal frame on the stack. But when I examine stack memory from
the core dump, I can't find 0x70774e40 (i.e. moveq __NR_sigreturn,%d0 ;
trap #0) which the kernel puts on the stack in my QEMU experiments.
That suggests that no signal was delivered... and yet gotsigchld == 1 at
the time of the coredump, after having been initialized by waitproc()
prior to calling __wait3(). So the signal handler onsig() must have
executed during __wait3() or __wait4_time64(). I can't explain this.
Am 14.04.2023 um 21:30 schrieb Finn Thain:
Would signal delivery erase any of the memory immediately below the
USP? If so, it would erase those old stack frames, which would give
some indication of the timing of signal delivery.
The signal stack is set up immediately below USP, from my reading of signal.c:setup_frame(). Old stack frames will be overwritten.
If I run dash under gdb under QEMU, I can break on entry to onsig()
and find the signal frame on the stack. But when I examine stack
memory from the core dump, I can't find 0x70774e40 (i.e. moveq __NR_sigreturn,%d0 ; trap #0) which the kernel puts on the stack in my
QEMU experiments.
As I understand this, the call to sys_sigreturn() removes both this code (signal trampoline IIRC) and the signal stack...
Again as far as I understand, the core dump happens on process exit.
Stack smashing is detected and process exit is forced only at exit from __wait3() or __wait4_time64(),
As I understand this, the call to sys_sigreturn() removes both this code
(signal trampoline IIRC) and the signal stack...
I don't see that stuff getting removed when I run dash under gdb under
QEMU. With breakpoints at the head of onsig() and the tail of __wait3(),
the memory under USP is the same when examined at either juncture.
The backtrace confirms that this signal was delivered during execution of __wait3(). (Delivery can happen during execution of __libc_fork() but I
just repeat the test until I get these ducks in a row.)
(gdb) c
Continuing.
# x=$(:)
[Detaching after fork from child process 1055]
Breakpoint 6.1, onsig (signo=17) at trap.c:286
286 trap.c: No such file or directory.
(gdb) bt
#0 onsig (signo=17) at trap.c:286
#1 <signal handler called>
#2 0xc00e81b6 in __GI___wait4_time64 (pid=-1, stat_loc=0xeffff86a, options=2,
usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:35
#3 0xc00e8164 in __GI___wait3_time64 (usage=0x0, options=<optimized out>,
stat_loc=<optimized out>) at ../sysdeps/unix/sysv/linux/wait3.c:26
#4 __wait3 (stat_loc=<optimized out>, options=<optimized out>,
usage=<optimized out>) at ../sysdeps/unix/sysv/linux/wait3.c:35
#5 0xd000c38e in waitproc (status=0xeffff85a, block=1) at jobs.c:1179
#6 waitone (block=1, job=0xd001f618) at jobs.c:1055
#7 0xd000c5b8 in dowait (block=1, jp=0xd001f618) at jobs.c:1137
#8 0xd000ddb0 in waitforjob (jp=0xd001f618) at jobs.c:1014
#9 0xd000aade in expbackq (flag=68, cmd=0xd001e4c8 <stackbase+36>)
at expand.c:520
#10 argstr (p=<optimized out>, flag=68) at expand.c:335
#11 0xd000b5ce in expandarg (arg=0xd001e4e8 <stackbase+68>,
arglist=0xeffffb08, flag=4) at expand.c:192
#12 0xd0007e2a in evalcommand (cmd=<optimized out>, flags=<optimized out>)
at eval.c:855
#13 0xd0006ffc in evaltree (n=0xd001e4f8 <stackbase+84>, flags=0) at eval.c:300
#14 0xd000e3c0 in cmdloop (top=1) at main.c:246
#15 0xd0005018 in main (argc=<optimized out>, argv=<optimized out>)
at main.c:181
0xeffff750: 0xc01a0000 saved $a5 == libc .got 0xeffff74c: 0xc0023e8c saved $a3 == &__stack_chk_guard
0xeffff748: 0x00000000 saved $a2
0xeffff744: 0x00000001 saved $d5
0xeffff740: 0xeffff86e saved $d4
0xeffff73c: 0xeffff86a saved $d3
0xeffff738: 0x00000002 saved $d2
0xeffff734: 0x00000000
0xeffff730: 0x00000000
0xeffff72c: 0x00000000
0xeffff728: 0x00000000
0xeffff724: 0x00000000
0xeffff720: 0x00000000
0xeffff71c: 0x00000000
0xeffff718: 0x00000000
0xeffff714: 0x00000000
0xeffff710: 0x00000000
0xeffff70c: 0x00000000
0xeffff708: 0x00000000
0xeffff704: 0x00000000
0xeffff700: 0x00000000
0xeffff6fc: 0x00000000
0xeffff6f8: 0x00000000
0xeffff6f4: 0x00000000
0xeffff6f0: 0x00000000
0xeffff6ec: 0x00000000
0xeffff6e8: 0x00000000
0xeffff6e4: 0x00000000
0xeffff6e0: 0x00000000
0xeffff6dc: 0x00000000
0xeffff6d8: 0x00000000
0xeffff6d4: 0x00000000
0xeffff6d0: 0x00000000
0xeffff6cc: 0x00000000
0xeffff6c8: 0x00000000
0xeffff6c4: 0x00000000
0xeffff6c0: 0x00000000
0xeffff6bc: 0x00000000
0xeffff6b8: 0x00000000
0xeffff6b4: 0x00000000
0xeffff6b0: 0x00000000
0xeffff6ac: 0x00000000
0xeffff6a8: 0x00000000
0xeffff6a4: 0x00000000
0xeffff6a0: 0x00000000
0xeffff69c: 0x00000000
0xeffff698: 0x00000000
0xeffff694: 0x00000000
0xeffff690: 0x00000000
0xeffff68c: 0x00000000
0xeffff688: 0x00000000
0xeffff684: 0x00000000
0xeffff680: 0x00000000
0xeffff67c: 0x00000000
0xeffff678: 0x00000000
0xeffff674: 0x00000000
0xeffff670: 0x00000000
0xeffff66c: 0x00000000
0xeffff668: 0x00000000
0xeffff664: 0x00000000
0xeffff660: 0x41000000
0xeffff65c: 0x00000000
0xeffff658: 0x00000000
0xeffff654: 0x00000000
0xeffff650: 0x00000000
0xeffff64c: 0x80000000
0xeffff648: 0x3fff0000
0xeffff644: 0x00000000
0xeffff640: 0xd0000000
0xeffff63c: 0x40020000 <= (sc.formatvec& 0xffff) << 16; fpregs from here on
0xeffff638: 0x81b60080 <= (sc.pc & 0xffff) << 16 | sc.formatvec >> 16
0xeffff634: 0x0000c00e <= sc.sr << 16 sc.pc >> 16
0xeffff630: 0xd001e4e3 saved a1? <= sc.a1
0xeffff62c: 0xc0028780 saved a0? <= sc.a0
0xeffff628: 0xffffffff saved d1? <= sc.d1
0xeffff624: 0x0000041f saved d0? <= sc.d0
0xeffff620: 0xeffff738 saved sp? <= sc.usp
0xeffff61c: 0x00000000 <= sc.mask
0xeffff618: 0x00000000 <= extramask 0xeffff614: 0x00000000 <= frame.retcode[1] 0xeffff610: 0x70774e40 moveq #119,%d0 ; trap #0
0xeffff60c: 0xeffff61c <= frame->sc 0xeffff608: 0x00000080 <= tregs->vector 0xeffff604: 0x00000011 <= signal no. 0xeffff600: 0xeffff610 return address
The above comes from dash running under gdb under qemu, which does not exhibit the failure but is convenient for that kind of experiment.
Again as far as I understand, the core dump happens on process exit.
Stack smashing is detected and process exit is forced only at exit from
__wait3() or __wait4_time64(),
I placed an illegal instruction in __wait3. This executes instead of the
call to __stack_chk_fail because that obliterates stack memory of
interest.
Consequently the latest core dump still contains dead stack frames (see below) of subroutines that returned before __wait3() dumped core. You can
see the return address for the branch to __wait4_time64() and below that
you can see the return address for the branch to __m68k_read_tp().
(gdb) disas __wait4_time64
Dump of assembler code for function __GI___wait4_time64:
0xc00e4174 <+0>: lea %sp@(-80),%sp
0xc00e4178 <+4>: moveml %d2-%d5/%a2-%a3/%a5,%sp@-
0xc00e417c <+8>: lea %pc@(0xc019c000),%a5
0xc00e4184 <+16>: movel %sp@(116),%d2
0xc00e4188 <+20>: moveal %sp@(124),%a2
0xc00e418c <+24>: moveal %a5@(108),%a3
0xc00e4190 <+28>: movel %a3@,%sp@(104)
0xc00e4194 <+32>: bsrl 0xc0056e2c <__m68k_read_tp@plt>
I gather the signal was delivered before __wait4_time64+38, otherwise the return address 0xc00e419a (which appears below) would have been
overwritten by the signal frame. The signal must have been delivered after waitproc() initialized gotsigchld = 0 since gotsigchld is 1 at the time of the coredump.
I assume the %a3 corruption happened after __wait4_time64+8 because that's when %a3 first appears on the stack. And the corruption must have happened before __wait4_time64+238, which is when %a3 was restored.
If it was the signal which somehow corrupted the saved %a3, there's only a small window for that. The only syscall in that window is get_thread_area.
Here's some stack memory from the core dump.
0xeffff0dc: 0xd000c38e return address waitproc+124
0xeffff0d8: 0xd001c1ec frame 0 $fp == &suppressint 0xeffff0d4: 0x00add14b canary
0xeffff0d0: 0x00000000
0xeffff0cc: 0x0000000a
0xeffff0c8: 0x00000202
0xeffff0c4: 0x00000008
0xeffff0c0: 0x00000000
0xeffff0bc: 0x00000000
0xeffff0b8: 0x00000174
0xeffff0b4: 0x00000004
0xeffff0b0: 0x00000004
0xeffff0ac: 0x00000006
0xeffff0a8: 0x000000e0
0xeffff0a4: 0x000000e0
0xeffff0a0: 0x00171f20
0xeffff09c: 0x00171f20
0xeffff098: 0x00171f20
0xeffff094: 0x00000002
0xeffff090: 0x00002000
0xeffff08c: 0x00000006
0xeffff088: 0x0000e920
0xeffff084: 0x00005360
0xeffff080: 0x00170700
0xeffff07c: 0x00170700
0xeffff078: 0x00170700 frame 0 $fp - 96
0xeffff074: 0xd001b874 saved $a5 == dash .got 0xeffff070: 0xd001e498 saved $a3 == &dash_errno 0xeffff06c: 0xd001e718 frame 0 $sp saved $a2 == &gotsigchld 0xeffff068: 0x00000000
0xeffff064: 0x00000000
0xeffff060: 0xeffff11e
0xeffff05c: 0xffffffff
0xeffff058: 0xc00e4164 return address __wait3+244
0xeffff054: 0x00add14b canary
0xeffff050: 0x00000001
0xeffff04c: 0x00000004
0xeffff048: 0x0000000d
0xeffff044: 0x0000000d
0xeffff040: 0x0015ef82
0xeffff03c: 0x0015ef82
0xeffff038: 0x0015ef82
0xeffff034: 0x00000003
0xeffff030: 0x00000004
0xeffff02c: 0x00000004
0xeffff028: 0x00000140
0xeffff024: 0x00000140
0xeffff020: 0x00000034
0xeffff01c: 0x00000034
0xeffff018: 0x00000034
0xeffff014: 0x00000006
0xeffff010: 0x003b003a
0xeffff00c: 0x000a0028
0xeffff008: 0x00340020
0xeffff004: 0xc019c000 saved $a5 == libc .got 0xeffff000: 0xeffff068 saved $a3 (corrupted) 0xefffeffc: 0x00000000 saved $a2
0xefffeff8: 0x00000001 saved $d5
0xefffeff4: 0xeffff122 saved $d4
0xefffeff0: 0xeffff11e saved $d3
0xefffefec: 0x00000000 saved $d2
0xefffefe8: 0xc00e419a return address __GI___wait4_time64+38 0xefffefe4: 0xc0028780
0xefffefe0: 0x3c344bfb
0xefffefdc: 0x000af353
0xefffefd8: 0x3c340170
0xefffefd4: 0x00000000
0xefffefd0: 0xc00e417c
0xefffefcc: 0xc00e417e
0xefffefc8: 0xc00e4180
0xefffefc4: 0x48e73c34
0xefffefc0: 0x00000000
0xefffefbc: 0xefffeff8
0xefffefb8: 0xefffeffc
0xefffefb4: 0x4bfb0170
0xefffefb0: 0x0eee0709
0xefffefac: 0x00000000
0xefffefa8: 0x00000000
0xefffefa4: 0x00000000
0xefffefa0: 0x00000000
0xefffef9c: 0x00000000
0xefffef98: 0x00000000
0xefffef94: 0x00000000
0xefffef90: 0x00000000
0xefffef8c: 0x00000000
0xefffef88: 0x00000000
0xefffef84: 0x00000000
0xefffef80: 0x00000000
0xefffef7c: 0x00000000
0xefffef78: 0x00000000
0xefffef74: 0x00000000
0xefffef70: 0x00000000
0xefffef6c: 0x00000000
0xefffef68: 0x00000000
0xefffef64: 0x00000000
0xefffef60: 0x00000000
0xefffef5c: 0x00000000
0xefffef58: 0x00000000
0xefffef54: 0x00000000
0xefffef50: 0x00000000
0xefffef4c: 0x00000000
0xefffef48: 0x00000000
0xefffef44: 0x00000000
0xefffef40: 0x00000000
0xefffef3c: 0x00000000
0xefffef38: 0x00000000
0xefffef34: 0x00000000
0xefffef30: 0x00000000
0xefffef2c: 0x00000000
0xefffef28: 0x00000000
0xefffef24: 0x00000000
0xefffef20: 0x00000000
0xefffef1c: 0x00000000
0xefffef18: 0x00000000
0xefffef14: 0x00000000
0xefffef10: 0x7c0effff
0xefffef0c: 0xffffffff
0xefffef08: 0xaaaaaaaa
0xefffef04: 0xaf54eaaa
0xefffef00: 0x40040000
0xefffeefc: 0x40040000
0xefffeef8: 0x2b000000
0xefffeef4: 0x00000000
0xefffeef0: 0x00000000
0xefffeeec: 0x408ece9a
0xefffeee8: 0x00000000
0xefffeee4: 0xf0ff0000
0xefffeee0: 0x0f800000
0xefffeedc: 0xf0fff0ff
0xefffeed8: 0x1f380000
0xefffeed4: 0x00000000
0xefffeed0: 0x00000000
0xefffeecc: 0x00000000
0xefffeec8: 0xffffffff
0xefffeec4: 0xffffffff
0xefffeec0: 0x7fff0000
0xefffeebc: 0xffffffff
0xefffeeb8: 0xffffffff
0xefffeeb4: 0x7fff0000
The signal frame is not readily apparent (to me).
Also, stack memory in the core dump and stack memory as observed upon
signal handler breakpoint do not agree very well... I'd expect the values immediately below the stack pointer to be zeros once the signal frame was
put there.
I can't explain all of that unless it's discontiguous stack corruption.
Am 16.04.2023 um 18:44 schrieb Finn Thain:
The backtrace confirms that this signal was delivered during execution
of __wait3(). (Delivery can happen during execution of __libc_fork()
but I just repeat the test until I get these ducks in a row.)
(gdb) c
Continuing.
# x=$(:)
[Detaching after fork from child process 1055]
Breakpoint 6.1, onsig (signo=17) at trap.c:286
286 trap.c: No such file or directory.
(gdb) bt
#0 onsig (signo=17) at trap.c:286
#1 <signal handler called>
#2 0xc00e81b6 in __GI___wait4_time64 (pid=-1, stat_loc=0xeffff86a, options=2,
usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:35
#3 0xc00e8164 in __GI___wait3_time64 (usage=0x0, options=<optimized out>,
stat_loc=<optimized out>) at ../sysdeps/unix/sysv/linux/wait3.c:26
Where did that one come from? I don't think we saw __GI___wait3_time64
called in your disassembly of __wait3 ...
#4 __wait3 (stat_loc=<optimized out>, options=<optimized out>,
usage=<optimized out>) at ../sysdeps/unix/sysv/linux/wait3.c:35
0xc00e416c <+252>: illegalEnd of assembler dump.
#5 0xd000c38e in waitproc (status=0xeffff85a, block=1) at jobs.c:1179
#6 waitone (block=1, job=0xd001f618) at jobs.c:1055
#7 0xd000c5b8 in dowait (block=1, jp=0xd001f618) at jobs.c:1137
#8 0xd000ddb0 in waitforjob (jp=0xd001f618) at jobs.c:1014
#9 0xd000aade in expbackq (flag=68, cmd=0xd001e4c8 <stackbase+36>)
at expand.c:520
#10 argstr (p=<optimized out>, flag=68) at expand.c:335
#11 0xd000b5ce in expandarg (arg=0xd001e4e8 <stackbase+68>,
arglist=0xeffffb08, flag=4) at expand.c:192
#12 0xd0007e2a in evalcommand (cmd=<optimized out>, flags=<optimized out>)
at eval.c:855
#13 0xd0006ffc in evaltree (n=0xd001e4f8 <stackbase+84>, flags=0) at eval.c:300
#14 0xd000e3c0 in cmdloop (top=1) at main.c:246
#15 0xd0005018 in main (argc=<optimized out>, argv=<optimized out>)
at main.c:181
0xeffff750: 0xc01a0000 saved $a5 == libc .got 0xeffff74c: 0xc0023e8c saved $a3 == &__stack_chk_guard
0xeffff748: 0x00000000 saved $a2
0xeffff744: 0x00000001 saved $d5
0xeffff740: 0xeffff86e saved $d4
0xeffff73c: 0xeffff86a saved $d3
0xeffff738: 0x00000002 saved $d2
0xeffff734: 0x00000000
0xeffff730: 0x00000000
0xeffff72c: 0x00000000
0xeffff728: 0x00000000
0xeffff724: 0x00000000
0xeffff720: 0x00000000
0xeffff71c: 0x00000000
0xeffff718: 0x00000000
0xeffff714: 0x00000000
0xeffff710: 0x00000000
0xeffff70c: 0x00000000
0xeffff708: 0x00000000
0xeffff704: 0x00000000
0xeffff700: 0x00000000
0xeffff6fc: 0x00000000
0xeffff6f8: 0x00000000
0xeffff6f4: 0x00000000
0xeffff6f0: 0x00000000
0xeffff6ec: 0x00000000
0xeffff6e8: 0x00000000
0xeffff6e4: 0x00000000
0xeffff6e0: 0x00000000
0xeffff6dc: 0x00000000
0xeffff6d8: 0x00000000
0xeffff6d4: 0x00000000
0xeffff6d0: 0x00000000
0xeffff6cc: 0x00000000
0xeffff6c8: 0x00000000
0xeffff6c4: 0x00000000
0xeffff6c0: 0x00000000
0xeffff6bc: 0x00000000
0xeffff6b8: 0x00000000
0xeffff6b4: 0x00000000
0xeffff6b0: 0x00000000
0xeffff6ac: 0x00000000
0xeffff6a8: 0x00000000
0xeffff6a4: 0x00000000
0xeffff6a0: 0x00000000
0xeffff69c: 0x00000000
0xeffff698: 0x00000000
0xeffff694: 0x00000000
0xeffff690: 0x00000000
0xeffff68c: 0x00000000
0xeffff688: 0x00000000
0xeffff684: 0x00000000
0xeffff680: 0x00000000
0xeffff67c: 0x00000000
0xeffff678: 0x00000000
0xeffff674: 0x00000000
0xeffff670: 0x00000000
0xeffff66c: 0x00000000
0xeffff668: 0x00000000
0xeffff664: 0x00000000
0xeffff660: 0x41000000
0xeffff65c: 0x00000000
0xeffff658: 0x00000000
0xeffff654: 0x00000000
0xeffff650: 0x00000000
0xeffff64c: 0x80000000
0xeffff648: 0x3fff0000
0xeffff644: 0x00000000
0xeffff640: 0xd0000000
0xeffff63c: 0x40020000 <= (sc.formatvec & 0xffff) << 16; fpregs from here on
0xeffff638: 0x81b60080 <= (sc.pc & 0xffff) << 16 | sc.formatvec >> 16
0xeffff634: 0x0000c00e <= sc.sr << 16 sc.pc >> 16
0xeffff630: 0xd001e4e3 <= sc.a1
0xeffff62c: 0xc0028780 <= sc.a0
0xeffff628: 0xffffffff <= sc.d1
0xeffff624: 0x0000041f <= sc.d0
0xeffff620: 0xeffff738 <= sc.usp
0xeffff61c: 0x00000000 <= sc.mask
0xeffff618: 0x00000000 <= extramask 0xeffff614: 0x00000000 <= frame.retcode[1] 0xeffff610: 0x70774e40 moveq #119,%d0 ; trap #0
0xeffff60c: 0xeffff61c <= frame->sc 0xeffff608: 0x00000080 <= tregs->vector 0xeffff604: 0x00000011 <= signal no. 0xeffff600: 0xeffff610 return address
The above comes from dash running under gdb under qemu, which does not exhibit the failure but is convenient for that kind of experiment.
I would have expected to see a different signal trampoline (for sys_rt_sigreturn) ...
But anyway:
The saved pc is 0xc00e81b6 which does match the backtrace above. Vector offset 80 matches trap 0 which suggests 0xc00e81b6 should be the
instruction after a trap 0 instruction. d0 is 1055 which is not a signal number I recognize.
Again as far as I understand, the core dump happens on process exit.
Stack smashing is detected and process exit is forced only at exit
from __wait3() or __wait4_time64(),
I placed an illegal instruction in __wait3. This executes instead of
the call to __stack_chk_fail because that obliterates stack memory of interest.
OK.
Consequently the latest core dump still contains dead stack frames
(see below) of subroutines that returned before __wait3() dumped core.
You can see the return address for the branch to __wait4_time64() and
below that you can see the return address for the branch to __m68k_read_tp().
(gdb) disas __wait4_time64
Dump of assembler code for function __GI___wait4_time64:
0xc00e4174 <+0>: lea %sp@(-80),%sp
0xc00e4178 <+4>: moveml %d2-%d5/%a2-%a3/%a5,%sp@-
0xc00e417c <+8>: lea %pc@(0xc019c000),%a5
0xc00e4184 <+16>: movel %sp@(116),%d2
0xc00e4188 <+20>: moveal %sp@(124),%a2
0xc00e418c <+24>: moveal %a5@(108),%a3
0xc00e4190 <+28>: movel %a3@,%sp@(104)
0xc00e4194 <+32>: bsrl 0xc0056e2c <__m68k_read_tp@plt>
I gather the signal was delivered before __wait4_time64+38, otherwise
the return address 0xc00e419a (which appears below) would have been overwritten by the signal frame. The signal must have been delivered
after waitproc() initialized gotsigchld = 0 since gotsigchld is 1 at
the time of the coredump.
I assume the %a3 corruption happened after __wait4_time64+8 because
that's when %a3 first appears on the stack. And the corruption must
have happened before __wait4_time64+238, which is when %a3 was
restored.
If it was the signal which somehow corrupted the saved %a3, there's
only a small window for that. The only syscall in that window is get_thread_area.
I see sys_wait4 called in two places (0xc00e01b4, and then 0xc00e0286 depending on the return code of the first). The second one again would
have called __m68k_read_tp so would have left a return address on the
stack (0xc00e02d2). Leaves the first.
Here's some stack memory from the core dump.
0xeffff0dc: 0xd000c38e return address waitproc+124
0xeffff0d8: 0xd001c1ec frame 0 $fp == &suppressint
0xeffff0d4: 0x00add14b canary
0xeffff0d0: 0x00000000
0xeffff0cc: 0x0000000a
0xeffff0c8: 0x00000202
0xeffff0c4: 0x00000008
0xeffff0c0: 0x00000000
0xeffff0bc: 0x00000000
0xeffff0b8: 0x00000174
0xeffff0b4: 0x00000004
0xeffff0b0: 0x00000004
0xeffff0ac: 0x00000006
0xeffff0a8: 0x000000e0
0xeffff0a4: 0x000000e0
0xeffff0a0: 0x00171f20
0xeffff09c: 0x00171f20
0xeffff098: 0x00171f20
0xeffff094: 0x00000002
0xeffff090: 0x00002000
0xeffff08c: 0x00000006
0xeffff088: 0x0000e920
0xeffff084: 0x00005360
0xeffff080: 0x00170700
0xeffff07c: 0x00170700
0xeffff078: 0x00170700 frame 0 $fp - 96
0xeffff074: 0xd001b874 saved $a5 == dash .got 0xeffff070: 0xd001e498 saved $a3 == &dash_errno 0xeffff06c: 0xd001e718 frame 0 $sp saved $a2 == &gotsigchld 0xeffff068: 0x00000000
0xeffff064: 0x00000000
0xeffff060: 0xeffff11e
0xeffff05c: 0xffffffff
0xeffff058: 0xc00e4164 return address __wait3+244
0xeffff054: 0x00add14b canary
0xeffff050: 0x00000001
0xeffff04c: 0x00000004
0xeffff048: 0x0000000d
0xeffff044: 0x0000000d
0xeffff040: 0x0015ef82
0xeffff03c: 0x0015ef82
0xeffff038: 0x0015ef82
0xeffff034: 0x00000003
0xeffff030: 0x00000004
0xeffff02c: 0x00000004
0xeffff028: 0x00000140
0xeffff024: 0x00000140
0xeffff020: 0x00000034
0xeffff01c: 0x00000034
0xeffff018: 0x00000034
0xeffff014: 0x00000006
0xeffff010: 0x003b003a
0xeffff00c: 0x000a0028
0xeffff008: 0x00340020
0xeffff004: 0xc019c000 saved $a5 == libc .got 0xeffff000: 0xeffff068 saved $a3 (corrupted) 0xefffeffc: 0x00000000 saved $a2
0xefffeff8: 0x00000001 saved $d5
0xefffeff4: 0xeffff122 saved $d4
0xefffeff0: 0xeffff11e saved $d3
0xefffefec: 0x00000000 saved $d2
0xefffefe8: 0xc00e419a return address __GI___wait4_time64+38 0xefffefe4: 0xc0028780
0xefffefe0: 0x3c344bfb
0xefffefdc: 0x000af353
0xefffefd8: 0x3c340170
0xefffefd4: 0x00000000
0xefffefd0: 0xc00e417c
0xefffefcc: 0xc00e417e
0xefffefc8: 0xc00e4180
0xefffefc4: 0x48e73c34
0xefffefc0: 0x00000000
0xefffefbc: 0xefffeff8
0xefffefb8: 0xefffeffc
0xefffefb4: 0x4bfb0170
0xefffefb0: 0x0eee0709
0xefffefac: 0x00000000
0xefffefa8: 0x00000000
0xefffefa4: 0x00000000
0xefffefa0: 0x00000000
0xefffef9c: 0x00000000
0xefffef98: 0x00000000
0xefffef94: 0x00000000
0xefffef90: 0x00000000
0xefffef8c: 0x00000000
0xefffef88: 0x00000000
0xefffef84: 0x00000000
0xefffef80: 0x00000000
0xefffef7c: 0x00000000
0xefffef78: 0x00000000
0xefffef74: 0x00000000
0xefffef70: 0x00000000
0xefffef6c: 0x00000000
0xefffef68: 0x00000000
0xefffef64: 0x00000000
0xefffef60: 0x00000000
0xefffef5c: 0x00000000
0xefffef58: 0x00000000
0xefffef54: 0x00000000
0xefffef50: 0x00000000
0xefffef4c: 0x00000000
0xefffef48: 0x00000000
0xefffef44: 0x00000000
0xefffef40: 0x00000000
0xefffef3c: 0x00000000
0xefffef38: 0x00000000
0xefffef34: 0x00000000
0xefffef30: 0x00000000
0xefffef2c: 0x00000000
0xefffef28: 0x00000000
0xefffef24: 0x00000000
0xefffef20: 0x00000000
0xefffef1c: 0x00000000
0xefffef18: 0x00000000
0xefffef14: 0x00000000
0xefffef10: 0x7c0effff
0xefffef0c: 0xffffffff
0xefffef08: 0xaaaaaaaa
0xefffef04: 0xaf54eaaa
0xefffef00: 0x40040000
0xefffeefc: 0x40040000
0xefffeef8: 0x2b000000
0xefffeef4: 0x00000000
0xefffeef0: 0x00000000
0xefffeeec: 0x408ece9a
0xefffeee8: 0x00000000
0xefffeee4: 0xf0ff0000
0xefffeee0: 0x0f800000
0xefffeedc: 0xf0fff0ff
0xefffeed8: 0x1f380000
0xefffeed4: 0x00000000
0xefffeed0: 0x00000000
0xefffeecc: 0x00000000
0xefffeec8: 0xffffffff
0xefffeec4: 0xffffffff
0xefffeec0: 0x7fff0000
0xefffeebc: 0xffffffff
0xefffeeb8: 0xffffffff
0xefffeeb4: 0x7fff0000 sc_formatvec
The signal frame is not readily apparent (to me).
From looking at the above stack dump, sc ought to start at 0xefffee90,
and the trampoline would be three words below that.
The last address you show corresponds to 0xeffff640 in first dump above, which is at the start of the saved fpregs. I'd say we just miss the
beginning of the signal frame?
(My reasoning is that copy_siginfo_to_user clears the end of the signal stack, which is what we can see in both cases.)
Can't explain the 14 words below the saved return address though.
On Tue, 18 Apr 2023, Michael Schmitz wrote:
Am 16.04.2023 um 18:44 schrieb Finn Thain:
The backtrace confirms that this signal was delivered during execution
of __wait3(). (Delivery can happen during execution of __libc_fork()
but I just repeat the test until I get these ducks in a row.)
(gdb) c
Continuing.
# x=$(:)
[Detaching after fork from child process 1055]
Breakpoint 6.1, onsig (signo=17) at trap.c:286
286 trap.c: No such file or directory.
(gdb) bt
#0 onsig (signo=17) at trap.c:286
#1 <signal handler called>
#2 0xc00e81b6 in __GI___wait4_time64 (pid=-1, stat_loc=0xeffff86a,
options=2,
usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:35
#3 0xc00e8164 in __GI___wait3_time64 (usage=0x0, options=<optimized out>, >>> stat_loc=<optimized out>) at ../sysdeps/unix/sysv/linux/wait3.c:26
Where did that one come from? I don't think we saw __GI___wait3_time64
called in your disassembly of __wait3 ...
#4 __wait3 (stat_loc=<optimized out>, options=<optimized out>,
usage=<optimized out>) at ../sysdeps/unix/sysv/linux/wait3.c:35
Well spotted. However, it turns out there is a good explanation for that:
(gdb) print __GI___wait3_time64
$3 = {pid_t (int *, int, struct __rusage64 *)} 0xc00e4054 <__GI___wait3_time64>
(gdb) disass __GI___wait3_time64
Dump of assembler code for function __GI___wait3_time64:
0xc00e4054 <+0>: movel %sp@(12),%sp@-
0xc00e4058 <+4>: movel %sp@(12),%sp@-
0xc00e405c <+8>: movel %sp@(12),%sp@-
0xc00e4060 <+12>: pea 0xffffffff
0xc00e4064 <+16>: bsrl 0xc00e4174 <__GI___wait4_time64>
0xc00e406a <+22>: lea %sp@(16),%sp
0xc00e406e <+26>: rts
End of assembler dump.
(gdb) print __wait3
$2 = {pid_t (int *, int, struct rusage *)} 0xc00e4070 <__wait3>
(gdb) disass __wait3
Dump of assembler code for function __wait3:
0xc00e4070 <+0>: linkw %fp,#-96
0xc00e4074 <+4>: moveml %a2-%a3/%a5,%sp@-
0xc00e4078 <+8>: lea %pc@(0xc019c000),%a5
0xc00e4080 <+16>: movel %fp@(8),%d0
0xc00e4084 <+20>: moveal %fp@(16),%a2
0xc00e4088 <+24>: moveal %a5@(108),%a3
0xc00e408c <+28>: movel %a3@,%fp@(-4)
0xc00e4090 <+32>: tstl %a2
0xc00e4092 <+34>: beqw 0xc00e4152 <__wait3+226>
0xc00e4096 <+38>: pea %fp@(-92)
0xc00e409a <+42>: movel %fp@(12),%sp@-
0xc00e409e <+46>: movel %d0,%sp@-
0xc00e40a0 <+48>: pea 0xffffffff
0xc00e40a4 <+52>: bsrl 0xc00e4174 <__GI___wait4_time64>
0xc00e40aa <+58>: lea %sp@(16),%sp
0xc00e40ae <+62>: tstl %d0
0xc00e40b0 <+64>: bgts 0xc00e40c8 <__wait3+88>
0xc00e40b2 <+66>: moveal %fp@(-4),%a0
0xc00e40b6 <+70>: movel %a3@,%d1
0xc00e40b8 <+72>: cmpl %a0,%d1
0xc00e40ba <+74>: bnew 0xc00e416c <__wait3+252>
0xc00e40be <+78>: moveml %fp@(-108),%a2-%a3/%a5
0xc00e40c4 <+84>: unlk %fp
0xc00e40c6 <+86>: rts
0xc00e40c8 <+88>: pea 0x44
0xc00e40cc <+92>: clrl %sp@-
0xc00e40ce <+94>: pea %a2@(4)
0xc00e40d2 <+98>: movel %d0,%fp@(-96)
0xc00e40d6 <+102>: bsrl 0xc00bc850 <__GI_memset>
0xc00e40dc <+108>: movel %fp@(-88),%a2@
0xc00e40e0 <+112>: movel %fp@(-80),%a2@(4)
0xc00e40e6 <+118>: movel %fp@(-72),%a2@(8)
0xc00e40ec <+124>: movel %fp@(-64),%a2@(12)
0xc00e40f2 <+130>: movel %fp@(-60),%a2@(16)
0xc00e40f8 <+136>: movel %fp@(-56),%a2@(20)
0xc00e40fe <+142>: movel %fp@(-52),%a2@(24)
0xc00e4104 <+148>: movel %fp@(-48),%a2@(28)
0xc00e410a <+154>: movel %fp@(-44),%a2@(32)
0xc00e4110 <+160>: movel %fp@(-40),%a2@(36)
0xc00e4116 <+166>: movel %fp@(-36),%a2@(40)
0xc00e411c <+172>: movel %fp@(-32),%a2@(44)
0xc00e4122 <+178>: movel %fp@(-28),%a2@(48)
0xc00e4128 <+184>: movel %fp@(-24),%a2@(52)
0xc00e412e <+190>: movel %fp@(-20),%a2@(56)
0xc00e4134 <+196>: movel %fp@(-16),%a2@(60)
0xc00e413a <+202>: movel %fp@(-12),%a2@(64)
0xc00e4140 <+208>: movel %fp@(-8),%a2@(68)
0xc00e4146 <+214>: lea %sp@(12),%sp
0xc00e414a <+218>: movel %fp@(-96),%d0
0xc00e414e <+222>: braw 0xc00e40b2 <__wait3+66>
0xc00e4152 <+226>: clrl %sp@-
0xc00e4154 <+228>: movel %fp@(12),%sp@-
0xc00e4158 <+232>: movel %d0,%sp@-
0xc00e415a <+234>: pea 0xffffffff
0xc00e415e <+238>: bsrl 0xc00e4174 <__GI___wait4_time64>
0xc00e4164 <+244>: lea %sp@(16),%sp
0xc00e4168 <+248>: braw 0xc00e40b2 <__wait3+66>
0xc00e416c <+252>: illegalEnd of assembler dump.
(gdb) info frame
Stack level 3, frame at 0xeffff82c:
pc = 0xc00e8164 in __GI___wait3_time64
(../sysdeps/unix/sysv/linux/wait3.c:26); saved pc = 0xd000c38e
inlined into frame 4, caller of frame at 0xeffff7a8
source language c.
Arglist at unknown address.
Locals at unknown address, Previous frame's sp is 0xeffff7a8
Saved registers:
d2 at 0xeffff738, d3 at 0xeffff73c, d4 at 0xeffff740, d5 at 0xeffff744,
a2 at 0xeffff748, a3 at 0xeffff74c, a5 at 0xeffff750, pc at 0xeffff7a4
(gdb) up
#4 __wait3 (stat_loc=<optimized out>, options=<optimized out>,
usage=<optimized out>) at ../sysdeps/unix/sysv/linux/wait3.c:35
35 in ../sysdeps/unix/sysv/linux/wait3.c
(gdb) info frame
Stack level 4, frame at 0xeffff82c:
pc = 0xc00e8164 in __wait3 (../sysdeps/unix/sysv/linux/wait3.c:35);
saved pc = 0xd000c38e
called by frame at 0xeffff928, caller of frame at 0xeffff82c
source language c.
Arglist at 0xeffff824, args: stat_loc=<optimized out>,
options=<optimized out>, usage=<optimized out>
Locals at 0xeffff824, Previous frame's sp is 0xeffff82c
Saved registers:
a2 at 0xeffff7b8, a3 at 0xeffff7bc, a5 at 0xeffff7c0, fp at 0xeffff824,
pc at 0xeffff828
Note that frame 3 was "inlined into frame 4". The inlined code can be seen above at 0xc00e4154. So the backtrace is misleading inasmuchas it
represents the source code rather than the disassembly.
#5 0xd000c38e in waitproc (status=0xeffff85a, block=1) at jobs.c:1179
#6 waitone (block=1, job=0xd001f618) at jobs.c:1055
#7 0xd000c5b8 in dowait (block=1, jp=0xd001f618) at jobs.c:1137
#8 0xd000ddb0 in waitforjob (jp=0xd001f618) at jobs.c:1014
#9 0xd000aade in expbackq (flag=68, cmd=0xd001e4c8 <stackbase+36>)
at expand.c:520
#10 argstr (p=<optimized out>, flag=68) at expand.c:335
#11 0xd000b5ce in expandarg (arg=0xd001e4e8 <stackbase+68>,
arglist=0xeffffb08, flag=4) at expand.c:192
#12 0xd0007e2a in evalcommand (cmd=<optimized out>, flags=<optimized out>) >>> at eval.c:855
#13 0xd0006ffc in evaltree (n=0xd001e4f8 <stackbase+84>, flags=0) at
eval.c:300
#14 0xd000e3c0 in cmdloop (top=1) at main.c:246
#15 0xd0005018 in main (argc=<optimized out>, argv=<optimized out>)
at main.c:181
0xeffff750: 0xc01a0000 saved $a5 == libc .got
0xeffff74c: 0xc0023e8c saved $a3 == &__stack_chk_guard
0xeffff748: 0x00000000 saved $a2
0xeffff744: 0x00000001 saved $d5
0xeffff740: 0xeffff86e saved $d4
0xeffff73c: 0xeffff86a saved $d3
0xeffff738: 0x00000002 saved $d2
0xeffff734: 0x00000000
0xeffff730: 0x00000000
0xeffff72c: 0x00000000
0xeffff728: 0x00000000
0xeffff724: 0x00000000
0xeffff720: 0x00000000
0xeffff71c: 0x00000000
0xeffff718: 0x00000000
0xeffff714: 0x00000000
0xeffff710: 0x00000000
0xeffff70c: 0x00000000
0xeffff708: 0x00000000
0xeffff704: 0x00000000
0xeffff700: 0x00000000
0xeffff6fc: 0x00000000
0xeffff6f8: 0x00000000
0xeffff6f4: 0x00000000
0xeffff6f0: 0x00000000
0xeffff6ec: 0x00000000
0xeffff6e8: 0x00000000
0xeffff6e4: 0x00000000
0xeffff6e0: 0x00000000
0xeffff6dc: 0x00000000
0xeffff6d8: 0x00000000
0xeffff6d4: 0x00000000
0xeffff6d0: 0x00000000
0xeffff6cc: 0x00000000
0xeffff6c8: 0x00000000
0xeffff6c4: 0x00000000
0xeffff6c0: 0x00000000
0xeffff6bc: 0x00000000
0xeffff6b8: 0x00000000
0xeffff6b4: 0x00000000
0xeffff6b0: 0x00000000
0xeffff6ac: 0x00000000
0xeffff6a8: 0x00000000
0xeffff6a4: 0x00000000
0xeffff6a0: 0x00000000
0xeffff69c: 0x00000000
0xeffff698: 0x00000000
0xeffff694: 0x00000000
0xeffff690: 0x00000000
0xeffff68c: 0x00000000
0xeffff688: 0x00000000
0xeffff684: 0x00000000
0xeffff680: 0x00000000
0xeffff67c: 0x00000000
0xeffff678: 0x00000000
0xeffff674: 0x00000000
0xeffff670: 0x00000000
0xeffff66c: 0x00000000
0xeffff668: 0x00000000
0xeffff664: 0x00000000
0xeffff660: 0x41000000
0xeffff65c: 0x00000000
0xeffff658: 0x00000000
0xeffff654: 0x00000000
0xeffff650: 0x00000000
0xeffff64c: 0x80000000
0xeffff648: 0x3fff0000
0xeffff644: 0x00000000
0xeffff640: 0xd0000000
0xeffff63c: 0x40020000 <= (sc.formatvec & 0xffff) << 16; fpregs from here on
0xeffff638: 0x81b60080 <= (sc.pc & 0xffff) << 16 | sc.formatvec >> 16
0xeffff634: 0x0000c00e <= sc.sr << 16 sc.pc >> 16
0xeffff630: 0xd001e4e3 <= sc.a1
0xeffff62c: 0xc0028780 <= sc.a0
0xeffff628: 0xffffffff <= sc.d1
0xeffff624: 0x0000041f <= sc.d0
0xeffff620: 0xeffff738 <= sc.usp
0xeffff61c: 0x00000000 <= sc.mask
0xeffff618: 0x00000000 <= extramask
0xeffff614: 0x00000000 <= frame.retcode[1]
0xeffff610: 0x70774e40 moveq #119,%d0 ; trap #0
0xeffff60c: 0xeffff61c <= frame->sc
0xeffff608: 0x00000080 <= tregs->vector
0xeffff604: 0x00000011 <= signal no.
0xeffff600: 0xeffff610 return address
The above comes from dash running under gdb under qemu, which does not
exhibit the failure but is convenient for that kind of experiment.
I would have expected to see a different signal trampoline (for
sys_rt_sigreturn) ...
Well, this seems to be the trampoline from setup_frame() and not setup_rt_frame().
But anyway:
The saved pc is 0xc00e81b6 which does match the backtrace above. Vector
offset 80 matches trap 0 which suggests 0xc00e81b6 should be the
instruction after a trap 0 instruction. d0 is 1055 which is not a signal
number I recognize.
I don't know what d0 represents here. But &frame->sig == 0x11 is correct (SIGCHLD).
Again as far as I understand, the core dump happens on process exit.
Stack smashing is detected and process exit is forced only at exit
from __wait3() or __wait4_time64(),
I placed an illegal instruction in __wait3. This executes instead of
the call to __stack_chk_fail because that obliterates stack memory of
interest.
OK.
Consequently the latest core dump still contains dead stack frames
(see below) of subroutines that returned before __wait3() dumped core.
You can see the return address for the branch to __wait4_time64() and
below that you can see the return address for the branch to
__m68k_read_tp().
(gdb) disas __wait4_time64
Dump of assembler code for function __GI___wait4_time64:
0xc00e4174 <+0>: lea %sp@(-80),%sp
0xc00e4178 <+4>: moveml %d2-%d5/%a2-%a3/%a5,%sp@-
0xc00e417c <+8>: lea %pc@(0xc019c000),%a5
0xc00e4184 <+16>: movel %sp@(116),%d2
0xc00e4188 <+20>: moveal %sp@(124),%a2
0xc00e418c <+24>: moveal %a5@(108),%a3
0xc00e4190 <+28>: movel %a3@,%sp@(104)
0xc00e4194 <+32>: bsrl 0xc0056e2c <__m68k_read_tp@plt>
I gather the signal was delivered before __wait4_time64+38, otherwise
the return address 0xc00e419a (which appears below) would have been
overwritten by the signal frame. The signal must have been delivered
after waitproc() initialized gotsigchld = 0 since gotsigchld is 1 at
the time of the coredump.
I assume the %a3 corruption happened after __wait4_time64+8 because
that's when %a3 first appears on the stack. And the corruption must
have happened before __wait4_time64+238, which is when %a3 was
restored.
If it was the signal which somehow corrupted the saved %a3, there's
only a small window for that. The only syscall in that window is
get_thread_area.
I see sys_wait4 called in two places (0xc00e01b4, and then 0xc00e0286
depending on the return code of the first). The second one again would
have called __m68k_read_tp so would have left a return address on the
stack (0xc00e02d2). Leaves the first.
That's why my analysis stopped at __wait4_time64+38: the rest of __wait4_time64 is not relevant to the dead stack contents. (It would have left a different return address in that memory location.)
Here's some stack memory from the core dump.
0xeffff0dc: 0xd000c38e return address waitproc+124
0xeffff0d8: 0xd001c1ec frame 0 $fp ==
&suppressint
0xeffff0d4: 0x00add14b canary
0xeffff0d0: 0x00000000
0xeffff0cc: 0x0000000a
0xeffff0c8: 0x00000202
0xeffff0c4: 0x00000008
0xeffff0c0: 0x00000000
0xeffff0bc: 0x00000000
0xeffff0b8: 0x00000174
0xeffff0b4: 0x00000004
0xeffff0b0: 0x00000004
0xeffff0ac: 0x00000006
0xeffff0a8: 0x000000e0
0xeffff0a4: 0x000000e0
0xeffff0a0: 0x00171f20
0xeffff09c: 0x00171f20
0xeffff098: 0x00171f20
0xeffff094: 0x00000002
0xeffff090: 0x00002000
0xeffff08c: 0x00000006
0xeffff088: 0x0000e920
0xeffff084: 0x00005360
0xeffff080: 0x00170700
0xeffff07c: 0x00170700
0xeffff078: 0x00170700 frame 0 $fp - 96
0xeffff074: 0xd001b874 saved $a5 == dash .got >>> 0xeffff070: 0xd001e498 saved $a3 == &dash_errno >>> 0xeffff06c: 0xd001e718 frame 0 $sp saved $a2 == &gotsigchld >>> 0xeffff068: 0x00000000
0xeffff064: 0x00000000
0xeffff060: 0xeffff11e
0xeffff05c: 0xffffffff
0xeffff058: 0xc00e4164 return address __wait3+244
0xeffff054: 0x00add14b canary
0xeffff050: 0x00000001
0xeffff04c: 0x00000004
0xeffff048: 0x0000000d
0xeffff044: 0x0000000d
0xeffff040: 0x0015ef82
0xeffff03c: 0x0015ef82
0xeffff038: 0x0015ef82
0xeffff034: 0x00000003
0xeffff030: 0x00000004
0xeffff02c: 0x00000004
0xeffff028: 0x00000140
0xeffff024: 0x00000140
0xeffff020: 0x00000034
0xeffff01c: 0x00000034
0xeffff018: 0x00000034
0xeffff014: 0x00000006
0xeffff010: 0x003b003a
0xeffff00c: 0x000a0028
0xeffff008: 0x00340020
0xeffff004: 0xc019c000 saved $a5 == libc .got
0xeffff000: 0xeffff068 saved $a3 (corrupted)
0xefffeffc: 0x00000000 saved $a2
0xefffeff8: 0x00000001 saved $d5
0xefffeff4: 0xeffff122 saved $d4
0xefffeff0: 0xeffff11e saved $d3
0xefffefec: 0x00000000 saved $d2
0xefffefe8: 0xc00e419a return address __GI___wait4_time64+38
0xefffefe4: 0xc0028780
0xefffefe0: 0x3c344bfb
0xefffefdc: 0x000af353
0xefffefd8: 0x3c340170
0xefffefd4: 0x00000000
0xefffefd0: 0xc00e417c
0xefffefcc: 0xc00e417e
0xefffefc8: 0xc00e4180
0xefffefc4: 0x48e73c34
0xefffefc0: 0x00000000
0xefffefbc: 0xefffeff8
0xefffefb8: 0xefffeffc
0xefffefb4: 0x4bfb0170
0xefffefb0: 0x0eee0709
0xefffefac: 0x00000000
0xefffefa8: 0x00000000
0xefffefa4: 0x00000000
0xefffefa0: 0x00000000
0xefffef9c: 0x00000000
0xefffef98: 0x00000000
0xefffef94: 0x00000000
0xefffef90: 0x00000000
0xefffef8c: 0x00000000
0xefffef88: 0x00000000
0xefffef84: 0x00000000
0xefffef80: 0x00000000
0xefffef7c: 0x00000000
0xefffef78: 0x00000000
0xefffef74: 0x00000000
0xefffef70: 0x00000000
0xefffef6c: 0x00000000
0xefffef68: 0x00000000
0xefffef64: 0x00000000
0xefffef60: 0x00000000
0xefffef5c: 0x00000000
0xefffef58: 0x00000000
0xefffef54: 0x00000000
0xefffef50: 0x00000000
0xefffef4c: 0x00000000
0xefffef48: 0x00000000
0xefffef44: 0x00000000
0xefffef40: 0x00000000
0xefffef3c: 0x00000000
0xefffef38: 0x00000000
0xefffef34: 0x00000000
0xefffef30: 0x00000000
0xefffef2c: 0x00000000
0xefffef28: 0x00000000
0xefffef24: 0x00000000
0xefffef20: 0x00000000
0xefffef1c: 0x00000000
0xefffef18: 0x00000000
0xefffef14: 0x00000000
0xefffef10: 0x7c0effff
0xefffef0c: 0xffffffff
0xefffef08: 0xaaaaaaaa
0xefffef04: 0xaf54eaaa
0xefffef00: 0x40040000
0xefffeefc: 0x40040000
0xefffeef8: 0x2b000000
0xefffeef4: 0x00000000
0xefffeef0: 0x00000000
0xefffeeec: 0x408ece9a
0xefffeee8: 0x00000000
0xefffeee4: 0xf0ff0000
0xefffeee0: 0x0f800000
0xefffeedc: 0xf0fff0ff
0xefffeed8: 0x1f380000
0xefffeed4: 0x00000000
0xefffeed0: 0x00000000
0xefffeecc: 0x00000000
0xefffeec8: 0xffffffff
0xefffeec4: 0xffffffff
0xefffeec0: 0x7fff0000
0xefffeebc: 0xffffffff
0xefffeeb8: 0xffffffff
0xefffeeb4: 0x7fff0000 sc_formatvec
The signal frame is not readily apparent (to me).
From looking at the above stack dump, sc ought to start at 0xefffee90,
and the trampoline would be three words below that.
0xefffeeb0: 0x4178b008 sc_pc, sc_formatvec
0xefffeeac: 0x0008c00e sc_sr, sc_pc
0xefffeea8: 0xd00223bb sc_a1
0xefffeea4: 0xd001e32c sc_a0
0xefffeea0: 0x00000003 sc_d1
0xefffee9c: 0xeffff11e sc_d0
0xefffee98: 0xeffff004 sc_usp
0xefffee94: 0x00000000 sc_mask
0xefffee90: 0x00000000 extramask
0xefffee8c: 0xc0024a90 retcode[1]
0xefffee88: 0x70774e40 retcode[0]
0xefffee84: 0xefffee94 psc
0xefffee80: 0x00000008 code
0xefffee7c: 0x00000011 sig
0xefffee78: 0xefffee88 pretcode
0xefffee74: 0xc019c000
0xefffee70: 0x00000000
0xefffee6c: 0xc0025878
0xefffee68: 0xc0007ed4
0xefffee64: 0xc0024000
0xefffee60: 0xefffef50
0xefffee5c: 0xc0024000
0xefffee58: 0xc002a034
0xefffee54: 0xc0024a90
0xefffee50: 0xc0025878
0xefffee4c: 0x00000001
0xefffee48: 0x0017f020
0xefffee44: 0x0000002c
0xefffee40: 0x0000000f
0xefffee3c: 0x00000000
0xefffee38: 0xfffff7fa
0xefffee34: 0xffffffff
0xefffee30: 0x00009782
0xefffee2c: 0x00000000
0xefffee28: 0x0000001e
0xefffee24: 0xc0025858
0xefffee20: 0xc0025af8
0xefffee1c: 0xc000b376
0xefffee18: 0xc0024000
0xefffee14: 0xc0025878
0xefffee10: 0x0000001d
0xefffee0c: 0xd0001b60
0xefffee08: 0x0000002f
0xefffee04: 0xc002563e
0xefffee00: 0xc0025490
The last address you show corresponds to 0xeffff640 in first dump above,
which is at the start of the saved fpregs. I'd say we just miss the
beginning of the signal frame?
It looks like you're right. I'm not sure how I missed that.
So when the signal was delivered, PC == 0xc00e4178 and USP == 0xc00e4178.
Those addresses can be found in the disassembly and the stack contents I
sent previously (quoted above) and it all seems to line up.
(My reasoning is that copy_siginfo_to_user clears the end of the signal
stack, which is what we can see in both cases.)
Can't explain the 14 words below the saved return address though.
Right. Is it sc_fpstate? Perhaps we should expect QEMU to differ here.
0xefffefe4: 0xc0028780 <= internal registers (6x)
0xefffefe0: 0x3c344bfb <=
0xefffefdc: 0x000af353 <=
0xefffefd8: 0x3c340170 <= internal reg; version no.
0xefffefd4: 0x00000000 <= data input buffer
0xefffefd0: 0xc00e417c <= internal registers (2x)
0xefffefcc: 0xc00e417e <= stage b address
0xefffefc8: 0xc00e4180 <= internal registers (4x)
0xefffefc4: 0x48e73c34 <=
0xefffefc0: 0x00000000 <= data output buffer
0xefffefbc: 0xefffeff8 <= internal registers (2x)
0xefffefb8: 0xefffeffc <= data fault address
0xefffefb4: 0x4bfb0170 <= ins stage c, stage b
0xefffefb0: 0x0eee0709 <= internal register; ssw
Bottom line is, the corrupted %a3 register would have been saved by the
MOVEM instruction at 0xc00e4178, which turns out to be the PC in the
signal frame. So it certainly looks like the kernel was the culprit here.
Am 18.04.2023 um 14:04 schrieb Finn Thain:
On Tue, 18 Apr 2023, Michael Schmitz wrote:
Am 16.04.2023 um 18:44 schrieb Finn Thain:
0xeffff750: 0xc01a0000 saved $a5 == libc .got >>> 0xeffff74c: 0xc0023e8c saved $a3 == &__stack_chk_guard
0xeffff748: 0x00000000 saved $a2
0xeffff744: 0x00000001 saved $d5
0xeffff740: 0xeffff86e saved $d4
0xeffff73c: 0xeffff86a saved $d3
0xeffff738: 0x00000002 saved $d2
0xeffff734: 0x00000000
0xeffff730: 0x00000000
0xeffff72c: 0x00000000
0xeffff728: 0x00000000
0xeffff724: 0x00000000
0xeffff720: 0x00000000
0xeffff71c: 0x00000000
0xeffff718: 0x00000000
0xeffff714: 0x00000000
0xeffff710: 0x00000000
0xeffff70c: 0x00000000
0xeffff708: 0x00000000
0xeffff704: 0x00000000
0xeffff700: 0x00000000
0xeffff6fc: 0x00000000
0xeffff6f8: 0x00000000
0xeffff6f4: 0x00000000
0xeffff6f0: 0x00000000
0xeffff6ec: 0x00000000
0xeffff6e8: 0x00000000
0xeffff6e4: 0x00000000
0xeffff6e0: 0x00000000
0xeffff6dc: 0x00000000
0xeffff6d8: 0x00000000
0xeffff6d4: 0x00000000
0xeffff6d0: 0x00000000
0xeffff6cc: 0x00000000
0xeffff6c8: 0x00000000
0xeffff6c4: 0x00000000
0xeffff6c0: 0x00000000
0xeffff6bc: 0x00000000
0xeffff6b8: 0x00000000
0xeffff6b4: 0x00000000
0xeffff6b0: 0x00000000
0xeffff6ac: 0x00000000
0xeffff6a8: 0x00000000
0xeffff6a4: 0x00000000
0xeffff6a0: 0x00000000
0xeffff69c: 0x00000000
0xeffff698: 0x00000000
0xeffff694: 0x00000000
0xeffff690: 0x00000000
0xeffff68c: 0x00000000
0xeffff688: 0x00000000
0xeffff684: 0x00000000
0xeffff680: 0x00000000
0xeffff67c: 0x00000000
0xeffff678: 0x00000000
0xeffff674: 0x00000000
0xeffff670: 0x00000000
0xeffff66c: 0x00000000
0xeffff668: 0x00000000
0xeffff664: 0x00000000
0xeffff660: 0x41000000
0xeffff65c: 0x00000000
0xeffff658: 0x00000000
0xeffff654: 0x00000000
0xeffff650: 0x00000000
0xeffff64c: 0x80000000
0xeffff648: 0x3fff0000
0xeffff644: 0x00000000
0xeffff640: 0xd0000000
0xeffff63c: 0x40020000 <= (sc.formatvec & 0xffff) << 16; fpregs from here on
0xeffff638: 0x81b60080 <= (sc.pc & 0xffff) << 16 | sc.formatvec >> 16
0xeffff634: 0x0000c00e <= sc.sr << 16 sc.pc >> 16
0xeffff630: 0xd001e4e3 <= sc.a1
0xeffff62c: 0xc0028780 <= sc.a0
0xeffff628: 0xffffffff <= sc.d1
0xeffff624: 0x0000041f <= sc.d0
0xeffff620: 0xeffff738 <= sc.usp
0xeffff61c: 0x00000000 <= sc.mask
0xeffff618: 0x00000000 <= extramask
0xeffff614: 0x00000000 <= frame.retcode[1]
0xeffff610: 0x70774e40 moveq #119,%d0 ; trap #0
0xeffff60c: 0xeffff61c <= frame->sc
0xeffff608: 0x00000080 <= tregs->vector
0xeffff604: 0x00000011 <= signal no.
0xeffff600: 0xeffff610 return address
The above comes from dash running under gdb under qemu, which does
not exhibit the failure but is convenient for that kind of
experiment.
I would have expected to see a different signal trampoline (for
sys_rt_sigreturn) ...
Well, this seems to be the trampoline from setup_frame() and not setup_rt_frame().
According to the manpages I've seen, glibc ought to pick rt signals if
the kernel supports those (which I suppose it does).
But anyway:
The saved pc is 0xc00e81b6 which does match the backtrace above.
Vector offset 80 matches trap 0 which suggests 0xc00e81b6 should be
the instruction after a trap 0 instruction. d0 is 1055 which is not a
signal number I recognize.
I don't know what d0 represents here. But &frame->sig == 0x11 is
correct (SIGCHLD).
Correct - that all works out. But d0 holds the syscall number when we
enter the kernel via trap 0, and that one is odd.
...
Here's some stack memory from the core dump.
0xeffff0dc: 0xd000c38e return address waitproc+124
0xeffff0d8: 0xd001c1ec frame 0 $fp == &suppressint >>> 0xeffff0d4: 0x00add14b canary
0xeffff0d0: 0x00000000
0xeffff0cc: 0x0000000a
0xeffff0c8: 0x00000202
0xeffff0c4: 0x00000008
0xeffff0c0: 0x00000000
0xeffff0bc: 0x00000000
0xeffff0b8: 0x00000174
0xeffff0b4: 0x00000004
0xeffff0b0: 0x00000004
0xeffff0ac: 0x00000006
0xeffff0a8: 0x000000e0
0xeffff0a4: 0x000000e0
0xeffff0a0: 0x00171f20
0xeffff09c: 0x00171f20
0xeffff098: 0x00171f20
0xeffff094: 0x00000002
0xeffff090: 0x00002000
0xeffff08c: 0x00000006
0xeffff088: 0x0000e920
0xeffff084: 0x00005360
0xeffff080: 0x00170700
0xeffff07c: 0x00170700
0xeffff078: 0x00170700 frame 0 $fp - 96
0xeffff074: 0xd001b874 saved $a5 == dash .got >>> 0xeffff070: 0xd001e498 saved $a3 == &dash_errno
0xeffff06c: 0xd001e718 frame 0 $sp saved $a2 == &gotsigchld
0xeffff068: 0x00000000
0xeffff064: 0x00000000
0xeffff060: 0xeffff11e
0xeffff05c: 0xffffffff
0xeffff058: 0xc00e4164 return address __wait3+244
0xeffff054: 0x00add14b canary
0xeffff050: 0x00000001
0xeffff04c: 0x00000004
0xeffff048: 0x0000000d
0xeffff044: 0x0000000d
0xeffff040: 0x0015ef82
0xeffff03c: 0x0015ef82
0xeffff038: 0x0015ef82
0xeffff034: 0x00000003
0xeffff030: 0x00000004
0xeffff02c: 0x00000004
0xeffff028: 0x00000140
0xeffff024: 0x00000140
0xeffff020: 0x00000034
0xeffff01c: 0x00000034
0xeffff018: 0x00000034
0xeffff014: 0x00000006
0xeffff010: 0x003b003a
0xeffff00c: 0x000a0028
0xeffff008: 0x00340020
0xeffff004: 0xc019c000 saved $a5 == libc .got >>> 0xeffff000: 0xeffff068 saved $a3 (corrupted)
0xefffeffc: 0x00000000 saved $a2
0xefffeff8: 0x00000001 saved $d5
0xefffeff4: 0xeffff122 saved $d4
0xefffeff0: 0xeffff11e saved $d3
0xefffefec: 0x00000000 saved $d2
0xefffefe8: 0xc00e419a return address __GI___wait4_time64+38
0xefffefe4: 0xc0028780
0xefffefe0: 0x3c344bfb
0xefffefdc: 0x000af353
0xefffefd8: 0x3c340170
0xefffefd4: 0x00000000
0xefffefd0: 0xc00e417c
0xefffefcc: 0xc00e417e
0xefffefc8: 0xc00e4180
0xefffefc4: 0x48e73c34
0xefffefc0: 0x00000000
0xefffefbc: 0xefffeff8
0xefffefb8: 0xefffeffc
0xefffefb4: 0x4bfb0170
0xefffefb0: 0x0eee0709
0xefffefac: 0x00000000
0xefffefa8: 0x00000000
0xefffefa4: 0x00000000
0xefffefa0: 0x00000000
0xefffef9c: 0x00000000
0xefffef98: 0x00000000
0xefffef94: 0x00000000
0xefffef90: 0x00000000
0xefffef8c: 0x00000000
0xefffef88: 0x00000000
0xefffef84: 0x00000000
0xefffef80: 0x00000000
0xefffef7c: 0x00000000
0xefffef78: 0x00000000
0xefffef74: 0x00000000
0xefffef70: 0x00000000
0xefffef6c: 0x00000000
0xefffef68: 0x00000000
0xefffef64: 0x00000000
0xefffef60: 0x00000000
0xefffef5c: 0x00000000
0xefffef58: 0x00000000
0xefffef54: 0x00000000
0xefffef50: 0x00000000
0xefffef4c: 0x00000000
0xefffef48: 0x00000000
0xefffef44: 0x00000000
0xefffef40: 0x00000000
0xefffef3c: 0x00000000
0xefffef38: 0x00000000
0xefffef34: 0x00000000
0xefffef30: 0x00000000
0xefffef2c: 0x00000000
0xefffef28: 0x00000000
0xefffef24: 0x00000000
0xefffef20: 0x00000000
0xefffef1c: 0x00000000
0xefffef18: 0x00000000
0xefffef14: 0x00000000
0xefffef10: 0x7c0effff
0xefffef0c: 0xffffffff
0xefffef08: 0xaaaaaaaa
0xefffef04: 0xaf54eaaa
0xefffef00: 0x40040000
0xefffeefc: 0x40040000
0xefffeef8: 0x2b000000
0xefffeef4: 0x00000000
0xefffeef0: 0x00000000
0xefffeeec: 0x408ece9a
0xefffeee8: 0x00000000
0xefffeee4: 0xf0ff0000
0xefffeee0: 0x0f800000
0xefffeedc: 0xf0fff0ff
0xefffeed8: 0x1f380000
0xefffeed4: 0x00000000
0xefffeed0: 0x00000000
0xefffeecc: 0x00000000
0xefffeec8: 0xffffffff
0xefffeec4: 0xffffffff
0xefffeec0: 0x7fff0000
0xefffeebc: 0xffffffff
0xefffeeb8: 0xffffffff
0xefffeeb4: 0x7fff0000 sc_formatvec
The signal frame is not readily apparent (to me).
From looking at the above stack dump, sc ought to start at 0xefffee90,
and the trampoline would be three words below that.
0xefffeeb0: 0x4178b008 sc_pc, sc_formatvec
0xefffeeac: 0x0008c00e sc_sr, sc_pc
0xefffeea8: 0xd00223bb sc_a1
0xefffeea4: 0xd001e32c sc_a0
0xefffeea0: 0x00000003 sc_d1
0xefffee9c: 0xeffff11e sc_d0
0xefffee98: 0xeffff004 sc_usp
0xefffee94: 0x00000000 sc_mask
0xefffee90: 0x00000000 extramask
0xefffee8c: 0xc0024a90 retcode[1]
0xefffee88: 0x70774e40 retcode[0]
0xefffee84: 0xefffee94 psc
0xefffee80: 0x00000008 code
0xefffee7c: 0x00000011 sig
0xefffee78: 0xefffee88 pretcode
OK, that's our SIGCHLD. But the signal frame format is odd ...
Frame format b, vector offset 008. That's a bus error?
How does that get on the user mode stack?
0xefffee74: 0xc019c000
0xefffee70: 0x00000000
0xefffee6c: 0xc0025878
0xefffee68: 0xc0007ed4
0xefffee64: 0xc0024000
0xefffee60: 0xefffef50
0xefffee5c: 0xc0024000
0xefffee58: 0xc002a034
0xefffee54: 0xc0024a90
0xefffee50: 0xc0025878
0xefffee4c: 0x00000001
0xefffee48: 0x0017f020
0xefffee44: 0x0000002c
0xefffee40: 0x0000000f
0xefffee3c: 0x00000000
0xefffee38: 0xfffff7fa
0xefffee34: 0xffffffff
0xefffee30: 0x00009782
0xefffee2c: 0x00000000
0xefffee28: 0x0000001e
0xefffee24: 0xc0025858
0xefffee20: 0xc0025af8
0xefffee1c: 0xc000b376
0xefffee18: 0xc0024000
0xefffee14: 0xc0025878
0xefffee10: 0x0000001d
0xefffee0c: 0xd0001b60
0xefffee08: 0x0000002f
0xefffee04: 0xc002563e
0xefffee00: 0xc0025490
The last address you show corresponds to 0xeffff640 in first dump
above, which is at the start of the saved fpregs. I'd say we just
miss the beginning of the signal frame?
It looks like you're right. I'm not sure how I missed that.
So when the signal was delivered, PC == 0xc00e4178 and USP ==
0xc00e4178.
USP is 0xeffff004 AFAICS. That's the location 15 was saved to above
(holding libc .got according to your interpretation).
The saved PC is that from the exception frame, in this case a long bus
error sequence fault frame. The PC is that of the instruction executing
when the fault occurred. As you say, that's the moveml saving registers
to the stack.
I don't believe the whole fault frame is on the signal stack in one contiguous piece, just the first four words, then we have struct
sigcontext. But after that, the extra contents follows, and that nicely explains the extra bits right below the return address from the __m68k_read_tp call.
Those addresses can be found in the disassembly and the stack contents
I sent previously (quoted above) and it all seems to line up.
(My reasoning is that copy_siginfo_to_user clears the end of the
signal stack, which is what we can see in both cases.)
Can't explain the 14 words below the saved return address though.
Right. Is it sc_fpstate? Perhaps we should expect QEMU to differ here.
See above - I think what's stored there is the extra frame content for a format b bus error frame. But that extra frame is incomplete at best
(should be 22 longwords, only a4 are seen). Probably overwritten by the
stack frame from __GI___wait4_time64.
Let's parse what's left:
<=
0xefffefe4: 0xc0028780 <= internal registers (6x)
0xefffefe0: 0x3c344bfb <=
0xefffefdc: 0x000af353 <=
0xefffefd8: 0x3c340170 <= internal reg; version no.
0xefffefd4: 0x00000000 <= data input buffer
0xefffefd0: 0xc00e417c <= internal registers (2x)
0xefffefcc: 0xc00e417e <= stage b address
0xefffefc8: 0xc00e4180 <= internal registers (4x)
0xefffefc4: 0x48e73c34 <=
0xefffefc0: 0x00000000 <= data output buffer
0xefffefbc: 0xefffeff8 <= internal registers (2x)
0xefffefb8: 0xefffeffc <= data fault address
0xefffefb4: 0x4bfb0170 <= ins stage c, stage b
0xefffefb0: 0x0eee0709 <= internal register; ssw
The fault address is the location on the stack where a2 is saved. That
does match the data output buffer contents BTW. fc, fb, rc, rb bits
clear means the fault didn't occur in stage b or c instructions. ssw bit
8 set indicates a data fault - the data cycle should be rerun on rte. rm
and rw bits clear tell us it's a write fault. If the moveml instruction copies registers to the stack in descending order, the fault address
makes sense - the stack pointer just crossed a page boundary.
Bottom line is, the corrupted %a3 register would have been saved by
the MOVEM instruction at 0xc00e4178, which turns out to be the PC in
the signal frame. So it certainly looks like the kernel was the
culprit here.
I think the moveml instruction did cause a bus error, and on return from
that exception the signal got delivered.
On entering the buserror handler, only a1 and a2 are saved, but the
comment in entry.h states that a3-a6 and d6, d7 are preserved by C code. After buserr_c returns, a3 should be restored to what it was when taking
the bus error. All registers restored before rte, the moveml instruction ought to be able to resume normally.
Unless that register use constraint has changed, I don't see how a3
could have changed midway during return from the bus error exception.
But maybe a disassembly of buserr_c from your kernel could confirm that?
I would have expected to see a different signal trampoline (for
sys_rt_sigreturn) ...
Well, this seems to be the trampoline from setup_frame() and not
setup_rt_frame().
According to the manpages I've seen, glibc ought to pick rt signals if
the kernel supports those (which I suppose it does).
It's got to be the trampoline from setup_frame() because dash did this:
act.sa_flags = 0;
sigfillset(&act.sa_mask);
sigaction(signo, &act, 0);
and the kernel did this:
/* set up the stack frame */
if (ksig->ka.sa.sa_flags & SA_SIGINFO)
err = setup_rt_frame(ksig, oldset, regs);
else
err = setup_frame(ksig, oldset, regs);
But anyway:
The saved pc is 0xc00e81b6 which does match the backtrace above.
Vector offset 80 matches trap 0 which suggests 0xc00e81b6 should be
the instruction after a trap 0 instruction. d0 is 1055 which is not a
signal number I recognize.
I don't know what d0 represents here. But &frame->sig == 0x11 is
correct (SIGCHLD).
Correct - that all works out. But d0 holds the syscall number when we
enter the kernel via trap 0, and that one is odd.
Well, you showed subsequently that the kernel was probably entered via a
page fault and not the get_thread_area trap. Would that explain the d0
value?
See above - I think what's stored there is the extra frame content for a
format b bus error frame. But that extra frame is incomplete at best
(should be 22 longwords, only a4 are seen). Probably overwritten by the
stack frame from __GI___wait4_time64.
Maybe the exception frame leaked onto the user stack via setup_frame()?
Let's parse what's left:
<=
0xefffefe4: 0xc0028780 <= internal registers (6x)
0xefffefe0: 0x3c344bfb <=
0xefffefdc: 0x000af353 <=
0xefffefd8: 0x3c340170 <= internal reg; version no.
0xefffefd4: 0x00000000 <= data input buffer
0xefffefd0: 0xc00e417c <= internal registers (2x)
0xefffefcc: 0xc00e417e <= stage b address
0xefffefc8: 0xc00e4180 <= internal registers (4x)
0xefffefc4: 0x48e73c34 <=
0xefffefc0: 0x00000000 <= data output buffer
0xefffefbc: 0xefffeff8 <= internal registers (2x)
0xefffefb8: 0xefffeffc <= data fault address
0xefffefb4: 0x4bfb0170 <= ins stage c, stage b
0xefffefb0: 0x0eee0709 <= internal register; ssw
The fault address is the location on the stack where a2 is saved. That
does match the data output buffer contents BTW. fc, fb, rc, rb bits
clear means the fault didn't occur in stage b or c instructions. ssw bit
8 set indicates a data fault - the data cycle should be rerun on rte. rm
and rw bits clear tell us it's a write fault. If the moveml instruction
copies registers to the stack in descending order, the fault address
makes sense - the stack pointer just crossed a page boundary.
Well spotted!
Bottom line is, the corrupted %a3 register would have been saved by
the MOVEM instruction at 0xc00e4178, which turns out to be the PC in
the signal frame. So it certainly looks like the kernel was the
culprit here.
I think the moveml instruction did cause a bus error, and on return from
that exception the signal got delivered.
Maybe the signal frame was partially overwritten by the resumed MOVEM?
I wonder what we'd see if we patched the kernel to log every user data
write fault caused by a MOVEM instruction. I'll try to code that up.
On entering the buserror handler, only a1 and a2 are saved, but the
comment in entry.h states that a3-a6 and d6, d7 are preserved by C code.
After buserr_c returns, a3 should be restored to what it was when taking
the bus error. All registers restored before rte, the moveml instruction
ought to be able to resume normally.
Unless that register use constraint has changed, I don't see how a3
could have changed midway during return from the bus error exception.
But maybe a disassembly of buserr_c from your kernel could confirm that?
I disassembled the relevant build. AFAICT, buserr_c() saves and restores those registers in the right places.
BTW, I've reproduced the failures with kernels built with both GCC 12 and
GCC 6.
... I think what's stored there is the extra frame content for a format
b bus error frame. But that extra frame is incomplete at best (should be
22 longwords, only a4 are seen). Probably overwritten by the stack frame
from __GI___wait4_time64.
Let's parse what's left:
<=
0xefffefe4: 0xc0028780 <= internal registers (6x)
0xefffefe0: 0x3c344bfb <=
0xefffefdc: 0x000af353 <=
0xefffefd8: 0x3c340170 <= internal reg; version no. >>> 0xefffefd4: 0x00000000 <= data input buffer
0xefffefd0: 0xc00e417c <= internal registers (2x)
0xefffefcc: 0xc00e417e <= stage b address
0xefffefc8: 0xc00e4180 <= internal registers (4x)
0xefffefc4: 0x48e73c34 <=
0xefffefc0: 0x00000000 <= data output buffer
0xefffefbc: 0xefffeff8 <= internal registers (2x)
0xefffefb8: 0xefffeffc <= data fault address
0xefffefb4: 0x4bfb0170 <= ins stage c, stage b
0xefffefb0: 0x0eee0709 <= internal register; ssw
The fault address is the location on the stack where a2 is saved. That
does match the data output buffer contents BTW. fc, fb, rc, rb bits
clear means the fault didn't occur in stage b or c instructions. ssw bit
8 set indicates a data fault - the data cycle should be rerun on rte. rm
and rw bits clear tell us it's a write fault. If the moveml instruction copies registers to the stack in descending order, the fault address
makes sense - the stack pointer just crossed a page boundary.
0x800005f6 <+262>: illegal0x800005f8 <+264>: nop
Inspired by your observation about the page fault and stack growth, I
wrote a small test program (given below) that just pushes registers onto
the stack recursively while forking processes and collecting the SIGCHLD signals.
On a Motorola '030 the stack grows to about 7 MiB before it gets
corrupted. The program detects the stack corruption and terminates immediately with an illegal instruction. Oddly, the program never detects
any stack corruption when run on the QEMU '040.
On Tue, 18 Apr 2023, Michael Schmitz wrote:
... I think what's stored there is the extra frame content for a formatInspired by your observation about the page fault and stack growth, I
b bus error frame. But that extra frame is incomplete at best (should be
22 longwords, only a4 are seen). Probably overwritten by the stack frame
from __GI___wait4_time64.
Let's parse what's left:
<=
The fault address is the location on the stack where a2 is saved. That0xefffefe4: 0xc0028780 <= internal registers (6x) >>>>> 0xefffefe0: 0x3c344bfb <=
0xefffefdc: 0x000af353 <=
0xefffefd8: 0x3c340170 <= internal reg; version no. >>>>> 0xefffefd4: 0x00000000 <= data input buffer
0xefffefd0: 0xc00e417c <= internal registers (2x) >>>>> 0xefffefcc: 0xc00e417e <= stage b address
0xefffefc8: 0xc00e4180 <= internal registers (4x) >>>>> 0xefffefc4: 0x48e73c34 <=
0xefffefc0: 0x00000000 <= data output buffer
0xefffefbc: 0xefffeff8 <= internal registers (2x) >>>>> 0xefffefb8: 0xefffeffc <= data fault address
0xefffefb4: 0x4bfb0170 <= ins stage c, stage b
0xefffefb0: 0x0eee0709 <= internal register; ssw
does match the data output buffer contents BTW. fc, fb, rc, rb bits
clear means the fault didn't occur in stage b or c instructions. ssw bit
8 set indicates a data fault - the data cycle should be rerun on rte. rm
and rw bits clear tell us it's a write fault. If the moveml instruction
copies registers to the stack in descending order, the fault address
makes sense - the stack pointer just crossed a page boundary.
wrote a small test program (given below) that just pushes registers onto
the stack recursively while forking processes and collecting the SIGCHLD signals.
On a Motorola '030 the stack grows to about 7 MiB before it gets
corrupted. The program detects the stack corruption and terminates immediately with an illegal instruction. Oddly, the program never detects
any stack corruption when run on the QEMU '040.
root@debian:~# ./movem
Illegal instruction
root@debian:~# ulimit -a
real-time non-blocking time (microseconds, -R) unlimited
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 242
max locked memory (kbytes, -l) 8192
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 242
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
root@debian:~# ulimit -s 7200
root@debian:~# ./movem
Illegal instruction
root@debian:~# ulimit -s 7000
root@debian:~# ./movem
Segmentation fault
root@debian:~# ulimit -s 16384
root@debian:~# ./movem
Illegal instruction
root@debian:~#
Looking at the core dump in gdb, the backtrace has 189869 frames. The dead stack frames confirm the recursion depth reached the limit I set at 200000 before the stack began to reduce again. This was also confirmed by the
lowest page fault address that was logged by the custom kernel.
That means validation succeeded 200000 - 189869 == 10131 times before it encountered corruption (I should try to figure out whether this varies).
The registers %a2, %a3 and %a4 below should contain 0x91929394, 0xa1a2a3a4 and 0xb1b2b3b4 respectively. But they don't. Their values were restored
from a corrupted stack by the returning rec() function call.
(gdb) info reg
d0 0x91929394 -1852664940
d1 0xf3 243
d2 0xd1d2d3d4 -774712364
d3 0xe1e2e3e4 -505224220
d4 0xf1f2f3f4 -235736076
d5 0x80003f0c -2147467508
d6 0xd014c528 -803945176
d7 0x0 0
a0 0xc0021708 0xc0021708
a1 0xc0023e8c 0xc0023e8c <__stack_chk_guard>
a2 0xf3 0xf3
a3 0x1464000 0x1464000
a4 0xef97bf44 0xef97bf44
a5 0xc1c2c3c4 0xc1c2c3c4
fp 0xef97b034 0xef97b034
sp 0xef97b018 0xef97b018
ps 0x8 [ N ]
pc 0x800005f6 0x800005f6 <rec+262>
fpcontrol 0x0 0
fpstatus 0x0 0
fpiaddr 0x0 0x0
(gdb) x/z $sp - 36
0xef97aff4: 0xd1d2d3d4
(gdb)
0xef97aff8: 0xe1e2e3e4
(gdb)
0xef97affc: 0xf1f2f3f4
(gdb)
0xef97b000: 0x000000f3
(gdb)
0xef97b004: 0x01464000
(gdb)
0xef97b008: 0xef97bf44
(gdb)
0xef97b00c: 0xc1c2c3c4
(gdb)
0xef97b010: 0xef97b034
(gdb)
0xef97b014: 0x8000055c
As with dash, the corruption lies the page boundary.
Any signal frames or exception frames have been completely overwritten because the recursion continued after the corruption took place. So
there's not much to see in the core dump.
(gdb) disass rec
Dump of assembler code for function rec:
0x800004f0 <+0>: linkw %fp,#0
0x800004f4 <+4>: moveml %d2-%d4/%a2-%a5,%sp@-
0x800004f8 <+8>: moveal 0x80000672 <i0>,%a2
0x800004fe <+14>: moveal 0x80000676 <i1>,%a3
0x80000504 <+20>: moveal 0x8000067a <i2>,%a4
0x8000050a <+26>: moveal 0x8000067e <i3>,%a5
0x80000510 <+32>: movel 0x80000682 <i4>,%d2
0x80000516 <+38>: movel 0x80000686 <i5>,%d3
0x8000051c <+44>: movel 0x8000068a <i6>,%d4
0x80000522 <+50>: movel 0x80004034 <depth>,%d0
0x80000528 <+56>: andil #2047,%d0
0x8000052e <+62>: bnes 0x80000542 <rec+82>
0x80000530 <+64>: jsr 0x8000042c <fork@plt>
0x80000536 <+70>: tstl %d0
0x80000538 <+72>: bnes 0x80000542 <rec+82>
0x8000053a <+74>: clrl %sp@-
0x8000053c <+76>: jsr 0x80000404 <exit@plt>
0x80000542 <+82>: movel 0x80004034 <depth>,%d0
0x80000548 <+88>: subql #1,%d0
0x8000054a <+90>: movel %d0,0x80004034 <depth>
0x80000550 <+96>: movel 0x80004034 <depth>,%d0
0x80000556 <+102>: beqs 0x8000055c <rec+108>
0x80000558 <+104>: jsr %pc@(0x800004f0 <rec>)
0x8000055c <+108>: movel %a2,0x8000403c <o0>
0x80000562 <+114>: movel %a3,0x80004040 <o1>
0x80000568 <+120>: movel %a4,0x80004044 <o2>
0x8000056e <+126>: movel %a5,0x80004048 <o3>
0x80000574 <+132>: movel %d2,0x8000404c <o4>
0x8000057a <+138>: movel %d3,0x80004050 <o5>
0x80000580 <+144>: movel %d4,0x80004054 <o6>
0x80000586 <+150>: movel 0x8000403c <o0>,%d1
0x8000058c <+156>: movel #-1852664940,%d0
0x80000592 <+162>: cmpl %d1,%d0
0x80000594 <+164>: bnes 0x800005f6 <rec+262>
0x80000596 <+166>: movel 0x80004040 <o1>,%d1
0x8000059c <+172>: movel #-1583176796,%d0
0x800005a2 <+178>: cmpl %d1,%d0
0x800005a4 <+180>: bnes 0x800005f6 <rec+262>
0x800005a6 <+182>: movel 0x80004044 <o2>,%d1
0x800005ac <+188>: movel #-1313688652,%d0
0x800005b2 <+194>: cmpl %d1,%d0
0x800005b4 <+196>: bnes 0x800005f6 <rec+262>
0x800005b6 <+198>: movel 0x80004048 <o3>,%d1
0x800005bc <+204>: movel #-1044200508,%d0
0x800005c2 <+210>: cmpl %d1,%d0
0x800005c4 <+212>: bnes 0x800005f6 <rec+262>
0x800005c6 <+214>: movel 0x8000404c <o4>,%d1
0x800005cc <+220>: movel #-774712364,%d0
0x800005d2 <+226>: cmpl %d1,%d0
0x800005d4 <+228>: bnes 0x800005f6 <rec+262>
0x800005d6 <+230>: movel 0x80004050 <o5>,%d1
0x800005dc <+236>: movel #-505224220,%d0
0x800005e2 <+242>: cmpl %d1,%d0
0x800005e4 <+244>: bnes 0x800005f6 <rec+262>
0x800005e6 <+246>: movel 0x80004054 <o6>,%d1
0x800005ec <+252>: movel #-235736076,%d0
0x800005f2 <+258>: cmpl %d1,%d0
0x800005f4 <+260>: beqs 0x800005f8 <rec+264>
0x800005f6 <+262>: illegal0x800005f8 <+264>: nop
0x800005fa <+266>: moveml %fp@(-28),%d2-%d4/%a2-%a5
0x80000600 <+272>: unlk %fp
0x80000602 <+274>: rts
End of assembler dump.
---
#include <stdlib.h>
#include <unistd.h>
#include <stdio.h>
#include <signal.h>
#include <string.h>
int depth = 200000;
const unsigned long i0 = 0x91929394;
const unsigned long i1 = 0xa1a2a3a4;
const unsigned long i2 = 0xb1b2b3b4;
const unsigned long i3 = 0xc1c2c3c4;
const unsigned long i4 = 0xd1d2d3d4;
const unsigned long i5 = 0xe1e2e3e4;
const unsigned long i6 = 0xf1f2f3f4;
unsigned long o0;
unsigned long o1;
unsigned long o2;
unsigned long o3;
unsigned long o4;
unsigned long o5;
unsigned long o6;
static void rec(void)
{
// initialize registers
asm( " move.l %0, %%a2\n"
" move.l %1, %%a3\n"
" move.l %2, %%a4\n"
" move.l %3, %%a5\n"
" move.l %4, %%d2\n"
" move.l %5, %%d3\n"
" move.l %6, %%d4\n"
:
: "m" (i0), "m" (i1), "m" (i2),
"m" (i3), "m" (i4), "m" (i5), "m" (i6)
: "a2", "a3", "a4", "a5", "d2", "d3", "d4"
);
// maybe fork a short-lived process
if ((depth & 0x7ff) == 0)
if (fork() == 0)
exit(0);
if (--depth)
rec(); // callee to save & restore registers
// compare register contents
asm( " move.l %%a2, %0\n"
" move.l %%a3, %1\n"
" move.l %%a4, %2\n"
" move.l %%a5, %3\n"
" move.l %%d2, %4\n"
" move.l %%d3, %5\n"
" move.l %%d4, %6\n"
: "=m" (o0), "=m" (o1), "=m" (o2),
"=m" (o3), "=m" (o4), "=m" (o5), "=m" (o6)
:
:
);
if (o0 != i0 || o1 != i1 || o2 != i2 ||
o3 != i3 || o4 != i4 || o5 != i5 || o6 != i6)
asm("illegal");
}
static void handler(int)
{
}
int main(void)
{
struct sigaction act;
memset(&act, 0, sizeof(act));
act.sa_handler = handler;
sigaction(SIGCHLD, &act, NULL);
rec();
}
Does it also fail on a very old kernel image you still have lying
around? Just to rule out a recent kernel bug.
Can you try and fault in as many of these stack pages as possible, ahead
of filling the stack? (Depending on how much RAM you have ...). Maybe we would need to lock those pages into memory? Just to show that with no
page faults (but still signals) there is no corruption?
Any signal frames or exception frames have been completely overwritten because the recursion continued after the corruption took place. So
there's not much to see in the core dump.
We'd need a way to stop recursion once the first corruption has taken
place. If the 'safe' recursion depth of 10131 is constant, the dump
taken at that point should look similar to what you saw in dash
(assuming it is the page fault and subsequent signal return that causes
the corruption).
On Thu, 20 Apr 2023, Michael Schmitz wrote:
Can you try and fault in as many of these stack pages as possible, ahead
of filling the stack? (Depending on how much RAM you have ...). Maybe we
would need to lock those pages into memory? Just to show that with no
page faults (but still signals) there is no corruption?
OK.
Any signal frames or exception frames have been completely overwritten
because the recursion continued after the corruption took place. So
there's not much to see in the core dump.
We'd need a way to stop recursion once the first corruption has taken
place. If the 'safe' recursion depth of 10131 is constant, the dump
taken at that point should look similar to what you saw in dash
(assuming it is the page fault and subsequent signal return that causes
the corruption).
It turns out that the recursion depth can be set a lot lower than the
200000 that I chose in that test program. (I used that value as it kept
the stack size just below the default 8192 kB limit.)
At depth = 2500, a failure is around 95% certain. At depth = 2048 I can
still get an intermittent failure. This only required 21 stack pagefaults
and one fork.
I suspect that the location of the corruption is probably somewhat random, and the larger the stack happens to be when the signal comes in, the
better the odds of detection.
So it must be that a MOVEM went awry when a signal got delivered.
As with dash, the corruption lies the page boundary.
Hence implies a page fault handled at the page boundary.
Can you try and fault in as many of these stack pages as possible, ahead
of filling the stack? (Depending on how much RAM you have ...). Maybe we would need to lock those pages into memory? Just to show that with no
page faults (but still signals) there is no corruption?
Oddly, the program never detects any stack corruption when run on the
QEMU '040.
On Thu, 20 Apr 2023, Michael Schmitz wrote:
As with dash, the corruption lies the page boundary.
Hence implies a page fault handled at the page boundary.
Can you try and fault in as many of these stack pages as possible, ahead
of filling the stack? (Depending on how much RAM you have ...). Maybe we
would need to lock those pages into memory? Just to show that with no
page faults (but still signals) there is no corruption?
I modified the test program to execute rec() to full depth with no
forking, then do it again with forking.
root@(none):/root# while ./stack-test 5000 ; do : ; done
starting recursion
done.
starting recursion with fork
done.
starting recursion
done.
starting recursion with fork
Illegal instruction
root@(none):/root#
I can't get this to crash during the first descent. The second descent
always crashes, given sufficient depth:
root@(none):/root# while ./stack-test 50000 ; do : ; done
starting recursion
done.
starting recursion with fork
Illegal instruction
So all the stack pages would have been faulted in well before the failure shows up. It appears to be the signal that's the problem and not the page fault. That's not surprising considering the PC in the signal frame in the dash crash was a MOVEM saving registers onto the stack.
It's worth noting that the test program never crashes with a corrupted
return address. Random corruption would have clobbered that address about
10% of the time, since the entire rec() stack frame is 9 long words. So it must be that a MOVEM went awry when a signal got delivered.
On Wed, 19 Apr 2023, I wrote:
Oddly, the program never detects any stack corruption when run on the
QEMU '040.
I tested a Motorola '040 and got the same result.
Am 20.04.2023 um 17:17 schrieb Finn Thain:
On Thu, 20 Apr 2023, Michael Schmitz wrote:
As with dash, the corruption lies the page boundary.
Hence implies a page fault handled at the page boundary.
Can you try and fault in as many of these stack pages as possible, ahead >> of filling the stack? (Depending on how much RAM you have ...). Maybe we >> would need to lock those pages into memory? Just to show that with no
page faults (but still signals) there is no corruption?
I modified the test program to execute rec() to full depth with no
forking, then do it again with forking.
root@(none):/root# while ./stack-test 5000 ; do : ; done
starting recursion
done.
starting recursion with fork
done.
starting recursion
done.
starting recursion with fork
Illegal instruction
root@(none):/root#
I can't get this to crash during the first descent. The second descent always crashes, given sufficient depth:
root@(none):/root# while ./stack-test 50000 ; do : ; done
starting recursion
done.
starting recursion with fork
Illegal instruction
So all the stack pages would have been faulted in well before the
failure shows up. It appears to be the signal that's the problem and
not the page fault. That's not surprising considering the PC in the
signal frame in the dash crash was a MOVEM saving registers onto the
stack.
Well. without locking the faulted in pages in memory we can't be sure
they were not swapped back out. Unless I misunderstand what's involved
in that ...
Am 20.04.2023 um 18:04 schrieb Finn Thain:
On Wed, 19 Apr 2023, I wrote:
Oddly, the program never detects any stack corruption when run on the
QEMU '040.
I tested a Motorola '040 and got the same result.
OK, that would mean the bus error was just the most reliable way to get do_signal_return run after child process termination, and the signal
delivery itself may be responsible for stack corruption.
Am 20.04.2023 um 19:47 schrieb Finn Thain:
So all the stack pages would have been faulted in well before the
failure shows up. It appears to be the signal that's the problem and
not the page fault. That's not surprising considering the PC in the
signal frame in the dash crash was a MOVEM saving registers onto the
stack.
Well. without locking the faulted in pages in memory we can't be sure
they were not swapped back out. Unless I misunderstand what's
involved in that ...
There was no swap enabled.
50000 frames * 36 bytes per frame == 1.8 MB
OK - swap is enabled in my case. That may explain the different fault
rates.
But in any case, it looks like we can eliminate the bus error code. Same fault on both 030 and 040 with very different bus error handlers is
highly unlikely.
So all the stack pages would have been faulted in well before the
failure shows up. It appears to be the signal that's the problem and
not the page fault. That's not surprising considering the PC in the
signal frame in the dash crash was a MOVEM saving registers onto the
stack.
Well. without locking the faulted in pages in memory we can't be sure
they were not swapped back out. Unless I misunderstand what's involved
in that ...
There was no swap enabled.
50000 frames * 36 bytes per frame == 1.8 MB
But in any case, it looks like we can eliminate the bus error code. SameThere's no failure on '040. QEMU and Motorola '040 gave the same result.
fault on both 030 and 040 with very different bus error handlers is
highly unlikely.
Fri, 21 Apr 2023 11:15:22 +1000 (AEST)n Thu, 20 Apr 2023, Michael Schmitz wrote:
In my tests, increasing the depth does not cause a monotonous increaseMy tests used 'norandmaps' in the kernel parameters. With the attached .config (which supports Mac and Atari) I saw 12 failures out of 16 tests
in fault probability. 16k depth only has four crashes, 8k had nine. I'll
stick with 200000 for now.
at depth 5000.
Hi Finn,
On 20/04/23 20:55, Finn Thain wrote:
But in any case, it looks like we can eliminate the bus error code.There's no failure on '040. QEMU and Motorola '040 gave the same result.
Same
fault on both 030 and 040 with very different bus error handlers is
highly unlikely.
Sorry, my fault - I interpreted your mail as saying 030 and 040 gave
the same result.
Back to the drawing board ... I've got kernel images back to 2.4.30
and 2.6.37 to try and test. I'm also trying with rt signals and
alternate signal stack (rt signals show the same behaviour).
Cheers,
Michael
I modified the test program to execute rec() to full depth with no
forking, then do it again with forking.
root@(none):/root# while ./stack-test 5000 ; do : ; done
starting recursion
done.
starting recursion with fork
done.
starting recursion
done.
starting recursion with fork
Illegal instruction
root@(none):/root#
I can't get this to crash during the first descent. The second descent
always crashes, given sufficient depth:
root@(none):/root# while ./stack-test 50000 ; do : ; done
starting recursion
done.
starting recursion with fork
Illegal instruction
So all the stack pages would have been faulted in well before the
failure shows up. It appears to be the signal that's the problem and not
the page fault.
Hi Finn,
On 20/04/23 20:55, Finn Thain wrote:
But in any case, it looks like we can eliminate the bus error code. Same >>> fault on both 030 and 040 with very different bus error handlers isThere's no failure on '040. QEMU and Motorola '040 gave the same result.
highly unlikely.
Sorry, my fault - I interpreted your mail as saying 030 and 040 gave the
same result.
Back to the drawing board ... I've got kernel images back to 2.4.30 and 2.6.37 to try and test. I'm also trying with rt signals and alternate
signal stack (rt signals show the same behaviour).
How often did a page fault happen when executing moveml, in other
programs?
On Fri, 21 Apr 2023, Michael Schmitz wrote:
How often did a page fault happen when executing moveml, in other
programs?
The printk() I placed in bus_error030() was conditional on the short word
at the instruction pointer. It didn't consider all forms of movem, just 0x48e7 which is the start of "moveml X,%sp@-". This matched page faults in many of the programs that executed while booting to single user mode. I suppose most of them don't use signal handlers in the same way dash does otherwise they would probably be unreliable too.
Hi Finn,
Am 21.04.2023 um 20:30 schrieb Finn Thain:
On Fri, 21 Apr 2023, Michael Schmitz wrote:
How often did a page fault happen when executing moveml, in other
programs?
The printk() I placed in bus_error030() was conditional on the short word
at the instruction pointer. It didn't consider all forms of movem, just
0x48e7 which is the start of "moveml X,%sp@-". This matched page
faults in
many of the programs that executed while booting to single user mode. I
suppose most of them don't use signal handlers in the same way dash does
otherwise they would probably be unreliable too.
OK; so too much noise unless filtered on the command name...
I'll try first to get register state at the time of signal delivery from
the sa_sigaction handler's ucontext parameter to see where the signal
stack falls in relation to the call frames from your rec() function on
the stack (and what the register contents were). Hope that won't be too
noisy ... Then see how that changes with bus error handling forcing
signal delivery.
Shame we don't have similar code to find the MMU descriptors on 040.
Cheers,
Michael
Took a little while to figure out that the ucontext format changed in the decade or two since my userland's libc headers were generated.
On Apr 22 2023, Michael Schmitz wrote:
Took a little while to figure out that the ucontext format changed in the
decade or two since my userland's libc headers were generated.
In which way did it change?
Hi Finn,
Am 21.04.2023 um 21:18 schrieb Michael Schmitz:
Hi Finn,
Am 21.04.2023 um 20:30 schrieb Finn Thain:
On Fri, 21 Apr 2023, Michael Schmitz wrote:
How often did a page fault happen when executing moveml, in other
programs?
The printk() I placed in bus_error030() was conditional on the short
word
at the instruction pointer. It didn't consider all forms of movem, just
0x48e7 which is the start of "moveml X,%sp@-". This matched page
faults in
many of the programs that executed while booting to single user mode. I
suppose most of them don't use signal handlers in the same way dash does >>> otherwise they would probably be unreliable too.
OK; so too much noise unless filtered on the command name...
I'll try first to get register state at the time of signal delivery from
the sa_sigaction handler's ucontext parameter to see where the signal
stack falls in relation to the call frames from your rec() function on
the stack (and what the register contents were). Hope that won't be too
Took a little while to figure out that the ucontext format changed in
the decade or two since my userland's libc headers were generated. With
the correct format, the information stored on th signal frame made a lot
more sense.
Log of your test program (attached), instrumented to keep track of user
stack pointer in the parent process, user stack pointer in the signal handler, and stack pointer, pc and exceptiopn frame format from the
signal stack (only the last few signals shown):
parent usp : 0xef97beb8
handler tos : 0xef97bdc4
handler usp : 0xef97bbe0
signal usp : 0xef97bea8
signal pc : 0xc009f37a
signal fmtv : 0x800006ca
parent usp : 0xef969eb8
handler tos : 0xef969dc4
handler usp : 0xef969be0
signal usp : 0xef969ea8
signal pc : 0xc009f37a
signal fmtv : 0x800006ca
parent usp : 0xef9530d0
handler tos : 0xef952fec
handler usp : 0xef952e08
signal usp : 0xef9530d0
signal pc : 0x800006dc
signal fmtv : 0x91929394
parent usp : 0xef945eb8
handler tos : 0xef945dc4
handler usp : 0xef945be0
signal usp : 0xef945ea8
signal pc : 0xc009f37a
signal fmtv : 0x800006ca
parent usp : 0xef933eb8
handler tos : 0xef933dc4
handler usp : 0xef933be0
signal usp : 0xef933ea8
signal pc : 0xc009f37a
signal fmtv : 0x800006ca
parent usp : 0xef921edc
handler tos : 0xef997984
handler stack overwrote usp!
handler usp : 0xef9977a0
signal usp : 0xef997a64
signal pc : 0x80000768
signal fmtv : 0xa1a2a3a4
Illegal instruction (core dumped)
usp from the signal stack is below that of the parent process (before
calling fork()).
usp from the signal handler is below both of those. So far, so good.
The top of the signal frame, however, is getting quite close to these
stack pointers. In the last log, it has grown above the user stack pointer.
Two things to note:
- pc in the signal frame (from struct uc_mcontext) is either the return
pc from the stack stuffing function, or something else I cannot work
out. That part of ucontext appears valid.
- what ought to be the frame format and vector offset does in fact hold varying longwords from the user stack. This information is not from
struct uc_mcontext, but from extra information copied after struct
ucontext ends. That wouldn't be there if at time of signal delivery,
nothing had yet written to the area where the signal frame is stored.
This is the definition from the kernel's
include/uapi/asm-generic/ucontext.h:
And this is /usr/include/sys/ucontext.h:
/* Userlevel context. */
typedef struct ucontext
{
unsigned long int uc_flags;
struct ucontext *uc_link;
__sigset_t uc_sigmask;
stack_t uc_stack;
mcontext_t uc_mcontext;
long int uc_filler[174];
} ucontext_t;
uc_sigmask appears before uc_stack and uc_mcontext.
I'm assuming libc just passes on what the kernel set, without reordering?
On Apr 22 2023, Michael Schmitz wrote:
This is the definition from the kernel's
include/uapi/asm-generic/ucontext.h:
That's not actually used by m68k, it uses
arch/m68k/include/asm/ucontext.h, which confusingly isn't an uapi
header.
And this is /usr/include/sys/ucontext.h:
/* Userlevel context. */
typedef struct ucontext
{
unsigned long int uc_flags;
struct ucontext *uc_link;
__sigset_t uc_sigmask;
stack_t uc_stack;
mcontext_t uc_mcontext;
long int uc_filler[174];
} ucontext_t;
uc_sigmask appears before uc_stack and uc_mcontext.
Yes, that got fixed as part of commit 9c986f878a back in 2006.
I'm assuming libc just passes on what the kernel set, without reordering?
Trying to rewrite the signal context would be prohibitive, yes.
Now I wonder who adds sigmask ... and whether that's also ending up on the user stack.
On Apr 23 2023, Michael Schmitz wrote:
Now I wonder who adds sigmask ... and whether that's also ending up on the >> user stack.
The kernel only writes the first 64 bits of the signal mask, as it does
for all signal mask related syscalls. The kernel version of the context
ends after that; since the user-space version is larger it actually
extends into the next stack frame.
I'll see whether the signal context is available on the stack even if the handler isn't passed that parameter.
On Apr 23 2023, Michael Schmitz wrote:
I'll see whether the signal context is available on the stack even if the
handler isn't passed that parameter.
The signal context is always on the stack, and used by the
(rt_)sigreturn syscall.
Hi Andreas,
Am 23.04.2023 um 08:46 schrieb Andreas Schwab:
On Apr 23 2023, Michael Schmitz wrote:
I'll see whether the signal context is available on the stack even if the >>> handler isn't passed that parameter.
The signal context is always on the stack, and used by the
(rt_)sigreturn syscall.
True, but at the time the signal handler (sa_handler type) runs, all I
have is the user stack pointer upon entry to the handler. I'll have to calculate back from that address, if that is possible.
I wonder what we'd see if we patched the kernel to log every user data write fault caused by a MOVEM instruction. I'll try to code that up.
If these instructions did always cause stack corruption on 030, I think
we would have noticed long ago?
Wasn't too hard actually. The signo parameter passed to the handler turns
out to be passed by reference, and signo is located 4 bytes into the
kernel sigframe.
Hi Andreas,
Am 23.04.2023 um 08:46 schrieb Andreas Schwab:
On Apr 23 2023, Michael Schmitz wrote:
I'll see whether the signal context is available on the stack even if
the
handler isn't passed that parameter.
The signal context is always on the stack, and used by the
(rt_)sigreturn syscall.
True, but at the time the signal handler (sa_handler type) runs, all I
have is the user stack pointer upon entry to the handler. I'll have to calculate back from that address, if that is possible.
Am 23.04.2023 um 13:41 schrieb Michael Schmitz:
Though the question remains - is this expected behaviour for programs
that do deep recursion on the stack while taking signals (and the reason
for the option to run signal handlers on an alternate stack)?
And why does this almost always appear to happen after bus error exceptions (frame format b)? The extra exception stack information isn't even accounted for in the above frame end address!
Result with sa_sigaction handler:
parent usp : 0xef969e28
handler tos : 0xef969e6c
handler stack overwrote usp!
frame end : 0xef969e7c
frame start : 0xef969b58
handler usp : 0xef969b40
signal usp : 0xef969e04
signal pc : 0x80000696
signal fmtv : 0x114
parent usp : 0xef955008
handler tos : 0xef955064
handler stack overwrote usp!
frame end : 0xef955074
frame start : 0xef954d50
handler usp : 0xef954d38
signal usp : 0xef954ffc
signal pc : 0x80000680
signal fmtv : 0xb008
parent usp : 0xef945eb8
handler tos : 0xef945f0c
handler stack overwrote usp!
frame end : 0xef945f1c
frame start : 0xef945bf8
handler usp : 0xef945be0
signal usp : 0xef945ea8
signal pc : 0xc009f37a
signal fmtv : 0x80
parent usp : 0xef933eb8
handler tos : 0xef933f0c
handler stack overwrote usp!
frame end : 0xef933f1c
frame start : 0xef933bf8
handler usp : 0xef933be0
signal usp : 0xef933ea8
signal pc : 0xc009f37a
signal fmtv : 0x80
parent usp : 0xef921edc
handler tos : 0xef9aaca4
handler stack overwrote usp!
frame end : 0xef9aacb4
frame start : 0xef9aa990
handler usp : 0xef9aa978
signal usp : 0xef9aac40
signal pc : 0x80000782
signal fmtv : 0x114
Illegal instruction (core dumped)
Exception right before crash was an interrupt in this case (only seen
that once in this context, though I've seen lots of those in the course
of the test runs). Frame start calculated from siginfo pointer value in
this case.
On Sun, 23 Apr 2023, Michael Schmitz wrote:OK, it's not really deep (though I've managed to get the test case
Am 23.04.2023 um 13:41 schrieb Michael Schmitz:I don't understand how "deep recursion" can be used to explain this. We've seen crashes with only 1.8 MB of stack usage.
Though the question remains - is this expected behaviour for programs
that do deep recursion on the stack while taking signals (and the reason
for the option to run signal handlers on an alternate stack)?
The best reason I can think of for having a signal stack would be that it
may be better for signal delivery to fail than for the target process to fail. But I've no idea whether the kernel makes that kind of defensive programming possible (?)
I think we're still at the point where rec() is called recursively,And why does this almost always appear to happen after bus error exceptions >> (frame format b)? The extra exception stack information isn't even accounted >> for in the above frame end address!I don't understand these results. If usp was really overwritten, the
Result with sa_sigaction handler:
parent usp : 0xef969e28
handler tos : 0xef969e6c
handler stack overwrote usp!
frame end : 0xef969e7c
frame start : 0xef969b58
handler usp : 0xef969b40
signal usp : 0xef969e04
signal pc : 0x80000696
signal fmtv : 0x114
parent usp : 0xef955008
handler tos : 0xef955064
handler stack overwrote usp!
frame end : 0xef955074
frame start : 0xef954d50
handler usp : 0xef954d38
signal usp : 0xef954ffc
signal pc : 0x80000680
signal fmtv : 0xb008
parent usp : 0xef945eb8
handler tos : 0xef945f0c
handler stack overwrote usp!
frame end : 0xef945f1c
frame start : 0xef945bf8
handler usp : 0xef945be0
signal usp : 0xef945ea8
signal pc : 0xc009f37a
signal fmtv : 0x80
parent usp : 0xef933eb8
handler tos : 0xef933f0c
handler stack overwrote usp!
frame end : 0xef933f1c
frame start : 0xef933bf8
handler usp : 0xef933be0
signal usp : 0xef933ea8
signal pc : 0xc009f37a
signal fmtv : 0x80
parent usp : 0xef921edc
handler tos : 0xef9aaca4
handler stack overwrote usp!
frame end : 0xef9aacb4
frame start : 0xef9aa990
handler usp : 0xef9aa978
signal usp : 0xef9aac40
signal pc : 0x80000782
signal fmtv : 0x114
Illegal instruction (core dumped)
program would have crashed early, no?
Exception right before crash was an interrupt in this case (only seenI didn't realize that you could get a crash from a signal delivered
that once in this context, though I've seen lots of those in the course
of the test runs). Frame start calculated from siginfo pointer value in
this case.
following an interrupt. I'll try to modify the kernel such that signals
are not delivered after page faults.
On Apr 23 2023, Michael Schmitz wrote:
Wasn't too hard actually. The signo parameter passed to the handler turnsThat's not "passed by reference". Function arguments are always passed
out to be passed by reference, and signo is located 4 bytes into the
kernel sigframe.
on the stack.
On Wed, 19 Apr 2023, Michael Schmitz wrote:Shouldn't be too hard, see my other mail.
I think it probably was noticed long ago, in the form of rare userland crashes on 68030. But it was probably never reported because the actual culprit is too distant from the symptoms.I wonder what we'd see if we patched the kernel to log every user dataIf these instructions did always cause stack corruption on 030, I think
write fault caused by a MOVEM instruction. I'll try to code that up.
we would have noticed long ago?
But I take your point -- signal delivery seems to be crucial. Would it be difficult to skip signal delivery following a bus error? Perhaps there's
no need to try that experiment, as we know what would happen.
I will take a look at your modified test program and try to use the output
to figure out the stack gymnastics.
IIUC, there are two RTEs following the page fault. The first one runs the signal handler, the second one resumes the MOVEM that faulted. Maybe we'll have to intercept the latter (at do_sigreturn() perhaps?) and examine that exception frame.
Not sure what third argument you referred to in another mail.
On Apr 24 2023, Michael Schmitz wrote:
Not sure what third argument you referred to in another mail.See struct sigframe and struct rt_sigframe. The non-rt signal handler
gets signal number, vector number and sigcontext*. The rt signal
handler gets signal number, siginfo* and ucontext*.
Hi Andreas,
On 24/04/23 09:48, Andreas Schwab wrote:
On Apr 24 2023, Michael Schmitz wrote:
Not sure what third argument you referred to in another mail.See struct sigframe and struct rt_sigframe. The non-rt signal handler
gets signal number, vector number and sigcontext*. The rt signal
handler gets signal number, siginfo* and ucontext*.
Thanks, I see now. Got confused by the sigaction man page (despite
working out that it's all there on the stack before...). Might need a
comment in the code (or an update to the man pages).
I've rewritten my test program to make the non-rt handler take three arguments, just to simplify things. Also fixed the end of signal frame calculation for the non-rt case where the exception places additional
data on the stack.
Running with the non-rt handler, the crash appears to happen right at
the end of the recursion (or at least, I take no further SIGCHLD on the
way back up the stack). With the rt handler, I see the stack depth
decreased on the last signal taken before the crash.
When I enable dumping the extra exception frame contents (which will
show prior stack contents when the exception only used a four-word
frame) in the rt handler case, I only see saved register data placed
there at the very end. That's different from previous tests where I saw
the saved register patterns all the time. (but might have had the
offsets wrong).
I'll see what peeking at the registers shows (now that I can be
confident I have got the offsets correct).
I don't understand these results. If usp was really overwritten, the program would have crashed early, no?
I think we're still at the point where rec() is called recursively,
before any returns.
Exception right before crash was an interrupt in this case (only seen
that once in this context, though I've seen lots of those in the
course of the test runs). Frame start calculated from siginfo pointer
value in this case.
I didn't realize that you could get a crash from a signal delivered following an interrupt. I'll try to modify the kernel such that
signals are not delivered after page faults.
Yes, that was news to me, too.
On Apr 22 2023, Michael Schmitz wrote:
This is the definition from the kernel's
include/uapi/asm-generic/ucontext.h:
That's not actually used by m68k, it uses
arch/m68k/include/asm/ucontext.h, which confusingly isn't an uapi
header.
Debian/sid has this struct in asm-generic/ucontext.h --
struct ucontext {
unsigned long uc_flags;
struct ucontext *uc_link;
stack_t uc_stack;
struct sigcontext uc_mcontext;
sigset_t uc_sigmask; /* mask last for extensibility */
};
but that lacks uc_filler[] so it's not ideal. I adapted the definitions
from the kernel source.
| Sysop: | Keyop |
|---|---|
| Location: | Huddersfield, West Yorkshire, UK |
| Users: | 715 |
| Nodes: | 16 (3 / 13) |
| Uptime: | 43:07:51 |
| Calls: | 12,111 |
| Calls today: | 2 |
| Files: | 15,008 |
| Messages: | 6,518,438 |