• Re: IAR ARM Cortex-M compiler does not align stack on 8-byte boundary

    From Richard Damon@21:1/5 to StateMachineCOM on Sun Sep 18 16:46:56 2022
    On 9/18/22 4:26 PM, StateMachineCOM wrote:
    ARM ABI says that the stack should be 8-byte aligned, but I see cases where the stack is aligned only to 4-byte boundary.

    For example, I have the following simple busy-delay function:

    <pre>
    void delay(int iter) {
    int volatile counter = 0;
    while (counter < iter) { // delay loop
    ++counter;
    }
    }
    </pre>

    This compiles with IAR EWARM 9.10.2 on ARM Cortex-M to the following disassembly:

    <pre>
    SUB SP, SP, #0x4
    ...
    ADD SP, SP, #0x4
    BX LR
    </pre>

    The problem is that after SUB SP,SP,4 the stack is misaligned (is aligned only to 4-byte boundary).

    Why is this happening? Is this compliant with the ARM ABI? Are there any compiler options to control that?

    I think, that as long as the function doesn't call another function it
    doesn't need to respect that ABI, since it knows it isn't going to do
    the operations that need the 8-byte alignment.

    If it isn't *I*nterfacing with anything, the ABI doesn't apply.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From StateMachineCOM@21:1/5 to All on Sun Sep 18 13:26:12 2022
    ARM ABI says that the stack should be 8-byte aligned, but I see cases where the stack is aligned only to 4-byte boundary.

    For example, I have the following simple busy-delay function:

    <pre>
    void delay(int iter) {
    int volatile counter = 0;
    while (counter < iter) { // delay loop
    ++counter;
    }
    }
    </pre>

    This compiles with IAR EWARM 9.10.2 on ARM Cortex-M to the following disassembly:

    <pre>
    SUB SP, SP, #0x4
    ...
    ADD SP, SP, #0x4
    BX LR
    </pre>

    The problem is that after SUB SP,SP,4 the stack is misaligned (is aligned only to 4-byte boundary).

    Why is this happening? Is this compliant with the ARM ABI? Are there any compiler options to control that?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From StateMachineCOM@21:1/5 to All on Sun Sep 18 17:10:48 2022
    Yes, the simple delay() function does not call anything. But still, interrupts can preempt it, which is quite likely because a function like this runs for a long time by design (and consumes a significant percentage of the CPU time).

    In fact, I've checked it, and an interrupt preempting delay() must re-align the stack by using the "stack aligner". So the simple (no FPU) Cortex-M exception stack frame of 8 registers (32 bytes) becomes the bigger stack frame of 9 registers (36 bytes).
    Please note that the Cortex-M CPU deals with it just fine and the program runs. But in the case of RTOS or some other assembly code dealing with interrupts could break the system by making assumptions about the stack alignment. I thought that the
    compatibility with interrupts is the primary reason why the ARM ABI stipulates 8-byte stack alignment.

    Also, I've just checked ARM/KEIL Compiler 6 (based on LLVM), and that compiler generated 8-byte aligned code for delay():

    <pre>
    SUB SP, SP, #0x8
    ...
    ADD SP, SP, #0x8
    BX LR
    </pre>

    Now, I don't have the time to investigate all compilers and various optimization levels. I thought that standards, like the ARM ABI, are supposed to settle things like that. I'm just a bit perplexed and couldn't find much information about that.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to StateMachineCOM on Mon Sep 19 10:07:55 2022
    (Please get a real newsreader and a real newsserver, rather than using
    the google groups crapware. Google groups is fine for searching old
    posts, but makes a mess of posts - it ruins line endings, code
    formatting, attributions, and generally breaks every Usenet posting
    convention it can. If you /must/ use google groups, please make the
    effort to get attributions right and to quote appropriate parts of the
    earlier posts. And if you are including code snippets, fix the line
    endings of your post. news.eternal-september.org is a free newsserver,
    and Thunderbird is one of many free newsreaders.)

    On 19/09/2022 02:10, StateMachineCOM wrote:
    Yes, the simple delay() function does not call anything. But still, interrupts can preempt it, which is quite likely because a function like this runs for a long time by design (and consumes a significant percentage of the CPU time).

    In fact, I've checked it, and an interrupt preempting delay() must re-align the stack by using the "stack aligner". So the simple (no FPU) Cortex-M exception stack frame of 8 registers (32 bytes) becomes the bigger stack frame of 9 registers (36 bytes).
    Please note that the Cortex-M CPU deals with it just fine and the program runs. But in the case of RTOS or some other assembly code dealing with interrupts could break the system by making assumptions about the stack alignment. I thought that the
    compatibility with interrupts is the primary reason why the ARM ABI stipulates 8-byte stack alignment.


    The hardware has to be able to cope with interrupts occurring while
    stacks are not 8-byte aligned. It's possible that it is marginally
    slower or results in a bigger stack frame, but it has to work.

    The key reason for stack alignment is efficiency. It makes a bigger
    difference when you have caches and big internal buses, and an even
    bigger difference when this is combined with multiple cores. It's also possible that some vector and SIMD units require higher alignments. For embedded Cortex-M devices, it would not have made much difference (I
    believe the old EABI required 4 byte alignment), but requiring 8 byte
    alignment is a very minor cost that makes future compatibility much
    simpler. Getting it right early on avoids the kind of dog's dinner you
    see in the x86 world where the 64-bit Windows stack alignment is too
    small for the needs of SIMD instructions.

    Also, I've just checked ARM/KEIL Compiler 6 (based on LLVM), and that compiler generated 8-byte aligned code for delay():

    <pre>
    SUB SP, SP, #0x8
    ...
    ADD SP, SP, #0x8
    BX LR
    </pre>

    Now, I don't have the time to investigate all compilers and various optimization levels. I thought that standards, like the ARM ABI, are supposed to settle things like that. I'm just a bit perplexed and couldn't find much information about that.

    A leaf function can be fine with 4 byte stack alignment. A quick test
    shows gcc aligns on 8 bytes, while clang aligns at 4 bytes for a leaf
    function.

    An extremely useful tool for investigating this kind of thing is the
    online compiler at <https://godbolt.org>. It does not include many
    commercial compilers (though it has MSVC), but supports C, C++, and lots
    of languages on a very wide range of compilers and targets. Here you
    can see your code compiled for gcc and clang Cortex-M4 :

    <https://godbolt.org/z/cc6bf6oGe>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From StateMachineCOM@21:1/5 to All on Mon Sep 19 09:09:39 2022
    Hi David,
    Thanks for your help.

    Please get a real newsreader and a real newsserver...

    I'd like to do this, but I use this newsgroup so infrequently that I don't want to buy and install anything special. Is there some online tool you'd recommend?

    An extremely useful tool for investigating this kind of thing is the online compiler

    Yes, thank you. It seems indeed as a useful tool for a quick look at the generated assembly.

    But regarding the stack alignment requirements, The "ARM Procedure Call Standard for the ARM Architecture" (ARM IHI 0042E) says in Section 5.2.1.1 "Universal stack constraints" that "SP mod 4 = 0, The stack must at all times be aligned at word boundary".
    Later in the next Section 5.2.1.2 "Stack constraints at a public interface" it strengthens the requirements to: "SP mod 8 = 0. The stack must be double-word aligned".

    So the question now is: what do they mean by "public interface"?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to StateMachineCOM on Mon Sep 19 20:16:06 2022
    On 19/09/2022 18:09, StateMachineCOM wrote:
    Hi David, Thanks for your help.

    Please get a real newsreader and a real newsserver...

    I'd like to do this, but I use this newsgroup so infrequently that I
    don't want to buy and install anything special. Is there some online
    tool you'd recommend?


    Thunderbird is free - as are any of a dozen different newsreaders,
    depending on preferences and OS. Many other email programs also support Usenet. There are several free Usenet servers, at least for non-binary
    groups like those in comp.* news.eternal-september.org is a popular
    one. Your ISP might also provide the service, as it used to be a
    standard part of any internet access package.

    I don't know of any free online interfaces other than google groups,
    which is barely worth the price (although as always with google, it's
    good for searching). There are several paid-for services, mostly
    targeting binary groups (which used to be a popular way to spread
    pirated software and media, before bittorrent).

    Technical groups are all text posts, and most have relatively few posts.
    Even if you start your newsreader once a month, it will take no more
    than a few seconds to download all posts in comp.arch.embedded to bring
    it up to date.

    An extremely useful tool for investigating this kind of thing is
    the online compiler

    Yes, thank you. It seems indeed as a useful tool for a quick look at
    the generated assembly.


    I use it all the time, for looking at code on different targets,
    comparing different options, checking complicated syntax (such as
    testing C++ features in the latest standards, newer than the compilers I
    have online), comparing the output of different compilers, sharing code
    with others via links, checking if the code I write gives exactly the
    assembly I want, amongst other things.

    But regarding the stack alignment requirements, The "ARM Procedure
    Call Standard for the ARM Architecture" (ARM IHI 0042E) says in
    Section 5.2.1.1 "Universal stack constraints" that "SP mod 4 = 0, The
    stack must at all times be aligned at word boundary". Later in the
    next Section 5.2.1.2 "Stack constraints at a public interface" it
    strengthens the requirements to: "SP mod 8 = 0. The stack must be
    double-word aligned".

    So the question now is: what do they mean by "public interface"?

    I guess that means when calling code, or being called from code, that is independently compiled. When it is within the same compiled code, you
    don't have to follow the standard ABI at all - you (meaning "the
    compiler") can make your own rules regarding parameter passing, volatile
    / non-volatile registers, etc.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Richard Damon@21:1/5 to David Brown on Mon Sep 19 21:48:48 2022
    On 9/19/22 2:16 PM, David Brown wrote:
    On 19/09/2022 18:09, StateMachineCOM wrote:
    But regarding the stack alignment requirements, The "ARM Procedure
    Call Standard for the ARM Architecture" (ARM IHI 0042E) says in
    Section 5.2.1.1 "Universal stack constraints" that "SP mod 4 = 0, The
    stack must at all times be aligned at word boundary". Later in the
    next Section 5.2.1.2 "Stack constraints at a public interface" it
    strengthens the requirements to: "SP mod 8 = 0. The stack must be
    double-word aligned".

    So the question now is: what do they mean by "public interface"?

    I guess that means when calling code, or being called from code, that is independently compiled.  When it is within the same compiled code, you
    don't have to follow the standard ABI at all - you (meaning "the
    compiler") can make your own rules regarding parameter passing, volatile
    / non-volatile registers, etc.

    Yes, the Standard API defines what functions are allowed to presume when
    they are called by "unknown" code. That is what is allowed at a "Public
    API", being public, anyone can call it.

    Since routines are allowed to assume they are entered with a stack
    pointer aligned to a multiple of 8, the caller needs to assure that (at
    least if their entry at a public API also had the stack pointer properly aligned).

    The purpose of this is that some common instructions require their source/destination to be so aligned, and it is a bit awkward to write a subroutine that might be called with a stack pointer that isn't so
    aligned to make the pointer so aligned (it typically costs a register to
    hold the old SP), so the ABI requires the stack to be so aligned.

    If a piece of code doesn't call any outside routines, then this isn't a problem, so the ABI doesn't restrict the stack pointer at those times.
    This is important, as it isn't uncommon to want to temporarily push a
    single word onto the stack for a bit, and it the stack pointer needed to
    be kept at an alignment of 8, that operation would need to use up extra
    stack memory.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)