• Redundant prefixes break fsrm in Ice Lake

    From Tavis Ormandy@21:1/5 to All on Wed Nov 15 14:59:06 2023
    I thought this might interest some posters here, I wrote up a bug we
    discovered in the fast short repeat move feature added in Ice Lake.

    The quick summary is that adding a redundant rex.r prefix to movsb seems
    to cause ROB entries to be associated with incorrect addresses. I have
    no special insight into what the microcode is doing, maybe some reader
    here can read between the lines and explain what is going on :)

    https://lock.cmpxchg8b.com/reptar.html

    Tavis.

    --
    _o) $ lynx lock.cmpxchg8b.com
    /\\ _o) _o) $ finger [email protected]
    _\_V _( ) _( ) @taviso

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Tavis Ormandy on Wed Nov 15 19:10:13 2023
    Tavis Ormandy wrote:

    I thought this might interest some posters here, I wrote up a bug we discovered in the fast short repeat move feature added in Ice Lake.

    The quick summary is that adding a redundant rex.r prefix to movsb seems
    to cause ROB entries to be associated with incorrect addresses. I have
    no special insight into what the microcode is doing, maybe some reader
    here can read between the lines and explain what is going on :)

    https://lock.cmpxchg8b.com/reptar.html
    <
    My GUESS has to do with how instruction-boundaries are determined.
    When the decoder encounters a prefix, it latches prefix data and goes
    on decoding. So, if you have multiple prefixes of the same flavor,
    instead of latching only the last (or first) prefix data, but instead
    ORs all the prefix data of a "kind" of prefix into a prefix container
    then execution is delivered a different pattern of bits than the programmer expected.
    <
    But who ever decided multiple prefixes of the same kind are LEGAL ??

    Tavis.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to MitchAlsup on Wed Nov 15 20:17:30 2023
    [email protected] (MitchAlsup) writes:
    Tavis Ormandy wrote:

    I thought this might interest some posters here, I wrote up a bug we
    discovered in the fast short repeat move feature added in Ice Lake.

    The quick summary is that adding a redundant rex.r prefix to movsb seems
    to cause ROB entries to be associated with incorrect addresses. I have
    no special insight into what the microcode is doing, maybe some reader
    here can read between the lines and explain what is going on :)

    https://lock.cmpxchg8b.com/reptar.html
    <
    My GUESS has to do with how instruction-boundaries are determined.
    When the decoder encounters a prefix, it latches prefix data and goes
    on decoding. So, if you have multiple prefixes of the same flavor,
    instead of latching only the last (or first) prefix data, but instead
    ORs all the prefix data of a "kind" of prefix into a prefix container
    then execution is delivered a different pattern of bits than the programmer >expected.
    <
    But who ever decided multiple prefixes of the same kind are LEGAL ??

    The compiler people use multiple prefixes to align code.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Scott Lurndal on Wed Nov 15 20:57:58 2023
    Scott Lurndal wrote:

    [email protected] (MitchAlsup) writes:
    Tavis Ormandy wrote:

    I thought this might interest some posters here, I wrote up a bug we
    discovered in the fast short repeat move feature added in Ice Lake.

    The quick summary is that adding a redundant rex.r prefix to movsb seems >>> to cause ROB entries to be associated with incorrect addresses. I have
    no special insight into what the microcode is doing, maybe some reader
    here can read between the lines and explain what is going on :)

    https://lock.cmpxchg8b.com/reptar.html
    <
    My GUESS has to do with how instruction-boundaries are determined.
    When the decoder encounters a prefix, it latches prefix data and goes
    on decoding. So, if you have multiple prefixes of the same flavor,
    instead of latching only the last (or first) prefix data, but instead
    ORs all the prefix data of a "kind" of prefix into a prefix container
    then execution is delivered a different pattern of bits than the programmer >>expected.
    <
    But who ever decided multiple prefixes of the same kind are LEGAL ??

    The compiler people use multiple prefixes to align code.

    The code is already byte aligned, what more is necessary ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From BGB@21:1/5 to MitchAlsup on Wed Nov 15 16:36:51 2023
    On 11/15/2023 2:57 PM, MitchAlsup wrote:
    Scott Lurndal wrote:

    [email protected] (MitchAlsup) writes:
    Tavis Ormandy wrote:

    I thought this might interest some posters here, I wrote up a bug we
    discovered in the fast short repeat move feature added in Ice Lake.

    The quick summary is that adding a redundant rex.r prefix to movsb
    seems
    to cause ROB entries to be associated with incorrect addresses. I have >>>> no special insight into what the microcode is doing, maybe some reader >>>> here can read between the lines and explain what is going on :)

    https://lock.cmpxchg8b.com/reptar.html
    <
    My GUESS has to do with how instruction-boundaries are determined.
    When the decoder encounters a prefix, it latches prefix data and goes
    on decoding. So, if you have multiple prefixes of the same flavor,
    instead of latching only the last (or first) prefix data, but instead
    ORs all the prefix data of a "kind" of prefix into a prefix container
    then execution is delivered a different pattern of bits than the
    programmer
    expected.
    <
    But who ever decided multiple prefixes of the same kind are LEGAL ??

    The compiler people use multiple prefixes to align code.

    The code is already byte aligned, what more is necessary ??

    I think it is semi-common to align function entry points and some labels
    and similar, but IME this was usually done with NOP or "INT 3"
    instructions or similar...

    I think the idea here is that aligning a function entry points can
    potentially make the function calls slightly faster due to "cache magic"
    or similar. Also INT3 crashes the program if it tries to branch into
    this padding space.

    But, at least, much beyond this, it is unclear how alignment would be
    needed or beneficial on x86 or x86-64.

    And, to this end (if one needs inline padding), using one of the
    multi-byte NOP sequences seems less likely to invoke weird/undefined
    behavior than trying to do something weird with opcode prefixes...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to MitchAlsup on Wed Nov 15 22:54:26 2023
    [email protected] (MitchAlsup) writes:
    Scott Lurndal wrote:

    [email protected] (MitchAlsup) writes:
    Tavis Ormandy wrote:

    I thought this might interest some posters here, I wrote up a bug we
    discovered in the fast short repeat move feature added in Ice Lake.

    The quick summary is that adding a redundant rex.r prefix to movsb seems >>>> to cause ROB entries to be associated with incorrect addresses. I have >>>> no special insight into what the microcode is doing, maybe some reader >>>> here can read between the lines and explain what is going on :)

    https://lock.cmpxchg8b.com/reptar.html
    <
    My GUESS has to do with how instruction-boundaries are determined.
    When the decoder encounters a prefix, it latches prefix data and goes
    on decoding. So, if you have multiple prefixes of the same flavor, >>>instead of latching only the last (or first) prefix data, but instead
    ORs all the prefix data of a "kind" of prefix into a prefix container >>>then execution is delivered a different pattern of bits than the programmer >>>expected.
    <
    But who ever decided multiple prefixes of the same kind are LEGAL ??

    The compiler people use multiple prefixes to align code.

    The code is already byte aligned, what more is necessary ??

    I refer you to the Intel Architecture Software Optimization Guide.

    Specifically:

    "Assembly/Compiler Coding Rule 12. (M impact, H generality) All branch
    targets should be 16-byte aligned."

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to BGB on Wed Nov 15 23:34:46 2023
    BGB wrote:

    On 11/15/2023 2:57 PM, MitchAlsup wrote:
    Scott Lurndal wrote:

    [email protected] (MitchAlsup) writes:
    Tavis Ormandy wrote:

    I thought this might interest some posters here, I wrote up a bug we >>>>> discovered in the fast short repeat move feature added in Ice Lake.

    The quick summary is that adding a redundant rex.r prefix to movsb
    seems
    to cause ROB entries to be associated with incorrect addresses. I have >>>>> no special insight into what the microcode is doing, maybe some reader >>>>> here can read between the lines and explain what is going on :)

    https://lock.cmpxchg8b.com/reptar.html
    <
    My GUESS has to do with how instruction-boundaries are determined.
    When the decoder encounters a prefix, it latches prefix data and goes
    on decoding. So, if you have multiple prefixes of the same flavor,
    instead of latching only the last (or first) prefix data, but instead
    ORs all the prefix data of a "kind" of prefix into a prefix container
    then execution is delivered a different pattern of bits than the
    programmer
    expected.
    <
    But who ever decided multiple prefixes of the same kind are LEGAL ??

    The compiler people use multiple prefixes to align code.

    The code is already byte aligned, what more is necessary ??

    I think it is semi-common to align function entry points and some labels
    and similar, but IME this was usually done with NOP or "INT 3"
    instructions or similar...
    <
    Yes, this is common (and useful)
    <
    How many functions start off with REP REP REP MOVS ??
    <
    I think the idea here is that aligning a function entry points can potentially make the function calls slightly faster due to "cache magic"
    or similar. Also INT3 crashes the program if it tries to branch into
    this padding space.
    <
    But REP REP REP MOVS never occurs at the entry point of a function !!
    <
    But, at least, much beyond this, it is unclear how alignment would be
    needed or beneficial on x86 or x86-64.

    And, to this end (if one needs inline padding), using one of the
    multi-byte NOP sequences seems less likely to invoke weird/undefined
    behavior than trying to do something weird with opcode prefixes...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Scott Lurndal on Wed Nov 15 23:36:15 2023
    Scott Lurndal wrote:

    [email protected] (MitchAlsup) writes:
    Scott Lurndal wrote:

    [email protected] (MitchAlsup) writes:
    Tavis Ormandy wrote:

    I thought this might interest some posters here, I wrote up a bug we >>>>> discovered in the fast short repeat move feature added in Ice Lake.

    The quick summary is that adding a redundant rex.r prefix to movsb seems >>>>> to cause ROB entries to be associated with incorrect addresses. I have >>>>> no special insight into what the microcode is doing, maybe some reader >>>>> here can read between the lines and explain what is going on :)

    https://lock.cmpxchg8b.com/reptar.html
    <
    My GUESS has to do with how instruction-boundaries are determined.
    When the decoder encounters a prefix, it latches prefix data and goes >>>>on decoding. So, if you have multiple prefixes of the same flavor, >>>>instead of latching only the last (or first) prefix data, but instead >>>>ORs all the prefix data of a "kind" of prefix into a prefix container >>>>then execution is delivered a different pattern of bits than the programmer >>>>expected.
    <
    But who ever decided multiple prefixes of the same kind are LEGAL ??

    The compiler people use multiple prefixes to align code.

    The code is already byte aligned, what more is necessary ??

    I refer you to the Intel Architecture Software Optimization Guide.

    Specifically:

    "Assembly/Compiler Coding Rule 12. (M impact, H generality) All branch
    targets should be 16-byte aligned."
    <
    How many branch targets have REP REP REP MOVS at the label??
    <
    You see, these REP REP REP MOVS's almost invariably have preceding instructions following label boundaries.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From BGB@21:1/5 to MitchAlsup on Wed Nov 15 17:53:25 2023
    On 11/15/2023 5:34 PM, MitchAlsup wrote:
    BGB wrote:

    On 11/15/2023 2:57 PM, MitchAlsup wrote:
    Scott Lurndal wrote:

    [email protected] (MitchAlsup) writes:
    Tavis Ormandy wrote:

    I thought this might interest some posters here, I wrote up a bug we >>>>>> discovered in the fast short repeat move feature added in Ice Lake. >>>>>
    The quick summary is that adding a redundant rex.r prefix to movsb >>>>>> seems
    to cause ROB entries to be associated with incorrect addresses. I
    have
    no special insight into what the microcode is doing, maybe some
    reader
    here can read between the lines and explain what is going on :)

    https://lock.cmpxchg8b.com/reptar.html
    <
    My GUESS has to do with how instruction-boundaries are determined.
    When the decoder encounters a prefix, it latches prefix data and goes >>>>> on decoding. So, if you have multiple prefixes of the same flavor,
    instead of latching only the last (or first) prefix data, but instead >>>>> ORs all the prefix data of a "kind" of prefix into a prefix container >>>>> then execution is delivered a different pattern of bits than the
    programmer
    expected.
    <
    But who ever decided multiple prefixes of the same kind are LEGAL ??

    The compiler people use multiple prefixes to align code.

    The code is already byte aligned, what more is necessary ??

    I think it is semi-common to align function entry points and some
    labels and similar, but IME this was usually done with NOP or "INT 3"
    instructions or similar...
    <
    Yes, this is common (and useful)
    <
    How many functions start off with REP REP REP MOVS ??
    <
    I think the idea here is that aligning a function entry points can
    potentially make the function calls slightly faster due to "cache
    magic" or similar. Also INT3 crashes the program if it tries to branch
    into this padding space.
    <
    But REP REP REP MOVS never occurs at the entry point of a function !!
    <

    Granted, yes, I have not seen this one.

    IME, it is usually something like:
    ...; INT3; INT3; INT3; PUSH RBP; MOV RBP, RSP; ...
    Or similar...

    And, at the end of a function:
    RET; NOP; NOP; ...; INT3; INT3; ...

    With any label-alignment via one of the multi-byte NOP encodings.


    But, at least, much beyond this, it is unclear how alignment would be
    needed or beneficial on x86 or x86-64.

    And, to this end (if one needs inline padding), using one of the
    multi-byte NOP sequences seems less likely to invoke weird/undefined
    behavior than trying to do something weird with opcode prefixes...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to MitchAlsup on Thu Nov 16 00:36:31 2023
    [email protected] (MitchAlsup) writes:
    Scott Lurndal wrote:

    [email protected] (MitchAlsup) writes:
    Scott Lurndal wrote:

    [email protected] (MitchAlsup) writes:
    Tavis Ormandy wrote:

    I thought this might interest some posters here, I wrote up a bug we >>>>>> discovered in the fast short repeat move feature added in Ice Lake. >>>>>
    The quick summary is that adding a redundant rex.r prefix to movsb seems >>>>>> to cause ROB entries to be associated with incorrect addresses. I have >>>>>> no special insight into what the microcode is doing, maybe some reader >>>>>> here can read between the lines and explain what is going on :)

    https://lock.cmpxchg8b.com/reptar.html
    <
    My GUESS has to do with how instruction-boundaries are determined. >>>>>When the decoder encounters a prefix, it latches prefix data and goes >>>>>on decoding. So, if you have multiple prefixes of the same flavor, >>>>>instead of latching only the last (or first) prefix data, but instead >>>>>ORs all the prefix data of a "kind" of prefix into a prefix container >>>>>then execution is delivered a different pattern of bits than the programmer
    expected.
    <
    But who ever decided multiple prefixes of the same kind are LEGAL ??

    The compiler people use multiple prefixes to align code.

    The code is already byte aligned, what more is necessary ??

    I refer you to the Intel Architecture Software Optimization Guide.

    Specifically:

    "Assembly/Compiler Coding Rule 12. (M impact, H generality) All branch
    targets should be 16-byte aligned."
    <
    How many branch targets have REP REP REP MOVS at the label??
    <
    You see, these REP REP REP MOVS's almost invariably have preceding instructions
    following label boundaries.


    In looking at a fairly recent ELF binary, mostly I see various length
    nops, and a bunch of 'repz retq' sequences.

    58eae6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)

    58eaf0: f3 c3 repz retq

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Thu Nov 16 00:56:36 2023
    According to Scott Lurndal <[email protected]>:
    But who ever decided multiple prefixes of the same kind are LEGAL ??

    The compiler people use multiple prefixes to align code.

    What? Why wouldn't you use a NOP? The Intel manual has a list
    of NOPs with sizes from one byte to nine.

    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to BGB on Thu Nov 16 03:00:41 2023
    BGB wrote:

    On 11/15/2023 5:34 PM, MitchAlsup wrote:
    BGB wrote:

    On 11/15/2023 2:57 PM, MitchAlsup wrote:
    Scott Lurndal wrote:

    [email protected] (MitchAlsup) writes:
    Tavis Ormandy wrote:

    I thought this might interest some posters here, I wrote up a bug we >>>>>>> discovered in the fast short repeat move feature added in Ice Lake. >>>>>>
    The quick summary is that adding a redundant rex.r prefix to movsb >>>>>>> seems
    to cause ROB entries to be associated with incorrect addresses. I >>>>>>> have
    no special insight into what the microcode is doing, maybe some
    reader
    here can read between the lines and explain what is going on :)

    https://lock.cmpxchg8b.com/reptar.html
    <
    My GUESS has to do with how instruction-boundaries are determined. >>>>>> When the decoder encounters a prefix, it latches prefix data and goes >>>>>> on decoding. So, if you have multiple prefixes of the same flavor, >>>>>> instead of latching only the last (or first) prefix data, but instead >>>>>> ORs all the prefix data of a "kind" of prefix into a prefix container >>>>>> then execution is delivered a different pattern of bits than the
    programmer
    expected.
    <
    But who ever decided multiple prefixes of the same kind are LEGAL ?? >>>>
    The compiler people use multiple prefixes to align code.

    The code is already byte aligned, what more is necessary ??

    I think it is semi-common to align function entry points and some
    labels and similar, but IME this was usually done with NOP or "INT 3"
    instructions or similar...
    <
    Yes, this is common (and useful)
    <
    How many functions start off with REP REP REP MOVS ??
    <
    I think the idea here is that aligning a function entry points can
    potentially make the function calls slightly faster due to "cache
    magic" or similar. Also INT3 crashes the program if it tries to branch
    into this padding space.
    <
    But REP REP REP MOVS never occurs at the entry point of a function !!
    <

    Granted, yes, I have not seen this one.

    IME, it is usually something like:
    ...; INT3; INT3; INT3; PUSH RBP; MOV RBP, RSP; ...
    Or similar...

    And, at the end of a function:
    RET; NOP; NOP; ...; INT3; INT3; ...

    With any label-alignment via one of the multi-byte NOP encodings.

    Yes, but control leaves the previous function at RET and control arrives at
    the next function at INT3 so the NOPs are never actually executed. And if
    you looked at the ASCII assembly, you will see::

    RET
    NOP
    NOP
    label:
    INT3

    But, at least, much beyond this, it is unclear how alignment would be
    needed or beneficial on x86 or x86-64.

    And, to this end (if one needs inline padding), using one of the
    multi-byte NOP sequences seems less likely to invoke weird/undefined
    behavior than trying to do something weird with opcode prefixes...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Scott Lurndal on Thu Nov 16 03:06:11 2023
    Scott Lurndal wrote:

    [email protected] (MitchAlsup) writes:
    Scott Lurndal wrote:

    [email protected] (MitchAlsup) writes:
    Scott Lurndal wrote:

    [email protected] (MitchAlsup) writes:
    Tavis Ormandy wrote:

    I thought this might interest some posters here, I wrote up a bug we >>>>>>> discovered in the fast short repeat move feature added in Ice Lake. >>>>>>
    The quick summary is that adding a redundant rex.r prefix to movsb seems
    to cause ROB entries to be associated with incorrect addresses. I have >>>>>>> no special insight into what the microcode is doing, maybe some reader >>>>>>> here can read between the lines and explain what is going on :)

    https://lock.cmpxchg8b.com/reptar.html
    <
    My GUESS has to do with how instruction-boundaries are determined. >>>>>>When the decoder encounters a prefix, it latches prefix data and goes >>>>>>on decoding. So, if you have multiple prefixes of the same flavor, >>>>>>instead of latching only the last (or first) prefix data, but instead >>>>>>ORs all the prefix data of a "kind" of prefix into a prefix container >>>>>>then execution is delivered a different pattern of bits than the programmer
    expected.
    <
    But who ever decided multiple prefixes of the same kind are LEGAL ??

    The compiler people use multiple prefixes to align code.

    The code is already byte aligned, what more is necessary ??

    I refer you to the Intel Architecture Software Optimization Guide.

    Specifically:

    "Assembly/Compiler Coding Rule 12. (M impact, H generality) All branch >>> targets should be 16-byte aligned."
    <
    How many branch targets have REP REP REP MOVS at the label??
    <
    You see, these REP REP REP MOVS's almost invariably have preceding instructions
    following label boundaries.


    In looking at a fairly recent ELF binary, mostly I see various length
    nops, and a bunch of 'repz retq' sequences.

    58eae6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)

    58eaf0: f3 c3 repz retq

    face it:: x86 is so broken it is amazing that it works at all.

    And never postulate that this is the BEST way of padding to some useful boundary--just like 68K used to

    CMP D1,#7
    BNE ELSE
    // then clause
    ...
    ...
    MOV dummy,#DW // consume the inst in the Else clause
    ELSE:
    INST // The immediate of the MOV consumes this instruction
    // join point

    And if you EVER get the chance of do your own ISA, make sure there is no
    way to and no need to do these kinds of things.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to All on Thu Nov 16 03:07:17 2023
    Is there somethings wrong with

    ...
    RET
    .align 64B
    Function:
    ...

    ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to MitchAlsup on Thu Nov 16 08:26:55 2023
    MitchAlsup wrote:
    Scott Lurndal wrote:

    [email protected] (MitchAlsup) writes:
    Tavis Ormandy wrote:

    I thought this might interest some posters here, I wrote up a bug we
    discovered in the fast short repeat move feature added in Ice Lake.

    The quick summary is that adding a redundant rex.r prefix to movsb
    seems
    to cause ROB entries to be associated with incorrect addresses. I have >>>> no special insight into what the microcode is doing, maybe some reader >>>> here can read between the lines and explain what is going on :)

    https://lock.cmpxchg8b.com/reptar.html
    <
    My GUESS has to do with how instruction-boundaries are determined.
    When the decoder encounters a prefix, it latches prefix data and goes
    on decoding. So, if you have multiple prefixes of the same flavor,
    instead of latching only the last (or first) prefix data, but instead
    ORs all the prefix data of a "kind" of prefix into a prefix container
    then execution is delivered a different pattern of bits than the
    programmer
    expected.
    <
    But who ever decided multiple prefixes of the same kind are LEGAL ??

    The compiler people use multiple prefixes to align code.

    The code is already byte aligned, what more is necessary ??

    Some loops, on some machines, run faster if the loop top is cache line
    aligned (or maybe 16-byte/32-byte aligned), since that allows the entire
    loop to fit within a single cache line, or whatever the loop buffer is?

    I'm not arguing with you that it shouldn't be needed, just that there
    have been and are several machines which do benefit from it.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to John Levine on Thu Nov 16 11:37:25 2023
    John Levine wrote:
    According to Scott Lurndal <[email protected]>:
    But who ever decided multiple prefixes of the same kind are LEGAL ??

    The compiler people use multiple prefixes to align code.

    What? Why wouldn't you use a NOP? The Intel manual has a list
    of NOPs with sizes from one byte to nine.

    Maybe because a few added/redundant prefix bytes on an instruction you
    are going to do anyway could be even cheaper/faster than a NOP?

    I.e. 0 vs 1 cycle (or a fraction thereof on average)?

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Scott Lurndal on Thu Nov 16 11:23:05 2023
    Scott Lurndal wrote:
    [email protected] (MitchAlsup) writes:
    <
    How many branch targets have REP REP REP MOVS at the label??
    <
    You see, these REP REP REP MOVS's almost invariably have preceding instructions
    following label boundaries.


    In looking at a fairly recent ELF binary, mostly I see various length
    nops, and a bunch of 'repz retq' sequences.

    58eae6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)

    58eaf0: f3 c3 repz retq

    Intel Instruction manual vol2 section 2.1:

    "Use of repeat prefixes and/or undefined opcodes with other Intel 64 or
    IA-32 instructions is reserved; such use may cause unpredictable behavior"

    These prefix rules were added relatively recently (maybe last 10 years?).
    While they only allow one prefix from each of Group 1..4,
    they still allow prefix bytes to be in any order thereby wasting
    much opcode space on redundant premutations and combinations.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to EricP on Fri Nov 17 18:42:18 2023
    EricP wrote:

    Scott Lurndal wrote:
    [email protected] (MitchAlsup) writes:
    <
    How many branch targets have REP REP REP MOVS at the label??
    <
    You see, these REP REP REP MOVS's almost invariably have preceding instructions
    following label boundaries.


    In looking at a fairly recent ELF binary, mostly I see various length
    nops, and a bunch of 'repz retq' sequences.

    58eae6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)

    58eaf0: f3 c3 repz retq

    Intel Instruction manual vol2 section 2.1:

    "Use of repeat prefixes and/or undefined opcodes with other Intel 64 or
    IA-32 instructions is reserved; such use may cause unpredictable behavior"

    These prefix rules were added relatively recently (maybe last 10 years?). While they only allow one prefix from each of Group 1..4,
    they still allow prefix bytes to be in any order thereby wasting
    much opcode space on redundant premutations and combinations.


    This is what I was talking about; the decoder is just routing data to
    a set of storage containers and only after identifying the OpCode, do
    these containers modify the behavior of the instruction during execution.
    The decoder would not "count" the prefixes, just route data, and if
    data came from multiple locations, what gets latched in the container
    becomes mask specific.

    And since they are reserved, your random code generator should not be generating them.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paul A. Clayton@21:1/5 to John Levine on Fri Nov 17 22:38:26 2023
    On 11/15/23 7:56 PM, John Levine wrote:
    According to Scott Lurndal <[email protected]>:
    But who ever decided multiple prefixes of the same kind are LEGAL ??

    The compiler people use multiple prefixes to align code.

    What? Why wouldn't you use a NOP? The Intel manual has a list
    of NOPs with sizes from one byte to nine.

    It is possible that a NOP is more expensive than bloating one or
    more instructions with alternative encodings. Even if a NOP is
    never "executed" (and, of course, early microprocessors did just
    execute NOPs), it might consume a ROB entry (to facilitate precise
    trapping when an instruction address is fetched, e.g. — obviously
    one could have coarser-grained ROB entries and replay from an
    earlier point even just "fusing" a NOP with the following
    instruction).

    Even if every compiler did "the right thing" to provide target
    alignment, clever programmers could include assembly to do "the
    clever thing". A clever programmer might reason that a NOP
    increases instruction count and therefore is harmful to
    performance. Also there may be a greater fear that some other
    programmer would remove a NOP as useless when a useless prefix
    might not be recognized as useless.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to EricP on Sat Nov 18 10:32:15 2023
    EricP wrote:
    Scott Lurndal wrote:
    [email protected] (MitchAlsup) writes:
    <
    How many branch targets have REP REP REP MOVS at the label??
    <
    You see, these REP REP REP MOVS's almost invariably have preceding
    instructions
    following label boundaries.


    In looking at a fairly recent ELF binary, mostly I see various length
    nops, and a bunch of 'repz retq' sequences.

    58eae6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)

    58eaf0: f3 c3 repz retq

    Intel Instruction manual vol2 section 2.1:

    "Use of repeat prefixes and/or undefined opcodes with other Intel 64 or
    IA-32 instructions is reserved; such use may cause unpredictable behavior"

    These prefix rules were added relatively recently (maybe last 10 years?). While they only allow one prefix from each of Group 1..4,
    they still allow prefix bytes to be in any order thereby wasting
    much opcode space on redundant premutations and combinations.

    Actually the prefix rules go back farther - they are present in
    an Intel x86 instruction manual from 2001 I had on a backup.
    Older backups are not readily accessible.

    So the 'REP REP MOVS' and 'repz retq' have been clearly documented
    as unpredictable for a long time.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Sat Nov 18 16:36:25 2023
    According to EricP <[email protected]>:
    These prefix rules were added relatively recently (maybe last 10 years?).
    While they only allow one prefix from each of Group 1..4,
    they still allow prefix bytes to be in any order thereby wasting
    much opcode space on redundant premutations and combinations.

    Actually the prefix rules go back farther - they are present in
    an Intel x86 instruction manual from 2001 I had on a backup.

    I have the October 1979 8086 Family User's Manual here. (The actual
    paper one, not a scan.)

    In the discussion of repeat prefixes, it says they're interruptible,
    and if a second or third segment or lock prefix is present it won't
    work because it only remembers one prefix for the interrupt. You can
    turn off interrupts, but an NMI might still break stuff.

    The only plausble two prefix instruction I can think of is an exchange
    with a segment override:

    LOCK XCHG ES:FOO,AX

    The assembler will generate the lock prefix first. I doubt they gave
    much thought to what would happen if the prefixes were in the other
    order.

    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to EricP on Sat Nov 18 16:21:46 2023
    EricP <[email protected]> writes:
    Actually the prefix rules go back farther - they are present in
    an Intel x86 instruction manual from 2001 I had on a backup.
    Older backups are not readily accessible.

    So the 'REP REP MOVS' and 'repz retq' have been clearly documented
    as unpredictable for a long time.

    Fortunately, unlike "undefined behaviour" advocates and others who
    point to documentation, Intel is aware of Hyrum's law and does not do "unpredictable behaviour" on any instruction sequence on purpose.
    Consequently, they treated the REX MOVSB issue as a bug that they
    should fix. However, in this case probably not because they expected
    to see such code in the wild (AFAIK it was only found by fuzzing), but
    because it allows priviledge escalation.

    In particular, if they now implemented a CPU where "repz retq" did
    something different than "retq", that would mean that a lot of
    binaries would no longer work, and no amount of pointing to
    documentation from 2001 or from 1978 to 2023 would stop the reputation
    damage that would ensue. That's because compilers (and probably also
    assembly language programmers) actually followed other documentation
    that recommended using repz retq (see
    <https://repzret.org/p/repzret/>).

    In this case, one interesting aspect is that the K8 is the first AMD64
    CPU, years before any Intel CPU could be bought that would be
    compatible with this instruction set. So by the time Intel brought
    out their AMD64-compatible CPU (although they had their own names for
    the architecture: IA32e, then EM64T, finally Intel64), there were a
    lot of binaries with repz retq around, and if Intel wanted to sell
    those CPUs into the 64-bit market, they had better support these
    binaries, and the 2001 documentation for a different architecture did
    not matter in any case.

    If you or Intel want to reserve some encoding space, the way to do it
    is to either trap on the encoding, or treat it as noop. The noops are
    for encodings that you later want to define as hints, because hints architecturally are noops.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to EricP on Sat Nov 18 16:21:24 2023
    EricP <[email protected]> writes:
    EricP wrote:
    Scott Lurndal wrote:
    [email protected] (MitchAlsup) writes:
    <
    How many branch targets have REP REP REP MOVS at the label??
    <
    You see, these REP REP REP MOVS's almost invariably have preceding
    instructions
    following label boundaries.


    In looking at a fairly recent ELF binary, mostly I see various length
    nops, and a bunch of 'repz retq' sequences.

    58eae6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)

    58eaf0: f3 c3 repz retq

    Intel Instruction manual vol2 section 2.1:

    "Use of repeat prefixes and/or undefined opcodes with other Intel 64 or
    IA-32 instructions is reserved; such use may cause unpredictable behavior" >>
    These prefix rules were added relatively recently (maybe last 10 years?).
    While they only allow one prefix from each of Group 1..4,
    they still allow prefix bytes to be in any order thereby wasting
    much opcode space on redundant premutations and combinations.

    Actually the prefix rules go back farther - they are present in
    an Intel x86 instruction manual from 2001 I had on a backup.
    Older backups are not readily accessible.


    The iAPX 86,88 manual from 1981 states when discussing REP/REPE/REPNE
    in the context of interrupts:

    "The processor 'remembers' only one prefix in effect
    at the time of the interrupt, the prefix that immediately precedes
    the string instruction."

    Which implies that segment overrides in conjunction with a repeat
    prefix won't be preserved if the MOVS is interrupted (they suggest
    CLI/STI during string operations with segment override(s), noting
    that won't help if an NMI occurs).

    I could find no text describing any other restrictions on prefix
    bytes. For that matter, while there were references to segment
    override prefixes, they weren't actually enumerated in the data sheet.

    The instruction set reference data is interesting with respect
    to the clock counts for each instruction. A 16-bit integer
    multiply, for example, took between 128 and 154 clocks when
    using register operands.

    Conditional branches were 16 or 4 clocks (presumably taken vs.
    not-taken).



    So the 'REP REP MOVS' and 'repz retq' have been clearly documented
    as unpredictable for a long time.




    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Levine on Sat Nov 18 17:48:10 2023
    John Levine <[email protected]> writes:
    The only plausble two prefix instruction I can think of is an exchange
    with a segment override:

    LOCK XCHG ES:FOO,AX

    When looking for REPZ, I found <https://www.felixcloutier.com/x86/rep:repe:repz:repne:repnz>, and it
    lists, e.g.,

    F3 REX.W A4

    F3 is the REP prefix. This instruction is a REP MOVSB, and the REX prefix seems redundant to me in this case. I don't know if that's the one
    the OP was about, though.

    It also lists

    F3 REX.W A5

    That's REP MOVSQ, and the REX prefix is not redundant here. But
    that's AMD64.

    However, the page also lists

    F3 A5

    which is either REP MOVSW or REP MOVSD (REP MOVSL for AT&T syntax),
    depending on mode. But there is also the 66/67 prefix for switching
    to the other mode. E.g., of the mode is 32-bit addresses and 32-bit
    data, and you want a REP MOVSW that uses 16-bit address registers,
    maybe you would do something like

    66 67 F3 A5

    But that's IA-32; for 8086 I indeed cannot think of other prefixes.

    For MOVS, a segment override prefix overrides the implicit DS: of the
    source operand; the segment of the destination (implicitly ES:) cannot
    be overridden. The page on REP above says nothing about segment
    override limitations, so I expect that this limitation was dropped in
    the 386 and later processors (probably already in the 286, where the
    idea was to use segment registers (and overrides) a lot).

    The assembler will generate the lock prefix first. I doubt they gave
    much thought to what would happen if the prefixes were in the other
    order.

    For the original decoder, each prefix probably set a bit, and the
    order did not matter. Therefore later implementations had to accept
    all orders.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Sat Nov 18 22:07:23 2023
    According to Anton Ertl <[email protected]>:
    For MOVS, a segment override prefix overrides the implicit DS: of the
    source operand; the segment of the destination (implicitly ES:) cannot
    be overridden. The page on REP above says nothing about segment
    override limitations, so I expect that this limitation was dropped in
    the 386 and later processors (probably already in the 286, where the
    idea was to use segment registers (and overrides) a lot).

    I looked at my 1985 i286 manual. The LOCK prefix waa fairly useless
    since XCHG now always locks, so it only affected MOVS, INS, and OUTS,
    I guess for unaligned word transfers. They describe REP MOVS and say
    that segment overrides work for the source address, no warning about interrupts.

    Appendix D on compatibility with the 86/88 has cryptic advice not to
    use duplicate prefixes because the 286 has a maximum instruction
    length of 10 bytes, while the 86/88 had no limit.

    So I guess you're right about the 86/88 prefixes setting a flag bit
    and otherwise being forgotten, but the 286 remembered the whole
    instruction with the prefixes so long as it wasn't excesively long.


    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Anton Ertl on Sat Nov 18 23:41:00 2023
    Anton Ertl wrote:

    If you or Intel want to reserve some encoding space, the way to do it
    is to either trap on the encoding, or treat it as noop. The noops are
    for encodings that you later want to define as hints, because hints architecturally are noops.

    No, for future compatibility, you can only raise exceptions on unrecognized
    bit patterns--otherwise you add future undefined behavior to your architecture. Taking unrecognized things as NoOps is a sure way to shoot yourself in the foot with a very slow and very painful bullet.

    See your own trailer::
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to John Levine on Sat Nov 18 16:20:30 2023
    On 11/18/2023 8:36 AM, John Levine wrote:
    According to EricP <[email protected]>:
    These prefix rules were added relatively recently (maybe last 10 years?). >>> While they only allow one prefix from each of Group 1..4,
    they still allow prefix bytes to be in any order thereby wasting
    much opcode space on redundant premutations and combinations.

    Actually the prefix rules go back farther - they are present in
    an Intel x86 instruction manual from 2001 I had on a backup.

    I have the October 1979 8086 Family User's Manual here. (The actual
    paper one, not a scan.)

    In the discussion of repeat prefixes, it says they're interruptible,
    and if a second or third segment or lock prefix is present it won't
    work because it only remembers one prefix for the interrupt. You can
    turn off interrupts, but an NMI might still break stuff.

    The only plausble two prefix instruction I can think of is an exchange
    with a segment override:

    LOCK XCHG ES:FOO,AX

    Funny aspect, XCHG has an implied LOCK prefix... :^)




    The assembler will generate the lock prefix first. I doubt they gave
    much thought to what would happen if the prefixes were in the other
    order.


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to MitchAlsup on Sun Nov 19 13:35:08 2023
    [email protected] (MitchAlsup) writes:
    Anton Ertl wrote:

    If you or Intel want to reserve some encoding space, the way to do it
    is to either trap on the encoding, or treat it as noop. The noops are
    for encodings that you later want to define as hints, because hints
    architecturally are noops.

    No, for future compatibility, you can only raise exceptions on unrecognized >bit patterns--otherwise you add future undefined behavior to your architecture.

    What behaviour is undefined by a noop (which is what a hint is architecturally)?

    Taking unrecognized things as NoOps is a sure way to shoot yourself in the foot
    with a very slow and very painful bullet.

    They are recognized as noops, and microarchitecturally have no
    specific performance impact. In the future they will continue to be
    noops, but they may influence the performance by providing
    microarchitectural hints. True, if somebody uses these noops instead
    of the recommended ones, in the future their application may suffer a
    slowdown, but the application will work correctly.

    The alternative is to add a previously trapping bit pattern as a hint.
    The result will be that the hint will not be used for at least a
    decade, because nobody wants their application to die if it is run
    on hardware of the previous generation.

    Anyway, the question is if hint instructions are still relevant. For
    the most part, they seem to have been replaced by history-based
    mechanisms.

    * Branch direction hints? We have branch predictors.

    * Branch target hints? We have BTBs and indirect branch predictors.

    * Prefetch instructions? Hardware prefetchers tend to work better, so
    they fell into disuse.

    Is there anything I forgot?

    Searching for "hint" in <https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf>,
    they use the register numbers of the JALR (indirect call) instruction
    for giving a hint on whether and how to use the return-address stack
    (x1 and x5 are used for calls, returns or coroutine calls).

    That's the only hints that the instruction set specification defines.

    It also defines that a number of compressed encodings that do not
    change architectural state are noops that may become hints in the
    future, e.g. C.ADDI with an immediate value of 0.

    Interestingly, they did not define such noops as possible future hints
    for the uncompressed instruction set. I guess they expected any
    implementation that implements hint instructions to also implement the compressed extension, but given that big implementations tend to not
    need hints (see above), while smaller ones may benefit from them, I
    wonder whether this is really such a good idea.

    BTW, thanks for producing a much more readable posting than what you
    used to produce with G2 (Google Groups). Rocksolid Light (used by
    NovaBBS) seems to be good for your readability.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Scott Lurndal on Sun Nov 19 22:56:50 2023
    Scott Lurndal wrote:
    EricP <[email protected]> writes:
    EricP wrote:
    Scott Lurndal wrote:
    [email protected] (MitchAlsup) writes:
    <
    How many branch targets have REP REP REP MOVS at the label??
    <
    You see, these REP REP REP MOVS's almost invariably have preceding
    instructions
    following label boundaries.


    In looking at a fairly recent ELF binary, mostly I see various length
    nops, and a bunch of 'repz retq' sequences.

    58eae6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)

    58eaf0: f3 c3 repz retq

    Intel Instruction manual vol2 section 2.1:

    "Use of repeat prefixes and/or undefined opcodes with other Intel 64 or
    IA-32 instructions is reserved; such use may cause unpredictable behavior" >>>
    These prefix rules were added relatively recently (maybe last 10 years?). >>> While they only allow one prefix from each of Group 1..4,
    they still allow prefix bytes to be in any order thereby wasting
    much opcode space on redundant premutations and combinations.

    Actually the prefix rules go back farther - they are present in
    an Intel x86 instruction manual from 2001 I had on a backup.
    Older backups are not readily accessible.


    The iAPX 86,88 manual from 1981 states when discussing REP/REPE/REPNE
    in the context of interrupts:

    "The processor 'remembers' only one prefix in effect
    at the time of the interrupt, the prefix that immediately precedes
    the string instruction."

    Which implies that segment overrides in conjunction with a repeat
    prefix won't be preserved if the MOVS is interrupted (they suggest
    CLI/STI during string operations with segment override(s), noting
    that won't help if an NMI occurs).

    I have written code to detect/test for this particular issue:

    I started a big REP SEGES MOVSB, with the prefix bytes in that order,
    and the repeat count in CX large enough that it would take more than 55
    ms to execute. This was long enough that a timer tick interrupt was
    guaranteed, so I could test the remaining CX value (JCXNZ) to check if I
    was running on a CPU which disallowed multiple prefix bytes,

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)