• Byte Addressability And Beyond

    From Lawrence D'Oliveiro@21:1/5 to All on Wed May 1 00:09:28 2024
    Byte addressing was invented by IBM for the System/360, introduced in
    1964. At least as I understand it. Up to that time, and indeed for a long
    time after, machines had a “word length” which was the smallest
    addressable unit of memory. This could have a range of sizes, e.g.

    12 -- DEC PDP-5/8
    18 -- DEC PDP-1/4/7/9
    36 -- DEC PDP-6/10
    60 -- CDC 6000-series
    64 -- Cray

    I’m sure there were also 24- and 48-bit machines. Note the popularity of numbers with a range of different integer divisors, including powers of
    both 2 and 3. The byte-addressable machines chucked away everything other
    than powers of 2, which was a step backwards in this respect. ;)

    (Interesting that the microprocessor world made byte addressing--and ASCII character encoding--universal right from the beginning. Starting from a
    clean slate, I guess.)

    Why was byte addressing invented? I think it was for easy handling of
    strings and other binary data. But why stop there? I guess the idea of
    going all the way down to bit-level addressing was considered a bit
    extreme? Certainly if you only had 32 (or, on those early IBMs, 24)
    address bits, then using 3 of them to address within a byte would have substantially cut down the available size of your address space.

    I think the move to 64-bit architectures missed a trick, though: it could
    have introduced bit-level addressing at the same time, given that we still
    have plenty of address bits to spare. That would simplify bit-field manipulations, too.

    One side-effect of byte addressing has been the “endian wars”: the inconsistency, between different machine architectures, of how to order
    the bytes making up multibyte objects, particularly numbers. Big-endian supposedly had the advantage of making memory dumps easier to read, but little-endian always made more logical sense.

    Nowadays, all the common CPU architectures are at least available in little-endian form, if not exclusively so. But we still have legacy
    oddities, like the TCP/IP network stack where integer fields are laid out
    in big-endian ordering.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Wed May 1 01:49:56 2024
    According to Lawrence D'Oliveiro <[email protected]d>:
    Byte addressing was invented by IBM for the System/360, introduced in
    1964. At least as I understand it. Up to that time, and indeed for a long >time after, machines had a “word length” which was the smallest >addressable unit of memory. This could have a range of sizes, e.g.

    12 -- DEC PDP-5/8
    18 -- DEC PDP-1/4/7/9
    36 -- DEC PDP-6/10
    60 -- CDC 6000-series
    64 -- Cray

    Commercial machines were character or digit addressed, as was at least
    one scientfic computer, the IBM 1620.

    The IBM 650 had 10 digit words, with characters stored as digit pairs.
    The 702 and 705 were decimal character addressable. Instructions were
    5 characters but data could be arbitrary length and location. The very
    popular 1401 was also character addressed with variable length data.

    Why was byte addressing invented? I think it was for easy handling of
    strings and other binary data. But why stop there?

    It was to be reasonably efficient both for character business data and
    word scientific data. Since the words had to be aligned, it was easy
    to handle them as a single unit in parallel on machines with internal
    data paths wider than 8 bits, all the models bigger than 360/30.

    I guess the idea of
    going all the way down to bit-level addressing was considered a bit
    extreme?

    STRETCH had bit addressing. It added a great deal of complication for
    very little benefit. In the relatively rare situations where you want
    to handle bit fields, shifting and masking is good enough without
    slowing everything else down.

    One side-effect of byte addressing has been the “endian wars”: the >inconsistency, between different machine architectures, ...

    Until the PDP-11, all byte addressed machines were bigendian. Despite
    a lot of looking, I have never found an explanation of why DEC made
    the PDP-11 littlendian. I'm reasonably sure they were aware that it
    was reversed from the 360, but they never said why.

    Please do me a favor and DO NOT guess why they did it -- we have
    already had lots and lots of guesses and we have no way to tell
    whether any of the guesses are right.

    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Wed May 1 03:02:07 2024
    Lawrence D'Oliveiro wrote:

    Byte addressing was invented by IBM for the System/360, introduced in
    1964. At least as I understand it. Up to that time, and indeed for a long time after, machines had a “word length” which was the smallest addressable unit of memory. This could have a range of sizes, e.g.

    12 -- DEC PDP-5/8
    18 -- DEC PDP-1/4/7/9
    36 -- DEC PDP-6/10
    60 -- CDC 6000-series
    64 -- Cray

    CDC had a number of machines with 12-bit times k words. k element {1,2,3,5}

    I’m sure there were also 24- and 48-bit machines. Note the popularity of numbers with a range of different integer divisors, including powers of
    both 2 and 3. The byte-addressable machines chucked away everything other than powers of 2, which was a step backwards in this respect. ;)

    I would make the argument that 2^k was a step forward not backwards.
    Perhaps another day...

    (Interesting that the microprocessor world made byte addressing--and ASCII character encoding--universal right from the beginning. Starting from a
    clean slate, I guess.)

    4004 anyone ?!?

    Why was byte addressing invented? I think it was for easy handling of
    strings and other binary data. But why stop there? I guess the idea of
    going all the way down to bit-level addressing was considered a bit
    extreme?

    It was certainly a reason Intel's 432 died. {but there were lots}

    Certainly if you only had 32 (or, on those early IBMs, 24)
    address bits, then using 3 of them to address within a byte would have substantially cut down the available size of your address space.

    I think the move to 64-bit architectures missed a trick, though: it could have introduced bit-level addressing at the same time, given that we still have plenty of address bits to spare. That would simplify bit-field manipulations, too.

    I don't see what is wrong with loading a container with the field and
    then extracting or inserting into the container. You loose atomicity
    but avoid doubling the number of LD/ST instructions.

    One side-effect of byte addressing has been the “endian wars”: the inconsistency, between different machine architectures, of how to order
    the bytes making up multibyte objects, particularly numbers. Big-endian supposedly had the advantage of making memory dumps easier to read, but little-endian always made more logical sense.

    BE means you can read the strings in a core dump
    LE means the bytes arrive in the order for on-line arithmetic
    LE allows one to make 8-bit wide data paths and still implement a full
    width architecture {but then so did 360/30)

    Nowadays, all the common CPU architectures are at least available in little-endian form, if not exclusively so. But we still have legacy
    oddities, like the TCP/IP network stack where integer fields are laid out
    in big-endian ordering.

    I have a BITR instruction that rearranges BE<->LE for these reasons.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Wed May 1 06:43:30 2024
    On Wed, 1 May 2024 03:02:07 +0000, MitchAlsup1 wrote:

    I don't see what is wrong with loading a container with the field and
    then extracting or inserting into the container.

    You still need a place to put a bit offset for the base address of the
    field. Why not put it together with the rest of the address?

    BE means you can read the strings in a core dump
    LE means the bytes arrive in the order for on-line arithmetic
    LE allows one to make 8-bit wide data paths and still implement a full
    width architecture {but then so did 360/30)

    The way I think of it is: consider how you specify these 3 conventions:
    * numbering of bits within a byte
    * numbering of bytes within a multibyte quantity
    * the place values of bits in an integer

    The only way to get all 3 consistent is with a little-endian architecture. Every big-endian architecture has inconsistencies between these somewhere
    or another.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Levine on Wed May 1 06:32:17 2024
    On Wed, 1 May 2024 01:49:56 -0000 (UTC), John Levine wrote:

    Until the PDP-11, all byte addressed machines were bigendian. Despite a
    lot of looking, I have never found an explanation of why DEC made the
    PDP-11 littlendian.

    As I previously mentioned, little-endian just makes more sense.

    Unfortunately, when their Fortran compiler implemented 32-bit integers (in software), they got the words the wrong way round.

    The VAX was like a 32-bit extension of the PDP-11, and it was consistently little-endian everywhere.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Levine on Wed May 1 07:36:13 2024
    John Levine <[email protected]> writes:
    Until the PDP-11, all byte addressed machines were bigendian. Despite
    a lot of looking, I have never found an explanation of why DEC made
    the PDP-11 littlendian. I'm reasonably sure they were aware that it
    was reversed from the 360, but they never said why.

    Please do me a favor and DO NOT guess why they did it -- we have
    already had lots and lots of guesses and we have no way to tell
    whether any of the guesses are right.

    Another case was the 6800 (big-endian) and its offspring, the 6502 (little-endian). In this case we know: little-endian is cheaper to
    implement on an 8-bit processor.

    Concerning the speculations about the PDP-11, here's one: Was it
    designed for also supporting an implementation with a 4-bit or 8-bit
    basis? The competing Nova was at first implemented with a 4-bit basis
    (but it is word-addressed, so this is not visible in the byte order).
    The PDP-X (the DEC-internal project that was canceled in favor of the
    PDP-11 and eventually became the Nova) might have influenced the
    PDP-11 in that way.

    The other interesting question in this context is why the Datapoint
    2200 (which is the basis of the Intel 8008 architecture) went for little-endian. <https://en.wikipedia.org/wiki/Datapoint_2200> says:

    |Because the original Datapoint 2200 had a serial processor, it needed
    |to start with the lowest bit of the lowest byte in order to handle
    |carries.

    So it's the same reason as for the 6502.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Lawrence D'Oliveiro on Wed May 1 07:43:52 2024
    Lawrence D'Oliveiro <[email protected]d> schrieb:

    (Interesting that the microprocessor world made byte addressing--and ASCII character encoding--universal right from the beginning. Starting from a
    clean slate, I guess.)

    A major market for microprocessors were pocket calculators,
    cash registers and the like, which is why having 8 bits and BCD
    arithmetic was an advantage - see the DAA instruction of the 8080
    or the decimal flag on the 6502.

    The basis of the 8008, the first serious microprocessor,
    was the Datapoint 2200. A nice history can be found at http://www.righto.com/2023/08/datapoint-to-8086.html .
    And as the Datapoint 2200 was originally a "smart terminal",
    it had to be able to connect to mainframes, which meant that
    8-bit bytes were a natural choice. (And I still think that
    having BCD influenced the decision to go to the 8-bit byte
    on the /360).

    So, anything but a clean slate.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Thomas Koenig on Wed May 1 07:51:06 2024
    On Wed, 1 May 2024 07:43:52 -0000 (UTC), Thomas Koenig wrote:

    And as the Datapoint 2200 was originally a "smart terminal",
    it had to be able to connect to mainframes, which meant that 8-bit bytes
    were a natural choice.

    You mean IBM mainframes? I don’t think any other mainframes were byte- addressable.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Lawrence D'Oliveiro on Wed May 1 09:02:22 2024
    Lawrence D'Oliveiro <[email protected]d> schrieb:
    On Wed, 1 May 2024 07:43:52 -0000 (UTC), Thomas Koenig wrote:

    And as the Datapoint 2200 was originally a "smart terminal",
    it had to be able to connect to mainframes, which meant that 8-bit bytes
    were a natural choice.

    You mean IBM mainframes?

    And compatibles. Together, they accounted for almost all mainframes.

    I don’t think any other mainframes were byte-
    addressable.

    IBM set the minimum standard for character capabilities, a
    terminal had to support eight bits or be laughed out of the market. Adressability has little to do with it.

    Hmm... what sort of terminals and character sets did people use on
    a PDP-10? 7-bit ASCII? It (and the PDP-6) were released before
    the ASCII standard came out. (And /360 was supposed to support
    ASCII originally, but that bit in the PSW got dropped for the /370,
    I believe).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Lawrence D'Oliveiro on Wed May 1 15:31:37 2024
    On Wed, 1 May 2024 00:09:28 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:


    (Interesting that the microprocessor world made byte addressing--and
    ASCII character encoding--universal right from the beginning.
    Starting from a clean slate, I guess.)


    It depends on what you call "microprocessor".
    Majority of early Digital Signal Processors were word-addressable. Some
    of them are still produced in significant quantities.
    Two of those (TI TMS320C30 and ADI ADSP 21xx series) played major role
    in my professional programming education.

    Few word-addressable Digital Signal Processors had non-power-of-two
    words. Motorola 24-bit 56K series was probably the most popular of
    those, but there were others as well.

    Microchip's PIC micro-controllers are word-addressable with quite
    varying word width. According to Wikipedia, they are descendants of
    General Instrument CP1600 CPU. I suppose, that their ancestor was word-addressable as well.

    In the world of general-purpose microprocessor, DEC Alpha (until EV6)
    was more like word-addressable than byte-addressable, although it is a
    matter of point of view.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Lawrence D'Oliveiro on Wed May 1 14:08:25 2024
    Lawrence D'Oliveiro <[email protected]d> writes:
    Byte addressing was invented by IBM for the System/360, introduced in
    1964. At least as I understand it. Up to that time, and indeed for a long >time after, machines had a “word length” which was the smallest >addressable unit of memory. This could have a range of sizes, e.g.

    12 -- DEC PDP-5/8
    18 -- DEC PDP-1/4/7/9
    36 -- DEC PDP-6/10
    60 -- CDC 6000-series
    64 -- Cray

    What about the IBM 1401, Electrodata 220 or Burroughs B5000?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Wed May 1 12:08:32 2024
    I guess the idea of going all the way down to bit-level addressing
    was considered a bit extreme?

    STRETCH had bit addressing. It added a great deal of complication for
    very little benefit. In the relatively rare situations where you want
    to handle bit fields, shifting and masking is good enough without
    slowing everything else down.

    Bit addressing doesn't have to be expensive: the DEC Alpha could have
    decided to use bit-addressing simply by ignoring/trapping more of the
    lowest bits than it did.
    Bit-addressing doesn't necessarily mean you can LD/ST at bit-granularity.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Wed May 1 16:38:09 2024
    Lawrence D'Oliveiro wrote:

    On Wed, 1 May 2024 03:02:07 +0000, MitchAlsup1 wrote:

    I don't see what is wrong with loading a container with the field and
    then extracting or inserting into the container.

    You still need a place to put a bit offset for the base address of the
    field. Why not put it together with the rest of the address?

    Given a 20-40 year life of an architecture and the desire not to be limited
    by addressability; I wanted and demanded of myself a full 63-bit virtual address space per thread. Therefore, no bits in the pointer are available
    for bit level addressing.

    BE means you can read the strings in a core dump
    LE means the bytes arrive in the order for on-line arithmetic
    LE allows one to make 8-bit wide data paths and still implement a full
    width architecture {but then so did 360/30)

    The way I think of it is: consider how you specify these 3 conventions:
    * numbering of bits within a byte
    * numbering of bytes within a multibyte quantity
    * the place values of bits in an integer

    The only way to get all 3 consistent is with a little-endian architecture. Every big-endian architecture has inconsistencies between these somewhere
    or another.

    Very many LE machines got one or more of those wrong, too.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Wed May 1 16:43:09 2024
    Thomas Koenig wrote:

    Lawrence D'Oliveiro <[email protected]d> schrieb:

    (Interesting that the microprocessor world made byte addressing--and ASCII >> character encoding--universal right from the beginning. Starting from a
    clean slate, I guess.)

    A major market for microprocessors were pocket calculators,
    cash registers and the like, which is why having 8 bits and BCD
    arithmetic was an advantage - see the DAA instruction of the 8080
    or the decimal flag on the 6502.

    From 1978-1980 I worked at NCR corporation on cash registers.
    We made a BASIC interpreter as the programmable backbone of
    the cash register lineup. Not a single decimal arithmetic
    instruction was used in the cash register application. The
    BASIC interpreter was written by a 5-man team in 8085 assembler.

    That model was sold from 1979 through 1998. So the lack of
    decimal arithmetic was not a significant disadvantage.

    The basis of the 8008, the first serious microprocessor,
    was the Datapoint 2200. A nice history can be found at http://www.righto.com/2023/08/datapoint-to-8086.html .
    And as the Datapoint 2200 was originally a "smart terminal",
    it had to be able to connect to mainframes, which meant that
    8-bit bytes were a natural choice. (And I still think that
    having BCD influenced the decision to go to the 8-bit byte
    on the /360).

    So, anything but a clean slate.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Wed May 1 16:46:04 2024
    Thomas Koenig wrote:

    Lawrence D'Oliveiro <[email protected]d> schrieb:
    On Wed, 1 May 2024 07:43:52 -0000 (UTC), Thomas Koenig wrote:

    And as the Datapoint 2200 was originally a "smart terminal",
    it had to be able to connect to mainframes, which meant that 8-bit bytes >>> were a natural choice.

    You mean IBM mainframes?

    And compatibles. Together, they accounted for almost all mainframes.

    I don’t think any other mainframes were byte-
    addressable.

    IBM set the minimum standard for character capabilities, a
    terminal had to support eight bits or be laughed out of the market. Adressability has little to do with it.

    Hmm... what sort of terminals and character sets did people use on
    a PDP-10? 7-bit ASCII? It (and the PDP-6) were released before
    the ASCII standard came out. (And /360 was supposed to support
    ASCII originally, but that bit in the PSW got dropped for the /370,
    I believe).

    PDP 10 had a 6-bit "field data" character set and a 9-bit bigger than
    ASCII character set. Programming languages and editors tended to use
    the 6-bit character set.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to [email protected] on Wed May 1 16:57:39 2024
    [email protected] (MitchAlsup1) writes:
    Thomas Koenig wrote:

    Lawrence D'Oliveiro <[email protected]d> schrieb:
    On Wed, 1 May 2024 07:43:52 -0000 (UTC), Thomas Koenig wrote:

    And as the Datapoint 2200 was originally a "smart terminal",
    it had to be able to connect to mainframes, which meant that 8-bit bytes >>>> were a natural choice.

    You mean IBM mainframes?

    And compatibles. Together, they accounted for almost all mainframes.

    I don’t think any other mainframes were byte-
    addressable.

    IBM set the minimum standard for character capabilities, a
    terminal had to support eight bits or be laughed out of the market.
    Adressability has little to do with it.

    Hmm... what sort of terminals and character sets did people use on
    a PDP-10? 7-bit ASCII? It (and the PDP-6) were released before
    the ASCII standard came out. (And /360 was supposed to support
    ASCII originally, but that bit in the PSW got dropped for the /370,
    I believe).

    PDP 10 had a 6-bit "field data" character set and a 9-bit bigger than
    ASCII character set. Programming languages and editors tended to use
    the 6-bit character set.

    Early Burroughs systems used 6-bit binary "characters". Two fit
    in one column of a 12-row Hollerith card.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Wed May 1 17:32:54 2024
    Please do me a favor and DO NOT guess why they did it --

    Concerning the speculations about the PDP-11, here's one: Was it
    designed for also supporting an implementation with a 4-bit or 8-bit
    basis?

    There are a bunch of design notes at bitsavers and none of them say
    anything about it. There was one place that might have hinted that
    little endian would save a few flip flops but since every PDP-11 was
    16 bits internally, it wouldn't have saved much.

    The PDP-X (the DEC-internal project that was canceled in favor of the
    PDP-11 and eventually became the Nova) might have influenced the
    PDP-11 in that way.

    I gather the PDP-X and PDP-11 were warring camps. There's a bunch
    of PDP-X notes at bitsavers and I don't see anything related to
    the -11. In the Bell et al book there's a lot about the -11 which
    only says it's different from the -8 and -9 series.
    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Wed May 1 17:41:46 2024
    According to Lawrence D'Oliveiro <[email protected]d>:
    On Wed, 1 May 2024 01:49:56 -0000 (UTC), John Levine wrote:

    Until the PDP-11, all byte addressed machines were bigendian. Despite a
    lot of looking, I have never found an explanation of why DEC made the
    PDP-11 littlendian.

    As I previously mentioned, little-endian just makes more sense.

    Ahem. You're guessing.

    I can assure you it didn't make more sense to all the people who read
    360 core dumps. BTDT.

    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Wed May 1 17:53:05 2024
    According to Stefan Monnier <[email protected]>:
    I guess the idea of going all the way down to bit-level addressing
    was considered a bit extreme?

    STRETCH had bit addressing. It added a great deal of complication for
    very little benefit. In the relatively rare situations where you want
    to handle bit fields, shifting and masking is good enough without
    slowing everything else down.

    Bit addressing doesn't have to be expensive: the DEC Alpha could have
    decided to use bit-addressing simply by ignoring/trapping more of the
    lowest bits than it did.

    That would waste three bits in every address, which would have been phenomenally expensive in the 1960s when every byte cost real money.

    The 360 had 12 bit displacements, so you could address a 4K range
    without having to load another base register. This would shrink
    it to 1K, so as a first approximation you'd need four times as
    many base register loads. Nope.

    I agree that with 64 bit addresses and memory that is pennies per
    megabyte the tradeoffs are different but that horse left the barn 50
    years ago. And I still don't think that bit operations are common
    enough to be worth using bits in every non-bit address.
    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Wed May 1 18:13:57 2024
    According to Thomas Koenig <[email protected]>:
    8-bit bytes were a natural choice. (And I still think that
    having BCD influenced the decision to go to the 8-bit byte
    on the /360).

    You don't have to guess. They explained in the IBM SJ paper
    why they chose 8 bits rather than 6. BCD was part of it, as
    was a belief that 6 bits wasn't going to be enough for
    text, and it allowed 16 bit instructions and 32/64 bit
    floating point.

    Read it here: https://www.ece.ucdavis.edu/~vojin/CLASSES/EEC272/S2005/Papers/IBM360-Amdahl_april64.pdf

    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to John Levine on Wed May 1 18:20:43 2024
    John Levine <[email protected]> writes:
    According to Lawrence D'Oliveiro <[email protected]d>:
    On Wed, 1 May 2024 01:49:56 -0000 (UTC), John Levine wrote:

    Until the PDP-11, all byte addressed machines were bigendian. Despite a
    lot of looking, I have never found an explanation of why DEC made the
    PDP-11 littlendian.

    As I previously mentioned, little-endian just makes more sense.

    Ahem. You're guessing.

    I can assure you it didn't make more sense to all the people who read
    360 core dumps. BTDT.

    To be fair, the tool that formatted the core dump could easily have
    arranged the human visible values appropriately, much like xxd(1)
    on linux does for little-endian values (i.e. when grouped with
    four bytes per (32-bits), the byte 3 value is printed first).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to John Levine on Wed May 1 18:17:33 2024
    John Levine <[email protected]> schrieb:

    I gather the PDP-X and PDP-11 were warring camps. There's a bunch
    of PDP-X notes at bitsavers and I don't see anything related to
    the -11. In the Bell et al book there's a lot about the -11 which
    only says it's different from the -8 and -9 series.

    Edson deCastro designed the PDP-X. When that project was cancelled
    because of perceived potential competition with the 12-bit and
    18-bit lines, he went off to found Data General and there built
    the Nova, which used "byte pointers" where the uppermost bit
    selected the low or high 8 bits of the 16-bit word.

    Apparently, the PDP-11 was originally an 8-bit "desk calculator"
    project which was then developed into the 16-bit architecture.
    I have also read somewhere that competition from the Nova played
    a major role.

    DeCastro leaving was a major sore point for a lot of people at DEC,
    so they probably did not tend to mention this influence.

    There were allegations that the Nova was a copy of the proposed
    PDP-X, but that was debunked now that some PDP-X development
    documents have surfaced.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Wed May 1 14:33:16 2024
    I agree that with 64 bit addresses and memory that is pennies per
    megabyte the tradeoffs are different but that horse left the barn 50
    years ago. And I still don't think that bit operations are common
    enough to be worth using bits in every non-bit address.

    Historically, the advantages vs disadvantages have indeed been rather
    against bit-addressing. AFAICT when the DEC Alpha came out was the most favorable time: the first time that the cost was low enough (they
    already had byte-addressing without byte-granularity of accesses,
    they had plenty of address bits to waste, and there wasn't too much
    existing 64bit code to break) to make the idea palatable.

    Practical benefits are fairly limited, but it would just be The Right
    thing to do, making it "easy" to eliminate some arbitrary restrictions
    in languages like C such as the inability to take the address of
    a struct's bitsized field. It would also have given an extra 3 bits to
    play with for tagging purposes :-)


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Wed May 1 18:37:07 2024
    Thomas Koenig wrote:
    Hmm... what sort of terminals and character sets did people use on
    a PDP-10? 7-bit ASCII? It (and the PDP-6) were released before
    the ASCII standard came out.

    On the PDP-6 and PDP-10s I used they were all Teletypes and tty
    compatible ASCII video terminals.

    The normal way to store text was five 7-bit ASCII characters in a 36
    bit word, since the byte handling instructiond made that easy to
    handle. It was common to start each line on a word boundary, so you had
    to skip zero padding bytes. Text editors often included line numbers
    that were five digit characters aligned on a word boundary, followed
    by a tab. The low bit in the word with the digits was set to say it's
    a line number, and compilers knew to look for the bit and skip the
    line number and tab.

    Disk and DECtape used a six bit upper case ASCII subset for file names
    so they could fit a six character name into one word. Compiler and
    object file symbol tables used RADIX50 aka SQUOZE that fit a six
    character symbol from a 40 character (octal 50) set into 32 bits with
    four flag bits left.

    (And /360 was supposed to support
    ASCII originally, but that bit in the PSW got dropped for the /370,
    I believe).

    They used a mutant ASCII that expanded from 7 to 8 bits by copying the
    high bit into the middle of the byte, which nobody ever used. It was
    one of the few inexplicably stupid choices in the 360.

    According to MitchAlsup1 <[email protected]>:
    PDP 10 had a 6-bit "field data" character set and a 9-bit bigger than
    ASCII character set.

    Dunno what computer that was, but it wasn't a PDP-10. Univac or GE600 maybe? --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Wed May 1 18:52:43 2024
    According to Scott Lurndal <[email protected]>:
    Ahem. You're guessing.

    I can assure you it didn't make more sense to all the people who read
    360 core dumps. BTDT.

    To be fair, the tool that formatted the core dump could easily have
    arranged the human visible values appropriately, much like xxd(1)
    on linux does for little-endian values (i.e. when grouped with
    four bytes per (32-bits), the byte 3 value is printed first).

    It could if it knew the structure of the data it was dumping, but it
    didn't, which was OK because it didn't have to. Like I said, BTDT.

    The first time I saw a PDP-11 in about 1970, I saw that the byte order
    was backward and thought, well, that is strange, and then dealt with
    it.
    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Robert Swindells@21:1/5 to Stefan Monnier on Wed May 1 18:49:36 2024
    On Wed, 01 May 2024 14:33:16 -0400, Stefan Monnier wrote:

    I agree that with 64 bit addresses and memory that is pennies per
    megabyte the tradeoffs are different but that horse left the barn 50
    years ago. And I still don't think that bit operations are common
    enough to be worth using bits in every non-bit address.

    Historically, the advantages vs disadvantages have indeed been rather
    against bit-addressing. AFAICT when the DEC Alpha came out was the most favorable time: the first time that the cost was low enough (they
    already had byte-addressing without byte-granularity of accesses,
    they had plenty of address bits to waste, and there wasn't too much
    existing 64bit code to break) to make the idea palatable.

    Practical benefits are fairly limited, but it would just be The Right
    thing to do, making it "easy" to eliminate some arbitrary restrictions
    in languages like C such as the inability to take the address of a
    struct's bitsized field. It would also have given an extra 3 bits to
    play with for tagging purposes :-)

    The TMS340[12]0 were bit-addressed 32 bit processors.

    <https://en.wikipedia.org/wiki/TMS34010>

    I never programmed one in C but the addressing worked well for doing
    graphics operations.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Wed May 1 19:07:00 2024
    According to Thomas Koenig <[email protected]>:
    Apparently, the PDP-11 was originally an 8-bit "desk calculator"
    project which was then developed into the 16-bit architecture.
    I have also read somewhere that competition from the Nova played
    a major role.

    "Desk calculator" was a misleading code name so the large computer
    group would leave them alone. The 11 design was largely by Harold
    McFarland who'd done most of the work for Gordon Bell at CMU.
    See https://hampage.hu/pdp-11/birth.html

    Again, you don't have to guess. This is all documented.
    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stefan Monnier on Wed May 1 18:55:06 2024
    Stefan Monnier wrote:

    I agree that with 64 bit addresses and memory that is pennies per
    megabyte the tradeoffs are different but that horse left the barn 50
    years ago. And I still don't think that bit operations are common
    enough to be worth using bits in every non-bit address.

    Historically, the advantages vs disadvantages have indeed been rather
    against bit-addressing. AFAICT when the DEC Alpha came out was the most favorable time: the first time that the cost was low enough (they
    already had byte-addressing without byte-granularity of accesses,
    they had plenty of address bits to waste, and there wasn't too much
    existing 64bit code to break) to make the idea palatable.

    Probably, but looking at code one rarely sees a field in a struct
    that is a bit-field. So, even if the cost was low, the benefits
    are similarly low.

    Practical benefits are fairly limited, but it would just be The Right
    thing to do, making it "easy" to eliminate some arbitrary restrictions
    in languages like C such as the inability to take the address of
    a struct's bitsized field. It would also have given an extra 3 bits to
    play with for tagging purposes :-)


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to [email protected] on Wed May 1 18:53:09 2024
    MitchAlsup1 <[email protected]> schrieb:
    Thomas Koenig wrote:

    Lawrence D'Oliveiro <[email protected]d> schrieb:

    (Interesting that the microprocessor world made byte addressing--and ASCII >>> character encoding--universal right from the beginning. Starting from a
    clean slate, I guess.)

    A major market for microprocessors were pocket calculators,
    cash registers and the like, which is why having 8 bits and BCD
    arithmetic was an advantage - see the DAA instruction of the 8080
    or the decimal flag on the 6502.

    From 1978-1980 I worked at NCR corporation on cash registers.
    We made a BASIC interpreter as the programmable backbone of
    the cash register lineup. Not a single decimal arithmetic
    instruction was used in the cash register application. The
    BASIC interpreter was written by a 5-man team in 8085 assembler.

    Quite interesting, thanks!

    That model was sold from 1979 through 1998. So the lack of
    decimal arithmetic was not a significant disadvantage.

    The 8085 has DAA, as well :-)

    However, at least the designers of the 8080 and the 6502 thought
    that it was important, or they would not have invested silicon
    in it. The 6502 people even had a patent on their direct
    decimal arithmetic.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to All on Wed May 1 19:21:32 2024
    MitchAlsup1 wrote:

    Stefan Monnier wrote:

    I agree that with 64 bit addresses and memory that is pennies per megabyte the tradeoffs are different but that horse left the barn
    50 years ago. And I still don't think that bit operations are
    common enough to be worth using bits in every non-bit address.

    Historically, the advantages vs disadvantages have indeed been
    rather against bit-addressing. AFAICT when the DEC Alpha came out
    was the most favorable time: the first time that the cost was low
    enough (they already had byte-addressing without byte-granularity
    of accesses, they had plenty of address bits to waste, and there
    wasn't too much existing 64bit code to break) to make the idea
    palatable.

    Probably, but looking at code one rarely sees a field in a struct
    that is a bit-field. So, even if the cost was low, the benefits
    are similarly low.


    Sure. But it isn't clear if that was the cause or the result of the
    hardware.




    Practical benefits are fairly limited, but it would just be The
    Right thing to do, making it "easy" to eliminate some arbitrary restrictions in languages like C such as the inability to take the
    address of a struct's bitsized field. It would also have given an
    extra 3 bits to play with for tagging purposes :-)


    Stefan



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Lawrence D'Oliveiro on Wed May 1 22:33:23 2024
    On Wed, 1 May 2024 06:32:17 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    On Wed, 1 May 2024 01:49:56 -0000 (UTC), John Levine wrote:

    Until the PDP-11, all byte addressed machines were bigendian.
    Despite a lot of looking, I have never found an explanation of why
    DEC made the PDP-11 littlendian.

    As I previously mentioned, little-endian just makes more sense.

    Unfortunately, when their Fortran compiler implemented 32-bit
    integers (in software), they got the words the wrong way round.

    The VAX was like a 32-bit extension of the PDP-11, and it was
    consistently little-endian everywhere.

    Not, it was not.
    Integer part was consistent, but FP formats were mixed-endian.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to John Levine on Wed May 1 19:33:54 2024
    John Levine wrote:


    snip


    According to MitchAlsup1 <[email protected]>:
    PDP 10 had a 6-bit "field data" character set and a 9-bit bigger
    than ASCII character set.

    Dunno what computer that was, but it wasn't a PDP-10. Univac or
    GE600 maybe?


    I don't know about the PDP 10, but you are right that Univac 1108 had
    both a six bit (technically a sixth of a word), and nine bit (quarter
    word) operations. The 6 bit was Fieldata and used for most older
    softwaare. The quarter words held an 8 bit ASCII character with one
    "wasted" bit per byte. This became the dominent usage for
    applications, but the Exec itself still uses a lot of Fieldata.





    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to [email protected] on Wed May 1 22:56:52 2024
    On Wed, 1 May 2024 16:38:09 +0000
    [email protected] (MitchAlsup1) wrote:

    Lawrence D'Oliveiro wrote:

    On Wed, 1 May 2024 03:02:07 +0000, MitchAlsup1 wrote:

    I don't see what is wrong with loading a container with the field
    and then extracting or inserting into the container.

    You still need a place to put a bit offset for the base address of
    the field. Why not put it together with the rest of the address?

    Given a 20-40 year life of an architecture and the desire not to be
    limited by addressability; I wanted and demanded of myself a full
    63-bit virtual address space per thread. Therefore, no bits in the
    pointer are available for bit level addressing.


    At current rate of DRAM Moore's Law it does not look like anybody would
    need 63 bits 40 years from now. Arm's 55 or 56 bits will likely suffice
    for that long or longer.
    The prospects of other byte-addresable types of memory looks even
    bleaker than DRAM's.
    The only memory tech that is doing better is NAND flash, but it is
    inherently block-addressable.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to John Levine on Wed May 1 22:40:12 2024
    On Wed, 1 May 2024 17:53:05 -0000 (UTC)
    John Levine <[email protected]> wrote:

    According to Stefan Monnier <[email protected]>:
    I guess the idea of going all the way down to bit-level
    addressing
    was considered a bit extreme?

    STRETCH had bit addressing. It added a great deal of complication
    for very little benefit. In the relatively rare situations where
    you want to handle bit fields, shifting and masking is good enough
    without slowing everything else down.

    Bit addressing doesn't have to be expensive: the DEC Alpha could have >decided to use bit-addressing simply by ignoring/trapping more of the >lowest bits than it did.

    That would waste three bits in every address, which would have been phenomenally expensive in the 1960s when every byte cost real money.

    The 360 had 12 bit displacements, so you could address a 4K range
    without having to load another base register. This would shrink
    it to 1K, so as a first approximation you'd need four times as
    many base register loads. Nope.

    I agree that with 64 bit addresses and memory that is pennies per
    megabyte the tradeoffs are different but that horse left the barn 50
    years ago. And I still don't think that bit operations are common
    enough to be worth using bits in every non-bit address.

    Bit-addressable TMS34010 was released 38 years ago and even was
    moderately successful. So, it seems, 50 yeras ago nothing was set in
    stone yet.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Wed May 1 20:30:16 2024
    Michael S wrote:

    On Wed, 1 May 2024 16:38:09 +0000
    [email protected] (MitchAlsup1) wrote:

    Lawrence D'Oliveiro wrote:

    On Wed, 1 May 2024 03:02:07 +0000, MitchAlsup1 wrote:

    I don't see what is wrong with loading a container with the field
    and then extracting or inserting into the container.

    You still need a place to put a bit offset for the base address of
    the field. Why not put it together with the rest of the address?

    Given a 20-40 year life of an architecture and the desire not to be
    limited by addressability; I wanted and demanded of myself a full
    63-bit virtual address space per thread. Therefore, no bits in the
    pointer are available for bit level addressing.


    At current rate of DRAM Moore's Law it does not look like anybody would
    need 63 bits 40 years from now. Arm's 55 or 56 bits will likely suffice
    for that long or longer.

    The largest single system memory I can find quickly is 160TB or about
    47-bits of address space (I rounded down).

    Given one can use CXL to coherently link multiples of such a system,
    and not be limited by the number of pins dedicated to DRAM access;
    40 years of growth at ½ a bit per year, already exceeds the 63-bit
    address space (47+40/2 = 67 bits).

    The prospects of other byte-addresable types of memory looks even
    bleaker than DRAM's.

    Agreed (baring some kind of miracle

    The only memory tech that is doing better is NAND flash, but it is
    inherently block-addressable.

    And becomes the backing store.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Wed May 1 20:18:56 2024
    According to Stephen Fuld <[email protected]d>:
    Probably, but looking at code one rarely sees a field in a struct
    that is a bit-field. So, even if the cost was low, the benefits
    are similarly low.

    Sure. But it isn't clear if that was the cause or the result of the >hardware.

    The people who designed the 360 had just done STRETCH, which had bit addressing. If it was useful, they would have known.

    The PDP-6/10 had load and store byte instructions that could address
    bit strings of arbitrary size and alignment in a singie instruction.
    But in practice, the only thing we used them for was packing and
    unpacking 7-bit ASCII into words.

    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Wed May 1 16:28:55 2024
    At current rate of DRAM Moore's Law it does not look like anybody would
    need 63 bits 40 years from now.

    Depends where. On "personal" computers, I fully agree, and indeed
    there's been work instead on compressing 64bit pointers to fit into
    32bit "boxes" (IIUC it's used in some Chrome versions) since many
    applications never (or rarely) need to manipulate a heap larger
    than 4GB.

    But for some HPC systems, it's not quite as obvious.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Wed May 1 20:37:11 2024
    According to Lawrence D'Oliveiro <[email protected]d>:
    On Wed, 1 May 2024 01:49:56 -0000 (UTC), John Levine wrote:

    Until the PDP-11, all byte addressed machines were bigendian. Despite a
    lot of looking, I have never found an explanation of why DEC made the
    PDP-11 littlendian.

    As I previously mentioned, little-endian just makes more sense.

    I happened to be looking at Blaauw and Brooks "Computer Architecture"
    published in 1997, which has several pages on bit and byte numbering.
    After noting that the Big- and Little- names come from Gulliver's
    Travels, they say on page 100:

    "Unlike Swift's, the computer Endian controversy is not pointless. The
    Little Endian design has many complications in use; we much prefer the
    Big Endian. Having two active conventions is very painful. Several
    recent Big Endian RISC computers, including the MIPS, the Motorola
    88000, and the Intel i860 provide a data-movement operation that can
    perform the Big Endian-Little Endian permutation. We predict that
    Little Endian addressing will die out, just as decimal addressing
    did."

    Really, people like what they are used to. They were just wrong about
    the i860 which was little endian, but had a mode bit to make data
    addressing big endian.
    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Wed May 1 20:38:58 2024
    On Wed, 1 May 2024 16:38:09 +0000, MitchAlsup1 wrote:

    Lawrence D'Oliveiro wrote:

    You still need a place to put a bit offset for the base address of the
    field. Why not put it together with the rest of the address?

    Given a 20-40 year life of an architecture and the desire not to be
    limited by addressability; I wanted and demanded of myself a full 63-bit virtual address space per thread. Therefore, no bits in the pointer are available for bit level addressing.

    You will just have to make the move to 128-bit addressing, then. Some
    designers (e.g. RISC-V) are already putting in place plans for that.

    The way I think of it is: consider how you specify these 3 conventions:
    * numbering of bits within a byte
    * numbering of bytes within a multibyte quantity
    * the place values of bits in an integer

    The only way to get all 3 consistent is with a little-endian
    architecture. Every big-endian architecture has inconsistencies between
    these somewhere or another.

    Very many LE machines got one or more of those wrong, too.

    For example?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Thomas Koenig on Wed May 1 20:43:48 2024
    On Wed, 1 May 2024 09:02:22 -0000 (UTC), Thomas Koenig wrote:

    Hmm... what sort of terminals and character sets did people use on a
    PDP-10? 7-bit ASCII? It (and the PDP-6) were released before the ASCII standard came out.

    A bit before my time, but I recall terms like “SIXBIT” encoding from looking at docs. Also this weird thing called “Radix-50” (the “50” actually being octal for 40 decimal) did persist into PDP-11 days, when I
    came along. It was a way of packing 3 characters (from a limited set, of course) into 2 bytes.

    (And /360 was supposed to support ASCII originally,
    but that bit in the PSW got dropped for the /370, I believe).

    Both ASCII and the System/360 came out in 1964. IBM’s excuse for inventing its own EBCDIC encoding was that ASCII wasn’t ready in time. And so they saddled their entire mainframe world with this awkward, incompatible
    encoding when the entire rest of the computing world very quickly embraced ASCII (and national variants based off it).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Wed May 1 20:50:23 2024
    According to Lawrence D'Oliveiro <[email protected]d>:
    The way I think of it is: consider how you specify these 3 conventions:
    * numbering of bits within a byte
    * numbering of bytes within a multibyte quantity
    * the place values of bits in an integer

    The only way to get all 3 consistent is with a little-endian
    architecture. Every big-endian architecture has inconsistencies between
    these somewhere or another.

    As far as I can tell the 360/370 was consistently big-endian. The
    convention for bit numbering in bytes and words was high to low but
    since there weren't any instructions with bit numbers it didn't
    matter.

    Very many LE machines got one or more of those wrong, too.

    For example?

    The PDP-11 had mixed endian 32 bit integers and floats. VAX floating
    point was pretty muddled, too.

    Intel has been consistently little endian as far as I can remember.
    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Wed May 1 20:53:06 2024
    According to Lawrence D'Oliveiro <[email protected]d>:
    (And /360 was supposed to support ASCII originally,
    but that bit in the PSW got dropped for the /370, I believe).

    Both ASCII and the System/360 came out in 1964. IBM’s excuse for inventing >its own EBCDIC encoding was that ASCII wasn’t ready in time.

    If you'd read the paper on the Architecture of System/360, you'd know
    that is just plain wrong. See the link I posted earlier today.

    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Michael S on Wed May 1 20:54:42 2024
    Michael S <[email protected]> writes:
    On Wed, 1 May 2024 16:38:09 +0000
    [email protected] (MitchAlsup1) wrote:

    Lawrence D'Oliveiro wrote:

    On Wed, 1 May 2024 03:02:07 +0000, MitchAlsup1 wrote:

    I don't see what is wrong with loading a container with the field
    and then extracting or inserting into the container.

    You still need a place to put a bit offset for the base address of
    the field. Why not put it together with the rest of the address?

    Given a 20-40 year life of an architecture and the desire not to be
    limited by addressability; I wanted and demanded of myself a full
    63-bit virtual address space per thread. Therefore, no bits in the
    pointer are available for bit level addressing.


    At current rate of DRAM Moore's Law it does not look like anybody would
    need 63 bits 40 years from now. Arm's 55 or 56 bits will likely suffice
    for that long or longer.

    DRAM isn't the only thing that consumes physical address space bits.

    The prospects of other byte-addresable types of memory looks even
    bleaker than DRAM's.

    Consider CXL-Memory, for instance, where you have cache coherent
    memory distributed via PCIe to a switched fabric with thousands
    of multicore hosts - that quickly eats up the full 64 bits of PA;
    52 bits per host leaves just 12 bits for host selector.

    A single PCU-express device could easily require 64GB of memory
    BAR space in the PA space, or even a TB.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to [email protected] on Wed May 1 23:54:56 2024
    On Wed, 1 May 2024 20:30:16 +0000
    [email protected] (MitchAlsup1) wrote:

    Michael S wrote:

    On Wed, 1 May 2024 16:38:09 +0000
    [email protected] (MitchAlsup1) wrote:

    Lawrence D'Oliveiro wrote:

    On Wed, 1 May 2024 03:02:07 +0000, MitchAlsup1 wrote:

    I don't see what is wrong with loading a container with the
    field and then extracting or inserting into the container.

    You still need a place to put a bit offset for the base address
    of the field. Why not put it together with the rest of the
    address?

    Given a 20-40 year life of an architecture and the desire not to be
    limited by addressability; I wanted and demanded of myself a full
    63-bit virtual address space per thread. Therefore, no bits in the
    pointer are available for bit level addressing.


    At current rate of DRAM Moore's Law it does not look like anybody
    would need 63 bits 40 years from now. Arm's 55 or 56 bits will
    likely suffice for that long or longer.

    The largest single system memory I can find quickly is 160TB or about 47-bits of address space (I rounded down).


    I am not aware of anything that big.
    My impression was that the biggest cache-coherent system right now is
    IBM's z15 Max190 (40 TB).

    Given one can use CXL to coherently link multiples of such a system,
    and not be limited by the number of pins dedicated to DRAM access;

    But it would be very slow, so slow that it defeats the point of direct addressability.

    40 years of growth at � a bit per year, already exceeds the 63-bit
    address space (47+40/2 = 67 bits).


    Half bit per year sounds very quick. It seems, right now the rate is
    much slower, something like doubling every 5-6 years. And it is likely
    to becaome even slower in 20 years.

    The prospects of other byte-addresable types of memory looks even
    bleaker than DRAM's.

    Agreed (baring some kind of miracle

    The only memory tech that is doing better is NAND flash, but it is inherently block-addressable.

    And becomes the backing store.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Wed May 1 21:13:31 2024
    According to Michael S <[email protected]>:
    years ago. And I still don't think that bit operations are common
    enough to be worth using bits in every non-bit address.

    Bit-addressable TMS34010 was released 38 years ago and even was
    moderately successful. So, it seems, 50 yeras ago nothing was set in
    stone yet.

    True, but that chip is designed to be good for video rendering which
    is an unusual application that uses a lot of bit aligned data.

    Chips for specialized applications have all sorts of strange
    architectures. Look at the Moto 56K DSP with 24 bit words and separate instruction and data memories. I wouldn't want to try and run linux
    on it but it's great for signal processing.




    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Levine on Wed May 1 20:36:47 2024
    On Wed, 1 May 2024 17:53:05 -0000 (UTC), John Levine wrote:

    That would waste three bits in every address, which would have been phenomenally expensive in the 1960s when every byte cost real money.

    But not today, with 64-bit addressing, was my point.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Levine on Wed May 1 20:36:02 2024
    On Wed, 1 May 2024 17:41:46 -0000 (UTC), John Levine wrote:

    According to Lawrence D'Oliveiro <[email protected]d>:

    As I previously mentioned, little-endian just makes more sense.

    Ahem. You're guessing.

    No I’m not. I’ve used both over many years.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Michael S on Wed May 1 21:40:17 2024
    Michael S <[email protected]> writes:
    On Wed, 1 May 2024 20:30:16 +0000
    [email protected] (MitchAlsup1) wrote:

    Given one can use CXL to coherently link multiples of such a system,
    and not be limited by the number of pins dedicated to DRAM access;

    But it would be very slow, so slow that it defeats the point of direct >addressability.

    On what basis do you make that statement? CXL-memory is real,
    and can be implemented on chiplets in an MCM with better
    than multisocket latencies. Add Gen6 PCIe cut-through switching
    and you get resonable and useful latencies across a switched fabric.

    Even a decade and a half ago, when we built a similar system using
    QDR infinband and a custom ASIC connected to HT or QPI,
    we had internode latencies of less than 400ns r/t, which
    was about double the Intel inter-socket latencies at the time.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to [email protected] on Wed May 1 22:11:46 2024
    MitchAlsup1 <[email protected]> schrieb:
    Michael S wrote:

    On Wed, 1 May 2024 16:38:09 +0000
    [email protected] (MitchAlsup1) wrote:

    Lawrence D'Oliveiro wrote:

    On Wed, 1 May 2024 03:02:07 +0000, MitchAlsup1 wrote:

    I don't see what is wrong with loading a container with the field
    and then extracting or inserting into the container.

    You still need a place to put a bit offset for the base address of
    the field. Why not put it together with the rest of the address?

    Given a 20-40 year life of an architecture and the desire not to be
    limited by addressability; I wanted and demanded of myself a full
    63-bit virtual address space per thread. Therefore, no bits in the
    pointer are available for bit level addressing.


    At current rate of DRAM Moore's Law it does not look like anybody would
    need 63 bits 40 years from now. Arm's 55 or 56 bits will likely suffice
    for that long or longer.

    The largest single system memory I can find quickly is 160TB or about
    47-bits of address space (I rounded down).

    A single Power10 CPU can address 2 Petabytes (51 bits), but of course
    it need not be all RAM.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Scott Lurndal on Thu May 2 02:04:37 2024
    On Wed, 01 May 2024 21:40:17 GMT
    [email protected] (Scott Lurndal) wrote:

    Michael S <[email protected]> writes:
    On Wed, 1 May 2024 20:30:16 +0000
    [email protected] (MitchAlsup1) wrote:

    Given one can use CXL to coherently link multiples of such a
    system, and not be limited by the number of pins dedicated to DRAM
    access;

    But it would be very slow, so slow that it defeats the point of
    direct addressability.

    On what basis do you make that statement? CXL-memory is real,
    and can be implemented on chiplets in an MCM with better
    than multisocket latencies. Add Gen6 PCIe cut-through switching
    and you get resonable and useful latencies across a switched fabric.

    Even a decade and a half ago, when we built a similar system using
    QDR infinband and a custom ASIC connected to HT or QPI,
    we had internode latencies of less than 400ns r/t, which
    was about double the Intel inter-socket latencies at the time.

    You didn't find many buyers, did you?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Thomas Koenig on Thu May 2 02:13:09 2024
    On Wed, 1 May 2024 22:11:46 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    MitchAlsup1 <[email protected]> schrieb:
    Michael S wrote:

    On Wed, 1 May 2024 16:38:09 +0000
    [email protected] (MitchAlsup1) wrote:

    Lawrence D'Oliveiro wrote:

    On Wed, 1 May 2024 03:02:07 +0000, MitchAlsup1 wrote:

    I don't see what is wrong with loading a container with the
    field and then extracting or inserting into the container.

    You still need a place to put a bit offset for the base address
    of the field. Why not put it together with the rest of the
    address?

    Given a 20-40 year life of an architecture and the desire not to
    be limited by addressability; I wanted and demanded of myself a
    full 63-bit virtual address space per thread. Therefore, no bits
    in the pointer are available for bit level addressing.


    At current rate of DRAM Moore's Law it does not look like anybody
    would need 63 bits 40 years from now. Arm's 55 or 56 bits will
    likely suffice for that long or longer.

    The largest single system memory I can find quickly is 160TB or
    about 47-bits of address space (I rounded down).

    A single Power10 CPU can address 2 Petabytes (51 bits), but of course
    it need not be all RAM.

    How much memory is connected to the biggest cache-coherent Power10
    computer that is actually for sale?
    My guess is 32 TB.
    IBM claims 64 TB, but that claim is likely based on memory technology
    that is not available yet.
    Anyway, even if 64 TB is true, it's only 46 bits.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to John Levine on Thu May 2 02:02:49 2024
    On Wed, 1 May 2024 21:13:31 -0000 (UTC)
    John Levine <[email protected]> wrote:

    I wouldn't want to try and run linux
    on it but it's great for signal processing.


    I agree about first part, disagree about second.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Levine on Thu May 2 00:05:00 2024
    On Wed, 1 May 2024 21:13:31 -0000 (UTC), John Levine wrote:

    According to Michael S <[email protected]>:

    Bit-addressable TMS34010 was released 38 years ago and even was
    moderately successful. So, it seems, 50 yeras ago nothing was set in
    stone yet.

    True, but that chip is designed to be good for video rendering which is
    an unusual application that uses a lot of bit aligned data.

    And yet, all our machines nowadays are doing heavy amounts of “video rendering”, aren’t they? Look at the machine generating the screen display you’re looking at right now.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Levine on Wed May 1 23:17:06 2024
    On Wed, 1 May 2024 20:37:11 -0000 (UTC), John Levine wrote:

    According to Lawrence D'Oliveiro <[email protected]d>:

    On Wed, 1 May 2024 01:49:56 -0000 (UTC), John Levine wrote:

    Until the PDP-11, all byte addressed machines were bigendian. Despite
    a
    lot of looking, I have never found an explanation of why DEC made the
    PDP-11 littlendian.

    As I previously mentioned, little-endian just makes more sense.

    I happened to be looking at Blaauw and Brooks "Computer Architecture" published in 1997, which has several pages on bit and byte numbering.
    After noting that the Big- and Little- names come from Gulliver's
    Travels, they say on page 100:

    "Unlike Swift's, the computer Endian controversy is not pointless. The
    Little Endian design has many complications in use; we much prefer the
    Big Endian."

    It’s easy to illustrate why they’re wrong. First of all, a note that, even on big-endian architectures, registers are still actually little-endian.
    Which is yet another reason why big-endian can never be entirely
    consistent.

    Consider this pseudo-assembly-language sequence:

    move.l a, b
    move.b b, c

    where “move” denotes either “load” or “store” as appropriate, the “.b”
    suffix indicates a byte operation, and “.l” denotes a multibyte operation (2, 4, 8 bytes or whatever, doesn’t matter as long as it’s more than 1).

    As for the labels “a”, “b” and “c”, they can be reasonably interpreted (to
    accommodate both RISC and non-RISC architectures) in two ways:
    1) “a” and “c” are registers, “b” is a memory address; or
    2) “b” is a register, while “a” and “c” are memory addresses.

    Now the question is: which byte from “a” ends up at location “c”?

    On a little-endian architecture, it is always the lowest-significance
    byte.

    But on a big-endian architecture, for a register-memory-register move, it
    will be the highest-significance byte. But for the memory-register-memory
    case, it will be the lowest-significance byte.

    In other words, even on big-endian architectures, registers are still interpreted as little-endian!

    Isn’t that fun?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Stephen Fuld on Thu May 2 00:24:50 2024
    On Wed, 1 May 2024 19:21:32 -0000 (UTC), Stephen Fuld wrote:

    MitchAlsup1 wrote:

    ... looking at code one rarely sees a field in a struct that
    is a bit-field. So, even if the cost was low, the benefits are
    similarly low.

    Sure. But it isn't clear if that was the cause or the result of the hardware.

    Absolutely, I would say that is very much a chicken-and-egg effect. Also,
    if you thought endian issues were complicated, look at how different architectures implement their bit-field instructions.

    Interesting fact: in spite of all the arguments over big-endian versus little-endian, everybody seems to be in agreement over what “shift left” and “shift right” mean: “left” is always to the most significant end, while “right” is always to the least significant end. If you want to do
    bit packing/unpacking in endian-independent C code, you do it with shifts
    and masks.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Stefan Monnier on Thu May 2 01:20:57 2024
    On Wed, 01 May 2024 16:28:55 -0400, Stefan Monnier wrote:

    On "personal" computers ... there's been work instead on compressing
    64bit pointers to fit into 32bit "boxes" (IIUC it's used in some Chrome versions) ...

    Intel pushed this thing called the “x32” ABI into the Linux kernel (and possibly some other places) some years ago. This was using the AMD64 instruction set, but with only 32-bit pointers. This way, you got the
    benefit of the extra registers, without the overhead of the longer
    addresses.

    I don’t think it was very popular, and I also think it’s been dropped from current Linux kernels.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to [email protected] on Thu May 2 01:18:59 2024
    It appears that Lawrence D'Oliveiro <[email protected]d> said:
    "Unlike Swift's, the computer Endian controversy is not pointless. The
    Little Endian design has many complications in use; we much prefer the
    Big Endian."

    It’s easy to illustrate why they’re wrong. First of all, a note that, even >on big-endian architectures, registers are still actually little-endian.

    I would be most interested in a concrete illustration of this
    implausible argument. How about starting with the IBM 360 principles
    of operation and pointing out the little endian registers.

    If you don't have a copy handy, you can find one here

    https://bitsavers.org/pdf/ibm/360/princOps/A22-6821-7_360PrincOpsDec67.pdf

    You might also look at its instruction set which is quite unlike the ones
    you seem to be familiar with.
    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Thu May 2 01:28:32 2024
    According to Lawrence D'Oliveiro <[email protected]d>:
    Bit-addressable TMS34010 was released 38 years ago and even was >>>moderately successful. So, it seems, 50 yeras ago nothing was set in >>>stone yet.

    True, but that chip is designed to be good for video rendering which is
    an unusual application that uses a lot of bit aligned data.

    And yet, all our machines nowadays are doing heavy amounts of “video >rendering”, aren’t they? Look at the machine generating the screen display >you’re looking at right now.

    It's an Apple M2 chip with a eight core dedicated GPU to do the video processing. Could you explain what point you're making here?

    Every computer these days does graphics rendering so they have
    specialized GPUs to make it fast, or on low end machines instruction
    set extensiosns to make it sort of fast. In both cases that is because
    graphics rendering is an unusual application that benefits from
    specialized hardware. I hope that doesn't come as a big surprise.


    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Levine on Thu May 2 01:29:42 2024
    On Wed, 1 May 2024 20:50:23 -0000 (UTC), John Levine wrote:

    The PDP-11 had mixed endian 32 bit integers and floats.

    The PDP-11 had no 32-bit integer instructions. It was the Fortran compiler (specifically “Fortran IV PLus”) that had mixed-endian 32-bit integers.

    VAX floating point was pretty muddled, too.

    Just rechecking one of their “architecture handbooks”, and the parts containing the mantissae are ordered big-endian by word, but little-endian between the bytes of a word.

    Intel has been consistently little endian as far as I can remember.

    That shows that it is possible. It is not possible for any big-endian architecture.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Scott Lurndal on Thu May 2 01:39:35 2024
    On Wed, 01 May 2024 14:08:25 GMT, Scott Lurndal wrote:

    What about the IBM 1401, Electrodata 220 or Burroughs B5000?

    Not really familiar with those--feel free to mention more details if you
    have them.

    Though I do recall, the 1401 didn’t have a “word length” as such: it was a
    “character”-based machine. For example, it could do arbitrary-precision arithmetic--it just kept processing digits until it hit a special end-of-
    data marker--but obviously this only worked for (fixed-point) addition and subtraction. The machine had no hardware support for multiplication or division. Or floating-point, for that matter.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Levine on Thu May 2 01:36:56 2024
    On Wed, 1 May 2024 20:53:06 -0000 (UTC), John Levine wrote:

    According to Lawrence D'Oliveiro <[email protected]d>:

    Both ASCII and the System/360 came out in 1964. IBM’s excuse for >>inventing its own EBCDIC encoding was that ASCII wasn’t ready in time.

    If you'd read the paper on the Architecture of System/360, you'd know
    that is just plain wrong. See the link I posted earlier today.

    See also these links:

    <https://en.wikipedia.org/wiki/IBM_System/360_architecture> note 4:

    Because the design of the S/360 occurred simultaneously with the
    development of ASCII, IBM's ASCII support did not match the
    standard that was ultimately adopted.

    <https://news.ycombinator.com/item?id=12360749>:

    This was roughly the same time the ANSI committee was trying to
    standardize ASCII. IBM was a proponent of ASCII, but they had
    shipping deadlines, and kept with their own character set rather
    than delay while they created ASCII peripherals.

    This item <https://retrocomputing.stackexchange.com/questions/15516/when-did-ibm-start-to-use-ascii>
    claims IBM was “a major proponent for ASCII”, but only it seems for communicating with other systems, not internally within its own
    products.

    Odd, don’t you think. But consistent with the time-factor excuse.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to [email protected] on Thu May 2 01:51:51 2024
    It appears that Lawrence D'Oliveiro <[email protected]d> said:
    On Wed, 1 May 2024 20:53:06 -0000 (UTC), John Levine wrote:

    According to Lawrence D'Oliveiro <[email protected]d>:

    Both ASCII and the System/360 came out in 1964. IBM’s excuse for >>>inventing its own EBCDIC encoding was that ASCII wasn’t ready in time.

    If you'd read the paper on the Architecture of System/360, you'd know
    that is just plain wrong. See the link I posted earlier today.

    See also these links:

    I'm familiar with those secondary sources. So just to be clear, you're
    saying that when the S/360 architects published that 1964 paper saying
    why they did what they did, they were lying?

    Be sure and look at figure 2b, "8-bit representation of the 7-bit
    American Standard Code for Information Interchange (ASCII)."

    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Thu May 2 01:46:25 2024
    According to Lawrence D'Oliveiro <[email protected]d>:
    On Wed, 1 May 2024 20:50:23 -0000 (UTC), John Levine wrote:

    The PDP-11 had mixed endian 32 bit integers and floats.

    The PDP-11 had no 32-bit integer instructions.

    I'm holding in my hand a DEC pdp-11 processor handbook published in 1979.

    On page 359 it describes LDCLF which converts a 32 bit mixed endian
    integer to float or double, and on page 368-9 STCFL which went the other
    way.

    It was the Fortran compiler
    (specifically “Fortran IV PLus”) that had mixed-endian 32-bit integers.

    Unsurprisingly it matched what the hardware did.

    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Thu May 2 01:57:53 2024
    According to Lawrence D'Oliveiro <[email protected]d>:
    On Wed, 01 May 2024 14:08:25 GMT, Scott Lurndal wrote:

    What about the IBM 1401, Electrodata 220 or Burroughs B5000?

    Not really familiar with those--feel free to mention more details if you
    have them.

    There's plenty of documentation at bitsavers.

    Though I do recall, the 1401 didn’t have a “word length” as such: it was a
    “character”-based machine. For example, it could do arbitrary-precision >arithmetic--it just kept processing digits until it hit a special end-of- >data marker--but obviously this only worked for (fixed-point) addition and >subtraction. The machine had no hardware support for multiplication or >division. Or floating-point, for that matter.

    You may be confusing it with the 1620. The 1401 had optional multiply
    and divide instructions but I don't think they were very popular. The
    1620 had all four operations, famously and slowly implemented by table
    lookup, and optional hardware floating point.

    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Michael S on Thu May 2 01:40:33 2024
    On Wed, 1 May 2024 15:31:37 +0300, Michael S wrote:

    In the world of general-purpose microprocessor, DEC Alpha (until EV6)
    was more like word-addressable than byte-addressable, although it is a
    matter of point of view.

    As I recall, the original design left out byte-addressability, but this
    was found to hurt Windows NT performance. So it was added later.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to John Levine on Thu May 2 05:05:25 2024
    John Levine wrote:

    snip


    Every computer these days does graphics rendering

    Is that true? What about all those computers that make up Google's
    server farm? Or how about AWS systems? I am not saying they don't,
    just asking.





    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Levine on Thu May 2 05:42:04 2024
    On Thu, 2 May 2024 01:18:59 -0000 (UTC), John Levine wrote:

    It appears that Lawrence D'Oliveiro <[email protected]d> said:

    "Unlike Swift's, the computer Endian controversy is not pointless.
    The Little Endian design has many complications in use; we much
    prefer the Big Endian."

    It’s easy to illustrate why they’re wrong. First of all, a note that, >>even on big-endian architectures, registers are still actually >>little-endian.

    I would be most interested in a concrete illustration of this
    implausible argument.

    Sure. Consider this pseudo-assembly-language sequence:

    move.l a, b
    move.b b, c

    where “move” denotes either “load” or “store” as appropriate, the “.b”
    suffix indicates a byte operation, and “.l” denotes a multibyte operation (2, 4, 8 bytes or whatever, doesn’t matter as long as it’s more than 1).

    As for the labels “a”, “b” and “c”, they can be reasonably interpreted (to
    accommodate both RISC and non-RISC architectures) in two ways:
    1) “a” and “c” are registers, “b” is a memory address; or
    2) “b” is a register, while “a” and “c” are memory addresses.

    Now the question is: which byte from “a” ends up at location “c”?

    On a little-endian architecture, it is always the lowest-significance
    byte.

    But on a big-endian architecture, for a register-memory-register move, it
    will be the highest-significance byte. But for the memory-register-memory
    case, it will be the lowest-significance byte.

    In other words, even on big-endian architectures, registers are still interpreted as little-endian!

    Isn’t that fun?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Levine on Thu May 2 06:59:49 2024
    On Thu, 2 May 2024 01:57:53 -0000 (UTC), John Levine wrote:

    According to Lawrence D'Oliveiro <[email protected]d>:

    Though I do recall, the 1401 didn’t have a “word length” as such:
    it was a “character”-based machine. For example, it could do
    arbitrary-precision arithmetic--it just kept processing digits
    until it hit a special end-of-data marker--but obviously this only
    worked for (fixed-point) addition and subtraction. The machine had
    no hardware support for multiplication or division. Or
    floating-point, for that matter.

    You may be confusing it with the 1620.

    The 1401 was the one with the “word-mark” bit that I was thinking of,
    which was set to 1 in the final (highest-order) digit of a number.

    The 1620 looks like it did it in a different way, with a separate end-of- number character.

    The “Guide to 1401 Programming” I’m looking at (from 1961) makes no mention of multiplication or division.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to All on Thu May 2 09:54:49 2024
    MitchAlsup1 wrote:
    Lawrence D'Oliveiro wrote:

    Byte addressing was invented by IBM for the System/360, introduced in
    1964. At least as I understand it. Up to that time, and indeed for a
    long time after, machines had a “word length” which was the
    smallest addressable unit of memory. This could have a range of sizes,
    e.g.

        12 -- DEC PDP-5/8
        18 -- DEC PDP-1/4/7/9
        36 -- DEC PDP-6/10
        60 -- CDC 6000-series
        64 -- Cray

    CDC had a number of machines with 12-bit times k words. k element {1,2,3,5}

    I’m sure there were also 24- and 48-bit machines. Note the
    popularity of numbers with a range of different integer divisors,
    including powers of both 2 and 3. The byte-addressable machines
    chucked away everything other than powers of 2, which was a step
    backwards in this respect. ;)

    I would make the argument that 2^k was a step forward not backwards.
    Perhaps another day...

    I've seen the argument that e is the best base from an energy
    standpoint, with 2 and 3 being the two closest integer values.

    Working with trits, encoded as -/0/+, would have been feasible, but
    binary provided much easier implementation. Base conversions are a bit
    messier when you use base3 as the machine representation, but you could
    have used 5 trits (243) to handle the US ASCII character set.

    In retrospect I'm glad they decided on binary!

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Terje Mathisen on Thu May 2 08:14:38 2024
    Terje Mathisen <[email protected]> schrieb:

    Working with trits, encoded as -/0/+, would have been feasible,

    There was a Russian computer that implemented that.

    but
    binary provided much easier implementation. Base conversions are a bit messier when you use base3 as the machine representation, but you could
    have used 5 trits (243) to handle the US ASCII character set.

    In retrospect I'm glad they decided on binary!

    I like balanced ternary for its symmetry. There
    appears to have been a Soviet computer implementing it, https://en.wikipedia.org/wiki/Setun . I also like the idea of
    encoding a comparison with three values in a single trit.

    But for today's technology, binary is much easier to implement,
    so it is the logical choice.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Thu May 2 07:24:26 2024
    I wrote:

    The “Guide to 1401 Programming” I’m looking at (from 1961) makes no mention of multiplication or division.

    No hardware instructions, just a mention of a multiplication subroutine.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Terje Mathisen on Thu May 2 08:59:10 2024
    On Thu, 2 May 2024 09:54:49 +0200, Terje Mathisen wrote:

    I've seen the argument that e is the best base from an energy
    standpoint, with 2 and 3 being the two closest integer values.

    To implement a non-integer base, you would need something like a
    probabilistic distribution of combinations of digits, rather than allowing every possible combination to be equally representable. Then you can
    average out the information content to a suitable value.

    So it would be an average-base-e representation.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to John Levine on Thu May 2 13:54:32 2024
    On Wed, 1 May 2024 20:37:11 -0000 (UTC)
    John Levine <[email protected]> wrote:

    According to Lawrence D'Oliveiro <[email protected]d>:
    On Wed, 1 May 2024 01:49:56 -0000 (UTC), John Levine wrote:

    Until the PDP-11, all byte addressed machines were bigendian.
    Despite a lot of looking, I have never found an explanation of why
    DEC made the PDP-11 littlendian.

    As I previously mentioned, little-endian just makes more sense.

    I happened to be looking at Blaauw and Brooks "Computer Architecture" published in 1997, which has several pages on bit and byte numbering.
    After noting that the Big- and Little- names come from Gulliver's
    Travels, they say on page 100:

    "Unlike Swift's, the computer Endian controversy is not pointless.
    The Little Endian design has many complications in use; we much
    prefer the Big Endian. Having two active conventions is very painful.
    Several recent Big Endian RISC computers, including the MIPS, the
    Motorola 88000, and the Intel i860 provide a data-movement operation
    that can perform the Big Endian-Little Endian permutation. We predict
    that Little Endian addressing will die out, just as decimal addressing
    did."


    IMHO, statements like that are forgivable for Blaauw (born 1924). Less
    so for 7 years younger Brooks.

    Really, people like what they are used to. They were just wrong about
    the i860 which was little endian, but had a mode bit to make data
    addressing big endian.

    Expressions of personal prejudices are fine for informal Usenet
    articles. For book that pretends to be more than memoir I expect more
    rigorous reasoning.
    But I didn't read the book and don't know its genre. Possibly it is in
    fact a memoir hidden behind uncharacteristic name. The full name is
    "Computer architecture: concepts and evolution." The last word gives a
    hint that it can be a case.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Thu May 2 12:00:40 2024
    According to Michael S <[email protected]>:
    that can perform the Big Endian-Little Endian permutation. We predict
    that Little Endian addressing will die out, just as decimal addressing
    did."


    IMHO, statements like that are forgivable for Blaauw (born 1924). Less
    so for 7 years younger Brooks.

    Really, people like what they are used to. They were just wrong about
    the i860 which was little endian, but had a mode bit to make data
    addressing big endian.

    Expressions of personal prejudices are fine for informal Usenet
    articles. For book that pretends to be more than memoir I expect more >rigorous reasoning.

    It's a pretty gppd textbook amd that prediction is one of the few
    places where they blow it, perhaps because from inside IBM they didn't
    realize how much the rest of the world had moved beyond IBM
    compatibility.

    But my point is that the arguments about big- and little-endian are
    far more about what you are used to than any inherent advantage of one
    or the other. As we have seen in recent bickering here, it is easy to
    construct examples that appear to make your less favored option look
    wrong, particularly if you don't know how actual implementations work.



    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Thu May 2 11:52:56 2024
    According to Lawrence D'Oliveiro <[email protected]d>:
    Sure. Consider this pseudo-assembly-language sequence:

    move.l a, b
    move.b b, c
    ...
    Now the question is: which byte from “a” ends up at location “c”?

    You really should stop guessing about computer architectures rather
    than reading up on them.

    On S/360, which is the ur-big-endian machine, memory to memory moves
    are different from register loads and stores. There are ICM and STCM instructions that take a four bit mask to say which bytes in the
    register to load or store. There are also IC and STC for the common
    case that you only want to load or store the low byte.

    In other words, even on big-endian architectures, registers are still >interpreted as little-endian!

    Isn’t that fun?

    I suppose it would be if it were true.


    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Thu May 2 12:20:37 2024
    According to Stephen Fuld <[email protected]d>:
    John Levine wrote:

    snip


    Every computer these days does graphics rendering

    Is that true? What about all those computers that make up Google's
    server farm? Or how about AWS systems? I am not saying they don't,
    just asking.

    AWS has several varieties of their custom Graviton chips:

    https://aws.amazon.com/ec2/graviton/

    Some of them are just ARM cores for stuff like databases but some are
    intended for video processing and game streaming:

    https://aws.amazon.com/ec2/instance-types/g5g/

    So you're right, it's not every computer, but it's more than you might think. --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to John Levine on Thu May 2 07:28:56 2024
    On 5/2/2024 5:20 AM, John Levine wrote:
    According to Stephen Fuld <[email protected]d>:
    John Levine wrote:

    snip


    Every computer these days does graphics rendering

    Is that true? What about all those computers that make up Google's
    server farm? Or how about AWS systems? I am not saying they don't,
    just asking.

    AWS has several varieties of their custom Graviton chips:

    https://aws.amazon.com/ec2/graviton/

    Some of them are just ARM cores for stuff like databases but some are intended for video processing and game streaming:

    https://aws.amazon.com/ec2/instance-types/g5g/

    So you're right, it's not every computer, but it's more than you might think.

    Fair enough. Thanks.


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Michael S on Thu May 2 14:32:50 2024
    Michael S <[email protected]> writes:
    On Wed, 01 May 2024 21:40:17 GMT
    [email protected] (Scott Lurndal) wrote:

    Michael S <[email protected]> writes:
    On Wed, 1 May 2024 20:30:16 +0000
    [email protected] (MitchAlsup1) wrote:

    Given one can use CXL to coherently link multiples of such a
    system, and not be limited by the number of pins dedicated to DRAM
    access;

    But it would be very slow, so slow that it defeats the point of
    direct addressability.

    On what basis do you make that statement? CXL-memory is real,
    and can be implemented on chiplets in an MCM with better
    than multisocket latencies. Add Gen6 PCIe cut-through switching
    and you get resonable and useful latencies across a switched fabric.

    Even a decade and a half ago, when we built a similar system using
    QDR infinband and a custom ASIC connected to HT or QPI,
    we had internode latencies of less than 400ns r/t, which
    was about double the Intel inter-socket latencies at the time.

    You didn't find many buyers, did you?

    We were in one of the national labs before the recession eliminated
    further funding.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to [email protected] on Thu May 2 08:58:23 2024
    On Wed, 1 May 2024 23:17:06 -0000 (UTC), Lawrence D'Oliveiro
    <[email protected]d> wrote:

    On a little-endian architecture, it is always the lowest-significance
    byte.

    But on a big-endian architecture, for a register-memory-register move, it >will be the highest-significance byte. But for the memory-register-memory >case, it will be the lowest-significance byte.

    In other words, even on big-endian architectures, registers are still >interpreted as little-endian!

    Isn�t that fun?

    It had never occured to me to think about it in this way.

    To me, it just made sense that, since registers contain quantities, if
    you load the value "8" into a reigster, it will contain the number 8.

    So in a byte operation, the least significant bits of the register are
    used.

    While if yiou store something in a memory location, you're only using
    the length corresponding to the size of the operand. So, yes, storing
    a value into a byte in memory... puts it at the location of the most significant 8 bits of a 32-bit quantity having the same address.

    But so what? Usually, a memory location is used for only one size of
    data. If EQUIVALENCE magic is going on, it makes more sense to have
    numbers in memory look the way we write them, so it's easy to
    understand.

    Plus, if you load a single precision float into a floating-point
    register, you are loading on the left side, not the right side, so the inconsistency to which you're referring now impacts the little-endian
    machines. (Of course, though, that's no longer quite true with IEEE
    754, since the exponent isn't the same size for all precisions, the
    way it was with old-fashioned machines.)

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Thu May 2 13:45:50 2024
    On "personal" computers ... there's been work instead on compressing
    64bit pointers to fit into 32bit "boxes" (IIUC it's used in some Chrome
    versions) ...
    Intel pushed this thing called the “x32” ABI into the Linux kernel (and possibly some other places) some years ago. This was using the AMD64

    Indeed, but I got the impression that there is a bit of a revival of
    interest for pointer compression as the evidence seems to point to RAM
    sizes not increasing very much any more on "end user devices".

    See for instance https://v8.dev/blog/pointer-compression

    Note also that this is targeted at JavaScript: dynamically typed
    languages tend to suffer more from the 64bit bloat because of their use
    of "boxing", meaning that pretty much everything (except usually for
    strings and arrays of floats, which are special-cased) doubles in size
    when the "box" size is changed from 32bit to 64bit.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Levine on Thu May 2 17:21:30 2024
    John Levine <[email protected]> writes:
    As far as I can tell the 360/370 was consistently big-endian. The
    convention for bit numbering in bytes and words was high to low but
    since there weren't any instructions with bit numbers it didn't
    matter.

    I remember reading the PowerPC documentation where the most
    significant bit was bit 0, so it was consistently big-endian. But the
    problem with this is that the least significant bit of a byte is bit
    7, of a halfword bit 15, of a word bit 31, etc. I don't remember if
    PowerPC has instructions where bit numbers play a role, though.

    With OpenPower being little-endian, did they rewrite all the docs to
    renumber the bits?

    The 68000 and 88000 architectures (which have instructions with bit
    numbers) make the least significant bit have number 0, so they are
    bitwise little-endian. The 68000 is bytewise big-endian, and I
    remember things getting pretty messy when I tried to use bit-numbering instructions for data larger than 32 bits. The 88000 supports
    little-endian mode, but IIRC the DG Aviion used big-endian mode.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Levine on Thu May 2 17:37:47 2024
    John Levine <[email protected]> writes:
    According to Lawrence D'Oliveiro <[email protected]d>:
    On Wed, 1 May 2024 01:49:56 -0000 (UTC), John Levine wrote:

    Until the PDP-11, all byte addressed machines were bigendian. Despite a
    lot of looking, I have never found an explanation of why DEC made the
    PDP-11 littlendian.

    As I previously mentioned, little-endian just makes more sense.

    I happened to be looking at Blaauw and Brooks "Computer Architecture" >published in 1997, which has several pages on bit and byte numbering.
    After noting that the Big- and Little- names come from Gulliver's
    Travels, they say on page 100:

    "Unlike Swift's, the computer Endian controversy is not pointless. The
    Little Endian design has many complications in use; we much prefer the
    Big Endian. Having two active conventions is very painful. Several
    recent Big Endian RISC computers, including the MIPS, the Motorola
    88000, and the Intel i860

    MIPS and 88000 support both big- and little-endian operation; and at
    least for MIPS, there were a lot of little-endian machines around: the DECstations. Even today, <https://popcon.debian.org/> reports:

    mips : 7
    mips64el : 10
    mipsel : 4

    So twice as many little-endian (el) systems as big-endian ones.

    provide a data-movement operation that can
    perform the Big Endian-Little Endian permutation. We predict that
    Little Endian addressing will die out, just as decimal addressing
    did."

    I did not expect any of them to die out, but actually big-endian is
    dying out. HPPA and SPARC have been cancelled, Power has switched to little-endian, and S390x is a niche, and MIPS has left the
    general-purpose computing field.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Terje Mathisen on Thu May 2 18:23:59 2024
    Terje Mathisen wrote:

    MitchAlsup1 wrote:
    Lawrence D'Oliveiro wrote:

    Byte addressing was invented by IBM for the System/360, introduced in
    1964. At least as I understand it. Up to that time, and indeed for a
    long time after, machines had a “word length” which was the
    smallest addressable unit of memory. This could have a range of sizes,

    e.g.

        12 -- DEC PDP-5/8
        18 -- DEC PDP-1/4/7/9
        36 -- DEC PDP-6/10
        60 -- CDC 6000-series
        64 -- Cray

    CDC had a number of machines with 12-bit times k words. k element
    {1,2,3,5}

    I’m sure there were also 24- and 48-bit machines. Note the
    popularity of numbers with a range of different integer divisors,
    including powers of both 2 and 3. The byte-addressable machines
    chucked away everything other than powers of 2, which was a step
    backwards in this respect. ;)

    I would make the argument that 2^k was a step forward not backwards.
    Perhaps another day...

    I've seen the argument that e is the best base from an energy
    standpoint, with 2 and 3 being the two closest integer values.

    If one wants to take a low-fan-out signal and drive a lot of loads
    (high fan-out) then the lease energy way of doing this is an
    exponentiating rate of e but often 3 (sometimes 4) were close enough. (Meade-Conway)

    Working with trits, encoded as -/0/+, would have been feasible, but
    binary provided much easier implementation. Base conversions are a bit messier when you use base3 as the machine representation, but you could
    have used 5 trits (243) to handle the US ASCII character set.

    One gets 1 bit of storage with 2 tubes (or transistors) and the storage
    for a stable trit reauires 4 tubes (lower storage per tube).

    In retrospect I'm glad they decided on binary!

    Binary self chose due to the medium (tubes and transistors).

    Terje

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Savard on Thu May 2 18:28:18 2024
    John Savard wrote:

    On Wed, 1 May 2024 23:17:06 -0000 (UTC), Lawrence D'Oliveiro


    Plus, if you load a single precision float into a floating-point
    register, you are loading on the left side, not the right side, so the

    In My 66000, floats are stored on the right side of the register
    {mostly because I do not have FP LD/STs.}

    inconsistency to which you're referring now impacts the little-endian machines. (Of course, though, that's no longer quite true with IEEE
    754, since the exponent isn't the same size for all precisions, the
    way it was with old-fashioned machines.)

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Thu May 2 18:33:48 2024
    Lawrence D'Oliveiro wrote:

    On Wed, 1 May 2024 20:37:11 -0000 (UTC), John Levine wrote:

    According to Lawrence D'Oliveiro <[email protected]d>:

    On Wed, 1 May 2024 01:49:56 -0000 (UTC), John Levine wrote:

    Until the PDP-11, all byte addressed machines were bigendian. Despite

    a
    lot of looking, I have never found an explanation of why DEC made the
    PDP-11 littlendian.

    As I previously mentioned, little-endian just makes more sense.

    I happened to be looking at Blaauw and Brooks "Computer Architecture"
    published in 1997, which has several pages on bit and byte numbering.
    After noting that the Big- and Little- names come from Gulliver's
    Travels, they say on page 100:

    "Unlike Swift's, the computer Endian controversy is not pointless. The
    Little Endian design has many complications in use; we much prefer the
    Big Endian."

    It’s easy to illustrate why they’re wrong. First of all, a note that,
    even
    on big-endian architectures, registers are still actually little-endian.

    Which is yet another reason why big-endian can never be entirely
    consistent.

    IBM 360 had its most significant bit labeled as bit<0>.

    We don't do that any more because we want the lowest bit number of
    a bit-field to equal the shift count needed to right align the
    bit with the register,

    Consider this pseudo-assembly-language sequence:

    move.l a, b
    move.b b, c

    May I suggest that the above ILLUSTRATES why someone wants to use
    LD and ST instructions rather than directionless MOV instructions.
    The interpretation of the instruction is determined by the operands
    not by the OpCode.

    where “move” denotes either “load” or “store” as appropriate, the “.b”
    suffix indicates a byte operation, and “.l” denotes a multibyte operation

    (2, 4, 8 bytes or whatever, doesn’t matter as long as it’s more than 1).

    As for the labels “a”, “b” and “c”, they can be reasonably interpreted
    (to
    accommodate both RISC and non-RISC architectures) in two ways:
    1) “a” and “c” are registers, “b” is a memory address; or
    2) “b” is a register, while “a” and “c” are memory addresses.

    All of the above goes away when LD/STs are used instead of MOV.

    Now the question is: which byte from “a” ends up at location “c”?

    On a little-endian architecture, it is always the lowest-significance
    byte.

    But on a big-endian architecture, for a register-memory-register move, it

    will be the highest-significance byte. But for the memory-register-memory

    case, it will be the lowest-significance byte.

    In other words, even on big-endian architectures, registers are still interpreted as little-endian!

    Isn’t that fun?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Fri May 3 02:59:42 2024
    On Thu, 02 May 2024 17:37:47 GMT, Anton Ertl wrote:

    ... MIPS has left the general-purpose computing field.

    Not so sure that it has. I think the Chinese “LoongArch” machines are a MIPS derivative.

    Also, if you want to think of “MIPS” as a corporate entity, that would be the company currently known as “Imagination Technologies”. It is true they have given up on the MIPS architecture, and are now quite heavily into
    RISC-V.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Fri May 3 03:02:20 2024
    On Thu, 02 May 2024 17:21:30 GMT, Anton Ertl wrote:

    The 68000 and 88000 architectures (which have instructions with bit
    numbers) make the least significant bit have number 0, so they are
    bitwise little-endian.

    The 68000 family is an example of the knots you can tie yourself into,
    trying to come up with bit numberings for a big-endian architecture.

    The 16-bit members of the family (pre-68020) had single-bit extraction/ insertion instructions, which numbered the bits one way. The 32-bit
    machines added bit-field instructions, which used an entirely different
    bit numbering.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Fri May 3 05:55:46 2024
    On Thu, 2 May 2024 18:33:48 +0000, MitchAlsup1 wrote:

    Lawrence D'Oliveiro wrote:

    move.l a, b
    move.b b, c

    May I suggest that the above ILLUSTRATES why someone wants to use LD and
    ST instructions rather than directionless MOV instructions.

    OK, use explicit load/store instead of generic move:

    register-memory-register:

    store.l a, b
    load.b b, c

    memory-register-memory:

    load.l a, b
    store.b b, c

    Do you see why this makes absolutely no difference to what happens, as per
    my description earlier?

    By the way, in case it wasn’t clear: in my examples, the destination
    operand is always the last one.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Lawrence D'Oliveiro on Fri May 3 09:48:50 2024
    Lawrence D'Oliveiro wrote:
    On Wed, 1 May 2024 09:02:22 -0000 (UTC), Thomas Koenig wrote:

    Hmm... what sort of terminals and character sets did people use on a
    PDP-10? 7-bit ASCII? It (and the PDP-6) were released before the ASCII
    standard came out.

    A bit before my time, but I recall terms like “SIXBIT” encoding from looking at docs. Also this weird thing called “Radix-50” (the “50” actually being octal for 40 decimal) did persist into PDP-11 days, when I came along. It was a way of packing 3 characters (from a limited set, of course) into 2 bytes.

    Radix 40 needs 64000 values to hold 3 characters from a set like
    [' ',0-9,A-Z,_,-,=] (pick any three characters you want for those last
    slots), it matches perfectly the classic 6.3 filename convention where
    names are limited to 6 characters, an (implied period) and a 3-character extension/file type.

    The 3-char to 2-byte packing was of course easy(*), while unpacking is a
    bit harder if you don't want to use div/mod operations. I strongly
    suspect that the file system designers would do searches for a
    particular extension by first packing the extension and then search for
    the resulting packed byte, instead of unpacking each extension byte into
    the 3-char result.

    (*)
    byte pack3(char *ext) {
    a = table[ext[0]]; b = table[ext[1]]; c = table[ext[2]];
    return a + b*40 + c*1600;
    }

    b*40 = (b*5)<<3
    or
    b*40 = (b<<5)+(b<<3)

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Fri May 3 08:51:30 2024
    Lawrence D'Oliveiro <[email protected]d> writes:
    On Thu, 02 May 2024 17:37:47 GMT, Anton Ertl wrote:

    ... MIPS has left the general-purpose computing field.

    Not so sure that it has. I think the Chinese “LoongArch” machines are a >MIPS derivative.

    They may have started with MIPS, like several others, but now they are LoongArch. Looking in <https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1-EN.html#common-memory-access-instructions>,
    I don't find anything about byte order, but it says:

    |LoongArch bit designations are always little-endian.

    Also, if you want to think of “MIPS” as a corporate entity, that would be >the company currently known as “Imagination Technologies”. It is true they >have given up on the MIPS architecture

    That's even worse for MIPS than what I know of, which was that it was
    used for embedded uses.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bernd Linsel@21:1/5 to All on Fri May 3 15:29:04 2024
    T24gMDMuMDUuMjQgMTA6NTEsIEFudG9uIEVydGwgd3JvdGU6DQo+IExhd3JlbmNlIEQnT2xp dmVpcm8gPGxkb0Buei5pbnZhbGlkPiB3cml0ZXM6DQo+PiBPbiBUaHUsIDAyIE1heSAyMDI0 IDE3OjM3OjQ3IEdNVCwgQW50b24gRXJ0bCB3cm90ZToNCj4+DQo+Pj4gLi4uIE1JUFMgaGFz IGxlZnQgdGhlIGdlbmVyYWwtcHVycG9zZSBjb21wdXRpbmcgZmllbGQuDQo+Pg0KPj4gTm90 IHNvIHN1cmUgdGhhdCBpdCBoYXMuIEkgdGhpbmsgdGhlIENoaW5lc2Ug4oCcTG9vbmdBcmNo 4oCdIG1hY2hpbmVzIGFyZSBhDQo+PiBNSVBTIGRlcml2YXRpdmUuDQo+IA0KPiBUaGV5IG1h eSBoYXZlIHN0YXJ0ZWQgd2l0aCBNSVBTLCBsaWtlIHNldmVyYWwgb3RoZXJzLCBidXQgbm93 IHRoZXkgYXJlDQo+IExvb25nQXJjaC4gIExvb2tpbmcgaW4NCj4gPGh0dHBzOi8vbG9vbmdz b24uZ2l0aHViLmlvL0xvb25nQXJjaC1Eb2N1bWVudGF0aW9uL0xvb25nQXJjaC1Wb2wxLUVO Lmh0bWwjY29tbW9uLW1lbW9yeS1hY2Nlc3MtaW5zdHJ1Y3Rpb25zPiwNCj4gSSBkb24ndCBm aW5kIGFueXRoaW5nIGFib3V0IGJ5dGUgb3JkZXIsIGJ1dCBpdCBzYXlzOg0KPiANCj4gfExv b25nQXJjaCBiaXQgZGVzaWduYXRpb25zIGFyZSBhbHdheXMgbGl0dGxlLWVuZGlhbi4NCj4g DQo+PiBBbHNvLCBpZiB5b3Ugd2FudCB0byB0aGluayBvZiDigJxNSVBT4oCdIGFzIGEgY29y cG9yYXRlIGVudGl0eSwgdGhhdCB3b3VsZCBiZQ0KPj4gdGhlIGNvbXBhbnkgY3VycmVudGx5 IGtub3duIGFzIOKAnEltYWdpbmF0aW9uIFRlY2hub2xvZ2llc+KAnS4gSXQgaXMgdHJ1ZSB0 aGV5DQo+PiBoYXZlIGdpdmVuIHVwIG9uIHRoZSBNSVBTIGFyY2hpdGVjdHVyZQ0KPiANCj4g VGhhdCdzIGV2ZW4gd29yc2UgZm9yIE1JUFMgdGhhbiB3aGF0IEkga25vdyBvZiwgd2hpY2gg d2FzIHRoYXQgaXQgd2FzDQo+IHVzZWQgZm9yIGVtYmVkZGVkIHVzZXMuDQo+IA0KPiAtIGFu dG9uDQoNCk1JUFMzMiBpcyBzdGlsbCB1c2VkIGluIE1pY3JvY2hpcCdzIFBJQzMyIG1pY3Jv Y29udHJvbGxlciBzZXJpZXMuDQoNCi0tIA0KQmVybmQgTGluc2VsDQo=

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bernd Linsel@21:1/5 to All on Fri May 3 17:07:25 2024
    T24gMDMuMDUuMjQgMTA6NTEsIEFudG9uIEVydGwgd3JvdGU6DQo+IExhd3JlbmNlIEQnT2xp dmVpcm8gPGxkb0Buei5pbnZhbGlkPiB3cml0ZXM6DQo+PiBPbiBUaHUsIDAyIE1heSAyMDI0 IDE3OjM3OjQ3IEdNVCwgQW50b24gRXJ0bCB3cm90ZToNCj4+DQo+PiBBbHNvLCBpZiB5b3Ug d2FudCB0byB0aGluayBvZiDigJxNSVBT4oCdIGFzIGEgY29ycG9yYXRlIGVudGl0eSwgdGhh dCB3b3VsZCBiZQ0KPj4gdGhlIGNvbXBhbnkgY3VycmVudGx5IGtub3duIGFzIOKAnEltYWdp bmF0aW9uIFRlY2hub2xvZ2llc+KAnS4gSXQgaXMgdHJ1ZSB0aGV5DQo+PiBoYXZlIGdpdmVu IHVwIG9uIHRoZSBNSVBTIGFyY2hpdGVjdHVyZQ0KPiANCj4gVGhhdCdzIGV2ZW4gd29yc2Ug Zm9yIE1JUFMgdGhhbiB3aGF0IEkga25vdyBvZiwgd2hpY2ggd2FzIHRoYXQgaXQgd2FzDQo+ IHVzZWQgZm9yIGVtYmVkZGVkIHVzZXMuDQo+IA0KPiAtIGFudG9uDQoNCk1JUFMzMiBpcyBz dGlsbCB1c2VkIGluIE1pY3JvY2hpcCdzIFBJQzMyIG1pY3JvY29udHJvbGxlciBzZXJpZXMu DQoNCi0tIA0KQmVybmQgTGluc2VsDQo=

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Fri May 3 17:40:20 2024
    On Fri, 03 May 2024 08:51:30 GMT
    [email protected] (Anton Ertl) wrote:

    Lawrence D'Oliveiro <[email protected]d> writes:
    On Thu, 02 May 2024 17:37:47 GMT, Anton Ertl wrote:

    ... MIPS has left the general-purpose computing field.

    Not so sure that it has. I think the Chinese “LoongArchâ€_ >machines are a MIPS derivative.

    They may have started with MIPS, like several others, but now they are LoongArch. Looking in <https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1-EN.html#common-memory-access-instructions>,
    I don't find anything about byte order, but it says:

    |LoongArch bit designations are always little-endian.

    Also, if you want to think of “MIPSâ€_ as a corporate entity, that >would be the company currently known as “Imagination >Technologiesâ€_. It is true they have given up on the MIPS
    architecture

    That's even worse for MIPS than what I know of, which was that it was
    used for embedded uses.

    - anton

    My impression was that embedded MIPS had two main players behind it:
    - Microchip on the low end. Measured on Arm scale from about Cortex-M3
    class to Cortex-M7 class.
    - Cavium on the high end. From Cortex-A55 to not quite Cortex-A73.

    Microchip will continue to sell it for decade at least. Microchip does
    not tend to talk openly about directions, however their behavior shows
    that their direction right now is away from MIPS and currently toward
    Arm.

    Cavium was absorbed by Marvell sevral years ago. Marvell, like
    Microchip, does not tend to talk openly about directions. But when
    Cavium was still independent, they did say that all new development
    would be Arm. Since Cavium's market (high-end nework equipment) is less conservative and more fashion-driven, it probably means that they have
    no new MIPS customers almost for decade and that old customers likely
    buy much less as well.

    As far as I am concerned, it's a pity, because I find MIPS latest ISA (nanoMIPS) very intersting and probably quite practical.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Fri May 3 15:02:16 2024
    Michael S <[email protected]> writes:
    In the world of general-purpose microprocessor, DEC Alpha (until EV6)
    was more like word-addressable than byte-addressable, although it is a
    matter of point of view.

    No, Alpha has had byte addresses from the start, and that made it easy
    to add the BWX instructions in EV56.

    What it's EV4 and EV5 implementations do not have is instructions for *accessing* bytes and (PDP-11) words in memory, but that's completely
    different from a word-addressed machine. When you add 1 to an address
    on the Alpha, you get the address of the next byte. When you do the
    same on a word-addressed machine, you get the address of the next
    word.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Fri May 3 15:13:30 2024
    Lawrence D'Oliveiro <[email protected]d> writes:
    Why was byte addressing invented? I think it was for easy handling of
    strings and other binary data.

    Yes, the S/360 was intended to succeed both IBM's word-addressed
    scientific line (such as the IBM 7094) and its character/digit-serial commercial lines such as the 7080 and the 1401. Combining byte
    addressing with a fixed word size provided both.

    The "360" refers to the full circle (an idea that IBM marketing
    promptly put aside when they introduced the S/370 line).

    But why stop there?

    Others have provided good answers for that. Here's another one: Given
    the requirements (based on the predecessors), there was not reason to
    go beyond byte addressing. And looking at history, this seems to have
    been the right choice.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Michael S on Fri May 3 17:51:17 2024
    On 03/05/2024 16:40, Michael S wrote:
    On Fri, 03 May 2024 08:51:30 GMT
    [email protected] (Anton Ertl) wrote:

    Lawrence D'Oliveiro <[email protected]d> writes:
    On Thu, 02 May 2024 17:37:47 GMT, Anton Ertl wrote:

    ... MIPS has left the general-purpose computing field.

    Not so sure that it has. I think the Chinese “LoongArchâ€_
    machines are a MIPS derivative.

    They may have started with MIPS, like several others, but now they are
    LoongArch. Looking in
    <https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1-EN.html#common-memory-access-instructions>,
    I don't find anything about byte order, but it says:

    |LoongArch bit designations are always little-endian.

    Also, if you want to think of “MIPSâ€_ as a corporate entity, that >>> would be the company currently known as “Imagination
    Technologiesâ€_. It is true they have given up on the MIPS
    architecture

    That's even worse for MIPS than what I know of, which was that it was
    used for embedded uses.

    - anton

    My impression was that embedded MIPS had two main players behind it:
    - Microchip on the low end. Measured on Arm scale from about Cortex-M3
    class to Cortex-M7 class.
    - Cavium on the high end. From Cortex-A55 to not quite Cortex-A73.

    Microchip will continue to sell it for decade at least. Microchip does
    not tend to talk openly about directions, however their behavior shows
    that their direction right now is away from MIPS and currently toward
    Arm.

    Microchip are good at continuing to produce old devices. But as you
    say, they have moved to ARM for 32-bit.

    Basically, Microchip managed to ruin embedded MIPS as a choice of
    processor core. They used a four-pronged attack here :

    1. They picked an older MIPS core for their first PIC32 line, rather
    than the newer ones that more directly competed with microcontroller ARM
    cores of the time, thus ensuring that their microcontroller would not be
    power or performance competitive.

    2. They made serious hardware errors in the first chips. A big
    marketing feature of the PIC32 was that it supported 480 Mbps USB - but
    it did not, and it took a very long time to make a fixed version. In
    the meantime, the chip was still advertised as being the only available microcontroller with 480 Mbps USB on chip, with am errata saying "reduce
    USB to 12 Mbps" as a "workaround" for the problem. This helped the
    PIC32 gain a reputation as a broken and poor-quality device, which
    reflected (unfairly) on the core.

    3. They called it "PIC32". If you are familiar with the PIC series, you
    know they have their good points - they are very robust and reliable microcontrollers (the PIC32 was the exception here), available for
    decades in hobby-friendly packages. And they also have the most
    brain-dead processor core known to man, making the 8051 pleasant in
    comparison, combined with some of the worst quality and buggiest
    compilers ever written and sold at ridiculously high prices. Thus
    anyone familiar with Microchip PIC devices (most small-systems embedded developers) and unfamiliar with MIPS (most small-systems embedded
    developers) would assume that the PIC32 core would be horrible to work
    with and almost impossible to program in reasonable standard C, with the
    "32" referring to some random part of the architecture rather than the processor width.

    4. They set themselves against the open source development tools
    community by packaging a modified GCC as though it were /their/
    compiler. Every indication that it was not made by Microchip themselves
    was hidden in the tiniest of small print. The library that they
    provided was licensed as strictly as their lawyers could manage - you
    were not allowed to use it with development tools other than the
    binaries provided by Microchip. (You /could/, in theory, get their
    modified GCC source for the compiler - but only at extreme effort. It
    was not quite at the point of delivering the source on open reel tape,
    but not far off it.) The modifications that Microchip made to GCC were
    to disable any kind of optimisation unless you had bought the amazingly expensive version of the development tool license from them. So most
    people had to use the devices with no optimisation at all.

    At the same time, people were using better ARM cores with full
    optimisations. In practice, ARM cores were 10-20 times faster than MIPS appeared to be, thanks to Microchip. It is no wonder they never caught on.


    I'm sure there are other reasons why MIPS failed, despite having cores
    that were comparable or better than ARM for small-systems embedded
    devices. But Microchip has to take a large chunk of the blame, IMHO.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Fri May 3 18:42:29 2024
    Lawrence D'Oliveiro wrote:

    move.l a, b
    move.b b, c

    This is the same mistake that Brooks and Blaauw made, so invested in
    your familiar byte order that you imagine that normal differences of
    the other are somehow wrong.

    Here's a concrete example on S/360.

    L R,100
    STH R,200

    That does a four byte load of location 100 into a register, and then
    a two byte halfword store into 200. The load gets bytes 100 through 103
    with 100 going into the high byte of the register. The store puts its
    values into bytes 200 and 201. Since it's the low half of the register,
    the new contents of 200 and 201 are the old contents of 102 and 103.

    Before anyone says aha, that's surprising or wrong. no it's not. It's
    the way big-endian addressing works, and it would be surprising and
    wrong if it did anything else. If we wanted to put the contents of 100
    and 101 into 200 and 201, we'd have done something else, maybe this on
    S/370 and later to explicitly store the high two bytes of the word:

    L R,100
    STCM R,12,200

    or just move the two bytes directly

    MVC 200(2),100

    I have written assembler code for S/360, PDP-11, Vax, ROMP, 8086/286
    and more machines using both byte orders than I can remember, so I'm
    speaking from experience here, not guessing.

    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Fri May 3 19:04:19 2024
    Lawrence D'Oliveiro wrote:

    On Thu, 2 May 2024 18:33:48 +0000, MitchAlsup1 wrote:

    Lawrence D'Oliveiro wrote:

    move.l a, b
    move.b b, c

    May I suggest that the above ILLUSTRATES why someone wants to use LD
    and
    ST instructions rather than directionless MOV instructions.

    OK, use explicit load/store instead of generic move:

    register-memory-register:

    store.l a, b
    load.b b, c

    memory-register-memory:

    load.l a, b
    store.b b, c

    Do you see why this makes absolutely no difference to what happens, as
    per
    my description earlier?

    Yes, because you explicitly left out the syntactic sugar.

    Try::

    STD R7,[IP,#192]
    LDSB R8,[SP,#32]

    See, by having the syntactic sugar to identify which is the register
    and which is the address and what direction the data is traveling,
    all the confusion goes away.

    The OpCode tells the direction LD is inbound, ST is outbound..
    The operand with the 'R' is the register
    The operand with the '[' and ']' is the address.

    By the way, in case it wasn’t clear: in my examples, the destination operand is always the last one.

    My preference is that the address operands are always in the same spot in
    the instruction, and that the destination register is the receiver of a
    LD and the sender of the ST.

    And secondly, the destination is written like one writes assignments::

    R9 = memory( pointer, index, offset );
    or
    R8 = R8 + #32

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Savard on Fri May 3 22:26:04 2024
    On Thu, 02 May 2024 08:58:23 -0600, John Savard wrote:

    To me, it just made sense that, since registers contain quantities, if
    you load the value "8" into a reigster, it will contain the number 8.

    So in a byte operation, the least significant bits of the register are
    used.

    Of course that makes sense.

    Now, think of main memory as just a holding place for stuff that won’t fit
    in registers: why shouldn’t it make sense there as well?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Fri May 3 22:24:46 2024
    On Fri, 3 May 2024 19:04:19 +0000, MitchAlsup1 wrote:

    Lawrence D'Oliveiro wrote:

    Do you see why this makes absolutely no difference to what happens, as
    per my description earlier?

    Yes, because you explicitly left out the syntactic sugar.

    None of which makes any difference to the point: even on a big-endian architecture, registers are still effectively little-endian!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Fri May 3 22:28:45 2024
    On Fri, 03 May 2024 08:51:30 GMT, Anton Ertl wrote:

    Lawrence D'Oliveiro <[email protected]d> writes:

    Also, if you want to think of “MIPS” as a corporate entity, that would >>be the company currently known as “Imagination Technologies”. It is true >>they have given up on the MIPS architecture

    That's even worse for MIPS than what I know of, which was that it was
    used for embedded uses.

    I think it still is, it just isn’t bringing in money for “MIPS IP” any more.

    Last I heard, unit shipments for the top 3 architectures were:

    ARM -- around 10 billion per year
    RISC-V -- now in the billions, too
    MIPS -- something like 840 million per year

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Fri May 3 22:32:22 2024
    On Fri, 03 May 2024 15:13:30 GMT, Anton Ertl wrote:

    Lawrence D'Oliveiro <[email protected]d> writes:

    But why stop there?

    Others have provided good answers for that. Here's another one: Given
    the requirements (based on the predecessors), there was not reason to go beyond byte addressing. And looking at history, this seems to have been
    the right choice.

    That applied back in history, when we had fewer addressing bits to play
    with, what about now?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Sat May 4 02:00:33 2024
    According to Lawrence D'Oliveiro <[email protected]d>:
    Others have provided good answers for that. Here's another one: Given
    the requirements (based on the predecessors), there was not reason to go
    beyond byte addressing. And looking at history, this seems to have been
    the right choice.

    That applied back in history, when we had fewer addressing bits to play
    with, what about now?

    What applications do you think would work better with bit addressing?

    I can think of some kinds of data compression that use variable sized
    bit fields, and I suppose graphics rendering although these days it's
    rare to find a display without at least 8 bits per pixel and in any
    event, most displays have GPUs nearby to do the rendering.

    Compare that to all the other stuff for which bit addressing would just
    be extra baggage. Where's the benefit?

    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to BGB on Sat May 4 06:44:11 2024
    On Fri, 3 May 2024 22:11:44 -0500, BGB wrote:

    Not a huge use-case in graphics, as noted, in most cases this is done
    with 16 or 32 bit pixels; and bit-plane graphics are long since dead.

    What happens if we go beyond 32 bits? For example, hardware might support
    10 bits per pixel component.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From [email protected]@21:1/5 to Stefan Monnier on Sat May 4 09:40:45 2024
    Stefan Monnier <[email protected]> wrote:
    On "personal" computers ... there's been work instead on compressing
    64bit pointers to fit into 32bit "boxes" (IIUC it's used in some Chrome
    versions) ...
    Intel pushed this thing called the “x32” ABI into the Linux kernel (and >> possibly some other places) some years ago. This was using the AMD64

    Indeed, but I got the impression that there is a bit of a revival of
    interest for pointer compression as the evidence seems to point to RAM
    sizes not increasing very much any more on "end user devices".

    See for instance https://v8.dev/blog/pointer-compression

    Note also that this is targeted at JavaScript: dynamically typed
    languages tend to suffer more from the 64bit bloat because of their
    use of "boxing", meaning that pretty much everything (except usually
    for strings and arrays of floats, which are special-cased) doubles
    in size when the "box" size is changed from 32bit to 64bit.

    We've used compressed 32-bit pointers in Java for more than a decade
    now. Every object in the Java VM is 8-aligned, so a 32-bit-wide
    aligned pointer gets you access to 32G of adressible application
    memory.

    This is a win, not just for saving storage but improving performance.
    Java applications are often memory-bandwidth limited, so memory
    efficiency is a pretty good proxy for performance. The less memory you
    use, the more customers you can serve.

    Andrew.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Sat May 4 09:11:27 2024
    Lawrence D'Oliveiro <[email protected]d> writes:
    On Fri, 03 May 2024 15:13:30 GMT, Anton Ertl wrote:

    Lawrence D'Oliveiro <[email protected]d> writes:

    But why stop there?

    Others have provided good answers for that. Here's another one: Given
    the requirements (based on the predecessors), there was not reason to go
    beyond byte addressing. And looking at history, this seems to have been
    the right choice.

    That applied back in history, when we had fewer addressing bits to play
    with, what about now?

    Byte addressing still seems to be the right choice, for the same
    reasons: We have lots of string data, and data that needs larger
    units, but for data that fits in smaller units

    a) either there is so little that spending a full byte on it is good
    enough, or

    b) the data is handled by so little code that the burden from the lack
    of bit addressing is relatively low in the overall scheme of things, or

    c) programs deal with arrays of these things in a SIMD way, and bit
    addressing provides little to no benefit.

    For case b), we deal with bits or bit fields in a similar way that the word-addressed machines of the old days dealt with characters. I
    guess that there were people that considered byte addressing similarly unnecessary that most of us consider bit addressing, so what is the
    difference?

    Apparently in the number of use cases: Byte addressing eventually won:
    IBM switched to it with the S/360, DEC with the PDP-11, the successful
    16-bit (and later 32-bit) microprocessors supported it, while the word-addressed machines were less successful and eventually vanished
    in niches.

    David Ungar's PhD thesis was on SOAR (aka RISC-IV), which was either word-addressed or (like Alpha) word-accessed; in one of the last
    chapters of his thesis he wrote that the most beneficial feature for performance that SOAR did not have was byte accesses, which would have
    reduced the number of cycles by IIRC 10% (to be balanced against
    potential negative effects on the cycle-time); I found that quite
    surprising for a thesis that mainly focussed on architectural features
    for Smalltalk execution.

    By contrast, there were two well-known cases of bit-addressed
    machines: The IBM Stretch and the Intel iAPX 432, both of which failed
    to achieve their performance goals and which did not succeed in the
    market. I guess that this is not due to bit-addressing only, but that bit-addressing is a symptom of the feature creep that doomed these
    projects. More focussed projects usually did not add bit addressing.

    I expect that various architects of from-scratch projects have looked
    at the question, and most concluded that bit-addressing provided not
    enough benefits to justify the cost. And those bit-addressed
    architectures that were introduced did not become great hits, unlike
    the S/360 and the PDP-11.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Lawrence D'Oliveiro on Sat May 4 10:18:29 2024
    Lawrence D'Oliveiro <[email protected]d> schrieb:

    Intel pushed this thing called the “x32” ABI into the Linux kernel (and possibly some other places) some years ago. This was using the AMD64 instruction set, but with only 32-bit pointers. This way, you got the
    benefit of the extra registers, without the overhead of the longer
    addresses.

    That was Donald Knuth's idea.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Michael S on Sat May 4 15:18:37 2024
    Michael S <[email protected]> writes:
    On Fri, 03 May 2024 08:51:30 GMT
    [email protected] (Anton Ertl) wrote:

    Lawrence D'Oliveiro <[email protected]d> writes:
    On Thu, 02 May 2024 17:37:47 GMT, Anton Ertl wrote:
    =20
    ... MIPS has left the general-purpose computing field. =20

    Not so sure that it has. I think the Chinese =C3=A2=E2=82=AC=C5=93LoongA= >rch=C3=A2=E2=82=AC_
    machines are a MIPS derivative. =20
    =20
    They may have started with MIPS, like several others, but now they are
    LoongArch. Looking in
    <https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1-EN.htm= >l#common-memory-access-instructions>,
    I don't find anything about byte order, but it says:
    =20
    |LoongArch bit designations are always little-endian.
    =20
    Also, if you want to think of =C3=A2=E2=82=AC=C5=93MIPS=C3=A2=E2=82=AC_ = >as a corporate entity, that
    would be the company currently known as =C3=A2=E2=82=AC=C5=93Imagination
    Technologies=C3=A2=E2=82=AC_. It is true they have given up on the MIPS
    architecture =20
    =20
    That's even worse for MIPS than what I know of, which was that it was
    used for embedded uses.
    =20
    - anton

    My impression was that embedded MIPS had two main players behind it:
    - Microchip on the low end. Measured on Arm scale from about Cortex-M3
    class to Cortex-M7 class.
    - Cavium on the high end. From Cortex-A55 to not quite Cortex-A73.

    The last Cavium MIPS core (Octeon 7800) taped out well over a decade
    ago.


    Cavium was absorbed by Marvell sevral years ago. Marvell, like
    Microchip, does not tend to talk openly about directions. But when
    Cavium was still independent, they did say that all new development
    would be Arm.

    There are three generations of ARM cores produced by cavium/Marvell;
    ThunderX, Octeon9 and Octeon10.

    https://www.servethehome.com/marvell-octeon-10-arm-neoverse-n2-dpu-in-the-wild-rivaling-2017-era-intel-xeon/


    As far as I am concerned, it's a pity, because I find MIPS latest ISA >(nanoMIPS) very intersting and probably quite practical.

    Personally I prefer ARM64 architecture over MIPS64 by a considerable margin,
    in almost all respects (and I worked at SGI for a number of years in the R10k days).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Lawrence D'Oliveiro on Sat May 4 15:19:36 2024
    Lawrence D'Oliveiro <[email protected]d> writes:
    On Fri, 03 May 2024 15:13:30 GMT, Anton Ertl wrote:

    Lawrence D'Oliveiro <[email protected]d> writes:

    But why stop there?

    Others have provided good answers for that. Here's another one: Given
    the requirements (based on the predecessors), there was not reason to go
    beyond byte addressing. And looking at history, this seems to have been
    the right choice.

    That applied back in history, when we had fewer addressing bits to play
    with, what about now?

    There is still no reason to leverage those bits for sub-byte addressing.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Anton Ertl on Sat May 4 15:21:04 2024
    [email protected] (Anton Ertl) writes:
    Lawrence D'Oliveiro <[email protected]d> writes:
    On Fri, 03 May 2024 15:13:30 GMT, Anton Ertl wrote:

    Lawrence D'Oliveiro <[email protected]d> writes:

    But why stop there?

    Others have provided good answers for that. Here's another one: Given
    the requirements (based on the predecessors), there was not reason to go >>> beyond byte addressing. And looking at history, this seems to have been >>> the right choice.

    That applied back in history, when we had fewer addressing bits to play >>with, what about now?

    Byte addressing still seems to be the right choice, for the same
    reasons: We have lots of string data, and data that needs larger
    units, but for data that fits in smaller units

    a) either there is so little that spending a full byte on it is good
    enough, or

    b) the data is handled by so little code that the burden from the lack
    of bit addressing is relatively low in the overall scheme of things, or

    c) programs deal with arrays of these things in a SIMD way, and bit >addressing provides little to no benefit.


    d) all modern major architectures have instructions for bitfield
    manipulation (insert, extract) obviating any need for general bit-level addressing.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Scott Lurndal on Sat May 4 21:56:00 2024
    On Sat, 04 May 2024 15:18:37 GMT
    [email protected] (Scott Lurndal) wrote:

    Personally I prefer ARM64 architecture over MIPS64 by a considerable
    margin, in almost all respects (and I worked at SGI for a number of
    years in the R10k days).

    I also prefer ARM64 over MIPS64.
    But nanoMIPS is not MIPS64, it's a new architecture that, at least
    according to my measurements, is head and shoulders above any
    comppetitors in terms of code densty.
    Even MIPSr6 is enough of divirgence from previous releases of MIPS64 to
    be considered new architecture, but nanoMIPS is order of magnitude
    bigger change than that.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Sat May 4 19:31:54 2024
    According to Lawrence D'Oliveiro <[email protected]d>:
    On Fri, 3 May 2024 22:11:44 -0500, BGB wrote:

    Not a huge use-case in graphics, as noted, in most cases this is done
    with 16 or 32 bit pixels; and bit-plane graphics are long since dead.

    What happens if we go beyond 32 bits? For example, hardware might support
    10 bits per pixel component.

    I dunno about you but I would align the elements on two-byte
    boundaries and only store the high 10 of the 16 bits. It's not like
    we're short of address space, and it's a lot quicker to multiply and
    divide by 2 or 16 than by 10.



    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to John Levine on Sat May 4 22:56:19 2024
    On Sat, 4 May 2024 19:31:54 -0000 (UTC)
    John Levine <[email protected]> wrote:

    According to Lawrence D'Oliveiro <[email protected]d>:
    On Fri, 3 May 2024 22:11:44 -0500, BGB wrote:

    Not a huge use-case in graphics, as noted, in most cases this is
    done with 16 or 32 bit pixels; and bit-plane graphics are long
    since dead.

    What happens if we go beyond 32 bits? For example, hardware might
    support 10 bits per pixel component.

    I dunno about you but I would align the elements on two-byte
    boundaries and only store the high 10 of the 16 bits. It's not like
    we're short of address space, and it's a lot quicker to multiply and
    divide by 2 or 16 than by 10.




    I agree about preferable solution and simplicity, but not about last
    part.
    Multiplication by 10 is only very slightly slower than multiplication
    by 2 or 16 and the difference shouldn't be noticable by comparison with
    other things that we want to do with pixel.
    On x386/AMD64 - multiplication by 2 is, depending on situation, zero or
    1 instruction, multiplication by 16 is 1 instruction (lsh) and
    multiplication by 10 is either 1 instruction (IMUL) or two simpler
    instructions (LEA+ADD).
    On Arm and aarch64 it's approximately the same except that there are
    situations in which multiplication by 16 is zero instructions.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Sat May 4 21:08:19 2024
    Michael S wrote:

    On Sat, 4 May 2024 19:31:54 -0000 (UTC)
    John Levine <[email protected]> wrote:

    According to Lawrence D'Oliveiro <[email protected]d>:
    On Fri, 3 May 2024 22:11:44 -0500, BGB wrote:

    Not a huge use-case in graphics, as noted, in most cases this is
    done with 16 or 32 bit pixels; and bit-plane graphics are long
    since dead.

    What happens if we go beyond 32 bits? For example, hardware might
    support 10 bits per pixel component.

    I dunno about you but I would align the elements on two-byte
    boundaries and only store the high 10 of the 16 bits. It's not like
    we're short of address space, and it's a lot quicker to multiply and
    divide by 2 or 16 than by 10.




    I agree about preferable solution and simplicity, but not about last
    part.

    Multiplication by 10 is only very slightly slower than multiplication
    by 2 or 16 and the difference shouldn't be noticable by comparison with
    other things that we want to do with pixel.

    Multiplication by 10 used to index an array is not slower than a
    multipication
    by 16 (when the ISA is not brain dead)::

    LEA Ri,[Ri,Ri<<3]
    LD Rd,[Rp,Ri]

    Compared to::

    SL Ri,Ri,#4
    LD Rd,[Rp,Ri]

    {{Brain dead ISAs need not apply}}

    On x386/AMD64 - multiplication by 2 is, depending on situation, zero or
    1 instruction, multiplication by 16 is 1 instruction (lsh) and multiplication by 10 is either 1 instruction (IMUL) or two simpler instructions (LEA+ADD).

    Many times the ADD can be folded into a memory reference as illustrated
    above.

    On Arm and aarch64 it's approximately the same except that there are situations in which multiplication by 16 is zero instructions.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Chris M. Thomasson on Sun May 5 00:12:52 2024
    Chris M. Thomasson wrote:

    On 5/4/2024 3:18 AM, Thomas Koenig wrote:
    Lawrence D'Oliveiro <[email protected]d> schrieb:

    Intel pushed this thing called the “x32” ABI into the Linux kernel
    (and
    possibly some other places) some years ago. This was using the AMD64
    instruction set, but with only 32-bit pointers. This way, you got the
    benefit of the extra registers, without the overhead of the longer
    addresses.

    That was Donald Knuth's idea.

    Storing meta data in actual pointers, aka aligned on a larger boundary,
    is critical to many advanced lock/wait free algorithms as well. I
    remember storing an actual reference count in pointers before for a
    special type of counting.

    Even if one has multi-location ATOMICs ?? (as a single event ??)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Sun May 5 00:11:24 2024
    BGB wrote:

    On 5/4/2024 1:44 AM, Lawrence D'Oliveiro wrote:
    On Fri, 3 May 2024 22:11:44 -0500, BGB wrote:

    Not a huge use-case in graphics, as noted, in most cases this is done
    with 16 or 32 bit pixels; and bit-plane graphics are long since dead.

    What happens if we go beyond 32 bits? For example, hardware might
    support
    10 bits per pixel component.

    A few typical formats:
    RGB555: 0rrrrrgg-gggbbbbb
    RGBA32: aaaaaaaa-rrrrrrrr-gggggggg-bbbbbbbb
    RGB30 : 00rrrrrr-rrrrgggg-ggggggbb-bbbbbbbb (10-bit component RGB)

    Though, for RGB30, there are variants with 10-bit linear RGB, and E5.F5 floating-point (sometimes used for HDR in OpenGL, as opposed to 4x
    Binary16).

    None of these would really benefit from bit addressable memory though.

    Nor are they serviced by any SIMD ISA.

    Though, for LDR, going beyond 8-bit color depth doesn't gain much even
    if the monitor supports it natively. And had noted before when using a
    cheap LCD TV as a monitor, that it only seemed to be working at a
    roughly 6-bit color depth (like, it was seemingly slightly better than RGB555, but not by much).

    Most people's eyes cannot even see the difference unless it is pointed
    out to them.

    Now I am using a 4K OLED, which does support 10b/component, but it
    doesn't make much difference in practice (and even if it did, most
    software wont make much use of it).

    But, say, 5 to 8 bits per component is at least noticeable (better
    colors and less banding artifacts), 8 to 10 bits, not so much. Though,
    with the main exception being HDR (but then, over the 0.5 to 1.0 range,
    E5.F5 is only about as accurate as a 6-bit component).

    Posterization is still a problem at 8-bits.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Levine on Sun May 5 00:19:34 2024
    On Thu, 2 May 2024 11:52:56 -0000 (UTC), John Levine wrote:

    According to Lawrence D'Oliveiro <[email protected]d>:

    Consider this pseudo-assembly-language sequence:

    move.l a, b
    move.b b, c
    ...
    Now the question is: which byte from “a” ends up at location “c”?

    On S/360, which is the ur-big-endian machine, memory to memory moves are different from register loads and stores.

    Hint: in the register-memory-register case, you would do an MVC followed
    by LOAD. In the memory-register-memory case, it would be LOAD followed by
    MVC.

    Does that put it in System/360 terms you can understand?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Levine on Sun May 5 00:21:42 2024
    On Fri, 3 May 2024 18:42:29 -0000 (UTC), John Levine wrote:

    Lawrence D'Oliveiro wrote:

    move.l a, b
    move.b b, c

    Here's a concrete example on S/360.

    L R,100
    STH R,200

    That does a four byte load of location 100 into a register, and then a
    two byte halfword store into 200. The load gets bytes 100 through 103
    with 100 going into the high byte of the register. The store puts its
    values into bytes 200 and 201. Since it's the low half of the register,
    the new contents of 200 and 201 are the old contents of 102 and 103.

    So using the same register name to address a halfword gives you the low
    half of the register, not the high half?

    Whereas using the same memory address to address a halfword gives you the
    high half of the word at that location, not the low half?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Levine on Sun May 5 00:25:17 2024
    On Thu, 2 May 2024 12:00:40 -0000 (UTC), John Levine wrote:

    ... it is easy to construct examples that appear to make your less
    favored option look wrong ...

    Here is the issue: we have three different quantities needing numbering.

    * Bit places within an integer
    * Bit numbers within a bit field
    * Byte numbers within a multibyte integer (offsets from the base address)

    In little-endian, it is easy to relate all these three as follows:

    bit place within integer = bit number within bit field =
    byte number * 8 + bit within byte

    There is no correspondingly simple formula for any big-endian
    architecture.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Scott Lurndal on Sun May 5 00:26:49 2024
    On Sat, 04 May 2024 15:18:37 GMT, Scott Lurndal wrote:

    Personally I prefer ARM64 architecture over MIPS64 by a considerable
    margin, in almost all respects ...

    I know MIPS (like SPARC) originated in that brief window when it was
    thought that delayed branches were a good idea, and so it remained saddled
    with that (mis)feature for the rest of its life.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Lawrence D'Oliveiro on Sun May 5 04:12:49 2024
    On Sun, 5 May 2024 00:26:49 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    On Sat, 04 May 2024 15:18:37 GMT, Scott Lurndal wrote:

    Personally I prefer ARM64 architecture over MIPS64 by a considerable margin, in almost all respects ...

    I know MIPS (like SPARC) originated in that brief window when it was
    thought that delayed branches were a good idea, and so it remained
    saddled with that (mis)feature for the rest of its life.

    Delay slot was deprecated back in MIPSr6, almost a decade ago.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Sun May 5 01:33:39 2024
    According to Lawrence D'Oliveiro <[email protected]d>:
    So using the same register name to address a halfword gives you the low
    half of the register, not the high half?

    Whereas using the same memory address to address a halfword gives you the >high half of the word at that location, not the low half?

    For anyone familiar with big-endian addressing, those would both be
    obviously correct.

    Perhaps this would be a good time to stop digging.
    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Sun May 5 01:49:52 2024
    Lawrence D'Oliveiro wrote:

    On Fri, 3 May 2024 18:42:29 -0000 (UTC), John Levine wrote:

    Lawrence D'Oliveiro wrote:

    move.l a, b
    move.b b, c

    Here's a concrete example on S/360.

    L R,100
    STH R,200

    That does a four byte load of location 100 into a register, and then a
    two byte halfword store into 200. The load gets bytes 100 through 103
    with 100 going into the high byte of the register. The store puts its
    values into bytes 200 and 201. Since it's the low half of the
    register,
    the new contents of 200 and 201 are the old contents of 102 and 103.

    So using the same register name to address a halfword gives you the low
    half of the register, not the high half?

    Whereas using the same memory address to address a halfword gives you the

    high half of the word at that location, not the low half?


    Concrete example::

    say location 100:103 contain 0xDEADBEAF

    LD R,100

    R contains 0xDEADBEAF

    STH R,200

    Location 200:201 contain 0XBEAF

    Whereas::

    LH R,100

    R contains 0xDEAD

    And nobody who understands BE would even question this functionality.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Sun May 5 04:35:51 2024
    On Sun, 5 May 2024 01:49:52 +0000, MitchAlsup1 wrote:

    Lawrence D'Oliveiro wrote:

    So using the same register name to address a halfword gives you the low
    half of the register, not the high half?

    Whereas using the same memory address to address a halfword gives you the
    high half of the word at that location, not the low half?

    Concrete example::

    That’s a “yes”.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Michael S on Sun May 5 04:36:34 2024
    On Sun, 5 May 2024 04:12:49 +0300, Michael S wrote:

    On Sun, 5 May 2024 00:26:49 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    thought that delayed branches were a good idea, and so it remained
    saddled with that (mis)feature for the rest of its life.

    Delay slot was deprecated back in MIPSr6, almost a decade ago.

    But that would be a backward-incompatible change, would it not?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Scott Lurndal on Sun May 5 07:43:27 2024
    Scott Lurndal <[email protected]> schrieb:

    d) all modern major architectures have instructions for bitfield
    manipulation (insert, extract) obviating any need for general bit-level addressing.

    RISC-V: Seems like it's an extension, for which only a draft is
    available, so it is debatable if it has it or not.

    POWER: Certainly, the rlwinm instruction.

    AMD64: Sure, pdep and friends.

    ARM: You certainly know by heart, I don't need to look.

    Loongarch: Looking at the docs, it also has it (BSTRINS etc).

    So, with the possible exception of RISC-V, I cannot see anything
    to contradict you :-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to [email protected] on Sun May 5 11:03:39 2024
    On Sat, 4 May 2024 21:08:19 +0000
    [email protected] (MitchAlsup1) wrote:

    Michael S wrote:

    On Sat, 4 May 2024 19:31:54 -0000 (UTC)
    John Levine <[email protected]> wrote:

    According to Lawrence D'Oliveiro <[email protected]d>:
    On Fri, 3 May 2024 22:11:44 -0500, BGB wrote:

    Not a huge use-case in graphics, as noted, in most cases this is
    done with 16 or 32 bit pixels; and bit-plane graphics are long
    since dead.

    What happens if we go beyond 32 bits? For example, hardware might
    support 10 bits per pixel component.

    I dunno about you but I would align the elements on two-byte
    boundaries and only store the high 10 of the 16 bits. It's not like
    we're short of address space, and it's a lot quicker to multiply
    and divide by 2 or 16 than by 10.




    I agree about preferable solution and simplicity, but not about last
    part.

    Multiplication by 10 is only very slightly slower than
    multiplication by 2 or 16 and the difference shouldn't be noticable
    by comparison with other things that we want to do with pixel.

    Multiplication by 10 used to index an array is not slower than a multipication
    by 16 (when the ISA is not brain dead)::

    LEA Ri,[Ri,Ri<<3]
    LD Rd,[Rp,Ri]


    Are you sure?
    To me, it looks like 9 rather than 10.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Lawrence D'Oliveiro on Sun May 5 11:10:55 2024
    On Sun, 5 May 2024 04:36:34 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    On Sun, 5 May 2024 04:12:49 +0300, Michael S wrote:

    On Sun, 5 May 2024 00:26:49 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    thought that delayed branches were a good idea, and so it remained
    saddled with that (mis)feature for the rest of its life.

    Delay slot was deprecated back in MIPSr6, almost a decade ago.

    But that would be a backward-incompatible change, would it not?

    It would not.
    They added a new set of branches, but preserved an old set.
    If I understand their intentions correctly, the old stuff was supposed
    to be removed in the next release of the ISA. But then two things
    happened simultaneously:
    1) they invented nanoMIPS, which made incompatible release of "classic"
    MIPS redundant
    2) their financial troubles escalated

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to David Brown on Sun May 5 11:32:44 2024
    On Fri, 3 May 2024 17:51:17 +0200
    David Brown <[email protected]> wrote:



    I'm sure there are other reasons why MIPS failed, despite having
    cores that were comparable or better than ARM for small-systems
    embedded devices. But Microchip has to take a large chunk of the
    blame, IMHO.


    I am not sure that I agree.
    It seems strange to me to blame Microchip that did embrace MIPS and to
    say nothing about their main competitors that never embraced it, i.e.
    STMicro, Philips (== NXP) and TI.

    Also, what about IDT (now owned by Renesas) ? In the 1990s they were
    the biggest partners of MIPS in general-purpose embedded space. I would
    think that they played bigger (than Microchip) role in MIPS demise.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Michael S on Sun May 5 08:21:47 2024
    Michael S <[email protected]> schrieb:
    On Sat, 04 May 2024 15:18:37 GMT
    [email protected] (Scott Lurndal) wrote:

    Personally I prefer ARM64 architecture over MIPS64 by a considerable
    margin, in almost all respects (and I worked at SGI for a number of
    years in the R10k days).

    I also prefer ARM64 over MIPS64.
    But nanoMIPS is not MIPS64, it's a new architecture that, at least
    according to my measurements, is head and shoulders above any
    comppetitors in terms of code densty.

    Hadn't come across it before...

    https://www.anandtech.com/show/12699/mips-announces-i7200-32bit-cpu-with-new-nanomips-isa
    says it has 16, 32 and 48 bit instructions, the latter for encoding
    32-bit immediates. Sounds like a good strategy if you want to
    increase density for a 32-bit ISA, which is also expected to remain
    firmly 32-bit.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Thomas Koenig on Sun May 5 12:13:27 2024
    On Sun, 5 May 2024 07:43:27 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    Scott Lurndal <[email protected]> schrieb:

    d) all modern major architectures have instructions for bitfield manipulation (insert, extract) obviating any need for general
    bit-level addressing.

    RISC-V: Seems like it's an extension, for which only a draft is
    available, so it is debatable if it has it or not.

    POWER: Certainly, the rlwinm instruction.

    AMD64: Sure, pdep and friends.


    PEXTR/PDEP has no immediate form, which makes it inconvenient for
    'C'-style fixed bit fields. Unless you access the same bifield
    repeatedly, it takes two instructions instead of 1 (the first is move
    reg,imm). Also, on many AMD processors PDEP/PEXTR is slow.
    BEXTR has the same problem of absence of immediate form, but at least it
    is fast across the board. Unfortunately, BEXTR does not help bit field insertion.

    ARM: You certainly know by heart, I don't need to look.

    Loongarch: Looking at the docs, it also has it (BSTRINS etc).

    So, with the possible exception of RISC-V, I cannot see anything
    to contradict you :-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Scott Lurndal on Sun May 5 09:20:08 2024
    [email protected] (Scott Lurndal) writes:
    [email protected] (Anton Ertl) writes:
    Byte addressing still seems to be the right choice, for the same
    reasons: We have lots of string data, and data that needs larger
    units, but for data that fits in smaller units

    a) either there is so little that spending a full byte on it is good >>enough, or

    b) the data is handled by so little code that the burden from the lack
    of bit addressing is relatively low in the overall scheme of things, or

    c) programs deal with arrays of these things in a SIMD way, and bit >>addressing provides little to no benefit.


    d) all modern major architectures have instructions for bitfield
    manipulation (insert, extract) obviating any need for general bit-level addressing.

    Many of the word-addressed machines of yesteryear had instructions for character manipulation (insert, extract), but that did not obviate any
    need for byte addressing.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Sun May 5 09:02:03 2024
    Michael S <[email protected]> writes:
    On Sun, 5 May 2024 00:26:49 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    On Sat, 04 May 2024 15:18:37 GMT, Scott Lurndal wrote:

    Personally I prefer ARM64 architecture over MIPS64 by a considerable
    margin, in almost all respects ...

    I know MIPS (like SPARC) originated in that brief window when it was
    thought that delayed branches were a good idea, and so it remained
    saddled with that (mis)feature for the rest of its life.

    Delay slot was deprecated back in MIPSr6, almost a decade ago.

    MIPS has a number of other misfeatures that made us disable dynamic superinstructions in Gforth and are a problem for other code-copying
    code generators:

    First and foremost, the architectural load delay slot (and, I think,
    similar constraints wrt multiply and divide instructions and/or
    MFHI/MFLO) mean that, unlike for every other architecture we have
    looked at (including IA-64), you cannot just concatenate two pieces of
    code which do work when they are connected with an indirect jump.

    Another nasty property of MIPS is the way direct jumps and calls are
    encoded: The target address is assembled from IIRC the top 6 bits of
    the current PC and the rest of the address as absolute number in the instruction. This means that the call/jump would not show up as non-relocatable in Gforth's sanity tests, but if copied a piece of
    code to a target area in a different 256MB-segment, it would fail.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Sun May 5 13:00:00 2024
    On Sun, 05 May 2024 09:02:03 GMT
    [email protected] (Anton Ertl) wrote:

    Michael S <[email protected]> writes:
    On Sun, 5 May 2024 00:26:49 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    On Sat, 04 May 2024 15:18:37 GMT, Scott Lurndal wrote:

    Personally I prefer ARM64 architecture over MIPS64 by a
    considerable margin, in almost all respects ...

    I know MIPS (like SPARC) originated in that brief window when it
    was thought that delayed branches were a good idea, and so it
    remained saddled with that (mis)feature for the rest of its life.

    Delay slot was deprecated back in MIPSr6, almost a decade ago.

    MIPS has a number of other misfeatures that made us disable dynamic superinstructions in Gforth and are a problem for other code-copying
    code generators:

    First and foremost, the architectural load delay slot (and, I think,
    similar constraints wrt multiply and divide instructions and/or
    MFHI/MFLO) mean that, unlike for every other architecture we have
    looked at (including IA-64), you cannot just concatenate two pieces of
    code which do work when they are connected with an indirect jump.


    Were not all delay slots except branch delay eliminated back in
    revision of the ISA that corresponded to R4K ?

    Another nasty property of MIPS is the way direct jumps and calls are
    encoded: The target address is assembled from IIRC the top 6 bits of
    the current PC and the rest of the address as absolute number in the instruction. This means that the call/jump would not show up as non-relocatable in Gforth's sanity tests, but if copied a piece of
    code to a target area in a different 256MB-segment, it would fail.

    - anton


    Compact branches (Release 6) have conventional signed PC-relative
    offsets - +-128 MB for unconditional jump/J&L, +-4MB for
    equal/non-equal to zero and +-128 KB for the rest of conditional
    branches.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Robert Swindells@21:1/5 to Anton Ertl on Sun May 5 13:35:01 2024
    On Sat, 04 May 2024 09:11:27 GMT, Anton Ertl wrote:

    David Ungar's PhD thesis was on SOAR (aka RISC-IV), which was either word-addressed or (like Alpha) word-accessed; in one of the last
    chapters of his thesis he wrote that the most beneficial feature for performance that SOAR did not have was byte accesses, which would have reduced the number of cycles by IIRC 10% (to be balanced against
    potential negative effects on the cycle-time); I found that quite
    surprising for a thesis that mainly focussed on architectural features
    for Smalltalk execution.

    I think SOAR was RISC-III and SPUR (their Lisp CPU) RISC-IV.

    My guess is that it was word-addressed.

    The type tags are in the high bits of a word, as they were in all the Lisp Machines of the time which were word-addressed, not the low bits as in
    SPARC.

    On a byte-addressed machine you can use some lower bits "for free" if
    the objects being addressed are always word-sized or larger. SPARC has
    specific instructions to make use of this.

    There is also a paragraph on page 38 on this topic, it states that
    Smalltalk didn't store byte scalar values in the image.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Thomas Koenig on Sun May 5 15:31:04 2024
    Thomas Koenig <[email protected]> writes:
    Scott Lurndal <[email protected]> schrieb:

    d) all modern major architectures have instructions for bitfield
    manipulation (insert, extract) obviating any need for general bit-level addressing.

    RISC-V: Seems like it's an extension, for which only a draft is
    available, so it is debatable if it has it or not.

    POWER: Certainly, the rlwinm instruction.

    AMD64: Sure, pdep and friends.

    ARM: You certainly know by heart, I don't need to look.

    Loongarch: Looking at the docs, it also has it (BSTRINS etc).

    So, with the possible exception of RISC-V, I cannot see anything
    to contradict you :-)

    I would, personally, categorize RISC-V as a niche architecture
    at this time. Give it time to reach "major" status, where
    the extensions become less optional.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Anton Ertl on Sun May 5 15:32:26 2024
    [email protected] (Anton Ertl) writes:
    [email protected] (Scott Lurndal) writes:
    [email protected] (Anton Ertl) writes:
    Byte addressing still seems to be the right choice, for the same
    reasons: We have lots of string data, and data that needs larger
    units, but for data that fits in smaller units

    a) either there is so little that spending a full byte on it is good >>>enough, or

    b) the data is handled by so little code that the burden from the lack
    of bit addressing is relatively low in the overall scheme of things, or

    c) programs deal with arrays of these things in a SIMD way, and bit >>>addressing provides little to no benefit.


    d) all modern major architectures have instructions for bitfield >>manipulation (insert, extract) obviating any need for general bit-level addressing.

    Many of the word-addressed machines of yesteryear had instructions for >character manipulation (insert, extract), but that did not obviate any
    need for byte addressing.

    And in further news, Apples are not equal to Oranges.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Sun May 5 16:09:41 2024
    Michael S <[email protected]> writes:
    On Sun, 05 May 2024 09:02:03 GMT
    [email protected] (Anton Ertl) wrote:
    First and foremost, the architectural load delay slot (and, I think,
    similar constraints wrt multiply and divide instructions and/or
    MFHI/MFLO) mean that, unlike for every other architecture we have
    looked at (including IA-64), you cannot just concatenate two pieces of
    code which do work when they are connected with an indirect jump.


    Were not all delay slots except branch delay eliminated back in
    revision of the ISA that corresponded to R4K ?

    Certainly Raymond Chen who writes explicitly about the R4000 in <https://devblogs.microsoft.com/oldnewthing/20180404-00/?p=98435>
    still mentions the restrictions on HI/LO register stuff in 2018.

    And even if it was, for 32-bit MIPS the typical build environments and
    build targets are just mips and mipsel, with no MIPS III-specific
    environment.

    And no, looking at the build machine is not good enough: I built some
    version of gcc on an EV56, and then wanted to run it on an EV45, and
    that produced illegal instruction errors, because during bootstrapping
    gcc had decided that it uses BWX instructions, because the build
    machine provides them.

    For building for MIPS64 one can rely on it being at least an R4000,
    but there is still the jump/call problem with that. If the platform
    was very relevant, we would be looking for some workaround, but it
    isn't.

    Another nasty property of MIPS is the way direct jumps and calls are
    encoded: The target address is assembled from IIRC the top 6 bits of
    the current PC and the rest of the address as absolute number in the
    instruction. This means that the call/jump would not show up as
    non-relocatable in Gforth's sanity tests, but if copied a piece of
    code to a target area in a different 256MB-segment, it would fail.

    - anton


    Compact branches (Release 6) have conventional signed PC-relative
    offsets - +-128 MB for unconditional jump/J&L, +-4MB for
    equal/non-equal to zero and +-128 KB for the rest of conditional
    branches.

    Sounds good, but again you cannot rely on these branches being present.

    There is a lot to be said for providing a plain ISA and doing
    optimizations in the microarchitecture. Among the MIPS descendents,
    RISC-V does much better, Alpha is somewhere in-between.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to [email protected] on Sun May 5 11:28:12 2024
    On Wed, 1 May 2024 00:09:28 -0000 (UTC), Lawrence D'Oliveiro
    <[email protected]d> wrote:

    Byte addressing was invented by IBM for the System/360, introduced in
    1964. At least as I understand it. Up to that time, and indeed for a long >time after, machines had a �word length� which was the smallest
    addressable unit of memory. This could have a range of sizes, e.g.

    12 -- DEC PDP-5/8
    18 -- DEC PDP-1/4/7/9
    36 -- DEC PDP-6/10
    60 -- CDC 6000-series
    64 -- Cray

    I�m sure there were also 24- and 48-bit machines.

    Oh, indeed.

    24 bits:
    CDC 924
    SDS 910, 920, 930, 940
    SDS 9300
    DDP-24, -124, -224
    GE 425, 435, 455, 465
    ASI 6020, 6030
    SEL 840
    Honeywell 300
    SCC 660
    Datacraft DC 6024, Harris Slash/4
    Four-Phase Systems System IV/70
    Telefunken TR440
    Philco 2000
    DJS-6

    48 bits:
    CDC 1604
    BESM 6
    Datamatic 1000, Honeywell 400, 800, 1400, 1800
    IBM AN/FSQ-31 -32

    among others.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to [email protected] on Sun May 5 11:20:02 2024
    On Wed, 1 May 2024 00:09:28 -0000 (UTC), Lawrence D'Oliveiro
    <[email protected]d> wrote:

    Big-endian
    supposedly had the advantage of making memory dumps easier to read, but >little-endian always made more logical sense.

    There is one practical argument for big-endian encoding.

    Let us suppose that a computer has the ability to do *both* decimal
    arithmetic and binary arithmetic.

    So a word in the computer might contain just bits, for binary
    arithmetic. Or it might contain BCD digits, for decimal arithmetic.

    Since it's possible to design an adder where carrying early between
    nibbles can be turned on or off, on for decimal arithmetic, and off
    for binary arithmetic, clearly the order of digits - big-endian or little-endian - should be the same between binary and decimal.

    Also, though, for ease of conversion, the order of BCD digits _should
    be the same as the order of the characters of which these digits are
    the last four bits_ in the representation of a decimal number as a
    character string.

    And that means big-endian.

    If you have decimal arithmetic, there's a direct connection between
    how numbers are represented for reading and writing, and how they are represented for internal arithmetic.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to All on Sun May 5 18:01:43 2024
    MitchAlsup1 wrote:

    Michael S wrote:

    On Sat, 4 May 2024 21:08:19 +0000
    [email protected] (MitchAlsup1) wrote:


    Multiplication by 10 used to index an array is not slower than a
    multipication
    by 16 (when the ISA is not brain dead)::

    LEA Ri,[Ri,Ri<<3]
    LD Rd,[Rp,Ri]


    Are you sure?
    To me, it looks like 9 rather than 10.

    LD Rd,[Rp,Ri<<2]

    sorry.........

    LEA Ri,[Ri,Ri<<2]
    LD Rd,[Rp,Ri<<2]

    sorry again.......

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Sun May 5 17:59:37 2024
    Michael S wrote:

    On Sat, 4 May 2024 21:08:19 +0000
    [email protected] (MitchAlsup1) wrote:


    Multiplication by 10 used to index an array is not slower than a
    multipication
    by 16 (when the ISA is not brain dead)::

    LEA Ri,[Ri,Ri<<3]
    LD Rd,[Rp,Ri]


    Are you sure?
    To me, it looks like 9 rather than 10.


    LD Rd,[Rp,Ri<<2]

    sorry.........

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Sun May 5 18:57:14 2024
    According to Robert Swindells <[email protected]>:
    On Sat, 04 May 2024 09:11:27 GMT, Anton Ertl wrote:
    On a byte-addressed machine you can use some lower bits "for free" if
    the objects being addressed are always word-sized or larger. SPARC has >specific instructions to make use of this.

    Only if you can count on them being aligned. On S/360 they required
    everything to be aligned, and one of the changes on S/370 was to allow arbitrary data alignment for data addresses. They quickly found that
    Fortran programs used COMMON and EQUIVALENCE to put 8 bit reals on 4
    byte boundaries in strictly standard conforming programs. Oops. The
    Fortran library caught the traps and fixed them up but with dreadful performance.

    If your storage management is disciplined enough that you know that everything is aligned on natural boundaries, this trick still works, but if you're going to have to mask out flag bits anyway, the argument for putting the flags in
    the low bits isn't as strong.

    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Sun May 5 22:21:20 2024
    BGB wrote:

    On 5/5/2024 10:31 AM, Scott Lurndal wrote:
    Thomas Koenig <[email protected]> writes:
    Scott Lurndal <[email protected]> schrieb:


    Not as of yet in my case, but bitfield extract might happen eventually.
    Issue is finding a way to pull it off that is useful and cheaper than shift+mask (and probably adding some mechanism to pattern-match it from
    the AST or similar).

    But, but but but:: it IS shift and Mask !!

    Annoyingly, a good general case instruction could not be encoded in a
    32-bit instruction form at this point (could either add a few special
    cases as 32-bit ops, or use a 64-bit encoding; or do it as a 2RI op
    rather than 3RI but this is lame...).

    Then again, say:
    BITEXTR Imm10, Rn //Rn=(Rn>>(Imm&63))&((1<<((Imm>>6)&15))-1)
    Could potentially still be useful.

    SL Rd,Rc,<width:offset>

    Is a bit field extract instruction, it is also a smash instruction
    (smashing a 64-bit value into a 8-bit or 12-bit or 47 bit for whatever
    purpose is needed)

    SR Rd,Rc,<width:offset>

    Positions the value in a register (Rc) such that it fits the alignment of
    a field.

    INS Rd,Rc,Rf,<width:offset>

    Inserts the field from Rf into its position <w:o> in Rc, inserts the field
    and delivers the new container to Rd.

    Also, some things don't seem well balanced in terms of cost, so while it would be fairly cheap for a microcontroller, by the time one implements enough extensions to make it more useful for general purpose computing,
    it will no longer be cheap (while at the same time shooting itself in
    the foot in terms of performance for imposing some design constraints
    that *only* make sense for small microcontrollers).

    We can put 64 GBOoO CPUs on a single die and you worry about the shifter
    having a masker ?!?

    One big offender here, as I see it, is a few features in the Privileged
    ISA spec, such as:
    Separate register sets for each protection level/mode;

    Wile My 66000 has separate register files for every thread; each file
    is memory resident when not running. {At least conceptually}

    The comparably large number of CSRs;

    I have a 64-bit control register space and all CSRs are mapped into this
    space (along with all device control registers,... {This space is entirely separate from the space where DRAM occupies}.

    Allowing operations on CSRs beyond just moving them to/from a GPR or
    similar;
    ....

    Things like the 'V' extension are also cause for concern.

    The 'M' extension isn't ideal, but I made it work in a way that "isn't
    too horribly expensive" (namely using a Shift-and-Add unit).



    Also the cost-scaling of the Shift-Add unit is such that it could
    potentially be extended to allow 128-bit integer multiply and divide,
    but debatable (there are only a few edge cases where this would likely
    be faster than "just do it in software").

    You are being mislead as to what architecture is compared to what you can implement in your FPGA and this is coloring your view of it.

    Well, and my ALUX extension can make for faster 128-bit ALU operations,
    but is debatable as the cost-delta mostly disappears in the noise
    (mostly because 128-bit ALU ops are rare).

    In My 66000's case, the CARRY instruction modifier provides access to multiprecision arithmetic--including exact FP arithmetics which even
    gets the inexact bit set (clear actually) correctly.

    Conversely, the code when built for RV64G omits 128-bit types entirely,

    What, exactly, did you expect from an Academic quality ISA ?????

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Chris M. Thomasson on Sun May 5 22:25:46 2024
    Chris M. Thomasson wrote:

    On 5/4/2024 5:12 PM, MitchAlsup1 wrote:
    Chris M. Thomasson wrote:

    On 5/4/2024 3:18 AM, Thomas Koenig wrote:
    Lawrence D'Oliveiro <[email protected]d> schrieb:

    Intel pushed this thing called the “x32” ABI into the Linux kernel
    (and
    possibly some other places) some years ago. This was using the AMD64 >>>>> instruction set, but with only 32-bit pointers. This way, you got the >>>>> benefit of the extra registers, without the overhead of the longer
    addresses.

    That was Donald Knuth's idea.

    Storing meta data in actual pointers, aka aligned on a larger
    boundary, is critical to many advanced lock/wait free algorithms as
    well. I remember storing an actual reference count in pointers before
    for a special type of counting.

    Even if one has multi-location ATOMICs ?? (as a single event ??)

    This was a technique for storing data in a pointer. For instance, strong atomic reference counting we need to update a pointer _and_ a reference together atomically. This can easily be done with DWCAS, or double width compare and swap. So, on a 32 bit system we need 64 bit cas, for a 64
    bit system we need 128 bit cas. However, sometimes we can pack the
    reference count in the pointer value itself if its aligned on a big
    enough boundary. Then we can update the pointer and the reference count
    using normal word based atomic RMW's.

    I understand why you had to pack the pointer and a chunk of data into a
    single container.

    What I don't understand is if you had easy access to multi-container ATOMICs the packing would be unnecessary--would it not ?? That is in one ATOMIC event you could update the pointer and the chunk of data independently and not NEED to store them in a single container.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Mon May 6 01:21:35 2024
    According to Anton Ertl <[email protected]>:
    d) all modern major architectures have instructions for bitfield >>manipulation (insert, extract) obviating any need for general bit-level addressing.

    Many of the word-addressed machines of yesteryear had instructions for >character manipulation (insert, extract), but that did not obviate any
    need for byte addressing.

    I believe that byte addressing which simultaneously allows larger
    words on power of two boundaries is one of those ideas that seems
    totally obvious now but was not at all at the time.

    Many of IBM's earlier machines like the 705 and 1620 and 1401 were
    character or digit addressable, and even had multi-character
    instructions that had to be alignd on a 5 digit boundary, but until
    the 360 nobody made the jump to see that you could address larger data
    in parallel that way.

    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Scott Lurndal on Mon May 6 02:29:18 2024
    On Sat, 04 May 2024 15:21:04 GMT, Scott Lurndal wrote:

    d) all modern major architectures have instructions for bitfield
    manipulation (insert, extract) obviating any need for general bit-level addressing.

    Even if those bottom three bits of the address must be zero in every other instruction but these, I thought it would be convenient to have them, just
    for these bitfield instructions. It would save passing around a separate bit-offset field in arbitrary-bit-aligned pointers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Chris M. Thomasson on Mon May 6 00:50:35 2024
    Chris M. Thomasson wrote:

    On 5/5/2024 3:25 PM, MitchAlsup1 wrote:
    Chris M. Thomasson wrote:

    On 5/4/2024 5:12 PM, MitchAlsup1 wrote:
    Chris M. Thomasson wrote:

    On 5/4/2024 3:18 AM, Thomas Koenig wrote:
    Lawrence D'Oliveiro <[email protected]d> schrieb:

    Intel pushed this thing called the “x32” ABI into the Linux kernel >>>> (and
    possibly some other places) some years ago. This was using the AMD64 >>>>>>> instruction set, but with only 32-bit pointers. This way, you got the >>>>>>> benefit of the extra registers, without the overhead of the longer >>>>>>> addresses.

    That was Donald Knuth's idea.

    Storing meta data in actual pointers, aka aligned on a larger
    boundary, is critical to many advanced lock/wait free algorithms as
    well. I remember storing an actual reference count in pointers
    before for a special type of counting.

    Even if one has multi-location ATOMICs ?? (as a single event ??)

    This was a technique for storing data in a pointer. For instance,
    strong atomic reference counting we need to update a pointer _and_ a
    reference together atomically. This can easily be done with DWCAS, or
    double width compare and swap. So, on a 32 bit system we need 64 bit
    cas, for a 64 bit system we need 128 bit cas. However, sometimes we
    can pack the reference count in the pointer value itself if its
    aligned on a big enough boundary. Then we can update the pointer and
    the reference count using normal word based atomic RMW's.

    I understand why you had to pack the pointer and a chunk of data into a
    single container.

    What I don't understand is if you had easy access to multi-container
    ATOMICs
    the packing would be unnecessary--would it not ?? That is in one ATOMIC
    event
    you could update the pointer and the chunk of data independently and not
    NEED
    to store them in a single container.

    Well, actually, a pessimistic word based fetch-and-add (LOCK XADD) is
    enough to increment the counter and load a pointer atomically all in one shot, loopless. Why would I need to use multi atomics with a possible
    loop to do that?

    Postulate that you have a 64-bit pointer and a 8-bit chunk 72-total bits. Further postulate that you need to update both in a single non-blocking
    ATOMIC event. ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to BGB on Mon May 6 02:52:19 2024
    On Sun, 5 May 2024 20:50:51 -0500, BGB wrote:

    Say, RISC-V:
    Says yes to DIV and MOD;
    Says yes to 4-register floating-point multiple-accumulate; Say no to
    register-indexed Load/Store.
    Me: This is not a good balance...

    Multiply-accumulate is at least as much about reducing rounding error as
    about speed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Mon May 6 02:54:11 2024
    According to Lawrence D'Oliveiro <[email protected]d>:
    On Sat, 04 May 2024 15:21:04 GMT, Scott Lurndal wrote:

    d) all modern major architectures have instructions for bitfield
    manipulation (insert, extract) obviating any need for general bit-level
    addressing.

    Even if those bottom three bits of the address must be zero in every other >instruction but these, I thought it would be convenient to have them, just >for these bitfield instructions. It would save passing around a separate >bit-offset field in arbitrary-bit-aligned pointers.

    The only significant application for bit addressing that anyone has
    mentioned is data compression. It's not something that computers spend
    a great deal of time doing, and I see no reason to believe that bit
    addressing would make it much faster than the way it's done now with
    shifting and masking.

    If you do want to make compression faster, it'd make more sense to add instructions to do the compressing you compare about, like DFLTCC in
    S/360 and zSeries that speed up gzip, rather than adding three bits to
    the other 99% of instructions that don't use bit fields.

    If you think otherwise, what are the applications that will make all
    those address bits useful, and why do you think bit addressing will be
    faster than shifting and masking? There's still going to be memory
    underneath that's byte or word addressed so the shifting and masking
    is going to happen anyway.
    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Scott Lurndal on Mon May 6 02:30:42 2024
    On Sun, 05 May 2024 15:31:04 GMT, Scott Lurndal wrote:

    I would, personally, categorize RISC-V as a niche architecture at this
    time.

    I think it’s already shipping in the billions of units per year--enough to make it the world’s second-most-popular CPU architecture, after ARM.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Savard on Mon May 6 02:34:48 2024
    On Sun, 05 May 2024 11:20:02 -0600, John Savard wrote:

    If you have decimal arithmetic, there's a direct connection between how numbers are represented for reading and writing, and how they are
    represented for internal arithmetic.

    It is easier to do addition/subtraction if you start from the least
    significant end and propagate the carry/borrow along.

    I believe those early IBM character machines worked exactly this way.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Michael S on Mon May 6 02:30:02 2024
    On Sun, 5 May 2024 12:13:27 +0300, Michael S wrote:

    PEXTR/PDEP has no immediate form, which makes it inconvenient for
    'C'-style fixed bit fields.

    Fixed bit fields are a limitation of the C language. Why should it
    constrain the design of machine architectures?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to All on Mon May 6 08:13:16 2024
    MitchAlsup1 wrote:
    Chris M. Thomasson wrote:

    On 5/5/2024 3:25 PM, MitchAlsup1 wrote:
    Chris M. Thomasson wrote:

    On 5/4/2024 5:12 PM, MitchAlsup1 wrote:
    Chris M. Thomasson wrote:

    On 5/4/2024 3:18 AM, Thomas Koenig wrote:
    Lawrence D'Oliveiro <[email protected]d> schrieb:

    Intel pushed this thing called the “x32” ABI into the Linux
    kernel
    (and
    possibly some other places) some years ago. This was using the >>>>>>>> AMD64
    instruction set, but with only 32-bit pointers. This way, you >>>>>>>> got the
    benefit of the extra registers, without the overhead of the longer >>>>>>>> addresses.

    That was Donald Knuth's idea.

    Storing meta data in actual pointers, aka aligned on a larger
    boundary, is critical to many advanced lock/wait free algorithms
    as well. I remember storing an actual reference count in pointers
    before for a special type of counting.

    Even if one has multi-location ATOMICs ?? (as a single event ??)

    This was a technique for storing data in a pointer. For instance,
    strong atomic reference counting we need to update a pointer _and_ a
    reference together atomically. This can easily be done with DWCAS,
    or double width compare and swap. So, on a 32 bit system we need 64
    bit cas, for a 64 bit system we need 128 bit cas. However, sometimes
    we can pack the reference count in the pointer value itself if its
    aligned on a big enough boundary. Then we can update the pointer and
    the reference count using normal word based atomic RMW's.

    I understand why you had to pack the pointer and a chunk of data into a
    single container.

    What I don't understand is if you had easy access to multi-container
    ATOMICs
    the packing would be unnecessary--would it not ?? That is in one
    ATOMIC event
    you could update the pointer and the chunk of data independently and
    not NEED
    to store them in a single container.

    Well, actually, a pessimistic word based fetch-and-add (LOCK XADD) is
    enough to increment the counter and load a pointer atomically all in
    one shot, loopless. Why would I need to use multi atomics with a
    possible loop to do that?

    Postulate that you have a 64-bit pointer and a 8-bit chunk 72-total bits. Further postulate that you need to update both in a single non-blocking ATOMIC event. ...

    "Any programming problem can be solved with an additional layer of indirection", so in this case you create a handle to that 72-bit item,
    and require all access to go via the handle?

    The addendum to the rule above is of course ", except the problem of too
    many layers of indirections". :-)

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to John Levine on Mon May 6 14:07:48 2024
    John Levine <[email protected]> writes:
    According to Lawrence D'Oliveiro <[email protected]d>:
    On Sat, 04 May 2024 15:21:04 GMT, Scott Lurndal wrote:

    d) all modern major architectures have instructions for bitfield
    manipulation (insert, extract) obviating any need for general bit-level
    addressing.

    Even if those bottom three bits of the address must be zero in every other >>instruction but these, I thought it would be convenient to have them, just >>for these bitfield instructions. It would save passing around a separate >>bit-offset field in arbitrary-bit-aligned pointers.

    The only significant application for bit addressing that anyone has
    mentioned is data compression. It's not something that computers spend
    a great deal of time doing, and I see no reason to believe that bit >addressing would make it much faster than the way it's done now with
    shifting and masking.

    We've one application that uses bit insertion
    and extraction extensively (an SoC simulator) when dealing
    both with emulation of the ARMv7 and ARMv8 instruction sets
    as well as hardware accelerator block CSRs.

    But as you note below, hardware support crypto and
    compression operations is generally superior.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to [email protected] on Mon May 6 09:56:03 2024
    On Mon, 6 May 2024 02:34:48 -0000 (UTC), Lawrence D'Oliveiro
    <[email protected]d> wrote:

    It is easier to do addition/subtraction if you start from the least >significant end and propagate the carry/borrow along.

    Of course, but so what? That just determines in which direction your
    ALU is wired. It is true that this is the reason why many machines
    were little-endian when their word size was smaller than the size of
    the integers on which they would do arithmetic.

    But we no longer have this problem.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Mon May 6 11:26:46 2024
    MitchAlsup1 wrote:
    BGB wrote:

    On 5/5/2024 10:31 AM, Scott Lurndal wrote:
    Thomas Koenig <[email protected]> writes:
    Scott Lurndal <[email protected]> schrieb:


    Not as of yet in my case, but bitfield extract might happen eventually.
    Issue is finding a way to pull it off that is useful and cheaper than
    shift+mask (and probably adding some mechanism to pattern-match it
    from the AST or similar).

    But, but but but:: it IS shift and Mask !!

    Annoyingly, a good general case instruction could not be encoded in a
    32-bit instruction form at this point (could either add a few special
    cases as 32-bit ops, or use a 64-bit encoding; or do it as a 2RI op
    rather than 3RI but this is lame...).

    Then again, say:
    BITEXTR Imm10, Rn //Rn=(Rn>>(Imm&63))&((1<<((Imm>>6)&15))-1)
    Could potentially still be useful.

    SL Rd,Rc,<width:offset>

    Is a bit field extract instruction, it is also a smash instruction
    (smashing a 64-bit value into a 8-bit or 12-bit or 47 bit for whatever purpose is needed)

    SR Rd,Rc,<width:offset>

    Positions the value in a register (Rc) such that it fits the alignment of
    a field.

    INS Rd,Rc,Rf,<width:offset>

    Inserts the field from Rf into its position <w:o> in Rc, inserts the
    field and delivers the new container to Rd.

    I think my instruction set could accomplish pretty much the same
    efficiency for bit field operations as bit addresses but without
    requiring direct bit addressing.

    An issue that comes up is when the in-memory bit field is > 56 bits wide
    as it might straddle two 64-bit words. If width is <= 56 bits then
    a load from a byte address handles most of the shifting and the
    rest can be handled within a single register.

    But if the in-memory bit field is > 56 bits wide it may or may not straddle
    a single 64-bit memory location, and require a pair of registers to loaded.

    I added an optional second dest register field to my ISA to allow operations like wide bit field extract and insert across a pair of registers.
    Also for wide arithmetic.

    I was thinking of variable length LDV and STV load & store instructions
    to work with variable length byte fields from 1 to 16 bytes.

    LDV has two dst registers, a normal byte address specifier,
    and a byte count from 1 to 16 to load. All high order bytes
    not written by the LDV are zero filled.
    The byte count can be an immediate or in a register.

    STV does the same for stores with a pair of source value registers.

    LDV and STV only touch the memory bytes they actually load or store.
    So if the actual address + byte count does not touch a second 64-bit
    memory word then they don't touch the next cache line or next page
    in the case of potential page straddles.

    This allows code to LDV up to 16 bytes into a register pair
    extract and insert up to 64-bit fields in that register pair,
    then STV only the bytes operated on,
    with HW taking care of the special cases of straddle/not-straddle.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to All on Mon May 6 11:15:39 2024
    On Mon, 6 May 2024 02:54:11 -0000 (UTC), John Levine <[email protected]>
    wrote:

    Why do you think bit addressing will be
    faster than shifting and masking? There's still going to be memory
    underneath that's byte or word addressed so the shifting and masking
    is going to happen anyway.

    Shifting, in a sense, yes. But not necessarily masking.

    So just because a processor has a 64-bit bus to memory doesn't mean it
    has to implement fetching a single byte from memory by doing a shift
    and mask operation in a 64-bit register. Instead, each byte of the bus
    could have a direct wired path to the low 8-bits of the internal data
    bus feeding the registers.

    With bit addressing, of course, an implementation involving shifting
    and masking is more likely, but even then, one omits fetching and
    decoding the instructions to shift and mask, which is a speed gain
    right there.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Lawrence D'Oliveiro on Mon May 6 14:08:39 2024
    Lawrence D'Oliveiro wrote:
    On Sat, 04 May 2024 15:21:04 GMT, Scott Lurndal wrote:

    d) all modern major architectures have instructions for bitfield
    manipulation (insert, extract) obviating any need for general bit-level
    addressing.

    Even if those bottom three bits of the address must be zero in every other instruction but these, I thought it would be convenient to have them, just for these bitfield instructions. It would save passing around a separate bit-offset field in arbitrary-bit-aligned pointers.

    Its not just the bit address that you have to carry about
    but also field width and type (zero/sign extend) on extract.

    To my eye the cost of bit fields is primarily in dealing at run time
    with the potential for straddles across memory locations and registers.
    It makes for a lot of fiddly little IF code blocks which then have to be
    put into general subroutines.

    A second issue occurs when there are multiple bit fields is
    optimizing this so it only loads and stores with memory when it has to.
    If r1 contains a low straddle part and r2 the high straddle part,
    and we have already updated one bit field in those parts,
    if we want to update a second bit field,
    then we need to check if it is wholly contained within those
    two registers, or one or both need to be spilled and reloaded.

    A lot of this fiddly code looks like it would be best
    implemented with predication.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Mon May 6 19:10:43 2024
    Lawrence D'Oliveiro wrote:

    On Sun, 5 May 2024 12:13:27 +0300, Michael S wrote:

    PEXTR/PDEP has no immediate form, which makes it inconvenient for
    'C'-style fixed bit fields.

    Fixed bit fields are a limitation of the C language. Why should it
    constrain the design of machine architectures?

    The only thing C bit-fields bears on extract and insert is the need
    for constants that specify the field.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Levine on Mon May 6 19:13:51 2024
    John Levine wrote:

    According to Lawrence D'Oliveiro <[email protected]d>:
    On Sat, 04 May 2024 15:21:04 GMT, Scott Lurndal wrote:

    d) all modern major architectures have instructions for bitfield
    manipulation (insert, extract) obviating any need for general bit-level
    addressing.

    Even if those bottom three bits of the address must be zero in every other >>instruction but these, I thought it would be convenient to have them, just >>for these bitfield instructions. It would save passing around a separate >>bit-offset field in arbitrary-bit-aligned pointers.

    The only significant application for bit addressing that anyone has
    mentioned is data compression. It's not something that computers spend
    a great deal of time doing, and I see no reason to believe that bit addressing would make it much faster than the way it's done now with
    shifting and masking.

    If you do want to make compression faster, it'd make more sense to add instructions to do the compressing you compare about, like DFLTCC in
    S/360 and zSeries that speed up gzip, rather than adding three bits to
    the other 99% of instructions that don't use bit fields.

    If you think otherwise, what are the applications that will make all
    those address bits useful, and why do you think bit addressing will be
    faster than shifting and masking? There's still going to be memory
    underneath that's byte or word addressed so the shifting and masking
    is going to happen anyway.

    Placing bit-field access INSIDE LDs and STs requires adding 2 stages
    of multiplexing in the LD/ST aligners (memory shifters). This has the
    potential to slow the overall pipeline frequency--at which point you
    have lost more than you can gain.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Terje Mathisen on Mon May 6 19:15:35 2024
    Terje Mathisen wrote:

    MitchAlsup1 wrote:
    Chris M. Thomasson wrote:

    On 5/5/2024 3:25 PM, MitchAlsup1 wrote:
    Chris M. Thomasson wrote:

    On 5/4/2024 5:12 PM, MitchAlsup1 wrote:
    Chris M. Thomasson wrote:

    On 5/4/2024 3:18 AM, Thomas Koenig wrote:
    Lawrence D'Oliveiro <[email protected]d> schrieb:

    Intel pushed this thing called the “x32” ABI into the Linux
    kernel
    (and
    possibly some other places) some years ago. This was using the >>>>>>>>> AMD64
    instruction set, but with only 32-bit pointers. This way, you >>>>>>>>> got the
    benefit of the extra registers, without the overhead of the longer >>>>>>>>> addresses.

    That was Donald Knuth's idea.

    Storing meta data in actual pointers, aka aligned on a larger
    boundary, is critical to many advanced lock/wait free algorithms >>>>>>> as well. I remember storing an actual reference count in pointers >>>>>>> before for a special type of counting.

    Even if one has multi-location ATOMICs ?? (as a single event ??)

    This was a technique for storing data in a pointer. For instance,
    strong atomic reference counting we need to update a pointer _and_ a >>>>> reference together atomically. This can easily be done with DWCAS,
    or double width compare and swap. So, on a 32 bit system we need 64
    bit cas, for a 64 bit system we need 128 bit cas. However, sometimes >>>>> we can pack the reference count in the pointer value itself if its
    aligned on a big enough boundary. Then we can update the pointer and >>>>> the reference count using normal word based atomic RMW's.

    I understand why you had to pack the pointer and a chunk of data into a >>>> single container.

    What I don't understand is if you had easy access to multi-container
    ATOMICs
    the packing would be unnecessary--would it not ?? That is in one
    ATOMIC event
    you could update the pointer and the chunk of data independently and
    not NEED
    to store them in a single container.

    Well, actually, a pessimistic word based fetch-and-add (LOCK XADD) is
    enough to increment the counter and load a pointer atomically all in
    one shot, loopless. Why would I need to use multi atomics with a
    possible loop to do that?

    Postulate that you have a 64-bit pointer and a 8-bit chunk 72-total bits.
    Further postulate that you need to update both in a single non-blocking
    ATOMIC event. ...

    "Any programming problem can be solved with an additional layer of indirection", so in this case you create a handle to that 72-bit item,
    and require all access to go via the handle?

    I am not trying to add an additional layer of indirection, I am trying (unsuccessfully it appears) to get Chris to think outside of the one
    container ATOMIC box.

    The addendum to the rule above is of course ", except the problem of too
    many layers of indirections". :-)

    Terje

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Mon May 6 19:11:22 2024
    Lawrence D'Oliveiro wrote:

    On Sun, 5 May 2024 20:50:51 -0500, BGB wrote:

    Say, RISC-V:
    Says yes to DIV and MOD;
    Says yes to 4-register floating-point multiple-accumulate; Say no to
    register-indexed Load/Store.
    Me: This is not a good balance...

    Multiply-accumulate is at least as much about reducing rounding error as about speed.

    It is also an IEEE 754-2008+ requirement.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Mon May 6 19:26:16 2024
    BGB wrote:

    On 5/5/2024 9:30 PM, Lawrence D'Oliveiro wrote:
    On Sun, 5 May 2024 12:13:27 +0300, Michael S wrote:

    PEXTR/PDEP has no immediate form, which makes it inconvenient for
    'C'-style fixed bit fields.

    Fixed bit fields are a limitation of the C language. Why should it
    constrain the design of machine architectures?

    If it lacks an immediate form, one is harder pressed to beat out
    shift+and or shift+shift on the performance front...

    Though, to be useful, it needs an immediate large enough to express both
    the shift amount and the width of the bitfield, and also a 3RI encoding.

    My 66000 has 12-bits of immediate for shifts, and a slot in the 3-operand instruction group.

    Bitfield insert would a little easier to get a performance advantage (vs bitfield extract), since insertion is a more complex operation, but is
    also likely require a more complex implementation and is also less
    common than bitfield extract.

    Without SR <w:o>; one needs two shifts and a container sized mask

    SR Rt,Rc,#64-11 // get rid of excess significance
    SL Rt,Rt,#64-11-12 // position field to container
    AND Rk,Rk,#ox0007FF0000 // EMPTY field in Kontainer
    OR Rk,Rk,Rt // insert field

    With:

    SR Rt,Rc,<11:12>
    AND Rk,Rk,#ox0007FF0000 // EMPTY field in Kontainer
    OR Rk,Rk,Rt // insert field

    With insert::

    INS Rk,Rk,Rc,<11:12>

    ....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Mon May 6 19:31:07 2024
    EricP wrote:

    MitchAlsup1 wrote:
    BGB wrote:

    On 5/5/2024 10:31 AM, Scott Lurndal wrote:
    Thomas Koenig <[email protected]> writes:
    Scott Lurndal <[email protected]> schrieb:


    Not as of yet in my case, but bitfield extract might happen eventually.
    Issue is finding a way to pull it off that is useful and cheaper than
    shift+mask (and probably adding some mechanism to pattern-match it
    from the AST or similar).

    But, but but but:: it IS shift and Mask !!

    Annoyingly, a good general case instruction could not be encoded in a
    32-bit instruction form at this point (could either add a few special
    cases as 32-bit ops, or use a 64-bit encoding; or do it as a 2RI op
    rather than 3RI but this is lame...).

    Then again, say:
    BITEXTR Imm10, Rn //Rn=(Rn>>(Imm&63))&((1<<((Imm>>6)&15))-1)
    Could potentially still be useful.

    SL Rd,Rc,<width:offset>

    Is a bit field extract instruction, it is also a smash instruction
    (smashing a 64-bit value into a 8-bit or 12-bit or 47 bit for whatever
    purpose is needed)

    SR Rd,Rc,<width:offset>

    Positions the value in a register (Rc) such that it fits the alignment of
    a field.

    INS Rd,Rc,Rf,<width:offset>

    Inserts the field from Rf into its position <w:o> in Rc, inserts the
    field and delivers the new container to Rd.

    I think my instruction set could accomplish pretty much the same
    efficiency for bit field operations as bit addresses but without
    requiring direct bit addressing.

    An issue that comes up is when the in-memory bit field is > 56 bits wide
    as it might straddle two 64-bit words. If width is <= 56 bits then
    a load from a byte address handles most of the shifting and the
    rest can be handled within a single register.

    This is what CARRY is for--access to 128-bit in 2×64-bit out shifts.
    CARRY can be used for extracts and for inserts.

    But if the in-memory bit field is > 56 bits wide it may or may not straddle
    a single 64-bit memory location, and require a pair of registers to loaded.

    I don't understand 56--56 takes just as many bits to encode as 63 ?!?

    I added an optional second dest register field to my ISA to allow operations like wide bit field extract and insert across a pair of registers.
    Also for wide arithmetic.

    I was thinking of variable length LDV and STV load & store instructions
    to work with variable length byte fields from 1 to 16 bytes.

    32 gives you access to an arithmetic space where you can calculate
    world GDP in the least valuable currency world-wide not lose a cent
    on the bottom end and not overflow on the top by 20-odd bits.

    LDV has two dst registers, a normal byte address specifier,
    and a byte count from 1 to 16 to load. All high order bytes
    not written by the LDV are zero filled.
    The byte count can be an immediate or in a register.

    STV does the same for stores with a pair of source value registers.

    LDV and STV only touch the memory bytes they actually load or store.
    So if the actual address + byte count does not touch a second 64-bit
    memory word then they don't touch the next cache line or next page
    in the case of potential page straddles.

    This allows code to LDV up to 16 bytes into a register pair
    extract and insert up to 64-bit fields in that register pair,
    then STV only the bytes operated on,
    with HW taking care of the special cases of straddle/not-straddle.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Savard on Mon May 6 19:34:47 2024
    John Savard wrote:

    On Mon, 6 May 2024 02:54:11 -0000 (UTC), John Levine <[email protected]>
    wrote:

    Why do you think bit addressing will be
    faster than shifting and masking? There's still going to be memory >>underneath that's byte or word addressed so the shifting and masking
    is going to happen anyway.

    Shifting, in a sense, yes. But not necessarily masking.

    So just because a processor has a 64-bit bus to memory doesn't mean it

    Why so narrow ??

    has to implement fetching a single byte from memory by doing a shift
    and mask operation in a 64-bit register.

    Not on a 64-bit register, but a 64-bit (or 128-bit) flip-flop.

    Instead, each byte of the bus
    could have a direct wired path to the low 8-bits of the internal data
    bus feeding the registers.

    How is that NOT a shifter ???

    Remember people, accessing smaller than cache port width REQUUIRES
    shifting. We often call them Aligners, but the logic is that of
    a shifter.

    With bit addressing, of course, an implementation involving shifting
    and masking is more likely, but even then, one omits fetching and
    decoding the instructions to shift and mask, which is a speed gain
    right there.

    Bit addressing only makes the shifter deeper, not wider.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Mon May 6 19:39:55 2024
    EricP wrote:

    Lawrence D'Oliveiro wrote:
    On Sat, 04 May 2024 15:21:04 GMT, Scott Lurndal wrote:

    d) all modern major architectures have instructions for bitfield
    manipulation (insert, extract) obviating any need for general bit-level
    addressing.

    Even if those bottom three bits of the address must be zero in every other >> instruction but these, I thought it would be convenient to have them, just >> for these bitfield instructions. It would save passing around a separate
    bit-offset field in arbitrary-bit-aligned pointers.

    Its not just the bit address that you have to carry about
    but also field width and type (zero/sign extend) on extract.

    No different from signed/unsigned bytes, halfwords, and words.

    To my eye the cost of bit fields is primarily in dealing at run time
    with the potential for straddles across memory locations and registers.
    It makes for a lot of fiddly little IF code blocks which then have to be
    put into general subroutines.

    In My 66000 ISA, one can use CARRY to concatenate 2 registers into
    1 container and then extract or insert into the double wide container
    EVEN when there is no straddling of boundaries! This gets rid of a
    lot of the fiddling.

    A second issue occurs when there are multiple bit fields is
    optimizing this so it only loads and stores with memory when it has to.
    If r1 contains a low straddle part and r2 the high straddle part,
    and we have already updated one bit field in those parts,
    if we want to update a second bit field,
    then we need to check if it is wholly contained within those
    two registers, or one or both need to be spilled and reloaded.

    Obviously.

    A lot of this fiddly code looks like it would be best
    implemented with predication.


    ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to EricP on Mon May 6 21:53:42 2024
    EricP wrote:
    MitchAlsup1 wrote:
    BGB wrote:

    On 5/5/2024 10:31 AM, Scott Lurndal wrote:
    Thomas Koenig <[email protected]> writes:
    Scott Lurndal <[email protected]> schrieb:


    Not as of yet in my case, but bitfield extract might happen eventually.
    Issue is finding a way to pull it off that is useful and cheaper than
    shift+mask (and probably adding some mechanism to pattern-match it
    from the AST or similar).

    But, but but but:: it IS shift and Mask !!

    Annoyingly, a good general case instruction could not be encoded in a
    32-bit instruction form at this point (could either add a few special
    cases as 32-bit ops, or use a 64-bit encoding; or do it as a 2RI op
    rather than 3RI but this is lame...).

    Then again, say:
       BITEXTR  Imm10, Rn  //Rn=(Rn>>(Imm&63))&((1<<((Imm>>6)&15))-1)
    Could potentially still be useful.

        SL    Rd,Rc,<width:offset>

    Is a bit field extract instruction, it is also a smash instruction
    (smashing a 64-bit value into a 8-bit or 12-bit or 47 bit for whatever
    purpose is needed)

        SR    Rd,Rc,<width:offset>

    Positions the value in a register (Rc) such that it fits the alignment of
    a field.

        INS   Rd,Rc,Rf,<width:offset>

    Inserts the field from Rf into its position <w:o> in Rc, inserts the
    field and delivers the new container to Rd.

    I think my instruction set could accomplish pretty much the same
    efficiency for bit field operations as bit addresses but without
    requiring direct bit addressing.

    An issue that comes up is when the in-memory bit field is > 56 bits wide
    as it might straddle two 64-bit words. If width is <= 56 bits then
    a load from a byte address handles most of the shifting and the
    rest can be handled within a single register.

    But if the in-memory bit field is > 56 bits wide it may or may not straddle
    a single 64-bit memory location, and require a pair of registers to loaded.

    x86 does not have bitfield insert/extract, but it does have SHRD/SHLD so
    it is fairly easy to handle arbitrary length (<= 64 bits) and alignment:

    ; RSI -> target, RCX = # bits to extract, RBX = 64-field size (0..63)
    mov rax,[rsi]
    mov rdx,[rsi+8]

    shrd rax,rdx,cl ; bit offset

    and rax,bitmask[rbx*8] ; 64 mask entries.

    The last instruction can also be replaced with

    shlx rax,rax,rbx ; Nr of excess bits (64-field to extract)
    shrx rax,rax,rbx

    or the entire thing can be replaced with this one which calculates the
    mask on the fly:

    mov rax,[rsi]
    mov rdx,[rsi+8]
    or rdi,-1 ; Generate mask

    shrd rax,rdx,cl ; bit offset
    shrx rdi,rdi,rbx ; excess bits to mask away

    and rax,rdi

    All seems like about 3 clock cycles when hitting the cache.

    Terje


    I added an optional second dest register field to my ISA to allow
    operations
    like wide bit field extract and insert across a pair of registers.
    Also for wide arithmetic.

    I was thinking of variable length LDV and STV load & store instructions
    to work with variable length byte fields from 1 to 16 bytes.

    LDV has two dst registers, a normal byte address specifier,
    and a byte count from 1 to 16 to load. All high order bytes
    not written by the LDV are zero filled.
    The byte count can be an immediate or in a register.

    STV does the same for stores with a pair of source value registers.

    LDV and STV only touch the memory bytes they actually load or store.
    So if the actual address + byte count does not touch a second 64-bit
    memory word then they don't touch the next cache line or next page
    in the case of potential page straddles.

    This allows code to LDV up to 16 bytes into a register pair
    extract and insert up to 64-bit fields in that register pair,
    then STV only the bytes operated on,
    with HW taking care of the special cases of straddle/not-straddle.





    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Mon May 6 21:08:14 2024
    According to John Savard <[email protected]d>:
    On Mon, 6 May 2024 02:54:11 -0000 (UTC), John Levine <[email protected]>
    wrote:

    Why do you think bit addressing will be
    faster than shifting and masking? ...

    So just because a processor has a 64-bit bus to memory doesn't mean it
    has to implement fetching a single byte from memory by doing a shift
    and mask operation in a 64-bit register. Instead, each byte of the bus
    could have a direct wired path to the low 8-bits of the internal data
    bus feeding the registers.

    I was more thinking about storing bit fields, where you probably have
    to fetch the whole word or cache line or whatever, shift the new field
    into it, and then store it back. You already have to do something like
    that for byte stores but bit addressing makes it 8 times as hairy.
    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to BGB on Mon May 6 22:34:23 2024
    BGB <[email protected]> writes:
    On 5/5/2024 12:20 PM, John Savard wrote:
    On Wed, 1 May 2024 00:09:28 -0000 (UTC), Lawrence D'Oliveiro
    <[email protected]d> wrote:


    Also, though, for ease of conversion, the order of BCD digits _should
    be the same as the order of the characters of which these digits are
    the last four bits_ in the representation of a decimal number as a
    character string.

    And that means big-endian.

    If you have decimal arithmetic, there's a direct connection between
    how numbers are represented for reading and writing, and how they are
    represented for internal arithmetic.



    Why would one burn 8 bits per BCD digit?...

    When processing numeric character data. The B3500 did that
    natively - the address controller on each operand selected
    the format of the operand (4-bit signed, 4-bit unsigned, 8-bit unsigned);
    in 8-bit forms, the processor ignored the most significant digit
    of the byte (ascii 0x3, ebcdic 0xf).

    The B2D and D2B instructions converted between decimal and
    binary representations (maximum magnitude 10**100-1).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to BGB on Tue May 7 02:49:31 2024
    On Mon, 6 May 2024 01:47:15 -0500
    BGB <[email protected]> wrote:


    RISC-V is quickly gaining ground in the microcontroller space,
    displacing ARM (Cortex-M / Thumb2).


    I don't see it.
    RISC-V right now is mostly in small cores doing auxiliary functions in
    bigger SoCs. General-purpose 32-bit MCUs are very strongly dominated by Cortex-M. I don't believe that in that space RISC-V is in top 3 by
    volumes. I would expect that 2nd tiers likes Xtensa cores, TI C2000
    as well as some of the Renesas cores sell more than RISC-V.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Tue May 7 00:53:29 2024
    BGB wrote:

    On 5/6/2024 2:11 PM, MitchAlsup1 wrote:
    Lawrence D'Oliveiro wrote:

    On Sun, 5 May 2024 20:50:51 -0500, BGB wrote:

    Say, RISC-V:
       Says yes to DIV and MOD;
       Says yes to 4-register floating-point multiple-accumulate; Say no to >>>>    register-indexed Load/Store.
    Me: This is not a good balance...

    Multiply-accumulate is at least as much about reducing rounding error
    as about speed.

    It is also an IEEE 754-2008+ requirement.

    And... I have a version that just sort of works well enough to make
    RV64G work, but is sort of a fail on the other fronts:
    Using it is slower than separate ops;
    It produces a double-rounded result.
    Also, well, the FMUL isn't super accurate either.

    So, it fails IEEE 754-accuracy requirements.

    FMUL is implemented in a way where it only generates the high-half of
    the multiply, which makes the FPU cheaper, but:
    Does not give strict 0.5ULP rounding.

    Also failing EEE 754-accuracy requirements.

    Some combination of factors leads to the inability of Newton-Raphson to
    fully converge, possibly either due to omitting the low-order multiplier results, or the carry-propagation limitation for rounding (if the
    rounding would result in more than 8 bits of carry, it is skipped).

    Newton-Raphson is dependent on getting the bits right so that its
    interpolation (between iterations) converges properly.

    Not likely to do proper FMA, as this would make a Binary64 FPU too
    expensive (and, doing Binary64 poorly is still preferable for most uses
    to not doing it at all).

    And yet, every other non FPGA implementation achieves those requirements.

    It really seams that your medium is influencing your architecture,
    rather than the other way around.

    Granted, not entirely sure how the 8087 managed to do all the stuff that
    it did. Since, it seems like an 80s ASIC would be more cramped than a
    modern Artix-7.

    Mostly it was simply slow.

    ....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Levine on Tue May 7 00:57:00 2024
    John Levine wrote:

    According to John Savard <[email protected]d>:
    On Mon, 6 May 2024 02:54:11 -0000 (UTC), John Levine <[email protected]> >>wrote:

    Why do you think bit addressing will be
    faster than shifting and masking? ...

    So just because a processor has a 64-bit bus to memory doesn't mean it
    has to implement fetching a single byte from memory by doing a shift
    and mask operation in a 64-bit register. Instead, each byte of the bus >>could have a direct wired path to the low 8-bits of the internal data
    bus feeding the registers.

    I was more thinking about storing bit fields, where you probably have
    to fetch the whole word or cache line or whatever, shift the new field
    into it, and then store it back. You already have to do something like
    that for byte stores but bit addressing makes it 8 times as hairy.

    Which is no different than ECC, BTW...

    Could someone invent a bit field ISA that was as efficient as a byte
    accessible architecture:: probably.

    Could this bit accessible architecture outperform a byte ISA on
    typical codes:: doubtful. Two reasons:: 1) more delay in the
    LD/ST pipeline, 2) most programs use as little bit-fielding as
    possible (not as much as practical) !!!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Chris M. Thomasson on Tue May 7 02:04:37 2024
    Chris M. Thomasson wrote:

    On 5/6/2024 12:15 PM, MitchAlsup1 wrote:
    Terje Mathisen wrote:

    MitchAlsup1 wrote:
    Chris M. Thomasson wrote:

    On 5/5/2024 3:25 PM, MitchAlsup1 wrote:
    Chris M. Thomasson wrote:

    On 5/4/2024 5:12 PM, MitchAlsup1 wrote:
    Chris M. Thomasson wrote:

    On 5/4/2024 3:18 AM, Thomas Koenig wrote:
    Lawrence D'Oliveiro <[email protected]d> schrieb:

    Intel pushed this thing called the “x32” ABI into the >>>>>>>>>>> Linux kernel
    (and
    possibly some other places) some years ago. This was using the >>>>>>>>>>> AMD64
    instruction set, but with only 32-bit pointers. This way, you >>>>>>>>>>> got the
    benefit of the extra registers, without the overhead of the >>>>>>>>>>> longer
    addresses.

    That was Donald Knuth's idea.

    Storing meta data in actual pointers, aka aligned on a larger >>>>>>>>> boundary, is critical to many advanced lock/wait free algorithms >>>>>>>>> as well. I remember storing an actual reference count in
    pointers before for a special type of counting.

    Even if one has multi-location ATOMICs ?? (as a single event ??) >>>>>>
    This was a technique for storing data in a pointer. For instance, >>>>>>> strong atomic reference counting we need to update a pointer _and_ >>>>>>> a reference together atomically. This can easily be done with
    DWCAS, or double width compare and swap. So, on a 32 bit system we >>>>>>> need 64 bit cas, for a 64 bit system we need 128 bit cas. However, >>>>>>> sometimes we can pack the reference count in the pointer value
    itself if its aligned on a big enough boundary. Then we can update >>>>>>> the pointer and the reference count using normal word based atomic >>>>>>> RMW's.

    I understand why you had to pack the pointer and a chunk of data
    into a
    single container.

    What I don't understand is if you had easy access to
    multi-container ATOMICs
    the packing would be unnecessary--would it not ?? That is in one
    ATOMIC event
    you could update the pointer and the chunk of data independently
    and not NEED
    to store them in a single container.

    Well, actually, a pessimistic word based fetch-and-add (LOCK XADD)
    is enough to increment the counter and load a pointer atomically all >>>>> in one shot, loopless. Why would I need to use multi atomics with a
    possible loop to do that?

    Postulate that you have a 64-bit pointer and a 8-bit chunk 72-total
    bits.
    Further postulate that you need to update both in a single
    non-blocking ATOMIC event. ...

    "Any programming problem can be solved with an additional layer of
    indirection", so in this case you create a handle to that 72-bit item,
    and require all access to go via the handle?

    I am not trying to add an additional layer of indirection, I am trying
    (unsuccessfully it appears) to get Chris to think outside of the one
    container ATOMIC box.

    LOCK XADD vs a CAS loop? I prefer the former.

    Those are not the only options.


    The addendum to the rule above is of course ", except the problem of
    too many layers of indirections". :-)

    Terje

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to BGB on Tue May 7 04:55:23 2024
    On Mon, 6 May 2024 01:47:15 -0500, BGB wrote:

    On 5/5/2024 9:30 PM, Lawrence D'Oliveiro wrote:

    I think [RISC-V]’s already shipping in the billions of units per
    year--enough to make it the world’s second-most-popular CPU
    architecture, after ARM.

    Yeah, seemingly right now, x86, ARM, and RISC-V are the top 3 ...

    Last I heard, x86 is in fourth place, after MIPS.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Terje Mathisen on Tue May 7 07:39:18 2024
    Terje Mathisen wrote:
    EricP wrote:
    MitchAlsup1 wrote:
    BGB wrote:

    On 5/5/2024 10:31 AM, Scott Lurndal wrote:
    Thomas Koenig <[email protected]> writes:
    Scott Lurndal <[email protected]> schrieb:


    Not as of yet in my case, but bitfield extract might happen eventually. >>>> Issue is finding a way to pull it off that is useful and cheaper
    than shift+mask (and probably adding some mechanism to pattern-match
    it from the AST or similar).

    But, but but but:: it IS shift and Mask !!

    Annoyingly, a good general case instruction could not be encoded in
    a 32-bit instruction form at this point (could either add a few
    special cases as 32-bit ops, or use a 64-bit encoding; or do it as a
    2RI op rather than 3RI but this is lame...).

    Then again, say:
       BITEXTR  Imm10, Rn  //Rn=(Rn>>(Imm&63))&((1<<((Imm>>6)&15))-1)
    Could potentially still be useful.

        SL    Rd,Rc,<width:offset>

    Is a bit field extract instruction, it is also a smash instruction
    (smashing a 64-bit value into a 8-bit or 12-bit or 47 bit for whatever
    purpose is needed)

        SR    Rd,Rc,<width:offset>

    Positions the value in a register (Rc) such that it fits the
    alignment of
    a field.

        INS   Rd,Rc,Rf,<width:offset>

    Inserts the field from Rf into its position <w:o> in Rc, inserts the
    field and delivers the new container to Rd.

    I think my instruction set could accomplish pretty much the same
    efficiency for bit field operations as bit addresses but without
    requiring direct bit addressing.

    An issue that comes up is when the in-memory bit field is > 56 bits wide
    as it might straddle two 64-bit words. If width is <= 56 bits then
    a load from a byte address handles most of the shifting and the
    rest can be handled within a single register.

    But if the in-memory bit field is > 56 bits wide it may or may not
    straddle
    a single 64-bit memory location, and require a pair of registers to
    loaded.

    x86 does not have bitfield insert/extract, but it does have SHRD/SHLD so
    it is fairly easy to handle arbitrary length (<= 64 bits) and alignment:

    ; RSI -> target, RCX = # bits to extract, RBX = 64-field size (0..63)
     mov rax,[rsi]
     mov rdx,[rsi+8]

     shrd rax,rdx,cl    ; bit offset

     and rax,bitmask[rbx*8] ; 64 mask entries.

    The last instruction can also be replaced with

      shlx rax,rax,rbx    ; Nr of excess bits (64-field to extract)
      shrx rax,rax,rbx

    or the entire thing can be replaced with this one which calculates the
    mask on the fly:

     mov rax,[rsi]
     mov rdx,[rsi+8]
     or rdi,-1        ; Generate mask

     shrd rax,rdx,cl    ; bit offset
     shrx rdi,rdi,rbx    ; excess bits to mask away

     and rax,rdi

    All seems like about 3 clock cycles when hitting the cache.

    I realized this morning that with arbitrary alignment and both signed
    and unsigned extract, it is better to always shift up first to get rid
    of the excess and then shift down to align. The main problem here is
    that you now need different code for straddling and non-straddling items
    since shifts (including double-wide shifts) have to be less than 64
    bits. :-(

    This is not a problem for constant length and alignment since the
    compiler can chose the correct pattern, but for codecs and compression
    it does not work. (Or at least not for those 57..64 field lengths).

    mov rax,[rsi]
    shl rax,cl ; Excess bits above the field we need
    shrx rax,rax,rbx ; rbx=64-field length

    The last instruction would be

    sarx rax,rax,rbx

    if you wanted a signed bitfield.

    No matter how you do it it will be become a bottleneck in any huffmann
    token extractor or similar codes. In my own decoders I've tended to
    grab a 32 (in the old days) or 64-bit chunk into a register and
    immediately align it. Then I'll use a lookup table over the first N
    (typically 6-12) bits of this buffer value and let the table decide how
    many bits to keep for the token, or in the case of longer tokens, select
    a second-level table to lookup the remaining bits.

    After decrementing the buffer bits remaining counter I'll branch out to
    refill it, but only if I have at least 32 or 48 free bits. This keeps
    the number of refills fairly low.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to BGB on Tue May 7 07:42:06 2024
    BGB wrote:
    On 5/6/2024 2:11 PM, MitchAlsup1 wrote:
    Lawrence D'Oliveiro wrote:

    On Sun, 5 May 2024 20:50:51 -0500, BGB wrote:

    Say, RISC-V:
       Says yes to DIV and MOD;
       Says yes to 4-register floating-point multiple-accumulate; Say >>>> no to
       register-indexed Load/Store.
    Me: This is not a good balance...

    Multiply-accumulate is at least as much about reducing rounding error
    as about speed.

    It is also an IEEE 754-2008+ requirement.

    And... I have a version that just sort of works well enough to make
    RV64G work, but is sort of a fail on the other fronts:
      Using it is slower than separate ops;
      It produces a double-rounded result.
      Also, well, the FMUL isn't super accurate either.


    FMUL is implemented in a way where it only generates the high-half of
    the multiply, which makes the FPU cheaper, but:
      Does not give strict 0.5ULP rounding.

    Some combination of factors leads to the inability of Newton-Raphson to fully converge, possibly either due to omitting the low-order multiplier results, or the carry-propagation limitation for rounding (if the
    rounding would result in more than 8 bits of carry, it is skipped).


    Not likely to do proper FMA, as this would make a Binary64 FPU too
    expensive (and, doing Binary64 poorly is still preferable for most uses
    to not doing it at all).

    Granted, not entirely sure how the 8087 managed to do all the stuff that
    it did. Since, it seems like an 80s ASIC would be more cramped than a
    modern Artix-7.

    Relatively easy to explain: It was _very_ slow, but still much faster
    than emulating it with an 8088 that needed 4 clock cycles for every
    single code or data byte touched.

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Chris M. Thomasson on Tue May 7 07:45:59 2024
    Chris M. Thomasson wrote:
    On 5/5/2024 11:13 PM, Terje Mathisen wrote:
    MitchAlsup1 wrote:
    Chris M. Thomasson wrote:

    On 5/5/2024 3:25 PM, MitchAlsup1 wrote:
    Chris M. Thomasson wrote:

    On 5/4/2024 5:12 PM, MitchAlsup1 wrote:
    Chris M. Thomasson wrote:

    On 5/4/2024 3:18 AM, Thomas Koenig wrote:
    Lawrence D'Oliveiro <[email protected]d> schrieb:

    Intel pushed this thing called the “x32” ABI into
    the Linux kernel
    (and
    possibly some other places) some years ago. This was using the >>>>>>>>>> AMD64
    instruction set, but with only 32-bit pointers. This way, you >>>>>>>>>> got the
    benefit of the extra registers, without the overhead of the >>>>>>>>>> longer
    addresses.

    That was Donald Knuth's idea.

    Storing meta data in actual pointers, aka aligned on a larger >>>>>>>> boundary, is critical to many advanced lock/wait free algorithms >>>>>>>> as well. I remember storing an actual reference count in
    pointers before for a special type of counting.

    Even if one has multi-location ATOMICs ?? (as a single event ??)

    This was a technique for storing data in a pointer. For instance,
    strong atomic reference counting we need to update a pointer _and_ >>>>>> a reference together atomically. This can easily be done with
    DWCAS, or double width compare and swap. So, on a 32 bit system we >>>>>> need 64 bit cas, for a 64 bit system we need 128 bit cas. However, >>>>>> sometimes we can pack the reference count in the pointer value
    itself if its aligned on a big enough boundary. Then we can update >>>>>> the pointer and the reference count using normal word based atomic >>>>>> RMW's.

    I understand why you had to pack the pointer and a chunk of data
    into a
    single container.

    What I don't understand is if you had easy access to
    multi-container ATOMICs
    the packing would be unnecessary--would it not ?? That is in one
    ATOMIC event
    you could update the pointer and the chunk of data independently
    and not NEED
    to store them in a single container.

    Well, actually, a pessimistic word based fetch-and-add (LOCK XADD)
    is enough to increment the counter and load a pointer atomically all
    in one shot, loopless. Why would I need to use multi atomics with a
    possible loop to do that?

    Postulate that you have a 64-bit pointer and a 8-bit chunk 72-total
    bits.
    Further postulate that you need to update both in a single
    non-blocking ATOMIC event. ...

    "Any programming problem can be solved with an additional layer of
    indirection", so in this case you create a handle to that 72-bit item,
    and require all access to go via the handle?

    The addendum to the rule above is of course ", except the problem of
    too many layers of indirections". :-)

    I remember look at one of your atomic queues that only used LOCK XADD on x86. Why would you use CAS for that? I don't know. I see no need for multi-atomics for any of it....


    Why should I have to use emojis when I think I'm being clearly
    sarcastical? :-(

    Please note my addendum above!


    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to All on Tue May 7 06:35:53 2024
    MitchAlsup1 wrote:

    John Levine wrote:

    According to John Savard <[email protected]d>:
    On Mon, 6 May 2024 02:54:11 -0000 (UTC), John Levine
    <[email protected]> wrote:

    Why do you think bit addressing will be
    faster than shifting and masking? ...

    So just because a processor has a 64-bit bus to memory doesn't
    mean it has to implement fetching a single byte from memory by
    doing a shift and mask operation in a 64-bit register. Instead,
    each byte of the bus could have a direct wired path to the low
    8-bits of the internal data bus feeding the registers.

    I was more thinking about storing bit fields, where you probably
    have to fetch the whole word or cache line or whatever, shift the
    new field into it, and then store it back. You already have to do
    something like that for byte stores but bit addressing makes it 8
    times as hairy.

    Which is no different than ECC, BTW...

    Could someone invent a bit field ISA that was as efficient as a byte accessible architecture:: probably.

    Could this bit accessible architecture outperform a byte ISA on
    typical codes:: doubtful. Two reasons:: 1) more delay in the LD/ST
    pipeline, 2) most programs use as little bit-fielding as possible
    (not as much as practical) !!!


    Some time ago, I proposed an additional instruction, a load varient
    that allowed you to address bit fields. Would it be slower than a
    "normal" byte oriented load? Almost certainly. But would it be faster
    than doing all the shifts, masks, word crossing calculations, etc. via
    extra instructions? Again, almost certainly. So you keep the benefits
    of byte oriented loads most of the time, but have "reasonable" access
    to bit fields when you need them, faster than without the
    extrainstructions. Hopefully the best of both worlds.




    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Levine on Tue May 7 06:40:23 2024
    On Sun, 5 May 2024 01:33:39 -0000 (UTC), John Levine wrote:

    According to Lawrence D'Oliveiro <[email protected]d>:

    So using the same register name to address a halfword gives you the low
    half of the register, not the high half?

    Whereas using the same memory address to address a halfword gives you
    the high half of the word at that location, not the low half?

    ... correct.

    So you are backing up what I’m claiming, that in accessing parts of registers, big-endian architectures behave just like little-endian ones?
    How exactly is that supposed to prove me wrong?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Tue May 7 06:45:02 2024
    On Tue, 7 May 2024 00:53:29 +0000, MitchAlsup1 wrote:

    BGB wrote:

    Granted, not entirely sure how the 8087 managed to do all the stuff
    that it did. Since, it seems like an 80s ASIC would be more cramped
    than a modern Artix-7.

    Mostly it was simply slow.

    Also it used a stack-based programming paradigm. This was not efficient,
    and frequently awkward.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to BGB on Tue May 7 06:45:43 2024
    On Mon, 6 May 2024 21:32:53 -0500, BGB wrote:

    Yes, but then again, I make no claim that it is IEEE-754 conformant,
    merely that it uses the same formats, and is "good enough" for most
    stuff one needs an FPU for.

    That’s what all the hardware engineers thought, back in the 1990s.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Tue May 7 06:47:09 2024
    On Mon, 6 May 2024 19:13:51 +0000, MitchAlsup1 wrote:

    Placing bit-field access INSIDE LDs and STs requires adding 2 stages of multiplexing in the LD/ST aligners (memory shifters). This has the
    potential to slow the overall pipeline frequency--at which point you
    have lost more than you can gain.

    Of course bit field extraction/insertion should require special
    instructions, not be a part of every load/store.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Savard on Tue May 7 06:49:48 2024
    On Mon, 06 May 2024 09:56:03 -0600, John Savard wrote:

    But we no longer have this problem.

    But the other reasons for going little-endian still exist.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to BGB on Tue May 7 06:42:26 2024
    On Tue, 7 May 2024 00:33:30 -0500, BGB wrote:

    I was thinking more in terms of popularity/mindshare ...

    You mean, PR terms? In this newsgroup??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Stephen Fuld on Tue May 7 11:47:42 2024
    On Tue, 7 May 2024 06:35:53 -0000 (UTC)
    "Stephen Fuld" <[email protected]d> wrote:

    MitchAlsup1 wrote:

    John Levine wrote:

    According to John Savard <[email protected]d>:
    On Mon, 6 May 2024 02:54:11 -0000 (UTC), John Levine
    <[email protected]> wrote:

    Why do you think bit addressing will be
    faster than shifting and masking? ...

    So just because a processor has a 64-bit bus to memory doesn't
    mean it has to implement fetching a single byte from memory by
    doing a shift and mask operation in a 64-bit register. Instead,
    each byte of the bus could have a direct wired path to the low
    8-bits of the internal data bus feeding the registers.

    I was more thinking about storing bit fields, where you probably
    have to fetch the whole word or cache line or whatever, shift the
    new field into it, and then store it back. You already have to do something like that for byte stores but bit addressing makes it 8
    times as hairy.

    Which is no different than ECC, BTW...

    Could someone invent a bit field ISA that was as efficient as a byte accessible architecture:: probably.

    Could this bit accessible architecture outperform a byte ISA on
    typical codes:: doubtful. Two reasons:: 1) more delay in the LD/ST pipeline, 2) most programs use as little bit-fielding as possible
    (not as much as practical) !!!


    Some time ago, I proposed an additional instruction, a load varient
    that allowed you to address bit fields. Would it be slower than a
    "normal" byte oriented load? Almost certainly. But would it be
    faster than doing all the shifts, masks, word crossing calculations,
    etc. via extra instructions? Again, almost certainly. So you keep
    the benefits of byte oriented loads most of the time, but have
    "reasonable" access to bit fields when you need them, faster than
    without the extrainstructions. Hopefully the best of both worlds.





    When you load bit field from memory, there is very high chance that you
    would want adjacent bit field soon thereafter.
    Think about it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Tue May 7 09:25:00 2024
    MitchAlsup1 wrote:
    EricP wrote:

    I think my instruction set could accomplish pretty much the same
    efficiency for bit field operations as bit addresses but without
    requiring direct bit addressing.

    An issue that comes up is when the in-memory bit field is > 56 bits wide
    as it might straddle two 64-bit words. If width is <= 56 bits then
    a load from a byte address handles most of the shifting and the
    rest can be handled within a single register.

    This is what CARRY is for--access to 128-bit in 2×64-bit out shifts.
    CARRY can be used for extracts and for inserts.

    But if the in-memory bit field is > 56 bits wide it may or may not
    straddle
    a single 64-bit memory location, and require a pair of registers to
    loaded.

    I don't understand 56--56 takes just as many bits to encode as 63 ?!?

    Here I'm referring to the two different ways one can load memory
    for bit fields: I can load 64-bit aligned words or byte aligned words
    (here "word" means 64 bits).

    One constraint I put on the following is that it must only touch the
    next cache line or page if it must read bits from it - it must not
    cause gratuitous cache line misses or page faults due to loading
    unnecessary bytes. For 64-bit word aligned loads this is inherently true,
    but for byte aligned loads care must be taken.

    If I load 64-bit aligned then I ignore (mask out) the low 3 bits from
    the address, but those 3 bit have to be inserted back into the field
    start offset as its high order bits, giving the 6-bit field start offset.
    If the field length+offset > 64 then the end of bit field straddles a word
    so I have a conditional load of the next sequential word into a second
    register to hold the high part of the field.

    Alternatively I can load a 64-bit word from a byte aligned address.
    In this case I don't need to merge the byte-offset bits with the
    bit-offset bits because the byte align shifter took care of that.
    This allows a bit field up to 56-bits wide to be loaded without
    having to check for a straddle and possibly load the high part.
    Since the high part can only be maximum of 8 bits (because the prior
    load took care of the lower 56 bits and the largest field is 64 bits)
    the second is a byte load so that it doesn't touch any bytes beyond
    the one it needs.

    As I see it, the main difference between these is how they handle
    multiple bit field accesses, possibly adjacent to the first bit field
    and therefore possible loaded into one of the two above registers.

    The first version above looks easier to optimize for multiple bit fields
    than the second, but I haven't actually worked it through.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Michael S on Tue May 7 17:24:02 2024
    Michael S wrote:

    On Tue, 7 May 2024 06:35:53 -0000 (UTC)
    "Stephen Fuld" <[email protected]d> wrote:

    MitchAlsup1 wrote:

    John Levine wrote:

    According to John Savard <[email protected]d>:
    On Mon, 6 May 2024 02:54:11 -0000 (UTC), John Levine <[email protected]> wrote:

    Why do you think bit addressing will be
    faster than shifting and masking? ...

    So just because a processor has a 64-bit bus to memory doesn't
    mean it has to implement fetching a single byte from memory by
    doing a shift and mask operation in a 64-bit register.
    Instead, each byte of the bus could have a direct wired path
    to the low 8-bits of the internal data bus feeding the
    registers.

    I was more thinking about storing bit fields, where you probably
    have to fetch the whole word or cache line or whatever, shift
    the new field into it, and then store it back. You already have
    to do something like that for byte stores but bit addressing
    makes it 8 times as hairy.

    Which is no different than ECC, BTW...

    Could someone invent a bit field ISA that was as efficient as a
    byte accessible architecture:: probably.

    Could this bit accessible architecture outperform a byte ISA on
    typical codes:: doubtful. Two reasons:: 1) more delay in the LD/ST pipeline, 2) most programs use as little bit-fielding as possible
    (not as much as practical) !!!


    Some time ago, I proposed an additional instruction, a load varient
    that allowed you to address bit fields. Would it be slower than a
    "normal" byte oriented load? Almost certainly. But would it be
    faster than doing all the shifts, masks, word crossing calculations,
    etc. via extra instructions? Again, almost certainly. So you keep
    the benefits of byte oriented loads most of the time, but have
    "reasonable" access to bit fields when you need them, faster than
    without the extrainstructions. Hopefully the best of both worlds.





    When you load bit field from memory, there is very high chance that
    you would want adjacent bit field soon thereafter.


    Yes. There are two aspects of this, setting the displacement of the
    next field, and the time it takes to access that field. For the first,
    my proposal took advantage of the MY 66000's capability of instruction modifiers to (optionally) add the length of the loaded bit field to the register that contains the bit displacement. So the addressing is
    already set up for a subsequent looad bit field instruction to load the adjacent bit field. For the time to access that field, it depends.
    For a low end implementation, the target data for the subsequent load
    would already be in the L1 cash, so not too bad. Higher end
    implementations could take advantage of the MY 66000's streaming
    buffers such that the data would already be "close" to the ALU. As I
    have often said, IANAHG, so I may have the details wrong.




    Think about it.

    Thanks, I have. :-)



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Terje Mathisen on Tue May 7 15:06:12 2024
    Terje Mathisen wrote:
    Terje Mathisen wrote:
    EricP wrote:
    MitchAlsup1 wrote:
    BGB wrote:

    On 5/5/2024 10:31 AM, Scott Lurndal wrote:
    Thomas Koenig <[email protected]> writes:
    Scott Lurndal <[email protected]> schrieb:


    Not as of yet in my case, but bitfield extract might happen
    eventually.
    Issue is finding a way to pull it off that is useful and cheaper
    than shift+mask (and probably adding some mechanism to
    pattern-match it from the AST or similar).

    But, but but but:: it IS shift and Mask !!

    Annoyingly, a good general case instruction could not be encoded in
    a 32-bit instruction form at this point (could either add a few
    special cases as 32-bit ops, or use a 64-bit encoding; or do it as
    a 2RI op rather than 3RI but this is lame...).

    Then again, say:
      BITEXTR Imm10, Rn //Rn=(Rn>>(Imm&63))&((1<<((Imm>>6)&15))-1) >>>>> Could potentially still be useful.

    Â Â Â SLÂ Â Â Rd,Rc,<width:offset>

    Is a bit field extract instruction, it is also a smash instruction
    (smashing a 64-bit value into a 8-bit or 12-bit or 47 bit for whatever >>>> purpose is needed)

    Â Â Â SRÂ Â Â Rd,Rc,<width:offset>

    Positions the value in a register (Rc) such that it fits the
    alignment of
    a field.

    Â Â Â INSÂ Â Rd,Rc,Rf,<width:offset>

    Inserts the field from Rf into its position <w:o> in Rc, inserts the
    field and delivers the new container to Rd.

    I think my instruction set could accomplish pretty much the same
    efficiency for bit field operations as bit addresses but without
    requiring direct bit addressing.

    An issue that comes up is when the in-memory bit field is > 56 bits wide >>> as it might straddle two 64-bit words. If width is <= 56 bits then
    a load from a byte address handles most of the shifting and the
    rest can be handled within a single register.

    But if the in-memory bit field is > 56 bits wide it may or may not
    straddle
    a single 64-bit memory location, and require a pair of registers to
    loaded.

    x86 does not have bitfield insert/extract, but it does have SHRD/SHLD
    so it is fairly easy to handle arbitrary length (<= 64 bits) and
    alignment:

    ; RSI -> target, RCX = # bits to extract, RBX = 64-field size (0..63)
    mov rax,[rsi]
    mov rdx,[rsi+8]

    This is what I wanted to avoid: blindly loading the next word
    as that could unnecessarilly read a cache line or worse,
    trap on an access violation.

    Its not that it is difficult to avoid, it just adds to the fiddlyness
    (like conditional branches around one or two instructions).

    shrd rax,rdx,cl ; bit offset

    and rax,bitmask[rbx*8] ; 64 mask entries.

    The last instruction can also be replaced with

    shlx rax,rax,rbx ; Nr of excess bits (64-field to extract)
    shrx rax,rax,rbx

    or the entire thing can be replaced with this one which calculates the
    mask on the fly:

    mov rax,[rsi]
    mov rdx,[rsi+8]
    or rdi,-1 ; Generate mask

    shrd rax,rdx,cl ; bit offset
    shrx rdi,rdi,rbx ; excess bits to mask away

    and rax,rdi

    All seems like about 3 clock cycles when hitting the cache.

    I realized this morning that with arbitrary alignment and both signed
    and unsigned extract, it is better to always shift up first to get rid
    of the excess and then shift down to align. The main problem here is
    that you now need different code for straddling and non-straddling items since shifts (including double-wide shifts) have to be less than 64
    bits. :-(

    This is not a problem for constant length and alignment since the
    compiler can chose the correct pattern, but for codecs and compression
    it does not work. (Or at least not for those 57..64 field lengths).

    mov rax,[rsi]
    shl rax,cl ; Excess bits above the field we need
    shrx rax,rax,rbx ; rbx=64-field length

    The last instruction would be

    sarx rax,rax,rbx

    if you wanted a signed bitfield.

    No matter how you do it it will be become a bottleneck in any huffmann
    token extractor or similar codes. In my own decoders I've tended to
    grab a 32 (in the old days) or 64-bit chunk into a register and
    immediately align it. Then I'll use a lookup table over the first N (typically 6-12) bits of this buffer value and let the table decide how
    many bits to keep for the token, or in the case of longer tokens, select
    a second-level table to lookup the remaining bits.

    After decrementing the buffer bits remaining counter I'll branch out to refill it, but only if I have at least 32 or 48 free bits. This keeps
    the number of refills fairly low.

    Terje

    There seem to be two use cases, one for bit-wise load and store to
    individual bit fields in compiled structures, the other is dynamic
    bit fields in bit streams.

    The first is bit sized elements in packed arrays, or packed structs,
    or packed arrays of packed structs, or packed structs containing packed
    array of bit fields, etc. These are supported by some languages
    (Ada85 had optional packed arrays and record structs).
    For these the field start bit-offset is dynamic but the field size and
    type are compile constants and so offer some potential for optimization
    (but that could require inlining some of the access subroutines).

    Such fields would tend to be both read and written is semi random order
    but with a high probability that nearby fields will also be accessed.

    The other is bit fields in bit streams being processed sequentially from
    lsb to msb order, e.g for a codec. For these the field size and type are dynamic but the token start offset can be arranged to be in bit[0].
    If you know the bit-wise token always starts in bit[0] you don't need to
    deal with field straddles, but must dynamically track where the last valid in-register bit is and detect when to load the next word and append to the register bit stream.

    Bit stream processing would likely be either write-only encode or read-only decode, proceeding once serially either low to high or high to low order.

    Both would simplify greatly with double-wide shifts of register pairs,
    as well as double-wide bit field extract and insert.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Tue May 7 20:56:09 2024
    EricP wrote:

    Terje Mathisen wrote:
    Terje Mathisen wrote:
    EricP wrote:
    MitchAlsup1 wrote:
    BGB wrote:

    On 5/5/2024 10:31 AM, Scott Lurndal wrote:
    Thomas Koenig <[email protected]> writes:
    Scott Lurndal <[email protected]> schrieb:


    Not as of yet in my case, but bitfield extract might happen
    eventually.
    Issue is finding a way to pull it off that is useful and cheaper
    than shift+mask (and probably adding some mechanism to
    pattern-match it from the AST or similar).

    But, but but but:: it IS shift and Mask !!

    Annoyingly, a good general case instruction could not be encoded in >>>>>> a 32-bit instruction form at this point (could either add a few
    special cases as 32-bit ops, or use a 64-bit encoding; or do it as >>>>>> a 2RI op rather than 3RI but this is lame...).

    Then again, say:
      BITEXTR Imm10, Rn //Rn=(Rn>>(Imm&63))&((1<<((Imm>>6)&15))-1)
    Could potentially still be useful.

    Â Â Â SLÂ Â Â Rd,Rc,<width:offset>

    Is a bit field extract instruction, it is also a smash instruction
    (smashing a 64-bit value into a 8-bit or 12-bit or 47 bit for whatever >>>>> purpose is needed)

    Â Â Â SRÂ Â Â Rd,Rc,<width:offset>

    Positions the value in a register (Rc) such that it fits the
    alignment of
    a field.

    Â Â Â INSÂ Â Rd,Rc,Rf,<width:offset>

    Inserts the field from Rf into its position <w:o> in Rc, inserts the >>>>> field and delivers the new container to Rd.

    I think my instruction set could accomplish pretty much the same
    efficiency for bit field operations as bit addresses but without
    requiring direct bit addressing.

    An issue that comes up is when the in-memory bit field is > 56 bits wide >>>> as it might straddle two 64-bit words. If width is <= 56 bits then
    a load from a byte address handles most of the shifting and the
    rest can be handled within a single register.

    But if the in-memory bit field is > 56 bits wide it may or may not
    straddle
    a single 64-bit memory location, and require a pair of registers to
    loaded.

    x86 does not have bitfield insert/extract, but it does have SHRD/SHLD
    so it is fairly easy to handle arbitrary length (<= 64 bits) and
    alignment:

    ; RSI -> target, RCX = # bits to extract, RBX = 64-field size (0..63)
    mov rax,[rsi]
    mov rdx,[rsi+8]

    This is what I wanted to avoid: blindly loading the next word
    as that could unnecessarilly read a cache line or worse,
    trap on an access violation.

    Its not that it is difficult to avoid, it just adds to the fiddlyness
    (like conditional branches around one or two instructions).

    shrd rax,rdx,cl ; bit offset

    and rax,bitmask[rbx*8] ; 64 mask entries.

    The last instruction can also be replaced with

    shlx rax,rax,rbx ; Nr of excess bits (64-field to extract)
    shrx rax,rax,rbx

    or the entire thing can be replaced with this one which calculates the
    mask on the fly:

    mov rax,[rsi]
    mov rdx,[rsi+8]
    or rdi,-1 ; Generate mask

    shrd rax,rdx,cl ; bit offset
    shrx rdi,rdi,rbx ; excess bits to mask away

    and rax,rdi

    All seems like about 3 clock cycles when hitting the cache.

    I realized this morning that with arbitrary alignment and both signed
    and unsigned extract, it is better to always shift up first to get rid
    of the excess and then shift down to align. The main problem here is
    that you now need different code for straddling and non-straddling items
    since shifts (including double-wide shifts) have to be less than 64
    bits. :-(

    This is not a problem for constant length and alignment since the
    compiler can chose the correct pattern, but for codecs and compression
    it does not work. (Or at least not for those 57..64 field lengths).

    mov rax,[rsi]
    shl rax,cl ; Excess bits above the field we need
    shrx rax,rax,rbx ; rbx=64-field length

    The last instruction would be

    sarx rax,rax,rbx

    if you wanted a signed bitfield.

    No matter how you do it it will be become a bottleneck in any huffmann
    token extractor or similar codes. In my own decoders I've tended to
    grab a 32 (in the old days) or 64-bit chunk into a register and
    immediately align it. Then I'll use a lookup table over the first N
    (typically 6-12) bits of this buffer value and let the table decide how
    many bits to keep for the token, or in the case of longer tokens, select
    a second-level table to lookup the remaining bits.

    After decrementing the buffer bits remaining counter I'll branch out to
    refill it, but only if I have at least 32 or 48 free bits. This keeps
    the number of refills fairly low.

    Terje

    There seem to be two use cases, one for bit-wise load and store to
    individual bit fields in compiled structures, the other is dynamic
    bit fields in bit streams.

    The first is bit sized elements in packed arrays, or packed structs,
    or packed arrays of packed structs, or packed structs containing packed
    array of bit fields, etc. These are supported by some languages
    (Ada85 had optional packed arrays and record structs).
    For these the field start bit-offset is dynamic but the field size and
    type are compile constants and so offer some potential for optimization
    (but that could require inlining some of the access subroutines).

    Such fields would tend to be both read and written is semi random order
    but with a high probability that nearby fields will also be accessed.

    The other is bit fields in bit streams being processed sequentially from
    lsb to msb order, e.g for a codec. For these the field size and type are dynamic but the token start offset can be arranged to be in bit[0].
    If you know the bit-wise token always starts in bit[0] you don't need to
    deal with field straddles, but must dynamically track where the last valid in-register bit is and detect when to load the next word and append to the register bit stream.

    Bit stream processing would likely be either write-only encode or read-only decode, proceeding once serially either low to high or high to low order.

    Both would simplify greatly with double-wide shifts of register pairs,
    as well as double-wide bit field extract and insert.

    If you have the later {double-wide bit field extract and insert} why do
    you need the former {double-wide shifts of register pairs}

    And by double-wide bit field extracts--you mean the container is 2 registers wide and the extracted result is 64-bits (or less) wide; and that for insert the value being inserted is 64-bits wide and the container it is being inserted into is 2 registers wide.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to [email protected] on Tue May 7 19:18:39 2024
    On Mon, 6 May 2024 02:34:48 -0000 (UTC), Lawrence D'Oliveiro
    <[email protected]d> wrote:

    On Sun, 05 May 2024 11:20:02 -0600, John Savard wrote:

    If you have decimal arithmetic, there's a direct connection between how
    numbers are represented for reading and writing, and how they are
    represented for internal arithmetic.

    It is easier to do addition/subtraction if you start from the least >significant end and propagate the carry/borrow along.

    I believe those early IBM character machines worked exactly this way.

    Yes, I think you're right. While the IBM 1401 did store character
    strings in the conventional big-endian order, they were addressed by
    the location of their least significant digit so that arithmetic could
    still start there, even if it then went backwards to lower addresses.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to [email protected] on Tue May 7 19:16:40 2024
    On Fri, 3 May 2024 22:26:04 -0000 (UTC), Lawrence D'Oliveiro
    <[email protected]d> wrote:

    On Thu, 02 May 2024 08:58:23 -0600, John Savard wrote:

    To me, it just made sense that, since registers contain quantities, if
    you load the value "8" into a reigster, it will contain the number 8.

    So in a byte operation, the least significant bits of the register are
    used.

    Of course that makes sense.

    Now, think of main memory as just a holding place for stuff that won�t fit
    in registers: why shouldn�t it make sense there as well?

    Because that isn't what main memory is. Even if one could think of
    cache memory that way, main memory also interacts with input-output
    devices.

    Although that isn't really the problem.

    After all, computational variables can be stored in memory in any
    format. The only things in memory that are constrained in format are
    character strings, because they get printed on paper for people to
    see.

    And, as I noted, that is the root of the problem.

    Character strings are in big-endian order.

    Packed decimal strings should be in the same order as character
    strings, so that the relationship between the two is simple and
    conversion between the two is quick.

    Packed decimal strings of numbers should be in the same order as
    binary numbers, because the can potentially share the same arithmetic
    unit in some implementations.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to [email protected] on Tue May 7 19:31:10 2024
    On Wed, 1 May 2024 19:33:54 -0000 (UTC), "Stephen Fuld" <[email protected]d> wrote:

    I don't know about the PDP 10, but you are right that Univac 1108 had
    both a six bit (technically a sixth of a word), and nine bit (quarter
    word) operations. The 6 bit was Fieldata and used for most older
    softwaare. The quarter words held an 8 bit ASCII character with one
    "wasted" bit per byte. This became the dominent usage for
    applications, but the Exec itself still uses a lot of Fieldata.

    The PDP-10 used ASCII, and not other codes.

    The six-bit code of the Univac was derived from FIELDATA, but the
    actual FIELDATA code, developed by the military, was a 7-bit code
    which included lower-case.

    In my cryptography pages, on the page

    http://www.quadibloc.com/crypto/mi060103.htm

    there's a diagram comparing Univac's Fieldata code with the actual
    FIELDATA code.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to [email protected] on Tue May 7 19:23:59 2024
    On Tue, 7 May 2024 06:49:48 -0000 (UTC), Lawrence D'Oliveiro
    <[email protected]d> wrote:

    On Mon, 06 May 2024 09:56:03 -0600, John Savard wrote:

    But we no longer have this problem.

    But the other reasons for going little-endian still exist.

    And what other reasons might those be?

    Yes, going little-endian made things simpler in computers with short
    word lengths, since the most common operations started from the least significant end.

    But to do things in a big-endian way in such computers didn't require
    trying to do addition backwards; you just had to jump ahead by the
    length of the number, and then move backwards from the least
    significant part. Often, though, even a trifling expense to do so
    didn't make sense.

    But when decimal and binary are both used in the same machine, then
    big-endian is almost unavoidable - especially when the same
    architecture is to be used in a wide range of implementations, some
    big, and some small. Then, compatibility forces the use of a small
    number of extra gates here and there.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Savard on Wed May 8 02:11:09 2024
    On Tue, 07 May 2024 19:16:40 -0600, John Savard wrote:

    Character strings are in big-endian order.

    Better thought of as “character strings are stored so ascending addresses correspond to logical reading order”. Note I didn’t say “display order”,
    since that can be quite different.

    Packed decimal strings should be in the same order as character strings,
    so that the relationship between the two is simple and conversion
    between the two is quick.

    Now here you are getting into cultural issues, For example, while both
    Arabic and Hebrew use decimal numbers, they write the digits in opposite
    order.

    Computer-internal formats should be optimized for computer-internal
    operations. Conversion from/to human-comprehensible layout/ordering/
    formatting should happen when accepting human input and displaying output
    for humans. The two should be kept separate, so the former remains
    independent of the latter, and the latter can be easily reconfigured.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Savard on Wed May 8 02:14:38 2024
    On Tue, 07 May 2024 19:23:59 -0600, John Savard wrote:

    On Tue, 7 May 2024 06:49:48 -0000 (UTC), Lawrence D'Oliveiro
    <[email protected]d> wrote:

    But the other reasons for going little-endian still exist.

    And what other reasons might those be?

    Consider how you specify these 3 conventions:
    * numbering of bits within a byte
    * numbering of bytes within a multibyte quantity
    * the place values (powers of 2) of bits in an integer

    The only way to get all 3 consistent is with a little-endian architecture. Every big-endian architecture has inconsistencies between these somewhere
    or another.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Savard on Wed May 8 02:15:44 2024
    John Savard wrote:

    On Tue, 7 May 2024 06:49:48 -0000 (UTC), Lawrence D'Oliveiro
    <[email protected]d> wrote:

    On Mon, 06 May 2024 09:56:03 -0600, John Savard wrote:

    But we no longer have this problem.

    But the other reasons for going little-endian still exist.

    And what other reasons might those be?

    Yes, going little-endian made things simpler in computers with short
    word lengths, since the most common operations started from the least significant end.

    But to do things in a big-endian way in such computers didn't require
    trying to do addition backwards; you just had to jump ahead by the
    length of the number, and then move backwards from the least
    significant part. Often, though, even a trifling expense to do so
    didn't make sense.

    But when decimal and binary are both used in the same machine, then big-endian is almost unavoidable

    Carry from digit to digit is the same direction in binary and decimal.
    This argues sameness not Big-Endian.

    - especially when the same
    architecture is to be used in a wide range of implementations, some
    big, and some small. Then, compatibility forces the use of a small
    number of extra gates here and there.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Wed May 8 02:47:46 2024
    According to MitchAlsup1 <[email protected]>:
    Character strings are in big-endian order.

    Not in Hebrew or Chinese !!

    It doesn't make sense to say that character strings are big- or little- endian.

    They're stored in the order you would read them, and there's typically
    metadata about how to display them. In Unicode, Hebrew and Arabic code
    points display right to left, Chinese displays however they want,
    typically left to right in rows these days.

    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Savard on Wed May 8 02:14:07 2024
    John Savard wrote:

    On Fri, 3 May 2024 22:26:04 -0000 (UTC), Lawrence D'Oliveiro
    <[email protected]d> wrote:

    On Thu, 02 May 2024 08:58:23 -0600, John Savard wrote:

    To me, it just made sense that, since registers contain quantities, if
    you load the value "8" into a reigster, it will contain the number 8.

    So in a byte operation, the least significant bits of the register are
    used.

    Of course that makes sense.

    Now, think of main memory as just a holding place for stuff that won’t fit >>in registers: why shouldn’t it make sense there as well?

    Because that isn't what main memory is. Even if one could think of
    cache memory that way, main memory also interacts with input-output
    devices.

    Although that isn't really the problem.

    After all, computational variables can be stored in memory in any
    format. The only things in memory that are constrained in format are character strings, because they get printed on paper for people to
    see.

    And, as I noted, that is the root of the problem.

    Character strings are in big-endian order.

    Not in Hebrew or Chinese !!

    Packed decimal strings should be in the same order as character
    strings, so that the relationship between the two is simple and
    conversion between the two is quick.

    Packed decimal strings of numbers should be in the same order as
    binary numbers, because the can potentially share the same arithmetic
    unit in some implementations.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Wed May 8 03:08:17 2024
    According to John Savard <[email protected]d>:
    But the other reasons for going little-endian still exist.

    And what other reasons might those be?

    These days the only reason is that everything else is little-endian.

    Danny Cohen went through all of the arguments in his Holy Wars paper
    in 1980. In the ensuing 44 years, nobody has added anything
    interesting.



    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Wed May 8 03:10:37 2024
    Lawrence D'Oliveiro wrote:

    On Tue, 07 May 2024 19:23:59 -0600, John Savard wrote:

    On Tue, 7 May 2024 06:49:48 -0000 (UTC), Lawrence D'Oliveiro
    <[email protected]d> wrote:

    But the other reasons for going little-endian still exist.

    And what other reasons might those be?

    Consider how you specify these 3 conventions:
    * numbering of bits within a byte

    Most significant is bit[0] least significant is bit[2^k-1]

    * numbering of bytes within a multibyte quantity

    Most significant byte[0] least significant byte[2^k-1]

    * the place values (powers of 2) of bits in an integer

    POWN Rp,#2,Ri

    The only way to get all 3 consistent is with a little-endian architecture.

    Not so; as illustrated above.

    Every big-endian architecture has inconsistencies between these somewhere
    or another.

    Most significant priority is [0] least significant priority is [2^k-1]

    Apparently even LE machines get this one wrong, too.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Wed May 8 03:38:37 2024
    On Wed, 8 May 2024 03:10:37 +0000, MitchAlsup1 wrote:

    Lawrence D'Oliveiro wrote:

    Consider how you specify these 3 conventions:
    * numbering of bits within a byte

    Most significant is bit[0] least significant is bit[2^k-1]

    * numbering of bytes within a multibyte quantity

    Most significant byte[0] least significant byte[2^k-1]

    * the place values (powers of 2) of bits in an integer

    Now you have to have place number = 2^k + 1 - i, where i is your bit
    number. So not only must the numbers be different, the relationship has to change depending on the size of the field!

    In little-endian, both numbers can be the same, in big-endian, they can’t.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to [email protected] on Tue May 7 21:56:35 2024
    On Wed, 8 May 2024 02:14:38 -0000 (UTC), Lawrence D'Oliveiro
    <[email protected]d> wrote:

    Consider how you specify these 3 conventions:
    * numbering of bits within a byte
    * numbering of bytes within a multibyte quantity
    * the place values (powers of 2) of bits in an integer

    The only way to get all 3 consistent is with a little-endian architecture. >Every big-endian architecture has inconsistencies between these somewhere
    or another.

    That's true.

    But I fail to see why the last one needs to be consistent, except as
    an aesthetic preference.

    And so I find the IBM System/360, which gets the first two consistent,
    to be a steling example of perfect consistency.

    The IBM System 360 gets to convert from character strings which
    represent integers to their packed decimal form in a simple way -
    assemble the last four bits of each byte, in the same order as the
    bytes in that string.

    And then packed decimal values are in the same ordering as binary
    values - with the most significant part in the same spot.

    This has practical consequences. Pack and Unpack are faster. Decimal
    and binary arithmetic can share circuitry on lower-end designs.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to All on Tue May 7 22:01:36 2024
    On Wed, 8 May 2024 02:15:44 +0000, [email protected] (MitchAlsup1)
    wrote:

    Carry from digit to digit is the same direction in binary and decimal.
    This argues sameness not Big-Endian.

    Yes, that's right.

    But that's only half of the argument.

    The reason for both being the same as big-endian instead of both being
    the same as little-endian is because of a _third_ item.

    This argues that packed decimal should have the same endianness as
    binary.

    But the third item is character stirings, used in input and output to
    represent numbers. They should be the same as packed decimal to make
    conversion between the two simpler.

    Then I argue for "sameness" as well, because a machine could be
    little-endian, with binary integers and floating-point all
    little-endian, but with decimal, as something minor and unimportant,
    being big-endian. So in addition to arguing that packed decimal should
    be big-endian because strings, I also have to argue that packed
    decimal and binary should have the same endianness.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Levine on Wed May 8 03:35:47 2024
    On Wed, 8 May 2024 02:47:46 -0000 (UTC), John Levine wrote:

    It doesn't make sense to say that character strings are big- or little- endian.

    Yes it does, for just about any encoding other than UTF-8. Thus, you have UTF16BE, and UTF16LE.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Savard on Wed May 8 05:50:08 2024
    On Tue, 07 May 2024 21:56:35 -0600, John Savard wrote:

    But I fail to see why the last one needs to be consistent, except as an aesthetic preference.

    Not just inconsistency, but the fact that the numbering has to be
    different depending on the size of the multibyte quantity.

    Only little-endian allows this numbering to be both consistent and
    constant.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Savard on Wed May 8 05:54:50 2024
    On Tue, 07 May 2024 22:01:36 -0600, John Savard wrote:

    But the third item is character stirings, used in input and output to represent numbers. They should be the same as packed decimal to make conversion between the two simpler.

    No, because character string conversion is subject to localization issues.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to John Levine on Wed May 8 13:14:53 2024
    On Wed, 8 May 2024 02:47:46 -0000 (UTC)
    John Levine <[email protected]> wrote:

    According to MitchAlsup1 <[email protected]>:
    Character strings are in big-endian order.

    Not in Hebrew or Chinese !!

    It doesn't make sense to say that character strings are big- or
    little- endian.

    They're stored in the order you would read them, and there's typically metadata about how to display them. In Unicode, Hebrew and Arabic code
    points display right to left, Chinese displays however they want,
    typically left to right in rows these days.


    Unfortunately, in Hebrew it is not that simple. Numbers [of Arabic
    variety] are written with most significant digit on the left, i.e. if
    we consider most significant digit as "first" then it can be said
    that [Arabic] numbers appear in opposite direction to the rest of the
    text. Numbers of Hebrew variety are written right to left, but nowadays
    they are used much less often.
    Arabic, on the other hand, uses the same right to left direction both
    for text and for numbers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to [email protected] on Wed May 8 13:45:58 2024
    On Wed, 8 May 2024 03:10:37 +0000
    [email protected] (MitchAlsup1) wrote:

    Most significant priority is [0] least significant priority is [2^k-1]

    Apparently even LE machines get this one wrong, too.

    What sort of 'priority' are you talking about? I can't think about
    any meaning of this word for which the numbering is independent of
    culture or context or both.
    Even if we limit ourselves to "Western" cultures, although it is true
    that more often than not (not always!) higher priority is associated
    with smaller number, I would think that the highest priority is more
    often associated with one than with zero.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Terje Mathisen on Wed May 8 15:36:48 2024
    On Wed, 8 May 2024 14:25:15 +0200
    Terje Mathisen <[email protected]> wrote:

    Michael S wrote:
    On Tue, 7 May 2024 06:35:53 -0000 (UTC)
    "Stephen Fuld" <[email protected]d> wrote:

    MitchAlsup1 wrote:

    John Levine wrote:

    According to John Savard <[email protected]d>:
    On Mon, 6 May 2024 02:54:11 -0000 (UTC), John Levine
    <[email protected]> wrote:

    Why do you think bit addressing will be
    faster than shifting and masking? ...

    So just because a processor has a 64-bit bus to memory doesn't
    mean it has to implement fetching a single byte from memory by
    doing a shift and mask operation in a 64-bit register. Instead,
    each byte of the bus could have a direct wired path to the low
    8-bits of the internal data bus feeding the registers.

    I was more thinking about storing bit fields, where you probably
    have to fetch the whole word or cache line or whatever, shift the
    new field into it, and then store it back. You already have to do
    something like that for byte stores but bit addressing makes it 8
    times as hairy.

    Which is no different than ECC, BTW...

    Could someone invent a bit field ISA that was as efficient as a
    byte accessible architecture:: probably.

    Could this bit accessible architecture outperform a byte ISA on
    typical codes:: doubtful. Two reasons:: 1) more delay in the LD/ST
    pipeline, 2) most programs use as little bit-fielding as possible
    (not as much as practical) !!!


    Some time ago, I proposed an additional instruction, a load varient
    that allowed you to address bit fields. Would it be slower than a
    "normal" byte oriented load? Almost certainly. But would it be
    faster than doing all the shifts, masks, word crossing
    calculations, etc. via extra instructions? Again, almost
    certainly. So you keep the benefits of byte oriented loads most
    of the time, but have "reasonable" access to bit fields when you
    need them, faster than without the extrainstructions. Hopefully
    the best of both worlds.





    When you load bit field from memory, there is very high chance that
    you would want adjacent bit field soon thereafter.
    Think about it.

    Which means that you would like to have a dedicated streaming buffer
    cache for the EXTR operation?

    Terje



    That not what I wanted to hint to Stephen.
    I wanted to hint that in typical situation, i.e. when one 32-bit or
    64-bit load serves several bit field extractions, his additional
    instruction would be slower rather than faster than existing practice.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Michael S on Wed May 8 14:25:15 2024
    Michael S wrote:
    On Tue, 7 May 2024 06:35:53 -0000 (UTC)
    "Stephen Fuld" <[email protected]d> wrote:

    MitchAlsup1 wrote:

    John Levine wrote:

    According to John Savard <[email protected]d>:
    On Mon, 6 May 2024 02:54:11 -0000 (UTC), John Levine
    <[email protected]> wrote:

    Why do you think bit addressing will be
    faster than shifting and masking? ...

    So just because a processor has a 64-bit bus to memory doesn't
    mean it has to implement fetching a single byte from memory by
    doing a shift and mask operation in a 64-bit register. Instead,
    each byte of the bus could have a direct wired path to the low
    8-bits of the internal data bus feeding the registers.

    I was more thinking about storing bit fields, where you probably
    have to fetch the whole word or cache line or whatever, shift the
    new field into it, and then store it back. You already have to do
    something like that for byte stores but bit addressing makes it 8
    times as hairy.

    Which is no different than ECC, BTW...

    Could someone invent a bit field ISA that was as efficient as a byte
    accessible architecture:: probably.

    Could this bit accessible architecture outperform a byte ISA on
    typical codes:: doubtful. Two reasons:: 1) more delay in the LD/ST
    pipeline, 2) most programs use as little bit-fielding as possible
    (not as much as practical) !!!


    Some time ago, I proposed an additional instruction, a load varient
    that allowed you to address bit fields. Would it be slower than a
    "normal" byte oriented load? Almost certainly. But would it be
    faster than doing all the shifts, masks, word crossing calculations,
    etc. via extra instructions? Again, almost certainly. So you keep
    the benefits of byte oriented loads most of the time, but have
    "reasonable" access to bit fields when you need them, faster than
    without the extrainstructions. Hopefully the best of both worlds.





    When you load bit field from memory, there is very high chance that you
    would want adjacent bit field soon thereafter.
    Think about it.

    Which means that you would like to have a dedicated streaming buffer
    cache for the EXTR operation?

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Wed May 8 11:30:36 2024
    I wanted to hint that in typical situation, i.e. when one 32-bit or
    64-bit load serves several bit field extractions, his additional
    instruction would be slower rather than faster than existing practice.

    The way I imagine bit-addressability, it would basically work as
    follows:

    - Use pointers almost as we do now, except shifted by 3 bits.
    Most likely normal loads and stores would signal an error if the low
    3 bits aren't 0. Immediate offsets in instructions would presumably
    still be in the same units as before (bytes, words, ...).

    This is fundamentally the only thing needed.
    But once you have that, you'd probably want to add some instructions to
    each bit-granular processing, which I'd imagine would look like:

    - Load/store operations that ignore the lowest 3 bits (or
    more than that, maybe the lowest 6 bits).
    - bit-insertion/extraction instructions which use those lowest 3-6bits
    and ignore the rest.

    This would not require any special shifter in the memory path and the combination of those operations should be just as efficient as
    a dedicated instruction.

    To handle bitfields that straddle word boundaries, you might want
    your bit-insert/extract to come with a "double-wide" option (I guess My
    66000's CARRY could do the trick), tho maybe you'd just use something
    like a 64bit load/store which only ignores the lowest 5bits (should be sufficient for any bit-field up to 32bits).


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Michael S on Wed May 8 16:09:32 2024
    Michael S wrote:

    On Wed, 8 May 2024 14:25:15 +0200
    Terje Mathisen <[email protected]> wrote:

    Michael S wrote:
    On Tue, 7 May 2024 06:35:53 -0000 (UTC)
    "Stephen Fuld" <[email protected]d> wrote:

    MitchAlsup1 wrote:

    John Levine wrote:

    According to John Savard <[email protected]d>:
    On Mon, 6 May 2024 02:54:11 -0000 (UTC), John Levine
    <[email protected]> wrote:

    Why do you think bit addressing will be
    faster than shifting and masking? ...

    So just because a processor has a 64-bit bus to memory doesn't
    mean it has to implement fetching a single byte from memory by
    doing a shift and mask operation in a 64-bit register.
    Instead, >>>>> each byte of the bus could have a direct wired path
    to the low >>>>> 8-bits of the internal data bus feeding the
    registers. >>>
    I was more thinking about storing bit fields, where you
    probably >>>> have to fetch the whole word or cache line or
    whatever, shift the >>>> new field into it, and then store it back.
    You already have to do >>>> something like that for byte stores but
    bit addressing makes it 8 >>>> times as hairy.

    Which is no different than ECC, BTW...

    Could someone invent a bit field ISA that was as efficient as a
    byte accessible architecture:: probably.

    Could this bit accessible architecture outperform a byte ISA on
    typical codes:: doubtful. Two reasons:: 1) more delay in the
    LD/ST >>> pipeline, 2) most programs use as little bit-fielding as
    possible >>> (not as much as practical) !!!


    Some time ago, I proposed an additional instruction, a load
    varient >> that allowed you to address bit fields. Would it be
    slower than a >> "normal" byte oriented load? Almost certainly.
    But would it be >> faster than doing all the shifts, masks, word
    crossing >> calculations, etc. via extra instructions? Again,
    almost >> certainly. So you keep the benefits of byte oriented
    loads most >> of the time, but have "reasonable" access to bit
    fields when you >> need them, faster than without the
    extrainstructions. Hopefully >> the best of both worlds.





    When you load bit field from memory, there is very high chance
    that you would want adjacent bit field soon thereafter.
    Think about it.

    Which means that you would like to have a dedicated streaming
    buffer cache for the EXTR operation?

    Terje



    That not what I wanted to hint to Stephen.
    I wanted to hint that in typical situation, i.e. when one 32-bit or
    64-bit load serves several bit field extractions, his additional
    instruction would be slower rather than faster than existing practice.


    Perhaps. But if you aren't absolutely sure that the next field doesn't
    cross a 64 bit boundry, then you have to test for that, and if it does,
    add more instructions to handle it. If that happens, your advantage is
    lost. Even the test and conditional jump/predication when you don't
    cross the boundry makes it pretty close.

    And, as I mentioned in a previous post, I would expect higher end implementations to make use of some sort of stream buffer, as Terje
    suggests.






    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Michael S on Wed May 8 19:04:23 2024
    Michael S wrote:
    On Wed, 8 May 2024 14:25:15 +0200
    Terje Mathisen <[email protected]> wrote:

    Michael S wrote:
    On Tue, 7 May 2024 06:35:53 -0000 (UTC)
    "Stephen Fuld" <[email protected]d> wrote:

    MitchAlsup1 wrote:

    John Levine wrote:

    According to John Savard <[email protected]d>:
    On Mon, 6 May 2024 02:54:11 -0000 (UTC), John Levine
    <[email protected]> wrote:

    Why do you think bit addressing will be
    faster than shifting and masking? ...

    So just because a processor has a 64-bit bus to memory doesn't
    mean it has to implement fetching a single byte from memory by
    doing a shift and mask operation in a 64-bit register. Instead,
    each byte of the bus could have a direct wired path to the low
    8-bits of the internal data bus feeding the registers.

    I was more thinking about storing bit fields, where you probably
    have to fetch the whole word or cache line or whatever, shift the
    new field into it, and then store it back. You already have to do
    something like that for byte stores but bit addressing makes it 8
    times as hairy.

    Which is no different than ECC, BTW...

    Could someone invent a bit field ISA that was as efficient as a
    byte accessible architecture:: probably.

    Could this bit accessible architecture outperform a byte ISA on
    typical codes:: doubtful. Two reasons:: 1) more delay in the LD/ST
    pipeline, 2) most programs use as little bit-fielding as possible
    (not as much as practical) !!!


    Some time ago, I proposed an additional instruction, a load varient
    that allowed you to address bit fields. Would it be slower than a
    "normal" byte oriented load? Almost certainly. But would it be
    faster than doing all the shifts, masks, word crossing
    calculations, etc. via extra instructions? Again, almost
    certainly. So you keep the benefits of byte oriented loads most
    of the time, but have "reasonable" access to bit fields when you
    need them, faster than without the extrainstructions. Hopefully
    the best of both worlds.





    When you load bit field from memory, there is very high chance that
    you would want adjacent bit field soon thereafter.
    Think about it.

    Which means that you would like to have a dedicated streaming buffer
    cache for the EXTR operation?

    Terje



    That not what I wanted to hint to Stephen.
    I wanted to hint that in typical situation, i.e. when one 32-bit or
    64-bit load serves several bit field extractions, his additional
    instruction would be slower rather than faster than existing practice.


    Yeah, as I wrote earlier, i my own code I tend to use a register as my
    buffer and keep it bottom-aligned at all times, i.e. end each extraction
    by a SHR buffer, token_len

    This means that most of the time, the buffer reg already contains all
    the bits of the next token.

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Terje Mathisen on Wed May 8 17:27:35 2024
    Terje Mathisen wrote:

    Michael S wrote:
    On Wed, 8 May 2024 14:25:15 +0200
    Terje Mathisen <[email protected]> wrote:

    Michael S wrote:
    On Tue, 7 May 2024 06:35:53 -0000 (UTC)
    "Stephen Fuld" <[email protected]d> wrote:

    MitchAlsup1 wrote:

    John Levine wrote:

    According to John Savard <[email protected]d>:
    On Mon, 6 May 2024 02:54:11 -0000 (UTC), John Levine
    <[email protected]> wrote:

    Why do you think bit addressing will be
    faster than shifting and masking? ...

    So just because a processor has a 64-bit bus to memory
    doesn't mean it has to implement fetching a single byte
    from memory by doing a shift and mask operation in a
    64-bit register. Instead, each byte of the bus could
    have a direct wired path to the low 8-bits of the
    internal data bus feeding the registers.

    I was more thinking about storing bit fields, where you
    probably have to fetch the whole word or cache line or
    whatever, shift the new field into it, and then store it
    back. You already have to do something like that for byte
    stores but bit addressing makes it 8 times as hairy.

    Which is no different than ECC, BTW...

    Could someone invent a bit field ISA that was as efficient
    as a byte accessible architecture:: probably.

    Could this bit accessible architecture outperform a byte
    ISA on typical codes:: doubtful. Two reasons:: 1) more
    delay in the LD/ST pipeline, 2) most programs use as little bit-fielding as possible (not as much as practical) !!!


    Some time ago, I proposed an additional instruction, a load
    varient that allowed you to address bit fields. Would it be
    slower than a "normal" byte oriented load? Almost certainly.
    But would it be faster than doing all the shifts, masks, word crossing calculations, etc. via extra instructions? Again,
    almost certainly. So you keep the benefits of byte oriented
    loads most of the time, but have "reasonable" access to bit
    fields when you need them, faster than without the
    extrainstructions. Hopefully the best of both worlds.





    When you load bit field from memory, there is very high chance
    that you would want adjacent bit field soon thereafter.
    Think about it.

    Which means that you would like to have a dedicated streaming
    buffer cache for the EXTR operation?

    Terje



    That not what I wanted to hint to Stephen.
    I wanted to hint that in typical situation, i.e. when one 32-bit or
    64-bit load serves several bit field extractions, his additional instruction would be slower rather than faster than existing
    practice.


    Yeah, as I wrote earlier, i my own code I tend to use a register as
    my buffer and keep it bottom-aligned at all times, i.e. end each
    extraction by a SHR buffer, token_len

    This means that most of the time, the buffer reg already contains all
    the bits of the next token.


    The key word being"most". If it isn't "always", you have to test for
    the condition. That test, and the conditional branch reduces, and
    perhaps eliminates the advantage.





    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Stephen Fuld on Wed May 8 19:16:09 2024
    Stephen Fuld wrote:
    Michael S wrote:

    On Wed, 8 May 2024 14:25:15 +0200
    Terje Mathisen <[email protected]> wrote:

    Michael S wrote:
    On Tue, 7 May 2024 06:35:53 -0000 (UTC)
    "Stephen Fuld" <[email protected]d> wrote:

    MitchAlsup1 wrote:

    John Levine wrote:

    According to John Savard <[email protected]d>:
    On Mon, 6 May 2024 02:54:11 -0000 (UTC), John Levine
    <[email protected]> wrote:

    Why do you think bit addressing will be
    faster than shifting and masking? ...

    So just because a processor has a 64-bit bus to memory doesn't >>>>>>>> mean it has to implement fetching a single byte from memory by >>>>>>>> doing a shift and mask operation in a 64-bit register.
    Instead, >>>>> each byte of the bus could have a direct wired path
    to the low >>>>> 8-bits of the internal data bus feeding the
    registers. >>>
    I was more thinking about storing bit fields, where you
    probably >>>> have to fetch the whole word or cache line or
    whatever, shift the >>>> new field into it, and then store it back.
    You already have to do >>>> something like that for byte stores but
    bit addressing makes it 8 >>>> times as hairy.

    Which is no different than ECC, BTW...

    Could someone invent a bit field ISA that was as efficient as a
    byte accessible architecture:: probably.

    Could this bit accessible architecture outperform a byte ISA on
    typical codes:: doubtful. Two reasons:: 1) more delay in the
    LD/ST >>> pipeline, 2) most programs use as little bit-fielding as
    possible >>> (not as much as practical) !!!


    Some time ago, I proposed an additional instruction, a load
    varient >> that allowed you to address bit fields. Would it be
    slower than a >> "normal" byte oriented load? Almost certainly.
    But would it be >> faster than doing all the shifts, masks, word
    crossing >> calculations, etc. via extra instructions? Again,
    almost >> certainly. So you keep the benefits of byte oriented
    loads most >> of the time, but have "reasonable" access to bit
    fields when you >> need them, faster than without the
    extrainstructions. Hopefully >> the best of both worlds.





    When you load bit field from memory, there is very high chance
    that you would want adjacent bit field soon thereafter.
    Think about it.

    Which means that you would like to have a dedicated streaming
    buffer cache for the EXTR operation?

    Terje



    That not what I wanted to hint to Stephen.
    I wanted to hint that in typical situation, i.e. when one 32-bit or
    64-bit load serves several bit field extractions, his additional
    instruction would be slower rather than faster than existing practice.


    Perhaps. But if you aren't absolutely sure that the next field doesn't
    cross a 64 bit boundry, then you have to test for that, and if it does,
    add more instructions to handle it. If that happens, your advantage is
    lost. Even the test and conditional jump/predication when you don't
    cross the boundry makes it pretty close.

    And, as I mentioned in a previous post, I would expect higher end implementations to make use of some sort of stream buffer, as Terje
    suggests.

    In typical codecs, tokens are mostly 2-3 to 8-10 bits long, so by having
    a 64-bit buffer which always contains at least 32 bits, you don't need
    to worry about any straddles, and for strings of shorter tokens, you
    don't even need to check if a reload/buffer fill-up is needed.

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Stephen Fuld on Wed May 8 19:47:34 2024
    Stephen Fuld wrote:
    Terje Mathisen wrote:

    Michael S wrote:
    On Wed, 8 May 2024 14:25:15 +0200
    Terje Mathisen <[email protected]> wrote:

    Michael S wrote:
    On Tue, 7 May 2024 06:35:53 -0000 (UTC)
    "Stephen Fuld" <[email protected]d> wrote:

    MitchAlsup1 wrote:

    John Levine wrote:

    According to John Savard <[email protected]d>:
    On Mon, 6 May 2024 02:54:11 -0000 (UTC), John Levine
    <[email protected]> wrote:

    Why do you think bit addressing will be
    faster than shifting and masking? ...

    So just because a processor has a 64-bit bus to memory
    doesn't mean it has to implement fetching a single byte
    from memory by doing a shift and mask operation in a
    64-bit register. Instead, each byte of the bus could
    have a direct wired path to the low 8-bits of the
    internal data bus feeding the registers.

    I was more thinking about storing bit fields, where you
    probably have to fetch the whole word or cache line or
    whatever, shift the new field into it, and then store it
    back. You already have to do something like that for byte
    stores but bit addressing makes it 8 times as hairy.

    Which is no different than ECC, BTW...

    Could someone invent a bit field ISA that was as efficient
    as a byte accessible architecture:: probably.

    Could this bit accessible architecture outperform a byte
    ISA on typical codes:: doubtful. Two reasons:: 1) more
    delay in the LD/ST pipeline, 2) most programs use as little
    bit-fielding as possible (not as much as practical) !!!


    Some time ago, I proposed an additional instruction, a load
    varient that allowed you to address bit fields. Would it be
    slower than a "normal" byte oriented load? Almost certainly.
    But would it be faster than doing all the shifts, masks, word
    crossing calculations, etc. via extra instructions? Again,
    almost certainly. So you keep the benefits of byte oriented
    loads most of the time, but have "reasonable" access to bit
    fields when you need them, faster than without the
    extrainstructions. Hopefully the best of both worlds.





    When you load bit field from memory, there is very high chance
    that you would want adjacent bit field soon thereafter.
    Think about it.

    Which means that you would like to have a dedicated streaming
    buffer cache for the EXTR operation?

    Terje



    That not what I wanted to hint to Stephen.
    I wanted to hint that in typical situation, i.e. when one 32-bit or
    64-bit load serves several bit field extractions, his additional
    instruction would be slower rather than faster than existing
    practice.


    Yeah, as I wrote earlier, i my own code I tend to use a register as
    my buffer and keep it bottom-aligned at all times, i.e. end each
    extraction by a SHR buffer, token_len

    This means that most of the time, the buffer reg already contains all
    the bits of the next token.


    The key word being"most". If it isn't "always", you have to test for
    the condition. That test, and the conditional branch reduces, and
    perhaps eliminates the advantage.

    It was exactly these kinds of optimizations I made in order to double
    the speed of Intel's reference BluRay decoder. However, instead of
    asking me to write a complete version they decided to licence a piece of
    VLSI to do it in hardware, and that was almost certainly the correct
    decision since my code needed 4 cores working nearly 100% in order to
    handle the highest possible size/speed quality (1080p, 60 Hz, CABAC
    encoding and 40 Mbit/s bitrate).

    With a hw decoder a laptop can show film for hours on battery power.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Wed May 8 21:38:01 2024
    BGB wrote:

    Though, had noticed recently that a lot of typos seem to escape my
    notice on my end. This is possibly a downside of using a 9pt font on a
    4K monitor (22 inch) with 100% UI zoom (*). Can fir more stuff on
    screen, but potentially not the most easily readable experience.

    Why so small ?? My monitor is 32" and if I were to replace it with a 4K
    monitor it would be 40-42".

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Thu May 9 01:24:54 2024
    According to Lawrence D'Oliveiro <[email protected]d>:
    On Wed, 8 May 2024 02:47:46 -0000 (UTC), John Levine wrote:

    It doesn't make sense to say that character strings are big- or little-
    endian.

    Yes it does, for just about any encoding other than UTF-8. Thus, you have >UTF16BE, and UTF16LE.

    Not really, those are byte orders within a character, not order of characters.

    If you look at surrogates, you can UTF16 is big-endian. First there's the high surrogate, then the low one.

    There's a reason that every encoding other than UTF-8 is dead. Who needs the grief?
    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to [email protected] on Wed May 8 20:50:53 2024
    On Wed, 8 May 2024 05:54:50 -0000 (UTC), Lawrence D'Oliveiro
    <[email protected]d> wrote:

    On Tue, 07 May 2024 22:01:36 -0600, John Savard wrote:

    But the third item is character stirings, used in input and output to
    represent numbers. They should be the same as packed decimal to make
    conversion between the two simpler.

    No, because character string conversion is subject to localization issues.

    I agree that little-endian computers make sense for people whose
    native language is Hebrew or Arabic.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to [email protected] on Thu May 9 15:01:55 2024
    On Wed, 08 May 2024 20:50:53 -0600, John Savard
    <[email protected]d> wrote:

    On Wed, 8 May 2024 05:54:50 -0000 (UTC), Lawrence D'Oliveiro
    <[email protected]d> wrote:

    On Tue, 07 May 2024 22:01:36 -0600, John Savard wrote:

    But the third item is character stirings, used in input and output to
    represent numbers. They should be the same as packed decimal to make
    conversion between the two simpler.

    No, because character string conversion is subject to localization issues.

    I agree that little-endian computers make sense for people whose
    native language is Hebrew or Arabic.

    Still, I get your point. My thinking is stuck in the days of card
    readers and line printers. Yes, one called a subroutine to print
    numbers, but what it did was convert them to the format used in North
    America and the United Kingdom, in accordance with any parameters in
    the call that were hard-coded into the program.

    The idea of programs as applications, to be distributed far and wide,
    to people with computers of their own, where the operating system
    could impose localization options on the display of numbers that
    programs would usually allow themselves to accept... the situation
    with newfangled operating systems like Microsoft Windows... is still
    one that is only gradually beginning to dawn on me.

    I do suspect, though, that programs like, say, dBase II, which store
    numbers in files internally as character strings, don't vary that
    format according to localization. Some binary to string conversions go
    through the localization mechanisms, but not all of them, and so
    string forms are _not_ wholly irrelevant.

    An embedded processor in, say, a digital voltmeter... is not going to
    have a localization layer to contend with. The makers of digital
    voltmeters will find other ways of addressing international markets.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Lawrence D'Oliveiro on Fri May 10 09:18:43 2024
    On 08/05/2024 04:11, Lawrence D'Oliveiro wrote:
    On Tue, 07 May 2024 19:16:40 -0600, John Savard wrote:

    Character strings are in big-endian order.

    Better thought of as “character strings are stored so ascending addresses correspond to logical reading order”. Note I didn’t say “display order”,
    since that can be quite different.

    Packed decimal strings should be in the same order as character strings,
    so that the relationship between the two is simple and conversion
    between the two is quick.

    Now here you are getting into cultural issues, For example, while both
    Arabic and Hebrew use decimal numbers, they write the digits in opposite order.


    Do you mean that when they write "123" with "1" on the left, they mean
    the number "three hundred and twenty one" rather than "one hundred and
    twenty three"? Or do you mean that where we write the digit "1" first
    when writing left to right, they write the digit "3" first going right
    to left?

    My understanding was that for both languages, and indeed any other
    language that uses Arabic numerals, digits are written big-endian read
    from the left. Thus "123", with the digit "1" on the left, means the
    same in Arabic, English, Hebrew, Chinese, or any other language using
    them. Anything else would be massively confusing.

    Many cultures and languages have additional numeric systems they use as
    well as the common Arabic numerals. Some use their own systems as
    standard, some just for specific purposes (just as English speakers use
    Roman numerals for some purposes). And some of these are read
    right-to-left rather than left-to-right (not necessarily matching the
    order of their text), others use different symbols for the weighting.

    As far as I know, in Hebrew numbers are usually written with
    Western-style Arabic numerals, in the same order as everywhere else.
    But they also use a more traditional letter-based system for dates,
    religious works, and so on. Those are additive rather than strictly
    positional (at least up to a limit).

    And in written Arabic, Eastern-style Arabic numerals are used,
    corresponding directly to Western-style Arabic numerals but with
    somewhat different forms - the order is still most significant digit on
    the left.


    (I have a book on the history of number systems throughout the world,
    but it is a /long/ time since I read it.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to John Levine on Fri May 10 09:31:00 2024
    On 09/05/2024 03:24, John Levine wrote:
    According to Lawrence D'Oliveiro <[email protected]d>:
    On Wed, 8 May 2024 02:47:46 -0000 (UTC), John Levine wrote:

    It doesn't make sense to say that character strings are big- or little-
    endian.

    Yes it does, for just about any encoding other than UTF-8. Thus, you have
    UTF16BE, and UTF16LE.

    Not really, those are byte orders within a character, not order of characters.


    Or rather, they are byte orders used by different encodings of code
    points. ("Characters" in Unicode are more complicated - nothing is ever
    simple in Unicode!) There are no endian issues between code points, and
    a "string" as far as Unicode is concerned would be a sequence of code
    points. You only have endian issues if you want to store these 21-bit
    integers in a format that is encoded in smaller lumps (like
    byte-addressed memory).

    If you look at surrogates, you can UTF16 is big-endian. First there's the high
    surrogate, then the low one.

    There's a reason that every encoding other than UTF-8 is dead. Who needs the grief?

    Indeed.

    UTF-32 is fine for internal use, however - using whatever endianness
    your processor prefers. The trick is never to let it leave the one
    computer in any encoding other than UTF-8.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to John Savard on Fri May 10 13:09:53 2024
    John Savard <[email protected]d> writes:
    On Wed, 08 May 2024 20:50:53 -0600, John Savard ><[email protected]d> wrote:

    On Wed, 8 May 2024 05:54:50 -0000 (UTC), Lawrence D'Oliveiro >><[email protected]d> wrote:

    On Tue, 07 May 2024 22:01:36 -0600, John Savard wrote:

    But the third item is character stirings, used in input and output to
    represent numbers. They should be the same as packed decimal to make
    conversion between the two simpler.

    No, because character string conversion is subject to localization issues. >>
    I agree that little-endian computers make sense for people whose
    native language is Hebrew or Arabic.

    Still, I get your point. My thinking is stuck in the days of card
    readers and line printers. Yes, one called a subroutine to print
    numbers, but what it did was convert them to the format used in North
    America and the United Kingdom, in accordance with any parameters in
    the call that were hard-coded into the program.

    The idea of programs as applications, to be distributed far and wide,
    to people with computers of their own, where the operating system
    could impose localization options on the display of numbers that
    programs would usually allow themselves to accept

    I actually was responsible for the I18N and L10N support in
    the Burroughs MCP (for Medium systems) in the 80's, so it's
    not something that Microsoft "invented". At the time, it
    was mainly for Europe, and Japan (Katakana).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to David Brown on Fri May 10 16:20:47 2024
    David Brown <[email protected]> writes:
    UTF-32 is fine for internal use, however - using whatever endianness
    your processor prefers. The trick is never to let it leave the one
    computer in any encoding other than UTF-8.

    An unnecessary complication.

    1) I only came up with the following use cases where you need to deal
    with individual non-ASCII characters: Palindrome checkers and anagram
    programs; I remember somebody mentioning a third use (which I promptly
    forgot), but anyway, there are few cases.

    2) But even for those few cases, UTF-32 is not good enough, because a
    code point is not a character.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to All on Fri May 10 18:49:31 2024
    On Thu, 2 May 2024 18:28:18 +0000, [email protected] (MitchAlsup1)
    wrote:

    John Savard wrote:

    On Wed, 1 May 2024 23:17:06 -0000 (UTC), Lawrence D'Oliveiro


    Plus, if you load a single precision float into a floating-point
    register, you are loading on the left side, not the right side, so the

    In My 66000, floats are stored on the right side of the register
    {mostly because I do not have FP LD/STs.}

    And _not only_ do I have FP loads and stores, but one of the things
    they *do* is convert floats (if needed) to an internal form so that
    the exponent is of the exact same form, in the same position, for all
    the floats of that type.

    The Compatible Floating Point loads and stores - those are the ones
    for hexadecimal S/360 floats - just do left-aligned raw loads and
    stores in the FP registers, since their exponents are all in the same
    form.

    But the regular ones, for IEEE 754 floats, convert everything to look
    like the old 8087 temporary real format. Possibly with an extra
    exponent bit to accomodate the new 128-bit format defined in IEEE 754.

    Of course, you may rightfully say that is crazy - if I did a
    computation saving everything in memory, or using short vectors (where
    this conversion doesn't take place) then the computation strictly
    observes the exponent range, but if I do one in registers, a
    calculation could continue normally where an intermediate result ought
    to have underflowed by a little bit.

    But here I'm following Seymour Cray - sacrifice everything else for
    speed. Although 'within reason'; _except for division_ I keep IEEE 754
    exact results.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Anton Ertl on Sat May 11 15:33:55 2024
    On 10/05/2024 18:20, Anton Ertl wrote:
    David Brown <[email protected]> writes:
    UTF-32 is fine for internal use, however - using whatever endianness
    your processor prefers. The trick is never to let it leave the one
    computer in any encoding other than UTF-8.

    An unnecessary complication.

    1) I only came up with the following use cases where you need to deal
    with individual non-ASCII characters: Palindrome checkers and anagram programs; I remember somebody mentioning a third use (which I promptly forgot), but anyway, there are few cases.

    2) But even for those few cases, UTF-32 is not good enough, because a
    code point is not a character.


    I agree that it is usually unnecessary to convert to UTF-32 - I am
    merely saying that /if/ you feel you want to expand the code points,
    then UTF-32 is fine for the purpose and you should not have to worry
    about endianness because you should not be moving it off your computer,
    thus native endianness is all you need.

    People sometimes say they want to expand to code points to be able to
    see the length of the string in characters, or to index characters, or
    for easier splicing or joining strings. I don't think these are
    particularly useful in practice, but UTF-32 is fine for those that want it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to David Brown on Sat May 11 15:31:49 2024
    David Brown <[email protected]> writes:
    On 10/05/2024 18:20, Anton Ertl wrote:
    1) I only came up with the following use cases where you need to deal
    with individual non-ASCII characters: Palindrome checkers and anagram
    programs; I remember somebody mentioning a third use (which I promptly
    forgot), but anyway, there are few cases.

    2) But even for those few cases, UTF-32 is not good enough, because a
    code point is not a character.


    I agree that it is usually unnecessary to convert to UTF-32 - I am
    merely saying that /if/ you feel you want to expand the code points,
    then UTF-32 is fine for the purpose and you should not have to worry
    about endianness because you should not be moving it off your computer,
    thus native endianness is all you need.

    Yes. The point I wanted to make is that there is the frequent
    misconception that dealing with individual arbitrary characters is
    something that is relatively common, and that one can do that by using
    UTF-32 (or UTF-16); it isn't, and one cannot. If you stick with UTF-8
    and use byte lengths and byte indexes, you can do almost everything as
    well or better (with less complication and more efficiently) as by
    converting to UTF-32 and back.

    People sometimes say they want to expand to code points to be able to
    see the length of the string in characters, or to index characters, or
    for easier splicing or joining strings. I don't think these are
    particularly useful in practice, but UTF-32 is fine for those that want it.

    Looking up "splicing strings", I find that this is a term used in
    connection with Python for specifying substrings. Python3 is a
    language that lives the codepoint mistake to the extreme (and from
    what I read, this was one of the major pain points in the
    Python2->Python3 transition), but anyway, with UTF-8 one way to
    represent a substring is to use the start index and length in bytes
    (aka code units) rather than code points.

    Looking up "joining strings" brings up the Python join() method, which
    is a variant of string concatenation. There is certainly no need to
    convert UTF-8 to UTF-32 and back for concatenating strings; just
    concatenate the UTF-8 strings.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Anton Ertl on Sat May 11 18:49:12 2024
    On 11/05/2024 17:31, Anton Ertl wrote:
    David Brown <[email protected]> writes:
    On 10/05/2024 18:20, Anton Ertl wrote:
    1) I only came up with the following use cases where you need to deal
    with individual non-ASCII characters: Palindrome checkers and anagram
    programs; I remember somebody mentioning a third use (which I promptly
    forgot), but anyway, there are few cases.

    2) But even for those few cases, UTF-32 is not good enough, because a
    code point is not a character.


    I agree that it is usually unnecessary to convert to UTF-32 - I am
    merely saying that /if/ you feel you want to expand the code points,
    then UTF-32 is fine for the purpose and you should not have to worry
    about endianness because you should not be moving it off your computer,
    thus native endianness is all you need.

    Yes. The point I wanted to make is that there is the frequent
    misconception that dealing with individual arbitrary characters is
    something that is relatively common, and that one can do that by using
    UTF-32 (or UTF-16); it isn't, and one cannot. If you stick with UTF-8
    and use byte lengths and byte indexes, you can do almost everything as
    well or better (with less complication and more efficiently) as by
    converting to UTF-32 and back.


    Agreed.

    People sometimes say they want to expand to code points to be able to
    see the length of the string in characters, or to index characters, or
    for easier splicing or joining strings. I don't think these are
    particularly useful in practice, but UTF-32 is fine for those that want it.

    Looking up "splicing strings", I find that this is a term used in
    connection with Python for specifying substrings. Python3 is a
    language that lives the codepoint mistake to the extreme (and from
    what I read, this was one of the major pain points in the
    Python2->Python3 transition), but anyway, with UTF-8 one way to
    represent a substring is to use the start index and length in bytes
    (aka code units) rather than code points.


    I was not thinking of Python in particular, and I don't think the term "splicing" is Python specific. But Python is generally a good and
    popular language when you need to do lots of text manipulation, so maybe
    that's where the association comes from (at least for search engines).

    People often think it is easier to do string manipulation - joining,
    splitting, replacing, etc., - when you have fixed size units per
    character. I agree with you that this is not actually true, especially
    if you want to support arbitrary Unicode characters (such as combining characters) that don't fit in a single code point. But it is not
    uncommon to think it is, and if you can make some simplifications to the
    text you support (specifically, limiting your code to single code point characters) then UTF-32 can be helpful. (I think everyone will at least
    agree that it's better than UTF-16!)

    Looking up "joining strings" brings up the Python join() method, which
    is a variant of string concatenation. There is certainly no need to
    convert UTF-8 to UTF-32 and back for concatenating strings; just
    concatenate the UTF-8 strings.


    Sure.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to David Brown on Sat May 11 17:39:30 2024
    David Brown <[email protected]> writes:
    People often think it is easier to do string manipulation - joining, >splitting, replacing, etc., - when you have fixed size units per
    character.

    But they are wrong. Fixed-size units per character are unnecessary
    and not helpful for joining, splitting, and replacing. And for nearly
    all of "etc.".

    But it is not
    uncommon to think it is, and if you can make some simplifications to the
    text you support (specifically, limiting your code to single code point >characters) then UTF-32 can be helpful.

    Yes, many people think so, but they are mistaken.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Anton Ertl on Sun May 12 07:34:35 2024
    Anton Ertl <[email protected]> schrieb:
    The point I wanted to make is that there is the frequent
    misconception that dealing with individual arbitrary characters is
    something that is relatively common, and that one can do that by using
    UTF-32 (or UTF-16); it isn't, and one cannot.

    Do you really mean one cannot change an individual character
    using UTF-32? I assume you mean "there is no need to do it"..

    If you stick with UTF-8
    and use byte lengths and byte indexes, you can do almost everything as
    well or better (with less complication and more efficiently) as by
    converting to UTF-32 and back.

    Assume you're implementing a language which has a function of
    setting an individual character in a string. How would you
    implement it? Run through the string? Would you then also
    store additional information somewhere so that the next character
    that the user sets does not need to do it again?

    Just curious...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Thomas Koenig on Sun May 12 11:39:14 2024
    On Sun, 12 May 2024 07:34:35 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    Anton Ertl <[email protected]> schrieb:
    The point I wanted to make is that there is the frequent
    misconception that dealing with individual arbitrary characters is something that is relatively common, and that one can do that by
    using UTF-32 (or UTF-16); it isn't, and one cannot.

    Do you really mean one cannot change an individual character
    using UTF-32? I assume you mean "there is no need to do it"..


    I would think that Anton meant to say that UCS-4/UTF-32 code point is
    not the same as individual character.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Paul A. Clayton on Tue May 14 22:36:54 2024
    Paul A. Clayton wrote:

    On 5/6/24 3:13 PM, MitchAlsup1 wrote:


    Placing bit-field access INSIDE LDs and STs requires adding 2 stages
    of multiplexing in the LD/ST aligners (memory shifters). This has the
    potential to slow the overall pipeline frequency--at which point you
    have lost more than you can gain.

    The extra shifting could be applied only for bit-granular
    accesses, so byte-granular accesses could have normal latency.
    (Bit-field loads would have higher latency.)

    If you only "apply" the bit level multiplexing when needed, instead
    of having 2 added gate delays you now have 3 !!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Levine on Mon May 27 01:05:36 2024
    On Thu, 9 May 2024 01:24:54 -0000 (UTC), John Levine wrote:

    According to Lawrence D'Oliveiro <[email protected]d>:

    On Wed, 8 May 2024 02:47:46 -0000 (UTC), John Levine wrote:

    It doesn't make sense to say that character strings are big- or
    little-endian.

    Yes it does, for just about any encoding other than UTF-8. Thus, you
    have UTF16BE, and UTF16LE.

    Not really, those are byte orders within a character ...

    Within an integer character code. Which is exactly what endianness is all about.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to David Brown on Mon May 27 01:15:10 2024
    On Sat, 11 May 2024 18:49:12 +0200, David Brown wrote:

    People often think it is easier to do string manipulation - joining, splitting, replacing, etc., - when you have fixed size units per
    character.

    It is easy enough to come up with a fixed-size representation for
    characters in Python (or other suitably powerful language), where “character” = “non-combining code point plus all immediately-following combining code points”. Do all your text manipulation in this internal representation, then write it back to regular text in UTF-8 or whatever
    other format you need.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to [email protected] on Mon May 27 02:54:49 2024
    It appears that Lawrence D'Oliveiro <[email protected]d> said:
    On Sat, 11 May 2024 18:49:12 +0200, David Brown wrote:

    People often think it is easier to do string manipulation - joining,
    splitting, replacing, etc., - when you have fixed size units per
    character.

    It is easy enough to come up with a fixed-size representation for
    characters in Python (or other suitably powerful language), where >“character” = “non-combining code point plus all immediately-following >combining code points”.

    I have to ask, how much storage do each of these fixed-size character
    things take?

    How do you know?

    I've been poking at Unicode for a while and I don't have the faintest
    idea, particularly if you include groups of emoji with ZWJ that are
    rendered as one image, as in this ever increasing list. Groups
    can have 9 code points, maybe more:

    https://www.unicode.org/emoji/charts/emoji-zwj-sequences.html

    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Levine on Mon May 27 07:18:51 2024
    On Mon, 27 May 2024 02:54:49 -0000 (UTC), John Levine wrote:

    It appears that Lawrence D'Oliveiro <[email protected]d> said:

    On Sat, 11 May 2024 18:49:12 +0200, David Brown wrote:

    People often think it is easier to do string manipulation - joining,
    splitting, replacing, etc., - when you have fixed size units per
    character.

    It is easy enough to come up with a fixed-size representation for >>characters in Python (or other suitably powerful language), where >>“character” = “non-combining code point plus all immediately-following >>combining code points”.

    I have to ask, how much storage do each of these fixed-size character
    things take?

    That’s not important; what’s important is that you can put characters as elements in an array, randomly accessible just by array index.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Mon May 27 15:09:23 2024
    According to Lawrence D'Oliveiro <[email protected]d>:
    It is easy enough to come up with a fixed-size representation for >>>characters in Python (or other suitably powerful language), where >>>“character” = “non-combining code point plus all immediately-following >>>combining code points”.

    I have to ask, how much storage do each of these fixed-size character
    things take?

    That’s not important; what’s important is that you can put characters as >elements in an array, randomly accessible just by array index.

    How am I supposed to write my code with an array of fixed size things if
    I don't know how big the things are?

    If you mean an array of pointers to sequences of code points, well
    sure, but now we're back to variable length encodings. I know that I
    have no idea how big these fixed size things would have to be and i
    suspect nobody else does either.

    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to John Levine on Mon May 27 12:45:09 2024
    John Levine wrote:
    According to Lawrence D'Oliveiro <[email protected]d>:
    It is easy enough to come up with a fixed-size representation for
    characters in Python (or other suitably powerful language), where
    “character” = “non-combining code point plus all immediately-following
    combining code points”.
    I have to ask, how much storage do each of these fixed-size character
    things take?
    That’s not important; what’s important is that you can put characters as >> elements in an array, randomly accessible just by array index.

    How am I supposed to write my code with an array of fixed size things if
    I don't know how big the things are?

    If you mean an array of pointers to sequences of code points, well
    sure, but now we're back to variable length encodings. I know that I
    have no idea how big these fixed size things would have to be and i
    suspect nobody else does either.

    One could have instructions that make it easier to parse the
    variable length UTF-8 sequences into codepoints.
    The first byte high order bits tells you the byte run length and also
    how to extract and shift the bit fields to assemble a 4-byte codepoint
    after those 1..4 bytes have been loaded into a register.

    Variable 1 to 8 byte count register load and store instructions could be helpful here too. Or lengths of 1..64 bytes if SIMD registers are used,
    because then we could apply Mitch's log_2 parallel parse method to
    multiple codepoints in the wide SIMD register and parse a bunch of
    codepoints in one clock and right justify them.

    It would still have to look up whether a codepoint was combining or
    stand alone. I don't see a firm definition whether combining codepoints
    come before or after, after requiring a lookahead parse and so extra
    checks to ensure it doesn't look past the buffer end.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Mon May 27 19:09:51 2024
    According to EricP <[email protected]>:
    John Levine wrote:
    If you mean an array of pointers to sequences of code points, well
    sure, but now we're back to variable length encodings. I know that I
    have no idea how big these fixed size things would have to be and i
    suspect nobody else does either.

    One could have instructions that make it easier to parse the
    variable length UTF-8 sequences into codepoints.

    That would be the CU14 instruction on zSeries, to turn UTF-8 into
    UTF-32. CU41 goes the other way.

    It would still have to look up whether a codepoint was combining or
    stand alone. I don't see a firm definition whether combining codepoints
    come before or after, after requiring a lookahead parse and so extra
    checks to ensure it doesn't look past the buffer end.

    I think they come after but I haven't looked in enough detail. And
    then you have all of the issues with precomposed characters: do you
    normalize as you go or denormaiize, or what?

    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Levine on Mon May 27 20:41:38 2024
    John Levine wrote:

    According to EricP <[email protected]>:
    John Levine wrote:
    If you mean an array of pointers to sequences of code points, well
    sure, but now we're back to variable length encodings. I know that I
    have no idea how big these fixed size things would have to be and i
    suspect nobody else does either.

    One could have instructions that make it easier to parse the
    variable length UTF-8 sequences into codepoints.

    That would be the CU14 instruction on zSeries, to turn UTF-8 into
    UTF-32. CU41 goes the other way.

    It would still have to look up whether a codepoint was combining or
    stand alone. I don't see a firm definition whether combining codepoints >>come before or after, after requiring a lookahead parse and so extra
    checks to ensure it doesn't look past the buffer end.

    I think they come after but I haven't looked in enough detail. And
    then you have all of the issues with precomposed characters: do you
    normalize as you go or denormaiize, or what?

    Character search (or compare) becomes 'grep'.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Savard on Tue May 28 01:10:02 2024
    On Wed, 08 May 2024 20:50:53 -0600, John Savard wrote:

    On Wed, 8 May 2024 05:54:50 -0000 (UTC), Lawrence D'Oliveiro
    <[email protected]d> wrote:

    On Tue, 07 May 2024 22:01:36 -0600, John Savard wrote:

    But the third item is character stirings, used in input and output to
    represent numbers. They should be the same as packed decimal to make
    conversion between the two simpler.

    No, because character string conversion is subject to localization
    issues.

    I agree that little-endian computers make sense for people whose native language is Hebrew or Arabic.

    That doesn’t actually make any sense.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Terje Mathisen on Tue May 28 01:12:56 2024
    On Wed, 8 May 2024 19:47:34 +0200, Terje Mathisen wrote:

    It was exactly these kinds of optimizations I made in order to double
    the speed of Intel's reference BluRay decoder. However, instead of
    asking me to write a complete version they decided to licence a piece of
    VLSI to do it in hardware, and that was almost certainly the correct
    decision since my code needed 4 cores working nearly 100% in order to
    handle the highest possible size/speed quality (1080p, 60 Hz, CABAC
    encoding and 40 Mbit/s bitrate).

    Still, that sounds like something that could be useful in a transcoder
    like FFmpeg.

    4 cores sounds like a modest requirement these days; nproc reports 24 on
    the machine I’m using now. And 16 on my laptop.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Levine on Tue May 28 01:24:41 2024
    On Mon, 27 May 2024 15:09:23 -0000 (UTC), John Levine wrote:

    According to Lawrence D'Oliveiro <[email protected]d>:

    It is easy enough to come up with a fixed-size representation for
    characters in Python (or other suitably powerful language), where
    “character” = “non-combining code point plus all immediately
    -following combining code points”.

    I have to ask, how much storage do each of these fixed-size character
    things take?

    That’s not important; what’s important is that you can put characters as >>elements in an array, randomly accessible just by array index.

    How am I supposed to write my code with an array of fixed size things if
    I don't know how big the things are?

    The fixed-size things are references to objects. Or in a lower-level
    language like C, they could indeed be pointers/indexes into an array of
    code points.

    If you mean an array of pointers to sequences of code points, well sure,
    but now we're back to variable length encodings.

    We’re not, because we still have easy random access, and the length of the array is the number of characters.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Levine on Tue May 28 01:25:37 2024
    On Mon, 27 May 2024 19:09:51 -0000 (UTC), John Levine wrote:

    According to EricP <[email protected]>:

    One could have instructions that make it easier to parse the variable
    length UTF-8 sequences into codepoints.

    That would be the CU14 instruction on zSeries, to turn UTF-8 into
    UTF-32. CU41 goes the other way.

    What is the point, in this day and age, of having special machine
    instructions to convert character encodings?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Lawrence D'Oliveiro on Tue May 28 08:01:36 2024
    Lawrence D'Oliveiro wrote:
    On Mon, 27 May 2024 15:09:23 -0000 (UTC), John Levine wrote:

    According to Lawrence D'Oliveiro <[email protected]d>:

    It is easy enough to come up with a fixed-size representation for
    characters in Python (or other suitably powerful language), where
    “character” = “non-combining code point plus all immediately
    -following combining code points”.

    I have to ask, how much storage do each of these fixed-size character
    things take?

    That’s not important; what’s important is that you can put characters as
    elements in an array, randomly accessible just by array index.

    How am I supposed to write my code with an array of fixed size things if
    I don't know how big the things are?

    The fixed-size things are references to objects. Or in a lower-level
    language like C, they could indeed be pointers/indexes into an array of
    code points.

    If you mean an array of pointers to sequences of code points, well sure,
    but now we're back to variable length encodings.

    We’re not, because we still have easy random access, and the length of the array is the number of characters.

    If you need efficient random read access to particular unicode
    characters, possibly consisting of multiple codepoints, then I would
    guess a skip list to be very efficient:

    Just a helper array containing the starting offsets to every ~32 or so
    utf8 characters. This would add 12.5% overhead for a file containing
    only US ASCII if using 32-bit offsets, while the more longer characters
    you have the lower the overhead.

    When accessing a particular character you could of course use linear
    scanning past the nearest preceeding index entry.

    If you also need to edit the utf8 character array, then you could
    augment the primary index with one or more higher layers, i.e. a classic
    skip list.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to John Levine on Tue May 28 11:04:38 2024
    John Levine wrote:
    According to EricP <[email protected]>:
    John Levine wrote:
    If you mean an array of pointers to sequences of code points, well
    sure, but now we're back to variable length encodings. I know that I
    have no idea how big these fixed size things would have to be and i
    suspect nobody else does either.
    One could have instructions that make it easier to parse the
    variable length UTF-8 sequences into codepoints.

    That would be the CU14 instruction on zSeries, to turn UTF-8 into
    UTF-32. CU41 goes the other way.

    It would still have to look up whether a codepoint was combining or
    stand alone. I don't see a firm definition whether combining codepoints
    come before or after, after requiring a lookahead parse and so extra
    checks to ensure it doesn't look past the buffer end.

    I think they come after but I haven't looked in enough detail.

    It appears they defined it as you described, with base character
    first and optional combiners follow. https://www.unicode.org/glossary/#combining_character_sequence

    I was thinking that as UTF-8 can be parsed in either direction,
    the order should be defined such that the usual case, low to high scan,
    is most efficient.

    That order should be to put the combiner(s) first and the base codepoint
    last so the base code acts like a parse stop-code and makes a lookahead
    higher unnecessary.

    A backwards scan still works but it has to look ahead backwards to check
    if there is a combiner, which there usually isn't, and unget it if not.
    As that is extra work, checking for buffer overflow etc., and touches
    extra bytes that are usually unused, this should be the second choice.

    But it appears they chose the least efficient way to do it.
    Sigh... oh well.

    And
    then you have all of the issues with precomposed characters: do you normalize as you go or denormaiize, or what?

    And fields in forms have fixed screen size, while record struct
    and database fields have fixed byte size.

    Fortunately I don't have to deal with any of this.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Lawrence D'Oliveiro on Tue May 28 16:02:10 2024
    Lawrence D'Oliveiro <[email protected]d> schrieb:
    On Mon, 27 May 2024 19:09:51 -0000 (UTC), John Levine wrote:

    According to EricP <[email protected]>:

    One could have instructions that make it easier to parse the variable
    length UTF-8 sequences into codepoints.

    That would be the CU14 instruction on zSeries, to turn UTF-8 into
    UTF-32. CU41 goes the other way.

    What is the point, in this day and age, of having special machine instructions to convert character encodings?

    Have you looked at decoding algorithms for UTF-8?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Thomas Koenig on Tue May 28 12:23:12 2024
    Thomas Koenig wrote:
    Lawrence D'Oliveiro <[email protected]d> schrieb:
    On Mon, 27 May 2024 19:09:51 -0000 (UTC), John Levine wrote:

    According to EricP <[email protected]>:
    One could have instructions that make it easier to parse the variable
    length UTF-8 sequences into codepoints.
    That would be the CU14 instruction on zSeries, to turn UTF-8 into
    UTF-32. CU41 goes the other way.
    What is the point, in this day and age, of having special machine
    instructions to convert character encodings?

    Have you looked at decoding algorithms for UTF-8?

    It's almost like the perfect application of risc instruction design:
    a long sequence of individual instructions of conditional branches,
    bit field extracts, inserts, and shifts, is replace in HW by
    a small number of muxes that can to the same in one clock.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Thomas Koenig on Wed May 29 04:46:34 2024
    On Tue, 28 May 2024 16:02:10 -0000 (UTC), Thomas Koenig wrote:

    Lawrence D'Oliveiro <[email protected]d> schrieb:

    On Mon, 27 May 2024 19:09:51 -0000 (UTC), John Levine wrote:

    According to EricP <[email protected]>:

    One could have instructions that make it easier to parse the variable
    length UTF-8 sequences into codepoints.

    That would be the CU14 instruction on zSeries, to turn UTF-8 into
    UTF-32. CU41 goes the other way.

    What is the point, in this day and age, of having special machine
    instructions to convert character encodings?

    Have you looked at decoding algorithms for UTF-8?

    Of course. Isn’t the point of RISC that these complex operations are more efficiently performed by a sequence of simpler instructions?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Wed May 29 07:04:35 2024
    Lawrence D'Oliveiro <[email protected]d> writes:
    On Tue, 28 May 2024 16:02:10 -0000 (UTC), Thomas Koenig wrote:

    Lawrence D'Oliveiro <[email protected]d> schrieb:

    On Mon, 27 May 2024 19:09:51 -0000 (UTC), John Levine wrote:

    According to EricP <[email protected]>:

    One could have instructions that make it easier to parse the variable >>>>> length UTF-8 sequences into codepoints.

    What for? Dealing with code points is rarely necessary, so adding
    instructions for that is a waste (and it's not surprising to me that
    neither AMD64 nor ARM A64 have such instructions; IBM z seems to be
    add special instructions that are rarely useful as marketing
    argument).

    That would be the CU14 instruction on zSeries, to turn UTF-8 into
    UTF-32. CU41 goes the other way.

    What is the point, in this day and age, of having special machine
    instructions to convert character encodings?

    Have you looked at decoding algorithms for UTF-8?

    Of course. Isn’t the point of RISC that these complex operations are more >efficiently performed by a sequence of simpler instructions?

    The IBM z series are not RISCs.

    Anyway, such instructions can be done in a RISCy way (pure
    register-to-register instructions) or in a CISCy way
    (memory-to-memory).

    A RISCy way to do UTF-8 -> UTF-32 would be to have the first 4 bytes
    of the remaining string in a register and producing an UTF-32 code
    point in another register and a length in a third register (or in the
    high part of the destination register to reduce write port
    requirements). Similarly for UTF-32->UTF-8, with the length
    specifying the length of the result; that would need to be combined
    with a length masked store to make it easy to store the result.

    This approach can also be SIMDified, converting regbits/32 code points
    in one representation to the same number of code points in the other representation plus a length of the UTF-8 representation.

    The disadvantage of this approach exists particularly for
    UTF-8->UTF-32: this is a very sequential approach full of dependences:
    each use of the conversion instruction is followed by a dependent load
    of the next input fragment, and the next use of the conversion
    instruction depends on that load.

    We have been discussing shift buffers; those would be useful for such instructions.

    A CISCy approach is similar to a block copy: have a source operand in
    memory (represented by an address and maybe a length) and a
    destination operand (represented by an address and a length) start the instruction in a loop until it is finished (the loop is there to allow interrupting the instruction in the middle, e.g., for page faults).

    Looking at CU14 on page 7-136 of <https://www.ibm.com/docs/en/SSQ2R2_15.0.0/com.ibm.tpf.toolkit.hlasm.doc/dz9zr006.pdf>,
    CU14 takes the CISCy approach outlined above.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Wed May 29 07:59:21 2024
    Lawrence D'Oliveiro <[email protected]d> writes:
    The fixed-size things are references to objects. Or in a lower-level
    language like C, they could indeed be pointers/indexes into an array of
    code points.

    There is no need for UTF-32 for such an approach. Just let the pointers/indexes point to the start of the character in UTF-8
    represntation.

    [...] we still have easy random access, and the length of the
    array is the number of characters.

    Both of which are rarely necessary.

    But sure, if you need that, the approach of having an array of
    pointers to characters in UTF-8 representation works, while converting
    to UTF-32 does not help at all.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Anton Ertl on Wed May 29 10:10:30 2024
    Anton Ertl wrote:
    Lawrence D'Oliveiro <[email protected]d> writes:
    On Tue, 28 May 2024 16:02:10 -0000 (UTC), Thomas Koenig wrote:

    Lawrence D'Oliveiro <[email protected]d> schrieb:
    On Mon, 27 May 2024 19:09:51 -0000 (UTC), John Levine wrote:

    According to EricP <[email protected]>:
    One could have instructions that make it easier to parse the variable >>>>>> length UTF-8 sequences into codepoints.

    What for? Dealing with code points is rarely necessary, so adding instructions for that is a waste (and it's not surprising to me that
    neither AMD64 nor ARM A64 have such instructions; IBM z seems to be
    add special instructions that are rarely useful as marketing
    argument).

    I've not dealt with UTF-8 or code points but that's because I've not
    written software that interacts with the non 1-byte character markets.

    But even something as simple as sanitizing a character string to feed
    into SQL will have to.

    And while I've not dealt with it myself, I can see just by looking at
    UTF-8 and its variable sized characters of variable sized code points
    that it likely makes string processing 10 times more complicated.

    As string processing is 99% of what business software manipulates,
    and international string processing is a large part of IBM's services
    market, services that they have to compete against others to sell,
    it doesn't surprise me that they would add instructions which facilitate it.

    Many processors have instructions particular operations,
    Find First/Last One/Zero, bit field reverse for FFT,
    POPCOUNT for them-who-shall-not-be-named.

    A Sign Extend instruction is just a way to decompress a redundant-high-order-bit-compressed integer.

    Why not instructions to decompress the most high frequency usage
    compressed character set?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Wed May 29 11:09:26 2024
    I've not dealt with UTF-8 or code points but that's because I've not
    written software that interacts with the non 1-byte character markets.
    But even something as simple as sanitizing a character string to feed
    into SQL will have to.

    AFAIK you can do that by treating the UTF-8 byte sequence as if it were
    an ASCII byte-sequence: all the Unicode weirdness is neatly stashed in
    bytes >127 which aren't used by SQL itself anyway.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Wed May 29 11:55:18 2024
    I've not dealt with UTF-8 or code points but that's because I've not
    written software that interacts with the non 1-byte character markets.
    But even something as simple as sanitizing a character string to feed
    into SQL will have to.
    AFAIK you can do that by treating the UTF-8 byte sequence as if it were
    an ASCII byte-sequence: all the Unicode weirdness is neatly stashed in
    bytes >127 which aren't used by SQL itself anyway.
    Stefan

    Of course with apologies to Herr Koenig's umlauts. :-)

    And what of all those new Asian customers your company was hoping
    to get by dealing with them in their native written language???
    You could always explain to the company president that
    you only work in ASCII so they should just get used to it.

    I think you misunderstand: the code written to sanitize an ASCII string to
    feed into SQL will *just work* to sanitize a UTF-8 string to feed
    into SQL, no matter how many funny characters and joiners and combiners
    and emojis you have in there.

    That's part of the reason why UTF-8 is so popular: you can surprisingly
    often treat it as "good old ASCII".


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Stefan Monnier on Wed May 29 11:46:40 2024
    Stefan Monnier wrote:
    I've not dealt with UTF-8 or code points but that's because I've not
    written software that interacts with the non 1-byte character markets.
    But even something as simple as sanitizing a character string to feed
    into SQL will have to.

    AFAIK you can do that by treating the UTF-8 byte sequence as if it were
    an ASCII byte-sequence: all the Unicode weirdness is neatly stashed in
    bytes >127 which aren't used by SQL itself anyway.


    Stefan

    Of course with apologies to Herr Koenig's umlauts. :-)

    And what of all those new Asian customers your company was hoping
    to get by dealing with them in their native written language???
    You could always explain to the company president that
    you only work in ASCII so they should just get used to it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Stefan Monnier on Wed May 29 13:20:14 2024
    Stefan Monnier wrote:
    I've not dealt with UTF-8 or code points but that's because I've not
    written software that interacts with the non 1-byte character markets. >>>> But even something as simple as sanitizing a character string to feed
    into SQL will have to.
    AFAIK you can do that by treating the UTF-8 byte sequence as if it were
    an ASCII byte-sequence: all the Unicode weirdness is neatly stashed in
    bytes >127 which aren't used by SQL itself anyway.
    Stefan
    Of course with apologies to Herr Koenig's umlauts. :-)

    And what of all those new Asian customers your company was hoping
    to get by dealing with them in their native written language???
    You could always explain to the company president that
    you only work in ASCII so they should just get used to it.

    I think you misunderstand: the code written to sanitize an ASCII string to feed into SQL will *just work* to sanitize a UTF-8 string to feed
    into SQL, no matter how many funny characters and joiners and combiners
    and emojis you have in there.

    That's part of the reason why UTF-8 is so popular: you can surprisingly
    often treat it as "good old ASCII".


    Stefan

    Ok, you accept international character data, you just don't have to
    check >127 characters for "drop table" etc commands.

    I don't think you are being paranoid enough.
    I still think you have to validate or sanitize the >127 string to
    ensure the code sequences only contain well formed characters.

    Random hack thought #1: if the string I send starts with an umlaut as
    the first code point, which doesn't display because it is invalid.
    Then someone edits the first char to a/o/u and magically it changes
    to a different character, and deposits now go to a different account.

    Random hack thought #2: If a character has multiple combiner code points,
    does changing the order create a different character or do they map to
    the same display character? Or worse, maybe combiner code point order sensitivity is character dependent, some are, some are not.
    If they do display the same, then I might create two accounts that
    look identical but index differently, and redirect deposits.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Wed May 29 18:42:32 2024
    According to EricP <[email protected]>:
    Ok, you accept international character data, you just don't have to
    check >127 characters for "drop table" etc commands.

    I don't think you are being paranoid enough.
    I still think you have to validate or sanitize the >127 string to
    ensure the code sequences only contain well formed characters.

    If you're sending the strings to a database, the database will
    invariably do detailed string validation so I wouldn't bother, but be
    prepared for the error code if it rejects the string,

    Random hack thought #1: if the string I send starts with an umlaut as
    the first code point, ...

    A bare umlaut displays just fine. But see below.

    Random hack thought #2: If a character has multiple combiner code points, >does changing the order create a different character or do they map to
    the same display character? Or worse, maybe combiner code point order >sensitivity is character dependent, some are, some are not.

    Unicode has normalization forms that deal with this. The most common
    are NFC which uses precomposed combined characters, and NFD where
    they're all separate (Composed and Decomposed.) NFD puts the combiners
    in a well defined order. Sensible people put all their strings into
    NFC or NFD before doing anything else with them.

    https://www.unicode.org/reports/tr15/
    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to [email protected] on Wed May 29 22:26:00 2024
    On Wed, 29 May 2024 18:42:32 -0000 (UTC), John Levine
    <[email protected]> wrote:

    According to EricP <[email protected]>:
    Ok, you accept international character data, you just don't have to
    check >127 characters for "drop table" etc commands.

    I don't think you are being paranoid enough.
    I still think you have to validate or sanitize the >127 string to
    ensure the code sequences only contain well formed characters.

    If you're sending the strings to a database, the database will
    invariably do detailed string validation so I wouldn't bother, but be >prepared for the error code if it rejects the string,

    Far too much SQL is constructed by simply splicing user input into a
    query "template" string.

    When queries are done right with all user input provided via SQL
    parameters, then there is far less need to "sanitize" input.

    There is a one major caveat: in SQL, table names can't be specified by parameter. If the user must provide a table name, then you DO have to
    splice the query string and you DO have to be careful.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to EricP on Thu May 30 02:42:29 2024
    On Wed, 29 May 2024 11:46:40 -0400, EricP wrote:

    You could always explain to the company president that you only work in
    ASCII so they should just get used to it.

    That stopped being acceptable back in about the 1980s.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Thu May 30 02:37:51 2024
    On Wed, 29 May 2024 07:04:35 GMT, Anton Ertl wrote:

    Lawrence D'Oliveiro <[email protected]d> writes:

    Isn’t the point of RISC that these complex operations are
    more efficiently performed by a sequence of simpler instructions?

    The IBM z series are not RISCs.

    Doesn’t matter. The principles of designing high-performance architectures still apply: simpler instructions are better than more complex ones.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to EricP on Thu May 30 02:41:20 2024
    On Wed, 29 May 2024 10:10:30 -0400, EricP wrote:

    I've not dealt with UTF-8 or code points but that's because I've not
    written software that interacts with the non 1-byte character markets.

    We are all “non 1-byte character markets” now.

    Just to rub it in: «€£¢©®±»

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Lawrence D'Oliveiro on Thu May 30 03:26:05 2024
    Lawrence D'Oliveiro wrote:

    On Wed, 29 May 2024 07:04:35 GMT, Anton Ertl wrote:

    Lawrence D'Oliveiro <[email protected]d> writes:

    Isn’t the point of RISC that these complex operations are
    more efficiently performed by a sequence of simpler instructions?

    The IBM z series are not RISCs.

    Doesn’t matter. The principles of designing high-performance
    architectures still apply: simpler instructions are better than more
    complex ones.



    IBM has, for a long time, combined commonly occuring sequences of
    instructions into single instructions. I don't know the tradeoffs
    here. Perhaps John Levine does?



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to EricP on Thu May 30 10:10:55 2024
    EricP wrote:
    Thomas Koenig wrote:
    Lawrence D'Oliveiro <[email protected]d> schrieb:
    On Mon, 27 May 2024 19:09:51 -0000 (UTC), John Levine wrote:

    According to EricP  <[email protected]>:
    One could have instructions that make it easier to parse the variable >>>>> length UTF-8 sequences into codepoints.
    That would be the CU14 instruction on zSeries, to turn UTF-8 into
    UTF-32. CU41 goes the other way.
    What is the point, in this day and age, of having special machine
    instructions to convert character encodings?

    Have you looked at decoding algorithms for UTF-8?

    It's almost like the perfect application of risc instruction design:
    a long sequence of individual instructions of conditional branches,
    bit field extracts, inserts, and shifts, is replace in HW by
    a small number of muxes that can to the same in one clock.


    If that CU14 can also return the number of bytes consumed, along with
    the resulting 32-bit character, then it would be perfect. Is that what
    it is doing?

    You still have the horrible combining codepoints problem of course,
    where you have to apply CU14 once more just in order to find out if it
    was in fact a combining code, and do that without any buffer overruns etc.

    Personally I tend to punt on these kinds of algorithms and simply demand
    that the decoding source buffer have at least enough extra buffer space
    at the end to avoid the problem.

    I.e. my LZ4 decoder is significantly faster than what Google is using,
    but it will happily grab up to 11 or 27 bytes past the actual end of input.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Anton Ertl on Thu May 30 10:36:03 2024
    Anton Ertl wrote:
    Lawrence D'Oliveiro <[email protected]d> writes:
    On Tue, 28 May 2024 16:02:10 -0000 (UTC), Thomas Koenig wrote:

    Lawrence D'Oliveiro <[email protected]d> schrieb:

    On Mon, 27 May 2024 19:09:51 -0000 (UTC), John Levine wrote:

    According to EricP <[email protected]>:

    One could have instructions that make it easier to parse the variable >>>>>> length UTF-8 sequences into codepoints.

    What for? Dealing with code points is rarely necessary, so adding instructions for that is a waste (and it's not surprising to me that
    neither AMD64 nor ARM A64 have such instructions; IBM z seems to be
    add special instructions that are rarely useful as marketing
    argument).

    That would be the CU14 instruction on zSeries, to turn UTF-8 into
    UTF-32. CU41 goes the other way.

    What is the point, in this day and age, of having special machine
    instructions to convert character encodings?

    Have you looked at decoding algorithms for UTF-8?

    Of course. Isn’t the point of RISC that these complex operations are more
    efficiently performed by a sequence of simpler instructions?

    The IBM z series are not RISCs.

    Anyway, such instructions can be done in a RISCy way (pure register-to-register instructions) or in a CISCy way
    (memory-to-memory).

    A RISCy way to do UTF-8 -> UTF-32 would be to have the first 4 bytes
    of the remaining string in a register and producing an UTF-32 code
    point in another register and a length in a third register (or in the
    high part of the destination register to reduce write port
    requirements). Similarly for UTF-32->UTF-8, with the length
    specifying the length of the result; that would need to be combined
    with a length masked store to make it easy to store the result.

    This approach can also be SIMDified, converting regbits/32 code points
    in one representation to the same number of code points in the other representation plus a length of the UTF-8 representation.

    The disadvantage of this approach exists particularly for
    UTF-8->UTF-32: this is a very sequential approach full of dependences:
    each use of the conversion instruction is followed by a dependent load
    of the next input fragment, and the next use of the conversion
    instruction depends on that load.

    Rather the opposite:

    UTF8->UTF32 looks a _lot_ like an easier example of a byte-oriented
    variable length (x86?) instruction decoder, but with the big
    simplification that the first byte directly tells you how long the
    sequence is.

    Doing a SIMD version corresponds to a superscalar x86 in that the
    decoder needs to grab a variable number of bytes for each instruction, starting the next immediately after.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Terje Mathisen on Thu May 30 12:45:27 2024
    Terje Mathisen wrote:
    Anton Ertl wrote:
    This approach can also be SIMDified, converting regbits/32 code points
    in one representation to the same number of code points in the other
    representation plus a length of the UTF-8 representation.

    The disadvantage of this approach exists particularly for
    UTF-8->UTF-32: this is a very sequential approach full of dependences:
    each use of the conversion instruction is followed by a dependent load
    of the next input fragment, and the next use of the conversion
    instruction depends on that load.

    Rather the opposite:

    UTF8->UTF32 looks a _lot_ like an easier example of a byte-oriented
    variable length (x86?) instruction decoder, but with the big
    simplification that the first byte directly tells you how long the
    sequence is.

    Doing a SIMD version corresponds to a superscalar x86 in that the
    decoder needs to grab a variable number of bytes for each instruction, starting the next immediately after.

    Even better (compared to a superscalar x86 instruction decoder), _every_
    byte uses the top two bits to tell you if this is 7-bit ascii, the start
    of a UTF-8 encoded code point, or a follow-on byte inside a UTF-8 code
    point.

    This means that each decoder can work alone, without having to wait for
    the length decoding of the previous code point ("instruction") before
    deciding to discard or pass on the results it got from starting where it
    did.

    It seems like it would be very feasible to have (say) 8 parallel
    decoders starting at every corresponding byte offset, and return a SIMD register with 2-8 32-bit decoded code points, right?

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Terje Mathisen on Thu May 30 11:59:53 2024
    Terje Mathisen <[email protected]> schrieb:
    Terje Mathisen wrote:
    Anton Ertl wrote:
    This approach can also be SIMDified, converting regbits/32 code points
    in one representation to the same number of code points in the other
    representation plus a length of the UTF-8 representation.

    The disadvantage of this approach exists particularly for
    UTF-8->UTF-32: this is a very sequential approach full of dependences:
    each use of the conversion instruction is followed by a dependent load
    of the next input fragment, and the next use of the conversion
    instruction depends on that load.

    Rather the opposite:

    UTF8->UTF32 looks a _lot_ like an easier example of a byte-oriented
    variable length (x86?) instruction decoder, but with the big
    simplification that the first byte directly tells you how long the
    sequence is.

    Doing a SIMD version corresponds to a superscalar x86 in that the
    decoder needs to grab a variable number of bytes for each instruction,
    starting the next immediately after.

    Even better (compared to a superscalar x86 instruction decoder), _every_
    byte uses the top two bits to tell you if this is 7-bit ascii, the start
    of a UTF-8 encoded code point, or a follow-on byte inside a UTF-8 code
    point.

    This means that each decoder can work alone, without having to wait for
    the length decoding of the previous code point ("instruction") before deciding to discard or pass on the results it got from starting where it
    did.

    It seems like it would be very feasible to have (say) 8 parallel
    decoders starting at every corresponding byte offset, and return a SIMD register with 2-8 32-bit decoded code points, right?

    Sounds quite reasonable (and would be like what Mitch describes for his
    My 66000 decoders). Apart from filling the buffers, it would also need
    to return the number of bytes consumed and the number of UTF-32
    characters generated, plus a possible error indication.

    Looking at what IBM did, the CU14 instruction is memory-to-memory
    and they use both the length and the address of both the source
    and destination data in register pairs. The number of characters
    to process are then decremented according to what has been processed
    (and there might be a CPU-defined limit). They also appear to have
    optional error checking only.

    Complicated...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Terje Mathisen on Thu May 30 11:54:09 2024
    Terje Mathisen <[email protected]> writes:
    Anton Ertl wrote:
    Anyway, such instructions can be done in a RISCy way (pure
    register-to-register instructions) or in a CISCy way
    (memory-to-memory).
    =20
    A RISCy way to do UTF-8 -> UTF-32 would be to have the first 4 bytes
    of the remaining string in a register and producing an UTF-32 code
    point in another register and a length in a third register (or in the
    high part of the destination register to reduce write port
    requirements). Similarly for UTF-32->UTF-8, with the length
    specifying the length of the result; that would need to be combined
    with a length masked store to make it easy to store the result.
    =20
    This approach can also be SIMDified, converting regbits/32 code points
    in one representation to the same number of code points in the other
    representation plus a length of the UTF-8 representation.
    =20
    The disadvantage of this approach exists particularly for
    UTF-8->UTF-32: this is a very sequential approach full of dependences:
    each use of the conversion instruction is followed by a dependent load
    of the next input fragment, and the next use of the conversion
    instruction depends on that load.

    Rather the opposite:

    UTF8->UTF32 looks a _lot_ like an easier example of a byte-oriented=20 >variable length (x86?) instruction decoder, but with the big=20 >simplification that the first byte directly tells you how long the=20 >sequence is.

    The SIMD version of the RISCy instruction is no problem. So you can
    process regbits/32 code points in one go. But what I wrote above
    still applies: You use this instruction in a loop like

    # s* are SIMD registers, g* are GPRs
    l: s0= load(g0)
    s1,g1= cu14(s0)
    store (g2)<-s1
    g0 = g0+g1
    g2 = g2+SIMD_width
    if g0>=input_end goto end
    if g2<output_limit goto l
    end:

    (probably some fine tuning of the last iteration and the termination
    is necessary).

    And here you have a dependence chain from load to cu14 to the g0+g1 to
    the load of the next iteration. With cu14 and the addition as
    single-cycle operations and the load taking 5 cycles as for D-cache
    hits on recent Intel CPUs, that's 7 cycles per iteration, limiting the throughput of your conversion routine to 1/7th of what your cu14 and
    your load/store unit would be capable of in throughput-limited code.

    With a byte-stream buffer as architectural feature, and a CU14 that
    takes its utf-8 input from that and automatically advances the stream,
    this could be quite a bit more efficient. Something like:

    ... set up stream buffer ...
    l: s1 = cu14(stream-buffer)
    store (g2)<-s1
    g2 = g2+SIMD_width
    if streambuffer empty goto end
    if g2<output_limit goto l
    end:

    (again with some fine-tuning for the last iteration and termination).

    For a technically unnecessary marketing gimick like CU14 one probably
    won't add a stream buffer, but, e.g., compression and decompression
    are probably more relevant and may also benefit from such a feature.

    Doing a SIMD version corresponds to a superscalar x86 in that the=20
    decoder needs to grab a variable number of bytes for each instruction,=20 >starting the next immediately after.

    The instructions are fetched into a stream buffer rather than waiting
    for the decoder to produce a length result before starting the next
    instruction fetch (and of course the instruction fetcher also has to
    deal with branches).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Thu May 30 12:50:38 2024
    Lawrence D'Oliveiro <[email protected]d> writes:
    On Wed, 29 May 2024 07:04:35 GMT, Anton Ertl wrote:

    Lawrence D'Oliveiro <[email protected]d> writes:

    Isn’t the point of RISC that these complex operations are
    more efficiently performed by a sequence of simpler instructions?

    The IBM z series are not RISCs.

    Doesn’t matter. The principles of designing high-performance architectures >still apply: simpler instructions are better than more complex ones.

    Is IBM z a high-performance architecture?

    In the present case, the principles of designing high-performance
    architectures will tell you that you don't need these instructions.

    But if we forget about that for a minute, the block-copy-style
    approach of IBM's CU14 instruction means that it could use a stream
    buffer internally to avoid the performance snag that I mentioned in
    another posting.

    However, there is a big difference between what performance features
    one can imagine and what is actually implemented. I think that's the
    marketing attraction of providing some feature as an instruction: it
    lets the sales victim's imagination do the marketing/selling.

    Concerning reality: When I looked at block copying a while ago
    (Skylake/Zen1 days), I found that my code using a loop of AVX moves outperformed REP MOVSB (where Intel and AMD's microcode should have
    done at least as well) in many cases, and that despite Intel adding
    "fast string moves" in IIRC Sandy Bridge.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Thu May 30 14:42:14 2024
    According to Terje Mathisen <[email protected]>:
    It's almost like the perfect application of risc instruction design:
    a long sequence of individual instructions of conditional branches,
    bit field extracts, inserts, and shifts, is replace in HW by
    a small number of muxes that can to the same in one clock.

    If that CU14 can also return the number of bytes consumed, along with
    the resulting 32-bit character, then it would be perfect. Is that what
    it is doing?

    You give it registers with two addresses and two lengths, and it
    converts the source UTF-8 code points to destination UTF-32 until it
    runs out of input, fills the output, gets an invalid character, or an interrupt. It updates the addresses and lengths. Other than optionally
    checking for invalid UTF-8 it does not interpret the code points.

    The condition code tells you which it was. If it was an interrupt, you just branch back and keep going.

    There's an extra cost flag whether to test for invalid UTF-8.

    Read all about it: https://www.vm.ibm.com/library/other/22783213.pdf

    It's on page 7-251.

    --
    Regards,
    John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to EricP on Thu May 30 15:35:37 2024
    EricP <[email protected]> writes:
    Stefan Monnier wrote:
    I've not dealt with UTF-8 or code points but that's because I've not >>>>> written software that interacts with the non 1-byte character markets. >>>>> But even something as simple as sanitizing a character string to feed >>>>> into SQL will have to.
    AFAIK you can do that by treating the UTF-8 byte sequence as if it were >>>> an ASCII byte-sequence: all the Unicode weirdness is neatly stashed in >>>> bytes >127 which aren't used by SQL itself anyway.
    Stefan
    Of course with apologies to Herr Koenig's umlauts. :-)

    And what of all those new Asian customers your company was hoping
    to get by dealing with them in their native written language???
    You could always explain to the company president that
    you only work in ASCII so they should just get used to it.

    I think you misunderstand: the code written to sanitize an ASCII string to >> feed into SQL will *just work* to sanitize a UTF-8 string to feed
    into SQL, no matter how many funny characters and joiners and combiners
    and emojis you have in there.

    That's part of the reason why UTF-8 is so popular: you can surprisingly
    often treat it as "good old ASCII".


    Stefan

    Ok, you accept international character data, you just don't have to
    check >127 characters for "drop table" etc commands.

    Actually what you check for is meta-characters like ; " '. They are
    all ASCII, so as long as your code is 8-bit-clean, your SQL string
    sanitizer needs to know nothing about UTF-8.

    I don't think you are being paranoid enough.
    I still think you have to validate or sanitize the >127 string to
    ensure the code sequences only contain well formed characters.

    Then run your string through a checker/normalizer before or
    afterwards. No need to complicate your SQL sanitizer by trying to do
    both at the same time. But if you want the last bit of performance by
    doing both at the same time, then you certainly don't want to convert
    to UTF-32 and back.

    Random hack thought #1: if the string I send starts with an umlaut as
    the first code point, which doesn't display because it is invalid.

    I found that hard to understand. Do you mean that the string starts
    with a composing diaresis code point and is invalid because it has no
    preceding basis with which to compose? The string may fail at the
    Unicode checking/normalization stage (depending on what it checks).

    Then someone edits the first char to a/o/u and magically it changes
    to a different character, and deposits now go to a different account.

    If someone can edit the string, and that changes where deposits go to,
    someone can do that even with no Unicode involved. E.g., if someone
    can change "EricP" to "Ertl". However, my impression is that banks
    use account numbers (pure ASCII) for deposits, names are used only for validation; so if you provide the wrong name, a money transfer may
    fail to go through (not sure what happens if a deposit does not go
    through), but won't be to the wrong account.

    Random hack thought #2: If a character has multiple combiner code points, >does changing the order create a different character or do they map to
    the same display character? Or worse, maybe combiner code point order >sensitivity is character dependent, some are, some are not.
    If they do display the same, then I might create two accounts that
    look identical but index differently, and redirect deposits.

    That's solved by normalization.

    Here's a story from work I had to do a while ago: users provided data
    through some tools written in Python, that data was somehow aggregated
    into one csv file (maybe with cat), and there was a Python3 script I
    had to run for processing the data. Now some users provided the data
    as Latin-1 and some as UTF-8, so the csv file contained a mixture of
    that. The Python3 script dutyfully reported an error on reading the
    csv file as guidelines recomment. This was the wrong thing to do in
    this application, as continuing to have this mixture was harmless.

    I then wrote a small program (in Gforth) that converted such mixed
    files to UTF-8, and that was one of the few uses of the Gforth words
    for dealing with UTF-8 that I needed (in most other cases strings are
    treated just as opaque data). The principle was to see if the next
    bytes were an UTF-8 code point or ASCII; if so, just output them. If
    they were neither, the next byte is a Latin-1 character, and is
    converted to UTF-8. Fortunately, there is no overlap between the
    Latin-1 characters that occured in these data and the bytes that start
    a non-ASCII UTF-8 code point.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Thu May 30 13:45:41 2024
    IBM has, for a long time, combined commonly occuring sequences of instructions into single instructions. I don't know the tradeoffs here.

    I don't know either, but it's hard to believe that it's *just* marketing because there is an actual design and implementation cost involved and
    even marketing needs some "hard" data to make a good sell.

    My guess is that they have gotten their implementation to a point where
    adding instructions is fairly painless (plenty of space in the
    instruction encoding, pre-existing micro/milli-code setup where the
    size of the micro/milli-code has a negligible impact on cycle time,
    chip size, and yield, ...).

    Then they use that flexibility to go after specific benchmarks they got
    from some important customers. Even if it speeds up the code of
    a single customer, it might be worth the effort if it's a large enough
    customer and it increases the chances of keeping them on
    that architecture.

    Maybe each of those cases could be solved about as efficiently by
    rewriting part of the code, but we're talking about a market where many
    of the customers are here specifically because they don't want to
    rewrite their code.

    For the case in point, I haven't seen problems where a UTF-32 encoding
    is the overall best solution, but I can easily believe that there are
    cases where some poorly thought out (but entrenched) API ends up
    imposing (directly or not) the use of UTF-32 and makes UTF-8 <-> UTF-32 conversions very frequent.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Stefan Monnier on Thu May 30 18:23:35 2024
    Stefan Monnier wrote:

    IBM has, for a long time, combined commonly occuring sequences of instructions into single instructions. I don't know the tradeoffs
    here.

    I don't know either, but it's hard to believe that it's just marketing because there is an actual design and implementation cost involved and
    even marketing needs some "hard" data to make a good sell.

    Yes.


    My guess is that they have gotten their implementation to a point
    where adding instructions is fairly painless (plenty of space in the instruction encoding, pre-existing micro/milli-code setup where the
    size of the micro/milli-code has a negligible impact on cycle time,
    chip size, and yield, ...).

    Good point. And note that there is some benefit in presumably better
    I-cache hit rate, etc. And if they have a hardware streaming buffer,
    it is probably easier to make use of it in a single instruction versus
    a sequence of instructions.



    Then they use that flexibility to go after specific benchmarks they
    got from some important customers. Even if it speeds up the code of
    a single customer, it might be worth the effort if it's a large enough customer and it increases the chances of keeping them on
    that architecture.

    Agreed. Furthermore, since IBM has major presence in certain
    industries, e.g. banking, if it helps one customer in that industry, it
    likely helps others.



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stefan Monnier on Thu May 30 18:31:46 2024
    Stefan Monnier wrote:

    IBM has, for a long time, combined commonly occuring sequences of
    instructions into single instructions. I don't know the tradeoffs
    here.

    I don't know either, but it's hard to believe that it's *just*
    marketing
    because there is an actual design and implementation cost involved and
    even marketing needs some "hard" data to make a good sell.

    My guess is that they have gotten their implementation to a point where adding instructions is fairly painless (plenty of space in the
    instruction encoding, pre-existing micro/milli-code setup where the
    size of the micro/milli-code has a negligible impact on cycle time,
    chip size, and yield, ...).

    Yes, as long as the new instruction is "like" other already existing instructions.

    Then they use that flexibility to go after specific benchmarks they got
    from some important customers. Even if it speeds up the code of
    a single customer, it might be worth the effort if it's a large enough customer and it increases the chances of keeping them on
    that architecture.

    Maybe each of those cases could be solved about as efficiently by
    rewriting part of the code, but we're talking about a market where many
    of the customers are here specifically because they don't want to
    rewrite their code.

    A lot of the added instructions support OS-like features--I infer that
    many of these require some kind of atomic activities not easily
    achieved
    with the existing ISA itself.

    For the case in point, I haven't seen problems where a UTF-32 encoding
    is the overall best solution, but I can easily believe that there are
    cases where some poorly thought out (but entrenched) API ends up
    imposing (directly or not) the use of UTF-32 and makes UTF-8 <-> UTF-32 conversions very frequent.

    30 years ago you could say the same thing about encryption.

    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Lawrence D'Oliveiro on Thu May 30 20:38:08 2024
    Lawrence D'Oliveiro wrote:
    On Wed, 29 May 2024 10:10:30 -0400, EricP wrote:

    I've not dealt with UTF-8 or code points but that's because I've not
    written software that interacts with the non 1-byte character markets.

    We are all “non 1-byte character markets” now.

    Just to rub it in: «€£¢©®±»

    Unnecessary in my case.
    My company's products were a real-time bond pricing and trading system,
    and customers were financial companies whose internal systems in this
    case only operated within North America in English, in ascii and ebcdic.

    They had other systems that did interface with the larger world
    and presumably dealt with international character sets.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to John Levine on Fri May 31 14:36:47 2024
    John Levine wrote:
    According to Terje Mathisen <[email protected]>:
    It's almost like the perfect application of risc instruction design:
    a long sequence of individual instructions of conditional branches,
    bit field extracts, inserts, and shifts, is replace in HW by
    a small number of muxes that can to the same in one clock.

    If that CU14 can also return the number of bytes consumed, along with
    the resulting 32-bit character, then it would be perfect. Is that what
    it is doing?

    You give it registers with two addresses and two lengths, and it
    converts the source UTF-8 code points to destination UTF-32 until it
    runs out of input, fills the output, gets an invalid character, or an interrupt. It updates the addresses and lengths. Other than optionally checking for invalid UTF-8 it does not interpret the code points.

    The condition code tells you which it was. If it was an interrupt, you just branch back and keep going.

    There's an extra cost flag whether to test for invalid UTF-8.

    Read all about it: https://www.vm.ibm.com/library/other/22783213.pdf

    It's on page 7-251.

    Thanks!

    I did read all of it, and it was pretty close to how I would have
    designed a sw function to do the same, except for the very funky ABI:

    Both source and destination _must_ be an even register number, with the following odd register providing the count/length.

    Just from this little snippet I'm pretty sure this instruction has a
    sizeable startup overhead, compiler support is probably in the form of
    an intrinsic that knows about the need to allocate two pairs of
    register, each pair starting at an even-numbered register.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to George Neuner on Sat Jun 1 12:49:46 2024
    George Neuner wrote:
    On Wed, 29 May 2024 18:42:32 -0000 (UTC), John Levine
    <[email protected]> wrote:

    According to EricP <[email protected]>:
    Ok, you accept international character data, you just don't have to
    check >127 characters for "drop table" etc commands.

    I don't think you are being paranoid enough.
    I still think you have to validate or sanitize the >127 string to
    ensure the code sequences only contain well formed characters.
    If you're sending the strings to a database, the database will
    invariably do detailed string validation so I wouldn't bother, but be
    prepared for the error code if it rejects the string,

    Far too much SQL is constructed by simply splicing user input into a
    query "template" string.

    When queries are done right with all user input provided via SQL
    parameters, then there is far less need to "sanitize" input.

    There is a one major caveat: in SQL, table names can't be specified by parameter. If the user must provide a table name, then you DO have to
    splice the query string and you DO have to be careful.

    Yes, I didn't mean not parameterizing the string args.

    I was trying to think of ways that I might get your software to combine malformed strings creating something different. This would occur after
    the strings have been passed using parameterization, like if an index
    is built from two concatenated string fields.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Anton Ertl on Sat Jun 1 12:40:53 2024
    Anton Ertl wrote:
    EricP <[email protected]> writes:
    Stefan Monnier wrote:
    I've not dealt with UTF-8 or code points but that's because I've not >>>>>> written software that interacts with the non 1-byte character markets. >>>>>> But even something as simple as sanitizing a character string to feed >>>>>> into SQL will have to.
    AFAIK you can do that by treating the UTF-8 byte sequence as if it were >>>>> an ASCII byte-sequence: all the Unicode weirdness is neatly stashed in >>>>> bytes >127 which aren't used by SQL itself anyway.
    Stefan
    Of course with apologies to Herr Koenig's umlauts. :-)

    And what of all those new Asian customers your company was hoping
    to get by dealing with them in their native written language???
    You could always explain to the company president that
    you only work in ASCII so they should just get used to it.
    I think you misunderstand: the code written to sanitize an ASCII string to >>> feed into SQL will *just work* to sanitize a UTF-8 string to feed
    into SQL, no matter how many funny characters and joiners and combiners
    and emojis you have in there.

    That's part of the reason why UTF-8 is so popular: you can surprisingly
    often treat it as "good old ASCII".


    Stefan
    Ok, you accept international character data, you just don't have to
    check >127 characters for "drop table" etc commands.

    Actually what you check for is meta-characters like ; " '. They are
    all ASCII, so as long as your code is 8-bit-clean, your SQL string
    sanitizer needs to know nothing about UTF-8.

    Yes, I just skipped to the result.

    I don't think you are being paranoid enough.
    I still think you have to validate or sanitize the >127 string to
    ensure the code sequences only contain well formed characters.

    Then run your string through a checker/normalizer before or
    afterwards. No need to complicate your SQL sanitizer by trying to do
    both at the same time. But if you want the last bit of performance by
    doing both at the same time, then you certainly don't want to convert
    to UTF-32 and back.

    If I want to validate combiner codes or normalize characters I need
    UTF-32 because I have to work with the whole character as a unit.

    Random hack thought #1: if the string I send starts with an umlaut as
    the first code point, which doesn't display because it is invalid.

    I found that hard to understand. Do you mean that the string starts
    with a composing diaresis code point and is invalid because it has no preceding basis with which to compose? The string may fail at the
    Unicode checking/normalization stage (depending on what it checks).

    I was looking for a reason to justify having to perform
    full character validation, not just UTF-8 code validation.

    I was trying to come up with an example where I give your system
    two strings, one contains a valid base character, another containing
    a continue code, and your system concatenates the two strings to
    create a different string.

    Like a first name of 'O' and a last name of umlaut, and your software concatenates them in a database index creating a full name of O-umlaut.

    Though admittedly it's difficult to see how that hacks your system
    but maybe others can see a way.

    Then someone edits the first char to a/o/u and magically it changes
    to a different character, and deposits now go to a different account.

    If someone can edit the string, and that changes where deposits go to, someone can do that even with no Unicode involved. E.g., if someone
    can change "EricP" to "Ertl". However, my impression is that banks
    use account numbers (pure ASCII) for deposits, names are used only for validation; so if you provide the wrong name, a money transfer may
    fail to go through (not sure what happens if a deposit does not go
    through), but won't be to the wrong account.

    I was just trying to get people thinking of ways that malformed
    characters might be used to bypass other validation checks in
    their software.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to EricP on Mon Jun 3 08:04:52 2024
    On Thu, 30 May 2024 20:38:08 -0400, EricP wrote:

    My company's products were a real-time bond pricing and trading system,
    and customers were financial companies whose internal systems in this
    case only operated within North America in English, in ascii and ebcdic.

    No need even for “¢” characters?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Mon Jun 3 08:03:53 2024
    On Thu, 30 May 2024 18:31:46 +0000, MitchAlsup1 wrote:

    30 years ago you could say the same thing about encryption.

    I don’t think newer CPUs have been optimized for encryption. Instead, we
    see newer encryption algorithms (or ways of using them) that work better
    on current CPUs. For example, when I was first learning about computer encryption, I was told that CBC (“Cipher-Block Chaining”) mode was teh hawtness, but nowadays it’s all about GFC (“Galois-Field Counter”) mode.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Lawrence D'Oliveiro on Mon Jun 3 13:22:27 2024
    On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    On Thu, 30 May 2024 18:31:46 +0000, MitchAlsup1 wrote:

    30 years ago you could say the same thing about encryption.

    I don’t think newer CPUs have been optimized for encryption. Instead,
    we see newer encryption algorithms (or ways of using them) that work
    better on current CPUs.

    I think moderate efficiency on CPU, not too low, but not high either,
    is a requirement for (symmetric-key) cipher. Esp. when the key is
    128-bit or shorter.

    For example, when I was first learning about
    computer encryption, I was told that CBC (“Cipher-Block Chaining”)
    mode was teh hawtness,

    CBC decrypt is easily parallelized. Encrypt - not so
    much.

    but nowadays it’s all about GFC (“Galois-Field
    Counter”) mode.

    GCM is far more common spelling.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Michael S on Mon Jun 3 14:07:12 2024
    Michael S <[email protected]> writes:
    On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    On Thu, 30 May 2024 18:31:46 +0000, MitchAlsup1 wrote:
    =20
    30 years ago you could say the same thing about encryption. =20
    =20
    I don=E2=80=99t think newer CPUs have been optimized for encryption. Inst= >ead,
    we see newer encryption algorithms (or ways of using them) that work
    better on current CPUs.=20

    I think moderate efficiency on CPU, not too low, but not high either,
    is a requirement for (symmetric-key) cipher. Esp. when the key is
    128-bit or shorter.

    Most modern CPUs have instruction set support for symmetric ciphers such
    as AES, SM2/SM3 as well as message digest/hash (SHA1, SHA256 et al).

    High throughput encryption has been done by hardware accelerators for
    decades now (e.g. bbn or ncypher HSM boxes sitting on a SCSI bus;
    now such HSM are an integral part of many SoC).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Lawrence D'Oliveiro on Mon Jun 3 10:31:51 2024
    Lawrence D'Oliveiro wrote:
    On Thu, 30 May 2024 20:38:08 -0400, EricP wrote:

    My company's products were a real-time bond pricing and trading system,
    and customers were financial companies whose internal systems in this
    case only operated within North America in English, in ascii and ebcdic.

    No need even for “¢” characters?

    Nope, and no pound or euro signs either because currency is dollars
    with . as the decimal point. Because otherwise you get into foreign
    exchange which is a whole different bucket of fish, not the least of
    which are legal and tax issues. That's not to say such issues do not
    come up, its just that if you want to buy $100 million worth of T-bills
    then you have to figure out how to convert your euros and deal with the paperwork.

    Actually the only problem with external text I encountered was when one
    day the price feed suddenly switched from decimal quantities to fractions
    like "12 1/8" or "15 5/32". Someone must have connected old software to
    the Reuters trade price network and started broadcasting ancient values.
    This was in direct violation of the network specs but there it was anyway.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Scott Lurndal on Mon Jun 3 17:42:17 2024
    On Mon, 03 Jun 2024 14:07:12 GMT
    [email protected] (Scott Lurndal) wrote:

    Michael S <[email protected]> writes:
    On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    On Thu, 30 May 2024 18:31:46 +0000, MitchAlsup1 wrote:
    =20
    30 years ago you could say the same thing about encryption. =20
    =20
    I don=E2=80=99t think newer CPUs have been optimized for
    encryption. Inst=
    ead,
    we see newer encryption algorithms (or ways of using them) that
    work better on current CPUs.=20

    I think moderate efficiency on CPU, not too low, but not high either,
    is a requirement for (symmetric-key) cipher. Esp. when the key is
    128-bit or shorter.

    Most modern CPUs have instruction set support for symmetric ciphers
    such as AES, SM2/SM3 as well as message digest/hash (SHA1, SHA256 et
    al).


    It is still not *too* fast.
    'Too fast' in my book is when with 1B to 10B USD worth of OTP servers
    you can break cipher by brute force in less than 1 hour.

    High throughput encryption has been done by hardware accelerators for
    decades now (e.g. bbn or ncypher HSM boxes sitting on a SCSI bus;
    now such HSM are an integral part of many SoC).

    BTDT, not in high volume app so, and with programmable logic rather
    than ASIC. It's still sufficiently slow to not become dangerous for
    the order of the world.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Scott Lurndal on Mon Jun 3 14:55:53 2024
    Scott Lurndal wrote:

    Michael S <[email protected]> writes:
    On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    On Thu, 30 May 2024 18:31:46 +0000, MitchAlsup1 wrote:
    =20
    30 years ago you could say the same thing about encryption. =20
    =20
    I don=E2=80=99t think newer CPUs have been optimized for
    encryption. Inst=
    ead,
    we see newer encryption algorithms (or ways of using them) that
    work >> better on current CPUs.=20

    I think moderate efficiency on CPU, not too low, but not high
    either, is a requirement for (symmetric-key) cipher. Esp. when the
    key is 128-bit or shorter.

    Most modern CPUs have instruction set support for symmetric ciphers
    such as AES, SM2/SM3 as well as message digest/hash (SHA1, SHA256 et
    al).

    High throughput encryption has been done by hardware accelerators for
    decades now (e.g. bbn or ncypher HSM boxes sitting on a SCSI bus;
    now such HSM are an integral part of many SoC).


    Queston. For a modern general purpose CPU, if you are including all
    the logic to implement encryption instructions, is it much more to
    include the control/sequencing logic to do it and not tie up the rest
    of the CPU logic to do the encryption? Furthermore, an "inbuilt"
    accelerator could interface directly with the I/O hardware of the CPU
    (e.g. PCI), saving the "intermediate" step of writing the encrypted
    data to memory.




    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Stephen Fuld on Mon Jun 3 15:33:48 2024
    "Stephen Fuld" <[email protected]d> writes:
    Scott Lurndal wrote:

    Michael S <[email protected]> writes:
    On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    On Thu, 30 May 2024 18:31:46 +0000, MitchAlsup1 wrote:
    =20
    30 years ago you could say the same thing about encryption. =20
    =20
    I don=E2=80=99t think newer CPUs have been optimized for
    encryption. Inst=
    ead,
    we see newer encryption algorithms (or ways of using them) that
    work >> better on current CPUs.=20

    I think moderate efficiency on CPU, not too low, but not high
    either, is a requirement for (symmetric-key) cipher. Esp. when the
    key is 128-bit or shorter.

    Most modern CPUs have instruction set support for symmetric ciphers
    such as AES, SM2/SM3 as well as message digest/hash (SHA1, SHA256 et
    al).

    High throughput encryption has been done by hardware accelerators for
    decades now (e.g. bbn or ncypher HSM boxes sitting on a SCSI bus;
    now such HSM are an integral part of many SoC).


    Queston. For a modern general purpose CPU, if you are including all
    the logic to implement encryption instructions, is it much more to
    include the control/sequencing logic to do it and not tie up the rest
    of the CPU logic to do the encryption? Furthermore, an "inbuilt"
    accelerator could interface directly with the I/O hardware of the CPU
    (e.g. PCI), saving the "intermediate" step of writing the encrypted
    data to memory.

    There are always tradeoffs. The issues surrounding the
    control/sequencing logic outside of the instruction flow
    require some level of asynchronicity, so to avoid bottlenecks
    one might need to replicate the "inbuilt accelerator" if
    more than one core will be using encryption (e.g. for RSS
    with IPSEC flows).

    From the operating software standpoint, it becomes most
    convenient, then, to model the offload as a device which
    requires OS support (and intervention for e.g. interrupt
    handling).

    For network traffic, there are often other operations
    being performed on the flow (routing, encapsulation, fragmentation/reassembly, etc) which require the packet to be in a memory buffer
    (which could be high-speed SRAM or lower-speed DRAM),
    even when just routing from an ingress port to an egress port.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stephen Fuld on Mon Jun 3 16:41:34 2024
    Stephen Fuld wrote:

    Scott Lurndal wrote:

    Michael S <[email protected]> writes:
    On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)


    High throughput encryption has been done by hardware accelerators for
    decades now (e.g. bbn or ncypher HSM boxes sitting on a SCSI bus;
    now such HSM are an integral part of many SoC).


    Queston. For a modern general purpose CPU, if you are including all
    the logic to implement encryption instructions, is it much more to
    include the control/sequencing logic to do it and not tie up the rest
    of the CPU logic to do the encryption? Furthermore, an "inbuilt"
    accelerator could interface directly with the I/O hardware of the CPU
    (e.g. PCI), saving the "intermediate" step of writing the encrypted
    data to memory.


    It is more of a systems issue than an ISA issue:: Consider a chip with
    100 cores, do you want all 100 cores to be doing encryption at the same

    time, or do you only need a certain BW of encryption rather equal to
    the internet BW at hand. For the first instructions are a reasonable
    starting point, for the second an I/O (or attached) processor is in
    order.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to All on Mon Jun 3 17:05:11 2024
    MitchAlsup1 wrote:

    Stephen Fuld wrote:

    Scott Lurndal wrote:

    Michael S <[email protected]> writes:
    On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)


    High throughput encryption has been done by hardware accelerators
    for decades now (e.g. bbn or ncypher HSM boxes sitting on a SCSI
    bus; now such HSM are an integral part of many SoC).


    Queston. For a modern general purpose CPU, if you are including all
    the logic to implement encryption instructions, is it much more to
    include the control/sequencing logic to do it and not tie up the
    rest of the CPU logic to do the encryption? Furthermore, an
    "inbuilt" accelerator could interface directly with the I/O
    hardware of the CPU (e.g. PCI), saving the "intermediate" step of
    writing the encrypted data to memory.


    It is more of a systems issue than an ISA issue:: Consider a chip
    with 100 cores, do you want all 100 cores to be doing encryption at
    the same

    time, or do you only need a certain BW of encryption rather equal to
    the internet BW at hand. For the first instructions are a reasonable
    starting point, for the second an I/O (or attached) processor is in
    order.

    I agree completely. If all of the data to be en/decrypted is comming from/going to an external device (network, storage device), then there
    is no benefit to being able to encrypt at a faster rate than the total
    I/O bandwidth. I don't know what percentage of the data is destined
    for external use, but my gut feel is that it is a lot, probably most,
    possibly almost all.

    If that is the case, then I think a good case can be made for putting encryption somewhere within the I/O hardware, in order to avoid the
    extra memory bandwidth and latency requirements of either instructions
    or a "typical" attached processor.

    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Stephen Fuld on Mon Jun 3 17:28:10 2024
    "Stephen Fuld" <[email protected]d> writes:
    Scott Lurndal wrote:



    Queston. For a modern general purpose CPU, if you are including all
    the logic to implement encryption instructions, is it much more to
    include the control/sequencing logic to do it and not tie up the
    rest of the CPU logic to do the encryption? Furthermore, an
    "inbuilt" accelerator could interface directly with the I/O
    hardware of the CPU (e.g. PCI), saving the "intermediate" step of
    writing the encrypted data to memory.

    There are always tradeoffs. The issues surrounding the
    control/sequencing logic outside of the instruction flow
    require some level of asynchronicity, so to avoid bottlenecks
    one might need to replicate the "inbuilt accelerator" if
    more than one core will be using encryption (e.g. for RSS
    with IPSEC flows).


    Yes, but putting the instructions into the core means you are
    replicating the logic for every core.

    In the scale of a modern CPU, it's a small fraction of the logic.

    The ARM neoverse cores, for example, require very little area.



    From the operating software standpoint, it becomes most
    convenient, then, to model the offload as a device which
    requires OS support (and intervention for e.g. interrupt
    handling).


    I look at it differently (and perhaps incorrectly). I view encryption
    as one of several "transformations" that data goes through in its path >to/from some external device.

    That's certainly a valid view, if perhaps not complete. There are
    use cases for in-place encryption.

    Adding encryption (which of the dozen standard symmetric and asymmetric
    cipher algoritnms?) to a hardware device does increase complexity, and
    thus cost at the expense of extensibility (new algorithms come along periodically). The cost of verifying crypto is a bit higher as it is
    very important to get correct when baking into gates.


    For exqmple, if the external device is a
    disk, the data from memory may be gathere from multiple locations, is >serialized, perhaps encoded (i.e. 8b10b), has (perhaps several levels)
    of ECC added, etc. Viewing it like that makes encryption one of many
    steps along the I/O pipeline. Under that view, Encryption is an
    option, probably controllede by some bits in the I/O mechanism, not as
    a separate device requiring interrupt support etc.

    In the Cavium crypto-enabled DPUs, the crypto block is inserted
    into the data-path where necessary, when necessary; and to the extent
    that a streaming protocol/alg is used, will encrypt/decrypt as the data
    is passing from the ingress point to the egress point (which could
    be another external port, or an on-board CPU). It can also be used
    as a stand-alone crypto accelerator by the on-board CPUs.

    Note that crypto is used for more than just data encryption/decryption;
    there's also digesting and digital signatures which rely on asymmetric algorithms such as RSA or EC and don't necessarily fit into the
    "path to the I/O device" model you've espoused.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Scott Lurndal on Mon Jun 3 18:01:00 2024
    Scott Lurndal <[email protected]> schrieb:

    Adding encryption (which of the dozen standard symmetric and asymmetric cipher algoritnms?)

    At the moment, AES.

    to a hardware device does increase complexity, and
    thus cost at the expense of extensibility (new algorithms come along periodically). The cost of verifying crypto is a bit higher as it is
    very important to get correct when baking into gates.

    Seems to be fairly common these days, looking at https://en.wikipedia.org/wiki/AES_instruction_set .

    It appears that one round of AES fits fairly well into one cycle.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Scott Lurndal on Mon Jun 3 17:15:34 2024
    Scott Lurndal wrote:

    "Stephen Fuld" <[email protected]d> writes:
    Scott Lurndal wrote:

    Michael S <[email protected]> writes:
    On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    On Thu, 30 May 2024 18:31:46 +0000, MitchAlsup1 wrote:
    =20
    30 years ago you could say the same thing about encryption.
    =20 >> > > =20
    I don=E2=80=99t think newer CPUs have been optimized for
    encryption. Inst=
    ead,
    we see newer encryption algorithms (or ways of using them) that
    work >> better on current CPUs.=20

    I think moderate efficiency on CPU, not too low, but not high
    either, is a requirement for (symmetric-key) cipher. Esp. when
    the >> > key is 128-bit or shorter.

    Most modern CPUs have instruction set support for symmetric ciphers
    such as AES, SM2/SM3 as well as message digest/hash (SHA1, SHA256
    et >> al).

    High throughput encryption has been done by hardware accelerators
    for >> decades now (e.g. bbn or ncypher HSM boxes sitting on a SCSI
    bus; >> now such HSM are an integral part of many SoC).


    Queston. For a modern general purpose CPU, if you are including all
    the logic to implement encryption instructions, is it much more to
    include the control/sequencing logic to do it and not tie up the
    rest of the CPU logic to do the encryption? Furthermore, an
    "inbuilt" accelerator could interface directly with the I/O
    hardware of the CPU (e.g. PCI), saving the "intermediate" step of
    writing the encrypted data to memory.

    There are always tradeoffs. The issues surrounding the
    control/sequencing logic outside of the instruction flow
    require some level of asynchronicity, so to avoid bottlenecks
    one might need to replicate the "inbuilt accelerator" if
    more than one core will be using encryption (e.g. for RSS
    with IPSEC flows).


    Yes, but putting the instructions into the core means you are
    replicating the logic for every core. If you don't tie the amount of encryption hardeware you need to the number of cores, you can adjust it
    to meet the needs independently of the amount of computation you need
    (i.e. number of cores)




    From the operating software standpoint, it becomes most
    convenient, then, to model the offload as a device which
    requires OS support (and intervention for e.g. interrupt
    handling).


    I look at it differently (and perhaps incorrectly). I view encryption
    as one of several "transformations" that data goes through in its path
    to/from some external device. For exqmple, if the external device is a
    disk, the data from memory may be gathere from multiple locations, is serialized, perhaps encoded (i.e. 8b10b), has (perhaps several levels)
    of ECC added, etc. Viewing it like that makes encryption one of many
    steps along the I/O pipeline. Under that view, Encryption is an
    option, probably controllede by some bits in the I/O mechanism, not as
    a separate device requiring interrupt support etc.



    For network traffic, there are often other operations
    being performed on the flow (routing, encapsulation, fragmentation/reassembly, etc) which require the packet to be in a
    memory buffer (which could be high-speed SRAM or lower-speed DRAM),
    even when just routing from an ingress port to an egress port.


    Yes. In my view, encryption is just another of these operations.




    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Thomas Koenig on Mon Jun 3 18:11:56 2024
    Thomas Koenig <[email protected]> writes:
    Scott Lurndal <[email protected]> schrieb:

    Adding encryption (which of the dozen standard symmetric and asymmetric
    cipher algoritnms?)

    At the moment, AES.

    to a hardware device does increase complexity, and
    thus cost at the expense of extensibility (new algorithms come along
    periodically). The cost of verifying crypto is a bit higher as it is
    very important to get correct when baking into gates.

    Seems to be fairly common these days, looking at >https://en.wikipedia.org/wiki/AES_instruction_set .

    As I mentioned earlier in the thread, all modern CPUs have
    support for the standard algorithms in their instruction
    set (optionally fused out for export).


    It appears that one round of AES fits fairly well into one cycle.

    Yes.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Scott Lurndal on Mon Jun 3 18:57:24 2024
    Scott Lurndal wrote:

    "Stephen Fuld" <[email protected]d> writes:
    Scott Lurndal wrote:



    Queston. For a modern general purpose CPU, if you are including
    all >> > the logic to implement encryption instructions, is it much
    more to >> > include the control/sequencing logic to do it and not
    tie up the >> > rest of the CPU logic to do the encryption?
    Furthermore, an >> > "inbuilt" accelerator could interface directly
    with the I/O >> > hardware of the CPU (e.g. PCI), saving the
    "intermediate" step of >> > writing the encrypted data to memory.

    There are always tradeoffs. The issues surrounding the
    control/sequencing logic outside of the instruction flow
    require some level of asynchronicity, so to avoid bottlenecks
    one might need to replicate the "inbuilt accelerator" if
    more than one core will be using encryption (e.g. for RSS
    with IPSEC flows).


    Yes, but putting the instructions into the core means you are
    replicating the logic for every core.

    In the scale of a modern CPU, it's a small fraction of the logic.

    The ARM neoverse cores, for example, require very little area.

    Agreed. I was assuming that the cost of the logic was about the same
    whether it was done as CPU instructions or a chunk of accelerator logic
    in the I/O stream. If that is true, then the cost of having multiples
    of them in the I/O stream is small.




    From the operating software standpoint, it becomes most
    convenient, then, to model the offload as a device which
    requires OS support (and intervention for e.g. interrupt
    handling).


    I look at it differently (and perhaps incorrectly). I view
    encryption as one of several "transformations" that data goes
    through in its path to/from some external device.

    That's certainly a valid view, if perhaps not complete. There are
    use cases for in-place encryption.

    Good. Can you give some examples, and perhaps an estimate of what
    percentage of the total encryption operations are in place? Note that
    it may be possible to add a feature to the "in-stream" hardware to
    allow in-place encryption - i.e. both sides go to/come from memory.



    Adding encryption (which of the dozen standard symmetric and
    asymmetric cipher algoritnms?) to a hardware device does increase
    complexity, and thus cost at the expense of extensibility (new
    algorithms come along periodically).

    Agreed. But this is also true for new CPU instructions.


    The cost of verifying crypto is
    a bit higher as it is very important to get correct when baking into
    gates.


    Sure, And I expect it is also higher because of the extra security
    precautions against side attacks, etc.



    For exqmple, if the external device is a
    disk, the data from memory may be gathere from multiple locations,
    is serialized, perhaps encoded (i.e. 8b10b), has (perhaps several
    levels) of ECC added, etc. Viewing it like that makes encryption
    one of many steps along the I/O pipeline. Under that view,
    Encryption is an option, probably controllede by some bits in the
    I/O mechanism, not as a separate device requiring interrupt support
    etc.

    In the Cavium crypto-enabled DPUs, the crypto block is inserted
    into the data-path where necessary, when necessary; and to the extent
    that a streaming protocol/alg is used, will encrypt/decrypt as the
    data is passing from the ingress point to the egress point (which
    could be another external port, or an on-board CPU). It can also be
    used as a stand-alone crypto accelerator by the on-board CPUs.


    Good to know. Proof of concept for my suggestion. :-) Can you talk
    about advantages/disadvantages of that mechanism versus other
    implementations?




    Note that crypto is used for more than just data
    encryption/decryption; there's also digesting and digital signatures
    which rely on asymmetric algorithms such as RSA or EC and don't
    necessarily fit into the "path to the I/O device" model you've
    espoused.

    Yes, of course. But I think digital signature creation/verification
    could be fit into the streaming model. Is that wrong? With regard to
    RSA/EC, etc. I absolutely agree.


    I do want to thank you for indulging my fantasies. :-)



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Thomas Koenig on Mon Jun 3 23:15:11 2024
    On Mon, 3 Jun 2024 18:01:00 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    Scott Lurndal <[email protected]> schrieb:

    Adding encryption (which of the dozen standard symmetric and
    asymmetric cipher algoritnms?)

    At the moment, AES.

    to a hardware device does increase complexity, and
    thus cost at the expense of extensibility (new algorithms come along periodically). The cost of verifying crypto is a bit higher as it
    is very important to get correct when baking into gates.

    Seems to be fairly common these days, looking at https://en.wikipedia.org/wiki/AES_instruction_set .

    It appears that one round of AES fits fairly well into one cycle.

    One/cycle throughput fits well. Even two/cycle throughput fits.
    One cycle latency does not fit unless you target very low frequency.
    Latency on POWER9 - 6 clocks. On majority of modern Intel and AMD cores
    3-4 clocks. On Apple M1 - 3 clocks.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Scott Lurndal on Mon Jun 3 23:15:48 2024
    On Mon, 03 Jun 2024 18:11:56 GMT
    [email protected] (Scott Lurndal) wrote:

    Thomas Koenig <[email protected]> writes:
    Scott Lurndal <[email protected]> schrieb:

    Adding encryption (which of the dozen standard symmetric and
    asymmetric cipher algoritnms?)

    At the moment, AES.

    to a hardware device does increase complexity, and
    thus cost at the expense of extensibility (new algorithms come
    along periodically). The cost of verifying crypto is a bit higher
    as it is very important to get correct when baking into gates.

    Seems to be fairly common these days, looking at >https://en.wikipedia.org/wiki/AES_instruction_set .

    As I mentioned earlier in the thread, all modern CPUs have
    support for the standard algorithms in their instruction
    set (optionally fused out for export).


    It appears that one round of AES fits fairly well into one cycle.

    Yes.

    No.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Stephen Fuld on Mon Jun 3 20:31:11 2024
    "Stephen Fuld" <[email protected]d> writes:
    Scott Lurndal wrote:

    "Stephen Fuld" <[email protected]d> writes:
    Scott Lurndal wrote:



    Queston. For a modern general purpose CPU, if you are including
    all >> > the logic to implement encryption instructions, is it much
    more to >> > include the control/sequencing logic to do it and not
    tie up the >> > rest of the CPU logic to do the encryption?
    Furthermore, an >> > "inbuilt" accelerator could interface directly
    with the I/O >> > hardware of the CPU (e.g. PCI), saving the
    "intermediate" step of >> > writing the encrypted data to memory.

    There are always tradeoffs. The issues surrounding the
    control/sequencing logic outside of the instruction flow
    require some level of asynchronicity, so to avoid bottlenecks
    one might need to replicate the "inbuilt accelerator" if
    more than one core will be using encryption (e.g. for RSS
    with IPSEC flows).


    Yes, but putting the instructions into the core means you are
    replicating the logic for every core.

    In the scale of a modern CPU, it's a small fraction of the logic.

    The ARM neoverse cores, for example, require very little area.

    Agreed. I was assuming that the cost of the logic was about the same
    whether it was done as CPU instructions or a chunk of accelerator logic
    in the I/O stream. If that is true, then the cost of having multiples
    of them in the I/O stream is small.

    Although the accelerator requires addition logic to interface
    to the CPUs (either by presenting as a memory mapped device,
    integrated into the processor register scheme, or some other
    proprietary mechanism). Which means non-standard software is
    required to manage and use the accelerator.





    From the operating software standpoint, it becomes most
    convenient, then, to model the offload as a device which
    requires OS support (and intervention for e.g. interrupt
    handling).


    I look at it differently (and perhaps incorrectly). I view
    encryption as one of several "transformations" that data goes
    through in its path to/from some external device.

    That's certainly a valid view, if perhaps not complete. There are
    use cases for in-place encryption.

    Good. Can you give some examples, and perhaps an estimate of what
    percentage of the total encryption operations are in place? Note that
    it may be possible to add a feature to the "in-stream" hardware to
    allow in-place encryption - i.e. both sides go to/come from memory.

    Consider file access. From the perspective of the disk, all blocks
    are identical - it doesn't know which partition, filesystem, or file
    that any individual block is part of, if any.

    Whole-disk encryption can happen at the drive. Per-file (or
    per-filesystem) happens in the filesystem driver(s), perhaps
    with a hardware assist, but it wouldn't be on the path from
    the disk to memory.

    There are cases where only a portion of a file is encrypted, and
    cases where the encryption is combined with compression (pkzip,
    rar, etc).




    Adding encryption (which of the dozen standard symmetric and
    asymmetric cipher algoritnms?) to a hardware device does increase
    complexity, and thus cost at the expense of extensibility (new
    algorithms come along periodically).

    Agreed. But this is also true for new CPU instructions.

    An hardware accelerator could, for example, be microcoded
    rather than using hard logic to future-proof it.



    The cost of verifying crypto is
    a bit higher as it is very important to get correct when baking into
    gates.


    Sure, And I expect it is also higher because of the extra security >precautions against side attacks, etc.

    Timing attacks, in particular.

    <snip>


    In the Cavium crypto-enabled DPUs, the crypto block is inserted
    into the data-path where necessary, when necessary; and to the extent
    that a streaming protocol/alg is used, will encrypt/decrypt as the
    data is passing from the ingress point to the egress point (which
    could be another external port, or an on-board CPU). It can also be
    used as a stand-alone crypto accelerator by the on-board CPUs.


    Good to know. Proof of concept for my suggestion. :-) Can you talk
    about advantages/disadvantages of that mechanism versus other >implementations?

    Freeing the CPU's to do useful work instead of crypto is the first
    reason for that type of architecture. There's plenty to do.





    Note that crypto is used for more than just data
    encryption/decryption; there's also digesting and digital signatures
    which rely on asymmetric algorithms such as RSA or EC and don't
    necessarily fit into the "path to the I/O device" model you've
    espoused.

    Yes, of course. But I think digital signature creation/verification
    could be fit into the streaming model. Is that wrong? With regard to >RSA/EC, etc. I absolutely agree.

    Digital signatures require X.509 support, and they're often embedded
    in non-encrypted data streams. The hardware processing
    the stream won't know anything about the data, including which
    parts would need to be digested (and the data may need decrypting
    first). Even if the hardware had the keys necessary to decrypt
    IPSEC packets and look inside for signatures, it would be very
    complicated to design hardware flexible enough to locate the
    data that needs to be digested in a sequence of packets (which
    may be arriving out of order).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Mon Jun 3 22:34:46 2024
    Michael S wrote:

    On Mon, 3 Jun 2024 18:01:00 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    Scott Lurndal <[email protected]> schrieb:

    Adding encryption (which of the dozen standard symmetric and
    asymmetric cipher algoritnms?)

    At the moment, AES.

    to a hardware device does increase complexity, and
    thus cost at the expense of extensibility (new algorithms come along
    periodically). The cost of verifying crypto is a bit higher as it
    is very important to get correct when baking into gates.

    Seems to be fairly common these days, looking at
    https://en.wikipedia.org/wiki/AES_instruction_set .

    It appears that one round of AES fits fairly well into one cycle.

    One/cycle throughput fits well. Even two/cycle throughput fits.
    One cycle latency does not fit unless you target very low frequency.
    Latency on POWER9 - 6 clocks. On majority of modern Intel and AMD cores
    3-4 clocks. On Apple M1 - 3 clocks.


    I agree here; You should consider encryption as smaller than an FMUL
    unit
    with about the characteristics of an FMUL. 1-cycle throughput 3-5 cycle

    latency.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Mon Jun 3 22:47:05 2024
    Scott Lurndal wrote:

    "Stephen Fuld" <[email protected]d> writes:
    Scott Lurndal wrote:



    The ARM neoverse cores, for example, require very little area.

    Agreed. I was assuming that the cost of the logic was about the same >>whether it was done as CPU instructions or a chunk of accelerator logic
    in the I/O stream. If that is true, then the cost of having multiples
    of them in the I/O stream is small.

    Although the accelerator requires addition logic to interface
    to the CPUs (either by presenting as a memory mapped device,
    integrated into the processor register scheme, or some other
    proprietary mechanism). Which means non-standard software is
    required to manage and use the accelerator.

    First consider that it is possible for an I/O device to DMA directly
    to another I/O device in the PCIe routing tree/DAG.

    Then, consider that with this infrastructure, you could DMA from
    memory through the Cryptor and back to memory (or anywhere you
    wanted it).





    From the operating software standpoint, it becomes most
    convenient, then, to model the offload as a device which
    requires OS support (and intervention for e.g. interrupt
    handling).


    I look at it differently (and perhaps incorrectly). I view
    encryption as one of several "transformations" that data goes
    through in its path to/from some external device.

    That's certainly a valid view, if perhaps not complete. There are
    use cases for in-place encryption.

    Good. Can you give some examples, and perhaps an estimate of what >>percentage of the total encryption operations are in place? Note that
    it may be possible to add a feature to the "in-stream" hardware to
    allow in-place encryption - i.e. both sides go to/come from memory.

    Different users want their files encrypted using different keys than
    any other user--where file could be memory resident (or not).

    Consider file access. From the perspective of the disk, all blocks
    are identical - it doesn't know which partition, filesystem, or file
    that any individual block is part of, if any.

    Whole-disk encryption can happen at the drive. Per-file (or per-filesystem) happens in the filesystem driver(s), perhaps
    with a hardware assist, but it wouldn't be on the path from
    the disk to memory.

    You may be correct in how it is now--but if the device has encryption
    services why can they not be applied sector by sector ??

    There are cases where only a portion of a file is encrypted, and
    cases where the encryption is combined with compression (pkzip,
    rar, etc).




    Adding encryption (which of the dozen standard symmetric and
    asymmetric cipher algoritnms?) to a hardware device does increase
    complexity, and thus cost at the expense of extensibility (new
    algorithms come along periodically).

    Agreed. But this is also true for new CPU instructions.

    An hardware accelerator could, for example, be microcoded
    rather than using hard logic to future-proof it.



    The cost of verifying crypto is
    a bit higher as it is very important to get correct when baking into
    gates.

    Verifying encryption is not harder than verifying IEEE 754
    instructions.



    Sure, And I expect it is also higher because of the extra security >>precautions against side attacks, etc.

    Timing attacks, in particular.

    All the more reason to run encryption through a device where you cannot
    measure time accurately. I/O fits this bill very well. It seems to me
    that
    as long as the system can maintain the cryption throughput all should
    be
    well.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Michael S on Tue Jun 4 01:17:33 2024
    On Mon, 3 Jun 2024 13:22:27 +0300, Michael S wrote:

    On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    but nowadays it’s all about GFC (“Galois-Field Counter”) mode.

    GCM is far more common spelling.

    Yeah. It’s just that Évariste Galois is known mainly for just one thing: Galois field theory, which is what’s relevant here. Which he wrote up on
    his last night alive.

    Imagine if he’d said “stuff this, I’ll write it up tomorrow night, I’m going to bed” ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Tue Jun 4 01:55:24 2024
    On Mon, 3 Jun 2024 22:47:05 +0000, MitchAlsup1 wrote:

    ... if the device has encryption
    services why can they not be applied sector by sector ??

    They can indeed. This is what “counter mode” is for: it lets you encrypt/ decrypt any part of some large data blob with random access, without
    having to start from the beginning each time.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Tue Jun 4 01:45:58 2024
    On Thu, 30 May 2024 15:35:37 GMT, Anton Ertl wrote:

    Actually what you check for is meta-characters like ; " '. They are all ASCII, so as long as your code is 8-bit-clean, your SQL string sanitizer needs to know nothing about UTF-8.

    According to the official spec, an SQL string literal is delimited by “"” characters, and an embedded double-quote is escaped by writing it twice: “""”.

    That’s it. Nothing else is special, so any other character/byte value in
    the string can be simply passed through as is.

    Of course, things like LIKE and REGEXP clauses are an entirely separate
    matter ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Michael S on Tue Jun 4 02:00:55 2024
    On Mon, 3 Jun 2024 17:42:17 +0300, Michael S wrote:

    On Mon, 03 Jun 2024 14:07:12 GMT [email protected] (Scott Lurndal)
    wrote:

    Most modern CPUs have instruction set support for symmetric ciphers
    such as AES, SM2/SM3 as well as message digest/hash (SHA1, SHA256 et
    al).

    It is still not *too* fast.
    'Too fast' in my book is when with 1B to 10B USD worth of OTP servers
    you can break cipher by brute force in less than 1 hour.

    The good algorithms are designed to be fast for encryption/decryption use, while still being uselessly slow for cracking purposes.

    Hash algorithms come in two flavours: cryptographic hashes (as mentioned
    above) and password hashes. Cryptographic hashes have to be fast to
    compute, but password hashes should take some appreciable fraction of a
    second. This is fast enough to authenticate a user logging in, while significantly slowing down password-guessing attacks.

    For example, the WordPress password-hashing algorithm takes a
    cryptographic hash like MD5 (considered crap nowadays), and iterates it
    8000 times. And suddenly crap becomes good.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to All on Tue Jun 4 06:09:25 2024
    MitchAlsup1 wrote:

    Scott Lurndal wrote:

    "Stephen Fuld" <[email protected]d> writes:
    Scott Lurndal wrote:



    The ARM neoverse cores, for example, require very little area.

    Agreed. I was assuming that the cost of the logic was about the
    same whether it was done as CPU instructions or a chunk of
    accelerator logic in the I/O stream. If that is true, then the
    cost of having multiples of them in the I/O stream is small.

    Although the accelerator requires addition logic to interface
    to the CPUs (either by presenting as a memory mapped device,
    integrated into the processor register scheme, or some other
    proprietary mechanism). Which means non-standard software is
    required to manage and use the accelerator.

    First consider that it is possible for an I/O device to DMA directly
    to another I/O device in the PCIe routing tree/DAG.

    Then, consider that with this infrastructure, you could DMA from
    memory through the Cryptor and back to memory (or anywhere you wanted
    it).





    From the operating software standpoint, it becomes most
    convenient, then, to model the offload as a device which
    requires OS support (and intervention for e.g. interrupt
    handling).
    I look at it differently (and perhaps incorrectly). I view
    encryption as one of several "transformations" that data goes
    through in its path to/from some external device.

    That's certainly a valid view, if perhaps not complete. There
    are use cases for in-place encryption.

    Good. Can you give some examples, and perhaps an estimate of what percentage of the total encryption operations are in place? Note
    that it may be possible to add a feature to the "in-stream"
    hardware to allow in-place encryption - i.e. both sides go
    to/come from memory.

    Different users want their files encrypted using different keys than
    any other user--where file could be memory resident (or not).


    Memory resident files I agree with you about. But in my conception of
    how this would all work, there would be a key specified for each I/O
    operation, thus, I/O to different files could trivially have different
    keys.




    Consider file access. From the perspective of the disk, all blocks
    are identical - it doesn't know which partition, filesystem, or file
    that any individual block is part of, if any.

    Whole-disk encryption can happen at the drive. Per-file (or per-filesystem) happens in the filesystem driver(s), perhaps
    with a hardware assist, but it wouldn't be on the path from
    the disk to memory.

    You may be correct in how it is now--but if the device has encryption services why can they not be applied sector by sector ??

    There are cases where only a portion of a file is encrypted, and
    cases where the encryption is combined with compression (pkzip,
    rar, etc).

    If the "boundary" of where the encrypted portion starts or ends
    corresponds to where an I/O boundary is, then no problem. If not, then
    the interface requires requires the ability to start/stop encryption at
    an arbitrary spot within the I/O. I envision this to work sort of like
    a scatter gather, but instead of different memory addresses, each
    "chunk" is encrypted or not. This is probably needed anyway for things
    like netword I/O where you want to encrypt the data but not the header.
    As for combining it with compression, clearly the encryption must come
    after the compression, and decryption must come before decompression.
    If you are doing the compression in the hardware interface that
    shouldn't be a problem, and if you are doing it in the software, then
    it definitly isn't a problem.




    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Stephen Fuld on Tue Jun 4 11:09:16 2024
    Stephen Fuld wrote:
    Scott Lurndal wrote:

    Michael S <[email protected]> writes:
    On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    On Thu, 30 May 2024 18:31:46 +0000, MitchAlsup1 wrote:
    =20
    30 years ago you could say the same thing about encryption. =20
    =20
    I don=E2=80=99t think newer CPUs have been optimized for
    encryption. Inst=
    ead,
    we see newer encryption algorithms (or ways of using them) that
    work >> better on current CPUs.=20

    I think moderate efficiency on CPU, not too low, but not high
    either, is a requirement for (symmetric-key) cipher. Esp. when the
    key is 128-bit or shorter.

    Most modern CPUs have instruction set support for symmetric ciphers
    such as AES, SM2/SM3 as well as message digest/hash (SHA1, SHA256 et
    al).

    High throughput encryption has been done by hardware accelerators for
    decades now (e.g. bbn or ncypher HSM boxes sitting on a SCSI bus;
    now such HSM are an integral part of many SoC).


    Queston. For a modern general purpose CPU, if you are including all
    the logic to implement encryption instructions, is it much more to
    include the control/sequencing logic to do it and not tie up the rest
    of the CPU logic to do the encryption? Furthermore, an "inbuilt"
    accelerator could interface directly with the I/O hardware of the CPU
    (e.g. PCI), saving the "intermediate" step of writing the encrypted
    data to memory.

    That logic already exists, in the form of a single thread/core dedicated
    to the job.

    With 30-100 cores on a single die, it becomes very cheap to dedicate one
    of them to babysit such a process, compared to the cost of making a
    custom chunk of VLSI to do the same. This is particularly true because
    the logic needed in the babysitting process is mostly straight line,
    with a very limited number of hard-to-predict branches.

    I.e. h.264 CABAC decoding has three branches per bit decoded, at least
    one of them impossible to predict or work around with clever coding.
    Here it makes perfect sense to have a chunk of hw to handle the heavy
    lifting. Monitoring block encryption/decryption not so much.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Michael S on Tue Jun 4 10:54:27 2024
    Michael S wrote:
    On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    On Thu, 30 May 2024 18:31:46 +0000, MitchAlsup1 wrote:

    30 years ago you could say the same thing about encryption.

    I don’t think newer CPUs have been optimized for encryption. Instead, >> we see newer encryption algorithms (or ways of using them) that work
    better on current CPUs.

    I think moderate efficiency on CPU, not too low, but not high either,
    is a requirement for (symmetric-key) cipher. Esp. when the key is
    128-bit or shorter.

    That's correct:

    CPU efficiency, primarily on the reference 32-bit platform (PentiumPro
    200 MHz) but also on an 8-bit "smart card" implementation was one of the
    key requirements for the AES competition.

    When a group of four programmers (including me) spent a week on CERN's candidate, we were able to triple the speed, bringing it into parity
    with the eventual winner. All the finalists were more or less the same
    speed at this point, i.e. able to do full duplex 100 Mbit/s Ethernet
    traffic (so around 20 MB/s) on a single thread/core.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Terje Mathisen on Tue Jun 4 12:11:33 2024
    On Tue, 4 Jun 2024 10:54:27 +0200
    Terje Mathisen <[email protected]> wrote:

    Michael S wrote:
    On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    On Thu, 30 May 2024 18:31:46 +0000, MitchAlsup1 wrote:

    30 years ago you could say the same thing about encryption.

    I don’t think newer CPUs have been optimized for encryption.
    Instead, we see newer encryption algorithms (or ways of using
    them) that work better on current CPUs.

    I think moderate efficiency on CPU, not too low, but not high
    either, is a requirement for (symmetric-key) cipher. Esp. when the
    key is 128-bit or shorter.

    That's correct:

    CPU efficiency, primarily on the reference 32-bit platform
    (PentiumPro 200 MHz) but also on an 8-bit "smart card" implementation
    was one of the key requirements for the AES competition.

    When a group of four programmers (including me) spent a week on
    CERN's candidate, we were able to triple the speed, bringing it into
    parity with the eventual winner. All the finalists were more or less
    the same speed at this point, i.e. able to do full duplex 100 Mbit/s
    Ethernet traffic (so around 20 MB/s) on a single thread/core.

    Terje


    My point was that for symmetric cipher intended for use with "short"
    keys, at least during a phase of standardization, exceptionally high
    efficiency on existing CPUs would be considered a defect rather than
    advantage.
    Not necessarily so for "long" keys, where unbreakability by brute force
    is taken for granted.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to [email protected] on Tue Jun 4 13:06:11 2024
    [email protected] (MitchAlsup1) writes:
    Scott Lurndal wrote:

    "Stephen Fuld" <[email protected]d> writes:
    Scott Lurndal wrote:



    The ARM neoverse cores, for example, require very little area.

    Agreed. I was assuming that the cost of the logic was about the same >>>whether it was done as CPU instructions or a chunk of accelerator logic >>>in the I/O stream. If that is true, then the cost of having multiples
    of them in the I/O stream is small.

    Although the accelerator requires addition logic to interface
    to the CPUs (either by presenting as a memory mapped device,
    integrated into the processor register scheme, or some other
    proprietary mechanism). Which means non-standard software is
    required to manage and use the accelerator.

    First consider that it is possible for an I/O device to DMA directly
    to another I/O device in the PCIe routing tree/DAG.

    If, and only if, the host bridge supports peer-to-peer transactions,
    which is not a given.


    Then, consider that with this infrastructure, you could DMA from
    memory through the Cryptor and back to memory (or anywhere you
    wanted it).

    Yes, this can be done, if the PCI endpoint(s) support it. Such
    routing is an optional feature of PCI Express.

    There are more efficient ways to link various hardware elements
    together in such a way as to include not only encryption, but
    also compression/decompression, regex (or other pattern) matching, ingress and egress.






    From the operating software standpoint, it becomes most
    convenient, then, to model the offload as a device which
    requires OS support (and intervention for e.g. interrupt
    handling).


    I look at it differently (and perhaps incorrectly). I view
    encryption as one of several "transformations" that data goes
    through in its path to/from some external device.

    That's certainly a valid view, if perhaps not complete. There are
    use cases for in-place encryption.

    Good. Can you give some examples, and perhaps an estimate of what >>>percentage of the total encryption operations are in place? Note that
    it may be possible to add a feature to the "in-stream" hardware to
    allow in-place encryption - i.e. both sides go to/come from memory.

    Different users want their files encrypted using different keys than
    any other user--where file could be memory resident (or not).

    Consider file access. From the perspective of the disk, all blocks
    are identical - it doesn't know which partition, filesystem, or file
    that any individual block is part of, if any.

    Whole-disk encryption can happen at the drive. Per-file (or
    per-filesystem) happens in the filesystem driver(s), perhaps
    with a hardware assist, but it wouldn't be on the path from
    the disk to memory.

    You may be correct in how it is now--but if the device has encryption >services why can they not be applied sector by sector ??

    Still not sufficient, as a filesystem could easily pack fragments
    from multiple files into a single sector or allocation unit
    (and with modern sector sizes of 4096 bytes....)


    Sure, And I expect it is also higher because of the extra security >>>precautions against side attacks, etc.

    Timing attacks, in particular.

    All the more reason to run encryption through a device where you cannot >measure time accurately.

    Indeed, we've been doing that for a couple of decades now.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Terje Mathisen on Tue Jun 4 17:00:33 2024
    Terje Mathisen wrote:



    That logic already exists, in the form of a single thread/core
    dedicated
    to the job.

    With 30-100 cores on a single die, it becomes very cheap to dedicate
    one
    of them to babysit such a process, compared to the cost of making a
    custom chunk of VLSI to do the same. This is particularly true because
    the logic needed in the babysitting process is mostly straight line,
    with a very limited number of hard-to-predict branches.

    I.e. h.264 CABAC decoding has three branches per bit decoded, at least
    one of them impossible to predict or work around with clever coding.

    How many instructions in the then-clause and in the else-clause ??
    If these are smaller than 8, My 66000 can process them without
    "branching"
    using predication.

    Here it makes perfect sense to have a chunk of hw to handle the heavy lifting. Monitoring block encryption/decryption not so much.

    Terje

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Terje Mathisen on Tue Jun 4 16:26:27 2024
    Terje Mathisen wrote:

    Stephen Fuld wrote:
    Scott Lurndal wrote:

    Michael S <[email protected]> writes:
    On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    On Thu, 30 May 2024 18:31:46 +0000, MitchAlsup1 wrote:
    =20
    30 years ago you could say the same thing about encryption.
    =20
    =20
    I don=E2=80=99t think newer CPUs have been optimized for
    encryption. Inst=
    ead,
    we see newer encryption algorithms (or ways of using them)
    that
    work >> better on current CPUs.=20

    I think moderate efficiency on CPU, not too low, but not high
    either, is a requirement for (symmetric-key) cipher. Esp. when
    the key is 128-bit or shorter.

    Most modern CPUs have instruction set support for symmetric
    ciphers such as AES, SM2/SM3 as well as message digest/hash
    (SHA1, SHA256 et al).

    High throughput encryption has been done by hardware accelerators
    for decades now (e.g. bbn or ncypher HSM boxes sitting on a SCSI
    bus; now such HSM are an integral part of many SoC).


    Queston. For a modern general purpose CPU, if you are including all
    the logic to implement encryption instructions, is it much more to
    include the control/sequencing logic to do it and not tie up the
    rest of the CPU logic to do the encryption? Furthermore, an
    "inbuilt" accelerator could interface directly with the I/O
    hardware of the CPU (e.g. PCI), saving the "intermediate" step of
    writing the encrypted data to memory.

    That logic already exists, in the form of a single thread/core
    dedicated to the job.

    With 30-100 cores on a single die, it becomes very cheap to dedicate
    one of them to babysit such a process, compared to the cost of making
    a custom chunk of VLSI to do the same. This is particularly true
    because the logic needed in the babysitting process is mostly
    straight line, with a very limited number of hard-to-predict branches.

    I.e. h.264 CABAC decoding has three branches per bit decoded, at
    least one of them impossible to predict or work around with clever
    coding. Here it makes perfect sense to have a chunk of hw to handle
    the heavy lifting. Monitoring block encryption/decryption not so much.


    I may be missing something, but while your proposal addresses the first
    part of my proposal, I think it doesn't adress the second. That is,
    for data coming from/going to some external source, you are still doing "unnecessary" memory traffic, which takes memory bandwidth and
    increases latency.




    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Tue Jun 4 16:28:00 2024
    If I want to validate combiner codes or normalize characters I need
    UTF-32 because I have to work with the whole character as a unit.

    You can read the code points directly from the UTF-8 sequence almost
    as easily as you can from a UTF-32 sequence.
    Most of the cost will be in the memory accesses and then in looking up the various tables to decide how to normalize or whether it's valid, so the difference between reading the info from UTF-32 or UTF-8 should be lost in
    the noise.
    UTF-32 might be marginally faster at this specific operation in some
    cases (definitely not if your text is mostly ASCII), but I'd be very
    surprised if the difference is ever large enough to pay for a conversion
    from UTF-8 to UTF-32.

    I was just trying to get people thinking of ways that malformed
    characters might be used to bypass other validation checks in
    their software.

    Another issue with Unicode is the so-called "confusables": things that
    may look identical (or close enough) on screen yet are different (and
    not just because of normalization). E.g. Β vs B, А vs A, or ∕ vs / vs ⁄. Unicode comes with a 700kB `confusables.txt` listing such issues.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to [email protected] on Tue Jun 4 16:56:18 2024
    On Sat, 01 Jun 2024 12:49:46 -0400, EricP
    <[email protected]> wrote:

    George Neuner wrote:
    On Wed, 29 May 2024 18:42:32 -0000 (UTC), John Levine
    <[email protected]> wrote:

    According to EricP <[email protected]>:
    Ok, you accept international character data, you just don't have to
    check >127 characters for "drop table" etc commands.

    I don't think you are being paranoid enough.
    I still think you have to validate or sanitize the >127 string to
    ensure the code sequences only contain well formed characters.
    If you're sending the strings to a database, the database will
    invariably do detailed string validation so I wouldn't bother, but be
    prepared for the error code if it rejects the string,

    Far too much SQL is constructed by simply splicing user input into a
    query "template" string.

    When queries are done right with all user input provided via SQL
    parameters, then there is far less need to "sanitize" input.

    There is a one major caveat: in SQL, table names can't be specified by
    parameter. If the user must provide a table name, then you DO have to
    splice the query string and you DO have to be careful.

    Yes, I didn't mean not parameterizing the string args.

    I was trying to think of ways that I might get your software to combine >malformed strings creating something different. This would occur after
    the strings have been passed using parameterization, like if an index
    is built from two concatenated string fields.

    Sorry ... was away for a few days.


    Even using parameters you still can have a "bad" outcome (for some
    definition). E.g., if the database contains "John" but the query
    string is "Jon", it might fail to find or delete existing tuples,
    update wrong tuples, create superfluous tuples, etc. ... which can
    affect the integrity[*] of the stored data. However, parameters
    provide no way to /rewrite/ the SQL to perform a different operation
    than that which was originally intended.

    [*] "ACID" provides some guarantees of "consistency" but does not make
    any guarantees of "integrity". The 'I' stands for "isolation".


    However, many SQL RDBMS now support operations on JSON and XML data,
    and it is possible to affect searches within these types of fields by
    using only (SQL) parameter strings. I don't know of any way to defend
    against this without checking code having some fairly sophisticated understanding of the stored data ... not just its structure, but also
    what it represents.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to [email protected] on Tue Jun 4 17:42:43 2024
    On Tue, 4 Jun 2024 02:00:55 -0000 (UTC), Lawrence D'Oliveiro
    <[email protected]d> wrote:

    On Mon, 3 Jun 2024 17:42:17 +0300, Michael S wrote:

    On Mon, 03 Jun 2024 14:07:12 GMT [email protected] (Scott Lurndal)
    wrote:

    Most modern CPUs have instruction set support for symmetric ciphers
    such as AES, SM2/SM3 as well as message digest/hash (SHA1, SHA256 et
    al).

    It is still not *too* fast.
    'Too fast' in my book is when with 1B to 10B USD worth of OTP servers
    you can break cipher by brute force in less than 1 hour.

    The good algorithms are designed to be fast for encryption/decryption use, >while still being uselessly slow for cracking purposes.

    Hash algorithms come in two flavours: cryptographic hashes (as mentioned >above) and password hashes. Cryptographic hashes have to be fast to
    compute, but password hashes should take some appreciable fraction of a >second. This is fast enough to authenticate a user logging in, while >significantly slowing down password-guessing attacks.

    For example, the WordPress password-hashing algorithm takes a
    cryptographic hash like MD5 (considered crap nowadays), and iterates it
    8000 times. And suddenly crap becomes good.

    It's debatable whether repeated application of a given function really represents a /different/ function.

    In any event there is no such thing as a "password" hash - really
    there only are cryptographic hashes. A use of a particular hash for
    passwords may deliberately slow its execution - e.g., by iterating or
    by deliberate delays - but the hash algorithm remains the same.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to George Neuner on Wed Jun 5 05:32:17 2024
    George Neuner <[email protected]> schrieb:

    It's debatable whether repeated application of a given function really represents a /different/ function.

    Try encrypting something with only one round of DES or AES :-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Thomas Koenig on Wed Jun 5 05:35:28 2024
    On Wed, 5 Jun 2024 05:32:17 -0000 (UTC), Thomas Koenig wrote:

    Try encrypting something with only one round of DES or AES :-)

    AES is fine, DES is not.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Stephen Fuld on Wed Jun 5 11:16:38 2024
    Stephen Fuld wrote:
    Terje Mathisen wrote:

    Stephen Fuld wrote:
    Scott Lurndal wrote:

    Michael S <[email protected]> writes:
    On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    On Thu, 30 May 2024 18:31:46 +0000, MitchAlsup1 wrote:
    =20
    30 years ago you could say the same thing about encryption.
    =20
    =20
    I don=E2=80=99t think newer CPUs have been optimized for
    encryption. Inst=
    ead,
    we see newer encryption algorithms (or ways of using them)
    that
    work >> better on current CPUs.=20

    I think moderate efficiency on CPU, not too low, but not high
    either, is a requirement for (symmetric-key) cipher. Esp. when
    the key is 128-bit or shorter.

    Most modern CPUs have instruction set support for symmetric
    ciphers such as AES, SM2/SM3 as well as message digest/hash
    (SHA1, SHA256 et al).

    High throughput encryption has been done by hardware accelerators
    for decades now (e.g. bbn or ncypher HSM boxes sitting on a SCSI
    bus; now such HSM are an integral part of many SoC).


    Queston. For a modern general purpose CPU, if you are including all
    the logic to implement encryption instructions, is it much more to
    include the control/sequencing logic to do it and not tie up the
    rest of the CPU logic to do the encryption? Furthermore, an
    "inbuilt" accelerator could interface directly with the I/O
    hardware of the CPU (e.g. PCI), saving the "intermediate" step of
    writing the encrypted data to memory.

    That logic already exists, in the form of a single thread/core
    dedicated to the job.

    With 30-100 cores on a single die, it becomes very cheap to dedicate
    one of them to babysit such a process, compared to the cost of making
    a custom chunk of VLSI to do the same. This is particularly true
    because the logic needed in the babysitting process is mostly
    straight line, with a very limited number of hard-to-predict branches.

    I.e. h.264 CABAC decoding has three branches per bit decoded, at
    least one of them impossible to predict or work around with clever
    coding. Here it makes perfect sense to have a chunk of hw to handle
    the heavy lifting. Monitoring block encryption/decryption not so much.


    I may be missing something, but while your proposal addresses the first
    part of my proposal, I think it doesn't adress the second. That is,
    for data coming from/going to some external source, you are still doing "unnecessary" memory traffic, which takes memory bandwidth and
    increases latency.

    Usually, when a CPU needs to work on something, it will need to get the
    data into $L1 anyway? It is only when the work is simply to be a
    pipeline that having a way to bypass the CPU completely really makes a difference, right?

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to All on Wed Jun 5 12:21:01 2024
    MitchAlsup1 wrote:
    Terje Mathisen wrote:



    That logic already exists, in the form of a single thread/core
    dedicated
    to the job.

    With 30-100 cores on a single die, it becomes very cheap to dedicate
    one
    of them to babysit such a process, compared to the cost of making a
    custom chunk of VLSI to do the same. This is particularly true because
    the logic needed in the babysitting process is mostly straight line,
    with a very limited number of hard-to-predict branches.

    I.e. h.264 CABAC decoding has three branches per bit decoded, at least
    one of them impossible to predict or work around with clever coding.

    How many instructions in the then-clause and in the else-clause ??
    If these are smaller than 8, My 66000 can process them without
    "branching" using predication.

    No, the real problem is the context branching: After doing the 50%
    branch you pick up one of two alternative contexts and follow totally
    different paths, i.e. you cannot simply use the branch bit as an index.

    I found ways to bypass the issues with the other two branches but this
    one is fundamental.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Terje Mathisen on Wed Jun 5 13:34:25 2024
    Terje Mathisen wrote:

    Stephen Fuld wrote:
    Terje Mathisen wrote:

    Stephen Fuld wrote:
    Scott Lurndal wrote:

    Michael S <[email protected]> writes:
    On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    On Thu, 30 May 2024 18:31:46 +0000, MitchAlsup1 wrote:
    =20
    30 years ago you could say the same thing about
    encryption. =20
    =20
    I don=E2=80=99t think newer CPUs have been optimized for
    encryption. Inst=
    ead,
    we see newer encryption algorithms (or ways of using them)
    that
    work >> better on current CPUs.=20

    I think moderate efficiency on CPU, not too low, but not
    high either, is a requirement for (symmetric-key) cipher.
    Esp. when the key is 128-bit or shorter.

    Most modern CPUs have instruction set support for symmetric
    ciphers such as AES, SM2/SM3 as well as message digest/hash
    (SHA1, SHA256 et al).

    High throughput encryption has been done by hardware
    accelerators for decades now (e.g. bbn or ncypher HSM boxes
    sitting on a SCSI bus; now such HSM are an integral part of
    many SoC).


    Queston. For a modern general purpose CPU, if you are
    including all the logic to implement encryption instructions,
    is it much more to include the control/sequencing logic to do
    it and not tie up the rest of the CPU logic to do the
    encryption? Furthermore, an "inbuilt" accelerator could
    interface directly with the I/O hardware of the CPU (e.g. PCI),
    saving the "intermediate" step of writing the encrypted data to
    memory.

    That logic already exists, in the form of a single thread/core
    dedicated to the job.

    With 30-100 cores on a single die, it becomes very cheap to
    dedicate one of them to babysit such a process, compared to the
    cost of making a custom chunk of VLSI to do the same. This is particularly true because the logic needed in the babysitting
    process is mostly straight line, with a very limited number of hard-to-predict branches.

    I.e. h.264 CABAC decoding has three branches per bit decoded, at
    least one of them impossible to predict or work around with clever coding. Here it makes perfect sense to have a chunk of hw to
    handle the heavy lifting. Monitoring block encryption/decryption
    not so much.


    I may be missing something, but while your proposal addresses the
    first part of my proposal, I think it doesn't adress the second.
    That is, for data coming from/going to some external source, you
    are still doing "unnecessary" memory traffic, which takes memory
    bandwidth and increases latency.

    Usually, when a CPU needs to work on something, it will need to get
    the data into $L1 anyway? It is only when the work is simply to be a
    pipeline that having a way to bypass the CPU completely really makes
    a difference, right?

    Right. But my point is that the CPU never really need to "work" on the encrypted data. It it frequently only sent to, or received from the
    network or a storage device, hence the pipelined approach has
    advantages.



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Stephen Fuld on Wed Jun 5 16:49:05 2024
    On Wed, 5 Jun 2024 13:34:25 -0000 (UTC)
    "Stephen Fuld" <[email protected]d> wrote:

    Terje Mathisen wrote:

    Stephen Fuld wrote:
    Terje Mathisen wrote:

    Stephen Fuld wrote:
    Scott Lurndal wrote:

    Michael S <[email protected]> writes:
    On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    On Thu, 30 May 2024 18:31:46 +0000, MitchAlsup1 wrote:
    =20
    30 years ago you could say the same thing about
    encryption. =20
    =20
    I don=E2=80=99t think newer CPUs have been optimized
    for
    encryption. Inst=
    ead,
    we see newer encryption algorithms (or ways of using
    them) that
    work >> better on current CPUs.=20

    I think moderate efficiency on CPU, not too low, but not
    high either, is a requirement for (symmetric-key) cipher.
    Esp. when the key is 128-bit or shorter.

    Most modern CPUs have instruction set support for symmetric
    ciphers such as AES, SM2/SM3 as well as message digest/hash
    (SHA1, SHA256 et al).

    High throughput encryption has been done by hardware
    accelerators for decades now (e.g. bbn or ncypher HSM boxes
    sitting on a SCSI bus; now such HSM are an integral part of
    many SoC).


    Queston. For a modern general purpose CPU, if you are
    including all the logic to implement encryption instructions,
    is it much more to include the control/sequencing logic to do
    it and not tie up the rest of the CPU logic to do the
    encryption? Furthermore, an "inbuilt" accelerator could
    interface directly with the I/O hardware of the CPU (e.g.
    PCI), saving the "intermediate" step of writing the encrypted
    data to memory.

    That logic already exists, in the form of a single thread/core dedicated to the job.

    With 30-100 cores on a single die, it becomes very cheap to
    dedicate one of them to babysit such a process, compared to the
    cost of making a custom chunk of VLSI to do the same. This is particularly true because the logic needed in the babysitting
    process is mostly straight line, with a very limited number of hard-to-predict branches.

    I.e. h.264 CABAC decoding has three branches per bit decoded, at
    least one of them impossible to predict or work around with
    clever coding. Here it makes perfect sense to have a chunk of
    hw to handle the heavy lifting. Monitoring block
    encryption/decryption not so much.


    I may be missing something, but while your proposal addresses the
    first part of my proposal, I think it doesn't adress the second.
    That is, for data coming from/going to some external source, you
    are still doing "unnecessary" memory traffic, which takes memory bandwidth and increases latency.

    Usually, when a CPU needs to work on something, it will need to get
    the data into $L1 anyway? It is only when the work is simply to be a pipeline that having a way to bypass the CPU completely really makes
    a difference, right?

    Right. But my point is that the CPU never really need to "work" on
    the encrypted data. It it frequently only sent to, or received from
    the network or a storage device, hence the pipelined approach has
    advantages.




    The best, the most secure encryption is an end-to-end encryption.
    Which means application-to-application.
    It's not that other, "piece-wise" encryption types can't be used, but
    if you are serious about privacy you should consider them insufficient.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Terje Mathisen on Wed Jun 5 16:53:53 2024
    Terje Mathisen wrote:

    MitchAlsup1 wrote:


    I.e. h.264 CABAC decoding has three branches per bit decoded, at least
    one of them impossible to predict or work around with clever coding.

    How many instructions in the then-clause and in the else-clause ??
    If these are smaller than 8, My 66000 can process them without
    "branching" using predication.

    No, the real problem is the context branching: After doing the 50%
    branch you pick up one of two alternative contexts and follow totally different paths, i.e. you cannot simply use the branch bit as an index.

    If the number of instructions in the combined then and else clauses is
    lower than a certain number, it is equally efficient to deal with the
    branch as if it were later nullification rather than a redirection of
    the fetch end of the pipeline. Here, NO prediction is required and
    there is no chance of misprediction without regard to the
    predictability
    of the control flow point. The whole point is that if the fetch end
    of the pipeline will reach the convergence point before the branch
    is fully resolved, then "don't branch" nullify. it saves cycles and
    keeps unpredictable branches out of the branch predictor--even if
    the apparent takenness of the branch is completely random--improving
    the prediction accuracy of "real branches".

    So, for example, let us postulate a 1-wide machine fetching 4 words per
    clock and a then clause of 3 instructions and an else clause of 4 inst.
    By the time the pseudo branch instruction enters execution, both the
    then and the else have already been fetched, parsed, and are flowing
    through decode. The execution of the branch merely decides which inst
    survive the pipeline and there are no misprediction stalls. {{On a
    wider machine, the fetch is even wider and the parse/decode BW is
    still higher, so the mispredicted control flow point does not suffer misprediction repair costs.}}

    Oddly enough, this is how predication works on My 66000.

    I found ways to bypass the issues with the other two branches but this
    one is fundamental.

    It is fundamental only on ISAs that perform predication improperly
    or does not have predication, or use the predictor when predicating.
    My 66000 is not one of them.

    I return to the question posed earlier::
    How many instructions in the then-clause and in the else-clause ??

    Terje

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Michael S on Wed Jun 5 16:16:32 2024
    Michael S wrote:


    snip lots of stuff about encryption alternatives



    The best, the most secure encryption is an end-to-end encryption.
    Which means application-to-application.
    It's not that other, "piece-wise" encryption types can't be used, but
    if you are serious about privacy you should consider them
    insufficient.


    That's fair. But there are counter arguments like not doing the
    encryption on a processor that is also executing arbitrary user code
    makes it more immune from side attacks.



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stephen Fuld on Wed Jun 5 17:03:42 2024
    Stephen Fuld wrote:

    Terje Mathisen wrote:


    Usually, when a CPU needs to work on something, it will need to get
    the data into $L1 anyway? It is only when the work is simply to be a
    pipeline that having a way to bypass the CPU completely really makes
    a difference, right?

    Right. But my point is that the CPU never really need to "work" on the encrypted data. It it frequently only sent to, or received from the
    network or a storage device, hence the pipelined approach has
    advantages.


    If the keys are visible in application memory, Spectré like attacks can
    read out those keys. If the keys are visible in supervisor memory,
    similar
    attack strategies can read them out. Thus, it makes sense that the CPUs

    not be doing the cryption.

    {{Or they could fix the µArchitecture so Spectré like attacks are
    prevented
    but apparently they have no cause for that.}}

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Wed Jun 5 17:04:49 2024
    Michael S wrote:

    On Wed, 5 Jun 2024 13:34:25 -0000 (UTC)


    The best, the most secure encryption is an end-to-end encryption.
    Which means application-to-application.

    Except for the Spectré like attacks that steal the keys if they are in
    memory.

    It's not that other, "piece-wise" encryption types can't be used, but
    if you are serious about privacy you should consider them insufficient.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Stephen Fuld on Wed Jun 5 20:06:43 2024
    On Wed, 5 Jun 2024 16:16:32 -0000 (UTC)
    "Stephen Fuld" <[email protected]d> wrote:

    Michael S wrote:


    snip lots of stuff about encryption alternatives



    The best, the most secure encryption is an end-to-end encryption.
    Which means application-to-application.
    It's not that other, "piece-wise" encryption types can't be used,
    but if you are serious about privacy you should consider them
    insufficient.


    That's fair. But there are counter arguments like not doing the
    encryption on a processor that is also executing arbitrary user code
    makes it more immune from side attacks.




    Side-channel attacks on AES were 99%-fantasy of bored (or
    attention-seeking) security researchers even before Rijndael core was
    put in CPU hardware. Much more so now.
    Weak point tends to be key management rather than encryption itself.
    And, BTW, running arbitrary hostile code on your computer is bad, bad,
    bad idea for 1e9 other reasons.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Wed Jun 5 13:37:12 2024
    And, BTW, running arbitrary hostile code on your computer is bad, bad,
    bad idea for 1e9 other reasons.

    Can't disagree, yet every day that comes by, another activity is made
    virtually impossible without allowing such arbitrary code on
    your device. 🙁


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to [email protected] on Wed Jun 5 17:56:10 2024
    [email protected] (MitchAlsup1) writes:
    Stephen Fuld wrote:

    Terje Mathisen wrote:


    Usually, when a CPU needs to work on something, it will need to get
    the data into $L1 anyway? It is only when the work is simply to be a
    pipeline that having a way to bypass the CPU completely really makes
    a difference, right?

    Right. But my point is that the CPU never really need to "work" on the
    encrypted data. It it frequently only sent to, or received from the
    network or a storage device, hence the pipelined approach has
    advantages.


    If the keys are visible in application memory, Spectré like attacks can
    read out those keys. If the keys are visible in supervisor memory,
    similar
    attack strategies can read them out. Thus, it makes sense that the CPUs

    not be doing the cryption.

    That's why most modern platforms have TPM devices on board or
    integrated on the SoC.

    https://en.wikipedia.org/wiki/Trusted_Platform_Module

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Michael S on Wed Jun 5 17:58:01 2024
    Michael S <[email protected]> writes:
    On Wed, 5 Jun 2024 17:04:49 +0000
    [email protected] (MitchAlsup1) wrote:

    Michael S wrote:
    =20
    On Wed, 5 Jun 2024 13:34:25 -0000 (UTC)
    =20
    =20
    The best, the most secure encryption is an end-to-end encryption.
    Which means application-to-application. =20
    =20
    Except for the Spectr=C3=A9 like attacks that steal the keys if they are = >in
    memory.
    =20

    Spectre, not Spectr=C3=A9 >https://en.wikipedia.org/wiki/Spectre_(security_vulnerability)

    It's not that other, "piece-wise" encryption types can't be used,
    but if you are serious about privacy you should consider them
    insufficient. =20

    And who exactly places the key into registers of your beloved shared >encryption device?

    It is pretty trivial to bake private keys into hardware at the fab,
    either through e-fuses or various other mechanisms.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to [email protected] on Wed Jun 5 20:15:53 2024
    On Wed, 5 Jun 2024 17:04:49 +0000
    [email protected] (MitchAlsup1) wrote:

    Michael S wrote:

    On Wed, 5 Jun 2024 13:34:25 -0000 (UTC)


    The best, the most secure encryption is an end-to-end encryption.
    Which means application-to-application.

    Except for the Spectré like attacks that steal the keys if they are in memory.


    Spectre, not Spectré https://en.wikipedia.org/wiki/Spectre_(security_vulnerability)

    It's not that other, "piece-wise" encryption types can't be used,
    but if you are serious about privacy you should consider them
    insufficient.

    And who exactly places the key into registers of your beloved shared
    encryption device? And, since device is shared, who exchanges keys
    hundreds or thousands times per second? Not software? Not via memory?
    It all makes situation much much worse rather than better.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Wed Jun 5 20:09:06 2024
    Scott Lurndal wrote:

    Michael S <[email protected]> writes:

    It's not that other, "piece-wise" encryption types can't be used,
    but if you are serious about privacy you should consider them
    insufficient. =20

    And who exactly places the key into registers of your beloved shared >>encryption device?

    It is pretty trivial to bake private keys into hardware at the fab,
    either through e-fuses or various other mechanisms.

    Is that something the CIA or NSA would allow on their computers ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Wed Jun 5 20:13:19 2024
    Michael S wrote:


    Side-channel attacks on AES were 99%-fantasy of bored (or
    attention-seeking) security researchers even before Rijndael core was
    put in CPU hardware. Much more so now.
    Weak point tends to be key management rather than encryption itself.
    And, BTW, running arbitrary hostile code on your computer is bad, bad,
    bad idea for 1e9 other reasons.

    Running arbitrary hostile code where the user address space is not
    completely disjoint from the supervisor access space is ALSO a bad
    Idea.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Wed Jun 5 20:11:03 2024
    Michael S wrote:

    Except for the Spectré like attacks that steal the keys if they are in
    memory.


    Spectre, not Spectré


    My spelling has the advantage I can GOOGLE the *net and find anything I
    have said about Spectré.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to [email protected] on Wed Jun 5 20:41:48 2024
    [email protected] (MitchAlsup1) writes:
    Scott Lurndal wrote:

    Michael S <[email protected]> writes:

    It's not that other, "piece-wise" encryption types can't be used,
    but if you are serious about privacy you should consider them
    insufficient. =20

    And who exactly places the key into registers of your beloved shared >>>encryption device?

    It is pretty trivial to bake private keys into hardware at the fab,
    either through e-fuses or various other mechanisms.

    Is that something the CIA or NSA would allow on their computers ??

    If they use windows, yes. Windows requires a TPM for boot integrity.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to All on Thu Jun 6 09:10:38 2024
    MitchAlsup1 wrote:
    Terje Mathisen wrote:

    MitchAlsup1 wrote:


    I.e. h.264 CABAC decoding has three branches per bit decoded, at
    least one of them impossible to predict or work around with clever
    coding.

    How many instructions in the then-clause and in the else-clause ??
    If these are smaller than 8, My 66000 can process them without
    "branching" using predication.

    No, the real problem is the context branching: After doing the 50%
    branch you pick up one of two alternative contexts and follow totally
    different paths, i.e. you cannot simply use the branch bit as an index.

    If the number of instructions in the combined then and else clauses is
    lower than a certain number, it is equally efficient to deal with the
    branch as if it were later nullification rather than a redirection of
    the fetch end of the pipeline. Here, NO prediction is required and there
    is no chance of misprediction without regard to the
    predictability
    of the control flow point. The whole point is that if the fetch end
    of the pipeline will reach the convergence point before the branch
    is fully resolved, then "don't branch" nullify. it saves cycles and
    keeps unpredictable branches out of the branch predictor--even if the apparent takenness of the branch is completely random--improving
    the prediction accuracy of "real branches".

    So, for example, let us postulate a 1-wide machine fetching 4 words per
    clock and a then clause of 3 instructions and an else clause of 4 inst.
    By the time the pseudo branch instruction enters execution, both the
    then and the else have already been fetched, parsed, and are flowing
    through decode. The execution of the branch merely decides which inst
    survive the pipeline and there are no misprediction stalls. {{On a
    wider machine, the fetch is even wider and the parse/decode BW is
    still higher, so the mispredicted control flow point does not suffer misprediction repair costs.}}

    Oddly enough, this is how predication works on My 66000.

    I found ways to bypass the issues with the other two branches but this
    one is fundamental.

    It is fundamental only on ISAs that perform predication improperly
    or does not have predication, or use the predictor when predicating.
    My 66000 is not one of them.

    I return to the question posed earlier::
    How many instructions in the then-clause and in the else-clause ??

    From 100++ to 10K+? Effectively no path merge within any kind a visible window.

    I.e. decoding CABAC is running a state machine with tens to hundreds
    (afair) different states, with close to zero commonality between the
    code for individual paths. There is almost zero if/then/else/endif local branching at this level.

    I could see absolutely no way to avoid biting the bullet and actually
    branch to the relevant code path.

    Like I've written before, it is almost as if CABAC was designed to be as
    hard as possible for a sw decoder.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to [email protected] on Thu Jun 6 11:21:39 2024
    On Wed, 5 Jun 2024 20:13:19 +0000
    [email protected] (MitchAlsup1) wrote:

    Michael S wrote:


    And, BTW, running arbitrary hostile code on your computer is bad,
    bad, bad idea for 1e9 other reasons.

    Running arbitrary hostile code where the user address space is not
    completely disjoint from the supervisor access space is ALSO a bad
    Idea.

    It sounds like you came to the verge of selling your soul to
    microkerneliac heresy.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Thu Jun 6 13:45:10 2024
    Michael S wrote:

    On Wed, 5 Jun 2024 20:13:19 +0000
    [email protected] (MitchAlsup1) wrote:

    Michael S wrote:


    And, BTW, running arbitrary hostile code on your computer is bad,
    bad, bad idea for 1e9 other reasons.

    Running arbitrary hostile code where the user address space is not
    completely disjoint from the supervisor access space is ALSO a bad
    Idea.

    It sounds like you came to the verge of selling your soul to
    microkerneliac heresy.

    While My 66000 has the rapid context switching needed for efficient microKernels, the MMU has the functionality that application AGEN
    cannot access supervisor space, while supervisor AGEN can access
    application space. It is just setting up the model properly.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Scott Lurndal on Fri Jun 7 03:35:41 2024
    On Wed, 05 Jun 2024 20:41:48 GMT, Scott Lurndal wrote:

    [email protected] (MitchAlsup1) writes:

    Is that something the CIA or NSA would allow on their computers ??

    If they use windows, yes.

    There is an interview somewhere in which somebody high up in the US
    military says that their Government’s reliance on Microsoft is their
    single biggest security vulnerability.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Stefan Monnier on Fri Jun 7 03:36:57 2024
    On Wed, 05 Jun 2024 13:37:12 -0400, Stefan Monnier wrote:

    ... every day that comes by, another activity is made
    virtually impossible without allowing such arbitrary code on your
    device. 🙁

    If you’re talking about WASM or JavaScript from websites, that runs in a carefully-designed sandbox.

    If you’re talking about proprietary closed-source apps downloaded from
    random sites ... just don’t.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Fri Jun 7 10:48:56 2024
    ... every day that comes by, another activity is made virtually
    impossible without allowing such arbitrary code on your device. 🙁
    If you’re talking about WASM or JavaScript from websites, that runs in a carefully-designed sandbox.

    The sandbox gives you only a very crude amount of control.
    In practice it's still basically code over which you have no control
    (beside "do I run it or not").

    And your sandbox wants to provides access to a large part of your
    machine's hardware anyway, in order to be able to run the many "web applications". So, it comes with many "carefully-designed" holes.
    And that's without counting hardware and software bugs.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Stefan Monnier on Fri Jun 7 10:23:27 2024
    Stefan Monnier wrote:

    Another issue with Unicode is the so-called "confusables": things that
    may look identical (or close enough) on screen yet are different (and
    not just because of normalization). E.g. Β vs B, А vs A, or ∕ vs / vs ⁄.
    Unicode comes with a 700kB `confusables.txt` listing such issues.

    Eeewww... I didn't even think of that.
    What does one do about them? You can't treat them as equivalent in a
    string compare... the user might want the first B and not second B.

    I suppose one would want two compare equal functions,
    an exactly equal, and a visually approximately equal.
    Like using a soundex for words to catch misspellings.

    But then programmers need to decide when to use each compare.

    These character and code attribute lookup tables are looking awkward.
    With up to 2M codes, and some base character codes having multiple
    possible combiners, but very sparse. And links between entries
    for upper and lower case, and now links between confusables.
    And we don't want to roll over the L1 cache just to do a string compare.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to EricP on Fri Jun 7 17:05:42 2024
    EricP wrote:
    Stefan Monnier wrote:

    Another issue with Unicode is the so-called "confusables": things that
    may look identical (or close enough) on screen yet are different (and
    not just because of normalization).  E.g. Î’ vs B, А vs A, or ∕ vs
    / vs ⁄.
    Unicode comes with a 700kB `confusables.txt` listing such issues.

    Eeewww... I didn't even think of that.
    What does one do about them? You can't treat them as equivalent in a
    string compare... the user might want the first B and not second B.

    I suppose one would want two compare equal functions,
    an exactly equal, and a visually approximately equal.
    Like using a soundex for words to catch misspellings.

    But then programmers need to decide when to use each compare.

    These character and code attribute lookup tables are looking awkward.
    With up to 2M codes, and some base character codes having multiple
    possible combiners, but very sparse. And links between entries
    for upper and lower case, and now links between confusables.
    And we don't want to roll over the L1 cache just to do a string compare.

    Years ago I considered case-insensitive Boyer-Moore text search with a
    wide alphabet and found that the only approach that made sense was to
    maintain two copies of the string to be searched for, one lower and one
    upper case, where each "character" was a length-encoded string. This was required to handle things like the German double s which can uppercase
    into a single letter.

    The lookup table for skip lengths was still far shorter than the
    alphabet size, effectively a very short and fast hash of the current character/codepoint/combined letter.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Terje Mathisen on Fri Jun 7 11:35:50 2024
    Terje Mathisen wrote:
    EricP wrote:
    Stefan Monnier wrote:

    Another issue with Unicode is the so-called "confusables": things that
    may look identical (or close enough) on screen yet are different (and
    not just because of normalization). E.g. Β vs B, А vs A, or ∕ vs
    / vs ⁄.
    Unicode comes with a 700kB `confusables.txt` listing such issues.

    Eeewww... I didn't even think of that.
    What does one do about them? You can't treat them as equivalent in a
    string compare... the user might want the first B and not second B.

    I suppose one would want two compare equal functions,
    an exactly equal, and a visually approximately equal.
    Like using a soundex for words to catch misspellings.

    But then programmers need to decide when to use each compare.

    These character and code attribute lookup tables are looking awkward.
    With up to 2M codes, and some base character codes having multiple
    possible combiners, but very sparse. And links between entries
    for upper and lower case, and now links between confusables.
    And we don't want to roll over the L1 cache just to do a string compare.

    Years ago I considered case-insensitive Boyer-Moore text search with a
    wide alphabet and found that the only approach that made sense was to maintain two copies of the string to be searched for, one lower and one
    upper case, where each "character" was a length-encoded string. This was required to handle things like the German double s which can uppercase
    into a single letter.

    The lookup table for skip lengths was still far shorter than the
    alphabet size, effectively a very short and fast hash of the current character/codepoint/combined letter.

    Terje

    Or perhaps rather than mapping upper into lower or lower into upper,
    and special cases like German double s, and confusables, into each other, instead we map all into a third hyper-character (because like hyperspace
    it intersects with all points in real space).

    Each real character (RCH) maps to a single hyper character (HCH)
    and a single HCH maps back to one or more RCH.
    And you might not even need a reverse map if all you do is compare.

    That's probably too simple.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)