Forum: >>> Magnum BBS <<<

Byte Addressability And Beyond

From Lawrence D'Oliveiro@21:1/5 to All on Wed May 1 00:09:28 2024

Byte addressing was invented by IBM for the System/360, introduced in
1964. At least as I understand it. Up to that time, and indeed for a long
time after, machines had a “word length” which was the smallest
addressable unit of memory. This could have a range of sizes, e.g.

12 -- DEC PDP-5/8
18 -- DEC PDP-1/4/7/9
36 -- DEC PDP-6/10
60 -- CDC 6000-series
64 -- Cray

I’m sure there were also 24- and 48-bit machines. Note the popularity of numbers with a range of different integer divisors, including powers of
both 2 and 3. The byte-addressable machines chucked away everything other
than powers of 2, which was a step backwards in this respect. ;)

(Interesting that the microprocessor world made byte addressing--and ASCII character encoding--universal right from the beginning. Starting from a
clean slate, I guess.)

Why was byte addressing invented? I think it was for easy handling of
strings and other binary data. But why stop there? I guess the idea of
going all the way down to bit-level addressing was considered a bit
extreme? Certainly if you only had 32 (or, on those early IBMs, 24)
address bits, then using 3 of them to address within a byte would have substantially cut down the available size of your address space.

I think the move to 64-bit architectures missed a trick, though: it could
have introduced bit-level addressing at the same time, given that we still
have plenty of address bits to spare. That would simplify bit-field manipulations, too.

One side-effect of byte addressing has been the “endian wars”: the inconsistency, between different machine architectures, of how to order
the bytes making up multibyte objects, particularly numbers. Big-endian supposedly had the advantage of making memory dumps easier to read, but little-endian always made more logical sense.

Nowadays, all the common CPU architectures are at least available in little-endian form, if not exclusively so. But we still have legacy
oddities, like the TCP/IP network stack where integer fields are laid out
in big-endian ordering.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Wed May 1 01:49:56 2024

According to Lawrence D'Oliveiro <[email protected]d>:

Byte addressing was invented by IBM for the System/360, introduced in
1964. At least as I understand it. Up to that time, and indeed for a long >time after, machines had a “word length” which was the smallest >addressable unit of memory. This could have a range of sizes, e.g.

12 -- DEC PDP-5/8
18 -- DEC PDP-1/4/7/9
36 -- DEC PDP-6/10
60 -- CDC 6000-series
64 -- Cray

Commercial machines were character or digit addressed, as was at least
one scientfic computer, the IBM 1620.

The IBM 650 had 10 digit words, with characters stored as digit pairs.
The 702 and 705 were decimal character addressable. Instructions were
5 characters but data could be arbitrary length and location. The very
popular 1401 was also character addressed with variable length data.

Why was byte addressing invented? I think it was for easy handling of
strings and other binary data. But why stop there?

It was to be reasonably efficient both for character business data and
word scientific data. Since the words had to be aligned, it was easy
to handle them as a single unit in parallel on machines with internal
data paths wider than 8 bits, all the models bigger than 360/30.

I guess the idea of
going all the way down to bit-level addressing was considered a bit
extreme?

STRETCH had bit addressing. It added a great deal of complication for
very little benefit. In the relatively rare situations where you want
to handle bit fields, shifting and masking is good enough without
slowing everything else down.

One side-effect of byte addressing has been the “endian wars”: the >inconsistency, between different machine architectures, ...

Until the PDP-11, all byte addressed machines were bigendian. Despite
a lot of looking, I have never found an explanation of why DEC made
the PDP-11 littlendian. I'm reasonably sure they were aware that it
was reversed from the 360, but they never said why.

Please do me a favor and DO NOT guess why they did it -- we have
already had lots and lots of guesses and we have no way to tell
whether any of the guesses are right.

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Wed May 1 03:02:07 2024

Lawrence D'Oliveiro wrote:

Byte addressing was invented by IBM for the System/360, introduced in
1964. At least as I understand it. Up to that time, and indeed for a long time after, machines had a “word length” which was the smallest addressable unit of memory. This could have a range of sizes, e.g.

12 -- DEC PDP-5/8
18 -- DEC PDP-1/4/7/9
36 -- DEC PDP-6/10
60 -- CDC 6000-series
64 -- Cray

CDC had a number of machines with 12-bit times k words. k element {1,2,3,5}

I’m sure there were also 24- and 48-bit machines. Note the popularity of numbers with a range of different integer divisors, including powers of
both 2 and 3. The byte-addressable machines chucked away everything other than powers of 2, which was a step backwards in this respect. ;)

I would make the argument that 2^k was a step forward not backwards.
Perhaps another day...

(Interesting that the microprocessor world made byte addressing--and ASCII character encoding--universal right from the beginning. Starting from a
clean slate, I guess.)

4004 anyone ?!?

Why was byte addressing invented? I think it was for easy handling of
strings and other binary data. But why stop there? I guess the idea of
going all the way down to bit-level addressing was considered a bit
extreme?

It was certainly a reason Intel's 432 died. {but there were lots}

Certainly if you only had 32 (or, on those early IBMs, 24)
address bits, then using 3 of them to address within a byte would have substantially cut down the available size of your address space.

I think the move to 64-bit architectures missed a trick, though: it could have introduced bit-level addressing at the same time, given that we still have plenty of address bits to spare. That would simplify bit-field manipulations, too.

I don't see what is wrong with loading a container with the field and
then extracting or inserting into the container. You loose atomicity
but avoid doubling the number of LD/ST instructions.

One side-effect of byte addressing has been the “endian wars”: the inconsistency, between different machine architectures, of how to order
the bytes making up multibyte objects, particularly numbers. Big-endian supposedly had the advantage of making memory dumps easier to read, but little-endian always made more logical sense.

BE means you can read the strings in a core dump
LE means the bytes arrive in the order for on-line arithmetic
LE allows one to make 8-bit wide data paths and still implement a full
width architecture {but then so did 360/30)

Nowadays, all the common CPU architectures are at least available in little-endian form, if not exclusively so. But we still have legacy
oddities, like the TCP/IP network stack where integer fields are laid out
in big-endian ordering.

I have a BITR instruction that rearranges BE<->LE for these reasons.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Wed May 1 06:43:30 2024

On Wed, 1 May 2024 03:02:07 +0000, MitchAlsup1 wrote:

I don't see what is wrong with loading a container with the field and
then extracting or inserting into the container.

You still need a place to put a bit offset for the base address of the
field. Why not put it together with the rest of the address?

BE means you can read the strings in a core dump
LE means the bytes arrive in the order for on-line arithmetic
LE allows one to make 8-bit wide data paths and still implement a full
width architecture {but then so did 360/30)

The way I think of it is: consider how you specify these 3 conventions:
* numbering of bits within a byte
* numbering of bytes within a multibyte quantity
* the place values of bits in an integer

The only way to get all 3 consistent is with a little-endian architecture. Every big-endian architecture has inconsistencies between these somewhere
or another.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Levine on Wed May 1 06:32:17 2024

On Wed, 1 May 2024 01:49:56 -0000 (UTC), John Levine wrote:

Until the PDP-11, all byte addressed machines were bigendian. Despite a
lot of looking, I have never found an explanation of why DEC made the
PDP-11 littlendian.

As I previously mentioned, little-endian just makes more sense.

Unfortunately, when their Fortran compiler implemented 32-bit integers (in software), they got the words the wrong way round.

The VAX was like a 32-bit extension of the PDP-11, and it was consistently little-endian everywhere.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Levine on Wed May 1 07:36:13 2024

John Levine <[email protected]> writes:

Until the PDP-11, all byte addressed machines were bigendian. Despite
a lot of looking, I have never found an explanation of why DEC made
the PDP-11 littlendian. I'm reasonably sure they were aware that it
was reversed from the 360, but they never said why.

Please do me a favor and DO NOT guess why they did it -- we have
already had lots and lots of guesses and we have no way to tell
whether any of the guesses are right.

Another case was the 6800 (big-endian) and its offspring, the 6502 (little-endian). In this case we know: little-endian is cheaper to
implement on an 8-bit processor.

Concerning the speculations about the PDP-11, here's one: Was it
designed for also supporting an implementation with a 4-bit or 8-bit
basis? The competing Nova was at first implemented with a 4-bit basis
(but it is word-addressed, so this is not visible in the byte order).
The PDP-X (the DEC-internal project that was canceled in favor of the
PDP-11 and eventually became the Nova) might have influenced the
PDP-11 in that way.

The other interesting question in this context is why the Datapoint
2200 (which is the basis of the Intel 8008 architecture) went for little-endian. <https://en.wikipedia.org/wiki/Datapoint_2200> says:

|Because the original Datapoint 2200 had a serial processor, it needed
|to start with the lowest bit of the lowest byte in order to handle
|carries.

So it's the same reason as for the 6502.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Lawrence D'Oliveiro on Wed May 1 07:43:52 2024

Lawrence D'Oliveiro <[email protected]d> schrieb:

(Interesting that the microprocessor world made byte addressing--and ASCII character encoding--universal right from the beginning. Starting from a
clean slate, I guess.)

A major market for microprocessors were pocket calculators,
cash registers and the like, which is why having 8 bits and BCD
arithmetic was an advantage - see the DAA instruction of the 8080
or the decimal flag on the 6502.

The basis of the 8008, the first serious microprocessor,
was the Datapoint 2200. A nice history can be found at http://www.righto.com/2023/08/datapoint-to-8086.html .
And as the Datapoint 2200 was originally a "smart terminal",
it had to be able to connect to mainframes, which meant that
8-bit bytes were a natural choice. (And I still think that
having BCD influenced the decision to go to the 8-bit byte
on the /360).

So, anything but a clean slate.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Thomas Koenig on Wed May 1 07:51:06 2024

On Wed, 1 May 2024 07:43:52 -0000 (UTC), Thomas Koenig wrote:

And as the Datapoint 2200 was originally a "smart terminal",
it had to be able to connect to mainframes, which meant that 8-bit bytes
were a natural choice.

You mean IBM mainframes? I don’t think any other mainframes were byte- addressable.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Lawrence D'Oliveiro on Wed May 1 09:02:22 2024

Lawrence D'Oliveiro <[email protected]d> schrieb:

On Wed, 1 May 2024 07:43:52 -0000 (UTC), Thomas Koenig wrote:

And as the Datapoint 2200 was originally a "smart terminal",
it had to be able to connect to mainframes, which meant that 8-bit bytes
were a natural choice.

You mean IBM mainframes?

And compatibles. Together, they accounted for almost all mainframes.

I don’t think any other mainframes were byte-
addressable.

IBM set the minimum standard for character capabilities, a
terminal had to support eight bits or be laughed out of the market. Adressability has little to do with it.

Hmm... what sort of terminals and character sets did people use on
a PDP-10? 7-bit ASCII? It (and the PDP-6) were released before
the ASCII standard came out. (And /360 was supposed to support
ASCII originally, but that bit in the PSW got dropped for the /370,
I believe).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Lawrence D'Oliveiro on Wed May 1 15:31:37 2024

On Wed, 1 May 2024 00:09:28 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

(Interesting that the microprocessor world made byte addressing--and
ASCII character encoding--universal right from the beginning.
Starting from a clean slate, I guess.)

It depends on what you call "microprocessor".
Majority of early Digital Signal Processors were word-addressable. Some
of them are still produced in significant quantities.
Two of those (TI TMS320C30 and ADI ADSP 21xx series) played major role
in my professional programming education.

Few word-addressable Digital Signal Processors had non-power-of-two
words. Motorola 24-bit 56K series was probably the most popular of
those, but there were others as well.

Microchip's PIC micro-controllers are word-addressable with quite
varying word width. According to Wikipedia, they are descendants of
General Instrument CP1600 CPU. I suppose, that their ancestor was word-addressable as well.

In the world of general-purpose microprocessor, DEC Alpha (until EV6)
was more like word-addressable than byte-addressable, although it is a
matter of point of view.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Lawrence D'Oliveiro on Wed May 1 14:08:25 2024

Lawrence D'Oliveiro <[email protected]d> writes:

Byte addressing was invented by IBM for the System/360, introduced in
1964. At least as I understand it. Up to that time, and indeed for a long >time after, machines had a “word length” which was the smallest >addressable unit of memory. This could have a range of sizes, e.g.

12 -- DEC PDP-5/8
18 -- DEC PDP-1/4/7/9
36 -- DEC PDP-6/10
60 -- CDC 6000-series
64 -- Cray

What about the IBM 1401, Electrodata 220 or Burroughs B5000?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Wed May 1 12:08:32 2024

I guess the idea of going all the way down to bit-level addressing
was considered a bit extreme?

STRETCH had bit addressing. It added a great deal of complication for
very little benefit. In the relatively rare situations where you want
to handle bit fields, shifting and masking is good enough without
slowing everything else down.

Bit addressing doesn't have to be expensive: the DEC Alpha could have
decided to use bit-addressing simply by ignoring/trapping more of the
lowest bits than it did.
Bit-addressing doesn't necessarily mean you can LD/ST at bit-granularity.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Wed May 1 16:38:09 2024

Lawrence D'Oliveiro wrote:

On Wed, 1 May 2024 03:02:07 +0000, MitchAlsup1 wrote:

I don't see what is wrong with loading a container with the field and
then extracting or inserting into the container.

You still need a place to put a bit offset for the base address of the
field. Why not put it together with the rest of the address?

Given a 20-40 year life of an architecture and the desire not to be limited
by addressability; I wanted and demanded of myself a full 63-bit virtual address space per thread. Therefore, no bits in the pointer are available
for bit level addressing.

BE means you can read the strings in a core dump
LE means the bytes arrive in the order for on-line arithmetic
LE allows one to make 8-bit wide data paths and still implement a full
width architecture {but then so did 360/30)

The way I think of it is: consider how you specify these 3 conventions:
* numbering of bits within a byte
* numbering of bytes within a multibyte quantity
* the place values of bits in an integer

The only way to get all 3 consistent is with a little-endian architecture. Every big-endian architecture has inconsistencies between these somewhere
or another.

Very many LE machines got one or more of those wrong, too.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Wed May 1 16:43:09 2024

Thomas Koenig wrote:

Lawrence D'Oliveiro <[email protected]d> schrieb:

(Interesting that the microprocessor world made byte addressing--and ASCII >> character encoding--universal right from the beginning. Starting from a
clean slate, I guess.)

A major market for microprocessors were pocket calculators,
cash registers and the like, which is why having 8 bits and BCD
arithmetic was an advantage - see the DAA instruction of the 8080
or the decimal flag on the 6502.

From 1978-1980 I worked at NCR corporation on cash registers.
We made a BASIC interpreter as the programmable backbone of
the cash register lineup. Not a single decimal arithmetic
instruction was used in the cash register application. The
BASIC interpreter was written by a 5-man team in 8085 assembler.

That model was sold from 1979 through 1998. So the lack of
decimal arithmetic was not a significant disadvantage.

The basis of the 8008, the first serious microprocessor,
was the Datapoint 2200. A nice history can be found at http://www.righto.com/2023/08/datapoint-to-8086.html .
And as the Datapoint 2200 was originally a "smart terminal",
it had to be able to connect to mainframes, which meant that
8-bit bytes were a natural choice. (And I still think that
having BCD influenced the decision to go to the 8-bit byte
on the /360).

So, anything but a clean slate.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Wed May 1 16:46:04 2024

Thomas Koenig wrote:

Lawrence D'Oliveiro <[email protected]d> schrieb:

On Wed, 1 May 2024 07:43:52 -0000 (UTC), Thomas Koenig wrote:

And as the Datapoint 2200 was originally a "smart terminal",
it had to be able to connect to mainframes, which meant that 8-bit bytes >>> were a natural choice.

You mean IBM mainframes?

And compatibles. Together, they accounted for almost all mainframes.

I don’t think any other mainframes were byte-
addressable.

IBM set the minimum standard for character capabilities, a
terminal had to support eight bits or be laughed out of the market. Adressability has little to do with it.

Hmm... what sort of terminals and character sets did people use on
a PDP-10? 7-bit ASCII? It (and the PDP-6) were released before
the ASCII standard came out. (And /360 was supposed to support
ASCII originally, but that bit in the PSW got dropped for the /370,
I believe).

PDP 10 had a 6-bit "field data" character set and a 9-bit bigger than
ASCII character set. Programming languages and editors tended to use
the 6-bit character set.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to [email protected] on Wed May 1 16:57:39 2024

[email protected] (MitchAlsup1) writes:

Thomas Koenig wrote:

Lawrence D'Oliveiro <[email protected]d> schrieb:

On Wed, 1 May 2024 07:43:52 -0000 (UTC), Thomas Koenig wrote:

And as the Datapoint 2200 was originally a "smart terminal",
it had to be able to connect to mainframes, which meant that 8-bit bytes >>>> were a natural choice.

You mean IBM mainframes?

And compatibles. Together, they accounted for almost all mainframes.

I don’t think any other mainframes were byte-
addressable.

IBM set the minimum standard for character capabilities, a
terminal had to support eight bits or be laughed out of the market.
Adressability has little to do with it.

Hmm... what sort of terminals and character sets did people use on
a PDP-10? 7-bit ASCII? It (and the PDP-6) were released before
the ASCII standard came out. (And /360 was supposed to support
ASCII originally, but that bit in the PSW got dropped for the /370,
I believe).

PDP 10 had a 6-bit "field data" character set and a 9-bit bigger than
ASCII character set. Programming languages and editors tended to use
the 6-bit character set.

Early Burroughs systems used 6-bit binary "characters". Two fit
in one column of a 12-row Hollerith card.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Wed May 1 17:32:54 2024

Please do me a favor and DO NOT guess why they did it --

Concerning the speculations about the PDP-11, here's one: Was it
designed for also supporting an implementation with a 4-bit or 8-bit
basis?

There are a bunch of design notes at bitsavers and none of them say
anything about it. There was one place that might have hinted that
little endian would save a few flip flops but since every PDP-11 was
16 bits internally, it wouldn't have saved much.

The PDP-X (the DEC-internal project that was canceled in favor of the
PDP-11 and eventually became the Nova) might have influenced the
PDP-11 in that way.

I gather the PDP-X and PDP-11 were warring camps. There's a bunch
of PDP-X notes at bitsavers and I don't see anything related to
the -11. In the Bell et al book there's a lot about the -11 which
only says it's different from the -8 and -9 series.
--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Wed May 1 17:41:46 2024

According to Lawrence D'Oliveiro <[email protected]d>:

On Wed, 1 May 2024 01:49:56 -0000 (UTC), John Levine wrote:

Until the PDP-11, all byte addressed machines were bigendian. Despite a
lot of looking, I have never found an explanation of why DEC made the
PDP-11 littlendian.

As I previously mentioned, little-endian just makes more sense.

Ahem. You're guessing.

I can assure you it didn't make more sense to all the people who read
360 core dumps. BTDT.

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Wed May 1 17:53:05 2024

According to Stefan Monnier <[email protected]>:

I guess the idea of going all the way down to bit-level addressing
was considered a bit extreme?

STRETCH had bit addressing. It added a great deal of complication for
very little benefit. In the relatively rare situations where you want
to handle bit fields, shifting and masking is good enough without
slowing everything else down.

Bit addressing doesn't have to be expensive: the DEC Alpha could have
decided to use bit-addressing simply by ignoring/trapping more of the
lowest bits than it did.

That would waste three bits in every address, which would have been phenomenally expensive in the 1960s when every byte cost real money.

The 360 had 12 bit displacements, so you could address a 4K range
without having to load another base register. This would shrink
it to 1K, so as a first approximation you'd need four times as
many base register loads. Nope.

I agree that with 64 bit addresses and memory that is pennies per
megabyte the tradeoffs are different but that horse left the barn 50
years ago. And I still don't think that bit operations are common
enough to be worth using bits in every non-bit address.
--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Wed May 1 18:13:57 2024

According to Thomas Koenig <[email protected]>:

8-bit bytes were a natural choice. (And I still think that
having BCD influenced the decision to go to the 8-bit byte
on the /360).

You don't have to guess. They explained in the IBM SJ paper
why they chose 8 bits rather than 6. BCD was part of it, as
was a belief that 6 bits wasn't going to be enough for
text, and it allowed 16 bit instructions and 32/64 bit
floating point.

Read it here: https://www.ece.ucdavis.edu/~vojin/CLASSES/EEC272/S2005/Papers/IBM360-Amdahl_april64.pdf

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to John Levine on Wed May 1 18:20:43 2024

John Levine <[email protected]> writes:

According to Lawrence D'Oliveiro <[email protected]d>:

On Wed, 1 May 2024 01:49:56 -0000 (UTC), John Levine wrote:

Until the PDP-11, all byte addressed machines were bigendian. Despite a
lot of looking, I have never found an explanation of why DEC made the
PDP-11 littlendian.

As I previously mentioned, little-endian just makes more sense.

Ahem. You're guessing.

I can assure you it didn't make more sense to all the people who read
360 core dumps. BTDT.

To be fair, the tool that formatted the core dump could easily have
arranged the human visible values appropriately, much like xxd(1)
on linux does for little-endian values (i.e. when grouped with
four bytes per (32-bits), the byte 3 value is printed first).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to John Levine on Wed May 1 18:17:33 2024

John Levine <[email protected]> schrieb:

I gather the PDP-X and PDP-11 were warring camps. There's a bunch
of PDP-X notes at bitsavers and I don't see anything related to
the -11. In the Bell et al book there's a lot about the -11 which
only says it's different from the -8 and -9 series.

Edson deCastro designed the PDP-X. When that project was cancelled
because of perceived potential competition with the 12-bit and
18-bit lines, he went off to found Data General and there built
the Nova, which used "byte pointers" where the uppermost bit
selected the low or high 8 bits of the 16-bit word.

Apparently, the PDP-11 was originally an 8-bit "desk calculator"
project which was then developed into the 16-bit architecture.
I have also read somewhere that competition from the Nova played
a major role.

DeCastro leaving was a major sore point for a lot of people at DEC,
so they probably did not tend to mention this influence.

There were allegations that the Nova was a copy of the proposed
PDP-X, but that was debunked now that some PDP-X development
documents have surfaced.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Wed May 1 14:33:16 2024

I agree that with 64 bit addresses and memory that is pennies per
megabyte the tradeoffs are different but that horse left the barn 50
years ago. And I still don't think that bit operations are common
enough to be worth using bits in every non-bit address.

Historically, the advantages vs disadvantages have indeed been rather
against bit-addressing. AFAICT when the DEC Alpha came out was the most favorable time: the first time that the cost was low enough (they
already had byte-addressing without byte-granularity of accesses,
they had plenty of address bits to waste, and there wasn't too much
existing 64bit code to break) to make the idea palatable.

Practical benefits are fairly limited, but it would just be The Right
thing to do, making it "easy" to eliminate some arbitrary restrictions
in languages like C such as the inability to take the address of
a struct's bitsized field. It would also have given an extra 3 bits to
play with for tagging purposes :-)

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Wed May 1 18:37:07 2024

Thomas Koenig wrote:

Hmm... what sort of terminals and character sets did people use on
a PDP-10? 7-bit ASCII? It (and the PDP-6) were released before
the ASCII standard came out.

On the PDP-6 and PDP-10s I used they were all Teletypes and tty
compatible ASCII video terminals.

The normal way to store text was five 7-bit ASCII characters in a 36
bit word, since the byte handling instructiond made that easy to
handle. It was common to start each line on a word boundary, so you had
to skip zero padding bytes. Text editors often included line numbers
that were five digit characters aligned on a word boundary, followed
by a tab. The low bit in the word with the digits was set to say it's
a line number, and compilers knew to look for the bit and skip the
line number and tab.

Disk and DECtape used a six bit upper case ASCII subset for file names
so they could fit a six character name into one word. Compiler and
object file symbol tables used RADIX50 aka SQUOZE that fit a six
character symbol from a 40 character (octal 50) set into 32 bits with
four flag bits left.

(And /360 was supposed to support

ASCII originally, but that bit in the PSW got dropped for the /370,
I believe).

They used a mutant ASCII that expanded from 7 to 8 bits by copying the
high bit into the middle of the byte, which nobody ever used. It was
one of the few inexplicably stupid choices in the 360.

According to MitchAlsup1 <[email protected]>:

PDP 10 had a 6-bit "field data" character set and a 9-bit bigger than
ASCII character set.

Dunno what computer that was, but it wasn't a PDP-10. Univac or GE600 maybe? --
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Wed May 1 18:52:43 2024

According to Scott Lurndal <[email protected]>:

Ahem. You're guessing.

I can assure you it didn't make more sense to all the people who read
360 core dumps. BTDT.

To be fair, the tool that formatted the core dump could easily have
arranged the human visible values appropriately, much like xxd(1)
on linux does for little-endian values (i.e. when grouped with
four bytes per (32-bits), the byte 3 value is printed first).

It could if it knew the structure of the data it was dumping, but it
didn't, which was OK because it didn't have to. Like I said, BTDT.

The first time I saw a PDP-11 in about 1970, I saw that the byte order
was backward and thought, well, that is strange, and then dealt with
it.
--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robert Swindells@21:1/5 to Stefan Monnier on Wed May 1 18:49:36 2024

On Wed, 01 May 2024 14:33:16 -0400, Stefan Monnier wrote:

I agree that with 64 bit addresses and memory that is pennies per
megabyte the tradeoffs are different but that horse left the barn 50
years ago. And I still don't think that bit operations are common
enough to be worth using bits in every non-bit address.

Historically, the advantages vs disadvantages have indeed been rather
against bit-addressing. AFAICT when the DEC Alpha came out was the most favorable time: the first time that the cost was low enough (they
already had byte-addressing without byte-granularity of accesses,
they had plenty of address bits to waste, and there wasn't too much
existing 64bit code to break) to make the idea palatable.

Practical benefits are fairly limited, but it would just be The Right
thing to do, making it "easy" to eliminate some arbitrary restrictions
in languages like C such as the inability to take the address of a
struct's bitsized field. It would also have given an extra 3 bits to
play with for tagging purposes :-)

The TMS340[12]0 were bit-addressed 32 bit processors.

<https://en.wikipedia.org/wiki/TMS34010>

I never programmed one in C but the addressing worked well for doing
graphics operations.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Wed May 1 19:07:00 2024

According to Thomas Koenig <[email protected]>:

Apparently, the PDP-11 was originally an 8-bit "desk calculator"
project which was then developed into the 16-bit architecture.
I have also read somewhere that competition from the Nova played
a major role.

"Desk calculator" was a misleading code name so the large computer
group would leave them alone. The 11 design was largely by Harold
McFarland who'd done most of the work for Gordon Bell at CMU.
See https://hampage.hu/pdp-11/birth.html

Again, you don't have to guess. This is all documented.
--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stefan Monnier on Wed May 1 18:55:06 2024

Stefan Monnier wrote:

I agree that with 64 bit addresses and memory that is pennies per
megabyte the tradeoffs are different but that horse left the barn 50
years ago. And I still don't think that bit operations are common
enough to be worth using bits in every non-bit address.

Historically, the advantages vs disadvantages have indeed been rather
against bit-addressing. AFAICT when the DEC Alpha came out was the most favorable time: the first time that the cost was low enough (they
already had byte-addressing without byte-granularity of accesses,
they had plenty of address bits to waste, and there wasn't too much
existing 64bit code to break) to make the idea palatable.

Probably, but looking at code one rarely sees a field in a struct
that is a bit-field. So, even if the cost was low, the benefits
are similarly low.

Practical benefits are fairly limited, but it would just be The Right
thing to do, making it "easy" to eliminate some arbitrary restrictions
in languages like C such as the inability to take the address of
a struct's bitsized field. It would also have given an extra 3 bits to
play with for tagging purposes :-)

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to [email protected] on Wed May 1 18:53:09 2024

MitchAlsup1 <[email protected]> schrieb:

Thomas Koenig wrote:

Lawrence D'Oliveiro <[email protected]d> schrieb:

(Interesting that the microprocessor world made byte addressing--and ASCII >>> character encoding--universal right from the beginning. Starting from a
clean slate, I guess.)

A major market for microprocessors were pocket calculators,
cash registers and the like, which is why having 8 bits and BCD
arithmetic was an advantage - see the DAA instruction of the 8080
or the decimal flag on the 6502.

From 1978-1980 I worked at NCR corporation on cash registers.
We made a BASIC interpreter as the programmable backbone of
the cash register lineup. Not a single decimal arithmetic
instruction was used in the cash register application. The
BASIC interpreter was written by a 5-man team in 8085 assembler.

Quite interesting, thanks!

That model was sold from 1979 through 1998. So the lack of
decimal arithmetic was not a significant disadvantage.

The 8085 has DAA, as well :-)

However, at least the designers of the 8080 and the 6502 thought
that it was important, or they would not have invested silicon
in it. The 6502 people even had a patent on their direct
decimal arithmetic.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to All on Wed May 1 19:21:32 2024

MitchAlsup1 wrote:

Stefan Monnier wrote:

I agree that with 64 bit addresses and memory that is pennies per megabyte the tradeoffs are different but that horse left the barn
50 years ago. And I still don't think that bit operations are
common enough to be worth using bits in every non-bit address.

Historically, the advantages vs disadvantages have indeed been
rather against bit-addressing. AFAICT when the DEC Alpha came out
was the most favorable time: the first time that the cost was low
enough (they already had byte-addressing without byte-granularity
of accesses, they had plenty of address bits to waste, and there
wasn't too much existing 64bit code to break) to make the idea
palatable.

Probably, but looking at code one rarely sees a field in a struct
that is a bit-field. So, even if the cost was low, the benefits
are similarly low.

Sure. But it isn't clear if that was the cause or the result of the
hardware.

Practical benefits are fairly limited, but it would just be The
Right thing to do, making it "easy" to eliminate some arbitrary restrictions in languages like C such as the inability to take the
address of a struct's bitsized field. It would also have given an
extra 3 bits to play with for tagging purposes :-)

Stefan

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Lawrence D'Oliveiro on Wed May 1 22:33:23 2024

On Wed, 1 May 2024 06:32:17 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

On Wed, 1 May 2024 01:49:56 -0000 (UTC), John Levine wrote:

Until the PDP-11, all byte addressed machines were bigendian.
Despite a lot of looking, I have never found an explanation of why
DEC made the PDP-11 littlendian.

As I previously mentioned, little-endian just makes more sense.

Unfortunately, when their Fortran compiler implemented 32-bit
integers (in software), they got the words the wrong way round.

The VAX was like a 32-bit extension of the PDP-11, and it was
consistently little-endian everywhere.

Not, it was not.
Integer part was consistent, but FP formats were mixed-endian.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to John Levine on Wed May 1 19:33:54 2024

John Levine wrote:

snip

According to MitchAlsup1 <[email protected]>:

PDP 10 had a 6-bit "field data" character set and a 9-bit bigger
than ASCII character set.

Dunno what computer that was, but it wasn't a PDP-10. Univac or
GE600 maybe?

I don't know about the PDP 10, but you are right that Univac 1108 had
both a six bit (technically a sixth of a word), and nine bit (quarter
word) operations. The 6 bit was Fieldata and used for most older
softwaare. The quarter words held an 8 bit ASCII character with one
"wasted" bit per byte. This became the dominent usage for
applications, but the Exec itself still uses a lot of Fieldata.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to [email protected] on Wed May 1 22:56:52 2024

On Wed, 1 May 2024 16:38:09 +0000
[email protected] (MitchAlsup1) wrote:

Lawrence D'Oliveiro wrote:

On Wed, 1 May 2024 03:02:07 +0000, MitchAlsup1 wrote:

I don't see what is wrong with loading a container with the field
and then extracting or inserting into the container.

You still need a place to put a bit offset for the base address of
the field. Why not put it together with the rest of the address?

Given a 20-40 year life of an architecture and the desire not to be
limited by addressability; I wanted and demanded of myself a full
63-bit virtual address space per thread. Therefore, no bits in the
pointer are available for bit level addressing.

At current rate of DRAM Moore's Law it does not look like anybody would
need 63 bits 40 years from now. Arm's 55 or 56 bits will likely suffice
for that long or longer.
The prospects of other byte-addresable types of memory looks even
bleaker than DRAM's.
The only memory tech that is doing better is NAND flash, but it is
inherently block-addressable.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to John Levine on Wed May 1 22:40:12 2024

On Wed, 1 May 2024 17:53:05 -0000 (UTC)
John Levine <[email protected]> wrote:

According to Stefan Monnier <[email protected]>:

I guess the idea of going all the way down to bit-level
addressing
was considered a bit extreme?

STRETCH had bit addressing. It added a great deal of complication
for very little benefit. In the relatively rare situations where
you want to handle bit fields, shifting and masking is good enough
without slowing everything else down.

Bit addressing doesn't have to be expensive: the DEC Alpha could have >decided to use bit-addressing simply by ignoring/trapping more of the >lowest bits than it did.

That would waste three bits in every address, which would have been phenomenally expensive in the 1960s when every byte cost real money.

The 360 had 12 bit displacements, so you could address a 4K range
without having to load another base register. This would shrink
it to 1K, so as a first approximation you'd need four times as
many base register loads. Nope.

I agree that with 64 bit addresses and memory that is pennies per
megabyte the tradeoffs are different but that horse left the barn 50
years ago. And I still don't think that bit operations are common
enough to be worth using bits in every non-bit address.

Bit-addressable TMS34010 was released 38 years ago and even was
moderately successful. So, it seems, 50 yeras ago nothing was set in
stone yet.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Wed May 1 20:30:16 2024

Michael S wrote:

On Wed, 1 May 2024 16:38:09 +0000
[email protected] (MitchAlsup1) wrote:

Lawrence D'Oliveiro wrote:

On Wed, 1 May 2024 03:02:07 +0000, MitchAlsup1 wrote:

I don't see what is wrong with loading a container with the field
and then extracting or inserting into the container.

You still need a place to put a bit offset for the base address of
the field. Why not put it together with the rest of the address?

Given a 20-40 year life of an architecture and the desire not to be
limited by addressability; I wanted and demanded of myself a full
63-bit virtual address space per thread. Therefore, no bits in the
pointer are available for bit level addressing.

At current rate of DRAM Moore's Law it does not look like anybody would
need 63 bits 40 years from now. Arm's 55 or 56 bits will likely suffice
for that long or longer.

The largest single system memory I can find quickly is 160TB or about
47-bits of address space (I rounded down).

Given one can use CXL to coherently link multiples of such a system,
and not be limited by the number of pins dedicated to DRAM access;
40 years of growth at ½ a bit per year, already exceeds the 63-bit
address space (47+40/2 = 67 bits).

The prospects of other byte-addresable types of memory looks even
bleaker than DRAM's.

Agreed (baring some kind of miracle

The only memory tech that is doing better is NAND flash, but it is
inherently block-addressable.

And becomes the backing store.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Wed May 1 20:18:56 2024

According to Stephen Fuld <[email protected]d>:

Probably, but looking at code one rarely sees a field in a struct
that is a bit-field. So, even if the cost was low, the benefits
are similarly low.

Sure. But it isn't clear if that was the cause or the result of the >hardware.

The people who designed the 360 had just done STRETCH, which had bit addressing. If it was useful, they would have known.

The PDP-6/10 had load and store byte instructions that could address
bit strings of arbitrary size and alignment in a singie instruction.
But in practice, the only thing we used them for was packing and
unpacking 7-bit ASCII into words.

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Wed May 1 16:28:55 2024

At current rate of DRAM Moore's Law it does not look like anybody would
need 63 bits 40 years from now.

Depends where. On "personal" computers, I fully agree, and indeed
there's been work instead on compressing 64bit pointers to fit into
32bit "boxes" (IIUC it's used in some Chrome versions) since many
applications never (or rarely) need to manipulate a heap larger
than 4GB.

But for some HPC systems, it's not quite as obvious.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Wed May 1 20:37:11 2024

According to Lawrence D'Oliveiro <[email protected]d>:

On Wed, 1 May 2024 01:49:56 -0000 (UTC), John Levine wrote:

Until the PDP-11, all byte addressed machines were bigendian. Despite a
lot of looking, I have never found an explanation of why DEC made the
PDP-11 littlendian.

As I previously mentioned, little-endian just makes more sense.

I happened to be looking at Blaauw and Brooks "Computer Architecture"
published in 1997, which has several pages on bit and byte numbering.
After noting that the Big- and Little- names come from Gulliver's
Travels, they say on page 100:

"Unlike Swift's, the computer Endian controversy is not pointless. The
Little Endian design has many complications in use; we much prefer the
Big Endian. Having two active conventions is very painful. Several
recent Big Endian RISC computers, including the MIPS, the Motorola
88000, and the Intel i860 provide a data-movement operation that can
perform the Big Endian-Little Endian permutation. We predict that
Little Endian addressing will die out, just as decimal addressing
did."

Really, people like what they are used to. They were just wrong about
the i860 which was little endian, but had a mode bit to make data
addressing big endian.
--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Wed May 1 20:38:58 2024

On Wed, 1 May 2024 16:38:09 +0000, MitchAlsup1 wrote:

Lawrence D'Oliveiro wrote:

You still need a place to put a bit offset for the base address of the
field. Why not put it together with the rest of the address?

Given a 20-40 year life of an architecture and the desire not to be
limited by addressability; I wanted and demanded of myself a full 63-bit virtual address space per thread. Therefore, no bits in the pointer are available for bit level addressing.

You will just have to make the move to 128-bit addressing, then. Some
designers (e.g. RISC-V) are already putting in place plans for that.

The way I think of it is: consider how you specify these 3 conventions:
* numbering of bits within a byte
* numbering of bytes within a multibyte quantity
* the place values of bits in an integer

The only way to get all 3 consistent is with a little-endian
architecture. Every big-endian architecture has inconsistencies between
these somewhere or another.

Very many LE machines got one or more of those wrong, too.

For example?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Thomas Koenig on Wed May 1 20:43:48 2024

On Wed, 1 May 2024 09:02:22 -0000 (UTC), Thomas Koenig wrote:

Hmm... what sort of terminals and character sets did people use on a
PDP-10? 7-bit ASCII? It (and the PDP-6) were released before the ASCII standard came out.

A bit before my time, but I recall terms like “SIXBIT” encoding from looking at docs. Also this weird thing called “Radix-50” (the “50” actually being octal for 40 decimal) did persist into PDP-11 days, when I
came along. It was a way of packing 3 characters (from a limited set, of course) into 2 bytes.

(And /360 was supposed to support ASCII originally,
but that bit in the PSW got dropped for the /370, I believe).

Both ASCII and the System/360 came out in 1964. IBM’s excuse for inventing its own EBCDIC encoding was that ASCII wasn’t ready in time. And so they saddled their entire mainframe world with this awkward, incompatible
encoding when the entire rest of the computing world very quickly embraced ASCII (and national variants based off it).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Wed May 1 20:50:23 2024

According to Lawrence D'Oliveiro <[email protected]d>:

The way I think of it is: consider how you specify these 3 conventions:
* numbering of bits within a byte
* numbering of bytes within a multibyte quantity
* the place values of bits in an integer

The only way to get all 3 consistent is with a little-endian
architecture. Every big-endian architecture has inconsistencies between
these somewhere or another.

As far as I can tell the 360/370 was consistently big-endian. The
convention for bit numbering in bytes and words was high to low but
since there weren't any instructions with bit numbers it didn't
matter.

Very many LE machines got one or more of those wrong, too.

For example?

The PDP-11 had mixed endian 32 bit integers and floats. VAX floating
point was pretty muddled, too.

Intel has been consistently little endian as far as I can remember.
--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Wed May 1 20:53:06 2024

According to Lawrence D'Oliveiro <[email protected]d>:

(And /360 was supposed to support ASCII originally,
but that bit in the PSW got dropped for the /370, I believe).

Both ASCII and the System/360 came out in 1964. IBM’s excuse for inventing >its own EBCDIC encoding was that ASCII wasn’t ready in time.

If you'd read the paper on the Architecture of System/360, you'd know
that is just plain wrong. See the link I posted earlier today.

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Michael S on Wed May 1 20:54:42 2024

Michael S <[email protected]> writes:

On Wed, 1 May 2024 16:38:09 +0000
[email protected] (MitchAlsup1) wrote:

Lawrence D'Oliveiro wrote:

On Wed, 1 May 2024 03:02:07 +0000, MitchAlsup1 wrote:

I don't see what is wrong with loading a container with the field
and then extracting or inserting into the container.

You still need a place to put a bit offset for the base address of
the field. Why not put it together with the rest of the address?

Given a 20-40 year life of an architecture and the desire not to be
limited by addressability; I wanted and demanded of myself a full
63-bit virtual address space per thread. Therefore, no bits in the
pointer are available for bit level addressing.

At current rate of DRAM Moore's Law it does not look like anybody would
need 63 bits 40 years from now. Arm's 55 or 56 bits will likely suffice
for that long or longer.

DRAM isn't the only thing that consumes physical address space bits.

The prospects of other byte-addresable types of memory looks even
bleaker than DRAM's.

Consider CXL-Memory, for instance, where you have cache coherent
memory distributed via PCIe to a switched fabric with thousands
of multicore hosts - that quickly eats up the full 64 bits of PA;
52 bits per host leaves just 12 bits for host selector.

A single PCU-express device could easily require 64GB of memory
BAR space in the PA space, or even a TB.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to [email protected] on Wed May 1 23:54:56 2024

On Wed, 1 May 2024 20:30:16 +0000
[email protected] (MitchAlsup1) wrote:

Michael S wrote:

On Wed, 1 May 2024 16:38:09 +0000
[email protected] (MitchAlsup1) wrote:

Lawrence D'Oliveiro wrote:

On Wed, 1 May 2024 03:02:07 +0000, MitchAlsup1 wrote:

I don't see what is wrong with loading a container with the
field and then extracting or inserting into the container.

You still need a place to put a bit offset for the base address
of the field. Why not put it together with the rest of the
address?

Given a 20-40 year life of an architecture and the desire not to be
limited by addressability; I wanted and demanded of myself a full
63-bit virtual address space per thread. Therefore, no bits in the
pointer are available for bit level addressing.

At current rate of DRAM Moore's Law it does not look like anybody
would need 63 bits 40 years from now. Arm's 55 or 56 bits will
likely suffice for that long or longer.

The largest single system memory I can find quickly is 160TB or about 47-bits of address space (I rounded down).

I am not aware of anything that big.
My impression was that the biggest cache-coherent system right now is
IBM's z15 Max190 (40 TB).

Given one can use CXL to coherently link multiples of such a system,
and not be limited by the number of pins dedicated to DRAM access;

But it would be very slow, so slow that it defeats the point of direct addressability.

40 years of growth at � a bit per year, already exceeds the 63-bit
address space (47+40/2 = 67 bits).

Half bit per year sounds very quick. It seems, right now the rate is
much slower, something like doubling every 5-6 years. And it is likely
to becaome even slower in 20 years.

The prospects of other byte-addresable types of memory looks even
bleaker than DRAM's.

Agreed (baring some kind of miracle

The only memory tech that is doing better is NAND flash, but it is inherently block-addressable.

And becomes the backing store.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Wed May 1 21:13:31 2024

According to Michael S <[email protected]>:

years ago. And I still don't think that bit operations are common
enough to be worth using bits in every non-bit address.

Bit-addressable TMS34010 was released 38 years ago and even was
moderately successful. So, it seems, 50 yeras ago nothing was set in
stone yet.

True, but that chip is designed to be good for video rendering which
is an unusual application that uses a lot of bit aligned data.

Chips for specialized applications have all sorts of strange
architectures. Look at the Moto 56K DSP with 24 bit words and separate instruction and data memories. I wouldn't want to try and run linux
on it but it's great for signal processing.

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Levine on Wed May 1 20:36:47 2024

On Wed, 1 May 2024 17:53:05 -0000 (UTC), John Levine wrote:

That would waste three bits in every address, which would have been phenomenally expensive in the 1960s when every byte cost real money.

But not today, with 64-bit addressing, was my point.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Levine on Wed May 1 20:36:02 2024

On Wed, 1 May 2024 17:41:46 -0000 (UTC), John Levine wrote:

According to Lawrence D'Oliveiro <[email protected]d>:

As I previously mentioned, little-endian just makes more sense.

Ahem. You're guessing.

No I’m not. I’ve used both over many years.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Michael S on Wed May 1 21:40:17 2024

Michael S <[email protected]> writes:

On Wed, 1 May 2024 20:30:16 +0000
[email protected] (MitchAlsup1) wrote:

Given one can use CXL to coherently link multiples of such a system,
and not be limited by the number of pins dedicated to DRAM access;

But it would be very slow, so slow that it defeats the point of direct >addressability.

On what basis do you make that statement? CXL-memory is real,
and can be implemented on chiplets in an MCM with better
than multisocket latencies. Add Gen6 PCIe cut-through switching
and you get resonable and useful latencies across a switched fabric.

Even a decade and a half ago, when we built a similar system using
QDR infinband and a custom ASIC connected to HT or QPI,
we had internode latencies of less than 400ns r/t, which
was about double the Intel inter-socket latencies at the time.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to [email protected] on Wed May 1 22:11:46 2024

MitchAlsup1 <[email protected]> schrieb:

Michael S wrote:

On Wed, 1 May 2024 16:38:09 +0000
[email protected] (MitchAlsup1) wrote:

Lawrence D'Oliveiro wrote:

On Wed, 1 May 2024 03:02:07 +0000, MitchAlsup1 wrote:

I don't see what is wrong with loading a container with the field
and then extracting or inserting into the container.

You still need a place to put a bit offset for the base address of
the field. Why not put it together with the rest of the address?

Given a 20-40 year life of an architecture and the desire not to be
limited by addressability; I wanted and demanded of myself a full
63-bit virtual address space per thread. Therefore, no bits in the
pointer are available for bit level addressing.

At current rate of DRAM Moore's Law it does not look like anybody would
need 63 bits 40 years from now. Arm's 55 or 56 bits will likely suffice
for that long or longer.

The largest single system memory I can find quickly is 160TB or about
47-bits of address space (I rounded down).

A single Power10 CPU can address 2 Petabytes (51 bits), but of course
it need not be all RAM.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Scott Lurndal on Thu May 2 02:04:37 2024

On Wed, 01 May 2024 21:40:17 GMT
[email protected] (Scott Lurndal) wrote:

Michael S <[email protected]> writes:

On Wed, 1 May 2024 20:30:16 +0000
[email protected] (MitchAlsup1) wrote:

Given one can use CXL to coherently link multiples of such a
system, and not be limited by the number of pins dedicated to DRAM
access;

But it would be very slow, so slow that it defeats the point of
direct addressability.

On what basis do you make that statement? CXL-memory is real,
and can be implemented on chiplets in an MCM with better
than multisocket latencies. Add Gen6 PCIe cut-through switching
and you get resonable and useful latencies across a switched fabric.

Even a decade and a half ago, when we built a similar system using
QDR infinband and a custom ASIC connected to HT or QPI,
we had internode latencies of less than 400ns r/t, which
was about double the Intel inter-socket latencies at the time.

You didn't find many buyers, did you?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Thomas Koenig on Thu May 2 02:13:09 2024

On Wed, 1 May 2024 22:11:46 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

MitchAlsup1 <[email protected]> schrieb:

Michael S wrote:

On Wed, 1 May 2024 16:38:09 +0000
[email protected] (MitchAlsup1) wrote:

Lawrence D'Oliveiro wrote:

On Wed, 1 May 2024 03:02:07 +0000, MitchAlsup1 wrote:

I don't see what is wrong with loading a container with the
field and then extracting or inserting into the container.

You still need a place to put a bit offset for the base address
of the field. Why not put it together with the rest of the
address?

Given a 20-40 year life of an architecture and the desire not to
be limited by addressability; I wanted and demanded of myself a
full 63-bit virtual address space per thread. Therefore, no bits
in the pointer are available for bit level addressing.

At current rate of DRAM Moore's Law it does not look like anybody
would need 63 bits 40 years from now. Arm's 55 or 56 bits will
likely suffice for that long or longer.

The largest single system memory I can find quickly is 160TB or
about 47-bits of address space (I rounded down).

A single Power10 CPU can address 2 Petabytes (51 bits), but of course
it need not be all RAM.

How much memory is connected to the biggest cache-coherent Power10
computer that is actually for sale?
My guess is 32 TB.
IBM claims 64 TB, but that claim is likely based on memory technology
that is not available yet.
Anyway, even if 64 TB is true, it's only 46 bits.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to John Levine on Thu May 2 02:02:49 2024

On Wed, 1 May 2024 21:13:31 -0000 (UTC)
John Levine <[email protected]> wrote:

I wouldn't want to try and run linux
on it but it's great for signal processing.

I agree about first part, disagree about second.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Levine on Thu May 2 00:05:00 2024

On Wed, 1 May 2024 21:13:31 -0000 (UTC), John Levine wrote:

According to Michael S <[email protected]>:

Bit-addressable TMS34010 was released 38 years ago and even was
moderately successful. So, it seems, 50 yeras ago nothing was set in
stone yet.

True, but that chip is designed to be good for video rendering which is
an unusual application that uses a lot of bit aligned data.

And yet, all our machines nowadays are doing heavy amounts of “video rendering”, aren’t they? Look at the machine generating the screen display you’re looking at right now.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Levine on Wed May 1 23:17:06 2024

On Wed, 1 May 2024 20:37:11 -0000 (UTC), John Levine wrote:

According to Lawrence D'Oliveiro <[email protected]d>:

On Wed, 1 May 2024 01:49:56 -0000 (UTC), John Levine wrote:

Until the PDP-11, all byte addressed machines were bigendian. Despite

a

lot of looking, I have never found an explanation of why DEC made the
PDP-11 littlendian.

As I previously mentioned, little-endian just makes more sense.

I happened to be looking at Blaauw and Brooks "Computer Architecture" published in 1997, which has several pages on bit and byte numbering.
After noting that the Big- and Little- names come from Gulliver's
Travels, they say on page 100:

"Unlike Swift's, the computer Endian controversy is not pointless. The
Little Endian design has many complications in use; we much prefer the
Big Endian."

It’s easy to illustrate why they’re wrong. First of all, a note that, even on big-endian architectures, registers are still actually little-endian.
Which is yet another reason why big-endian can never be entirely
consistent.

Consider this pseudo-assembly-language sequence:

move.l a, b
move.b b, c

where “move” denotes either “load” or “store” as appropriate, the “.b”
suffix indicates a byte operation, and “.l” denotes a multibyte operation (2, 4, 8 bytes or whatever, doesn’t matter as long as it’s more than 1).

As for the labels “a”, “b” and “c”, they can be reasonably interpreted (to
accommodate both RISC and non-RISC architectures) in two ways:
1) “a” and “c” are registers, “b” is a memory address; or
2) “b” is a register, while “a” and “c” are memory addresses.

Now the question is: which byte from “a” ends up at location “c”?

On a little-endian architecture, it is always the lowest-significance
byte.

But on a big-endian architecture, for a register-memory-register move, it
will be the highest-significance byte. But for the memory-register-memory
case, it will be the lowest-significance byte.

In other words, even on big-endian architectures, registers are still interpreted as little-endian!

Isn’t that fun?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Stephen Fuld on Thu May 2 00:24:50 2024

On Wed, 1 May 2024 19:21:32 -0000 (UTC), Stephen Fuld wrote:

MitchAlsup1 wrote:

... looking at code one rarely sees a field in a struct that
is a bit-field. So, even if the cost was low, the benefits are
similarly low.

Sure. But it isn't clear if that was the cause or the result of the hardware.

Absolutely, I would say that is very much a chicken-and-egg effect. Also,
if you thought endian issues were complicated, look at how different architectures implement their bit-field instructions.

Interesting fact: in spite of all the arguments over big-endian versus little-endian, everybody seems to be in agreement over what “shift left” and “shift right” mean: “left” is always to the most significant end, while “right” is always to the least significant end. If you want to do
bit packing/unpacking in endian-independent C code, you do it with shifts
and masks.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Stefan Monnier on Thu May 2 01:20:57 2024

On Wed, 01 May 2024 16:28:55 -0400, Stefan Monnier wrote:

On "personal" computers ... there's been work instead on compressing
64bit pointers to fit into 32bit "boxes" (IIUC it's used in some Chrome versions) ...

Intel pushed this thing called the “x32” ABI into the Linux kernel (and possibly some other places) some years ago. This was using the AMD64 instruction set, but with only 32-bit pointers. This way, you got the
benefit of the extra registers, without the overhead of the longer
addresses.

I don’t think it was very popular, and I also think it’s been dropped from current Linux kernels.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to [email protected] on Thu May 2 01:18:59 2024

It appears that Lawrence D'Oliveiro <[email protected]d> said:

"Unlike Swift's, the computer Endian controversy is not pointless. The
Little Endian design has many complications in use; we much prefer the
Big Endian."

It’s easy to illustrate why they’re wrong. First of all, a note that, even >on big-endian architectures, registers are still actually little-endian.

I would be most interested in a concrete illustration of this
implausible argument. How about starting with the IBM 360 principles
of operation and pointing out the little endian registers.

If you don't have a copy handy, you can find one here

https://bitsavers.org/pdf/ibm/360/princOps/A22-6821-7_360PrincOpsDec67.pdf

You might also look at its instruction set which is quite unlike the ones
you seem to be familiar with.
--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Thu May 2 01:28:32 2024

According to Lawrence D'Oliveiro <[email protected]d>:

Bit-addressable TMS34010 was released 38 years ago and even was >>>moderately successful. So, it seems, 50 yeras ago nothing was set in >>>stone yet.

True, but that chip is designed to be good for video rendering which is
an unusual application that uses a lot of bit aligned data.

And yet, all our machines nowadays are doing heavy amounts of “video >rendering”, aren’t they? Look at the machine generating the screen display >you’re looking at right now.

It's an Apple M2 chip with a eight core dedicated GPU to do the video processing. Could you explain what point you're making here?

Every computer these days does graphics rendering so they have
specialized GPUs to make it fast, or on low end machines instruction
set extensiosns to make it sort of fast. In both cases that is because
graphics rendering is an unusual application that benefits from
specialized hardware. I hope that doesn't come as a big surprise.

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Levine on Thu May 2 01:29:42 2024

On Wed, 1 May 2024 20:50:23 -0000 (UTC), John Levine wrote:

The PDP-11 had mixed endian 32 bit integers and floats.

The PDP-11 had no 32-bit integer instructions. It was the Fortran compiler (specifically “Fortran IV PLus”) that had mixed-endian 32-bit integers.

VAX floating point was pretty muddled, too.

Just rechecking one of their “architecture handbooks”, and the parts containing the mantissae are ordered big-endian by word, but little-endian between the bytes of a word.

Intel has been consistently little endian as far as I can remember.

That shows that it is possible. It is not possible for any big-endian architecture.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Scott Lurndal on Thu May 2 01:39:35 2024

On Wed, 01 May 2024 14:08:25 GMT, Scott Lurndal wrote:

What about the IBM 1401, Electrodata 220 or Burroughs B5000?

Not really familiar with those--feel free to mention more details if you
have them.

Though I do recall, the 1401 didn’t have a “word length” as such: it was a
“character”-based machine. For example, it could do arbitrary-precision arithmetic--it just kept processing digits until it hit a special end-of-
data marker--but obviously this only worked for (fixed-point) addition and subtraction. The machine had no hardware support for multiplication or division. Or floating-point, for that matter.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Levine on Thu May 2 01:36:56 2024

On Wed, 1 May 2024 20:53:06 -0000 (UTC), John Levine wrote:

According to Lawrence D'Oliveiro <[email protected]d>:

Both ASCII and the System/360 came out in 1964. IBM’s excuse for >>inventing its own EBCDIC encoding was that ASCII wasn’t ready in time.

If you'd read the paper on the Architecture of System/360, you'd know
that is just plain wrong. See the link I posted earlier today.

See also these links:

<https://en.wikipedia.org/wiki/IBM_System/360_architecture> note 4:

Because the design of the S/360 occurred simultaneously with the
development of ASCII, IBM's ASCII support did not match the
standard that was ultimately adopted.

<https://news.ycombinator.com/item?id=12360749>:

This was roughly the same time the ANSI committee was trying to
standardize ASCII. IBM was a proponent of ASCII, but they had
shipping deadlines, and kept with their own character set rather
than delay while they created ASCII peripherals.

This item <https://retrocomputing.stackexchange.com/questions/15516/when-did-ibm-start-to-use-ascii>
claims IBM was “a major proponent for ASCII”, but only it seems for communicating with other systems, not internally within its own
products.

Odd, don’t you think. But consistent with the time-factor excuse.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to [email protected] on Thu May 2 01:51:51 2024

It appears that Lawrence D'Oliveiro <[email protected]d> said:

On Wed, 1 May 2024 20:53:06 -0000 (UTC), John Levine wrote:

According to Lawrence D'Oliveiro <[email protected]d>:

Both ASCII and the System/360 came out in 1964. IBM’s excuse for >>>inventing its own EBCDIC encoding was that ASCII wasn’t ready in time.

If you'd read the paper on the Architecture of System/360, you'd know
that is just plain wrong. See the link I posted earlier today.

See also these links:

I'm familiar with those secondary sources. So just to be clear, you're
saying that when the S/360 architects published that 1964 paper saying
why they did what they did, they were lying?

Be sure and look at figure 2b, "8-bit representation of the 7-bit
American Standard Code for Information Interchange (ASCII)."

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Thu May 2 01:46:25 2024

According to Lawrence D'Oliveiro <[email protected]d>:

On Wed, 1 May 2024 20:50:23 -0000 (UTC), John Levine wrote:

The PDP-11 had mixed endian 32 bit integers and floats.

The PDP-11 had no 32-bit integer instructions.

I'm holding in my hand a DEC pdp-11 processor handbook published in 1979.

On page 359 it describes LDCLF which converts a 32 bit mixed endian
integer to float or double, and on page 368-9 STCFL which went the other
way.

It was the Fortran compiler

(specifically “Fortran IV PLus”) that had mixed-endian 32-bit integers.

Unsurprisingly it matched what the hardware did.

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Thu May 2 01:57:53 2024

According to Lawrence D'Oliveiro <[email protected]d>:

On Wed, 01 May 2024 14:08:25 GMT, Scott Lurndal wrote:

What about the IBM 1401, Electrodata 220 or Burroughs B5000?

Not really familiar with those--feel free to mention more details if you
have them.

There's plenty of documentation at bitsavers.

Though I do recall, the 1401 didn’t have a “word length” as such: it was a
“character”-based machine. For example, it could do arbitrary-precision >arithmetic--it just kept processing digits until it hit a special end-of- >data marker--but obviously this only worked for (fixed-point) addition and >subtraction. The machine had no hardware support for multiplication or >division. Or floating-point, for that matter.

You may be confusing it with the 1620. The 1401 had optional multiply
and divide instructions but I don't think they were very popular. The
1620 had all four operations, famously and slowly implemented by table
lookup, and optional hardware floating point.

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Michael S on Thu May 2 01:40:33 2024

On Wed, 1 May 2024 15:31:37 +0300, Michael S wrote:

In the world of general-purpose microprocessor, DEC Alpha (until EV6)
was more like word-addressable than byte-addressable, although it is a
matter of point of view.

As I recall, the original design left out byte-addressability, but this
was found to hurt Windows NT performance. So it was added later.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to John Levine on Thu May 2 05:05:25 2024

John Levine wrote:

snip

Every computer these days does graphics rendering

Is that true? What about all those computers that make up Google's
server farm? Or how about AWS systems? I am not saying they don't,
just asking.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Levine on Thu May 2 05:42:04 2024

On Thu, 2 May 2024 01:18:59 -0000 (UTC), John Levine wrote:

It appears that Lawrence D'Oliveiro <[email protected]d> said:

"Unlike Swift's, the computer Endian controversy is not pointless.
The Little Endian design has many complications in use; we much
prefer the Big Endian."

It’s easy to illustrate why they’re wrong. First of all, a note that, >>even on big-endian architectures, registers are still actually >>little-endian.

I would be most interested in a concrete illustration of this
implausible argument.

Sure. Consider this pseudo-assembly-language sequence:

move.l a, b
move.b b, c

where “move” denotes either “load” or “store” as appropriate, the “.b”
suffix indicates a byte operation, and “.l” denotes a multibyte operation (2, 4, 8 bytes or whatever, doesn’t matter as long as it’s more than 1).

As for the labels “a”, “b” and “c”, they can be reasonably interpreted (to
accommodate both RISC and non-RISC architectures) in two ways:
1) “a” and “c” are registers, “b” is a memory address; or
2) “b” is a register, while “a” and “c” are memory addresses.

Now the question is: which byte from “a” ends up at location “c”?

On a little-endian architecture, it is always the lowest-significance
byte.

But on a big-endian architecture, for a register-memory-register move, it
will be the highest-significance byte. But for the memory-register-memory
case, it will be the lowest-significance byte.

In other words, even on big-endian architectures, registers are still interpreted as little-endian!

Isn’t that fun?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Levine on Thu May 2 06:59:49 2024

On Thu, 2 May 2024 01:57:53 -0000 (UTC), John Levine wrote:

According to Lawrence D'Oliveiro <[email protected]d>:

Though I do recall, the 1401 didn’t have a “word length” as such:
it was a “character”-based machine. For example, it could do
arbitrary-precision arithmetic--it just kept processing digits
until it hit a special end-of-data marker--but obviously this only
worked for (fixed-point) addition and subtraction. The machine had
no hardware support for multiplication or division. Or
floating-point, for that matter.

You may be confusing it with the 1620.

The 1401 was the one with the “word-mark” bit that I was thinking of,
which was set to 1 in the final (highest-order) digit of a number.

The 1620 looks like it did it in a different way, with a separate end-of- number character.

The “Guide to 1401 Programming” I’m looking at (from 1961) makes no mention of multiplication or division.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to All on Thu May 2 09:54:49 2024

MitchAlsup1 wrote:

Lawrence D'Oliveiro wrote:

Byte addressing was invented by IBM for the System/360, introduced in
1964. At least as I understand it. Up to that time, and indeed for a
long time after, machines had a â€œword lengthâ€ which was the
smallest addressable unit of memory. This could have a range of sizes,
e.g.

    12 -- DEC PDP-5/8
    18 -- DEC PDP-1/4/7/9
    36 -- DEC PDP-6/10
    60 -- CDC 6000-series
    64 -- Cray

CDC had a number of machines with 12-bit times k words. k element {1,2,3,5}

Iâ€™m sure there were also 24- and 48-bit machines. Note the
popularity of numbers with a range of different integer divisors,
including powers of both 2 and 3. The byte-addressable machines
chucked away everything other than powers of 2, which was a step
backwards in this respect. ;)

I would make the argument that 2^k was a step forward not backwards.
Perhaps another day...

I've seen the argument that e is the best base from an energy
standpoint, with 2 and 3 being the two closest integer values.

Working with trits, encoded as -/0/+, would have been feasible, but
binary provided much easier implementation. Base conversions are a bit
messier when you use base3 as the machine representation, but you could
have used 5 trits (243) to handle the US ASCII character set.

In retrospect I'm glad they decided on binary!

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Terje Mathisen on Thu May 2 08:14:38 2024

Terje Mathisen <[email protected]> schrieb:

Working with trits, encoded as -/0/+, would have been feasible,

There was a Russian computer that implemented that.

but
binary provided much easier implementation. Base conversions are a bit messier when you use base3 as the machine representation, but you could
have used 5 trits (243) to handle the US ASCII character set.

In retrospect I'm glad they decided on binary!

I like balanced ternary for its symmetry. There
appears to have been a Soviet computer implementing it, https://en.wikipedia.org/wiki/Setun . I also like the idea of
encoding a comparison with three values in a single trit.

But for today's technology, binary is much easier to implement,
so it is the logical choice.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Thu May 2 07:24:26 2024

I wrote:

The “Guide to 1401 Programming” I’m looking at (from 1961) makes no mention of multiplication or division.

No hardware instructions, just a mention of a multiplication subroutine.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Terje Mathisen on Thu May 2 08:59:10 2024

On Thu, 2 May 2024 09:54:49 +0200, Terje Mathisen wrote:

I've seen the argument that e is the best base from an energy
standpoint, with 2 and 3 being the two closest integer values.

To implement a non-integer base, you would need something like a
probabilistic distribution of combinations of digits, rather than allowing every possible combination to be equally representable. Then you can
average out the information content to a suitable value.

So it would be an average-base-e representation.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to John Levine on Thu May 2 13:54:32 2024

On Wed, 1 May 2024 20:37:11 -0000 (UTC)
John Levine <[email protected]> wrote:

According to Lawrence D'Oliveiro <[email protected]d>:

On Wed, 1 May 2024 01:49:56 -0000 (UTC), John Levine wrote:

Until the PDP-11, all byte addressed machines were bigendian.
Despite a lot of looking, I have never found an explanation of why
DEC made the PDP-11 littlendian.

As I previously mentioned, little-endian just makes more sense.

I happened to be looking at Blaauw and Brooks "Computer Architecture" published in 1997, which has several pages on bit and byte numbering.
After noting that the Big- and Little- names come from Gulliver's
Travels, they say on page 100:

"Unlike Swift's, the computer Endian controversy is not pointless.
The Little Endian design has many complications in use; we much
prefer the Big Endian. Having two active conventions is very painful.
Several recent Big Endian RISC computers, including the MIPS, the
Motorola 88000, and the Intel i860 provide a data-movement operation
that can perform the Big Endian-Little Endian permutation. We predict
that Little Endian addressing will die out, just as decimal addressing
did."

IMHO, statements like that are forgivable for Blaauw (born 1924). Less
so for 7 years younger Brooks.

Really, people like what they are used to. They were just wrong about
the i860 which was little endian, but had a mode bit to make data
addressing big endian.

Expressions of personal prejudices are fine for informal Usenet
articles. For book that pretends to be more than memoir I expect more
rigorous reasoning.
But I didn't read the book and don't know its genre. Possibly it is in
fact a memoir hidden behind uncharacteristic name. The full name is
"Computer architecture: concepts and evolution." The last word gives a
hint that it can be a case.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Thu May 2 12:00:40 2024

According to Michael S <[email protected]>:

that can perform the Big Endian-Little Endian permutation. We predict
that Little Endian addressing will die out, just as decimal addressing
did."

IMHO, statements like that are forgivable for Blaauw (born 1924). Less
so for 7 years younger Brooks.

Really, people like what they are used to. They were just wrong about
the i860 which was little endian, but had a mode bit to make data
addressing big endian.

Expressions of personal prejudices are fine for informal Usenet
articles. For book that pretends to be more than memoir I expect more >rigorous reasoning.

It's a pretty gppd textbook amd that prediction is one of the few
places where they blow it, perhaps because from inside IBM they didn't
realize how much the rest of the world had moved beyond IBM
compatibility.

But my point is that the arguments about big- and little-endian are
far more about what you are used to than any inherent advantage of one
or the other. As we have seen in recent bickering here, it is easy to
construct examples that appear to make your less favored option look
wrong, particularly if you don't know how actual implementations work.

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Thu May 2 11:52:56 2024

According to Lawrence D'Oliveiro <[email protected]d>:

Sure. Consider this pseudo-assembly-language sequence:

move.l a, b
move.b b, c
...
Now the question is: which byte from “a” ends up at location “c”?

You really should stop guessing about computer architectures rather
than reading up on them.

On S/360, which is the ur-big-endian machine, memory to memory moves
are different from register loads and stores. There are ICM and STCM instructions that take a four bit mask to say which bytes in the
register to load or store. There are also IC and STC for the common
case that you only want to load or store the low byte.

In other words, even on big-endian architectures, registers are still >interpreted as little-endian!

Isn’t that fun?

I suppose it would be if it were true.

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Thu May 2 12:20:37 2024

According to Stephen Fuld <[email protected]d>:

John Levine wrote:

snip

Every computer these days does graphics rendering

Is that true? What about all those computers that make up Google's
server farm? Or how about AWS systems? I am not saying they don't,
just asking.

AWS has several varieties of their custom Graviton chips:

https://aws.amazon.com/ec2/graviton/

Some of them are just ARM cores for stuff like databases but some are
intended for video processing and game streaming:

https://aws.amazon.com/ec2/instance-types/g5g/

So you're right, it's not every computer, but it's more than you might think. --
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to John Levine on Thu May 2 07:28:56 2024

On 5/2/2024 5:20 AM, John Levine wrote:

According to Stephen Fuld <[email protected]d>:

John Levine wrote:

snip

Every computer these days does graphics rendering

Is that true? What about all those computers that make up Google's
server farm? Or how about AWS systems? I am not saying they don't,
just asking.

AWS has several varieties of their custom Graviton chips:

https://aws.amazon.com/ec2/graviton/

Some of them are just ARM cores for stuff like databases but some are intended for video processing and game streaming:

https://aws.amazon.com/ec2/instance-types/g5g/

So you're right, it's not every computer, but it's more than you might think.

Fair enough. Thanks.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Michael S on Thu May 2 14:32:50 2024

Michael S <[email protected]> writes:

On Wed, 01 May 2024 21:40:17 GMT
[email protected] (Scott Lurndal) wrote:

Michael S <[email protected]> writes:

On Wed, 1 May 2024 20:30:16 +0000
[email protected] (MitchAlsup1) wrote:

Given one can use CXL to coherently link multiples of such a
system, and not be limited by the number of pins dedicated to DRAM
access;

But it would be very slow, so slow that it defeats the point of
direct addressability.

On what basis do you make that statement? CXL-memory is real,
and can be implemented on chiplets in an MCM with better
than multisocket latencies. Add Gen6 PCIe cut-through switching
and you get resonable and useful latencies across a switched fabric.

Even a decade and a half ago, when we built a similar system using
QDR infinband and a custom ASIC connected to HT or QPI,
we had internode latencies of less than 400ns r/t, which
was about double the Intel inter-socket latencies at the time.

You didn't find many buyers, did you?

We were in one of the national labs before the recession eliminated
further funding.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to [email protected] on Thu May 2 08:58:23 2024

On Wed, 1 May 2024 23:17:06 -0000 (UTC), Lawrence D'Oliveiro
<[email protected]d> wrote:

On a little-endian architecture, it is always the lowest-significance
byte.

But on a big-endian architecture, for a register-memory-register move, it >will be the highest-significance byte. But for the memory-register-memory >case, it will be the lowest-significance byte.

In other words, even on big-endian architectures, registers are still >interpreted as little-endian!

Isn�t that fun?

It had never occured to me to think about it in this way.

To me, it just made sense that, since registers contain quantities, if
you load the value "8" into a reigster, it will contain the number 8.

So in a byte operation, the least significant bits of the register are
used.

While if yiou store something in a memory location, you're only using
the length corresponding to the size of the operand. So, yes, storing
a value into a byte in memory... puts it at the location of the most significant 8 bits of a 32-bit quantity having the same address.

But so what? Usually, a memory location is used for only one size of
data. If EQUIVALENCE magic is going on, it makes more sense to have
numbers in memory look the way we write them, so it's easy to
understand.

Plus, if you load a single precision float into a floating-point
register, you are loading on the left side, not the right side, so the inconsistency to which you're referring now impacts the little-endian
machines. (Of course, though, that's no longer quite true with IEEE
754, since the exponent isn't the same size for all precisions, the
way it was with old-fashioned machines.)

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Thu May 2 13:45:50 2024

On "personal" computers ... there's been work instead on compressing
64bit pointers to fit into 32bit "boxes" (IIUC it's used in some Chrome
versions) ...

Intel pushed this thing called the “x32” ABI into the Linux kernel (and possibly some other places) some years ago. This was using the AMD64

Indeed, but I got the impression that there is a bit of a revival of
interest for pointer compression as the evidence seems to point to RAM
sizes not increasing very much any more on "end user devices".

See for instance https://v8.dev/blog/pointer-compression

Note also that this is targeted at JavaScript: dynamically typed
languages tend to suffer more from the 64bit bloat because of their use
of "boxing", meaning that pretty much everything (except usually for
strings and arrays of floats, which are special-cased) doubles in size
when the "box" size is changed from 32bit to 64bit.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Levine on Thu May 2 17:21:30 2024

John Levine <[email protected]> writes:

As far as I can tell the 360/370 was consistently big-endian. The
convention for bit numbering in bytes and words was high to low but
since there weren't any instructions with bit numbers it didn't
matter.

I remember reading the PowerPC documentation where the most
significant bit was bit 0, so it was consistently big-endian. But the
problem with this is that the least significant bit of a byte is bit
7, of a halfword bit 15, of a word bit 31, etc. I don't remember if
PowerPC has instructions where bit numbers play a role, though.

With OpenPower being little-endian, did they rewrite all the docs to
renumber the bits?

The 68000 and 88000 architectures (which have instructions with bit
numbers) make the least significant bit have number 0, so they are
bitwise little-endian. The 68000 is bytewise big-endian, and I
remember things getting pretty messy when I tried to use bit-numbering instructions for data larger than 32 bits. The 88000 supports
little-endian mode, but IIRC the DG Aviion used big-endian mode.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Levine on Thu May 2 17:37:47 2024

John Levine <[email protected]> writes:

According to Lawrence D'Oliveiro <[email protected]d>:

On Wed, 1 May 2024 01:49:56 -0000 (UTC), John Levine wrote:

Until the PDP-11, all byte addressed machines were bigendian. Despite a
lot of looking, I have never found an explanation of why DEC made the
PDP-11 littlendian.

As I previously mentioned, little-endian just makes more sense.

I happened to be looking at Blaauw and Brooks "Computer Architecture" >published in 1997, which has several pages on bit and byte numbering.
After noting that the Big- and Little- names come from Gulliver's
Travels, they say on page 100:

"Unlike Swift's, the computer Endian controversy is not pointless. The
Little Endian design has many complications in use; we much prefer the
Big Endian. Having two active conventions is very painful. Several
recent Big Endian RISC computers, including the MIPS, the Motorola
88000, and the Intel i860

MIPS and 88000 support both big- and little-endian operation; and at
least for MIPS, there were a lot of little-endian machines around: the DECstations. Even today, <https://popcon.debian.org/> reports:

mips : 7
mips64el : 10
mipsel : 4

So twice as many little-endian (el) systems as big-endian ones.

provide a data-movement operation that can
perform the Big Endian-Little Endian permutation. We predict that
Little Endian addressing will die out, just as decimal addressing
did."

I did not expect any of them to die out, but actually big-endian is
dying out. HPPA and SPARC have been cancelled, Power has switched to little-endian, and S390x is a niche, and MIPS has left the
general-purpose computing field.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Terje Mathisen on Thu May 2 18:23:59 2024

Terje Mathisen wrote:

MitchAlsup1 wrote:

Lawrence D'Oliveiro wrote:

Byte addressing was invented by IBM for the System/360, introduced in
1964. At least as I understand it. Up to that time, and indeed for a
long time after, machines had a â€œword lengthâ€ which was the
smallest addressable unit of memory. This could have a range of sizes,

e.g.

    12 -- DEC PDP-5/8
    18 -- DEC PDP-1/4/7/9
    36 -- DEC PDP-6/10
    60 -- CDC 6000-series
    64 -- Cray

CDC had a number of machines with 12-bit times k words. k element

{1,2,3,5}

Iâ€™m sure there were also 24- and 48-bit machines. Note the
popularity of numbers with a range of different integer divisors,
including powers of both 2 and 3. The byte-addressable machines
chucked away everything other than powers of 2, which was a step
backwards in this respect. ;)

I would make the argument that 2^k was a step forward not backwards.
Perhaps another day...

I've seen the argument that e is the best base from an energy
standpoint, with 2 and 3 being the two closest integer values.

If one wants to take a low-fan-out signal and drive a lot of loads
(high fan-out) then the lease energy way of doing this is an
exponentiating rate of e but often 3 (sometimes 4) were close enough. (Meade-Conway)

Working with trits, encoded as -/0/+, would have been feasible, but
binary provided much easier implementation. Base conversions are a bit messier when you use base3 as the machine representation, but you could
have used 5 trits (243) to handle the US ASCII character set.

One gets 1 bit of storage with 2 tubes (or transistors) and the storage
for a stable trit reauires 4 tubes (lower storage per tube).

In retrospect I'm glad they decided on binary!

Binary self chose due to the medium (tubes and transistors).

Terje

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Thu May 2 18:28:18 2024

John Savard wrote:

On Wed, 1 May 2024 23:17:06 -0000 (UTC), Lawrence D'Oliveiro

Plus, if you load a single precision float into a floating-point
register, you are loading on the left side, not the right side, so the

In My 66000, floats are stored on the right side of the register
{mostly because I do not have FP LD/STs.}

inconsistency to which you're referring now impacts the little-endian machines. (Of course, though, that's no longer quite true with IEEE
754, since the exponent isn't the same size for all precisions, the
way it was with old-fashioned machines.)

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Thu May 2 18:33:48 2024

Lawrence D'Oliveiro wrote:

On Wed, 1 May 2024 20:37:11 -0000 (UTC), John Levine wrote:

According to Lawrence D'Oliveiro <[email protected]d>:

On Wed, 1 May 2024 01:49:56 -0000 (UTC), John Levine wrote:

Until the PDP-11, all byte addressed machines were bigendian. Despite

a

lot of looking, I have never found an explanation of why DEC made the
PDP-11 littlendian.

As I previously mentioned, little-endian just makes more sense.

I happened to be looking at Blaauw and Brooks "Computer Architecture"
published in 1997, which has several pages on bit and byte numbering.
After noting that the Big- and Little- names come from Gulliver's
Travels, they say on page 100:

"Unlike Swift's, the computer Endian controversy is not pointless. The
Little Endian design has many complications in use; we much prefer the
Big Endian."

It’s easy to illustrate why they’re wrong. First of all, a note that,

even

on big-endian architectures, registers are still actually little-endian.

Which is yet another reason why big-endian can never be entirely
consistent.

IBM 360 had its most significant bit labeled as bit<0>.

We don't do that any more because we want the lowest bit number of
a bit-field to equal the shift count needed to right align the
bit with the register,

Consider this pseudo-assembly-language sequence:

move.l a, b
move.b b, c

May I suggest that the above ILLUSTRATES why someone wants to use
LD and ST instructions rather than directionless MOV instructions.
The interpretation of the instruction is determined by the operands
not by the OpCode.

where “move” denotes either “load” or “store” as appropriate, the “.b”
suffix indicates a byte operation, and “.l” denotes a multibyte operation

(2, 4, 8 bytes or whatever, doesn’t matter as long as it’s more than 1).

As for the labels “a”, “b” and “c”, they can be reasonably interpreted

(to

accommodate both RISC and non-RISC architectures) in two ways:
1) “a” and “c” are registers, “b” is a memory address; or
2) “b” is a register, while “a” and “c” are memory addresses.

All of the above goes away when LD/STs are used instead of MOV.

Now the question is: which byte from “a” ends up at location “c”?

On a little-endian architecture, it is always the lowest-significance
byte.

But on a big-endian architecture, for a register-memory-register move, it

will be the highest-significance byte. But for the memory-register-memory

case, it will be the lowest-significance byte.

In other words, even on big-endian architectures, registers are still interpreted as little-endian!

Isn’t that fun?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Fri May 3 02:59:42 2024

On Thu, 02 May 2024 17:37:47 GMT, Anton Ertl wrote:

... MIPS has left the general-purpose computing field.

Not so sure that it has. I think the Chinese “LoongArch” machines are a MIPS derivative.

Also, if you want to think of “MIPS” as a corporate entity, that would be the company currently known as “Imagination Technologies”. It is true they have given up on the MIPS architecture, and are now quite heavily into
RISC-V.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Fri May 3 03:02:20 2024

On Thu, 02 May 2024 17:21:30 GMT, Anton Ertl wrote:

The 68000 and 88000 architectures (which have instructions with bit
numbers) make the least significant bit have number 0, so they are
bitwise little-endian.

The 68000 family is an example of the knots you can tie yourself into,
trying to come up with bit numberings for a big-endian architecture.

The 16-bit members of the family (pre-68020) had single-bit extraction/ insertion instructions, which numbered the bits one way. The 32-bit
machines added bit-field instructions, which used an entirely different
bit numbering.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Fri May 3 05:55:46 2024

On Thu, 2 May 2024 18:33:48 +0000, MitchAlsup1 wrote:

Lawrence D'Oliveiro wrote:

move.l a, b
move.b b, c

May I suggest that the above ILLUSTRATES why someone wants to use LD and
ST instructions rather than directionless MOV instructions.

OK, use explicit load/store instead of generic move:

register-memory-register:

store.l a, b
load.b b, c

memory-register-memory:

load.l a, b
store.b b, c

Do you see why this makes absolutely no difference to what happens, as per
my description earlier?

By the way, in case it wasn’t clear: in my examples, the destination
operand is always the last one.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Lawrence D'Oliveiro on Fri May 3 09:48:50 2024

Lawrence D'Oliveiro wrote:

On Wed, 1 May 2024 09:02:22 -0000 (UTC), Thomas Koenig wrote:

Hmm... what sort of terminals and character sets did people use on a
PDP-10? 7-bit ASCII? It (and the PDP-6) were released before the ASCII
standard came out.

A bit before my time, but I recall terms like “SIXBIT” encoding from looking at docs. Also this weird thing called “Radix-50” (the “50” actually being octal for 40 decimal) did persist into PDP-11 days, when I came along. It was a way of packing 3 characters (from a limited set, of course) into 2 bytes.

Radix 40 needs 64000 values to hold 3 characters from a set like
[' ',0-9,A-Z,_,-,=] (pick any three characters you want for those last
slots), it matches perfectly the classic 6.3 filename convention where
names are limited to 6 characters, an (implied period) and a 3-character extension/file type.

The 3-char to 2-byte packing was of course easy(*), while unpacking is a
bit harder if you don't want to use div/mod operations. I strongly
suspect that the file system designers would do searches for a
particular extension by first packing the extension and then search for
the resulting packed byte, instead of unpacking each extension byte into
the 3-char result.

(*)
byte pack3(char *ext) {
a = table[ext[0]]; b = table[ext[1]]; c = table[ext[2]];
return a + b*40 + c*1600;
}

b*40 = (b*5)<<3
or
b*40 = (b<<5)+(b<<3)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Fri May 3 08:51:30 2024

Lawrence D'Oliveiro <[email protected]d> writes:

On Thu, 02 May 2024 17:37:47 GMT, Anton Ertl wrote:

... MIPS has left the general-purpose computing field.

Not so sure that it has. I think the Chinese “LoongArch” machines are a >MIPS derivative.

They may have started with MIPS, like several others, but now they are LoongArch. Looking in <https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1-EN.html#common-memory-access-instructions>,
I don't find anything about byte order, but it says:

|LoongArch bit designations are always little-endian.

Also, if you want to think of “MIPS” as a corporate entity, that would be >the company currently known as “Imagination Technologies”. It is true they >have given up on the MIPS architecture

That's even worse for MIPS than what I know of, which was that it was
used for embedded uses.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bernd Linsel@21:1/5 to All on Fri May 3 15:29:04 2024

T24gMDMuMDUuMjQgMTA6NTEsIEFudG9uIEVydGwgd3JvdGU6DQo+IExhd3JlbmNlIEQnT2xp dmVpcm8gPGxkb0Buei5pbnZhbGlkPiB3cml0ZXM6DQo+PiBPbiBUaHUsIDAyIE1heSAyMDI0 IDE3OjM3OjQ3IEdNVCwgQW50b24gRXJ0bCB3cm90ZToNCj4+DQo+Pj4gLi4uIE1JUFMgaGFz IGxlZnQgdGhlIGdlbmVyYWwtcHVycG9zZSBjb21wdXRpbmcgZmllbGQuDQo+Pg0KPj4gTm90 IHNvIHN1cmUgdGhhdCBpdCBoYXMuIEkgdGhpbmsgdGhlIENoaW5lc2Ug4oCcTG9vbmdBcmNo 4oCdIG1hY2hpbmVzIGFyZSBhDQo+PiBNSVBTIGRlcml2YXRpdmUuDQo+IA0KPiBUaGV5IG1h eSBoYXZlIHN0YXJ0ZWQgd2l0aCBNSVBTLCBsaWtlIHNldmVyYWwgb3RoZXJzLCBidXQgbm93 IHRoZXkgYXJlDQo+IExvb25nQXJjaC4gIExvb2tpbmcgaW4NCj4gPGh0dHBzOi8vbG9vbmdz b24uZ2l0aHViLmlvL0xvb25nQXJjaC1Eb2N1bWVudGF0aW9uL0xvb25nQXJjaC1Wb2wxLUVO Lmh0bWwjY29tbW9uLW1lbW9yeS1hY2Nlc3MtaW5zdHJ1Y3Rpb25zPiwNCj4gSSBkb24ndCBm aW5kIGFueXRoaW5nIGFib3V0IGJ5dGUgb3JkZXIsIGJ1dCBpdCBzYXlzOg0KPiANCj4gfExv b25nQXJjaCBiaXQgZGVzaWduYXRpb25zIGFyZSBhbHdheXMgbGl0dGxlLWVuZGlhbi4NCj4g DQo+PiBBbHNvLCBpZiB5b3Ugd2FudCB0byB0aGluayBvZiDigJxNSVBT4oCdIGFzIGEgY29y cG9yYXRlIGVudGl0eSwgdGhhdCB3b3VsZCBiZQ0KPj4gdGhlIGNvbXBhbnkgY3VycmVudGx5 IGtub3duIGFzIOKAnEltYWdpbmF0aW9uIFRlY2hub2xvZ2llc+KAnS4gSXQgaXMgdHJ1ZSB0 aGV5DQo+PiBoYXZlIGdpdmVuIHVwIG9uIHRoZSBNSVBTIGFyY2hpdGVjdHVyZQ0KPiANCj4g VGhhdCdzIGV2ZW4gd29yc2UgZm9yIE1JUFMgdGhhbiB3aGF0IEkga25vdyBvZiwgd2hpY2gg d2FzIHRoYXQgaXQgd2FzDQo+IHVzZWQgZm9yIGVtYmVkZGVkIHVzZXMuDQo+IA0KPiAtIGFu dG9uDQoNCk1JUFMzMiBpcyBzdGlsbCB1c2VkIGluIE1pY3JvY2hpcCdzIFBJQzMyIG1pY3Jv Y29udHJvbGxlciBzZXJpZXMuDQoNCi0tIA0KQmVybmQgTGluc2VsDQo=

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bernd Linsel@21:1/5 to All on Fri May 3 17:07:25 2024

T24gMDMuMDUuMjQgMTA6NTEsIEFudG9uIEVydGwgd3JvdGU6DQo+IExhd3JlbmNlIEQnT2xp dmVpcm8gPGxkb0Buei5pbnZhbGlkPiB3cml0ZXM6DQo+PiBPbiBUaHUsIDAyIE1heSAyMDI0 IDE3OjM3OjQ3IEdNVCwgQW50b24gRXJ0bCB3cm90ZToNCj4+DQo+PiBBbHNvLCBpZiB5b3Ug d2FudCB0byB0aGluayBvZiDigJxNSVBT4oCdIGFzIGEgY29ycG9yYXRlIGVudGl0eSwgdGhh dCB3b3VsZCBiZQ0KPj4gdGhlIGNvbXBhbnkgY3VycmVudGx5IGtub3duIGFzIOKAnEltYWdp bmF0aW9uIFRlY2hub2xvZ2llc+KAnS4gSXQgaXMgdHJ1ZSB0aGV5DQo+PiBoYXZlIGdpdmVu IHVwIG9uIHRoZSBNSVBTIGFyY2hpdGVjdHVyZQ0KPiANCj4gVGhhdCdzIGV2ZW4gd29yc2Ug Zm9yIE1JUFMgdGhhbiB3aGF0IEkga25vdyBvZiwgd2hpY2ggd2FzIHRoYXQgaXQgd2FzDQo+ IHVzZWQgZm9yIGVtYmVkZGVkIHVzZXMuDQo+IA0KPiAtIGFudG9uDQoNCk1JUFMzMiBpcyBz dGlsbCB1c2VkIGluIE1pY3JvY2hpcCdzIFBJQzMyIG1pY3JvY29udHJvbGxlciBzZXJpZXMu DQoNCi0tIA0KQmVybmQgTGluc2VsDQo=

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Fri May 3 17:40:20 2024

On Fri, 03 May 2024 08:51:30 GMT
[email protected] (Anton Ertl) wrote:

Lawrence D'Oliveiro <[email protected]d> writes:

On Thu, 02 May 2024 17:37:47 GMT, Anton Ertl wrote:

... MIPS has left the general-purpose computing field.

Not so sure that it has. I think the Chinese â€œLoongArchâ€_ >machines are a MIPS derivative.

They may have started with MIPS, like several others, but now they are LoongArch. Looking in <https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1-EN.html#common-memory-access-instructions>,
I don't find anything about byte order, but it says:

|LoongArch bit designations are always little-endian.

Also, if you want to think of â€œMIPSâ€_ as a corporate entity, that >would be the company currently known as â€œImagination >Technologiesâ€_. It is true they have given up on the MIPS
architecture

That's even worse for MIPS than what I know of, which was that it was
used for embedded uses.

- anton

My impression was that embedded MIPS had two main players behind it:
- Microchip on the low end. Measured on Arm scale from about Cortex-M3
class to Cortex-M7 class.
- Cavium on the high end. From Cortex-A55 to not quite Cortex-A73.

Microchip will continue to sell it for decade at least. Microchip does
not tend to talk openly about directions, however their behavior shows
that their direction right now is away from MIPS and currently toward
Arm.

Cavium was absorbed by Marvell sevral years ago. Marvell, like
Microchip, does not tend to talk openly about directions. But when
Cavium was still independent, they did say that all new development
would be Arm. Since Cavium's market (high-end nework equipment) is less conservative and more fashion-driven, it probably means that they have
no new MIPS customers almost for decade and that old customers likely
buy much less as well.

As far as I am concerned, it's a pity, because I find MIPS latest ISA (nanoMIPS) very intersting and probably quite practical.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Fri May 3 15:02:16 2024

Michael S <[email protected]> writes:

In the world of general-purpose microprocessor, DEC Alpha (until EV6)
was more like word-addressable than byte-addressable, although it is a
matter of point of view.

No, Alpha has had byte addresses from the start, and that made it easy
to add the BWX instructions in EV56.

What it's EV4 and EV5 implementations do not have is instructions for *accessing* bytes and (PDP-11) words in memory, but that's completely
different from a word-addressed machine. When you add 1 to an address
on the Alpha, you get the address of the next byte. When you do the
same on a word-addressed machine, you get the address of the next
word.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Fri May 3 15:13:30 2024

Lawrence D'Oliveiro <[email protected]d> writes:

Why was byte addressing invented? I think it was for easy handling of
strings and other binary data.

Yes, the S/360 was intended to succeed both IBM's word-addressed
scientific line (such as the IBM 7094) and its character/digit-serial commercial lines such as the 7080 and the 1401. Combining byte
addressing with a fixed word size provided both.

The "360" refers to the full circle (an idea that IBM marketing
promptly put aside when they introduced the S/370 line).

But why stop there?

Others have provided good answers for that. Here's another one: Given
the requirements (based on the predecessors), there was not reason to
go beyond byte addressing. And looking at history, this seems to have
been the right choice.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Michael S on Fri May 3 17:51:17 2024

On 03/05/2024 16:40, Michael S wrote:

On Fri, 03 May 2024 08:51:30 GMT
[email protected] (Anton Ertl) wrote:

Lawrence D'Oliveiro <[email protected]d> writes:

On Thu, 02 May 2024 17:37:47 GMT, Anton Ertl wrote:

... MIPS has left the general-purpose computing field.

Not so sure that it has. I think the Chinese â€œLoongArchâ€_
machines are a MIPS derivative.

They may have started with MIPS, like several others, but now they are
LoongArch. Looking in
<https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1-EN.html#common-memory-access-instructions>,
I don't find anything about byte order, but it says:

|LoongArch bit designations are always little-endian.

Also, if you want to think of â€œMIPSâ€_ as a corporate entity, that >>> would be the company currently known as â€œImagination
Technologiesâ€_. It is true they have given up on the MIPS
architecture

That's even worse for MIPS than what I know of, which was that it was
used for embedded uses.

- anton

My impression was that embedded MIPS had two main players behind it:
- Microchip on the low end. Measured on Arm scale from about Cortex-M3
class to Cortex-M7 class.
- Cavium on the high end. From Cortex-A55 to not quite Cortex-A73.

Microchip will continue to sell it for decade at least. Microchip does
not tend to talk openly about directions, however their behavior shows
that their direction right now is away from MIPS and currently toward
Arm.

Microchip are good at continuing to produce old devices. But as you
say, they have moved to ARM for 32-bit.

Basically, Microchip managed to ruin embedded MIPS as a choice of
processor core. They used a four-pronged attack here :

1. They picked an older MIPS core for their first PIC32 line, rather
than the newer ones that more directly competed with microcontroller ARM
cores of the time, thus ensuring that their microcontroller would not be
power or performance competitive.

2. They made serious hardware errors in the first chips. A big
marketing feature of the PIC32 was that it supported 480 Mbps USB - but
it did not, and it took a very long time to make a fixed version. In
the meantime, the chip was still advertised as being the only available microcontroller with 480 Mbps USB on chip, with am errata saying "reduce
USB to 12 Mbps" as a "workaround" for the problem. This helped the
PIC32 gain a reputation as a broken and poor-quality device, which
reflected (unfairly) on the core.

3. They called it "PIC32". If you are familiar with the PIC series, you
know they have their good points - they are very robust and reliable microcontrollers (the PIC32 was the exception here), available for
decades in hobby-friendly packages. And they also have the most
brain-dead processor core known to man, making the 8051 pleasant in
comparison, combined with some of the worst quality and buggiest
compilers ever written and sold at ridiculously high prices. Thus
anyone familiar with Microchip PIC devices (most small-systems embedded developers) and unfamiliar with MIPS (most small-systems embedded
developers) would assume that the PIC32 core would be horrible to work
with and almost impossible to program in reasonable standard C, with the
"32" referring to some random part of the architecture rather than the processor width.

4. They set themselves against the open source development tools
community by packaging a modified GCC as though it were /their/
compiler. Every indication that it was not made by Microchip themselves
was hidden in the tiniest of small print. The library that they
provided was licensed as strictly as their lawyers could manage - you
were not allowed to use it with development tools other than the
binaries provided by Microchip. (You /could/, in theory, get their
modified GCC source for the compiler - but only at extreme effort. It
was not quite at the point of delivering the source on open reel tape,
but not far off it.) The modifications that Microchip made to GCC were
to disable any kind of optimisation unless you had bought the amazingly expensive version of the development tool license from them. So most
people had to use the devices with no optimisation at all.

At the same time, people were using better ARM cores with full
optimisations. In practice, ARM cores were 10-20 times faster than MIPS appeared to be, thanks to Microchip. It is no wonder they never caught on.

I'm sure there are other reasons why MIPS failed, despite having cores
that were comparable or better than ARM for small-systems embedded
devices. But Microchip has to take a large chunk of the blame, IMHO.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Fri May 3 18:42:29 2024

Lawrence D'Oliveiro wrote:

move.l a, b
move.b b, c

This is the same mistake that Brooks and Blaauw made, so invested in
your familiar byte order that you imagine that normal differences of
the other are somehow wrong.

Here's a concrete example on S/360.

L R,100
STH R,200

That does a four byte load of location 100 into a register, and then
a two byte halfword store into 200. The load gets bytes 100 through 103
with 100 going into the high byte of the register. The store puts its
values into bytes 200 and 201. Since it's the low half of the register,
the new contents of 200 and 201 are the old contents of 102 and 103.

Before anyone says aha, that's surprising or wrong. no it's not. It's
the way big-endian addressing works, and it would be surprising and
wrong if it did anything else. If we wanted to put the contents of 100
and 101 into 200 and 201, we'd have done something else, maybe this on
S/370 and later to explicitly store the high two bytes of the word:

L R,100
STCM R,12,200

or just move the two bytes directly

MVC 200(2),100

I have written assembler code for S/360, PDP-11, Vax, ROMP, 8086/286
and more machines using both byte orders than I can remember, so I'm
speaking from experience here, not guessing.

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Fri May 3 19:04:19 2024

Lawrence D'Oliveiro wrote:

On Thu, 2 May 2024 18:33:48 +0000, MitchAlsup1 wrote:

Lawrence D'Oliveiro wrote:

move.l a, b
move.b b, c

May I suggest that the above ILLUSTRATES why someone wants to use LD

and

ST instructions rather than directionless MOV instructions.

OK, use explicit load/store instead of generic move:

register-memory-register:

store.l a, b
load.b b, c

memory-register-memory:

load.l a, b
store.b b, c

Do you see why this makes absolutely no difference to what happens, as

per

my description earlier?

Yes, because you explicitly left out the syntactic sugar.

Try::

STD R7,[IP,#192]
LDSB R8,[SP,#32]

See, by having the syntactic sugar to identify which is the register
and which is the address and what direction the data is traveling,
all the confusion goes away.

The OpCode tells the direction LD is inbound, ST is outbound..
The operand with the 'R' is the register
The operand with the '[' and ']' is the address.

By the way, in case it wasn’t clear: in my examples, the destination operand is always the last one.

My preference is that the address operands are always in the same spot in
the instruction, and that the destination register is the receiver of a
LD and the sender of the ST.

And secondly, the destination is written like one writes assignments::

R9 = memory( pointer, index, offset );
or
R8 = R8 + #32

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Savard on Fri May 3 22:26:04 2024

On Thu, 02 May 2024 08:58:23 -0600, John Savard wrote:

To me, it just made sense that, since registers contain quantities, if
you load the value "8" into a reigster, it will contain the number 8.

So in a byte operation, the least significant bits of the register are
used.

Of course that makes sense.

Now, think of main memory as just a holding place for stuff that won’t fit
in registers: why shouldn’t it make sense there as well?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Fri May 3 22:24:46 2024

On Fri, 3 May 2024 19:04:19 +0000, MitchAlsup1 wrote:

Lawrence D'Oliveiro wrote:

Do you see why this makes absolutely no difference to what happens, as
per my description earlier?

Yes, because you explicitly left out the syntactic sugar.

None of which makes any difference to the point: even on a big-endian architecture, registers are still effectively little-endian!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Fri May 3 22:28:45 2024

On Fri, 03 May 2024 08:51:30 GMT, Anton Ertl wrote:

Lawrence D'Oliveiro <[email protected]d> writes:

Also, if you want to think of “MIPS” as a corporate entity, that would >>be the company currently known as “Imagination Technologies”. It is true >>they have given up on the MIPS architecture

That's even worse for MIPS than what I know of, which was that it was
used for embedded uses.

I think it still is, it just isn’t bringing in money for “MIPS IP” any more.

Last I heard, unit shipments for the top 3 architectures were:

ARM -- around 10 billion per year
RISC-V -- now in the billions, too
MIPS -- something like 840 million per year

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Fri May 3 22:32:22 2024

On Fri, 03 May 2024 15:13:30 GMT, Anton Ertl wrote:

Lawrence D'Oliveiro <[email protected]d> writes:

But why stop there?

Others have provided good answers for that. Here's another one: Given
the requirements (based on the predecessors), there was not reason to go beyond byte addressing. And looking at history, this seems to have been
the right choice.

That applied back in history, when we had fewer addressing bits to play
with, what about now?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Sat May 4 02:00:33 2024

According to Lawrence D'Oliveiro <[email protected]d>:

Others have provided good answers for that. Here's another one: Given
the requirements (based on the predecessors), there was not reason to go
beyond byte addressing. And looking at history, this seems to have been
the right choice.

That applied back in history, when we had fewer addressing bits to play
with, what about now?

What applications do you think would work better with bit addressing?

I can think of some kinds of data compression that use variable sized
bit fields, and I suppose graphics rendering although these days it's
rare to find a display without at least 8 bits per pixel and in any
event, most displays have GPUs nearby to do the rendering.

Compare that to all the other stuff for which bit addressing would just
be extra baggage. Where's the benefit?

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to BGB on Sat May 4 06:44:11 2024

On Fri, 3 May 2024 22:11:44 -0500, BGB wrote:

Not a huge use-case in graphics, as noted, in most cases this is done
with 16 or 32 bit pixels; and bit-plane graphics are long since dead.

What happens if we go beyond 32 bits? For example, hardware might support
10 bits per pixel component.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From [email protected]@21:1/5 to Stefan Monnier on Sat May 4 09:40:45 2024

Stefan Monnier <[email protected]> wrote:

On "personal" computers ... there's been work instead on compressing
64bit pointers to fit into 32bit "boxes" (IIUC it's used in some Chrome
versions) ...

Intel pushed this thing called the “x32” ABI into the Linux kernel (and >> possibly some other places) some years ago. This was using the AMD64

Indeed, but I got the impression that there is a bit of a revival of
interest for pointer compression as the evidence seems to point to RAM
sizes not increasing very much any more on "end user devices".

See for instance https://v8.dev/blog/pointer-compression

Note also that this is targeted at JavaScript: dynamically typed
languages tend to suffer more from the 64bit bloat because of their
use of "boxing", meaning that pretty much everything (except usually
for strings and arrays of floats, which are special-cased) doubles
in size when the "box" size is changed from 32bit to 64bit.

We've used compressed 32-bit pointers in Java for more than a decade
now. Every object in the Java VM is 8-aligned, so a 32-bit-wide
aligned pointer gets you access to 32G of adressible application
memory.

This is a win, not just for saving storage but improving performance.
Java applications are often memory-bandwidth limited, so memory
efficiency is a pretty good proxy for performance. The less memory you
use, the more customers you can serve.

Andrew.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Sat May 4 09:11:27 2024

Lawrence D'Oliveiro <[email protected]d> writes:

On Fri, 03 May 2024 15:13:30 GMT, Anton Ertl wrote:

Lawrence D'Oliveiro <[email protected]d> writes:

But why stop there?

Others have provided good answers for that. Here's another one: Given
the requirements (based on the predecessors), there was not reason to go
beyond byte addressing. And looking at history, this seems to have been
the right choice.

That applied back in history, when we had fewer addressing bits to play
with, what about now?

Byte addressing still seems to be the right choice, for the same
reasons: We have lots of string data, and data that needs larger
units, but for data that fits in smaller units

a) either there is so little that spending a full byte on it is good
enough, or

b) the data is handled by so little code that the burden from the lack
of bit addressing is relatively low in the overall scheme of things, or

c) programs deal with arrays of these things in a SIMD way, and bit
addressing provides little to no benefit.

For case b), we deal with bits or bit fields in a similar way that the word-addressed machines of the old days dealt with characters. I
guess that there were people that considered byte addressing similarly unnecessary that most of us consider bit addressing, so what is the
difference?

Apparently in the number of use cases: Byte addressing eventually won:
IBM switched to it with the S/360, DEC with the PDP-11, the successful
16-bit (and later 32-bit) microprocessors supported it, while the word-addressed machines were less successful and eventually vanished
in niches.

David Ungar's PhD thesis was on SOAR (aka RISC-IV), which was either word-addressed or (like Alpha) word-accessed; in one of the last
chapters of his thesis he wrote that the most beneficial feature for performance that SOAR did not have was byte accesses, which would have
reduced the number of cycles by IIRC 10% (to be balanced against
potential negative effects on the cycle-time); I found that quite
surprising for a thesis that mainly focussed on architectural features
for Smalltalk execution.

By contrast, there were two well-known cases of bit-addressed
machines: The IBM Stretch and the Intel iAPX 432, both of which failed
to achieve their performance goals and which did not succeed in the
market. I guess that this is not due to bit-addressing only, but that bit-addressing is a symptom of the feature creep that doomed these
projects. More focussed projects usually did not add bit addressing.

I expect that various architects of from-scratch projects have looked
at the question, and most concluded that bit-addressing provided not
enough benefits to justify the cost. And those bit-addressed
architectures that were introduced did not become great hits, unlike
the S/360 and the PDP-11.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Lawrence D'Oliveiro on Sat May 4 10:18:29 2024

Lawrence D'Oliveiro <[email protected]d> schrieb:

Intel pushed this thing called the “x32” ABI into the Linux kernel (and possibly some other places) some years ago. This was using the AMD64 instruction set, but with only 32-bit pointers. This way, you got the
benefit of the extra registers, without the overhead of the longer
addresses.

That was Donald Knuth's idea.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Michael S on Sat May 4 15:18:37 2024

Michael S <[email protected]> writes:

On Fri, 03 May 2024 08:51:30 GMT
[email protected] (Anton Ertl) wrote:

Lawrence D'Oliveiro <[email protected]d> writes:

On Thu, 02 May 2024 17:37:47 GMT, Anton Ertl wrote:
=20

... MIPS has left the general-purpose computing field. =20

Not so sure that it has. I think the Chinese =C3=A2=E2=82=AC=C5=93LoongA= >rch=C3=A2=E2=82=AC_
machines are a MIPS derivative. =20

=20
They may have started with MIPS, like several others, but now they are
LoongArch. Looking in
<https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1-EN.htm= >l#common-memory-access-instructions>,
I don't find anything about byte order, but it says:
=20
|LoongArch bit designations are always little-endian.
=20

Also, if you want to think of =C3=A2=E2=82=AC=C5=93MIPS=C3=A2=E2=82=AC_ = >as a corporate entity, that
would be the company currently known as =C3=A2=E2=82=AC=C5=93Imagination
Technologies=C3=A2=E2=82=AC_. It is true they have given up on the MIPS
architecture =20

=20
That's even worse for MIPS than what I know of, which was that it was
used for embedded uses.
=20
- anton

My impression was that embedded MIPS had two main players behind it:
- Microchip on the low end. Measured on Arm scale from about Cortex-M3
class to Cortex-M7 class.
- Cavium on the high end. From Cortex-A55 to not quite Cortex-A73.

The last Cavium MIPS core (Octeon 7800) taped out well over a decade
ago.

Cavium was absorbed by Marvell sevral years ago. Marvell, like
Microchip, does not tend to talk openly about directions. But when
Cavium was still independent, they did say that all new development
would be Arm.

There are three generations of ARM cores produced by cavium/Marvell;
ThunderX, Octeon9 and Octeon10.

https://www.servethehome.com/marvell-octeon-10-arm-neoverse-n2-dpu-in-the-wild-rivaling-2017-era-intel-xeon/

As far as I am concerned, it's a pity, because I find MIPS latest ISA >(nanoMIPS) very intersting and probably quite practical.

Personally I prefer ARM64 architecture over MIPS64 by a considerable margin,
in almost all respects (and I worked at SGI for a number of years in the R10k days).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Lawrence D'Oliveiro on Sat May 4 15:19:36 2024

Lawrence D'Oliveiro <[email protected]d> writes:

On Fri, 03 May 2024 15:13:30 GMT, Anton Ertl wrote:

Lawrence D'Oliveiro <[email protected]d> writes:

But why stop there?

Others have provided good answers for that. Here's another one: Given
the requirements (based on the predecessors), there was not reason to go
beyond byte addressing. And looking at history, this seems to have been
the right choice.

That applied back in history, when we had fewer addressing bits to play
with, what about now?

There is still no reason to leverage those bits for sub-byte addressing.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Anton Ertl on Sat May 4 15:21:04 2024

[email protected] (Anton Ertl) writes:

Lawrence D'Oliveiro <[email protected]d> writes:

On Fri, 03 May 2024 15:13:30 GMT, Anton Ertl wrote:

Lawrence D'Oliveiro <[email protected]d> writes:

But why stop there?

Others have provided good answers for that. Here's another one: Given
the requirements (based on the predecessors), there was not reason to go >>> beyond byte addressing. And looking at history, this seems to have been >>> the right choice.

That applied back in history, when we had fewer addressing bits to play >>with, what about now?

Byte addressing still seems to be the right choice, for the same
reasons: We have lots of string data, and data that needs larger
units, but for data that fits in smaller units

a) either there is so little that spending a full byte on it is good
enough, or

b) the data is handled by so little code that the burden from the lack
of bit addressing is relatively low in the overall scheme of things, or

c) programs deal with arrays of these things in a SIMD way, and bit >addressing provides little to no benefit.

d) all modern major architectures have instructions for bitfield
manipulation (insert, extract) obviating any need for general bit-level addressing.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Scott Lurndal on Sat May 4 21:56:00 2024

On Sat, 04 May 2024 15:18:37 GMT
[email protected] (Scott Lurndal) wrote:

Personally I prefer ARM64 architecture over MIPS64 by a considerable
margin, in almost all respects (and I worked at SGI for a number of
years in the R10k days).

I also prefer ARM64 over MIPS64.
But nanoMIPS is not MIPS64, it's a new architecture that, at least
according to my measurements, is head and shoulders above any
comppetitors in terms of code densty.
Even MIPSr6 is enough of divirgence from previous releases of MIPS64 to
be considered new architecture, but nanoMIPS is order of magnitude
bigger change than that.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Sat May 4 19:31:54 2024

According to Lawrence D'Oliveiro <[email protected]d>:

On Fri, 3 May 2024 22:11:44 -0500, BGB wrote:

Not a huge use-case in graphics, as noted, in most cases this is done
with 16 or 32 bit pixels; and bit-plane graphics are long since dead.

What happens if we go beyond 32 bits? For example, hardware might support
10 bits per pixel component.

I dunno about you but I would align the elements on two-byte
boundaries and only store the high 10 of the 16 bits. It's not like
we're short of address space, and it's a lot quicker to multiply and
divide by 2 or 16 than by 10.

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to John Levine on Sat May 4 22:56:19 2024

On Sat, 4 May 2024 19:31:54 -0000 (UTC)
John Levine <[email protected]> wrote:

According to Lawrence D'Oliveiro <[email protected]d>:

On Fri, 3 May 2024 22:11:44 -0500, BGB wrote:

Not a huge use-case in graphics, as noted, in most cases this is
done with 16 or 32 bit pixels; and bit-plane graphics are long
since dead.

What happens if we go beyond 32 bits? For example, hardware might
support 10 bits per pixel component.

I dunno about you but I would align the elements on two-byte
boundaries and only store the high 10 of the 16 bits. It's not like
we're short of address space, and it's a lot quicker to multiply and
divide by 2 or 16 than by 10.

I agree about preferable solution and simplicity, but not about last
part.
Multiplication by 10 is only very slightly slower than multiplication
by 2 or 16 and the difference shouldn't be noticable by comparison with
other things that we want to do with pixel.
On x386/AMD64 - multiplication by 2 is, depending on situation, zero or
1 instruction, multiplication by 16 is 1 instruction (lsh) and
multiplication by 10 is either 1 instruction (IMUL) or two simpler
instructions (LEA+ADD).
On Arm and aarch64 it's approximately the same except that there are
situations in which multiplication by 16 is zero instructions.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Sat May 4 21:08:19 2024

Michael S wrote:

On Sat, 4 May 2024 19:31:54 -0000 (UTC)
John Levine <[email protected]> wrote:

According to Lawrence D'Oliveiro <[email protected]d>:

On Fri, 3 May 2024 22:11:44 -0500, BGB wrote:

Not a huge use-case in graphics, as noted, in most cases this is
done with 16 or 32 bit pixels; and bit-plane graphics are long
since dead.

What happens if we go beyond 32 bits? For example, hardware might
support 10 bits per pixel component.

I dunno about you but I would align the elements on two-byte
boundaries and only store the high 10 of the 16 bits. It's not like
we're short of address space, and it's a lot quicker to multiply and
divide by 2 or 16 than by 10.

I agree about preferable solution and simplicity, but not about last
part.

Multiplication by 10 is only very slightly slower than multiplication
by 2 or 16 and the difference shouldn't be noticable by comparison with
other things that we want to do with pixel.

Multiplication by 10 used to index an array is not slower than a
multipication
by 16 (when the ISA is not brain dead)::

LEA Ri,[Ri,Ri<<3]
LD Rd,[Rp,Ri]

Compared to::

SL Ri,Ri,#4
LD Rd,[Rp,Ri]

{{Brain dead ISAs need not apply}}

On x386/AMD64 - multiplication by 2 is, depending on situation, zero or
1 instruction, multiplication by 16 is 1 instruction (lsh) and multiplication by 10 is either 1 instruction (IMUL) or two simpler instructions (LEA+ADD).

Many times the ADD can be folded into a memory reference as illustrated
above.

On Arm and aarch64 it's approximately the same except that there are situations in which multiplication by 16 is zero instructions.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Chris M. Thomasson on Sun May 5 00:12:52 2024

Chris M. Thomasson wrote:

On 5/4/2024 3:18 AM, Thomas Koenig wrote:

Lawrence D'Oliveiro <[email protected]d> schrieb:

Intel pushed this thing called the “x32” ABI into the Linux kernel

(and

possibly some other places) some years ago. This was using the AMD64
instruction set, but with only 32-bit pointers. This way, you got the
benefit of the extra registers, without the overhead of the longer
addresses.

That was Donald Knuth's idea.

Storing meta data in actual pointers, aka aligned on a larger boundary,
is critical to many advanced lock/wait free algorithms as well. I
remember storing an actual reference count in pointers before for a
special type of counting.

Even if one has multi-location ATOMICs ?? (as a single event ??)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Sun May 5 00:11:24 2024

BGB wrote:

On 5/4/2024 1:44 AM, Lawrence D'Oliveiro wrote:

On Fri, 3 May 2024 22:11:44 -0500, BGB wrote:

Not a huge use-case in graphics, as noted, in most cases this is done
with 16 or 32 bit pixels; and bit-plane graphics are long since dead.

What happens if we go beyond 32 bits? For example, hardware might

support

10 bits per pixel component.

A few typical formats:
RGB555: 0rrrrrgg-gggbbbbb
RGBA32: aaaaaaaa-rrrrrrrr-gggggggg-bbbbbbbb
RGB30 : 00rrrrrr-rrrrgggg-ggggggbb-bbbbbbbb (10-bit component RGB)

Though, for RGB30, there are variants with 10-bit linear RGB, and E5.F5 floating-point (sometimes used for HDR in OpenGL, as opposed to 4x
Binary16).

None of these would really benefit from bit addressable memory though.

Nor are they serviced by any SIMD ISA.

Though, for LDR, going beyond 8-bit color depth doesn't gain much even
if the monitor supports it natively. And had noted before when using a
cheap LCD TV as a monitor, that it only seemed to be working at a
roughly 6-bit color depth (like, it was seemingly slightly better than RGB555, but not by much).

Most people's eyes cannot even see the difference unless it is pointed
out to them.

Now I am using a 4K OLED, which does support 10b/component, but it
doesn't make much difference in practice (and even if it did, most
software wont make much use of it).

But, say, 5 to 8 bits per component is at least noticeable (better
colors and less banding artifacts), 8 to 10 bits, not so much. Though,
with the main exception being HDR (but then, over the 0.5 to 1.0 range,
E5.F5 is only about as accurate as a 6-bit component).

Posterization is still a problem at 8-bits.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Levine on Sun May 5 00:19:34 2024

On Thu, 2 May 2024 11:52:56 -0000 (UTC), John Levine wrote:

According to Lawrence D'Oliveiro <[email protected]d>:

Consider this pseudo-assembly-language sequence:

move.l a, b
move.b b, c
...
Now the question is: which byte from “a” ends up at location “c”?

On S/360, which is the ur-big-endian machine, memory to memory moves are different from register loads and stores.

Hint: in the register-memory-register case, you would do an MVC followed
by LOAD. In the memory-register-memory case, it would be LOAD followed by
MVC.

Does that put it in System/360 terms you can understand?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Levine on Sun May 5 00:21:42 2024

On Fri, 3 May 2024 18:42:29 -0000 (UTC), John Levine wrote:

Lawrence D'Oliveiro wrote:

move.l a, b
move.b b, c

Here's a concrete example on S/360.

L R,100
STH R,200

That does a four byte load of location 100 into a register, and then a
two byte halfword store into 200. The load gets bytes 100 through 103
with 100 going into the high byte of the register. The store puts its
values into bytes 200 and 201. Since it's the low half of the register,
the new contents of 200 and 201 are the old contents of 102 and 103.

So using the same register name to address a halfword gives you the low
half of the register, not the high half?

Whereas using the same memory address to address a halfword gives you the
high half of the word at that location, not the low half?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Levine on Sun May 5 00:25:17 2024

On Thu, 2 May 2024 12:00:40 -0000 (UTC), John Levine wrote:

... it is easy to construct examples that appear to make your less
favored option look wrong ...

Here is the issue: we have three different quantities needing numbering.

* Bit places within an integer
* Bit numbers within a bit field
* Byte numbers within a multibyte integer (offsets from the base address)

In little-endian, it is easy to relate all these three as follows:

bit place within integer = bit number within bit field =
byte number * 8 + bit within byte

There is no correspondingly simple formula for any big-endian
architecture.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Scott Lurndal on Sun May 5 00:26:49 2024

On Sat, 04 May 2024 15:18:37 GMT, Scott Lurndal wrote:

Personally I prefer ARM64 architecture over MIPS64 by a considerable
margin, in almost all respects ...

I know MIPS (like SPARC) originated in that brief window when it was
thought that delayed branches were a good idea, and so it remained saddled
with that (mis)feature for the rest of its life.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Lawrence D'Oliveiro on Sun May 5 04:12:49 2024

On Sun, 5 May 2024 00:26:49 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

On Sat, 04 May 2024 15:18:37 GMT, Scott Lurndal wrote:

Personally I prefer ARM64 architecture over MIPS64 by a considerable margin, in almost all respects ...

I know MIPS (like SPARC) originated in that brief window when it was
thought that delayed branches were a good idea, and so it remained
saddled with that (mis)feature for the rest of its life.

Delay slot was deprecated back in MIPSr6, almost a decade ago.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Sun May 5 01:33:39 2024

According to Lawrence D'Oliveiro <[email protected]d>:

So using the same register name to address a halfword gives you the low
half of the register, not the high half?

Whereas using the same memory address to address a halfword gives you the >high half of the word at that location, not the low half?

For anyone familiar with big-endian addressing, those would both be
obviously correct.

Perhaps this would be a good time to stop digging.
--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Sun May 5 01:49:52 2024

Lawrence D'Oliveiro wrote:

On Fri, 3 May 2024 18:42:29 -0000 (UTC), John Levine wrote:

Lawrence D'Oliveiro wrote:

move.l a, b
move.b b, c

Here's a concrete example on S/360.

L R,100
STH R,200

That does a four byte load of location 100 into a register, and then a
two byte halfword store into 200. The load gets bytes 100 through 103
with 100 going into the high byte of the register. The store puts its
values into bytes 200 and 201. Since it's the low half of the

register,

the new contents of 200 and 201 are the old contents of 102 and 103.

So using the same register name to address a halfword gives you the low
half of the register, not the high half?

Whereas using the same memory address to address a halfword gives you the

high half of the word at that location, not the low half?

Concrete example::

say location 100:103 contain 0xDEADBEAF

LD R,100

R contains 0xDEADBEAF

STH R,200

Location 200:201 contain 0XBEAF

Whereas::

LH R,100

R contains 0xDEAD

And nobody who understands BE would even question this functionality.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Sun May 5 04:35:51 2024

On Sun, 5 May 2024 01:49:52 +0000, MitchAlsup1 wrote:

Lawrence D'Oliveiro wrote:

So using the same register name to address a halfword gives you the low
half of the register, not the high half?

Whereas using the same memory address to address a halfword gives you the
high half of the word at that location, not the low half?

Concrete example::

That’s a “yes”.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Michael S on Sun May 5 04:36:34 2024

On Sun, 5 May 2024 04:12:49 +0300, Michael S wrote:

On Sun, 5 May 2024 00:26:49 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

thought that delayed branches were a good idea, and so it remained
saddled with that (mis)feature for the rest of its life.

Delay slot was deprecated back in MIPSr6, almost a decade ago.

But that would be a backward-incompatible change, would it not?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Scott Lurndal on Sun May 5 07:43:27 2024

Scott Lurndal <[email protected]> schrieb:

d) all modern major architectures have instructions for bitfield
manipulation (insert, extract) obviating any need for general bit-level addressing.

RISC-V: Seems like it's an extension, for which only a draft is
available, so it is debatable if it has it or not.

POWER: Certainly, the rlwinm instruction.

AMD64: Sure, pdep and friends.

ARM: You certainly know by heart, I don't need to look.

Loongarch: Looking at the docs, it also has it (BSTRINS etc).

So, with the possible exception of RISC-V, I cannot see anything
to contradict you :-)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to [email protected] on Sun May 5 11:03:39 2024

On Sat, 4 May 2024 21:08:19 +0000
[email protected] (MitchAlsup1) wrote:

Michael S wrote:

On Sat, 4 May 2024 19:31:54 -0000 (UTC)
John Levine <[email protected]> wrote:

According to Lawrence D'Oliveiro <[email protected]d>:

On Fri, 3 May 2024 22:11:44 -0500, BGB wrote:

Not a huge use-case in graphics, as noted, in most cases this is
done with 16 or 32 bit pixels; and bit-plane graphics are long
since dead.

What happens if we go beyond 32 bits? For example, hardware might
support 10 bits per pixel component.

I dunno about you but I would align the elements on two-byte
boundaries and only store the high 10 of the 16 bits. It's not like
we're short of address space, and it's a lot quicker to multiply
and divide by 2 or 16 than by 10.

I agree about preferable solution and simplicity, but not about last
part.

Multiplication by 10 is only very slightly slower than
multiplication by 2 or 16 and the difference shouldn't be noticable
by comparison with other things that we want to do with pixel.

Multiplication by 10 used to index an array is not slower than a multipication
by 16 (when the ISA is not brain dead)::

LEA Ri,[Ri,Ri<<3]
LD Rd,[Rp,Ri]

Are you sure?
To me, it looks like 9 rather than 10.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Lawrence D'Oliveiro on Sun May 5 11:10:55 2024

On Sun, 5 May 2024 04:36:34 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

On Sun, 5 May 2024 04:12:49 +0300, Michael S wrote:

On Sun, 5 May 2024 00:26:49 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

thought that delayed branches were a good idea, and so it remained
saddled with that (mis)feature for the rest of its life.

Delay slot was deprecated back in MIPSr6, almost a decade ago.

But that would be a backward-incompatible change, would it not?

It would not.
They added a new set of branches, but preserved an old set.
If I understand their intentions correctly, the old stuff was supposed
to be removed in the next release of the ISA. But then two things
happened simultaneously:
1) they invented nanoMIPS, which made incompatible release of "classic"
MIPS redundant
2) their financial troubles escalated

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to David Brown on Sun May 5 11:32:44 2024

On Fri, 3 May 2024 17:51:17 +0200
David Brown <[email protected]> wrote:

I'm sure there are other reasons why MIPS failed, despite having
cores that were comparable or better than ARM for small-systems
embedded devices. But Microchip has to take a large chunk of the
blame, IMHO.

I am not sure that I agree.
It seems strange to me to blame Microchip that did embrace MIPS and to
say nothing about their main competitors that never embraced it, i.e.
STMicro, Philips (== NXP) and TI.

Also, what about IDT (now owned by Renesas) ? In the 1990s they were
the biggest partners of MIPS in general-purpose embedded space. I would
think that they played bigger (than Microchip) role in MIPS demise.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Michael S on Sun May 5 08:21:47 2024

Michael S <[email protected]> schrieb:

On Sat, 04 May 2024 15:18:37 GMT
[email protected] (Scott Lurndal) wrote:

Personally I prefer ARM64 architecture over MIPS64 by a considerable
margin, in almost all respects (and I worked at SGI for a number of
years in the R10k days).

I also prefer ARM64 over MIPS64.
But nanoMIPS is not MIPS64, it's a new architecture that, at least
according to my measurements, is head and shoulders above any
comppetitors in terms of code densty.

Hadn't come across it before...

https://www.anandtech.com/show/12699/mips-announces-i7200-32bit-cpu-with-new-nanomips-isa
says it has 16, 32 and 48 bit instructions, the latter for encoding
32-bit immediates. Sounds like a good strategy if you want to
increase density for a 32-bit ISA, which is also expected to remain
firmly 32-bit.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Thomas Koenig on Sun May 5 12:13:27 2024

On Sun, 5 May 2024 07:43:27 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

Scott Lurndal <[email protected]> schrieb:

d) all modern major architectures have instructions for bitfield manipulation (insert, extract) obviating any need for general
bit-level addressing.

RISC-V: Seems like it's an extension, for which only a draft is
available, so it is debatable if it has it or not.

POWER: Certainly, the rlwinm instruction.

AMD64: Sure, pdep and friends.

PEXTR/PDEP has no immediate form, which makes it inconvenient for
'C'-style fixed bit fields. Unless you access the same bifield
repeatedly, it takes two instructions instead of 1 (the first is move
reg,imm). Also, on many AMD processors PDEP/PEXTR is slow.
BEXTR has the same problem of absence of immediate form, but at least it
is fast across the board. Unfortunately, BEXTR does not help bit field insertion.

ARM: You certainly know by heart, I don't need to look.

Loongarch: Looking at the docs, it also has it (BSTRINS etc).

So, with the possible exception of RISC-V, I cannot see anything
to contradict you :-)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Scott Lurndal on Sun May 5 09:20:08 2024

[email protected] (Scott Lurndal) writes:

[email protected] (Anton Ertl) writes:

Byte addressing still seems to be the right choice, for the same
reasons: We have lots of string data, and data that needs larger
units, but for data that fits in smaller units

a) either there is so little that spending a full byte on it is good >>enough, or

b) the data is handled by so little code that the burden from the lack
of bit addressing is relatively low in the overall scheme of things, or

c) programs deal with arrays of these things in a SIMD way, and bit >>addressing provides little to no benefit.

d) all modern major architectures have instructions for bitfield
manipulation (insert, extract) obviating any need for general bit-level addressing.

Many of the word-addressed machines of yesteryear had instructions for character manipulation (insert, extract), but that did not obviate any
need for byte addressing.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Sun May 5 09:02:03 2024

Michael S <[email protected]> writes:

On Sun, 5 May 2024 00:26:49 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

On Sat, 04 May 2024 15:18:37 GMT, Scott Lurndal wrote:

Personally I prefer ARM64 architecture over MIPS64 by a considerable
margin, in almost all respects ...

I know MIPS (like SPARC) originated in that brief window when it was
thought that delayed branches were a good idea, and so it remained
saddled with that (mis)feature for the rest of its life.

Delay slot was deprecated back in MIPSr6, almost a decade ago.

MIPS has a number of other misfeatures that made us disable dynamic superinstructions in Gforth and are a problem for other code-copying
code generators:

First and foremost, the architectural load delay slot (and, I think,
similar constraints wrt multiply and divide instructions and/or
MFHI/MFLO) mean that, unlike for every other architecture we have
looked at (including IA-64), you cannot just concatenate two pieces of
code which do work when they are connected with an indirect jump.

Another nasty property of MIPS is the way direct jumps and calls are
encoded: The target address is assembled from IIRC the top 6 bits of
the current PC and the rest of the address as absolute number in the instruction. This means that the call/jump would not show up as non-relocatable in Gforth's sanity tests, but if copied a piece of
code to a target area in a different 256MB-segment, it would fail.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Sun May 5 13:00:00 2024

On Sun, 05 May 2024 09:02:03 GMT
[email protected] (Anton Ertl) wrote:

Michael S <[email protected]> writes:

On Sun, 5 May 2024 00:26:49 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

On Sat, 04 May 2024 15:18:37 GMT, Scott Lurndal wrote:

Personally I prefer ARM64 architecture over MIPS64 by a
considerable margin, in almost all respects ...

I know MIPS (like SPARC) originated in that brief window when it
was thought that delayed branches were a good idea, and so it
remained saddled with that (mis)feature for the rest of its life.

Delay slot was deprecated back in MIPSr6, almost a decade ago.

MIPS has a number of other misfeatures that made us disable dynamic superinstructions in Gforth and are a problem for other code-copying
code generators:

First and foremost, the architectural load delay slot (and, I think,
similar constraints wrt multiply and divide instructions and/or
MFHI/MFLO) mean that, unlike for every other architecture we have
looked at (including IA-64), you cannot just concatenate two pieces of
code which do work when they are connected with an indirect jump.

Were not all delay slots except branch delay eliminated back in
revision of the ISA that corresponded to R4K ?

Another nasty property of MIPS is the way direct jumps and calls are
encoded: The target address is assembled from IIRC the top 6 bits of
the current PC and the rest of the address as absolute number in the instruction. This means that the call/jump would not show up as non-relocatable in Gforth's sanity tests, but if copied a piece of
code to a target area in a different 256MB-segment, it would fail.

- anton

Compact branches (Release 6) have conventional signed PC-relative
offsets - +-128 MB for unconditional jump/J&L, +-4MB for
equal/non-equal to zero and +-128 KB for the rest of conditional
branches.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robert Swindells@21:1/5 to Anton Ertl on Sun May 5 13:35:01 2024

On Sat, 04 May 2024 09:11:27 GMT, Anton Ertl wrote:

David Ungar's PhD thesis was on SOAR (aka RISC-IV), which was either word-addressed or (like Alpha) word-accessed; in one of the last
chapters of his thesis he wrote that the most beneficial feature for performance that SOAR did not have was byte accesses, which would have reduced the number of cycles by IIRC 10% (to be balanced against
potential negative effects on the cycle-time); I found that quite
surprising for a thesis that mainly focussed on architectural features
for Smalltalk execution.

I think SOAR was RISC-III and SPUR (their Lisp CPU) RISC-IV.

My guess is that it was word-addressed.

The type tags are in the high bits of a word, as they were in all the Lisp Machines of the time which were word-addressed, not the low bits as in
SPARC.

On a byte-addressed machine you can use some lower bits "for free" if
the objects being addressed are always word-sized or larger. SPARC has
specific instructions to make use of this.

There is also a paragraph on page 38 on this topic, it states that
Smalltalk didn't store byte scalar values in the image.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Thomas Koenig on Sun May 5 15:31:04 2024

Thomas Koenig <[email protected]> writes:

Scott Lurndal <[email protected]> schrieb:

d) all modern major architectures have instructions for bitfield
manipulation (insert, extract) obviating any need for general bit-level addressing.

RISC-V: Seems like it's an extension, for which only a draft is
available, so it is debatable if it has it or not.

POWER: Certainly, the rlwinm instruction.

AMD64: Sure, pdep and friends.

ARM: You certainly know by heart, I don't need to look.

Loongarch: Looking at the docs, it also has it (BSTRINS etc).

So, with the possible exception of RISC-V, I cannot see anything
to contradict you :-)

I would, personally, categorize RISC-V as a niche architecture
at this time. Give it time to reach "major" status, where
the extensions become less optional.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Anton Ertl on Sun May 5 15:32:26 2024

[email protected] (Anton Ertl) writes:

[email protected] (Scott Lurndal) writes:

[email protected] (Anton Ertl) writes:

Byte addressing still seems to be the right choice, for the same
reasons: We have lots of string data, and data that needs larger
units, but for data that fits in smaller units

a) either there is so little that spending a full byte on it is good >>>enough, or

b) the data is handled by so little code that the burden from the lack
of bit addressing is relatively low in the overall scheme of things, or

c) programs deal with arrays of these things in a SIMD way, and bit >>>addressing provides little to no benefit.

d) all modern major architectures have instructions for bitfield >>manipulation (insert, extract) obviating any need for general bit-level addressing.

Many of the word-addressed machines of yesteryear had instructions for >character manipulation (insert, extract), but that did not obviate any
need for byte addressing.

And in further news, Apples are not equal to Oranges.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Sun May 5 16:09:41 2024

Michael S <[email protected]> writes:

On Sun, 05 May 2024 09:02:03 GMT
[email protected] (Anton Ertl) wrote:

First and foremost, the architectural load delay slot (and, I think,
similar constraints wrt multiply and divide instructions and/or
MFHI/MFLO) mean that, unlike for every other architecture we have
looked at (including IA-64), you cannot just concatenate two pieces of
code which do work when they are connected with an indirect jump.

Were not all delay slots except branch delay eliminated back in
revision of the ISA that corresponded to R4K ?

Certainly Raymond Chen who writes explicitly about the R4000 in <https://devblogs.microsoft.com/oldnewthing/20180404-00/?p=98435>
still mentions the restrictions on HI/LO register stuff in 2018.

And even if it was, for 32-bit MIPS the typical build environments and
build targets are just mips and mipsel, with no MIPS III-specific
environment.

And no, looking at the build machine is not good enough: I built some
version of gcc on an EV56, and then wanted to run it on an EV45, and
that produced illegal instruction errors, because during bootstrapping
gcc had decided that it uses BWX instructions, because the build
machine provides them.

For building for MIPS64 one can rely on it being at least an R4000,
but there is still the jump/call problem with that. If the platform
was very relevant, we would be looking for some workaround, but it
isn't.

Another nasty property of MIPS is the way direct jumps and calls are
encoded: The target address is assembled from IIRC the top 6 bits of
the current PC and the rest of the address as absolute number in the
instruction. This means that the call/jump would not show up as
non-relocatable in Gforth's sanity tests, but if copied a piece of
code to a target area in a different 256MB-segment, it would fail.

- anton

Compact branches (Release 6) have conventional signed PC-relative
offsets - +-128 MB for unconditional jump/J&L, +-4MB for
equal/non-equal to zero and +-128 KB for the rest of conditional
branches.

Sounds good, but again you cannot rely on these branches being present.

There is a lot to be said for providing a plain ISA and doing
optimizations in the microarchitecture. Among the MIPS descendents,
RISC-V does much better, Alpha is somewhere in-between.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to [email protected] on Sun May 5 11:28:12 2024

On Wed, 1 May 2024 00:09:28 -0000 (UTC), Lawrence D'Oliveiro
<[email protected]d> wrote:

Byte addressing was invented by IBM for the System/360, introduced in
1964. At least as I understand it. Up to that time, and indeed for a long >time after, machines had a �word length� which was the smallest
addressable unit of memory. This could have a range of sizes, e.g.

12 -- DEC PDP-5/8
18 -- DEC PDP-1/4/7/9
36 -- DEC PDP-6/10
60 -- CDC 6000-series
64 -- Cray

I�m sure there were also 24- and 48-bit machines.

Oh, indeed.

24 bits:
CDC 924
SDS 910, 920, 930, 940
SDS 9300
DDP-24, -124, -224
GE 425, 435, 455, 465
ASI 6020, 6030
SEL 840
Honeywell 300
SCC 660
Datacraft DC 6024, Harris Slash/4
Four-Phase Systems System IV/70
Telefunken TR440
Philco 2000
DJS-6

48 bits:
CDC 1604
BESM 6
Datamatic 1000, Honeywell 400, 800, 1400, 1800
IBM AN/FSQ-31 -32

among others.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to [email protected] on Sun May 5 11:20:02 2024

On Wed, 1 May 2024 00:09:28 -0000 (UTC), Lawrence D'Oliveiro
<[email protected]d> wrote:

Big-endian
supposedly had the advantage of making memory dumps easier to read, but >little-endian always made more logical sense.

There is one practical argument for big-endian encoding.

Let us suppose that a computer has the ability to do *both* decimal
arithmetic and binary arithmetic.

So a word in the computer might contain just bits, for binary
arithmetic. Or it might contain BCD digits, for decimal arithmetic.

Since it's possible to design an adder where carrying early between
nibbles can be turned on or off, on for decimal arithmetic, and off
for binary arithmetic, clearly the order of digits - big-endian or little-endian - should be the same between binary and decimal.

Also, though, for ease of conversion, the order of BCD digits _should
be the same as the order of the characters of which these digits are
the last four bits_ in the representation of a decimal number as a
character string.

And that means big-endian.

If you have decimal arithmetic, there's a direct connection between
how numbers are represented for reading and writing, and how they are represented for internal arithmetic.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to All on Sun May 5 18:01:43 2024

MitchAlsup1 wrote:

Michael S wrote:

On Sat, 4 May 2024 21:08:19 +0000
[email protected] (MitchAlsup1) wrote:

Multiplication by 10 used to index an array is not slower than a
multipication
by 16 (when the ISA is not brain dead)::

LEA Ri,[Ri,Ri<<3]
LD Rd,[Rp,Ri]

Are you sure?
To me, it looks like 9 rather than 10.

LD Rd,[Rp,Ri<<2]

sorry.........

LEA Ri,[Ri,Ri<<2]
LD Rd,[Rp,Ri<<2]

sorry again.......

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Sun May 5 17:59:37 2024

Michael S wrote:

On Sat, 4 May 2024 21:08:19 +0000
[email protected] (MitchAlsup1) wrote:

Multiplication by 10 used to index an array is not slower than a
multipication
by 16 (when the ISA is not brain dead)::

LEA Ri,[Ri,Ri<<3]
LD Rd,[Rp,Ri]

Are you sure?
To me, it looks like 9 rather than 10.

LD Rd,[Rp,Ri<<2]

sorry.........

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Sun May 5 18:57:14 2024

According to Robert Swindells <[email protected]>:

On Sat, 04 May 2024 09:11:27 GMT, Anton Ertl wrote:
On a byte-addressed machine you can use some lower bits "for free" if
the objects being addressed are always word-sized or larger. SPARC has >specific instructions to make use of this.

Only if you can count on them being aligned. On S/360 they required
everything to be aligned, and one of the changes on S/370 was to allow arbitrary data alignment for data addresses. They quickly found that
Fortran programs used COMMON and EQUIVALENCE to put 8 bit reals on 4
byte boundaries in strictly standard conforming programs. Oops. The
Fortran library caught the traps and fixed them up but with dreadful performance.

If your storage management is disciplined enough that you know that everything is aligned on natural boundaries, this trick still works, but if you're going to have to mask out flag bits anyway, the argument for putting the flags in
the low bits isn't as strong.

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Sun May 5 22:21:20 2024

BGB wrote:

On 5/5/2024 10:31 AM, Scott Lurndal wrote:

Thomas Koenig <[email protected]> writes:

Scott Lurndal <[email protected]> schrieb:

Not as of yet in my case, but bitfield extract might happen eventually.
Issue is finding a way to pull it off that is useful and cheaper than shift+mask (and probably adding some mechanism to pattern-match it from
the AST or similar).

But, but but but:: it IS shift and Mask !!

Annoyingly, a good general case instruction could not be encoded in a
32-bit instruction form at this point (could either add a few special
cases as 32-bit ops, or use a 64-bit encoding; or do it as a 2RI op
rather than 3RI but this is lame...).

Then again, say:
BITEXTR Imm10, Rn //Rn=(Rn>>(Imm&63))&((1<<((Imm>>6)&15))-1)
Could potentially still be useful.

SL Rd,Rc,<width:offset>

Is a bit field extract instruction, it is also a smash instruction
(smashing a 64-bit value into a 8-bit or 12-bit or 47 bit for whatever
purpose is needed)

SR Rd,Rc,<width:offset>

Positions the value in a register (Rc) such that it fits the alignment of
a field.

INS Rd,Rc,Rf,<width:offset>

Inserts the field from Rf into its position <w:o> in Rc, inserts the field
and delivers the new container to Rd.

Also, some things don't seem well balanced in terms of cost, so while it would be fairly cheap for a microcontroller, by the time one implements enough extensions to make it more useful for general purpose computing,
it will no longer be cheap (while at the same time shooting itself in
the foot in terms of performance for imposing some design constraints
that *only* make sense for small microcontrollers).

We can put 64 GBOoO CPUs on a single die and you worry about the shifter
having a masker ?!?

One big offender here, as I see it, is a few features in the Privileged
ISA spec, such as:
Separate register sets for each protection level/mode;

Wile My 66000 has separate register files for every thread; each file
is memory resident when not running. {At least conceptually}

The comparably large number of CSRs;

I have a 64-bit control register space and all CSRs are mapped into this
space (along with all device control registers,... {This space is entirely separate from the space where DRAM occupies}.

Allowing operations on CSRs beyond just moving them to/from a GPR or
similar;
....

Things like the 'V' extension are also cause for concern.

The 'M' extension isn't ideal, but I made it work in a way that "isn't
too horribly expensive" (namely using a Shift-and-Add unit).

Also the cost-scaling of the Shift-Add unit is such that it could
potentially be extended to allow 128-bit integer multiply and divide,
but debatable (there are only a few edge cases where this would likely
be faster than "just do it in software").

You are being mislead as to what architecture is compared to what you can implement in your FPGA and this is coloring your view of it.

Well, and my ALUX extension can make for faster 128-bit ALU operations,
but is debatable as the cost-delta mostly disappears in the noise
(mostly because 128-bit ALU ops are rare).

In My 66000's case, the CARRY instruction modifier provides access to multiprecision arithmetic--including exact FP arithmetics which even
gets the inexact bit set (clear actually) correctly.

Conversely, the code when built for RV64G omits 128-bit types entirely,

What, exactly, did you expect from an Academic quality ISA ?????

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Chris M. Thomasson on Sun May 5 22:25:46 2024

Chris M. Thomasson wrote:

On 5/4/2024 5:12 PM, MitchAlsup1 wrote:

Chris M. Thomasson wrote:

On 5/4/2024 3:18 AM, Thomas Koenig wrote:

Lawrence D'Oliveiro <[email protected]d> schrieb:

Intel pushed this thing called the “x32” ABI into the Linux kernel

(and

possibly some other places) some years ago. This was using the AMD64 >>>>> instruction set, but with only 32-bit pointers. This way, you got the >>>>> benefit of the extra registers, without the overhead of the longer
addresses.

That was Donald Knuth's idea.

Storing meta data in actual pointers, aka aligned on a larger
boundary, is critical to many advanced lock/wait free algorithms as
well. I remember storing an actual reference count in pointers before
for a special type of counting.

Even if one has multi-location ATOMICs ?? (as a single event ??)

This was a technique for storing data in a pointer. For instance, strong atomic reference counting we need to update a pointer _and_ a reference together atomically. This can easily be done with DWCAS, or double width compare and swap. So, on a 32 bit system we need 64 bit cas, for a 64
bit system we need 128 bit cas. However, sometimes we can pack the
reference count in the pointer value itself if its aligned on a big
enough boundary. Then we can update the pointer and the reference count
using normal word based atomic RMW's.

I understand why you had to pack the pointer and a chunk of data into a
single container.

What I don't understand is if you had easy access to multi-container ATOMICs the packing would be unnecessary--would it not ?? That is in one ATOMIC event you could update the pointer and the chunk of data independently and not NEED to store them in a single container.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Mon May 6 01:21:35 2024

According to Anton Ertl <[email protected]>:

d) all modern major architectures have instructions for bitfield >>manipulation (insert, extract) obviating any need for general bit-level addressing.

Many of the word-addressed machines of yesteryear had instructions for >character manipulation (insert, extract), but that did not obviate any
need for byte addressing.

I believe that byte addressing which simultaneously allows larger
words on power of two boundaries is one of those ideas that seems
totally obvious now but was not at all at the time.

Many of IBM's earlier machines like the 705 and 1620 and 1401 were
character or digit addressable, and even had multi-character
instructions that had to be alignd on a 5 digit boundary, but until
the 360 nobody made the jump to see that you could address larger data
in parallel that way.

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Scott Lurndal on Mon May 6 02:29:18 2024

On Sat, 04 May 2024 15:21:04 GMT, Scott Lurndal wrote:

d) all modern major architectures have instructions for bitfield
manipulation (insert, extract) obviating any need for general bit-level addressing.

Even if those bottom three bits of the address must be zero in every other instruction but these, I thought it would be convenient to have them, just
for these bitfield instructions. It would save passing around a separate bit-offset field in arbitrary-bit-aligned pointers.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Chris M. Thomasson on Mon May 6 00:50:35 2024

Chris M. Thomasson wrote:

On 5/5/2024 3:25 PM, MitchAlsup1 wrote:

Chris M. Thomasson wrote:

On 5/4/2024 5:12 PM, MitchAlsup1 wrote:

Chris M. Thomasson wrote:

On 5/4/2024 3:18 AM, Thomas Koenig wrote:

Lawrence D'Oliveiro <[email protected]d> schrieb:

Intel pushed this thing called the “x32” ABI into the Linux kernel >>>> (and
possibly some other places) some years ago. This was using the AMD64 >>>>>>> instruction set, but with only 32-bit pointers. This way, you got the >>>>>>> benefit of the extra registers, without the overhead of the longer >>>>>>> addresses.

That was Donald Knuth's idea.

Storing meta data in actual pointers, aka aligned on a larger
boundary, is critical to many advanced lock/wait free algorithms as
well. I remember storing an actual reference count in pointers
before for a special type of counting.

Even if one has multi-location ATOMICs ?? (as a single event ??)

This was a technique for storing data in a pointer. For instance,
strong atomic reference counting we need to update a pointer _and_ a
reference together atomically. This can easily be done with DWCAS, or
double width compare and swap. So, on a 32 bit system we need 64 bit
cas, for a 64 bit system we need 128 bit cas. However, sometimes we
can pack the reference count in the pointer value itself if its
aligned on a big enough boundary. Then we can update the pointer and
the reference count using normal word based atomic RMW's.

I understand why you had to pack the pointer and a chunk of data into a
single container.

What I don't understand is if you had easy access to multi-container
ATOMICs
the packing would be unnecessary--would it not ?? That is in one ATOMIC
event
you could update the pointer and the chunk of data independently and not
NEED
to store them in a single container.

Well, actually, a pessimistic word based fetch-and-add (LOCK XADD) is
enough to increment the counter and load a pointer atomically all in one shot, loopless. Why would I need to use multi atomics with a possible
loop to do that?

Postulate that you have a 64-bit pointer and a 8-bit chunk 72-total bits. Further postulate that you need to update both in a single non-blocking
ATOMIC event. ...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to BGB on Mon May 6 02:52:19 2024

On Sun, 5 May 2024 20:50:51 -0500, BGB wrote:

Say, RISC-V:
Says yes to DIV and MOD;
Says yes to 4-register floating-point multiple-accumulate; Say no to
register-indexed Load/Store.
Me: This is not a good balance...

Multiply-accumulate is at least as much about reducing rounding error as
about speed.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Mon May 6 02:54:11 2024

According to Lawrence D'Oliveiro <[email protected]d>:

On Sat, 04 May 2024 15:21:04 GMT, Scott Lurndal wrote:

d) all modern major architectures have instructions for bitfield
manipulation (insert, extract) obviating any need for general bit-level
addressing.

Even if those bottom three bits of the address must be zero in every other >instruction but these, I thought it would be convenient to have them, just >for these bitfield instructions. It would save passing around a separate >bit-offset field in arbitrary-bit-aligned pointers.

The only significant application for bit addressing that anyone has
mentioned is data compression. It's not something that computers spend
a great deal of time doing, and I see no reason to believe that bit
addressing would make it much faster than the way it's done now with
shifting and masking.

If you do want to make compression faster, it'd make more sense to add instructions to do the compressing you compare about, like DFLTCC in
S/360 and zSeries that speed up gzip, rather than adding three bits to
the other 99% of instructions that don't use bit fields.

If you think otherwise, what are the applications that will make all
those address bits useful, and why do you think bit addressing will be
faster than shifting and masking? There's still going to be memory
underneath that's byte or word addressed so the shifting and masking
is going to happen anyway.
--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Scott Lurndal on Mon May 6 02:30:42 2024

On Sun, 05 May 2024 15:31:04 GMT, Scott Lurndal wrote:

I would, personally, categorize RISC-V as a niche architecture at this
time.

I think it’s already shipping in the billions of units per year--enough to make it the world’s second-most-popular CPU architecture, after ARM.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Savard on Mon May 6 02:34:48 2024

On Sun, 05 May 2024 11:20:02 -0600, John Savard wrote:

If you have decimal arithmetic, there's a direct connection between how numbers are represented for reading and writing, and how they are
represented for internal arithmetic.

It is easier to do addition/subtraction if you start from the least
significant end and propagate the carry/borrow along.

I believe those early IBM character machines worked exactly this way.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Michael S on Mon May 6 02:30:02 2024

On Sun, 5 May 2024 12:13:27 +0300, Michael S wrote:

PEXTR/PDEP has no immediate form, which makes it inconvenient for
'C'-style fixed bit fields.

Fixed bit fields are a limitation of the C language. Why should it
constrain the design of machine architectures?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to All on Mon May 6 08:13:16 2024

MitchAlsup1 wrote:

Chris M. Thomasson wrote:

On 5/5/2024 3:25 PM, MitchAlsup1 wrote:

Chris M. Thomasson wrote:

On 5/4/2024 5:12 PM, MitchAlsup1 wrote:

Chris M. Thomasson wrote:

On 5/4/2024 3:18 AM, Thomas Koenig wrote:

Lawrence D'Oliveiro <[email protected]d> schrieb:

Intel pushed this thing called the â€œx32â€ ABI into the Linux
kernel

(and

possibly some other places) some years ago. This was using the >>>>>>>> AMD64
instruction set, but with only 32-bit pointers. This way, you >>>>>>>> got the
benefit of the extra registers, without the overhead of the longer >>>>>>>> addresses.

That was Donald Knuth's idea.

Storing meta data in actual pointers, aka aligned on a larger
boundary, is critical to many advanced lock/wait free algorithms
as well. I remember storing an actual reference count in pointers
before for a special type of counting.

Even if one has multi-location ATOMICs ?? (as a single event ??)

This was a technique for storing data in a pointer. For instance,
strong atomic reference counting we need to update a pointer _and_ a
reference together atomically. This can easily be done with DWCAS,
or double width compare and swap. So, on a 32 bit system we need 64
bit cas, for a 64 bit system we need 128 bit cas. However, sometimes
we can pack the reference count in the pointer value itself if its
aligned on a big enough boundary. Then we can update the pointer and
the reference count using normal word based atomic RMW's.

I understand why you had to pack the pointer and a chunk of data into a
single container.

What I don't understand is if you had easy access to multi-container
ATOMICs
the packing would be unnecessary--would it not ?? That is in one
ATOMIC event
you could update the pointer and the chunk of data independently and
not NEED
to store them in a single container.

Well, actually, a pessimistic word based fetch-and-add (LOCK XADD) is
enough to increment the counter and load a pointer atomically all in
one shot, loopless. Why would I need to use multi atomics with a
possible loop to do that?

Postulate that you have a 64-bit pointer and a 8-bit chunk 72-total bits. Further postulate that you need to update both in a single non-blocking ATOMIC event. ...

"Any programming problem can be solved with an additional layer of indirection", so in this case you create a handle to that 72-bit item,
and require all access to go via the handle?

The addendum to the rule above is of course ", except the problem of too
many layers of indirections". :-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to John Levine on Mon May 6 14:07:48 2024

John Levine <[email protected]> writes:

According to Lawrence D'Oliveiro <[email protected]d>:

On Sat, 04 May 2024 15:21:04 GMT, Scott Lurndal wrote:

d) all modern major architectures have instructions for bitfield
manipulation (insert, extract) obviating any need for general bit-level
addressing.

Even if those bottom three bits of the address must be zero in every other >>instruction but these, I thought it would be convenient to have them, just >>for these bitfield instructions. It would save passing around a separate >>bit-offset field in arbitrary-bit-aligned pointers.

The only significant application for bit addressing that anyone has
mentioned is data compression. It's not something that computers spend
a great deal of time doing, and I see no reason to believe that bit >addressing would make it much faster than the way it's done now with
shifting and masking.

We've one application that uses bit insertion
and extraction extensively (an SoC simulator) when dealing
both with emulation of the ARMv7 and ARMv8 instruction sets
as well as hardware accelerator block CSRs.

But as you note below, hardware support crypto and
compression operations is generally superior.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to [email protected] on Mon May 6 09:56:03 2024

On Mon, 6 May 2024 02:34:48 -0000 (UTC), Lawrence D'Oliveiro
<[email protected]d> wrote:

It is easier to do addition/subtraction if you start from the least >significant end and propagate the carry/borrow along.

Of course, but so what? That just determines in which direction your
ALU is wired. It is true that this is the reason why many machines
were little-endian when their word size was smaller than the size of
the integers on which they would do arithmetic.

But we no longer have this problem.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Mon May 6 11:26:46 2024

MitchAlsup1 wrote:

BGB wrote:

On 5/5/2024 10:31 AM, Scott Lurndal wrote:

Thomas Koenig <[email protected]> writes:

Scott Lurndal <[email protected]> schrieb:

Not as of yet in my case, but bitfield extract might happen eventually.
Issue is finding a way to pull it off that is useful and cheaper than
shift+mask (and probably adding some mechanism to pattern-match it
from the AST or similar).

But, but but but:: it IS shift and Mask !!

Annoyingly, a good general case instruction could not be encoded in a
32-bit instruction form at this point (could either add a few special
cases as 32-bit ops, or use a 64-bit encoding; or do it as a 2RI op
rather than 3RI but this is lame...).

Then again, say:
BITEXTR Imm10, Rn //Rn=(Rn>>(Imm&63))&((1<<((Imm>>6)&15))-1)
Could potentially still be useful.

SL Rd,Rc,<width:offset>

Is a bit field extract instruction, it is also a smash instruction
(smashing a 64-bit value into a 8-bit or 12-bit or 47 bit for whatever purpose is needed)

SR Rd,Rc,<width:offset>

Positions the value in a register (Rc) such that it fits the alignment of
a field.

INS Rd,Rc,Rf,<width:offset>

Inserts the field from Rf into its position <w:o> in Rc, inserts the
field and delivers the new container to Rd.

I think my instruction set could accomplish pretty much the same
efficiency for bit field operations as bit addresses but without
requiring direct bit addressing.

An issue that comes up is when the in-memory bit field is > 56 bits wide
as it might straddle two 64-bit words. If width is <= 56 bits then
a load from a byte address handles most of the shifting and the
rest can be handled within a single register.

But if the in-memory bit field is > 56 bits wide it may or may not straddle
a single 64-bit memory location, and require a pair of registers to loaded.

I added an optional second dest register field to my ISA to allow operations like wide bit field extract and insert across a pair of registers.
Also for wide arithmetic.

I was thinking of variable length LDV and STV load & store instructions
to work with variable length byte fields from 1 to 16 bytes.

LDV has two dst registers, a normal byte address specifier,
and a byte count from 1 to 16 to load. All high order bytes
not written by the LDV are zero filled.
The byte count can be an immediate or in a register.

STV does the same for stores with a pair of source value registers.

LDV and STV only touch the memory bytes they actually load or store.
So if the actual address + byte count does not touch a second 64-bit
memory word then they don't touch the next cache line or next page
in the case of potential page straddles.

This allows code to LDV up to 16 bytes into a register pair
extract and insert up to 64-bit fields in that register pair,
then STV only the bytes operated on,
with HW taking care of the special cases of straddle/not-straddle.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Mon May 6 11:15:39 2024

On Mon, 6 May 2024 02:54:11 -0000 (UTC), John Levine <[email protected]>
wrote:

Why do you think bit addressing will be
faster than shifting and masking? There's still going to be memory
underneath that's byte or word addressed so the shifting and masking
is going to happen anyway.

Shifting, in a sense, yes. But not necessarily masking.

So just because a processor has a 64-bit bus to memory doesn't mean it
has to implement fetching a single byte from memory by doing a shift
and mask operation in a 64-bit register. Instead, each byte of the bus
could have a direct wired path to the low 8-bits of the internal data
bus feeding the registers.

With bit addressing, of course, an implementation involving shifting
and masking is more likely, but even then, one omits fetching and
decoding the instructions to shift and mask, which is a speed gain
right there.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Lawrence D'Oliveiro on Mon May 6 14:08:39 2024

Lawrence D'Oliveiro wrote:

On Sat, 04 May 2024 15:21:04 GMT, Scott Lurndal wrote:

d) all modern major architectures have instructions for bitfield
manipulation (insert, extract) obviating any need for general bit-level
addressing.

Even if those bottom three bits of the address must be zero in every other instruction but these, I thought it would be convenient to have them, just for these bitfield instructions. It would save passing around a separate bit-offset field in arbitrary-bit-aligned pointers.

Its not just the bit address that you have to carry about
but also field width and type (zero/sign extend) on extract.

To my eye the cost of bit fields is primarily in dealing at run time
with the potential for straddles across memory locations and registers.
It makes for a lot of fiddly little IF code blocks which then have to be
put into general subroutines.

A second issue occurs when there are multiple bit fields is
optimizing this so it only loads and stores with memory when it has to.
If r1 contains a low straddle part and r2 the high straddle part,
and we have already updated one bit field in those parts,
if we want to update a second bit field,
then we need to check if it is wholly contained within those
two registers, or one or both need to be spilled and reloaded.

A lot of this fiddly code looks like it would be best
implemented with predication.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Mon May 6 19:10:43 2024

Lawrence D'Oliveiro wrote:

On Sun, 5 May 2024 12:13:27 +0300, Michael S wrote:

PEXTR/PDEP has no immediate form, which makes it inconvenient for
'C'-style fixed bit fields.

Fixed bit fields are a limitation of the C language. Why should it
constrain the design of machine architectures?

The only thing C bit-fields bears on extract and insert is the need
for constants that specify the field.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Levine on Mon May 6 19:13:51 2024

John Levine wrote:

According to Lawrence D'Oliveiro <[email protected]d>:

On Sat, 04 May 2024 15:21:04 GMT, Scott Lurndal wrote:

d) all modern major architectures have instructions for bitfield
manipulation (insert, extract) obviating any need for general bit-level
addressing.

Even if those bottom three bits of the address must be zero in every other >>instruction but these, I thought it would be convenient to have them, just >>for these bitfield instructions. It would save passing around a separate >>bit-offset field in arbitrary-bit-aligned pointers.

The only significant application for bit addressing that anyone has
mentioned is data compression. It's not something that computers spend
a great deal of time doing, and I see no reason to believe that bit addressing would make it much faster than the way it's done now with
shifting and masking.

If you do want to make compression faster, it'd make more sense to add instructions to do the compressing you compare about, like DFLTCC in
S/360 and zSeries that speed up gzip, rather than adding three bits to
the other 99% of instructions that don't use bit fields.

If you think otherwise, what are the applications that will make all
those address bits useful, and why do you think bit addressing will be
faster than shifting and masking? There's still going to be memory
underneath that's byte or word addressed so the shifting and masking
is going to happen anyway.

Placing bit-field access INSIDE LDs and STs requires adding 2 stages
of multiplexing in the LD/ST aligners (memory shifters). This has the
potential to slow the overall pipeline frequency--at which point you
have lost more than you can gain.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Terje Mathisen on Mon May 6 19:15:35 2024

Terje Mathisen wrote:

MitchAlsup1 wrote:

Chris M. Thomasson wrote:

On 5/5/2024 3:25 PM, MitchAlsup1 wrote:

Chris M. Thomasson wrote:

On 5/4/2024 5:12 PM, MitchAlsup1 wrote:

Chris M. Thomasson wrote:

On 5/4/2024 3:18 AM, Thomas Koenig wrote:

Lawrence D'Oliveiro <[email protected]d> schrieb:

Intel pushed this thing called the â€œx32â€ ABI into the Linux
kernel

(and

possibly some other places) some years ago. This was using the >>>>>>>>> AMD64
instruction set, but with only 32-bit pointers. This way, you >>>>>>>>> got the
benefit of the extra registers, without the overhead of the longer >>>>>>>>> addresses.

That was Donald Knuth's idea.

Storing meta data in actual pointers, aka aligned on a larger
boundary, is critical to many advanced lock/wait free algorithms >>>>>>> as well. I remember storing an actual reference count in pointers >>>>>>> before for a special type of counting.

Even if one has multi-location ATOMICs ?? (as a single event ??)

This was a technique for storing data in a pointer. For instance,
strong atomic reference counting we need to update a pointer _and_ a >>>>> reference together atomically. This can easily be done with DWCAS,
or double width compare and swap. So, on a 32 bit system we need 64
bit cas, for a 64 bit system we need 128 bit cas. However, sometimes >>>>> we can pack the reference count in the pointer value itself if its
aligned on a big enough boundary. Then we can update the pointer and >>>>> the reference count using normal word based atomic RMW's.

I understand why you had to pack the pointer and a chunk of data into a >>>> single container.

What I don't understand is if you had easy access to multi-container
ATOMICs
the packing would be unnecessary--would it not ?? That is in one
ATOMIC event
you could update the pointer and the chunk of data independently and
not NEED
to store them in a single container.

Well, actually, a pessimistic word based fetch-and-add (LOCK XADD) is
enough to increment the counter and load a pointer atomically all in
one shot, loopless. Why would I need to use multi atomics with a
possible loop to do that?

Postulate that you have a 64-bit pointer and a 8-bit chunk 72-total bits.
Further postulate that you need to update both in a single non-blocking
ATOMIC event. ...

"Any programming problem can be solved with an additional layer of indirection", so in this case you create a handle to that 72-bit item,
and require all access to go via the handle?

I am not trying to add an additional layer of indirection, I am trying (unsuccessfully it appears) to get Chris to think outside of the one
container ATOMIC box.

The addendum to the rule above is of course ", except the problem of too
many layers of indirections". :-)

Terje

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Mon May 6 19:11:22 2024

Lawrence D'Oliveiro wrote:

On Sun, 5 May 2024 20:50:51 -0500, BGB wrote:

Say, RISC-V:
Says yes to DIV and MOD;
Says yes to 4-register floating-point multiple-accumulate; Say no to
register-indexed Load/Store.
Me: This is not a good balance...

Multiply-accumulate is at least as much about reducing rounding error as about speed.

It is also an IEEE 754-2008+ requirement.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Mon May 6 19:26:16 2024

BGB wrote:

On 5/5/2024 9:30 PM, Lawrence D'Oliveiro wrote:

On Sun, 5 May 2024 12:13:27 +0300, Michael S wrote:

PEXTR/PDEP has no immediate form, which makes it inconvenient for
'C'-style fixed bit fields.

Fixed bit fields are a limitation of the C language. Why should it
constrain the design of machine architectures?

If it lacks an immediate form, one is harder pressed to beat out
shift+and or shift+shift on the performance front...

Though, to be useful, it needs an immediate large enough to express both
the shift amount and the width of the bitfield, and also a 3RI encoding.

My 66000 has 12-bits of immediate for shifts, and a slot in the 3-operand instruction group.

Bitfield insert would a little easier to get a performance advantage (vs bitfield extract), since insertion is a more complex operation, but is
also likely require a more complex implementation and is also less
common than bitfield extract.

Without SR <w:o>; one needs two shifts and a container sized mask

SR Rt,Rc,#64-11 // get rid of excess significance
SL Rt,Rt,#64-11-12 // position field to container
AND Rk,Rk,#ox0007FF0000 // EMPTY field in Kontainer
OR Rk,Rk,Rt // insert field

With:

SR Rt,Rc,<11:12>
AND Rk,Rk,#ox0007FF0000 // EMPTY field in Kontainer
OR Rk,Rk,Rt // insert field

With insert::

INS Rk,Rk,Rc,<11:12>

....

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Mon May 6 19:31:07 2024

EricP wrote:

MitchAlsup1 wrote:

BGB wrote:

On 5/5/2024 10:31 AM, Scott Lurndal wrote:

Thomas Koenig <[email protected]> writes:

Scott Lurndal <[email protected]> schrieb:

Not as of yet in my case, but bitfield extract might happen eventually.
Issue is finding a way to pull it off that is useful and cheaper than
shift+mask (and probably adding some mechanism to pattern-match it
from the AST or similar).

But, but but but:: it IS shift and Mask !!

Annoyingly, a good general case instruction could not be encoded in a
32-bit instruction form at this point (could either add a few special
cases as 32-bit ops, or use a 64-bit encoding; or do it as a 2RI op
rather than 3RI but this is lame...).

Then again, say:
BITEXTR Imm10, Rn //Rn=(Rn>>(Imm&63))&((1<<((Imm>>6)&15))-1)
Could potentially still be useful.

SL Rd,Rc,<width:offset>

Is a bit field extract instruction, it is also a smash instruction
(smashing a 64-bit value into a 8-bit or 12-bit or 47 bit for whatever
purpose is needed)

SR Rd,Rc,<width:offset>

Positions the value in a register (Rc) such that it fits the alignment of
a field.

INS Rd,Rc,Rf,<width:offset>

Inserts the field from Rf into its position <w:o> in Rc, inserts the
field and delivers the new container to Rd.

I think my instruction set could accomplish pretty much the same
efficiency for bit field operations as bit addresses but without
requiring direct bit addressing.

An issue that comes up is when the in-memory bit field is > 56 bits wide
as it might straddle two 64-bit words. If width is <= 56 bits then
a load from a byte address handles most of the shifting and the
rest can be handled within a single register.

This is what CARRY is for--access to 128-bit in 2×64-bit out shifts.
CARRY can be used for extracts and for inserts.

But if the in-memory bit field is > 56 bits wide it may or may not straddle
a single 64-bit memory location, and require a pair of registers to loaded.

I don't understand 56--56 takes just as many bits to encode as 63 ?!?

I added an optional second dest register field to my ISA to allow operations like wide bit field extract and insert across a pair of registers.
Also for wide arithmetic.

I was thinking of variable length LDV and STV load & store instructions
to work with variable length byte fields from 1 to 16 bytes.

32 gives you access to an arithmetic space where you can calculate
world GDP in the least valuable currency world-wide not lose a cent
on the bottom end and not overflow on the top by 20-odd bits.

LDV has two dst registers, a normal byte address specifier,
and a byte count from 1 to 16 to load. All high order bytes
not written by the LDV are zero filled.
The byte count can be an immediate or in a register.

STV does the same for stores with a pair of source value registers.

LDV and STV only touch the memory bytes they actually load or store.
So if the actual address + byte count does not touch a second 64-bit
memory word then they don't touch the next cache line or next page
in the case of potential page straddles.

This allows code to LDV up to 16 bytes into a register pair
extract and insert up to 64-bit fields in that register pair,
then STV only the bytes operated on,
with HW taking care of the special cases of straddle/not-straddle.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Mon May 6 19:34:47 2024

John Savard wrote:

On Mon, 6 May 2024 02:54:11 -0000 (UTC), John Levine <[email protected]>
wrote:

Why do you think bit addressing will be
faster than shifting and masking? There's still going to be memory >>underneath that's byte or word addressed so the shifting and masking
is going to happen anyway.

Shifting, in a sense, yes. But not necessarily masking.

So just because a processor has a 64-bit bus to memory doesn't mean it

Why so narrow ??

has to implement fetching a single byte from memory by doing a shift
and mask operation in a 64-bit register.

Not on a 64-bit register, but a 64-bit (or 128-bit) flip-flop.

Instead, each byte of the bus
could have a direct wired path to the low 8-bits of the internal data
bus feeding the registers.

How is that NOT a shifter ???

Remember people, accessing smaller than cache port width REQUUIRES
shifting. We often call them Aligners, but the logic is that of
a shifter.

With bit addressing, of course, an implementation involving shifting
and masking is more likely, but even then, one omits fetching and
decoding the instructions to shift and mask, which is a speed gain
right there.

Bit addressing only makes the shifter deeper, not wider.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Mon May 6 19:39:55 2024

EricP wrote:

Lawrence D'Oliveiro wrote:

On Sat, 04 May 2024 15:21:04 GMT, Scott Lurndal wrote:

d) all modern major architectures have instructions for bitfield
manipulation (insert, extract) obviating any need for general bit-level
addressing.

Even if those bottom three bits of the address must be zero in every other >> instruction but these, I thought it would be convenient to have them, just >> for these bitfield instructions. It would save passing around a separate
bit-offset field in arbitrary-bit-aligned pointers.

Its not just the bit address that you have to carry about
but also field width and type (zero/sign extend) on extract.

No different from signed/unsigned bytes, halfwords, and words.

To my eye the cost of bit fields is primarily in dealing at run time
with the potential for straddles across memory locations and registers.
It makes for a lot of fiddly little IF code blocks which then have to be
put into general subroutines.

In My 66000 ISA, one can use CARRY to concatenate 2 registers into
1 container and then extract or insert into the double wide container
EVEN when there is no straddling of boundaries! This gets rid of a
lot of the fiddling.

A second issue occurs when there are multiple bit fields is
optimizing this so it only loads and stores with memory when it has to.
If r1 contains a low straddle part and r2 the high straddle part,
and we have already updated one bit field in those parts,
if we want to update a second bit field,
then we need to check if it is wholly contained within those
two registers, or one or both need to be spilled and reloaded.

Obviously.

A lot of this fiddly code looks like it would be best
implemented with predication.

...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to EricP on Mon May 6 21:53:42 2024

EricP wrote:

MitchAlsup1 wrote:

BGB wrote:

On 5/5/2024 10:31 AM, Scott Lurndal wrote:

Thomas Koenig <[email protected]> writes:

Scott Lurndal <[email protected]> schrieb:

Not as of yet in my case, but bitfield extract might happen eventually.
Issue is finding a way to pull it off that is useful and cheaper than
shift+mask (and probably adding some mechanism to pattern-match it
from the AST or similar).

But, but but but:: it IS shift and Mask !!

Annoyingly, a good general case instruction could not be encoded in a
32-bit instruction form at this point (could either add a few special
cases as 32-bit ops, or use a 64-bit encoding; or do it as a 2RI op
rather than 3RI but this is lame...).

Then again, say:
   BITEXTR Imm10, Rn //Rn=(Rn>>(Imm&63))&((1<<((Imm>>6)&15))-1)
Could potentially still be useful.

    SL    Rd,Rc,<width:offset>

Is a bit field extract instruction, it is also a smash instruction
(smashing a 64-bit value into a 8-bit or 12-bit or 47 bit for whatever
purpose is needed)

    SR    Rd,Rc,<width:offset>

Positions the value in a register (Rc) such that it fits the alignment of
a field.

    INS   Rd,Rc,Rf,<width:offset>

Inserts the field from Rf into its position <w:o> in Rc, inserts the
field and delivers the new container to Rd.

I think my instruction set could accomplish pretty much the same
efficiency for bit field operations as bit addresses but without
requiring direct bit addressing.

An issue that comes up is when the in-memory bit field is > 56 bits wide
as it might straddle two 64-bit words. If width is <= 56 bits then
a load from a byte address handles most of the shifting and the
rest can be handled within a single register.

But if the in-memory bit field is > 56 bits wide it may or may not straddle
a single 64-bit memory location, and require a pair of registers to loaded.

x86 does not have bitfield insert/extract, but it does have SHRD/SHLD so
it is fairly easy to handle arbitrary length (<= 64 bits) and alignment:

; RSI -> target, RCX = # bits to extract, RBX = 64-field size (0..63)
mov rax,[rsi]
mov rdx,[rsi+8]

shrd rax,rdx,cl ; bit offset

and rax,bitmask[rbx*8] ; 64 mask entries.

The last instruction can also be replaced with

shlx rax,rax,rbx ; Nr of excess bits (64-field to extract)
shrx rax,rax,rbx

or the entire thing can be replaced with this one which calculates the
mask on the fly:

mov rax,[rsi]
mov rdx,[rsi+8]
or rdi,-1 ; Generate mask

shrd rax,rdx,cl ; bit offset
shrx rdi,rdi,rbx ; excess bits to mask away

and rax,rdi

All seems like about 3 clock cycles when hitting the cache.

Terje

I added an optional second dest register field to my ISA to allow
operations
like wide bit field extract and insert across a pair of registers.
Also for wide arithmetic.

I was thinking of variable length LDV and STV load & store instructions
to work with variable length byte fields from 1 to 16 bytes.

LDV has two dst registers, a normal byte address specifier,
and a byte count from 1 to 16 to load. All high order bytes
not written by the LDV are zero filled.
The byte count can be an immediate or in a register.

STV does the same for stores with a pair of source value registers.

LDV and STV only touch the memory bytes they actually load or store.
So if the actual address + byte count does not touch a second 64-bit
memory word then they don't touch the next cache line or next page
in the case of potential page straddles.

This allows code to LDV up to 16 bytes into a register pair
extract and insert up to 64-bit fields in that register pair,
then STV only the bytes operated on,
with HW taking care of the special cases of straddle/not-straddle.

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Mon May 6 21:08:14 2024

According to John Savard <[email protected]d>:

On Mon, 6 May 2024 02:54:11 -0000 (UTC), John Levine <[email protected]>
wrote:

Why do you think bit addressing will be
faster than shifting and masking? ...

So just because a processor has a 64-bit bus to memory doesn't mean it
has to implement fetching a single byte from memory by doing a shift
and mask operation in a 64-bit register. Instead, each byte of the bus
could have a direct wired path to the low 8-bits of the internal data
bus feeding the registers.

I was more thinking about storing bit fields, where you probably have
to fetch the whole word or cache line or whatever, shift the new field
into it, and then store it back. You already have to do something like
that for byte stores but bit addressing makes it 8 times as hairy.
--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to BGB on Mon May 6 22:34:23 2024

BGB <[email protected]> writes:

On 5/5/2024 12:20 PM, John Savard wrote:

On Wed, 1 May 2024 00:09:28 -0000 (UTC), Lawrence D'Oliveiro
<[email protected]d> wrote:

Also, though, for ease of conversion, the order of BCD digits _should
be the same as the order of the characters of which these digits are
the last four bits_ in the representation of a decimal number as a
character string.

And that means big-endian.

If you have decimal arithmetic, there's a direct connection between
how numbers are represented for reading and writing, and how they are
represented for internal arithmetic.

Why would one burn 8 bits per BCD digit?...

When processing numeric character data. The B3500 did that
natively - the address controller on each operand selected
the format of the operand (4-bit signed, 4-bit unsigned, 8-bit unsigned);
in 8-bit forms, the processor ignored the most significant digit
of the byte (ascii 0x3, ebcdic 0xf).

The B2D and D2B instructions converted between decimal and
binary representations (maximum magnitude 10**100-1).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to BGB on Tue May 7 02:49:31 2024

On Mon, 6 May 2024 01:47:15 -0500
BGB <[email protected]> wrote:

RISC-V is quickly gaining ground in the microcontroller space,
displacing ARM (Cortex-M / Thumb2).

I don't see it.
RISC-V right now is mostly in small cores doing auxiliary functions in
bigger SoCs. General-purpose 32-bit MCUs are very strongly dominated by Cortex-M. I don't believe that in that space RISC-V is in top 3 by
volumes. I would expect that 2nd tiers likes Xtensa cores, TI C2000
as well as some of the Renesas cores sell more than RISC-V.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Tue May 7 00:53:29 2024

BGB wrote:

On 5/6/2024 2:11 PM, MitchAlsup1 wrote:

Lawrence D'Oliveiro wrote:

On Sun, 5 May 2024 20:50:51 -0500, BGB wrote:

Say, RISC-V:
Says yes to DIV and MOD;
Says yes to 4-register floating-point multiple-accumulate; Say no to >>>> register-indexed Load/Store.
Me: This is not a good balance...

Multiply-accumulate is at least as much about reducing rounding error
as about speed.

It is also an IEEE 754-2008+ requirement.

And... I have a version that just sort of works well enough to make
RV64G work, but is sort of a fail on the other fronts:
Using it is slower than separate ops;
It produces a double-rounded result.
Also, well, the FMUL isn't super accurate either.

So, it fails IEEE 754-accuracy requirements.

FMUL is implemented in a way where it only generates the high-half of
the multiply, which makes the FPU cheaper, but:
Does not give strict 0.5ULP rounding.

Also failing EEE 754-accuracy requirements.

Some combination of factors leads to the inability of Newton-Raphson to
fully converge, possibly either due to omitting the low-order multiplier results, or the carry-propagation limitation for rounding (if the
rounding would result in more than 8 bits of carry, it is skipped).

Newton-Raphson is dependent on getting the bits right so that its
interpolation (between iterations) converges properly.

Not likely to do proper FMA, as this would make a Binary64 FPU too
expensive (and, doing Binary64 poorly is still preferable for most uses
to not doing it at all).

And yet, every other non FPGA implementation achieves those requirements.

It really seams that your medium is influencing your architecture,
rather than the other way around.

Granted, not entirely sure how the 8087 managed to do all the stuff that
it did. Since, it seems like an 80s ASIC would be more cramped than a
modern Artix-7.

Mostly it was simply slow.

....

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Levine on Tue May 7 00:57:00 2024

John Levine wrote:

According to John Savard <[email protected]d>:

On Mon, 6 May 2024 02:54:11 -0000 (UTC), John Levine <[email protected]> >>wrote:

Why do you think bit addressing will be
faster than shifting and masking? ...

So just because a processor has a 64-bit bus to memory doesn't mean it
has to implement fetching a single byte from memory by doing a shift
and mask operation in a 64-bit register. Instead, each byte of the bus >>could have a direct wired path to the low 8-bits of the internal data
bus feeding the registers.

I was more thinking about storing bit fields, where you probably have
to fetch the whole word or cache line or whatever, shift the new field
into it, and then store it back. You already have to do something like
that for byte stores but bit addressing makes it 8 times as hairy.

Which is no different than ECC, BTW...

Could someone invent a bit field ISA that was as efficient as a byte
accessible architecture:: probably.

Could this bit accessible architecture outperform a byte ISA on
typical codes:: doubtful. Two reasons:: 1) more delay in the
LD/ST pipeline, 2) most programs use as little bit-fielding as
possible (not as much as practical) !!!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Chris M. Thomasson on Tue May 7 02:04:37 2024

Chris M. Thomasson wrote:

On 5/6/2024 12:15 PM, MitchAlsup1 wrote:

Terje Mathisen wrote:

MitchAlsup1 wrote:

Chris M. Thomasson wrote:

On 5/5/2024 3:25 PM, MitchAlsup1 wrote:

Chris M. Thomasson wrote:

On 5/4/2024 5:12 PM, MitchAlsup1 wrote:

Chris M. Thomasson wrote:

On 5/4/2024 3:18 AM, Thomas Koenig wrote:

Lawrence D'Oliveiro <[email protected]d> schrieb:

Intel pushed this thing called the â€œx32â€ ABI into the >>>>>>>>>>> Linux kernel

(and

possibly some other places) some years ago. This was using the >>>>>>>>>>> AMD64
instruction set, but with only 32-bit pointers. This way, you >>>>>>>>>>> got the
benefit of the extra registers, without the overhead of the >>>>>>>>>>> longer
addresses.

That was Donald Knuth's idea.

Storing meta data in actual pointers, aka aligned on a larger >>>>>>>>> boundary, is critical to many advanced lock/wait free algorithms >>>>>>>>> as well. I remember storing an actual reference count in
pointers before for a special type of counting.

Even if one has multi-location ATOMICs ?? (as a single event ??) >>>>>>

This was a technique for storing data in a pointer. For instance, >>>>>>> strong atomic reference counting we need to update a pointer _and_ >>>>>>> a reference together atomically. This can easily be done with
DWCAS, or double width compare and swap. So, on a 32 bit system we >>>>>>> need 64 bit cas, for a 64 bit system we need 128 bit cas. However, >>>>>>> sometimes we can pack the reference count in the pointer value
itself if its aligned on a big enough boundary. Then we can update >>>>>>> the pointer and the reference count using normal word based atomic >>>>>>> RMW's.

I understand why you had to pack the pointer and a chunk of data
into a
single container.

What I don't understand is if you had easy access to
multi-container ATOMICs
the packing would be unnecessary--would it not ?? That is in one
ATOMIC event
you could update the pointer and the chunk of data independently
and not NEED
to store them in a single container.

Well, actually, a pessimistic word based fetch-and-add (LOCK XADD)
is enough to increment the counter and load a pointer atomically all >>>>> in one shot, loopless. Why would I need to use multi atomics with a
possible loop to do that?

Postulate that you have a 64-bit pointer and a 8-bit chunk 72-total
bits.
Further postulate that you need to update both in a single
non-blocking ATOMIC event. ...

"Any programming problem can be solved with an additional layer of
indirection", so in this case you create a handle to that 72-bit item,
and require all access to go via the handle?

I am not trying to add an additional layer of indirection, I am trying
(unsuccessfully it appears) to get Chris to think outside of the one
container ATOMIC box.

LOCK XADD vs a CAS loop? I prefer the former.

Those are not the only options.

The addendum to the rule above is of course ", except the problem of
too many layers of indirections". :-)

Terje

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to BGB on Tue May 7 04:55:23 2024

On Mon, 6 May 2024 01:47:15 -0500, BGB wrote:

On 5/5/2024 9:30 PM, Lawrence D'Oliveiro wrote:

I think [RISC-V]’s already shipping in the billions of units per
year--enough to make it the world’s second-most-popular CPU
architecture, after ARM.

Yeah, seemingly right now, x86, ARM, and RISC-V are the top 3 ...

Last I heard, x86 is in fourth place, after MIPS.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Terje Mathisen on Tue May 7 07:39:18 2024

Terje Mathisen wrote:

EricP wrote:

MitchAlsup1 wrote:

BGB wrote:

On 5/5/2024 10:31 AM, Scott Lurndal wrote:

Thomas Koenig <[email protected]> writes:

Scott Lurndal <[email protected]> schrieb:

Not as of yet in my case, but bitfield extract might happen eventually. >>>> Issue is finding a way to pull it off that is useful and cheaper
than shift+mask (and probably adding some mechanism to pattern-match
it from the AST or similar).

But, but but but:: it IS shift and Mask !!

Annoyingly, a good general case instruction could not be encoded in
a 32-bit instruction form at this point (could either add a few
special cases as 32-bit ops, or use a 64-bit encoding; or do it as a
2RI op rather than 3RI but this is lame...).

Then again, say:
Â Â BITEXTRÂ Imm10, RnÂ //Rn=(Rn>>(Imm&63))&((1<<((Imm>>6)&15))-1)
Could potentially still be useful.

Â Â Â SLÂ Â Â Rd,Rc,<width:offset>

Is a bit field extract instruction, it is also a smash instruction
(smashing a 64-bit value into a 8-bit or 12-bit or 47 bit for whatever
purpose is needed)

Â Â Â SRÂ Â Â Rd,Rc,<width:offset>

Positions the value in a register (Rc) such that it fits the
alignment of
a field.

Â Â Â INSÂ Â Rd,Rc,Rf,<width:offset>

Inserts the field from Rf into its position <w:o> in Rc, inserts the
field and delivers the new container to Rd.

I think my instruction set could accomplish pretty much the same
efficiency for bit field operations as bit addresses but without
requiring direct bit addressing.

An issue that comes up is when the in-memory bit field is > 56 bits wide
as it might straddle two 64-bit words. If width is <= 56 bits then
a load from a byte address handles most of the shifting and the
rest can be handled within a single register.

But if the in-memory bit field is > 56 bits wide it may or may not
straddle
a single 64-bit memory location, and require a pair of registers to
loaded.

x86 does not have bitfield insert/extract, but it does have SHRD/SHLD so
it is fairly easy to handle arbitrary length (<= 64 bits) and alignment:

; RSI -> target, RCX = # bits to extract, RBX = 64-field size (0..63)
mov rax,[rsi]
mov rdx,[rsi+8]

shrd rax,rdx,cl    ; bit offset

and rax,bitmask[rbx*8] ; 64 mask entries.

The last instruction can also be replaced with

shlx rax,rax,rbx    ; Nr of excess bits (64-field to extract)
shrx rax,rax,rbx

or the entire thing can be replaced with this one which calculates the
mask on the fly:

mov rax,[rsi]
mov rdx,[rsi+8]
or rdi,-1        ; Generate mask

shrd rax,rdx,cl    ; bit offset
shrx rdi,rdi,rbx    ; excess bits to mask away

and rax,rdi

All seems like about 3 clock cycles when hitting the cache.

I realized this morning that with arbitrary alignment and both signed
and unsigned extract, it is better to always shift up first to get rid
of the excess and then shift down to align. The main problem here is
that you now need different code for straddling and non-straddling items
since shifts (including double-wide shifts) have to be less than 64
bits. :-(

This is not a problem for constant length and alignment since the
compiler can chose the correct pattern, but for codecs and compression
it does not work. (Or at least not for those 57..64 field lengths).

mov rax,[rsi]
shl rax,cl ; Excess bits above the field we need
shrx rax,rax,rbx ; rbx=64-field length

The last instruction would be

sarx rax,rax,rbx

if you wanted a signed bitfield.

No matter how you do it it will be become a bottleneck in any huffmann
token extractor or similar codes. In my own decoders I've tended to
grab a 32 (in the old days) or 64-bit chunk into a register and
immediately align it. Then I'll use a lookup table over the first N
(typically 6-12) bits of this buffer value and let the table decide how
many bits to keep for the token, or in the case of longer tokens, select
a second-level table to lookup the remaining bits.

After decrementing the buffer bits remaining counter I'll branch out to
refill it, but only if I have at least 32 or 48 free bits. This keeps
the number of refills fairly low.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to BGB on Tue May 7 07:42:06 2024

BGB wrote:

On 5/6/2024 2:11 PM, MitchAlsup1 wrote:

Lawrence D'Oliveiro wrote:

On Sun, 5 May 2024 20:50:51 -0500, BGB wrote:

Say, RISC-V:
Â Â Says yes to DIV and MOD;
Â Â Says yes to 4-register floating-point multiple-accumulate; Say >>>> no to
Â Â register-indexed Load/Store.
Me: This is not a good balance...

Multiply-accumulate is at least as much about reducing rounding error
as about speed.

It is also an IEEE 754-2008+ requirement.

And... I have a version that just sort of works well enough to make
RV64G work, but is sort of a fail on the other fronts:
Using it is slower than separate ops;
It produces a double-rounded result.
Also, well, the FMUL isn't super accurate either.

FMUL is implemented in a way where it only generates the high-half of
the multiply, which makes the FPU cheaper, but:
Does not give strict 0.5ULP rounding.

Some combination of factors leads to the inability of Newton-Raphson to fully converge, possibly either due to omitting the low-order multiplier results, or the carry-propagation limitation for rounding (if the
rounding would result in more than 8 bits of carry, it is skipped).

Not likely to do proper FMA, as this would make a Binary64 FPU too
expensive (and, doing Binary64 poorly is still preferable for most uses
to not doing it at all).

Granted, not entirely sure how the 8087 managed to do all the stuff that
it did. Since, it seems like an 80s ASIC would be more cramped than a
modern Artix-7.

Relatively easy to explain: It was _very_ slow, but still much faster
than emulating it with an 8088 that needed 4 clock cycles for every
single code or data byte touched.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Chris M. Thomasson on Tue May 7 07:45:59 2024

Chris M. Thomasson wrote:

On 5/5/2024 11:13 PM, Terje Mathisen wrote:

MitchAlsup1 wrote:

Chris M. Thomasson wrote:

On 5/5/2024 3:25 PM, MitchAlsup1 wrote:

Chris M. Thomasson wrote:

On 5/4/2024 5:12 PM, MitchAlsup1 wrote:

Chris M. Thomasson wrote:

On 5/4/2024 3:18 AM, Thomas Koenig wrote:

Lawrence D'Oliveiro <[email protected]d> schrieb:

Intel pushed this thing called the Ã¢â‚¬Å“x32Ã¢â‚¬Â ABI into
the Linux kernel

(and

possibly some other places) some years ago. This was using the >>>>>>>>>> AMD64
instruction set, but with only 32-bit pointers. This way, you >>>>>>>>>> got the
benefit of the extra registers, without the overhead of the >>>>>>>>>> longer
addresses.

That was Donald Knuth's idea.

Storing meta data in actual pointers, aka aligned on a larger >>>>>>>> boundary, is critical to many advanced lock/wait free algorithms >>>>>>>> as well. I remember storing an actual reference count in
pointers before for a special type of counting.

Even if one has multi-location ATOMICs ?? (as a single event ??)

This was a technique for storing data in a pointer. For instance,
strong atomic reference counting we need to update a pointer _and_ >>>>>> a reference together atomically. This can easily be done with
DWCAS, or double width compare and swap. So, on a 32 bit system we >>>>>> need 64 bit cas, for a 64 bit system we need 128 bit cas. However, >>>>>> sometimes we can pack the reference count in the pointer value
itself if its aligned on a big enough boundary. Then we can update >>>>>> the pointer and the reference count using normal word based atomic >>>>>> RMW's.

I understand why you had to pack the pointer and a chunk of data
into a
single container.

What I don't understand is if you had easy access to
multi-container ATOMICs
the packing would be unnecessary--would it not ?? That is in one
ATOMIC event
you could update the pointer and the chunk of data independently
and not NEED
to store them in a single container.

Well, actually, a pessimistic word based fetch-and-add (LOCK XADD)
is enough to increment the counter and load a pointer atomically all
in one shot, loopless. Why would I need to use multi atomics with a
possible loop to do that?

Postulate that you have a 64-bit pointer and a 8-bit chunk 72-total
bits.
Further postulate that you need to update both in a single
non-blocking ATOMIC event. ...

"Any programming problem can be solved with an additional layer of
indirection", so in this case you create a handle to that 72-bit item,
and require all access to go via the handle?

The addendum to the rule above is of course ", except the problem of
too many layers of indirections". :-)

I remember look at one of your atomic queues that only used LOCK XADD on x86. Why would you use CAS for that? I don't know. I see no need for multi-atomics for any of it....

Why should I have to use emojis when I think I'm being clearly
sarcastical? :-(

Please note my addendum above!

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to All on Tue May 7 06:35:53 2024

MitchAlsup1 wrote:

John Levine wrote:

According to John Savard <[email protected]d>:

On Mon, 6 May 2024 02:54:11 -0000 (UTC), John Levine
<[email protected]> wrote:

Why do you think bit addressing will be
faster than shifting and masking? ...

So just because a processor has a 64-bit bus to memory doesn't
mean it has to implement fetching a single byte from memory by
doing a shift and mask operation in a 64-bit register. Instead,
each byte of the bus could have a direct wired path to the low
8-bits of the internal data bus feeding the registers.

I was more thinking about storing bit fields, where you probably
have to fetch the whole word or cache line or whatever, shift the
new field into it, and then store it back. You already have to do
something like that for byte stores but bit addressing makes it 8
times as hairy.

Which is no different than ECC, BTW...

Could someone invent a bit field ISA that was as efficient as a byte accessible architecture:: probably.

Could this bit accessible architecture outperform a byte ISA on
typical codes:: doubtful. Two reasons:: 1) more delay in the LD/ST
pipeline, 2) most programs use as little bit-fielding as possible
(not as much as practical) !!!

Some time ago, I proposed an additional instruction, a load varient
that allowed you to address bit fields. Would it be slower than a
"normal" byte oriented load? Almost certainly. But would it be faster
than doing all the shifts, masks, word crossing calculations, etc. via
extra instructions? Again, almost certainly. So you keep the benefits
of byte oriented loads most of the time, but have "reasonable" access
to bit fields when you need them, faster than without the
extrainstructions. Hopefully the best of both worlds.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Levine on Tue May 7 06:40:23 2024

On Sun, 5 May 2024 01:33:39 -0000 (UTC), John Levine wrote:

According to Lawrence D'Oliveiro <[email protected]d>:

So using the same register name to address a halfword gives you the low
half of the register, not the high half?

Whereas using the same memory address to address a halfword gives you
the high half of the word at that location, not the low half?

... correct.

So you are backing up what I’m claiming, that in accessing parts of registers, big-endian architectures behave just like little-endian ones?
How exactly is that supposed to prove me wrong?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Tue May 7 06:45:02 2024

On Tue, 7 May 2024 00:53:29 +0000, MitchAlsup1 wrote:

BGB wrote:

Granted, not entirely sure how the 8087 managed to do all the stuff
that it did. Since, it seems like an 80s ASIC would be more cramped
than a modern Artix-7.

Mostly it was simply slow.

Also it used a stack-based programming paradigm. This was not efficient,
and frequently awkward.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to BGB on Tue May 7 06:45:43 2024

On Mon, 6 May 2024 21:32:53 -0500, BGB wrote:

Yes, but then again, I make no claim that it is IEEE-754 conformant,
merely that it uses the same formats, and is "good enough" for most
stuff one needs an FPU for.

That’s what all the hardware engineers thought, back in the 1990s.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Tue May 7 06:47:09 2024

On Mon, 6 May 2024 19:13:51 +0000, MitchAlsup1 wrote:

Placing bit-field access INSIDE LDs and STs requires adding 2 stages of multiplexing in the LD/ST aligners (memory shifters). This has the
potential to slow the overall pipeline frequency--at which point you
have lost more than you can gain.

Of course bit field extraction/insertion should require special
instructions, not be a part of every load/store.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Savard on Tue May 7 06:49:48 2024

On Mon, 06 May 2024 09:56:03 -0600, John Savard wrote:

But we no longer have this problem.

But the other reasons for going little-endian still exist.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to BGB on Tue May 7 06:42:26 2024

On Tue, 7 May 2024 00:33:30 -0500, BGB wrote:

I was thinking more in terms of popularity/mindshare ...

You mean, PR terms? In this newsgroup??

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Stephen Fuld on Tue May 7 11:47:42 2024

On Tue, 7 May 2024 06:35:53 -0000 (UTC)
"Stephen Fuld" <[email protected]d> wrote:

MitchAlsup1 wrote:

John Levine wrote:

According to John Savard <[email protected]d>:

On Mon, 6 May 2024 02:54:11 -0000 (UTC), John Levine
<[email protected]> wrote:

Why do you think bit addressing will be
faster than shifting and masking? ...

So just because a processor has a 64-bit bus to memory doesn't
mean it has to implement fetching a single byte from memory by
doing a shift and mask operation in a 64-bit register. Instead,
each byte of the bus could have a direct wired path to the low
8-bits of the internal data bus feeding the registers.

I was more thinking about storing bit fields, where you probably
have to fetch the whole word or cache line or whatever, shift the
new field into it, and then store it back. You already have to do something like that for byte stores but bit addressing makes it 8
times as hairy.

Which is no different than ECC, BTW...

Could someone invent a bit field ISA that was as efficient as a byte accessible architecture:: probably.

Could this bit accessible architecture outperform a byte ISA on
typical codes:: doubtful. Two reasons:: 1) more delay in the LD/ST pipeline, 2) most programs use as little bit-fielding as possible
(not as much as practical) !!!

Some time ago, I proposed an additional instruction, a load varient
that allowed you to address bit fields. Would it be slower than a
"normal" byte oriented load? Almost certainly. But would it be
faster than doing all the shifts, masks, word crossing calculations,
etc. via extra instructions? Again, almost certainly. So you keep
the benefits of byte oriented loads most of the time, but have
"reasonable" access to bit fields when you need them, faster than
without the extrainstructions. Hopefully the best of both worlds.

When you load bit field from memory, there is very high chance that you
would want adjacent bit field soon thereafter.
Think about it.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Tue May 7 09:25:00 2024

MitchAlsup1 wrote:

EricP wrote:

I think my instruction set could accomplish pretty much the same
efficiency for bit field operations as bit addresses but without
requiring direct bit addressing.

An issue that comes up is when the in-memory bit field is > 56 bits wide
as it might straddle two 64-bit words. If width is <= 56 bits then
a load from a byte address handles most of the shifting and the
rest can be handled within a single register.

This is what CARRY is for--access to 128-bit in 2×64-bit out shifts.
CARRY can be used for extracts and for inserts.

But if the in-memory bit field is > 56 bits wide it may or may not
straddle
a single 64-bit memory location, and require a pair of registers to
loaded.

I don't understand 56--56 takes just as many bits to encode as 63 ?!?

Here I'm referring to the two different ways one can load memory
for bit fields: I can load 64-bit aligned words or byte aligned words
(here "word" means 64 bits).

One constraint I put on the following is that it must only touch the
next cache line or page if it must read bits from it - it must not
cause gratuitous cache line misses or page faults due to loading
unnecessary bytes. For 64-bit word aligned loads this is inherently true,
but for byte aligned loads care must be taken.

If I load 64-bit aligned then I ignore (mask out) the low 3 bits from
the address, but those 3 bit have to be inserted back into the field
start offset as its high order bits, giving the 6-bit field start offset.
If the field length+offset > 64 then the end of bit field straddles a word
so I have a conditional load of the next sequential word into a second
register to hold the high part of the field.

Alternatively I can load a 64-bit word from a byte aligned address.
In this case I don't need to merge the byte-offset bits with the
bit-offset bits because the byte align shifter took care of that.
This allows a bit field up to 56-bits wide to be loaded without
having to check for a straddle and possibly load the high part.
Since the high part can only be maximum of 8 bits (because the prior
load took care of the lower 56 bits and the largest field is 64 bits)
the second is a byte load so that it doesn't touch any bytes beyond
the one it needs.

As I see it, the main difference between these is how they handle
multiple bit field accesses, possibly adjacent to the first bit field
and therefore possible loaded into one of the two above registers.

The first version above looks easier to optimize for multiple bit fields
than the second, but I haven't actually worked it through.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Michael S on Tue May 7 17:24:02 2024

Michael S wrote:

On Tue, 7 May 2024 06:35:53 -0000 (UTC)
"Stephen Fuld" <[email protected]d> wrote:

MitchAlsup1 wrote:

John Levine wrote:

According to John Savard <[email protected]d>:

On Mon, 6 May 2024 02:54:11 -0000 (UTC), John Levine <[email protected]> wrote:

Why do you think bit addressing will be
faster than shifting and masking? ...

So just because a processor has a 64-bit bus to memory doesn't
mean it has to implement fetching a single byte from memory by
doing a shift and mask operation in a 64-bit register.
Instead, each byte of the bus could have a direct wired path
to the low 8-bits of the internal data bus feeding the
registers.

I was more thinking about storing bit fields, where you probably
have to fetch the whole word or cache line or whatever, shift
the new field into it, and then store it back. You already have
to do something like that for byte stores but bit addressing
makes it 8 times as hairy.

Which is no different than ECC, BTW...

Could someone invent a bit field ISA that was as efficient as a
byte accessible architecture:: probably.

Could this bit accessible architecture outperform a byte ISA on
typical codes:: doubtful. Two reasons:: 1) more delay in the LD/ST pipeline, 2) most programs use as little bit-fielding as possible
(not as much as practical) !!!

Some time ago, I proposed an additional instruction, a load varient
that allowed you to address bit fields. Would it be slower than a
"normal" byte oriented load? Almost certainly. But would it be
faster than doing all the shifts, masks, word crossing calculations,
etc. via extra instructions? Again, almost certainly. So you keep
the benefits of byte oriented loads most of the time, but have
"reasonable" access to bit fields when you need them, faster than
without the extrainstructions. Hopefully the best of both worlds.

When you load bit field from memory, there is very high chance that
you would want adjacent bit field soon thereafter.

Yes. There are two aspects of this, setting the displacement of the
next field, and the time it takes to access that field. For the first,
my proposal took advantage of the MY 66000's capability of instruction modifiers to (optionally) add the length of the loaded bit field to the register that contains the bit displacement. So the addressing is
already set up for a subsequent looad bit field instruction to load the adjacent bit field. For the time to access that field, it depends.
For a low end implementation, the target data for the subsequent load
would already be in the L1 cash, so not too bad. Higher end
implementations could take advantage of the MY 66000's streaming
buffers such that the data would already be "close" to the ALU. As I
have often said, IANAHG, so I may have the details wrong.

Think about it.

Thanks, I have. :-)

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Terje Mathisen on Tue May 7 15:06:12 2024

Terje Mathisen wrote:

Terje Mathisen wrote:

EricP wrote:

MitchAlsup1 wrote:

BGB wrote:

On 5/5/2024 10:31 AM, Scott Lurndal wrote:

Thomas Koenig <[email protected]> writes:

Scott Lurndal <[email protected]> schrieb:

Not as of yet in my case, but bitfield extract might happen
eventually.
Issue is finding a way to pull it off that is useful and cheaper
than shift+mask (and probably adding some mechanism to
pattern-match it from the AST or similar).

But, but but but:: it IS shift and Mask !!

Annoyingly, a good general case instruction could not be encoded in
a 32-bit instruction form at this point (could either add a few
special cases as 32-bit ops, or use a 64-bit encoding; or do it as
a 2RI op rather than 3RI but this is lame...).

Then again, say:
Â Â BITEXTRÂ Imm10, RnÂ //Rn=(Rn>>(Imm&63))&((1<<((Imm>>6)&15))-1) >>>>> Could potentially still be useful.

Â Â Â SLÂ Â Â Rd,Rc,<width:offset>

Is a bit field extract instruction, it is also a smash instruction
(smashing a 64-bit value into a 8-bit or 12-bit or 47 bit for whatever >>>> purpose is needed)

Â Â Â SRÂ Â Â Rd,Rc,<width:offset>

Positions the value in a register (Rc) such that it fits the
alignment of
a field.

Â Â Â INSÂ Â Rd,Rc,Rf,<width:offset>

Inserts the field from Rf into its position <w:o> in Rc, inserts the
field and delivers the new container to Rd.

I think my instruction set could accomplish pretty much the same
efficiency for bit field operations as bit addresses but without
requiring direct bit addressing.

An issue that comes up is when the in-memory bit field is > 56 bits wide >>> as it might straddle two 64-bit words. If width is <= 56 bits then
a load from a byte address handles most of the shifting and the
rest can be handled within a single register.

But if the in-memory bit field is > 56 bits wide it may or may not
straddle
a single 64-bit memory location, and require a pair of registers to
loaded.

x86 does not have bitfield insert/extract, but it does have SHRD/SHLD
so it is fairly easy to handle arbitrary length (<= 64 bits) and
alignment:

; RSI -> target, RCX = # bits to extract, RBX = 64-field size (0..63)
mov rax,[rsi]
mov rdx,[rsi+8]

This is what I wanted to avoid: blindly loading the next word
as that could unnecessarilly read a cache line or worse,
trap on an access violation.

Its not that it is difficult to avoid, it just adds to the fiddlyness
(like conditional branches around one or two instructions).

shrd rax,rdx,cl ; bit offset

and rax,bitmask[rbx*8] ; 64 mask entries.

The last instruction can also be replaced with

shlx rax,rax,rbx ; Nr of excess bits (64-field to extract)
shrx rax,rax,rbx

or the entire thing can be replaced with this one which calculates the
mask on the fly:

mov rax,[rsi]
mov rdx,[rsi+8]
or rdi,-1 ; Generate mask

shrd rax,rdx,cl ; bit offset
shrx rdi,rdi,rbx ; excess bits to mask away

and rax,rdi

All seems like about 3 clock cycles when hitting the cache.

I realized this morning that with arbitrary alignment and both signed
and unsigned extract, it is better to always shift up first to get rid
of the excess and then shift down to align. The main problem here is
that you now need different code for straddling and non-straddling items since shifts (including double-wide shifts) have to be less than 64
bits. :-(

This is not a problem for constant length and alignment since the
compiler can chose the correct pattern, but for codecs and compression
it does not work. (Or at least not for those 57..64 field lengths).

mov rax,[rsi]
shl rax,cl ; Excess bits above the field we need
shrx rax,rax,rbx ; rbx=64-field length

The last instruction would be

sarx rax,rax,rbx

if you wanted a signed bitfield.

No matter how you do it it will be become a bottleneck in any huffmann
token extractor or similar codes. In my own decoders I've tended to
grab a 32 (in the old days) or 64-bit chunk into a register and
immediately align it. Then I'll use a lookup table over the first N (typically 6-12) bits of this buffer value and let the table decide how
many bits to keep for the token, or in the case of longer tokens, select
a second-level table to lookup the remaining bits.

After decrementing the buffer bits remaining counter I'll branch out to refill it, but only if I have at least 32 or 48 free bits. This keeps
the number of refills fairly low.

Terje

There seem to be two use cases, one for bit-wise load and store to
individual bit fields in compiled structures, the other is dynamic
bit fields in bit streams.

The first is bit sized elements in packed arrays, or packed structs,
or packed arrays of packed structs, or packed structs containing packed
array of bit fields, etc. These are supported by some languages
(Ada85 had optional packed arrays and record structs).
For these the field start bit-offset is dynamic but the field size and
type are compile constants and so offer some potential for optimization
(but that could require inlining some of the access subroutines).

Such fields would tend to be both read and written is semi random order
but with a high probability that nearby fields will also be accessed.

The other is bit fields in bit streams being processed sequentially from
lsb to msb order, e.g for a codec. For these the field size and type are dynamic but the token start offset can be arranged to be in bit[0].
If you know the bit-wise token always starts in bit[0] you don't need to
deal with field straddles, but must dynamically track where the last valid in-register bit is and detect when to load the next word and append to the register bit stream.

Bit stream processing would likely be either write-only encode or read-only decode, proceeding once serially either low to high or high to low order.

Both would simplify greatly with double-wide shifts of register pairs,
as well as double-wide bit field extract and insert.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Tue May 7 20:56:09 2024

EricP wrote:

Terje Mathisen wrote:

Terje Mathisen wrote:

EricP wrote:

MitchAlsup1 wrote:

BGB wrote:

On 5/5/2024 10:31 AM, Scott Lurndal wrote:

Thomas Koenig <[email protected]> writes:

Scott Lurndal <[email protected]> schrieb:

Not as of yet in my case, but bitfield extract might happen
eventually.
Issue is finding a way to pull it off that is useful and cheaper
than shift+mask (and probably adding some mechanism to
pattern-match it from the AST or similar).

But, but but but:: it IS shift and Mask !!

Annoyingly, a good general case instruction could not be encoded in >>>>>> a 32-bit instruction form at this point (could either add a few
special cases as 32-bit ops, or use a 64-bit encoding; or do it as >>>>>> a 2RI op rather than 3RI but this is lame...).

Then again, say:
Â Â BITEXTRÂ Imm10, RnÂ //Rn=(Rn>>(Imm&63))&((1<<((Imm>>6)&15))-1)
Could potentially still be useful.

Â Â Â SLÂ Â Â Rd,Rc,<width:offset>

Is a bit field extract instruction, it is also a smash instruction
(smashing a 64-bit value into a 8-bit or 12-bit or 47 bit for whatever >>>>> purpose is needed)

Â Â Â SRÂ Â Â Rd,Rc,<width:offset>

Positions the value in a register (Rc) such that it fits the
alignment of
a field.

Â Â Â INSÂ Â Rd,Rc,Rf,<width:offset>

Inserts the field from Rf into its position <w:o> in Rc, inserts the >>>>> field and delivers the new container to Rd.

I think my instruction set could accomplish pretty much the same
efficiency for bit field operations as bit addresses but without
requiring direct bit addressing.

An issue that comes up is when the in-memory bit field is > 56 bits wide >>>> as it might straddle two 64-bit words. If width is <= 56 bits then
a load from a byte address handles most of the shifting and the
rest can be handled within a single register.

But if the in-memory bit field is > 56 bits wide it may or may not
straddle
a single 64-bit memory location, and require a pair of registers to
loaded.

x86 does not have bitfield insert/extract, but it does have SHRD/SHLD
so it is fairly easy to handle arbitrary length (<= 64 bits) and
alignment:

; RSI -> target, RCX = # bits to extract, RBX = 64-field size (0..63)
mov rax,[rsi]
mov rdx,[rsi+8]

This is what I wanted to avoid: blindly loading the next word
as that could unnecessarilly read a cache line or worse,
trap on an access violation.

Its not that it is difficult to avoid, it just adds to the fiddlyness
(like conditional branches around one or two instructions).

shrd rax,rdx,cl ; bit offset

and rax,bitmask[rbx*8] ; 64 mask entries.

The last instruction can also be replaced with

shlx rax,rax,rbx ; Nr of excess bits (64-field to extract)
shrx rax,rax,rbx

or the entire thing can be replaced with this one which calculates the
mask on the fly:

mov rax,[rsi]
mov rdx,[rsi+8]
or rdi,-1 ; Generate mask

shrd rax,rdx,cl ; bit offset
shrx rdi,rdi,rbx ; excess bits to mask away

and rax,rdi

All seems like about 3 clock cycles when hitting the cache.

I realized this morning that with arbitrary alignment and both signed
and unsigned extract, it is better to always shift up first to get rid
of the excess and then shift down to align. The main problem here is
that you now need different code for straddling and non-straddling items
since shifts (including double-wide shifts) have to be less than 64
bits. :-(

This is not a problem for constant length and alignment since the
compiler can chose the correct pattern, but for codecs and compression
it does not work. (Or at least not for those 57..64 field lengths).

mov rax,[rsi]
shl rax,cl ; Excess bits above the field we need
shrx rax,rax,rbx ; rbx=64-field length

The last instruction would be

sarx rax,rax,rbx

if you wanted a signed bitfield.

No matter how you do it it will be become a bottleneck in any huffmann
token extractor or similar codes. In my own decoders I've tended to
grab a 32 (in the old days) or 64-bit chunk into a register and
immediately align it. Then I'll use a lookup table over the first N
(typically 6-12) bits of this buffer value and let the table decide how
many bits to keep for the token, or in the case of longer tokens, select
a second-level table to lookup the remaining bits.

After decrementing the buffer bits remaining counter I'll branch out to
refill it, but only if I have at least 32 or 48 free bits. This keeps
the number of refills fairly low.

Terje

There seem to be two use cases, one for bit-wise load and store to
individual bit fields in compiled structures, the other is dynamic
bit fields in bit streams.

The first is bit sized elements in packed arrays, or packed structs,
or packed arrays of packed structs, or packed structs containing packed
array of bit fields, etc. These are supported by some languages
(Ada85 had optional packed arrays and record structs).
For these the field start bit-offset is dynamic but the field size and
type are compile constants and so offer some potential for optimization
(but that could require inlining some of the access subroutines).

Such fields would tend to be both read and written is semi random order
but with a high probability that nearby fields will also be accessed.

The other is bit fields in bit streams being processed sequentially from
lsb to msb order, e.g for a codec. For these the field size and type are dynamic but the token start offset can be arranged to be in bit[0].
If you know the bit-wise token always starts in bit[0] you don't need to
deal with field straddles, but must dynamically track where the last valid in-register bit is and detect when to load the next word and append to the register bit stream.

Bit stream processing would likely be either write-only encode or read-only decode, proceeding once serially either low to high or high to low order.

Both would simplify greatly with double-wide shifts of register pairs,
as well as double-wide bit field extract and insert.

If you have the later {double-wide bit field extract and insert} why do
you need the former {double-wide shifts of register pairs}

And by double-wide bit field extracts--you mean the container is 2 registers wide and the extracted result is 64-bits (or less) wide; and that for insert the value being inserted is 64-bits wide and the container it is being inserted into is 2 registers wide.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to [email protected] on Tue May 7 19:18:39 2024

On Mon, 6 May 2024 02:34:48 -0000 (UTC), Lawrence D'Oliveiro
<[email protected]d> wrote:

On Sun, 05 May 2024 11:20:02 -0600, John Savard wrote:

If you have decimal arithmetic, there's a direct connection between how
numbers are represented for reading and writing, and how they are
represented for internal arithmetic.

It is easier to do addition/subtraction if you start from the least >significant end and propagate the carry/borrow along.

I believe those early IBM character machines worked exactly this way.

Yes, I think you're right. While the IBM 1401 did store character
strings in the conventional big-endian order, they were addressed by
the location of their least significant digit so that arithmetic could
still start there, even if it then went backwards to lower addresses.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to [email protected] on Tue May 7 19:16:40 2024

On Fri, 3 May 2024 22:26:04 -0000 (UTC), Lawrence D'Oliveiro
<[email protected]d> wrote:

On Thu, 02 May 2024 08:58:23 -0600, John Savard wrote:

To me, it just made sense that, since registers contain quantities, if
you load the value "8" into a reigster, it will contain the number 8.

So in a byte operation, the least significant bits of the register are
used.

Of course that makes sense.

Now, think of main memory as just a holding place for stuff that won�t fit
in registers: why shouldn�t it make sense there as well?

Because that isn't what main memory is. Even if one could think of
cache memory that way, main memory also interacts with input-output
devices.

Although that isn't really the problem.

After all, computational variables can be stored in memory in any
format. The only things in memory that are constrained in format are
character strings, because they get printed on paper for people to
see.

And, as I noted, that is the root of the problem.

Character strings are in big-endian order.

Packed decimal strings should be in the same order as character
strings, so that the relationship between the two is simple and
conversion between the two is quick.

Packed decimal strings of numbers should be in the same order as
binary numbers, because the can potentially share the same arithmetic
unit in some implementations.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to [email protected] on Tue May 7 19:31:10 2024

On Wed, 1 May 2024 19:33:54 -0000 (UTC), "Stephen Fuld" <[email protected]d> wrote:

I don't know about the PDP 10, but you are right that Univac 1108 had
both a six bit (technically a sixth of a word), and nine bit (quarter
word) operations. The 6 bit was Fieldata and used for most older
softwaare. The quarter words held an 8 bit ASCII character with one
"wasted" bit per byte. This became the dominent usage for
applications, but the Exec itself still uses a lot of Fieldata.

The PDP-10 used ASCII, and not other codes.

The six-bit code of the Univac was derived from FIELDATA, but the
actual FIELDATA code, developed by the military, was a 7-bit code
which included lower-case.

In my cryptography pages, on the page

http://www.quadibloc.com/crypto/mi060103.htm

there's a diagram comparing Univac's Fieldata code with the actual
FIELDATA code.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to [email protected] on Tue May 7 19:23:59 2024

On Tue, 7 May 2024 06:49:48 -0000 (UTC), Lawrence D'Oliveiro
<[email protected]d> wrote:

On Mon, 06 May 2024 09:56:03 -0600, John Savard wrote:

But we no longer have this problem.

But the other reasons for going little-endian still exist.

And what other reasons might those be?

Yes, going little-endian made things simpler in computers with short
word lengths, since the most common operations started from the least significant end.

But to do things in a big-endian way in such computers didn't require
trying to do addition backwards; you just had to jump ahead by the
length of the number, and then move backwards from the least
significant part. Often, though, even a trifling expense to do so
didn't make sense.

But when decimal and binary are both used in the same machine, then
big-endian is almost unavoidable - especially when the same
architecture is to be used in a wide range of implementations, some
big, and some small. Then, compatibility forces the use of a small
number of extra gates here and there.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Savard on Wed May 8 02:11:09 2024

On Tue, 07 May 2024 19:16:40 -0600, John Savard wrote:

Character strings are in big-endian order.

Better thought of as “character strings are stored so ascending addresses correspond to logical reading order”. Note I didn’t say “display order”,
since that can be quite different.

Packed decimal strings should be in the same order as character strings,
so that the relationship between the two is simple and conversion
between the two is quick.

Now here you are getting into cultural issues, For example, while both
Arabic and Hebrew use decimal numbers, they write the digits in opposite
order.

Computer-internal formats should be optimized for computer-internal
operations. Conversion from/to human-comprehensible layout/ordering/
formatting should happen when accepting human input and displaying output
for humans. The two should be kept separate, so the former remains
independent of the latter, and the latter can be easily reconfigured.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Savard on Wed May 8 02:14:38 2024

On Tue, 07 May 2024 19:23:59 -0600, John Savard wrote:

On Tue, 7 May 2024 06:49:48 -0000 (UTC), Lawrence D'Oliveiro
<[email protected]d> wrote:

But the other reasons for going little-endian still exist.

And what other reasons might those be?

Consider how you specify these 3 conventions:
* numbering of bits within a byte
* numbering of bytes within a multibyte quantity
* the place values (powers of 2) of bits in an integer

The only way to get all 3 consistent is with a little-endian architecture. Every big-endian architecture has inconsistencies between these somewhere
or another.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Wed May 8 02:15:44 2024

John Savard wrote:

On Tue, 7 May 2024 06:49:48 -0000 (UTC), Lawrence D'Oliveiro
<[email protected]d> wrote:

On Mon, 06 May 2024 09:56:03 -0600, John Savard wrote:

But we no longer have this problem.

But the other reasons for going little-endian still exist.

And what other reasons might those be?

Yes, going little-endian made things simpler in computers with short
word lengths, since the most common operations started from the least significant end.

But to do things in a big-endian way in such computers didn't require
trying to do addition backwards; you just had to jump ahead by the
length of the number, and then move backwards from the least
significant part. Often, though, even a trifling expense to do so
didn't make sense.

But when decimal and binary are both used in the same machine, then big-endian is almost unavoidable

Carry from digit to digit is the same direction in binary and decimal.
This argues sameness not Big-Endian.

- especially when the same
architecture is to be used in a wide range of implementations, some
big, and some small. Then, compatibility forces the use of a small
number of extra gates here and there.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Wed May 8 02:47:46 2024

According to MitchAlsup1 <[email protected]>:

Character strings are in big-endian order.

Not in Hebrew or Chinese !!

It doesn't make sense to say that character strings are big- or little- endian.

They're stored in the order you would read them, and there's typically
metadata about how to display them. In Unicode, Hebrew and Arabic code
points display right to left, Chinese displays however they want,
typically left to right in rows these days.

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Wed May 8 02:14:07 2024

John Savard wrote:

On Fri, 3 May 2024 22:26:04 -0000 (UTC), Lawrence D'Oliveiro
<[email protected]d> wrote:

On Thu, 02 May 2024 08:58:23 -0600, John Savard wrote:

To me, it just made sense that, since registers contain quantities, if
you load the value "8" into a reigster, it will contain the number 8.

So in a byte operation, the least significant bits of the register are
used.

Of course that makes sense.

Now, think of main memory as just a holding place for stuff that wont fit >>in registers: why shouldnt it make sense there as well?

Because that isn't what main memory is. Even if one could think of
cache memory that way, main memory also interacts with input-output
devices.

Although that isn't really the problem.

After all, computational variables can be stored in memory in any
format. The only things in memory that are constrained in format are character strings, because they get printed on paper for people to
see.

And, as I noted, that is the root of the problem.

Character strings are in big-endian order.

Not in Hebrew or Chinese !!

Packed decimal strings should be in the same order as character
strings, so that the relationship between the two is simple and
conversion between the two is quick.

Packed decimal strings of numbers should be in the same order as
binary numbers, because the can potentially share the same arithmetic
unit in some implementations.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Wed May 8 03:08:17 2024

According to John Savard <[email protected]d>:

But the other reasons for going little-endian still exist.

And what other reasons might those be?

These days the only reason is that everything else is little-endian.

Danny Cohen went through all of the arguments in his Holy Wars paper
in 1980. In the ensuing 44 years, nobody has added anything
interesting.

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Wed May 8 03:10:37 2024

Lawrence D'Oliveiro wrote:

On Tue, 07 May 2024 19:23:59 -0600, John Savard wrote:

On Tue, 7 May 2024 06:49:48 -0000 (UTC), Lawrence D'Oliveiro
<[email protected]d> wrote:

But the other reasons for going little-endian still exist.

And what other reasons might those be?

Consider how you specify these 3 conventions:
* numbering of bits within a byte

Most significant is bit[0] least significant is bit[2^k-1]

* numbering of bytes within a multibyte quantity

Most significant byte[0] least significant byte[2^k-1]

* the place values (powers of 2) of bits in an integer

POWN Rp,#2,Ri

The only way to get all 3 consistent is with a little-endian architecture.

Not so; as illustrated above.

Every big-endian architecture has inconsistencies between these somewhere
or another.

Most significant priority is [0] least significant priority is [2^k-1]

Apparently even LE machines get this one wrong, too.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Wed May 8 03:38:37 2024

On Wed, 8 May 2024 03:10:37 +0000, MitchAlsup1 wrote:

Lawrence D'Oliveiro wrote:

Consider how you specify these 3 conventions:
* numbering of bits within a byte

Most significant is bit[0] least significant is bit[2^k-1]

* numbering of bytes within a multibyte quantity

Most significant byte[0] least significant byte[2^k-1]

* the place values (powers of 2) of bits in an integer

Now you have to have place number = 2^k + 1 - i, where i is your bit
number. So not only must the numbers be different, the relationship has to change depending on the size of the field!

In little-endian, both numbers can be the same, in big-endian, they can’t.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to [email protected] on Tue May 7 21:56:35 2024

On Wed, 8 May 2024 02:14:38 -0000 (UTC), Lawrence D'Oliveiro
<[email protected]d> wrote:

Consider how you specify these 3 conventions:
* numbering of bits within a byte
* numbering of bytes within a multibyte quantity
* the place values (powers of 2) of bits in an integer

The only way to get all 3 consistent is with a little-endian architecture. >Every big-endian architecture has inconsistencies between these somewhere
or another.

That's true.

But I fail to see why the last one needs to be consistent, except as
an aesthetic preference.

And so I find the IBM System/360, which gets the first two consistent,
to be a steling example of perfect consistency.

The IBM System 360 gets to convert from character strings which
represent integers to their packed decimal form in a simple way -
assemble the last four bits of each byte, in the same order as the
bytes in that string.

And then packed decimal values are in the same ordering as binary
values - with the most significant part in the same spot.

This has practical consequences. Pack and Unpack are faster. Decimal
and binary arithmetic can share circuitry on lower-end designs.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Tue May 7 22:01:36 2024

On Wed, 8 May 2024 02:15:44 +0000, [email protected] (MitchAlsup1)
wrote:

Carry from digit to digit is the same direction in binary and decimal.
This argues sameness not Big-Endian.

Yes, that's right.

But that's only half of the argument.

The reason for both being the same as big-endian instead of both being
the same as little-endian is because of a _third_ item.

This argues that packed decimal should have the same endianness as
binary.

But the third item is character stirings, used in input and output to
represent numbers. They should be the same as packed decimal to make
conversion between the two simpler.

Then I argue for "sameness" as well, because a machine could be
little-endian, with binary integers and floating-point all
little-endian, but with decimal, as something minor and unimportant,
being big-endian. So in addition to arguing that packed decimal should
be big-endian because strings, I also have to argue that packed
decimal and binary should have the same endianness.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Levine on Wed May 8 03:35:47 2024

On Wed, 8 May 2024 02:47:46 -0000 (UTC), John Levine wrote:

It doesn't make sense to say that character strings are big- or little- endian.

Yes it does, for just about any encoding other than UTF-8. Thus, you have UTF16BE, and UTF16LE.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Savard on Wed May 8 05:50:08 2024

On Tue, 07 May 2024 21:56:35 -0600, John Savard wrote:

But I fail to see why the last one needs to be consistent, except as an aesthetic preference.

Not just inconsistency, but the fact that the numbering has to be
different depending on the size of the multibyte quantity.

Only little-endian allows this numbering to be both consistent and
constant.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Savard on Wed May 8 05:54:50 2024

On Tue, 07 May 2024 22:01:36 -0600, John Savard wrote:

But the third item is character stirings, used in input and output to represent numbers. They should be the same as packed decimal to make conversion between the two simpler.

No, because character string conversion is subject to localization issues.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to John Levine on Wed May 8 13:14:53 2024

On Wed, 8 May 2024 02:47:46 -0000 (UTC)
John Levine <[email protected]> wrote:

According to MitchAlsup1 <[email protected]>:

Character strings are in big-endian order.

Not in Hebrew or Chinese !!

It doesn't make sense to say that character strings are big- or
little- endian.

They're stored in the order you would read them, and there's typically metadata about how to display them. In Unicode, Hebrew and Arabic code
points display right to left, Chinese displays however they want,
typically left to right in rows these days.

Unfortunately, in Hebrew it is not that simple. Numbers [of Arabic
variety] are written with most significant digit on the left, i.e. if
we consider most significant digit as "first" then it can be said
that [Arabic] numbers appear in opposite direction to the rest of the
text. Numbers of Hebrew variety are written right to left, but nowadays
they are used much less often.
Arabic, on the other hand, uses the same right to left direction both
for text and for numbers.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to [email protected] on Wed May 8 13:45:58 2024

On Wed, 8 May 2024 03:10:37 +0000
[email protected] (MitchAlsup1) wrote:

Most significant priority is [0] least significant priority is [2^k-1]

Apparently even LE machines get this one wrong, too.

What sort of 'priority' are you talking about? I can't think about
any meaning of this word for which the numbering is independent of
culture or context or both.
Even if we limit ourselves to "Western" cultures, although it is true
that more often than not (not always!) higher priority is associated
with smaller number, I would think that the highest priority is more
often associated with one than with zero.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Terje Mathisen on Wed May 8 15:36:48 2024

On Wed, 8 May 2024 14:25:15 +0200
Terje Mathisen <[email protected]> wrote:

Michael S wrote:

On Tue, 7 May 2024 06:35:53 -0000 (UTC)
"Stephen Fuld" <[email protected]d> wrote:

MitchAlsup1 wrote:

John Levine wrote:

According to John Savard <[email protected]d>:

On Mon, 6 May 2024 02:54:11 -0000 (UTC), John Levine
<[email protected]> wrote:

Why do you think bit addressing will be
faster than shifting and masking? ...

So just because a processor has a 64-bit bus to memory doesn't
mean it has to implement fetching a single byte from memory by
doing a shift and mask operation in a 64-bit register. Instead,
each byte of the bus could have a direct wired path to the low
8-bits of the internal data bus feeding the registers.

I was more thinking about storing bit fields, where you probably
have to fetch the whole word or cache line or whatever, shift the
new field into it, and then store it back. You already have to do
something like that for byte stores but bit addressing makes it 8
times as hairy.

Which is no different than ECC, BTW...

Could someone invent a bit field ISA that was as efficient as a
byte accessible architecture:: probably.

Could this bit accessible architecture outperform a byte ISA on
typical codes:: doubtful. Two reasons:: 1) more delay in the LD/ST
pipeline, 2) most programs use as little bit-fielding as possible
(not as much as practical) !!!

Some time ago, I proposed an additional instruction, a load varient
that allowed you to address bit fields. Would it be slower than a
"normal" byte oriented load? Almost certainly. But would it be
faster than doing all the shifts, masks, word crossing
calculations, etc. via extra instructions? Again, almost
certainly. So you keep the benefits of byte oriented loads most
of the time, but have "reasonable" access to bit fields when you
need them, faster than without the extrainstructions. Hopefully
the best of both worlds.

When you load bit field from memory, there is very high chance that
you would want adjacent bit field soon thereafter.
Think about it.

Which means that you would like to have a dedicated streaming buffer
cache for the EXTR operation?

Terje

That not what I wanted to hint to Stephen.
I wanted to hint that in typical situation, i.e. when one 32-bit or
64-bit load serves several bit field extractions, his additional
instruction would be slower rather than faster than existing practice.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Michael S on Wed May 8 14:25:15 2024

Michael S wrote:

On Tue, 7 May 2024 06:35:53 -0000 (UTC)
"Stephen Fuld" <[email protected]d> wrote:

MitchAlsup1 wrote:

John Levine wrote:

According to John Savard <[email protected]d>:

On Mon, 6 May 2024 02:54:11 -0000 (UTC), John Levine
<[email protected]> wrote:

Why do you think bit addressing will be
faster than shifting and masking? ...

So just because a processor has a 64-bit bus to memory doesn't
mean it has to implement fetching a single byte from memory by
doing a shift and mask operation in a 64-bit register. Instead,
each byte of the bus could have a direct wired path to the low
8-bits of the internal data bus feeding the registers.

I was more thinking about storing bit fields, where you probably
have to fetch the whole word or cache line or whatever, shift the
new field into it, and then store it back. You already have to do
something like that for byte stores but bit addressing makes it 8
times as hairy.

Which is no different than ECC, BTW...

Could someone invent a bit field ISA that was as efficient as a byte
accessible architecture:: probably.

Could this bit accessible architecture outperform a byte ISA on
typical codes:: doubtful. Two reasons:: 1) more delay in the LD/ST
pipeline, 2) most programs use as little bit-fielding as possible
(not as much as practical) !!!

Some time ago, I proposed an additional instruction, a load varient
that allowed you to address bit fields. Would it be slower than a
"normal" byte oriented load? Almost certainly. But would it be
faster than doing all the shifts, masks, word crossing calculations,
etc. via extra instructions? Again, almost certainly. So you keep
the benefits of byte oriented loads most of the time, but have
"reasonable" access to bit fields when you need them, faster than
without the extrainstructions. Hopefully the best of both worlds.

When you load bit field from memory, there is very high chance that you
would want adjacent bit field soon thereafter.
Think about it.

Which means that you would like to have a dedicated streaming buffer
cache for the EXTR operation?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Wed May 8 11:30:36 2024

I wanted to hint that in typical situation, i.e. when one 32-bit or
64-bit load serves several bit field extractions, his additional
instruction would be slower rather than faster than existing practice.

The way I imagine bit-addressability, it would basically work as
follows:

- Use pointers almost as we do now, except shifted by 3 bits.
Most likely normal loads and stores would signal an error if the low
3 bits aren't 0. Immediate offsets in instructions would presumably
still be in the same units as before (bytes, words, ...).

This is fundamentally the only thing needed.
But once you have that, you'd probably want to add some instructions to
each bit-granular processing, which I'd imagine would look like:

- Load/store operations that ignore the lowest 3 bits (or
more than that, maybe the lowest 6 bits).
- bit-insertion/extraction instructions which use those lowest 3-6bits
and ignore the rest.

This would not require any special shifter in the memory path and the combination of those operations should be just as efficient as
a dedicated instruction.

To handle bitfields that straddle word boundaries, you might want
your bit-insert/extract to come with a "double-wide" option (I guess My
66000's CARRY could do the trick), tho maybe you'd just use something
like a 64bit load/store which only ignores the lowest 5bits (should be sufficient for any bit-field up to 32bits).

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Michael S on Wed May 8 16:09:32 2024

Michael S wrote:

On Wed, 8 May 2024 14:25:15 +0200
Terje Mathisen <[email protected]> wrote:

Michael S wrote:

On Tue, 7 May 2024 06:35:53 -0000 (UTC)
"Stephen Fuld" <[email protected]d> wrote:

MitchAlsup1 wrote:

John Levine wrote:

According to John Savard <[email protected]d>:

On Mon, 6 May 2024 02:54:11 -0000 (UTC), John Levine
<[email protected]> wrote:

Why do you think bit addressing will be
faster than shifting and masking? ...

So just because a processor has a 64-bit bus to memory doesn't
mean it has to implement fetching a single byte from memory by
doing a shift and mask operation in a 64-bit register.

Instead, >>>>> each byte of the bus could have a direct wired path
to the low >>>>> 8-bits of the internal data bus feeding the
registers. >>>

I was more thinking about storing bit fields, where you

probably >>>> have to fetch the whole word or cache line or
whatever, shift the >>>> new field into it, and then store it back.
You already have to do >>>> something like that for byte stores but
bit addressing makes it 8 >>>> times as hairy.

Which is no different than ECC, BTW...

Could someone invent a bit field ISA that was as efficient as a
byte accessible architecture:: probably.

Could this bit accessible architecture outperform a byte ISA on
typical codes:: doubtful. Two reasons:: 1) more delay in the

LD/ST >>> pipeline, 2) most programs use as little bit-fielding as
possible >>> (not as much as practical) !!!

Some time ago, I proposed an additional instruction, a load

varient >> that allowed you to address bit fields. Would it be
slower than a >> "normal" byte oriented load? Almost certainly.
But would it be >> faster than doing all the shifts, masks, word
crossing >> calculations, etc. via extra instructions? Again,
almost >> certainly. So you keep the benefits of byte oriented
loads most >> of the time, but have "reasonable" access to bit
fields when you >> need them, faster than without the
extrainstructions. Hopefully >> the best of both worlds.

When you load bit field from memory, there is very high chance
that you would want adjacent bit field soon thereafter.
Think about it.

Which means that you would like to have a dedicated streaming
buffer cache for the EXTR operation?

Terje

That not what I wanted to hint to Stephen.
I wanted to hint that in typical situation, i.e. when one 32-bit or
64-bit load serves several bit field extractions, his additional
instruction would be slower rather than faster than existing practice.

Perhaps. But if you aren't absolutely sure that the next field doesn't
cross a 64 bit boundry, then you have to test for that, and if it does,
add more instructions to handle it. If that happens, your advantage is
lost. Even the test and conditional jump/predication when you don't
cross the boundry makes it pretty close.

And, as I mentioned in a previous post, I would expect higher end implementations to make use of some sort of stream buffer, as Terje
suggests.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Michael S on Wed May 8 19:04:23 2024

Michael S wrote:

On Wed, 8 May 2024 14:25:15 +0200
Terje Mathisen <[email protected]> wrote:

Michael S wrote:

On Tue, 7 May 2024 06:35:53 -0000 (UTC)
"Stephen Fuld" <[email protected]d> wrote:

MitchAlsup1 wrote:

John Levine wrote:

According to John Savard <[email protected]d>:

On Mon, 6 May 2024 02:54:11 -0000 (UTC), John Levine
<[email protected]> wrote:

Why do you think bit addressing will be
faster than shifting and masking? ...

So just because a processor has a 64-bit bus to memory doesn't
mean it has to implement fetching a single byte from memory by
doing a shift and mask operation in a 64-bit register. Instead,
each byte of the bus could have a direct wired path to the low
8-bits of the internal data bus feeding the registers.

I was more thinking about storing bit fields, where you probably
have to fetch the whole word or cache line or whatever, shift the
new field into it, and then store it back. You already have to do
something like that for byte stores but bit addressing makes it 8
times as hairy.

Which is no different than ECC, BTW...

Could someone invent a bit field ISA that was as efficient as a
byte accessible architecture:: probably.

Could this bit accessible architecture outperform a byte ISA on
typical codes:: doubtful. Two reasons:: 1) more delay in the LD/ST
pipeline, 2) most programs use as little bit-fielding as possible
(not as much as practical) !!!

Some time ago, I proposed an additional instruction, a load varient
that allowed you to address bit fields. Would it be slower than a
"normal" byte oriented load? Almost certainly. But would it be
faster than doing all the shifts, masks, word crossing
calculations, etc. via extra instructions? Again, almost
certainly. So you keep the benefits of byte oriented loads most
of the time, but have "reasonable" access to bit fields when you
need them, faster than without the extrainstructions. Hopefully
the best of both worlds.

When you load bit field from memory, there is very high chance that
you would want adjacent bit field soon thereafter.
Think about it.

Which means that you would like to have a dedicated streaming buffer
cache for the EXTR operation?

Terje

That not what I wanted to hint to Stephen.
I wanted to hint that in typical situation, i.e. when one 32-bit or
64-bit load serves several bit field extractions, his additional
instruction would be slower rather than faster than existing practice.

Yeah, as I wrote earlier, i my own code I tend to use a register as my
buffer and keep it bottom-aligned at all times, i.e. end each extraction
by a SHR buffer, token_len

This means that most of the time, the buffer reg already contains all
the bits of the next token.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Terje Mathisen on Wed May 8 17:27:35 2024

Terje Mathisen wrote:

Michael S wrote:

On Wed, 8 May 2024 14:25:15 +0200
Terje Mathisen <[email protected]> wrote:

Michael S wrote:

On Tue, 7 May 2024 06:35:53 -0000 (UTC)
"Stephen Fuld" <[email protected]d> wrote:

MitchAlsup1 wrote:

John Levine wrote:

According to John Savard <[email protected]d>:

On Mon, 6 May 2024 02:54:11 -0000 (UTC), John Levine
<[email protected]> wrote:

Why do you think bit addressing will be
faster than shifting and masking? ...

So just because a processor has a 64-bit bus to memory
doesn't mean it has to implement fetching a single byte
from memory by doing a shift and mask operation in a
64-bit register. Instead, each byte of the bus could
have a direct wired path to the low 8-bits of the
internal data bus feeding the registers.

I was more thinking about storing bit fields, where you
probably have to fetch the whole word or cache line or
whatever, shift the new field into it, and then store it
back. You already have to do something like that for byte
stores but bit addressing makes it 8 times as hairy.

Which is no different than ECC, BTW...

Could someone invent a bit field ISA that was as efficient
as a byte accessible architecture:: probably.

Could this bit accessible architecture outperform a byte
ISA on typical codes:: doubtful. Two reasons:: 1) more
delay in the LD/ST pipeline, 2) most programs use as little bit-fielding as possible (not as much as practical) !!!

Some time ago, I proposed an additional instruction, a load
varient that allowed you to address bit fields. Would it be
slower than a "normal" byte oriented load? Almost certainly.
But would it be faster than doing all the shifts, masks, word crossing calculations, etc. via extra instructions? Again,
almost certainly. So you keep the benefits of byte oriented
loads most of the time, but have "reasonable" access to bit
fields when you need them, faster than without the
extrainstructions. Hopefully the best of both worlds.

When you load bit field from memory, there is very high chance
that you would want adjacent bit field soon thereafter.
Think about it.

Which means that you would like to have a dedicated streaming
buffer cache for the EXTR operation?

Terje

That not what I wanted to hint to Stephen.
I wanted to hint that in typical situation, i.e. when one 32-bit or
64-bit load serves several bit field extractions, his additional instruction would be slower rather than faster than existing
practice.

Yeah, as I wrote earlier, i my own code I tend to use a register as
my buffer and keep it bottom-aligned at all times, i.e. end each
extraction by a SHR buffer, token_len

This means that most of the time, the buffer reg already contains all
the bits of the next token.

The key word being"most". If it isn't "always", you have to test for
the condition. That test, and the conditional branch reduces, and
perhaps eliminates the advantage.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Stephen Fuld on Wed May 8 19:16:09 2024

Stephen Fuld wrote:

Michael S wrote:

On Wed, 8 May 2024 14:25:15 +0200
Terje Mathisen <[email protected]> wrote:

Michael S wrote:

On Tue, 7 May 2024 06:35:53 -0000 (UTC)
"Stephen Fuld" <[email protected]d> wrote:

MitchAlsup1 wrote:

John Levine wrote:

According to John Savard <[email protected]d>:

On Mon, 6 May 2024 02:54:11 -0000 (UTC), John Levine
<[email protected]> wrote:

Why do you think bit addressing will be
faster than shifting and masking? ...

So just because a processor has a 64-bit bus to memory doesn't >>>>>>>> mean it has to implement fetching a single byte from memory by >>>>>>>> doing a shift and mask operation in a 64-bit register.

Instead, >>>>> each byte of the bus could have a direct wired path
to the low >>>>> 8-bits of the internal data bus feeding the
registers. >>>

I was more thinking about storing bit fields, where you

probably >>>> have to fetch the whole word or cache line or
whatever, shift the >>>> new field into it, and then store it back.
You already have to do >>>> something like that for byte stores but
bit addressing makes it 8 >>>> times as hairy.

Which is no different than ECC, BTW...

Could someone invent a bit field ISA that was as efficient as a
byte accessible architecture:: probably.

Could this bit accessible architecture outperform a byte ISA on
typical codes:: doubtful. Two reasons:: 1) more delay in the

LD/ST >>> pipeline, 2) most programs use as little bit-fielding as
possible >>> (not as much as practical) !!!

Some time ago, I proposed an additional instruction, a load

varient >> that allowed you to address bit fields. Would it be
slower than a >> "normal" byte oriented load? Almost certainly.
But would it be >> faster than doing all the shifts, masks, word
crossing >> calculations, etc. via extra instructions? Again,
almost >> certainly. So you keep the benefits of byte oriented
loads most >> of the time, but have "reasonable" access to bit
fields when you >> need them, faster than without the
extrainstructions. Hopefully >> the best of both worlds.

When you load bit field from memory, there is very high chance
that you would want adjacent bit field soon thereafter.
Think about it.

Which means that you would like to have a dedicated streaming
buffer cache for the EXTR operation?

Terje

That not what I wanted to hint to Stephen.
I wanted to hint that in typical situation, i.e. when one 32-bit or
64-bit load serves several bit field extractions, his additional
instruction would be slower rather than faster than existing practice.

Perhaps. But if you aren't absolutely sure that the next field doesn't
cross a 64 bit boundry, then you have to test for that, and if it does,
add more instructions to handle it. If that happens, your advantage is
lost. Even the test and conditional jump/predication when you don't
cross the boundry makes it pretty close.

And, as I mentioned in a previous post, I would expect higher end implementations to make use of some sort of stream buffer, as Terje
suggests.

In typical codecs, tokens are mostly 2-3 to 8-10 bits long, so by having
a 64-bit buffer which always contains at least 32 bits, you don't need
to worry about any straddles, and for strings of shorter tokens, you
don't even need to check if a reload/buffer fill-up is needed.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Stephen Fuld on Wed May 8 19:47:34 2024

Stephen Fuld wrote:

Terje Mathisen wrote:

Michael S wrote:

On Wed, 8 May 2024 14:25:15 +0200
Terje Mathisen <[email protected]> wrote:

Michael S wrote:

On Tue, 7 May 2024 06:35:53 -0000 (UTC)
"Stephen Fuld" <[email protected]d> wrote:

MitchAlsup1 wrote:

John Levine wrote:

According to John Savard <[email protected]d>:

On Mon, 6 May 2024 02:54:11 -0000 (UTC), John Levine
<[email protected]> wrote:

Why do you think bit addressing will be
faster than shifting and masking? ...

So just because a processor has a 64-bit bus to memory
doesn't mean it has to implement fetching a single byte
from memory by doing a shift and mask operation in a
64-bit register. Instead, each byte of the bus could
have a direct wired path to the low 8-bits of the
internal data bus feeding the registers.

I was more thinking about storing bit fields, where you
probably have to fetch the whole word or cache line or
whatever, shift the new field into it, and then store it
back. You already have to do something like that for byte
stores but bit addressing makes it 8 times as hairy.

Which is no different than ECC, BTW...

Could someone invent a bit field ISA that was as efficient
as a byte accessible architecture:: probably.

Could this bit accessible architecture outperform a byte
ISA on typical codes:: doubtful. Two reasons:: 1) more
delay in the LD/ST pipeline, 2) most programs use as little
bit-fielding as possible (not as much as practical) !!!

Some time ago, I proposed an additional instruction, a load
varient that allowed you to address bit fields. Would it be
slower than a "normal" byte oriented load? Almost certainly.
But would it be faster than doing all the shifts, masks, word
crossing calculations, etc. via extra instructions? Again,
almost certainly. So you keep the benefits of byte oriented
loads most of the time, but have "reasonable" access to bit
fields when you need them, faster than without the
extrainstructions. Hopefully the best of both worlds.

When you load bit field from memory, there is very high chance
that you would want adjacent bit field soon thereafter.
Think about it.

Which means that you would like to have a dedicated streaming
buffer cache for the EXTR operation?

Terje

That not what I wanted to hint to Stephen.
I wanted to hint that in typical situation, i.e. when one 32-bit or
64-bit load serves several bit field extractions, his additional
instruction would be slower rather than faster than existing
practice.

Yeah, as I wrote earlier, i my own code I tend to use a register as
my buffer and keep it bottom-aligned at all times, i.e. end each
extraction by a SHR buffer, token_len

This means that most of the time, the buffer reg already contains all
the bits of the next token.

The key word being"most". If it isn't "always", you have to test for
the condition. That test, and the conditional branch reduces, and
perhaps eliminates the advantage.

It was exactly these kinds of optimizations I made in order to double
the speed of Intel's reference BluRay decoder. However, instead of
asking me to write a complete version they decided to licence a piece of
VLSI to do it in hardware, and that was almost certainly the correct
decision since my code needed 4 cores working nearly 100% in order to
handle the highest possible size/speed quality (1080p, 60 Hz, CABAC
encoding and 40 Mbit/s bitrate).

With a hw decoder a laptop can show film for hours on battery power.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Wed May 8 21:38:01 2024

BGB wrote:

Though, had noticed recently that a lot of typos seem to escape my
notice on my end. This is possibly a downside of using a 9pt font on a
4K monitor (22 inch) with 100% UI zoom (*). Can fir more stuff on
screen, but potentially not the most easily readable experience.

Why so small ?? My monitor is 32" and if I were to replace it with a 4K
monitor it would be 40-42".

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Thu May 9 01:24:54 2024

According to Lawrence D'Oliveiro <[email protected]d>:

On Wed, 8 May 2024 02:47:46 -0000 (UTC), John Levine wrote:

It doesn't make sense to say that character strings are big- or little-
endian.

Yes it does, for just about any encoding other than UTF-8. Thus, you have >UTF16BE, and UTF16LE.

Not really, those are byte orders within a character, not order of characters.

If you look at surrogates, you can UTF16 is big-endian. First there's the high surrogate, then the low one.

There's a reason that every encoding other than UTF-8 is dead. Who needs the grief?
--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to [email protected] on Wed May 8 20:50:53 2024

On Wed, 8 May 2024 05:54:50 -0000 (UTC), Lawrence D'Oliveiro
<[email protected]d> wrote:

On Tue, 07 May 2024 22:01:36 -0600, John Savard wrote:

But the third item is character stirings, used in input and output to
represent numbers. They should be the same as packed decimal to make
conversion between the two simpler.

No, because character string conversion is subject to localization issues.

I agree that little-endian computers make sense for people whose
native language is Hebrew or Arabic.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to [email protected] on Thu May 9 15:01:55 2024

On Wed, 08 May 2024 20:50:53 -0600, John Savard
<[email protected]d> wrote:

On Wed, 8 May 2024 05:54:50 -0000 (UTC), Lawrence D'Oliveiro
<[email protected]d> wrote:

On Tue, 07 May 2024 22:01:36 -0600, John Savard wrote:

But the third item is character stirings, used in input and output to
represent numbers. They should be the same as packed decimal to make
conversion between the two simpler.

No, because character string conversion is subject to localization issues.

I agree that little-endian computers make sense for people whose
native language is Hebrew or Arabic.

Still, I get your point. My thinking is stuck in the days of card
readers and line printers. Yes, one called a subroutine to print
numbers, but what it did was convert them to the format used in North
America and the United Kingdom, in accordance with any parameters in
the call that were hard-coded into the program.

The idea of programs as applications, to be distributed far and wide,
to people with computers of their own, where the operating system
could impose localization options on the display of numbers that
programs would usually allow themselves to accept... the situation
with newfangled operating systems like Microsoft Windows... is still
one that is only gradually beginning to dawn on me.

I do suspect, though, that programs like, say, dBase II, which store
numbers in files internally as character strings, don't vary that
format according to localization. Some binary to string conversions go
through the localization mechanisms, but not all of them, and so
string forms are _not_ wholly irrelevant.

An embedded processor in, say, a digital voltmeter... is not going to
have a localization layer to contend with. The makers of digital
voltmeters will find other ways of addressing international markets.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Lawrence D'Oliveiro on Fri May 10 09:18:43 2024

On 08/05/2024 04:11, Lawrence D'Oliveiro wrote:

On Tue, 07 May 2024 19:16:40 -0600, John Savard wrote:

Character strings are in big-endian order.

Better thought of as “character strings are stored so ascending addresses correspond to logical reading order”. Note I didn’t say “display order”,
since that can be quite different.

Packed decimal strings should be in the same order as character strings,
so that the relationship between the two is simple and conversion
between the two is quick.

Now here you are getting into cultural issues, For example, while both
Arabic and Hebrew use decimal numbers, they write the digits in opposite order.

Do you mean that when they write "123" with "1" on the left, they mean
the number "three hundred and twenty one" rather than "one hundred and
twenty three"? Or do you mean that where we write the digit "1" first
when writing left to right, they write the digit "3" first going right
to left?

My understanding was that for both languages, and indeed any other
language that uses Arabic numerals, digits are written big-endian read
from the left. Thus "123", with the digit "1" on the left, means the
same in Arabic, English, Hebrew, Chinese, or any other language using
them. Anything else would be massively confusing.

Many cultures and languages have additional numeric systems they use as
well as the common Arabic numerals. Some use their own systems as
standard, some just for specific purposes (just as English speakers use
Roman numerals for some purposes). And some of these are read
right-to-left rather than left-to-right (not necessarily matching the
order of their text), others use different symbols for the weighting.

As far as I know, in Hebrew numbers are usually written with
Western-style Arabic numerals, in the same order as everywhere else.
But they also use a more traditional letter-based system for dates,
religious works, and so on. Those are additive rather than strictly
positional (at least up to a limit).

And in written Arabic, Eastern-style Arabic numerals are used,
corresponding directly to Western-style Arabic numerals but with
somewhat different forms - the order is still most significant digit on
the left.

(I have a book on the history of number systems throughout the world,
but it is a /long/ time since I read it.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to John Levine on Fri May 10 09:31:00 2024

On 09/05/2024 03:24, John Levine wrote:

According to Lawrence D'Oliveiro <[email protected]d>:

On Wed, 8 May 2024 02:47:46 -0000 (UTC), John Levine wrote:

It doesn't make sense to say that character strings are big- or little-
endian.

Yes it does, for just about any encoding other than UTF-8. Thus, you have
UTF16BE, and UTF16LE.

Not really, those are byte orders within a character, not order of characters.

Or rather, they are byte orders used by different encodings of code
points. ("Characters" in Unicode are more complicated - nothing is ever
simple in Unicode!) There are no endian issues between code points, and
a "string" as far as Unicode is concerned would be a sequence of code
points. You only have endian issues if you want to store these 21-bit
integers in a format that is encoded in smaller lumps (like
byte-addressed memory).

If you look at surrogates, you can UTF16 is big-endian. First there's the high
surrogate, then the low one.

There's a reason that every encoding other than UTF-8 is dead. Who needs the grief?

Indeed.

UTF-32 is fine for internal use, however - using whatever endianness
your processor prefers. The trick is never to let it leave the one
computer in any encoding other than UTF-8.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to John Savard on Fri May 10 13:09:53 2024

John Savard <[email protected]d> writes:

On Wed, 08 May 2024 20:50:53 -0600, John Savard ><[email protected]d> wrote:

On Wed, 8 May 2024 05:54:50 -0000 (UTC), Lawrence D'Oliveiro >><[email protected]d> wrote:

On Tue, 07 May 2024 22:01:36 -0600, John Savard wrote:

But the third item is character stirings, used in input and output to
represent numbers. They should be the same as packed decimal to make
conversion between the two simpler.

No, because character string conversion is subject to localization issues. >>

I agree that little-endian computers make sense for people whose
native language is Hebrew or Arabic.

Still, I get your point. My thinking is stuck in the days of card
readers and line printers. Yes, one called a subroutine to print
numbers, but what it did was convert them to the format used in North
America and the United Kingdom, in accordance with any parameters in
the call that were hard-coded into the program.

The idea of programs as applications, to be distributed far and wide,
to people with computers of their own, where the operating system
could impose localization options on the display of numbers that
programs would usually allow themselves to accept

I actually was responsible for the I18N and L10N support in
the Burroughs MCP (for Medium systems) in the 80's, so it's
not something that Microsoft "invented". At the time, it
was mainly for Europe, and Japan (Katakana).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to David Brown on Fri May 10 16:20:47 2024

David Brown <[email protected]> writes:

UTF-32 is fine for internal use, however - using whatever endianness
your processor prefers. The trick is never to let it leave the one
computer in any encoding other than UTF-8.

An unnecessary complication.

1) I only came up with the following use cases where you need to deal
with individual non-ASCII characters: Palindrome checkers and anagram
programs; I remember somebody mentioning a third use (which I promptly
forgot), but anyway, there are few cases.

2) But even for those few cases, UTF-32 is not good enough, because a
code point is not a character.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Fri May 10 18:49:31 2024

On Thu, 2 May 2024 18:28:18 +0000, [email protected] (MitchAlsup1)
wrote:

John Savard wrote:

On Wed, 1 May 2024 23:17:06 -0000 (UTC), Lawrence D'Oliveiro

Plus, if you load a single precision float into a floating-point
register, you are loading on the left side, not the right side, so the

In My 66000, floats are stored on the right side of the register
{mostly because I do not have FP LD/STs.}

And _not only_ do I have FP loads and stores, but one of the things
they *do* is convert floats (if needed) to an internal form so that
the exponent is of the exact same form, in the same position, for all
the floats of that type.

The Compatible Floating Point loads and stores - those are the ones
for hexadecimal S/360 floats - just do left-aligned raw loads and
stores in the FP registers, since their exponents are all in the same
form.

But the regular ones, for IEEE 754 floats, convert everything to look
like the old 8087 temporary real format. Possibly with an extra
exponent bit to accomodate the new 128-bit format defined in IEEE 754.

Of course, you may rightfully say that is crazy - if I did a
computation saving everything in memory, or using short vectors (where
this conversion doesn't take place) then the computation strictly
observes the exponent range, but if I do one in registers, a
calculation could continue normally where an intermediate result ought
to have underflowed by a little bit.

But here I'm following Seymour Cray - sacrifice everything else for
speed. Although 'within reason'; _except for division_ I keep IEEE 754
exact results.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Anton Ertl on Sat May 11 15:33:55 2024

On 10/05/2024 18:20, Anton Ertl wrote:

David Brown <[email protected]> writes:

UTF-32 is fine for internal use, however - using whatever endianness
your processor prefers. The trick is never to let it leave the one
computer in any encoding other than UTF-8.

An unnecessary complication.

1) I only came up with the following use cases where you need to deal
with individual non-ASCII characters: Palindrome checkers and anagram programs; I remember somebody mentioning a third use (which I promptly forgot), but anyway, there are few cases.

2) But even for those few cases, UTF-32 is not good enough, because a
code point is not a character.

I agree that it is usually unnecessary to convert to UTF-32 - I am
merely saying that /if/ you feel you want to expand the code points,
then UTF-32 is fine for the purpose and you should not have to worry
about endianness because you should not be moving it off your computer,
thus native endianness is all you need.

People sometimes say they want to expand to code points to be able to
see the length of the string in characters, or to index characters, or
for easier splicing or joining strings. I don't think these are
particularly useful in practice, but UTF-32 is fine for those that want it.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to David Brown on Sat May 11 15:31:49 2024

David Brown <[email protected]> writes:

On 10/05/2024 18:20, Anton Ertl wrote:

1) I only came up with the following use cases where you need to deal
with individual non-ASCII characters: Palindrome checkers and anagram
programs; I remember somebody mentioning a third use (which I promptly
forgot), but anyway, there are few cases.

2) But even for those few cases, UTF-32 is not good enough, because a
code point is not a character.

I agree that it is usually unnecessary to convert to UTF-32 - I am
merely saying that /if/ you feel you want to expand the code points,
then UTF-32 is fine for the purpose and you should not have to worry
about endianness because you should not be moving it off your computer,
thus native endianness is all you need.

Yes. The point I wanted to make is that there is the frequent
misconception that dealing with individual arbitrary characters is
something that is relatively common, and that one can do that by using
UTF-32 (or UTF-16); it isn't, and one cannot. If you stick with UTF-8
and use byte lengths and byte indexes, you can do almost everything as
well or better (with less complication and more efficiently) as by
converting to UTF-32 and back.

People sometimes say they want to expand to code points to be able to
see the length of the string in characters, or to index characters, or
for easier splicing or joining strings. I don't think these are
particularly useful in practice, but UTF-32 is fine for those that want it.

Looking up "splicing strings", I find that this is a term used in
connection with Python for specifying substrings. Python3 is a
language that lives the codepoint mistake to the extreme (and from
what I read, this was one of the major pain points in the
Python2->Python3 transition), but anyway, with UTF-8 one way to
represent a substring is to use the start index and length in bytes
(aka code units) rather than code points.

Looking up "joining strings" brings up the Python join() method, which
is a variant of string concatenation. There is certainly no need to
convert UTF-8 to UTF-32 and back for concatenating strings; just
concatenate the UTF-8 strings.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Anton Ertl on Sat May 11 18:49:12 2024

On 11/05/2024 17:31, Anton Ertl wrote:

David Brown <[email protected]> writes:

On 10/05/2024 18:20, Anton Ertl wrote:

1) I only came up with the following use cases where you need to deal
with individual non-ASCII characters: Palindrome checkers and anagram
programs; I remember somebody mentioning a third use (which I promptly
forgot), but anyway, there are few cases.

2) But even for those few cases, UTF-32 is not good enough, because a
code point is not a character.

I agree that it is usually unnecessary to convert to UTF-32 - I am
merely saying that /if/ you feel you want to expand the code points,
then UTF-32 is fine for the purpose and you should not have to worry
about endianness because you should not be moving it off your computer,
thus native endianness is all you need.

Yes. The point I wanted to make is that there is the frequent
misconception that dealing with individual arbitrary characters is
something that is relatively common, and that one can do that by using
UTF-32 (or UTF-16); it isn't, and one cannot. If you stick with UTF-8
and use byte lengths and byte indexes, you can do almost everything as
well or better (with less complication and more efficiently) as by
converting to UTF-32 and back.

Agreed.

People sometimes say they want to expand to code points to be able to
see the length of the string in characters, or to index characters, or
for easier splicing or joining strings. I don't think these are
particularly useful in practice, but UTF-32 is fine for those that want it.

Looking up "splicing strings", I find that this is a term used in
connection with Python for specifying substrings. Python3 is a
language that lives the codepoint mistake to the extreme (and from
what I read, this was one of the major pain points in the
Python2->Python3 transition), but anyway, with UTF-8 one way to
represent a substring is to use the start index and length in bytes
(aka code units) rather than code points.

I was not thinking of Python in particular, and I don't think the term "splicing" is Python specific. But Python is generally a good and
popular language when you need to do lots of text manipulation, so maybe
that's where the association comes from (at least for search engines).

People often think it is easier to do string manipulation - joining,
splitting, replacing, etc., - when you have fixed size units per
character. I agree with you that this is not actually true, especially
if you want to support arbitrary Unicode characters (such as combining characters) that don't fit in a single code point. But it is not
uncommon to think it is, and if you can make some simplifications to the
text you support (specifically, limiting your code to single code point characters) then UTF-32 can be helpful. (I think everyone will at least
agree that it's better than UTF-16!)

Looking up "joining strings" brings up the Python join() method, which
is a variant of string concatenation. There is certainly no need to
convert UTF-8 to UTF-32 and back for concatenating strings; just
concatenate the UTF-8 strings.

Sure.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to David Brown on Sat May 11 17:39:30 2024

David Brown <[email protected]> writes:

People often think it is easier to do string manipulation - joining, >splitting, replacing, etc., - when you have fixed size units per
character.

But they are wrong. Fixed-size units per character are unnecessary
and not helpful for joining, splitting, and replacing. And for nearly
all of "etc.".

But it is not
uncommon to think it is, and if you can make some simplifications to the
text you support (specifically, limiting your code to single code point >characters) then UTF-32 can be helpful.

Yes, many people think so, but they are mistaken.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Anton Ertl on Sun May 12 07:34:35 2024

Anton Ertl <[email protected]> schrieb:

The point I wanted to make is that there is the frequent
misconception that dealing with individual arbitrary characters is
something that is relatively common, and that one can do that by using
UTF-32 (or UTF-16); it isn't, and one cannot.

Do you really mean one cannot change an individual character
using UTF-32? I assume you mean "there is no need to do it"..

If you stick with UTF-8
and use byte lengths and byte indexes, you can do almost everything as
well or better (with less complication and more efficiently) as by
converting to UTF-32 and back.

Assume you're implementing a language which has a function of
setting an individual character in a string. How would you
implement it? Run through the string? Would you then also
store additional information somewhere so that the next character
that the user sets does not need to do it again?

Just curious...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Thomas Koenig on Sun May 12 11:39:14 2024

On Sun, 12 May 2024 07:34:35 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

Anton Ertl <[email protected]> schrieb:

The point I wanted to make is that there is the frequent
misconception that dealing with individual arbitrary characters is something that is relatively common, and that one can do that by
using UTF-32 (or UTF-16); it isn't, and one cannot.

Do you really mean one cannot change an individual character
using UTF-32? I assume you mean "there is no need to do it"..

I would think that Anton meant to say that UCS-4/UTF-32 code point is
not the same as individual character.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Paul A. Clayton on Tue May 14 22:36:54 2024

Paul A. Clayton wrote:

On 5/6/24 3:13 PM, MitchAlsup1 wrote:

Placing bit-field access INSIDE LDs and STs requires adding 2 stages
of multiplexing in the LD/ST aligners (memory shifters). This has the
potential to slow the overall pipeline frequency--at which point you
have lost more than you can gain.

The extra shifting could be applied only for bit-granular
accesses, so byte-granular accesses could have normal latency.
(Bit-field loads would have higher latency.)

If you only "apply" the bit level multiplexing when needed, instead
of having 2 added gate delays you now have 3 !!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Levine on Mon May 27 01:05:36 2024

On Thu, 9 May 2024 01:24:54 -0000 (UTC), John Levine wrote:

According to Lawrence D'Oliveiro <[email protected]d>:

On Wed, 8 May 2024 02:47:46 -0000 (UTC), John Levine wrote:

It doesn't make sense to say that character strings are big- or
little-endian.

Yes it does, for just about any encoding other than UTF-8. Thus, you
have UTF16BE, and UTF16LE.

Not really, those are byte orders within a character ...

Within an integer character code. Which is exactly what endianness is all about.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to David Brown on Mon May 27 01:15:10 2024

On Sat, 11 May 2024 18:49:12 +0200, David Brown wrote:

People often think it is easier to do string manipulation - joining, splitting, replacing, etc., - when you have fixed size units per
character.

It is easy enough to come up with a fixed-size representation for
characters in Python (or other suitably powerful language), where “character” = “non-combining code point plus all immediately-following combining code points”. Do all your text manipulation in this internal representation, then write it back to regular text in UTF-8 or whatever
other format you need.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to [email protected] on Mon May 27 02:54:49 2024

It appears that Lawrence D'Oliveiro <[email protected]d> said:

On Sat, 11 May 2024 18:49:12 +0200, David Brown wrote:

People often think it is easier to do string manipulation - joining,
splitting, replacing, etc., - when you have fixed size units per
character.

It is easy enough to come up with a fixed-size representation for
characters in Python (or other suitably powerful language), where >“character” = “non-combining code point plus all immediately-following >combining code points”.

I have to ask, how much storage do each of these fixed-size character
things take?

How do you know?

I've been poking at Unicode for a while and I don't have the faintest
idea, particularly if you include groups of emoji with ZWJ that are
rendered as one image, as in this ever increasing list. Groups
can have 9 code points, maybe more:

https://www.unicode.org/emoji/charts/emoji-zwj-sequences.html

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Levine on Mon May 27 07:18:51 2024

On Mon, 27 May 2024 02:54:49 -0000 (UTC), John Levine wrote:

It appears that Lawrence D'Oliveiro <[email protected]d> said:

On Sat, 11 May 2024 18:49:12 +0200, David Brown wrote:

People often think it is easier to do string manipulation - joining,
splitting, replacing, etc., - when you have fixed size units per
character.

It is easy enough to come up with a fixed-size representation for >>characters in Python (or other suitably powerful language), where >>“character” = “non-combining code point plus all immediately-following >>combining code points”.

I have to ask, how much storage do each of these fixed-size character
things take?

That’s not important; what’s important is that you can put characters as elements in an array, randomly accessible just by array index.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Mon May 27 15:09:23 2024

According to Lawrence D'Oliveiro <[email protected]d>:

It is easy enough to come up with a fixed-size representation for >>>characters in Python (or other suitably powerful language), where >>>“character” = “non-combining code point plus all immediately-following >>>combining code points”.

I have to ask, how much storage do each of these fixed-size character
things take?

That’s not important; what’s important is that you can put characters as >elements in an array, randomly accessible just by array index.

How am I supposed to write my code with an array of fixed size things if
I don't know how big the things are?

If you mean an array of pointers to sequences of code points, well
sure, but now we're back to variable length encodings. I know that I
have no idea how big these fixed size things would have to be and i
suspect nobody else does either.

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to John Levine on Mon May 27 12:45:09 2024

John Levine wrote:

According to Lawrence D'Oliveiro <[email protected]d>:

It is easy enough to come up with a fixed-size representation for
characters in Python (or other suitably powerful language), where
“character” = “non-combining code point plus all immediately-following
combining code points”.

I have to ask, how much storage do each of these fixed-size character
things take?

That’s not important; what’s important is that you can put characters as >> elements in an array, randomly accessible just by array index.

How am I supposed to write my code with an array of fixed size things if
I don't know how big the things are?

If you mean an array of pointers to sequences of code points, well
sure, but now we're back to variable length encodings. I know that I
have no idea how big these fixed size things would have to be and i
suspect nobody else does either.

One could have instructions that make it easier to parse the
variable length UTF-8 sequences into codepoints.
The first byte high order bits tells you the byte run length and also
how to extract and shift the bit fields to assemble a 4-byte codepoint
after those 1..4 bytes have been loaded into a register.

Variable 1 to 8 byte count register load and store instructions could be helpful here too. Or lengths of 1..64 bytes if SIMD registers are used,
because then we could apply Mitch's log_2 parallel parse method to
multiple codepoints in the wide SIMD register and parse a bunch of
codepoints in one clock and right justify them.

It would still have to look up whether a codepoint was combining or
stand alone. I don't see a firm definition whether combining codepoints
come before or after, after requiring a lookahead parse and so extra
checks to ensure it doesn't look past the buffer end.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Mon May 27 19:09:51 2024

According to EricP <[email protected]>:

John Levine wrote:

If you mean an array of pointers to sequences of code points, well
sure, but now we're back to variable length encodings. I know that I
have no idea how big these fixed size things would have to be and i
suspect nobody else does either.

One could have instructions that make it easier to parse the
variable length UTF-8 sequences into codepoints.

That would be the CU14 instruction on zSeries, to turn UTF-8 into
UTF-32. CU41 goes the other way.

It would still have to look up whether a codepoint was combining or
stand alone. I don't see a firm definition whether combining codepoints
come before or after, after requiring a lookahead parse and so extra
checks to ensure it doesn't look past the buffer end.

I think they come after but I haven't looked in enough detail. And
then you have all of the issues with precomposed characters: do you
normalize as you go or denormaiize, or what?

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Levine on Mon May 27 20:41:38 2024

John Levine wrote:

According to EricP <[email protected]>:

John Levine wrote:

If you mean an array of pointers to sequences of code points, well
sure, but now we're back to variable length encodings. I know that I
have no idea how big these fixed size things would have to be and i
suspect nobody else does either.

One could have instructions that make it easier to parse the
variable length UTF-8 sequences into codepoints.

That would be the CU14 instruction on zSeries, to turn UTF-8 into
UTF-32. CU41 goes the other way.

It would still have to look up whether a codepoint was combining or
stand alone. I don't see a firm definition whether combining codepoints >>come before or after, after requiring a lookahead parse and so extra
checks to ensure it doesn't look past the buffer end.

I think they come after but I haven't looked in enough detail. And
then you have all of the issues with precomposed characters: do you
normalize as you go or denormaiize, or what?

Character search (or compare) becomes 'grep'.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Savard on Tue May 28 01:10:02 2024

On Wed, 08 May 2024 20:50:53 -0600, John Savard wrote:

On Wed, 8 May 2024 05:54:50 -0000 (UTC), Lawrence D'Oliveiro
<[email protected]d> wrote:

On Tue, 07 May 2024 22:01:36 -0600, John Savard wrote:

But the third item is character stirings, used in input and output to
represent numbers. They should be the same as packed decimal to make
conversion between the two simpler.

No, because character string conversion is subject to localization
issues.

I agree that little-endian computers make sense for people whose native language is Hebrew or Arabic.

That doesn’t actually make any sense.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Terje Mathisen on Tue May 28 01:12:56 2024

On Wed, 8 May 2024 19:47:34 +0200, Terje Mathisen wrote:

It was exactly these kinds of optimizations I made in order to double
the speed of Intel's reference BluRay decoder. However, instead of
asking me to write a complete version they decided to licence a piece of
VLSI to do it in hardware, and that was almost certainly the correct
decision since my code needed 4 cores working nearly 100% in order to
handle the highest possible size/speed quality (1080p, 60 Hz, CABAC
encoding and 40 Mbit/s bitrate).

Still, that sounds like something that could be useful in a transcoder
like FFmpeg.

4 cores sounds like a modest requirement these days; nproc reports 24 on
the machine I’m using now. And 16 on my laptop.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Levine on Tue May 28 01:24:41 2024

On Mon, 27 May 2024 15:09:23 -0000 (UTC), John Levine wrote:

According to Lawrence D'Oliveiro <[email protected]d>:

It is easy enough to come up with a fixed-size representation for
characters in Python (or other suitably powerful language), where
“character” = “non-combining code point plus all immediately
-following combining code points”.

I have to ask, how much storage do each of these fixed-size character
things take?

That’s not important; what’s important is that you can put characters as >>elements in an array, randomly accessible just by array index.

How am I supposed to write my code with an array of fixed size things if
I don't know how big the things are?

The fixed-size things are references to objects. Or in a lower-level
language like C, they could indeed be pointers/indexes into an array of
code points.

If you mean an array of pointers to sequences of code points, well sure,
but now we're back to variable length encodings.

We’re not, because we still have easy random access, and the length of the array is the number of characters.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Levine on Tue May 28 01:25:37 2024

On Mon, 27 May 2024 19:09:51 -0000 (UTC), John Levine wrote:

According to EricP <[email protected]>:

One could have instructions that make it easier to parse the variable
length UTF-8 sequences into codepoints.

That would be the CU14 instruction on zSeries, to turn UTF-8 into
UTF-32. CU41 goes the other way.

What is the point, in this day and age, of having special machine
instructions to convert character encodings?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Lawrence D'Oliveiro on Tue May 28 08:01:36 2024

Lawrence D'Oliveiro wrote:

On Mon, 27 May 2024 15:09:23 -0000 (UTC), John Levine wrote:

According to Lawrence D'Oliveiro <[email protected]d>:

It is easy enough to come up with a fixed-size representation for
characters in Python (or other suitably powerful language), where
“character” = “non-combining code point plus all immediately
-following combining code points”.

I have to ask, how much storage do each of these fixed-size character
things take?

That’s not important; what’s important is that you can put characters as
elements in an array, randomly accessible just by array index.

How am I supposed to write my code with an array of fixed size things if
I don't know how big the things are?

The fixed-size things are references to objects. Or in a lower-level
language like C, they could indeed be pointers/indexes into an array of
code points.

If you mean an array of pointers to sequences of code points, well sure,
but now we're back to variable length encodings.

We’re not, because we still have easy random access, and the length of the array is the number of characters.

If you need efficient random read access to particular unicode
characters, possibly consisting of multiple codepoints, then I would
guess a skip list to be very efficient:

Just a helper array containing the starting offsets to every ~32 or so
utf8 characters. This would add 12.5% overhead for a file containing
only US ASCII if using 32-bit offsets, while the more longer characters
you have the lower the overhead.

When accessing a particular character you could of course use linear
scanning past the nearest preceeding index entry.

If you also need to edit the utf8 character array, then you could
augment the primary index with one or more higher layers, i.e. a classic
skip list.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to John Levine on Tue May 28 11:04:38 2024

John Levine wrote:

According to EricP <[email protected]>:

John Levine wrote:

If you mean an array of pointers to sequences of code points, well
sure, but now we're back to variable length encodings. I know that I
have no idea how big these fixed size things would have to be and i
suspect nobody else does either.

One could have instructions that make it easier to parse the
variable length UTF-8 sequences into codepoints.

That would be the CU14 instruction on zSeries, to turn UTF-8 into
UTF-32. CU41 goes the other way.

It would still have to look up whether a codepoint was combining or
stand alone. I don't see a firm definition whether combining codepoints
come before or after, after requiring a lookahead parse and so extra
checks to ensure it doesn't look past the buffer end.

I think they come after but I haven't looked in enough detail.

It appears they defined it as you described, with base character
first and optional combiners follow. https://www.unicode.org/glossary/#combining_character_sequence

I was thinking that as UTF-8 can be parsed in either direction,
the order should be defined such that the usual case, low to high scan,
is most efficient.

That order should be to put the combiner(s) first and the base codepoint
last so the base code acts like a parse stop-code and makes a lookahead
higher unnecessary.

A backwards scan still works but it has to look ahead backwards to check
if there is a combiner, which there usually isn't, and unget it if not.
As that is extra work, checking for buffer overflow etc., and touches
extra bytes that are usually unused, this should be the second choice.

But it appears they chose the least efficient way to do it.
Sigh... oh well.

And
then you have all of the issues with precomposed characters: do you normalize as you go or denormaiize, or what?

And fields in forms have fixed screen size, while record struct
and database fields have fixed byte size.

Fortunately I don't have to deal with any of this.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Lawrence D'Oliveiro on Tue May 28 16:02:10 2024

Lawrence D'Oliveiro <[email protected]d> schrieb:

On Mon, 27 May 2024 19:09:51 -0000 (UTC), John Levine wrote:

According to EricP <[email protected]>:

One could have instructions that make it easier to parse the variable
length UTF-8 sequences into codepoints.

That would be the CU14 instruction on zSeries, to turn UTF-8 into
UTF-32. CU41 goes the other way.

What is the point, in this day and age, of having special machine instructions to convert character encodings?

Have you looked at decoding algorithms for UTF-8?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Thomas Koenig on Tue May 28 12:23:12 2024

Thomas Koenig wrote:

Lawrence D'Oliveiro <[email protected]d> schrieb:

On Mon, 27 May 2024 19:09:51 -0000 (UTC), John Levine wrote:

According to EricP <[email protected]>:

One could have instructions that make it easier to parse the variable
length UTF-8 sequences into codepoints.

That would be the CU14 instruction on zSeries, to turn UTF-8 into
UTF-32. CU41 goes the other way.

What is the point, in this day and age, of having special machine
instructions to convert character encodings?

Have you looked at decoding algorithms for UTF-8?

It's almost like the perfect application of risc instruction design:
a long sequence of individual instructions of conditional branches,
bit field extracts, inserts, and shifts, is replace in HW by
a small number of muxes that can to the same in one clock.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Thomas Koenig on Wed May 29 04:46:34 2024

On Tue, 28 May 2024 16:02:10 -0000 (UTC), Thomas Koenig wrote:

Lawrence D'Oliveiro <[email protected]d> schrieb:

On Mon, 27 May 2024 19:09:51 -0000 (UTC), John Levine wrote:

According to EricP <[email protected]>:

One could have instructions that make it easier to parse the variable
length UTF-8 sequences into codepoints.

That would be the CU14 instruction on zSeries, to turn UTF-8 into
UTF-32. CU41 goes the other way.

What is the point, in this day and age, of having special machine
instructions to convert character encodings?

Have you looked at decoding algorithms for UTF-8?

Of course. Isn’t the point of RISC that these complex operations are more efficiently performed by a sequence of simpler instructions?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Wed May 29 07:04:35 2024

Lawrence D'Oliveiro <[email protected]d> writes:

On Tue, 28 May 2024 16:02:10 -0000 (UTC), Thomas Koenig wrote:

Lawrence D'Oliveiro <[email protected]d> schrieb:

On Mon, 27 May 2024 19:09:51 -0000 (UTC), John Levine wrote:

According to EricP <[email protected]>:

One could have instructions that make it easier to parse the variable >>>>> length UTF-8 sequences into codepoints.

What for? Dealing with code points is rarely necessary, so adding
instructions for that is a waste (and it's not surprising to me that
neither AMD64 nor ARM A64 have such instructions; IBM z seems to be
add special instructions that are rarely useful as marketing
argument).

That would be the CU14 instruction on zSeries, to turn UTF-8 into
UTF-32. CU41 goes the other way.

What is the point, in this day and age, of having special machine
instructions to convert character encodings?

Have you looked at decoding algorithms for UTF-8?

Of course. Isn’t the point of RISC that these complex operations are more >efficiently performed by a sequence of simpler instructions?

The IBM z series are not RISCs.

Anyway, such instructions can be done in a RISCy way (pure
register-to-register instructions) or in a CISCy way
(memory-to-memory).

A RISCy way to do UTF-8 -> UTF-32 would be to have the first 4 bytes
of the remaining string in a register and producing an UTF-32 code
point in another register and a length in a third register (or in the
high part of the destination register to reduce write port
requirements). Similarly for UTF-32->UTF-8, with the length
specifying the length of the result; that would need to be combined
with a length masked store to make it easy to store the result.

This approach can also be SIMDified, converting regbits/32 code points
in one representation to the same number of code points in the other representation plus a length of the UTF-8 representation.

The disadvantage of this approach exists particularly for
UTF-8->UTF-32: this is a very sequential approach full of dependences:
each use of the conversion instruction is followed by a dependent load
of the next input fragment, and the next use of the conversion
instruction depends on that load.

We have been discussing shift buffers; those would be useful for such instructions.

A CISCy approach is similar to a block copy: have a source operand in
memory (represented by an address and maybe a length) and a
destination operand (represented by an address and a length) start the instruction in a loop until it is finished (the loop is there to allow interrupting the instruction in the middle, e.g., for page faults).

Looking at CU14 on page 7-136 of <https://www.ibm.com/docs/en/SSQ2R2_15.0.0/com.ibm.tpf.toolkit.hlasm.doc/dz9zr006.pdf>,
CU14 takes the CISCy approach outlined above.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Wed May 29 07:59:21 2024

Lawrence D'Oliveiro <[email protected]d> writes:

The fixed-size things are references to objects. Or in a lower-level
language like C, they could indeed be pointers/indexes into an array of
code points.

There is no need for UTF-32 for such an approach. Just let the pointers/indexes point to the start of the character in UTF-8
represntation.

[...] we still have easy random access, and the length of the
array is the number of characters.

Both of which are rarely necessary.

But sure, if you need that, the approach of having an array of
pointers to characters in UTF-8 representation works, while converting
to UTF-32 does not help at all.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Anton Ertl on Wed May 29 10:10:30 2024

Anton Ertl wrote:

Lawrence D'Oliveiro <[email protected]d> writes:

On Tue, 28 May 2024 16:02:10 -0000 (UTC), Thomas Koenig wrote:

Lawrence D'Oliveiro <[email protected]d> schrieb:

On Mon, 27 May 2024 19:09:51 -0000 (UTC), John Levine wrote:

According to EricP <[email protected]>:

One could have instructions that make it easier to parse the variable >>>>>> length UTF-8 sequences into codepoints.

What for? Dealing with code points is rarely necessary, so adding instructions for that is a waste (and it's not surprising to me that
neither AMD64 nor ARM A64 have such instructions; IBM z seems to be
add special instructions that are rarely useful as marketing
argument).

I've not dealt with UTF-8 or code points but that's because I've not
written software that interacts with the non 1-byte character markets.

But even something as simple as sanitizing a character string to feed
into SQL will have to.

And while I've not dealt with it myself, I can see just by looking at
UTF-8 and its variable sized characters of variable sized code points
that it likely makes string processing 10 times more complicated.

As string processing is 99% of what business software manipulates,
and international string processing is a large part of IBM's services
market, services that they have to compete against others to sell,
it doesn't surprise me that they would add instructions which facilitate it.

Many processors have instructions particular operations,
Find First/Last One/Zero, bit field reverse for FFT,
POPCOUNT for them-who-shall-not-be-named.

A Sign Extend instruction is just a way to decompress a redundant-high-order-bit-compressed integer.

Why not instructions to decompress the most high frequency usage
compressed character set?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Wed May 29 11:09:26 2024

I've not dealt with UTF-8 or code points but that's because I've not
written software that interacts with the non 1-byte character markets.
But even something as simple as sanitizing a character string to feed
into SQL will have to.

AFAIK you can do that by treating the UTF-8 byte sequence as if it were
an ASCII byte-sequence: all the Unicode weirdness is neatly stashed in
bytes >127 which aren't used by SQL itself anyway.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Wed May 29 11:55:18 2024

I've not dealt with UTF-8 or code points but that's because I've not
written software that interacts with the non 1-byte character markets.
But even something as simple as sanitizing a character string to feed
into SQL will have to.

AFAIK you can do that by treating the UTF-8 byte sequence as if it were
an ASCII byte-sequence: all the Unicode weirdness is neatly stashed in
bytes >127 which aren't used by SQL itself anyway.
Stefan

Of course with apologies to Herr Koenig's umlauts. :-)

And what of all those new Asian customers your company was hoping
to get by dealing with them in their native written language???
You could always explain to the company president that
you only work in ASCII so they should just get used to it.

I think you misunderstand: the code written to sanitize an ASCII string to
feed into SQL will *just work* to sanitize a UTF-8 string to feed
into SQL, no matter how many funny characters and joiners and combiners
and emojis you have in there.

That's part of the reason why UTF-8 is so popular: you can surprisingly
often treat it as "good old ASCII".

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Stefan Monnier on Wed May 29 11:46:40 2024

Stefan Monnier wrote:

I've not dealt with UTF-8 or code points but that's because I've not
written software that interacts with the non 1-byte character markets.
But even something as simple as sanitizing a character string to feed
into SQL will have to.

AFAIK you can do that by treating the UTF-8 byte sequence as if it were
an ASCII byte-sequence: all the Unicode weirdness is neatly stashed in
bytes >127 which aren't used by SQL itself anyway.

Stefan

Of course with apologies to Herr Koenig's umlauts. :-)

And what of all those new Asian customers your company was hoping
to get by dealing with them in their native written language???
You could always explain to the company president that
you only work in ASCII so they should just get used to it.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Stefan Monnier on Wed May 29 13:20:14 2024

Stefan Monnier wrote:

I've not dealt with UTF-8 or code points but that's because I've not
written software that interacts with the non 1-byte character markets. >>>> But even something as simple as sanitizing a character string to feed
into SQL will have to.

AFAIK you can do that by treating the UTF-8 byte sequence as if it were
an ASCII byte-sequence: all the Unicode weirdness is neatly stashed in
bytes >127 which aren't used by SQL itself anyway.
Stefan

Of course with apologies to Herr Koenig's umlauts. :-)

And what of all those new Asian customers your company was hoping
to get by dealing with them in their native written language???
You could always explain to the company president that
you only work in ASCII so they should just get used to it.

I think you misunderstand: the code written to sanitize an ASCII string to feed into SQL will *just work* to sanitize a UTF-8 string to feed
into SQL, no matter how many funny characters and joiners and combiners
and emojis you have in there.

That's part of the reason why UTF-8 is so popular: you can surprisingly
often treat it as "good old ASCII".

Stefan

Ok, you accept international character data, you just don't have to
check >127 characters for "drop table" etc commands.

I don't think you are being paranoid enough.
I still think you have to validate or sanitize the >127 string to
ensure the code sequences only contain well formed characters.

Random hack thought #1: if the string I send starts with an umlaut as
the first code point, which doesn't display because it is invalid.
Then someone edits the first char to a/o/u and magically it changes
to a different character, and deposits now go to a different account.

Random hack thought #2: If a character has multiple combiner code points,
does changing the order create a different character or do they map to
the same display character? Or worse, maybe combiner code point order sensitivity is character dependent, some are, some are not.
If they do display the same, then I might create two accounts that
look identical but index differently, and redirect deposits.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Wed May 29 18:42:32 2024

According to EricP <[email protected]>:

Ok, you accept international character data, you just don't have to
check >127 characters for "drop table" etc commands.

I don't think you are being paranoid enough.
I still think you have to validate or sanitize the >127 string to
ensure the code sequences only contain well formed characters.

If you're sending the strings to a database, the database will
invariably do detailed string validation so I wouldn't bother, but be
prepared for the error code if it rejects the string,

Random hack thought #1: if the string I send starts with an umlaut as
the first code point, ...

A bare umlaut displays just fine. But see below.

Random hack thought #2: If a character has multiple combiner code points, >does changing the order create a different character or do they map to
the same display character? Or worse, maybe combiner code point order >sensitivity is character dependent, some are, some are not.

Unicode has normalization forms that deal with this. The most common
are NFC which uses precomposed combined characters, and NFD where
they're all separate (Composed and Decomposed.) NFD puts the combiners
in a well defined order. Sensible people put all their strings into
NFC or NFD before doing anything else with them.

https://www.unicode.org/reports/tr15/
--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to [email protected] on Wed May 29 22:26:00 2024

On Wed, 29 May 2024 18:42:32 -0000 (UTC), John Levine
<[email protected]> wrote:

According to EricP <[email protected]>:

Ok, you accept international character data, you just don't have to
check >127 characters for "drop table" etc commands.

I don't think you are being paranoid enough.
I still think you have to validate or sanitize the >127 string to
ensure the code sequences only contain well formed characters.

If you're sending the strings to a database, the database will
invariably do detailed string validation so I wouldn't bother, but be >prepared for the error code if it rejects the string,

Far too much SQL is constructed by simply splicing user input into a
query "template" string.

When queries are done right with all user input provided via SQL
parameters, then there is far less need to "sanitize" input.

There is a one major caveat: in SQL, table names can't be specified by parameter. If the user must provide a table name, then you DO have to
splice the query string and you DO have to be careful.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to EricP on Thu May 30 02:42:29 2024

On Wed, 29 May 2024 11:46:40 -0400, EricP wrote:

You could always explain to the company president that you only work in
ASCII so they should just get used to it.

That stopped being acceptable back in about the 1980s.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Thu May 30 02:37:51 2024

On Wed, 29 May 2024 07:04:35 GMT, Anton Ertl wrote:

Lawrence D'Oliveiro <[email protected]d> writes:

Isn’t the point of RISC that these complex operations are
more efficiently performed by a sequence of simpler instructions?

The IBM z series are not RISCs.

Doesn’t matter. The principles of designing high-performance architectures still apply: simpler instructions are better than more complex ones.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to EricP on Thu May 30 02:41:20 2024

On Wed, 29 May 2024 10:10:30 -0400, EricP wrote:

I've not dealt with UTF-8 or code points but that's because I've not
written software that interacts with the non 1-byte character markets.

We are all “non 1-byte character markets” now.

Just to rub it in: «€£¢©®±»

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Lawrence D'Oliveiro on Thu May 30 03:26:05 2024

Lawrence D'Oliveiro wrote:

On Wed, 29 May 2024 07:04:35 GMT, Anton Ertl wrote:

Lawrence D'Oliveiro <[email protected]d> writes:

Isn’t the point of RISC that these complex operations are
more efficiently performed by a sequence of simpler instructions?

The IBM z series are not RISCs.

Doesn’t matter. The principles of designing high-performance
architectures still apply: simpler instructions are better than more
complex ones.

IBM has, for a long time, combined commonly occuring sequences of
instructions into single instructions. I don't know the tradeoffs
here. Perhaps John Levine does?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to EricP on Thu May 30 10:10:55 2024

EricP wrote:

Thomas Koenig wrote:

Lawrence D'Oliveiro <[email protected]d> schrieb:

On Mon, 27 May 2024 19:09:51 -0000 (UTC), John Levine wrote:

According to EricP <[email protected]>:

One could have instructions that make it easier to parse the variable >>>>> length UTF-8 sequences into codepoints.

That would be the CU14 instruction on zSeries, to turn UTF-8 into
UTF-32. CU41 goes the other way.

What is the point, in this day and age, of having special machine
instructions to convert character encodings?

Have you looked at decoding algorithms for UTF-8?

It's almost like the perfect application of risc instruction design:
a long sequence of individual instructions of conditional branches,
bit field extracts, inserts, and shifts, is replace in HW by
a small number of muxes that can to the same in one clock.

If that CU14 can also return the number of bytes consumed, along with
the resulting 32-bit character, then it would be perfect. Is that what
it is doing?

You still have the horrible combining codepoints problem of course,
where you have to apply CU14 once more just in order to find out if it
was in fact a combining code, and do that without any buffer overruns etc.

Personally I tend to punt on these kinds of algorithms and simply demand
that the decoding source buffer have at least enough extra buffer space
at the end to avoid the problem.

I.e. my LZ4 decoder is significantly faster than what Google is using,
but it will happily grab up to 11 or 27 bytes past the actual end of input.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Anton Ertl on Thu May 30 10:36:03 2024

Anton Ertl wrote:

Lawrence D'Oliveiro <[email protected]d> writes:

On Tue, 28 May 2024 16:02:10 -0000 (UTC), Thomas Koenig wrote:

Lawrence D'Oliveiro <[email protected]d> schrieb:

On Mon, 27 May 2024 19:09:51 -0000 (UTC), John Levine wrote:

According to EricP <[email protected]>:

One could have instructions that make it easier to parse the variable >>>>>> length UTF-8 sequences into codepoints.

What for? Dealing with code points is rarely necessary, so adding instructions for that is a waste (and it's not surprising to me that
neither AMD64 nor ARM A64 have such instructions; IBM z seems to be
add special instructions that are rarely useful as marketing
argument).

That would be the CU14 instruction on zSeries, to turn UTF-8 into
UTF-32. CU41 goes the other way.

What is the point, in this day and age, of having special machine
instructions to convert character encodings?

Have you looked at decoding algorithms for UTF-8?

Of course. Isnâ€™t the point of RISC that these complex operations are more
efficiently performed by a sequence of simpler instructions?

The IBM z series are not RISCs.

Anyway, such instructions can be done in a RISCy way (pure register-to-register instructions) or in a CISCy way
(memory-to-memory).

A RISCy way to do UTF-8 -> UTF-32 would be to have the first 4 bytes
of the remaining string in a register and producing an UTF-32 code
point in another register and a length in a third register (or in the
high part of the destination register to reduce write port
requirements). Similarly for UTF-32->UTF-8, with the length
specifying the length of the result; that would need to be combined
with a length masked store to make it easy to store the result.

This approach can also be SIMDified, converting regbits/32 code points
in one representation to the same number of code points in the other representation plus a length of the UTF-8 representation.

The disadvantage of this approach exists particularly for
UTF-8->UTF-32: this is a very sequential approach full of dependences:
each use of the conversion instruction is followed by a dependent load
of the next input fragment, and the next use of the conversion
instruction depends on that load.

Rather the opposite:

UTF8->UTF32 looks a _lot_ like an easier example of a byte-oriented
variable length (x86?) instruction decoder, but with the big
simplification that the first byte directly tells you how long the
sequence is.

Doing a SIMD version corresponds to a superscalar x86 in that the
decoder needs to grab a variable number of bytes for each instruction, starting the next immediately after.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Terje Mathisen on Thu May 30 12:45:27 2024

Terje Mathisen wrote:

Anton Ertl wrote:

This approach can also be SIMDified, converting regbits/32 code points
in one representation to the same number of code points in the other
representation plus a length of the UTF-8 representation.

The disadvantage of this approach exists particularly for
UTF-8->UTF-32: this is a very sequential approach full of dependences:
each use of the conversion instruction is followed by a dependent load
of the next input fragment, and the next use of the conversion
instruction depends on that load.

Rather the opposite:

UTF8->UTF32 looks a _lot_ like an easier example of a byte-oriented
variable length (x86?) instruction decoder, but with the big
simplification that the first byte directly tells you how long the
sequence is.

Doing a SIMD version corresponds to a superscalar x86 in that the
decoder needs to grab a variable number of bytes for each instruction, starting the next immediately after.

Even better (compared to a superscalar x86 instruction decoder), _every_
byte uses the top two bits to tell you if this is 7-bit ascii, the start
of a UTF-8 encoded code point, or a follow-on byte inside a UTF-8 code
point.

This means that each decoder can work alone, without having to wait for
the length decoding of the previous code point ("instruction") before
deciding to discard or pass on the results it got from starting where it
did.

It seems like it would be very feasible to have (say) 8 parallel
decoders starting at every corresponding byte offset, and return a SIMD register with 2-8 32-bit decoded code points, right?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Terje Mathisen on Thu May 30 11:59:53 2024

Terje Mathisen <[email protected]> schrieb:

Terje Mathisen wrote:

Anton Ertl wrote:

This approach can also be SIMDified, converting regbits/32 code points
in one representation to the same number of code points in the other
representation plus a length of the UTF-8 representation.

The disadvantage of this approach exists particularly for
UTF-8->UTF-32: this is a very sequential approach full of dependences:
each use of the conversion instruction is followed by a dependent load
of the next input fragment, and the next use of the conversion
instruction depends on that load.

Rather the opposite:

UTF8->UTF32 looks a _lot_ like an easier example of a byte-oriented
variable length (x86?) instruction decoder, but with the big
simplification that the first byte directly tells you how long the
sequence is.

Doing a SIMD version corresponds to a superscalar x86 in that the
decoder needs to grab a variable number of bytes for each instruction,
starting the next immediately after.

Even better (compared to a superscalar x86 instruction decoder), _every_
byte uses the top two bits to tell you if this is 7-bit ascii, the start
of a UTF-8 encoded code point, or a follow-on byte inside a UTF-8 code
point.

This means that each decoder can work alone, without having to wait for
the length decoding of the previous code point ("instruction") before deciding to discard or pass on the results it got from starting where it
did.

It seems like it would be very feasible to have (say) 8 parallel
decoders starting at every corresponding byte offset, and return a SIMD register with 2-8 32-bit decoded code points, right?

Sounds quite reasonable (and would be like what Mitch describes for his
My 66000 decoders). Apart from filling the buffers, it would also need
to return the number of bytes consumed and the number of UTF-32
characters generated, plus a possible error indication.

Looking at what IBM did, the CU14 instruction is memory-to-memory
and they use both the length and the address of both the source
and destination data in register pairs. The number of characters
to process are then decremented according to what has been processed
(and there might be a CPU-defined limit). They also appear to have
optional error checking only.

Complicated...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Terje Mathisen on Thu May 30 11:54:09 2024

Terje Mathisen <[email protected]> writes:

Anton Ertl wrote:

Anyway, such instructions can be done in a RISCy way (pure
register-to-register instructions) or in a CISCy way
(memory-to-memory).
=20
A RISCy way to do UTF-8 -> UTF-32 would be to have the first 4 bytes
of the remaining string in a register and producing an UTF-32 code
point in another register and a length in a third register (or in the
high part of the destination register to reduce write port
requirements). Similarly for UTF-32->UTF-8, with the length
specifying the length of the result; that would need to be combined
with a length masked store to make it easy to store the result.
=20
This approach can also be SIMDified, converting regbits/32 code points
in one representation to the same number of code points in the other
representation plus a length of the UTF-8 representation.
=20
The disadvantage of this approach exists particularly for
UTF-8->UTF-32: this is a very sequential approach full of dependences:
each use of the conversion instruction is followed by a dependent load
of the next input fragment, and the next use of the conversion
instruction depends on that load.

Rather the opposite:

UTF8->UTF32 looks a _lot_ like an easier example of a byte-oriented=20 >variable length (x86?) instruction decoder, but with the big=20 >simplification that the first byte directly tells you how long the=20 >sequence is.

The SIMD version of the RISCy instruction is no problem. So you can
process regbits/32 code points in one go. But what I wrote above
still applies: You use this instruction in a loop like

# s* are SIMD registers, g* are GPRs
l: s0= load(g0)
s1,g1= cu14(s0)
store (g2)<-s1
g0 = g0+g1
g2 = g2+SIMD_width
if g0>=input_end goto end
if g2<output_limit goto l
end:

(probably some fine tuning of the last iteration and the termination
is necessary).

And here you have a dependence chain from load to cu14 to the g0+g1 to
the load of the next iteration. With cu14 and the addition as
single-cycle operations and the load taking 5 cycles as for D-cache
hits on recent Intel CPUs, that's 7 cycles per iteration, limiting the throughput of your conversion routine to 1/7th of what your cu14 and
your load/store unit would be capable of in throughput-limited code.

With a byte-stream buffer as architectural feature, and a CU14 that
takes its utf-8 input from that and automatically advances the stream,
this could be quite a bit more efficient. Something like:

... set up stream buffer ...
l: s1 = cu14(stream-buffer)
store (g2)<-s1
g2 = g2+SIMD_width
if streambuffer empty goto end
if g2<output_limit goto l
end:

(again with some fine-tuning for the last iteration and termination).

For a technically unnecessary marketing gimick like CU14 one probably
won't add a stream buffer, but, e.g., compression and decompression
are probably more relevant and may also benefit from such a feature.

Doing a SIMD version corresponds to a superscalar x86 in that the=20
decoder needs to grab a variable number of bytes for each instruction,=20 >starting the next immediately after.

The instructions are fetched into a stream buffer rather than waiting
for the decoder to produce a length result before starting the next
instruction fetch (and of course the instruction fetcher also has to
deal with branches).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Thu May 30 12:50:38 2024

Lawrence D'Oliveiro <[email protected]d> writes:

On Wed, 29 May 2024 07:04:35 GMT, Anton Ertl wrote:

Lawrence D'Oliveiro <[email protected]d> writes:

Isn’t the point of RISC that these complex operations are
more efficiently performed by a sequence of simpler instructions?

The IBM z series are not RISCs.

Doesn’t matter. The principles of designing high-performance architectures >still apply: simpler instructions are better than more complex ones.

Is IBM z a high-performance architecture?

In the present case, the principles of designing high-performance
architectures will tell you that you don't need these instructions.

But if we forget about that for a minute, the block-copy-style
approach of IBM's CU14 instruction means that it could use a stream
buffer internally to avoid the performance snag that I mentioned in
another posting.

However, there is a big difference between what performance features
one can imagine and what is actually implemented. I think that's the
marketing attraction of providing some feature as an instruction: it
lets the sales victim's imagination do the marketing/selling.

Concerning reality: When I looked at block copying a while ago
(Skylake/Zen1 days), I found that my code using a loop of AVX moves outperformed REP MOVSB (where Intel and AMD's microcode should have
done at least as well) in many cases, and that despite Intel adding
"fast string moves" in IIRC Sandy Bridge.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Thu May 30 14:42:14 2024

According to Terje Mathisen <[email protected]>:

It's almost like the perfect application of risc instruction design:
a long sequence of individual instructions of conditional branches,
bit field extracts, inserts, and shifts, is replace in HW by
a small number of muxes that can to the same in one clock.

If that CU14 can also return the number of bytes consumed, along with
the resulting 32-bit character, then it would be perfect. Is that what
it is doing?

You give it registers with two addresses and two lengths, and it
converts the source UTF-8 code points to destination UTF-32 until it
runs out of input, fills the output, gets an invalid character, or an interrupt. It updates the addresses and lengths. Other than optionally
checking for invalid UTF-8 it does not interpret the code points.

The condition code tells you which it was. If it was an interrupt, you just branch back and keep going.

There's an extra cost flag whether to test for invalid UTF-8.

Read all about it: https://www.vm.ibm.com/library/other/22783213.pdf

It's on page 7-251.

--
Regards,
John Levine, [email protected], Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to EricP on Thu May 30 15:35:37 2024

EricP <[email protected]> writes:

Stefan Monnier wrote:

I've not dealt with UTF-8 or code points but that's because I've not >>>>> written software that interacts with the non 1-byte character markets. >>>>> But even something as simple as sanitizing a character string to feed >>>>> into SQL will have to.

AFAIK you can do that by treating the UTF-8 byte sequence as if it were >>>> an ASCII byte-sequence: all the Unicode weirdness is neatly stashed in >>>> bytes >127 which aren't used by SQL itself anyway.
Stefan

Of course with apologies to Herr Koenig's umlauts. :-)

And what of all those new Asian customers your company was hoping
to get by dealing with them in their native written language???
You could always explain to the company president that
you only work in ASCII so they should just get used to it.

I think you misunderstand: the code written to sanitize an ASCII string to >> feed into SQL will *just work* to sanitize a UTF-8 string to feed
into SQL, no matter how many funny characters and joiners and combiners
and emojis you have in there.

That's part of the reason why UTF-8 is so popular: you can surprisingly
often treat it as "good old ASCII".

Stefan

Ok, you accept international character data, you just don't have to
check >127 characters for "drop table" etc commands.

Actually what you check for is meta-characters like ; " '. They are
all ASCII, so as long as your code is 8-bit-clean, your SQL string
sanitizer needs to know nothing about UTF-8.

I don't think you are being paranoid enough.
I still think you have to validate or sanitize the >127 string to
ensure the code sequences only contain well formed characters.

Then run your string through a checker/normalizer before or
afterwards. No need to complicate your SQL sanitizer by trying to do
both at the same time. But if you want the last bit of performance by
doing both at the same time, then you certainly don't want to convert
to UTF-32 and back.

Random hack thought #1: if the string I send starts with an umlaut as
the first code point, which doesn't display because it is invalid.

I found that hard to understand. Do you mean that the string starts
with a composing diaresis code point and is invalid because it has no
preceding basis with which to compose? The string may fail at the
Unicode checking/normalization stage (depending on what it checks).

Then someone edits the first char to a/o/u and magically it changes
to a different character, and deposits now go to a different account.

If someone can edit the string, and that changes where deposits go to,
someone can do that even with no Unicode involved. E.g., if someone
can change "EricP" to "Ertl". However, my impression is that banks
use account numbers (pure ASCII) for deposits, names are used only for validation; so if you provide the wrong name, a money transfer may
fail to go through (not sure what happens if a deposit does not go
through), but won't be to the wrong account.

Random hack thought #2: If a character has multiple combiner code points, >does changing the order create a different character or do they map to
the same display character? Or worse, maybe combiner code point order >sensitivity is character dependent, some are, some are not.
If they do display the same, then I might create two accounts that
look identical but index differently, and redirect deposits.

That's solved by normalization.

Here's a story from work I had to do a while ago: users provided data
through some tools written in Python, that data was somehow aggregated
into one csv file (maybe with cat), and there was a Python3 script I
had to run for processing the data. Now some users provided the data
as Latin-1 and some as UTF-8, so the csv file contained a mixture of
that. The Python3 script dutyfully reported an error on reading the
csv file as guidelines recomment. This was the wrong thing to do in
this application, as continuing to have this mixture was harmless.

I then wrote a small program (in Gforth) that converted such mixed
files to UTF-8, and that was one of the few uses of the Gforth words
for dealing with UTF-8 that I needed (in most other cases strings are
treated just as opaque data). The principle was to see if the next
bytes were an UTF-8 code point or ASCII; if so, just output them. If
they were neither, the next byte is a Latin-1 character, and is
converted to UTF-8. Fortunately, there is no overlap between the
Latin-1 characters that occured in these data and the bytes that start
a non-ASCII UTF-8 code point.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Thu May 30 13:45:41 2024

IBM has, for a long time, combined commonly occuring sequences of instructions into single instructions. I don't know the tradeoffs here.

I don't know either, but it's hard to believe that it's *just* marketing because there is an actual design and implementation cost involved and
even marketing needs some "hard" data to make a good sell.

My guess is that they have gotten their implementation to a point where
adding instructions is fairly painless (plenty of space in the
instruction encoding, pre-existing micro/milli-code setup where the
size of the micro/milli-code has a negligible impact on cycle time,
chip size, and yield, ...).

Then they use that flexibility to go after specific benchmarks they got
from some important customers. Even if it speeds up the code of
a single customer, it might be worth the effort if it's a large enough
customer and it increases the chances of keeping them on
that architecture.

Maybe each of those cases could be solved about as efficiently by
rewriting part of the code, but we're talking about a market where many
of the customers are here specifically because they don't want to
rewrite their code.

For the case in point, I haven't seen problems where a UTF-32 encoding
is the overall best solution, but I can easily believe that there are
cases where some poorly thought out (but entrenched) API ends up
imposing (directly or not) the use of UTF-32 and makes UTF-8 <-> UTF-32 conversions very frequent.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Stefan Monnier on Thu May 30 18:23:35 2024

Stefan Monnier wrote:

IBM has, for a long time, combined commonly occuring sequences of instructions into single instructions. I don't know the tradeoffs
here.

I don't know either, but it's hard to believe that it's just marketing because there is an actual design and implementation cost involved and
even marketing needs some "hard" data to make a good sell.

Yes.

My guess is that they have gotten their implementation to a point
where adding instructions is fairly painless (plenty of space in the instruction encoding, pre-existing micro/milli-code setup where the
size of the micro/milli-code has a negligible impact on cycle time,
chip size, and yield, ...).

Good point. And note that there is some benefit in presumably better
I-cache hit rate, etc. And if they have a hardware streaming buffer,
it is probably easier to make use of it in a single instruction versus
a sequence of instructions.

Then they use that flexibility to go after specific benchmarks they
got from some important customers. Even if it speeds up the code of
a single customer, it might be worth the effort if it's a large enough customer and it increases the chances of keeping them on
that architecture.

Agreed. Furthermore, since IBM has major presence in certain
industries, e.g. banking, if it helps one customer in that industry, it
likely helps others.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stefan Monnier on Thu May 30 18:31:46 2024

Stefan Monnier wrote:

IBM has, for a long time, combined commonly occuring sequences of
instructions into single instructions. I don't know the tradeoffs
here.

I don't know either, but it's hard to believe that it's *just*
marketing
because there is an actual design and implementation cost involved and
even marketing needs some "hard" data to make a good sell.

My guess is that they have gotten their implementation to a point where adding instructions is fairly painless (plenty of space in the
instruction encoding, pre-existing micro/milli-code setup where the
size of the micro/milli-code has a negligible impact on cycle time,
chip size, and yield, ...).

Yes, as long as the new instruction is "like" other already existing instructions.

Then they use that flexibility to go after specific benchmarks they got
from some important customers. Even if it speeds up the code of
a single customer, it might be worth the effort if it's a large enough customer and it increases the chances of keeping them on
that architecture.

Maybe each of those cases could be solved about as efficiently by
rewriting part of the code, but we're talking about a market where many
of the customers are here specifically because they don't want to
rewrite their code.

A lot of the added instructions support OS-like features--I infer that
many of these require some kind of atomic activities not easily
achieved
with the existing ISA itself.

For the case in point, I haven't seen problems where a UTF-32 encoding
is the overall best solution, but I can easily believe that there are
cases where some poorly thought out (but entrenched) API ends up
imposing (directly or not) the use of UTF-32 and makes UTF-8 <-> UTF-32 conversions very frequent.

30 years ago you could say the same thing about encryption.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Lawrence D'Oliveiro on Thu May 30 20:38:08 2024

Lawrence D'Oliveiro wrote:

On Wed, 29 May 2024 10:10:30 -0400, EricP wrote:

I've not dealt with UTF-8 or code points but that's because I've not
written software that interacts with the non 1-byte character markets.

We are all “non 1-byte character markets” now.

Just to rub it in: «€£¢©®±»

Unnecessary in my case.
My company's products were a real-time bond pricing and trading system,
and customers were financial companies whose internal systems in this
case only operated within North America in English, in ascii and ebcdic.

They had other systems that did interface with the larger world
and presumably dealt with international character sets.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to John Levine on Fri May 31 14:36:47 2024

John Levine wrote:

According to Terje Mathisen <[email protected]>:

It's almost like the perfect application of risc instruction design:
a long sequence of individual instructions of conditional branches,
bit field extracts, inserts, and shifts, is replace in HW by
a small number of muxes that can to the same in one clock.

If that CU14 can also return the number of bytes consumed, along with
the resulting 32-bit character, then it would be perfect. Is that what
it is doing?

You give it registers with two addresses and two lengths, and it
converts the source UTF-8 code points to destination UTF-32 until it
runs out of input, fills the output, gets an invalid character, or an interrupt. It updates the addresses and lengths. Other than optionally checking for invalid UTF-8 it does not interpret the code points.

The condition code tells you which it was. If it was an interrupt, you just branch back and keep going.

There's an extra cost flag whether to test for invalid UTF-8.

Read all about it: https://www.vm.ibm.com/library/other/22783213.pdf

It's on page 7-251.

Thanks!

I did read all of it, and it was pretty close to how I would have
designed a sw function to do the same, except for the very funky ABI:

Both source and destination _must_ be an even register number, with the following odd register providing the count/length.

Just from this little snippet I'm pretty sure this instruction has a
sizeable startup overhead, compiler support is probably in the form of
an intrinsic that knows about the need to allocate two pairs of
register, each pair starting at an even-numbered register.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to George Neuner on Sat Jun 1 12:49:46 2024

George Neuner wrote:

On Wed, 29 May 2024 18:42:32 -0000 (UTC), John Levine
<[email protected]> wrote:

According to EricP <[email protected]>:

Ok, you accept international character data, you just don't have to
check >127 characters for "drop table" etc commands.

I don't think you are being paranoid enough.
I still think you have to validate or sanitize the >127 string to
ensure the code sequences only contain well formed characters.

If you're sending the strings to a database, the database will
invariably do detailed string validation so I wouldn't bother, but be
prepared for the error code if it rejects the string,

Far too much SQL is constructed by simply splicing user input into a
query "template" string.

When queries are done right with all user input provided via SQL
parameters, then there is far less need to "sanitize" input.

There is a one major caveat: in SQL, table names can't be specified by parameter. If the user must provide a table name, then you DO have to
splice the query string and you DO have to be careful.

Yes, I didn't mean not parameterizing the string args.

I was trying to think of ways that I might get your software to combine malformed strings creating something different. This would occur after
the strings have been passed using parameterization, like if an index
is built from two concatenated string fields.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Anton Ertl on Sat Jun 1 12:40:53 2024

Anton Ertl wrote:

EricP <[email protected]> writes:

Stefan Monnier wrote:

I've not dealt with UTF-8 or code points but that's because I've not >>>>>> written software that interacts with the non 1-byte character markets. >>>>>> But even something as simple as sanitizing a character string to feed >>>>>> into SQL will have to.

AFAIK you can do that by treating the UTF-8 byte sequence as if it were >>>>> an ASCII byte-sequence: all the Unicode weirdness is neatly stashed in >>>>> bytes >127 which aren't used by SQL itself anyway.
Stefan

Of course with apologies to Herr Koenig's umlauts. :-)

And what of all those new Asian customers your company was hoping
to get by dealing with them in their native written language???
You could always explain to the company president that
you only work in ASCII so they should just get used to it.

I think you misunderstand: the code written to sanitize an ASCII string to >>> feed into SQL will *just work* to sanitize a UTF-8 string to feed
into SQL, no matter how many funny characters and joiners and combiners
and emojis you have in there.

That's part of the reason why UTF-8 is so popular: you can surprisingly
often treat it as "good old ASCII".

Stefan

Ok, you accept international character data, you just don't have to
check >127 characters for "drop table" etc commands.

Actually what you check for is meta-characters like ; " '. They are
all ASCII, so as long as your code is 8-bit-clean, your SQL string
sanitizer needs to know nothing about UTF-8.

Yes, I just skipped to the result.

I don't think you are being paranoid enough.
I still think you have to validate or sanitize the >127 string to
ensure the code sequences only contain well formed characters.

Then run your string through a checker/normalizer before or
afterwards. No need to complicate your SQL sanitizer by trying to do
both at the same time. But if you want the last bit of performance by
doing both at the same time, then you certainly don't want to convert
to UTF-32 and back.

If I want to validate combiner codes or normalize characters I need
UTF-32 because I have to work with the whole character as a unit.

Random hack thought #1: if the string I send starts with an umlaut as
the first code point, which doesn't display because it is invalid.

I found that hard to understand. Do you mean that the string starts
with a composing diaresis code point and is invalid because it has no preceding basis with which to compose? The string may fail at the
Unicode checking/normalization stage (depending on what it checks).

I was looking for a reason to justify having to perform
full character validation, not just UTF-8 code validation.

I was trying to come up with an example where I give your system
two strings, one contains a valid base character, another containing
a continue code, and your system concatenates the two strings to
create a different string.

Like a first name of 'O' and a last name of umlaut, and your software concatenates them in a database index creating a full name of O-umlaut.

Though admittedly it's difficult to see how that hacks your system
but maybe others can see a way.

Then someone edits the first char to a/o/u and magically it changes
to a different character, and deposits now go to a different account.

If someone can edit the string, and that changes where deposits go to, someone can do that even with no Unicode involved. E.g., if someone
can change "EricP" to "Ertl". However, my impression is that banks
use account numbers (pure ASCII) for deposits, names are used only for validation; so if you provide the wrong name, a money transfer may
fail to go through (not sure what happens if a deposit does not go
through), but won't be to the wrong account.

I was just trying to get people thinking of ways that malformed
characters might be used to bypass other validation checks in
their software.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to EricP on Mon Jun 3 08:04:52 2024

On Thu, 30 May 2024 20:38:08 -0400, EricP wrote:

My company's products were a real-time bond pricing and trading system,
and customers were financial companies whose internal systems in this
case only operated within North America in English, in ascii and ebcdic.

No need even for “¢” characters?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Mon Jun 3 08:03:53 2024

On Thu, 30 May 2024 18:31:46 +0000, MitchAlsup1 wrote:

30 years ago you could say the same thing about encryption.

I don’t think newer CPUs have been optimized for encryption. Instead, we
see newer encryption algorithms (or ways of using them) that work better
on current CPUs. For example, when I was first learning about computer encryption, I was told that CBC (“Cipher-Block Chaining”) mode was teh hawtness, but nowadays it’s all about GFC (“Galois-Field Counter”) mode.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Lawrence D'Oliveiro on Mon Jun 3 13:22:27 2024

On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

On Thu, 30 May 2024 18:31:46 +0000, MitchAlsup1 wrote:

30 years ago you could say the same thing about encryption.

I don’t think newer CPUs have been optimized for encryption. Instead,
we see newer encryption algorithms (or ways of using them) that work
better on current CPUs.

I think moderate efficiency on CPU, not too low, but not high either,
is a requirement for (symmetric-key) cipher. Esp. when the key is
128-bit or shorter.

For example, when I was first learning about
computer encryption, I was told that CBC (“Cipher-Block Chaining”)
mode was teh hawtness,

CBC decrypt is easily parallelized. Encrypt - not so
much.

but nowadays it’s all about GFC (“Galois-Field
Counter”) mode.

GCM is far more common spelling.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Michael S on Mon Jun 3 14:07:12 2024

Michael S <[email protected]> writes:

On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

On Thu, 30 May 2024 18:31:46 +0000, MitchAlsup1 wrote:
=20

30 years ago you could say the same thing about encryption. =20

=20
I don=E2=80=99t think newer CPUs have been optimized for encryption. Inst= >ead,
we see newer encryption algorithms (or ways of using them) that work
better on current CPUs.=20

I think moderate efficiency on CPU, not too low, but not high either,
is a requirement for (symmetric-key) cipher. Esp. when the key is
128-bit or shorter.

Most modern CPUs have instruction set support for symmetric ciphers such
as AES, SM2/SM3 as well as message digest/hash (SHA1, SHA256 et al).

High throughput encryption has been done by hardware accelerators for
decades now (e.g. bbn or ncypher HSM boxes sitting on a SCSI bus;
now such HSM are an integral part of many SoC).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Lawrence D'Oliveiro on Mon Jun 3 10:31:51 2024

Lawrence D'Oliveiro wrote:

On Thu, 30 May 2024 20:38:08 -0400, EricP wrote:

My company's products were a real-time bond pricing and trading system,
and customers were financial companies whose internal systems in this
case only operated within North America in English, in ascii and ebcdic.

No need even for “¢” characters?

Nope, and no pound or euro signs either because currency is dollars
with . as the decimal point. Because otherwise you get into foreign
exchange which is a whole different bucket of fish, not the least of
which are legal and tax issues. That's not to say such issues do not
come up, its just that if you want to buy $100 million worth of T-bills
then you have to figure out how to convert your euros and deal with the paperwork.

Actually the only problem with external text I encountered was when one
day the price feed suddenly switched from decimal quantities to fractions
like "12 1/8" or "15 5/32". Someone must have connected old software to
the Reuters trade price network and started broadcasting ancient values.
This was in direct violation of the network specs but there it was anyway.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Scott Lurndal on Mon Jun 3 17:42:17 2024

On Mon, 03 Jun 2024 14:07:12 GMT
[email protected] (Scott Lurndal) wrote:

Michael S <[email protected]> writes:

On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

On Thu, 30 May 2024 18:31:46 +0000, MitchAlsup1 wrote:
=20

30 years ago you could say the same thing about encryption. =20

=20
I don=E2=80=99t think newer CPUs have been optimized for
encryption. Inst=

ead,

we see newer encryption algorithms (or ways of using them) that
work better on current CPUs.=20

I think moderate efficiency on CPU, not too low, but not high either,
is a requirement for (symmetric-key) cipher. Esp. when the key is
128-bit or shorter.

Most modern CPUs have instruction set support for symmetric ciphers
such as AES, SM2/SM3 as well as message digest/hash (SHA1, SHA256 et
al).

It is still not *too* fast.
'Too fast' in my book is when with 1B to 10B USD worth of OTP servers
you can break cipher by brute force in less than 1 hour.

High throughput encryption has been done by hardware accelerators for
decades now (e.g. bbn or ncypher HSM boxes sitting on a SCSI bus;
now such HSM are an integral part of many SoC).

BTDT, not in high volume app so, and with programmable logic rather
than ASIC. It's still sufficiently slow to not become dangerous for
the order of the world.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Scott Lurndal on Mon Jun 3 14:55:53 2024

Scott Lurndal wrote:

Michael S <[email protected]> writes:

On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

On Thu, 30 May 2024 18:31:46 +0000, MitchAlsup1 wrote:
=20

30 years ago you could say the same thing about encryption. =20

=20
I don=E2=80=99t think newer CPUs have been optimized for

encryption. Inst=

ead,

we see newer encryption algorithms (or ways of using them) that

work >> better on current CPUs.=20

I think moderate efficiency on CPU, not too low, but not high
either, is a requirement for (symmetric-key) cipher. Esp. when the
key is 128-bit or shorter.

Most modern CPUs have instruction set support for symmetric ciphers
such as AES, SM2/SM3 as well as message digest/hash (SHA1, SHA256 et
al).

High throughput encryption has been done by hardware accelerators for
decades now (e.g. bbn or ncypher HSM boxes sitting on a SCSI bus;
now such HSM are an integral part of many SoC).

Queston. For a modern general purpose CPU, if you are including all
the logic to implement encryption instructions, is it much more to
include the control/sequencing logic to do it and not tie up the rest
of the CPU logic to do the encryption? Furthermore, an "inbuilt"
accelerator could interface directly with the I/O hardware of the CPU
(e.g. PCI), saving the "intermediate" step of writing the encrypted
data to memory.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Stephen Fuld on Mon Jun 3 15:33:48 2024

"Stephen Fuld" <[email protected]d> writes:

Scott Lurndal wrote:

Michael S <[email protected]> writes:

On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

On Thu, 30 May 2024 18:31:46 +0000, MitchAlsup1 wrote:
=20

30 years ago you could say the same thing about encryption. =20

=20
I don=E2=80=99t think newer CPUs have been optimized for

encryption. Inst=

ead,

we see newer encryption algorithms (or ways of using them) that

work >> better on current CPUs.=20

I think moderate efficiency on CPU, not too low, but not high
either, is a requirement for (symmetric-key) cipher. Esp. when the
key is 128-bit or shorter.

Most modern CPUs have instruction set support for symmetric ciphers
such as AES, SM2/SM3 as well as message digest/hash (SHA1, SHA256 et
al).

High throughput encryption has been done by hardware accelerators for
decades now (e.g. bbn or ncypher HSM boxes sitting on a SCSI bus;
now such HSM are an integral part of many SoC).

Queston. For a modern general purpose CPU, if you are including all
the logic to implement encryption instructions, is it much more to
include the control/sequencing logic to do it and not tie up the rest
of the CPU logic to do the encryption? Furthermore, an "inbuilt"
accelerator could interface directly with the I/O hardware of the CPU
(e.g. PCI), saving the "intermediate" step of writing the encrypted
data to memory.

There are always tradeoffs. The issues surrounding the
control/sequencing logic outside of the instruction flow
require some level of asynchronicity, so to avoid bottlenecks
one might need to replicate the "inbuilt accelerator" if
more than one core will be using encryption (e.g. for RSS
with IPSEC flows).

From the operating software standpoint, it becomes most
convenient, then, to model the offload as a device which
requires OS support (and intervention for e.g. interrupt
handling).

For network traffic, there are often other operations
being performed on the flow (routing, encapsulation, fragmentation/reassembly, etc) which require the packet to be in a memory buffer
(which could be high-speed SRAM or lower-speed DRAM),
even when just routing from an ingress port to an egress port.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stephen Fuld on Mon Jun 3 16:41:34 2024

Stephen Fuld wrote:

Scott Lurndal wrote:

Michael S <[email protected]> writes:

On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)

High throughput encryption has been done by hardware accelerators for
decades now (e.g. bbn or ncypher HSM boxes sitting on a SCSI bus;
now such HSM are an integral part of many SoC).

Queston. For a modern general purpose CPU, if you are including all
the logic to implement encryption instructions, is it much more to
include the control/sequencing logic to do it and not tie up the rest
of the CPU logic to do the encryption? Furthermore, an "inbuilt"
accelerator could interface directly with the I/O hardware of the CPU
(e.g. PCI), saving the "intermediate" step of writing the encrypted
data to memory.

It is more of a systems issue than an ISA issue:: Consider a chip with
100 cores, do you want all 100 cores to be doing encryption at the same

time, or do you only need a certain BW of encryption rather equal to
the internet BW at hand. For the first instructions are a reasonable
starting point, for the second an I/O (or attached) processor is in
order.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to All on Mon Jun 3 17:05:11 2024

MitchAlsup1 wrote:

Stephen Fuld wrote:

Scott Lurndal wrote:

Michael S <[email protected]> writes:

On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)

High throughput encryption has been done by hardware accelerators
for decades now (e.g. bbn or ncypher HSM boxes sitting on a SCSI
bus; now such HSM are an integral part of many SoC).

Queston. For a modern general purpose CPU, if you are including all
the logic to implement encryption instructions, is it much more to
include the control/sequencing logic to do it and not tie up the
rest of the CPU logic to do the encryption? Furthermore, an
"inbuilt" accelerator could interface directly with the I/O
hardware of the CPU (e.g. PCI), saving the "intermediate" step of
writing the encrypted data to memory.

It is more of a systems issue than an ISA issue:: Consider a chip
with 100 cores, do you want all 100 cores to be doing encryption at
the same

time, or do you only need a certain BW of encryption rather equal to
the internet BW at hand. For the first instructions are a reasonable
starting point, for the second an I/O (or attached) processor is in
order.

I agree completely. If all of the data to be en/decrypted is comming from/going to an external device (network, storage device), then there
is no benefit to being able to encrypt at a faster rate than the total
I/O bandwidth. I don't know what percentage of the data is destined
for external use, but my gut feel is that it is a lot, probably most,
possibly almost all.

If that is the case, then I think a good case can be made for putting encryption somewhere within the I/O hardware, in order to avoid the
extra memory bandwidth and latency requirements of either instructions
or a "typical" attached processor.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Stephen Fuld on Mon Jun 3 17:28:10 2024

"Stephen Fuld" <[email protected]d> writes:

Scott Lurndal wrote:

Queston. For a modern general purpose CPU, if you are including all
the logic to implement encryption instructions, is it much more to
include the control/sequencing logic to do it and not tie up the
rest of the CPU logic to do the encryption? Furthermore, an
"inbuilt" accelerator could interface directly with the I/O
hardware of the CPU (e.g. PCI), saving the "intermediate" step of
writing the encrypted data to memory.

There are always tradeoffs. The issues surrounding the
control/sequencing logic outside of the instruction flow
require some level of asynchronicity, so to avoid bottlenecks
one might need to replicate the "inbuilt accelerator" if
more than one core will be using encryption (e.g. for RSS
with IPSEC flows).

Yes, but putting the instructions into the core means you are
replicating the logic for every core.

In the scale of a modern CPU, it's a small fraction of the logic.

The ARM neoverse cores, for example, require very little area.

From the operating software standpoint, it becomes most
convenient, then, to model the offload as a device which
requires OS support (and intervention for e.g. interrupt
handling).

I look at it differently (and perhaps incorrectly). I view encryption
as one of several "transformations" that data goes through in its path >to/from some external device.

That's certainly a valid view, if perhaps not complete. There are
use cases for in-place encryption.

Adding encryption (which of the dozen standard symmetric and asymmetric
cipher algoritnms?) to a hardware device does increase complexity, and
thus cost at the expense of extensibility (new algorithms come along periodically). The cost of verifying crypto is a bit higher as it is
very important to get correct when baking into gates.

For exqmple, if the external device is a
disk, the data from memory may be gathere from multiple locations, is >serialized, perhaps encoded (i.e. 8b10b), has (perhaps several levels)
of ECC added, etc. Viewing it like that makes encryption one of many
steps along the I/O pipeline. Under that view, Encryption is an
option, probably controllede by some bits in the I/O mechanism, not as
a separate device requiring interrupt support etc.

In the Cavium crypto-enabled DPUs, the crypto block is inserted
into the data-path where necessary, when necessary; and to the extent
that a streaming protocol/alg is used, will encrypt/decrypt as the data
is passing from the ingress point to the egress point (which could
be another external port, or an on-board CPU). It can also be used
as a stand-alone crypto accelerator by the on-board CPUs.

Note that crypto is used for more than just data encryption/decryption;
there's also digesting and digital signatures which rely on asymmetric algorithms such as RSA or EC and don't necessarily fit into the
"path to the I/O device" model you've espoused.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Scott Lurndal on Mon Jun 3 18:01:00 2024

Scott Lurndal <[email protected]> schrieb:

Adding encryption (which of the dozen standard symmetric and asymmetric cipher algoritnms?)

At the moment, AES.

to a hardware device does increase complexity, and
thus cost at the expense of extensibility (new algorithms come along periodically). The cost of verifying crypto is a bit higher as it is
very important to get correct when baking into gates.

Seems to be fairly common these days, looking at https://en.wikipedia.org/wiki/AES_instruction_set .

It appears that one round of AES fits fairly well into one cycle.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Scott Lurndal on Mon Jun 3 17:15:34 2024

Scott Lurndal wrote:

"Stephen Fuld" <[email protected]d> writes:

Scott Lurndal wrote:

Michael S <[email protected]> writes:

On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

On Thu, 30 May 2024 18:31:46 +0000, MitchAlsup1 wrote:
=20

30 years ago you could say the same thing about encryption.

=20 >> > > =20

I don=E2=80=99t think newer CPUs have been optimized for

encryption. Inst=

ead,

we see newer encryption algorithms (or ways of using them) that

work >> better on current CPUs.=20

I think moderate efficiency on CPU, not too low, but not high
either, is a requirement for (symmetric-key) cipher. Esp. when

the >> > key is 128-bit or shorter.

Most modern CPUs have instruction set support for symmetric ciphers
such as AES, SM2/SM3 as well as message digest/hash (SHA1, SHA256

et >> al).

High throughput encryption has been done by hardware accelerators

for >> decades now (e.g. bbn or ncypher HSM boxes sitting on a SCSI
bus; >> now such HSM are an integral part of many SoC).

Queston. For a modern general purpose CPU, if you are including all
the logic to implement encryption instructions, is it much more to
include the control/sequencing logic to do it and not tie up the
rest of the CPU logic to do the encryption? Furthermore, an
"inbuilt" accelerator could interface directly with the I/O
hardware of the CPU (e.g. PCI), saving the "intermediate" step of
writing the encrypted data to memory.

There are always tradeoffs. The issues surrounding the
control/sequencing logic outside of the instruction flow
require some level of asynchronicity, so to avoid bottlenecks
one might need to replicate the "inbuilt accelerator" if
more than one core will be using encryption (e.g. for RSS
with IPSEC flows).

Yes, but putting the instructions into the core means you are
replicating the logic for every core. If you don't tie the amount of encryption hardeware you need to the number of cores, you can adjust it
to meet the needs independently of the amount of computation you need
(i.e. number of cores)

From the operating software standpoint, it becomes most
convenient, then, to model the offload as a device which
requires OS support (and intervention for e.g. interrupt
handling).

I look at it differently (and perhaps incorrectly). I view encryption
as one of several "transformations" that data goes through in its path
to/from some external device. For exqmple, if the external device is a
disk, the data from memory may be gathere from multiple locations, is serialized, perhaps encoded (i.e. 8b10b), has (perhaps several levels)
of ECC added, etc. Viewing it like that makes encryption one of many
steps along the I/O pipeline. Under that view, Encryption is an
option, probably controllede by some bits in the I/O mechanism, not as
a separate device requiring interrupt support etc.

For network traffic, there are often other operations
being performed on the flow (routing, encapsulation, fragmentation/reassembly, etc) which require the packet to be in a
memory buffer (which could be high-speed SRAM or lower-speed DRAM),
even when just routing from an ingress port to an egress port.

Yes. In my view, encryption is just another of these operations.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Thomas Koenig on Mon Jun 3 18:11:56 2024

Thomas Koenig <[email protected]> writes:

Scott Lurndal <[email protected]> schrieb:

Adding encryption (which of the dozen standard symmetric and asymmetric
cipher algoritnms?)

At the moment, AES.

to a hardware device does increase complexity, and
thus cost at the expense of extensibility (new algorithms come along
periodically). The cost of verifying crypto is a bit higher as it is
very important to get correct when baking into gates.

Seems to be fairly common these days, looking at >https://en.wikipedia.org/wiki/AES_instruction_set .

As I mentioned earlier in the thread, all modern CPUs have
support for the standard algorithms in their instruction
set (optionally fused out for export).

It appears that one round of AES fits fairly well into one cycle.

Yes.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Scott Lurndal on Mon Jun 3 18:57:24 2024

Scott Lurndal wrote:

"Stephen Fuld" <[email protected]d> writes:

Scott Lurndal wrote:

Queston. For a modern general purpose CPU, if you are including

all >> > the logic to implement encryption instructions, is it much
more to >> > include the control/sequencing logic to do it and not
tie up the >> > rest of the CPU logic to do the encryption?
Furthermore, an >> > "inbuilt" accelerator could interface directly
with the I/O >> > hardware of the CPU (e.g. PCI), saving the
"intermediate" step of >> > writing the encrypted data to memory.

There are always tradeoffs. The issues surrounding the
control/sequencing logic outside of the instruction flow
require some level of asynchronicity, so to avoid bottlenecks
one might need to replicate the "inbuilt accelerator" if
more than one core will be using encryption (e.g. for RSS
with IPSEC flows).

Yes, but putting the instructions into the core means you are
replicating the logic for every core.

In the scale of a modern CPU, it's a small fraction of the logic.

The ARM neoverse cores, for example, require very little area.

Agreed. I was assuming that the cost of the logic was about the same
whether it was done as CPU instructions or a chunk of accelerator logic
in the I/O stream. If that is true, then the cost of having multiples
of them in the I/O stream is small.

From the operating software standpoint, it becomes most
convenient, then, to model the offload as a device which
requires OS support (and intervention for e.g. interrupt
handling).

I look at it differently (and perhaps incorrectly). I view
encryption as one of several "transformations" that data goes
through in its path to/from some external device.

That's certainly a valid view, if perhaps not complete. There are
use cases for in-place encryption.

Good. Can you give some examples, and perhaps an estimate of what
percentage of the total encryption operations are in place? Note that
it may be possible to add a feature to the "in-stream" hardware to
allow in-place encryption - i.e. both sides go to/come from memory.

Adding encryption (which of the dozen standard symmetric and
asymmetric cipher algoritnms?) to a hardware device does increase
complexity, and thus cost at the expense of extensibility (new
algorithms come along periodically).

Agreed. But this is also true for new CPU instructions.

The cost of verifying crypto is
a bit higher as it is very important to get correct when baking into
gates.

Sure, And I expect it is also higher because of the extra security
precautions against side attacks, etc.

For exqmple, if the external device is a
disk, the data from memory may be gathere from multiple locations,
is serialized, perhaps encoded (i.e. 8b10b), has (perhaps several
levels) of ECC added, etc. Viewing it like that makes encryption
one of many steps along the I/O pipeline. Under that view,
Encryption is an option, probably controllede by some bits in the
I/O mechanism, not as a separate device requiring interrupt support
etc.

In the Cavium crypto-enabled DPUs, the crypto block is inserted
into the data-path where necessary, when necessary; and to the extent
that a streaming protocol/alg is used, will encrypt/decrypt as the
data is passing from the ingress point to the egress point (which
could be another external port, or an on-board CPU). It can also be
used as a stand-alone crypto accelerator by the on-board CPUs.

Good to know. Proof of concept for my suggestion. :-) Can you talk
about advantages/disadvantages of that mechanism versus other
implementations?

Note that crypto is used for more than just data
encryption/decryption; there's also digesting and digital signatures
which rely on asymmetric algorithms such as RSA or EC and don't
necessarily fit into the "path to the I/O device" model you've
espoused.

Yes, of course. But I think digital signature creation/verification
could be fit into the streaming model. Is that wrong? With regard to
RSA/EC, etc. I absolutely agree.

I do want to thank you for indulging my fantasies. :-)

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Thomas Koenig on Mon Jun 3 23:15:11 2024

On Mon, 3 Jun 2024 18:01:00 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

Scott Lurndal <[email protected]> schrieb:

Adding encryption (which of the dozen standard symmetric and
asymmetric cipher algoritnms?)

At the moment, AES.

to a hardware device does increase complexity, and
thus cost at the expense of extensibility (new algorithms come along periodically). The cost of verifying crypto is a bit higher as it
is very important to get correct when baking into gates.

Seems to be fairly common these days, looking at https://en.wikipedia.org/wiki/AES_instruction_set .

It appears that one round of AES fits fairly well into one cycle.

One/cycle throughput fits well. Even two/cycle throughput fits.
One cycle latency does not fit unless you target very low frequency.
Latency on POWER9 - 6 clocks. On majority of modern Intel and AMD cores
3-4 clocks. On Apple M1 - 3 clocks.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Scott Lurndal on Mon Jun 3 23:15:48 2024

On Mon, 03 Jun 2024 18:11:56 GMT
[email protected] (Scott Lurndal) wrote:

Thomas Koenig <[email protected]> writes:

Scott Lurndal <[email protected]> schrieb:

Adding encryption (which of the dozen standard symmetric and
asymmetric cipher algoritnms?)

At the moment, AES.

to a hardware device does increase complexity, and
thus cost at the expense of extensibility (new algorithms come
along periodically). The cost of verifying crypto is a bit higher
as it is very important to get correct when baking into gates.

Seems to be fairly common these days, looking at >https://en.wikipedia.org/wiki/AES_instruction_set .

As I mentioned earlier in the thread, all modern CPUs have
support for the standard algorithms in their instruction
set (optionally fused out for export).

It appears that one round of AES fits fairly well into one cycle.

Yes.

No.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Stephen Fuld on Mon Jun 3 20:31:11 2024

"Stephen Fuld" <[email protected]d> writes:

Scott Lurndal wrote:

"Stephen Fuld" <[email protected]d> writes:

Scott Lurndal wrote:

Queston. For a modern general purpose CPU, if you are including

all >> > the logic to implement encryption instructions, is it much
more to >> > include the control/sequencing logic to do it and not
tie up the >> > rest of the CPU logic to do the encryption?
Furthermore, an >> > "inbuilt" accelerator could interface directly
with the I/O >> > hardware of the CPU (e.g. PCI), saving the
"intermediate" step of >> > writing the encrypted data to memory.

There are always tradeoffs. The issues surrounding the
control/sequencing logic outside of the instruction flow
require some level of asynchronicity, so to avoid bottlenecks
one might need to replicate the "inbuilt accelerator" if
more than one core will be using encryption (e.g. for RSS
with IPSEC flows).

Yes, but putting the instructions into the core means you are
replicating the logic for every core.

In the scale of a modern CPU, it's a small fraction of the logic.

The ARM neoverse cores, for example, require very little area.

Agreed. I was assuming that the cost of the logic was about the same
whether it was done as CPU instructions or a chunk of accelerator logic
in the I/O stream. If that is true, then the cost of having multiples
of them in the I/O stream is small.

Although the accelerator requires addition logic to interface
to the CPUs (either by presenting as a memory mapped device,
integrated into the processor register scheme, or some other
proprietary mechanism). Which means non-standard software is
required to manage and use the accelerator.

From the operating software standpoint, it becomes most
convenient, then, to model the offload as a device which
requires OS support (and intervention for e.g. interrupt
handling).

I look at it differently (and perhaps incorrectly). I view
encryption as one of several "transformations" that data goes
through in its path to/from some external device.

That's certainly a valid view, if perhaps not complete. There are
use cases for in-place encryption.

Good. Can you give some examples, and perhaps an estimate of what
percentage of the total encryption operations are in place? Note that
it may be possible to add a feature to the "in-stream" hardware to
allow in-place encryption - i.e. both sides go to/come from memory.

Consider file access. From the perspective of the disk, all blocks
are identical - it doesn't know which partition, filesystem, or file
that any individual block is part of, if any.

Whole-disk encryption can happen at the drive. Per-file (or
per-filesystem) happens in the filesystem driver(s), perhaps
with a hardware assist, but it wouldn't be on the path from
the disk to memory.

There are cases where only a portion of a file is encrypted, and
cases where the encryption is combined with compression (pkzip,
rar, etc).

Adding encryption (which of the dozen standard symmetric and
asymmetric cipher algoritnms?) to a hardware device does increase
complexity, and thus cost at the expense of extensibility (new
algorithms come along periodically).

Agreed. But this is also true for new CPU instructions.

An hardware accelerator could, for example, be microcoded
rather than using hard logic to future-proof it.

The cost of verifying crypto is
a bit higher as it is very important to get correct when baking into
gates.

Sure, And I expect it is also higher because of the extra security >precautions against side attacks, etc.

Timing attacks, in particular.

<snip>

In the Cavium crypto-enabled DPUs, the crypto block is inserted
into the data-path where necessary, when necessary; and to the extent
that a streaming protocol/alg is used, will encrypt/decrypt as the
data is passing from the ingress point to the egress point (which
could be another external port, or an on-board CPU). It can also be
used as a stand-alone crypto accelerator by the on-board CPUs.

Good to know. Proof of concept for my suggestion. :-) Can you talk
about advantages/disadvantages of that mechanism versus other >implementations?

Freeing the CPU's to do useful work instead of crypto is the first
reason for that type of architecture. There's plenty to do.

Note that crypto is used for more than just data
encryption/decryption; there's also digesting and digital signatures
which rely on asymmetric algorithms such as RSA or EC and don't
necessarily fit into the "path to the I/O device" model you've
espoused.

Yes, of course. But I think digital signature creation/verification
could be fit into the streaming model. Is that wrong? With regard to >RSA/EC, etc. I absolutely agree.

Digital signatures require X.509 support, and they're often embedded
in non-encrypted data streams. The hardware processing
the stream won't know anything about the data, including which
parts would need to be digested (and the data may need decrypting
first). Even if the hardware had the keys necessary to decrypt
IPSEC packets and look inside for signatures, it would be very
complicated to design hardware flexible enough to locate the
data that needs to be digested in a sequence of packets (which
may be arriving out of order).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Mon Jun 3 22:34:46 2024

Michael S wrote:

On Mon, 3 Jun 2024 18:01:00 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

Scott Lurndal <[email protected]> schrieb:

Adding encryption (which of the dozen standard symmetric and
asymmetric cipher algoritnms?)

At the moment, AES.

to a hardware device does increase complexity, and
thus cost at the expense of extensibility (new algorithms come along
periodically). The cost of verifying crypto is a bit higher as it
is very important to get correct when baking into gates.

Seems to be fairly common these days, looking at
https://en.wikipedia.org/wiki/AES_instruction_set .

It appears that one round of AES fits fairly well into one cycle.

One/cycle throughput fits well. Even two/cycle throughput fits.
One cycle latency does not fit unless you target very low frequency.
Latency on POWER9 - 6 clocks. On majority of modern Intel and AMD cores
3-4 clocks. On Apple M1 - 3 clocks.

I agree here; You should consider encryption as smaller than an FMUL
unit
with about the characteristics of an FMUL. 1-cycle throughput 3-5 cycle

latency.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Mon Jun 3 22:47:05 2024

Scott Lurndal wrote:

"Stephen Fuld" <[email protected]d> writes:

Scott Lurndal wrote:

The ARM neoverse cores, for example, require very little area.

Agreed. I was assuming that the cost of the logic was about the same >>whether it was done as CPU instructions or a chunk of accelerator logic
in the I/O stream. If that is true, then the cost of having multiples
of them in the I/O stream is small.

Although the accelerator requires addition logic to interface
to the CPUs (either by presenting as a memory mapped device,
integrated into the processor register scheme, or some other
proprietary mechanism). Which means non-standard software is
required to manage and use the accelerator.

First consider that it is possible for an I/O device to DMA directly
to another I/O device in the PCIe routing tree/DAG.

Then, consider that with this infrastructure, you could DMA from
memory through the Cryptor and back to memory (or anywhere you
wanted it).

From the operating software standpoint, it becomes most
convenient, then, to model the offload as a device which
requires OS support (and intervention for e.g. interrupt
handling).

I look at it differently (and perhaps incorrectly). I view
encryption as one of several "transformations" that data goes
through in its path to/from some external device.

That's certainly a valid view, if perhaps not complete. There are
use cases for in-place encryption.

Good. Can you give some examples, and perhaps an estimate of what >>percentage of the total encryption operations are in place? Note that
it may be possible to add a feature to the "in-stream" hardware to
allow in-place encryption - i.e. both sides go to/come from memory.

Different users want their files encrypted using different keys than
any other user--where file could be memory resident (or not).

Consider file access. From the perspective of the disk, all blocks
are identical - it doesn't know which partition, filesystem, or file
that any individual block is part of, if any.

Whole-disk encryption can happen at the drive. Per-file (or per-filesystem) happens in the filesystem driver(s), perhaps
with a hardware assist, but it wouldn't be on the path from
the disk to memory.

You may be correct in how it is now--but if the device has encryption
services why can they not be applied sector by sector ??

There are cases where only a portion of a file is encrypted, and
cases where the encryption is combined with compression (pkzip,
rar, etc).

Adding encryption (which of the dozen standard symmetric and
asymmetric cipher algoritnms?) to a hardware device does increase
complexity, and thus cost at the expense of extensibility (new
algorithms come along periodically).

Agreed. But this is also true for new CPU instructions.

An hardware accelerator could, for example, be microcoded
rather than using hard logic to future-proof it.

The cost of verifying crypto is
a bit higher as it is very important to get correct when baking into
gates.

Verifying encryption is not harder than verifying IEEE 754
instructions.

Sure, And I expect it is also higher because of the extra security >>precautions against side attacks, etc.

Timing attacks, in particular.

All the more reason to run encryption through a device where you cannot
measure time accurately. I/O fits this bill very well. It seems to me
that
as long as the system can maintain the cryption throughput all should
be
well.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Michael S on Tue Jun 4 01:17:33 2024

On Mon, 3 Jun 2024 13:22:27 +0300, Michael S wrote:

On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

but nowadays it’s all about GFC (“Galois-Field Counter”) mode.

GCM is far more common spelling.

Yeah. It’s just that Évariste Galois is known mainly for just one thing: Galois field theory, which is what’s relevant here. Which he wrote up on
his last night alive.

Imagine if he’d said “stuff this, I’ll write it up tomorrow night, I’m going to bed” ...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Tue Jun 4 01:55:24 2024

On Mon, 3 Jun 2024 22:47:05 +0000, MitchAlsup1 wrote:

... if the device has encryption
services why can they not be applied sector by sector ??

They can indeed. This is what “counter mode” is for: it lets you encrypt/ decrypt any part of some large data blob with random access, without
having to start from the beginning each time.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Tue Jun 4 01:45:58 2024

On Thu, 30 May 2024 15:35:37 GMT, Anton Ertl wrote:

Actually what you check for is meta-characters like ; " '. They are all ASCII, so as long as your code is 8-bit-clean, your SQL string sanitizer needs to know nothing about UTF-8.

According to the official spec, an SQL string literal is delimited by “"” characters, and an embedded double-quote is escaped by writing it twice: “""”.

That’s it. Nothing else is special, so any other character/byte value in
the string can be simply passed through as is.

Of course, things like LIKE and REGEXP clauses are an entirely separate
matter ...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Michael S on Tue Jun 4 02:00:55 2024

On Mon, 3 Jun 2024 17:42:17 +0300, Michael S wrote:

On Mon, 03 Jun 2024 14:07:12 GMT [email protected] (Scott Lurndal)
wrote:

Most modern CPUs have instruction set support for symmetric ciphers
such as AES, SM2/SM3 as well as message digest/hash (SHA1, SHA256 et
al).

It is still not *too* fast.
'Too fast' in my book is when with 1B to 10B USD worth of OTP servers
you can break cipher by brute force in less than 1 hour.

The good algorithms are designed to be fast for encryption/decryption use, while still being uselessly slow for cracking purposes.

Hash algorithms come in two flavours: cryptographic hashes (as mentioned
above) and password hashes. Cryptographic hashes have to be fast to
compute, but password hashes should take some appreciable fraction of a
second. This is fast enough to authenticate a user logging in, while significantly slowing down password-guessing attacks.

For example, the WordPress password-hashing algorithm takes a
cryptographic hash like MD5 (considered crap nowadays), and iterates it
8000 times. And suddenly crap becomes good.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to All on Tue Jun 4 06:09:25 2024

MitchAlsup1 wrote:

Scott Lurndal wrote:

"Stephen Fuld" <[email protected]d> writes:

Scott Lurndal wrote:

The ARM neoverse cores, for example, require very little area.

Agreed. I was assuming that the cost of the logic was about the
same whether it was done as CPU instructions or a chunk of
accelerator logic in the I/O stream. If that is true, then the
cost of having multiples of them in the I/O stream is small.

Although the accelerator requires addition logic to interface
to the CPUs (either by presenting as a memory mapped device,
integrated into the processor register scheme, or some other
proprietary mechanism). Which means non-standard software is
required to manage and use the accelerator.

First consider that it is possible for an I/O device to DMA directly
to another I/O device in the PCIe routing tree/DAG.

Then, consider that with this infrastructure, you could DMA from
memory through the Cryptor and back to memory (or anywhere you wanted
it).

From the operating software standpoint, it becomes most
convenient, then, to model the offload as a device which
requires OS support (and intervention for e.g. interrupt
handling).

I look at it differently (and perhaps incorrectly). I view

encryption as one of several "transformations" that data goes
through in its path to/from some external device.

That's certainly a valid view, if perhaps not complete. There
are use cases for in-place encryption.

Good. Can you give some examples, and perhaps an estimate of what percentage of the total encryption operations are in place? Note
that it may be possible to add a feature to the "in-stream"
hardware to allow in-place encryption - i.e. both sides go
to/come from memory.

Different users want their files encrypted using different keys than
any other user--where file could be memory resident (or not).

Memory resident files I agree with you about. But in my conception of
how this would all work, there would be a key specified for each I/O
operation, thus, I/O to different files could trivially have different
keys.

Consider file access. From the perspective of the disk, all blocks
are identical - it doesn't know which partition, filesystem, or file
that any individual block is part of, if any.

Whole-disk encryption can happen at the drive. Per-file (or per-filesystem) happens in the filesystem driver(s), perhaps
with a hardware assist, but it wouldn't be on the path from
the disk to memory.

You may be correct in how it is now--but if the device has encryption services why can they not be applied sector by sector ??

There are cases where only a portion of a file is encrypted, and
cases where the encryption is combined with compression (pkzip,
rar, etc).

If the "boundary" of where the encrypted portion starts or ends
corresponds to where an I/O boundary is, then no problem. If not, then
the interface requires requires the ability to start/stop encryption at
an arbitrary spot within the I/O. I envision this to work sort of like
a scatter gather, but instead of different memory addresses, each
"chunk" is encrypted or not. This is probably needed anyway for things
like netword I/O where you want to encrypt the data but not the header.
As for combining it with compression, clearly the encryption must come
after the compression, and decryption must come before decompression.
If you are doing the compression in the hardware interface that
shouldn't be a problem, and if you are doing it in the software, then
it definitly isn't a problem.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Stephen Fuld on Tue Jun 4 11:09:16 2024

Stephen Fuld wrote:

Scott Lurndal wrote:

Michael S <[email protected]> writes:

On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

On Thu, 30 May 2024 18:31:46 +0000, MitchAlsup1 wrote:
=20

30 years ago you could say the same thing about encryption. =20

=20
I don=E2=80=99t think newer CPUs have been optimized for

encryption. Inst=

ead,

we see newer encryption algorithms (or ways of using them) that

work >> better on current CPUs.=20

I think moderate efficiency on CPU, not too low, but not high
either, is a requirement for (symmetric-key) cipher. Esp. when the
key is 128-bit or shorter.

Most modern CPUs have instruction set support for symmetric ciphers
such as AES, SM2/SM3 as well as message digest/hash (SHA1, SHA256 et
al).

High throughput encryption has been done by hardware accelerators for
decades now (e.g. bbn or ncypher HSM boxes sitting on a SCSI bus;
now such HSM are an integral part of many SoC).

Queston. For a modern general purpose CPU, if you are including all
the logic to implement encryption instructions, is it much more to
include the control/sequencing logic to do it and not tie up the rest
of the CPU logic to do the encryption? Furthermore, an "inbuilt"
accelerator could interface directly with the I/O hardware of the CPU
(e.g. PCI), saving the "intermediate" step of writing the encrypted
data to memory.

That logic already exists, in the form of a single thread/core dedicated
to the job.

With 30-100 cores on a single die, it becomes very cheap to dedicate one
of them to babysit such a process, compared to the cost of making a
custom chunk of VLSI to do the same. This is particularly true because
the logic needed in the babysitting process is mostly straight line,
with a very limited number of hard-to-predict branches.

I.e. h.264 CABAC decoding has three branches per bit decoded, at least
one of them impossible to predict or work around with clever coding.
Here it makes perfect sense to have a chunk of hw to handle the heavy
lifting. Monitoring block encryption/decryption not so much.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Michael S on Tue Jun 4 10:54:27 2024

Michael S wrote:

On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

On Thu, 30 May 2024 18:31:46 +0000, MitchAlsup1 wrote:

30 years ago you could say the same thing about encryption.

I donâ€™t think newer CPUs have been optimized for encryption. Instead, >> we see newer encryption algorithms (or ways of using them) that work
better on current CPUs.

I think moderate efficiency on CPU, not too low, but not high either,
is a requirement for (symmetric-key) cipher. Esp. when the key is
128-bit or shorter.

That's correct:

CPU efficiency, primarily on the reference 32-bit platform (PentiumPro
200 MHz) but also on an 8-bit "smart card" implementation was one of the
key requirements for the AES competition.

When a group of four programmers (including me) spent a week on CERN's candidate, we were able to triple the speed, bringing it into parity
with the eventual winner. All the finalists were more or less the same
speed at this point, i.e. able to do full duplex 100 Mbit/s Ethernet
traffic (so around 20 MB/s) on a single thread/core.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Terje Mathisen on Tue Jun 4 12:11:33 2024

On Tue, 4 Jun 2024 10:54:27 +0200
Terje Mathisen <[email protected]> wrote:

Michael S wrote:

On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

On Thu, 30 May 2024 18:31:46 +0000, MitchAlsup1 wrote:

30 years ago you could say the same thing about encryption.

I donâ€™t think newer CPUs have been optimized for encryption.
Instead, we see newer encryption algorithms (or ways of using
them) that work better on current CPUs.

I think moderate efficiency on CPU, not too low, but not high
either, is a requirement for (symmetric-key) cipher. Esp. when the
key is 128-bit or shorter.

That's correct:

CPU efficiency, primarily on the reference 32-bit platform
(PentiumPro 200 MHz) but also on an 8-bit "smart card" implementation
was one of the key requirements for the AES competition.

When a group of four programmers (including me) spent a week on
CERN's candidate, we were able to triple the speed, bringing it into
parity with the eventual winner. All the finalists were more or less
the same speed at this point, i.e. able to do full duplex 100 Mbit/s
Ethernet traffic (so around 20 MB/s) on a single thread/core.

Terje

My point was that for symmetric cipher intended for use with "short"
keys, at least during a phase of standardization, exceptionally high
efficiency on existing CPUs would be considered a defect rather than
advantage.
Not necessarily so for "long" keys, where unbreakability by brute force
is taken for granted.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to [email protected] on Tue Jun 4 13:06:11 2024

[email protected] (MitchAlsup1) writes:

Scott Lurndal wrote:

"Stephen Fuld" <[email protected]d> writes:

Scott Lurndal wrote:

The ARM neoverse cores, for example, require very little area.

Agreed. I was assuming that the cost of the logic was about the same >>>whether it was done as CPU instructions or a chunk of accelerator logic >>>in the I/O stream. If that is true, then the cost of having multiples
of them in the I/O stream is small.

Although the accelerator requires addition logic to interface
to the CPUs (either by presenting as a memory mapped device,
integrated into the processor register scheme, or some other
proprietary mechanism). Which means non-standard software is
required to manage and use the accelerator.

First consider that it is possible for an I/O device to DMA directly
to another I/O device in the PCIe routing tree/DAG.

If, and only if, the host bridge supports peer-to-peer transactions,
which is not a given.

Then, consider that with this infrastructure, you could DMA from
memory through the Cryptor and back to memory (or anywhere you
wanted it).

Yes, this can be done, if the PCI endpoint(s) support it. Such
routing is an optional feature of PCI Express.

There are more efficient ways to link various hardware elements
together in such a way as to include not only encryption, but
also compression/decompression, regex (or other pattern) matching, ingress and egress.

From the operating software standpoint, it becomes most
convenient, then, to model the offload as a device which
requires OS support (and intervention for e.g. interrupt
handling).

I look at it differently (and perhaps incorrectly). I view
encryption as one of several "transformations" that data goes
through in its path to/from some external device.

That's certainly a valid view, if perhaps not complete. There are
use cases for in-place encryption.

Good. Can you give some examples, and perhaps an estimate of what >>>percentage of the total encryption operations are in place? Note that
it may be possible to add a feature to the "in-stream" hardware to
allow in-place encryption - i.e. both sides go to/come from memory.

Different users want their files encrypted using different keys than
any other user--where file could be memory resident (or not).

Consider file access. From the perspective of the disk, all blocks
are identical - it doesn't know which partition, filesystem, or file
that any individual block is part of, if any.

Whole-disk encryption can happen at the drive. Per-file (or
per-filesystem) happens in the filesystem driver(s), perhaps
with a hardware assist, but it wouldn't be on the path from
the disk to memory.

You may be correct in how it is now--but if the device has encryption >services why can they not be applied sector by sector ??

Still not sufficient, as a filesystem could easily pack fragments
from multiple files into a single sector or allocation unit
(and with modern sector sizes of 4096 bytes....)

Sure, And I expect it is also higher because of the extra security >>>precautions against side attacks, etc.

Timing attacks, in particular.

All the more reason to run encryption through a device where you cannot >measure time accurately.

Indeed, we've been doing that for a couple of decades now.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Terje Mathisen on Tue Jun 4 17:00:33 2024

Terje Mathisen wrote:

That logic already exists, in the form of a single thread/core
dedicated
to the job.

With 30-100 cores on a single die, it becomes very cheap to dedicate
one
of them to babysit such a process, compared to the cost of making a
custom chunk of VLSI to do the same. This is particularly true because
the logic needed in the babysitting process is mostly straight line,
with a very limited number of hard-to-predict branches.

I.e. h.264 CABAC decoding has three branches per bit decoded, at least
one of them impossible to predict or work around with clever coding.

How many instructions in the then-clause and in the else-clause ??
If these are smaller than 8, My 66000 can process them without
"branching"
using predication.

Here it makes perfect sense to have a chunk of hw to handle the heavy lifting. Monitoring block encryption/decryption not so much.

Terje

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Terje Mathisen on Tue Jun 4 16:26:27 2024

Terje Mathisen wrote:

Stephen Fuld wrote:

Scott Lurndal wrote:

Michael S <[email protected]> writes:

On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

On Thu, 30 May 2024 18:31:46 +0000, MitchAlsup1 wrote:
=20

30 years ago you could say the same thing about encryption.
=20

=20
I don=E2=80=99t think newer CPUs have been optimized for

encryption. Inst=

ead,

we see newer encryption algorithms (or ways of using them)
that

work >> better on current CPUs.=20

I think moderate efficiency on CPU, not too low, but not high
either, is a requirement for (symmetric-key) cipher. Esp. when
the key is 128-bit or shorter.

Most modern CPUs have instruction set support for symmetric
ciphers such as AES, SM2/SM3 as well as message digest/hash
(SHA1, SHA256 et al).

High throughput encryption has been done by hardware accelerators
for decades now (e.g. bbn or ncypher HSM boxes sitting on a SCSI
bus; now such HSM are an integral part of many SoC).

Queston. For a modern general purpose CPU, if you are including all
the logic to implement encryption instructions, is it much more to
include the control/sequencing logic to do it and not tie up the
rest of the CPU logic to do the encryption? Furthermore, an
"inbuilt" accelerator could interface directly with the I/O
hardware of the CPU (e.g. PCI), saving the "intermediate" step of
writing the encrypted data to memory.

That logic already exists, in the form of a single thread/core
dedicated to the job.

With 30-100 cores on a single die, it becomes very cheap to dedicate
one of them to babysit such a process, compared to the cost of making
a custom chunk of VLSI to do the same. This is particularly true
because the logic needed in the babysitting process is mostly
straight line, with a very limited number of hard-to-predict branches.

I.e. h.264 CABAC decoding has three branches per bit decoded, at
least one of them impossible to predict or work around with clever
coding. Here it makes perfect sense to have a chunk of hw to handle
the heavy lifting. Monitoring block encryption/decryption not so much.

I may be missing something, but while your proposal addresses the first
part of my proposal, I think it doesn't adress the second. That is,
for data coming from/going to some external source, you are still doing "unnecessary" memory traffic, which takes memory bandwidth and
increases latency.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Tue Jun 4 16:28:00 2024

If I want to validate combiner codes or normalize characters I need
UTF-32 because I have to work with the whole character as a unit.

You can read the code points directly from the UTF-8 sequence almost
as easily as you can from a UTF-32 sequence.
Most of the cost will be in the memory accesses and then in looking up the various tables to decide how to normalize or whether it's valid, so the difference between reading the info from UTF-32 or UTF-8 should be lost in
the noise.
UTF-32 might be marginally faster at this specific operation in some
cases (definitely not if your text is mostly ASCII), but I'd be very
surprised if the difference is ever large enough to pay for a conversion
from UTF-8 to UTF-32.

I was just trying to get people thinking of ways that malformed
characters might be used to bypass other validation checks in
their software.

Another issue with Unicode is the so-called "confusables": things that
may look identical (or close enough) on screen yet are different (and
not just because of normalization). E.g. Β vs B, А vs A, or ∕ vs / vs ⁄. Unicode comes with a 700kB `confusables.txt` listing such issues.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to [email protected] on Tue Jun 4 16:56:18 2024

On Sat, 01 Jun 2024 12:49:46 -0400, EricP
<[email protected]> wrote:

George Neuner wrote:

On Wed, 29 May 2024 18:42:32 -0000 (UTC), John Levine
<[email protected]> wrote:

According to EricP <[email protected]>:

Ok, you accept international character data, you just don't have to
check >127 characters for "drop table" etc commands.

I don't think you are being paranoid enough.
I still think you have to validate or sanitize the >127 string to
ensure the code sequences only contain well formed characters.

If you're sending the strings to a database, the database will
invariably do detailed string validation so I wouldn't bother, but be
prepared for the error code if it rejects the string,

Far too much SQL is constructed by simply splicing user input into a
query "template" string.

When queries are done right with all user input provided via SQL
parameters, then there is far less need to "sanitize" input.

There is a one major caveat: in SQL, table names can't be specified by
parameter. If the user must provide a table name, then you DO have to
splice the query string and you DO have to be careful.

Yes, I didn't mean not parameterizing the string args.

I was trying to think of ways that I might get your software to combine >malformed strings creating something different. This would occur after
the strings have been passed using parameterization, like if an index
is built from two concatenated string fields.

Sorry ... was away for a few days.

Even using parameters you still can have a "bad" outcome (for some
definition). E.g., if the database contains "John" but the query
string is "Jon", it might fail to find or delete existing tuples,
update wrong tuples, create superfluous tuples, etc. ... which can
affect the integrity[*] of the stored data. However, parameters
provide no way to /rewrite/ the SQL to perform a different operation
than that which was originally intended.

[*] "ACID" provides some guarantees of "consistency" but does not make
any guarantees of "integrity". The 'I' stands for "isolation".

However, many SQL RDBMS now support operations on JSON and XML data,
and it is possible to affect searches within these types of fields by
using only (SQL) parameter strings. I don't know of any way to defend
against this without checking code having some fairly sophisticated understanding of the stored data ... not just its structure, but also
what it represents.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to [email protected] on Tue Jun 4 17:42:43 2024

On Tue, 4 Jun 2024 02:00:55 -0000 (UTC), Lawrence D'Oliveiro
<[email protected]d> wrote:

On Mon, 3 Jun 2024 17:42:17 +0300, Michael S wrote:

On Mon, 03 Jun 2024 14:07:12 GMT [email protected] (Scott Lurndal)
wrote:

Most modern CPUs have instruction set support for symmetric ciphers
such as AES, SM2/SM3 as well as message digest/hash (SHA1, SHA256 et
al).

It is still not *too* fast.
'Too fast' in my book is when with 1B to 10B USD worth of OTP servers
you can break cipher by brute force in less than 1 hour.

The good algorithms are designed to be fast for encryption/decryption use, >while still being uselessly slow for cracking purposes.

Hash algorithms come in two flavours: cryptographic hashes (as mentioned >above) and password hashes. Cryptographic hashes have to be fast to
compute, but password hashes should take some appreciable fraction of a >second. This is fast enough to authenticate a user logging in, while >significantly slowing down password-guessing attacks.

For example, the WordPress password-hashing algorithm takes a
cryptographic hash like MD5 (considered crap nowadays), and iterates it
8000 times. And suddenly crap becomes good.

It's debatable whether repeated application of a given function really represents a /different/ function.

In any event there is no such thing as a "password" hash - really
there only are cryptographic hashes. A use of a particular hash for
passwords may deliberately slow its execution - e.g., by iterating or
by deliberate delays - but the hash algorithm remains the same.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to George Neuner on Wed Jun 5 05:32:17 2024

George Neuner <[email protected]> schrieb:

It's debatable whether repeated application of a given function really represents a /different/ function.

Try encrypting something with only one round of DES or AES :-)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Thomas Koenig on Wed Jun 5 05:35:28 2024

On Wed, 5 Jun 2024 05:32:17 -0000 (UTC), Thomas Koenig wrote:

Try encrypting something with only one round of DES or AES :-)

AES is fine, DES is not.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Stephen Fuld on Wed Jun 5 11:16:38 2024

Stephen Fuld wrote:

Terje Mathisen wrote:

Stephen Fuld wrote:

Scott Lurndal wrote:

Michael S <[email protected]> writes:

On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

On Thu, 30 May 2024 18:31:46 +0000, MitchAlsup1 wrote:
=20

30 years ago you could say the same thing about encryption.
=20

=20
I don=E2=80=99t think newer CPUs have been optimized for

encryption. Inst=

ead,

we see newer encryption algorithms (or ways of using them)
that

work >> better on current CPUs.=20

I think moderate efficiency on CPU, not too low, but not high
either, is a requirement for (symmetric-key) cipher. Esp. when
the key is 128-bit or shorter.

Most modern CPUs have instruction set support for symmetric
ciphers such as AES, SM2/SM3 as well as message digest/hash
(SHA1, SHA256 et al).

High throughput encryption has been done by hardware accelerators
for decades now (e.g. bbn or ncypher HSM boxes sitting on a SCSI
bus; now such HSM are an integral part of many SoC).

Queston. For a modern general purpose CPU, if you are including all
the logic to implement encryption instructions, is it much more to
include the control/sequencing logic to do it and not tie up the
rest of the CPU logic to do the encryption? Furthermore, an
"inbuilt" accelerator could interface directly with the I/O
hardware of the CPU (e.g. PCI), saving the "intermediate" step of
writing the encrypted data to memory.

That logic already exists, in the form of a single thread/core
dedicated to the job.

With 30-100 cores on a single die, it becomes very cheap to dedicate
one of them to babysit such a process, compared to the cost of making
a custom chunk of VLSI to do the same. This is particularly true
because the logic needed in the babysitting process is mostly
straight line, with a very limited number of hard-to-predict branches.

I.e. h.264 CABAC decoding has three branches per bit decoded, at
least one of them impossible to predict or work around with clever
coding. Here it makes perfect sense to have a chunk of hw to handle
the heavy lifting. Monitoring block encryption/decryption not so much.

I may be missing something, but while your proposal addresses the first
part of my proposal, I think it doesn't adress the second. That is,
for data coming from/going to some external source, you are still doing "unnecessary" memory traffic, which takes memory bandwidth and
increases latency.

Usually, when a CPU needs to work on something, it will need to get the
data into $L1 anyway? It is only when the work is simply to be a
pipeline that having a way to bypass the CPU completely really makes a difference, right?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to All on Wed Jun 5 12:21:01 2024

MitchAlsup1 wrote:

Terje Mathisen wrote:

That logic already exists, in the form of a single thread/core
dedicated
to the job.

With 30-100 cores on a single die, it becomes very cheap to dedicate
one
of them to babysit such a process, compared to the cost of making a
custom chunk of VLSI to do the same. This is particularly true because
the logic needed in the babysitting process is mostly straight line,
with a very limited number of hard-to-predict branches.

I.e. h.264 CABAC decoding has three branches per bit decoded, at least
one of them impossible to predict or work around with clever coding.

How many instructions in the then-clause and in the else-clause ??
If these are smaller than 8, My 66000 can process them without
"branching" using predication.

No, the real problem is the context branching: After doing the 50%
branch you pick up one of two alternative contexts and follow totally
different paths, i.e. you cannot simply use the branch bit as an index.

I found ways to bypass the issues with the other two branches but this
one is fundamental.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Terje Mathisen on Wed Jun 5 13:34:25 2024

Terje Mathisen wrote:

Stephen Fuld wrote:

Terje Mathisen wrote:

Stephen Fuld wrote:

Scott Lurndal wrote:

Michael S <[email protected]> writes:

On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

On Thu, 30 May 2024 18:31:46 +0000, MitchAlsup1 wrote:
=20

30 years ago you could say the same thing about
encryption. =20

=20
I don=E2=80=99t think newer CPUs have been optimized for

encryption. Inst=

ead,

we see newer encryption algorithms (or ways of using them)
that

work >> better on current CPUs.=20

I think moderate efficiency on CPU, not too low, but not
high either, is a requirement for (symmetric-key) cipher.
Esp. when the key is 128-bit or shorter.

Most modern CPUs have instruction set support for symmetric
ciphers such as AES, SM2/SM3 as well as message digest/hash
(SHA1, SHA256 et al).

High throughput encryption has been done by hardware
accelerators for decades now (e.g. bbn or ncypher HSM boxes
sitting on a SCSI bus; now such HSM are an integral part of
many SoC).

Queston. For a modern general purpose CPU, if you are
including all the logic to implement encryption instructions,
is it much more to include the control/sequencing logic to do
it and not tie up the rest of the CPU logic to do the
encryption? Furthermore, an "inbuilt" accelerator could
interface directly with the I/O hardware of the CPU (e.g. PCI),
saving the "intermediate" step of writing the encrypted data to
memory.

That logic already exists, in the form of a single thread/core
dedicated to the job.

With 30-100 cores on a single die, it becomes very cheap to
dedicate one of them to babysit such a process, compared to the
cost of making a custom chunk of VLSI to do the same. This is particularly true because the logic needed in the babysitting
process is mostly straight line, with a very limited number of hard-to-predict branches.

I.e. h.264 CABAC decoding has three branches per bit decoded, at
least one of them impossible to predict or work around with clever coding. Here it makes perfect sense to have a chunk of hw to
handle the heavy lifting. Monitoring block encryption/decryption
not so much.

I may be missing something, but while your proposal addresses the
first part of my proposal, I think it doesn't adress the second.
That is, for data coming from/going to some external source, you
are still doing "unnecessary" memory traffic, which takes memory
bandwidth and increases latency.

Usually, when a CPU needs to work on something, it will need to get
the data into $L1 anyway? It is only when the work is simply to be a
pipeline that having a way to bypass the CPU completely really makes
a difference, right?

Right. But my point is that the CPU never really need to "work" on the encrypted data. It it frequently only sent to, or received from the
network or a storage device, hence the pipelined approach has
advantages.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Stephen Fuld on Wed Jun 5 16:49:05 2024

On Wed, 5 Jun 2024 13:34:25 -0000 (UTC)
"Stephen Fuld" <[email protected]d> wrote:

Terje Mathisen wrote:

Stephen Fuld wrote:

Terje Mathisen wrote:

Stephen Fuld wrote:

Scott Lurndal wrote:

Michael S <[email protected]> writes:

On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

On Thu, 30 May 2024 18:31:46 +0000, MitchAlsup1 wrote:
=20

30 years ago you could say the same thing about
encryption. =20

=20
I don=E2=80=99t think newer CPUs have been optimized
for

encryption. Inst=

ead,

we see newer encryption algorithms (or ways of using
them) that

work >> better on current CPUs.=20

I think moderate efficiency on CPU, not too low, but not
high either, is a requirement for (symmetric-key) cipher.
Esp. when the key is 128-bit or shorter.

Most modern CPUs have instruction set support for symmetric
ciphers such as AES, SM2/SM3 as well as message digest/hash
(SHA1, SHA256 et al).

High throughput encryption has been done by hardware
accelerators for decades now (e.g. bbn or ncypher HSM boxes
sitting on a SCSI bus; now such HSM are an integral part of
many SoC).

Queston. For a modern general purpose CPU, if you are
including all the logic to implement encryption instructions,
is it much more to include the control/sequencing logic to do
it and not tie up the rest of the CPU logic to do the
encryption? Furthermore, an "inbuilt" accelerator could
interface directly with the I/O hardware of the CPU (e.g.
PCI), saving the "intermediate" step of writing the encrypted
data to memory.

That logic already exists, in the form of a single thread/core dedicated to the job.

With 30-100 cores on a single die, it becomes very cheap to
dedicate one of them to babysit such a process, compared to the
cost of making a custom chunk of VLSI to do the same. This is particularly true because the logic needed in the babysitting
process is mostly straight line, with a very limited number of hard-to-predict branches.

I.e. h.264 CABAC decoding has three branches per bit decoded, at
least one of them impossible to predict or work around with
clever coding. Here it makes perfect sense to have a chunk of
hw to handle the heavy lifting. Monitoring block
encryption/decryption not so much.

I may be missing something, but while your proposal addresses the
first part of my proposal, I think it doesn't adress the second.
That is, for data coming from/going to some external source, you
are still doing "unnecessary" memory traffic, which takes memory bandwidth and increases latency.

Usually, when a CPU needs to work on something, it will need to get
the data into $L1 anyway? It is only when the work is simply to be a pipeline that having a way to bypass the CPU completely really makes
a difference, right?

Right. But my point is that the CPU never really need to "work" on
the encrypted data. It it frequently only sent to, or received from
the network or a storage device, hence the pipelined approach has
advantages.

The best, the most secure encryption is an end-to-end encryption.
Which means application-to-application.
It's not that other, "piece-wise" encryption types can't be used, but
if you are serious about privacy you should consider them insufficient.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Terje Mathisen on Wed Jun 5 16:53:53 2024

Terje Mathisen wrote:

MitchAlsup1 wrote:

I.e. h.264 CABAC decoding has three branches per bit decoded, at least
one of them impossible to predict or work around with clever coding.

How many instructions in the then-clause and in the else-clause ??
If these are smaller than 8, My 66000 can process them without
"branching" using predication.

No, the real problem is the context branching: After doing the 50%
branch you pick up one of two alternative contexts and follow totally different paths, i.e. you cannot simply use the branch bit as an index.

If the number of instructions in the combined then and else clauses is
lower than a certain number, it is equally efficient to deal with the
branch as if it were later nullification rather than a redirection of
the fetch end of the pipeline. Here, NO prediction is required and
there is no chance of misprediction without regard to the
predictability
of the control flow point. The whole point is that if the fetch end
of the pipeline will reach the convergence point before the branch
is fully resolved, then "don't branch" nullify. it saves cycles and
keeps unpredictable branches out of the branch predictor--even if
the apparent takenness of the branch is completely random--improving
the prediction accuracy of "real branches".

So, for example, let us postulate a 1-wide machine fetching 4 words per
clock and a then clause of 3 instructions and an else clause of 4 inst.
By the time the pseudo branch instruction enters execution, both the
then and the else have already been fetched, parsed, and are flowing
through decode. The execution of the branch merely decides which inst
survive the pipeline and there are no misprediction stalls. {{On a
wider machine, the fetch is even wider and the parse/decode BW is
still higher, so the mispredicted control flow point does not suffer misprediction repair costs.}}

Oddly enough, this is how predication works on My 66000.

I found ways to bypass the issues with the other two branches but this
one is fundamental.

It is fundamental only on ISAs that perform predication improperly
or does not have predication, or use the predictor when predicating.
My 66000 is not one of them.

I return to the question posed earlier::
How many instructions in the then-clause and in the else-clause ??

Terje

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Michael S on Wed Jun 5 16:16:32 2024

Michael S wrote:

snip lots of stuff about encryption alternatives

The best, the most secure encryption is an end-to-end encryption.
Which means application-to-application.
It's not that other, "piece-wise" encryption types can't be used, but
if you are serious about privacy you should consider them
insufficient.

That's fair. But there are counter arguments like not doing the
encryption on a processor that is also executing arbitrary user code
makes it more immune from side attacks.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stephen Fuld on Wed Jun 5 17:03:42 2024

Stephen Fuld wrote:

Terje Mathisen wrote:

Usually, when a CPU needs to work on something, it will need to get
the data into $L1 anyway? It is only when the work is simply to be a
pipeline that having a way to bypass the CPU completely really makes
a difference, right?

Right. But my point is that the CPU never really need to "work" on the encrypted data. It it frequently only sent to, or received from the
network or a storage device, hence the pipelined approach has
advantages.

If the keys are visible in application memory, Spectré like attacks can
read out those keys. If the keys are visible in supervisor memory,
similar
attack strategies can read them out. Thus, it makes sense that the CPUs

not be doing the cryption.

{{Or they could fix the µArchitecture so Spectré like attacks are
prevented
but apparently they have no cause for that.}}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Wed Jun 5 17:04:49 2024

Michael S wrote:

On Wed, 5 Jun 2024 13:34:25 -0000 (UTC)

The best, the most secure encryption is an end-to-end encryption.
Which means application-to-application.

Except for the Spectré like attacks that steal the keys if they are in
memory.

It's not that other, "piece-wise" encryption types can't be used, but
if you are serious about privacy you should consider them insufficient.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Stephen Fuld on Wed Jun 5 20:06:43 2024

On Wed, 5 Jun 2024 16:16:32 -0000 (UTC)
"Stephen Fuld" <[email protected]d> wrote:

Michael S wrote:

snip lots of stuff about encryption alternatives

The best, the most secure encryption is an end-to-end encryption.
Which means application-to-application.
It's not that other, "piece-wise" encryption types can't be used,
but if you are serious about privacy you should consider them
insufficient.

That's fair. But there are counter arguments like not doing the
encryption on a processor that is also executing arbitrary user code
makes it more immune from side attacks.

Side-channel attacks on AES were 99%-fantasy of bored (or
attention-seeking) security researchers even before Rijndael core was
put in CPU hardware. Much more so now.
Weak point tends to be key management rather than encryption itself.
And, BTW, running arbitrary hostile code on your computer is bad, bad,
bad idea for 1e9 other reasons.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Wed Jun 5 13:37:12 2024

And, BTW, running arbitrary hostile code on your computer is bad, bad,
bad idea for 1e9 other reasons.

Can't disagree, yet every day that comes by, another activity is made
virtually impossible without allowing such arbitrary code on
your device. 🙁

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to [email protected] on Wed Jun 5 17:56:10 2024

[email protected] (MitchAlsup1) writes:

Stephen Fuld wrote:

Terje Mathisen wrote:

Usually, when a CPU needs to work on something, it will need to get
the data into $L1 anyway? It is only when the work is simply to be a
pipeline that having a way to bypass the CPU completely really makes
a difference, right?

Right. But my point is that the CPU never really need to "work" on the
encrypted data. It it frequently only sent to, or received from the
network or a storage device, hence the pipelined approach has
advantages.

If the keys are visible in application memory, Spectré like attacks can
read out those keys. If the keys are visible in supervisor memory,
similar
attack strategies can read them out. Thus, it makes sense that the CPUs

not be doing the cryption.

That's why most modern platforms have TPM devices on board or
integrated on the SoC.

https://en.wikipedia.org/wiki/Trusted_Platform_Module

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Michael S on Wed Jun 5 17:58:01 2024

Michael S <[email protected]> writes:

On Wed, 5 Jun 2024 17:04:49 +0000
[email protected] (MitchAlsup1) wrote:

Michael S wrote:
=20

On Wed, 5 Jun 2024 13:34:25 -0000 (UTC)
=20

=20

The best, the most secure encryption is an end-to-end encryption.
Which means application-to-application. =20

=20
Except for the Spectr=C3=A9 like attacks that steal the keys if they are = >in
memory.
=20

Spectre, not Spectr=C3=A9 >https://en.wikipedia.org/wiki/Spectre_(security_vulnerability)

It's not that other, "piece-wise" encryption types can't be used,
but if you are serious about privacy you should consider them
insufficient. =20

And who exactly places the key into registers of your beloved shared >encryption device?

It is pretty trivial to bake private keys into hardware at the fab,
either through e-fuses or various other mechanisms.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to [email protected] on Wed Jun 5 20:15:53 2024

On Wed, 5 Jun 2024 17:04:49 +0000
[email protected] (MitchAlsup1) wrote:

Michael S wrote:

On Wed, 5 Jun 2024 13:34:25 -0000 (UTC)

The best, the most secure encryption is an end-to-end encryption.
Which means application-to-application.

Except for the Spectré like attacks that steal the keys if they are in memory.

Spectre, not Spectré https://en.wikipedia.org/wiki/Spectre_(security_vulnerability)

It's not that other, "piece-wise" encryption types can't be used,
but if you are serious about privacy you should consider them
insufficient.

And who exactly places the key into registers of your beloved shared
encryption device? And, since device is shared, who exchanges keys
hundreds or thousands times per second? Not software? Not via memory?
It all makes situation much much worse rather than better.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Wed Jun 5 20:09:06 2024

Scott Lurndal wrote:

Michael S <[email protected]> writes:

It's not that other, "piece-wise" encryption types can't be used,
but if you are serious about privacy you should consider them
insufficient. =20

And who exactly places the key into registers of your beloved shared >>encryption device?

It is pretty trivial to bake private keys into hardware at the fab,
either through e-fuses or various other mechanisms.

Is that something the CIA or NSA would allow on their computers ??

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Wed Jun 5 20:13:19 2024

Michael S wrote:

Side-channel attacks on AES were 99%-fantasy of bored (or
attention-seeking) security researchers even before Rijndael core was
put in CPU hardware. Much more so now.
Weak point tends to be key management rather than encryption itself.
And, BTW, running arbitrary hostile code on your computer is bad, bad,
bad idea for 1e9 other reasons.

Running arbitrary hostile code where the user address space is not
completely disjoint from the supervisor access space is ALSO a bad
Idea.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Wed Jun 5 20:11:03 2024

Michael S wrote:

Except for the Spectré like attacks that steal the keys if they are in
memory.

Spectre, not Spectré

My spelling has the advantage I can GOOGLE the *net and find anything I
have said about Spectré.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to [email protected] on Wed Jun 5 20:41:48 2024

[email protected] (MitchAlsup1) writes:

Scott Lurndal wrote:

Michael S <[email protected]> writes:

It's not that other, "piece-wise" encryption types can't be used,
but if you are serious about privacy you should consider them
insufficient. =20

And who exactly places the key into registers of your beloved shared >>>encryption device?

It is pretty trivial to bake private keys into hardware at the fab,
either through e-fuses or various other mechanisms.

Is that something the CIA or NSA would allow on their computers ??

If they use windows, yes. Windows requires a TPM for boot integrity.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to All on Thu Jun 6 09:10:38 2024

MitchAlsup1 wrote:

Terje Mathisen wrote:

MitchAlsup1 wrote:

I.e. h.264 CABAC decoding has three branches per bit decoded, at
least one of them impossible to predict or work around with clever
coding.

How many instructions in the then-clause and in the else-clause ??
If these are smaller than 8, My 66000 can process them without
"branching" using predication.

No, the real problem is the context branching: After doing the 50%
branch you pick up one of two alternative contexts and follow totally
different paths, i.e. you cannot simply use the branch bit as an index.

If the number of instructions in the combined then and else clauses is
lower than a certain number, it is equally efficient to deal with the
branch as if it were later nullification rather than a redirection of
the fetch end of the pipeline. Here, NO prediction is required and there
is no chance of misprediction without regard to the
predictability
of the control flow point. The whole point is that if the fetch end
of the pipeline will reach the convergence point before the branch
is fully resolved, then "don't branch" nullify. it saves cycles and
keeps unpredictable branches out of the branch predictor--even if the apparent takenness of the branch is completely random--improving
the prediction accuracy of "real branches".

So, for example, let us postulate a 1-wide machine fetching 4 words per
clock and a then clause of 3 instructions and an else clause of 4 inst.
By the time the pseudo branch instruction enters execution, both the
then and the else have already been fetched, parsed, and are flowing
through decode. The execution of the branch merely decides which inst
survive the pipeline and there are no misprediction stalls. {{On a
wider machine, the fetch is even wider and the parse/decode BW is
still higher, so the mispredicted control flow point does not suffer misprediction repair costs.}}

Oddly enough, this is how predication works on My 66000.

I found ways to bypass the issues with the other two branches but this
one is fundamental.

It is fundamental only on ISAs that perform predication improperly
or does not have predication, or use the predictor when predicating.
My 66000 is not one of them.

I return to the question posed earlier::
How many instructions in the then-clause and in the else-clause ??

From 100++ to 10K+? Effectively no path merge within any kind a visible window.

I.e. decoding CABAC is running a state machine with tens to hundreds
(afair) different states, with close to zero commonality between the
code for individual paths. There is almost zero if/then/else/endif local branching at this level.

I could see absolutely no way to avoid biting the bullet and actually
branch to the relevant code path.

Like I've written before, it is almost as if CABAC was designed to be as
hard as possible for a sw decoder.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to [email protected] on Thu Jun 6 11:21:39 2024

On Wed, 5 Jun 2024 20:13:19 +0000
[email protected] (MitchAlsup1) wrote:

Michael S wrote:

And, BTW, running arbitrary hostile code on your computer is bad,
bad, bad idea for 1e9 other reasons.

Running arbitrary hostile code where the user address space is not
completely disjoint from the supervisor access space is ALSO a bad
Idea.

It sounds like you came to the verge of selling your soul to
microkerneliac heresy.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Thu Jun 6 13:45:10 2024

Michael S wrote:

On Wed, 5 Jun 2024 20:13:19 +0000
[email protected] (MitchAlsup1) wrote:

Michael S wrote:

And, BTW, running arbitrary hostile code on your computer is bad,
bad, bad idea for 1e9 other reasons.

Running arbitrary hostile code where the user address space is not
completely disjoint from the supervisor access space is ALSO a bad
Idea.

It sounds like you came to the verge of selling your soul to
microkerneliac heresy.

While My 66000 has the rapid context switching needed for efficient microKernels, the MMU has the functionality that application AGEN
cannot access supervisor space, while supervisor AGEN can access
application space. It is just setting up the model properly.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Scott Lurndal on Fri Jun 7 03:35:41 2024

On Wed, 05 Jun 2024 20:41:48 GMT, Scott Lurndal wrote:

[email protected] (MitchAlsup1) writes:

Is that something the CIA or NSA would allow on their computers ??

If they use windows, yes.

There is an interview somewhere in which somebody high up in the US
military says that their Government’s reliance on Microsoft is their
single biggest security vulnerability.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Stefan Monnier on Fri Jun 7 03:36:57 2024

On Wed, 05 Jun 2024 13:37:12 -0400, Stefan Monnier wrote:

... every day that comes by, another activity is made
virtually impossible without allowing such arbitrary code on your
device. 🙁

If you’re talking about WASM or JavaScript from websites, that runs in a carefully-designed sandbox.

If you’re talking about proprietary closed-source apps downloaded from
random sites ... just don’t.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Fri Jun 7 10:48:56 2024

... every day that comes by, another activity is made virtually
impossible without allowing such arbitrary code on your device. 🙁

If you’re talking about WASM or JavaScript from websites, that runs in a carefully-designed sandbox.

The sandbox gives you only a very crude amount of control.
In practice it's still basically code over which you have no control
(beside "do I run it or not").

And your sandbox wants to provides access to a large part of your
machine's hardware anyway, in order to be able to run the many "web applications". So, it comes with many "carefully-designed" holes.
And that's without counting hardware and software bugs.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Stefan Monnier on Fri Jun 7 10:23:27 2024

Stefan Monnier wrote:

Another issue with Unicode is the so-called "confusables": things that
may look identical (or close enough) on screen yet are different (and
not just because of normalization). E.g. Β vs B, А vs A, or ∕ vs / vs ⁄.
Unicode comes with a 700kB `confusables.txt` listing such issues.

Eeewww... I didn't even think of that.
What does one do about them? You can't treat them as equivalent in a
string compare... the user might want the first B and not second B.

I suppose one would want two compare equal functions,
an exactly equal, and a visually approximately equal.
Like using a soundex for words to catch misspellings.

But then programmers need to decide when to use each compare.

These character and code attribute lookup tables are looking awkward.
With up to 2M codes, and some base character codes having multiple
possible combiners, but very sparse. And links between entries
for upper and lower case, and now links between confusables.
And we don't want to roll over the L1 cache just to do a string compare.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to EricP on Fri Jun 7 17:05:42 2024

EricP wrote:

Stefan Monnier wrote:

Another issue with Unicode is the so-called "confusables": things that
may look identical (or close enough) on screen yet are different (and
not just because of normalization). E.g. Î’ vs B, Ð vs A, or âˆ• vs
/ vs â„.
Unicode comes with a 700kB `confusables.txt` listing such issues.

Eeewww... I didn't even think of that.
What does one do about them? You can't treat them as equivalent in a
string compare... the user might want the first B and not second B.

I suppose one would want two compare equal functions,
an exactly equal, and a visually approximately equal.
Like using a soundex for words to catch misspellings.

But then programmers need to decide when to use each compare.

These character and code attribute lookup tables are looking awkward.
With up to 2M codes, and some base character codes having multiple
possible combiners, but very sparse. And links between entries
for upper and lower case, and now links between confusables.
And we don't want to roll over the L1 cache just to do a string compare.

Years ago I considered case-insensitive Boyer-Moore text search with a
wide alphabet and found that the only approach that made sense was to
maintain two copies of the string to be searched for, one lower and one
upper case, where each "character" was a length-encoded string. This was required to handle things like the German double s which can uppercase
into a single letter.

The lookup table for skip lengths was still far shorter than the
alphabet size, effectively a very short and fast hash of the current character/codepoint/combined letter.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Terje Mathisen on Fri Jun 7 11:35:50 2024

Terje Mathisen wrote:

EricP wrote:

Stefan Monnier wrote:

Another issue with Unicode is the so-called "confusables": things that
may look identical (or close enough) on screen yet are different (and
not just because of normalization). E.g. Î’ vs B, Ð vs A, or âˆ• vs
/ vs â„.
Unicode comes with a 700kB `confusables.txt` listing such issues.

Eeewww... I didn't even think of that.
What does one do about them? You can't treat them as equivalent in a
string compare... the user might want the first B and not second B.

I suppose one would want two compare equal functions,
an exactly equal, and a visually approximately equal.
Like using a soundex for words to catch misspellings.

But then programmers need to decide when to use each compare.

These character and code attribute lookup tables are looking awkward.
With up to 2M codes, and some base character codes having multiple
possible combiners, but very sparse. And links between entries
for upper and lower case, and now links between confusables.
And we don't want to roll over the L1 cache just to do a string compare.

Years ago I considered case-insensitive Boyer-Moore text search with a
wide alphabet and found that the only approach that made sense was to maintain two copies of the string to be searched for, one lower and one
upper case, where each "character" was a length-encoded string. This was required to handle things like the German double s which can uppercase
into a single letter.

The lookup table for skip lengths was still far shorter than the
alphabet size, effectively a very short and fast hash of the current character/codepoint/combined letter.

Terje

Or perhaps rather than mapping upper into lower or lower into upper,
and special cases like German double s, and confusables, into each other, instead we map all into a third hyper-character (because like hyperspace
it intersects with all points in real space).

Each real character (RCH) maps to a single hyper character (HCH)
and a single HCH maps back to one or more RCH.
And you might not even need a reverse map if all you do is compare.

That's probably too simple.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	54:02:58
Calls:	12,445
Files:	15,192
Messages:	6,537,308

Byte Addressability And Beyond

Who's Online

System Info