Byte addressing was invented by IBM for the System/360, introduced in
1964. At least as I understand it. Up to that time, and indeed for a long >time after, machines had a “word length” which was the smallest >addressable unit of memory. This could have a range of sizes, e.g.
12 -- DEC PDP-5/8
18 -- DEC PDP-1/4/7/9
36 -- DEC PDP-6/10
60 -- CDC 6000-series
64 -- Cray
Why was byte addressing invented? I think it was for easy handling of
strings and other binary data. But why stop there?
I guess the idea of
going all the way down to bit-level addressing was considered a bit
extreme?
One side-effect of byte addressing has been the “endian wars”: the >inconsistency, between different machine architectures, ...
Byte addressing was invented by IBM for the System/360, introduced in
1964. At least as I understand it. Up to that time, and indeed for a long time after, machines had a “word length” which was the smallest addressable unit of memory. This could have a range of sizes, e.g.
12 -- DEC PDP-5/8
18 -- DEC PDP-1/4/7/9
36 -- DEC PDP-6/10
60 -- CDC 6000-series
64 -- Cray
I’m sure there were also 24- and 48-bit machines. Note the popularity of numbers with a range of different integer divisors, including powers of
both 2 and 3. The byte-addressable machines chucked away everything other than powers of 2, which was a step backwards in this respect. ;)
(Interesting that the microprocessor world made byte addressing--and ASCII character encoding--universal right from the beginning. Starting from a
clean slate, I guess.)
Why was byte addressing invented? I think it was for easy handling of
strings and other binary data. But why stop there? I guess the idea of
going all the way down to bit-level addressing was considered a bit
extreme?
Certainly if you only had 32 (or, on those early IBMs, 24)
address bits, then using 3 of them to address within a byte would have substantially cut down the available size of your address space.
I think the move to 64-bit architectures missed a trick, though: it could have introduced bit-level addressing at the same time, given that we still have plenty of address bits to spare. That would simplify bit-field manipulations, too.
One side-effect of byte addressing has been the “endian wars”: the inconsistency, between different machine architectures, of how to order
the bytes making up multibyte objects, particularly numbers. Big-endian supposedly had the advantage of making memory dumps easier to read, but little-endian always made more logical sense.
Nowadays, all the common CPU architectures are at least available in little-endian form, if not exclusively so. But we still have legacy
oddities, like the TCP/IP network stack where integer fields are laid out
in big-endian ordering.
I don't see what is wrong with loading a container with the field and
then extracting or inserting into the container.
BE means you can read the strings in a core dump
LE means the bytes arrive in the order for on-line arithmetic
LE allows one to make 8-bit wide data paths and still implement a full
width architecture {but then so did 360/30)
Until the PDP-11, all byte addressed machines were bigendian. Despite a
lot of looking, I have never found an explanation of why DEC made the
PDP-11 littlendian.
Until the PDP-11, all byte addressed machines were bigendian. Despite
a lot of looking, I have never found an explanation of why DEC made
the PDP-11 littlendian. I'm reasonably sure they were aware that it
was reversed from the 360, but they never said why.
Please do me a favor and DO NOT guess why they did it -- we have
already had lots and lots of guesses and we have no way to tell
whether any of the guesses are right.
(Interesting that the microprocessor world made byte addressing--and ASCII character encoding--universal right from the beginning. Starting from a
clean slate, I guess.)
And as the Datapoint 2200 was originally a "smart terminal",
it had to be able to connect to mainframes, which meant that 8-bit bytes
were a natural choice.
On Wed, 1 May 2024 07:43:52 -0000 (UTC), Thomas Koenig wrote:
And as the Datapoint 2200 was originally a "smart terminal",
it had to be able to connect to mainframes, which meant that 8-bit bytes
were a natural choice.
You mean IBM mainframes?
I don’t think any other mainframes were byte-
addressable.
(Interesting that the microprocessor world made byte addressing--and
ASCII character encoding--universal right from the beginning.
Starting from a clean slate, I guess.)
Byte addressing was invented by IBM for the System/360, introduced in
1964. At least as I understand it. Up to that time, and indeed for a long >time after, machines had a “word length” which was the smallest >addressable unit of memory. This could have a range of sizes, e.g.
12 -- DEC PDP-5/8
18 -- DEC PDP-1/4/7/9
36 -- DEC PDP-6/10
60 -- CDC 6000-series
64 -- Cray
I guess the idea of going all the way down to bit-level addressing
was considered a bit extreme?
STRETCH had bit addressing. It added a great deal of complication for
very little benefit. In the relatively rare situations where you want
to handle bit fields, shifting and masking is good enough without
slowing everything else down.
On Wed, 1 May 2024 03:02:07 +0000, MitchAlsup1 wrote:
I don't see what is wrong with loading a container with the field and
then extracting or inserting into the container.
You still need a place to put a bit offset for the base address of the
field. Why not put it together with the rest of the address?
BE means you can read the strings in a core dump
LE means the bytes arrive in the order for on-line arithmetic
LE allows one to make 8-bit wide data paths and still implement a full
width architecture {but then so did 360/30)
The way I think of it is: consider how you specify these 3 conventions:
* numbering of bits within a byte
* numbering of bytes within a multibyte quantity
* the place values of bits in an integer
The only way to get all 3 consistent is with a little-endian architecture. Every big-endian architecture has inconsistencies between these somewhere
or another.
Lawrence D'Oliveiro <[email protected]d> schrieb:
(Interesting that the microprocessor world made byte addressing--and ASCII >> character encoding--universal right from the beginning. Starting from a
clean slate, I guess.)
A major market for microprocessors were pocket calculators,
cash registers and the like, which is why having 8 bits and BCD
arithmetic was an advantage - see the DAA instruction of the 8080
or the decimal flag on the 6502.
The basis of the 8008, the first serious microprocessor,
was the Datapoint 2200. A nice history can be found at http://www.righto.com/2023/08/datapoint-to-8086.html .
And as the Datapoint 2200 was originally a "smart terminal",
it had to be able to connect to mainframes, which meant that
8-bit bytes were a natural choice. (And I still think that
having BCD influenced the decision to go to the 8-bit byte
on the /360).
So, anything but a clean slate.
Lawrence D'Oliveiro <[email protected]d> schrieb:
On Wed, 1 May 2024 07:43:52 -0000 (UTC), Thomas Koenig wrote:
And as the Datapoint 2200 was originally a "smart terminal",
it had to be able to connect to mainframes, which meant that 8-bit bytes >>> were a natural choice.
You mean IBM mainframes?
And compatibles. Together, they accounted for almost all mainframes.
I don’t think any other mainframes were byte-
addressable.
IBM set the minimum standard for character capabilities, a
terminal had to support eight bits or be laughed out of the market. Adressability has little to do with it.
Hmm... what sort of terminals and character sets did people use on
a PDP-10? 7-bit ASCII? It (and the PDP-6) were released before
the ASCII standard came out. (And /360 was supposed to support
ASCII originally, but that bit in the PSW got dropped for the /370,
I believe).
Thomas Koenig wrote:
Lawrence D'Oliveiro <[email protected]d> schrieb:
On Wed, 1 May 2024 07:43:52 -0000 (UTC), Thomas Koenig wrote:
And as the Datapoint 2200 was originally a "smart terminal",
it had to be able to connect to mainframes, which meant that 8-bit bytes >>>> were a natural choice.
You mean IBM mainframes?
And compatibles. Together, they accounted for almost all mainframes.
I don’t think any other mainframes were byte-
addressable.
IBM set the minimum standard for character capabilities, a
terminal had to support eight bits or be laughed out of the market.
Adressability has little to do with it.
Hmm... what sort of terminals and character sets did people use on
a PDP-10? 7-bit ASCII? It (and the PDP-6) were released before
the ASCII standard came out. (And /360 was supposed to support
ASCII originally, but that bit in the PSW got dropped for the /370,
I believe).
PDP 10 had a 6-bit "field data" character set and a 9-bit bigger than
ASCII character set. Programming languages and editors tended to use
the 6-bit character set.
Please do me a favor and DO NOT guess why they did it --
Concerning the speculations about the PDP-11, here's one: Was it
designed for also supporting an implementation with a 4-bit or 8-bit
basis?
The PDP-X (the DEC-internal project that was canceled in favor of the
PDP-11 and eventually became the Nova) might have influenced the
PDP-11 in that way.
On Wed, 1 May 2024 01:49:56 -0000 (UTC), John Levine wrote:
Until the PDP-11, all byte addressed machines were bigendian. Despite a
lot of looking, I have never found an explanation of why DEC made the
PDP-11 littlendian.
As I previously mentioned, little-endian just makes more sense.
I guess the idea of going all the way down to bit-level addressing
was considered a bit extreme?
STRETCH had bit addressing. It added a great deal of complication for
very little benefit. In the relatively rare situations where you want
to handle bit fields, shifting and masking is good enough without
slowing everything else down.
Bit addressing doesn't have to be expensive: the DEC Alpha could have
decided to use bit-addressing simply by ignoring/trapping more of the
lowest bits than it did.
8-bit bytes were a natural choice. (And I still think that
having BCD influenced the decision to go to the 8-bit byte
on the /360).
According to Lawrence D'Oliveiro <[email protected]d>:
On Wed, 1 May 2024 01:49:56 -0000 (UTC), John Levine wrote:
Until the PDP-11, all byte addressed machines were bigendian. Despite a
lot of looking, I have never found an explanation of why DEC made the
PDP-11 littlendian.
As I previously mentioned, little-endian just makes more sense.
Ahem. You're guessing.
I can assure you it didn't make more sense to all the people who read
360 core dumps. BTDT.
I gather the PDP-X and PDP-11 were warring camps. There's a bunch
of PDP-X notes at bitsavers and I don't see anything related to
the -11. In the Bell et al book there's a lot about the -11 which
only says it's different from the -8 and -9 series.
I agree that with 64 bit addresses and memory that is pennies per
megabyte the tradeoffs are different but that horse left the barn 50
years ago. And I still don't think that bit operations are common
enough to be worth using bits in every non-bit address.
Thomas Koenig wrote:
Hmm... what sort of terminals and character sets did people use on
a PDP-10? 7-bit ASCII? It (and the PDP-6) were released before
the ASCII standard came out.
(And /360 was supposed to support
ASCII originally, but that bit in the PSW got dropped for the /370,
I believe).
PDP 10 had a 6-bit "field data" character set and a 9-bit bigger than
ASCII character set.
Ahem. You're guessing.
I can assure you it didn't make more sense to all the people who read
360 core dumps. BTDT.
To be fair, the tool that formatted the core dump could easily have
arranged the human visible values appropriately, much like xxd(1)
on linux does for little-endian values (i.e. when grouped with
four bytes per (32-bits), the byte 3 value is printed first).
I agree that with 64 bit addresses and memory that is pennies per
megabyte the tradeoffs are different but that horse left the barn 50
years ago. And I still don't think that bit operations are common
enough to be worth using bits in every non-bit address.
Historically, the advantages vs disadvantages have indeed been rather
against bit-addressing. AFAICT when the DEC Alpha came out was the most favorable time: the first time that the cost was low enough (they
already had byte-addressing without byte-granularity of accesses,
they had plenty of address bits to waste, and there wasn't too much
existing 64bit code to break) to make the idea palatable.
Practical benefits are fairly limited, but it would just be The Right
thing to do, making it "easy" to eliminate some arbitrary restrictions
in languages like C such as the inability to take the address of a
struct's bitsized field. It would also have given an extra 3 bits to
play with for tagging purposes :-)
Apparently, the PDP-11 was originally an 8-bit "desk calculator"
project which was then developed into the 16-bit architecture.
I have also read somewhere that competition from the Nova played
a major role.
I agree that with 64 bit addresses and memory that is pennies per
megabyte the tradeoffs are different but that horse left the barn 50
years ago. And I still don't think that bit operations are common
enough to be worth using bits in every non-bit address.
Historically, the advantages vs disadvantages have indeed been rather
against bit-addressing. AFAICT when the DEC Alpha came out was the most favorable time: the first time that the cost was low enough (they
already had byte-addressing without byte-granularity of accesses,
they had plenty of address bits to waste, and there wasn't too much
existing 64bit code to break) to make the idea palatable.
Practical benefits are fairly limited, but it would just be The Right
thing to do, making it "easy" to eliminate some arbitrary restrictions
in languages like C such as the inability to take the address of
a struct's bitsized field. It would also have given an extra 3 bits to
play with for tagging purposes :-)
Stefan
Thomas Koenig wrote:
Lawrence D'Oliveiro <[email protected]d> schrieb:
(Interesting that the microprocessor world made byte addressing--and ASCII >>> character encoding--universal right from the beginning. Starting from a
clean slate, I guess.)
A major market for microprocessors were pocket calculators,
cash registers and the like, which is why having 8 bits and BCD
arithmetic was an advantage - see the DAA instruction of the 8080
or the decimal flag on the 6502.
From 1978-1980 I worked at NCR corporation on cash registers.
We made a BASIC interpreter as the programmable backbone of
the cash register lineup. Not a single decimal arithmetic
instruction was used in the cash register application. The
BASIC interpreter was written by a 5-man team in 8085 assembler.
That model was sold from 1979 through 1998. So the lack of
decimal arithmetic was not a significant disadvantage.
Stefan Monnier wrote:
I agree that with 64 bit addresses and memory that is pennies per megabyte the tradeoffs are different but that horse left the barn
50 years ago. And I still don't think that bit operations are
common enough to be worth using bits in every non-bit address.
Historically, the advantages vs disadvantages have indeed been
rather against bit-addressing. AFAICT when the DEC Alpha came out
was the most favorable time: the first time that the cost was low
enough (they already had byte-addressing without byte-granularity
of accesses, they had plenty of address bits to waste, and there
wasn't too much existing 64bit code to break) to make the idea
palatable.
Probably, but looking at code one rarely sees a field in a struct
that is a bit-field. So, even if the cost was low, the benefits
are similarly low.
Practical benefits are fairly limited, but it would just be The
Right thing to do, making it "easy" to eliminate some arbitrary restrictions in languages like C such as the inability to take the
address of a struct's bitsized field. It would also have given an
extra 3 bits to play with for tagging purposes :-)
Stefan
On Wed, 1 May 2024 01:49:56 -0000 (UTC), John Levine wrote:
Until the PDP-11, all byte addressed machines were bigendian.
Despite a lot of looking, I have never found an explanation of why
DEC made the PDP-11 littlendian.
As I previously mentioned, little-endian just makes more sense.
Unfortunately, when their Fortran compiler implemented 32-bit
integers (in software), they got the words the wrong way round.
The VAX was like a 32-bit extension of the PDP-11, and it was
consistently little-endian everywhere.
According to MitchAlsup1 <[email protected]>:
PDP 10 had a 6-bit "field data" character set and a 9-bit bigger
than ASCII character set.
Dunno what computer that was, but it wasn't a PDP-10. Univac or
GE600 maybe?
Lawrence D'Oliveiro wrote:
On Wed, 1 May 2024 03:02:07 +0000, MitchAlsup1 wrote:
I don't see what is wrong with loading a container with the field
and then extracting or inserting into the container.
You still need a place to put a bit offset for the base address of
the field. Why not put it together with the rest of the address?
Given a 20-40 year life of an architecture and the desire not to be
limited by addressability; I wanted and demanded of myself a full
63-bit virtual address space per thread. Therefore, no bits in the
pointer are available for bit level addressing.
According to Stefan Monnier <[email protected]>:
I guess the idea of going all the way down to bit-level
addressing
was considered a bit extreme?
STRETCH had bit addressing. It added a great deal of complication
for very little benefit. In the relatively rare situations where
you want to handle bit fields, shifting and masking is good enough
without slowing everything else down.
Bit addressing doesn't have to be expensive: the DEC Alpha could have >decided to use bit-addressing simply by ignoring/trapping more of the >lowest bits than it did.
That would waste three bits in every address, which would have been phenomenally expensive in the 1960s when every byte cost real money.
The 360 had 12 bit displacements, so you could address a 4K range
without having to load another base register. This would shrink
it to 1K, so as a first approximation you'd need four times as
many base register loads. Nope.
I agree that with 64 bit addresses and memory that is pennies per
megabyte the tradeoffs are different but that horse left the barn 50
years ago. And I still don't think that bit operations are common
enough to be worth using bits in every non-bit address.
On Wed, 1 May 2024 16:38:09 +0000
[email protected] (MitchAlsup1) wrote:
Lawrence D'Oliveiro wrote:
On Wed, 1 May 2024 03:02:07 +0000, MitchAlsup1 wrote:
I don't see what is wrong with loading a container with the field
and then extracting or inserting into the container.
You still need a place to put a bit offset for the base address of
the field. Why not put it together with the rest of the address?
Given a 20-40 year life of an architecture and the desire not to be
limited by addressability; I wanted and demanded of myself a full
63-bit virtual address space per thread. Therefore, no bits in the
pointer are available for bit level addressing.
At current rate of DRAM Moore's Law it does not look like anybody would
need 63 bits 40 years from now. Arm's 55 or 56 bits will likely suffice
for that long or longer.
The prospects of other byte-addresable types of memory looks even
bleaker than DRAM's.
The only memory tech that is doing better is NAND flash, but it is
inherently block-addressable.
Probably, but looking at code one rarely sees a field in a struct
that is a bit-field. So, even if the cost was low, the benefits
are similarly low.
Sure. But it isn't clear if that was the cause or the result of the >hardware.
At current rate of DRAM Moore's Law it does not look like anybody would
need 63 bits 40 years from now.
On Wed, 1 May 2024 01:49:56 -0000 (UTC), John Levine wrote:
Until the PDP-11, all byte addressed machines were bigendian. Despite a
lot of looking, I have never found an explanation of why DEC made the
PDP-11 littlendian.
As I previously mentioned, little-endian just makes more sense.
Lawrence D'Oliveiro wrote:
You still need a place to put a bit offset for the base address of the
field. Why not put it together with the rest of the address?
Given a 20-40 year life of an architecture and the desire not to be
limited by addressability; I wanted and demanded of myself a full 63-bit virtual address space per thread. Therefore, no bits in the pointer are available for bit level addressing.
The way I think of it is: consider how you specify these 3 conventions:
* numbering of bits within a byte
* numbering of bytes within a multibyte quantity
* the place values of bits in an integer
The only way to get all 3 consistent is with a little-endian
architecture. Every big-endian architecture has inconsistencies between
these somewhere or another.
Very many LE machines got one or more of those wrong, too.
Hmm... what sort of terminals and character sets did people use on a
PDP-10? 7-bit ASCII? It (and the PDP-6) were released before the ASCII standard came out.
(And /360 was supposed to support ASCII originally,
but that bit in the PSW got dropped for the /370, I believe).
The way I think of it is: consider how you specify these 3 conventions:
* numbering of bits within a byte
* numbering of bytes within a multibyte quantity
* the place values of bits in an integer
The only way to get all 3 consistent is with a little-endian
architecture. Every big-endian architecture has inconsistencies between
these somewhere or another.
Very many LE machines got one or more of those wrong, too.
For example?
(And /360 was supposed to support ASCII originally,
but that bit in the PSW got dropped for the /370, I believe).
Both ASCII and the System/360 came out in 1964. IBM’s excuse for inventing >its own EBCDIC encoding was that ASCII wasn’t ready in time.
On Wed, 1 May 2024 16:38:09 +0000
[email protected] (MitchAlsup1) wrote:
Lawrence D'Oliveiro wrote:
On Wed, 1 May 2024 03:02:07 +0000, MitchAlsup1 wrote:
I don't see what is wrong with loading a container with the field
and then extracting or inserting into the container.
You still need a place to put a bit offset for the base address of
the field. Why not put it together with the rest of the address?
Given a 20-40 year life of an architecture and the desire not to be
limited by addressability; I wanted and demanded of myself a full
63-bit virtual address space per thread. Therefore, no bits in the
pointer are available for bit level addressing.
At current rate of DRAM Moore's Law it does not look like anybody would
need 63 bits 40 years from now. Arm's 55 or 56 bits will likely suffice
for that long or longer.
The prospects of other byte-addresable types of memory looks even
bleaker than DRAM's.
Michael S wrote:
On Wed, 1 May 2024 16:38:09 +0000
[email protected] (MitchAlsup1) wrote:
Lawrence D'Oliveiro wrote:
On Wed, 1 May 2024 03:02:07 +0000, MitchAlsup1 wrote:
I don't see what is wrong with loading a container with the
field and then extracting or inserting into the container.
You still need a place to put a bit offset for the base address
of the field. Why not put it together with the rest of the
address?
Given a 20-40 year life of an architecture and the desire not to be
limited by addressability; I wanted and demanded of myself a full
63-bit virtual address space per thread. Therefore, no bits in the
pointer are available for bit level addressing.
At current rate of DRAM Moore's Law it does not look like anybody
would need 63 bits 40 years from now. Arm's 55 or 56 bits will
likely suffice for that long or longer.
The largest single system memory I can find quickly is 160TB or about 47-bits of address space (I rounded down).
Given one can use CXL to coherently link multiples of such a system,
and not be limited by the number of pins dedicated to DRAM access;
40 years of growth at � a bit per year, already exceeds the 63-bit
address space (47+40/2 = 67 bits).
The prospects of other byte-addresable types of memory looks even
bleaker than DRAM's.
Agreed (baring some kind of miracle
The only memory tech that is doing better is NAND flash, but it is inherently block-addressable.
And becomes the backing store.
years ago. And I still don't think that bit operations are common
enough to be worth using bits in every non-bit address.
Bit-addressable TMS34010 was released 38 years ago and even was
moderately successful. So, it seems, 50 yeras ago nothing was set in
stone yet.
That would waste three bits in every address, which would have been phenomenally expensive in the 1960s when every byte cost real money.
According to Lawrence D'Oliveiro <[email protected]d>:
As I previously mentioned, little-endian just makes more sense.
Ahem. You're guessing.
On Wed, 1 May 2024 20:30:16 +0000
[email protected] (MitchAlsup1) wrote:
Given one can use CXL to coherently link multiples of such a system,
and not be limited by the number of pins dedicated to DRAM access;
But it would be very slow, so slow that it defeats the point of direct >addressability.
Michael S wrote:
On Wed, 1 May 2024 16:38:09 +0000
[email protected] (MitchAlsup1) wrote:
Lawrence D'Oliveiro wrote:
On Wed, 1 May 2024 03:02:07 +0000, MitchAlsup1 wrote:
I don't see what is wrong with loading a container with the field
and then extracting or inserting into the container.
You still need a place to put a bit offset for the base address of
the field. Why not put it together with the rest of the address?
Given a 20-40 year life of an architecture and the desire not to be
limited by addressability; I wanted and demanded of myself a full
63-bit virtual address space per thread. Therefore, no bits in the
pointer are available for bit level addressing.
At current rate of DRAM Moore's Law it does not look like anybody would
need 63 bits 40 years from now. Arm's 55 or 56 bits will likely suffice
for that long or longer.
The largest single system memory I can find quickly is 160TB or about
47-bits of address space (I rounded down).
Michael S <[email protected]> writes:
On Wed, 1 May 2024 20:30:16 +0000
[email protected] (MitchAlsup1) wrote:
Given one can use CXL to coherently link multiples of such a
system, and not be limited by the number of pins dedicated to DRAM
access;
But it would be very slow, so slow that it defeats the point of
direct addressability.
On what basis do you make that statement? CXL-memory is real,
and can be implemented on chiplets in an MCM with better
than multisocket latencies. Add Gen6 PCIe cut-through switching
and you get resonable and useful latencies across a switched fabric.
Even a decade and a half ago, when we built a similar system using
QDR infinband and a custom ASIC connected to HT or QPI,
we had internode latencies of less than 400ns r/t, which
was about double the Intel inter-socket latencies at the time.
MitchAlsup1 <[email protected]> schrieb:
Michael S wrote:
On Wed, 1 May 2024 16:38:09 +0000
[email protected] (MitchAlsup1) wrote:
Lawrence D'Oliveiro wrote:
On Wed, 1 May 2024 03:02:07 +0000, MitchAlsup1 wrote:
I don't see what is wrong with loading a container with the
field and then extracting or inserting into the container.
You still need a place to put a bit offset for the base address
of the field. Why not put it together with the rest of the
address?
Given a 20-40 year life of an architecture and the desire not to
be limited by addressability; I wanted and demanded of myself a
full 63-bit virtual address space per thread. Therefore, no bits
in the pointer are available for bit level addressing.
At current rate of DRAM Moore's Law it does not look like anybody
would need 63 bits 40 years from now. Arm's 55 or 56 bits will
likely suffice for that long or longer.
The largest single system memory I can find quickly is 160TB or
about 47-bits of address space (I rounded down).
A single Power10 CPU can address 2 Petabytes (51 bits), but of course
it need not be all RAM.
I wouldn't want to try and run linux
on it but it's great for signal processing.
According to Michael S <[email protected]>:
Bit-addressable TMS34010 was released 38 years ago and even was
moderately successful. So, it seems, 50 yeras ago nothing was set in
stone yet.
True, but that chip is designed to be good for video rendering which is
an unusual application that uses a lot of bit aligned data.
According to Lawrence D'Oliveiro <[email protected]d>:a
On Wed, 1 May 2024 01:49:56 -0000 (UTC), John Levine wrote:
Until the PDP-11, all byte addressed machines were bigendian. Despite
lot of looking, I have never found an explanation of why DEC made the
PDP-11 littlendian.
As I previously mentioned, little-endian just makes more sense.
I happened to be looking at Blaauw and Brooks "Computer Architecture" published in 1997, which has several pages on bit and byte numbering.
After noting that the Big- and Little- names come from Gulliver's
Travels, they say on page 100:
"Unlike Swift's, the computer Endian controversy is not pointless. The
Little Endian design has many complications in use; we much prefer the
Big Endian."
MitchAlsup1 wrote:
... looking at code one rarely sees a field in a struct that
is a bit-field. So, even if the cost was low, the benefits are
similarly low.
Sure. But it isn't clear if that was the cause or the result of the hardware.
On "personal" computers ... there's been work instead on compressing
64bit pointers to fit into 32bit "boxes" (IIUC it's used in some Chrome versions) ...
"Unlike Swift's, the computer Endian controversy is not pointless. The
Little Endian design has many complications in use; we much prefer the
Big Endian."
It’s easy to illustrate why they’re wrong. First of all, a note that, even >on big-endian architectures, registers are still actually little-endian.
Bit-addressable TMS34010 was released 38 years ago and even was >>>moderately successful. So, it seems, 50 yeras ago nothing was set in >>>stone yet.
True, but that chip is designed to be good for video rendering which is
an unusual application that uses a lot of bit aligned data.
And yet, all our machines nowadays are doing heavy amounts of “video >rendering”, aren’t they? Look at the machine generating the screen display >you’re looking at right now.
The PDP-11 had mixed endian 32 bit integers and floats.
VAX floating point was pretty muddled, too.
Intel has been consistently little endian as far as I can remember.
What about the IBM 1401, Electrodata 220 or Burroughs B5000?
According to Lawrence D'Oliveiro <[email protected]d>:
Both ASCII and the System/360 came out in 1964. IBM’s excuse for >>inventing its own EBCDIC encoding was that ASCII wasn’t ready in time.
If you'd read the paper on the Architecture of System/360, you'd know
that is just plain wrong. See the link I posted earlier today.
On Wed, 1 May 2024 20:53:06 -0000 (UTC), John Levine wrote:
According to Lawrence D'Oliveiro <[email protected]d>:
Both ASCII and the System/360 came out in 1964. IBM’s excuse for >>>inventing its own EBCDIC encoding was that ASCII wasn’t ready in time.
If you'd read the paper on the Architecture of System/360, you'd know
that is just plain wrong. See the link I posted earlier today.
See also these links:
On Wed, 1 May 2024 20:50:23 -0000 (UTC), John Levine wrote:
The PDP-11 had mixed endian 32 bit integers and floats.
The PDP-11 had no 32-bit integer instructions.
(specifically “Fortran IV PLus”) that had mixed-endian 32-bit integers.
On Wed, 01 May 2024 14:08:25 GMT, Scott Lurndal wrote:
What about the IBM 1401, Electrodata 220 or Burroughs B5000?
Not really familiar with those--feel free to mention more details if you
have them.
Though I do recall, the 1401 didn’t have a “word length” as such: it was a
“character”-based machine. For example, it could do arbitrary-precision >arithmetic--it just kept processing digits until it hit a special end-of- >data marker--but obviously this only worked for (fixed-point) addition and >subtraction. The machine had no hardware support for multiplication or >division. Or floating-point, for that matter.
In the world of general-purpose microprocessor, DEC Alpha (until EV6)
was more like word-addressable than byte-addressable, although it is a
matter of point of view.
Every computer these days does graphics rendering
It appears that Lawrence D'Oliveiro <[email protected]d> said:
"Unlike Swift's, the computer Endian controversy is not pointless.
The Little Endian design has many complications in use; we much
prefer the Big Endian."
It’s easy to illustrate why they’re wrong. First of all, a note that, >>even on big-endian architectures, registers are still actually >>little-endian.
I would be most interested in a concrete illustration of this
implausible argument.
According to Lawrence D'Oliveiro <[email protected]d>:
Though I do recall, the 1401 didn’t have a “word length” as such:
it was a “character”-based machine. For example, it could do
arbitrary-precision arithmetic--it just kept processing digits
until it hit a special end-of-data marker--but obviously this only
worked for (fixed-point) addition and subtraction. The machine had
no hardware support for multiplication or division. Or
floating-point, for that matter.
You may be confusing it with the 1620.
Lawrence D'Oliveiro wrote:
Byte addressing was invented by IBM for the System/360, introduced in
1964. At least as I understand it. Up to that time, and indeed for a
long time after, machines had a “word length†which was the
smallest addressable unit of memory. This could have a range of sizes,
e.g.
12 -- DEC PDP-5/8
18 -- DEC PDP-1/4/7/9
36 -- DEC PDP-6/10
60 -- CDC 6000-series
64 -- Cray
CDC had a number of machines with 12-bit times k words. k element {1,2,3,5}
I’m sure there were also 24- and 48-bit machines. Note the
popularity of numbers with a range of different integer divisors,
including powers of both 2 and 3. The byte-addressable machines
chucked away everything other than powers of 2, which was a step
backwards in this respect. ;)
I would make the argument that 2^k was a step forward not backwards.
Perhaps another day...
Working with trits, encoded as -/0/+, would have been feasible,
but
binary provided much easier implementation. Base conversions are a bit messier when you use base3 as the machine representation, but you could
have used 5 trits (243) to handle the US ASCII character set.
In retrospect I'm glad they decided on binary!
The “Guide to 1401 Programming” I’m looking at (from 1961) makes no mention of multiplication or division.
I've seen the argument that e is the best base from an energy
standpoint, with 2 and 3 being the two closest integer values.
According to Lawrence D'Oliveiro <[email protected]d>:
On Wed, 1 May 2024 01:49:56 -0000 (UTC), John Levine wrote:
Until the PDP-11, all byte addressed machines were bigendian.
Despite a lot of looking, I have never found an explanation of why
DEC made the PDP-11 littlendian.
As I previously mentioned, little-endian just makes more sense.
I happened to be looking at Blaauw and Brooks "Computer Architecture" published in 1997, which has several pages on bit and byte numbering.
After noting that the Big- and Little- names come from Gulliver's
Travels, they say on page 100:
"Unlike Swift's, the computer Endian controversy is not pointless.
The Little Endian design has many complications in use; we much
prefer the Big Endian. Having two active conventions is very painful.
Several recent Big Endian RISC computers, including the MIPS, the
Motorola 88000, and the Intel i860 provide a data-movement operation
that can perform the Big Endian-Little Endian permutation. We predict
that Little Endian addressing will die out, just as decimal addressing
did."
Really, people like what they are used to. They were just wrong about
the i860 which was little endian, but had a mode bit to make data
addressing big endian.
that can perform the Big Endian-Little Endian permutation. We predict
that Little Endian addressing will die out, just as decimal addressing
did."
IMHO, statements like that are forgivable for Blaauw (born 1924). Less
so for 7 years younger Brooks.
Really, people like what they are used to. They were just wrong about
the i860 which was little endian, but had a mode bit to make data
addressing big endian.
Expressions of personal prejudices are fine for informal Usenet
articles. For book that pretends to be more than memoir I expect more >rigorous reasoning.
Sure. Consider this pseudo-assembly-language sequence:
move.l a, b
move.b b, c
...
Now the question is: which byte from “a” ends up at location “c”?
In other words, even on big-endian architectures, registers are still >interpreted as little-endian!
Isn’t that fun?
John Levine wrote:
snip
Every computer these days does graphics rendering
Is that true? What about all those computers that make up Google's
server farm? Or how about AWS systems? I am not saying they don't,
just asking.
According to Stephen Fuld <[email protected]d>:
John Levine wrote:
snip
Every computer these days does graphics rendering
Is that true? What about all those computers that make up Google's
server farm? Or how about AWS systems? I am not saying they don't,
just asking.
AWS has several varieties of their custom Graviton chips:
https://aws.amazon.com/ec2/graviton/
Some of them are just ARM cores for stuff like databases but some are intended for video processing and game streaming:
https://aws.amazon.com/ec2/instance-types/g5g/
So you're right, it's not every computer, but it's more than you might think.
On Wed, 01 May 2024 21:40:17 GMT
[email protected] (Scott Lurndal) wrote:
Michael S <[email protected]> writes:
On Wed, 1 May 2024 20:30:16 +0000
[email protected] (MitchAlsup1) wrote:
Given one can use CXL to coherently link multiples of such a
system, and not be limited by the number of pins dedicated to DRAM
access;
But it would be very slow, so slow that it defeats the point of
direct addressability.
On what basis do you make that statement? CXL-memory is real,
and can be implemented on chiplets in an MCM with better
than multisocket latencies. Add Gen6 PCIe cut-through switching
and you get resonable and useful latencies across a switched fabric.
Even a decade and a half ago, when we built a similar system using
QDR infinband and a custom ASIC connected to HT or QPI,
we had internode latencies of less than 400ns r/t, which
was about double the Intel inter-socket latencies at the time.
You didn't find many buyers, did you?
On a little-endian architecture, it is always the lowest-significance
byte.
But on a big-endian architecture, for a register-memory-register move, it >will be the highest-significance byte. But for the memory-register-memory >case, it will be the lowest-significance byte.
In other words, even on big-endian architectures, registers are still >interpreted as little-endian!
Isn�t that fun?
On "personal" computers ... there's been work instead on compressingIntel pushed this thing called the “x32” ABI into the Linux kernel (and possibly some other places) some years ago. This was using the AMD64
64bit pointers to fit into 32bit "boxes" (IIUC it's used in some Chrome
versions) ...
As far as I can tell the 360/370 was consistently big-endian. The
convention for bit numbering in bytes and words was high to low but
since there weren't any instructions with bit numbers it didn't
matter.
According to Lawrence D'Oliveiro <[email protected]d>:
On Wed, 1 May 2024 01:49:56 -0000 (UTC), John Levine wrote:
Until the PDP-11, all byte addressed machines were bigendian. Despite a
lot of looking, I have never found an explanation of why DEC made the
PDP-11 littlendian.
As I previously mentioned, little-endian just makes more sense.
I happened to be looking at Blaauw and Brooks "Computer Architecture" >published in 1997, which has several pages on bit and byte numbering.
After noting that the Big- and Little- names come from Gulliver's
Travels, they say on page 100:
"Unlike Swift's, the computer Endian controversy is not pointless. The
Little Endian design has many complications in use; we much prefer the
Big Endian. Having two active conventions is very painful. Several
recent Big Endian RISC computers, including the MIPS, the Motorola
88000, and the Intel i860
provide a data-movement operation that can
perform the Big Endian-Little Endian permutation. We predict that
Little Endian addressing will die out, just as decimal addressing
did."
MitchAlsup1 wrote:{1,2,3,5}
Lawrence D'Oliveiro wrote:
Byte addressing was invented by IBM for the System/360, introduced in
1964. At least as I understand it. Up to that time, and indeed for a
long time after, machines had a “word length†which was the
smallest addressable unit of memory. This could have a range of sizes,
e.g.
12 -- DEC PDP-5/8
18 -- DEC PDP-1/4/7/9
36 -- DEC PDP-6/10
60 -- CDC 6000-series
64 -- Cray
CDC had a number of machines with 12-bit times k words. k element
I’m sure there were also 24- and 48-bit machines. Note the
popularity of numbers with a range of different integer divisors,
including powers of both 2 and 3. The byte-addressable machines
chucked away everything other than powers of 2, which was a step
backwards in this respect. ;)
I would make the argument that 2^k was a step forward not backwards.
Perhaps another day...
I've seen the argument that e is the best base from an energy
standpoint, with 2 and 3 being the two closest integer values.
Working with trits, encoded as -/0/+, would have been feasible, but
binary provided much easier implementation. Base conversions are a bit messier when you use base3 as the machine representation, but you could
have used 5 trits (243) to handle the US ASCII character set.
In retrospect I'm glad they decided on binary!
Terje
On Wed, 1 May 2024 23:17:06 -0000 (UTC), Lawrence D'Oliveiro
Plus, if you load a single precision float into a floating-point
register, you are loading on the left side, not the right side, so the
inconsistency to which you're referring now impacts the little-endian machines. (Of course, though, that's no longer quite true with IEEE
754, since the exponent isn't the same size for all precisions, the
way it was with old-fashioned machines.)
John Savard
On Wed, 1 May 2024 20:37:11 -0000 (UTC), John Levine wrote:even
According to Lawrence D'Oliveiro <[email protected]d>:
On Wed, 1 May 2024 01:49:56 -0000 (UTC), John Levine wrote:
Until the PDP-11, all byte addressed machines were bigendian. Despite
a
lot of looking, I have never found an explanation of why DEC made the
PDP-11 littlendian.
As I previously mentioned, little-endian just makes more sense.
I happened to be looking at Blaauw and Brooks "Computer Architecture"
published in 1997, which has several pages on bit and byte numbering.
After noting that the Big- and Little- names come from Gulliver's
Travels, they say on page 100:
"Unlike Swift's, the computer Endian controversy is not pointless. The
Little Endian design has many complications in use; we much prefer the
Big Endian."
It’s easy to illustrate why they’re wrong. First of all, a note that,
on big-endian architectures, registers are still actually little-endian.
Which is yet another reason why big-endian can never be entirely
consistent.
Consider this pseudo-assembly-language sequence:
move.l a, b
move.b b, c
where “move” denotes either “load” or “store” as appropriate, the “.b”(to
suffix indicates a byte operation, and “.l” denotes a multibyte operation
(2, 4, 8 bytes or whatever, doesn’t matter as long as it’s more than 1).
As for the labels “a”, “b” and “c”, they can be reasonably interpreted
accommodate both RISC and non-RISC architectures) in two ways:
1) “a” and “c” are registers, “b” is a memory address; or
2) “b” is a register, while “a” and “c” are memory addresses.
Now the question is: which byte from “a” ends up at location “c”?
On a little-endian architecture, it is always the lowest-significance
byte.
But on a big-endian architecture, for a register-memory-register move, it
will be the highest-significance byte. But for the memory-register-memory
case, it will be the lowest-significance byte.
In other words, even on big-endian architectures, registers are still interpreted as little-endian!
Isn’t that fun?
... MIPS has left the general-purpose computing field.
The 68000 and 88000 architectures (which have instructions with bit
numbers) make the least significant bit have number 0, so they are
bitwise little-endian.
Lawrence D'Oliveiro wrote:
move.l a, b
move.b b, c
May I suggest that the above ILLUSTRATES why someone wants to use LD and
ST instructions rather than directionless MOV instructions.
On Wed, 1 May 2024 09:02:22 -0000 (UTC), Thomas Koenig wrote:
Hmm... what sort of terminals and character sets did people use on a
PDP-10? 7-bit ASCII? It (and the PDP-6) were released before the ASCII
standard came out.
A bit before my time, but I recall terms like “SIXBIT” encoding from looking at docs. Also this weird thing called “Radix-50” (the “50” actually being octal for 40 decimal) did persist into PDP-11 days, when I came along. It was a way of packing 3 characters (from a limited set, of course) into 2 bytes.
On Thu, 02 May 2024 17:37:47 GMT, Anton Ertl wrote:
... MIPS has left the general-purpose computing field.
Not so sure that it has. I think the Chinese “LoongArch” machines are a >MIPS derivative.
Also, if you want to think of “MIPS” as a corporate entity, that would be >the company currently known as “Imagination Technologies”. It is true they >have given up on the MIPS architecture
Lawrence D'Oliveiro <[email protected]d> writes:
On Thu, 02 May 2024 17:37:47 GMT, Anton Ertl wrote:
... MIPS has left the general-purpose computing field.
Not so sure that it has. I think the Chinese “LoongArchâ€_ >machines are a MIPS derivative.
They may have started with MIPS, like several others, but now they are LoongArch. Looking in <https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1-EN.html#common-memory-access-instructions>,
I don't find anything about byte order, but it says:
|LoongArch bit designations are always little-endian.
Also, if you want to think of “MIPSâ€_ as a corporate entity, that >would be the company currently known as “Imagination >Technologiesâ€_. It is true they have given up on the MIPS
architecture
That's even worse for MIPS than what I know of, which was that it was
used for embedded uses.
- anton
In the world of general-purpose microprocessor, DEC Alpha (until EV6)
was more like word-addressable than byte-addressable, although it is a
matter of point of view.
Why was byte addressing invented? I think it was for easy handling of
strings and other binary data.
But why stop there?
On Fri, 03 May 2024 08:51:30 GMT
[email protected] (Anton Ertl) wrote:
Lawrence D'Oliveiro <[email protected]d> writes:
On Thu, 02 May 2024 17:37:47 GMT, Anton Ertl wrote:
... MIPS has left the general-purpose computing field.
Not so sure that it has. I think the Chinese “LoongArchâ€_
machines are a MIPS derivative.
They may have started with MIPS, like several others, but now they are
LoongArch. Looking in
<https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1-EN.html#common-memory-access-instructions>,
I don't find anything about byte order, but it says:
|LoongArch bit designations are always little-endian.
Also, if you want to think of “MIPSâ€_ as a corporate entity, that >>> would be the company currently known as “Imagination
Technologiesâ€_. It is true they have given up on the MIPS
architecture
That's even worse for MIPS than what I know of, which was that it was
used for embedded uses.
- anton
My impression was that embedded MIPS had two main players behind it:
- Microchip on the low end. Measured on Arm scale from about Cortex-M3
class to Cortex-M7 class.
- Cavium on the high end. From Cortex-A55 to not quite Cortex-A73.
Microchip will continue to sell it for decade at least. Microchip does
not tend to talk openly about directions, however their behavior shows
that their direction right now is away from MIPS and currently toward
Arm.
Lawrence D'Oliveiro wrote:
move.l a, b
move.b b, c
On Thu, 2 May 2024 18:33:48 +0000, MitchAlsup1 wrote:and
Lawrence D'Oliveiro wrote:
move.l a, b
move.b b, c
May I suggest that the above ILLUSTRATES why someone wants to use LD
perST instructions rather than directionless MOV instructions.
OK, use explicit load/store instead of generic move:
register-memory-register:
store.l a, b
load.b b, c
memory-register-memory:
load.l a, b
store.b b, c
Do you see why this makes absolutely no difference to what happens, as
my description earlier?
By the way, in case it wasn’t clear: in my examples, the destination operand is always the last one.
To me, it just made sense that, since registers contain quantities, if
you load the value "8" into a reigster, it will contain the number 8.
So in a byte operation, the least significant bits of the register are
used.
Lawrence D'Oliveiro wrote:
Do you see why this makes absolutely no difference to what happens, as
per my description earlier?
Yes, because you explicitly left out the syntactic sugar.
Lawrence D'Oliveiro <[email protected]d> writes:
Also, if you want to think of “MIPS” as a corporate entity, that would >>be the company currently known as “Imagination Technologies”. It is true >>they have given up on the MIPS architecture
That's even worse for MIPS than what I know of, which was that it was
used for embedded uses.
Lawrence D'Oliveiro <[email protected]d> writes:
But why stop there?
Others have provided good answers for that. Here's another one: Given
the requirements (based on the predecessors), there was not reason to go beyond byte addressing. And looking at history, this seems to have been
the right choice.
Others have provided good answers for that. Here's another one: Given
the requirements (based on the predecessors), there was not reason to go
beyond byte addressing. And looking at history, this seems to have been
the right choice.
That applied back in history, when we had fewer addressing bits to play
with, what about now?
Not a huge use-case in graphics, as noted, in most cases this is done
with 16 or 32 bit pixels; and bit-plane graphics are long since dead.
On "personal" computers ... there's been work instead on compressingIntel pushed this thing called the “x32” ABI into the Linux kernel (and >> possibly some other places) some years ago. This was using the AMD64
64bit pointers to fit into 32bit "boxes" (IIUC it's used in some Chrome
versions) ...
Indeed, but I got the impression that there is a bit of a revival of
interest for pointer compression as the evidence seems to point to RAM
sizes not increasing very much any more on "end user devices".
See for instance https://v8.dev/blog/pointer-compression
Note also that this is targeted at JavaScript: dynamically typed
languages tend to suffer more from the 64bit bloat because of their
use of "boxing", meaning that pretty much everything (except usually
for strings and arrays of floats, which are special-cased) doubles
in size when the "box" size is changed from 32bit to 64bit.
On Fri, 03 May 2024 15:13:30 GMT, Anton Ertl wrote:
Lawrence D'Oliveiro <[email protected]d> writes:
But why stop there?
Others have provided good answers for that. Here's another one: Given
the requirements (based on the predecessors), there was not reason to go
beyond byte addressing. And looking at history, this seems to have been
the right choice.
That applied back in history, when we had fewer addressing bits to play
with, what about now?
Intel pushed this thing called the “x32” ABI into the Linux kernel (and possibly some other places) some years ago. This was using the AMD64 instruction set, but with only 32-bit pointers. This way, you got the
benefit of the extra registers, without the overhead of the longer
addresses.
On Fri, 03 May 2024 08:51:30 GMT
[email protected] (Anton Ertl) wrote:
Lawrence D'Oliveiro <[email protected]d> writes:
On Thu, 02 May 2024 17:37:47 GMT, Anton Ertl wrote:=20
=20
... MIPS has left the general-purpose computing field. =20
Not so sure that it has. I think the Chinese =C3=A2=E2=82=AC=C5=93LoongA= >rch=C3=A2=E2=82=AC_
machines are a MIPS derivative. =20
They may have started with MIPS, like several others, but now they are
LoongArch. Looking in
<https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1-EN.htm= >l#common-memory-access-instructions>,
I don't find anything about byte order, but it says:
=20
|LoongArch bit designations are always little-endian.
=20
Also, if you want to think of =C3=A2=E2=82=AC=C5=93MIPS=C3=A2=E2=82=AC_ = >as a corporate entity, that=20
would be the company currently known as =C3=A2=E2=82=AC=C5=93Imagination
Technologies=C3=A2=E2=82=AC_. It is true they have given up on the MIPS
architecture =20
That's even worse for MIPS than what I know of, which was that it was
used for embedded uses.
=20
- anton
My impression was that embedded MIPS had two main players behind it:
- Microchip on the low end. Measured on Arm scale from about Cortex-M3
class to Cortex-M7 class.
- Cavium on the high end. From Cortex-A55 to not quite Cortex-A73.
Cavium was absorbed by Marvell sevral years ago. Marvell, like
Microchip, does not tend to talk openly about directions. But when
Cavium was still independent, they did say that all new development
would be Arm.
As far as I am concerned, it's a pity, because I find MIPS latest ISA >(nanoMIPS) very intersting and probably quite practical.
On Fri, 03 May 2024 15:13:30 GMT, Anton Ertl wrote:
Lawrence D'Oliveiro <[email protected]d> writes:
But why stop there?
Others have provided good answers for that. Here's another one: Given
the requirements (based on the predecessors), there was not reason to go
beyond byte addressing. And looking at history, this seems to have been
the right choice.
That applied back in history, when we had fewer addressing bits to play
with, what about now?
Lawrence D'Oliveiro <[email protected]d> writes:
On Fri, 03 May 2024 15:13:30 GMT, Anton Ertl wrote:
Lawrence D'Oliveiro <[email protected]d> writes:
But why stop there?
Others have provided good answers for that. Here's another one: Given
the requirements (based on the predecessors), there was not reason to go >>> beyond byte addressing. And looking at history, this seems to have been >>> the right choice.
That applied back in history, when we had fewer addressing bits to play >>with, what about now?
Byte addressing still seems to be the right choice, for the same
reasons: We have lots of string data, and data that needs larger
units, but for data that fits in smaller units
a) either there is so little that spending a full byte on it is good
enough, or
b) the data is handled by so little code that the burden from the lack
of bit addressing is relatively low in the overall scheme of things, or
c) programs deal with arrays of these things in a SIMD way, and bit >addressing provides little to no benefit.
Personally I prefer ARM64 architecture over MIPS64 by a considerable
margin, in almost all respects (and I worked at SGI for a number of
years in the R10k days).
On Fri, 3 May 2024 22:11:44 -0500, BGB wrote:
Not a huge use-case in graphics, as noted, in most cases this is done
with 16 or 32 bit pixels; and bit-plane graphics are long since dead.
What happens if we go beyond 32 bits? For example, hardware might support
10 bits per pixel component.
According to Lawrence D'Oliveiro <[email protected]d>:
On Fri, 3 May 2024 22:11:44 -0500, BGB wrote:
Not a huge use-case in graphics, as noted, in most cases this is
done with 16 or 32 bit pixels; and bit-plane graphics are long
since dead.
What happens if we go beyond 32 bits? For example, hardware might
support 10 bits per pixel component.
I dunno about you but I would align the elements on two-byte
boundaries and only store the high 10 of the 16 bits. It's not like
we're short of address space, and it's a lot quicker to multiply and
divide by 2 or 16 than by 10.
On Sat, 4 May 2024 19:31:54 -0000 (UTC)
John Levine <[email protected]> wrote:
According to Lawrence D'Oliveiro <[email protected]d>:
On Fri, 3 May 2024 22:11:44 -0500, BGB wrote:
Not a huge use-case in graphics, as noted, in most cases this is
done with 16 or 32 bit pixels; and bit-plane graphics are long
since dead.
What happens if we go beyond 32 bits? For example, hardware might
support 10 bits per pixel component.
I dunno about you but I would align the elements on two-byte
boundaries and only store the high 10 of the 16 bits. It's not like
we're short of address space, and it's a lot quicker to multiply and
divide by 2 or 16 than by 10.
I agree about preferable solution and simplicity, but not about last
part.
Multiplication by 10 is only very slightly slower than multiplication
by 2 or 16 and the difference shouldn't be noticable by comparison with
other things that we want to do with pixel.
On x386/AMD64 - multiplication by 2 is, depending on situation, zero or
1 instruction, multiplication by 16 is 1 instruction (lsh) and multiplication by 10 is either 1 instruction (IMUL) or two simpler instructions (LEA+ADD).
On Arm and aarch64 it's approximately the same except that there are situations in which multiplication by 16 is zero instructions.
On 5/4/2024 3:18 AM, Thomas Koenig wrote:(and
Lawrence D'Oliveiro <[email protected]d> schrieb:
Intel pushed this thing called the “x32” ABI into the Linux kernel
possibly some other places) some years ago. This was using the AMD64
instruction set, but with only 32-bit pointers. This way, you got the
benefit of the extra registers, without the overhead of the longer
addresses.
That was Donald Knuth's idea.
Storing meta data in actual pointers, aka aligned on a larger boundary,
is critical to many advanced lock/wait free algorithms as well. I
remember storing an actual reference count in pointers before for a
special type of counting.
On 5/4/2024 1:44 AM, Lawrence D'Oliveiro wrote:support
On Fri, 3 May 2024 22:11:44 -0500, BGB wrote:
Not a huge use-case in graphics, as noted, in most cases this is done
with 16 or 32 bit pixels; and bit-plane graphics are long since dead.
What happens if we go beyond 32 bits? For example, hardware might
10 bits per pixel component.
A few typical formats:
RGB555: 0rrrrrgg-gggbbbbb
RGBA32: aaaaaaaa-rrrrrrrr-gggggggg-bbbbbbbb
RGB30 : 00rrrrrr-rrrrgggg-ggggggbb-bbbbbbbb (10-bit component RGB)
Though, for RGB30, there are variants with 10-bit linear RGB, and E5.F5 floating-point (sometimes used for HDR in OpenGL, as opposed to 4x
Binary16).
None of these would really benefit from bit addressable memory though.
Though, for LDR, going beyond 8-bit color depth doesn't gain much even
if the monitor supports it natively. And had noted before when using a
cheap LCD TV as a monitor, that it only seemed to be working at a
roughly 6-bit color depth (like, it was seemingly slightly better than RGB555, but not by much).
Now I am using a 4K OLED, which does support 10b/component, but it
doesn't make much difference in practice (and even if it did, most
software wont make much use of it).
But, say, 5 to 8 bits per component is at least noticeable (better
colors and less banding artifacts), 8 to 10 bits, not so much. Though,
with the main exception being HDR (but then, over the 0.5 to 1.0 range,
E5.F5 is only about as accurate as a 6-bit component).
According to Lawrence D'Oliveiro <[email protected]d>:
Consider this pseudo-assembly-language sequence:
move.l a, b
move.b b, c
...
Now the question is: which byte from “a” ends up at location “c”?
On S/360, which is the ur-big-endian machine, memory to memory moves are different from register loads and stores.
Lawrence D'Oliveiro wrote:
move.l a, b
move.b b, c
Here's a concrete example on S/360.
L R,100
STH R,200
That does a four byte load of location 100 into a register, and then a
two byte halfword store into 200. The load gets bytes 100 through 103
with 100 going into the high byte of the register. The store puts its
values into bytes 200 and 201. Since it's the low half of the register,
the new contents of 200 and 201 are the old contents of 102 and 103.
... it is easy to construct examples that appear to make your less
favored option look wrong ...
Personally I prefer ARM64 architecture over MIPS64 by a considerable
margin, in almost all respects ...
On Sat, 04 May 2024 15:18:37 GMT, Scott Lurndal wrote:
Personally I prefer ARM64 architecture over MIPS64 by a considerable margin, in almost all respects ...
I know MIPS (like SPARC) originated in that brief window when it was
thought that delayed branches were a good idea, and so it remained
saddled with that (mis)feature for the rest of its life.
So using the same register name to address a halfword gives you the low
half of the register, not the high half?
Whereas using the same memory address to address a halfword gives you the >high half of the word at that location, not the low half?
On Fri, 3 May 2024 18:42:29 -0000 (UTC), John Levine wrote:register,
Lawrence D'Oliveiro wrote:
move.l a, b
move.b b, c
Here's a concrete example on S/360.
L R,100
STH R,200
That does a four byte load of location 100 into a register, and then a
two byte halfword store into 200. The load gets bytes 100 through 103
with 100 going into the high byte of the register. The store puts its
values into bytes 200 and 201. Since it's the low half of the
the new contents of 200 and 201 are the old contents of 102 and 103.
So using the same register name to address a halfword gives you the low
half of the register, not the high half?
Whereas using the same memory address to address a halfword gives you the
high half of the word at that location, not the low half?
Lawrence D'Oliveiro wrote:
So using the same register name to address a halfword gives you the low
half of the register, not the high half?
Whereas using the same memory address to address a halfword gives you the
high half of the word at that location, not the low half?
Concrete example::
On Sun, 5 May 2024 00:26:49 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:
thought that delayed branches were a good idea, and so it remained
saddled with that (mis)feature for the rest of its life.
Delay slot was deprecated back in MIPSr6, almost a decade ago.
d) all modern major architectures have instructions for bitfield
manipulation (insert, extract) obviating any need for general bit-level addressing.
Michael S wrote:
On Sat, 4 May 2024 19:31:54 -0000 (UTC)
John Levine <[email protected]> wrote:
According to Lawrence D'Oliveiro <[email protected]d>:
On Fri, 3 May 2024 22:11:44 -0500, BGB wrote:
Not a huge use-case in graphics, as noted, in most cases this is
done with 16 or 32 bit pixels; and bit-plane graphics are long
since dead.
What happens if we go beyond 32 bits? For example, hardware might
support 10 bits per pixel component.
I dunno about you but I would align the elements on two-byte
boundaries and only store the high 10 of the 16 bits. It's not like
we're short of address space, and it's a lot quicker to multiply
and divide by 2 or 16 than by 10.
I agree about preferable solution and simplicity, but not about last
part.
Multiplication by 10 is only very slightly slower than
multiplication by 2 or 16 and the difference shouldn't be noticable
by comparison with other things that we want to do with pixel.
Multiplication by 10 used to index an array is not slower than a multipication
by 16 (when the ISA is not brain dead)::
LEA Ri,[Ri,Ri<<3]
LD Rd,[Rp,Ri]
On Sun, 5 May 2024 04:12:49 +0300, Michael S wrote:
On Sun, 5 May 2024 00:26:49 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:
thought that delayed branches were a good idea, and so it remained
saddled with that (mis)feature for the rest of its life.
Delay slot was deprecated back in MIPSr6, almost a decade ago.
But that would be a backward-incompatible change, would it not?
I'm sure there are other reasons why MIPS failed, despite having
cores that were comparable or better than ARM for small-systems
embedded devices. But Microchip has to take a large chunk of the
blame, IMHO.
On Sat, 04 May 2024 15:18:37 GMT
[email protected] (Scott Lurndal) wrote:
Personally I prefer ARM64 architecture over MIPS64 by a considerable
margin, in almost all respects (and I worked at SGI for a number of
years in the R10k days).
I also prefer ARM64 over MIPS64.
But nanoMIPS is not MIPS64, it's a new architecture that, at least
according to my measurements, is head and shoulders above any
comppetitors in terms of code densty.
Scott Lurndal <[email protected]> schrieb:
d) all modern major architectures have instructions for bitfield manipulation (insert, extract) obviating any need for general
bit-level addressing.
RISC-V: Seems like it's an extension, for which only a draft is
available, so it is debatable if it has it or not.
POWER: Certainly, the rlwinm instruction.
AMD64: Sure, pdep and friends.
ARM: You certainly know by heart, I don't need to look.
Loongarch: Looking at the docs, it also has it (BSTRINS etc).
So, with the possible exception of RISC-V, I cannot see anything
to contradict you :-)
[email protected] (Anton Ertl) writes:
Byte addressing still seems to be the right choice, for the same
reasons: We have lots of string data, and data that needs larger
units, but for data that fits in smaller units
a) either there is so little that spending a full byte on it is good >>enough, or
b) the data is handled by so little code that the burden from the lack
of bit addressing is relatively low in the overall scheme of things, or
c) programs deal with arrays of these things in a SIMD way, and bit >>addressing provides little to no benefit.
d) all modern major architectures have instructions for bitfield
manipulation (insert, extract) obviating any need for general bit-level addressing.
On Sun, 5 May 2024 00:26:49 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:
On Sat, 04 May 2024 15:18:37 GMT, Scott Lurndal wrote:
Personally I prefer ARM64 architecture over MIPS64 by a considerable
margin, in almost all respects ...
I know MIPS (like SPARC) originated in that brief window when it was
thought that delayed branches were a good idea, and so it remained
saddled with that (mis)feature for the rest of its life.
Delay slot was deprecated back in MIPSr6, almost a decade ago.
Michael S <[email protected]> writes:
On Sun, 5 May 2024 00:26:49 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:
On Sat, 04 May 2024 15:18:37 GMT, Scott Lurndal wrote:
Personally I prefer ARM64 architecture over MIPS64 by a
considerable margin, in almost all respects ...
I know MIPS (like SPARC) originated in that brief window when it
was thought that delayed branches were a good idea, and so it
remained saddled with that (mis)feature for the rest of its life.
Delay slot was deprecated back in MIPSr6, almost a decade ago.
MIPS has a number of other misfeatures that made us disable dynamic superinstructions in Gforth and are a problem for other code-copying
code generators:
First and foremost, the architectural load delay slot (and, I think,
similar constraints wrt multiply and divide instructions and/or
MFHI/MFLO) mean that, unlike for every other architecture we have
looked at (including IA-64), you cannot just concatenate two pieces of
code which do work when they are connected with an indirect jump.
Another nasty property of MIPS is the way direct jumps and calls are
encoded: The target address is assembled from IIRC the top 6 bits of
the current PC and the rest of the address as absolute number in the instruction. This means that the call/jump would not show up as non-relocatable in Gforth's sanity tests, but if copied a piece of
code to a target area in a different 256MB-segment, it would fail.
- anton
David Ungar's PhD thesis was on SOAR (aka RISC-IV), which was either word-addressed or (like Alpha) word-accessed; in one of the last
chapters of his thesis he wrote that the most beneficial feature for performance that SOAR did not have was byte accesses, which would have reduced the number of cycles by IIRC 10% (to be balanced against
potential negative effects on the cycle-time); I found that quite
surprising for a thesis that mainly focussed on architectural features
for Smalltalk execution.
Scott Lurndal <[email protected]> schrieb:
d) all modern major architectures have instructions for bitfield
manipulation (insert, extract) obviating any need for general bit-level addressing.
RISC-V: Seems like it's an extension, for which only a draft is
available, so it is debatable if it has it or not.
POWER: Certainly, the rlwinm instruction.
AMD64: Sure, pdep and friends.
ARM: You certainly know by heart, I don't need to look.
Loongarch: Looking at the docs, it also has it (BSTRINS etc).
So, with the possible exception of RISC-V, I cannot see anything
to contradict you :-)
[email protected] (Scott Lurndal) writes:
[email protected] (Anton Ertl) writes:
Byte addressing still seems to be the right choice, for the same
reasons: We have lots of string data, and data that needs larger
units, but for data that fits in smaller units
a) either there is so little that spending a full byte on it is good >>>enough, or
b) the data is handled by so little code that the burden from the lack
of bit addressing is relatively low in the overall scheme of things, or
c) programs deal with arrays of these things in a SIMD way, and bit >>>addressing provides little to no benefit.
d) all modern major architectures have instructions for bitfield >>manipulation (insert, extract) obviating any need for general bit-level addressing.
Many of the word-addressed machines of yesteryear had instructions for >character manipulation (insert, extract), but that did not obviate any
need for byte addressing.
On Sun, 05 May 2024 09:02:03 GMT
[email protected] (Anton Ertl) wrote:
First and foremost, the architectural load delay slot (and, I think,
similar constraints wrt multiply and divide instructions and/or
MFHI/MFLO) mean that, unlike for every other architecture we have
looked at (including IA-64), you cannot just concatenate two pieces of
code which do work when they are connected with an indirect jump.
Were not all delay slots except branch delay eliminated back in
revision of the ISA that corresponded to R4K ?
Another nasty property of MIPS is the way direct jumps and calls are
encoded: The target address is assembled from IIRC the top 6 bits of
the current PC and the rest of the address as absolute number in the
instruction. This means that the call/jump would not show up as
non-relocatable in Gforth's sanity tests, but if copied a piece of
code to a target area in a different 256MB-segment, it would fail.
- anton
Compact branches (Release 6) have conventional signed PC-relative
offsets - +-128 MB for unconditional jump/J&L, +-4MB for
equal/non-equal to zero and +-128 KB for the rest of conditional
branches.
Byte addressing was invented by IBM for the System/360, introduced in
1964. At least as I understand it. Up to that time, and indeed for a long >time after, machines had a �word length� which was the smallest
addressable unit of memory. This could have a range of sizes, e.g.
12 -- DEC PDP-5/8
18 -- DEC PDP-1/4/7/9
36 -- DEC PDP-6/10
60 -- CDC 6000-series
64 -- Cray
I�m sure there were also 24- and 48-bit machines.
Big-endian
supposedly had the advantage of making memory dumps easier to read, but >little-endian always made more logical sense.
Michael S wrote:
On Sat, 4 May 2024 21:08:19 +0000
[email protected] (MitchAlsup1) wrote:
Multiplication by 10 used to index an array is not slower than a
multipication
by 16 (when the ISA is not brain dead)::
LEA Ri,[Ri,Ri<<3]
LD Rd,[Rp,Ri]
Are you sure?
To me, it looks like 9 rather than 10.
LD Rd,[Rp,Ri<<2]
sorry.........
On Sat, 4 May 2024 21:08:19 +0000
[email protected] (MitchAlsup1) wrote:
Multiplication by 10 used to index an array is not slower than a
multipication
by 16 (when the ISA is not brain dead)::
LEA Ri,[Ri,Ri<<3]
LD Rd,[Rp,Ri]
Are you sure?
To me, it looks like 9 rather than 10.
On Sat, 04 May 2024 09:11:27 GMT, Anton Ertl wrote:
On a byte-addressed machine you can use some lower bits "for free" if
the objects being addressed are always word-sized or larger. SPARC has >specific instructions to make use of this.
On 5/5/2024 10:31 AM, Scott Lurndal wrote:
Thomas Koenig <[email protected]> writes:
Scott Lurndal <[email protected]> schrieb:
Not as of yet in my case, but bitfield extract might happen eventually.
Issue is finding a way to pull it off that is useful and cheaper than shift+mask (and probably adding some mechanism to pattern-match it from
the AST or similar).
Annoyingly, a good general case instruction could not be encoded in a
32-bit instruction form at this point (could either add a few special
cases as 32-bit ops, or use a 64-bit encoding; or do it as a 2RI op
rather than 3RI but this is lame...).
Then again, say:
BITEXTR Imm10, Rn //Rn=(Rn>>(Imm&63))&((1<<((Imm>>6)&15))-1)
Could potentially still be useful.
Also, some things don't seem well balanced in terms of cost, so while it would be fairly cheap for a microcontroller, by the time one implements enough extensions to make it more useful for general purpose computing,
it will no longer be cheap (while at the same time shooting itself in
the foot in terms of performance for imposing some design constraints
that *only* make sense for small microcontrollers).
One big offender here, as I see it, is a few features in the Privileged
ISA spec, such as:
Separate register sets for each protection level/mode;
The comparably large number of CSRs;
Allowing operations on CSRs beyond just moving them to/from a GPR or
similar;
....
Things like the 'V' extension are also cause for concern.
The 'M' extension isn't ideal, but I made it work in a way that "isn't
too horribly expensive" (namely using a Shift-and-Add unit).
Also the cost-scaling of the Shift-Add unit is such that it could
potentially be extended to allow 128-bit integer multiply and divide,
but debatable (there are only a few edge cases where this would likely
be faster than "just do it in software").
Well, and my ALUX extension can make for faster 128-bit ALU operations,
but is debatable as the cost-delta mostly disappears in the noise
(mostly because 128-bit ALU ops are rare).
Conversely, the code when built for RV64G omits 128-bit types entirely,
On 5/4/2024 5:12 PM, MitchAlsup1 wrote:
Chris M. Thomasson wrote:
On 5/4/2024 3:18 AM, Thomas Koenig wrote:(and
Lawrence D'Oliveiro <[email protected]d> schrieb:
Intel pushed this thing called the “x32” ABI into the Linux kernel
possibly some other places) some years ago. This was using the AMD64 >>>>> instruction set, but with only 32-bit pointers. This way, you got the >>>>> benefit of the extra registers, without the overhead of the longer
addresses.
That was Donald Knuth's idea.
Storing meta data in actual pointers, aka aligned on a larger
boundary, is critical to many advanced lock/wait free algorithms as
well. I remember storing an actual reference count in pointers before
for a special type of counting.
Even if one has multi-location ATOMICs ?? (as a single event ??)
This was a technique for storing data in a pointer. For instance, strong atomic reference counting we need to update a pointer _and_ a reference together atomically. This can easily be done with DWCAS, or double width compare and swap. So, on a 32 bit system we need 64 bit cas, for a 64
bit system we need 128 bit cas. However, sometimes we can pack the
reference count in the pointer value itself if its aligned on a big
enough boundary. Then we can update the pointer and the reference count
using normal word based atomic RMW's.
d) all modern major architectures have instructions for bitfield >>manipulation (insert, extract) obviating any need for general bit-level addressing.
Many of the word-addressed machines of yesteryear had instructions for >character manipulation (insert, extract), but that did not obviate any
need for byte addressing.
d) all modern major architectures have instructions for bitfield
manipulation (insert, extract) obviating any need for general bit-level addressing.
On 5/5/2024 3:25 PM, MitchAlsup1 wrote:
Chris M. Thomasson wrote:
On 5/4/2024 5:12 PM, MitchAlsup1 wrote:
Chris M. Thomasson wrote:
On 5/4/2024 3:18 AM, Thomas Koenig wrote:
Lawrence D'Oliveiro <[email protected]d> schrieb:
Intel pushed this thing called the “x32” ABI into the Linux kernel >>>> (and
possibly some other places) some years ago. This was using the AMD64 >>>>>>> instruction set, but with only 32-bit pointers. This way, you got the >>>>>>> benefit of the extra registers, without the overhead of the longer >>>>>>> addresses.
That was Donald Knuth's idea.
Storing meta data in actual pointers, aka aligned on a larger
boundary, is critical to many advanced lock/wait free algorithms as
well. I remember storing an actual reference count in pointers
before for a special type of counting.
Even if one has multi-location ATOMICs ?? (as a single event ??)
This was a technique for storing data in a pointer. For instance,
strong atomic reference counting we need to update a pointer _and_ a
reference together atomically. This can easily be done with DWCAS, or
double width compare and swap. So, on a 32 bit system we need 64 bit
cas, for a 64 bit system we need 128 bit cas. However, sometimes we
can pack the reference count in the pointer value itself if its
aligned on a big enough boundary. Then we can update the pointer and
the reference count using normal word based atomic RMW's.
I understand why you had to pack the pointer and a chunk of data into a
single container.
What I don't understand is if you had easy access to multi-container
ATOMICs
the packing would be unnecessary--would it not ?? That is in one ATOMIC
event
you could update the pointer and the chunk of data independently and not
NEED
to store them in a single container.
Well, actually, a pessimistic word based fetch-and-add (LOCK XADD) is
enough to increment the counter and load a pointer atomically all in one shot, loopless. Why would I need to use multi atomics with a possible
loop to do that?
Say, RISC-V:
Says yes to DIV and MOD;
Says yes to 4-register floating-point multiple-accumulate; Say no to
register-indexed Load/Store.
Me: This is not a good balance...
On Sat, 04 May 2024 15:21:04 GMT, Scott Lurndal wrote:
d) all modern major architectures have instructions for bitfield
manipulation (insert, extract) obviating any need for general bit-level
addressing.
Even if those bottom three bits of the address must be zero in every other >instruction but these, I thought it would be convenient to have them, just >for these bitfield instructions. It would save passing around a separate >bit-offset field in arbitrary-bit-aligned pointers.
I would, personally, categorize RISC-V as a niche architecture at this
time.
If you have decimal arithmetic, there's a direct connection between how numbers are represented for reading and writing, and how they are
represented for internal arithmetic.
PEXTR/PDEP has no immediate form, which makes it inconvenient for
'C'-style fixed bit fields.
Chris M. Thomasson wrote:
On 5/5/2024 3:25 PM, MitchAlsup1 wrote:
Chris M. Thomasson wrote:
On 5/4/2024 5:12 PM, MitchAlsup1 wrote:
Chris M. Thomasson wrote:
On 5/4/2024 3:18 AM, Thomas Koenig wrote:(and
Lawrence D'Oliveiro <[email protected]d> schrieb:
Intel pushed this thing called the “x32†ABI into the Linux
kernel
possibly some other places) some years ago. This was using the >>>>>>>> AMD64
instruction set, but with only 32-bit pointers. This way, you >>>>>>>> got the
benefit of the extra registers, without the overhead of the longer >>>>>>>> addresses.
That was Donald Knuth's idea.
Storing meta data in actual pointers, aka aligned on a larger
boundary, is critical to many advanced lock/wait free algorithms
as well. I remember storing an actual reference count in pointers
before for a special type of counting.
Even if one has multi-location ATOMICs ?? (as a single event ??)
This was a technique for storing data in a pointer. For instance,
strong atomic reference counting we need to update a pointer _and_ a
reference together atomically. This can easily be done with DWCAS,
or double width compare and swap. So, on a 32 bit system we need 64
bit cas, for a 64 bit system we need 128 bit cas. However, sometimes
we can pack the reference count in the pointer value itself if its
aligned on a big enough boundary. Then we can update the pointer and
the reference count using normal word based atomic RMW's.
I understand why you had to pack the pointer and a chunk of data into a
single container.
What I don't understand is if you had easy access to multi-container
ATOMICs
the packing would be unnecessary--would it not ?? That is in one
ATOMIC event
you could update the pointer and the chunk of data independently and
not NEED
to store them in a single container.
Well, actually, a pessimistic word based fetch-and-add (LOCK XADD) is
enough to increment the counter and load a pointer atomically all in
one shot, loopless. Why would I need to use multi atomics with a
possible loop to do that?
Postulate that you have a 64-bit pointer and a 8-bit chunk 72-total bits. Further postulate that you need to update both in a single non-blocking ATOMIC event. ...
According to Lawrence D'Oliveiro <[email protected]d>:
On Sat, 04 May 2024 15:21:04 GMT, Scott Lurndal wrote:
d) all modern major architectures have instructions for bitfield
manipulation (insert, extract) obviating any need for general bit-level
addressing.
Even if those bottom three bits of the address must be zero in every other >>instruction but these, I thought it would be convenient to have them, just >>for these bitfield instructions. It would save passing around a separate >>bit-offset field in arbitrary-bit-aligned pointers.
The only significant application for bit addressing that anyone has
mentioned is data compression. It's not something that computers spend
a great deal of time doing, and I see no reason to believe that bit >addressing would make it much faster than the way it's done now with
shifting and masking.
It is easier to do addition/subtraction if you start from the least >significant end and propagate the carry/borrow along.
BGB wrote:
On 5/5/2024 10:31 AM, Scott Lurndal wrote:
Thomas Koenig <[email protected]> writes:
Scott Lurndal <[email protected]> schrieb:
Not as of yet in my case, but bitfield extract might happen eventually.
Issue is finding a way to pull it off that is useful and cheaper than
shift+mask (and probably adding some mechanism to pattern-match it
from the AST or similar).
But, but but but:: it IS shift and Mask !!
Annoyingly, a good general case instruction could not be encoded in a
32-bit instruction form at this point (could either add a few special
cases as 32-bit ops, or use a 64-bit encoding; or do it as a 2RI op
rather than 3RI but this is lame...).
Then again, say:
BITEXTR Imm10, Rn //Rn=(Rn>>(Imm&63))&((1<<((Imm>>6)&15))-1)
Could potentially still be useful.
SL Rd,Rc,<width:offset>
Is a bit field extract instruction, it is also a smash instruction
(smashing a 64-bit value into a 8-bit or 12-bit or 47 bit for whatever purpose is needed)
SR Rd,Rc,<width:offset>
Positions the value in a register (Rc) such that it fits the alignment of
a field.
INS Rd,Rc,Rf,<width:offset>
Inserts the field from Rf into its position <w:o> in Rc, inserts the
field and delivers the new container to Rd.
Why do you think bit addressing will be
faster than shifting and masking? There's still going to be memory
underneath that's byte or word addressed so the shifting and masking
is going to happen anyway.
On Sat, 04 May 2024 15:21:04 GMT, Scott Lurndal wrote:
d) all modern major architectures have instructions for bitfield
manipulation (insert, extract) obviating any need for general bit-level
addressing.
Even if those bottom three bits of the address must be zero in every other instruction but these, I thought it would be convenient to have them, just for these bitfield instructions. It would save passing around a separate bit-offset field in arbitrary-bit-aligned pointers.
On Sun, 5 May 2024 12:13:27 +0300, Michael S wrote:
PEXTR/PDEP has no immediate form, which makes it inconvenient for
'C'-style fixed bit fields.
Fixed bit fields are a limitation of the C language. Why should it
constrain the design of machine architectures?
According to Lawrence D'Oliveiro <[email protected]d>:
On Sat, 04 May 2024 15:21:04 GMT, Scott Lurndal wrote:
d) all modern major architectures have instructions for bitfield
manipulation (insert, extract) obviating any need for general bit-level
addressing.
Even if those bottom three bits of the address must be zero in every other >>instruction but these, I thought it would be convenient to have them, just >>for these bitfield instructions. It would save passing around a separate >>bit-offset field in arbitrary-bit-aligned pointers.
The only significant application for bit addressing that anyone has
mentioned is data compression. It's not something that computers spend
a great deal of time doing, and I see no reason to believe that bit addressing would make it much faster than the way it's done now with
shifting and masking.
If you do want to make compression faster, it'd make more sense to add instructions to do the compressing you compare about, like DFLTCC in
S/360 and zSeries that speed up gzip, rather than adding three bits to
the other 99% of instructions that don't use bit fields.
If you think otherwise, what are the applications that will make all
those address bits useful, and why do you think bit addressing will be
faster than shifting and masking? There's still going to be memory
underneath that's byte or word addressed so the shifting and masking
is going to happen anyway.
MitchAlsup1 wrote:
Chris M. Thomasson wrote:
On 5/5/2024 3:25 PM, MitchAlsup1 wrote:
Chris M. Thomasson wrote:
On 5/4/2024 5:12 PM, MitchAlsup1 wrote:
Chris M. Thomasson wrote:
On 5/4/2024 3:18 AM, Thomas Koenig wrote:(and
Lawrence D'Oliveiro <[email protected]d> schrieb:
Intel pushed this thing called the “x32†ABI into the Linux
kernel
possibly some other places) some years ago. This was using the >>>>>>>>> AMD64
instruction set, but with only 32-bit pointers. This way, you >>>>>>>>> got the
benefit of the extra registers, without the overhead of the longer >>>>>>>>> addresses.
That was Donald Knuth's idea.
Storing meta data in actual pointers, aka aligned on a larger
boundary, is critical to many advanced lock/wait free algorithms >>>>>>> as well. I remember storing an actual reference count in pointers >>>>>>> before for a special type of counting.
Even if one has multi-location ATOMICs ?? (as a single event ??)
This was a technique for storing data in a pointer. For instance,
strong atomic reference counting we need to update a pointer _and_ a >>>>> reference together atomically. This can easily be done with DWCAS,
or double width compare and swap. So, on a 32 bit system we need 64
bit cas, for a 64 bit system we need 128 bit cas. However, sometimes >>>>> we can pack the reference count in the pointer value itself if its
aligned on a big enough boundary. Then we can update the pointer and >>>>> the reference count using normal word based atomic RMW's.
I understand why you had to pack the pointer and a chunk of data into a >>>> single container.
What I don't understand is if you had easy access to multi-container
ATOMICs
the packing would be unnecessary--would it not ?? That is in one
ATOMIC event
you could update the pointer and the chunk of data independently and
not NEED
to store them in a single container.
Well, actually, a pessimistic word based fetch-and-add (LOCK XADD) is
enough to increment the counter and load a pointer atomically all in
one shot, loopless. Why would I need to use multi atomics with a
possible loop to do that?
Postulate that you have a 64-bit pointer and a 8-bit chunk 72-total bits.
Further postulate that you need to update both in a single non-blocking
ATOMIC event. ...
"Any programming problem can be solved with an additional layer of indirection", so in this case you create a handle to that 72-bit item,
and require all access to go via the handle?
The addendum to the rule above is of course ", except the problem of too
many layers of indirections". :-)
Terje
On Sun, 5 May 2024 20:50:51 -0500, BGB wrote:
Say, RISC-V:
Says yes to DIV and MOD;
Says yes to 4-register floating-point multiple-accumulate; Say no to
register-indexed Load/Store.
Me: This is not a good balance...
Multiply-accumulate is at least as much about reducing rounding error as about speed.
On 5/5/2024 9:30 PM, Lawrence D'Oliveiro wrote:
On Sun, 5 May 2024 12:13:27 +0300, Michael S wrote:
PEXTR/PDEP has no immediate form, which makes it inconvenient for
'C'-style fixed bit fields.
Fixed bit fields are a limitation of the C language. Why should it
constrain the design of machine architectures?
If it lacks an immediate form, one is harder pressed to beat out
shift+and or shift+shift on the performance front...
Though, to be useful, it needs an immediate large enough to express both
the shift amount and the width of the bitfield, and also a 3RI encoding.
Bitfield insert would a little easier to get a performance advantage (vs bitfield extract), since insertion is a more complex operation, but is
also likely require a more complex implementation and is also less
common than bitfield extract.
....
MitchAlsup1 wrote:
BGB wrote:
On 5/5/2024 10:31 AM, Scott Lurndal wrote:
Thomas Koenig <[email protected]> writes:
Scott Lurndal <[email protected]> schrieb:
Not as of yet in my case, but bitfield extract might happen eventually.
Issue is finding a way to pull it off that is useful and cheaper than
shift+mask (and probably adding some mechanism to pattern-match it
from the AST or similar).
But, but but but:: it IS shift and Mask !!
Annoyingly, a good general case instruction could not be encoded in a
32-bit instruction form at this point (could either add a few special
cases as 32-bit ops, or use a 64-bit encoding; or do it as a 2RI op
rather than 3RI but this is lame...).
Then again, say:
BITEXTR Imm10, Rn //Rn=(Rn>>(Imm&63))&((1<<((Imm>>6)&15))-1)
Could potentially still be useful.
SL Rd,Rc,<width:offset>
Is a bit field extract instruction, it is also a smash instruction
(smashing a 64-bit value into a 8-bit or 12-bit or 47 bit for whatever
purpose is needed)
SR Rd,Rc,<width:offset>
Positions the value in a register (Rc) such that it fits the alignment of
a field.
INS Rd,Rc,Rf,<width:offset>
Inserts the field from Rf into its position <w:o> in Rc, inserts the
field and delivers the new container to Rd.
I think my instruction set could accomplish pretty much the same
efficiency for bit field operations as bit addresses but without
requiring direct bit addressing.
An issue that comes up is when the in-memory bit field is > 56 bits wide
as it might straddle two 64-bit words. If width is <= 56 bits then
a load from a byte address handles most of the shifting and the
rest can be handled within a single register.
But if the in-memory bit field is > 56 bits wide it may or may not straddle
a single 64-bit memory location, and require a pair of registers to loaded.
I added an optional second dest register field to my ISA to allow operations like wide bit field extract and insert across a pair of registers.
Also for wide arithmetic.
I was thinking of variable length LDV and STV load & store instructions
to work with variable length byte fields from 1 to 16 bytes.
LDV has two dst registers, a normal byte address specifier,
and a byte count from 1 to 16 to load. All high order bytes
not written by the LDV are zero filled.
The byte count can be an immediate or in a register.
STV does the same for stores with a pair of source value registers.
LDV and STV only touch the memory bytes they actually load or store.
So if the actual address + byte count does not touch a second 64-bit
memory word then they don't touch the next cache line or next page
in the case of potential page straddles.
This allows code to LDV up to 16 bytes into a register pair
extract and insert up to 64-bit fields in that register pair,
then STV only the bytes operated on,
with HW taking care of the special cases of straddle/not-straddle.
On Mon, 6 May 2024 02:54:11 -0000 (UTC), John Levine <[email protected]>
wrote:
Why do you think bit addressing will be
faster than shifting and masking? There's still going to be memory >>underneath that's byte or word addressed so the shifting and masking
is going to happen anyway.
Shifting, in a sense, yes. But not necessarily masking.
So just because a processor has a 64-bit bus to memory doesn't mean it
has to implement fetching a single byte from memory by doing a shift
and mask operation in a 64-bit register.
Instead, each byte of the bus
could have a direct wired path to the low 8-bits of the internal data
bus feeding the registers.
With bit addressing, of course, an implementation involving shifting
and masking is more likely, but even then, one omits fetching and
decoding the instructions to shift and mask, which is a speed gain
right there.
John Savard
Lawrence D'Oliveiro wrote:
On Sat, 04 May 2024 15:21:04 GMT, Scott Lurndal wrote:
d) all modern major architectures have instructions for bitfield
manipulation (insert, extract) obviating any need for general bit-level
addressing.
Even if those bottom three bits of the address must be zero in every other >> instruction but these, I thought it would be convenient to have them, just >> for these bitfield instructions. It would save passing around a separate
bit-offset field in arbitrary-bit-aligned pointers.
Its not just the bit address that you have to carry about
but also field width and type (zero/sign extend) on extract.
To my eye the cost of bit fields is primarily in dealing at run time
with the potential for straddles across memory locations and registers.
It makes for a lot of fiddly little IF code blocks which then have to be
put into general subroutines.
A second issue occurs when there are multiple bit fields is
optimizing this so it only loads and stores with memory when it has to.
If r1 contains a low straddle part and r2 the high straddle part,
and we have already updated one bit field in those parts,
if we want to update a second bit field,
then we need to check if it is wholly contained within those
two registers, or one or both need to be spilled and reloaded.
A lot of this fiddly code looks like it would be best
implemented with predication.
MitchAlsup1 wrote:
BGB wrote:
On 5/5/2024 10:31 AM, Scott Lurndal wrote:
Thomas Koenig <[email protected]> writes:
Scott Lurndal <[email protected]> schrieb:
Not as of yet in my case, but bitfield extract might happen eventually.
Issue is finding a way to pull it off that is useful and cheaper than
shift+mask (and probably adding some mechanism to pattern-match it
from the AST or similar).
But, but but but:: it IS shift and Mask !!
Annoyingly, a good general case instruction could not be encoded in a
32-bit instruction form at this point (could either add a few special
cases as 32-bit ops, or use a 64-bit encoding; or do it as a 2RI op
rather than 3RI but this is lame...).
Then again, say:
BITEXTR Imm10, Rn //Rn=(Rn>>(Imm&63))&((1<<((Imm>>6)&15))-1)
Could potentially still be useful.
SL Rd,Rc,<width:offset>
Is a bit field extract instruction, it is also a smash instruction
(smashing a 64-bit value into a 8-bit or 12-bit or 47 bit for whatever
purpose is needed)
SR Rd,Rc,<width:offset>
Positions the value in a register (Rc) such that it fits the alignment of
a field.
INS Rd,Rc,Rf,<width:offset>
Inserts the field from Rf into its position <w:o> in Rc, inserts the
field and delivers the new container to Rd.
I think my instruction set could accomplish pretty much the same
efficiency for bit field operations as bit addresses but without
requiring direct bit addressing.
An issue that comes up is when the in-memory bit field is > 56 bits wide
as it might straddle two 64-bit words. If width is <= 56 bits then
a load from a byte address handles most of the shifting and the
rest can be handled within a single register.
But if the in-memory bit field is > 56 bits wide it may or may not straddle
a single 64-bit memory location, and require a pair of registers to loaded.
I added an optional second dest register field to my ISA to allow
operations
like wide bit field extract and insert across a pair of registers.
Also for wide arithmetic.
I was thinking of variable length LDV and STV load & store instructions
to work with variable length byte fields from 1 to 16 bytes.
LDV has two dst registers, a normal byte address specifier,
and a byte count from 1 to 16 to load. All high order bytes
not written by the LDV are zero filled.
The byte count can be an immediate or in a register.
STV does the same for stores with a pair of source value registers.
LDV and STV only touch the memory bytes they actually load or store.
So if the actual address + byte count does not touch a second 64-bit
memory word then they don't touch the next cache line or next page
in the case of potential page straddles.
This allows code to LDV up to 16 bytes into a register pair
extract and insert up to 64-bit fields in that register pair,
then STV only the bytes operated on,
with HW taking care of the special cases of straddle/not-straddle.
On Mon, 6 May 2024 02:54:11 -0000 (UTC), John Levine <[email protected]>
wrote:
Why do you think bit addressing will be
faster than shifting and masking? ...
So just because a processor has a 64-bit bus to memory doesn't mean it
has to implement fetching a single byte from memory by doing a shift
and mask operation in a 64-bit register. Instead, each byte of the bus
could have a direct wired path to the low 8-bits of the internal data
bus feeding the registers.
On 5/5/2024 12:20 PM, John Savard wrote:
On Wed, 1 May 2024 00:09:28 -0000 (UTC), Lawrence D'Oliveiro
<[email protected]d> wrote:
Also, though, for ease of conversion, the order of BCD digits _should
be the same as the order of the characters of which these digits are
the last four bits_ in the representation of a decimal number as a
character string.
And that means big-endian.
If you have decimal arithmetic, there's a direct connection between
how numbers are represented for reading and writing, and how they are
represented for internal arithmetic.
Why would one burn 8 bits per BCD digit?...
RISC-V is quickly gaining ground in the microcontroller space,
displacing ARM (Cortex-M / Thumb2).
On 5/6/2024 2:11 PM, MitchAlsup1 wrote:
Lawrence D'Oliveiro wrote:
On Sun, 5 May 2024 20:50:51 -0500, BGB wrote:
Say, RISC-V:
Says yes to DIV and MOD;
Says yes to 4-register floating-point multiple-accumulate; Say no to >>>> register-indexed Load/Store.
Me: This is not a good balance...
Multiply-accumulate is at least as much about reducing rounding error
as about speed.
It is also an IEEE 754-2008+ requirement.
And... I have a version that just sort of works well enough to make
RV64G work, but is sort of a fail on the other fronts:
Using it is slower than separate ops;
It produces a double-rounded result.
Also, well, the FMUL isn't super accurate either.
FMUL is implemented in a way where it only generates the high-half of
the multiply, which makes the FPU cheaper, but:
Does not give strict 0.5ULP rounding.
Some combination of factors leads to the inability of Newton-Raphson to
fully converge, possibly either due to omitting the low-order multiplier results, or the carry-propagation limitation for rounding (if the
rounding would result in more than 8 bits of carry, it is skipped).
Not likely to do proper FMA, as this would make a Binary64 FPU too
expensive (and, doing Binary64 poorly is still preferable for most uses
to not doing it at all).
Granted, not entirely sure how the 8087 managed to do all the stuff that
it did. Since, it seems like an 80s ASIC would be more cramped than a
modern Artix-7.
....
According to John Savard <[email protected]d>:
On Mon, 6 May 2024 02:54:11 -0000 (UTC), John Levine <[email protected]> >>wrote:
Why do you think bit addressing will be
faster than shifting and masking? ...
So just because a processor has a 64-bit bus to memory doesn't mean it
has to implement fetching a single byte from memory by doing a shift
and mask operation in a 64-bit register. Instead, each byte of the bus >>could have a direct wired path to the low 8-bits of the internal data
bus feeding the registers.
I was more thinking about storing bit fields, where you probably have
to fetch the whole word or cache line or whatever, shift the new field
into it, and then store it back. You already have to do something like
that for byte stores but bit addressing makes it 8 times as hairy.
On 5/6/2024 12:15 PM, MitchAlsup1 wrote:
Terje Mathisen wrote:
MitchAlsup1 wrote:
Chris M. Thomasson wrote:
On 5/5/2024 3:25 PM, MitchAlsup1 wrote:
Chris M. Thomasson wrote:
On 5/4/2024 5:12 PM, MitchAlsup1 wrote:
Chris M. Thomasson wrote:This was a technique for storing data in a pointer. For instance, >>>>>>> strong atomic reference counting we need to update a pointer _and_ >>>>>>> a reference together atomically. This can easily be done with
On 5/4/2024 3:18 AM, Thomas Koenig wrote:(and
Lawrence D'Oliveiro <[email protected]d> schrieb:
Intel pushed this thing called the “x32†ABI into the >>>>>>>>>>> Linux kernel
possibly some other places) some years ago. This was using the >>>>>>>>>>> AMD64
instruction set, but with only 32-bit pointers. This way, you >>>>>>>>>>> got the
benefit of the extra registers, without the overhead of the >>>>>>>>>>> longer
addresses.
That was Donald Knuth's idea.
Storing meta data in actual pointers, aka aligned on a larger >>>>>>>>> boundary, is critical to many advanced lock/wait free algorithms >>>>>>>>> as well. I remember storing an actual reference count in
pointers before for a special type of counting.
Even if one has multi-location ATOMICs ?? (as a single event ??) >>>>>>
DWCAS, or double width compare and swap. So, on a 32 bit system we >>>>>>> need 64 bit cas, for a 64 bit system we need 128 bit cas. However, >>>>>>> sometimes we can pack the reference count in the pointer value
itself if its aligned on a big enough boundary. Then we can update >>>>>>> the pointer and the reference count using normal word based atomic >>>>>>> RMW's.
I understand why you had to pack the pointer and a chunk of data
into a
single container.
What I don't understand is if you had easy access to
multi-container ATOMICs
the packing would be unnecessary--would it not ?? That is in one
ATOMIC event
you could update the pointer and the chunk of data independently
and not NEED
to store them in a single container.
Well, actually, a pessimistic word based fetch-and-add (LOCK XADD)
is enough to increment the counter and load a pointer atomically all >>>>> in one shot, loopless. Why would I need to use multi atomics with a
possible loop to do that?
Postulate that you have a 64-bit pointer and a 8-bit chunk 72-total
bits.
Further postulate that you need to update both in a single
non-blocking ATOMIC event. ...
"Any programming problem can be solved with an additional layer of
indirection", so in this case you create a handle to that 72-bit item,
and require all access to go via the handle?
I am not trying to add an additional layer of indirection, I am trying
(unsuccessfully it appears) to get Chris to think outside of the one
container ATOMIC box.
LOCK XADD vs a CAS loop? I prefer the former.
The addendum to the rule above is of course ", except the problem of
too many layers of indirections". :-)
Terje
On 5/5/2024 9:30 PM, Lawrence D'Oliveiro wrote:
I think [RISC-V]’s already shipping in the billions of units per
year--enough to make it the world’s second-most-popular CPU
architecture, after ARM.
Yeah, seemingly right now, x86, ARM, and RISC-V are the top 3 ...
EricP wrote:
MitchAlsup1 wrote:
BGB wrote:
On 5/5/2024 10:31 AM, Scott Lurndal wrote:
Thomas Koenig <[email protected]> writes:
Scott Lurndal <[email protected]> schrieb:
Not as of yet in my case, but bitfield extract might happen eventually. >>>> Issue is finding a way to pull it off that is useful and cheaper
than shift+mask (and probably adding some mechanism to pattern-match
it from the AST or similar).
But, but but but:: it IS shift and Mask !!
Annoyingly, a good general case instruction could not be encoded in
a 32-bit instruction form at this point (could either add a few
special cases as 32-bit ops, or use a 64-bit encoding; or do it as a
2RI op rather than 3RI but this is lame...).
Then again, say:
  BITEXTR Imm10, Rn //Rn=(Rn>>(Imm&63))&((1<<((Imm>>6)&15))-1)
Could potentially still be useful.
   SL   Rd,Rc,<width:offset>
Is a bit field extract instruction, it is also a smash instruction
(smashing a 64-bit value into a 8-bit or 12-bit or 47 bit for whatever
purpose is needed)
   SR   Rd,Rc,<width:offset>
Positions the value in a register (Rc) such that it fits the
alignment of
a field.
   INS  Rd,Rc,Rf,<width:offset>
Inserts the field from Rf into its position <w:o> in Rc, inserts the
field and delivers the new container to Rd.
I think my instruction set could accomplish pretty much the same
efficiency for bit field operations as bit addresses but without
requiring direct bit addressing.
An issue that comes up is when the in-memory bit field is > 56 bits wide
as it might straddle two 64-bit words. If width is <= 56 bits then
a load from a byte address handles most of the shifting and the
rest can be handled within a single register.
But if the in-memory bit field is > 56 bits wide it may or may not
straddle
a single 64-bit memory location, and require a pair of registers to
loaded.
x86 does not have bitfield insert/extract, but it does have SHRD/SHLD so
it is fairly easy to handle arbitrary length (<= 64 bits) and alignment:
; RSI -> target, RCX = # bits to extract, RBX = 64-field size (0..63)
mov rax,[rsi]
mov rdx,[rsi+8]
shrd rax,rdx,cl ; bit offset
and rax,bitmask[rbx*8] ; 64 mask entries.
The last instruction can also be replaced with
shlx rax,rax,rbx ; Nr of excess bits (64-field to extract)
shrx rax,rax,rbx
or the entire thing can be replaced with this one which calculates the
mask on the fly:
mov rax,[rsi]
mov rdx,[rsi+8]
or rdi,-1 ; Generate mask
shrd rax,rdx,cl ; bit offset
shrx rdi,rdi,rbx ; excess bits to mask away
and rax,rdi
All seems like about 3 clock cycles when hitting the cache.
On 5/6/2024 2:11 PM, MitchAlsup1 wrote:
Lawrence D'Oliveiro wrote:
On Sun, 5 May 2024 20:50:51 -0500, BGB wrote:
Say, RISC-V:
  Says yes to DIV and MOD;
  Says yes to 4-register floating-point multiple-accumulate; Say >>>> no to
  register-indexed Load/Store.
Me: This is not a good balance...
Multiply-accumulate is at least as much about reducing rounding error
as about speed.
It is also an IEEE 754-2008+ requirement.
And... I have a version that just sort of works well enough to make
RV64G work, but is sort of a fail on the other fronts:
Using it is slower than separate ops;
It produces a double-rounded result.
Also, well, the FMUL isn't super accurate either.
FMUL is implemented in a way where it only generates the high-half of
the multiply, which makes the FPU cheaper, but:
Does not give strict 0.5ULP rounding.
Some combination of factors leads to the inability of Newton-Raphson to fully converge, possibly either due to omitting the low-order multiplier results, or the carry-propagation limitation for rounding (if the
rounding would result in more than 8 bits of carry, it is skipped).
Not likely to do proper FMA, as this would make a Binary64 FPU too
expensive (and, doing Binary64 poorly is still preferable for most uses
to not doing it at all).
Granted, not entirely sure how the 8087 managed to do all the stuff that
it did. Since, it seems like an 80s ASIC would be more cramped than a
modern Artix-7.
On 5/5/2024 11:13 PM, Terje Mathisen wrote:
MitchAlsup1 wrote:
Chris M. Thomasson wrote:
On 5/5/2024 3:25 PM, MitchAlsup1 wrote:
Chris M. Thomasson wrote:
On 5/4/2024 5:12 PM, MitchAlsup1 wrote:
Chris M. Thomasson wrote:
On 5/4/2024 3:18 AM, Thomas Koenig wrote:(and
Lawrence D'Oliveiro <[email protected]d> schrieb:
Intel pushed this thing called the “x32†ABI into
the Linux kernel
possibly some other places) some years ago. This was using the >>>>>>>>>> AMD64
instruction set, but with only 32-bit pointers. This way, you >>>>>>>>>> got the
benefit of the extra registers, without the overhead of the >>>>>>>>>> longer
addresses.
That was Donald Knuth's idea.
Storing meta data in actual pointers, aka aligned on a larger >>>>>>>> boundary, is critical to many advanced lock/wait free algorithms >>>>>>>> as well. I remember storing an actual reference count in
pointers before for a special type of counting.
Even if one has multi-location ATOMICs ?? (as a single event ??)
This was a technique for storing data in a pointer. For instance,
strong atomic reference counting we need to update a pointer _and_ >>>>>> a reference together atomically. This can easily be done with
DWCAS, or double width compare and swap. So, on a 32 bit system we >>>>>> need 64 bit cas, for a 64 bit system we need 128 bit cas. However, >>>>>> sometimes we can pack the reference count in the pointer value
itself if its aligned on a big enough boundary. Then we can update >>>>>> the pointer and the reference count using normal word based atomic >>>>>> RMW's.
I understand why you had to pack the pointer and a chunk of data
into a
single container.
What I don't understand is if you had easy access to
multi-container ATOMICs
the packing would be unnecessary--would it not ?? That is in one
ATOMIC event
you could update the pointer and the chunk of data independently
and not NEED
to store them in a single container.
Well, actually, a pessimistic word based fetch-and-add (LOCK XADD)
is enough to increment the counter and load a pointer atomically all
in one shot, loopless. Why would I need to use multi atomics with a
possible loop to do that?
Postulate that you have a 64-bit pointer and a 8-bit chunk 72-total
bits.
Further postulate that you need to update both in a single
non-blocking ATOMIC event. ...
"Any programming problem can be solved with an additional layer of
indirection", so in this case you create a handle to that 72-bit item,
and require all access to go via the handle?
The addendum to the rule above is of course ", except the problem of
too many layers of indirections". :-)
I remember look at one of your atomic queues that only used LOCK XADD on x86. Why would you use CAS for that? I don't know. I see no need for multi-atomics for any of it....
John Levine wrote:
According to John Savard <[email protected]d>:
On Mon, 6 May 2024 02:54:11 -0000 (UTC), John Levine
<[email protected]> wrote:
Why do you think bit addressing will be
faster than shifting and masking? ...
So just because a processor has a 64-bit bus to memory doesn't
mean it has to implement fetching a single byte from memory by
doing a shift and mask operation in a 64-bit register. Instead,
each byte of the bus could have a direct wired path to the low
8-bits of the internal data bus feeding the registers.
I was more thinking about storing bit fields, where you probably
have to fetch the whole word or cache line or whatever, shift the
new field into it, and then store it back. You already have to do
something like that for byte stores but bit addressing makes it 8
times as hairy.
Which is no different than ECC, BTW...
Could someone invent a bit field ISA that was as efficient as a byte accessible architecture:: probably.
Could this bit accessible architecture outperform a byte ISA on
typical codes:: doubtful. Two reasons:: 1) more delay in the LD/ST
pipeline, 2) most programs use as little bit-fielding as possible
(not as much as practical) !!!
According to Lawrence D'Oliveiro <[email protected]d>:
So using the same register name to address a halfword gives you the low
half of the register, not the high half?
Whereas using the same memory address to address a halfword gives you
the high half of the word at that location, not the low half?
... correct.
BGB wrote:
Granted, not entirely sure how the 8087 managed to do all the stuff
that it did. Since, it seems like an 80s ASIC would be more cramped
than a modern Artix-7.
Mostly it was simply slow.
Yes, but then again, I make no claim that it is IEEE-754 conformant,
merely that it uses the same formats, and is "good enough" for most
stuff one needs an FPU for.
Placing bit-field access INSIDE LDs and STs requires adding 2 stages of multiplexing in the LD/ST aligners (memory shifters). This has the
potential to slow the overall pipeline frequency--at which point you
have lost more than you can gain.
But we no longer have this problem.
I was thinking more in terms of popularity/mindshare ...
MitchAlsup1 wrote:
John Levine wrote:
According to John Savard <[email protected]d>:
On Mon, 6 May 2024 02:54:11 -0000 (UTC), John Levine
<[email protected]> wrote:
Why do you think bit addressing will be
faster than shifting and masking? ...
So just because a processor has a 64-bit bus to memory doesn't
mean it has to implement fetching a single byte from memory by
doing a shift and mask operation in a 64-bit register. Instead,
each byte of the bus could have a direct wired path to the low
8-bits of the internal data bus feeding the registers.
I was more thinking about storing bit fields, where you probably
have to fetch the whole word or cache line or whatever, shift the
new field into it, and then store it back. You already have to do something like that for byte stores but bit addressing makes it 8
times as hairy.
Which is no different than ECC, BTW...
Could someone invent a bit field ISA that was as efficient as a byte accessible architecture:: probably.
Could this bit accessible architecture outperform a byte ISA on
typical codes:: doubtful. Two reasons:: 1) more delay in the LD/ST pipeline, 2) most programs use as little bit-fielding as possible
(not as much as practical) !!!
Some time ago, I proposed an additional instruction, a load varient
that allowed you to address bit fields. Would it be slower than a
"normal" byte oriented load? Almost certainly. But would it be
faster than doing all the shifts, masks, word crossing calculations,
etc. via extra instructions? Again, almost certainly. So you keep
the benefits of byte oriented loads most of the time, but have
"reasonable" access to bit fields when you need them, faster than
without the extrainstructions. Hopefully the best of both worlds.
EricP wrote:
I think my instruction set could accomplish pretty much the same
efficiency for bit field operations as bit addresses but without
requiring direct bit addressing.
An issue that comes up is when the in-memory bit field is > 56 bits wide
as it might straddle two 64-bit words. If width is <= 56 bits then
a load from a byte address handles most of the shifting and the
rest can be handled within a single register.
This is what CARRY is for--access to 128-bit in 2×64-bit out shifts.
CARRY can be used for extracts and for inserts.
But if the in-memory bit field is > 56 bits wide it may or may not
straddle
a single 64-bit memory location, and require a pair of registers to
loaded.
I don't understand 56--56 takes just as many bits to encode as 63 ?!?
On Tue, 7 May 2024 06:35:53 -0000 (UTC)
"Stephen Fuld" <[email protected]d> wrote:
MitchAlsup1 wrote:
John Levine wrote:
According to John Savard <[email protected]d>:
On Mon, 6 May 2024 02:54:11 -0000 (UTC), John Levine <[email protected]> wrote:
Why do you think bit addressing will be
faster than shifting and masking? ...
So just because a processor has a 64-bit bus to memory doesn't
mean it has to implement fetching a single byte from memory by
doing a shift and mask operation in a 64-bit register.
Instead, each byte of the bus could have a direct wired path
to the low 8-bits of the internal data bus feeding the
registers.
I was more thinking about storing bit fields, where you probably
have to fetch the whole word or cache line or whatever, shift
the new field into it, and then store it back. You already have
to do something like that for byte stores but bit addressing
makes it 8 times as hairy.
Which is no different than ECC, BTW...
Could someone invent a bit field ISA that was as efficient as a
byte accessible architecture:: probably.
Could this bit accessible architecture outperform a byte ISA on
typical codes:: doubtful. Two reasons:: 1) more delay in the LD/ST pipeline, 2) most programs use as little bit-fielding as possible
(not as much as practical) !!!
Some time ago, I proposed an additional instruction, a load varient
that allowed you to address bit fields. Would it be slower than a
"normal" byte oriented load? Almost certainly. But would it be
faster than doing all the shifts, masks, word crossing calculations,
etc. via extra instructions? Again, almost certainly. So you keep
the benefits of byte oriented loads most of the time, but have
"reasonable" access to bit fields when you need them, faster than
without the extrainstructions. Hopefully the best of both worlds.
When you load bit field from memory, there is very high chance that
you would want adjacent bit field soon thereafter.
Think about it.
Terje Mathisen wrote:
EricP wrote:
MitchAlsup1 wrote:
BGB wrote:
On 5/5/2024 10:31 AM, Scott Lurndal wrote:
Thomas Koenig <[email protected]> writes:
Scott Lurndal <[email protected]> schrieb:
Not as of yet in my case, but bitfield extract might happen
eventually.
Issue is finding a way to pull it off that is useful and cheaper
than shift+mask (and probably adding some mechanism to
pattern-match it from the AST or similar).
But, but but but:: it IS shift and Mask !!
Annoyingly, a good general case instruction could not be encoded in
a 32-bit instruction form at this point (could either add a few
special cases as 32-bit ops, or use a 64-bit encoding; or do it as
a 2RI op rather than 3RI but this is lame...).
Then again, say:
  BITEXTR Imm10, Rn //Rn=(Rn>>(Imm&63))&((1<<((Imm>>6)&15))-1) >>>>> Could potentially still be useful.
   SL   Rd,Rc,<width:offset>
Is a bit field extract instruction, it is also a smash instruction
(smashing a 64-bit value into a 8-bit or 12-bit or 47 bit for whatever >>>> purpose is needed)
   SR   Rd,Rc,<width:offset>
Positions the value in a register (Rc) such that it fits the
alignment of
a field.
   INS  Rd,Rc,Rf,<width:offset>
Inserts the field from Rf into its position <w:o> in Rc, inserts the
field and delivers the new container to Rd.
I think my instruction set could accomplish pretty much the same
efficiency for bit field operations as bit addresses but without
requiring direct bit addressing.
An issue that comes up is when the in-memory bit field is > 56 bits wide >>> as it might straddle two 64-bit words. If width is <= 56 bits then
a load from a byte address handles most of the shifting and the
rest can be handled within a single register.
But if the in-memory bit field is > 56 bits wide it may or may not
straddle
a single 64-bit memory location, and require a pair of registers to
loaded.
x86 does not have bitfield insert/extract, but it does have SHRD/SHLD
so it is fairly easy to handle arbitrary length (<= 64 bits) and
alignment:
; RSI -> target, RCX = # bits to extract, RBX = 64-field size (0..63)
mov rax,[rsi]
mov rdx,[rsi+8]
shrd rax,rdx,cl ; bit offset
and rax,bitmask[rbx*8] ; 64 mask entries.
The last instruction can also be replaced with
shlx rax,rax,rbx ; Nr of excess bits (64-field to extract)
shrx rax,rax,rbx
or the entire thing can be replaced with this one which calculates the
mask on the fly:
mov rax,[rsi]
mov rdx,[rsi+8]
or rdi,-1 ; Generate mask
shrd rax,rdx,cl ; bit offset
shrx rdi,rdi,rbx ; excess bits to mask away
and rax,rdi
All seems like about 3 clock cycles when hitting the cache.
I realized this morning that with arbitrary alignment and both signed
and unsigned extract, it is better to always shift up first to get rid
of the excess and then shift down to align. The main problem here is
that you now need different code for straddling and non-straddling items since shifts (including double-wide shifts) have to be less than 64
bits. :-(
This is not a problem for constant length and alignment since the
compiler can chose the correct pattern, but for codecs and compression
it does not work. (Or at least not for those 57..64 field lengths).
mov rax,[rsi]
shl rax,cl ; Excess bits above the field we need
shrx rax,rax,rbx ; rbx=64-field length
The last instruction would be
sarx rax,rax,rbx
if you wanted a signed bitfield.
No matter how you do it it will be become a bottleneck in any huffmann
token extractor or similar codes. In my own decoders I've tended to
grab a 32 (in the old days) or 64-bit chunk into a register and
immediately align it. Then I'll use a lookup table over the first N (typically 6-12) bits of this buffer value and let the table decide how
many bits to keep for the token, or in the case of longer tokens, select
a second-level table to lookup the remaining bits.
After decrementing the buffer bits remaining counter I'll branch out to refill it, but only if I have at least 32 or 48 free bits. This keeps
the number of refills fairly low.
Terje
Terje Mathisen wrote:
Terje Mathisen wrote:
EricP wrote:
MitchAlsup1 wrote:
BGB wrote:
On 5/5/2024 10:31 AM, Scott Lurndal wrote:
Thomas Koenig <[email protected]> writes:
Scott Lurndal <[email protected]> schrieb:
Not as of yet in my case, but bitfield extract might happen
eventually.
Issue is finding a way to pull it off that is useful and cheaper
than shift+mask (and probably adding some mechanism to
pattern-match it from the AST or similar).
But, but but but:: it IS shift and Mask !!
Annoyingly, a good general case instruction could not be encoded in >>>>>> a 32-bit instruction form at this point (could either add a few
special cases as 32-bit ops, or use a 64-bit encoding; or do it as >>>>>> a 2RI op rather than 3RI but this is lame...).
Then again, say:
  BITEXTR Imm10, Rn //Rn=(Rn>>(Imm&63))&((1<<((Imm>>6)&15))-1)
Could potentially still be useful.
   SL   Rd,Rc,<width:offset>
Is a bit field extract instruction, it is also a smash instruction
(smashing a 64-bit value into a 8-bit or 12-bit or 47 bit for whatever >>>>> purpose is needed)
   SR   Rd,Rc,<width:offset>
Positions the value in a register (Rc) such that it fits the
alignment of
a field.
   INS  Rd,Rc,Rf,<width:offset>
Inserts the field from Rf into its position <w:o> in Rc, inserts the >>>>> field and delivers the new container to Rd.
I think my instruction set could accomplish pretty much the same
efficiency for bit field operations as bit addresses but without
requiring direct bit addressing.
An issue that comes up is when the in-memory bit field is > 56 bits wide >>>> as it might straddle two 64-bit words. If width is <= 56 bits then
a load from a byte address handles most of the shifting and the
rest can be handled within a single register.
But if the in-memory bit field is > 56 bits wide it may or may not
straddle
a single 64-bit memory location, and require a pair of registers to
loaded.
x86 does not have bitfield insert/extract, but it does have SHRD/SHLD
so it is fairly easy to handle arbitrary length (<= 64 bits) and
alignment:
; RSI -> target, RCX = # bits to extract, RBX = 64-field size (0..63)
mov rax,[rsi]
mov rdx,[rsi+8]
This is what I wanted to avoid: blindly loading the next word
as that could unnecessarilly read a cache line or worse,
trap on an access violation.
Its not that it is difficult to avoid, it just adds to the fiddlyness
(like conditional branches around one or two instructions).
shrd rax,rdx,cl ; bit offset
and rax,bitmask[rbx*8] ; 64 mask entries.
The last instruction can also be replaced with
shlx rax,rax,rbx ; Nr of excess bits (64-field to extract)
shrx rax,rax,rbx
or the entire thing can be replaced with this one which calculates the
mask on the fly:
mov rax,[rsi]
mov rdx,[rsi+8]
or rdi,-1 ; Generate mask
shrd rax,rdx,cl ; bit offset
shrx rdi,rdi,rbx ; excess bits to mask away
and rax,rdi
All seems like about 3 clock cycles when hitting the cache.
I realized this morning that with arbitrary alignment and both signed
and unsigned extract, it is better to always shift up first to get rid
of the excess and then shift down to align. The main problem here is
that you now need different code for straddling and non-straddling items
since shifts (including double-wide shifts) have to be less than 64
bits. :-(
This is not a problem for constant length and alignment since the
compiler can chose the correct pattern, but for codecs and compression
it does not work. (Or at least not for those 57..64 field lengths).
mov rax,[rsi]
shl rax,cl ; Excess bits above the field we need
shrx rax,rax,rbx ; rbx=64-field length
The last instruction would be
sarx rax,rax,rbx
if you wanted a signed bitfield.
No matter how you do it it will be become a bottleneck in any huffmann
token extractor or similar codes. In my own decoders I've tended to
grab a 32 (in the old days) or 64-bit chunk into a register and
immediately align it. Then I'll use a lookup table over the first N
(typically 6-12) bits of this buffer value and let the table decide how
many bits to keep for the token, or in the case of longer tokens, select
a second-level table to lookup the remaining bits.
After decrementing the buffer bits remaining counter I'll branch out to
refill it, but only if I have at least 32 or 48 free bits. This keeps
the number of refills fairly low.
Terje
There seem to be two use cases, one for bit-wise load and store to
individual bit fields in compiled structures, the other is dynamic
bit fields in bit streams.
The first is bit sized elements in packed arrays, or packed structs,
or packed arrays of packed structs, or packed structs containing packed
array of bit fields, etc. These are supported by some languages
(Ada85 had optional packed arrays and record structs).
For these the field start bit-offset is dynamic but the field size and
type are compile constants and so offer some potential for optimization
(but that could require inlining some of the access subroutines).
Such fields would tend to be both read and written is semi random order
but with a high probability that nearby fields will also be accessed.
The other is bit fields in bit streams being processed sequentially from
lsb to msb order, e.g for a codec. For these the field size and type are dynamic but the token start offset can be arranged to be in bit[0].
If you know the bit-wise token always starts in bit[0] you don't need to
deal with field straddles, but must dynamically track where the last valid in-register bit is and detect when to load the next word and append to the register bit stream.
Bit stream processing would likely be either write-only encode or read-only decode, proceeding once serially either low to high or high to low order.
Both would simplify greatly with double-wide shifts of register pairs,
as well as double-wide bit field extract and insert.
On Sun, 05 May 2024 11:20:02 -0600, John Savard wrote:
If you have decimal arithmetic, there's a direct connection between how
numbers are represented for reading and writing, and how they are
represented for internal arithmetic.
It is easier to do addition/subtraction if you start from the least >significant end and propagate the carry/borrow along.
I believe those early IBM character machines worked exactly this way.
On Thu, 02 May 2024 08:58:23 -0600, John Savard wrote:
To me, it just made sense that, since registers contain quantities, if
you load the value "8" into a reigster, it will contain the number 8.
So in a byte operation, the least significant bits of the register are
used.
Of course that makes sense.
Now, think of main memory as just a holding place for stuff that won�t fit
in registers: why shouldn�t it make sense there as well?
I don't know about the PDP 10, but you are right that Univac 1108 had
both a six bit (technically a sixth of a word), and nine bit (quarter
word) operations. The 6 bit was Fieldata and used for most older
softwaare. The quarter words held an 8 bit ASCII character with one
"wasted" bit per byte. This became the dominent usage for
applications, but the Exec itself still uses a lot of Fieldata.
On Mon, 06 May 2024 09:56:03 -0600, John Savard wrote:
But we no longer have this problem.
But the other reasons for going little-endian still exist.
Character strings are in big-endian order.
Packed decimal strings should be in the same order as character strings,
so that the relationship between the two is simple and conversion
between the two is quick.
On Tue, 7 May 2024 06:49:48 -0000 (UTC), Lawrence D'Oliveiro
<[email protected]d> wrote:
But the other reasons for going little-endian still exist.
And what other reasons might those be?
On Tue, 7 May 2024 06:49:48 -0000 (UTC), Lawrence D'Oliveiro
<[email protected]d> wrote:
On Mon, 06 May 2024 09:56:03 -0600, John Savard wrote:
But we no longer have this problem.
But the other reasons for going little-endian still exist.
And what other reasons might those be?
Yes, going little-endian made things simpler in computers with short
word lengths, since the most common operations started from the least significant end.
But to do things in a big-endian way in such computers didn't require
trying to do addition backwards; you just had to jump ahead by the
length of the number, and then move backwards from the least
significant part. Often, though, even a trifling expense to do so
didn't make sense.
But when decimal and binary are both used in the same machine, then big-endian is almost unavoidable
- especially when the same
architecture is to be used in a wide range of implementations, some
big, and some small. Then, compatibility forces the use of a small
number of extra gates here and there.
John Savard
Character strings are in big-endian order.
Not in Hebrew or Chinese !!
On Fri, 3 May 2024 22:26:04 -0000 (UTC), Lawrence D'Oliveiro
<[email protected]d> wrote:
On Thu, 02 May 2024 08:58:23 -0600, John Savard wrote:
To me, it just made sense that, since registers contain quantities, if
you load the value "8" into a reigster, it will contain the number 8.
So in a byte operation, the least significant bits of the register are
used.
Of course that makes sense.
Now, think of main memory as just a holding place for stuff that wont fit >>in registers: why shouldnt it make sense there as well?
Because that isn't what main memory is. Even if one could think of
cache memory that way, main memory also interacts with input-output
devices.
Although that isn't really the problem.
After all, computational variables can be stored in memory in any
format. The only things in memory that are constrained in format are character strings, because they get printed on paper for people to
see.
And, as I noted, that is the root of the problem.
Character strings are in big-endian order.
Packed decimal strings should be in the same order as character
strings, so that the relationship between the two is simple and
conversion between the two is quick.
Packed decimal strings of numbers should be in the same order as
binary numbers, because the can potentially share the same arithmetic
unit in some implementations.
John Savard
But the other reasons for going little-endian still exist.
And what other reasons might those be?
On Tue, 07 May 2024 19:23:59 -0600, John Savard wrote:
On Tue, 7 May 2024 06:49:48 -0000 (UTC), Lawrence D'Oliveiro
<[email protected]d> wrote:
But the other reasons for going little-endian still exist.
And what other reasons might those be?
Consider how you specify these 3 conventions:
* numbering of bits within a byte
* numbering of bytes within a multibyte quantity
* the place values (powers of 2) of bits in an integer
The only way to get all 3 consistent is with a little-endian architecture.
Every big-endian architecture has inconsistencies between these somewhere
or another.
Lawrence D'Oliveiro wrote:
Consider how you specify these 3 conventions:
* numbering of bits within a byte
Most significant is bit[0] least significant is bit[2^k-1]
* numbering of bytes within a multibyte quantity
Most significant byte[0] least significant byte[2^k-1]
* the place values (powers of 2) of bits in an integer
Consider how you specify these 3 conventions:
* numbering of bits within a byte
* numbering of bytes within a multibyte quantity
* the place values (powers of 2) of bits in an integer
The only way to get all 3 consistent is with a little-endian architecture. >Every big-endian architecture has inconsistencies between these somewhere
or another.
Carry from digit to digit is the same direction in binary and decimal.
This argues sameness not Big-Endian.
It doesn't make sense to say that character strings are big- or little- endian.
But I fail to see why the last one needs to be consistent, except as an aesthetic preference.
But the third item is character stirings, used in input and output to represent numbers. They should be the same as packed decimal to make conversion between the two simpler.
According to MitchAlsup1 <[email protected]>:
Character strings are in big-endian order.
Not in Hebrew or Chinese !!
It doesn't make sense to say that character strings are big- or
little- endian.
They're stored in the order you would read them, and there's typically metadata about how to display them. In Unicode, Hebrew and Arabic code
points display right to left, Chinese displays however they want,
typically left to right in rows these days.
Most significant priority is [0] least significant priority is [2^k-1]
Apparently even LE machines get this one wrong, too.
Michael S wrote:
On Tue, 7 May 2024 06:35:53 -0000 (UTC)
"Stephen Fuld" <[email protected]d> wrote:
MitchAlsup1 wrote:
John Levine wrote:
According to John Savard <[email protected]d>:
On Mon, 6 May 2024 02:54:11 -0000 (UTC), John Levine
<[email protected]> wrote:
Why do you think bit addressing will be
faster than shifting and masking? ...
So just because a processor has a 64-bit bus to memory doesn't
mean it has to implement fetching a single byte from memory by
doing a shift and mask operation in a 64-bit register. Instead,
each byte of the bus could have a direct wired path to the low
8-bits of the internal data bus feeding the registers.
I was more thinking about storing bit fields, where you probably
have to fetch the whole word or cache line or whatever, shift the
new field into it, and then store it back. You already have to do
something like that for byte stores but bit addressing makes it 8
times as hairy.
Which is no different than ECC, BTW...
Could someone invent a bit field ISA that was as efficient as a
byte accessible architecture:: probably.
Could this bit accessible architecture outperform a byte ISA on
typical codes:: doubtful. Two reasons:: 1) more delay in the LD/ST
pipeline, 2) most programs use as little bit-fielding as possible
(not as much as practical) !!!
Some time ago, I proposed an additional instruction, a load varient
that allowed you to address bit fields. Would it be slower than a
"normal" byte oriented load? Almost certainly. But would it be
faster than doing all the shifts, masks, word crossing
calculations, etc. via extra instructions? Again, almost
certainly. So you keep the benefits of byte oriented loads most
of the time, but have "reasonable" access to bit fields when you
need them, faster than without the extrainstructions. Hopefully
the best of both worlds.
When you load bit field from memory, there is very high chance that
you would want adjacent bit field soon thereafter.
Think about it.
Which means that you would like to have a dedicated streaming buffer
cache for the EXTR operation?
Terje
On Tue, 7 May 2024 06:35:53 -0000 (UTC)
"Stephen Fuld" <[email protected]d> wrote:
MitchAlsup1 wrote:
John Levine wrote:
According to John Savard <[email protected]d>:
On Mon, 6 May 2024 02:54:11 -0000 (UTC), John Levine
<[email protected]> wrote:
Why do you think bit addressing will be
faster than shifting and masking? ...
So just because a processor has a 64-bit bus to memory doesn't
mean it has to implement fetching a single byte from memory by
doing a shift and mask operation in a 64-bit register. Instead,
each byte of the bus could have a direct wired path to the low
8-bits of the internal data bus feeding the registers.
I was more thinking about storing bit fields, where you probably
have to fetch the whole word or cache line or whatever, shift the
new field into it, and then store it back. You already have to do
something like that for byte stores but bit addressing makes it 8
times as hairy.
Which is no different than ECC, BTW...
Could someone invent a bit field ISA that was as efficient as a byte
accessible architecture:: probably.
Could this bit accessible architecture outperform a byte ISA on
typical codes:: doubtful. Two reasons:: 1) more delay in the LD/ST
pipeline, 2) most programs use as little bit-fielding as possible
(not as much as practical) !!!
Some time ago, I proposed an additional instruction, a load varient
that allowed you to address bit fields. Would it be slower than a
"normal" byte oriented load? Almost certainly. But would it be
faster than doing all the shifts, masks, word crossing calculations,
etc. via extra instructions? Again, almost certainly. So you keep
the benefits of byte oriented loads most of the time, but have
"reasonable" access to bit fields when you need them, faster than
without the extrainstructions. Hopefully the best of both worlds.
When you load bit field from memory, there is very high chance that you
would want adjacent bit field soon thereafter.
Think about it.
I wanted to hint that in typical situation, i.e. when one 32-bit or
64-bit load serves several bit field extractions, his additional
instruction would be slower rather than faster than existing practice.
On Wed, 8 May 2024 14:25:15 +0200
Terje Mathisen <[email protected]> wrote:
Michael S wrote:
On Tue, 7 May 2024 06:35:53 -0000 (UTC)
"Stephen Fuld" <[email protected]d> wrote:
Instead, >>>>> each byte of the bus could have a direct wired pathMitchAlsup1 wrote:
John Levine wrote:
According to John Savard <[email protected]d>:
On Mon, 6 May 2024 02:54:11 -0000 (UTC), John Levine
<[email protected]> wrote:
Why do you think bit addressing will be
faster than shifting and masking? ...
So just because a processor has a 64-bit bus to memory doesn't
mean it has to implement fetching a single byte from memory by
doing a shift and mask operation in a 64-bit register.
to the low >>>>> 8-bits of the internal data bus feeding the
registers. >>>
probably >>>> have to fetch the whole word or cache line orI was more thinking about storing bit fields, where you
whatever, shift the >>>> new field into it, and then store it back.
You already have to do >>>> something like that for byte stores but
bit addressing makes it 8 >>>> times as hairy.
Which is no different than ECC, BTW...
Could someone invent a bit field ISA that was as efficient as a
byte accessible architecture:: probably.
LD/ST >>> pipeline, 2) most programs use as little bit-fielding asCould this bit accessible architecture outperform a byte ISA on
typical codes:: doubtful. Two reasons:: 1) more delay in the
possible >>> (not as much as practical) !!!
varient >> that allowed you to address bit fields. Would it beSome time ago, I proposed an additional instruction, a load
slower than a >> "normal" byte oriented load? Almost certainly.
But would it be >> faster than doing all the shifts, masks, word
crossing >> calculations, etc. via extra instructions? Again,
almost >> certainly. So you keep the benefits of byte oriented
loads most >> of the time, but have "reasonable" access to bit
fields when you >> need them, faster than without the
extrainstructions. Hopefully >> the best of both worlds.
When you load bit field from memory, there is very high chance
that you would want adjacent bit field soon thereafter.
Think about it.
Which means that you would like to have a dedicated streaming
buffer cache for the EXTR operation?
Terje
That not what I wanted to hint to Stephen.
I wanted to hint that in typical situation, i.e. when one 32-bit or
64-bit load serves several bit field extractions, his additional
instruction would be slower rather than faster than existing practice.
On Wed, 8 May 2024 14:25:15 +0200
Terje Mathisen <[email protected]> wrote:
Michael S wrote:
On Tue, 7 May 2024 06:35:53 -0000 (UTC)
"Stephen Fuld" <[email protected]d> wrote:
MitchAlsup1 wrote:
John Levine wrote:
According to John Savard <[email protected]d>:
On Mon, 6 May 2024 02:54:11 -0000 (UTC), John Levine
<[email protected]> wrote:
Why do you think bit addressing will be
faster than shifting and masking? ...
So just because a processor has a 64-bit bus to memory doesn't
mean it has to implement fetching a single byte from memory by
doing a shift and mask operation in a 64-bit register. Instead,
each byte of the bus could have a direct wired path to the low
8-bits of the internal data bus feeding the registers.
I was more thinking about storing bit fields, where you probably
have to fetch the whole word or cache line or whatever, shift the
new field into it, and then store it back. You already have to do
something like that for byte stores but bit addressing makes it 8
times as hairy.
Which is no different than ECC, BTW...
Could someone invent a bit field ISA that was as efficient as a
byte accessible architecture:: probably.
Could this bit accessible architecture outperform a byte ISA on
typical codes:: doubtful. Two reasons:: 1) more delay in the LD/ST
pipeline, 2) most programs use as little bit-fielding as possible
(not as much as practical) !!!
Some time ago, I proposed an additional instruction, a load varient
that allowed you to address bit fields. Would it be slower than a
"normal" byte oriented load? Almost certainly. But would it be
faster than doing all the shifts, masks, word crossing
calculations, etc. via extra instructions? Again, almost
certainly. So you keep the benefits of byte oriented loads most
of the time, but have "reasonable" access to bit fields when you
need them, faster than without the extrainstructions. Hopefully
the best of both worlds.
When you load bit field from memory, there is very high chance that
you would want adjacent bit field soon thereafter.
Think about it.
Which means that you would like to have a dedicated streaming buffer
cache for the EXTR operation?
Terje
That not what I wanted to hint to Stephen.
I wanted to hint that in typical situation, i.e. when one 32-bit or
64-bit load serves several bit field extractions, his additional
instruction would be slower rather than faster than existing practice.
Michael S wrote:
On Wed, 8 May 2024 14:25:15 +0200
Terje Mathisen <[email protected]> wrote:
Michael S wrote:
On Tue, 7 May 2024 06:35:53 -0000 (UTC)
"Stephen Fuld" <[email protected]d> wrote:
MitchAlsup1 wrote:
John Levine wrote:
According to John Savard <[email protected]d>:
On Mon, 6 May 2024 02:54:11 -0000 (UTC), John Levine
<[email protected]> wrote:
Why do you think bit addressing will be
faster than shifting and masking? ...
So just because a processor has a 64-bit bus to memory
doesn't mean it has to implement fetching a single byte
from memory by doing a shift and mask operation in a
64-bit register. Instead, each byte of the bus could
have a direct wired path to the low 8-bits of the
internal data bus feeding the registers.
I was more thinking about storing bit fields, where you
probably have to fetch the whole word or cache line or
whatever, shift the new field into it, and then store it
back. You already have to do something like that for byte
stores but bit addressing makes it 8 times as hairy.
Which is no different than ECC, BTW...
Could someone invent a bit field ISA that was as efficient
as a byte accessible architecture:: probably.
Could this bit accessible architecture outperform a byte
ISA on typical codes:: doubtful. Two reasons:: 1) more
delay in the LD/ST pipeline, 2) most programs use as little bit-fielding as possible (not as much as practical) !!!
Some time ago, I proposed an additional instruction, a load
varient that allowed you to address bit fields. Would it be
slower than a "normal" byte oriented load? Almost certainly.
But would it be faster than doing all the shifts, masks, word crossing calculations, etc. via extra instructions? Again,
almost certainly. So you keep the benefits of byte oriented
loads most of the time, but have "reasonable" access to bit
fields when you need them, faster than without the
extrainstructions. Hopefully the best of both worlds.
When you load bit field from memory, there is very high chance
that you would want adjacent bit field soon thereafter.
Think about it.
Which means that you would like to have a dedicated streaming
buffer cache for the EXTR operation?
Terje
That not what I wanted to hint to Stephen.
I wanted to hint that in typical situation, i.e. when one 32-bit or
64-bit load serves several bit field extractions, his additional instruction would be slower rather than faster than existing
practice.
Yeah, as I wrote earlier, i my own code I tend to use a register as
my buffer and keep it bottom-aligned at all times, i.e. end each
extraction by a SHR buffer, token_len
This means that most of the time, the buffer reg already contains all
the bits of the next token.
Michael S wrote:
On Wed, 8 May 2024 14:25:15 +0200
Terje Mathisen <[email protected]> wrote:
Michael S wrote:
On Tue, 7 May 2024 06:35:53 -0000 (UTC)Instead, >>>>> each byte of the bus could have a direct wired path
"Stephen Fuld" <[email protected]d> wrote:
MitchAlsup1 wrote:
John Levine wrote:
According to John Savard <[email protected]d>:
On Mon, 6 May 2024 02:54:11 -0000 (UTC), John Levine
<[email protected]> wrote:
Why do you think bit addressing will be
faster than shifting and masking? ...
So just because a processor has a 64-bit bus to memory doesn't >>>>>>>> mean it has to implement fetching a single byte from memory by >>>>>>>> doing a shift and mask operation in a 64-bit register.
to the low >>>>> 8-bits of the internal data bus feeding the
registers. >>>
probably >>>> have to fetch the whole word or cache line orI was more thinking about storing bit fields, where you
whatever, shift the >>>> new field into it, and then store it back.
You already have to do >>>> something like that for byte stores but
bit addressing makes it 8 >>>> times as hairy.
LD/ST >>> pipeline, 2) most programs use as little bit-fielding as
Which is no different than ECC, BTW...
Could someone invent a bit field ISA that was as efficient as a
byte accessible architecture:: probably.
Could this bit accessible architecture outperform a byte ISA on
typical codes:: doubtful. Two reasons:: 1) more delay in the
possible >>> (not as much as practical) !!!
varient >> that allowed you to address bit fields. Would it be
Some time ago, I proposed an additional instruction, a load
slower than a >> "normal" byte oriented load? Almost certainly.
But would it be >> faster than doing all the shifts, masks, word
crossing >> calculations, etc. via extra instructions? Again,
almost >> certainly. So you keep the benefits of byte oriented
loads most >> of the time, but have "reasonable" access to bit
fields when you >> need them, faster than without the
extrainstructions. Hopefully >> the best of both worlds.
When you load bit field from memory, there is very high chance
that you would want adjacent bit field soon thereafter.
Think about it.
Which means that you would like to have a dedicated streaming
buffer cache for the EXTR operation?
Terje
That not what I wanted to hint to Stephen.
I wanted to hint that in typical situation, i.e. when one 32-bit or
64-bit load serves several bit field extractions, his additional
instruction would be slower rather than faster than existing practice.
Perhaps. But if you aren't absolutely sure that the next field doesn't
cross a 64 bit boundry, then you have to test for that, and if it does,
add more instructions to handle it. If that happens, your advantage is
lost. Even the test and conditional jump/predication when you don't
cross the boundry makes it pretty close.
And, as I mentioned in a previous post, I would expect higher end implementations to make use of some sort of stream buffer, as Terje
suggests.
Terje Mathisen wrote:
Michael S wrote:
On Wed, 8 May 2024 14:25:15 +0200
Terje Mathisen <[email protected]> wrote:
Michael S wrote:
On Tue, 7 May 2024 06:35:53 -0000 (UTC)
"Stephen Fuld" <[email protected]d> wrote:
MitchAlsup1 wrote:
John Levine wrote:
According to John Savard <[email protected]d>:
On Mon, 6 May 2024 02:54:11 -0000 (UTC), John Levine
<[email protected]> wrote:
Why do you think bit addressing will be
faster than shifting and masking? ...
So just because a processor has a 64-bit bus to memory
doesn't mean it has to implement fetching a single byte
from memory by doing a shift and mask operation in a
64-bit register. Instead, each byte of the bus could
have a direct wired path to the low 8-bits of the
internal data bus feeding the registers.
I was more thinking about storing bit fields, where you
probably have to fetch the whole word or cache line or
whatever, shift the new field into it, and then store it
back. You already have to do something like that for byte
stores but bit addressing makes it 8 times as hairy.
Which is no different than ECC, BTW...
Could someone invent a bit field ISA that was as efficient
as a byte accessible architecture:: probably.
Could this bit accessible architecture outperform a byte
ISA on typical codes:: doubtful. Two reasons:: 1) more
delay in the LD/ST pipeline, 2) most programs use as little
bit-fielding as possible (not as much as practical) !!!
Some time ago, I proposed an additional instruction, a load
varient that allowed you to address bit fields. Would it be
slower than a "normal" byte oriented load? Almost certainly.
But would it be faster than doing all the shifts, masks, word
crossing calculations, etc. via extra instructions? Again,
almost certainly. So you keep the benefits of byte oriented
loads most of the time, but have "reasonable" access to bit
fields when you need them, faster than without the
extrainstructions. Hopefully the best of both worlds.
When you load bit field from memory, there is very high chance
that you would want adjacent bit field soon thereafter.
Think about it.
Which means that you would like to have a dedicated streaming
buffer cache for the EXTR operation?
Terje
That not what I wanted to hint to Stephen.
I wanted to hint that in typical situation, i.e. when one 32-bit or
64-bit load serves several bit field extractions, his additional
instruction would be slower rather than faster than existing
practice.
Yeah, as I wrote earlier, i my own code I tend to use a register as
my buffer and keep it bottom-aligned at all times, i.e. end each
extraction by a SHR buffer, token_len
This means that most of the time, the buffer reg already contains all
the bits of the next token.
The key word being"most". If it isn't "always", you have to test for
the condition. That test, and the conditional branch reduces, and
perhaps eliminates the advantage.
Though, had noticed recently that a lot of typos seem to escape my
notice on my end. This is possibly a downside of using a 9pt font on a
4K monitor (22 inch) with 100% UI zoom (*). Can fir more stuff on
screen, but potentially not the most easily readable experience.
On Wed, 8 May 2024 02:47:46 -0000 (UTC), John Levine wrote:
It doesn't make sense to say that character strings are big- or little-
endian.
Yes it does, for just about any encoding other than UTF-8. Thus, you have >UTF16BE, and UTF16LE.
On Tue, 07 May 2024 22:01:36 -0600, John Savard wrote:
But the third item is character stirings, used in input and output to
represent numbers. They should be the same as packed decimal to make
conversion between the two simpler.
No, because character string conversion is subject to localization issues.
On Wed, 8 May 2024 05:54:50 -0000 (UTC), Lawrence D'Oliveiro
<[email protected]d> wrote:
On Tue, 07 May 2024 22:01:36 -0600, John Savard wrote:
But the third item is character stirings, used in input and output to
represent numbers. They should be the same as packed decimal to make
conversion between the two simpler.
No, because character string conversion is subject to localization issues.
I agree that little-endian computers make sense for people whose
native language is Hebrew or Arabic.
On Tue, 07 May 2024 19:16:40 -0600, John Savard wrote:
Character strings are in big-endian order.
Better thought of as “character strings are stored so ascending addresses correspond to logical reading order”. Note I didn’t say “display order”,
since that can be quite different.
Packed decimal strings should be in the same order as character strings,
so that the relationship between the two is simple and conversion
between the two is quick.
Now here you are getting into cultural issues, For example, while both
Arabic and Hebrew use decimal numbers, they write the digits in opposite order.
According to Lawrence D'Oliveiro <[email protected]d>:
On Wed, 8 May 2024 02:47:46 -0000 (UTC), John Levine wrote:
It doesn't make sense to say that character strings are big- or little-
endian.
Yes it does, for just about any encoding other than UTF-8. Thus, you have
UTF16BE, and UTF16LE.
Not really, those are byte orders within a character, not order of characters.
If you look at surrogates, you can UTF16 is big-endian. First there's the high
surrogate, then the low one.
There's a reason that every encoding other than UTF-8 is dead. Who needs the grief?
On Wed, 08 May 2024 20:50:53 -0600, John Savard ><[email protected]d> wrote:
On Wed, 8 May 2024 05:54:50 -0000 (UTC), Lawrence D'Oliveiro >><[email protected]d> wrote:
On Tue, 07 May 2024 22:01:36 -0600, John Savard wrote:I agree that little-endian computers make sense for people whose
But the third item is character stirings, used in input and output to
represent numbers. They should be the same as packed decimal to make
conversion between the two simpler.
No, because character string conversion is subject to localization issues. >>
native language is Hebrew or Arabic.
Still, I get your point. My thinking is stuck in the days of card
readers and line printers. Yes, one called a subroutine to print
numbers, but what it did was convert them to the format used in North
America and the United Kingdom, in accordance with any parameters in
the call that were hard-coded into the program.
The idea of programs as applications, to be distributed far and wide,
to people with computers of their own, where the operating system
could impose localization options on the display of numbers that
programs would usually allow themselves to accept
UTF-32 is fine for internal use, however - using whatever endianness
your processor prefers. The trick is never to let it leave the one
computer in any encoding other than UTF-8.
John Savard wrote:
On Wed, 1 May 2024 23:17:06 -0000 (UTC), Lawrence D'Oliveiro
Plus, if you load a single precision float into a floating-point
register, you are loading on the left side, not the right side, so the
In My 66000, floats are stored on the right side of the register
{mostly because I do not have FP LD/STs.}
David Brown <[email protected]> writes:
UTF-32 is fine for internal use, however - using whatever endianness
your processor prefers. The trick is never to let it leave the one
computer in any encoding other than UTF-8.
An unnecessary complication.
1) I only came up with the following use cases where you need to deal
with individual non-ASCII characters: Palindrome checkers and anagram programs; I remember somebody mentioning a third use (which I promptly forgot), but anyway, there are few cases.
2) But even for those few cases, UTF-32 is not good enough, because a
code point is not a character.
On 10/05/2024 18:20, Anton Ertl wrote:
1) I only came up with the following use cases where you need to deal
with individual non-ASCII characters: Palindrome checkers and anagram
programs; I remember somebody mentioning a third use (which I promptly
forgot), but anyway, there are few cases.
2) But even for those few cases, UTF-32 is not good enough, because a
code point is not a character.
I agree that it is usually unnecessary to convert to UTF-32 - I am
merely saying that /if/ you feel you want to expand the code points,
then UTF-32 is fine for the purpose and you should not have to worry
about endianness because you should not be moving it off your computer,
thus native endianness is all you need.
People sometimes say they want to expand to code points to be able to
see the length of the string in characters, or to index characters, or
for easier splicing or joining strings. I don't think these are
particularly useful in practice, but UTF-32 is fine for those that want it.
David Brown <[email protected]> writes:
On 10/05/2024 18:20, Anton Ertl wrote:
1) I only came up with the following use cases where you need to deal
with individual non-ASCII characters: Palindrome checkers and anagram
programs; I remember somebody mentioning a third use (which I promptly
forgot), but anyway, there are few cases.
2) But even for those few cases, UTF-32 is not good enough, because a
code point is not a character.
I agree that it is usually unnecessary to convert to UTF-32 - I am
merely saying that /if/ you feel you want to expand the code points,
then UTF-32 is fine for the purpose and you should not have to worry
about endianness because you should not be moving it off your computer,
thus native endianness is all you need.
Yes. The point I wanted to make is that there is the frequent
misconception that dealing with individual arbitrary characters is
something that is relatively common, and that one can do that by using
UTF-32 (or UTF-16); it isn't, and one cannot. If you stick with UTF-8
and use byte lengths and byte indexes, you can do almost everything as
well or better (with less complication and more efficiently) as by
converting to UTF-32 and back.
People sometimes say they want to expand to code points to be able to
see the length of the string in characters, or to index characters, or
for easier splicing or joining strings. I don't think these are
particularly useful in practice, but UTF-32 is fine for those that want it.
Looking up "splicing strings", I find that this is a term used in
connection with Python for specifying substrings. Python3 is a
language that lives the codepoint mistake to the extreme (and from
what I read, this was one of the major pain points in the
Python2->Python3 transition), but anyway, with UTF-8 one way to
represent a substring is to use the start index and length in bytes
(aka code units) rather than code points.
Looking up "joining strings" brings up the Python join() method, which
is a variant of string concatenation. There is certainly no need to
convert UTF-8 to UTF-32 and back for concatenating strings; just
concatenate the UTF-8 strings.
People often think it is easier to do string manipulation - joining, >splitting, replacing, etc., - when you have fixed size units per
character.
But it is not
uncommon to think it is, and if you can make some simplifications to the
text you support (specifically, limiting your code to single code point >characters) then UTF-32 can be helpful.
The point I wanted to make is that there is the frequent
misconception that dealing with individual arbitrary characters is
something that is relatively common, and that one can do that by using
UTF-32 (or UTF-16); it isn't, and one cannot.
If you stick with UTF-8
and use byte lengths and byte indexes, you can do almost everything as
well or better (with less complication and more efficiently) as by
converting to UTF-32 and back.
Anton Ertl <[email protected]> schrieb:
The point I wanted to make is that there is the frequent
misconception that dealing with individual arbitrary characters is something that is relatively common, and that one can do that by
using UTF-32 (or UTF-16); it isn't, and one cannot.
Do you really mean one cannot change an individual character
using UTF-32? I assume you mean "there is no need to do it"..
On 5/6/24 3:13 PM, MitchAlsup1 wrote:
Placing bit-field access INSIDE LDs and STs requires adding 2 stages
of multiplexing in the LD/ST aligners (memory shifters). This has the
potential to slow the overall pipeline frequency--at which point you
have lost more than you can gain.
The extra shifting could be applied only for bit-granular
accesses, so byte-granular accesses could have normal latency.
(Bit-field loads would have higher latency.)
According to Lawrence D'Oliveiro <[email protected]d>:
On Wed, 8 May 2024 02:47:46 -0000 (UTC), John Levine wrote:
It doesn't make sense to say that character strings are big- or
little-endian.
Yes it does, for just about any encoding other than UTF-8. Thus, you
have UTF16BE, and UTF16LE.
Not really, those are byte orders within a character ...
People often think it is easier to do string manipulation - joining, splitting, replacing, etc., - when you have fixed size units per
character.
On Sat, 11 May 2024 18:49:12 +0200, David Brown wrote:
People often think it is easier to do string manipulation - joining,
splitting, replacing, etc., - when you have fixed size units per
character.
It is easy enough to come up with a fixed-size representation for
characters in Python (or other suitably powerful language), where >“character” = “non-combining code point plus all immediately-following >combining code points”.
It appears that Lawrence D'Oliveiro <[email protected]d> said:
On Sat, 11 May 2024 18:49:12 +0200, David Brown wrote:
People often think it is easier to do string manipulation - joining,
splitting, replacing, etc., - when you have fixed size units per
character.
It is easy enough to come up with a fixed-size representation for >>characters in Python (or other suitably powerful language), where >>“character” = “non-combining code point plus all immediately-following >>combining code points”.
I have to ask, how much storage do each of these fixed-size character
things take?
It is easy enough to come up with a fixed-size representation for >>>characters in Python (or other suitably powerful language), where >>>“character” = “non-combining code point plus all immediately-following >>>combining code points”.
I have to ask, how much storage do each of these fixed-size character
things take?
That’s not important; what’s important is that you can put characters as >elements in an array, randomly accessible just by array index.
According to Lawrence D'Oliveiro <[email protected]d>:
That’s not important; what’s important is that you can put characters as >> elements in an array, randomly accessible just by array index.It is easy enough to come up with a fixed-size representation forI have to ask, how much storage do each of these fixed-size character
characters in Python (or other suitably powerful language), where
“character” = “non-combining code point plus all immediately-following
combining code points”.
things take?
How am I supposed to write my code with an array of fixed size things if
I don't know how big the things are?
If you mean an array of pointers to sequences of code points, well
sure, but now we're back to variable length encodings. I know that I
have no idea how big these fixed size things would have to be and i
suspect nobody else does either.
John Levine wrote:
If you mean an array of pointers to sequences of code points, well
sure, but now we're back to variable length encodings. I know that I
have no idea how big these fixed size things would have to be and i
suspect nobody else does either.
One could have instructions that make it easier to parse the
variable length UTF-8 sequences into codepoints.
It would still have to look up whether a codepoint was combining or
stand alone. I don't see a firm definition whether combining codepoints
come before or after, after requiring a lookahead parse and so extra
checks to ensure it doesn't look past the buffer end.
According to EricP <[email protected]>:
John Levine wrote:
If you mean an array of pointers to sequences of code points, well
sure, but now we're back to variable length encodings. I know that I
have no idea how big these fixed size things would have to be and i
suspect nobody else does either.
One could have instructions that make it easier to parse the
variable length UTF-8 sequences into codepoints.
That would be the CU14 instruction on zSeries, to turn UTF-8 into
UTF-32. CU41 goes the other way.
It would still have to look up whether a codepoint was combining or
stand alone. I don't see a firm definition whether combining codepoints >>come before or after, after requiring a lookahead parse and so extra
checks to ensure it doesn't look past the buffer end.
I think they come after but I haven't looked in enough detail. And
then you have all of the issues with precomposed characters: do you
normalize as you go or denormaiize, or what?
On Wed, 8 May 2024 05:54:50 -0000 (UTC), Lawrence D'Oliveiro
<[email protected]d> wrote:
On Tue, 07 May 2024 22:01:36 -0600, John Savard wrote:
But the third item is character stirings, used in input and output to
represent numbers. They should be the same as packed decimal to make
conversion between the two simpler.
No, because character string conversion is subject to localization
issues.
I agree that little-endian computers make sense for people whose native language is Hebrew or Arabic.
It was exactly these kinds of optimizations I made in order to double
the speed of Intel's reference BluRay decoder. However, instead of
asking me to write a complete version they decided to licence a piece of
VLSI to do it in hardware, and that was almost certainly the correct
decision since my code needed 4 cores working nearly 100% in order to
handle the highest possible size/speed quality (1080p, 60 Hz, CABAC
encoding and 40 Mbit/s bitrate).
According to Lawrence D'Oliveiro <[email protected]d>:
It is easy enough to come up with a fixed-size representation for
characters in Python (or other suitably powerful language), where
“character” = “non-combining code point plus all immediately
-following combining code points”.
I have to ask, how much storage do each of these fixed-size character
things take?
That’s not important; what’s important is that you can put characters as >>elements in an array, randomly accessible just by array index.
How am I supposed to write my code with an array of fixed size things if
I don't know how big the things are?
If you mean an array of pointers to sequences of code points, well sure,
but now we're back to variable length encodings.
According to EricP <[email protected]>:
One could have instructions that make it easier to parse the variable
length UTF-8 sequences into codepoints.
That would be the CU14 instruction on zSeries, to turn UTF-8 into
UTF-32. CU41 goes the other way.
On Mon, 27 May 2024 15:09:23 -0000 (UTC), John Levine wrote:
According to Lawrence D'Oliveiro <[email protected]d>:
It is easy enough to come up with a fixed-size representation for
characters in Python (or other suitably powerful language), where
“character” = “non-combining code point plus all immediately
-following combining code points”.
I have to ask, how much storage do each of these fixed-size character
things take?
That’s not important; what’s important is that you can put characters as
elements in an array, randomly accessible just by array index.
How am I supposed to write my code with an array of fixed size things if
I don't know how big the things are?
The fixed-size things are references to objects. Or in a lower-level
language like C, they could indeed be pointers/indexes into an array of
code points.
If you mean an array of pointers to sequences of code points, well sure,
but now we're back to variable length encodings.
We’re not, because we still have easy random access, and the length of the array is the number of characters.
According to EricP <[email protected]>:
John Levine wrote:
If you mean an array of pointers to sequences of code points, wellOne could have instructions that make it easier to parse the
sure, but now we're back to variable length encodings. I know that I
have no idea how big these fixed size things would have to be and i
suspect nobody else does either.
variable length UTF-8 sequences into codepoints.
That would be the CU14 instruction on zSeries, to turn UTF-8 into
UTF-32. CU41 goes the other way.
It would still have to look up whether a codepoint was combining or
stand alone. I don't see a firm definition whether combining codepoints
come before or after, after requiring a lookahead parse and so extra
checks to ensure it doesn't look past the buffer end.
I think they come after but I haven't looked in enough detail.
And
then you have all of the issues with precomposed characters: do you normalize as you go or denormaiize, or what?
On Mon, 27 May 2024 19:09:51 -0000 (UTC), John Levine wrote:
According to EricP <[email protected]>:
One could have instructions that make it easier to parse the variable
length UTF-8 sequences into codepoints.
That would be the CU14 instruction on zSeries, to turn UTF-8 into
UTF-32. CU41 goes the other way.
What is the point, in this day and age, of having special machine instructions to convert character encodings?
Lawrence D'Oliveiro <[email protected]d> schrieb:
On Mon, 27 May 2024 19:09:51 -0000 (UTC), John Levine wrote:
According to EricP <[email protected]>:What is the point, in this day and age, of having special machine
One could have instructions that make it easier to parse the variableThat would be the CU14 instruction on zSeries, to turn UTF-8 into
length UTF-8 sequences into codepoints.
UTF-32. CU41 goes the other way.
instructions to convert character encodings?
Have you looked at decoding algorithms for UTF-8?
Lawrence D'Oliveiro <[email protected]d> schrieb:
On Mon, 27 May 2024 19:09:51 -0000 (UTC), John Levine wrote:
According to EricP <[email protected]>:
One could have instructions that make it easier to parse the variable
length UTF-8 sequences into codepoints.
That would be the CU14 instruction on zSeries, to turn UTF-8 into
UTF-32. CU41 goes the other way.
What is the point, in this day and age, of having special machine
instructions to convert character encodings?
Have you looked at decoding algorithms for UTF-8?
On Tue, 28 May 2024 16:02:10 -0000 (UTC), Thomas Koenig wrote:
Lawrence D'Oliveiro <[email protected]d> schrieb:
On Mon, 27 May 2024 19:09:51 -0000 (UTC), John Levine wrote:
According to EricP <[email protected]>:
One could have instructions that make it easier to parse the variable >>>>> length UTF-8 sequences into codepoints.
That would be the CU14 instruction on zSeries, to turn UTF-8 into
UTF-32. CU41 goes the other way.
What is the point, in this day and age, of having special machine
instructions to convert character encodings?
Have you looked at decoding algorithms for UTF-8?
Of course. Isn’t the point of RISC that these complex operations are more >efficiently performed by a sequence of simpler instructions?
The fixed-size things are references to objects. Or in a lower-level
language like C, they could indeed be pointers/indexes into an array of
code points.
[...] we still have easy random access, and the length of the
array is the number of characters.
Lawrence D'Oliveiro <[email protected]d> writes:
On Tue, 28 May 2024 16:02:10 -0000 (UTC), Thomas Koenig wrote:
Lawrence D'Oliveiro <[email protected]d> schrieb:
On Mon, 27 May 2024 19:09:51 -0000 (UTC), John Levine wrote:
According to EricP <[email protected]>:
One could have instructions that make it easier to parse the variable >>>>>> length UTF-8 sequences into codepoints.
What for? Dealing with code points is rarely necessary, so adding instructions for that is a waste (and it's not surprising to me that
neither AMD64 nor ARM A64 have such instructions; IBM z seems to be
add special instructions that are rarely useful as marketing
argument).
I've not dealt with UTF-8 or code points but that's because I've not
written software that interacts with the non 1-byte character markets.
But even something as simple as sanitizing a character string to feed
into SQL will have to.
I've not dealt with UTF-8 or code points but that's because I've notAFAIK you can do that by treating the UTF-8 byte sequence as if it were
written software that interacts with the non 1-byte character markets.
But even something as simple as sanitizing a character string to feed
into SQL will have to.
an ASCII byte-sequence: all the Unicode weirdness is neatly stashed in
bytes >127 which aren't used by SQL itself anyway.
Stefan
Of course with apologies to Herr Koenig's umlauts. :-)
And what of all those new Asian customers your company was hoping
to get by dealing with them in their native written language???
You could always explain to the company president that
you only work in ASCII so they should just get used to it.
I've not dealt with UTF-8 or code points but that's because I've not
written software that interacts with the non 1-byte character markets.
But even something as simple as sanitizing a character string to feed
into SQL will have to.
AFAIK you can do that by treating the UTF-8 byte sequence as if it were
an ASCII byte-sequence: all the Unicode weirdness is neatly stashed in
bytes >127 which aren't used by SQL itself anyway.
Stefan
Of course with apologies to Herr Koenig's umlauts. :-)I've not dealt with UTF-8 or code points but that's because I've notAFAIK you can do that by treating the UTF-8 byte sequence as if it were
written software that interacts with the non 1-byte character markets. >>>> But even something as simple as sanitizing a character string to feed
into SQL will have to.
an ASCII byte-sequence: all the Unicode weirdness is neatly stashed in
bytes >127 which aren't used by SQL itself anyway.
Stefan
And what of all those new Asian customers your company was hoping
to get by dealing with them in their native written language???
You could always explain to the company president that
you only work in ASCII so they should just get used to it.
I think you misunderstand: the code written to sanitize an ASCII string to feed into SQL will *just work* to sanitize a UTF-8 string to feed
into SQL, no matter how many funny characters and joiners and combiners
and emojis you have in there.
That's part of the reason why UTF-8 is so popular: you can surprisingly
often treat it as "good old ASCII".
Stefan
Ok, you accept international character data, you just don't have to
check >127 characters for "drop table" etc commands.
I don't think you are being paranoid enough.
I still think you have to validate or sanitize the >127 string to
ensure the code sequences only contain well formed characters.
Random hack thought #1: if the string I send starts with an umlaut as
the first code point, ...
Random hack thought #2: If a character has multiple combiner code points, >does changing the order create a different character or do they map to
the same display character? Or worse, maybe combiner code point order >sensitivity is character dependent, some are, some are not.
According to EricP <[email protected]>:
Ok, you accept international character data, you just don't have to
check >127 characters for "drop table" etc commands.
I don't think you are being paranoid enough.
I still think you have to validate or sanitize the >127 string to
ensure the code sequences only contain well formed characters.
If you're sending the strings to a database, the database will
invariably do detailed string validation so I wouldn't bother, but be >prepared for the error code if it rejects the string,
You could always explain to the company president that you only work in
ASCII so they should just get used to it.
Lawrence D'Oliveiro <[email protected]d> writes:
Isn’t the point of RISC that these complex operations are
more efficiently performed by a sequence of simpler instructions?
The IBM z series are not RISCs.
I've not dealt with UTF-8 or code points but that's because I've not
written software that interacts with the non 1-byte character markets.
On Wed, 29 May 2024 07:04:35 GMT, Anton Ertl wrote:
Lawrence D'Oliveiro <[email protected]d> writes:
Isn’t the point of RISC that these complex operations are
more efficiently performed by a sequence of simpler instructions?
The IBM z series are not RISCs.
Doesn’t matter. The principles of designing high-performance
architectures still apply: simpler instructions are better than more
complex ones.
Thomas Koenig wrote:
Lawrence D'Oliveiro <[email protected]d> schrieb:
On Mon, 27 May 2024 19:09:51 -0000 (UTC), John Levine wrote:
According to EricP <[email protected]>:What is the point, in this day and age, of having special machine
One could have instructions that make it easier to parse the variable >>>>> length UTF-8 sequences into codepoints.That would be the CU14 instruction on zSeries, to turn UTF-8 into
UTF-32. CU41 goes the other way.
instructions to convert character encodings?
Have you looked at decoding algorithms for UTF-8?
It's almost like the perfect application of risc instruction design:
a long sequence of individual instructions of conditional branches,
bit field extracts, inserts, and shifts, is replace in HW by
a small number of muxes that can to the same in one clock.
Lawrence D'Oliveiro <[email protected]d> writes:
On Tue, 28 May 2024 16:02:10 -0000 (UTC), Thomas Koenig wrote:
Lawrence D'Oliveiro <[email protected]d> schrieb:
On Mon, 27 May 2024 19:09:51 -0000 (UTC), John Levine wrote:
According to EricP <[email protected]>:
One could have instructions that make it easier to parse the variable >>>>>> length UTF-8 sequences into codepoints.
What for? Dealing with code points is rarely necessary, so adding instructions for that is a waste (and it's not surprising to me that
neither AMD64 nor ARM A64 have such instructions; IBM z seems to be
add special instructions that are rarely useful as marketing
argument).
That would be the CU14 instruction on zSeries, to turn UTF-8 into
UTF-32. CU41 goes the other way.
What is the point, in this day and age, of having special machine
instructions to convert character encodings?
Have you looked at decoding algorithms for UTF-8?
Of course. Isn’t the point of RISC that these complex operations are more
efficiently performed by a sequence of simpler instructions?
The IBM z series are not RISCs.
Anyway, such instructions can be done in a RISCy way (pure register-to-register instructions) or in a CISCy way
(memory-to-memory).
A RISCy way to do UTF-8 -> UTF-32 would be to have the first 4 bytes
of the remaining string in a register and producing an UTF-32 code
point in another register and a length in a third register (or in the
high part of the destination register to reduce write port
requirements). Similarly for UTF-32->UTF-8, with the length
specifying the length of the result; that would need to be combined
with a length masked store to make it easy to store the result.
This approach can also be SIMDified, converting regbits/32 code points
in one representation to the same number of code points in the other representation plus a length of the UTF-8 representation.
The disadvantage of this approach exists particularly for
UTF-8->UTF-32: this is a very sequential approach full of dependences:
each use of the conversion instruction is followed by a dependent load
of the next input fragment, and the next use of the conversion
instruction depends on that load.
Anton Ertl wrote:
This approach can also be SIMDified, converting regbits/32 code points
in one representation to the same number of code points in the other
representation plus a length of the UTF-8 representation.
The disadvantage of this approach exists particularly for
UTF-8->UTF-32: this is a very sequential approach full of dependences:
each use of the conversion instruction is followed by a dependent load
of the next input fragment, and the next use of the conversion
instruction depends on that load.
Rather the opposite:
UTF8->UTF32 looks a _lot_ like an easier example of a byte-oriented
variable length (x86?) instruction decoder, but with the big
simplification that the first byte directly tells you how long the
sequence is.
Doing a SIMD version corresponds to a superscalar x86 in that the
decoder needs to grab a variable number of bytes for each instruction, starting the next immediately after.
Terje Mathisen wrote:
Anton Ertl wrote:
This approach can also be SIMDified, converting regbits/32 code points
in one representation to the same number of code points in the other
representation plus a length of the UTF-8 representation.
The disadvantage of this approach exists particularly for
UTF-8->UTF-32: this is a very sequential approach full of dependences:
each use of the conversion instruction is followed by a dependent load
of the next input fragment, and the next use of the conversion
instruction depends on that load.
Rather the opposite:
UTF8->UTF32 looks a _lot_ like an easier example of a byte-oriented
variable length (x86?) instruction decoder, but with the big
simplification that the first byte directly tells you how long the
sequence is.
Doing a SIMD version corresponds to a superscalar x86 in that the
decoder needs to grab a variable number of bytes for each instruction,
starting the next immediately after.
Even better (compared to a superscalar x86 instruction decoder), _every_
byte uses the top two bits to tell you if this is 7-bit ascii, the start
of a UTF-8 encoded code point, or a follow-on byte inside a UTF-8 code
point.
This means that each decoder can work alone, without having to wait for
the length decoding of the previous code point ("instruction") before deciding to discard or pass on the results it got from starting where it
did.
It seems like it would be very feasible to have (say) 8 parallel
decoders starting at every corresponding byte offset, and return a SIMD register with 2-8 32-bit decoded code points, right?
Anton Ertl wrote:
Anyway, such instructions can be done in a RISCy way (pure
register-to-register instructions) or in a CISCy way
(memory-to-memory).
=20
A RISCy way to do UTF-8 -> UTF-32 would be to have the first 4 bytes
of the remaining string in a register and producing an UTF-32 code
point in another register and a length in a third register (or in the
high part of the destination register to reduce write port
requirements). Similarly for UTF-32->UTF-8, with the length
specifying the length of the result; that would need to be combined
with a length masked store to make it easy to store the result.
=20
This approach can also be SIMDified, converting regbits/32 code points
in one representation to the same number of code points in the other
representation plus a length of the UTF-8 representation.
=20
The disadvantage of this approach exists particularly for
UTF-8->UTF-32: this is a very sequential approach full of dependences:
each use of the conversion instruction is followed by a dependent load
of the next input fragment, and the next use of the conversion
instruction depends on that load.
Rather the opposite:
UTF8->UTF32 looks a _lot_ like an easier example of a byte-oriented=20 >variable length (x86?) instruction decoder, but with the big=20 >simplification that the first byte directly tells you how long the=20 >sequence is.
Doing a SIMD version corresponds to a superscalar x86 in that the=20
decoder needs to grab a variable number of bytes for each instruction,=20 >starting the next immediately after.
On Wed, 29 May 2024 07:04:35 GMT, Anton Ertl wrote:
Lawrence D'Oliveiro <[email protected]d> writes:
Isn’t the point of RISC that these complex operations are
more efficiently performed by a sequence of simpler instructions?
The IBM z series are not RISCs.
Doesn’t matter. The principles of designing high-performance architectures >still apply: simpler instructions are better than more complex ones.
It's almost like the perfect application of risc instruction design:
a long sequence of individual instructions of conditional branches,
bit field extracts, inserts, and shifts, is replace in HW by
a small number of muxes that can to the same in one clock.
If that CU14 can also return the number of bytes consumed, along with
the resulting 32-bit character, then it would be perfect. Is that what
it is doing?
Stefan Monnier wrote:
Of course with apologies to Herr Koenig's umlauts. :-)I've not dealt with UTF-8 or code points but that's because I've not >>>>> written software that interacts with the non 1-byte character markets. >>>>> But even something as simple as sanitizing a character string to feed >>>>> into SQL will have to.AFAIK you can do that by treating the UTF-8 byte sequence as if it were >>>> an ASCII byte-sequence: all the Unicode weirdness is neatly stashed in >>>> bytes >127 which aren't used by SQL itself anyway.
Stefan
And what of all those new Asian customers your company was hoping
to get by dealing with them in their native written language???
You could always explain to the company president that
you only work in ASCII so they should just get used to it.
I think you misunderstand: the code written to sanitize an ASCII string to >> feed into SQL will *just work* to sanitize a UTF-8 string to feed
into SQL, no matter how many funny characters and joiners and combiners
and emojis you have in there.
That's part of the reason why UTF-8 is so popular: you can surprisingly
often treat it as "good old ASCII".
Stefan
Ok, you accept international character data, you just don't have to
check >127 characters for "drop table" etc commands.
I don't think you are being paranoid enough.
I still think you have to validate or sanitize the >127 string to
ensure the code sequences only contain well formed characters.
Random hack thought #1: if the string I send starts with an umlaut as
the first code point, which doesn't display because it is invalid.
Then someone edits the first char to a/o/u and magically it changes
to a different character, and deposits now go to a different account.
Random hack thought #2: If a character has multiple combiner code points, >does changing the order create a different character or do they map to
the same display character? Or worse, maybe combiner code point order >sensitivity is character dependent, some are, some are not.
If they do display the same, then I might create two accounts that
look identical but index differently, and redirect deposits.
IBM has, for a long time, combined commonly occuring sequences of instructions into single instructions. I don't know the tradeoffs here.
IBM has, for a long time, combined commonly occuring sequences of instructions into single instructions. I don't know the tradeoffs
here.
I don't know either, but it's hard to believe that it's just marketing because there is an actual design and implementation cost involved and
even marketing needs some "hard" data to make a good sell.
My guess is that they have gotten their implementation to a point
where adding instructions is fairly painless (plenty of space in the instruction encoding, pre-existing micro/milli-code setup where the
size of the micro/milli-code has a negligible impact on cycle time,
chip size, and yield, ...).
Then they use that flexibility to go after specific benchmarks they
got from some important customers. Even if it speeds up the code of
a single customer, it might be worth the effort if it's a large enough customer and it increases the chances of keeping them on
that architecture.
IBM has, for a long time, combined commonly occuring sequences of
instructions into single instructions. I don't know the tradeoffs
here.
I don't know either, but it's hard to believe that it's *just*
marketing
because there is an actual design and implementation cost involved and
even marketing needs some "hard" data to make a good sell.
My guess is that they have gotten their implementation to a point where adding instructions is fairly painless (plenty of space in the
instruction encoding, pre-existing micro/milli-code setup where the
size of the micro/milli-code has a negligible impact on cycle time,
chip size, and yield, ...).
Then they use that flexibility to go after specific benchmarks they got
from some important customers. Even if it speeds up the code of
a single customer, it might be worth the effort if it's a large enough customer and it increases the chances of keeping them on
that architecture.
Maybe each of those cases could be solved about as efficiently by
rewriting part of the code, but we're talking about a market where many
of the customers are here specifically because they don't want to
rewrite their code.
For the case in point, I haven't seen problems where a UTF-32 encoding
is the overall best solution, but I can easily believe that there are
cases where some poorly thought out (but entrenched) API ends up
imposing (directly or not) the use of UTF-32 and makes UTF-8 <-> UTF-32 conversions very frequent.
Stefan
On Wed, 29 May 2024 10:10:30 -0400, EricP wrote:
I've not dealt with UTF-8 or code points but that's because I've not
written software that interacts with the non 1-byte character markets.
We are all “non 1-byte character markets” now.
Just to rub it in: «€£¢©®±»
According to Terje Mathisen <[email protected]>:
It's almost like the perfect application of risc instruction design:
a long sequence of individual instructions of conditional branches,
bit field extracts, inserts, and shifts, is replace in HW by
a small number of muxes that can to the same in one clock.
If that CU14 can also return the number of bytes consumed, along with
the resulting 32-bit character, then it would be perfect. Is that what
it is doing?
You give it registers with two addresses and two lengths, and it
converts the source UTF-8 code points to destination UTF-32 until it
runs out of input, fills the output, gets an invalid character, or an interrupt. It updates the addresses and lengths. Other than optionally checking for invalid UTF-8 it does not interpret the code points.
The condition code tells you which it was. If it was an interrupt, you just branch back and keep going.
There's an extra cost flag whether to test for invalid UTF-8.
Read all about it: https://www.vm.ibm.com/library/other/22783213.pdf
It's on page 7-251.
On Wed, 29 May 2024 18:42:32 -0000 (UTC), John Levine
<[email protected]> wrote:
According to EricP <[email protected]>:
Ok, you accept international character data, you just don't have toIf you're sending the strings to a database, the database will
check >127 characters for "drop table" etc commands.
I don't think you are being paranoid enough.
I still think you have to validate or sanitize the >127 string to
ensure the code sequences only contain well formed characters.
invariably do detailed string validation so I wouldn't bother, but be
prepared for the error code if it rejects the string,
Far too much SQL is constructed by simply splicing user input into a
query "template" string.
When queries are done right with all user input provided via SQL
parameters, then there is far less need to "sanitize" input.
There is a one major caveat: in SQL, table names can't be specified by parameter. If the user must provide a table name, then you DO have to
splice the query string and you DO have to be careful.
EricP <[email protected]> writes:
Stefan Monnier wrote:
Ok, you accept international character data, you just don't have toI think you misunderstand: the code written to sanitize an ASCII string to >>> feed into SQL will *just work* to sanitize a UTF-8 string to feedOf course with apologies to Herr Koenig's umlauts. :-)I've not dealt with UTF-8 or code points but that's because I've not >>>>>> written software that interacts with the non 1-byte character markets. >>>>>> But even something as simple as sanitizing a character string to feed >>>>>> into SQL will have to.AFAIK you can do that by treating the UTF-8 byte sequence as if it were >>>>> an ASCII byte-sequence: all the Unicode weirdness is neatly stashed in >>>>> bytes >127 which aren't used by SQL itself anyway.
Stefan
And what of all those new Asian customers your company was hoping
to get by dealing with them in their native written language???
You could always explain to the company president that
you only work in ASCII so they should just get used to it.
into SQL, no matter how many funny characters and joiners and combiners
and emojis you have in there.
That's part of the reason why UTF-8 is so popular: you can surprisingly
often treat it as "good old ASCII".
Stefan
check >127 characters for "drop table" etc commands.
Actually what you check for is meta-characters like ; " '. They are
all ASCII, so as long as your code is 8-bit-clean, your SQL string
sanitizer needs to know nothing about UTF-8.
I don't think you are being paranoid enough.
I still think you have to validate or sanitize the >127 string to
ensure the code sequences only contain well formed characters.
Then run your string through a checker/normalizer before or
afterwards. No need to complicate your SQL sanitizer by trying to do
both at the same time. But if you want the last bit of performance by
doing both at the same time, then you certainly don't want to convert
to UTF-32 and back.
Random hack thought #1: if the string I send starts with an umlaut as
the first code point, which doesn't display because it is invalid.
I found that hard to understand. Do you mean that the string starts
with a composing diaresis code point and is invalid because it has no preceding basis with which to compose? The string may fail at the
Unicode checking/normalization stage (depending on what it checks).
Then someone edits the first char to a/o/u and magically it changes
to a different character, and deposits now go to a different account.
If someone can edit the string, and that changes where deposits go to, someone can do that even with no Unicode involved. E.g., if someone
can change "EricP" to "Ertl". However, my impression is that banks
use account numbers (pure ASCII) for deposits, names are used only for validation; so if you provide the wrong name, a money transfer may
fail to go through (not sure what happens if a deposit does not go
through), but won't be to the wrong account.
My company's products were a real-time bond pricing and trading system,
and customers were financial companies whose internal systems in this
case only operated within North America in English, in ascii and ebcdic.
30 years ago you could say the same thing about encryption.
On Thu, 30 May 2024 18:31:46 +0000, MitchAlsup1 wrote:
30 years ago you could say the same thing about encryption.
I don’t think newer CPUs have been optimized for encryption. Instead,
we see newer encryption algorithms (or ways of using them) that work
better on current CPUs.
For example, when I was first learning about
computer encryption, I was told that CBC (“Cipher-Block Chaining”)
mode was teh hawtness,
but nowadays it’s all about GFC (“Galois-Field
Counter”) mode.
On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:
On Thu, 30 May 2024 18:31:46 +0000, MitchAlsup1 wrote:
=20
30 years ago you could say the same thing about encryption. =20=20
I don=E2=80=99t think newer CPUs have been optimized for encryption. Inst= >ead,
we see newer encryption algorithms (or ways of using them) that work
better on current CPUs.=20
I think moderate efficiency on CPU, not too low, but not high either,
is a requirement for (symmetric-key) cipher. Esp. when the key is
128-bit or shorter.
On Thu, 30 May 2024 20:38:08 -0400, EricP wrote:
My company's products were a real-time bond pricing and trading system,
and customers were financial companies whose internal systems in this
case only operated within North America in English, in ascii and ebcdic.
No need even for “¢” characters?
Michael S <[email protected]> writes:
On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:
On Thu, 30 May 2024 18:31:46 +0000, MitchAlsup1 wrote:ead,
=20
30 years ago you could say the same thing about encryption. =20=20
I don=E2=80=99t think newer CPUs have been optimized for
encryption. Inst=
we see newer encryption algorithms (or ways of using them) that
work better on current CPUs.=20
I think moderate efficiency on CPU, not too low, but not high either,
is a requirement for (symmetric-key) cipher. Esp. when the key is
128-bit or shorter.
Most modern CPUs have instruction set support for symmetric ciphers
such as AES, SM2/SM3 as well as message digest/hash (SHA1, SHA256 et
al).
High throughput encryption has been done by hardware accelerators for
decades now (e.g. bbn or ncypher HSM boxes sitting on a SCSI bus;
now such HSM are an integral part of many SoC).
Michael S <[email protected]> writes:
On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:
encryption. Inst=On Thu, 30 May 2024 18:31:46 +0000, MitchAlsup1 wrote:
=20
30 years ago you could say the same thing about encryption. =20=20
I don=E2=80=99t think newer CPUs have been optimized for
ead,work >> better on current CPUs.=20
we see newer encryption algorithms (or ways of using them) that
I think moderate efficiency on CPU, not too low, but not high
either, is a requirement for (symmetric-key) cipher. Esp. when the
key is 128-bit or shorter.
Most modern CPUs have instruction set support for symmetric ciphers
such as AES, SM2/SM3 as well as message digest/hash (SHA1, SHA256 et
al).
High throughput encryption has been done by hardware accelerators for
decades now (e.g. bbn or ncypher HSM boxes sitting on a SCSI bus;
now such HSM are an integral part of many SoC).
Scott Lurndal wrote:
Michael S <[email protected]> writes:
On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)encryption. Inst=
Lawrence D'Oliveiro <[email protected]d> wrote:
On Thu, 30 May 2024 18:31:46 +0000, MitchAlsup1 wrote:
=20
30 years ago you could say the same thing about encryption. =20=20
I don=E2=80=99t think newer CPUs have been optimized for
ead,work >> better on current CPUs.=20
we see newer encryption algorithms (or ways of using them) that
I think moderate efficiency on CPU, not too low, but not high
either, is a requirement for (symmetric-key) cipher. Esp. when the
key is 128-bit or shorter.
Most modern CPUs have instruction set support for symmetric ciphers
such as AES, SM2/SM3 as well as message digest/hash (SHA1, SHA256 et
al).
High throughput encryption has been done by hardware accelerators for
decades now (e.g. bbn or ncypher HSM boxes sitting on a SCSI bus;
now such HSM are an integral part of many SoC).
Queston. For a modern general purpose CPU, if you are including all
the logic to implement encryption instructions, is it much more to
include the control/sequencing logic to do it and not tie up the rest
of the CPU logic to do the encryption? Furthermore, an "inbuilt"
accelerator could interface directly with the I/O hardware of the CPU
(e.g. PCI), saving the "intermediate" step of writing the encrypted
data to memory.
Scott Lurndal wrote:
Michael S <[email protected]> writes:
On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)
High throughput encryption has been done by hardware accelerators for
decades now (e.g. bbn or ncypher HSM boxes sitting on a SCSI bus;
now such HSM are an integral part of many SoC).
Queston. For a modern general purpose CPU, if you are including all
the logic to implement encryption instructions, is it much more to
include the control/sequencing logic to do it and not tie up the rest
of the CPU logic to do the encryption? Furthermore, an "inbuilt"
accelerator could interface directly with the I/O hardware of the CPU
(e.g. PCI), saving the "intermediate" step of writing the encrypted
data to memory.
Stephen Fuld wrote:
Scott Lurndal wrote:
Michael S <[email protected]> writes:
On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)
High throughput encryption has been done by hardware accelerators
for decades now (e.g. bbn or ncypher HSM boxes sitting on a SCSI
bus; now such HSM are an integral part of many SoC).
Queston. For a modern general purpose CPU, if you are including all
the logic to implement encryption instructions, is it much more to
include the control/sequencing logic to do it and not tie up the
rest of the CPU logic to do the encryption? Furthermore, an
"inbuilt" accelerator could interface directly with the I/O
hardware of the CPU (e.g. PCI), saving the "intermediate" step of
writing the encrypted data to memory.
It is more of a systems issue than an ISA issue:: Consider a chip
with 100 cores, do you want all 100 cores to be doing encryption at
the same
time, or do you only need a certain BW of encryption rather equal to
the internet BW at hand. For the first instructions are a reasonable
starting point, for the second an I/O (or attached) processor is in
order.
Scott Lurndal wrote:
Queston. For a modern general purpose CPU, if you are including all
the logic to implement encryption instructions, is it much more to
include the control/sequencing logic to do it and not tie up the
rest of the CPU logic to do the encryption? Furthermore, an
"inbuilt" accelerator could interface directly with the I/O
hardware of the CPU (e.g. PCI), saving the "intermediate" step of
writing the encrypted data to memory.
There are always tradeoffs. The issues surrounding the
control/sequencing logic outside of the instruction flow
require some level of asynchronicity, so to avoid bottlenecks
one might need to replicate the "inbuilt accelerator" if
more than one core will be using encryption (e.g. for RSS
with IPSEC flows).
Yes, but putting the instructions into the core means you are
replicating the logic for every core.
From the operating software standpoint, it becomes most
convenient, then, to model the offload as a device which
requires OS support (and intervention for e.g. interrupt
handling).
I look at it differently (and perhaps incorrectly). I view encryption
as one of several "transformations" that data goes through in its path >to/from some external device.
For exqmple, if the external device is a
disk, the data from memory may be gathere from multiple locations, is >serialized, perhaps encoded (i.e. 8b10b), has (perhaps several levels)
of ECC added, etc. Viewing it like that makes encryption one of many
steps along the I/O pipeline. Under that view, Encryption is an
option, probably controllede by some bits in the I/O mechanism, not as
a separate device requiring interrupt support etc.
Adding encryption (which of the dozen standard symmetric and asymmetric cipher algoritnms?)
to a hardware device does increase complexity, and
thus cost at the expense of extensibility (new algorithms come along periodically). The cost of verifying crypto is a bit higher as it is
very important to get correct when baking into gates.
"Stephen Fuld" <[email protected]d> writes:
Scott Lurndal wrote:
=20 >> > > =20Michael S <[email protected]> writes:
On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:
On Thu, 30 May 2024 18:31:46 +0000, MitchAlsup1 wrote:
=20
30 years ago you could say the same thing about encryption.
the >> > key is 128-bit or shorter.encryption. Inst=I don=E2=80=99t think newer CPUs have been optimized for
ead,work >> better on current CPUs.=20
we see newer encryption algorithms (or ways of using them) that
I think moderate efficiency on CPU, not too low, but not high
either, is a requirement for (symmetric-key) cipher. Esp. when
et >> al).
Most modern CPUs have instruction set support for symmetric ciphers
such as AES, SM2/SM3 as well as message digest/hash (SHA1, SHA256
for >> decades now (e.g. bbn or ncypher HSM boxes sitting on a SCSI
High throughput encryption has been done by hardware accelerators
bus; >> now such HSM are an integral part of many SoC).
Queston. For a modern general purpose CPU, if you are including all
the logic to implement encryption instructions, is it much more to
include the control/sequencing logic to do it and not tie up the
rest of the CPU logic to do the encryption? Furthermore, an
"inbuilt" accelerator could interface directly with the I/O
hardware of the CPU (e.g. PCI), saving the "intermediate" step of
writing the encrypted data to memory.
There are always tradeoffs. The issues surrounding the
control/sequencing logic outside of the instruction flow
require some level of asynchronicity, so to avoid bottlenecks
one might need to replicate the "inbuilt accelerator" if
more than one core will be using encryption (e.g. for RSS
with IPSEC flows).
From the operating software standpoint, it becomes most
convenient, then, to model the offload as a device which
requires OS support (and intervention for e.g. interrupt
handling).
For network traffic, there are often other operations
being performed on the flow (routing, encapsulation, fragmentation/reassembly, etc) which require the packet to be in a
memory buffer (which could be high-speed SRAM or lower-speed DRAM),
even when just routing from an ingress port to an egress port.
Scott Lurndal <[email protected]> schrieb:
Adding encryption (which of the dozen standard symmetric and asymmetric
cipher algoritnms?)
At the moment, AES.
to a hardware device does increase complexity, and
thus cost at the expense of extensibility (new algorithms come along
periodically). The cost of verifying crypto is a bit higher as it is
very important to get correct when baking into gates.
Seems to be fairly common these days, looking at >https://en.wikipedia.org/wiki/AES_instruction_set .
It appears that one round of AES fits fairly well into one cycle.
"Stephen Fuld" <[email protected]d> writes:
Scott Lurndal wrote:
all >> > the logic to implement encryption instructions, is it much
Queston. For a modern general purpose CPU, if you are including
more to >> > include the control/sequencing logic to do it and not
tie up the >> > rest of the CPU logic to do the encryption?
Furthermore, an >> > "inbuilt" accelerator could interface directly
with the I/O >> > hardware of the CPU (e.g. PCI), saving the
"intermediate" step of >> > writing the encrypted data to memory.
There are always tradeoffs. The issues surrounding the
control/sequencing logic outside of the instruction flow
require some level of asynchronicity, so to avoid bottlenecks
one might need to replicate the "inbuilt accelerator" if
more than one core will be using encryption (e.g. for RSS
with IPSEC flows).
Yes, but putting the instructions into the core means you are
replicating the logic for every core.
In the scale of a modern CPU, it's a small fraction of the logic.
The ARM neoverse cores, for example, require very little area.
From the operating software standpoint, it becomes most
convenient, then, to model the offload as a device which
requires OS support (and intervention for e.g. interrupt
handling).
I look at it differently (and perhaps incorrectly). I view
encryption as one of several "transformations" that data goes
through in its path to/from some external device.
That's certainly a valid view, if perhaps not complete. There are
use cases for in-place encryption.
Adding encryption (which of the dozen standard symmetric and
asymmetric cipher algoritnms?) to a hardware device does increase
complexity, and thus cost at the expense of extensibility (new
algorithms come along periodically).
The cost of verifying crypto is
a bit higher as it is very important to get correct when baking into
gates.
For exqmple, if the external device is a
disk, the data from memory may be gathere from multiple locations,
is serialized, perhaps encoded (i.e. 8b10b), has (perhaps several
levels) of ECC added, etc. Viewing it like that makes encryption
one of many steps along the I/O pipeline. Under that view,
Encryption is an option, probably controllede by some bits in the
I/O mechanism, not as a separate device requiring interrupt support
etc.
In the Cavium crypto-enabled DPUs, the crypto block is inserted
into the data-path where necessary, when necessary; and to the extent
that a streaming protocol/alg is used, will encrypt/decrypt as the
data is passing from the ingress point to the egress point (which
could be another external port, or an on-board CPU). It can also be
used as a stand-alone crypto accelerator by the on-board CPUs.
Note that crypto is used for more than just data
encryption/decryption; there's also digesting and digital signatures
which rely on asymmetric algorithms such as RSA or EC and don't
necessarily fit into the "path to the I/O device" model you've
espoused.
Scott Lurndal <[email protected]> schrieb:
Adding encryption (which of the dozen standard symmetric and
asymmetric cipher algoritnms?)
At the moment, AES.
to a hardware device does increase complexity, and
thus cost at the expense of extensibility (new algorithms come along periodically). The cost of verifying crypto is a bit higher as it
is very important to get correct when baking into gates.
Seems to be fairly common these days, looking at https://en.wikipedia.org/wiki/AES_instruction_set .
It appears that one round of AES fits fairly well into one cycle.
Thomas Koenig <[email protected]> writes:
Scott Lurndal <[email protected]> schrieb:
Adding encryption (which of the dozen standard symmetric and
asymmetric cipher algoritnms?)
At the moment, AES.
to a hardware device does increase complexity, and
thus cost at the expense of extensibility (new algorithms come
along periodically). The cost of verifying crypto is a bit higher
as it is very important to get correct when baking into gates.
Seems to be fairly common these days, looking at >https://en.wikipedia.org/wiki/AES_instruction_set .
As I mentioned earlier in the thread, all modern CPUs have
support for the standard algorithms in their instruction
set (optionally fused out for export).
It appears that one round of AES fits fairly well into one cycle.
Yes.
Scott Lurndal wrote:
"Stephen Fuld" <[email protected]d> writes:
Scott Lurndal wrote:all >> > the logic to implement encryption instructions, is it much
Queston. For a modern general purpose CPU, if you are including
more to >> > include the control/sequencing logic to do it and not
tie up the >> > rest of the CPU logic to do the encryption?
Furthermore, an >> > "inbuilt" accelerator could interface directly
with the I/O >> > hardware of the CPU (e.g. PCI), saving the
"intermediate" step of >> > writing the encrypted data to memory.
There are always tradeoffs. The issues surrounding the
control/sequencing logic outside of the instruction flow
require some level of asynchronicity, so to avoid bottlenecks
one might need to replicate the "inbuilt accelerator" if
more than one core will be using encryption (e.g. for RSS
with IPSEC flows).
Yes, but putting the instructions into the core means you are
replicating the logic for every core.
In the scale of a modern CPU, it's a small fraction of the logic.
The ARM neoverse cores, for example, require very little area.
Agreed. I was assuming that the cost of the logic was about the same
whether it was done as CPU instructions or a chunk of accelerator logic
in the I/O stream. If that is true, then the cost of having multiples
of them in the I/O stream is small.
From the operating software standpoint, it becomes most
convenient, then, to model the offload as a device which
requires OS support (and intervention for e.g. interrupt
handling).
I look at it differently (and perhaps incorrectly). I view
encryption as one of several "transformations" that data goes
through in its path to/from some external device.
That's certainly a valid view, if perhaps not complete. There are
use cases for in-place encryption.
Good. Can you give some examples, and perhaps an estimate of what
percentage of the total encryption operations are in place? Note that
it may be possible to add a feature to the "in-stream" hardware to
allow in-place encryption - i.e. both sides go to/come from memory.
Adding encryption (which of the dozen standard symmetric and
asymmetric cipher algoritnms?) to a hardware device does increase
complexity, and thus cost at the expense of extensibility (new
algorithms come along periodically).
Agreed. But this is also true for new CPU instructions.
The cost of verifying crypto is
a bit higher as it is very important to get correct when baking into
gates.
Sure, And I expect it is also higher because of the extra security >precautions against side attacks, etc.
In the Cavium crypto-enabled DPUs, the crypto block is inserted
into the data-path where necessary, when necessary; and to the extent
that a streaming protocol/alg is used, will encrypt/decrypt as the
data is passing from the ingress point to the egress point (which
could be another external port, or an on-board CPU). It can also be
used as a stand-alone crypto accelerator by the on-board CPUs.
Good to know. Proof of concept for my suggestion. :-) Can you talk
about advantages/disadvantages of that mechanism versus other >implementations?
Note that crypto is used for more than just data
encryption/decryption; there's also digesting and digital signatures
which rely on asymmetric algorithms such as RSA or EC and don't
necessarily fit into the "path to the I/O device" model you've
espoused.
Yes, of course. But I think digital signature creation/verification
could be fit into the streaming model. Is that wrong? With regard to >RSA/EC, etc. I absolutely agree.
On Mon, 3 Jun 2024 18:01:00 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:
Scott Lurndal <[email protected]> schrieb:
Adding encryption (which of the dozen standard symmetric and
asymmetric cipher algoritnms?)
At the moment, AES.
to a hardware device does increase complexity, and
thus cost at the expense of extensibility (new algorithms come along
periodically). The cost of verifying crypto is a bit higher as it
is very important to get correct when baking into gates.
Seems to be fairly common these days, looking at
https://en.wikipedia.org/wiki/AES_instruction_set .
It appears that one round of AES fits fairly well into one cycle.
One/cycle throughput fits well. Even two/cycle throughput fits.
One cycle latency does not fit unless you target very low frequency.
Latency on POWER9 - 6 clocks. On majority of modern Intel and AMD cores
3-4 clocks. On Apple M1 - 3 clocks.
"Stephen Fuld" <[email protected]d> writes:
Scott Lurndal wrote:
The ARM neoverse cores, for example, require very little area.
Agreed. I was assuming that the cost of the logic was about the same >>whether it was done as CPU instructions or a chunk of accelerator logic
in the I/O stream. If that is true, then the cost of having multiples
of them in the I/O stream is small.
Although the accelerator requires addition logic to interface
to the CPUs (either by presenting as a memory mapped device,
integrated into the processor register scheme, or some other
proprietary mechanism). Which means non-standard software is
required to manage and use the accelerator.
From the operating software standpoint, it becomes most
convenient, then, to model the offload as a device which
requires OS support (and intervention for e.g. interrupt
handling).
I look at it differently (and perhaps incorrectly). I view
encryption as one of several "transformations" that data goes
through in its path to/from some external device.
That's certainly a valid view, if perhaps not complete. There are
use cases for in-place encryption.
Good. Can you give some examples, and perhaps an estimate of what >>percentage of the total encryption operations are in place? Note that
it may be possible to add a feature to the "in-stream" hardware to
allow in-place encryption - i.e. both sides go to/come from memory.
Consider file access. From the perspective of the disk, all blocks
are identical - it doesn't know which partition, filesystem, or file
that any individual block is part of, if any.
Whole-disk encryption can happen at the drive. Per-file (or per-filesystem) happens in the filesystem driver(s), perhaps
with a hardware assist, but it wouldn't be on the path from
the disk to memory.
There are cases where only a portion of a file is encrypted, and
cases where the encryption is combined with compression (pkzip,
rar, etc).
Adding encryption (which of the dozen standard symmetric and
asymmetric cipher algoritnms?) to a hardware device does increase
complexity, and thus cost at the expense of extensibility (new
algorithms come along periodically).
Agreed. But this is also true for new CPU instructions.
An hardware accelerator could, for example, be microcoded
rather than using hard logic to future-proof it.
The cost of verifying crypto is
a bit higher as it is very important to get correct when baking into
gates.
Sure, And I expect it is also higher because of the extra security >>precautions against side attacks, etc.
Timing attacks, in particular.
On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:
but nowadays it’s all about GFC (“Galois-Field Counter”) mode.
GCM is far more common spelling.
... if the device has encryption
services why can they not be applied sector by sector ??
Actually what you check for is meta-characters like ; " '. They are all ASCII, so as long as your code is 8-bit-clean, your SQL string sanitizer needs to know nothing about UTF-8.
On Mon, 03 Jun 2024 14:07:12 GMT [email protected] (Scott Lurndal)
wrote:
Most modern CPUs have instruction set support for symmetric ciphersIt is still not *too* fast.
such as AES, SM2/SM3 as well as message digest/hash (SHA1, SHA256 et
al).
'Too fast' in my book is when with 1B to 10B USD worth of OTP servers
you can break cipher by brute force in less than 1 hour.
Scott Lurndal wrote:
"Stephen Fuld" <[email protected]d> writes:
Scott Lurndal wrote:
The ARM neoverse cores, for example, require very little area.
Agreed. I was assuming that the cost of the logic was about the
same whether it was done as CPU instructions or a chunk of
accelerator logic in the I/O stream. If that is true, then the
cost of having multiples of them in the I/O stream is small.
Although the accelerator requires addition logic to interface
to the CPUs (either by presenting as a memory mapped device,
integrated into the processor register scheme, or some other
proprietary mechanism). Which means non-standard software is
required to manage and use the accelerator.
First consider that it is possible for an I/O device to DMA directly
to another I/O device in the PCIe routing tree/DAG.
Then, consider that with this infrastructure, you could DMA from
memory through the Cryptor and back to memory (or anywhere you wanted
it).
From the operating software standpoint, it becomes mostencryption as one of several "transformations" that data goes
convenient, then, to model the offload as a device which
requires OS support (and intervention for e.g. interrupt
handling).
I look at it differently (and perhaps incorrectly). I view
through in its path to/from some external device.
That's certainly a valid view, if perhaps not complete. There
are use cases for in-place encryption.
Good. Can you give some examples, and perhaps an estimate of what percentage of the total encryption operations are in place? Note
that it may be possible to add a feature to the "in-stream"
hardware to allow in-place encryption - i.e. both sides go
to/come from memory.
Different users want their files encrypted using different keys than
any other user--where file could be memory resident (or not).
Consider file access. From the perspective of the disk, all blocks
are identical - it doesn't know which partition, filesystem, or file
that any individual block is part of, if any.
Whole-disk encryption can happen at the drive. Per-file (or per-filesystem) happens in the filesystem driver(s), perhaps
with a hardware assist, but it wouldn't be on the path from
the disk to memory.
You may be correct in how it is now--but if the device has encryption services why can they not be applied sector by sector ??
There are cases where only a portion of a file is encrypted, and
cases where the encryption is combined with compression (pkzip,
rar, etc).
Scott Lurndal wrote:
Michael S <[email protected]> writes:
On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)encryption. Inst=
Lawrence D'Oliveiro <[email protected]d> wrote:
On Thu, 30 May 2024 18:31:46 +0000, MitchAlsup1 wrote:
=20
30 years ago you could say the same thing about encryption. =20=20
I don=E2=80=99t think newer CPUs have been optimized for
ead,work >> better on current CPUs.=20
we see newer encryption algorithms (or ways of using them) that
I think moderate efficiency on CPU, not too low, but not high
either, is a requirement for (symmetric-key) cipher. Esp. when the
key is 128-bit or shorter.
Most modern CPUs have instruction set support for symmetric ciphers
such as AES, SM2/SM3 as well as message digest/hash (SHA1, SHA256 et
al).
High throughput encryption has been done by hardware accelerators for
decades now (e.g. bbn or ncypher HSM boxes sitting on a SCSI bus;
now such HSM are an integral part of many SoC).
Queston. For a modern general purpose CPU, if you are including all
the logic to implement encryption instructions, is it much more to
include the control/sequencing logic to do it and not tie up the rest
of the CPU logic to do the encryption? Furthermore, an "inbuilt"
accelerator could interface directly with the I/O hardware of the CPU
(e.g. PCI), saving the "intermediate" step of writing the encrypted
data to memory.
On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:
On Thu, 30 May 2024 18:31:46 +0000, MitchAlsup1 wrote:
30 years ago you could say the same thing about encryption.
I don’t think newer CPUs have been optimized for encryption. Instead, >> we see newer encryption algorithms (or ways of using them) that work
better on current CPUs.
I think moderate efficiency on CPU, not too low, but not high either,
is a requirement for (symmetric-key) cipher. Esp. when the key is
128-bit or shorter.
Michael S wrote:
On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:
On Thu, 30 May 2024 18:31:46 +0000, MitchAlsup1 wrote:
30 years ago you could say the same thing about encryption.
I don’t think newer CPUs have been optimized for encryption.
Instead, we see newer encryption algorithms (or ways of using
them) that work better on current CPUs.
I think moderate efficiency on CPU, not too low, but not high
either, is a requirement for (symmetric-key) cipher. Esp. when the
key is 128-bit or shorter.
That's correct:
CPU efficiency, primarily on the reference 32-bit platform
(PentiumPro 200 MHz) but also on an 8-bit "smart card" implementation
was one of the key requirements for the AES competition.
When a group of four programmers (including me) spent a week on
CERN's candidate, we were able to triple the speed, bringing it into
parity with the eventual winner. All the finalists were more or less
the same speed at this point, i.e. able to do full duplex 100 Mbit/s
Ethernet traffic (so around 20 MB/s) on a single thread/core.
Terje
Scott Lurndal wrote:
"Stephen Fuld" <[email protected]d> writes:
Scott Lurndal wrote:
The ARM neoverse cores, for example, require very little area.
Agreed. I was assuming that the cost of the logic was about the same >>>whether it was done as CPU instructions or a chunk of accelerator logic >>>in the I/O stream. If that is true, then the cost of having multiples
of them in the I/O stream is small.
Although the accelerator requires addition logic to interface
to the CPUs (either by presenting as a memory mapped device,
integrated into the processor register scheme, or some other
proprietary mechanism). Which means non-standard software is
required to manage and use the accelerator.
First consider that it is possible for an I/O device to DMA directly
to another I/O device in the PCIe routing tree/DAG.
Then, consider that with this infrastructure, you could DMA from
memory through the Cryptor and back to memory (or anywhere you
wanted it).
From the operating software standpoint, it becomes most
convenient, then, to model the offload as a device which
requires OS support (and intervention for e.g. interrupt
handling).
I look at it differently (and perhaps incorrectly). I view
encryption as one of several "transformations" that data goes
through in its path to/from some external device.
That's certainly a valid view, if perhaps not complete. There are
use cases for in-place encryption.
Good. Can you give some examples, and perhaps an estimate of what >>>percentage of the total encryption operations are in place? Note that
it may be possible to add a feature to the "in-stream" hardware to
allow in-place encryption - i.e. both sides go to/come from memory.
Different users want their files encrypted using different keys than
any other user--where file could be memory resident (or not).
Consider file access. From the perspective of the disk, all blocks
are identical - it doesn't know which partition, filesystem, or file
that any individual block is part of, if any.
Whole-disk encryption can happen at the drive. Per-file (or
per-filesystem) happens in the filesystem driver(s), perhaps
with a hardware assist, but it wouldn't be on the path from
the disk to memory.
You may be correct in how it is now--but if the device has encryption >services why can they not be applied sector by sector ??
Sure, And I expect it is also higher because of the extra security >>>precautions against side attacks, etc.
Timing attacks, in particular.
All the more reason to run encryption through a device where you cannot >measure time accurately.
That logic already exists, in the form of a single thread/core
dedicated
to the job.
With 30-100 cores on a single die, it becomes very cheap to dedicate
one
of them to babysit such a process, compared to the cost of making a
custom chunk of VLSI to do the same. This is particularly true because
the logic needed in the babysitting process is mostly straight line,
with a very limited number of hard-to-predict branches.
I.e. h.264 CABAC decoding has three branches per bit decoded, at least
one of them impossible to predict or work around with clever coding.
Here it makes perfect sense to have a chunk of hw to handle the heavy lifting. Monitoring block encryption/decryption not so much.
Terje
Stephen Fuld wrote:
Scott Lurndal wrote:
Michael S <[email protected]> writes:
On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:
encryption. Inst=On Thu, 30 May 2024 18:31:46 +0000, MitchAlsup1 wrote:
=20
30 years ago you could say the same thing about encryption.=20
=20
I don=E2=80=99t think newer CPUs have been optimized for
ead,work >> better on current CPUs.=20
we see newer encryption algorithms (or ways of using them)
that
I think moderate efficiency on CPU, not too low, but not high
either, is a requirement for (symmetric-key) cipher. Esp. when
the key is 128-bit or shorter.
Most modern CPUs have instruction set support for symmetric
ciphers such as AES, SM2/SM3 as well as message digest/hash
(SHA1, SHA256 et al).
High throughput encryption has been done by hardware accelerators
for decades now (e.g. bbn or ncypher HSM boxes sitting on a SCSI
bus; now such HSM are an integral part of many SoC).
Queston. For a modern general purpose CPU, if you are including all
the logic to implement encryption instructions, is it much more to
include the control/sequencing logic to do it and not tie up the
rest of the CPU logic to do the encryption? Furthermore, an
"inbuilt" accelerator could interface directly with the I/O
hardware of the CPU (e.g. PCI), saving the "intermediate" step of
writing the encrypted data to memory.
That logic already exists, in the form of a single thread/core
dedicated to the job.
With 30-100 cores on a single die, it becomes very cheap to dedicate
one of them to babysit such a process, compared to the cost of making
a custom chunk of VLSI to do the same. This is particularly true
because the logic needed in the babysitting process is mostly
straight line, with a very limited number of hard-to-predict branches.
I.e. h.264 CABAC decoding has three branches per bit decoded, at
least one of them impossible to predict or work around with clever
coding. Here it makes perfect sense to have a chunk of hw to handle
the heavy lifting. Monitoring block encryption/decryption not so much.
If I want to validate combiner codes or normalize characters I need
UTF-32 because I have to work with the whole character as a unit.
I was just trying to get people thinking of ways that malformed
characters might be used to bypass other validation checks in
their software.
George Neuner wrote:
On Wed, 29 May 2024 18:42:32 -0000 (UTC), John Levine
<[email protected]> wrote:
According to EricP <[email protected]>:
Ok, you accept international character data, you just don't have toIf you're sending the strings to a database, the database will
check >127 characters for "drop table" etc commands.
I don't think you are being paranoid enough.
I still think you have to validate or sanitize the >127 string to
ensure the code sequences only contain well formed characters.
invariably do detailed string validation so I wouldn't bother, but be
prepared for the error code if it rejects the string,
Far too much SQL is constructed by simply splicing user input into a
query "template" string.
When queries are done right with all user input provided via SQL
parameters, then there is far less need to "sanitize" input.
There is a one major caveat: in SQL, table names can't be specified by
parameter. If the user must provide a table name, then you DO have to
splice the query string and you DO have to be careful.
Yes, I didn't mean not parameterizing the string args.
I was trying to think of ways that I might get your software to combine >malformed strings creating something different. This would occur after
the strings have been passed using parameterization, like if an index
is built from two concatenated string fields.
On Mon, 3 Jun 2024 17:42:17 +0300, Michael S wrote:
On Mon, 03 Jun 2024 14:07:12 GMT [email protected] (Scott Lurndal)
wrote:
Most modern CPUs have instruction set support for symmetric ciphersIt is still not *too* fast.
such as AES, SM2/SM3 as well as message digest/hash (SHA1, SHA256 et
al).
'Too fast' in my book is when with 1B to 10B USD worth of OTP servers
you can break cipher by brute force in less than 1 hour.
The good algorithms are designed to be fast for encryption/decryption use, >while still being uselessly slow for cracking purposes.
Hash algorithms come in two flavours: cryptographic hashes (as mentioned >above) and password hashes. Cryptographic hashes have to be fast to
compute, but password hashes should take some appreciable fraction of a >second. This is fast enough to authenticate a user logging in, while >significantly slowing down password-guessing attacks.
For example, the WordPress password-hashing algorithm takes a
cryptographic hash like MD5 (considered crap nowadays), and iterates it
8000 times. And suddenly crap becomes good.
It's debatable whether repeated application of a given function really represents a /different/ function.
Try encrypting something with only one round of DES or AES :-)
Terje Mathisen wrote:
Stephen Fuld wrote:
Scott Lurndal wrote:
Michael S <[email protected]> writes:
On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)encryption. Inst=
Lawrence D'Oliveiro <[email protected]d> wrote:
On Thu, 30 May 2024 18:31:46 +0000, MitchAlsup1 wrote:
=20
30 years ago you could say the same thing about encryption.=20
=20
I don=E2=80=99t think newer CPUs have been optimized for
ead,work >> better on current CPUs.=20
we see newer encryption algorithms (or ways of using them)
that
I think moderate efficiency on CPU, not too low, but not high
either, is a requirement for (symmetric-key) cipher. Esp. when
the key is 128-bit or shorter.
Most modern CPUs have instruction set support for symmetric
ciphers such as AES, SM2/SM3 as well as message digest/hash
(SHA1, SHA256 et al).
High throughput encryption has been done by hardware accelerators
for decades now (e.g. bbn or ncypher HSM boxes sitting on a SCSI
bus; now such HSM are an integral part of many SoC).
Queston. For a modern general purpose CPU, if you are including all
the logic to implement encryption instructions, is it much more to
include the control/sequencing logic to do it and not tie up the
rest of the CPU logic to do the encryption? Furthermore, an
"inbuilt" accelerator could interface directly with the I/O
hardware of the CPU (e.g. PCI), saving the "intermediate" step of
writing the encrypted data to memory.
That logic already exists, in the form of a single thread/core
dedicated to the job.
With 30-100 cores on a single die, it becomes very cheap to dedicate
one of them to babysit such a process, compared to the cost of making
a custom chunk of VLSI to do the same. This is particularly true
because the logic needed in the babysitting process is mostly
straight line, with a very limited number of hard-to-predict branches.
I.e. h.264 CABAC decoding has three branches per bit decoded, at
least one of them impossible to predict or work around with clever
coding. Here it makes perfect sense to have a chunk of hw to handle
the heavy lifting. Monitoring block encryption/decryption not so much.
I may be missing something, but while your proposal addresses the first
part of my proposal, I think it doesn't adress the second. That is,
for data coming from/going to some external source, you are still doing "unnecessary" memory traffic, which takes memory bandwidth and
increases latency.
Terje Mathisen wrote:
That logic already exists, in the form of a single thread/core
dedicated
to the job.
With 30-100 cores on a single die, it becomes very cheap to dedicate
one
of them to babysit such a process, compared to the cost of making a
custom chunk of VLSI to do the same. This is particularly true because
the logic needed in the babysitting process is mostly straight line,
with a very limited number of hard-to-predict branches.
I.e. h.264 CABAC decoding has three branches per bit decoded, at least
one of them impossible to predict or work around with clever coding.
How many instructions in the then-clause and in the else-clause ??
If these are smaller than 8, My 66000 can process them without
"branching" using predication.
Stephen Fuld wrote:
Terje Mathisen wrote:
Stephen Fuld wrote:
Scott Lurndal wrote:
Michael S <[email protected]> writes:
On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:
encryption. Inst=On Thu, 30 May 2024 18:31:46 +0000, MitchAlsup1 wrote:
=20
30 years ago you could say the same thing about=20
encryption. =20
I don=E2=80=99t think newer CPUs have been optimized for
ead,work >> better on current CPUs.=20
we see newer encryption algorithms (or ways of using them)
that
I think moderate efficiency on CPU, not too low, but not
high either, is a requirement for (symmetric-key) cipher.
Esp. when the key is 128-bit or shorter.
Most modern CPUs have instruction set support for symmetric
ciphers such as AES, SM2/SM3 as well as message digest/hash
(SHA1, SHA256 et al).
High throughput encryption has been done by hardware
accelerators for decades now (e.g. bbn or ncypher HSM boxes
sitting on a SCSI bus; now such HSM are an integral part of
many SoC).
Queston. For a modern general purpose CPU, if you are
including all the logic to implement encryption instructions,
is it much more to include the control/sequencing logic to do
it and not tie up the rest of the CPU logic to do the
encryption? Furthermore, an "inbuilt" accelerator could
interface directly with the I/O hardware of the CPU (e.g. PCI),
saving the "intermediate" step of writing the encrypted data to
memory.
That logic already exists, in the form of a single thread/core
dedicated to the job.
With 30-100 cores on a single die, it becomes very cheap to
dedicate one of them to babysit such a process, compared to the
cost of making a custom chunk of VLSI to do the same. This is particularly true because the logic needed in the babysitting
process is mostly straight line, with a very limited number of hard-to-predict branches.
I.e. h.264 CABAC decoding has three branches per bit decoded, at
least one of them impossible to predict or work around with clever coding. Here it makes perfect sense to have a chunk of hw to
handle the heavy lifting. Monitoring block encryption/decryption
not so much.
I may be missing something, but while your proposal addresses the
first part of my proposal, I think it doesn't adress the second.
That is, for data coming from/going to some external source, you
are still doing "unnecessary" memory traffic, which takes memory
bandwidth and increases latency.
Usually, when a CPU needs to work on something, it will need to get
the data into $L1 anyway? It is only when the work is simply to be a
pipeline that having a way to bypass the CPU completely really makes
a difference, right?
Terje Mathisen wrote:
Stephen Fuld wrote:
Terje Mathisen wrote:
Stephen Fuld wrote:
Scott Lurndal wrote:
Michael S <[email protected]> writes:
On Mon, 3 Jun 2024 08:03:53 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:
encryption. Inst=On Thu, 30 May 2024 18:31:46 +0000, MitchAlsup1 wrote:
=20
30 years ago you could say the same thing about=20
encryption. =20
I don=E2=80=99t think newer CPUs have been optimized
for
ead,work >> better on current CPUs.=20
we see newer encryption algorithms (or ways of using
them) that
I think moderate efficiency on CPU, not too low, but not
high either, is a requirement for (symmetric-key) cipher.
Esp. when the key is 128-bit or shorter.
Most modern CPUs have instruction set support for symmetric
ciphers such as AES, SM2/SM3 as well as message digest/hash
(SHA1, SHA256 et al).
High throughput encryption has been done by hardware
accelerators for decades now (e.g. bbn or ncypher HSM boxes
sitting on a SCSI bus; now such HSM are an integral part of
many SoC).
Queston. For a modern general purpose CPU, if you are
including all the logic to implement encryption instructions,
is it much more to include the control/sequencing logic to do
it and not tie up the rest of the CPU logic to do the
encryption? Furthermore, an "inbuilt" accelerator could
interface directly with the I/O hardware of the CPU (e.g.
PCI), saving the "intermediate" step of writing the encrypted
data to memory.
That logic already exists, in the form of a single thread/core dedicated to the job.
With 30-100 cores on a single die, it becomes very cheap to
dedicate one of them to babysit such a process, compared to the
cost of making a custom chunk of VLSI to do the same. This is particularly true because the logic needed in the babysitting
process is mostly straight line, with a very limited number of hard-to-predict branches.
I.e. h.264 CABAC decoding has three branches per bit decoded, at
least one of them impossible to predict or work around with
clever coding. Here it makes perfect sense to have a chunk of
hw to handle the heavy lifting. Monitoring block
encryption/decryption not so much.
I may be missing something, but while your proposal addresses the
first part of my proposal, I think it doesn't adress the second.
That is, for data coming from/going to some external source, you
are still doing "unnecessary" memory traffic, which takes memory bandwidth and increases latency.
Usually, when a CPU needs to work on something, it will need to get
the data into $L1 anyway? It is only when the work is simply to be a pipeline that having a way to bypass the CPU completely really makes
a difference, right?
Right. But my point is that the CPU never really need to "work" on
the encrypted data. It it frequently only sent to, or received from
the network or a storage device, hence the pipelined approach has
advantages.
MitchAlsup1 wrote:
I.e. h.264 CABAC decoding has three branches per bit decoded, at least
one of them impossible to predict or work around with clever coding.
How many instructions in the then-clause and in the else-clause ??
If these are smaller than 8, My 66000 can process them without
"branching" using predication.
No, the real problem is the context branching: After doing the 50%
branch you pick up one of two alternative contexts and follow totally different paths, i.e. you cannot simply use the branch bit as an index.
I found ways to bypass the issues with the other two branches but this
one is fundamental.
Terje
The best, the most secure encryption is an end-to-end encryption.
Which means application-to-application.
It's not that other, "piece-wise" encryption types can't be used, but
if you are serious about privacy you should consider them
insufficient.
Terje Mathisen wrote:
Usually, when a CPU needs to work on something, it will need to get
the data into $L1 anyway? It is only when the work is simply to be a
pipeline that having a way to bypass the CPU completely really makes
a difference, right?
Right. But my point is that the CPU never really need to "work" on the encrypted data. It it frequently only sent to, or received from the
network or a storage device, hence the pipelined approach has
advantages.
On Wed, 5 Jun 2024 13:34:25 -0000 (UTC)
The best, the most secure encryption is an end-to-end encryption.
Which means application-to-application.
It's not that other, "piece-wise" encryption types can't be used, but
if you are serious about privacy you should consider them insufficient.
Michael S wrote:
snip lots of stuff about encryption alternatives
The best, the most secure encryption is an end-to-end encryption.
Which means application-to-application.
It's not that other, "piece-wise" encryption types can't be used,
but if you are serious about privacy you should consider them
insufficient.
That's fair. But there are counter arguments like not doing the
encryption on a processor that is also executing arbitrary user code
makes it more immune from side attacks.
And, BTW, running arbitrary hostile code on your computer is bad, bad,
bad idea for 1e9 other reasons.
Stephen Fuld wrote:
Terje Mathisen wrote:
Usually, when a CPU needs to work on something, it will need to get
the data into $L1 anyway? It is only when the work is simply to be a
pipeline that having a way to bypass the CPU completely really makes
a difference, right?
Right. But my point is that the CPU never really need to "work" on the
encrypted data. It it frequently only sent to, or received from the
network or a storage device, hence the pipelined approach has
advantages.
If the keys are visible in application memory, Spectré like attacks can
read out those keys. If the keys are visible in supervisor memory,
similar
attack strategies can read them out. Thus, it makes sense that the CPUs
not be doing the cryption.
On Wed, 5 Jun 2024 17:04:49 +0000
[email protected] (MitchAlsup1) wrote:
Michael S wrote:
=20
On Wed, 5 Jun 2024 13:34:25 -0000 (UTC)=20
=20
The best, the most secure encryption is an end-to-end encryption.=20
Which means application-to-application. =20
Except for the Spectr=C3=A9 like attacks that steal the keys if they are = >in
memory.
=20
Spectre, not Spectr=C3=A9 >https://en.wikipedia.org/wiki/Spectre_(security_vulnerability)
It's not that other, "piece-wise" encryption types can't be used,
but if you are serious about privacy you should consider them
insufficient. =20
And who exactly places the key into registers of your beloved shared >encryption device?
Michael S wrote:
On Wed, 5 Jun 2024 13:34:25 -0000 (UTC)
The best, the most secure encryption is an end-to-end encryption.
Which means application-to-application.
Except for the Spectré like attacks that steal the keys if they are in memory.
It's not that other, "piece-wise" encryption types can't be used,
but if you are serious about privacy you should consider them
insufficient.
Michael S <[email protected]> writes:
It's not that other, "piece-wise" encryption types can't be used,
but if you are serious about privacy you should consider them
insufficient. =20
And who exactly places the key into registers of your beloved shared >>encryption device?
It is pretty trivial to bake private keys into hardware at the fab,
either through e-fuses or various other mechanisms.
Side-channel attacks on AES were 99%-fantasy of bored (or
attention-seeking) security researchers even before Rijndael core was
put in CPU hardware. Much more so now.
Weak point tends to be key management rather than encryption itself.
And, BTW, running arbitrary hostile code on your computer is bad, bad,
bad idea for 1e9 other reasons.
Except for the Spectré like attacks that steal the keys if they are in
memory.
Spectre, not Spectré
Scott Lurndal wrote:
Michael S <[email protected]> writes:
It's not that other, "piece-wise" encryption types can't be used,
but if you are serious about privacy you should consider them
insufficient. =20
And who exactly places the key into registers of your beloved shared >>>encryption device?
It is pretty trivial to bake private keys into hardware at the fab,
either through e-fuses or various other mechanisms.
Is that something the CIA or NSA would allow on their computers ??
Terje Mathisen wrote:
MitchAlsup1 wrote:
I.e. h.264 CABAC decoding has three branches per bit decoded, at
least one of them impossible to predict or work around with clever
coding.
How many instructions in the then-clause and in the else-clause ??
If these are smaller than 8, My 66000 can process them without
"branching" using predication.
No, the real problem is the context branching: After doing the 50%
branch you pick up one of two alternative contexts and follow totally
different paths, i.e. you cannot simply use the branch bit as an index.
If the number of instructions in the combined then and else clauses is
lower than a certain number, it is equally efficient to deal with the
branch as if it were later nullification rather than a redirection of
the fetch end of the pipeline. Here, NO prediction is required and there
is no chance of misprediction without regard to the
predictability
of the control flow point. The whole point is that if the fetch end
of the pipeline will reach the convergence point before the branch
is fully resolved, then "don't branch" nullify. it saves cycles and
keeps unpredictable branches out of the branch predictor--even if the apparent takenness of the branch is completely random--improving
the prediction accuracy of "real branches".
So, for example, let us postulate a 1-wide machine fetching 4 words per
clock and a then clause of 3 instructions and an else clause of 4 inst.
By the time the pseudo branch instruction enters execution, both the
then and the else have already been fetched, parsed, and are flowing
through decode. The execution of the branch merely decides which inst
survive the pipeline and there are no misprediction stalls. {{On a
wider machine, the fetch is even wider and the parse/decode BW is
still higher, so the mispredicted control flow point does not suffer misprediction repair costs.}}
Oddly enough, this is how predication works on My 66000.
I found ways to bypass the issues with the other two branches but this
one is fundamental.
It is fundamental only on ISAs that perform predication improperly
or does not have predication, or use the predictor when predicating.
My 66000 is not one of them.
I return to the question posed earlier::
How many instructions in the then-clause and in the else-clause ??
Michael S wrote:
And, BTW, running arbitrary hostile code on your computer is bad,
bad, bad idea for 1e9 other reasons.
Running arbitrary hostile code where the user address space is not
completely disjoint from the supervisor access space is ALSO a bad
Idea.
On Wed, 5 Jun 2024 20:13:19 +0000
[email protected] (MitchAlsup1) wrote:
Michael S wrote:
And, BTW, running arbitrary hostile code on your computer is bad,
bad, bad idea for 1e9 other reasons.
Running arbitrary hostile code where the user address space is not
completely disjoint from the supervisor access space is ALSO a bad
Idea.
It sounds like you came to the verge of selling your soul to
microkerneliac heresy.
[email protected] (MitchAlsup1) writes:
Is that something the CIA or NSA would allow on their computers ??
If they use windows, yes.
... every day that comes by, another activity is made
virtually impossible without allowing such arbitrary code on your
device. 🙁
... every day that comes by, another activity is made virtuallyIf you’re talking about WASM or JavaScript from websites, that runs in a carefully-designed sandbox.
impossible without allowing such arbitrary code on your device. 🙁
Another issue with Unicode is the so-called "confusables": things that
may look identical (or close enough) on screen yet are different (and
not just because of normalization). E.g. Β vs B, А vs A, or ∕ vs / vs ⁄.
Unicode comes with a 700kB `confusables.txt` listing such issues.
Stefan Monnier wrote:
Another issue with Unicode is the so-called "confusables": things that
may look identical (or close enough) on screen yet are different (and
not just because of normalization). E.g. Β vs B, Рvs A, or ∕ vs
/ vs â„.
Unicode comes with a 700kB `confusables.txt` listing such issues.
Eeewww... I didn't even think of that.
What does one do about them? You can't treat them as equivalent in a
string compare... the user might want the first B and not second B.
I suppose one would want two compare equal functions,
an exactly equal, and a visually approximately equal.
Like using a soundex for words to catch misspellings.
But then programmers need to decide when to use each compare.
These character and code attribute lookup tables are looking awkward.
With up to 2M codes, and some base character codes having multiple
possible combiners, but very sparse. And links between entries
for upper and lower case, and now links between confusables.
And we don't want to roll over the L1 cache just to do a string compare.
EricP wrote:
Stefan Monnier wrote:
Another issue with Unicode is the so-called "confusables": things that
may look identical (or close enough) on screen yet are different (and
not just because of normalization). E.g. Β vs B, Рvs A, or ∕ vs
/ vs â„.
Unicode comes with a 700kB `confusables.txt` listing such issues.
Eeewww... I didn't even think of that.
What does one do about them? You can't treat them as equivalent in a
string compare... the user might want the first B and not second B.
I suppose one would want two compare equal functions,
an exactly equal, and a visually approximately equal.
Like using a soundex for words to catch misspellings.
But then programmers need to decide when to use each compare.
These character and code attribute lookup tables are looking awkward.
With up to 2M codes, and some base character codes having multiple
possible combiners, but very sparse. And links between entries
for upper and lower case, and now links between confusables.
And we don't want to roll over the L1 cache just to do a string compare.
Years ago I considered case-insensitive Boyer-Moore text search with a
wide alphabet and found that the only approach that made sense was to maintain two copies of the string to be searched for, one lower and one
upper case, where each "character" was a length-encoded string. This was required to handle things like the German double s which can uppercase
into a single letter.
The lookup table for skip lengths was still far shorter than the
alphabet size, effectively a very short and fast hash of the current character/codepoint/combined letter.
Terje
| Sysop: | Keyop |
|---|---|
| Location: | Huddersfield, West Yorkshire, UK |
| Users: | 715 |
| Nodes: | 16 (2 / 14) |
| Uptime: | 20:13:39 |
| Calls: | 12,104 |
| Calls today: | 4 |
| Files: | 15,004 |
| Messages: | 6,518,100 |