Forum: >>> Magnum BBS <<<

Arguments for a sane ISA 6-years later

From MitchAlsup1@21:1/5 to All on Wed Jul 24 20:37:06 2024

Just before Google Groups got spammed to death; I wrote:: --------------------------------------------------------
MitchAlsup
Nov 1, 2022, 5:53:02 PM

In a thread called "Arguments for a Sane Instruction Set Architecture"
Aug 7, 2017, 6:53:09 PM I wrote:: -----------------------------------------------------------------------
Looking back over my 40-odd year career in computer architecture,
I thought I would list out the typical errors I and others have
made with respect to architecting computers. This is going to be
a bit long, so bear with me:

When the Instruction Set architecture is Sane, there is support
for:
A) negating operands prior to an arithmetic calculation.
B) providing constants from the instruction stream;
..where constant can be an immediate a displacement, or both.
C) exact floating point arithmetics that get the Inexact flag
..correctly unmolested.
D) exception and interrupt control transfer should take no more
..than 1 cache line read followed by 4 cache line reads to the
..same page in DRAM/L3/L2 that are dependent on the first cache
..line read. Control transfer back to the suspended thread should
..be no longer than the control transfer to the exception handler.
E) Exception control transfer can transfer control directly to a
..user privilege thread without taking an excursion through the
..Operating System.
F) upon arrival at an exception handler, no state needs to be saved,
..and the "cause" of the exception is immediately available to the
..Exception handler.
G) Atomicity over a multiplicity of instructions and over a
..multiplicity of memory locations--without losing the
..illusion of real atomicity.
H) Elementary Transcendental function are first class citizens of
..the instruction set, and at least faithfully accurate and perform
..at the same speeds as SQRT and DIV.
I) The "system programming model" is inherently:
..1) Virtual Machine
..2) Hypervisor + Supervisor
..3) multiprocessor, multithreaded
J) Simple applications can run with as little as 1 page of Memory
..Mapping overhead. An application like 'cat' can be run with
..an total allocated page count of 6: {MMU, Code, Data, BSS, Stack,
..and Register Files} --------------------------------------------------------------------
<
I though it might be fun to have a review of what came out of this::
<
At the time of that writing My 66000 ISA was still gestating in my
head--I was pretty much following the Mc 88000 Architecture in scope
and in format.
<
So; point by point::
<
A) negating operands prior to an arithmetic calculation.
1-operand instructions have sign control over result and of operand
2-operand instructions have sign control over both operands
3-operand instructions have sign control over two operands
So: check

B) providing constants from the instruction stream;
1-operand instructions have one <optional> immediate
2-operand instructions have one register and one <optional> immediate
3-operand instructions have two registers and one <optional> immediate
Loads have base register, index register and <optional> displacement
Stores have the same addressing, but the value being stored can be
....either from a register or from an immediate.
Many immediates have auto-expanding characteristics::
one can FADD Rd,Rs1,#3 to add 3.0D0 using a single 1-word
instruction. 32-bit immediates for (double) FP calculations are auto-
expanded to 64-bits in operand delivery.
Similarly, integer instructions have ±5-bit immediates, signed 16-bit immediates, 32-bit immediates and 64-bit immediates.
Memory references have 16-bit, 32-bit, and 64-bit displacements.
When Rbase = R0 IP is inserted for easy access to data relative to the
code stream.
So, big Check
<
C) exact floating point arithmetics that get the Inexact flag
..correctly unmolested.
While CARRY provides access to these features (and the inexact bit
....gets set correctly; it my current assessment that DBLE will be
....greater use and utility than the exact FP arithmetics.
So, little check

D) exception and interrupt control transfer should take no more
..than 1 cache line read followed by 4 cache line reads to the
..same page in DRAM/L3/L2 that are dependent on the first cache
..line read.
While the above is TRUE it is different than expected. Yes, a context
switch still takes 5-cache line reads, and context switch can transpire
from any thread under any GuestOS to any other thread under any other
GuestOS, all of this is "perpetrated" by a "fixed function unit" far
from the cores of the chip.
This fixed function unit combines the thread being scheduled, the
customer thread asking for service and the <appropriate> HyperVisor
data "assembled" into a single message that effects a context switch.
<
E) Exception control transfer can transfer control directly to a
..user privilege thread without taking an excursion through the
..Operating System.
This remains illusive--while it is technically possible to setup
"state" such that the above happens; it requires each such thread
run under its unique GuestOS. However, one can configure a rather
normal GuestOS so that the exception dispatcher transfers control
to a user level exception handler in 15-ish instructions.
So, medium check -------------------------------------------------------------------
Update: My 66000 new interrupt architecture can now allow interrupts
or exceptions to be directed at Application privilege level. And this
is how Linux would deliver signal() to Applications.

In addition, VMexit()s need no diddling with interrupt or PCI control registers.

So, memory is virtualized, devices are virtualized, device DMA is
virtualized, device interrupts are virtualized, and the relation-
ship between cores and interrupt is virtualized; not needing any
diddling when one traverses up and down the privilege levels. The
only overhead is <rather static> mapping tables. -------------------------------------------------------------------
<
F) upon arrival at an exception handler, no state needs to be saved,
..and the "cause" of the exception is immediately available to the
..Exception handler.
This above TRUE and also comes with the property that multiple
exceptions can be logged onto a handler without Interrupt or
Exception disablement.
No state needs to be saved: Check
No state needs to be loaded: Check
Pertinence arrives with control: Check
Control arrives on affinitizxed core: Check
--------
unCheck
--------
Control arrives at proper priority: Check
Control arrives with proper "privilege": Check
Hard Real Time supported: Maybe
---------------------------
Closer to check than maybe.
---------------------------
Moderate Real Time Supported: Check
No extraneous excursions though OS: Check.
Overall: Big check
<
G) Atomicity over a multiplicity of instructions and over a
..multiplicity of memory locations--without losing the
..illusion of real atomicity.
Up to 8 cache lines participate in an ATOMIC event.
Multiple locations in each line may have state altered.
There is direct access to whether interference has transpired.
Software can use interference to drive down future interference.
Hardware can transfer control is ATOMICITY has been violated.
Essentially ANY atomic-primitive studied in academia or provided
by industry can be synthesized.
So, medium check
<
H) Elementary Transcendental function are first class citizens of
..the instruction set, and at least faithfully accurate and perform
..at the same speeds as SQRT and DIV.
Transcendental functions operate at about the latency of FDIV
ln2, ln2P1, exp2, exp2M1 14 cycles
ln, ln10, exp, exp10 <and cousins> 18 cycles
sin, sinpi, cos, cospi 19 cycles {including Payne and Hanek argument
reduction}
tan, atan 19 or 38 cycles
power 35 cycles
23 Transcendental instructions are available in (float) and (double)
forms.
(float will be around 9 cycles)
So, reasonable check.
<
I) The "system programming model" is inherently:
..1) Virtual Machine
..2) Hypervisor + Supervisor
..3) multiprocessor, multithreaded
It is not only the above, but even moderately hard real time is built
in.
Interrupts are directed at threads not cores ------------------------------------------------------------------------
Turns out that Linux thinks interrupts are directed at cores and there
is essentially nothing anyone can do about that. My 66000 new system
model is much more Linus friendly at the cost of hard real time. ------------------------------------------------------------------------ Deferred Procedure Calls are single instruction events
--------
Check.
-------
Most handler->handler control transfers do not need an excursion though
the OS scheduler. -----------------------------------------------------------------------
ISR schedules a softIRQ and then when it SVRs the softIRQ gains control
before what originally got interrupted, transitively, without having SW traverse schedule queues. ----------------------------------------------------------------------- Basically, if you have less than 1024 processes in a Linux system, the
lower level scheduler consumes no cycles on a second by second basis.
Context switch between threads under different hypervisors is the same 10-cycles as context switch between threads under the same GuestOS (10). Conventional machines might take 1,000 cycles for a within GuestOS
context switch and 10,000 cycles on a between Guest OS context switch;
given 1,000 context switches per second, this accounts for a fraction
of a percent speed up.
So, moderate-big check
<
J) Simple applications can run with as little as 1 page of Memory
..Mapping overhead.
Achievable even when different areas {.text, .data, .bss, .stack, ...}
are separated by GB or even TB.
So, check
-------------------------------------------------------------------
That is all.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Thu Jul 25 22:07:46 2024

On Thu, 25 Jul 2024 20:09:06 +0000, BGB wrote:

On 7/24/2024 3:37 PM, MitchAlsup1 wrote:

Just before Google Groups got spammed to death; I wrote::
--------------------------------------------------------
MitchAlsup
Nov 1, 2022, 5:53:02 PM

In a thread called "Arguments for a Sane Instruction Set Architecture"
Aug 7, 2017, 6:53:09 PM I wrote::
-----------------------------------------------------------------------
Looking back over my 40-odd year career in computer architecture,
I thought I would list out the typical errors I and others have
made with respect to architecting computers. This is going to be
a bit long, so bear with me:

When the Instruction Set architecture is Sane, there is support
for:
A) negating operands prior to an arithmetic calculation.

Not seen many doing this, and might not be workable in the general case.
Might make sense for FPU ops like FADD/FMUL.

Maybe 'ADD'. Though, "-(A+B)" is the only case that can't be expressed
with traditional ADD/SUB/RSUB.

a) one does not need a SUB or NEG instruction as one has:
ADD Rd,R1,R2
ADD Rd,R1,-R2
ADD Rd,-R1,R2
ADD Rd,-R1,-R2
Which basically gets rid of the unary NEG instruction.

B) providing constants from the instruction stream;
..where constant can be an immediate a displacement, or both.

Probably true.

My ISA allows for Immediate or Displacement to be extended, but doesn't currently allow (in the base ISA) any instructions that can encode both
an immediate and displacement.

ST #3.14159265358927,[IP,R3<<3,#0x123456789abcd]

Here we have 5 instruction words storing 2 words anywhere in memory in
one instruction and one decode cycle; we waste no registers with the
constants. Looks to be 7 instructions in RISC-V including 2 LDDs...

At present:
Baseline allows Imm33s/Disp33s via a 64-bit encoding;
There is optional support for Imm57s, which in XG2 is now extended to
Imm64.

There are special cases that allow immediate encodings for many
instructions that would otherwise lack an immediate encoding.

C) exact floating point arithmetics that get the Inexact flag
..correctly unmolested.

Dunno. I suspect the whole global FPU status/control register thing
should probably be rethought somehow.

But, off-hand, don't know of a clearly better alternative.

D) exception and interrupt control transfer should take no more
..than 1 cache line read followed by 4 cache line reads to the
..same page in DRAM/L3/L2 that are dependent on the first cache
..line read. Control transfer back to the suspended thread should
..be no longer than the control transfer to the exception handler.

Likely expensive...

Tread "thread state" and its register file as a write back cache.

Granted, "glorified branch with some twiddling" is probably a little too
far in the other direction. Interrupt and syscall overhead is fairly
high when the handler needs to manually save and restore all the
registers each time.

A fast, but more expensive, option would be to have multiple copies of
the register file which is then bank-switched on an interrupt.

Under My 66000 a low end implementation can choose the write back cache version, while the GBOoO implementation can choose the bank switcher.
In both cases, the same model is presented to executing SW.

One possibility here could be, rather than being hard-wired to specific modes, there are 4 assignable register banks controlled by 2 status
register bits.

Then, say:
0: User Task 1
1: User Task 2
2: Reserved for Kernel / Syscall Task;
3: Reserved for interrupts.

Possibly along with instructions to move between the banked registers
and the currently active register file.

Just memory map everything into MMI/O space where you have access to memorymove(to, from, count) capabilities and can move an entire
thread state in 1 instruction.

Though, likely cost would be that it would require putting the GPR
register file in Block-RAM and possibly needing to increase pipeline
length.

Just MMI/O

In an OS, the syscall and interrupt bank would likely be assigned
statically, and the others could be assigned dynamically by the
scheduler (though, as-is, would likely increase task-switch overhead vs
the current mechanism).

For SYSCALL in particular, you want at least 6 of the callers registers
to pass arguments to the service provider, and at least 1 register to
return the result.

This situation could potentially be "better" if there were 8 dynamic
banks, with the scheduler potentially able to be clever and reuse banks
if they haven't been evicted and the same process is run again (but
could otherwise reassign them round-robin or similar).

The Write Back Cache model works easier.
<snip>

Though, can note that as-is, in my case, in some programs, system call overhead is high enough that all this could be worth looking into (Say:
Quake 3 manages to spend nearly 3% of the clock-cycle budget in the
SYSCALL ISR; mostly saving/restoring registers).

My SVC overhead is about 10 cycles.
VM exit overhead is also about 10 cycles.

E) Exception control transfer can transfer control directly to a
..user privilege thread without taking an excursion through the
..Operating System.

? Putting the scheduler in hardware?...

Policy remains in SW, the ability to manifest a SW choice fast is in HW.

Could make sense for a microcontroller, but less so for a conventional
OS as pretty much the only things handling interrupts are likely to be supervisor-mode drivers.

Signal handlers.

F) upon arrival at an exception handler, no state needs to be saved,
..and the "cause" of the exception is immediately available to the
..Exception handler.
G) Atomicity over a multiplicity of instructions and over a
..multiplicity of memory locations--without losing the
..illusion of real atomicity.

Memory consistency is hard...

It is simply a fully pipelined version of LL/SC

H) Elementary Transcendental function are first class citizens of
..the instruction set, and at least faithfully accurate and perform
..at the same speeds as SQRT and DIV.

.... Yeah...

In my case, they don't exist, and FDIV and FSQRT are basically boat
anchors.

Well, I guess it could be possible to support them in the ISA if they
were all boat anchors.

Say:
FSIN Rm, Rn
Raises an TRAPFPU exception, whereupon the exception handler decodes the instruction and performs the FSIN operation.

The trap is likely more cycles than FSIN().

I) The "system programming model" is inherently:
..1) Virtual Machine
..2) Hypervisor + Supervisor
..3) multiprocessor, multithreaded

If the system-mode architecture is low-level enough, the difference
between normal OS functionality and emulation starts to break down.

Like, in both cases one has:
Software page table walking;

How does one walk a nested page table when HV does not want OS to see
its mapping tables, and vice versa ??

Needing to keep track of a virtual model of the TLB;

TLB is an association of host.PTE with guest.virtual-address.

You can't have host or guest perform the TLB update !!

J) Simple applications can run with as little as 1 page of Memory
..Mapping overhead. An application like 'cat' can be run with
..an total allocated page count of 6: {MMU, Code, Data, BSS, Stack,
..and Register Files}

Hmm.

I guess one could make a case for a position-independent version of an "a.out" like format, focused on low-footprint binaries.

For the record, My 66000 code is PIC, including GOT, method calls, and
switch tables.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Chris M. Thomasson on Fri Jul 26 17:00:07 2024

"Chris M. Thomasson" <[email protected]> writes:

On 7/25/2024 1:09 PM, BGB wrote:

At least with a weak model, software knows that if it doesn't go through
the rituals, the memory will be stale.

There is no guarantee of staleness, only a lack of stronger ordering guarantees.

The weak model is ideal for me. I know how to program for it

And the fact that this model is so hard to use that few others know
how to program for it make it ideal for you.

and it's more efficient

That depends on the hardware.

Yes, the Alpha 21164 with its imprecise exceptions was "more
efficient" than other hardware for a while, then the Pentium Pro came
along and gave us precise exceptions and more efficiency. And
eventually the Alpha people learned the trick, too, and 21264 provided
precise exceptions (although they did not admit this) and more
efficieny.

Similarly, I expect that hardware that is designed for good TSO or
sequential consistency performance will run faster on code written for
this model than code written for weakly consistent hardware will run
on that hardware. That's because software written for weakly
consistent hardware often has to insert barriers or atomic operations
just in case, and these operations are slow on hardware optimized for
weak consistency.

By contrast, one can design hardware for strong ordering such that the
slowness occurs only in those cases when actual (not potential)
communication between the cores happens, i.e., much less frequently.

and sometimes use cases do not care if they encounter "stale" data.

Great. Unless these "sometimes" cases are more often than the cases
where you perform some atomic operation or barrier because of
potential, but not actual communication between cores, the weak model
is still slower than a well-implemented strong model.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Fri Jul 26 20:59:06 2024

On Fri, 26 Jul 2024 17:00:07 +0000, Anton Ertl wrote:

"Chris M. Thomasson" <[email protected]> writes:

On 7/25/2024 1:09 PM, BGB wrote:

At least with a weak model, software knows that if it doesn't go through >>> the rituals, the memory will be stale.

There is no guarantee of staleness, only a lack of stronger ordering guarantees.

The weak model is ideal for me. I know how to program for it

And the fact that this model is so hard to use that few others know
how to program for it make it ideal for you.

and it's more efficient

That depends on the hardware.

Yes, the Alpha 21164 with its imprecise exceptions was "more
efficient" than other hardware for a while, then the Pentium Pro came
along and gave us precise exceptions and more efficiency. And
eventually the Alpha people learned the trick, too, and 21264 provided precise exceptions (although they did not admit this) and more
efficieny.

Similarly, I expect that hardware that is designed for good TSO or
sequential consistency performance will run faster on code written for
this model than code written for weakly consistent hardware will run
on that hardware.

According to Lamport; only the ATOMIC stuff needs sequential
consistency.
So, it is completely possible to have a causally consistent processor
that switches to sequential consistency when doing ATOMIC stuff and gain performance when not doing ATOMIC stuff, and gain programmability when
doing atomic stuff.

That's because software written for weakly
consistent hardware often has to insert barriers or atomic operations
just in case, and these operations are slow on hardware optimized for
weak consistency.

The operations themselves are not slow. What is slow is delaying the
pipeline until it catches up to the stronger memory model before
proceeding.

By contrast, one can design hardware for strong ordering such that the slowness occurs only in those cases when actual (not potential)
communication between the cores happens, i.e., much less frequently.

How would you do this for a 256-way banked memory system of the
NEC SX ?? I.E., the processor is not in charge of memory order--
the memory system is.

and sometimes use cases do not care if they encounter "stale" data.

Great. Unless these "sometimes" cases are more often than the cases
where you perform some atomic operation or barrier because of
potential, but not actual communication between cores, the weak model
is still slower than a well-implemented strong model.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Paul A. Clayton on Sun Jul 28 01:27:49 2024

On Sun, 28 Jul 2024 1:01:59 +0000, Paul A. Clayton wrote:

On 7/25/24 6:07 PM, MitchAlsup1 wrote:

On Thu, 25 Jul 2024 20:09:06 +0000, BGB wrote:

On 7/24/2024 3:37 PM, MitchAlsup1 wrote:

[snip]

D) exception and interrupt control transfer should take no more
..than 1 cache line read followed by 4 cache line reads to the
..same page in DRAM/L3/L2 that are dependent on the first cache
..line read. Control transfer back to the suspended thread should
..be no longer than the control transfer to the exception handler.

[snip]

A fast, but more expensive, option would be to have multiple
copies of
the register file which is then bank-switched on an interrupt.

Under My 66000 a low end implementation can choose the write back
cache
version, while the GBOoO implementation can choose the bank switcher.
In both cases, the same model is presented to executing SW.

I do not know at what port count a "3D register file" (temporal
banking where extra storage "hides" under the wires) makes sense.
I suspect the 3-read, 1-write register file of a low end My 66000 implementation would have the overhead be too great unless lower
overhead context switching was extremely important.

The low end implementation has a single 4=ported register file.
When running code it is accessed as 3R-1W, but when context
switching it is accessed as 4R or 4W depending on the cycle.

The sequencer operates it like a write back cache, so if the
code has not used R16-R23 since receiving control <again>,
those registers are consistent with the already saved in memory
registers, and no writes are necessary.

As to the higher end machine, thee would be an SRAM organized
as 4-contexts of 32-regsiters each where each port can read
or write 8×64 bits per cycle, so to bank switch, one does
4 writes and then 4 reads.

In both cases, all the fancy stuff is hidden from SW.

In neither case are there more than 32 actual registers in the file
nor are there more ports than decoders.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to BGB on Mon Jul 29 12:59:33 2024

BGB <[email protected]> writes:

On 7/26/2024 12:00 PM, Anton Ertl wrote:

"Chris M. Thomasson" <[email protected]> writes:

and it's more efficient

That depends on the hardware.

Yes, the Alpha 21164 with its imprecise exceptions was "more
efficient" than other hardware for a while, then the Pentium Pro came
along and gave us precise exceptions and more efficiency. And
eventually the Alpha people learned the trick, too, and 21264 provided
precise exceptions (although they did not admit this) and more
efficieny.

Similarly, I expect that hardware that is designed for good TSO or
sequential consistency performance will run faster on code written for
this model than code written for weakly consistent hardware will run
on that hardware. That's because software written for weakly
consistent hardware often has to insert barriers or atomic operations
just in case, and these operations are slow on hardware optimized for
weak consistency.

TSO requires more significant hardware complexity though.

An efficient implementation of TSO or sequential consistency requires
more hardware, yes.

Floating point requires more hardware than fixed point. Precise
exceptions require more hardware than imprecise exceptions. Caches
require more hardware than the local memory of Cells SPEs. OoO
requires more hardware than in-order; in this case the IA-64
implementations demonstrated that you could then spend the area budget
on more in-order resources (and big caches) and still fail to keep up
on SPECint with the smaller OoO competition. In all these cases we
decided that the benefit is worth the additional hardware. I think
that's the case for strong memory ordering, too.

Seems like it would be harder to debug the hardware since:
There is more that has to go on in the hardware for TSO to work;
Software will have higher expectations that it actually work.

Possible. Delivering working hardware is the job of hardware
engineers. Intel and AMD apparently have no problems getting the TSO
parts of their architectures right. However, it seems that they don't
go for "really efficient" TSO, or they would just upgrade the parts of
their architecture with weaker consistency to have TSO.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Chris M. Thomasson on Mon Jul 29 17:38:23 2024

On Mon, 29 Jul 2024 3:32:52 +0000, Chris M. Thomasson wrote:

On 7/26/2024 10:00 AM, Anton Ertl wrote:

"Chris M. Thomasson" <[email protected]> writes:

On 7/25/2024 1:09 PM, BGB wrote:

At least with a weak model, software knows that if it doesn't go through >>>> the rituals, the memory will be stale.

There is no guarantee of staleness, only a lack of stronger ordering
guarantees.

The weak model is ideal for me. I know how to program for it

And the fact that this model is so hard to use that few others know
how to program for it make it ideal for you.

and it's more efficient

That depends on the hardware.

Yes, the Alpha 21164 with its imprecise exceptions was "more
efficient" than other hardware for a while, then the Pentium Pro came
along and gave us precise exceptions and more efficiency. And
eventually the Alpha people learned the trick, too, and 21264 provided
precise exceptions (although they did not admit this) and more
efficieny.

Similarly, I expect that hardware that is designed for good TSO or
sequential consistency performance will run faster on code written for
this model than code written for weakly consistent hardware will run
on that hardware. That's because software written for weakly
consistent hardware often has to insert barriers or atomic operations
just in case, and these operations are slow on hardware optimized for
weak consistency.

By contrast, one can design hardware for strong ordering such that the
slowness occurs only in those cases when actual (not potential)
communication between the cores happens, i.e., much less frequently.

and sometimes use cases do not care if they encounter "stale" data.

Great. Unless these "sometimes" cases are more often than the cases
where you perform some atomic operation or barrier because of
potential, but not actual communication between cores, the weak model
is still slower than a well-implemented strong model.

A strong model? You mean I don't have to use any memory barriers at all?
Tell that to SPARC in RMO mode... How strong? Even the x86 requires a
membar when a store followed by a load to another location shall be
respected wrt order. Store-Load. #StoreLoad over on SPARC. ;^)

DRAM does not need this property, MMI/O does.

If you can force everything to be #StoreLoad (*) and make it faster than
a handcrafted algo on a very weak memory system, well, hats off! I
thought it was easier for a HW guy to implement weak consistency? At the
cost of the increased complexity wrt programming the sucker! ;^)

Or HW can have different order strengths based on where the PTE
sends the request. DRAM gets causal order, ATOMICs to DRAM get
sequential consistency, MMI/O gets sequential consistency,
Configuration gets strong ordering.

Programmer has to do nothing.

(*) Not just #StoreLoad for full consistency, you would need :

MEMBAR #StoreLoad | #LoadStore | #StoreStore | #LoadLoad

right?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to BGB on Tue Jul 30 09:44:24 2024

BGB <[email protected]> writes:

Otherwise, stuff isn't going to fit into the FPGAs.

Something like TSO is a lot of complexity for not much gain.

Given that you are so constrained, the easiest corner to cut is to
have only one core. And then even seqyential consistency is trivial
to implement.

Contrast, floating point and precise exceptions are a lot more relevant
to software.

John von Neumann (IIRC) argued against floating point, with similar
arguments that are now used to defend weak ordering.

The other examples I gave are all examples where people have argued
that simplifying hardware at the cost of more complex software was the
way to go, and history proved them wrong.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Chris M. Thomasson on Tue Jul 30 15:59:23 2024

On 7/29/2024 2:55 PM, Chris M. Thomasson wrote:

However... There is "special" mutex logic that actually requires a #StoreLoad! Peterson's algorithm for example. Iirc, it needs a #StoreLoad because it depends on a store followed by a load to another location to hold true. This is a bit different than

other locking algorithms...

There there are more "exotic" methods such as so-called asymmetric mutexes. They can have fast paths and slow paths, so to speak. It's almost getting into the realm of RCU here... A fast path can be memory barrier free. The slow path can make things

consistent with the use of so called "remote" memory barriers. It's funny that Windows seems to have one:

https://learn.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-flushprocesswritebuffers

;^)

The slow path is meant to not be frequently used, hence the term asymmetric. On par with read/write logic... :^)

The folly library hazard pointers use that on windows, membarrier() system call on linux (something else on older linuxes), to get rid of the expensive store/load memory barrier in hazard pointer loads. Something like 0.7 nsecs w/o membar vs 7.7 w/
membar. The term I've seen being used now is asymmetric memory barrier.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Chris M. Thomasson on Tue Jul 30 21:27:13 2024

On 7/30/2024 4:26 PM, Chris M. Thomasson wrote:

On 7/30/2024 12:59 PM, jseigh wrote:

The folly library hazard pointers use that on windows, membarrier() system call on linux (something else on older linuxes), to get rid of the expensive store/load memory barrier in hazard pointer loads.

I need to check that out; thanks for the heads up. Fwiw, remember that old thread on comp.programming.threads where you first'ish published your ideas on RCU+SMR? I need to see if the folly library references your work. Also, remember when some paper

from SUN or something was trying to claim your atomic_ptr logic? Iirc, we talked about it back on comp.arch, a long time ago...

I remember you issued a "challenge like" post over on comp.programming.threads wrt detecting quiescent periods. Iirc, I was the first one to comment wrt a possible hackish solution using timing wrt kernel time. ;^)

Something like 0.7 nsecs w/o membar vs 7.7 w/ membar. The term I've seen being used now is asymmetric memory barrier.

Big time! This is bringing back a lot of memories Joe. :^) Thanks.

I finally got around to writing a proxy version, smrproxy, here https://github.com/jseigh . Also there's an atomic reference counted proxy, arcproxy. Timings are here https://threadnought.wordpress.com/2023/06/09/smrproxy-timing-comparisons/ .

smr is what I use to refer to hazard pointers. In the original smrrcu, the rcu refered to the rcu polling of context switches which had the property of performing a memory barrier action. I had to go through the linux proc filesystem for that. Talk
about pain. I was really glad that somebody implemented membarrier().

I also did a variation where you used local counters like when we were first messing with userspace rcu. About the same performance but with an extra polling cycle (events vs conditions). I tried to put in the same code as smrproxy but pseudo OO in C
gets messy when you try to implement chimerical types, so I yanked it out.

atomic-ptr-plus is there but it's been copied from sourceforge to google to github. I don't know if it's still intact.

There's no attributions to anything in folly. It will probably end up like what's now called split reference counting which is now folklore according to a cppcon talk.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to BGB on Thu Aug 1 17:10:28 2024

BGB <[email protected]> writes:

Some amount of the cases where consistency issues have come up in my
case have do do with RAM-backed hardware devices, like the rasterizer
module. It has its own internal caches that need to be flushed, and not >flushing caches (between this module and CPU) when trying to "transfer" >control over things like the framebuffer or Z-buffer, can result in
obvious graphical issues (and, texture-corruption doesn't necessarily
look good either).

The approach taken on AMD64 CPUs is to have different memory types
(and associated memory type range registers). Plain DRAM is
write-back cached, but there is also write-through and uncacheable
memory. For a frame buffer that is read by some hardware that can
access the memory independently, write-through seems to be the way to
go.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Anton Ertl on Thu Aug 1 17:39:24 2024

[email protected] (Anton Ertl) writes:

BGB <[email protected]> writes:

Some amount of the cases where consistency issues have come up in my
case have do do with RAM-backed hardware devices, like the rasterizer >>module. It has its own internal caches that need to be flushed, and not >>flushing caches (between this module and CPU) when trying to "transfer" >>control over things like the framebuffer or Z-buffer, can result in
obvious graphical issues (and, texture-corruption doesn't necessarily
look good either).

The approach taken on AMD64 CPUs is to have different memory types
(and associated memory type range registers). Plain DRAM is
write-back cached, but there is also write-through and uncacheable
memory. For a frame buffer that is read by some hardware that can
access the memory independently, write-through seems to be the way to
go.

In addition, ARM64 CPUs include allocation hints in
the memory type such as 'read allocate', 'transient read allocate',
'write allocate' as well has having optionally multiple coherency
domains (inner and outer sharable).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Thu Aug 1 23:08:25 2024

On Thu, 01 Aug 2024 17:10:28 GMT
[email protected] (Anton Ertl) wrote:

BGB <[email protected]> writes:

Some amount of the cases where consistency issues have come up in my
case have do do with RAM-backed hardware devices, like the
rasterizer module. It has its own internal caches that need to be
flushed, and not flushing caches (between this module and CPU) when
trying to "transfer" control over things like the framebuffer or
Z-buffer, can result in obvious graphical issues (and,
texture-corruption doesn't necessarily look good either).

The approach taken on AMD64 CPUs is to have different memory types
(and associated memory type range registers). Plain DRAM is
write-back cached, but there is also write-through and uncacheable
memory. For a frame buffer that is read by some hardware that can
access the memory independently, write-through seems to be the way to
go.

- anton

In theory WT regions can be used for frame buffers, but I would think
that in real world overwhelming majority of [few remaining Direct IO]
frame buffer applications use write-combining (WC) regions.

To remind to those of us who recently didn't re-read the relevant
topics of the manual, architecturally WC regions are weakly ordered and uncached.
On the other hand, WT regions adhere to the same x86-TSO memory
ordering model as WB regions.

I don't believe that designers of iAMD64 CPUs pay much attention to
performance of WT regions, because of the absence of killer app.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Thu Aug 1 20:06:10 2024

On Thu, 1 Aug 2024 17:39:24 +0000, Scott Lurndal wrote:

[email protected] (Anton Ertl) writes:

BGB <[email protected]> writes:

Some amount of the cases where consistency issues have come up in my
case have do do with RAM-backed hardware devices, like the rasterizer >>>module. It has its own internal caches that need to be flushed, and not >>>flushing caches (between this module and CPU) when trying to "transfer" >>>control over things like the framebuffer or Z-buffer, can result in >>>obvious graphical issues (and, texture-corruption doesn't necessarily >>>look good either).

The approach taken on AMD64 CPUs is to have different memory types
(and associated memory type range registers). Plain DRAM is
write-back cached, but there is also write-through and uncacheable
memory. For a frame buffer that is read by some hardware that can
access the memory independently, write-through seems to be the way to
go.

In addition, ARM64 CPUs include allocation hints in
the memory type such as 'read allocate', 'transient read allocate',
'write allocate' as well has having optionally multiple coherency
domains (inner and outer sharable).

Sorry, I don't understand the word 'allocate' ?!?

In My 66000 I use the term allocate wrt cache lines as a line that
resides in the cache that may NOT have a defined DRAM address.
These lines can float around the cache hierarchy as desired.

BOOT uses the caches as main memory while it rummages through PCIe
looking for devices (and now memory), and while DRAM is discovered, initialized, tuned, and made ready for general use. These lines
have PTEs with the 'Allocate' cache specifier.

The Call Stack uses Allocate cache lines which have the property
that if they are freed before they are written to DRAM, they can
be discarded instead of being written back. These cache lines
have a TPE with both the 'allocate' specifier, and RWE = 000 so
the application cannot LD or ST to these lines--only prologue
and epilogue instructions are allowed access, but these lines
do have an actual DRAM address.

..

So, what definition does ARM apply to 'allocate' ??

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to [email protected] on Thu Aug 1 20:34:28 2024

[email protected] (MitchAlsup1) writes:

On Thu, 1 Aug 2024 17:39:24 +0000, Scott Lurndal wrote:

[email protected] (Anton Ertl) writes:

BGB <[email protected]> writes:

Some amount of the cases where consistency issues have come up in my >>>>case have do do with RAM-backed hardware devices, like the rasterizer >>>>module. It has its own internal caches that need to be flushed, and not >>>>flushing caches (between this module and CPU) when trying to "transfer" >>>>control over things like the framebuffer or Z-buffer, can result in >>>>obvious graphical issues (and, texture-corruption doesn't necessarily >>>>look good either).

The approach taken on AMD64 CPUs is to have different memory types
(and associated memory type range registers). Plain DRAM is
write-back cached, but there is also write-through and uncacheable >>>memory. For a frame buffer that is read by some hardware that can
access the memory independently, write-through seems to be the way to
go.

In addition, ARM64 CPUs include allocation hints in
the memory type such as 'read allocate', 'transient read allocate',
'write allocate' as well has having optionally multiple coherency
domains (inner and outer sharable).

Sorry, I don't understand the word 'allocate' ?!?

"allocate a cache line".

Example would be a DMA request with the 'read allocate' hint
is allowed to be allocated in LLC instead of being stored in
DRAM.

Used when software expects the DMA data to be immediately.

"no allocate" for CPU initiated stores/loads would be equivalent
to write-through-but-do-not-allocate-a-line-for-it (e.g
non-temporal stores/loads). There are instructions for
individual N/T accesses, but with the region attributes it
can be applied to normal loads/stores for a whole page
or set of pages.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to [email protected] on Thu Aug 1 23:31:39 2024

On Thu, 1 Aug 2024 20:06:10 +0000
[email protected] (MitchAlsup1) wrote:

So, what definition does ARM apply to 'allocate' ??

I suppose, the same as most of the CS world - an event that causes
association of line in cache with particular address. https://developer.arm.com/documentation/den0013/d/Caches/Cache-policies/Allocation-policy
For comparison: https://www.intel.com/content/www/us/en/docs/programmable/814346/24-2/cache-allocation-policy.html

Sounds like Arm and Intel agree on the meaning of the word.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Thu Aug 1 23:40:33 2024

On Thu, 1 Aug 2024 20:34:28 +0000, Scott Lurndal wrote:

[email protected] (MitchAlsup1) writes:

On Thu, 1 Aug 2024 17:39:24 +0000, Scott Lurndal wrote:

[email protected] (Anton Ertl) writes:

BGB <[email protected]> writes:

Some amount of the cases where consistency issues have come up in my >>>>>case have do do with RAM-backed hardware devices, like the rasterizer >>>>>module. It has its own internal caches that need to be flushed, and not >>>>>flushing caches (between this module and CPU) when trying to "transfer" >>>>>control over things like the framebuffer or Z-buffer, can result in >>>>>obvious graphical issues (and, texture-corruption doesn't necessarily >>>>>look good either).

The approach taken on AMD64 CPUs is to have different memory types
(and associated memory type range registers). Plain DRAM is
write-back cached, but there is also write-through and uncacheable >>>>memory. For a frame buffer that is read by some hardware that can >>>>access the memory independently, write-through seems to be the way to >>>>go.

In addition, ARM64 CPUs include allocation hints in
the memory type such as 'read allocate', 'transient read allocate',
'write allocate' as well has having optionally multiple coherency
domains (inner and outer sharable).

Sorry, I don't understand the word 'allocate' ?!?

"allocate a cache line".

Example would be a DMA request with the 'read allocate' hint
is allowed to be allocated in LLC instead of being stored in
DRAM.

Used when software expects the DMA data to be immediately.

Thanks for the explanation.

In my case LLC is simply the front end for DRAM so a device
write will spew data into LLC where it will wait to be written.
Meanwhile, cores (or other devices) can access it directly
from LLC as if it were from DRAM except at lower latency.

"no allocate" for CPU initiated stores/loads would be equivalent
to write-through-but-do-not-allocate-a-line-for-it (e.g
non-temporal stores/loads). There are instructions for
individual N/T accesses, but with the region attributes it
can be applied to normal loads/stores for a whole page
or set of pages.

When using VVM, vector LDs and STs are considered non-temporal
if the loop-count is greater than the cache size, alleviating
the programmer from needing to know.

When using memove() or memset() data is moved on page sized
boundaries over the "bus".

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to [email protected] on Fri Aug 2 12:31:26 2024

On Thu, 1 Aug 2024 23:40:33 +0000
[email protected] (MitchAlsup1) wrote:

On Thu, 1 Aug 2024 20:34:28 +0000, Scott Lurndal wrote:

[email protected] (MitchAlsup1) writes:

On Thu, 1 Aug 2024 17:39:24 +0000, Scott Lurndal wrote:

[email protected] (Anton Ertl) writes:

BGB <[email protected]> writes:

Some amount of the cases where consistency issues have come up
in my case have do do with RAM-backed hardware devices, like the >>>>>rasterizer module. It has its own internal caches that need to
be flushed, and not flushing caches (between this module and
CPU) when trying to "transfer" control over things like the >>>>>framebuffer or Z-buffer, can result in obvious graphical issues >>>>>(and, texture-corruption doesn't necessarily look good either).

The approach taken on AMD64 CPUs is to have different memory types >>>>(and associated memory type range registers). Plain DRAM is >>>>write-back cached, but there is also write-through and uncacheable >>>>memory. For a frame buffer that is read by some hardware that can >>>>access the memory independently, write-through seems to be the
way to go.

In addition, ARM64 CPUs include allocation hints in
the memory type such as 'read allocate', 'transient read
allocate', 'write allocate' as well has having optionally
multiple coherency domains (inner and outer sharable).

Sorry, I don't understand the word 'allocate' ?!?

"allocate a cache line".

Example would be a DMA request with the 'read allocate' hint
is allowed to be allocated in LLC instead of being stored in
DRAM.

Used when software expects the DMA data to be immediately.

Thanks for the explanation.

In my case LLC is simply the front end for DRAM so a device
write will spew data into LLC where it will wait to be written.
Meanwhile, cores (or other devices) can access it directly
from LLC as if it were from DRAM except at lower latency.

Sounds like memory-side cache. Intel Broadwell and few Skylake
models with Iris Pro 580 GPU had 128MB of eDRAM cache operating in
that manner.

It is usefull in many applications, esp. bandwidth constrained, but not
in OLTP or similar enterprise apps on multi-socket SMP hardware.

BTW, what do you plan to do when a single die has multiple independent
memory channels/controllers? Is your LLC statically split beween
channels or shared?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to [email protected] on Fri Aug 2 14:05:25 2024

[email protected] (MitchAlsup1) writes:

On Thu, 1 Aug 2024 20:34:28 +0000, Scott Lurndal wrote:

In addition, ARM64 CPUs include allocation hints in
the memory type such as 'read allocate', 'transient read allocate',
'write allocate' as well has having optionally multiple coherency
domains (inner and outer sharable).

Sorry, I don't understand the word 'allocate' ?!?

"allocate a cache line".

Example would be a DMA request with the 'read allocate' hint
is allowed to be allocated in LLC instead of being stored in
DRAM.

Used when software expects the DMA data to be immediately.

Thanks for the explanation.

In my case LLC is simply the front end for DRAM so a device
write will spew data into LLC where it will wait to be written.

I'm not sure that's a good idea. Large DMAs are common
(e.g. reading pages of data in a single I/O) and the data
from the DMA is not always used by the CPU. Evicting LLC lines to
accomodate a file copy, for example, seems less than optimal.

When using memove() or memset() data is moved on page sized
boundaries over the "bus".

IME the majority of memset calls are for relatively small
(less than a page) regions.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Sun Aug 4 19:38:44 2024

On Fri, 2 Aug 2024 14:05:25 +0000, Scott Lurndal wrote:

[email protected] (MitchAlsup1) writes:

On Thu, 1 Aug 2024 20:34:28 +0000, Scott Lurndal wrote:

In addition, ARM64 CPUs include allocation hints in
the memory type such as 'read allocate', 'transient read allocate',
'write allocate' as well has having optionally multiple coherency
domains (inner and outer sharable).

Sorry, I don't understand the word 'allocate' ?!?

"allocate a cache line".

Example would be a DMA request with the 'read allocate' hint
is allowed to be allocated in LLC instead of being stored in
DRAM.

Used when software expects the DMA data to be immediately.

Thanks for the explanation.

In my case LLC is simply the front end for DRAM so a device
write will spew data into LLC where it will wait to be written.

I'm not sure that's a good idea. Large DMAs are common
(e.g. reading pages of data in a single I/O) and the data
from the DMA is not always used by the CPU. Evicting LLC lines to
accomodate a file copy, for example, seems less than optimal.

Fair enough. But after thinking abut this for a while, does the
process performing the file copy even know it is doing a file
copy ?? for example::

cat ../mydir/myfile > ../yourdir/yourfile

Which kind of applications know they are doing Input that will
not be used rather presently ??

It seems to me that a file copy application would understand
that writing of DRAM is irrelevant when the true destination
is another sector on another disk, and any means to connect
those does is more than sufficient.

When using memove() or memset() data is moved on page sized
boundaries over the "bus".

IME the majority of memset calls are for relatively small
(less than a page) regions.

Yes, but the interconnect is designed to move large chunks
atomically. And the size of that chunk is "within a page
boundary"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to [email protected] on Sun Aug 4 23:28:55 2024

[email protected] (MitchAlsup1) writes:

On Fri, 2 Aug 2024 14:05:25 +0000, Scott Lurndal wrote:

[email protected] (MitchAlsup1) writes:

On Thu, 1 Aug 2024 20:34:28 +0000, Scott Lurndal wrote:

In addition, ARM64 CPUs include allocation hints in
the memory type such as 'read allocate', 'transient read allocate', >>>>>> 'write allocate' as well has having optionally multiple coherency
domains (inner and outer sharable).

Sorry, I don't understand the word 'allocate' ?!?

"allocate a cache line".

Example would be a DMA request with the 'read allocate' hint
is allowed to be allocated in LLC instead of being stored in
DRAM.

Used when software expects the DMA data to be immediately.

Thanks for the explanation.

In my case LLC is simply the front end for DRAM so a device
write will spew data into LLC where it will wait to be written.

I'm not sure that's a good idea. Large DMAs are common
(e.g. reading pages of data in a single I/O) and the data
from the DMA is not always used by the CPU. Evicting LLC lines to
accomodate a file copy, for example, seems less than optimal.

Fair enough. But after thinking abut this for a while, does the
process performing the file copy even know it is doing a file
copy ?? for example::

cat ../mydir/myfile > ../yourdir/yourfile

It's not the application that matters. It's how the kernel
handles file data accesses. Most kernels (unix, linux and NT)
will read page-sized chunks (or multiples thereof) into kernel memory
buffers. In the case of cat, it's using stdio, so there is
another level of buffering in the C library.

So, cat does a 'getc', getc looks at the C library buffer, and if the
buffer is empty or fully consumed, it will tell the kernel to
read another 1k (legacy unix) or 4k (linux) of data from the
file into the C library buffer.

The kernel will see the request from usermode and will load
the corresponding page-sized chunk of data from the underlying
disk file sectors, if the page cache doesn't already hold the
page(s) containing the requested data.

Which kind of applications know they are doing Input that will
not be used rather presently ??

Applications that care about I/O performance use various
mechanisms (O_DIRECT, mmap, etc) to eliminate one or both
levels of intermediate buffering. The madvise system call
can be used to inform the kernel of the expected access
pattern to allow the kernel to optimize its caching policies.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to All on Mon Aug 5 05:13:47 2024

MitchAlsup1 wrote:

On Fri, 2 Aug 2024 14:05:25 +0000, Scott Lurndal wrote:

[email protected] (MitchAlsup1) writes:

On Thu, 1 Aug 2024 20:34:28 +0000, Scott Lurndal wrote:

In addition, ARM64 CPUs include allocation hints in
the memory type such as 'read allocate', 'transient read
allocate', 'write allocate' as well has having optionally
multiple coherency domains (inner and outer sharable).

Sorry, I don't understand the word 'allocate' ?!?

"allocate a cache line".

Example would be a DMA request with the 'read allocate' hint
is allowed to be allocated in LLC instead of being stored in
DRAM.

Used when software expects the DMA data to be immediately.

Thanks for the explanation.

In my case LLC is simply the front end for DRAM so a device
write will spew data into LLC where it will wait to be written.

I'm not sure that's a good idea. Large DMAs are common
(e.g. reading pages of data in a single I/O) and the data
from the DMA is not always used by the CPU. Evicting LLC lines to accomodate a file copy, for example, seems less than optimal.

Fair enough. But after thinking abut this for a while, does the
process performing the file copy even know it is doing a file
copy ?? for example::

cat ../mydir/myfile > ../yourdir/yourfile

Which kind of applications know they are doing Input that will
not be used rather presently ??

It seems to me that a file copy application would understand
that writing of DRAM is irrelevant when the true destination
is another sector on another disk, and any means to connect
those does is more than sufficient.

I suppose you could creaate a mecnahism that fed the data from the
"read" DMA directly to the "Write DMA, thus bypassing not only the
cache, but the saving DRAM bandwidth as well. This would help on
copies, and perhaps things like defrag and backup. But I suspect that
the savings are not worth the effort.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Stephen Fuld on Mon Aug 5 14:26:00 2024

"Stephen Fuld" <[email protected]d> writes:

MitchAlsup1 wrote:

It seems to me that a file copy application would understand
that writing of DRAM is irrelevant when the true destination
is another sector on another disk, and any means to connect
those does is more than sufficient.

I suppose you could creaate a mecnahism that fed the data from the
"read" DMA directly to the "Write DMA, thus bypassing not only the
cache, but the saving DRAM bandwidth as well. This would help on
copies, and perhaps things like defrag and backup. But I suspect that
the savings are not worth the effort.

It would be more logical, I think, to simply build the functionality
into the controller (when the source and destination are devices
attached to that controller (e.g. SATA, SAS or nVME)). IIRC,
that sort of functionality was available on some SCSI controllers.

For the case where devices are on multiple controllers, PCI express peer-to-peer would be the appropriate solution. There's no need
for the CPU and cache complex to be involved at all.

Shades of channel programs...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Scott Lurndal on Mon Aug 5 17:41:34 2024

Scott Lurndal wrote:

"Stephen Fuld" <[email protected]d> writes:

MitchAlsup1 wrote:

It seems to me that a file copy application would understand
that writing of DRAM is irrelevant when the true destination
is another sector on another disk, and any means to connect
those does is more than sufficient.

I suppose you could creaate a mecnahism that fed the data from the
"read" DMA directly to the "Write DMA, thus bypassing not only the
cache, but the saving DRAM bandwidth as well. This would help on
copies, and perhaps things like defrag and backup. But I suspect
that the savings are not worth the effort.

It would be more logical, I think, to simply build the functionality
into the controller (when the source and destination are devices
attached to that controller (e.g. SATA, SAS or nVME)). IIRC,
that sort of functionality was available on some SCSI controllers.

For the case where devices are on multiple controllers, PCI express peer-to-peer would be the appropriate solution. There's no need
for the CPU and cache complex to be involved at all.

Yes, thank you. The PCI Express option was the kind of thing I was
thinking of. Since it is more general than the "in controller option",
if you implement it at the PCI level, then you don't need the
controller option.

But even though the savings are real, given the limited use case for
the feature, I question if it is worth the trouble.

Shades of channel programs...

Not nearly as flexible as channel programs, nor with their overhead.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stephen Fuld on Mon Aug 5 20:05:53 2024

On Mon, 5 Aug 2024 17:41:34 +0000, Stephen Fuld wrote:

Scott Lurndal wrote:

"Stephen Fuld" <[email protected]d> writes:

MitchAlsup1 wrote:

It seems to me that a file copy application would understand
that writing of DRAM is irrelevant when the true destination
is another sector on another disk, and any means to connect
those does is more than sufficient.

I suppose you could creaate a mecnahism that fed the data from the
"read" DMA directly to the "Write DMA, thus bypassing not only the
cache, but the saving DRAM bandwidth as well. This would help on
copies, and perhaps things like defrag and backup. But I suspect
that the savings are not worth the effort.

It would be more logical, I think, to simply build the functionality
into the controller (when the source and destination are devices
attached to that controller (e.g. SATA, SAS or nVME)). IIRC,
that sort of functionality was available on some SCSI controllers.

For the case where devices are on multiple controllers, PCI express
peer-to-peer would be the appropriate solution. There's no need
for the CPU and cache complex to be involved at all.

Yes, thank you. The PCI Express option was the kind of thing I was
thinking of. Since it is more general than the "in controller option",
if you implement it at the PCI level, then you don't need the
controller option.

Done right, it is jus a apart of I/O MMU address translation

But even though the savings are real, given the limited use case for
the feature, I question if it is worth the trouble.

With an I/O MMU it pretty much drops out for free.

Shades of channel programs...

Not nearly as flexible as channel programs, nor with their overhead.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Shift
  Fri Jul 31 06:46:34 2026
  from Leeds, England via SSH
- Centurion
  Fri Jul 31 00:59:56 2026
  from Berea, Ohio via Telnet
- Rixter
  Fri Jul 31 00:00:46 2026
  from Madison, Nc via Telnet
- Bob Worm
  Thu Jul 30 20:01:55 2026
  from Wales, Uk via Telnet
- Rixter
  Thu Jul 30 14:17:17 2026
  from Madison, Nc via Telnet
- Krenn
  Thu Jul 30 13:16:49 2026
  from Sydney, Nsw via Telnet
- Bob Worm
  Thu Jul 30 09:03:28 2026
  from Wales, Uk via Telnet
- Bob Worm
  Thu Jul 30 08:47:34 2026
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	107:23:00
Calls:	12,461
Calls today:	3
Files:	15,197
Messages:	6,538,088

Arguments for a sane ISA 6-years later

Who's Online

Recent Visitors

System Info