• Arguments for a sane ISA 6-years later

    From MitchAlsup1@21:1/5 to All on Wed Jul 24 20:37:06 2024
    Just before Google Groups got spammed to death; I wrote:: --------------------------------------------------------
    MitchAlsup
    Nov 1, 2022, 5:53:02 PM

    In a thread called "Arguments for a Sane Instruction Set Architecture"
    Aug 7, 2017, 6:53:09 PM I wrote:: -----------------------------------------------------------------------
    Looking back over my 40-odd year career in computer architecture,
    I thought I would list out the typical errors I and others have
    made with respect to architecting computers. This is going to be
    a bit long, so bear with me:

    When the Instruction Set architecture is Sane, there is support
    for:
    A) negating operands prior to an arithmetic calculation.
    B) providing constants from the instruction stream;
    ..where constant can be an immediate a displacement, or both.
    C) exact floating point arithmetics that get the Inexact flag
    ..correctly unmolested.
    D) exception and interrupt control transfer should take no more
    ..than 1 cache line read followed by 4 cache line reads to the
    ..same page in DRAM/L3/L2 that are dependent on the first cache
    ..line read. Control transfer back to the suspended thread should
    ..be no longer than the control transfer to the exception handler.
    E) Exception control transfer can transfer control directly to a
    ..user privilege thread without taking an excursion through the
    ..Operating System.
    F) upon arrival at an exception handler, no state needs to be saved,
    ..and the "cause" of the exception is immediately available to the
    ..Exception handler.
    G) Atomicity over a multiplicity of instructions and over a
    ..multiplicity of memory locations--without losing the
    ..illusion of real atomicity.
    H) Elementary Transcendental function are first class citizens of
    ..the instruction set, and at least faithfully accurate and perform
    ..at the same speeds as SQRT and DIV.
    I) The "system programming model" is inherently:
    ..1) Virtual Machine
    ..2) Hypervisor + Supervisor
    ..3) multiprocessor, multithreaded
    J) Simple applications can run with as little as 1 page of Memory
    ..Mapping overhead. An application like 'cat' can be run with
    ..an total allocated page count of 6: {MMU, Code, Data, BSS, Stack,
    ..and Register Files} --------------------------------------------------------------------
    <
    I though it might be fun to have a review of what came out of this::
    <
    At the time of that writing My 66000 ISA was still gestating in my
    head--I was pretty much following the Mc 88000 Architecture in scope
    and in format.
    <
    So; point by point::
    <
    A) negating operands prior to an arithmetic calculation.
    1-operand instructions have sign control over result and of operand
    2-operand instructions have sign control over both operands
    3-operand instructions have sign control over two operands
    So: check

    B) providing constants from the instruction stream;
    1-operand instructions have one <optional> immediate
    2-operand instructions have one register and one <optional> immediate
    3-operand instructions have two registers and one <optional> immediate
    Loads have base register, index register and <optional> displacement
    Stores have the same addressing, but the value being stored can be
    ....either from a register or from an immediate.
    Many immediates have auto-expanding characteristics::
    one can FADD Rd,Rs1,#3 to add 3.0D0 using a single 1-word
    instruction. 32-bit immediates for (double) FP calculations are auto-
    expanded to 64-bits in operand delivery.
    Similarly, integer instructions have ±5-bit immediates, signed 16-bit immediates, 32-bit immediates and 64-bit immediates.
    Memory references have 16-bit, 32-bit, and 64-bit displacements.
    When Rbase = R0 IP is inserted for easy access to data relative to the
    code stream.
    So, big Check
    <
    C) exact floating point arithmetics that get the Inexact flag
    ..correctly unmolested.
    While CARRY provides access to these features (and the inexact bit
    ....gets set correctly; it my current assessment that DBLE will be
    ....greater use and utility than the exact FP arithmetics.
    So, little check

    D) exception and interrupt control transfer should take no more
    ..than 1 cache line read followed by 4 cache line reads to the
    ..same page in DRAM/L3/L2 that are dependent on the first cache
    ..line read.
    While the above is TRUE it is different than expected. Yes, a context
    switch still takes 5-cache line reads, and context switch can transpire
    from any thread under any GuestOS to any other thread under any other
    GuestOS, all of this is "perpetrated" by a "fixed function unit" far
    from the cores of the chip.
    This fixed function unit combines the thread being scheduled, the
    customer thread asking for service and the <appropriate> HyperVisor
    data "assembled" into a single message that effects a context switch.
    <
    E) Exception control transfer can transfer control directly to a
    ..user privilege thread without taking an excursion through the
    ..Operating System.
    This remains illusive--while it is technically possible to setup
    "state" such that the above happens; it requires each such thread
    run under its unique GuestOS. However, one can configure a rather
    normal GuestOS so that the exception dispatcher transfers control
    to a user level exception handler in 15-ish instructions.
    So, medium check -------------------------------------------------------------------
    Update: My 66000 new interrupt architecture can now allow interrupts
    or exceptions to be directed at Application privilege level. And this
    is how Linux would deliver signal() to Applications.

    In addition, VMexit()s need no diddling with interrupt or PCI control registers.

    So, memory is virtualized, devices are virtualized, device DMA is
    virtualized, device interrupts are virtualized, and the relation-
    ship between cores and interrupt is virtualized; not needing any
    diddling when one traverses up and down the privilege levels. The
    only overhead is <rather static> mapping tables. -------------------------------------------------------------------
    <
    F) upon arrival at an exception handler, no state needs to be saved,
    ..and the "cause" of the exception is immediately available to the
    ..Exception handler.
    This above TRUE and also comes with the property that multiple
    exceptions can be logged onto a handler without Interrupt or
    Exception disablement.
    No state needs to be saved: Check
    No state needs to be loaded: Check
    Pertinence arrives with control: Check
    Control arrives on affinitizxed core: Check
    --------
    unCheck
    --------
    Control arrives at proper priority: Check
    Control arrives with proper "privilege": Check
    Hard Real Time supported: Maybe
    ---------------------------
    Closer to check than maybe.
    ---------------------------
    Moderate Real Time Supported: Check
    No extraneous excursions though OS: Check.
    Overall: Big check
    <
    G) Atomicity over a multiplicity of instructions and over a
    ..multiplicity of memory locations--without losing the
    ..illusion of real atomicity.
    Up to 8 cache lines participate in an ATOMIC event.
    Multiple locations in each line may have state altered.
    There is direct access to whether interference has transpired.
    Software can use interference to drive down future interference.
    Hardware can transfer control is ATOMICITY has been violated.
    Essentially ANY atomic-primitive studied in academia or provided
    by industry can be synthesized.
    So, medium check
    <
    H) Elementary Transcendental function are first class citizens of
    ..the instruction set, and at least faithfully accurate and perform
    ..at the same speeds as SQRT and DIV.
    Transcendental functions operate at about the latency of FDIV
    ln2, ln2P1, exp2, exp2M1 14 cycles
    ln, ln10, exp, exp10 <and cousins> 18 cycles
    sin, sinpi, cos, cospi 19 cycles {including Payne and Hanek argument
    reduction}
    tan, atan 19 or 38 cycles
    power 35 cycles
    23 Transcendental instructions are available in (float) and (double)
    forms.
    (float will be around 9 cycles)
    So, reasonable check.
    <
    I) The "system programming model" is inherently:
    ..1) Virtual Machine
    ..2) Hypervisor + Supervisor
    ..3) multiprocessor, multithreaded
    It is not only the above, but even moderately hard real time is built
    in.
    Interrupts are directed at threads not cores ------------------------------------------------------------------------
    Turns out that Linux thinks interrupts are directed at cores and there
    is essentially nothing anyone can do about that. My 66000 new system
    model is much more Linus friendly at the cost of hard real time. ------------------------------------------------------------------------ Deferred Procedure Calls are single instruction events
    --------
    Check.
    -------
    Most handler->handler control transfers do not need an excursion though
    the OS scheduler. -----------------------------------------------------------------------
    ISR schedules a softIRQ and then when it SVRs the softIRQ gains control
    before what originally got interrupted, transitively, without having SW traverse schedule queues. ----------------------------------------------------------------------- Basically, if you have less than 1024 processes in a Linux system, the
    lower level scheduler consumes no cycles on a second by second basis.
    Context switch between threads under different hypervisors is the same 10-cycles as context switch between threads under the same GuestOS (10). Conventional machines might take 1,000 cycles for a within GuestOS
    context switch and 10,000 cycles on a between Guest OS context switch;
    given 1,000 context switches per second, this accounts for a fraction
    of a percent speed up.
    So, moderate-big check
    <
    J) Simple applications can run with as little as 1 page of Memory
    ..Mapping overhead.
    Achievable even when different areas {.text, .data, .bss, .stack, ...}
    are separated by GB or even TB.
    So, check
    -------------------------------------------------------------------
    That is all.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Thu Jul 25 22:07:46 2024
    On Thu, 25 Jul 2024 20:09:06 +0000, BGB wrote:

    On 7/24/2024 3:37 PM, MitchAlsup1 wrote:
    Just before Google Groups got spammed to death; I wrote::
    --------------------------------------------------------
    MitchAlsup
    Nov 1, 2022, 5:53:02 PM

    In a thread called "Arguments for a Sane Instruction Set Architecture"
    Aug 7, 2017, 6:53:09 PM I wrote::
    -----------------------------------------------------------------------
    Looking back over my 40-odd year career in computer architecture,
    I thought I would list out the typical errors I and others have
    made with respect to architecting computers. This is going to be
    a bit long, so bear with me:

    When the Instruction Set architecture is Sane, there is support
    for:
    A) negating operands prior to an arithmetic calculation.

    Not seen many doing this, and might not be workable in the general case.
    Might make sense for FPU ops like FADD/FMUL.

    Maybe 'ADD'. Though, "-(A+B)" is the only case that can't be expressed
    with traditional ADD/SUB/RSUB.

    a) one does not need a SUB or NEG instruction as one has:
    ADD Rd,R1,R2
    ADD Rd,R1,-R2
    ADD Rd,-R1,R2
    ADD Rd,-R1,-R2
    Which basically gets rid of the unary NEG instruction.


    B) providing constants from the instruction stream;
    ..where constant can be an immediate a displacement, or both.

    Probably true.

    My ISA allows for Immediate or Displacement to be extended, but doesn't currently allow (in the base ISA) any instructions that can encode both
    an immediate and displacement.


    ST #3.14159265358927,[IP,R3<<3,#0x123456789abcd]

    Here we have 5 instruction words storing 2 words anywhere in memory in
    one instruction and one decode cycle; we waste no registers with the
    constants. Looks to be 7 instructions in RISC-V including 2 LDDs...


    At present:
    Baseline allows Imm33s/Disp33s via a 64-bit encoding;
    There is optional support for Imm57s, which in XG2 is now extended to
    Imm64.

    There are special cases that allow immediate encodings for many
    instructions that would otherwise lack an immediate encoding.


    C) exact floating point arithmetics that get the Inexact flag
    ..correctly unmolested.

    Dunno. I suspect the whole global FPU status/control register thing
    should probably be rethought somehow.

    But, off-hand, don't know of a clearly better alternative.


    D) exception and interrupt control transfer should take no more
    ..than 1 cache line read followed by 4 cache line reads to the
    ..same page in DRAM/L3/L2 that are dependent on the first cache
    ..line read. Control transfer back to the suspended thread should
    ..be no longer than the control transfer to the exception handler.

    Likely expensive...

    Tread "thread state" and its register file as a write back cache.


    Granted, "glorified branch with some twiddling" is probably a little too
    far in the other direction. Interrupt and syscall overhead is fairly
    high when the handler needs to manually save and restore all the
    registers each time.


    A fast, but more expensive, option would be to have multiple copies of
    the register file which is then bank-switched on an interrupt.

    Under My 66000 a low end implementation can choose the write back cache version, while the GBOoO implementation can choose the bank switcher.
    In both cases, the same model is presented to executing SW.

    One possibility here could be, rather than being hard-wired to specific modes, there are 4 assignable register banks controlled by 2 status
    register bits.

    Then, say:
    0: User Task 1
    1: User Task 2
    2: Reserved for Kernel / Syscall Task;
    3: Reserved for interrupts.

    Possibly along with instructions to move between the banked registers
    and the currently active register file.

    Just memory map everything into MMI/O space where you have access to memorymove(to, from, count) capabilities and can move an entire
    thread state in 1 instruction.


    Though, likely cost would be that it would require putting the GPR
    register file in Block-RAM and possibly needing to increase pipeline
    length.

    Just MMI/O

    In an OS, the syscall and interrupt bank would likely be assigned
    statically, and the others could be assigned dynamically by the
    scheduler (though, as-is, would likely increase task-switch overhead vs
    the current mechanism).

    For SYSCALL in particular, you want at least 6 of the callers registers
    to pass arguments to the service provider, and at least 1 register to
    return the result.

    This situation could potentially be "better" if there were 8 dynamic
    banks, with the scheduler potentially able to be clever and reuse banks
    if they haven't been evicted and the same process is run again (but
    could otherwise reassign them round-robin or similar).

    The Write Back Cache model works easier.
    <snip>

    Though, can note that as-is, in my case, in some programs, system call overhead is high enough that all this could be worth looking into (Say:
    Quake 3 manages to spend nearly 3% of the clock-cycle budget in the
    SYSCALL ISR; mostly saving/restoring registers).

    My SVC overhead is about 10 cycles.
    VM exit overhead is also about 10 cycles.

    E) Exception control transfer can transfer control directly to a
    ..user privilege thread without taking an excursion through the
    ..Operating System.

    ? Putting the scheduler in hardware?...

    Policy remains in SW, the ability to manifest a SW choice fast is in HW.

    Could make sense for a microcontroller, but less so for a conventional
    OS as pretty much the only things handling interrupts are likely to be supervisor-mode drivers.

    Signal handlers.


    F) upon arrival at an exception handler, no state needs to be saved,
    ..and the "cause" of the exception is immediately available to the
    ..Exception handler.
    G) Atomicity over a multiplicity of instructions and over a
    ..multiplicity of memory locations--without losing the
    ..illusion of real atomicity.

    Memory consistency is hard...

    It is simply a fully pipelined version of LL/SC


    H) Elementary Transcendental function are first class citizens of
    ..the instruction set, and at least faithfully accurate and perform
    ..at the same speeds as SQRT and DIV.

    .... Yeah...

    In my case, they don't exist, and FDIV and FSQRT are basically boat
    anchors.


    Well, I guess it could be possible to support them in the ISA if they
    were all boat anchors.

    Say:
    FSIN Rm, Rn
    Raises an TRAPFPU exception, whereupon the exception handler decodes the instruction and performs the FSIN operation.

    The trap is likely more cycles than FSIN().


    I) The "system programming model" is inherently:
    ..1) Virtual Machine
    ..2) Hypervisor + Supervisor
    ..3) multiprocessor, multithreaded

    If the system-mode architecture is low-level enough, the difference
    between normal OS functionality and emulation starts to break down.

    Like, in both cases one has:
    Software page table walking;

    How does one walk a nested page table when HV does not want OS to see
    its mapping tables, and vice versa ??

    Needing to keep track of a virtual model of the TLB;

    TLB is an association of host.PTE with guest.virtual-address.

    You can't have host or guest perform the TLB update !!



    J) Simple applications can run with as little as 1 page of Memory
    ..Mapping overhead. An application like 'cat' can be run with
    ..an total allocated page count of 6: {MMU, Code, Data, BSS, Stack,
    ..and Register Files}

    Hmm.


    I guess one could make a case for a position-independent version of an "a.out" like format, focused on low-footprint binaries.

    For the record, My 66000 code is PIC, including GOT, method calls, and
    switch tables.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Chris M. Thomasson on Fri Jul 26 17:00:07 2024
    "Chris M. Thomasson" <[email protected]> writes:
    On 7/25/2024 1:09 PM, BGB wrote:
    At least with a weak model, software knows that if it doesn't go through
    the rituals, the memory will be stale.

    There is no guarantee of staleness, only a lack of stronger ordering guarantees.

    The weak model is ideal for me. I know how to program for it

    And the fact that this model is so hard to use that few others know
    how to program for it make it ideal for you.

    and it's more efficient

    That depends on the hardware.

    Yes, the Alpha 21164 with its imprecise exceptions was "more
    efficient" than other hardware for a while, then the Pentium Pro came
    along and gave us precise exceptions and more efficiency. And
    eventually the Alpha people learned the trick, too, and 21264 provided
    precise exceptions (although they did not admit this) and more
    efficieny.

    Similarly, I expect that hardware that is designed for good TSO or
    sequential consistency performance will run faster on code written for
    this model than code written for weakly consistent hardware will run
    on that hardware. That's because software written for weakly
    consistent hardware often has to insert barriers or atomic operations
    just in case, and these operations are slow on hardware optimized for
    weak consistency.

    By contrast, one can design hardware for strong ordering such that the
    slowness occurs only in those cases when actual (not potential)
    communication between the cores happens, i.e., much less frequently.

    and sometimes use cases do not care if they encounter "stale" data.

    Great. Unless these "sometimes" cases are more often than the cases
    where you perform some atomic operation or barrier because of
    potential, but not actual communication between cores, the weak model
    is still slower than a well-implemented strong model.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Fri Jul 26 20:59:06 2024
    On Fri, 26 Jul 2024 17:00:07 +0000, Anton Ertl wrote:

    "Chris M. Thomasson" <[email protected]> writes:
    On 7/25/2024 1:09 PM, BGB wrote:
    At least with a weak model, software knows that if it doesn't go through >>> the rituals, the memory will be stale.

    There is no guarantee of staleness, only a lack of stronger ordering guarantees.

    The weak model is ideal for me. I know how to program for it

    And the fact that this model is so hard to use that few others know
    how to program for it make it ideal for you.

    and it's more efficient

    That depends on the hardware.

    Yes, the Alpha 21164 with its imprecise exceptions was "more
    efficient" than other hardware for a while, then the Pentium Pro came
    along and gave us precise exceptions and more efficiency. And
    eventually the Alpha people learned the trick, too, and 21264 provided precise exceptions (although they did not admit this) and more
    efficieny.

    Similarly, I expect that hardware that is designed for good TSO or
    sequential consistency performance will run faster on code written for
    this model than code written for weakly consistent hardware will run
    on that hardware.

    According to Lamport; only the ATOMIC stuff needs sequential
    consistency.
    So, it is completely possible to have a causally consistent processor
    that switches to sequential consistency when doing ATOMIC stuff and gain performance when not doing ATOMIC stuff, and gain programmability when
    doing atomic stuff.

    That's because software written for weakly
    consistent hardware often has to insert barriers or atomic operations
    just in case, and these operations are slow on hardware optimized for
    weak consistency.

    The operations themselves are not slow. What is slow is delaying the
    pipeline until it catches up to the stronger memory model before
    proceeding.

    By contrast, one can design hardware for strong ordering such that the slowness occurs only in those cases when actual (not potential)
    communication between the cores happens, i.e., much less frequently.

    How would you do this for a 256-way banked memory system of the
    NEC SX ?? I.E., the processor is not in charge of memory order--
    the memory system is.


    and sometimes use cases do not care if they encounter "stale" data.

    Great. Unless these "sometimes" cases are more often than the cases
    where you perform some atomic operation or barrier because of
    potential, but not actual communication between cores, the weak model
    is still slower than a well-implemented strong model.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Paul A. Clayton on Sun Jul 28 01:27:49 2024
    On Sun, 28 Jul 2024 1:01:59 +0000, Paul A. Clayton wrote:

    On 7/25/24 6:07 PM, MitchAlsup1 wrote:
    On Thu, 25 Jul 2024 20:09:06 +0000, BGB wrote:

    On 7/24/2024 3:37 PM, MitchAlsup1 wrote:
    [snip]
    D) exception and interrupt control transfer should take no more
    ..than 1 cache line read followed by 4 cache line reads to the
    ..same page in DRAM/L3/L2 that are dependent on the first cache
    ..line read. Control transfer back to the suspended thread should
    ..be no longer than the control transfer to the exception handler.
    [snip]
    A fast, but more expensive, option would be to have multiple
    copies of
    the register file which is then bank-switched on an interrupt.

    Under My 66000 a low end implementation can choose the write back
    cache
    version, while the GBOoO implementation can choose the bank switcher.
    In both cases, the same model is presented to executing SW.

    I do not know at what port count a "3D register file" (temporal
    banking where extra storage "hides" under the wires) makes sense.
    I suspect the 3-read, 1-write register file of a low end My 66000 implementation would have the overhead be too great unless lower
    overhead context switching was extremely important.

    The low end implementation has a single 4=ported register file.
    When running code it is accessed as 3R-1W, but when context
    switching it is accessed as 4R or 4W depending on the cycle.

    The sequencer operates it like a write back cache, so if the
    code has not used R16-R23 since receiving control <again>,
    those registers are consistent with the already saved in memory
    registers, and no writes are necessary.

    As to the higher end machine, thee would be an SRAM organized
    as 4-contexts of 32-regsiters each where each port can read
    or write 8×64 bits per cycle, so to bank switch, one does
    4 writes and then 4 reads.

    In both cases, all the fancy stuff is hidden from SW.

    In neither case are there more than 32 actual registers in the file
    nor are there more ports than decoders.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to BGB on Mon Jul 29 12:59:33 2024
    BGB <[email protected]> writes:
    On 7/26/2024 12:00 PM, Anton Ertl wrote:
    "Chris M. Thomasson" <[email protected]> writes:
    and it's more efficient

    That depends on the hardware.

    Yes, the Alpha 21164 with its imprecise exceptions was "more
    efficient" than other hardware for a while, then the Pentium Pro came
    along and gave us precise exceptions and more efficiency. And
    eventually the Alpha people learned the trick, too, and 21264 provided
    precise exceptions (although they did not admit this) and more
    efficieny.

    Similarly, I expect that hardware that is designed for good TSO or
    sequential consistency performance will run faster on code written for
    this model than code written for weakly consistent hardware will run
    on that hardware. That's because software written for weakly
    consistent hardware often has to insert barriers or atomic operations
    just in case, and these operations are slow on hardware optimized for
    weak consistency.


    TSO requires more significant hardware complexity though.

    An efficient implementation of TSO or sequential consistency requires
    more hardware, yes.

    Floating point requires more hardware than fixed point. Precise
    exceptions require more hardware than imprecise exceptions. Caches
    require more hardware than the local memory of Cells SPEs. OoO
    requires more hardware than in-order; in this case the IA-64
    implementations demonstrated that you could then spend the area budget
    on more in-order resources (and big caches) and still fail to keep up
    on SPECint with the smaller OoO competition. In all these cases we
    decided that the benefit is worth the additional hardware. I think
    that's the case for strong memory ordering, too.

    Seems like it would be harder to debug the hardware since:
    There is more that has to go on in the hardware for TSO to work;
    Software will have higher expectations that it actually work.

    Possible. Delivering working hardware is the job of hardware
    engineers. Intel and AMD apparently have no problems getting the TSO
    parts of their architectures right. However, it seems that they don't
    go for "really efficient" TSO, or they would just upgrade the parts of
    their architecture with weaker consistency to have TSO.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Chris M. Thomasson on Mon Jul 29 17:38:23 2024
    On Mon, 29 Jul 2024 3:32:52 +0000, Chris M. Thomasson wrote:

    On 7/26/2024 10:00 AM, Anton Ertl wrote:
    "Chris M. Thomasson" <[email protected]> writes:
    On 7/25/2024 1:09 PM, BGB wrote:
    At least with a weak model, software knows that if it doesn't go through >>>> the rituals, the memory will be stale.

    There is no guarantee of staleness, only a lack of stronger ordering
    guarantees.

    The weak model is ideal for me. I know how to program for it

    And the fact that this model is so hard to use that few others know
    how to program for it make it ideal for you.

    and it's more efficient

    That depends on the hardware.

    Yes, the Alpha 21164 with its imprecise exceptions was "more
    efficient" than other hardware for a while, then the Pentium Pro came
    along and gave us precise exceptions and more efficiency. And
    eventually the Alpha people learned the trick, too, and 21264 provided
    precise exceptions (although they did not admit this) and more
    efficieny.

    Similarly, I expect that hardware that is designed for good TSO or
    sequential consistency performance will run faster on code written for
    this model than code written for weakly consistent hardware will run
    on that hardware. That's because software written for weakly
    consistent hardware often has to insert barriers or atomic operations
    just in case, and these operations are slow on hardware optimized for
    weak consistency.

    By contrast, one can design hardware for strong ordering such that the
    slowness occurs only in those cases when actual (not potential)
    communication between the cores happens, i.e., much less frequently.

    and sometimes use cases do not care if they encounter "stale" data.

    Great. Unless these "sometimes" cases are more often than the cases
    where you perform some atomic operation or barrier because of
    potential, but not actual communication between cores, the weak model
    is still slower than a well-implemented strong model.

    A strong model? You mean I don't have to use any memory barriers at all?
    Tell that to SPARC in RMO mode... How strong? Even the x86 requires a
    membar when a store followed by a load to another location shall be
    respected wrt order. Store-Load. #StoreLoad over on SPARC. ;^)

    DRAM does not need this property, MMI/O does.

    If you can force everything to be #StoreLoad (*) and make it faster than
    a handcrafted algo on a very weak memory system, well, hats off! I
    thought it was easier for a HW guy to implement weak consistency? At the
    cost of the increased complexity wrt programming the sucker! ;^)

    Or HW can have different order strengths based on where the PTE
    sends the request. DRAM gets causal order, ATOMICs to DRAM get
    sequential consistency, MMI/O gets sequential consistency,
    Configuration gets strong ordering.

    Programmer has to do nothing.


    (*) Not just #StoreLoad for full consistency, you would need :

    MEMBAR #StoreLoad | #LoadStore | #StoreStore | #LoadLoad

    right?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to BGB on Tue Jul 30 09:44:24 2024
    BGB <[email protected]> writes:
    Otherwise, stuff isn't going to fit into the FPGAs.

    Something like TSO is a lot of complexity for not much gain.

    Given that you are so constrained, the easiest corner to cut is to
    have only one core. And then even seqyential consistency is trivial
    to implement.

    Contrast, floating point and precise exceptions are a lot more relevant
    to software.

    John von Neumann (IIRC) argued against floating point, with similar
    arguments that are now used to defend weak ordering.

    The other examples I gave are all examples where people have argued
    that simplifying hardware at the cost of more complex software was the
    way to go, and history proved them wrong.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Chris M. Thomasson on Tue Jul 30 15:59:23 2024
    On 7/29/2024 2:55 PM, Chris M. Thomasson wrote:

    However... There is "special" mutex logic that actually requires a #StoreLoad! Peterson's algorithm for example. Iirc, it needs a #StoreLoad because it depends on a store followed by a load to another location to hold true. This is a bit different than
    other locking algorithms...

    There there are more "exotic" methods such as so-called asymmetric mutexes. They can have fast paths and slow paths, so to speak. It's almost getting into the realm of RCU here... A fast path can be memory barrier free. The slow path can make things
    consistent with the use of so called "remote" memory barriers. It's funny that Windows seems to have one:

    https://learn.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-flushprocesswritebuffers

    ;^)

    The slow path is meant to not be frequently used, hence the term asymmetric. On par with read/write logic... :^)


    The folly library hazard pointers use that on windows, membarrier() system call on linux (something else on older linuxes), to get rid of the expensive store/load memory barrier in hazard pointer loads. Something like 0.7 nsecs w/o membar vs 7.7 w/
    membar. The term I've seen being used now is asymmetric memory barrier.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Chris M. Thomasson on Tue Jul 30 21:27:13 2024
    On 7/30/2024 4:26 PM, Chris M. Thomasson wrote:
    On 7/30/2024 12:59 PM, jseigh wrote:

    The folly library hazard pointers use that on windows,  membarrier() system call on linux (something else on older linuxes), to get rid of the expensive store/load memory barrier in hazard pointer loads.

    I need to check that out; thanks for the heads up. Fwiw, remember that old thread on comp.programming.threads where you first'ish published your ideas on RCU+SMR? I need to see if the folly library references your work. Also, remember when some paper
    from SUN or something was trying to claim your atomic_ptr logic? Iirc, we talked about it back on comp.arch, a long time ago...

    I remember you issued a "challenge like" post over on comp.programming.threads wrt detecting quiescent periods. Iirc, I was the first one to comment wrt a possible hackish solution using timing wrt kernel time. ;^)


    Something like 0.7 nsecs w/o membar vs 7.7 w/ membar.  The term I've seen being used now is asymmetric memory barrier.

    Big time! This is bringing back a lot of memories Joe. :^) Thanks.


    I finally got around to writing a proxy version, smrproxy, here https://github.com/jseigh . Also there's an atomic reference counted proxy, arcproxy. Timings are here https://threadnought.wordpress.com/2023/06/09/smrproxy-timing-comparisons/ .

    smr is what I use to refer to hazard pointers. In the original smrrcu, the rcu refered to the rcu polling of context switches which had the property of performing a memory barrier action. I had to go through the linux proc filesystem for that. Talk
    about pain. I was really glad that somebody implemented membarrier().

    I also did a variation where you used local counters like when we were first messing with userspace rcu. About the same performance but with an extra polling cycle (events vs conditions). I tried to put in the same code as smrproxy but pseudo OO in C
    gets messy when you try to implement chimerical types, so I yanked it out.

    atomic-ptr-plus is there but it's been copied from sourceforge to google to github. I don't know if it's still intact.

    There's no attributions to anything in folly. It will probably end up like what's now called split reference counting which is now folklore according to a cppcon talk.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to BGB on Thu Aug 1 17:10:28 2024
    BGB <[email protected]> writes:
    Some amount of the cases where consistency issues have come up in my
    case have do do with RAM-backed hardware devices, like the rasterizer
    module. It has its own internal caches that need to be flushed, and not >flushing caches (between this module and CPU) when trying to "transfer" >control over things like the framebuffer or Z-buffer, can result in
    obvious graphical issues (and, texture-corruption doesn't necessarily
    look good either).

    The approach taken on AMD64 CPUs is to have different memory types
    (and associated memory type range registers). Plain DRAM is
    write-back cached, but there is also write-through and uncacheable
    memory. For a frame buffer that is read by some hardware that can
    access the memory independently, write-through seems to be the way to
    go.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Anton Ertl on Thu Aug 1 17:39:24 2024
    [email protected] (Anton Ertl) writes:
    BGB <[email protected]> writes:
    Some amount of the cases where consistency issues have come up in my
    case have do do with RAM-backed hardware devices, like the rasterizer >>module. It has its own internal caches that need to be flushed, and not >>flushing caches (between this module and CPU) when trying to "transfer" >>control over things like the framebuffer or Z-buffer, can result in
    obvious graphical issues (and, texture-corruption doesn't necessarily
    look good either).

    The approach taken on AMD64 CPUs is to have different memory types
    (and associated memory type range registers). Plain DRAM is
    write-back cached, but there is also write-through and uncacheable
    memory. For a frame buffer that is read by some hardware that can
    access the memory independently, write-through seems to be the way to
    go.

    In addition, ARM64 CPUs include allocation hints in
    the memory type such as 'read allocate', 'transient read allocate',
    'write allocate' as well has having optionally multiple coherency
    domains (inner and outer sharable).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Thu Aug 1 23:08:25 2024
    On Thu, 01 Aug 2024 17:10:28 GMT
    [email protected] (Anton Ertl) wrote:

    BGB <[email protected]> writes:
    Some amount of the cases where consistency issues have come up in my
    case have do do with RAM-backed hardware devices, like the
    rasterizer module. It has its own internal caches that need to be
    flushed, and not flushing caches (between this module and CPU) when
    trying to "transfer" control over things like the framebuffer or
    Z-buffer, can result in obvious graphical issues (and,
    texture-corruption doesn't necessarily look good either).

    The approach taken on AMD64 CPUs is to have different memory types
    (and associated memory type range registers). Plain DRAM is
    write-back cached, but there is also write-through and uncacheable
    memory. For a frame buffer that is read by some hardware that can
    access the memory independently, write-through seems to be the way to
    go.

    - anton

    In theory WT regions can be used for frame buffers, but I would think
    that in real world overwhelming majority of [few remaining Direct IO]
    frame buffer applications use write-combining (WC) regions.

    To remind to those of us who recently didn't re-read the relevant
    topics of the manual, architecturally WC regions are weakly ordered and uncached.
    On the other hand, WT regions adhere to the same x86-TSO memory
    ordering model as WB regions.

    I don't believe that designers of iAMD64 CPUs pay much attention to
    performance of WT regions, because of the absence of killer app.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Thu Aug 1 20:06:10 2024
    On Thu, 1 Aug 2024 17:39:24 +0000, Scott Lurndal wrote:

    [email protected] (Anton Ertl) writes:
    BGB <[email protected]> writes:
    Some amount of the cases where consistency issues have come up in my
    case have do do with RAM-backed hardware devices, like the rasterizer >>>module. It has its own internal caches that need to be flushed, and not >>>flushing caches (between this module and CPU) when trying to "transfer" >>>control over things like the framebuffer or Z-buffer, can result in >>>obvious graphical issues (and, texture-corruption doesn't necessarily >>>look good either).

    The approach taken on AMD64 CPUs is to have different memory types
    (and associated memory type range registers). Plain DRAM is
    write-back cached, but there is also write-through and uncacheable
    memory. For a frame buffer that is read by some hardware that can
    access the memory independently, write-through seems to be the way to
    go.

    In addition, ARM64 CPUs include allocation hints in
    the memory type such as 'read allocate', 'transient read allocate',
    'write allocate' as well has having optionally multiple coherency
    domains (inner and outer sharable).

    Sorry, I don't understand the word 'allocate' ?!?

    In My 66000 I use the term allocate wrt cache lines as a line that
    resides in the cache that may NOT have a defined DRAM address.
    These lines can float around the cache hierarchy as desired.

    BOOT uses the caches as main memory while it rummages through PCIe
    looking for devices (and now memory), and while DRAM is discovered, initialized, tuned, and made ready for general use. These lines
    have PTEs with the 'Allocate' cache specifier.

    The Call Stack uses Allocate cache lines which have the property
    that if they are freed before they are written to DRAM, they can
    be discarded instead of being written back. These cache lines
    have a TPE with both the 'allocate' specifier, and RWE = 000 so
    the application cannot LD or ST to these lines--only prologue
    and epilogue instructions are allowed access, but these lines
    do have an actual DRAM address.

    ..

    So, what definition does ARM apply to 'allocate' ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to [email protected] on Thu Aug 1 20:34:28 2024
    [email protected] (MitchAlsup1) writes:
    On Thu, 1 Aug 2024 17:39:24 +0000, Scott Lurndal wrote:

    [email protected] (Anton Ertl) writes:
    BGB <[email protected]> writes:
    Some amount of the cases where consistency issues have come up in my >>>>case have do do with RAM-backed hardware devices, like the rasterizer >>>>module. It has its own internal caches that need to be flushed, and not >>>>flushing caches (between this module and CPU) when trying to "transfer" >>>>control over things like the framebuffer or Z-buffer, can result in >>>>obvious graphical issues (and, texture-corruption doesn't necessarily >>>>look good either).

    The approach taken on AMD64 CPUs is to have different memory types
    (and associated memory type range registers). Plain DRAM is
    write-back cached, but there is also write-through and uncacheable >>>memory. For a frame buffer that is read by some hardware that can
    access the memory independently, write-through seems to be the way to
    go.

    In addition, ARM64 CPUs include allocation hints in
    the memory type such as 'read allocate', 'transient read allocate',
    'write allocate' as well has having optionally multiple coherency
    domains (inner and outer sharable).

    Sorry, I don't understand the word 'allocate' ?!?

    "allocate a cache line".

    Example would be a DMA request with the 'read allocate' hint
    is allowed to be allocated in LLC instead of being stored in
    DRAM.

    Used when software expects the DMA data to be immediately.

    "no allocate" for CPU initiated stores/loads would be equivalent
    to write-through-but-do-not-allocate-a-line-for-it (e.g
    non-temporal stores/loads). There are instructions for
    individual N/T accesses, but with the region attributes it
    can be applied to normal loads/stores for a whole page
    or set of pages.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to [email protected] on Thu Aug 1 23:31:39 2024
    On Thu, 1 Aug 2024 20:06:10 +0000
    [email protected] (MitchAlsup1) wrote:


    So, what definition does ARM apply to 'allocate' ??


    I suppose, the same as most of the CS world - an event that causes
    association of line in cache with particular address. https://developer.arm.com/documentation/den0013/d/Caches/Cache-policies/Allocation-policy
    For comparison: https://www.intel.com/content/www/us/en/docs/programmable/814346/24-2/cache-allocation-policy.html

    Sounds like Arm and Intel agree on the meaning of the word.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Thu Aug 1 23:40:33 2024
    On Thu, 1 Aug 2024 20:34:28 +0000, Scott Lurndal wrote:

    [email protected] (MitchAlsup1) writes:
    On Thu, 1 Aug 2024 17:39:24 +0000, Scott Lurndal wrote:

    [email protected] (Anton Ertl) writes:
    BGB <[email protected]> writes:
    Some amount of the cases where consistency issues have come up in my >>>>>case have do do with RAM-backed hardware devices, like the rasterizer >>>>>module. It has its own internal caches that need to be flushed, and not >>>>>flushing caches (between this module and CPU) when trying to "transfer" >>>>>control over things like the framebuffer or Z-buffer, can result in >>>>>obvious graphical issues (and, texture-corruption doesn't necessarily >>>>>look good either).

    The approach taken on AMD64 CPUs is to have different memory types
    (and associated memory type range registers). Plain DRAM is
    write-back cached, but there is also write-through and uncacheable >>>>memory. For a frame buffer that is read by some hardware that can >>>>access the memory independently, write-through seems to be the way to >>>>go.

    In addition, ARM64 CPUs include allocation hints in
    the memory type such as 'read allocate', 'transient read allocate',
    'write allocate' as well has having optionally multiple coherency
    domains (inner and outer sharable).

    Sorry, I don't understand the word 'allocate' ?!?

    "allocate a cache line".

    Example would be a DMA request with the 'read allocate' hint
    is allowed to be allocated in LLC instead of being stored in
    DRAM.

    Used when software expects the DMA data to be immediately.

    Thanks for the explanation.

    In my case LLC is simply the front end for DRAM so a device
    write will spew data into LLC where it will wait to be written.
    Meanwhile, cores (or other devices) can access it directly
    from LLC as if it were from DRAM except at lower latency.

    "no allocate" for CPU initiated stores/loads would be equivalent
    to write-through-but-do-not-allocate-a-line-for-it (e.g
    non-temporal stores/loads). There are instructions for
    individual N/T accesses, but with the region attributes it
    can be applied to normal loads/stores for a whole page
    or set of pages.

    When using VVM, vector LDs and STs are considered non-temporal
    if the loop-count is greater than the cache size, alleviating
    the programmer from needing to know.

    When using memove() or memset() data is moved on page sized
    boundaries over the "bus".

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to [email protected] on Fri Aug 2 12:31:26 2024
    On Thu, 1 Aug 2024 23:40:33 +0000
    [email protected] (MitchAlsup1) wrote:

    On Thu, 1 Aug 2024 20:34:28 +0000, Scott Lurndal wrote:

    [email protected] (MitchAlsup1) writes:
    On Thu, 1 Aug 2024 17:39:24 +0000, Scott Lurndal wrote:

    [email protected] (Anton Ertl) writes:
    BGB <[email protected]> writes:
    Some amount of the cases where consistency issues have come up
    in my case have do do with RAM-backed hardware devices, like the >>>>>rasterizer module. It has its own internal caches that need to
    be flushed, and not flushing caches (between this module and
    CPU) when trying to "transfer" control over things like the >>>>>framebuffer or Z-buffer, can result in obvious graphical issues >>>>>(and, texture-corruption doesn't necessarily look good either).

    The approach taken on AMD64 CPUs is to have different memory types >>>>(and associated memory type range registers). Plain DRAM is >>>>write-back cached, but there is also write-through and uncacheable >>>>memory. For a frame buffer that is read by some hardware that can >>>>access the memory independently, write-through seems to be the
    way to go.

    In addition, ARM64 CPUs include allocation hints in
    the memory type such as 'read allocate', 'transient read
    allocate', 'write allocate' as well has having optionally
    multiple coherency domains (inner and outer sharable).

    Sorry, I don't understand the word 'allocate' ?!?

    "allocate a cache line".

    Example would be a DMA request with the 'read allocate' hint
    is allowed to be allocated in LLC instead of being stored in
    DRAM.

    Used when software expects the DMA data to be immediately.

    Thanks for the explanation.

    In my case LLC is simply the front end for DRAM so a device
    write will spew data into LLC where it will wait to be written.
    Meanwhile, cores (or other devices) can access it directly
    from LLC as if it were from DRAM except at lower latency.


    Sounds like memory-side cache. Intel Broadwell and few Skylake
    models with Iris Pro 580 GPU had 128MB of eDRAM cache operating in
    that manner.

    It is usefull in many applications, esp. bandwidth constrained, but not
    in OLTP or similar enterprise apps on multi-socket SMP hardware.

    BTW, what do you plan to do when a single die has multiple independent
    memory channels/controllers? Is your LLC statically split beween
    channels or shared?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to [email protected] on Fri Aug 2 14:05:25 2024
    [email protected] (MitchAlsup1) writes:
    On Thu, 1 Aug 2024 20:34:28 +0000, Scott Lurndal wrote:


    In addition, ARM64 CPUs include allocation hints in
    the memory type such as 'read allocate', 'transient read allocate',
    'write allocate' as well has having optionally multiple coherency
    domains (inner and outer sharable).

    Sorry, I don't understand the word 'allocate' ?!?

    "allocate a cache line".

    Example would be a DMA request with the 'read allocate' hint
    is allowed to be allocated in LLC instead of being stored in
    DRAM.

    Used when software expects the DMA data to be immediately.

    Thanks for the explanation.

    In my case LLC is simply the front end for DRAM so a device
    write will spew data into LLC where it will wait to be written.

    I'm not sure that's a good idea. Large DMAs are common
    (e.g. reading pages of data in a single I/O) and the data
    from the DMA is not always used by the CPU. Evicting LLC lines to
    accomodate a file copy, for example, seems less than optimal.


    When using memove() or memset() data is moved on page sized
    boundaries over the "bus".

    IME the majority of memset calls are for relatively small
    (less than a page) regions.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Sun Aug 4 19:38:44 2024
    On Fri, 2 Aug 2024 14:05:25 +0000, Scott Lurndal wrote:

    [email protected] (MitchAlsup1) writes:
    On Thu, 1 Aug 2024 20:34:28 +0000, Scott Lurndal wrote:


    In addition, ARM64 CPUs include allocation hints in
    the memory type such as 'read allocate', 'transient read allocate',
    'write allocate' as well has having optionally multiple coherency
    domains (inner and outer sharable).

    Sorry, I don't understand the word 'allocate' ?!?

    "allocate a cache line".

    Example would be a DMA request with the 'read allocate' hint
    is allowed to be allocated in LLC instead of being stored in
    DRAM.

    Used when software expects the DMA data to be immediately.

    Thanks for the explanation.

    In my case LLC is simply the front end for DRAM so a device
    write will spew data into LLC where it will wait to be written.

    I'm not sure that's a good idea. Large DMAs are common
    (e.g. reading pages of data in a single I/O) and the data
    from the DMA is not always used by the CPU. Evicting LLC lines to
    accomodate a file copy, for example, seems less than optimal.

    Fair enough. But after thinking abut this for a while, does the
    process performing the file copy even know it is doing a file
    copy ?? for example::

    cat ../mydir/myfile > ../yourdir/yourfile

    Which kind of applications know they are doing Input that will
    not be used rather presently ??

    It seems to me that a file copy application would understand
    that writing of DRAM is irrelevant when the true destination
    is another sector on another disk, and any means to connect
    those does is more than sufficient.


    When using memove() or memset() data is moved on page sized
    boundaries over the "bus".

    IME the majority of memset calls are for relatively small
    (less than a page) regions.

    Yes, but the interconnect is designed to move large chunks
    atomically. And the size of that chunk is "within a page
    boundary"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to [email protected] on Sun Aug 4 23:28:55 2024
    [email protected] (MitchAlsup1) writes:
    On Fri, 2 Aug 2024 14:05:25 +0000, Scott Lurndal wrote:

    [email protected] (MitchAlsup1) writes:
    On Thu, 1 Aug 2024 20:34:28 +0000, Scott Lurndal wrote:


    In addition, ARM64 CPUs include allocation hints in
    the memory type such as 'read allocate', 'transient read allocate', >>>>>> 'write allocate' as well has having optionally multiple coherency
    domains (inner and outer sharable).

    Sorry, I don't understand the word 'allocate' ?!?

    "allocate a cache line".

    Example would be a DMA request with the 'read allocate' hint
    is allowed to be allocated in LLC instead of being stored in
    DRAM.

    Used when software expects the DMA data to be immediately.

    Thanks for the explanation.

    In my case LLC is simply the front end for DRAM so a device
    write will spew data into LLC where it will wait to be written.

    I'm not sure that's a good idea. Large DMAs are common
    (e.g. reading pages of data in a single I/O) and the data
    from the DMA is not always used by the CPU. Evicting LLC lines to
    accomodate a file copy, for example, seems less than optimal.

    Fair enough. But after thinking abut this for a while, does the
    process performing the file copy even know it is doing a file
    copy ?? for example::

    cat ../mydir/myfile > ../yourdir/yourfile

    It's not the application that matters. It's how the kernel
    handles file data accesses. Most kernels (unix, linux and NT)
    will read page-sized chunks (or multiples thereof) into kernel memory
    buffers. In the case of cat, it's using stdio, so there is
    another level of buffering in the C library.

    So, cat does a 'getc', getc looks at the C library buffer, and if the
    buffer is empty or fully consumed, it will tell the kernel to
    read another 1k (legacy unix) or 4k (linux) of data from the
    file into the C library buffer.

    The kernel will see the request from usermode and will load
    the corresponding page-sized chunk of data from the underlying
    disk file sectors, if the page cache doesn't already hold the
    page(s) containing the requested data.


    Which kind of applications know they are doing Input that will
    not be used rather presently ??

    Applications that care about I/O performance use various
    mechanisms (O_DIRECT, mmap, etc) to eliminate one or both
    levels of intermediate buffering. The madvise system call
    can be used to inform the kernel of the expected access
    pattern to allow the kernel to optimize its caching policies.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to All on Mon Aug 5 05:13:47 2024
    MitchAlsup1 wrote:

    On Fri, 2 Aug 2024 14:05:25 +0000, Scott Lurndal wrote:

    [email protected] (MitchAlsup1) writes:
    On Thu, 1 Aug 2024 20:34:28 +0000, Scott Lurndal wrote:


    In addition, ARM64 CPUs include allocation hints in
    the memory type such as 'read allocate', 'transient read
    allocate', 'write allocate' as well has having optionally
    multiple coherency domains (inner and outer sharable).

    Sorry, I don't understand the word 'allocate' ?!?

    "allocate a cache line".

    Example would be a DMA request with the 'read allocate' hint
    is allowed to be allocated in LLC instead of being stored in
    DRAM.

    Used when software expects the DMA data to be immediately.

    Thanks for the explanation.

    In my case LLC is simply the front end for DRAM so a device
    write will spew data into LLC where it will wait to be written.

    I'm not sure that's a good idea. Large DMAs are common
    (e.g. reading pages of data in a single I/O) and the data
    from the DMA is not always used by the CPU. Evicting LLC lines to accomodate a file copy, for example, seems less than optimal.

    Fair enough. But after thinking abut this for a while, does the
    process performing the file copy even know it is doing a file
    copy ?? for example::

    cat ../mydir/myfile > ../yourdir/yourfile

    Which kind of applications know they are doing Input that will
    not be used rather presently ??

    It seems to me that a file copy application would understand
    that writing of DRAM is irrelevant when the true destination
    is another sector on another disk, and any means to connect
    those does is more than sufficient.


    I suppose you could creaate a mecnahism that fed the data from the
    "read" DMA directly to the "Write DMA, thus bypassing not only the
    cache, but the saving DRAM bandwidth as well. This would help on
    copies, and perhaps things like defrag and backup. But I suspect that
    the savings are not worth the effort.




    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Stephen Fuld on Mon Aug 5 14:26:00 2024
    "Stephen Fuld" <[email protected]d> writes:
    MitchAlsup1 wrote:



    It seems to me that a file copy application would understand
    that writing of DRAM is irrelevant when the true destination
    is another sector on another disk, and any means to connect
    those does is more than sufficient.


    I suppose you could creaate a mecnahism that fed the data from the
    "read" DMA directly to the "Write DMA, thus bypassing not only the
    cache, but the saving DRAM bandwidth as well. This would help on
    copies, and perhaps things like defrag and backup. But I suspect that
    the savings are not worth the effort.

    It would be more logical, I think, to simply build the functionality
    into the controller (when the source and destination are devices
    attached to that controller (e.g. SATA, SAS or nVME)). IIRC,
    that sort of functionality was available on some SCSI controllers.

    For the case where devices are on multiple controllers, PCI express peer-to-peer would be the appropriate solution. There's no need
    for the CPU and cache complex to be involved at all.

    Shades of channel programs...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Scott Lurndal on Mon Aug 5 17:41:34 2024
    Scott Lurndal wrote:

    "Stephen Fuld" <[email protected]d> writes:
    MitchAlsup1 wrote:



    It seems to me that a file copy application would understand
    that writing of DRAM is irrelevant when the true destination
    is another sector on another disk, and any means to connect
    those does is more than sufficient.


    I suppose you could creaate a mecnahism that fed the data from the
    "read" DMA directly to the "Write DMA, thus bypassing not only the
    cache, but the saving DRAM bandwidth as well. This would help on
    copies, and perhaps things like defrag and backup. But I suspect
    that the savings are not worth the effort.

    It would be more logical, I think, to simply build the functionality
    into the controller (when the source and destination are devices
    attached to that controller (e.g. SATA, SAS or nVME)). IIRC,
    that sort of functionality was available on some SCSI controllers.

    For the case where devices are on multiple controllers, PCI express peer-to-peer would be the appropriate solution. There's no need
    for the CPU and cache complex to be involved at all.


    Yes, thank you. The PCI Express option was the kind of thing I was
    thinking of. Since it is more general than the "in controller option",
    if you implement it at the PCI level, then you don't need the
    controller option.

    But even though the savings are real, given the limited use case for
    the feature, I question if it is worth the trouble.


    Shades of channel programs...



    Not nearly as flexible as channel programs, nor with their overhead.



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stephen Fuld on Mon Aug 5 20:05:53 2024
    On Mon, 5 Aug 2024 17:41:34 +0000, Stephen Fuld wrote:

    Scott Lurndal wrote:

    "Stephen Fuld" <[email protected]d> writes:
    MitchAlsup1 wrote:



    It seems to me that a file copy application would understand
    that writing of DRAM is irrelevant when the true destination
    is another sector on another disk, and any means to connect
    those does is more than sufficient.


    I suppose you could creaate a mecnahism that fed the data from the
    "read" DMA directly to the "Write DMA, thus bypassing not only the
    cache, but the saving DRAM bandwidth as well. This would help on
    copies, and perhaps things like defrag and backup. But I suspect
    that the savings are not worth the effort.

    It would be more logical, I think, to simply build the functionality
    into the controller (when the source and destination are devices
    attached to that controller (e.g. SATA, SAS or nVME)). IIRC,
    that sort of functionality was available on some SCSI controllers.

    For the case where devices are on multiple controllers, PCI express
    peer-to-peer would be the appropriate solution. There's no need
    for the CPU and cache complex to be involved at all.


    Yes, thank you. The PCI Express option was the kind of thing I was
    thinking of. Since it is more general than the "in controller option",
    if you implement it at the PCI level, then you don't need the
    controller option.

    Done right, it is jus a apart of I/O MMU address translation

    But even though the savings are real, given the limited use case for
    the feature, I question if it is worth the trouble.

    With an I/O MMU it pretty much drops out for free.


    Shades of channel programs...



    Not nearly as flexible as channel programs, nor with their overhead.



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)