• Re: hmac hardware acceleration...

    From MitchAlsup@21:1/5 to Chris M. Thomasson on Thu Dec 14 20:08:30 2023
    Chris M. Thomasson wrote:

    Humm... I am wondering if hardware based HMAC could possibly help out
    one of my encryption experiments, for fun... A hyper crude little write
    up, has some crude Python 3 code in there. It's not all that fast, yikes!

    http://funwithfractals.atspace.cc/ct_cipher

    Online version of it:

    Online experiment:

    http://fractallife247.com/test/hmac_cipher/ver_0_0_0_1

    First of all never use this cipher simply because it has not been
    properly peer reviewed yet! If interested, experiment with it, never use
    it until it has been deemed worth to protect a pet's life, your Mom's
    life, your own life, ect.


    When I looked into this a while back, I came to the conclusion that incorporating something like SHA256, SHA512, DES, AES, ... encryption
    stuff suits an attached processor a lot better than putting it in
    ISA directly.

    Why: It is fundamentally difficult to chop up the units of work
    to fit in GPRs, and if you run the data through the GPRs (or any
    CPU register) you open up holes in your security blanket that
    are never open in the attached processor implementation. Perf
    will be better in the attached processor version unless the
    width of the en/decryption is small.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to MitchAlsup on Fri Dec 15 10:19:12 2023
    MitchAlsup wrote:
    Chris M. Thomasson wrote:

    Humm... I am wondering if hardware based HMAC could possibly help out
    one of my encryption experiments, for fun... A hyper crude little
    write up, has some crude Python 3 code in there. It's not all that
    fast, yikes!

    http://funwithfractals.atspace.cc/ct_cipher

    Online version of it:

    Online experiment:

    http://fractallife247.com/test/hmac_cipher/ver_0_0_0_1

    First of all never use this cipher simply because it has not been
    properly peer reviewed yet! If interested, experiment with it, never
    use it until it has been deemed worth to protect a pet's life, your
    Mom's life, your own life, ect.


    When I looked into this a while back, I came to the conclusion that incorporating something like SHA256, SHA512, DES, AES, ... encryption
    stuff suits an attached processor a lot better than putting it in ISA directly.

    Why: It is fundamentally difficult to chop up the units of work
    to fit in GPRs, and if you run the data through the GPRs (or any CPU register) you open up holes in your security blanket that
    are never open in the attached processor implementation. Perf
    will be better in the attached processor version unless the
    width of the en/decryption is small.

    I disagree, specifically because these algorithms are used a lot on
    short inputs: For a bulk process an attached coprocessor is an excellent
    idea, but when you just want to verify the hash of a very short message,
    or encrypt a single packet, you do want this to be very close to the cpu.

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Terje Mathisen on Fri Dec 15 10:22:38 2023
    Terje Mathisen wrote:
    MitchAlsup wrote:
    Chris M. Thomasson wrote:

    Humm... I am wondering if hardware based HMAC could possibly help out
    one of my encryption experiments, for fun... A hyper crude little
    write up, has some crude Python 3 code in there. It's not all that
    fast, yikes!

    http://funwithfractals.atspace.cc/ct_cipher

    Online version of it:

    Online experiment:

    http://fractallife247.com/test/hmac_cipher/ver_0_0_0_1

    First of all never use this cipher simply because it has not been
    properly peer reviewed yet! If interested, experiment with it, never
    use it until it has been deemed worth to protect a pet's life, your
    Mom's life, your own life, ect.


    When I looked into this a while back, I came to the conclusion that
    incorporating something like SHA256, SHA512, DES, AES, ... encryption
    stuff suits an attached processor a lot better than putting it in ISA
    directly.

    Why: It is fundamentally difficult to chop up the units of work
    to fit in GPRs, and if you run the data through the GPRs (or any CPU
    register) you open up holes in your security blanket that
    are never open in the attached processor implementation. Perf
    will be better in the attached processor version unless the
    width of the en/decryption is small.

    I disagree, specifically because these algorithms are used a lot on
    short inputs: For a bulk process an attached coprocessor is an excellent idea, but when you just want to verify the hash of a very short message,
    or encrypt a single packet, you do want this to be very close to the cpu.

    Terje

    An issue I see is in thread switching. You don't want a user process
    to be able to block the OS thread switching for an arbitrary time
    while it syncs with this coprocessor.

    It needs a coprocessor which is both fully asynchronous for
    bulk jobs from multiple processes and threads in the background,
    high priority communication packets from drivers,
    and like the x87 available as a semi-asynchronous resource to the
    current thread on zero notice but for limited size jobs.

    Or make the coprocessor jobs interruptible.

    Or maybe like a barrel processor where the OS can allocate
    as many tasks as it wants, and assign one to each user thread plus
    some to itself for high priority comms and low priority background.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to EricP on Fri Dec 15 20:21:05 2023
    EricP <[email protected]> writes:
    Terje Mathisen wrote:
    MitchAlsup wrote:
    Chris M. Thomasson wrote:

    Humm... I am wondering if hardware based HMAC could possibly help out
    one of my encryption experiments, for fun... A hyper crude little
    write up, has some crude Python 3 code in there. It's not all that
    fast, yikes!

    http://funwithfractals.atspace.cc/ct_cipher

    Online version of it:

    Online experiment:

    http://fractallife247.com/test/hmac_cipher/ver_0_0_0_1

    First of all never use this cipher simply because it has not been
    properly peer reviewed yet! If interested, experiment with it, never
    use it until it has been deemed worth to protect a pet's life, your
    Mom's life, your own life, ect.


    When I looked into this a while back, I came to the conclusion that
    incorporating something like SHA256, SHA512, DES, AES, ... encryption
    stuff suits an attached processor a lot better than putting it in ISA
    directly.

    Why: It is fundamentally difficult to chop up the units of work
    to fit in GPRs, and if you run the data through the GPRs (or any CPU
    register) you open up holes in your security blanket that
    are never open in the attached processor implementation. Perf
    will be better in the attached processor version unless the
    width of the en/decryption is small.

    I disagree, specifically because these algorithms are used a lot on
    short inputs: For a bulk process an attached coprocessor is an excellent
    idea, but when you just want to verify the hash of a very short message,
    or encrypt a single packet, you do want this to be very close to the cpu.

    Terje

    An issue I see is in thread switching. You don't want a user process
    to be able to block the OS thread switching for an arbitrary time
    while it syncs with this coprocessor.

    It needs a coprocessor which is both fully asynchronous for
    bulk jobs from multiple processes and threads in the background,
    high priority communication packets from drivers,
    and like the x87 available as a semi-asynchronous resource to the
    current thread on zero notice but for limited size jobs.

    Our coprocessors are 'virtualized', such that they provide
    a physical function and a number of virtual functions; a
    virtual function can be assigned (mapped into the
    address space directly) to a process and
    it can directly access the coprocessor from user mode.

    There are no worries about the host scheduling threads
    in the process - the process owns the virtual function.

    (see PCI express single-root I/O virtualization (SR-IOV) which
    is the model used for standard OS compatability).

    This model is used in DPDK and ODP, for example.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Terje Mathisen on Sat Dec 16 12:18:36 2023
    Terje Mathisen <[email protected]> schrieb:
    MitchAlsup wrote:

    When I looked into this a while back, I came to the conclusion that
    incorporating something like SHA256, SHA512, DES, AES, ... encryption
    stuff suits an attached processor a lot better than putting it in ISA
    directly.

    That is the solution that IBM Z is using.

    Why: It is fundamentally difficult to chop up the units of work
    to fit in GPRs, and if you run the data through the GPRs (or any CPU
    register) you open up holes in your security blanket that
    are never open in the attached processor implementation. Perf
    will be better in the attached processor version unless the
    width of the en/decryption is small.

    I disagree, specifically because these algorithms are used a lot on
    short inputs: For a bulk process an attached coprocessor is an excellent idea, but when you just want to verify the hash of a very short message,
    or encrypt a single packet, you do want this to be very close to the cpu.

    And that is what Power does with its vcipher and vcipherlast
    instructions, which do a single round of AES.

    POWER9 has six cycles of latency and at most operation per cycle,
    Power10 between four and seven cycles, but four in parallel (I
    guess they invested some of their silicon there).

    AES operates on blocks of 128 bits, so 128-bit registers are quite
    natural there. For My 66000, this would require either register
    pairs or a variant of Carry, so thi is probably not an easy fit.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Scott Lurndal on Sat Dec 16 11:28:37 2023
    Scott Lurndal wrote:
    EricP <[email protected]> writes:
    Terje Mathisen wrote:
    MitchAlsup wrote:

    When I looked into this a while back, I came to the conclusion that
    incorporating something like SHA256, SHA512, DES, AES, ... encryption
    stuff suits an attached processor a lot better than putting it in ISA
    directly.

    Why: It is fundamentally difficult to chop up the units of work
    to fit in GPRs, and if you run the data through the GPRs (or any CPU
    register) you open up holes in your security blanket that
    are never open in the attached processor implementation. Perf
    will be better in the attached processor version unless the
    width of the en/decryption is small.
    I disagree, specifically because these algorithms are used a lot on
    short inputs: For a bulk process an attached coprocessor is an excellent >>> idea, but when you just want to verify the hash of a very short message, >>> or encrypt a single packet, you do want this to be very close to the cpu. >>>
    Terje
    An issue I see is in thread switching. You don't want a user process
    to be able to block the OS thread switching for an arbitrary time
    while it syncs with this coprocessor.

    It needs a coprocessor which is both fully asynchronous for
    bulk jobs from multiple processes and threads in the background,
    high priority communication packets from drivers,
    and like the x87 available as a semi-asynchronous resource to the
    current thread on zero notice but for limited size jobs.

    Our coprocessors are 'virtualized', such that they provide
    a physical function and a number of virtual functions; a
    virtual function can be assigned (mapped into the
    address space directly) to a process and
    it can directly access the coprocessor from user mode.

    There are no worries about the host scheduling threads
    in the process - the process owns the virtual function.

    (see PCI express single-root I/O virtualization (SR-IOV) which
    is the model used for standard OS compatability).

    This model is used in DPDK and ODP, for example.

    Unfortunately the PCIe specs are all paywalled so I can't get the
    real poop on it. Linux doesn't seem to have any documentation on it.
    Microsoft only has the Windows driver development guides which
    I've had a look at.

    Presenting the coprocessor as a Virtual Function (VF) could work but,
    from the limited info I have seen, using a VF does seem to be limited
    because the SR-IOV device only export a fixed number of VF's,
    (eg 16, 32, 64) as it is the device that maps from the
    Physical Function (PF) to the VF. As SR-IOV was mostly intended for
    optimizing paravirtualized network cards, in that context it is a
    reasonable limitation. However this would not be suitable for a
    coprocessor to say "access denied, all are in use".

    I also could not find out how Windows delivers virtual interrupts
    signaling IO completion for SR-IOV devices. Assuming it would use
    something call an APC's, similar to a *nix signal, that would be
    an expensive way to be notified of coprocessor completion.
    Again as SR-IOV was intended for IO virtualization so in that
    context that overhead is reasonable.

    Otherwise one would have to use SR-IOV polling in a spin loop to
    detect completion, whereas a coprocessor like the x87 has the FWAIT
    instruction to halt the processor until completion.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to EricP on Sat Dec 16 21:46:01 2023
    EricP wrote:

    Scott Lurndal wrote:
    EricP <[email protected]> writes:
    Terje Mathisen wrote:
    MitchAlsup wrote:

    When I looked into this a while back, I came to the conclusion that
    incorporating something like SHA256, SHA512, DES, AES, ... encryption >>>>> stuff suits an attached processor a lot better than putting it in ISA >>>>> directly.

    Why: It is fundamentally difficult to chop up the units of work
    to fit in GPRs, and if you run the data through the GPRs (or any CPU >>>>> register) you open up holes in your security blanket that
    are never open in the attached processor implementation. Perf
    will be better in the attached processor version unless the
    width of the en/decryption is small.
    I disagree, specifically because these algorithms are used a lot on
    short inputs: For a bulk process an attached coprocessor is an excellent >>>> idea, but when you just want to verify the hash of a very short message, >>>> or encrypt a single packet, you do want this to be very close to the cpu. >>>>
    Terje
    An issue I see is in thread switching. You don't want a user process
    to be able to block the OS thread switching for an arbitrary time
    while it syncs with this coprocessor.

    It needs a coprocessor which is both fully asynchronous for
    bulk jobs from multiple processes and threads in the background,
    high priority communication packets from drivers,
    and like the x87 available as a semi-asynchronous resource to the
    current thread on zero notice but for limited size jobs.

    Our coprocessors are 'virtualized', such that they provide
    a physical function and a number of virtual functions; a
    virtual function can be assigned (mapped into the
    address space directly) to a process and
    it can directly access the coprocessor from user mode.

    There are no worries about the host scheduling threads
    in the process - the process owns the virtual function.

    (see PCI express single-root I/O virtualization (SR-IOV) which
    is the model used for standard OS compatability).

    This model is used in DPDK and ODP, for example.

    Unfortunately the PCIe specs are all paywalled so I can't get the
    real poop on it. Linux doesn't seem to have any documentation on it. Microsoft only has the Windows driver development guides which
    I've had a look at.

    Presenting the coprocessor as a Virtual Function (VF) could work but,
    from the limited info I have seen, using a VF does seem to be limited
    because the SR-IOV device only export a fixed number of VF's,
    (eg 16, 32, 64) as it is the device that maps from the
    Physical Function (PF) to the VF. As SR-IOV was mostly intended for optimizing paravirtualized network cards, in that context it is a
    reasonable limitation. However this would not be suitable for a
    coprocessor to say "access denied, all are in use".

    Each Guest OS gets one VF.

    I also could not find out how Windows delivers virtual interrupts
    signaling IO completion for SR-IOV devices. Assuming it would use
    something call an APC's, similar to a *nix signal, that would be
    an expensive way to be notified of coprocessor completion.

    A sufficiently privileged interrupt dispatching thread receives control.
    It examines the pending interrupts and dispatches the interrupt handler.
    The interrupt handler then services the interrupt and DPCs/softIRQs cleanup activities.
    A stack of PDCs/softIRQs wander through the cleanup work and finally
    schedule the user thread (synch) or send user thread a signal (asynch) Scheduler receives control and sooner or later delivers control back
    to user.

    Again as SR-IOV was intended for IO virtualization so in that
    context that overhead is reasonable.

    Otherwise one would have to use SR-IOV polling in a spin loop to
    detect completion, whereas a coprocessor like the x87 has the FWAIT instruction to halt the processor until completion.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to EricP on Sun Dec 17 17:57:39 2023
    EricP wrote:

    MitchAlsup wrote:
    EricP wrote:

    Presenting the coprocessor as a Virtual Function (VF) could work but,
    from the limited info I have seen, using a VF does seem to be limited
    because the SR-IOV device only export a fixed number of VF's,
    (eg 16, 32, 64) as it is the device that maps from the
    Physical Function (PF) to the VF. As SR-IOV was mostly intended for
    optimizing paravirtualized network cards, in that context it is a
    reasonable limitation. However this would not be suitable for a
    coprocessor to say "access denied, all are in use".

    Each Guest OS gets one VF.

    Right, that is the intended purpose for network cards in virtual machines although the SR-IOV specs are generalized.

    Then there is also was this movement that wants to do "zero-copy" network
    IO directly from user space IO buffers, which is a VF per process opening
    the device.

    Why is this not an I/O MMU mapping. Kernel still does setup and teardown
    but device does DMA directly into user (requestor) memory.

    Then the two combine and it becomes a VF per guest processes opening the device per guest OS and that fixed device quota of 16,32,64 VF's starts looking a little sparse.

    Which is why direct user access to devices will never win.

    In either of these cases the IO device has to be opened so it would be ok
    to return a status "denied, device not available" as that is already a possible IO open status.

    A coprocessor is intended to be implicitly immediately available,
    under OS control, to the current processor context, be it threads, OS
    or drivers. That implies huge quota of VF's for all threads plus
    sundry other uses on all guest OS just in case they want one.
    And, just guessing at the device internals, implies huge management tables, CAMs instead of SRAMs, caches, blah, blah, etc.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to MitchAlsup on Sun Dec 17 12:21:57 2023
    MitchAlsup wrote:
    EricP wrote:

    Scott Lurndal wrote:
    EricP <[email protected]> writes:
    Terje Mathisen wrote:
    MitchAlsup wrote:

    When I looked into this a while back, I came to the conclusion that >>>>>> incorporating something like SHA256, SHA512, DES, AES, ... encryption >>>>>> stuff suits an attached processor a lot better than putting it in
    ISA directly.

    Why: It is fundamentally difficult to chop up the units of work
    to fit in GPRs, and if you run the data through the GPRs (or any
    CPU register) you open up holes in your security blanket that
    are never open in the attached processor implementation. Perf
    will be better in the attached processor version unless the
    width of the en/decryption is small.
    I disagree, specifically because these algorithms are used a lot on
    short inputs: For a bulk process an attached coprocessor is an
    excellent idea, but when you just want to verify the hash of a very
    short message, or encrypt a single packet, you do want this to be
    very close to the cpu.

    Terje
    An issue I see is in thread switching. You don't want a user process
    to be able to block the OS thread switching for an arbitrary time
    while it syncs with this coprocessor.

    It needs a coprocessor which is both fully asynchronous for
    bulk jobs from multiple processes and threads in the background,
    high priority communication packets from drivers,
    and like the x87 available as a semi-asynchronous resource to the
    current thread on zero notice but for limited size jobs.

    Our coprocessors are 'virtualized', such that they provide
    a physical function and a number of virtual functions; a
    virtual function can be assigned (mapped into the
    address space directly) to a process and
    it can directly access the coprocessor from user mode.

    There are no worries about the host scheduling threads
    in the process - the process owns the virtual function.

    (see PCI express single-root I/O virtualization (SR-IOV) which
    is the model used for standard OS compatability).

    This model is used in DPDK and ODP, for example.

    Unfortunately the PCIe specs are all paywalled so I can't get the
    real poop on it. Linux doesn't seem to have any documentation on it.
    Microsoft only has the Windows driver development guides which
    I've had a look at.

    Presenting the coprocessor as a Virtual Function (VF) could work but,
    from the limited info I have seen, using a VF does seem to be limited
    because the SR-IOV device only export a fixed number of VF's,
    (eg 16, 32, 64) as it is the device that maps from the
    Physical Function (PF) to the VF. As SR-IOV was mostly intended for
    optimizing paravirtualized network cards, in that context it is a
    reasonable limitation. However this would not be suitable for a
    coprocessor to say "access denied, all are in use".

    Each Guest OS gets one VF.

    Right, that is the intended purpose for network cards in virtual machines although the SR-IOV specs are generalized.

    Then there is also was this movement that wants to do "zero-copy" network
    IO directly from user space IO buffers, which is a VF per process opening
    the device.

    Then the two combine and it becomes a VF per guest processes opening the
    device per guest OS and that fixed device quota of 16,32,64 VF's starts
    looking a little sparse.

    In either of these cases the IO device has to be opened so it would be ok
    to return a status "denied, device not available" as that is already a
    possible IO open status.

    A coprocessor is intended to be implicitly immediately available,
    under OS control, to the current processor context, be it threads, OS
    or drivers. That implies huge quota of VF's for all threads plus
    sundry other uses on all guest OS just in case they want one.
    And, just guessing at the device internals, implies huge management tables, CAMs instead of SRAMs, caches, blah, blah, etc.

    I also could not find out how Windows delivers virtual interrupts
    signaling IO completion for SR-IOV devices. Assuming it would use
    something call an APC's, similar to a *nix signal, that would be
    an expensive way to be notified of coprocessor completion.

    A sufficiently privileged interrupt dispatching thread receives control.
    It examines the pending interrupts and dispatches the interrupt handler.
    The interrupt handler then services the interrupt and DPCs/softIRQs cleanup activities.
    A stack of PDCs/softIRQs wander through the cleanup work and finally
    schedule the user thread (synch) or send user thread a signal (asynch) Scheduler receives control and sooner or later delivers control back
    to user.

    I'm familiar with the OS mechanisms, its the overhead I'm pointing out.
    To do this the hypervisor has to dispatch a virtual interrupt to the
    guest OS, which converts it to its local delivery mechanism,
    on Windows DPC->UAPC, on *nix to softIrq->signal,
    and delivers it to the guest thread on the guest OS.

    The overhead of the async completion signal would likely be much greater
    that the cost of the original coprocessor hash/encrypt.

    Again as SR-IOV was intended for IO virtualization so in that
    context that overhead is reasonable.

    Otherwise one would have to use SR-IOV polling in a spin loop to
    detect completion, whereas a coprocessor like the x87 has the FWAIT
    instruction to halt the processor until completion.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to EricP on Sun Dec 17 19:51:11 2023
    EricP <[email protected]> writes:
    MitchAlsup wrote:
    EricP wrote:

    Scott Lurndal wrote:
    EricP <[email protected]> writes:
    Terje Mathisen wrote:
    MitchAlsup wrote:

    When I looked into this a while back, I came to the conclusion that >>>>>>> incorporating something like SHA256, SHA512, DES, AES, ... encryption >>>>>>> stuff suits an attached processor a lot better than putting it in >>>>>>> ISA directly.

    Why: It is fundamentally difficult to chop up the units of work
    to fit in GPRs, and if you run the data through the GPRs (or any >>>>>>> CPU register) you open up holes in your security blanket that
    are never open in the attached processor implementation. Perf
    will be better in the attached processor version unless the
    width of the en/decryption is small.
    I disagree, specifically because these algorithms are used a lot on >>>>>> short inputs: For a bulk process an attached coprocessor is an
    excellent idea, but when you just want to verify the hash of a very >>>>>> short message, or encrypt a single packet, you do want this to be
    very close to the cpu.

    Terje
    An issue I see is in thread switching. You don't want a user process >>>>> to be able to block the OS thread switching for an arbitrary time
    while it syncs with this coprocessor.

    It needs a coprocessor which is both fully asynchronous for
    bulk jobs from multiple processes and threads in the background,
    high priority communication packets from drivers,
    and like the x87 available as a semi-asynchronous resource to the
    current thread on zero notice but for limited size jobs.

    Our coprocessors are 'virtualized', such that they provide
    a physical function and a number of virtual functions; a
    virtual function can be assigned (mapped into the
    address space directly) to a process and
    it can directly access the coprocessor from user mode.

    There are no worries about the host scheduling threads
    in the process - the process owns the virtual function.

    (see PCI express single-root I/O virtualization (SR-IOV) which
    is the model used for standard OS compatability).

    This model is used in DPDK and ODP, for example.

    Unfortunately the PCIe specs are all paywalled so I can't get the
    real poop on it. Linux doesn't seem to have any documentation on it.
    Microsoft only has the Windows driver development guides which
    I've had a look at.

    Presenting the coprocessor as a Virtual Function (VF) could work but,
    from the limited info I have seen, using a VF does seem to be limited
    because the SR-IOV device only export a fixed number of VF's,
    (eg 16, 32, 64) as it is the device that maps from the
    Physical Function (PF) to the VF. As SR-IOV was mostly intended for
    optimizing paravirtualized network cards, in that context it is a
    reasonable limitation. However this would not be suitable for a
    coprocessor to say "access denied, all are in use".

    Each Guest OS gets one VF.

    Right, that is the intended purpose for network cards in virtual machines >although the SR-IOV specs are generalized.

    Indeed, and the number of VF's is limited by the PCIe specification to
    65535 with one PF.

    The device is dividing its resources amongst the VF's, so the maximum number
    of VF's is controlled by the amount of resources available on the device
    and the implementation of the logic on the device.

    The number oF VF's exposed to the host is controlled by the host driver
    (up to the max supported by the device) via stores to the device
    configuration space SR-IOV capability.


    A coprocessor is intended to be implicitly immediately available,
    under OS control, to the current processor context, be it threads, OS
    or drivers. That implies huge quota of VF's for all threads plus
    sundry other uses on all guest OS just in case they want one.

    That assumes that the coprocess will be used by all processors,
    which aside from legacy coprocessors like FPUs (even then, most
    applications didn't actually use floating point and there are
    hooks in most major operating systems to detect whether an application
    uses floating point so they don't need to save the FPR over context switches).

    And, just guessing at the device internals, implies huge management tables, >CAMs instead of SRAMs, caches, blah, blah, etc.

    Certainly in many cases, CAMS are quite useful. Particularly on
    networking hardware that performs hardware packet classification
    based on header fields.


    The overhead of the async completion signal would likely be much greater
    that the cost of the original coprocessor hash/encrypt.

    That again, depends on the coprocessor. If the amount of work
    that is offloaded isn't large enough to subsume the slight extra
    cost for the virtio interrupt (particularly on cpus where the
    interrupt overhead is low - e.g. ARMv8), you probably should
    couple the coprocessor closer to the CPU, much like ARM Neoverse
    cores where the RND instruction interacts with an off-cpu random
    number generator (via MMIO).

    Here's what our chips look like to the kernel/software:

    https://doc.dpdk.org/guides-20.05/platform/octeontx2.html

    Packet comes in, hardware allocates packet storage from the
    NPA (network pool allocator) hardware block. Passes to
    NCPC for classification (big CAMS), queues to scheduler,
    scheduler may or may not interact with a processor or
    one of the many blocks that can be added to the processing
    flow for a packet (crypto for IPsec, compression, etc) before
    queuing the packet for egress (where shaping occurs) on
    a network port.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to MitchAlsup on Sun Dec 17 19:52:30 2023
    [email protected] (MitchAlsup) writes:
    EricP wrote:

    MitchAlsup wrote:
    EricP wrote:

    Presenting the coprocessor as a Virtual Function (VF) could work but,
    from the limited info I have seen, using a VF does seem to be limited
    because the SR-IOV device only export a fixed number of VF's,
    (eg 16, 32, 64) as it is the device that maps from the
    Physical Function (PF) to the VF. As SR-IOV was mostly intended for
    optimizing paravirtualized network cards, in that context it is a
    reasonable limitation. However this would not be suitable for a
    coprocessor to say "access denied, all are in use".

    Each Guest OS gets one VF.

    Right, that is the intended purpose for network cards in virtual machines
    although the SR-IOV specs are generalized.

    Then there is also was this movement that wants to do "zero-copy" network
    IO directly from user space IO buffers, which is a VF per process opening
    the device.

    Why is this not an I/O MMU mapping. Kernel still does setup and teardown
    but device does DMA directly into user (requestor) memory.

    It is an IOMMU mapping.


    Then the two combine and it becomes a VF per guest processes opening the
    device per guest OS and that fixed device quota of 16,32,64 VF's starts
    looking a little sparse.

    Which is why direct user access to devices will never win.

    Sorry, they already have for the use cases where it makes sense.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to MitchAlsup on Sun Dec 17 15:43:50 2023
    MitchAlsup wrote:
    EricP wrote:

    MitchAlsup wrote:
    EricP wrote:

    Presenting the coprocessor as a Virtual Function (VF) could work but,
    from the limited info I have seen, using a VF does seem to be limited
    because the SR-IOV device only export a fixed number of VF's,
    (eg 16, 32, 64) as it is the device that maps from the
    Physical Function (PF) to the VF. As SR-IOV was mostly intended for
    optimizing paravirtualized network cards, in that context it is a
    reasonable limitation. However this would not be suitable for a
    coprocessor to say "access denied, all are in use".

    Each Guest OS gets one VF.

    Right, that is the intended purpose for network cards in virtual machines
    although the SR-IOV specs are generalized.

    Then there is also was this movement that wants to do "zero-copy" network
    IO directly from user space IO buffers, which is a VF per process opening
    the device.

    Why is this not an I/O MMU mapping. Kernel still does setup and teardown
    but device does DMA directly into user (requestor) memory.

    It is. My understanding from looking at the Windows Driver documents
    was it would have to be allocated when the VF pseudo-device is opened
    instead of for each individual IO. At FileOpen the OS would need to be
    told one or more virtual buffers the pseudo-device will work within.

    Leaving HV's out for the moment, the SR-IOV needs to pre-allocate and
    prepare any buffer physical memory at the time of pseudo-device open.
    Then when you write the pseudo-device control register referencing
    a byte range in the virtual buffer, it can validate it and initiate
    the IO without a trip through the OS.

    At pseudo-device open it would check any pinning quotas, fault in the
    buffer pages and pin them, and create a virtual buffer to physical
    fragment map, and set up the IOMMU DMA registers (PTE's) which the
    device HW uses later.

    For networks this is slightly complicated because network cards want to
    do lots of scatter-gather IO from many byte sized and aligned buffers,
    to assemble the TCPIP packet headers, merge that with the app's payload,
    and possible add a packet trailer for the checksum (which the card
    usually adds automatically).

    The IO operation should consists of just writing the VF control register a pointer to an operation, which points to a user space scatter-gather list
    of byte buffers inside the pre-allocated and prepared memory areas.

    The HV adds one more indirection layer to this because all those
    pinned physical addresses and fragments above are actually guest OS
    addresses which the HV converts to real physical fragments,
    pins real physical frames and sets up real IOMMU maps for them.
    The when you write the VF register the HW card can assemble the
    packet direct from the guest user virtual buffers as all the
    guest OS and HV management work was done at FileOpen.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)