Humm... I am wondering if hardware based HMAC could possibly help out
one of my encryption experiments, for fun... A hyper crude little write
up, has some crude Python 3 code in there. It's not all that fast, yikes!
http://funwithfractals.atspace.cc/ct_cipher
Online version of it:
Online experiment:
http://fractallife247.com/test/hmac_cipher/ver_0_0_0_1
First of all never use this cipher simply because it has not been
properly peer reviewed yet! If interested, experiment with it, never use
it until it has been deemed worth to protect a pet's life, your Mom's
life, your own life, ect.
Chris M. Thomasson wrote:
Humm... I am wondering if hardware based HMAC could possibly help out
one of my encryption experiments, for fun... A hyper crude little
write up, has some crude Python 3 code in there. It's not all that
fast, yikes!
http://funwithfractals.atspace.cc/ct_cipher
Online version of it:
Online experiment:
http://fractallife247.com/test/hmac_cipher/ver_0_0_0_1
First of all never use this cipher simply because it has not been
properly peer reviewed yet! If interested, experiment with it, never
use it until it has been deemed worth to protect a pet's life, your
Mom's life, your own life, ect.
When I looked into this a while back, I came to the conclusion that incorporating something like SHA256, SHA512, DES, AES, ... encryption
stuff suits an attached processor a lot better than putting it in ISA directly.
Why: It is fundamentally difficult to chop up the units of work
to fit in GPRs, and if you run the data through the GPRs (or any CPU register) you open up holes in your security blanket that
are never open in the attached processor implementation. Perf
will be better in the attached processor version unless the
width of the en/decryption is small.
MitchAlsup wrote:
Chris M. Thomasson wrote:
Humm... I am wondering if hardware based HMAC could possibly help out
one of my encryption experiments, for fun... A hyper crude little
write up, has some crude Python 3 code in there. It's not all that
fast, yikes!
http://funwithfractals.atspace.cc/ct_cipher
Online version of it:
Online experiment:
http://fractallife247.com/test/hmac_cipher/ver_0_0_0_1
First of all never use this cipher simply because it has not been
properly peer reviewed yet! If interested, experiment with it, never
use it until it has been deemed worth to protect a pet's life, your
Mom's life, your own life, ect.
When I looked into this a while back, I came to the conclusion that
incorporating something like SHA256, SHA512, DES, AES, ... encryption
stuff suits an attached processor a lot better than putting it in ISA
directly.
Why: It is fundamentally difficult to chop up the units of work
to fit in GPRs, and if you run the data through the GPRs (or any CPU
register) you open up holes in your security blanket that
are never open in the attached processor implementation. Perf
will be better in the attached processor version unless the
width of the en/decryption is small.
I disagree, specifically because these algorithms are used a lot on
short inputs: For a bulk process an attached coprocessor is an excellent idea, but when you just want to verify the hash of a very short message,
or encrypt a single packet, you do want this to be very close to the cpu.
Terje
Terje Mathisen wrote:
MitchAlsup wrote:
Chris M. Thomasson wrote:
Humm... I am wondering if hardware based HMAC could possibly help out
one of my encryption experiments, for fun... A hyper crude little
write up, has some crude Python 3 code in there. It's not all that
fast, yikes!
http://funwithfractals.atspace.cc/ct_cipher
Online version of it:
Online experiment:
http://fractallife247.com/test/hmac_cipher/ver_0_0_0_1
First of all never use this cipher simply because it has not been
properly peer reviewed yet! If interested, experiment with it, never
use it until it has been deemed worth to protect a pet's life, your
Mom's life, your own life, ect.
When I looked into this a while back, I came to the conclusion that
incorporating something like SHA256, SHA512, DES, AES, ... encryption
stuff suits an attached processor a lot better than putting it in ISA
directly.
Why: It is fundamentally difficult to chop up the units of work
to fit in GPRs, and if you run the data through the GPRs (or any CPU
register) you open up holes in your security blanket that
are never open in the attached processor implementation. Perf
will be better in the attached processor version unless the
width of the en/decryption is small.
I disagree, specifically because these algorithms are used a lot on
short inputs: For a bulk process an attached coprocessor is an excellent
idea, but when you just want to verify the hash of a very short message,
or encrypt a single packet, you do want this to be very close to the cpu.
Terje
An issue I see is in thread switching. You don't want a user process
to be able to block the OS thread switching for an arbitrary time
while it syncs with this coprocessor.
It needs a coprocessor which is both fully asynchronous for
bulk jobs from multiple processes and threads in the background,
high priority communication packets from drivers,
and like the x87 available as a semi-asynchronous resource to the
current thread on zero notice but for limited size jobs.
MitchAlsup wrote:
When I looked into this a while back, I came to the conclusion that
incorporating something like SHA256, SHA512, DES, AES, ... encryption
stuff suits an attached processor a lot better than putting it in ISA
directly.
Why: It is fundamentally difficult to chop up the units of work
to fit in GPRs, and if you run the data through the GPRs (or any CPU
register) you open up holes in your security blanket that
are never open in the attached processor implementation. Perf
will be better in the attached processor version unless the
width of the en/decryption is small.
I disagree, specifically because these algorithms are used a lot on
short inputs: For a bulk process an attached coprocessor is an excellent idea, but when you just want to verify the hash of a very short message,
or encrypt a single packet, you do want this to be very close to the cpu.
EricP <[email protected]> writes:
Terje Mathisen wrote:
MitchAlsup wrote:An issue I see is in thread switching. You don't want a user process
I disagree, specifically because these algorithms are used a lot on
When I looked into this a while back, I came to the conclusion that
incorporating something like SHA256, SHA512, DES, AES, ... encryption
stuff suits an attached processor a lot better than putting it in ISA
directly.
Why: It is fundamentally difficult to chop up the units of work
to fit in GPRs, and if you run the data through the GPRs (or any CPU
register) you open up holes in your security blanket that
are never open in the attached processor implementation. Perf
will be better in the attached processor version unless the
width of the en/decryption is small.
short inputs: For a bulk process an attached coprocessor is an excellent >>> idea, but when you just want to verify the hash of a very short message, >>> or encrypt a single packet, you do want this to be very close to the cpu. >>>
Terje
to be able to block the OS thread switching for an arbitrary time
while it syncs with this coprocessor.
It needs a coprocessor which is both fully asynchronous for
bulk jobs from multiple processes and threads in the background,
high priority communication packets from drivers,
and like the x87 available as a semi-asynchronous resource to the
current thread on zero notice but for limited size jobs.
Our coprocessors are 'virtualized', such that they provide
a physical function and a number of virtual functions; a
virtual function can be assigned (mapped into the
address space directly) to a process and
it can directly access the coprocessor from user mode.
There are no worries about the host scheduling threads
in the process - the process owns the virtual function.
(see PCI express single-root I/O virtualization (SR-IOV) which
is the model used for standard OS compatability).
This model is used in DPDK and ODP, for example.
Scott Lurndal wrote:
EricP <[email protected]> writes:
Terje Mathisen wrote:
MitchAlsup wrote:An issue I see is in thread switching. You don't want a user process
I disagree, specifically because these algorithms are used a lot on
When I looked into this a while back, I came to the conclusion that
incorporating something like SHA256, SHA512, DES, AES, ... encryption >>>>> stuff suits an attached processor a lot better than putting it in ISA >>>>> directly.
Why: It is fundamentally difficult to chop up the units of work
to fit in GPRs, and if you run the data through the GPRs (or any CPU >>>>> register) you open up holes in your security blanket that
are never open in the attached processor implementation. Perf
will be better in the attached processor version unless the
width of the en/decryption is small.
short inputs: For a bulk process an attached coprocessor is an excellent >>>> idea, but when you just want to verify the hash of a very short message, >>>> or encrypt a single packet, you do want this to be very close to the cpu. >>>>
Terje
to be able to block the OS thread switching for an arbitrary time
while it syncs with this coprocessor.
It needs a coprocessor which is both fully asynchronous for
bulk jobs from multiple processes and threads in the background,
high priority communication packets from drivers,
and like the x87 available as a semi-asynchronous resource to the
current thread on zero notice but for limited size jobs.
Our coprocessors are 'virtualized', such that they provide
a physical function and a number of virtual functions; a
virtual function can be assigned (mapped into the
address space directly) to a process and
it can directly access the coprocessor from user mode.
There are no worries about the host scheduling threads
in the process - the process owns the virtual function.
(see PCI express single-root I/O virtualization (SR-IOV) which
is the model used for standard OS compatability).
This model is used in DPDK and ODP, for example.
Unfortunately the PCIe specs are all paywalled so I can't get the
real poop on it. Linux doesn't seem to have any documentation on it. Microsoft only has the Windows driver development guides which
I've had a look at.
Presenting the coprocessor as a Virtual Function (VF) could work but,
from the limited info I have seen, using a VF does seem to be limited
because the SR-IOV device only export a fixed number of VF's,
(eg 16, 32, 64) as it is the device that maps from the
Physical Function (PF) to the VF. As SR-IOV was mostly intended for optimizing paravirtualized network cards, in that context it is a
reasonable limitation. However this would not be suitable for a
coprocessor to say "access denied, all are in use".
I also could not find out how Windows delivers virtual interrupts
signaling IO completion for SR-IOV devices. Assuming it would use
something call an APC's, similar to a *nix signal, that would be
an expensive way to be notified of coprocessor completion.
Again as SR-IOV was intended for IO virtualization so in that
context that overhead is reasonable.
Otherwise one would have to use SR-IOV polling in a spin loop to
detect completion, whereas a coprocessor like the x87 has the FWAIT instruction to halt the processor until completion.
MitchAlsup wrote:
EricP wrote:
Presenting the coprocessor as a Virtual Function (VF) could work but,
from the limited info I have seen, using a VF does seem to be limited
because the SR-IOV device only export a fixed number of VF's,
(eg 16, 32, 64) as it is the device that maps from the
Physical Function (PF) to the VF. As SR-IOV was mostly intended for
optimizing paravirtualized network cards, in that context it is a
reasonable limitation. However this would not be suitable for a
coprocessor to say "access denied, all are in use".
Each Guest OS gets one VF.
Right, that is the intended purpose for network cards in virtual machines although the SR-IOV specs are generalized.
Then there is also was this movement that wants to do "zero-copy" network
IO directly from user space IO buffers, which is a VF per process opening
the device.
Then the two combine and it becomes a VF per guest processes opening the device per guest OS and that fixed device quota of 16,32,64 VF's starts looking a little sparse.
In either of these cases the IO device has to be opened so it would be ok
to return a status "denied, device not available" as that is already a possible IO open status.
A coprocessor is intended to be implicitly immediately available,
under OS control, to the current processor context, be it threads, OS
or drivers. That implies huge quota of VF's for all threads plus
sundry other uses on all guest OS just in case they want one.
And, just guessing at the device internals, implies huge management tables, CAMs instead of SRAMs, caches, blah, blah, etc.
EricP wrote:
Scott Lurndal wrote:
EricP <[email protected]> writes:
Terje Mathisen wrote:
MitchAlsup wrote:An issue I see is in thread switching. You don't want a user process
I disagree, specifically because these algorithms are used a lot on
When I looked into this a while back, I came to the conclusion that >>>>>> incorporating something like SHA256, SHA512, DES, AES, ... encryption >>>>>> stuff suits an attached processor a lot better than putting it in
ISA directly.
Why: It is fundamentally difficult to chop up the units of work
to fit in GPRs, and if you run the data through the GPRs (or any
CPU register) you open up holes in your security blanket that
are never open in the attached processor implementation. Perf
will be better in the attached processor version unless the
width of the en/decryption is small.
short inputs: For a bulk process an attached coprocessor is an
excellent idea, but when you just want to verify the hash of a very
short message, or encrypt a single packet, you do want this to be
very close to the cpu.
Terje
to be able to block the OS thread switching for an arbitrary time
while it syncs with this coprocessor.
It needs a coprocessor which is both fully asynchronous for
bulk jobs from multiple processes and threads in the background,
high priority communication packets from drivers,
and like the x87 available as a semi-asynchronous resource to the
current thread on zero notice but for limited size jobs.
Our coprocessors are 'virtualized', such that they provide
a physical function and a number of virtual functions; a
virtual function can be assigned (mapped into the
address space directly) to a process and
it can directly access the coprocessor from user mode.
There are no worries about the host scheduling threads
in the process - the process owns the virtual function.
(see PCI express single-root I/O virtualization (SR-IOV) which
is the model used for standard OS compatability).
This model is used in DPDK and ODP, for example.
Unfortunately the PCIe specs are all paywalled so I can't get the
real poop on it. Linux doesn't seem to have any documentation on it.
Microsoft only has the Windows driver development guides which
I've had a look at.
Presenting the coprocessor as a Virtual Function (VF) could work but,
from the limited info I have seen, using a VF does seem to be limited
because the SR-IOV device only export a fixed number of VF's,
(eg 16, 32, 64) as it is the device that maps from the
Physical Function (PF) to the VF. As SR-IOV was mostly intended for
optimizing paravirtualized network cards, in that context it is a
reasonable limitation. However this would not be suitable for a
coprocessor to say "access denied, all are in use".
Each Guest OS gets one VF.
I also could not find out how Windows delivers virtual interrupts
signaling IO completion for SR-IOV devices. Assuming it would use
something call an APC's, similar to a *nix signal, that would be
an expensive way to be notified of coprocessor completion.
A sufficiently privileged interrupt dispatching thread receives control.
It examines the pending interrupts and dispatches the interrupt handler.
The interrupt handler then services the interrupt and DPCs/softIRQs cleanup activities.
A stack of PDCs/softIRQs wander through the cleanup work and finally
schedule the user thread (synch) or send user thread a signal (asynch) Scheduler receives control and sooner or later delivers control back
to user.
Again as SR-IOV was intended for IO virtualization so in that
context that overhead is reasonable.
Otherwise one would have to use SR-IOV polling in a spin loop to
detect completion, whereas a coprocessor like the x87 has the FWAIT
instruction to halt the processor until completion.
MitchAlsup wrote:
EricP wrote:
Scott Lurndal wrote:
EricP <[email protected]> writes:
Terje Mathisen wrote:
MitchAlsup wrote:An issue I see is in thread switching. You don't want a user process >>>>> to be able to block the OS thread switching for an arbitrary time
I disagree, specifically because these algorithms are used a lot on >>>>>> short inputs: For a bulk process an attached coprocessor is an
When I looked into this a while back, I came to the conclusion that >>>>>>> incorporating something like SHA256, SHA512, DES, AES, ... encryption >>>>>>> stuff suits an attached processor a lot better than putting it in >>>>>>> ISA directly.
Why: It is fundamentally difficult to chop up the units of work
to fit in GPRs, and if you run the data through the GPRs (or any >>>>>>> CPU register) you open up holes in your security blanket that
are never open in the attached processor implementation. Perf
will be better in the attached processor version unless the
width of the en/decryption is small.
excellent idea, but when you just want to verify the hash of a very >>>>>> short message, or encrypt a single packet, you do want this to be
very close to the cpu.
Terje
while it syncs with this coprocessor.
It needs a coprocessor which is both fully asynchronous for
bulk jobs from multiple processes and threads in the background,
high priority communication packets from drivers,
and like the x87 available as a semi-asynchronous resource to the
current thread on zero notice but for limited size jobs.
Our coprocessors are 'virtualized', such that they provide
a physical function and a number of virtual functions; a
virtual function can be assigned (mapped into the
address space directly) to a process and
it can directly access the coprocessor from user mode.
There are no worries about the host scheduling threads
in the process - the process owns the virtual function.
(see PCI express single-root I/O virtualization (SR-IOV) which
is the model used for standard OS compatability).
This model is used in DPDK and ODP, for example.
Unfortunately the PCIe specs are all paywalled so I can't get the
real poop on it. Linux doesn't seem to have any documentation on it.
Microsoft only has the Windows driver development guides which
I've had a look at.
Presenting the coprocessor as a Virtual Function (VF) could work but,
from the limited info I have seen, using a VF does seem to be limited
because the SR-IOV device only export a fixed number of VF's,
(eg 16, 32, 64) as it is the device that maps from the
Physical Function (PF) to the VF. As SR-IOV was mostly intended for
optimizing paravirtualized network cards, in that context it is a
reasonable limitation. However this would not be suitable for a
coprocessor to say "access denied, all are in use".
Each Guest OS gets one VF.
Right, that is the intended purpose for network cards in virtual machines >although the SR-IOV specs are generalized.
A coprocessor is intended to be implicitly immediately available,
under OS control, to the current processor context, be it threads, OS
or drivers. That implies huge quota of VF's for all threads plus
sundry other uses on all guest OS just in case they want one.
And, just guessing at the device internals, implies huge management tables, >CAMs instead of SRAMs, caches, blah, blah, etc.
The overhead of the async completion signal would likely be much greater
that the cost of the original coprocessor hash/encrypt.
EricP wrote:
MitchAlsup wrote:
EricP wrote:
Presenting the coprocessor as a Virtual Function (VF) could work but,
from the limited info I have seen, using a VF does seem to be limited
because the SR-IOV device only export a fixed number of VF's,
(eg 16, 32, 64) as it is the device that maps from the
Physical Function (PF) to the VF. As SR-IOV was mostly intended for
optimizing paravirtualized network cards, in that context it is a
reasonable limitation. However this would not be suitable for a
coprocessor to say "access denied, all are in use".
Each Guest OS gets one VF.
Right, that is the intended purpose for network cards in virtual machines
although the SR-IOV specs are generalized.
Then there is also was this movement that wants to do "zero-copy" network
IO directly from user space IO buffers, which is a VF per process opening
the device.
Why is this not an I/O MMU mapping. Kernel still does setup and teardown
but device does DMA directly into user (requestor) memory.
Then the two combine and it becomes a VF per guest processes opening the
device per guest OS and that fixed device quota of 16,32,64 VF's starts
looking a little sparse.
Which is why direct user access to devices will never win.
EricP wrote:
MitchAlsup wrote:
EricP wrote:
Presenting the coprocessor as a Virtual Function (VF) could work but,
from the limited info I have seen, using a VF does seem to be limited
because the SR-IOV device only export a fixed number of VF's,
(eg 16, 32, 64) as it is the device that maps from the
Physical Function (PF) to the VF. As SR-IOV was mostly intended for
optimizing paravirtualized network cards, in that context it is a
reasonable limitation. However this would not be suitable for a
coprocessor to say "access denied, all are in use".
Each Guest OS gets one VF.
Right, that is the intended purpose for network cards in virtual machines
although the SR-IOV specs are generalized.
Then there is also was this movement that wants to do "zero-copy" network
IO directly from user space IO buffers, which is a VF per process opening
the device.
Why is this not an I/O MMU mapping. Kernel still does setup and teardown
but device does DMA directly into user (requestor) memory.
| Sysop: | Keyop |
|---|---|
| Location: | Huddersfield, West Yorkshire, UK |
| Users: | 715 |
| Nodes: | 16 (2 / 14) |
| Uptime: | 23:20:32 |
| Calls: | 12,105 |
| Calls today: | 5 |
| Files: | 15,006 |
| Messages: | 6,518,142 |