Quadibloc wrote:
Given that I do not know a whole lot about how cache
coherency is done, and Mitch asked me what approach
I was planning to take...
I went on a web search to find more information on
the subject.
I learned that MSI went to MESI... and then there were
a bunch of "ownership" schemes, such as Berkeley,
Illinois, Firefly, and Dragon.
By 1999, AMD seems to have done something in that area
with MOESI, and later on Intel came up with MESIF instead,
where "F", for Forwarding, is _like_ owned data, but it
is also saved to RAM. Engineers at Intel recently also
wrote papers on "MOESI Prime", which has primed versions
of two of the MOESI states to avoid the cache coherency
mechanism causing RowHammer-like behavior.
The OWNED state represents the concept that this copy is the
only valid copy, so you better not lose it. A request can
arrive back with OWNED data (in some protocols) and now the
recipient is in charge of not losing it.
Anyways... there was something else I found while looking
this stuff up.
I had noted that one of the reasons for offering the
programmer a choice of writing programs with 32-bit
long instructions and nothing but 32-bit long instructions,
or using block headers for blocks of 256 bits in code,
was to allow instructions to be decoded in parallel.
Mitch pointed out that one could just start decoding
in parallel at every possible instruction start location,
Consider reading 4 words at a time out of ICache. Even
before one compares the tag and selects the data to be
decoded, one can apply a block of logic 40-gates in
size and 4-gates of delay and have unary pointers to
the {Next instruction, any displacement, any constant}
by the time the tags have been compared and the 4-words
are then gated out with these extra pointers (8-bits)
on top of the 128-bits of instructions.
Each Next instruction pointer selects its successor, and
a tree of these resolves 2->4->8->16 at 1 more gate of
delay each. {Higher exponents seem accessible if desired}
while also, in parallel, quickly resolving instruction
lengths so as to find which decodes result in executions.
Generally one associated DECODE with when logical registers
are applied to either the physical register rile or to the
register renamer. These be ports one must use efficiently
and if possible the stage before DECODE (I call PARSE)
routes instructions to suitable DECODERs {Especially
important in ISAs with multiple register files {GPR, FP,
SIMD}.
I acknowledged that one could certainly do that, but
since it was somewhat wasteful of heat and electricity,
Separating PARSE from DECODE minimizes the waste heat
as all we are doing is looking at enough bits to route
the instruction to somewhere it can be efficiently DECODEd.
DECODE accesses the register ports and all sorts of big
gate count decoding, PARSE uses tiny pattern decoders to
only route instruction.
I didn't think of this as describing a _typical_
implementation of my ISA (and hence parallel decoding
was still an excuse for having a block structure rather
than conventional CISC-like variable-length instructions).
Well, one of my search results showed that this was how
they did it on the first 64-bit Opterons, from AMD, so
that explains why this technique came so readily to
Mitch's mind!
Burned in solid. Opteron used a trailing marker bit so we
know if we were looking at the last byte in an instruction
(or not). My 66000 uses 4 Major OpCode patterns from 001xxx
to then use a 4-bit positions {15,14,13,11} to decode all
VLE size information.
John Savard
--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)