Forum: >>> Magnum BBS <<<

Decoding Instructions in Parallel

From Quadibloc@21:1/5 to All on Sat Jan 6 16:36:31 2024

Given that I do not know a whole lot about how cache
coherency is done, and Mitch asked me what approach
I was planning to take...

I went on a web search to find more information on
the subject.

I learned that MSI went to MESI... and then there were
a bunch of "ownership" schemes, such as Berkeley,
Illinois, Firefly, and Dragon.

By 1999, AMD seems to have done something in that area
with MOESI, and later on Intel came up with MESIF instead,
where "F", for Forwarding, is _like_ owned data, but it
is also saved to RAM. Engineers at Intel recently also
wrote papers on "MOESI Prime", which has primed versions
of two of the MOESI states to avoid the cache coherency
mechanism causing RowHammer-like behavior.

Anyways... there was something else I found while looking
this stuff up.

I had noted that one of the reasons for offering the
programmer a choice of writing programs with 32-bit
long instructions and nothing but 32-bit long instructions,
or using block headers for blocks of 256 bits in code,
was to allow instructions to be decoded in parallel.

Mitch pointed out that one could just start decoding
in parallel at every possible instruction start location,
while also, in parallel, quickly resolving instruction
lengths so as to find which decodes result in executions.

I acknowledged that one could certainly do that, but
since it was somewhat wasteful of heat and electricity,
I didn't think of this as describing a _typical_
implementation of my ISA (and hence parallel decoding
was still an excuse for having a block structure rather
than conventional CISC-like variable-length instructions).

Well, one of my search results showed that this was how
they did it on the first 64-bit Opterons, from AMD, so
that explains why this technique came so readily to
Mitch's mind!

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Quadibloc on Sat Jan 6 19:16:30 2024

Quadibloc wrote:

Given that I do not know a whole lot about how cache
coherency is done, and Mitch asked me what approach
I was planning to take...

I went on a web search to find more information on
the subject.

I learned that MSI went to MESI... and then there were
a bunch of "ownership" schemes, such as Berkeley,
Illinois, Firefly, and Dragon.

By 1999, AMD seems to have done something in that area
with MOESI, and later on Intel came up with MESIF instead,
where "F", for Forwarding, is _like_ owned data, but it
is also saved to RAM. Engineers at Intel recently also
wrote papers on "MOESI Prime", which has primed versions
of two of the MOESI states to avoid the cache coherency
mechanism causing RowHammer-like behavior.

The OWNED state represents the concept that this copy is the
only valid copy, so you better not lose it. A request can
arrive back with OWNED data (in some protocols) and now the
recipient is in charge of not losing it.

Anyways... there was something else I found while looking
this stuff up.

I had noted that one of the reasons for offering the
programmer a choice of writing programs with 32-bit
long instructions and nothing but 32-bit long instructions,
or using block headers for blocks of 256 bits in code,
was to allow instructions to be decoded in parallel.

Mitch pointed out that one could just start decoding
in parallel at every possible instruction start location,

Consider reading 4 words at a time out of ICache. Even
before one compares the tag and selects the data to be
decoded, one can apply a block of logic 40-gates in
size and 4-gates of delay and have unary pointers to
the {Next instruction, any displacement, any constant}
by the time the tags have been compared and the 4-words
are then gated out with these extra pointers (8-bits)
on top of the 128-bits of instructions.

Each Next instruction pointer selects its successor, and
a tree of these resolves 2->4->8->16 at 1 more gate of
delay each. {Higher exponents seem accessible if desired}

while also, in parallel, quickly resolving instruction
lengths so as to find which decodes result in executions.

Generally one associated DECODE with when logical registers
are applied to either the physical register rile or to the
register renamer. These be ports one must use efficiently
and if possible the stage before DECODE (I call PARSE)
routes instructions to suitable DECODERs {Especially
important in ISAs with multiple register files {GPR, FP,
SIMD}.

I acknowledged that one could certainly do that, but
since it was somewhat wasteful of heat and electricity,

Separating PARSE from DECODE minimizes the waste heat
as all we are doing is looking at enough bits to route
the instruction to somewhere it can be efficiently DECODEd.
DECODE accesses the register ports and all sorts of big
gate count decoding, PARSE uses tiny pattern decoders to
only route instruction.

I didn't think of this as describing a _typical_
implementation of my ISA (and hence parallel decoding
was still an excuse for having a block structure rather
than conventional CISC-like variable-length instructions).

Well, one of my search results showed that this was how
they did it on the first 64-bit Opterons, from AMD, so
that explains why this technique came so readily to
Mitch's mind!

Burned in solid. Opteron used a trailing marker bit so we
know if we were looking at the last byte in an instruction
(or not). My 66000 uses 4 Major OpCode patterns from 001xxx
to then use a 4-bit positions {15,14,13,11} to decode all
VLE size information.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to MitchAlsup on Mon Jan 8 12:20:01 2024

MitchAlsup wrote:

Quadibloc wrote:

Given that I do not know a whole lot about how cache
coherency is done, and Mitch asked me what approach
I was planning to take...

I went on a web search to find more information on
the subject.

I learned that MSI went to MESI... and then there were
a bunch of "ownership" schemes, such as Berkeley,
Illinois, Firefly, and Dragon.

By 1999, AMD seems to have done something in that area
with MOESI, and later on Intel came up with MESIF instead,
where "F", for Forwarding, is _like_ owned data, but it
is also saved to RAM. Engineers at Intel recently also
wrote papers on "MOESI Prime", which has primed versions
of two of the MOESI states to avoid the cache coherency
mechanism causing RowHammer-like behavior.

The Forward state is to address the issue of who should respond to a
request for a shared copy of a line when there are multiple sharers.
If multiple sharers respond it could flood a requester with redundant
messages.

The Directory Controller (DC) records which lines are held in each core
in what state. It remembers the most recent core to read-share a line
as the Forward state, on the assumption that copy is most likely still resident, while the prior readers are tracked in a Shared state.

The cache with the line in a Forward state is told send a shared copy to
a read-shared requester, who becomes the line's new Forward state holder.
If no Forward copy is available the DC reads from DRAM.

The OWNED state represents the concept that this copy is the
only valid copy, so you better not lose it. A request can
arrive back with OWNED data (in some protocols) and now the recipient is
in charge of not losing it.

Also OWNED is the modified-shared state where the owner modifies a line
then shared read-only copies of it. The ownership can be passed to a
new cache without writing it back to DRAM or invalidating the shared copies.
To modify the line again the owner has to invalidate all the shared copies first to return it to the Exclusive state.
When the owner eventually evicts the line, it is responsible for writing
it back to DRAM.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to EricP on Mon Jan 8 22:13:37 2024

EricP wrote:

MitchAlsup wrote:

Quadibloc wrote:

Given that I do not know a whole lot about how cache
coherency is done, and Mitch asked me what approach
I was planning to take...

I went on a web search to find more information on
the subject.

I learned that MSI went to MESI... and then there were
a bunch of "ownership" schemes, such as Berkeley,
Illinois, Firefly, and Dragon.

By 1999, AMD seems to have done something in that area
with MOESI, and later on Intel came up with MESIF instead,
where "F", for Forwarding, is _like_ owned data, but it
is also saved to RAM. Engineers at Intel recently also
wrote papers on "MOESI Prime", which has primed versions
of two of the MOESI states to avoid the cache coherency
mechanism causing RowHammer-like behavior.

The Forward state is to address the issue of who should respond to a
request for a shared copy of a line when there are multiple sharers.
If multiple sharers respond it could flood a requester with redundant messages.

The Directory Controller (DC) records which lines are held in each core
in what state. It remembers the most recent core to read-share a line
as the Forward state, on the assumption that copy is most likely still resident, while the prior readers are tracked in a Shared state.

The cache with the line in a Forward state is told send a shared copy to
a read-shared requester, who becomes the line's new Forward state holder.
If no Forward copy is available the DC reads from DRAM.

The OWNED state represents the concept that this copy is the
only valid copy, so you better not lose it. A request can
arrive back with OWNED data (in some protocols) and now the recipient is
in charge of not losing it.

Also OWNED is the modified-shared state where the owner modifies a line
then shared read-only copies of it. The ownership can be passed to a
new cache without writing it back to DRAM or invalidating the shared copies. To modify the line again the owner has to invalidate all the shared copies first to return it to the Exclusive state.

Granted

When the owner eventually evicts the line, it is responsible for writing
it back to DRAM.

Or sending it to another cache that can take OWNERship over it.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Bob Worm
  Wed Jul 29 22:26:45 2026
  from Wales, Uk via Telnet
- Zenobyte
  Wed Jul 29 21:08:05 2026
  from San Juan, Pr via Telnet
- Guest
  Wed Jul 29 14:26:54 2026
  from Balkans via Telnet
- Rixter
  Wed Jul 29 14:18:17 2026
  from Madison, Nc via Telnet
- Rixter
  Wed Jul 29 02:00:40 2026
  from Madison, Nc via Telnet
- Centurion
  Tue Jul 28 22:54:59 2026
  from Berea, Ohio via Telnet
- Bob Worm
  Tue Jul 28 16:01:18 2026
  from Wales, Uk via Telnet
- Rixter
  Tue Jul 28 13:42:46 2026
  from Madison, Nc via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	76:10:03
Calls:	12,450
Calls today:	5
Files:	15,194
Messages:	6,537,666

Decoding Instructions in Parallel

Who's Online

Recent Visitors

System Info