quadibloc <
[email protected]> writes:
On Sat, 14 Jun 2025 16:05:53 +0000, Anton Ertl wrote:
The project started in earnest in 1994, when IBM had sold OoO
mainframes for several years and the Pentium Pro and HP PA 8000 were
pretty far along the way (both were released in November 1995), so
it's not as if the two involved companies did not have in-house
knowledge of the capabilities of OoO. They apparently chose to ignore
it.
Surely that is an unfair characterization.
I dispute that.
After all, as Ivan Godard has reminded us on several occasions, out of
order execution has a very large cost in transistors.
Maybe. So what? It's not as if IA-64 has been designed for small
embedded systems. I have not found numbers for Merced, but McKinley
(Itanium II), released in 2002, "contains 221 million transistors (of
which 25 million are for logic and 181 million for L3 cache), measured
19.5 mm by 21.6 mm (421 mm2) and was fabricated in a 180 nm" <
https://en.wikipedia.org/wiki/Itanium#Itanium_2_(McKinley_and_Madison):_2002%E2%80%932006>.
For comparison, the contemporary Willamette Pentium 4 also
manufactured in 180nm by Intel has 42M transistors and 217mm^2 die
size <
https://en.wikichip.org/wiki/intel/microarchitectures/netburst#Willamette>. And despite all the shortcomings of the Netburst microarchitecture,
and despite the disadvantage of not having such a big on-chip L3
cache, the Willamette outperformed McKinley.
Another contemporary CPU manufactured in 180nm was the Thunderbird,
with 37M transistors in 120mm^2
<
https://www.anandtech.com/show/557/3>, with 22M transistors due to
the core, i.e., fewer transistors than the logic (core?) of the
McKinley. And again, despite the cache disadvantage of the
Thunderbird compared to the McKinley, it performs. Here are some
performance numbers (lower is better):
Some Gforth results:
siev bubble matrix fib machine and configuration
1.144 1.329 0.762 1.333 Itanium 2 (HP rx2600) 900MHz; gcc-3.3
0.37 0.52 0.24 0.61 Athlon 1200 (Thunderbird); gcc-2.95.1
0.23 0.28 0.19 0.34 Pentium 4 2.26 (Northwood); gcc-2.95.1
The Northwood is a 130nm variant of the Pentium 4, but the 2.26GHz
variant was released on May 6, 2002 (before the release of the
McKinley in July 2002). But even if you scale the result down to the
2GHz that Willamette reached (in August 2001), it still outperforms
the McKinley by a big margin.
LaTeX Benchmark results:
- HP workstation 900MHz Itanium II, Debian Linux 3.528
- Athlon (Thunderbird) 1200C, VIA KT133A, PC133 SDRAM, RedHat7.1 1.68
- Pentium 4 2.26GHz, 512KB L2, 1GB PC2100 RAM, RedHat 7.3 1.44
So, while it is a
way of achieving high performance, it comes at a cost both in die size
and in power consumption.
If the same benefits could be obtained through VLIW techniques without
those costs
They cannot, at least not for most of the code that runs on CPUs.
Intel invested a lot in that idea, and the results were disappointing,
not just in performance and area, but also in die size and power
consumption. Here are power consumption numbers, all for 180nm CPUs:
TDP CPU release date
75.3W Willamette Pentium 4 2GHz August 2001
72W Thunderbird Athlon 1.4GHz June 2001
100W McKinley Itanium 2 1GHz July 2002
And in those days TDP still meant something.
So their
problem wasn't that they forgot what they knew about OoO, but rather
perhaps that their knowledge of the limitations of VLIW was
insufficient.
It certainly was. But the actual problem was that they did not
understand where the performance benefits of OoO come from. If they
had understood that, they would have understood that the magic
compiler that makes EPIC go fast won't appear, ever.
Of course it's easy to write that in hindsight. At the time EPIC
looked like a good idea to me, too. But then I did not have knowledge
of P6 and Onyx in 1994, and did not have access to their design teams,
and I was not paid to make such decisions.
- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <
[email protected]>
--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)