• IA-64 again (was: Why I've Dropped In)

    From Anton Ertl@21:1/5 to quadibloc on Sun Jun 15 08:09:57 2025
    quadibloc <[email protected]> writes:
    On Sat, 14 Jun 2025 16:05:53 +0000, Anton Ertl wrote:

    The project started in earnest in 1994, when IBM had sold OoO
    mainframes for several years and the Pentium Pro and HP PA 8000 were
    pretty far along the way (both were released in November 1995), so
    it's not as if the two involved companies did not have in-house
    knowledge of the capabilities of OoO. They apparently chose to ignore
    it.

    Surely that is an unfair characterization.

    I dispute that.

    After all, as Ivan Godard has reminded us on several occasions, out of
    order execution has a very large cost in transistors.

    Maybe. So what? It's not as if IA-64 has been designed for small
    embedded systems. I have not found numbers for Merced, but McKinley
    (Itanium II), released in 2002, "contains 221 million transistors (of
    which 25 million are for logic and 181 million for L3 cache), measured
    19.5 mm by 21.6 mm (421 mm2) and was fabricated in a 180 nm" <https://en.wikipedia.org/wiki/Itanium#Itanium_2_(McKinley_and_Madison):_2002%E2%80%932006>.

    For comparison, the contemporary Willamette Pentium 4 also
    manufactured in 180nm by Intel has 42M transistors and 217mm^2 die
    size <https://en.wikichip.org/wiki/intel/microarchitectures/netburst#Willamette>. And despite all the shortcomings of the Netburst microarchitecture,
    and despite the disadvantage of not having such a big on-chip L3
    cache, the Willamette outperformed McKinley.

    Another contemporary CPU manufactured in 180nm was the Thunderbird,
    with 37M transistors in 120mm^2
    <https://www.anandtech.com/show/557/3>, with 22M transistors due to
    the core, i.e., fewer transistors than the logic (core?) of the
    McKinley. And again, despite the cache disadvantage of the
    Thunderbird compared to the McKinley, it performs. Here are some
    performance numbers (lower is better):

    Some Gforth results:

    siev bubble matrix fib machine and configuration
    1.144 1.329 0.762 1.333 Itanium 2 (HP rx2600) 900MHz; gcc-3.3
    0.37 0.52 0.24 0.61 Athlon 1200 (Thunderbird); gcc-2.95.1
    0.23 0.28 0.19 0.34 Pentium 4 2.26 (Northwood); gcc-2.95.1

    The Northwood is a 130nm variant of the Pentium 4, but the 2.26GHz
    variant was released on May 6, 2002 (before the release of the
    McKinley in July 2002). But even if you scale the result down to the
    2GHz that Willamette reached (in August 2001), it still outperforms
    the McKinley by a big margin.

    LaTeX Benchmark results:

    - HP workstation 900MHz Itanium II, Debian Linux 3.528
    - Athlon (Thunderbird) 1200C, VIA KT133A, PC133 SDRAM, RedHat7.1 1.68
    - Pentium 4 2.26GHz, 512KB L2, 1GB PC2100 RAM, RedHat 7.3 1.44


    So, while it is a
    way of achieving high performance, it comes at a cost both in die size
    and in power consumption.

    If the same benefits could be obtained through VLIW techniques without
    those costs

    They cannot, at least not for most of the code that runs on CPUs.
    Intel invested a lot in that idea, and the results were disappointing,
    not just in performance and area, but also in die size and power
    consumption. Here are power consumption numbers, all for 180nm CPUs:

    TDP CPU release date
    75.3W Willamette Pentium 4 2GHz August 2001
    72W Thunderbird Athlon 1.4GHz June 2001
    100W McKinley Itanium 2 1GHz July 2002

    And in those days TDP still meant something.

    So their
    problem wasn't that they forgot what they knew about OoO, but rather
    perhaps that their knowledge of the limitations of VLIW was
    insufficient.

    It certainly was. But the actual problem was that they did not
    understand where the performance benefits of OoO come from. If they
    had understood that, they would have understood that the magic
    compiler that makes EPIC go fast won't appear, ever.

    Of course it's easy to write that in hindsight. At the time EPIC
    looked like a good idea to me, too. But then I did not have knowledge
    of P6 and Onyx in 1994, and did not have access to their design teams,
    and I was not paid to make such decisions.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)