Alasir Enterprises
Main Page >  Articles >  Alpha: The History in Facts and Comments  

Main Page
About Us
Alpha: The History in Facts and Comments
AlphaPowered Dig my grave both long and narrow
Make my coffin neat and strong 

(from an old American song)

Paul V. Bolotoff
Release date: 14th of April 2005
Last modify date: 21 of April 2007

in Russian

Alpha 21264 (EV6, EV67, EV68A, EV68C)

Although 21264 (EV6) processor was developed by DEC and was mentioned first at a Microprocessor Forum in October of 1996, the final silicon implementation was done in February of 1998 when DEC was in process of liquidation. The processor itself was a significant step forward when compared to EV5, not a tuned up old design at all. One of the most important innovations was out-of-order execution which implied a fundamental core redesign and lowered functional units' dependence upon cache and operating memory's bandwidth. EV6 could reorder up to 80 instructions on the fly, more than other competitive products could. For instance, the P6 architecture by Intel was able to execute out-of-order up to 40 [microcommands], HP PA-8x00 — up to 56, MIPS R12000 — up to 48, IBM POWER3 — up to 32, Motorola PowerPC G4 — up to 5, and Sun UltraSPARC II didn't support instruction reordering at all. There was also register renaming technique implemented, so EV6 accommodated 80 integer and 72 floating-point physical registers, but the number of architectural (logical) registers remained unchanged, i. e. 32 integer and 32 floating-point.
There were 4 integer pipelines available, i. e. twice as many as EV5 was given. They were organised in 2 clusters with 2 pipelines and an 80-entry integer register file per cluster. Those 2 register files were identical (syncronised) though. However, those pipelines were different functionally: the 2nd pipeline of the 1st cluster was capable of shifting (1-cycle latency) and multiplying (7-cycle latency), the 2nd pipeline of the 2nd cluster — of shifting (1-cycle latency) and executing MVIs (3-cycle latency). The 1st pipeline of every cluster helped A-box by calculating virtual addresses for load/store operations. Apart of that, all 4 integer pipelines were capable of basic arithmetical and logical operations (1-cycle latency). A-box itself worked with I-TLB and D-TLB (128 entries each), load and store queues (32 instructions each), also 8 64-byte buffers (miss address file) for transactions involving B-cache and operating memory. Floating-point pipelines were different functionally as well. The 1st pipeline was capable of adding (4-cycle latency), dividing (12-cycle latency for single-precision operands and 15-cycle for double-precision) and square root calculation (15-cycle and 30-cycle respectively), but the 2nd one was only capable of multiplying (4-cycle latency). Like before in EV5, I-box was able to decode up to 4 instructions per cycle and dispatch them into 2 queues, to E-box called E-queue (20 instructions) and to F-box called F-queue (15 instructions).
C-box was redesigned significantly and was made capable of supporting only 2 cache levels. The integrated L1 cache memory consisted of 64Kb I-cache and 64Kb D-cache, both 2-way set associative with 64-byte lines. D-cache was write-back as well as B-cache, hence no S-cache at all. B-cache was inclusive to D-cache though. Because of a large size D-cache read/write latencies were increased from 2 to 3 cycles (to/from an integer register) and 4 cycles (to/from a floating-point register). D-cache remained dual-ported, but it was made not of 2 identical write-synchronised parts like in EV5, but of a single part clocked at double the core frequency. External B-cache of 1Mb to 16Mb, direct-mapped, write-back, was accessed through an independent bidirectional 128-bit data bus with a 16-bit channel for ECC protection, also a unidirectional 20-bit address bus. B-cache was built of LW SSRAM chips (late write), later of DDR SSRAM ones (double data rate). Speed of B-cache was programmable ranging from 2/3 to 1/8 of EV6 core frequency. Unlike for the previous generations of Alpha processors, B-cache itself wasn't optional. The system data bus was only 64-bit wide with an additional 8-bit ECC protection, but was able to transfer data on both rising and falling edges of clock signal, i. e. was DDR capable. The system address bus was 44-bit wide implemented physically through two 15-bit unidirectional paths, the system control   15-bit wide. The basic functional principle of the system bus was changed, so the bus became dedicated instead of shared, thus every processor possessed an own path to a system logic set.
The branch prediction logic was redesigned completely. It followed a 2-level scheme with a local history table of 1024 records 10-bit each and a local predictor of 1024 records 3-bit each coupled with a global predictor of 4096 records 3-bit each, also a history path of 12 bits. Both local and global algorithms worked independently, and if the local one traced every branch available, the global one traced sequences of branches. The chooser analysed results of both algorithms and made conclusions to a separate choice predictor of 4096 records 2-bit each which was the source of a preferred decision if the predictions were different. Such a cooperative approach allowed to achieve better results than any of the algorithms if used stand-alone.
Engineers who developed EV6, considering a large number of functional units and other difficulties, decided to redesign the clock subsystem entirely. A more efficient signal flow allowed the core to reach frequencies of the much simpler core of EV56 while involving almost the same technological process. Overall, power consumed by the clock subsystem of EV6 was about 32% of the total core power. To compare, it was about 25% for EV56, about 37% for EV5 and about 40% for EV4.
Clock driver placements for Alpha 21064, 21164 and 21264

EV6 was manufactured using the same technological process to of EV56, but with 2 additional metallisation layers. Consisted of 15.2 mln. transistors (including about 9 mln. spent for I-cache, D-cache and branch predictors), possessed a die size of 314mm² and required a 2.1V to 2.3V power supply. 21264 (EV6) core frequencies ranged from 466MHz to 600MHz (TDP approx. from 80W to 110W). Form-factor: PGA-587 (Pin Grid Array).
Micrograph of Alpha 21264 (EV6) Floor-plan of Alpha 21264 (EV6)
Samsung Alpha 21264 (EV6) - front view Samsung Alpha 21264 (EV6) - back view
(click to enlarge, 62Kb) (click to enlarge, 128Kb)

21264A (EV67) entered the market in the end of 1999. Was produced by Samsung using a 0.25µ CMOS process, posessed a die size of 210mm² and required a lower power supply of 2.0V. No significant architectural differences if compared to EV6. 21264A (EV67) core frequencies ranged from 600MHz to 833MHz (TDP approx. from 70W to 100W) which allowed the Alpha architecture to bring back the leadership on integer tasks, lost not so much time ago to Intel Pentium III (Coppermine) and AMD Athlon (K7).
The first samples of 21264B (EV68C) were delivered in the beginning of 2000. This processor was produced by IBM using a 0.18µ CMOS process of its own involving copper conductors. Despite absence of any architectural differences still, the promising technology allowed to rise core frequencies right up to 1250MHz. In 2001, Samsung became able to manufacture 21264B (EV68A) in quantity using a 0.18µ CMOS process of its own, but involving aluminium conductors. If compared to EV67, the die size was reduced by over than one third (to 125mm²), also the voltage did decrease (to 1.7V). 21264B (EV68A) core frequencies ranged between 750MHz and 940MHz (TDP approx. from 60W to 75W). It was declared in September of 1998 that EV68 by Samsung would be implemented in an innovative 0.18µ FD-SOI (Fully Depleted Silicon-On-Insulator) process involving copper conductors, so it should be able to reach 1.5GHz and even more. Unfortunately, it didn't happen.
Samsung Alpha 21264B (EV68A, prototype) - front view Samsung Alpha 21264B (EV68A, prototype) - back view
(click to enlarge, 82Kb) (click to enlarge, 128Kb)

Different sources mention 21264C and 21264D, code-named as EV68CB and EV68DC respectively, manufactured by IBM using the same technology as EV68C and running within the same frequency range, so they could be considered as minor modifications. The only noticeable difference was a new form-factor, pinless LGA-675 (Land Grid Array) instead of PGA-587. Apparently, these processors were installed in Compaq servers only.
Behind of BWX and MVI inherited from the previous generation of Alpha processors, there was a new set of 9 instructions implemented in EV6 called FIX or FX (Floating-point eXtension) which was aimed at square root calculations (SQRTF, SQRTG, SQRTS, SQRTT), data transfers from integer to floating-point registers (ITOFF, ITOFS, ITOFT) and from floating-point to integer registers (FTOIS, FTOIT). Another set of 3 instructions called CIX or CX (Count eXtension) was introduced in EV67 to facilitate bit counting tasks (CTLZ, CTTZ, CTPOP). Finally, EV6 and the derivatives featured two prefetching instructions (ECB, WH64) in addition to FETCH and FETCH_M which existed from the beginning of the architecture.
There were 2 system logic sets designed initially for the 21264 processors: DEC Tsunami (21272; also known as Typhoon) and AMD Irongate (AMD-751), though could be many more if to take into account that both 21264 and Athlon utilised almost the same system bus licenced by DEC to AMD.
DEC Tsunami was a highly scalable system logic set. It could be used to design 1-processor as well as 2-processor and 4-processor systems with a memory data path ranging from 128 to 512 bits (83MHz SDRAM ECC registered) and supporting from one to two 33MHz 64-bit PCI buses. Such a flexibility could be achieved because the system logic set consisted of 3 kinds of components: system bus controllers (C-chips, one per processor), memory bus controllers (D-chips, one per every 64 bits of the bus width) and PCI bus controllers (P-chips, one per bus needed). So, there is no wonder that some systems (for example, AlphaPC 264DP) were accommodated with system logic sets consisting of 12 chips.
Although AMD Irongate (AMD-751) was developed to serve as a north bridge on Athlon-based mainboards accompanied with the AMD Viper (AMD-756) south bridge or a compatible one, it was also used in some Alpha mainboards (to be correct, in UP1000 and UP1100). Being a single-chip solution, it cost much less than DEC Tsunami and consumed much less energy. However, it wasn't the best solution for 21264 because lacked support for multiprocessing and had a narrow memory data bus (64-bit, up to 768Mb of SDRAM ECC unbuffered at 100MHz in 3 DIMMs with 2 RAS lines each). Nevertheless, Irongate was the first system logic set for Alpha to feature the AGP bus support.
In 2001, Samsung introduced the UP1500 mainboard which was a single-processor solution designed upon the AMD Irongate-2 (AMD-761) north bridge. This mainboard was superior to UP1000 and UP1100 in means of performance due to support for a much faster operating memory: either up to 4Gb of DDR SDRAM ECC registered at 133MHz in 4 DIMMs with 2 RAS lines each or up to 2Gb of DDR SDRAM ECC unbuffered at the same 133MHz in 2 DIMMs with 2 RAS lines each. The memory data bus remained of the same width though.
<< Previous page Next page >>

Copyright (c) Paul V. Bolotoff, 2005-07. All rights reserved.
A full or partial reprint without a permission received from the author is prohibited.
Designed and maintained by Alasir Enterprises, 1999-2007
rhett from, walter from