Introduction
The primary hardware implementation of the upcoming NVIDIA Fermi
architecture is supposed to be the GT300 graphprocessor which should replace
GT200, the current top performance design. It needs to mention that GT300
contains (or will contain) many conceptual advances, so it's going to be the
company's key product in the near future. To compare, such designs in the past
were NV20 (2001, GeForce 3 family), NV40 (2004, GeForce 6800 family) and G80
(2006, GeForce 8800 family). What's so interesting offers the Fermi
architecture in general and the GT300 graphprocessor in particular?
The Fermi architercture implies that computer graphics is no more the only
serious task for graphprocessors, though it's still a top priority direction.
NVIDIA targets the new architecture onto the market of supercomputers and
other high-performance computing solutions. That is, there must be significant
performance improvements in floating point calculations as well as programming
convenience. This market has a few must-have requirements, and those include
support for double precision floating point of the IEEE-754 standard with
reliable error checking and correcting (ECC) for both memory and cache
subsystems. Regular graphprocessors don't need these features, they're just
fine with single precision floating point calculations. To be correct, they
have been all right even with integer calculations somewhere until the end of
2002, a DirectX 9 release time. That's it, Shader Model 2.0 with HLSL, also
FP16 and FP32 textures. A very significant step ahead in both software and
hardware means. About hardware, ATI R300 was released the same year. To make a
long story short, its architecture was optimised for FP32 calculations and long
shaders, though offered excellent performance while executing legacy integer
code including T&L through HLSL. It was followed by unlucky NVIDIA NV30 in
2003. The primary disadvantage of its hybrid CineFX architecture was a poor
implementation of floating point calculations: FP16 and FP32 were about 2 and 4
times slower than INT16 respectively. There was a known problem with long
shaders which couldn't be executed at once, thus needed temporary registers to
store data. It appeared later that 32 bytes of such temporary space (8 FP32 or
16 FP16 registers) were not enough for many tasks. NVIDIA tried to fix these
issues and offered NV35 soon, but decent floating point performance wasn't the
case until NV40. All right, we digress. It needs to mention that GT200 like all
G80 derivatives can execute double precision floating point code, but the
performance isn't outstanding overall. It is expected that upcoming GT300 will
do double precision up to 8 times faster than GT200 given the same clock speed.
What to say, an excellent improvement. Anyway, though GT200 isn't an awesome
find for scientific calculations, it powers the second generation of NVIDIA
Tesla family cards, too. Poor double precision performance isn't the only
disadvantage of GT200, but let's look at G80 first.
NVIDIA G80 and G92
G80 is the first NVIDIA graphrocessor based upon unified shader
architecture. That is, all calculations are performed through scalar unified
shader pipelines which are also known as streaming processors. NVIDIA also
calls them CUDA cores while CUDA stands for Computer Unified Device
Architecture. To keep things simple, they will be called shader pipelines in
this article. In a matter of fact, every such a shader pipeline consists of a
floating point unit, integer unit and some auxiliary logic, but absolutely no
registers or caches. Both of these units are pipelined, so here is the name.
Just to mention, the previous generations of NVIDIA graphprocessors starting
with NV20 are used to run separate vectorised vertex and pixel pipelines. So,
G80 consists of 128 shader pipelines which are gathered into 8 thread
processing clusters or just clusters. Every such a cluster is subdivided for 2
streaming multiprocessors or just subclusters. So, there are 16 shader
pipelines per 1 cluster and 8 shader pipelines per 1 subcluster. Every cluster
contains 16Kb of shared memory which is accessible to all 16 shader pipelines.
There is also some read only cache memory of the 1st level for constants (64Kb
in total, i.e. 8Kb per cluster) and textures (128Kb in total, i.e. 16Kb per
cluster). The 2nd level cache memory for textures of 192Kb is segmented
physically having as many segments as memory channels, so that every memory
controller manages a 32Kb segment of its own. There are also 6 raster
partitions, as many as memory channels again, and each partition comes with 4
ROPs (24 ROPs in total). In addition, every cluster has got a local warp
scheduler, dispatch unit, register unit with a 32Kb register file, 2 special
functions units, 8 texture filtering units and 4 texture load/store units.
Those mysterious special functions units are for transcendental and other
special operations (SIN, COS, EXP, RCP, etc.) as well as all double precision
floating point calculations. By the way, a warp is a bundle of 32 parallel
threads in NVIDIA terminology, and a thread isn't what it's used to be in
general purpose programming but rather a very basic data processing job. 32
threads make a warp, up to 512 threads make a block, blocks make grids, and
grids are spawned by a kernel which is actually executed by a host (general
purpose processor) with a device (graphprocessor) as a coprocessing power. The
idea of running very many threads in parallel allows well to hide cache and
memory latencies. General purpose x86 compatible processors prefer not to do so
because their threads are much more expensive to switch between. A single
subcluster can execute one instruction (shader) at any moment of time using
data from various threads of a particular warp and have up to 32 (or even more)
warps running concurrently.
G80 consisted of 681 million transistors with a die size at 484mm²
given a 90nm TSMC technological process. G80 entered the market in November of
2006, and it was the largest graphprocessor ever produced by that moment. Of
course, it was quite expensive to be manufactured even with high yields which
were low to moderate actually. In addition, it required a companion chip called
NVIO to provide with output interfaces (two RAMDACs, two DVI/HDMI transmitters,
one legacy TV-out). NVIDIA needed a less expensive yet competitive design. In
brief, they integrated NVIO, added 4 texture load/store units per cluster, but
implemented a simplified memory interface having only 4 memory channels 64-bit
each (256 bits in total) as opposed to G80's with 6 memory channels 64-bit each
(384 bits in total). In turn, 4 memory channels implied 4 raster partitions
with 4 ROPs each (16 ROPs in total). The host interface was upgraded to PCI
Express v2.0 (G80 supported v1.0a). G92 was released in October of 2007; 754
million transistors, 384mm² die size, 65nm TSMC or UMC technological
process. Later it was manufactured with a 55nm process, thus reducing the die
size to 230mm². This die shrink was known as G92b.
|
|