Alasir Enterprises
Main Page >  Articles >  A Quick Analysis of the NVIDIA Fermi Architecture  

Main Page
About Us
A Quick Analysis of the NVIDIA Fermi Architecture

Paul V. Bolotoff
Release date: 16th of February 2010
Last modify date: 16th of February 2010


1. Introduction. NVIDIA G80 and G92.
2. NVIDIA GT200 and GT300. Conclusions.


The primary hardware implementation of the upcoming NVIDIA Fermi architecture is supposed to be the GT300 graphprocessor which should replace GT200, the current top performance design. It needs to mention that GT300 contains (or will contain) many conceptual advances, so it's going to be the company's key product in the near future. To compare, such designs in the past were NV20 (2001, GeForce 3 family), NV40 (2004, GeForce 6800 family) and G80 (2006, GeForce 8800 family). What's so interesting offers the Fermi architecture in general and the GT300 graphprocessor in particular?
The Fermi architercture implies that computer graphics is no more the only serious task for graphprocessors, though it's still a top priority direction. NVIDIA targets the new architecture onto the market of supercomputers and other high-performance computing solutions. That is, there must be significant performance improvements in floating point calculations as well as programming convenience. This market has a few must-have requirements, and those include support for double precision floating point of the IEEE-754 standard with reliable error checking and correcting (ECC) for both memory and cache subsystems. Regular graphprocessors don't need these features, they're just fine with single precision floating point calculations. To be correct, they have been all right even with integer calculations somewhere until the end of 2002, a DirectX 9 release time. That's it, Shader Model 2.0 with HLSL, also FP16 and FP32 textures. A very significant step ahead in both software and hardware means. About hardware, ATI R300 was released the same year. To make a long story short, its architecture was optimised for FP32 calculations and long shaders, though offered excellent performance while executing legacy integer code including T&L through HLSL. It was followed by unlucky NVIDIA NV30 in 2003. The primary disadvantage of its hybrid CineFX architecture was a poor implementation of floating point calculations: FP16 and FP32 were about 2 and 4 times slower than INT16 respectively. There was a known problem with long shaders which couldn't be executed at once, thus needed temporary registers to store data. It appeared later that 32 bytes of such temporary space (8 FP32 or 16 FP16 registers) were not enough for many tasks. NVIDIA tried to fix these issues and offered NV35 soon, but decent floating point performance wasn't the case until NV40. All right, we digress. It needs to mention that GT200 like all G80 derivatives can execute double precision floating point code, but the performance isn't outstanding overall. It is expected that upcoming GT300 will do double precision up to 8 times faster than GT200 given the same clock speed. What to say, an excellent improvement. Anyway, though GT200 isn't an awesome find for scientific calculations, it powers the second generation of NVIDIA Tesla family cards, too. Poor double precision performance isn't the only disadvantage of GT200, but let's look at G80 first.

NVIDIA G80 and G92

G80 is the first NVIDIA graphrocessor based upon unified shader architecture. That is, all calculations are performed through scalar unified shader pipelines which are also known as streaming processors. NVIDIA also calls them CUDA cores while CUDA stands for Computer Unified Device Architecture. To keep things simple, they will be called shader pipelines in this article. In a matter of fact, every such a shader pipeline consists of a floating point unit, integer unit and some auxiliary logic, but absolutely no registers or caches. Both of these units are pipelined, so here is the name. Just to mention, the previous generations of NVIDIA graphprocessors starting with NV20 are used to run separate vectorised vertex and pixel pipelines. So, G80 consists of 128 shader pipelines which are gathered into 8 thread processing clusters or just clusters. Every such a cluster is subdivided for 2 streaming multiprocessors or just subclusters. So, there are 16 shader pipelines per 1 cluster and 8 shader pipelines per 1 subcluster. Every cluster contains 16Kb of shared memory which is accessible to all 16 shader pipelines. There is also some read only cache memory of the 1st level for constants (64Kb in total, i.e. 8Kb per cluster) and textures (128Kb in total, i.e. 16Kb per cluster). The 2nd level cache memory for textures of 192Kb is segmented physically having as many segments as memory channels, so that every memory controller manages a 32Kb segment of its own. There are also 6 raster partitions, as many as memory channels again, and each partition comes with 4 ROPs (24 ROPs in total). In addition, every cluster has got a local warp scheduler, dispatch unit, register unit with a 32Kb register file, 2 special functions units, 8 texture filtering units and 4 texture load/store units. Those mysterious special functions units are for transcendental and other special operations (SIN, COS, EXP, RCP, etc.) as well as all double precision floating point calculations. By the way, a warp is a bundle of 32 parallel threads in NVIDIA terminology, and a thread isn't what it's used to be in general purpose programming but rather a very basic data processing job. 32 threads make a warp, up to 512 threads make a block, blocks make grids, and grids are spawned by a kernel which is actually executed by a host (general purpose processor) with a device (graphprocessor) as a coprocessing power. The idea of running very many threads in parallel allows well to hide cache and memory latencies. General purpose x86 compatible processors prefer not to do so because their threads are much more expensive to switch between. A single subcluster can execute one instruction (shader) at any moment of time using data from various threads of a particular warp and have up to 32 (or even more) warps running concurrently.
G80 consisted of 681 million transistors with a die size at 484mm² given a 90nm TSMC technological process. G80 entered the market in November of 2006, and it was the largest graphprocessor ever produced by that moment. Of course, it was quite expensive to be manufactured even with high yields which were low to moderate actually. In addition, it required a companion chip called NVIO to provide with output interfaces (two RAMDACs, two DVI/HDMI transmitters, one legacy TV-out). NVIDIA needed a less expensive yet competitive design. In brief, they integrated NVIO, added 4 texture load/store units per cluster, but implemented a simplified memory interface having only 4 memory channels 64-bit each (256 bits in total) as opposed to G80's with 6 memory channels 64-bit each (384 bits in total). In turn, 4 memory channels implied 4 raster partitions with 4 ROPs each (16 ROPs in total). The host interface was upgraded to PCI Express v2.0 (G80 supported v1.0a). G92 was released in October of 2007; 754 million transistors, 384mm² die size, 65nm TSMC or UMC technological process. Later it was manufactured with a 55nm process, thus reducing the die size to 230mm². This die shrink was known as G92b.
NVIDIA G80 block diagram
  Next page >>

Copyright (c) Paul V. Bolotoff, 2010. All rights reserved.
A full or partial reprint without a permission received from the author is prohibited.
Designed and maintained by Alasir Enterprises, 1999-2020
pvb from