Sie sind auf Seite 1von 33

NVIDIA GeForce 7800 GTX: New Architecture Exposed

Category: Video
by Alexey Stepin
[ 06/22/2005 | 12:52 PM ]
NVIDIA has unveiled its brand-new graphics processor and proclaimed it to be the fastest in the industry. Today we
take a look at what the company has to offer in terms of acrhitecture, efficiency and power consumption.

Table of contents:

Introduction
GeForce7: New Graphics Architecture?
More than Just a Pixel Pipeline
Vertex Processors
HDR: More Speed
New FSAA Modes
PureVideo
GeForce 7800 GTX in Detail
Power Consumption and Heat Dissipation
Noise, Overclocking, 2D Quality
Testbed and Methods
Anisotropic Filtering Quality
Transparency Antialiasing Quality
Performance Hit with Transparency Antialiasing
GeForce 7800 GTX in Theoretical Tests

o
o
o
o
o

Fill Rate
Pixel Shader Performance
Vertex Shader Performance
Fixed T&L Emulation
Relief Rendering and Other Tests

Video Playback Performance


Conclusion

Pages : 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18

Introduction
The 4-th of April, 2004, was a remarkable day in the 3D graphics realm. Having previously lost the lead to ATI
Technologies, NVIDIA Corporation announced a new graphics processor codenamed NV40. This chip made NVIDIA
a technological leader since it was the first consumer graphics solution with such revolutionary technologies as
next-generation pixel and vertex shaders (Shader Model 3.0), floating-point color representation and others.

As a sign of departure from the past, NVIDIA abandoned the letters FX in the names of the graphics cards on the
new GPU, and GeForce 6800 cards were really brilliant in all the benchmarks, wresting the crown from the
RADEON 9800 XT. This was not an easy victory, though. The chip came out very complex, consisting of 222 million
transistors, and an acceptable chip yield was only found at frequencies of 350-400MHz. Besides that, the higher
heat dissipation made a clumsy and noisy dual-slot cooling system necessary. But even with all these drawbacks
the release of the GeForce 6800 Ultra was a big step forward for NVIDIA as well as for the industry at large.
Soon after that, on May 4, ATI Technologies replied with the release of the R420 processor and R420-based
graphics cards. Unlike NVIDIAs, ATIs approach was evolutionary rather than revolutionary: the RADEON X800 was
in fact a greatly improved RADEON 9800 rather than something completely new. That approach was quite
justifiable then: the R420 was a rather simple chip (160 million transistors against the NV40s 222 million), and
coupled with new dielectric materials this simplicity allowed ATI to raise the frequency of the new solution to
520MHz, achieving a very high level of performance.
The NV40 and R420 were in fact equals in their basic technical characteristics. Each chip had 16 pixel pipelines
and 6 vertex processors, but the RADEON X800 XT was generally faster than the GeForce 6800 thanks to higher
operational frequencies. NVIDIAs card couldnt use its support of Shader Model 3.0 to its advantage since there
were no games capable of using this feature. Even the Far Cry patch that added SM 3.0 to this game didnt
change anything as the same patch also added Shader Model 2.0b which was implemented in competitor
processors from ATI.
So, NVIDIA held the crown of the king of 3D graphics but for a very short while. Moreover, the difficulties with
production of such a complex chip as the NV40 almost immediately resulted in a deficit of GeForce 6800 Ultra
cards (well, ATIs RADEON X800 XT and PRO were not abundant, either). Later on ATI split the RADEON X800
family into two lines by releasing the high-performance R480 (RADEON X850) and the mass-user-oriented 0.11micron R430. The maximum frequency of the R480 chip reached 540MHz whereas the max clock rate of NVIDIAs
NV40 and NV45-based solutions was only 425MHz (on special edition graphics cards from certain
manufacturers). The top models of NVIDIAs graphics cards were still inferior in performance to their counterparts
from the ATI camp.
The announcement of the multi-chip SLI technology helped NVIDIA to offer more performance than the ATI
RADEON X850 XT Platinum Edition could deliver. Yet, the solution consisting of two GeForce 6800 Ultra/GT
graphics cards turned to be too expensive, awkward and power-wasting and also required a special mainboard
based on the nForce4 SLI chipset. On the other hand, people who wanted the best performance money could buy
didnt care about these things at all, and NVIDIAs multi-GPU technology became quite popular.
So, ATIs trumps by the middle of 2005 were:

RADEON X850 XT Platinum Edition, the worlds fastest graphics card


The widest range of high-performance graphics processors (five models for each AGP and PCI Express
buses)

NVIDIA had a few aces, too:

Multi-GPU SLI technology, the planets fastest graphics solution

The GeForce 6600 series was enjoying success in the middle-range sector of the market

Formal technological superiority since the GeForce 6 supported Shader Model 3.0, High Dynamic Range
etc

In other words, both GPU developers offered products that were the best in some way or another, but neither
could offer a chip both fastest and feature-richest. Today NVIDIA and ATI both need a new graphics processor that
would return the crown of an absolute leader to one of them. NVIDIA was the first to announce a new-generation
GPU, making ATI Technologies hurry up with an answer.

NVIDIA GeForce 7800 GTX: New Architecture Exposed


(page 2)
Category: Video
by Alexey Stepin
[ 06/22/2005 | 12:52 PM ]
Pages : 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18

GeForce7: New Graphics Architecture?


The new graphics architecture was quite expectedly named GeForce 7, and the graphics processor which is its
embodiment is called G70. NVIDIA seems to imply that this is a perfectly new architecture rather than an
improved GeForce 6. Is it really so? Lets delve deeper into the matter.

First, the new graphics processor from NVIDIA is the most complex GPU for personal
computers today, consisting of as many as 302 million transistors. For note: the NV40
consists of 222 million transistors, and the ATI R480 of only 160 million. This
complexity is quite natural since the new GPU contains 24 pixel pipelines and 8 vertex
processors against 16 and 6, respectively, in GPUs of the previous generation. Moreover,
the G70 is not just an overgrown NV40. NVIDIA has considerably revised the
architecture of pixel and vertex processors to improve their performance. More about that
shortly, but now lets have a look at the technical specification of the new GPU in
comparison with previous-generation models:

NVIDIA GeForce 7800 NVIDIA GeForce 6800


GTX
Ultra
Manufacturing technology

ATI RADEON X850 XT


Platinum Edition

0.11 micron

0.13 micron

0.13 micron low-k

Number of transistors

302 mln.

222 mln.

160 mln.

Clock frequency

430MHz

400MHz

520MHz

256bit GDDR3 SDRAM

256bit GDDR3 SDRAM

256bit GDDR3 SDRAM

1200 MHz

1100 MHz

1180 MHz

34GB/s

32.8GB/s

33.4GB/s

Graphics memory controller


Graphics memory clock
frequency
Memory bus peak bandwidth

Maximum graphics memory size


Interface

512MB

512MB

512MB

PCI Express x16

PCI Express x16

PCI Express x16

Pixel processors, pixel shaders


Shader model

2.x

Static loops and branching

yes

yes

yes

Dynamic loops and branching

yes

yes

no

Multiple Render Targets

yes

yes

yes

Floating-Point Render Target

yes

yes

yes

Maximum number of pixels per


clock cycle

24

16

16

Maximum number of Z values


per clock cycle

32

32

16

Number of texturing samples

16

16

16

Bi-linear

Bi-linear

Bi-linear

tri-linear

tri-linear

tri-linear

anisotropic

anisotropic

anisotropic

tri-linear + anisotropic

tri-linear + anisotropic

tri-linear + anisotropic

16x

16x

16x

Texture filtering algorithms

Maximum level of anisotropy

Vertex processors, vertex shaders


Shader model

2.x

Number of vertex processors

Static loops and branching

yes

yes

yes

Dynamic loops and branching

yes

yes

no

Reading textures from the vertex


shader

yes

yes

no

Tesselation

no

no

no

Full-Screen Anti-Aliasing
FSAA algorithms

Ordered-grid supersampling

Ordered-grid supersampling

Rotated-grid multi-sampling

Rotated-grid multisampling

Rotated-grid multisampling

temporal anti-aliasing

super-sampling + multi- super-sampling + multisampling


sampling
transparent
supersampling
Number of samples

2..8

2..8

2,4,6

Technologies increasing the efficiency of the memory bus bandwidth


Hidden Surface Removal (HSR)

yes

yes

yes

Texture, Z-buffer, frame buffer


compression

yes

yes

yes

Fast Z-buffer clear

yes

yes

yes

Additional technologies
OpenEXR (HDR)

yes

yes

no

Videoprocessor

yes

yes

no

So, the G70-based graphics card is called GeForce 7800 GTX. This new device looks quite imposing, being
technologically head above the GeForce 6800 Ultra, not to mention the RADEON X850 XT Platinum Edition.
Theres only one parameter the ATI card is superior in the clock rate of the G70 is lower, having increased by
only 30MHz above the previous-generation NV40 chip. This is of course the consequence of the high complexity of
the chip, but there seem to be no problems with production. NVIDIA says graphics cards on the new GPU will be
available at the day of the announcement. Thats good, recalling that the last-year announcement of the NV40
was in fact just a marketing event, the actual silicon being unavailable in shops.
The table above does not reveal the distinguishing features of the G70 which make it a truly new-gen solution. We
will dwell on these features individually. At first sight, the G70 is similar to the NV40, save for the number of pixel
and vertex processors:

The number of Raster Operation (ROP) units has remained the same. There 16 of them in the chip, and the
number of texture samples per pass is still 16 or 32 when only Z values are sampled. In other words, the number
24 refers to the number of pixel processors only. So, the GeForce 7800 architecture generally resembles the
GeForce 6800, but there are some considerable differences, too.

More than Just a Pixel Pipeline


As we said above, NVIDIA has seriously redesigned the architecture of the pixel pipelines to improve their
performance. The developers had modeled 1,300 various shaders to expose bottlenecks of the previous
architecture and the resulting pixel pipeline of the G70 looks as follows:

Each of the two shader units now has an additional mini-ALU (these mini-ALUs first appeared back in the NV35,
but the NV40 didnt have them). It improves the mathematical performance of the processor and, accordingly, the
speed of pixel shaders. Each pixel processor can execute 8 instructions of the MADD (multiply/add) type in a
single cycle, and the total performance of 24 such processors with instructions of that type is a whopping
165Gflops which is three times the performance of the GeForce 6800 Ultra (54Gflops). Loops and branching
available in version 3.0 pixel shaders are fully supported.
Of course, real-life shaders do not consist of MADD instructions only, but NVIDIA claims the pixel shader
performance of the G70 is two times higher than that of the NV40. We will check this claim in our theoretical
tests, but the improved pixel pipelines look highly promising. We can expect a considerable performance gain in
modern pixel-shader-heavy games.

Vertex Processors
The flowchart of the G70s vertex processor doesnt differ from the same processor in the NV40:

A higher speed of processing geometry is achieved by means of more vertex processors (8 against the NV40s 6)
and, probably, through improvements in the vector and scalar units. According to the official data, the
performance of the scalar unit has increased by 20-30% in comparison with the NV40, and a MADD instruction is
executed in a single cycle in the vector unit. Besides that, the efficiency of cull and setup operations in the fixed
section of the geometry pipeline has increased by 30%. We are going to cover these things in more detail below.
On the whole, we cant call the new architecture from NVIDIA a revolution. It is rather a greatly improved and
perfected GeForce 6 which has been the most advanced architecture in the 3D consumer graphics market until
today. The GeForce 7 carries the leadership on, once again confirming NVIDIAs technological superiority.

HDR: More Speed


The support of OpenEXR format that allows outputting an image in an extended dynamic range on the screen first
appeared in the GeForce 6800 Ultra. This format is employed by Industrial Light & Magic, a division of Lucasfilm,
for creating special effects for modern blockbuster movies.
Alas, this rendering mode requires huge resources, even though it ensures a much better image quality. The first
game to support HDR was the popular 3D shooter Far Cry, since version 1.3. But in fact, this support of HDR
remained more of a marketing trick, since you could not play in this mode even in 1024x768 resolution. For
example, the performance normally being from 55 to 90fps on the Training map in different resolutions, the HDR
mode yielded no more than 15-30fps. Of course, there was no talking about comfortable play. NVIDIA SLI
technology increased the speed in the HDR mode to more acceptable numbers but the cost of a system with two
GeForce 6800 Ultra/GT was very high.
The situation changes with the arrival of the G70, and HDR is going to be more useful for owners of G70-based
graphics cards. According to NVIDIA, the GeForce 7800 GTX is 60% faster in this mode than the GeForce 6800
Ultra thanks to the improved texture-mapping units. So it looks like you can enjoy a beautiful high-dynamic-range
image in resolutions up to 1280x1024 with one such graphics card, while SLI configurations will make 1600x1200
resolution playable in HDR.

GeForce 7800 GTX in Detail


Getting closer to practice, its time to take our sample of the GeForce 7800 GTX in our hands. This graphics card
resembles the GeForce 6800 GT at first:

Both cards use a compact single-slot cooling system, and the component layout of the GeForce 7800 GTX
resembles NVIDIAs earlier products, too. Changes are most visible in the rear part of the PCB where the voltage
regulators and other power elements reside. Its not that the power circuit has become simple, but there are fewer
electrolytic capacitors and the power elements are placed in three rows rather than in a single line as before.
They form a small rectangular now, covered with a thin-ribbed aluminum heatsink. These changes in the power
circuit layout have made the PCB of the GeForce 7800 GTX longer, so the new device is obviously the longest
graphics card today.
There is an aluminum plate on the reverse side of the PCB, besides the usual bracket for fastening the cooler. The
plate is not just a decoration as one might have thought. The thing is the PCB of the GeForce 7800 GTX is
intended for 512 megabytes of GDDR3 memory, and there are 16 places intended for memory chips, 8 places on
each side of the PCB. But the standard amount of memory on a GeForce 7800 GTX card is 256 megabytes. They
could have put eight 256Mbit chips on the face side of the PCB, but NVIDIA preferred to install four chips on either
side. So, the above-mentioned plate is a heat-spreader for the four reverse-side GDDR3 chips. Its efficiency may
be not very high, but GDDR3 memory features a low level of heat dissipation, and there are only four such chips
to be cooled there.
The graphics card uses 1.6ns memory from Samsung rated for 600 (1200) MHz frequency. This is exactly the
frequency the memory chips are clocked at on this card. Note that starting from the GeForce FX 5900 NVIDIA
places memory chips in such a way as to make the pathways that connect them to the GPU as short as possible.
This helps to ensure stable operation at high frequencies.
We dismantled the cooling system to access the graphics processor:

As you see, the die area of the G70 is much larger than the NV40 notwithstanding the thinner tech process (0.11
micron against 0.13 micron). No wonder as they have added 80 million transistors more into the new chip. The
surface of the G70 is not mirror-like like the surface of the NV40, but kind of matte, probably due to difference in
the tech processes: the NV40/45 is manufactured at IBMs East Fishkill facilities, while the G70 at TSMCs fabs. The
shape of the dies is different, too. The G70 is a square whereas the NV40 is a rectangular. There is no separate
HSI die here the G70 natively supports PCI Express. Our sample was manufactured during the 16-th week of the
current year, i.e. somewhere at the end of April, and this indicates that NVIDIA doesnt have problems with
manufacturing the new chip. The symbols A2 denote a second revision of the chip, and this too makes us hope
that the supply will be sufficient.

The cooling system deserves to be discussed separately. It is a variation of the GeForce 6800 GT cooler, but a
seriously improved one. The blower is driving air through two aluminum heatsinks joined with a U-shaped heat
pipe and is also cooling the heatsink on the power regulators of the card. The heat pipe doesnt only transfer heat
from one heatsink to the other, but also takes heat off the needle section that touches the memory chips. The
whole arrangement is covered in a plastic casing, although NVIDIA used to employ a metal casing earlier.
Curiously enough, it is perfectly visible in reference snapshots from NVIDIA that the first or main heatsink that
takes heat off the graphics core is made of copper, while on our sample it was made of aluminum. Why? We
suppose that copper coolers will be mounted on advanced versions of G70-based graphics cards, with 512MB of
memory and clocked at higher frequencies, while the current version of the GeForce 7800 GTX is quite satisfied
with the aluminum cooler.
Note also that the fan is connected to the card via four wires rather than two as usual. It looks like the fan is
equipped with a tachometer and the fan speed control system is now more perfect and flexible than before. Well
tell you below about its noise characteristics.

Testbed and Methods


The testbed was configured like follows:

AMD Athlon 64 4000+ CPU (2.40GHz, 1MB L2 cache)

Microsoft Windows XP Pro SP2, DirectX 9.0c

ASUS A8N-SLI Deluxe mainboard


OCZ PC-3200 Platinum EB DDR SDRAM (2x512MB, CL2.5-3-2-8)
Samsung SpinPoint SP1213C hard disk drive (Serial ATA-150, 8MB buffer)
Creative SoundBlaster Audigy 2 audio card

Graphics cards:

NVIDIA GeForce 7800 GTX 256MB (430/1200MHz)


NVIDIA GeForce 6800 Ultra 256MB (400/1100MHz)
NVIDIA GeForce 6800 GT 256MB (350/1000MHz)
ATI RADEON X850 XT Platinum Edition 256MB (540/1180MHz)
ATI RADEON X800 XL 256MB (400/980MHz)

We also tested SLI configurations based on GeForce 7800 GTX and GeForce 6800 Ultra.
Drivers:

NVIDIA ForceWare 77.62


ATI Catalyst 5.6

We installed ForceWare 77.62, NVIDIAs new-generation driver of the so-called ForceWare Release 75, to test the
GeForce 7800 GTX. This is the first version of ForceWare to support the new graphics processor from NVIDIA. It
differs from the older official driver (version 71.89) in improved HDTV support, added game profiles, full OpenGL
2.0support, and the option for selecting a SLI mode. The interface of the control panel of the new driver has
changed and become handier. We chose the following settings for the time of our tests:

The Gamma correct antialiasing and Transparency antialiasing options are available only if a GeForce 7800 GTX is
installed in the system. These options would be missing for a GeForce 6800 Ultra. The rest of the settings were
selected in the same way. We chose the Catalyst A.I Standard mode in ATI Catalyst 5.6 and set the Mipmap Detail
Level to Quality. The VSync option was disabled in both drivers.

Anisotropic Filtering Quality


Before running the tests we decided to check out the quality of anisotropic filtering as it was done by the GeForce
7800 GTX using the appropriate function from 3DMark05.

NVIDIA GeForce 7800 GTX

NVIDIA GeForce 6800 Ultra

ATI RADEON X850 XT Platinum


Edition

We could see no difference between the GeForce 7800 GTX and GeForce 6800 Ultra when they were doing trilinear filtering only. With the RADEON X850 XT Platinum Edition the mip levels begin farther and the transitions
between them are smoother than with NVIDIAs cards.

NVIDIA GeForce 7800 GTX

NVIDIA GeForce 6800 Ultra

ATI RADEON X850 XT Platinum


Edition

Its hard to distinguish between the GeForce 6800 Ultra and GeForce 7800 GTX as concerns anisotropic filtering.
But on a closer inspection you can see that the sharpness of textures is higher with the latter card. Here, the
RADEON X850 XT Platinum Edition produces another picture, with less distinct mip levels. That is, the ATI card is
better at doing anisotropic filtering than the GeForces. So, the GeForce 7800 GTX doesnt bring us anything new in
terms of anisotropic filtering, but its quality is higher than with the GeForce 6800 Ultra. The difference is
negligible, though, especially in real gaming situations

Transparency Antialiasing Quality


We took screenshots in two popular shooters, Far Cry and Half-Life, to see the effect of the new full-screen
antialiasing methods.

NVIDIA GeForce 7800 GTX

ATI RADEON X850 XT Platinum


Edition

no FSAA

no FSAA

FSAA 4x + TMS

FSAA 4x

FSAA 4x + TSS

FSAA 8xS + TMS

FSAA 6x

FSAA 8xS + TSS


Alas, we couldnt find any great differences between ordinary FSAA and FSAA with enabled Transparency
Antialiasing in Far Cry. Well, yes, there is a difference, but it is so negligible that you can notice it only after a
careful examination of each screenshot.

NVIDIA GeForce 7800 GTX

ATI RADEON X850 XT Platinum


Edition

no FSAA

no FSAA

FSAA 4x + TMS

FSAA 4x

FSAA 4x + TSS

FSAA 8xS + TMS

FSAA 6x

FSAA 8xS + TSS


Half-Life 2 is quite another story: Transparency Antialiasing makes the image much better in this game. We mean
the Supersampling mode, while the Multisampling mode brings a less impressive picture, even though of an
improved quality, especially in combination with FSAA 8xS. The most beautiful picture can be observed in FSAA

8xS + Transparency Supersampling mode: even the smallest details are drawn almost ideally. So, the new feature
of the GeForce 7800 GTX Transparency Antialiasing can really improve the quality of certain details of the
image, especially if combined with FSAA 8xS, but how does it tell on the speed of the card?

Performance Hit with Transparency Antialiasing


It turned out that TSAA didnt affect the performance of card as negatively as ordinary FSAA modes. Take a look at
the diagram:

As you can see, enabled Transparency Multisampling does not practically affect the performance even in the basic
8xS mode. This TSAA variety doesnt however improve the image quality as much as Transparency Supersampling
does, but even in the latter case the performance hit is less than 5%. So, if you prefer to play games with enabled
full-screen antialiasing, you can turn Transparency Antialiasing on without worrying that you lose anything in
speed.

GeForce 7800 GTX in Theoretical Tests


We performed a full cycle of theoretical tests to reveal the potential of the new graphics processor from NVIDIA.
We made use of the following software:

Marko Dolencs Fillrate Tester


Xbitmark v0.60

3DMark 2001SE build 330


3DMark03 build 360
3DMark05 build 120

Fill Rate
We traditionally begin our tests with Marko Dolencs Fillrate Tester.

The diagram shows that the GeForce 7800 GTX isnt much faster than the GeForce 6800 Ultra at single-texturing.
At multi-texturing, however, their graphs diverge more and the advantage of the new solution becomes evident.
The GeForce 7800 GTX is by half faster than the GeForce 6800 Ultra when mapping three or four textures. Well, it
means it does have 24 pixel pipelines!
Note also that the results of the GeForce 7800 GTX are rather far from the theoretical maximum which is probably
due to the relatively low memory clock rate.

Its generally the same picture with disabled Z writes. The RADEON X850 XT Platinum Edition is much slower than
the GeForce 7800 GTX as well as the GeForce 6800 Ultra in both cases.
So, the GeForce 7800 GTX behaves similarly to its predecessor, but is evidently limited by the memory
performance.

The GeForce 7800 GTX has 16 ROP units rather than 24, so its fill rate when working only with the Z buffer is
almost the same as the fill rate of the GeForce 6800 Ultra. The newer solution is ahead because of a minor
advantage in the core frequency (430MHz against 400MHz).
By the way, the GeForce 6800 Ultra and GeForce 7800 GTX both exceed the theoretical maximum in this test. We
dont know yet how to explain this fact.

NVIDIA GeForce 6800 Ultra (NV40). Part One:


Architecture Features and Synthetic Tests in
D3D RightMark
We've been waiting for it.
And finally, here is the new architecture:
a correction of past mistakes
and a solid foundation
for the future. But is it really so?
We are going to probe into both aspects.

CONTENTS
1.
2.
3.
4.
5.
6.
7.

Official specs
Architecture
2D and the GPU
Videocard features
Synthetic tests in D3D RightMark
Quality of trilinear filtering and anisotropy
Conclusions

The article is mainly devoted to issues of architecture and synthetic limiting tests. In a while, an article on
performance and quality of game applications will appear, and then, after a new ATI architecture has been
announced, we'll conduct and publish a detailed comparative research of quality and speed issues of AA
and anisotropic filtering in the new-generation accelerators. Before reading this article make sure you
have thoroughly studied DX Current and DX Next, materials on various aspects of today's hardware
graphic accelerators, and on architectural features of NVIDIA and ATI products, in particular.

GeForce 6800 official specs

Chip codenamed NV40


130nm FSG (IBM) technology
222 million transistors
FC case (flip chip with no metallic cover)
256-bit memory interface
Up to 1 GB of DDR / GDDR -2/ GDDR -3 memory
Bus interface AGP 3.0 8x
A special APG 16x mode (both ways), for PCI Express of HSI bridge

16 pixel processors, each having a texture unit with an optional filtering of integer and
float-point textures (anisotropy up to 16x).

6 vertex processors, each having one texture unit with no value filtration (discrete
selection)

Calculates, blends, and writes up to 16 full pixels (colour, depth, stencil buffer) per clock

Calculates and writes up to 32 values of depth and stencil buffer per clock (if no
operation with colour is executed)

Supports a two-way stencil buffer

Supports special optimisations of geometry rendering for acceleration of shadow


algorithms based on the stencil buffer (the so-called Ultra Shadow II technology)

Supports pixel and vertex shaders 3.0, including dynamic branchings in pixel and vertex
processors, texture value selection from vertex processors, etc.

Texture filtering in the floating-point format

Supports framebuffer in the floating-point format (including blending operations)


MRT (Multiple Render Targets - rendering into several buffers)
2x RAMDAC 400 MHz
2x DVI interfaces (require external chips)
TV-Out and TV-In interface (requires separate chips)

Programmable streaming GPU (for video compression, decompression and postprocessing)

2D accelerator supporting all GDI+ functions

GeForce 6800 Ultra reference card specs

400 MHz core frequency


1.1 GHz (2*550 MHz) effective memory frequency
GDDR-3 memory type
256-MB memory size
35.2 GBps memory bandwidth
Theoretical filling speed: 6.4 Gpps
Theoretical texture selection speed: 6.4 Gtps
2 DVI-I connectors
TV-Out

Up to 120 W energy consumption (the card has two additional power connectors, the
power sources are recommended to be no less than 480 W)

General scheme of the chip

At the current detalisation level, no significant architectural differences from the previous generation are
seen. And it is no surprise, as the scheme has survived several generations and is optimal in many
aspects. We would like to note that there are six vertex processors and four separate pixel processors

each working with one quad (a 2x2 pixel fragment). Also noteworthy are two levels of these textures'
caching (a general cache and a personal cache for each group of 4 TMUs in the pixel processor), and, as a
result, the new ratio of 16 TMUs per 16 pixels.
And now we'll increase detalisation in the most interesting places:

Vertex processors and data selection


An interesting innovation has been introduced: a support of various scalers for the flows of the vertices'
source data. Let us remind you how data are generally selected for the vertices in modern accelerators:

The structure consists of several predefined parameteres: scalars and vectors up to 4D, floating-point or
integer formats, including such special data types as vertex coordinates or normal vector, colour value,
texture coordinates, etc. Interestingly, they can only be called "special" from the point of view of API, as
hardware itself allows an optional parameter commutation in the microcode of the vertex shader. But the
programmer needs to specify the source registers of the vertex processor, where these data will be after
selection, in order not to make redundant data moves in the shader.
Vertex data stored in the memory must not necessarily be a single fragment, they can be divided into a
number of flows (up to 16 in NV40) each having one or several parameters. Some of the flows may be in
the AGP address range (that is, will be selected from the system memory), other may be placed in the
local memory of the accelerator. Such approach allows to use twice the same data sets for different
objects. For instance, we can separate geometrical and textural information into different flows, and
having one geometrical model use different sets of textural coordinates and other surface parameters,
thus ensuring an external difference. Besides, we can use a separate flow only for the parameters that
have really changed. Others can be loaded just once into the local memory of the accelerator. A current
index, single for all flows, is used to access the parameters of this or that vertex. This index either
changes in a chaotic way (source data are represented as an index buffer) or gradually increases
(separate triangles, stripes and fans).
What is new about the vertex data selection in NV40 is that it's not necessary for all the flows to have the
same number of data sets. Each flow can have its own index value divider (a so-called Frequency Stream
Divider). Thus, we avoid data duplication in some cases and save some size and bandwidth of the local
memory and the system memory addressed through AGP:

Apart from that, the flow can now be represented as a buffer smaller in size than the maximal index value
(even including the divider), and the index will just turn round the flow's buffer border. This novelty can be
applied for many operations, for instance, to compress geometry using hierarchic representations or to
copy features onto the array of objects (information common for each tree in the forest is only stored
once, etc.). And now take a look at the schematic of the NV40 vertex processor:

The processor itself is represented as a yellow bar, and the blocks surrounding it are only shown to make
the picture more complete. NV40 is announced to have six independent processors (multiply the yellow
bar by six) each executing its own instructions and having its own control logic. That is, separate
processors can simultaneously execute different condition branches on different vertices. Per one clock, an
NV40 vertex processor is able to execute one vector operation (up to four FP32 components), one scalar
FP32 operation, and make one access to the texture. It supports integer and float-point texture formats
and mipmapping. Up to four different textures can be used in one vertex shader, but there's no filtering as
only the simplest access (a discrete one) to the nearest value by specified coordinates is possible. This
enabled a considerable simplification of the TMU and consequently, of the whole vertex processor (the
simpler the TMU - the shorter the pipeline - the fewer transistors). In case of urgency, you can execute
filtering in the shader yourself. But of course, it will require several texture value selections and further
calculations, and as a result, it will take many more clocks. There are no rigid restrictions as to the length
of the shader's microcord: it is selected from the local memory of the accelerator during execution. But
some specific APIs (namely, DX) may impose such restrictions. Given below is a summary table of the
NV40 vertex processor's parameters concerning DX9 vertex shaders, compared to families R3XX and
NV3X:

Vertex shader version


Number of instructions in the shader
code

2.0 (R 3 XX)

2. a (NV 3 X)

3.0 (NV40)

256

256

512 and more

Number of executed instructions

65535

65535

65535 and more

Predicates

No

Yes

Yes

Temporary registers

12

13

32

256 and more

256 and more

256 and more

Static jumps

Yes

Yes

Yes

Dynamic jumps

No

Yes

Yes

Nesting depth of dynamic jumps

No

24

24

Texture value selection

No

No

Yes (4)

Constant registers

In fact, if we look back on the NV3X architecture, it becomes clear that NVIDIA developers only had to
increase the number of temporary registers and add a TMU module. Well, now we are going to see
synthetic test results and find out how close NV40 and NV3X architectures are in terms of performance.
And another interesting aspect we will dwell on is performance of the FFP emulation (of the fixed T&L). We
would like to know if NV40 hardware still has the special units that gave NV3X such a visible increase on
the FFP geometry.

Pixel processors and filling organisation


Let's examine the NV40 pixel architecture in the order of data sequence. So, this is what comes after the
triangle parameters are set:

Now we are going to touch upon the most interesting facts. First, in contrast to earlier NV3Xs that only
had one quad processor taking a block of four pixels (2x2) per clock, we now have four such processors.
They are absolutely independent of one another, and each of them can be excluded from work (for
instance, to create a lighter chip version with three processors in case of them has a defect). Then, each
processor still has its own quad round queue (see DX Curent). Consequently, they also execute pixel
shaders similarly to the way it's done in NV3X: more than a hundred quads are run through one setting
(operation) followed by a setting change according to the shader code. But there are major differences
too. First of all, it concerns the number of TMUs: now we only have one TMU per each quad pixel. And as
we have 4 quad processors with 4 TMUs in each, it makes the total of 16 TMUs.
The new TMUs support anisotropic filtering with the maximal ratio of 16:1 (the so-called 16x, against 8x in
NV3X). And they are they first to be able to execute all kinds of filtering with floating-point texture
formats. Although, providing the components have a 16-bit precision (FP16). As for FP32, filtering still
remains impossible. But the fact that the FP16 level has been reached is already visible progress. From
now on, floating-point textures will be a viable alternative to integer ones in any applications. Especially as
FP16 textures are filtered with no speed degradation. (However, an increased data flow may and probably
will impact on performance of real applications.)
Also noteworthy is a two-level texture caching: each quad processor has its own first-level texture cache.
It is necessary to have one for two following reasons: the number of quads processed simultaneously has
increased fourfold (quad queues haven't become longer, but the number of processors has risen to four),
and there is another access to the texture cache from vertex processors.
A pixel has two ALUs each capable of executing two different operations on different numbers of randomly
selected vector components (up to four). Thus, the following schemes are possible: 4, 1+1, 2+1, 3+1 (as
in R3XX), and also the new 2+2 configuration, not possible before (see article DX Current for details).
Optional masking and post-operational component replacements are supported too. Besides, ALUs can
normalise a vector in one operation, which can have a considerable influence on performance of some
algorithms. Hardware calculation of SIN and COS values was extracted from the new NVIDIA architecture:
it was proved that transistors used for these operations were spent in vain. All the same, better results in
terms of speed can be achieved when accessing by an elementary table (1D texture), especailly
considering that ATI doesn't support the mentioned operations.
Thus, depending on the code, from one to four different FP32 operations on scalars and vectors can be
made per clock. As you can see in the schematic, the first ALU is used for service operations during
texture value selection. So, within one clock we can either select one texture value and use the second
ALU for one or two operations, or to use both ALUs if we're not selecting any texture. Performance is
directly related to the compiler and the code, but we definitely have the following variants:
Minimum:
one
texture
selection
Minimum:
two
operations
per
clock
without
Maximum:
four
operations
per
clock
without
Maximum: one texture selection and two operations per clock

per
texture
texture

clock
selection
selection

According to certain information, the number of temporary registers for quad has been doubled, so now
we have four temporary FP32 registers per pixel or eight temporary FP16 registers. This fact must
incerase dramatically performance of complex shaders. Moreover, all hardware restrictions as to the pixel
shaders' size and the number of texture selections have been removed, and now everything depends on
API only. The most important modification is that execution can now be controlled dynamically. Later,
when the new SDK and the next DirectX 9 (9.0c) version appear, we'll conduct a thorough study of
realisation and performance of pixel shaders 3.0 and dynamic branches. And now take a look at a
summary table of capabilities:

Pixel shader version


Boehhoct texture selections,
maximum
Texture value selections, maximum

2.0 (R3XX)

2.a (NV3X)

2.b (R420?)

3.0 (NV40)

No restrictions

No restrictions

32

No restrictions No restrictions No restrictions

Shader code length

32 + 64

512

512

512 and more

Number of shader instructions executed

32 + 64

512

512

65535 and
more

Interpolators

2+8

2+8

2+8

10

Predicates

no

yes

no

yes

Temporary registers

12

22

32

32

Constant registers

32

32

32

224

Optional component rearrangement

no

yes

no

yes

Gradient instructions (D D X/ D DY)

no

yes

no

yes

Nesting depth of dynamic jumps

no

no

no

24

Evidently, the soon-to-be-announced ATI (R420) architecture will support the 2.b profile present in the
shader compiler. Not willing to make hasty conclusions, we'll say however, that NV40's flexibility and
programming capabilities are beyond comparison.
And now let's go back to our schematic and look at its lower part. It contains a unit responsible for
comparison and modification of colour values, transparency, depth, and stencil buffer. All in all, we have
16 such units. Considering the fact that comparison and modification task is executed quite similarly in
every case, we can use this unit in two following modes.
Standard mode (executes per one clock):

Comparison and modification of depth value


Comparison and modification of stencil buffer value
Comparison and modification of transparency and colour component values (blending)

Turbo mode (executes per one clock):

Comparison and modification of two depth values


Comparison and modification of two stencil buffer values

Certainly, the latter mode is only possible if there's no calculated and writable colour value. That is why
the specs say that in case there's no colour, the chip can fill 32 pixels per clock, estimating the values of
depth and stencil buffer. Such turbo mode is mainly useful for a quicker shadow building basing on the
stencil buffer (the algorithm from Doom III) and for a preliminary rendering pass that only estimates the
Z buffer. (Such technique often allows to save time on long shaders as overlap factor will be reduced to
one).
Luckily, the NV3X family now supports MRT (Multiple Render Targets - rendering into several buffers), that
is, up to four different colour values can be calculated and written in one pixel shader and then placed into
different buffers (of the same size). The fact that NV3X had no such function used to play into the hands
of R3XX, but now NV40 has turned the tables. It is also different from the previous generations in an
intensive support of floating-point arithmetics. All comparison, blending and colour-writing operations can
now be made in the FP16 format. So we finally have a full (orthogonal) support of operations with a 16-bit
floating point both for texture filtering and selection and stencil buffer handling. Well, FP32 is next, but
that will be an issue for the future generation.
Another interesting fact is the MSAA support. Like its NV 2X and NV 3X predecessors, NV40 can execute
2x MSAA with no speed degradation (two depth values per pixel are generated and compared), and it
takes one penalty clock to execute 4x MSAA. (In practice, however, there's no need to calculate all four
values within one clock, as a limited memory bandwidth will make it difficult to write so much information
per clock into the depth and frame buffers). More than 4x MSAA are not supported, and like in the
previous family, all more complex modes are hybrids of 4x MSAA and the following SSAA of this or that
size. But at least it supports RGMS:

And that can visibly increase the smoothing quality of slanting lines. At this point we finish our description
of the NV40 pixel processor and proceed to the next chapter.

2D and the GPU


This is the separate programmed NV40 unit that is charged with processing video flows:

The processor contains four functional units (integer ALU, vector integer ALU with 16 components, data
loading and unloading unit, and a unit controlling jumps and conditions) and thus can execute up to four
different operations per clock. The data format is integers of 16-bit or 32-bit precision (it is not known
more exactly which, but 8 bits wouldn't be enough for some algorithms). For more convenience, the
processor includes special possibilities of data flow selection, commutation, and writing. Such classical
tasks as video decoding and coding (IDCT, deinterlacing, colour model transformations, etc.) can be
executed without the CPU. But still, a certain amount of CPU control is required: it is the CPU that has to
prepare data and transform parameters, especially in complex algorithms of compression that include
unpacking as one of the interim steps.
Such processor can relieve the CPU of many operations, especially in the case of hi-res videos, such as
increasingly popular HDTV formats. Unfortunately, it is not known if the processor's capabilities are used
for 2D graphic acceleration, especially for some really complex GDI+ functions. But anyway, NV40 meets
the requirements for hardware 2D acceleration: all necessary computive intensive GDI and GDI+ functions
are executed hardwarily.

OpenGL extensions and D3D features


Here's the list of extensions supported by OpenGL (Drivers 60.72):

GL_ARB_depth_texture

GL_ARB_fragment_program
GL_ARB_fragment_program_shadow
GL_ARB_fragment_shader
GL_ARB_imaging
GL_ARB_multisample
GL_ARB_multitexture
GL_ARB_occlusion_query
GL_ARB_point_parameters
GL_ARB_point_sprite
GL_ARB_shadowGL_ARB_shader_objects
GL_ARB_shading_language_100
GL_ARB_texture_border_clamp
GL_ARB_texture_compression
GL_ARB_texture_cube_map
GL_ARB_texture_env_add
GL_ARB_texture_env_combine
GL_ARB_texture_env_dot3
GL_ARB_texture_mirrored_repeat
GL_ARB_texture_non_power_of_two
GL_ARB_transpose_matrix
GL_ARB_vertex_buffer_object
GL_ARB_vertex_program
GL_ARB_vertex_shader
GL_ARB_window_pos
GL_ATI_draw_buffers
GL_ATI_pixel_format_float
GL_ATI_texture_float
GL_ATI_texture_mirror_once
GL_S3_s3tc
GL_EXT_texture_env_add
GL_EXT_abgr
GL_EXT_bgra
GL_EXT_blend_color
GL_EXT_blend_equation_separate
GL_EXT_blend_func_separate
GL_EXT_blend_minmax
GL_EXT_blend_subtract
GL_EXT_compiled_vertex_array
GL_EXT_Cg_shader
GL_EXT_depth_bounds_test
GL_EXT_draw_range_elements
GL_EXT_fog_coord
GL_EXT_multi_draw_arrays
GL_EXT_packed_pixels
GL_EXT_pixel_buffer_object
GL_EXT_point_parameters
GL_EXT_rescale_normal
GL_EXT_secondary_color
GL_EXT_separate_specular_color

GL_EXT_shadow_funcs
GL_EXT_stencil_two_side
GL_EXT_stencil_wrap
GL_EXT_texture3D
GL_EXT_texture_compression_s3tc
GL_EXT_texture_cube_map
GL_EXT_texture_edge_clamp
GL_EXT_texture_env_combine
GL_EXT_texture_env_dot3
GL_EXT_texture_filter_anisotropic
GL_EXT_texture_lod
GL_EXT_texture_lod_bias
GL_EXT_texture_mirror_clamp
GL_EXT_texture_object
GL_EXT_vertex_array
GL_HP_occlusion_test
GL_IBM_rasterpos_clip
GL_IBM_texture_mirrored_repeat
GL_KTX_buffer_region
GL_NV_blend_square
GL_NV_centroid_sample
GL_NV_copy_depth_to_color
GL_NV_depth_clamp
GL_NV_fence
GL_NV_float_buffer
GL_NV_fog_distance
GL_NV_fragment_program
GL_NV_fragment_program_option
GL_NV_fragment_program2
GL_NV_half_float
GL_NV_light_max_exponent
GL_NV_multisample_filter_hint
GL_NV_occlusion_query
GL_NV_packed_depth_stencil
GL_NV_pixel_data_range
GL_NV_point_sprite
GL_NV_primitive_restart
GL_NV_register_combiners
GL_NV_register_combiners2
GL_NV_texgen_reflection
GL_NV_texture_compression_vtc
GL_NV_texture_env_combine4
GL_NV_texture_expand_normal
GL_NV_texture_rectangle
GL_NV_texture_shader
GL_NV_texture_shader2
GL_NV_texture_shader3
GL_NV_vertex_array_range
GL_NV_vertex_array_range2

GL_NV_vertex_program
GL_NV_vertex_program1_1
GL_NV_vertex_program2
GL_NV_vertex_program2_option
GL_NV_vertex_program3
GL_NVX_conditional_render
GL_SGIS_generate_mipmap
GL_SGIS_texture_lod
GL_SGIX_depth_texture
GL_SGIX_shadow
GL_SUN_slice_accum
GL_WIN_swap_hint
WGL_EXT_swap_control

D3D parameters can be ssen here:


D3D
RightMark:
DX CapsViewer: NV40, NV38, R360

NV40,

NV38,

R360

Attention! Be advised that the current DirectX version with the current NVIDIA (60.72) drivers does not
yet support the capabilities of pixel and vertex shaders 3.0. Perhaps the release of DirectX 9.0c will solve
the problem, or perhaps, the current DirectX will be suitable, but only after programs are recompiled using
new SDK version libraries. This recompilation will be available soon.

Das könnte Ihnen auch gefallen