Anda di halaman 1dari 24

PCI Express: An Overview

PCI Express has generated a lot of excitement in the PC enthusiast scene in


a
by Jon Stokes

Introduction
With the launch of Intel's 900-series chipsets and the recent return of SLI to the video card
scene, PCI Express has finally arrived on the PC enthusiast scene in a big way. PCI
Express-enabled motherboards are going to start becoming more and more common, and
with the new bus's increasing ubiquity will come the inevitable confusion that accompanies
the rise of any new technology, especially one as complex and feature-rich as PCI Express.
In this article, we'll take a detailed look at the features of PCI Express ? what it is, what it
isn't, and how it improves on the venerable interconnect scheme that we've all come to
know and curse: PCI.
Basic PC system architecture
No doubt most Ars readers are familiar with the basic layout of a PC system, but it's
worthwhile to do a brief recap in order to set the stage for the discussion that follows.
Logically, an average PCI system is laid out in something like the following manner:

Figure 1: PCI system layout
The core logic chipset acts as a switch or router, and routes I/O traffic among the different
devices that make up the system.
In reality, the core logic chipset is split into two parts: the northbridge and the southbridge
(or I/O bridge). This split is there for a couple of reasons, the most important of which is the
fact that there are three types of devices that naturally work very closely together, and so
they need to have faster access to each other: the CPU, the main memory, and the video
card. In a modern system, the video card's GPU is functionally a second (or third) CPU, so
it needs to share privileged access to main memory with the CPU(s). As a result, these
three devices are all clustered together off of the northbridge.
The northbridge is tied to a secondary bridge, the southbridge, which routes traffic from the
different I/O devices on the system: the hard drives, USB ports, Ethernet ports, etc. The
traffic from these devices is routed through the southbridge to the northbridge and then on
to the CPU and/or memory.

Figure 2: northbridge and southbridge
As is evident from the diagram above, the PCI bus is attached to the southbridge. This bus
is usually the oldest, slowest bus in a modern system, and is the one most in need of an
upgrade.
For now, the main thing that you should take away from the previous diagram is that the
modern PC is a motley collection of specialized buses of different protocols and bandwidth
capabilities. This mix of specialized buses designed to attach different types of hardware
directly to the southbridge is something of a continuously evolving hack that has been
gradually and collectively engineered by the PC industry as it tries to get around the
limitations of the aging PCI bus. Because the PCI bus can't really cut it for things like Serial
ATA, Firewire, etc., the trend has been to attach interfaces for both internal and external I/O
directly to the southbridge. So today's southbridge is sort of the Swiss Army Knife of I/O
switches, and thanks to Moore's Curves it has been able to keep adding functionality in the
form of new interfaces that keep bandwidth-hungry devices from starving on the PCI bus.
In an ideal world, there would be one primary type of bus and one bus protocol that
connects all of these different I/O devices ? including the video card/GPU ? to the CPU and
main memory. Of course, this "one bus to rule them all" ideal is never, ever going to happen
in the real world. It won't happen with PCI Express, and it won't happen with Infiniband
(although it technically could happen with Infiniband if we threw away all of today's PC
hardware and started over from scratch with a round of natively Infiniband-compliant
devices).
Still, even though the utopian ideal of one bus and one bus protocol for every device will
never be achieved, there has to be way bring some order to the chaos. Luckily for us, that
way has finally arrived in the form of PCI Express (a.k.a. PCIe).
With Intel's recent launch of its 900-series chipsets and NVIDIA and ATI's announcements
of PCI Express-compatible cards, PCIe will shortly begin cropping up in consumer systems.
This article will give you the lowdown on what you can expect from the bus technology that
will dominate the personal computer for the coming decade.
Note: A few of the more server-specific features of PCI Express are not covered in this
article. These include hot plugging and hot swapping, as well as reliability-oriented features
like packet retries and such.
A primer on PCI
Before I go into detail on PCIe, it helps to understand how PCI works and what its
limitations are.
The PCI bus debuted over a decade ago at 33MHz, with a 32-bit bus and a peak
theoretical bandwidth of 132MB/s. This was pretty good for the time, but as the rest of
the system got more bandwidth hungry both the bus speed and bus width were cranked
up in a effort keep pace. Later flavors of PCI included a 64-bit, 33MHz bus combination
with a peak bandwidth of 264MB/s; a more recent 64-bit, 66MHz combination with a
bandwidth of 512MB/s.
PCI uses a shared bus topology to allow for communication among the different
devices on the bus; the different PCI devices (i.e., a network card, a sound card, a RAID
card, etc.) are all attached to the same bus, which they use to communicate with the
CPU. Take a look at the following diagram to get a feel for what a shared bus looks like.

Figure 3: the shared bus
Because all of the devices attached to the bus must share it among themselves, there
has to be some kind of bus arbitration scheme in place for deciding who gets access
to the bus and when, especially in situations where multiple devices need to use the bus
at the same time. Once a device has control of the bus, it becomes the bus master,
which means that it can use the PCI bus to talk to the CPU or memory via the chipset's
southbridge.
Speaking of the southbridge, the large system diagram that I presented on the first page
? the one with the PCI devices attached to the southbridge ? represents how things are
actually configured in the real world, as opposed to the idealized representation given
immediately above. The southbridge, the northbridge, and the CPU all combine to fill
the host or root role, which we'll discuss in a bit more detail momentarily. For now, it
will suffice to note that the root runs the show ? it detects and initializes the PCI devices,
and it controls the PCI bus by default. Or another way to put it would be to say that the
purpose of the PCI bus is to connect I/O devices to the root, so that the root can read
from them and write to them, and just generally use them to talk either to storage
devices or to the outside world.
The shared bus topology's main advantages are that it's simple, cheap, and easy to
implement ? or at least, that's the case as long as you're not trying to do anything too
fancy with it. Once you start demanding more performance and functionality from a
shared bus, then you run into its limitations. Let's take a look at some of those
limitations, in order to motivate our discussion of PCI Express's improvements.
From the CPU's perspective, PCI devices are accessible via a fairly straightforward
load-store mechanism. There's flat, unified chunk of address space dedicated for PCI
use, which looks to the CPU much like a flat chunk of main memory address space, the
primary difference being that at each range of addresses there sits a PCI device instead
of a group of memory cells containing code or data.

Figure 4: memory space
So in the same way that the CPU access memory by performing loads and stores to
specific addresses, it accesses PCI devices by performing reads and writes to specific
addresses.
When a PCI-enabled computer boots up, it must initialize the PCI subsystem by
assigning chunks of the PCI address space to the different devices so that they'll be
accessible to the CPU. Once the devices are initialized and know which parts of the
address space that they "own," they start listening to the bus for any commands and
data that might be directed their way. Once an individual PCI device "hears" an address
that it owns being placed on the bus, then it reads any data following behind that
address.
This scheme works fine when there are only a few devices attached to the bus, listening
to it for addresses and data. But the nature of a bus is that any device that's attached to
it and is "listening" to it injects a certain amount of noise onto the bus. Thus the more
devices that listen to the bus ? and thereby place an electrical load on the bus ? the
more noise there is on the bus and the harder it becomes to get a clean signal through.
Sharing the bus
In this respect, the shared bus is kind of like the following slightly loopy scenario:
Imagine an office building in which there is only one phone line that everyone shares.
People work all day in their cubicles with their phones off the hook and their hands-free
speakerphones turned on, listening for the front-office secretary to call out their name,
"Mr. Smith, Ms. Jones is here at my desk and wants to talk to you, so I'm going to put
her on. Now pay attention, because here she is..." With only a few employees this lame
scheme would be a pain but it would at least be feasible. But in an office of hundreds,
the amount of ambient background noise pouring into each speakerphone would
combine to make the entire line a noisy mess, and it would be very hard to hear your
name called out above the racket.
This load-related noise phenomenon, along with clock skew issues, is the reason that
PCI buses are limited to five card-based devices at most. (If you solder PCI devices
directly onto the motherboard, the signal is cleaner so you can put a few more than five
on a single bus.)
What this means in real life is that if you want to put more than five PCI devices on a
system, then you must use PCI-to-PCI bridge chips configured in the following manner:

Figure 5: PCI-to-PCI bridge chips
This hierarchical tree structure, outlined above, is one of the features that distinguishes
PCI from peer-to-peer and point-to-point next-generation interconnects like
HyperTransport and Infiniband. The root at the top of the diagram is the master
controller which is responsible for initializing and configuring all of the PCI devices in the
system at boot-up. This makes every PCI device a slave device, with one master
controlling them. And because the master must enumerate all of the devices and
configure the entire system at boot time, there can be no hot-plugging or hot-swapping.
Excursus: organizing bus traffic
Generally speaking, there are two pairs of categories into which all bus traffic can be
placed. The first pair of categories is address traffic and data traffic. The data is the
information that you're using the bus to send or receive from a device that's attached to
it, and address is location of the particular device (or the region within a particular
device) where the information is being sent. So any bus which supports multiple devices
will need a way of handling both address traffic and data traffic, and of distinguishing
between the two.
The second pair of categories, which overlaps the first pair, is command
traffic and read/write traffic. A command consists of a chunk of data containing some
type of configuration or control information (= a specific type of data) which is sent to a
particular device (= a particular address) on the bus. So command traffic includes both
address and data traffic. Examples of command traffic are initialization instructions for a
device, a device reset signal, a configuration command that causes the device to switch
operating modes, etc. Command traffic allows the CPU to control how the PCI device
handles the data that flows in and out of it.
Read/write traffic is the most important type of traffic, because it consists of the actual
information that is being sent to the device. For instance, a PCI RAID controller uses
read and write traffic to send and receive the actual files which it reads from and writes
to its attached hard disks, a PCI sound card uses read/write traffic to get the sound data
that it puts out through its speaker jack, and so on. Like command traffic, read/write
traffic consists of addresses coupled with data, and so accounts for part of both of these
types of traffic.
Different buses and bus protocols have different ways of handling these four
overlapping types of traffic. For instance, many common bus types actually consist of
two separate buses: an address bus and a data bus. Addresses are placed on the
address bus and data is placed on the data bus, with the result that data is able to flow
quickly between devices because each type of traffic has its own dedicated bus.
The alternative to this would be to "multiplex" address and data onto the same bus. This
involves first placing the address on the bus, and then following it with the data that is to
be sent to that address. PCI takes this approach, with a single 32-bit bus on which
addresses and data are multiplexed. In fact, remember the office phone line analogy?
"Mr. Smith, a Ms. Jones is here at my desk and wants to talk to you, so I'm going to put
her on. Now pay attention, because here she is..." The "Mr. Smith" in this sentence
would be the address, and Mrs. Jones' speech to Mr. Smith would be the data.
Obviously multiplexing is a little less bandwidth-efficient than having two dedicated
buses, because address traffic takes up precious bandwidth that could be put to better
use carrying bus traffic. But multiplexed buses are a lot cheaper than shared buses,
because half the number of bus lines are needed, and the devices on the bus need half
the number of pins.
The other popular way of handling bus traffic is to split it into control traffic and
read/write traffic and give each its own bus. To return to our office analogy, this would
be like installing a separate line for management to use to talk to employees.
PCI and MSI
Later versions of the PCI specification opt in part for the last method of organizing bus
traffic outlined above, and have what is called a "side-band bus" for transmitting some
types of command traffic. The side-band bus is a smaller bus consisting of a few lines
dedicated to the transmission of control and configuration information. Of course, this
side-band bus increases pin count, power draw, cost, etc., so it's not the most optimal
solution.
Even more recent versions of the PCI spec dictate a method for using standard read
and write operations to pass one type of command and control traffic to PCI devices.
This method, called Message Signal Interrupt (MSI), sets aside a special message
space in the PCI flat memory space for passing a certain type of control message called
an interrupt. This message space is kind of like a bulletin board, onto which the CPU
writes interrupt messages which the device then reads. As we'll see below, PCI Express
expand the MSI spec to include not just interrupts but all side-band control signals. But
we're getting ahead of ourselves...
Summary of PCI's shortcomings
To summarize, PCI as it exists today has some serious shortcomings that prevent it
from providing the bandwidth and features needed by current and future generations of
I/O and storage devices. Specifically, its highly parallel shared-bus architecture holds it
back by limiting its bus speed and scalability, and its simple, load-store, flat memory-
based communications model is less robust and extensible than a routed, packet-based
model.
PCI-X: wider and faster, but still outdated
The PCI-X spec was an attempt to update PCI as painlessly as possible and allow it to
hobble along for a few more years. This being the case, the spec doesn't really fix any
of the inherent problems outlined above. In fact, it actually makes some of the problems
worse.
The PCI-X spec essentially doubled the bus width from 32 bits to 64 bits, thereby
increasing PCI's parallel data transmission abilities and enlarging its address space.
The spec also ups PCI's basic clock rate to 66MHz with a 133MHz variety on the high
end, providing yet another boost to PCI's bandwidth and bringing it up to 1GB/s (at
133MHz).
The latest version of the PCI-X spec (PCI-X 266) also double-pumps the bus, so that
data is transmitted on the rising and falling edges of the clock. While this improves PCI-
X's peak theoretical bandwidth, its real-world sustained bandwidth gains are more
modest. (See this article for more on the relationship between peak theoretical
bandwidth and real-world bandwidth.)
While both of these moves significantly increased PCI's bandwidth and its usefulness,
they also made it more expensive to implement. The faster a bus runs, the sensitive it
becomes to noise; manufacturing standards for high-speed buses are exceptionally
strict for this very reason; shoddy materials and/or wide margins of error translate
directly into noise at higher clock speeds. This means that the higher-speed PCI-X bus
is more expensive to make.
The higher clock speed isn't the only thing that increases PCI-X's noise problems and
manufacturing costs. The other factor is the increased bus width. Because the bus is
wider and consists of more wires, there's more noise in the form of crosstalk.
Furthermore, all of those new wires are connected at their endpoints to multiple PCI
devices, which means an even larger load on the bus and thus more noise injected into
the bus by attached devices. And then there's the fact that the PCI devices themselves
need 32 extra pins, which increases the manufacturing cost of each individual device
and of the connectors on the motherboard.
All of these factors, when taken together with the increased clock rate, combine to make
the PCI-X a more expensive proposition than PCI, which keeps it out of mainstream
PCs. And it should also be noted that most of the problems with increasing bus
parallelism and double-pumping the bus also plague recent forms of DDR, and
especially the DDR-II spec.
And after all of that pain, you still have to deal with PCI's shared-bus topology and all of
its attendant ills. Fortunately, there's a better way.
PCI Express: the next generation
PCI Express (PCIe) is the newest name for the technology formerly known as 3GIO.
Though the PCIe specification was finalized in 2002, PCIe-based devices have just
now started to debut on the market.
PCIe's most drastic and obvious improvement over PCI is its point-to-point bus
topology. Take a look at the following diagram, and compare it to the layout of the PCI
bus.

Figure 6: shared switch
?

Figure 3: the shared bus
In a point-to-point bus topology, a shared switch replaces the shared bus as the single
shared resource by means of which all of the devices communicate. Unlike in a shared
bus topology, where the devices must collectively arbitrate among themselves for use of
the bus, each device in the system has direct and exclusive access to the switch. In
other words, each device sits on its own dedicated bus, which in PCIe lingo is called
a link.
Like a router in a network or a telephone switchbox, the switch routes bus traffic and
establishes point-to-point connections between any two communicating devices on a
system. To return to our office analogy from the previous section, each employee has
his or her own private line to the front desk; so instead of shouting over a shared line to
get a particular employee's attention, the front desk secretary uses a switchboard to
connect employees directly to incoming callers and to each other.
In the point-to-point diagram above, the CPU at the top can talk to any of the PCIe
devices by "dialing" that device's address and opening up a direct and private
communications link, via the switch, with it. Of course, as with a modern telephone call,
or even better, an internet connection between a browser and a website, the two
communicating parties only think they're talking to each other via a private, direct,
continuous link; in reality, though, the communications stream is broken up into discrete
packets of data, which the switch routes ? like a postal worker delivering addressed
envelopes ? back and forth between the two parties.
Enabling Quality of Service
The overall effect of the switched fabric topology is that it allows the "smarts" needed to
manage and route traffic to be centralized in one single chip ? the switch. With a shared
bus, the devices on the bus must use an arbitration scheme to decide among
themselves how to distribute a shared resource (i.e., the bus). With a switched fabric,
the switch makes all the resource-sharing decisions.
By centralizing the traffic-routing and resource-management functions in a single unit,
PCIe also enables another important and long overdue next-generation function: quality
of service (QoS). PCIe's switch can prioritize packets, so that real-time streaming
packets (i.e., a video stream or an audio stream) can take priority over packets that
aren't as time critical. This should mean fewer dropped frames in your first-person
shooter and lower audio latency in your digital recording software.
Backwards compatibility
Now, you've probably heard that PCIe is backwards-compatible with PCI, and that
operating systems can boot on and use a PCIe-based system without modification. So
you're no doubt wondering how PCI's load-store model, described previously, can be
compatible with the switched packet-based model outlined here. The answer is more
straightforward than you might think.
PCI and PCI Express, like many computer systems designed to transmit data,
implement a part of theOSI network stack. This article is not the place for a detailed
breakdown of a network stack, but the basic idea behind it is easy enough to grasp.
PCI implements the first four layers of the OSI stack, which specify the physical aspects
of transmission (i.e., the wire-level signals) up through the higher-level load-store
interface that software uses to send and receive via PCI. PCI Express's designers have
left this load-store-based, flat memory model unchanged. So a legacy application that
wants to communicate via PCIe still executes a read from or a write to a specific
address. The next two stack levels down, however, take this read or write request and
convert it into a packet by appending routing and flow control information, as well as
CRC information, placing it in a frame, and then sending it to its destination.
So the application still thinks that it's reading to or writing from a memory address when
it talks to a PCI device, but behind the scenes there's a totally different network of
protocols and signals at work shuffling that read or write request to along to its
destination.
This brings us to back to the topic of command and control signals. As I hinted at
earlier, PCIe takes all PCI side-band signals and converts them to MSI signals (which
are load-store) so that they can be encapsulated into packets and routed just like any
other read/write traffic. Of course, this means that all types of PCIe traffic ? whether
command or read/write, or address or data ? are transmitted over a single bus.
It's important to note at this point that the two pairs of bus traffic types are logically
divided under PCIe, even if they're not physically separated onto different buses. The
first two types of traffic, address and data, are combined in the form of the packet. The
core of packet consists of an address combined with a chunk of data; so the packet
structure fuses these two types.
The packets themselves, though, generally fall into the two other categories: command
and read/write. In fact, literature on a packet-based bus system like PCIe or RAMBUS
will often talk ofcommand packets and data packets, the latter being the more
common name for what I'm calling read/write packets.
Traffic runs in lanes
When PCIe's designers started thinking about a true next-generation upgrade for PCI,
one of the issues that they needed to tackle was pin count. In the section on PCI above,
I covered some of the problems with the kind of large-scale data parallelism that PCI
exhibits (i.e. noise, cost, poor frequency scaling, etc.). PCIe solves this problem by
taking a serial approach.
As I noted previously, a connection between two a PCIe device and a PCIe switch is
called a link. Each link is composed of one or more lanes, and each lane is capable of
transmitting one byte at a time in both directions at once. This full-duplex
communication is possible because each lane is itself composed of one pair of signals:
send and receive.

Figure 7: Links and lanes
In order to transmit PCIe packets, which are composed of multiple bytes, a one-lane link
must break down each packet into a series of bytes, and then transmit the bytes in rapid
succession. The device on the receiving end must collect all of the bytes and then
reassemble them into a complete packet. This disassembly and reassembly happens
must happen rapidly enough to where it's transparent to the next layer up in the stack.
This means that it requires some processing power on each end of the link. The upside,
though, is that because each lane is only one byte wide, very few pins are needed to
transmit the data. You might say that this serial transmission scheme is a way of turning
processing power into bandwidth; this is in contrast to the old PCI parallel approach,
which turns bus width (and hence pin counts) into bandwidth. It so happens that thanks
to Moore's Curves, processing power is cheaper than bus width, hence PCIe's tradeoff
makes a lot of sense.
I stated earlier that a link can be composed of "one or more lanes", so let me clarify that
now. One of PCIe's nicest features is the ability to aggregate multiple individual lanes
together to form a single link. In other words, two lanes could be coupled together to
form a single link capable of transmitting two bytes at a time, thus doubling the link
bandwidth. Likewise, you could combine four lanes, or eight lanes, and so on.
A link that's composed of a single lane is called an x1 link; a link composed of two lanes
is called an x2 link; a link composed of four lanes is called an x4 link, etc. PCIe supports
x1, x2, x4, x8, x12, x16, and x32 link widths.
PCIe's bandwidth gains over PCI are considerable. A single lane is capable of
transmitting 2.5Gbps in each direction, simultaneously. Add two lanes together to form
an x2 link and you've got 5 Gbps, and so on with each link width. These high transfer
speeds are good, good news, and will enable a new class of applications, like...
PCIe, the GPU, and you
...SLI video card rendering.
When announcements of Alienware's new PCIe-based SLI technology hit the wires, I
saw a few folks claiming that the company had somehow rebranded some basic PCIe
functionality. If you've made it this far in the article, though, then you probably noticed
that no single one of the PCIe capabilities that I've outlined thus far seems specifically
enabling of this kind of vid card cooperation. That's because it's PCIe's whole, high-
bandwidth, next-generation package that allows this functionality, and not any one
feature.
3D rendering involves moving a lot of data around, very quickly, between the video
card, the CPU, and main memory. In current systems the AGP bus is a bottleneck. You
can tell just how much of a bottleneck it is by observing how much RAM vendors are
cramming into high-end video cards. All of that RAM is needed so that the GPU doesn't
have to go out to main memory to get rendering data.
This picture changes when you add PCIe into the mix. Two video cards placed in a pair
of x16 slots will have high-bandwidth pipes connecting them to each other, to main
memory, and to the CPU. They can use all of that bandwidth to cooperate on rendering
chores at a level that wouldn't have been feasible with previous bus technologies.
For more on PCIe and graphics, check out the following links.
Alienware announces dual PCI-Express graphics subsystem
PCI Express for graphics: Analyzing ATI and NVIDIA's PCI-E strategies
NVIDIA's SLI resurrects GPU teaming: Kickin' it old school?with 32 pipes
And be sure to stick around Ars, because this PCIe article is just the groundwork for our
future coverage of all things PCIe, including graphics.
At this point, I want to use the second two articles in the list above to bring up two other
features of PCIe that are worth taking a look at, especially because they factor in to the
emerging SLI GPU scene.
Lane negotiation at startup
In the last article linked in the above list (the one on NVIDIA's SLI) TR notes that no
currently available motherboard has two x16 links. Now, some boards have two x16
slots, but those slots are connected to the bridge by x8 links. What gives? This can be
kind of confusing, so a diagram will help.

Figure 8: lane negotiation
At startup, PCIe devices negotiate with the switch to determine the maximum number of
lanes that the link can consist of. This link width negotiation depends on the maximum
width of the link itself (i.e., the actual number of physical signal pairs that the link
consists of), on the width of the connector into which the device is plugged, and the
width of the device itself. (It also depends on the width of the switch's interface, but we'll
leave that out and assume that the switch's interface width equals the physical link
width.)
Now, a PCIe-compliant device has a certain number of lanes built into it. So NVIDIA's
first SLI cards are all x16 cards, which means that they have enough copper connectors
at their bottom contact edges to support 16 lanes. This also means that they need to be
plugged into a connector slot that supports at least 16 lanes. If the connector had fewer
than 16 lanes, then it wouldn't have enough contacts to understand all of the signals
coming out of the card. If it supports more, then those extra lanes can be ignored.
However, just because the card and connector are x16 doesn't mean the link itself is
x16. The physical link itself could have enough copper traces for exactly sixteen lanes,
or some number less than sixteen, or some number greater than sixteen. If the link has
only enough signal pairs to support less than sixteen lanes, then the switch and the
device will negotiate to figure this out, and they'll use only the lanes that the link has. If
the link supports more than sixteen lanes, then the extra lanes will be ignored.
If you take a close look at the diagram above, then you'll see how this works. Extra
lanes are ignored, while too few lanes means that the devices on each end just throttle
back their bandwidth accordingly.
There is one situation depicted above that just won't work, and that's the last one with
the text above it in red. Plugging an x16 card into an x8 connector doesn't work,
because there aren't enough contacts in the connector to pick up all of the lanes coming
out of the card.
This link width negotiation allows for some flexibility in designing systems and
integrating devices with different lane widths, but it will make for some headache in the
consumer space. People will have to figure out how to match link widths with device
widths, and they'll be initially confused by situations in which the link is one width and
the connector another, as is the case with an NVIDIA card plugged into an x16 slot
attached to an x8 link.
The NVIDIA card plugged into the x8 link will talk to the switch and figure out that the
link is only x8. It will then train down accordingly and transmit data at the appropriate x8
rate.
(If you're confused, just go back and meditate on the previous diagram some more. It
took me a while of staring at it before it sank in for me, too, and I'm the one who made
the diagram!)
PCIe to PCI bridging
One thing that you're going to hear a lot about in the coming months is PCI to PCIe
bridging. Fortunately, it's a lot easier to grasp than the whole training and lane width
thing.
Basically, a PCI to PCIe bridge translates PCIe packets back into regular old PCI
signals, allowing a legacy PCI device to be plugged into a PCIe system. This bridging
can happen anywhere, from on the motherboard to on the card. NVIDIA is taking such
an approach with their first-generation PCIe cards. There's a PCIe-to-PCI bridge
embedded on the card, which means that the card itself is still a "PCI" card even though
it fits into a PCIe slot.
ATI, in contrast, has cards that support PCIe natively and therefore don't need the
bridge chip.
I don't expect these bridges to make a whole lot of difference in anyone's life in the
near-term, and in the long-term they'll disappear entirely as companies like NVIDIA
rework their product line for native PCIe support. The translation chip will add some cost
to the device, but it's impact on performance (if any) will be very hard to quantify and
absolutely impossible to isolate. Still, expect this talk about bridging to play a role in the
graphics wars in the next few months. My advice, though, is to ignore it and focus on
the benchmarks, which are all that matter anyway.
Conclusion: PCI Express in the real world
A good example of PCIe-to-PCI bridging on the motherboard is in Intel's new 900-series
chipsets. These chipsets employ PCIe-to-PCI bridge logic integrated directly into the
southbridge. This allows legacy PCI devices to coexist with new PCIe devices in the
same system.
I won't go into detail about these chipsets, because that's been done in the reviews
accessible under the link above. What I will do, though, is give you one last diagram,
showing you how PCIe is used in newly announced chipsets.

Figure 9: PCIe usage in new chipsets
As you can see, PCIe links hang off of both the northbridge and the southbridge. Just as
the northbridge and southbridge combined with the CPU to fill the role of PCI host (or
root), the northbridge and southbridge join with each other to fulfill the role of the PCIe
switch. In Intel's design, the north and south bridges are PCIe switches combined with a
single, high-bandwidth PCIe link.
I began this article with a discussion of how PCI has caused different buses to be
absorbed into the chipset. Thus the chipset in a pre-PCIe system functions as a switch,
with the various attached devices connected in something resembling a hacked-up
switched fabric. PCIe brings some order to this chaos by making the core logic chipset
into a bona fide switch ? a PCIe switch. It also it turns some of the attached buses into
PCIe buses, and it makes the PC as a system more cleanly extensible and future-proof
by eliminating the need for one specialized bus after another.


PCI Express Primer
PCI Express is a serial point to point link that operates at 2.5 Gbits/sec in each direction and
which is meant to replace the legacy parallel PCI bus. PCI Express (PCIe) is designed to
provide software compatibility with older PCI systems, however the hardware is completely
different. Since PCIe is point to point, there is no arbitration for resources on the link. Each
pair of links (both directions) is referred to as a "lane", and multiple lanes can be
aggregated to form a single higher bandwidth connection. The following sections describe
some of the details of the PCIe interface.

PCI Express Topology


As can be seen in the figure below, a PCI Express fabric consists of three types of devices:
the root complex, switches, and endpoints. The root complex is generally associated with
the processor and is responsible for configuring the fabric at power-up. Since PCIe
connections are point to point, switches are used to expand the fabric. PCIe endpoints are
the I/O devices in the fabric - the sources of, and destinations for the data.




PCI Express Layers


PCIe is implemented in three of the OSI model layers: the transaction layer, the data link
layer, and the physical layer. The following figure displays the layers as connected between
two PCIe devices.






As can be seen in the figure, the user logic interfaces to the transaction layer. The user
forms Transaction Layer Packets, or TLPs which contain a header, data payload, and
optionally an end-to-end CRC, ECRC. The ECRC, if used, is generated by the user logic at
the transmitter and checked by the user logic at the receiver. The data link layer is
responsible for link management including error detection. In this layer, a CRC (called the
Link CRC or LCRC) is appended and a sequence number is prepended to the Transaction
Layer Packet. When a packet is transmitted from the data link layer, the receiver sends
back an ACK (success) or NACK (failure) to the transmitter which will retransmit in the case
of an error. These ACKs and NACKs are sent via special packets which originate from the
data link layer called Data Link Layer Packets, or DLLPs. The physical layer consists of two
differential pairs with 8B/10B encoded data allowing for a DC balance on the transmission
media and for clock recovery at the destination. Framing information is added to the data
link layer packet, and this is encoded and driven onto the link. The following diagram
displays the encapsulation of packets in PCIe:





The transaction layer supports the notion of Virtual Channels and Traffic Classes which can
be used for real-time isochronous and prioritized data transport. The maximum data
payload (MDP) in a PCIe system is a system-wide user defined parameter. The desired MDP
is requested in a PCIe configuration register which is read by the root complex. After polling
all of the MDP values in the system, the lowest value is written to a separate configuration
register on each side of the link. Legal values of the MDP are 128 bytes through 4096 bytes
in powers of 2. A transmitter must not send a packet which exceeds the programmed MDP.


PCI Express Transactions


PCIe provides four types of transactions that originate at the transaction layer: memory,
I/O, configuration, and message. In general, memory transactions are the basic method of
transferring data. I/O transactions are provided for backward compatibility with PCI (which
provided them for backward compatibility with ISA) and are not recommended for future
use. Configuration transactions are similar to those of the same name in the PCI bus and
are used by the root complex to configure the system upon power-up. Message transactions
are new and are used to send interrupts and error conditions, as well as other information
through the fabric. Transactions can be further classified as posted, non-posted, and
completion. A memory write operation is an example of a posted transaction since it does
not require a response from the destination. A memory read request is a non-posted
transaction that will later cause a completion transaction with the read data. The completion
transaction is initiated by the destination when the read data is available. Both I/O read and
I/O write are non-posted transactions, as are configuration read and write. Message
transactions are of type posted.


Flow Control


PCIe implements a point to point (not end to end) credit policy for managing buffers. The
data link layer sends Data Link Layer Packets which indicate the amount of receiver buffer
space available in units of credits. The transmitter must ensure that the buffer space is not
exceeded prior to commencing a transmission.