Anda di halaman 1dari 93

Mid-Term Training Report 2015

Introduction
Field programmable Gate Arrays (FPGAs) are pre-fabricated silicon devices that can be
electrically programmed in the field to become almost any kind of digital circuit or system.

A Field programmable gate array (FPGA) is a logic device that contains a two-dimensional
array of generic logic cells and programmable switches. A logic cell can be configured (i.e.,
programmed) to perform a simple function, and a programmable switch can be customized to
provide interconnections among the logic cells. A custom design can be implemented by
specifying the function of each logic cell and selectively setting the connection of each
programmable switch. Once the design and synthesis is completed, we can use a simple
adaptor cable to download the desired logic cell and switch configuration to the FPGA device
and obtain the custom circuit. Since this process can be done "in the field" rather than "in a
fabrication facility (fab)," the device is known as field prograrnrnable.

Example

FPGAs are One-Time Programmable (OTP) . FPGAs allow designers to change their designs
very late in the design cycle even after the end product has been manufactured and deployed
in the field.
Gates
1982: 8192 gates, Burroughs Advances Systems Group, integrated into the SType

1
Mid-Term Training Report 2015
24-bit processor for reprogrammable I/O.
1987: 9,000 gates, Xilinx
1992: 600,000, Naval Surface Warfare Department
Early 2000s: Millions
Market size
1985: First commercial FPGA : Xilinx XC2064
1987: $14 million
1993: >$385 million
2005: $1.9 billion
2010 estimates: $2.75 billion
Design starts
2005: 80,000
2008: 90,000

FPGA Vendors
Actel
Altera
Atmel
Lattice
QuickLogic
Xilinx

Disadvantage
Internal Memory is limited.
Analog interface is challenging.
Power consumption is more.
Learning to use or design Complex FPGA is more challenging.
Not suitable for high volume products.

Advantage
Long time availability
FPGAs (Field Programmable Gate Arrays) enable you to make yourself independent
from component manufacturers and distributors since the functionality is not given by
the device itself but in its configuration. The configuration can be programmed to be
portable between miscellaneous FPGAs without any adaptations.
Can be updated and upgraded at your customer's site
FPGAs in contrast to traditional computer chips are completely configurable. Updates
and feature enhancement can be carried out even after delivery at your customer's
site.
Extremely short time to market
Through the use of FPGAs the development of hardware prototypes is significantly
accelerated since a big part of the hardware development process is shifted into ip
core design, which can take place in parallel. Additionally, because of the early
availability of hardware prototypes, time-consuming activities like the start-up and
debugging of the hardware are brought forward concurrently to the overall
development.

2
Mid-Term Training Report 2015

Fast and efficient systems


Available standard components address a broad user group and consequently often
constitute a compromise between performance and compatibility. With FPGAs,
systems can be developed that are exactly customized for the designated task and for
this reason works highly efficient.
Performance gain for software applications
Complex tasks are often handled through software implementations in combination
with high-performance processors. In this case FPGAs provide a competitive
alternative, which by means of parallelization and customization for the specific task
even establishes an additional performance gain.
Real time applications
FPGAs are perfectly suitable for applications in time-critical systems. In contrast to
software based solutions with real time operating systems, FPGAs provide real
deterministic behavior. By means of the featured flexibility eve complex
computations can be executed in extremely short periods.
Massively parallel data processing
The amount of data in contemporary systems is ever increasing which leads to the
problem that systems working sequential are no longer able to process the data on
time. Especially by means of parallelization, FPGAs provide a solution to this
problem which in addition scales excellently.

FPGA Applications

Aerospace and Defense


Avionics/DO-254
Communications
Missiles & Munitions
Secure Solutions
Space
Audio
Connectivity Solutions
Portable Electronics
Radio
Digital Signal Processing (DSP)
Automotive
High Resolution Video
Image Processing
Vehicle Networking and Connectivity
Automotive Infotainment
Broadcast
Real-Time Video Engine
EdgeQAM
Encoders
Displays
Switches and Routers
Consumer Electronics
Digital Displays
3
Mid-Term Training Report 2015
Digital Cameras
Multi-function Printers
Portable Electronics
Set-top Boxes
Data Center
Servers
Security
Routers
Switches
Gateways
Load Balancing
High Performance Computing
Servers
Super Computers
SIGINT Systems
High-end RADARS
High-end Beam Forming Systems
Data Mining Systems
Industrial
Industrial Imaging
Industrial Networking
Motor Control
Medical
Ultrasound
CT Scan
MRI
X-ray
PET
Surgical Systems
Scientific Instruments
Lock-in amplifiers
Boxcar averagers
Phase-locked loops
Security
Industrial Imaging
Secure Solutions
Image Processing
Video & Image Processing
High Resolution Video
Video Over IP Gateway
Digital Displays
Industrial Imaging
Wired Communications
Optical Transport Networks
Network Processing
Connectivity Interfaces

4
Mid-Term Training Report 2015

Wireless Communications
Baseband
Connectivity Interfaces
Mobile Backhaul
Radio

FPGA configuration

Basic process technology types


SRAM - based on static memory technology. In-system programmable and
reprogrammable. Requires external boot devices. CMOS. Currently in use.
Fuse - One-time programmable. Bipolar. Obsolete.
Antifuse - One-time programmable. CMOS.
PROM - Programmable Read-Only Memory technology. One-time programmable
because of plastic packaging. Obsolete.
EPROM - Erasable Programmable Read-Only Memory technology. One-time
programmable but with window, can be erased with ultraviolet (UV) light. CMOS.
Obsolete.

5
Mid-Term Training Report 2015

EEPROM - Electrically Erasable Programmable Read-Only Memory technology.


Can be erased, even in plastic packages. Some but not all EEPROM devices can be
in-system programmed. CMOS.
Flash - Flash-erase EPROM technology. Can be erased, even in plastic packages.
Some but not all flash devices can be in-system programmed. Usually, a flash cell is
smaller than an equivalent EEPROM cell and is therefore less expensive to
manufacture. CMOS.

FPGA architecture

CLB Overview
The Configurable Logic Blocks (CLBs) are the main logic resources for implementation
sequential as well as combinatorial circuits. Each CLB element is connected to a switch
matrix for access to the general routing matrix (shown in Figure 1). A CLB element contains
a pair of slices. These two slices do not have direct connections to each other, and each slice
is organized as a column. For each CLB, the slice in the bottom of the CLB is labeled as
SLICE(0), and the slice in the top of the CLB is labeled as SLICE(1).

6
Mid-Term Training Report 2015

Arrangement of Slices within the CLB


The Xilinx tools designate slices with the following definitions. An X followed by a
number identifies the position of each slice in a pair as well as the column position of the
slice. The X number counts slices starting from the bottom in sequence 0, 1 (the first CLB
column); 2, 3 (the second CLB column); etc. A Y followed by a number identifies a row of
slices . The number remains the same within a CLB, but counts up in sequence from one
CLB row to the next CLB row, starting from the bottom. Figure 2 shows four CLBs located
in the bottom-left corner of the die.

Row and Column Relationship between CLBs and Slices

Slice Description
Every slice contains four logic-function generators (or look-up tables, LUTs) and eight
storage elements. These elements are used by all slices to provide logic and ROM functions
(Table 1). SLICEX is the basic slice. Some slices, called SLICELs, also contain an arithmetic
carry structure that can be concatenated vertically up through the slice column, and wide
function multiplexers. The SLICEMs contain the carry structure and multiplexers, and add

7
Mid-Term Training Report 2015
the ability to use the LUTs as 64-bit distributed RAM and as variable-length shift registers
(maximum 32-bit).

Slice Features

Each column of CLBs contain two slice columns. One column is a SLICEX column, the
other column alternates between SLICEL and SLICEMs. Thus, approximately 50% of the
available slices are of type SLICEX, while 25% each are SLICEL and SLICEMs. The
XC6SLX4 does not have SLICELs (Table 3).

SLICEM (shown in Figure 3) represents a superset of elements and connections found in all
slices. SLICEL is shown in Figure 4. SLICEX is shown in Figure 5. All eight SR, CE, and
CLK inputs are driven by common control inputs.

8
Mid-Term Training Report 2015

Diagram of SLICEM

9
Mid-Term Training Report 2015

Diagram of SLICEL

10
Mid-Term Training Report 2015

Diagram of SLICEEX

CLB/Slice Configurations
Table 2 summarizes the logic resources in one CLB. Each CLB or slice can be implemented
in one of the configurations listed.

Logic Resources in One CLB

11
Mid-Term Training Report 2015

Notes:
SLICEM only, SLICEL and SLICEX do not have distributed RAM or shift registers.
SLICEM and SLICEL only

Table 3 shows the available CLB resources for the Spartan-6 FPGAs. The ratio between the
number of 6-input LUTs and logic cells is 1.6. This reflects the increased capability of the
new 6-input LUT architecture compared to traditional 4-input LUTs.

Spartan-6 FPGA Logic Resources

Look-Up Table (LUT)


The function generators in FPGAs are implemented as six-input look-up table (LUTs). There
are six independent inputs (A inputs - A1 to A6) and two independent outputs (O5 and O6)
for each of the four function generators in a slice (A, B, C, and D). The function generators
can implement any arbitrarily defined six-input Boolean function. Each function generator
can also implement two arbitrarily defined five-input Boolean functions, as long as these two
functions share common inputs. Only the O6 output of the function generator is used when a
six-input function is implemented. Both O5 and O6 are used for each of the five-input
function generators implemented. In this case, A6 is driven High by the software. The
propagation delay through a LUT is independent of the function implemented, or whether one
six-input or two five-input generators are implemented. Signals from the function generators
can exit the slice (through A, B, C, D output for O6 or AMUX, BMUX, CMUX, DMUX
output for O5), enter the XOR dedicated gate from an O6 output , enter the carry-logic chain
from an O5 output , enter the select line of the carry-logic multiplexer from an O6 output ,
feed the D input of the storage element, or go to F7AMUX/F7BMUX from an O6 output.

12
Mid-Term Training Report 2015

Figure 6 shows a simplified view of the connectivity for a single LUT6.

LUT6
In addition to the basic LUTs, SLICEL and SLICEM contain three multiplexers (F7AMUX,
F7BMUX, and F8MUX). These multiplexers are used to combine up to four function
generators to provide any function of seven or eight inputs in a slice. F7AMUX and
F7BMUX are used to generate seven input functions from slice A and B, or C and D, while
F8MUX is used to combine all slices to generate eight input functions. Functions with more
than eight inputs can be implemented using multiple slices. There are no direct connections
between slices to form function generators greater than eight inputs within a CLB or between
slices, but CLB outputs can be routed through the switch matrix and directly back into the
CLB inputs.

Storage Elements
Each slice has eight storage elements. There are four storage elements in a slice that can be
configured as either edge-triggered D-type flip-flops or level-sensitive latches. The D input
can be driven directly by a LUT output via AFFMUX, BFFMUX, CFFMUX or DFFMUX,
or by the BYPASS slice inputs bypassing the function generators via AX, BX, CX, or DX
input. When configured as a latch, the latch is transparent when the CLK is Low.

In Spartan-6 devices, there are four additional storage elements that can only be configured as
edge-triggered D-type flip-flops. The D input can be driven by the O5 output of the LUT.
When the original 4 storage elements are configured as latches, these 4 additional storage
elements cannot be used.

13
Mid-Term Training Report 2015

Figure 7 shows both the register only and the register/latch configuration in a slice, both are
available.

Configuration in a Slice: 4 Registers Only and 4 Register/Latch


The control signals clock (CLK), clock enable (CE), and set/reset (SR) are common to all
storage elements in one slice. When one flip-flop in a slice has SR or CE enabled, the other
flip-flops used in the slice will also have SR or CE enabled by the common signal. Only the
CLK signal has independent polarity but applies it to all eight storage elements. Any inverter
placed on the clock signal is automatically absorbed. The CE and SR signals are active High.
All flip-flop and latch primitives have CE and non-CE versions. The SR signal always has
priority over CE.

Initialization
The SR signal forces the storage element into the initial state specified by SRINIT1 or
SRINIT0. SRINIT1 forces a logic High at the storage element output when SR is asserted,
while SRINIT0 forces a logic Low at the storage element output (see Table 4).

14
Mid-Term Training Report 2015

Truth Table When Using SRINIT

SRINIT0 and SRINIT1 can be set individually for each storage element in a slice. The choice
of synchronous (SYNC) or asynchronous (ASYNC) set/reset (SRTYPE) is common to all
eight storage elements and cannot be set individually for each storage element in a slice.

The initial state after configuration or global initial state is also defined by the same SRINIT
option. The initial state is set whenever the Global Set/Reset (GSR) signal is asserted. The
GSR signal is always asserted during configuration, and can be controlled after configuration
by using the STARTUP_SPARTAN6 primitive. To maximize design flexibility
andutilization, use the GSR and avoid local initialization signals.

The initial state of any storage element (SRINIT) is defined in the design either by the INIT
attribute or by the use of a set or reset. If both methods are used, they must both be 0 or both
be 1. INIT = 0 or a reset selects SRINIT0, and INIT = 1 or a set selects SRINIT1.

The storage element must be initialized to the same value both by the global power-up or
GSR signal, and by the local SR input to the slice. A storage element cannot have both set
and reset, unless one is defined as a synchronous function so that it can be placed in the LUT.
Avoid instantiating primitives with the control input while specifying the INIT attribute in an
opposite state, for example, an FDRE with a reset input and the INIT attribute set to 1. Care
should be taken when re-targeting designs from another technology to the Spartan-6
architecture. If converting an existing FPGA design, avoid primitives that use both set and
reset, such as the FDCPE primitive.

Each of the eight flip-flops in a slice must use the same SR input, although they can be
initialized to different values. A second initialization control will require implementation in a
separate slice, so minimize the number of initialization signals. The SR could be turned off
for all flip-flops in a slice and implemented independently for each flip-flop by implementing
it synchronously in the LUT.

The SR signal is available to the flip-flop, independent of whether the LUT is used as a
distributed RAM or shift register, which supports a registered read from distributed RAM or
an additional pipeline stage in a shift register while still allowing initialization.

The configuration options for the set and reset functionality of a register or the four storage
elements capable of functioning as a latch are as follows:
No set or reset
Synchronous set
Synchronous reset
Asynchronous set (preset)
Asynchronous reset (clear)

15
Mid-Term Training Report 2015

Distributed RAM and Memory (SLICEM only)


The function generators in SLICEMs add a data input and write enable that allows the
function generator to be implemented as distributed RAM. RAM resources are configurable
within a SLICEM to implement the distributed RAM shown in Table 5. Multiple LUTs in a
SLICEM can be combined in various ways to store more data. Distributed RAM is fast,
localized, and ideal for small data buffers, FIFOs, or register files. For larger memory
requirements, consider using the 18K block RAM resources.

Distributed RAM are synchronous (write) and asynchronous (read) resources. However, a
synchronous read resource can be implemented with a storage element or a flip-flop in the
same slice. By placing this flip-flop, the distributed RAM performance is improved by
decreasing the delay into the clock-to-out value of the flip-flop. However, an additional clock
latency is added. The distributed resources share the same clock input. For a write operation,
the Write Enable (WE) input, driven by either the CE or WE pin of a SLICEM, must be set
High.

Table 5 shows the number of LUTs (four per slice) occupied by each distributed RAM
configuration.

Notes:
S = single-port configuration; D = dual-port configuration; Q = quad-port
configuration; SDP = simple dual-port configuration.
RAM32M is the associated primitive for this configuration.
RAM64M is the associated primitive for this configuration.

For single-port configurations, distributed RAM has a common address port for synchronous
writes and asynchronous reads. For dual-port configurations, distributed RAM has one port
for synchronous writes and asynchronous reads, and another port for asynchronous reads. In
simple dual-port configuration, there is no data out (read port) from the write port. For quad-
port configurations, distributed RAM has one port for synchronous writes and asynchronous
reads, and three additional ports for asynchronous reads.

16
Mid-Term Training Report 2015
In single-port mode, read and write addresses share the same address bus. In dual-port mode,
one function generator is connected with the shared read and write port address. The second
function generator has the A inputs connected to a second read-only port address and the WA
inputs shared with the first read/write port address.

Figure 8 through Figure 16 illustrate various example distributed RAM configurations


occupying one SLICEM. When using x2 configuration (RAM32X2Q), A6 and WA6 are
driven High by the software to keep O5 and O6 independent.

Distributed RAM (RAM32X2Q)

17
Mid-Term Training Report 2015

Distributed RAM (RAM32X6SDP)

18
Mid-Term Training Report 2015

Distributed RAM (RAM64X1S)


If four single-port 64 x 1-bit modules are built, the four RAM64X1S primitives can occupy a
SLICEM, as long as they share the same clock, write enable, and shared read and write port
address inputs. This configuration equates to 64 x 4-bit single-port distributed RAM.

Distributed RAM (RAM64X1D)


If two dual-port 64 x 1-bit modules are built, the two RAM64X1D primitives can occupy a
SLICEM, as long as they share the same clock, write enable, and shared read and write port
address inputs. This configuration equates to 64 x 2-bit dual-port distributed RAM.

19
Mid-Term Training Report 2015

Distributed RAM (RAM64X1Q)

20
Mid-Term Training Report 2015

Distributed RAM (RAM64X3SDP)


Implementation of distributed RAM configurations with depth greater than 64 requires the
usage of wide-function multiplexers (F7AMUX, F7BMUX, and F8MUX).

21
Mid-Term Training Report 2015

Distributed RAM (RAM128X1S)


If two single-port 128 x 1-bit modules are built, the two RAM128X1S primitives can occupy
a SLICEM, as long as they share the same clock, write enable, and shared read and write port
address inputs. This configuration equates to 128 x 2-bit single-port distributed RAM.

22
Mid-Term Training Report 2015

Distributed RAM (RAM128X1D)

23
Mid-Term Training Report 2015

Distributed RAM (RAM256X1S)


Distributed RAM configurations larger than the examples provided in Figure 8 through
Figure 16 require more than one SLICEM. There are no direct connections to form larger
distributed RAM configurations within a CLB or between slices.

Using distributed RAM for memory depths of 64 bits or less is generally more efficient than
block RAM in terms of resources, performance, and power. For depths greater than 64 bits
but less than or equal to 128 bits, use the following guidelines:
To conserve LUT resources, use any extra block RAM
For asynchronous read capability, use distributed RAM
For widths greater than 16 bits, use block RAM
For shorter clock-to-out timing and fewer placement restrictions, use registered
distributed RAM

24
Mid-Term Training Report 2015

Distributed RAM Data Flow

Synchronous Write Operation


The synchronous write operation is a single clock-edge operation with an active-High write
enable (WE) feature. When WE is High, the input (D) is loaded into the memory location at
address A.

Asynchronous Read Operation


The output is determined by the address A (for single-port mode output/SPO output of dual-
port mode), or address DPRA (DPO output of dual-port mode). Each time a new address is
applied to the address pins, the data value in the memory location of that address is available
on the output after the time delay to access the LUT. This operation is asynchronous and
independent of the clock signal.

Distributed RAM Summary

Single-port and dual-port modes are available in SLICEMs.


A write operation requires one clock edge.
Read operations are asynchronous (Q output).
The data input has a setup-to-clock timing specification.

Read Only Memory (ROM)


Each function generator can implement a 64 x 1-bit ROM. Three configurations are available:
ROM64x1, ROM128x1, and ROM256x1. ROM contents are loaded at each device
configuration. Table 6 shows the number of LUTs occupied by each ROM configuration.

ROM Configuration

Shift Registers (SLICEM only)


A SLICEM function generator can also be configured as a 32-bit shift register without using
the flip-flops available in a slice. Used in this way, each LUT can delay serial data anywhere
from one to 32 clock cycles. The shiftin D (DI1 LUT pin) and shiftout Q31 (MC31 LUT pin)
lines cascade LUTs to form larger shift registers. The four LUTs in a SLICEM are thus
cascaded to produce delays up to 128 clock cycles. It is also possible to combine shift
registers across more than one SLICEM. Note that there are no direct connections between
slices to form longer shift registers, nor is the MC31 output at LUT B/C/D available. The
resulting programmable delays can be used to balance the timing of data pipelines.

25
Mid-Term Training Report 2015
Applications requiring delay or latency compensation use these shift registers to develop
efficient designs. Shift registers are also useful in synchronous FIFO and content addressable
memory (CAM) designs.

The write operation is synchronous with a clock input (CLK) and an optional clock enable
(CE). A dynamic read access is performed through the 5-bit address bus, A[4:0]. The LSB of
the LUT is unused and the software automatically ties it to a logic High. The configurable
shift registers cannot be set or reset. The read is asynchronous; however, a storage element or
flip-flop is available to implement a synchronous read. In this case, the clock-to-out of the
flip-flop determines the overall delay and improves performance. However, one additional
cycle of clock latency is added. Any of the 32 bits can be read out asynchronously (at the O6
LUT outputs) by varying the 5-bit address. This capability is useful in creating smaller shift
registers (less than 32 bits). For example, when building a 13-bit shift register, simply set the
address to the 13th bit. Figure 17 is a logic block diagram of a 32-bit shift register.

32-bit Shift Register Configuration


Figure 18 illustrates an example shift register configuration occupying one function
generator.

Representation of a Shift Register

26
Mid-Term Training Report 2015
Figure 19 shows two 16-bit shift registers. The example shown can be implemented in a
single LUT.

Dual 16-bit Shift Register Configuration


As mentioned earlier, an additional output (MC31) and a dedicated connection between shift
registers allows connecting the last bit of one shift register to the first bit of the next, without
using the LUT O6 output. Longer shift registers can be built with dynamic access to any bit
in the chain. The shift register chaining and the F7AMUX, F7BMUX, and F8MUX
multiplexers allow up to a 128-bit shift register with addressable access to be implemented in
one SLICEM. Figure 20 through Figure 22 illustrate various example shift register
configurations that can occupy one SLICEM.

64-bit Shift Register Configuration

27
Mid-Term Training Report 2015

96-bit Shift Register Configuration

28
Mid-Term Training Report 2015

128-bit Shift Register Configuration


It is possible to create shift registers longer than 128 bits across more than one SLICEM.
However, there are no direct connections between slices to form these shift registers.

Shift Register Data Flow

Shift Operation
The shift operation is a single clock-edge operation, with an active-High clock enable feature.
When enable is High, the input (D) is loaded into the first bit of the shift register. Each bit is
also shifted to the next highest bit position. In a cascadable shift register configuration, the
last bit is shifted out on the M31 output.

The bit selected by the 5-bit address port (A[4:0]) appears on the Q output.

Dynamic Read Operation


The Q output is determined by the 5-bit address. Each time a new address is applied to the 5-
input address pins, the new bit position value is available on the Q output after the time delay

29
Mid-Term Training Report 2015
to access the LUT. This operation is asynchronous and independent of the clock and clock-
enable signals.

Static Read Operation


If the 5-bit address is fixed, the Q output always uses the same bit position. This mode
implements any shift-register length from 1 to 16 bits in one LUT. The shift register length is
(N+1), where N is the input address (0 31).

The Q output changes synchronously with each shift operation. The previous bit is shifted to
the next position and appears on the Q output.

Shift Register Summary


A shift operation requires one clock edge.
Dynamic-length read operations are asynchronous (Q output).
Static-length read operations are synchronous (Q output).
The data input has a setup-to-clock timing specification.
In a cascadable configuration, the Q31 output always contains the last bit value.
The Q31 output changes synchronously after each shift operation.

Multiplexers
Function generators and associated multiplexers in SLICEL or SLICEM can implement the
following:
4:1 multiplexers using one LUT
8:1 multiplexers using two LUTs
16:1 multiplexers using four LUTs

These wide input multiplexers are implemented in one level or logic (or LUT) using the
dedicated F7AMUX, F7BMUX, and F8MUX multiplexers. These multiplexers allow LUT
combinations of up to four LUTs in a slice. Dedicated multiplexers can be automatically
inferred from the design, or the specific primitives can be instantiated. See WP309: Targeting
and Retargeting Guide for Spartan-6 FPGAs White

Designing Large Multiplexers

4:1 Multiplexer
Each LUT can be configured into a 4:1 MUX. The 4:1 MUX can be implemented with a
flipflop in the same slice. Up to four 4:1 MUXes can be implemented in a slice, as shown in
Figure 23.

30
Mid-Term Training Report 2015

Four 4:1 Multiplexers in a Slice

8:1 Multiplexer
Each SLICEL or SLICEM has an F7AMUX and an F7BMUX. These two multiplexers
combine the output of two LUTs to form a combinatorial function up to 13 inputs (or an 8:1
MUX). Up to two 8:1 MUXes can be implemented in a slice, as shown in Figure 24.

31
Mid-Term Training Report 2015

Two 8:1 Multiplexers in a Slice

16:1 Multiplexer
Each SLICEL or SLICEM has an F8MUX. F8MUX combines the outputs of F7AMUX and
F7BMUX to form a combinatorial function up to 27 inputs (or a 16:1 MUX). Only one 16:1
MUX can be implemented in a slice, as shown in Figure 25.

32
Mid-Term Training Report 2015

16:1 Multiplexer in a Slice


It is possible to create multiplexers wider than 16:1 across more than one SLICEM.
However, there are no direct connections between slices to form these wide multiplexers.

Fast Lookahead Carry Logic


In addition to function generators, SLICEM and SLICEL (but not SLICEX) contain
dedicated carry logic to perform fast arithmetic addition and subtraction in a slice. A CLB has
one carry chain, as shown in Figure 1. The carry chains are cascadable to form wider
add/subtract logic, as shown in Figure 2.

The carry chain in the Spartan-6 device is running upward and has a height of four bits per
slice. For each bit, there is a carry multiplexer (MUXCY) and a dedicated XOR gate for
adding/subtracting the operands with a selected carry bits. Typically, the carry logic allows
four bits of a counter or other arithmetic function to fit in each slice, independent of the
function's total size. The dedicated carry path and carry multiplexer (MUXCY) can also be
used to cascade function generators for implementing wide logic functions.

33
Mid-Term Training Report 2015
Figure 26 illustrates the carry chain with associated logic elements in a slice.

Fast Carry Logic Path and Associated Elements


The carry chains carry lookahead logic along with the function generators. There are ten
independent inputs (S inputs S0 to S3, DI inputs DI1 to DI4, CYINIT and CIN) and eight
independent outputs (O outputs O0 to O3, and CO outputs CO0 to CO3).

The S inputs are used for the propagate signals of the carry lookahead logic. The
propagate signals are sourced from the O6 output of a function generator. The DI inputs are
used for the generate signals of the carry lookahead logic. The generate signals are
sourced from either the O5 output of a function generator or the BYPASS input (AX, BX,
CX, or DX) of a slice. The former input is used to create a multiplier, while the latter is used

34
Mid-Term Training Report 2015
to create an adder/accumulator. CYINIT is the CIN of the first bit in a carry chain. The
CYINIT value can be 0 (for add), 1 (for subtract), or AX input (for the dynamic first carry
bit). The CIN input is used to cascade slices to form a longer carry chain. The O outputs
contain the sum of the addition/subtraction. The CO outputs compute the carry out for each
bit. CO3 is connected to COUT output of a slice to form a longer carry chain by cascading
multiple slices. The propagation delay for an adder increases linearly with the number of bits
in the operand, as more carry chains are cascaded. The carry chain can be implemented with a
storage element or a flip-flop in the same slice.

Consider using the DSP48A1 slice adders (see the Spartan-6 FPGA DSP48A1 Slice User
Guide) for designs consuming too many carry logic resources.

To conserve carry logic resources when designing with adder trees, the 6-input LUT
architecture can efficiently create ternary addition (A + B + C = D) using the same amount of
resources as simple 2-input addition.

Using the Latch Function as Logic


Since the latch function is level-sensitive, it can be used as the equivalent of a logic gate. The
primitives to specify this function are AND2B1L (a 2-input AND gate with one input
inverted) and OR2L (a 2-input OR gate), as shown in Figure 27.

AND2B1L and OR2L Components

As shown in Figure 28, the data and SR inputs and Q output of the latch are used when the
AND2B1L and OR2L primitives are instantiated, and the CK gate and CE gate enables are
held active High. The AND2B1L combines the latch data input (the inverted input on the
gate, DI) with the asynchronous clear input (SRI). The OR2L combines the latch data input
with an asynchronous preset. Generally, the latch data input comes from the output of a LUT
within the same slice, extending the logic capability to another external input. Since there is
only one SR input per slice, using more than one AND2B1L or OR2L per slice requires a
shared common external input.

Implementation of OR2L (Q = D or SRI)

35
Mid-Term Training Report 2015

The device model shows these functions as AND2L and OR2L configurations of the storage
element (Figure 3 through Figure 5). The ISE software reports these as AND/OR Logics
within the slice utilization report. As shown in Table 7, the two inputs of the OR2L gate are
not architecturally equivalent; DI is the D input to the latch, and SRI is the SR input.

OR2L Logic Table

The AND2B1L and OR2L two-input gates save LUT resources and are initialized to a
known state on power-up and on GSR assertion. Using these primitives can reduce logic
levels and increase logic density of the device by trading register/latch resources for logic.
However, due to the static inputs required on the clock and clock enable inputs, specifying
one or more AND2B1L or OR2L primitives can cause register packing and density issues in
a slice disallowing the use of the remaining registers and latches.

Interconnect Resources
Interconnect is the programmable network of signal pathways between the inputs and outputs
of functional elements within the FPGA, such as IOBs, CLBs, DSP slices, and block RAM.
Interconnect, also called routing, is segmented for optimal connectivity. The Xilinx Place and
Route (PAR) tool within the ISE Design Suite software exploits the rich interconnect array to
deliver optimal system performance and the fastest compile times.

Most of the interconnect features are transparent to FPGA designers. Knowledge of the
interconnect details can be used to guide design techniques but is not necessary for efficient
FPGA design. Only selected types of interconnect are under user control. These include the
clock routing resources, which are selected by using clock buffers, and discussed in more
detail in the Spartan-6 FPGA Clocking Resources User Guide. Two global control signals,
GTS and GSR, are selected by using the STARTUP_SPARTAN6 primitive, which is
described in Global Controls. Knowledge of the general-purpose routing resources is helpful
when considering floorplanning the layout of a design.

Interconnect Types
The CLBs are arranged in a regular array inside the FPGA. Each connects to a switch matrix
for access to the general-routing resources, which run vertically and horizontally between the
CLB rows and columns (Figure 29). A similar switch matrix connects other resources, such
as the DSP slices and block RAM resources.

36
Mid-Term Training Report 2015

CLB Array and Interconnect Channels


The various types of routing in the Spartan-6 architecture are primarily defined by their
length (Figure 30). Longer routing elements are faster for longer distances.

Fast Interconnects
Fast connects route block outputs back to block inputs. Along with the larger size of the CLB,
fast connects provide higher performance for simpler functions.

Single Interconnects

Singles route signals to neighboring tiles, both vertically and horizontally.

Double Interconnects
Doubles connect to every other tile, both horizontally and vertically, in all four directions,
and to the diagonally adjacent tiles.

Quad Interconnects
Quads connect to one out of every four tiles, horizontally and vertically, and diagonally to
tiles two rows and two columns distant. Quad lines provide more flexibility than the single-
channel long lines of earlier generations.

37
Mid-Term Training Report 2015

Examples of Interconnect Types

Interconnect Delay and Optimization


Interconnect delays vary according to the specific implementation and loading in a design.
The type of interconnect, distance required to travel in the device, and number of switch
matrices to traverse factor into the total delay. A good estimate of interconnect delay is to use
the same value as the block delays in a path.

Most timing issues are addressed by examining the block delays and determining the
impact of using fewer levels or faster paths. If interconnect delays seem too long, increase
PAR effort levels or iterations to improve performance along with making sure that the
required timing is in the constraints file.

Nets with critical timing or that are heavily loaded can often be improved by replicating the
source of the net. The dual 5-input LUT configuration of the slice simplifies the replication of
logic in the same slice, which minimizes any additional loads on the inputs to the source
function. Replicating logic in multiple slices gives the software more flexibility to place the
sources independently.

Interconnect delays are typically improved not by changing the interconnect but by changing
the placement. This is the Floorplanning process.

38
Mid-Term Training Report 2015

XC6SLX45T Floorplan View in PlanAhead

FPGA design flow

Development flow
The simplified development flow of an FPGA-based system is shown in Figure 2.4. To
facilitate further reading, we follow the terms used in the Xilinx documentation. The left
portion of the flow is the refinement and programming process, in which a system is
transformed from an abstract textual HDL description to a device cell-level configuration

39
Mid-Term Training Report 2015

Development flow
and then downloaded to the FPGA device. The right portion is the validation process, which
checks whether the system meets the functional specification and performance goals. The
major steps in the flow are:
Design the system and derive the HDL file(s). We may need to add a separate
constraint file to specify certain implementation constraints.
Develop the testbench in HDL and perform RTL simulation. The RTL term reflects
the fact that the HDL code is done at the register transfer level.
Perform synthesis and implementation. The synthesis process is generally known as
logic s.ynthesis, in which the software transforms the HDL constructs to generic gate
level components, such as simple logic gates and FFs. The implementation process
consists of three smaller processes: translate, map, and place and route. The translate
process merges multiple design files to a single netlist. The map process, which
is generally known as technology mapping, maps the generic gates in the netlist to
FPGAs logic cells and IOBs. The place and route process, which is generally known
as placement and routing, derives the physical layout inside the FPGA chip. It places
the cells in physical locations and determines the routes to connect various signals. In
the Xilinx flow, static timing analysis, which determines various timing parameters,
such as maximal propagation delay and maximal clock frequency, is performed at
the end of the implementation process.
Generate and download the programming file. In this process, a configuration file is

40
Mid-Term Training Report 2015
generated according to the final netlist. This file is downloaded to an FPGA device
serially to configure the logic cells and switches. The physical circuit can be verified
accordingly.

The optional functional simulation can be performed after synthesis, and the optional timing
sirnulation can be performed after implementation. Functional simulation uses a synthesized
netlist to replace the RTL description and checks the correctness of the synthesis process.
Timing simulation uses the final netlist, along with detailed timing data, to perform
simulation. Because of the complexity of the netlist, functional and timing simulation may
require a significant amount of time. If we follow good design and coding practices, the HDL
code will be synthesized and implemented correctly. We only need to use RTL simulation to
check the correctness of the HDL code and use static timing analysis to examine the relevant
timing information. Both functional and timing simulations can be omitted from the
development flow.

Spartan-6
The Spartan-6 family provides leading system integration capabilities with the lowest total
cost for high-volume applications. The thirteen-member family delivers expanded densities
ranging from 3,840 to 147,443 logic cells, with half the power consumption of previous
Spartan families, and faster, more comprehensive connectivity. Built on a mature 45 nm low
power copper process technology that delivers the optimal balance of cost, power, and
performance, the Spartan-6 family offers a new, more efficient, dual-register 6-input lookup
table (LUT) logic and a rich selection of built-in system-level blocks. These include 18 Kb (2
x 9 Kb) block RAMs, second generation DSP48A1 slices, SDRAM memory controllers,
enhanced mixed-mode clock management blocks, SelectIO technology, power optimized
high-speed serial transceiver blocks, PCI Express compatible Endpoint blocks, advanced
system-level power management modes, auto-detect configuration options, and enhanced IP
security with AES and Device DNA protection. These features provide a low cost
programmable alternative to custom ASIC products with unprecedented ease of use. Spartan
6 FPGAs offer the best solution for high-volume logic designs, consumer-oriented DSP
designs, and cost-sensitive embedded applications. Spartan-6 FPGAs are the programmable
silicon foundation for Targeted Design Platforms that deliver integrated software and
hardware components that enable designers to focus on innovation as soon as their
development cycle begins.

41
Mid-Term Training Report 2015

Spartan-6 FPGA Feature Summary

Table 1: Spartan-6 FPGA Feature Summary by Device

Notes:
Spartan-6 FPGA logic cell ratings reflect the increased logic cell capability offered by
the new 6-input LUT architecture.
Each Spartan-6 FPGA slice contains four LUTs and eight flip-flops.
Each DSP48A1 slice contains an 18 x 18 multiplier, an adder, and an accumulator.
Block RAMs are fundamentally 18 Kb in size. Each block can also be used as two
independent 9 Kb blocks.
Each CMT contains two DCMs and one PLL.
Memory Controller Blocks are not supported in the -3N speed grade.

Configuration
Spartan-6 FPGAs store the customized configuration data in SRAM-type internal latches.
The number of configuration bits is between 3 Mb and 33 Mb depending on device size and
user-design implementation options. The configuration storage is volatile and must be
reloaded whenever the FPGA is powered up. This storage can also be reloaded at any time by
pulling the PROGRAM_B pin Low. Several methods and data formats for loading
configuration are available.

Bit-serial configurations can be either master serial mode, where the FPGA generates the
configuration clock (CCLK) signal,or slave serial mode, where the external configuration
data source also clocks the FPGA. For byte-wide configurations, master SelectMAP mode
generates the CCLK signal while slave SelectMAP mode receives the CCLK signal for the 8-
and 16-bit-wide transfer. In master serial mode, the beginning of the bitstream can optionally
switch the clocking source to an external clock, which can be faster or more precise than the
internal clock. The available JTAG pins use boundary-scan protocols to load bit-serial
configuration data.

The bitstream configuration information is generated by the ISE software using a program
called BitGen. The configuration process typically executes the following sequence:

42
Mid-Term Training Report 2015
Detects power-up (power-on reset) or PROGRAM_B when Low.
Clears the whole configuration memory.
Samples the mode pins to determine the configuration mode: master or slave, bit-
serial or parallel.
Loads the configuration data starting with the bus-width detection pattern followed by
a synchronization word, checks for the proper device code, and ends with a cyclic
redundancy check (CRC) of the complete bitstream.
Starts a user-defined sequence of events: releasing the internal reset (or preset) of flip-
flops, optionally waiting for the DCMs and/or PLLs to lock, activating the output
drivers, and transitioning the DONE pin to High.

The Master Serial Peripheral Interface (SPI) and the Master Byte-wide Peripheral Interface
(BPI) are two common methods used for configuring the FPGA. The Spartan-6 FPGA
configures itself from a directly attached industry-standard SPI serial flash PROM. The
Spartan-6 FPGA can configure itself via BPI when connected to an industry-standard parallel
NOR flash.

Note that BPI configuration is not supported in the XC6SLX4, XC6SLX25, and
XC6SLX25T nor is BPI available when using Spartan-6 FPGAs in TQG144 and CPG196
packages.

Spartan-6 FPGAs support MultiBoot configuration, where two or more FPGA configuration
bitstreams can be stored in a single configuration source. The FPGA application controls
which configuration to load next and when to load it.

Spartan-6 FPGAs also include a unique, factory-programmed Device DNA identifier that is
useful for tracking purposes, anticloning designs, or IP protection. In the largest devices,
bitstreams can be copy protected using AES encryption.

Readback
Most configuration data can be read back without affecting the systems operation.

CLBs, Slices, and LUTs


Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged
side-by-side as part of two vertical columns. There are three types of CLB slices in the
Spartan-6 architecture: SLICEM, SLICEL, and SLICEX. Each slice contains four LUTs,
eight flip-flops, and miscellaneous logic. The LUTs are for general-purpose combinatorial
and sequential logic support. Synthesis tools take advantage of these highly efficient logic,
arithmetic, and memory features. Expert designers can also instantiate them.

SLICEM
One quarter (25%) of Spartan-6 FPGA slices are SLICEMs. Each of the four SLICEM LUTs
can be configured as either a 6-input LUT with one output, or as dual 5-input LUTs with
identical 5-bit addresses and two independent outputs. These LUTs can also be used as
distributed 64-bit RAM with 64 bits or two times 32 bits per LUT, as a single 32-bit shift
register (SRL32), or as two 16-bit shift registers (SRL16s) with addressable length. Each

43
Mid-Term Training Report 2015
LUT output can be registered in a flip-flop within the CLB. For arithmetic operations, a high
speed carry chain propagates carry signals upwards in a column of slices.

SLICEL
One quarter (25%) of Spartan-6 FPGA slices are SLICELs, which contain all the features of
the SLICEM except the memory/shift register function.

SLICEX
One half (50%) of Spartan-6 FPGA slices are SLICEXs. The SLICEXs have the same
structure as SLICELs except the arithmetic carry option and the wide multiplexers.

Clock Management
Each Spartan-6 FPGA has up to six CMTs, each consisting of two DCMs and one PLL,
which can be used individually or cascaded.

DCM
The DCM provides four phases of the input frequency (CLKIN): shifted 0, 90, 180, and
270 (CLK0, CLK90, CLK180, and CLK270). It also provides a doubled frequency CLK2X
and its complement CLK2X180. The CLKDV output provides a fractional clock frequency
that can be phase-aligned to CLK0. The fraction is programmable as every integer from 2 to
16, as well as 1.5, 2.5, 3.5 . . . 7.5. CLKIN can optionally be divided by 2. The DCM can be a
zero-delay clock buffer when a clock signal drives CLKIN, while the CLK0 output is fed
back to the CLKFB input.

Frequency Synthesis
Independent of the basic DCM functionality, the frequency synthesis outputs CLKFX and
CLKFX180 can be programmed to generate any output frequency that is the DCM input
frequency (FIN) multiplied by M and simultaneously divided by D, where M can be any
integer from 2 to 32 and D can be any integer from 1 to 32.

Phase Shifting
With CLK0 connected to CLKFB, all nine CLK outputs (CLK0, CLK90, CLK180, CLK270,
CLK2X, CLK2X180, CLKDV, CLKFX, and CLKFX180) can be shifted by a common
amount, defined as any integer multiple of a fixed delay. A fixed DCM delay value (fraction
of the input period) can be established by configuration and can also be incremented or
decremented dynamically.

Spread-Spectrum Clocking
The DCM can accept and track typical spread-spectrum clock inputs, provided they abide by
the input clock specifications listed in the Spartan-6 FPGA Data Sheet: DC and Switching

44
Mid-Term Training Report 2015
Characteristics. Spartan-6 FPGAs can generate a spread spectrum clock source from a
standard fixed-frequency oscillator.

PLL
The PLL can serve as a frequency synthesizer for a wider range of frequencies and as a jitter
filter for incoming clocks in conjunction with the DCMs. The heart of the PLL is a voltage
controlled oscillator (VCO) with a frequency range of 400 MHz to 1,080 MHz, thus spanning
more than one octave. Three sets of programmable frequency dividers (D, M, and O) adapt
the VCO to the required application.

The pre-divider D (programmable by configuration) reduces the input frequency and feeds
one input of the traditional PLL phase comparator. The feedback divider (programmable by
configuration) acts as a multiplier because it divides the VCO output frequency before
feeding the other input of the phase comparator. D and M must be chosen appropriately to
keep the VCO within its controllable frequency range.

The VCO has eight equally spaced outputs (0, 45, 90, 135, 180, 225, 270, and 315).
Each can be selected to drive one of the six output dividers, O0 to O5 (each programmable by
configuration to divide by any integer from 1 to 128).

Clock Distribution
Each Spartan-6 FPGA provides abundant clock lines to address the different clocking
requirements of high fanout, short propagation delay, and extremely low skew.

Global Clock Lines


In each Spartan-6 FPGA, 16 global-clock lines have the highest fanout and can reach every
flip-flop clock. Global clock lines must be driven by global clock buffers, which can also
perform glitchless clock multiplexing and the clock enable function.

Global clocks are often driven from the CMTs, which can completely eliminate the basic
clock distribution delay.

I/O Clocks
I/O clocks are especially fast and serve only the localized input and output delay circuits and
the I/O serializer/deserializer (SERDES) circuits, as described in the I/O Logic section.

Block RAM
Every Spartan-6 FPGA has between 12 and 268 dual-port block RAMs, each storing 18 Kb.
Each block RAM has two completely independent ports that share only the stored data.

45
Mid-Term Training Report 2015

Synchronous Operation
Each memory access, whether read or write, is controlled by the clock. All inputs, data,
address, clock enables, and write enables are registered. The data output is always latched,
retaining data until the next operation. An optional output data pipeline register allows higher
clock rates at the cost of an extra cycle of latency.

During a write operation in dual-port mode, the data output can reflect either the previously
stored data, the newly written data, or remain unchanged.

Programmable Data Width


Each port can be configured as 16K 1, 8K 2, 4K 4, 2K 9 (or 8), 1K 18 (or
16), or 512 x 36 (or 32).
The x9, x18, and x36 configurations include parity bits. The two ports can have
different aspect ratios.
Each block RAM can be divided into two completely independent 9 Kb block RAMs
that can each be configured to any aspect ratio from 8K x 1 to 512 x 18, with 256 x 36
supported in simple dual-port mode.

Memory Controller Block


Most Spartan-6 devices include dedicated memory controller blocks (MCBs), each targeting
a single-chip DRAM (either DDR, DDR2, DDR3, or LPDDR), and supporting access rates of
up to 800 Mb/s.

The MCB has dedicated routing to predefined FPGA I/Os. If the MCB is not used, these I/Os
are available as general purpose FPGA I/Os. The memory controller offers a complete multi
port arbitrated interface to the logic inside the Spartan-6 FPGA. Commands can be pushed,
and data can be pushed to and pulled from independent built-in FIFOs, using conventional
FIFO control signals. The multi-port memory controller can be configured in many ways. An
internal 32-, 64-, or 128-bit data interface provides a simple and reliable interface to the
MCB.

The MCB can be connected to 4-, 8-, or 16-bit external DRAM. The MCB, in many
applications, provides a faster DRAM interface compared to traditional internal data buses,
which are wider and are clocked at a lower frequency. The FPGA logic interface can be
flexibly configured irrespective of the physical memory device. The MCB functionality is not
supported in the -3N speed grade.

Digital Signal ProcessingDSP48A1 Slice


DSP applications use many binary multipliers and accumulators, best implemented in
dedicated DSP slices. All Spartan-6 FPGAs have many dedicated, full-custom, low-power
DSP slices, combining high speed with small size, while retaining system design flexibility.
Each DSP48A1 slice consists of a dedicated 18 18 bit two's complement multiplier and a
48-bit accumulator, both capable of operating at up to 390 MHz. The DSP48A1 slice
provides extensive pipelining and extension capabilities that enhance speed and efficiency of
many applications, even beyond digital signal processing, such as wide dynamic bus shifters,
memory address generators, wide bus multiplexers, and memory-mapped I/O register files.
46
Mid-Term Training Report 2015
The accumulator can also be used as a synchronous up/down counter. The multiplier can
perform barrel shifting.

Input/Output
The number of I/O pins varies from 102 to 576, depending on device and package size. Each
I/O pin is configurable and can comply with a large number of standards, using up to 3.3V.
The Spartan-6 FPGA SelectIO Resources User Guide describes the I/O compatibilities of the
various I/O options. With the exception of supply pins and a few dedicated configuration
pins, all other package pins have the same I/O capabilities, constrained only by certain
banking rules. All user I/O is bidirectional; there are no input-only pins.

All I/O pins are organized in banks, with four banks on the smaller devices and six banks on
the larger devices. Each bank has several common VCCO output supply-voltage pins, which
also powers certain input buffers. Some single-ended input buffers require an externally
applied reference voltage (VREF). There are several dual-purpose VREF-I/O pins in each
bank.

In a given bank, when I/O standard calls for a VREF voltage, each VREF pin in that bank
must be connected to the same voltage rail and can not be used as an I/O pin.

I/O Electrical Characteristics


Single-ended outputs use a conventional CMOS push/pull output structure, driving High
towards VCCO or Low towards ground, and can be put into high-Z state. Many I/O features
are available to the system designer to optionally invoke in each I/O in their design, such as
weak internal pull-up and pull-down resistors, strong internal split-termination input resistors,
adjustable output drive-strengths and slew-rates, and differential termination resistors. See the
Spartan-6 FPGA SelectIO Resources User Guide for more details on available options for
each I/O standard.

I/O Logic

Input and Output Delay


This section describes the available logic resources connected to the I/O interfaces. All inputs
and outputs can be configured as either combinatorial or registered. Double data rate (DDR)
is supported by all inputs and outputs. Any input or output can be individually delayed by up
to 256 increments (except in the -1L speed grade). This is implemented as IODELAY2. The
identical delay value is available either for data input or output. For a bidirectional data line,
the transfer from input to output delay is automatic. The number of delay steps can be set by
configuration and can also be incremented or decremented while in use.

Because these tap delays vary with supply voltage, process, and temperature, an optional
calibration mechanism is built into each IODELAY2:
For source synchronous designs where more accuracy is required, the calibration
mechanism can (optionally) determine dynamically how many taps are needed to
delay data by one full I/O clock cycle, and then programs the IODELAY2 with 50%
of that value, thus centering the I/O clock in the middle of the data eye.

47
Mid-Term Training Report 2015
A special mode is available only for differential inputs, which uses a phase-detector
mechanism to determine whether the incoming data signal is being accurately
sampled in the middle of the eye. The results from the phase-detector logic can be
used to either increment or decrement the input delay, one tap at a time, to ensure
error-free operation at very high bit rates.

ISERDES and OSERDES


Many applications combine high-speed bit-serial I/O with slower parallel operation inside the
device. This requires a serializer and deserializer (SerDes) inside the I/O structure. Each input
has access to its own deserializer (serial-to-parallel converter) with programmable parallel
width of 2, 3, or 4 bits. Where differential inputs are used, the two serializers can be cascaded
to provide parallel widths of 5, 6, 7, or 8 bits. Each output has access to its own serializer
(parallel-to-serial converter) with programmable parallel width of 2, 3, or 4 bits. Two
serializers can be cascaded when a differential driver is used to give access to bus widths of
5, 6, 7, or 8 bits.

When distributing a double data rate clock, all SerDes data is actually clocked in/out at single
data rate to eliminate the possibility of bit errors due to duty cycle distortion. This faster
single data rate clock is either derived via frequency multiplication in a PLL, or doubled
locally in each IOB by differentiating both clock edges when the incoming clock uses double
data rate.

Low-Power Gigabit Transceiver


Ultra-fast data transmission between ICs, over the backplane, or over longer distances is
becoming increasingly popular and important. It requires specialized dedicated on-chp
circuitry and differential I/O capable of coping with the signal integrity issues at these high
data rates.

All Spartan-6 LXT devices have 28 gigabit transceiver circuits. Each GTP transceiver is a
combined transmitter and receiver capable of operating at data rates up to 3.2 Gb/s. The
transmitter and receiver are independent circuits that use separate PLLs to multiply the
reference frequency input by certain programmable numbers between 2 and 25, to become
the bit-serial data clock. Each GTP transceiver has a large number of user-definable features
and parameters. All of these can be defined during device configuration, and many can also
be modified during operation.

Transmitter
The transmitter is fundamentally a parallel-to-serial converter with a conversion ratio of 8,
10, 16, or 20. The transmitter output drives the PC board with a single-channel differential
current-mode logic (CML) output signal.

TXOUTCLK is the appropriately divided serial data clock and can be used directly to register
the parallel data coming from the internal logic. The incoming parallel data is fed through a
small FIFO and can optionally be modified with the 8B/10B algorithm to guarantee a
sufficient number of transitions. The bit-serial output signal drives two package pins with
complementary CML signals. This output signal pair has programmable signal swing as well

48
Mid-Term Training Report 2015
as programmable preemphasis to compensate for PC board losses and other interconnect
characteristics.

Receiver
The receiver is fundamentally a serial-to-parallel converter, changing the incoming bit serial
differential signal into a parallel stream of words, each 8, 10, 16, or 20 bits wide. The receiver
takes the incoming differential data stream, feeds it through a programmable equalizer (to
compensate for the PC board and other interconnect characteristics), and uses the FREF input
to initiate clock recognition. There is no need for a separate clock line. The data pattern uses
non-return-to-zero (NRZ) encoding and optionally guarantees sufficient data transitions by
using the 8B/10B encoding scheme. Parallel data is then transferred into the FPGA logic
using the RXUSRCLK clock. The serial-to-parallel conversion ratio can be 8, 10, 16, or 20.

Integrated Endpoint Block for PCI Express Designs


The PCI Express standard is a packet-based, point-to-point serial interface standard. The
differential signal transmission uses an embedded clock, which eliminates the clock-to-data
skew problems of traditional wide parallel buses.

The PCI Express Base Specification 1.1 defines bit rate of 2.5 Gb/s per lane, per direction
(transmit and receive). When using 8B/10B encoding, this supports a data rate of 2.0 Gb/s per
lane.

The Spartan-6 LXT devices include one integrated Endpoint block for PCI Express
technology that is compliant with the PCI Express Base Specification Revision 1.1. This
block is highly configurable to system design requirements and operates as a compliant single
lane Endpoint. The integrated Endpoint block interfaces to the GTP transceivers for
serialization/deserialization, and to block RAMs for data buffering. Combined, these
elements implement the physical layer, data link layer, and transaction layer of the protocol.

Xilinx provides a light-weight (<200 LUT), configurable, easy-to-use LogiCORE IP that


ties the various building blocks (the integrated Endpoint block for PCI Express technology,
the GTP transceivers, block RAM, and clocking resources) into acompliant Endpoint
solution. The system designer has control over many configurable parameters: maximum
payload size, reference clock frequency, and base address register decoding and filtering.

49
Mid-Term Training Report 2015

AXI

What is AXI?
AXI is part of ARM AMBA, a family of micro controller buses first introduced in 1996. The
first version of AXI was first included in AMBA 3.0, released in 2003. AMBA 4.0, released
in 2010, includes the second version of AXI, AXI4.

There are three types of AXI4 interfaces:


AXI4for high-performance memory-mapped requirements.
AXI4-Litefor simple, low-throughput memory-mapped communication (for
example, to and from control and status registers).
AXI4-Streamfor high-speed streaming data.

Xilinx introduced these interfaces in the ISE Design Suite, release 12.3.

AXI4 Benefits
AXI4 provides improvements and enhancements to the Xilinx product offering across the
board, providing benefits to Productivity, Flexibility, and Availability:
ProductivityBy standardizing on the AXI interface, developers need to learn only a
single protocol for IP.
FlexibilityProviding the right protocol for the application:
AXI4 is for memory mapped interfaces and allows burst of up to 256 data transfer
cycles with just a single address phase.
AXI4-Lite is a light-weight, single transaction memory mapped interface. It has a
small logic footprint and is a simple interface to work with both in design and
usage.
AXI4-Stream removes the requirement for an address phase altogether and allows
unlimited data burst size. AXI4-Stream interfaces and transfers do not have
address phases and are therefore not considered to be memory-mapped.
AvailabilityBy moving to an industry-standard, you have access not only to the

Xilinx IP catalog, but also to a worldwide community of ARM Partners.

50
Mid-Term Training Report 2015
Many IP providers support the AXI protocol.
A robust collection of third-party AXI tool vendors is available that provide a
variety of verification, system development, and performance characterization
tools. As you begin developing higher performance AXI-based systems, the
availability of these tools is essential.

How AXI Works


This section provides a brief overview of how the AXI interface works. The Introduction,
page 3, provides the procedure for obtaining the ARM specification. Consult those
specifications for the complete details on AXI operation.

The AXI specifications describe an interface between a single AXI master and a single AXI
slave, representing IP cores that exchange information with each other. Memory mappedAXI
masters and slaves can be connected together using a structure called an Interconnect block.
The Xilinx AXI Interconnect IP contains AXI-compliant master and slave interfaces, and can
be used to route transactions between one or more AXI masters and slaves. The AXI
Interconnect IP is described in Xilinx AXI Interconnect Core IP, page 14.

Both AXI4 and AXI4-Lite interfaces consist of five different channels:


Read Address Channel
Write Address Channel
Read Data Channel
Write Data Channel
Write Response Channel

Data can move in both directions between the master and slave simultaneously, and data
transfer sizes can vary. The limit in AXI4 is a burst transaction of up to 256 data transfers.
AXI4-Lite allows only 1 data transfer per transaction.

Figure 1-1, page 5 shows how an AXI4 Read transaction uses the Read address and Read
data channels:

51
Mid-Term Training Report 2015

Channel Architecture of Reads

Figure 1-2 shows how a write transaction uses the write address, write data, and write
response channels. As shown

Channel Architecture of writes


As shown in the preceding figures, AXI4 provides separate data and address connections for
reads and writes, which allows simultaneous, bidirectional data transfer. AXI4 requires a
single address and then bursts up to 256 words of data. The AXI4 protocol describes a variety
of options that allow AXI4-compliant systems to achieve very high data throughput. Some of
these features, in addition to bursting, are: data upsizing and downsizing, multiple
outstanding addresses, and out-of-order transaction processing.

At a hardware level, AXI4 allows a different clock for each AXI master-slave pair. In
addition, the AXI protocol allows the insertion of register slices (often called pipeline stages)
to aid in timing closure.

AXI4-Lite is similar to AXI4 with some exceptions, the most notable of which is that
bursting, is not supported. The AXI4-Lite chapter of the ARM AMBA AXI Protocol v2.0
Specification describes the AXI4-Lite protocol in more detail.

52
Mid-Term Training Report 2015
The AXI4-Stream protocol defines a single channel for transmission of streaming data. The
AXI4-Stream channel is modeled after the write data channel of the AXI4. Unlike AXI4,
AXI4-Stream interfaces can burst an unlimited amount of data. There are additional, optional
capabilities described in the AXI4-Stream Protocol Specification. The specification describes
how AXI4-Stream-compliant interfaces can be split, merged, interleaved, upsized, and
downsized. Unlike AXI4, AXI4-Stream transfers cannot be reordered.

Note: With regards to AXI4-Stream, even if two pieces of IP are designed in accordance with
the AXI4-Stream specification, and are compatible at a signaling level, it does not guarantee
that two components will function correctly together due to higher level system
considerations.

Basic axi4 signalling 5 channels point to point

Axi4 interface

53
Mid-Term Training Report 2015

Axi master burst


The AXI Master Burst is designed to provide a User with a quick way to implement a light
weight mastering interface between User logic and AXI4. Figure 1 shows a block diagram of
the AXI Master Burst. The port references and groupings are detailed in Table 1. The design
is parameterizable to transaction data in 32, 64, and 128-bitwidths for AXI4 read and write
transactions Transaction request protocol between the AXI4 and the User Logic is provided
by the IPIC command and Status Adapter Block. The primary data transport function is
provided by the Read and Write Controller.

54
Mid-Term Training Report 2015

Typical System Interconnect


The AXI Master Burst helper core is designed to be instantiated in a User IP design as a
helper core. A typical use case is shown in Figure 2. The AXI Master Burst allows the User
IP to access AXI4 slaves via the AXI4 Interconnect.

55
Mid-Term Training Report 2015

Typical System Configuration Using AXI Master Burst

56
Mid-Term Training Report 2015

57
Mid-Term Training Report 2015

58
Mid-Term Training Report 2015

59
Mid-Term Training Report 2015

60
Mid-Term Training Report 2015

61
Mid-Term Training Report 2015

62
Mid-Term Training Report 2015

63
Mid-Term Training Report 2015

64
Mid-Term Training Report 2015

Timing diagrams
Single Data Beat Read Operation
A single beat read cycle is shown in Figure 4. The diagram shows the AXI Slave accepting
the read address and qualifiers in one clock cycle and presenting the read data in the next
clock cycle.

Example Single Beat Read Transaction Timing

65
Mid-Term Training Report 2015

Single Data Beat Write Operation


A typical single beat write cycle is shown in Figure 5.

Example Single Beat Write Transaction Timing

66
Mid-Term Training Report 2015

Single Data Beat Read Operation with AXI Read Data Channel Reported
Error
A single data beat Read transaction with a Slave reported error is shown in Figure 6. The AXI
Read Data Channel response error is captured, reported on the IPIC Status Channel, and the
Masters md_error output is asserted and held. The assertion of md_error is cleared by the
ip2bus_mst_reset input from the IPIC Command interface. The m_axi_aresetn would clear
the md_error if it were asserted.

Example Single Beat Read Transaction Timing With Error

67
Mid-Term Training Report 2015

Single Data Beat Write Operation with AXI Response Channel Reported
Error
A single beat write transaction is shown in Figure 7. The AXI Write Response Channel
response error is captured, reported on the IPIC Status Channel, and the Masters md_error
output is asserted and held. The assertion of md_error is cleared by the ip2bus_mst_reset
input from the IPIC Command interface. The m_axi_aresetn would clear the md_error if it
were asserted..

Example Single Beat Write Transaction Timing Error

68
Mid-Term Training Report 2015

Burst Read Transaction


A burst Write transaction of 80 bytes is shown in Figure 10. This example is for a AXI
Master Burst configured for a 32-bit native data width of 32-bits and a maximum allowed
burst length of 16 data beats per AXI4 transaction. The command length of 80 bytes requires
the Master to break the transaction up into two AXI4 transactions, one of 16 data beats and
one of four data beats.

Example Burst Read Transaction Timing

69
Mid-Term Training Report 2015

Burst Read Discontinue


The AXI Master burst issues a discontinue on the Read LocalLink if an internal error is
encountered during the read transaction. This is normally caused by the User logic setting the
ip2bus_mst_length qualifier to a value of zero on the IPIC Command interface during a read
command assertion. LocalLink requires all transactions to complete with an EOF assertion,
even during a discontinue. Figure 9. shows an example of the AXI Master Burst issuing a
discontinue on a Read burst transaction as the result of an internal error. The Read LocalLink
is terminated early with the EOF assertion by the source.

If the LocalLink is not terminated correctly by the destination, the AXI Master Burst does not
assert the bus2ip_mst_cmplt status signal.

Example Burst Read Discontinue Timing

70
Mid-Term Training Report 2015

Burst Write Transaction

A burst Write transaction of 80 bytes is shown in Figure 10. This example is for a AXI
Master Burst configured for a 32-bit native data width and a maximum allowed burst length
of 16 data beats per AXI4 transaction. The command length of 80 bytes requires the Master
to break the transaction up into two AXI4 transactions, one 16 data beats and one four data
beats

Example Burst Write Transaction Timing

71
Mid-Term Training Report 2015

Burst Write Discontinue

The AXI Master burst issues a discontinue on the Write LocalLink if an internal error is
encountered during the write transaction. This is normally caused by the User logic setting
the ip2bus_mst_length qualifier to a value of zero on the IPIC Command interface
during a write command assertion. LocalLink requires all transactions to complete with an
EOF assertion, even during a discontinue. Figure 11. shows an example of the AXI Master
Burst issuing a discontinue on a Write burst transaction as the result of an internal error. The
Write LocalLink is terminated early with the EOF assertion by the source after the
bus2ip_mstwr_dst_dsc_n assertion is detected.

If the LocalLink is not terminated correctly by the source, the AXI Master Burst does not
assert the bus2ip_mst_cmplt status signal.

Example Burst Write Discontinue Timing

Axi slave burst

The AXI Slave Burst core is designed to provide a quick way to implement a light-weight
interface between the AXI4 interface and a user slave IP core capable of supporting bursts.
This slave interface allows for multiple user IPs to be interfaced to the AXI4 interface
providing address decoding over various user-configurable address ranges.

This core allows easy migration of a user slave IP from earlier Processor Local Bus (PLB)
v4.6 and v3.4, and the On-chip Peripheral Bus (OPB) which used the respective IP interface
cores. The AXI4 protocol is simple to adapt when unsupported features are needed or lowest
latency and highest throughput is required. Figure 1 shows a block diagram of the AXI Slave

72
Mid-Term Training Report 2015
Burst core. The port references and groupings are shown in Table 2. The internal modules
provide the basic functionality for connected slave IP operation based on the AXI4
transaction. It implements the protocol and timing translation between the AXI4 and IP
interconnect interface

AXI Slave Burst IP Core Block Diagram

73
Mid-Term Training Report 2015

74
Mid-Term Training Report 2015

75
Mid-Term Training Report 2015

76
Mid-Term Training Report 2015

77
Mid-Term Training Report 2015

78
Mid-Term Training Report 2015

79
Mid-Term Training Report 2015

Timing diagrams
Figure 2 shows the typical response of the AXI Slave Burst core for INCR (incremental)
read-write transactions from the AXI4 interface.

Core Response for AXI4 INCR Mode Transactions

80
Mid-Term Training Report 2015
Figure 3 shows the core response for FIXED read-write transactions from the AXI4 interface.

Core Response for AXI4 FIXED Read-Write Transactions

81
Mid-Term Training Report 2015

Figure 4 shows the core response for WRAP read-write transactions from the AXI4 interface.

Core Response for AXI4 WRAP Read-Write Transactions

82
Mid-Term Training Report 2015

Axi lite

The AXI Interface-AXI4-Lite

AXI4-Lite 5 Channels

83
Mid-Term Training Report 2015
The AXI4-Lite IPIF is designed to provide you with a quick way to implement a light-weight
interface between the ARM AXI and a user IP core. This slave service allows you to
configure for multiple user IPs to be interfaced to the AXI providing address decoding over
various address ranges. Figure 1 shows a block diagram of the AXI4-Lite IPIF. The port
references and groupings are detailed in Table 1.
The base element of the design is the Slave Attachment. This block provides the basic
functionality for slave operation. It implements the protocol and timing translation between
the AXI and the IPIC.

The Address Decoder module generates the necessary chip select and read/write chip enable
signals based upon the user requirement. The timeout counter is added in the design if the
C_DPHASE_TIMEOUT parameter is non-zero. If C_DPHASE_TIMEOUT = 0, you must
make sure that the core generates the acknowledge signals for all the transactions.

Block diagram of axi lite

84
Mid-Term Training Report 2015

85
Mid-Term Training Report 2015

86
Mid-Term Training Report 2015

87
Mid-Term Training Report 2015

Timing diagram

Single Read Operation

AXI4-Lite IPIF Single Read Operation

88
Mid-Term Training Report 2015

Single Write Operation

AXI4-Lite IPIF Single Write Operation AXI4-Lite IPIF Single Write


Operation

89
Mid-Term Training Report 2015

Single Read Error Operation

Single Read Error Operation

Axi stream

AXI4-STREAM Interface
The AXI4-Stream protocol is used for applications that typically focus on a data-centric and
data-flow paradigm where the concept of an address is not present or not required. Each
AXI4-Stream acts as a single unidirectional channel for a handshake data flow. At this lower
level of operation (compared to the memory mapped AXI protocol types), the mechanism to
move data between IP is defined and efficient, but there is no unifying address context
between IP. The AXI4-Stream IP can be better optimized for performance in data flow
applications, but also tends to be more specialized around a given application space.

90
Mid-Term Training Report 2015

Axi- stream i/o signals

Waveform

STREAM EXAMPLE

91
Mid-Term Training Report 2015

Appendices

92
Mid-Term Training Report 2015

93

Anda mungkin juga menyukai