Anda di halaman 1dari 4

IEEE International Symposium on Field-Programmable Custom Computing Machines

An Autonomous Vector/Scalar Floating Point Coprocessor for FPGAs

Jainik Kathiara Analog Devices, Inc. 3 Technology Way, Norwood, MA USA Email:

Miriam Leeser Dept of Electrical and Computer Engineering Northeastern University Boston, MA USA Email:

AbstractWe present a Floating Point Vector Coprocessor that works with the Xilinx embedded processors. The FPVC is completely autonomous from the embedded processor, exploiting parallelism and exhibiting greater speedup than alternative vector processors. The FPVC supports scalar computation so that loops can be executed independently of the main embedded processor. Floating point addition, multiplication, division and square root are implemented with the Northeastern University VFLOAT library. The FPVC is parameterized so that the number of vector lanes and maximum vector length can be easily modied. We have implemented the FPVC on a Xilinx Virtex 5 connected via the Processor Local Bus (PLB) to the embedded PowerPC. Our results show more than ve times improved performance over the PowerPC augmented with the Xilinx Floating Point Unit on applications from linear algebra: QR and Cholesky decomposition. Keywords-oating point; vector processing; FPGA
Figure 1. The Floating Point Vector Co-Processor

I. I NTRODUCTION There is increased interest in using embedded processing on FPGAs including for applications that make use of oating point operations. The current design practice for both Xilinx and Altera FPGAs is to generate an auxiliary oating point processing unit. These FPUs rely on the embedded processor for fetching instructions, which inherently limits the parallelism in the implementation and hurts performance. We have implemented a oating point co-processor, the oating point vector/scalar co-processor (FPVC), that runs independent of the main embedded processor. The FPVC has its own local instruction memory (IRAM) and data memory (DRAM) under DMA control. The main processor initiates the DMA of instructions to IRAM and then starts the FPVC. The FPVC achieves performance by fetching and decoding instructions in parallel with other operations on the FPGA. In addition, scalar instructions are supported. This allows all loop control to be handled locally without requiring intervention of the main processor. Vector processing has several advantages for an FPGA implementation. A much smaller program is required to implement a program, reducing the static instruction count. Fewer instructions need to be decoded dynamically, which simplies instruction execution. Hazards only need to be checked at the start of an instruction. The FPVC design takes advantages of the recongurable nature of FPGAs by
978-0-7695-4301-7/11 $26.00 2011 IEEE DOI 10.1109/FCCM.2011.14 33

including design time parameters including the maximum vector length (MVL) supported in hardware, number of vector lanes implemented as well as the size of the local instruction and data memories. Details of the FPVC and its implementation can be found in [1]. II. T HE F LOATING P OINT V ECTOR C O - PROCESSOR The FPVC (Fig. 1) has the following features: Complete autonomy from the main processor Support for single precision oating point and 32-bit integer arithmetic operations Four stage RISC pipeline for integer arithmetic and memory access Variable length RISC pipeline for oating point arithmetic Unied vector/scalar general purpose register le Support for a modied Harvard style memory architecture with separate level 1 instruction and data RAM and unied level 2 memory The FPVC uses the Processor Local Bus(PLB) as the system bus interface. The FPVC has one slave (SLV) port for communicating with the main processor and one master (MST) port for main memory accesses. The memory architecture is divided into two levels: main memory and local memory. Both types of memory are implemented in on-

Figure 2.

The Vector Scalar Register File

chip BlockRAM. Main memory is connected to the FPVC through the master port of the system bus while local memory sits between the bus and the processing core. All memory transfers are under program control; no caching is implemented. Instruction memory is loaded under DMA control before program execution begins. Data memory is loaded from main memory using DMA under FPVC program control. The local memories are part of system address space and can be accessed by any master on the system bus. A. Vector Scalar Instruction Set Architecture We have designed a new Instruction Set Architecture (ISA) inspired by the VIRAM vector ISA [2] as well as the Microblaze and Power ISAs. The Vector-Scalar ISA is a 32-bit instruction set. The instruction encoding allows for 32 vector-scalar registers with variable vector length. As shown in Fig. 2, the top short vector of each vector register can be used as a scalar register. This allows us to freely mix vector and scalar registers without requiring communication with the host processor. The vector register le supports a congurable lane width. The vectorscalar ISA supports a maximum vector length (MVL) of C_NUM_OF_LANE * C_NUM_OF_VECTOR. These parameters are design time parameters. Instructions are classied into three major classes: Arithmetic instructions, memory instructions and inter-lane communication instructions (expand and compress). Arithmetic Instructions Fig. 3 shows vector and scalar arithmetic operation. We implement several oating point instructions: add, multiply, divide and square root; as well as basic integer arithmetic, and compare and shift instructions which operate on the full data word (32 bits). The integer multiply instruction operates on the lower 16-bits of the operands and produces a 32 bit result. All scalar operations are performed on the rst element of the rst short vector of each register and the result is replicated to all lanes and stored on the rst short vector of the destination register. Vector instructions which require scalar data can reference the top of each register. Memory Instructions The memory instructions ( and support strided memory access (both unit and non-unit stride), permuted access,

Figure 3.

Vector and Scalar Arithmetic Operation

Figure 4.

Memory Access Patterns

indexed access and rake access. Fig. 4 shows the rake access pattern, which can be described by a single vector instruction. It requires a vector base address, vector index address and immediate offset value (the distance between two neighbor elements). All neighbor elements within each rake are stored in a single short vector while each rake of elements is stored in a different short vector. The same instruction can be used for unit stride and non-unit stride access by setting the immediate offset value to an equal distance between two vector elements in memory. Permutation and look-up table access classes are realized by setting the immediate offset to zero and providing an index register. Scalar accesses are supported with the same memory instructions. Interlane Communication Instructions We implement vector compression and expansion instructions. Compress instructions select the subset of an input vector marked by a ag vector and pack these together into contiguous elements at the start of a destination vector. Expand instructions perform the reverse operation to unpack a vector, placing source elements into a destination vector at locations marked by bits in a ag register. Previous vector processors [2], [3] implement a crossbar switch to perform inter-lane communication. The FPVC compress and expand instructions implement the same functionality with lower hardware cost but slower access to the vector elements.


Figure 5.

Vector Scalar Pipeline Figure 6. Experimental Setup

B. The Vector Scalar Pipeline The FPVC pipeline (Fig. 5) is based on the classic inorder issue, out-of-order completion RISC pipeline. The four stages are Instruction Fetch, Decode, Execute and Write Back. The pipeline is intentionally kept short so integer vector instructions can complete in a small number of cycles and oating point instructions spend most of their time in the oating point unit, optimizing the execution latency. As both scalar and vector instructions are executed from the same instruction pipeline, both type of instructions are freely mixed in the program execution and stored in the same local instruction memory. Floating point operations are implemented using the VFloat library [4]. Each functional unit implements IEEE 754 single precision oating point. Arithmetic operations all have different latencies: each unit is fully pipelined so that a new operation can be started every clock cycle. Normalization and rounding are implemented along with the multiplexers shown in the datapath. Due to the different latencies of different operations, instructions are issued in order but can complete out of order. Hence, a structural hazard may occur if more than one instruction completes in the same clock cycle. We have implemented an arbiter, which can commit one result each cycle, between the end of the execution stage and the write back stage of the pipeline to eliminate this hazard. When multiple results are available at the same time, one will be written to the register le, and the rest will be stalled. Design Time Parameters We take advantage of the exibility of the recongurable fabric to provide design time recongurable parameters as well as runtime parameters. At design time, the implementer can chose the number of vector lanes(C_NUM_OF_LANE), number of short vectors supported(C_VECTOR_LENGTH), and number of bytes of local BRAM memory(C_INSTR_MEM_SIZE, C_DATA_MEM_SIZE) as well as the bitwidth of the oating point components. For the experiments described here, we always implement single precision. The maximum vector length supported is the number of lanes times the vector length. III. E XPERIMENTS AND R ESULTS The FPVC is implemented in VHDL and synthesized using Xilinx ISE 10.1 CAD tools targeting Virtex-5 FPGAs. We compare the FPVC against thePowerPC 440 with Xilinx FPU using linear algebra kernels as examples. All code runs on the Xilinx ML510 board. We connect the FPVC via a PLB to the PowerPC and to the main on-chip memory (BRAM, Fig. 6). We also connect the PowerPC to the Xilinx FPU via the Fabric Co-processor Bus (FCB). For the experiments either the PowerPC plus FPU or the PowerPC and FPVC are used. The performance metric is the number of clock cycles between the start and the end of a kernel. Clock cycles are counted using the PowerPCs internal timer and results are compared to the runtime of the PowerPC plus FPU. Local IRAM and DRAM can be congured for various sizes using parameters C_INSTR_MEM_SIZE, C_DATA_MEM_SIZE and C_MPLB_DWIDTH at design compile time. For all of the results presented, we have set the instruction and data memory sizes to 64KB each and the PLB to 32 bits. We vary the number of vector lanes and length of short vectors. For running on the PowerPC, the linear algebra kernels were written in C and compiled using gcc with -o2 optimization. The FPVC based kernels are written in machine code. Program and data are stored in the 64KB main memory shown in Fig. 6. The FPVC system bus interface is used to load instructions into the local IRAM of the FPVC. We test our FPVC for performance on oating point numerical linear systems solvers. Here we present results for QR and Cholesky decomposition. While our runtimes are not as fast as a custom datapath, they consume less area and are more exible than previously published vector processors while running faster than the Xilinx PowerPC embedded processor with auxiliary FPU. The FPGA resources used for a range of the runtime parameter values is shown in


Figure 7.

Resources Used

Figure 9.

Results for Cholesky Decomposition

Figure 8.

Results for QR Decomposition

Table 7. The factor that has the largest inuence on the number of resources used is the number of vector lanes. We implemented QR and Cholesky Decomposition on the FPVC and compared the results to a PowerPC connected via the APU interface to the Xilinx FPU; this setup has performance equal to one. Fig. 8 shows the results for QR on the FPVC. For the FPVC implementations, the short vector length was kept constant at 32; the number of vector lanes was varied. The FPVC outperforms the Xilinx FPU even with only one lane implemented. This is due to the fact that there are signicantly fewer vector instructions to decode thus most of the time is spent implementing oating point operations. QR has plenty of parallelism, increasing the number of lanes improves the speedup. The best speedup is achieved with 8 lanes on a 24 by 24 matrix; the speedup is over ve times the performance of the Xilinx solution. Results comparing Cholesky on FPVC to PowerPC plus FPU, varying the number of lanes, is shown in Fig. 9. For the implementations shown, Cholesky runs more than three times faster on the FPVC. Cholesky does not exhibit as much parallelism as QR, hence increasing lanes does not give that much improvement. Increasing the short vector size to 32 did result in signicant improvement. The maximum performance improvement achieved is over 5x for 8 lanes. However the four lane solution, which uses signicantly fewer resources, achieves nearly as good performance and is the best choice for Cholesky. IV. C ONCLUSIONS AND FUTURE WORK The completely autonomous oating point vector/scalar co-processor exhibits speedup on important linear algebra

kernels when compared to the implementation used by most practioners: embedded processor FPU provided by Xilinx. The FPVC is easier to implement at the cost of a decrease in performance compared to a custom datapath. Hence the FPVC occupies a middle ground in the range of designs that make use of oating point. The FPVC is completely autonomous. Thus, the PowerPC can be doing independent work while the FPVC is computing oating point solutions. We have not yet exploited this concurrency. The FPVC is congurable at design time. The number of lanes, size of vectors and local memory sizes can be congured to t the application. Results above show that for QR decomposition, the designer may choose 8 lanes while for Cholesky, 4 lanes at 32 element short vectors is more efcient. The bitwidth of the FPVC datapath can also easily be modied. We plan to implement double precision in the near future. We also plan to add instruction caching so that larger programs can be implemented on the FPVC, and to provide tools such as assemblers and compilers to make the FPVC easier to use. R EFERENCES
[1] J. Kathiara, The Unied Floating Point Vector Coprocessor for Recongurable Hardware, Masters thesis, Northeastern University Dept. of ECE, Boston, MA, 2011. [Online]. Available: rcl/publications.php#theses [2] C. Kozyrakis and D. Patterson, Overcoming the limitations of conventional vector processors, in Intl Symposium on Computer Architecture, June 2003, pp. 399 409. [3] J. Yu, C. Eagleston et al., Vector processing as a soft processor accelerator, ACM Trans. Recongurable Technol. Syst., vol. 2, pp. 12:112:34, June 2009. [4] X. Wang and M. Leeser, VFloat: a variable precision xedand oating-point library for recongurable hardware, ACM Trans. Recongurable Technol. Syst., vol. 3, pp. 16:116:34, September 2010.