Abstract—Network function virtualization (NFV) aims to run throughput for L3 forwarding [12]. Although these packet I/O
software network functions (NFs) in commodity servers. As CPU engines provide the high I/O, it is still resource-demanding
is general-purpose hardware, one has to use many CPU cores to deploy NFs that have computation-intensive processing
to handle complex packet processing at line rate. Owing to
its performance and programmability, FPGA has emerged as a logic, such as IPsec gateway and Network Intrusion Detection
promising platform for NFV. However, the programmable logic System (NIDS). For example, it takes as many as 20 CPU
blocks on an FPGA board are limited and expensive. Imple- cores to reach 40 Gbps when performing packet encryption
menting the entire NFs on FPGA is thus resource-demanding. with IPsec [13].
Further, FPGA needs to be reprogrammed when the NF logic To further improve the performance of software NFs, there
changes which can take hours to synthesize the code. It is thus
inflexible to use FPGA to implement the entire NFV service chain. is an emerging trend to implement NFs on FPGA in both
We present dynamic hardware library (DHL), a novel CPU- academia [14]–[17] and industry [18]. As a form of pro-
FPGA co-design framework for NFV with both high performance grammable silicon, FPGA enjoys high programmability of
and flexibility. DHL employs FPGA as accelerators only for software and high performance of hardware. Yet most of
complex packet processing. It abstracts accelerator modules in current FPGA solutions implement the entire NF, including
FPGA as a hardware function library, and provides a set of
transparent APIs for developers. DHL supports running multiple the complete packet processing logic, on the FPGA board to
concurrent software NFs with distinct accelerator functions on the best of our knowledge. This approach suffers from several
the same FPGA and provides data isolation among them. We drawbacks.
implement a prototype of DHL with Intel DPDK. Experimental First, the FPGA-only approach is resource-wasting. NFs
results demonstrate that DHL greatly reduces the programming
generally contain complex protocol processing and control
efforts to access FPGA, brings significantly higher throughput
and lower latency over CPU-only implementation, and minimizes logic, which consumes many resources when realized in
the CPU resources. hardware circuit. Besides, there are many types of NFs, this
incurs very high deployment cost with much resource de-
I. I NTRODUCTION manding. Second, developing and programming FPGA is time-
Network function virtualization (NFV) aims to shift packet consuming. It usually takes hours to synthesize and implement
processing from dedicated hardware middleboxes to commod- HDL (Hardware Description Language) source code to final
ity servers [1]–[4]. Implementing network functions (NFs) in bitstream [14]. Since debugging and modifying the design of
software entails many benefits, including lower equipment NF is common in practice, the FPGA-only approach presents
cost, better flexibility and scalability, reduced deployment a significant barrier to fast development.
time, etc [5], [6]. However, compared to dedicated network We present dynamic hardware library (DHL)1 , a CPU-
appliances, software NFs on commodity servers suffer from FPGA co-design packet processing framework that enables
severe performance degradation due to overheads of the kernel flexible FPGA acceleration for software NFs. In DHL, FPGA
TCP/IP stack [6]–[10]. serves as a hardware accelerator, not as a complete network
To overcome this limitation, several high performance appliance. Specifically, we extract the computation-intensive
packet I/O engines are proposed. Intel DPDK [11] and netmap part, such as encryption and decryption in IPsec, pattern
[7]) bypass the kernel network stack to obtain packets directly matching in intrusion detection, etc., and offload them to
from the NIC to improve the I/O of software NFs. For FPGA as accelerator modules. We provide a DHL Runtime
example, DPDK only needs two CPU cores to achieve 40 Gbps that hides the low-level specifics of FPGA, and we abstract
the accelerator modules as hardware functions in FPGA that
∗ The corresponding author is Fangming Liu (fmliu@hust.edu.cn). This
NF developers can call just like calling a software function.
work was supported in part by the National Key Research & Develop-
ment (R&D) Plan under grant 2017YFB1001703, in part by NSFC un-
DHL decouples the software development of software NFs and
der Grant 61722206 and 61761136014 and 392046569 (NSFC-DFG) and the hardware development of FPGA accelerator modules that
61520106005, in part by the National 973 Basic Research Program under software developer and hardware developer can concentrate on
Grant 2014CB347800, in part by the Fundamental Research Funds for
the Central Universities under Grant 2017KFKJXX009, and in part by the
what they are capable at. In addition, it provides data isolation
National Program for Support of Top-notch Young Professionals in National
Program for Special Support of Eminent Professionals. 1 DHL is opensourced on github: https://github.com/OpenCloudNeXt/DHL.
that supports multiple software NFs to use accelerator modules [14], [21], [22] aiming at improving the usability, it still suffers
simultaneously without interfering each other. from long compilation time. When new requirements and bug
This paper describes the DHL architecture and makes the occur, it usually takes hours to re-synthesize and re-implement
following contributions: the code.
1) We propose DHL, a CPU-FPGA co-design framework
B. CPU is Not That Bad
which not only enables sharing FPGA accelerators among
multiple NFs to maximize utilization but also eases the The emerging high performance packet I/O engines such
developing effort by decoupling the software develop- as Intel DPDK [11] and netmap [7] bring a new life to the
ment and hardware development. software NFs. As reported by Intel, just two CPU cores can
2) We abstract accelerator modules in FPGA as hardware achieve 40 Gbps line rate when performing L3 layer forward-
functions and provide a set of programming APIs, which ing [12]. Further, we survey a wide variety of common network
enable software applications to easily leverage FPGA functions, and find that there are two types of processing in
accelerators with minimal code changes. data plane:
3) We design a high throughput DMA engine between CPU • Shallow Packet Processing: Executing operations based
and FPGA with ultra-low latency for network applica- on the packet header, i.e., L2–L4 layer headers, such as
tion. We achieve this by adopting several optimization NAT, L2/L3 forwarding/routing, firewall, etc.
techniques of batching, user-space IO and polling. • Deep Packet Processing: Executing operations on the
4) We build a prototype of DHL and implement it on a server whole packet data, such as IDS/IPS, IPsec gateway, flow
equipped with a Xilinx FPGA board. Our evaluation on compression, etc.
the prototype demonstrates that DHL can achieve similar For shallow packet processing, as the packet headers follow
throughput as high as 40 Gbps compared with previous the unified protocol specification, it usually does not need too
FPGA-only solutions, while incurring reasonable latency much computation efforts. For example, it just needs to check
less than 10 µs. and modify the specified header fields. For most common
The rest of the paper is organized as follows. In Section II, operations – table lookup, we find that searching an LPM
we introduce the motivation and challenges of our work. (longest prefix match) table takes 60 CPU cycles on average
Section III describes the overview of DHL. The implemen- (26 ns@2.4 Hz), as shown in Table I. It is fairly fast for CPU
tation details is presented in Section IV. Then we evaluate to perform shallow packet processing.
our DHL system in Section V and discuss the improvement However, in terms of deep packet processing, it usually
and applicable scenarios in Section VI. Next, we survey the needs much more compute resources since data of higher
related work in Section VII. Finally, the paper is concluded in layers has no regular patterns. Regular expression match
Section VIII. from deep packet inspection (DPI) [23] and cryptographic
algorithms [24] in IPsec gateways usually take more CPU
II. M OTIVATION AND C HALLENGES
cycles to complete the processing (again see Table I).
A. FPGA is Not Perfect TABLE I
There is much work about using FPGA for packet process- PERFORMANCE OF DPDK WITH ONE CPU CORE
ing. Whether it is a specific NF (e.g., a router [19], redundancy Network Function Latency (cpu cycles) with one core Throughput
elimination [17], load balancer [20], etc.) or a framework L2fwd 36 9.95 Gbps
to develop high-performance NFs [14], [21], [22], they all L3fwd-lpm 60 9.72 Gbps
demonstrate the fact that a hardware-based design performs IPsec-gateway 796 1.47 Gbps
1
better than a software design with general purpose CPU. L3fwd-lpm uses longest prefix match (lpm) table to forward packets based
on the 5-tuple; IPsec-gateway uses AES-CTR for cipher and SHA1-HMAC
There are usually many types of NFs. When it comes to for authentication;
FPGA, there are some vital limitations for implementing NFs. 2 Test with 64B packets, Intel 10G X520-DA2 NICs, DPDK 17.05 on
Firstly, the programmable logic units (look-up tables, regis- CentOS 7.2, kernel 3.10.0, with an Intel Xeon E5-2650 V3 @ 2.30GHz
CPU.
ters, and block RAMs) of an FPGA are limited. It is resource-
wasting to put everything into FPGA as every module with
specified functionality consumes dedicated logic units. Indeed, C. CPU-FPGA Co-Design & Challenges
we can use a more advanced FPGA which contains more Considering the pros and cons of FPGA and CPU, an
resources, but the price of high-end FPGA is very high. For intuitive solution is to combine them together that keeping the
example as of July 2017 a Xilinx UltraScale chip costs up to control logic and shallow packet processing running on CPU
$45,7112 , which is 20× higher than that of a CPU ($2,2803 and offloading the deep packet processing to FPGA. Compared
for Intel Xeon E5-2696v4, 22 cores). with FPGA-only NF solution, FPGA-CPU co-design brings
Secondly, it is not friendly to change existing logics. Al- four key benefits.
though FPGA is programmable and there is much prior work
• First, it fully utilizes FPGA resources. Only the compu-
2 The price is from AVNET, https://www.avnet.com tationally intensive processing are offloaded to FPGA as
3 The price is from Amazon, https://www.amazon.com accelerator modules, which generally has a reasonable
resource requirement of LUTs and BRAM. Thus an
FPGA board can host more modules.
• Second, software NFs can flexibly use these FPGA DHL Framework
accelerator modules on-demand, without having to re- Software NFs Hardware Function
implement them from scratch. This also saves a lot of Deep Packet Processing DHL Runtime
FPGA development time and effort. Manager Userspace
Dispatcher
NF2 Output Data Flow
DMA
Distributor Acc2
RX Control Flow
...
...
...
FPGA Data Flow
NFn Accn FPGA Control Flow
Hardware Function Table Controller
hf.name s.id a.id f.id
Acc.1 0 0 0 Config PR
Accelerator
…
Module
Acc.n 0 1 1
Database
Acc.n+1 1 2 2
Fig. 2. The detailed architecture of DHL. In hardware function table, hf.name: hf name; s.id: socket id; a.id: acc id; f.id: f pga id.
Throughput (Gbps)
40
RAM CPU0 CPU1 RAM
30
PCIE PCIE
FPGA0 FPGA2 20 In-kernel
Different NUMA nodes
10 Same NUMA node
PCIE PCIE
FPGA1 FPGA3
0
B B B B B B B B B B B B B B B
64 128 256 512 1K 2K 3K 4K 5K 6K 7K 8K 16K 32K 64K
Fig. 3. The block diagram of NUMA architecture server. Transfer Size
(a) Throughput
Runtime schedules the CPU core in the same NUMA node to In-kernel
10 3
execute the polling tasks. Different NUMA nodes
3) Transfer batching: To evaluate our UIO based poll mode 10 2 Same NUMA node
5 A security association is simply the bundle of algorithms and parameters 6 Intel(R) Multi-Buffer Crypto for IPsec Library, a highly-optimized soft-
(such as keys) that is being used to encrypt and authenticate a particular flow ware implementation of the core cryptographic processing for IPsec, which
in one direction. provides industry-leading performance on a range of Intel(R) Processors.
TABLE IV
EXPERIMENT CONFIGURATION
CPU-only DHL
Test Software NF Ethernet I/O
Worker Total Runtime Batching Size Total cores
IPsec Gateway 2 cores 2 cores 4 cores 2 cores 6 KB 4 cores
Single NF
NIDS 2 cores 2 cores 4 cores 2 cores 6 KB 4 cores
Multi-NFs with IPsec Gateway 1 core
N/A N/A 2 cores 6 KB 4 cores
accelerator shared 1 IPsec Gateway 1 core
Multi-NFs with IPsec Gateway 1 core
N/A N/A 2 cores 6 KB 4 cores
different accelerators 2 NIDS 1 core
1 We run two IPsec gateway instances that use the same accelerator module IPsec crypto.
2 We run an IPsec gateway instance using the accelerator module IPsec crypto, and a NIDS instance using the accelerator module
pattern matching simultaneously.
25
50 80 IPSec1 IPSec2
Throughput (Gbps)
CPU-only DHL ClickNP I/O 70 CPU-only DHL ClickNP
Throughput (Gbps)
20
40
60
Latency (µs)
50 15
30
40
10
20 30
20 5
10
10
0
0 0 64B 128B 256B 512B 1024B 1500B
64B 128B 256B 512B 1024B 1500B 64B 128B 256B 512B 1024B 1500B
Packet Size
Packet Size Packet Size
(a) Throughput of IPsec gateway (b) Latency of IPsec gateway (a) Throughput of two IPsec gateway
25
50 150 IPsec NIDS
Throughput (Gbps)
CPU-only DHL I/O CPU-only DHL
Throughput (Gbps)
125 20
40
Latency (µs)
100 15
30
75
10
20
50
5
10 25
0
0 0 64B 128B 256B 512B 1024B 1500B
64B 128B 256B 512B 1024B 1500B 64B 128B 256B 512B 1024B 1500B
Packet Size
Packet Size Packet Size
(c) Throughput of NIDS (d) Latency of NIDS (b) Throughput of IPsec gateway and NIDS
Fig. 6. Throughput and processing latency of the single IPsec gateway/NIDS running with 40G NIC. Fig. 7. The throughput of multiple NFs with 10G
NICs.
to process no more than 8 characters per clock cycle, which we can see that the throughput of DHL is close to ClickNP
gives a theoretical throughput of 32 Gbps. except for small packet size. From Figure 6(b), we find that
the IPsec gateway of ClickNP suffers more latency than DHL,
We measure the latency by attaching a timestamp to each
it is weird that it should not be so high if based on the delay
packet when the NF gets them from NIC. Before the packets
(cycles) of their elements reported in their paper.
are sent out from NIC, we use the current timestamp to
From these single NF tests, we can see that DHL, using
minus the saved one. Figure 6(b) and Figure 6(d) shows the
four CPU cores, the CPU-FPGA co-design framework, can
processing latency under different load factors. We can see that
provide up to 7.7× throughput with less to 19× latency for
when packet size increases, the processing latency for CPU-
IPsec gateway and up to 8.3× throughput with less to 36×
only version increases up to 72 µs for IPsec gateway and
latency for NIDS .
138 µs for NIDS. For DHL version, both IPsec gateway and
NIDS incurs less than 10 µs in any packet size. Moreover, we D. Multiple NF Applications
can see that the latency is higher of 64B packets than 1500B
Since there is only one FPGA board in our testbed, which
packets. Since it needs to encapsulate more packets to reach
only can offer up to 42 Gbps data transfer throughput between
the batching size of 6 KB for 64B packets, that indeed incurs
CPU and FPGA. To evaluate how DHL performs with multiple
more time to dequeue enough packets from shared IBQs.
NFs running simultaneously, it needs to divide the total
Besides, as ClickNP [14] has implemented an FPGA-only transfer throughput to each NF instance. So we use two Intel
IPsec gateway, we compare our DHL framework with it in X520-DA2 Dual port NICs (total four 10G ports) to provide
terms of IPsec gateway. Since ClickNP is not opensourced that four separate 10 Gbps input traffic, total 40 Gbps input traffic,
we cannot reproduce their experiment with our platform, we and assign them to different NF instances. Unlike the 40G port
use the result reported in their paper. As shown in Figure 6(a), of Intel XL710-QDA2 NIC, only one CPU core is enough to
TABLE V TABLE VI
R ECONFIGURATION T IME OF ACCELERATOR M ODULES S UMMARY OF THE TWO ACCELERATOR MODULES AND STATIC REGION