Anda di halaman 1dari 20

Kernel Networking Walkthrough

LinuxCon 2015, Seattle


Thomas Graf

Kernel & Open vSwitch Team


Noiro Networks (Cisco)

Agenda

Getting packets from/to the NIC

Packet processing

RX Handler, IP Processing, TCP Processing, TCP Fast Open

Queuing from/to userspace

NAPI, Busy Polling, RSS, RPS, XPS, GRO, TSO

Socket Buffers, Flow Control, TCP Small Queues

Q&A

Touring the Network Stack


Expectation

Reality

How does a packet get in and out of


the Network Stack?

Receive & Transmit Process


NIC

Network Stack
(Kernel Space)

Ring Buffer

Parse
L2 & IP

Local?

Process
(User Space)
Parse
TCP/UDP

Socket Buffer read()

Task /
Container

Forward

DMA

Route?
Ring Buffer

Construct
IP

Construct
TCP/UDP

write()
Socket Buffer

The 3 ways into the Network Stack


Interrupt Driven

A
B

Network
Stack

Ring Buffer

NAPI based Polling

poll()

Network
Stack

Ring Buffer

Busy Polling

busy_poll()

Task

C
Ring Buffer

Network
Stack

RSS Receive Side Scaling

NIC distributes packets across multiple RX queues


allowing for parallel processing.

Separate IRQ per RX queue, thus selects CPU to run


hardware interrupt handler on.
RX-queue-1
CPU 1
RX-queue-2
CPU 2
filter

RX-queue-3
CPU 1
RX-queue-4
CPU 2

RPS Receive Packet Steering

Software filter to select CPU # for processing

Use it to ...

... redo queue - CPU mapping

... distribute single queue to


multiple CPUs

RX-queue-1

RX-queue-2

RX-queue-3

RX-queue-4

CPU 1

CPU 1

CPU 2

CPU 2

CPU 3

CPU 3

Hardware Offload

RX/TX Checksumming

Virtual LAN filtering and tag stripping

Perform CPU intensive checksumming in


hardware.

Strip 802.1Q header and store VLAN ID


in network packet meta data.
Filter out unsubscribed VLANs.

Segmentation Offload

Generic Receive Offload


(ethtool -K eth0 gro on)

NAPI based GRO


poll()

Network
Stack

Ring Buffer

GRO
MTU

Up to 64K

It's more effective to process 1x64K bytes packet


instead of 40x1500 bytes packets.

Segmentation Offload
(ethtool -K eth0 tso on)
(ethtool -K eth0 gso on)
Up to 64K

Network
Stack

Generic Segmentation Offload (GSO)


ethtool -K eth0 gso on
MTU

Ring Buffer

TCP Segmentation Offload (TSO)


ethtool -K eth0 tso on
MTU

How does a packet get through the


Network Stack?

(c) Karen Sagovac

Packet Processing
Link Layer

Packet Socket
ETH_P_ALL
Ingress QoS

tcpdump

Bridge
Open vSwitch

RX Handler

Team
Bonding
macvlan
macvtap

IPv4
Proto Handler

IPv6
ARP

The Feast!

IPX
Drop

...

IP Processing
PREROUTING
IP
Handler

INPUT

Route Lookup

Local Delivery

Forwarding

L4
(TCP, ...)

FORWARD
Route Lookup
Link Layer

IPv4
Construction

POSTROUTING

OUTPUT

Local Output

User
Space

TCP Processing
IP

Parse TCP
Lookup Socket

Socket Filter
socket locked
task exists

Receive TCP

Prequeue

process context softirq

Receive Socket Buffer


read()

poll()

Task

Backlog

TCP Fast Open


(net.ipv4.tcp_fastopen)
Regular
Client
1st Req

Fast Open
Server

Client
1st Req

SYN

ACK
SYN+

2x RTT

ACK+
H

TTP G
ET

2x RTT

A
SYN+

2x RTT

ACK+
H

CK

TTP G
ET

Data

ACK+
H

okie

TTP G
ET

Data

2nd Req

SYN

SYN

Co
ACK+
+
N
Y
S

Data

2nd Req

Server

1x RTT

SYN+
Cook
ie+HT
TP GE
T

ta
K+Da
C
A
+
SYN

Memory Accounting & Flow Control

Socket Buffers & Flow Control


(net.ipv4.tcp_{r|w}mem)
ssh

ssh
Block or EWOULDBLOCK

write()

rmem -= packet-size

wmem
overlimit?

Socket Buffer

rmem += packet-size

wmem += packet-size

rmem
overlimit?

Socket Buffer
Reduce TCP Window

TCP/IP

TCP/IP

TX Ring Buffer
wmem -= packet-size

RX Ring Buffer

TCP Small Queues


(net.ipv4.tcp_limit_output_bytes)
ssh

torrent
write()

write()

Socket Buffer

Socket Buffer

TSQ: max 128Kb in flight per socket


TCP/IP

Queuing Discipline

Driver

TX Ring Buffer

Q&A

Contact:
E-Mail: tgraf@suug.ch
Twitter: @tgraf__

Anda mungkin juga menyukai