COMS W6998
Spring 2010
Erich Nahum
Outline
Introto socket buffers
The sk_buff data structure
Packet data
sk_buff
...
...
head
head
data
data tailroom
tailroom size
tail
tail
end
end
...
...
Packet data
sk_buff
headroom
headroom len
...
...
head
head
data
data size
tail
tail
end
end
...
... tailroom
tailroom
data += len
tail += len
sk_buff after skb_put(len)
Packet data
sk_buff
headroom
headroom
...
...
head
head
data
data
tail size
tail data
data len
end
end
...
...
tailroom
tailroom
tail += len
len += len
sk_buff after skb_push(len)
Packet data
headroom
headroom
sk_buff
...
... new len
new data
data
head
head
data
data size
tail
tail old
old data
data
end
end
...
...
tailroom
tailroom
data -= len
len += len
Changes in sk_buff as a Packet
Traverses Across the Stack
IP-Header
UDP-Header UDP-Header
UDP-Data UDP-Data UDP-Data
dataref:
dataref: 11 dataref:
dataref: 11 dataref:
dataref: 11
Parameters of sk_buff Structure
sk: points to the socket that created the packet (if
available).
tstamp: specifies the time when the packet
arrived in the Linux (using ktime)
dev: states the current network device on which
the socket buffer operates. If a routing decision is
made, dev points to the network adapter on
which the packet leaves.
_skb_dst: a reference to the adapter on which
the packet leaves the computer
cloned: indicates if a packet was cloned.
Parameters of sk_buff Structure
pkt_type: specifies the type of a packet
PACKET_HOST: a packet sent to the local host
PACKET_BROADCAST: a broadcast packet
PACKET_MULTICAST: a multicast packet
PACKET_OTHERHOST: a packet not destined for the
local host, but received in the promiscuous mode.
PACKET_OUTGOING: a packet leaving the host
PACKET_LOOKBACK: a packet sent by the local host
to itself.
sk_buff fields
next: Next buffer in list ip_summed: Driver fed us an IP checksum
prev: Previous buffer in list priority: Packet queuing priority
sk: Socket we are owned by users: User count - see {datagram,tcp}.c
tstamp: Time we arrived protocol: Packet protocol from driver
dev: Device we arrived on/are leaving by truesize: Buffer size
transport_header head: Head of buffer
network_header data: Data head pointer
mac_header tail: Tail pointer
_skb_dst: Destination route cache entry end: End pointer
sp: Security path, used for xfrm destructor: Destruct function
cb: Control buffer. Private data. mark: generic packet mark
len: Length of actual data nfct: Associated connection, if any
data_len: Data length ipvs_property: skbuff is owned by ipvs
mac_len: Length of link layer header peeked: packet was looked at
hdr_len: writeable length of cloned skb nf_trace: netfilter packet trace flag
csum: Checksum nfctinfo: Connection tracking info
csum_start: offset from head where to start nfct_reasm: Netfilter conntrack reass ptr
csum_offset: offset from head where to store nf_bridge: Saved data for bridged frame
local_df: Allow local fragmentation flag tc_index: Traffic control index
cloned: Head may be cloned (see refcnt) tc_verd: Traffic control verdict
nohdr: Payload reference only flag dma_cookie: DMA operation cookie
pkt_type: Packet class secmark: Security marking for LSM
fclone: Clone status And many more!
Outline
Introto socket buffers
The sk_buff data structure
dataref:
dataref: 11 dataref:
dataref: 11 dataref:
dataref: 11
Cloning Socket Buffers
skb_clone(): creates a new socket buffer
sk_buff, but not the packet data. Pointers in
both sk_buffs point to the same packet data
space.
Used all over the place, e.g., tcp_transmit_skb().
Cloning Socket Buffers
skb_clone()
sk_buff sk_buff sk_buff
next
next next
next next
next
prev
prev prev
prev prev
prev
...
... ...
... ...
...
head
head head
head head
head
data
data data
data data
data
tail
tail tail
tail tail
tail
end
end end
end end
end
Packet data Packet data
IP-Header IP-Header
UDP-Header UDP-Header
UDP-Data UDP-Data
dataref:
dataref: 11 dataref:
dataref: 22
Releasing Socket Buffers
kfree_skb(): decrements reference count for skb. If
null, free the memory.
Used by the kernel, not meant to be used by drivers
dev_free_skb():
For use by drivers in non-interrupt context
dev_free_skb_irq():
For use by drivers in interrupt context
dev_free_skb_any():
For use by drivers in any context
Outline
Introto socket buffers
The sk_buff data structure
Packetdata Packetdata
Packetdata
Managing Socket Buffer Queues
skb_queue_head_init(list): initializes an
skb_queue_head structure
prev = next = self; qlen = 0;
skb_queue_empty(list): checks whether the queue
list is empty; checks if list == list->next
skb_queue_len(list): returns length of the queue.
skb_queue_head(list, skb): inserts the socket
buffer skb at the head of the queue and increment
listqlen by one.
skb_queue_tail(list, skb): appends the socket buffer
skb to the end of the queue and increment
listqlen by one.
Managing Socket Buffer Queues
skb_dequeue(list): removes the top skb from the
queue and returns the pointer to the skb.
skb_dequeue_tail(list): removes the last packet
from the queue and returns the pointer to the
packet.
skb_queue_purge(): empties the queue list; all
packets are removed via kfree_skb().
skb_insert(oldskb, newskb, list): inserts newskb in
front of oldskb in the queue of list.
skb_append(oldskb, newskb, list): inserts newskb
behind oldskb in the queue of list.
Managing Socket Buffer Queues
skb_unlink(skb, list): removes the socket buffer
skb from queue list and decrement the queue
length.
skb_peek(list): returns a pointer to the first
element of a list, if this list is not empty;
otherwise, returns NULL.
Leaves buffer on the list
skb_peek_tail(list): returns a pointer to the last
element of a queue; if the list is empty, returns
NULL.
Leaves buffer on the list
Backup
sk_buff Alignment
CPUs often take a performance hit when accessing
unaligned memory locations.
Since an Ethernet header is 14 bytes, network
drivers often end up with the IP header at an
unaligned offset.
The IP header can be aligned by shifting the start of
the packet by 2 bytes. Drivers should do this with:
skb_reserve(NET_IP_ALIGN);
The downside is that the DMA is now unaligned. On
some architectures the cost of an unaligned DMA
outweighs the gains so NET_IP_ALIGN is set on a
per arch basis.