Anda di halaman 1dari 77

An Introduction to Advanced Routing &

Traffic Control

Filip Sneppe (filip.sneppe@cronos.be)

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
1. What’s on the Menu ?
• Linux Networking Stack Crash Course
– What happens to a packet entering/leaving a Linux System ?
– What are the subsystems encountered ?
– Which subsystems are the topic of this talk ?
• Advanced Routing
– How traditional routing works ?
– How Linux extends this schema beyond recognition
• Traffic Control
– How can we manage our bandwidth ?
– How Linux implements Traffic Control

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
2. Linux Networking Stack Crash Course
2.1. Overview

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
2.2. netfilter/iptables

iptables supports the following operations:


• Packet Filtering (”Firewalling”)
• Network Address Translation
• Packet Mangling

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
2.2.1. netfilter/iptables: Packet Filtering (”Firewalling”)

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
2.2.2. netfilter/iptables: Network Address Translation

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
2.2.3. netfilter/iptables: Packet Mangling

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
2.3. (Advanced) Routing

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
2.4. Traffic Control

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
3. Advanced Routing
3.1. What’s Needed ?
3.1.1. Userspace
• ⇒ iproute package, written by Alexey N. Kuznetsov
• Contains a utility, ip, to configure:
– IPv4 & IPv6 kernel policy routing (unicast & multicast)
– IPv4 & IPv6 addresses
– MAC address resolution
– Device/Link states
– Tunneling
– Route monitoring
• ⇒ Blows away ifconfig, route & arp

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
3.1.2. Kernel

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
3.2. Traditional IP Routing
• One routing table
• sequential lookup (mostly: there is also caching, route aggregarion)
• Destination-based
• ”Longest match” principle, determined via subnet mask
• ”Metric” = hop count (artificial cost associated with a route)
• Example:
zeus:˜# route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
10.0.223.0 0.0.0.0 255.255.255.0 U 0 0 0 eth2
10.0.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eth1
0.0.0.0 10.0.10.2 0.0.0.0 UG 0 0 0 eth1

• Or with the ip utility:


zeus:˜# ip route list
10.0.223.0/24 dev eth2 proto kernel scope link src 10.0.223.1
10.0.0.0/16 dev eth1 proto kernel scope link src 10.0.222.1
default via 10.0.10.2 dev eth1

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
3.3. What Traditional Routing Looks Like...

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
3.4. What Traditional Routing Really Looks Like...

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
3.5. Advanced Routing: 255 routing tables, 2ˆ32 rules

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
3.6. Feature Summary
• Up to 255 routing tables:
– Work like ’traditional’, destination-based routing tables
– Actually, they’re a lot more powerful than routing tables on other OSes ...
• Up to 2ˆ32 rules:
– That’s one rule per ipv4 address out there !
– Sits in front of routing tables
– Determine which routing table to consult
– It’s possibly to consult multiple routing tables, ...
• Traditional routing, and the utilities ifconfig & route only work on a subset
of the Routing Policy Database

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
3.7. Command Syntax
3.7.1. General Syntax
ip [ options ] object [ command ] [ arguments ]

Options:
• -V, -version
• -s, -stats, -statistics
• -f, -family (inet, inet6, link), -4, -6, -0
• -o, -oneline: formats output as a single line, replacing line feeds with a backslash (useful for pipes)
• -r, -resolve

Commands:
• add
• delete
• show, list
• ...

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
Objects:
• link
• address
• neigh
• route
• rule
• maddr
• mroute
• tunnel

Arguments:
• flags
• parameters

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
3.7.2. ip rule ...
Syntax
ip rule [ list | add | del ] SELECTOR ACTION
SELECTOR := [ from PREFIX ] [ to PREFIX ] [ tos TOS ] [ fwmark FWMARK ]
[ dev STRING ] [ pref NUMBER ]
ACTION := [ table TABLE ID ] [ nat ADDRESS ]
[ prohibit | blackhole | unreachable ]
[ realms [ SRCREALM/]DSTREALM ]
TABLE ID := [ local | main | default | NUMBER ]

Arguments
• type TYPE:

– unicast: Default
– blackhole: Drop silently
– unreachable: Return Network is Unreachable ICMP error
– prohibit: Return Communication Administratively Prohibited ICMP error
– nat: NAT of source address - see ip route for DNAT
• from PREFIX: Match on source prefix
• to PREFIX: Match on destination prefix

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
• iif NAME: Match on incoming interface
If you want to redirect locally generated packets, use iif lo, not from 127.0.0.1
• tos TOS or dsfield TOS: Match on TOS
• fwmark MARK: Match on FWMARK
• priority PREFERENCE: Determines matching order;
You should add a priority to every rule (command doesn’t require it)
• table TABLE ID: Routing table identifier to look up if rule selector matches
• realms FROM/TO: Realms to select if the rule matches and routing table lookup succeeds. realm TO
is only used if the route returned did not select any realm
• nat ADDRESS: Base IP address block to translate

Examples
# ip rule add from 10.0.0.0/24 table 200 prio 3000
# ip rule add to 20.0.0.0/16 table 200 prio 4000
# ip rule add iif lo table 100 prio 5000
# ip rule add tos 0x8 tables 100 prio 6000
# ip rule add to 172.16.0.0/12 prohibit pref 1000

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
3.7.3. ip route ...
Syntax
ip route ( list | flush ) SELECTOR
ip route get ADDRESS [ from ADDRESS iif STRING ]
[ oif STRING ] [ tos TOS ]
ip route ( add | del | change | append | replace | monitor ) ROUTE
SELECTOR := [ root PREFIX ] [ match PREFIX ] [ exact PREFIX ]
[ table TABLE ID ] [ proto RTPROTO ]
[ type TYPE ] [ scope SCOPE ]
ROUTE := NODE SPEC [ INFO SPEC ]
NODE SPEC := [ TYPE ] PREFIX [ tos TOS ]
[ table TABLE ID ] [ proto RTPROTO ]
[ scope SCOPE ] [ metric METRIC ]
INFO SPEC := NH OPTIONS FLAGS [ nexthop NH ] ...
NH := [ via ADDRESS ] [ dev STRING ] [ weight NUMBER ] NHFLAGS
OPTIONS := FLAGS [ mtu [lock] NUMBER ] [ advmss NUMBER ]
[ rtt NUMBER ] [ rttvar NUMBER ]
[ window NUMBER] [ cwnd NUMBER ] [ ssthresh REALM ]
[ realms REALM ]
TYPE := [ unicast | local | broadcast | multicast | throw |

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
unreachable | prohibit | blackhole | nat ]
TABLE ID := [ local | main | default | all | NUMBER ]
SCOPE := [ host | link | global | NUMBER ]
FLAGS := [ equalize ]
NHFLAGS := [ onlink | pervasive ]
RTPROTO := [ kernel | boot | static | NUMBER ]

Arguments
• to PREFIX or to TYPE PREFIX: Type is one of:
– unicast: Real paths to destination prefix
– unreachable: Return Host Unreachable ICMP error
– blackhole: Silently discard packets
– prohibit: Return Communication Administratively Prohibited ICMP error
– local: Packets are looped back and delivered locally
– broadcast: For link broadcasts
– throw: Terminates lookup in table (possibly returning to rules)
– nat: NAT of destination address (- see ip rule for SNAT)
– anycast: Not implemented yet
– multicast: For multicast routing (not present in normal routing tables)

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
• tos TOS or dsfield TOS: (see also /etc/iproute2/rt dsfield)
”longest match” means to first compare TOS;
if not equal, packet may still match a route with zero TOS
• metric NUMBER or preference NUMBER: Preference of the route
• table TABLE ID: (see also /etc/iproute2/rt tables)
• dev NAME: Outgoing device name
• via ADDRESS: Address of nexthop router (different meaning for nat and for a direct route in BSD
compatibility mode)
• src ADDRESS: Source address to preferentially use
• realm REALMID: (see also /etc/iproute2/rt realms)
• mtu MTU or mtu lock MTU: Sets Maximum Transfer Unit (of interface);
If lock is not set, value may be updated by kernel due to Path MTU Discovery
• window NUMBER: Sets maximum advertised TCP window - can limit tcp bursts
• rtt NUMBER: Initial round trip time estimate
• nexthop NEXTHOP:
– via ADDRESS: Nexthop router
– dev NAME: Ouput device
– weight NUMBER: R̃elative bandwidth/weight
• scope SCOPE VAL: (see also /etc/iproute2/rt scopes)
– global: For gatewayed unicast routes
– link: For direct unicast routes

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
– broadcast, host: For local routes
• protocol PROTO: see also /etc/iproute2/rt protos
– redirect: Route was installed due to ICMP redirect
– kernel: As a result of autoconfiguration
– boot: Installed during bootup sequence. A routing daemon will purge all of these upon startup
– static: Overrides dynamic routing, installed by admin
– ra: ”Router discovery protocol”-installed route
• onlink: Pretend that nexthop is directly attached (tunnels)
• equalize: Allow packet-by-packet randomization on multipath routes
- see http://www.linuxvirtualserver.org/˜julian/

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
Examples
# ip route add 10.0.0.0/8 via 192.168.0.1 table 200
# ip route add 192.168.0.20 dev eth2 src 192.168.0.50

Multiroute path:
# ip route add default scope global nexthop dev ppp0 nexthop dev ppp1

Listing:
# ip route list 193.121.44.108 table cache
193.121.44.108 from 192.168.0.10 via 192.168.0.1 dev eth0
cache mtu 1500 rtt 45ms rttvar 45ms cwnd 1 advmss 1460
# ip route list proto gated/bgp

Flushing the routing cache or specific routes:


# ip route flush cache
# ip -6 -s -s route cache
...

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
Looking up policy routing decisions:
# ip route get 193.121.44.108 from 10.0.0.1 iif eth0
193.121.44.108 from 10.0.0.1 via 192.168.0.1 dev eth0 src 192.168.0.10
cache mtu 1500 advmss 1460 iif eth0

NAT:
# ip rule add from $MAILSERVER I nat $MAILSERVER E table main
# ip route add nat $MAILSERVER E via $MAILSERVER I table local

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
3.7.4. ip address ...
Syntax
ip address ( add | delete ) IFADDR dev STRING
ip address ( show | flush ) [ dev STRING ] [ scope SCOPE-ID ]
[ to PREFIX ] [ FLAG-LIST ] [ label PATTERN ]
IFADDR := PREFIX | ADDR peer PREFIX
[ broadcast ADDR ] [ anycast ADDR ]
[ label STRING ] [ scope SCOPE-ID ]
SCOPE-ID := [ host | link | global | NUMBER ]
FLAG-LIST := [ FLAG-LIST ] FLAG
FLAG := [ permanent | dynamic | secondary | primary |
tentative | depreciated ]

Arguments
• dev NAME: Device name
• local ADDRESS: Local address. If no CIDR netmask is given, /32 is assumed for ipv4
• peer ADDRESS: Address of remote endpoint for POINTOPOINT interfaces
• broadcast ADDRESS: Broadcast address; Special symbols + and - can be used
• label NAME: eth0:x preserves Linux 2.0 compatibility

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
• scope SCOPE-ID: (see /etc/iproute2/rt scopes):
– global: Address is globally valid
– site: ipv6 only
– link: Address is link local, only valid on this link
– host: Address is only valid inside this host
• permanent, dynamic, tentative, depreciated: ipv6 only
• primary/secondary: See examples

Examples
Adding an IP address:
# ip addr add 10.0.0.1/24 brd + dev eth0

Flushing addresses:
# ip addr flush label "eth*"
# ip -s -s addr flush to 10/8
3: eth1 inet 10.1.1.10/24 brd 10.1.1.255 scope global eth1
4: eth2 inet 10.2.2.10/24 brd 10.2.2.255 scope global eth2
*** Round 1, deleting 2 addresses ***
*** Flush is complete after 1 round ***

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
Adding addresses without subnet masks:
# ip address add 10.0.0.1 dev eth1
# ip address add 10.0.0.2 dev eth1
# ip a list dev eth1
3: eth1: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo fast qlen 100
link/ether 00:10:b5:bb:f9:2b brd ff:ff:ff:ff:ff:ff
inet 10.0.0.1/32 scope global eth1
inet 10.0.0.2/32 scope global eth1
# ip address del 10.0.0.2 dev eth1
# ip a list dev eth1
3: eth1: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo fast qlen 100
link/ether 00:10:b5:bb:f9:2b brd ff:ff:ff:ff:ff:ff
inet 10.0.0.1/32 scope global eth1
# ip address del 10.0.0.1 dev eth1

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
Adding addresses with subnet masks:
# ip address add 10.0.0.1/24 dev eth1
# ip address add 10.0.0.2/24 dev eth1
# ip a list dev eth1
3: eth1: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo fast qlen 100
link/ether 00:10:b5:bb:f9:2b brd ff:ff:ff:ff:ff:ff
inet 10.0.0.1/24 scope global eth1
inet 10.0.0.2/24 scope global secondary eth1
# ip address del 10.0.0.1/24 dev eth1
# ip a list dev eth1
3: eth1: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo fast qlen 100
link/ether 00:10:b5:bb:f9:2b brd ff:ff:ff:ff:ff:ff

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
3.7.5. ip link ...
Syntax
ip link set DEVICE ( up | down | arp ( on | off ) |
dynamic ( on | off ) |
multicast ( on | off ) | txqueuelen PACKETS |
name NEWNAME | address LLADDR |
broadcast LLADDR | mtu MTU )
ip link show [ DEVICE ]

Arguments
• arp on/off: don’t change this when device is up
• ip links show: flags:

– UP, LOOPBACK, BROADCAST, POINTOPOINT, MULTICAST


– PROMISC: sShows *real* link state
– ALLMULTI: Device receives all multicast packets (multicast router)
– NOARP: device doen not need ARP
– DYNAMIC: interface is dynamically created and destroyed
– SLAVE: interface is bound to other interfaces in order to share link capabilities

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
Examples
# ip -s -s link list eth0

2: eth0: $<$BROADCAST,MULTICAST,UP$>$ mtu 1500 qdisc pfifo_fast qlen 100


link/ether 00:10:b5:bb:f9:36 brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped overrun mcast
13226295 13406 0 0 0 0
RX errors: length crc frame fifo missed
0 0 0 0 0
TX: bytes packets errors dropped carrier collsns
1878081 13406 0 0 0 6
TX errors: aborted fifo window heartbeat
0 5 0 0

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
3.7.6. ip neigh ...
Syntax
ip neigh ( add | del | change | replace ) ( ADDR [ lladdr LLADDR ]
[ nud ( permanent | noarp | stale | reachable ) ] | proxy ADDR )
[ dev DEV ]
ip neigh ( show | flush ) [ to PREFIX ] [ dev DEV ] [ nud STATE ]

Arguments
• to ADDRESS: Protocol address of neighbour (ipv4 or ipv6)
• dev NAME: Device
• lladdr LLADDRESS: MAC address
• nud STATE: ”Neighbour Unreachability Detection”

– permanent: Valid forever, until removed administratively


– noarp: Neighbour entry is valid
– reachable: Neighbour entry is valid until reachability timeout expires
– stale: Valid but suspicious

• proxy ADDR: Configure proxy ARP

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
Examples
# ip neigh add 192.168.0.221 lladdr 01:01:01:01:10:01 dev eth0
# ip neigh list
192.168.0.221 dev eth0 lladdr 01:01:01:01:10:01 nud permanent
192.168.0.1 dev eth0 lladdr 00:20:18:b8:1d:d0 nud stale
192.168.0.20 dev eth2 lladdr 00:08:02:66:6c:bd nud reachable
# ip neigh change 192.168.0.1 dev eth0 lladdr 00:20:18:b8:1d:d0 nud reachable
# ip neigh list
192.168.0.221 dev eth0 lladdr 01:01:01:01:10:01 nud permanent
192.168.0.1 dev eth0 lladdr 00:20:18:b8:1d:d0 nud reachable
192.168.0.20 dev eth2 lladdr 00:08:02:66:6c:bd nud reachable

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
3.7.7. Other
NAT:

Quite different from iptables NAT...


• No connection tracking required
• No protocol/nat helpers
• No automatic translation of return packets
• Combine FastNAT & iptables NAT ? ⇒ No!
• Combine FastNAT & iptables filtering ? ⇒ Yes, but don’t lose your head
ip monitor & rtmon state monitoring

ip monitor [ file FILENAME ] [ all | link | address | route ]


rtmon [ file FILENAME ]

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
ip tunnel ...

Example: GRE tunnel

On 10.0.0.1/10.1.0.1:
# ip tunnel add netb mode gre remote 10.1.0.2 local 10.1.0.1 ttl 225
# ip link set netb up
# ip address add 10.0.0.1 dev netb
# ip route add 10.2.0.0/24 dev netb

On 10.2.0.2/10.1.0.2:
# ip tunnel add neta mode gre remote 10.1.0.1 local 10.1.0.2 ttl 225
# ip link set neta up
# ip address add 10.2.0.2 dev neta
# ip route add 10.0.0.0/24 dev neta

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
3.7.8. The ”Firewall Mark (FWMARK)”
• Packet journey though the Linux network stack ⇒ lots of packet classification:
– iptables nat, mangle, filter ⇒ depend on iptables matches
– Routing ⇒ again classification, especially with policy routing
– Traffic control ⇒ control traffic according to protocol ports, ...
• ⇒ This has disadvantages:
– performance impact
– what if, for a given type of traffic, I want to:
∗ Route it differently (routing)
∗ Set TOS bits to different value (iptables mangle operation)
∗ Shape traffic
– Some types of classification only available with netfilter, routing, or traffic
shaping

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
• Solution: the firewall mark: an numeric value that can be given to a packet as it
travels through the stack and can be queried by every network subsystem
# iptables -t mangle -A PREROUTING -p tcp --dport 80 -j MARK --set-mark 10
# ip rule add fwmark 10 table 200
# iptables -t mangle -A FORWARD -m mark --mark 10 -j TOS --set-tos 16
# tc filter add dev eth0 parent 10:0 protocol ip handle 10 fw flowid 10:1

• Note: connection marking is also possible (netfilter patch-o-matic patch by Hen-


rik Nordstrom)

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
4. Traffic Shaping
4.1. What’s Needed ?
4.1.1. Userspace
• ⇒ iproute package, written by Alexey N. Kuznetsov
• Contains a utility, tc, to configure kernel traffic control
• Again, Linux is at the forefront of networking technology and even blows away
commercial competition in terms of power/functionality (and price :-) )

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
4.1.2. Kernel

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
4.1.3. Basic principles
• How does one control network traffic ?
– Delay the sending of packets
– Reorder packets as to prioritize certain traffic flows (interactive traffic vs.
long downloads) ⇒ This requires the presence of a packet queue. If there
is no queue, one cannot reorder:
∗ Linux must be the ”slowest” element of the chain, so that it has a
packet queue to work with
∗ Packets are (possibly) queued before they leave the machine ⇒ there
is queueing before packets are sent to the NIC driver
– Drop packets: end-nodes will retransmit/recover and lower their sending
rates (don’t laugh - it’s supposed to work like that)
• ingress or egress ? ⇒ egress ! control what *YOU* send, and the rest of the
network will adapt

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
4.1.4. Quiz...

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
4.2. Terminology & Concepts
• Queueing Discipline: Algorithm that manages the queue of a device, either in-
coming (ingress) or outgoing (egress). Example: FIFO
• Classless qdisc: Has no configurable internal subdivisions
• Classful qdisc: Contains multiple classes. Each of these classes contains a further
qdisc, which may again be classful.
• Classes: A classful qdisc may have many classes, which each are internal to the
qdisc. Each of these classes may contain a real qdisc.
• Classifier: Sits inside a classful qdisc and determines to which class it needs to
send a packet
• Filter: Classification can be performed using filters. A filter contains a number
of conditions that can be matched, and an action
• Scheduling: The process of reordering packets before they go out, performed by
a qdisc
• Shaping: The process of delaying packets before they go out to make traffic con-
firm to a configured maximum rate. Shaping is performed on egress. (dropping
packets to slow traffic down is also often called Shaping)

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
• Policing: Delaying or dropping packets in order to make traffic stay below a
configured bandwidth. In Linux, policing can only drop a packet and not delay
it - there is no ’ingress queue’.
• Work-Conserving: A work-conserving qdisc always delivers a packet if one is
available. In other words, it never delays a packet if the network adaptor is
ready to send one (in the case of an egress qdisc).
• Non-Work-Conserving: Some queues, like for example the Token Bucket Filter,
may need to hold on to a packet for a certain time in order to limit the band-
width. This means that they sometimes refuse to give up a packet, even though
they have one available.

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
4.3. Putting it all together

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
4.4. Queueing Disciplines
4.4.1. Some Points of Attention
• Determine the way in which data is SENT
• Classful qdiscs: very useful if you have different kinds of traffic needing differ-
ent treatment
– Traffic enters qdisc, filters are consulted and the traffic is classified
– filters are called from within a qdisc and return with a decision, filters don’t
call qdiscs!
– A qdisc is identified by a handle: X:0, or X: for short
– A class is identified by a handle: X:Y - a class belonging to qdisc X:0 (same
major number as parent)
– A filter points traffic flow to a class, not to another qdisc (no errors but
won’t work)
– Most classful qdisc also perform shaping (besides containing other qdiscs)
– Each interface has one root qdisc

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
– Here’s a typical hierarchy:
root 1:
|
_1:1_
/ | \
/ | \
/ | \
10: 11: 12:
/ \ / \
10:1 10:2 12:1 12:2

– Packets are only enqueued downwards! When dequeued, they go up again,


where the interface lives. They do NOT fall off the end of the tree to the
network adaptor!

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
4.4.2. General syntax
tc qdisc [ add | del | replace | change | get ] dev STRING
[ handle QHANDLE ] [ root | ingress | parent CLASSID ]
[ estimator INTERVAL TIME CONSTANT ]
[ [ QDISC KIND ] [ help | OPTIONS ] ]

Or:

tc qdisc show [ dev STRING ] [ ingress ]

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
4.4.3. bfifo fast
• Default queueing discipline for a kernel that has been compiled with
CONFIG IP ADVANCED ROUTER; (a normal host has a FIFO queue)
• Check out ip address list
• Change the queue length:

ifconfig eth0 txqueuelen 200

or:

ip link dev eth0 txqueuelen 200

• Queue, divided in three bands; every band has a FIFO queue


• As long as their are packets in band 0, band 1 is not processed; Idem for band 1
and 2.
• Kernel puts packets in one of three bands depending on TOS value

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
4.4.4. pfifo & bfifo queues
• Packet- or byte-limited FIFO queue
• Syntax:
# tc qdisc ... add pfifo [ limit packets ]
# tc qdisc ... add bfifo [ limit bytes ]

• Advantage: it’s possible to get statistics, not possible with default

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
4.4.5. Token Bucket Filter (TBF)
General Principle
• Classless
• Non-work-conserving
• Doesn’t reorder packets ⇒ throttles traffic
• Analogy (˜ netfilter ”limit” match):

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
Syntax
tc qdisc ... tbf rate RATE burst BYTES/CELL ( latency MS | limit BYTES )
[ mpu BYTES [peakrate RATE mtu BYTES/CELL ] ]

Parameters
parameter meaning
rate Link speed. Actual speed will be a little higher
burst a.k.a. buffer or maxburst; Size of the bucket, in bytes
Larger shaping rates require a larger buffer
latency Maximum amount of time a packet can sit in the TBF queue (l̃imit)
limit Number of bytes that can be sitting in the queue waiting for tokens to
become available (l̃atency)
mpu minimum packet unit
peakrate Maximum depletion rate of the bucket
mtu a.k.a. minburst; size of the peakrate bucket
Example

# tc qdisc add dev eth0 root tbf rate 0.5mbit burst 5kb latency 70ms peakrate 1mbit
minburst 1540

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
4.4.6. Stochastic Fairness Queueing (SFQ)
General Principle
• Classless
• Work-conserving
• Doesn’t shape ⇒ reorders packet transmission based on ”flows”
• Goal is to obtain fairness: every flow is able to send data in turn
• Only useful when it ”owns the queue” ⇒ embed it in a classful qdisc
• Can create high latency for interactive traffic ⇒ Change
#define SFQ DEPTH 128 in net/sched/sch sfq.c
• Note: ESFQ (http://www.ssi.bg/˜alex/esfq/index.html): only uses
only destination and/or source addresses to put the packets in internal classes
Syntax

tc qdisc ... sfq [ perturb SECONDS ] [ quantum BYTES ]

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
Parameters

parameter meaning
perturb Interval (seconds) for queue algoritm perturbation
quantum Amount of bytes a stream is allowed to dequeue before
the next queue gets a turn (don’t set below MTU)
Example

# tc qdisc add dev ppp0 root sfq perturb 10

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
4.4.7. Random Early Detect/Drop (RED)
General Principle
• Think HTB, but smarter
• A regular queue drops packets at the end of its queue ⇒ result is ”burstiness”
as end-hosts try to adjust sending rates
• Solution ⇒ manage queue intelligently:
– Don’t start dropping packets when queue is full
– Start dropping packets gradually as queue gets fuller
• Good for backbones (no session state), where SFQ burns too much CPU
Syntax

tc qdisc ... red limit BYTES min BYTES max BYTES avpkt BYTES
burst PACKETS [ecn] [bandwidth RATE] probability CHANCE

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
Parameters

parameter meaning
min Average queue size in bytes at which marking becomes a possibility.
set to highest acceptable base latency * bandwidth
max At this average queue size, the marking probability
is maximal. Set to at least twice min value
burst Used for determining how fast the average queue size is
influenced by the real queue size. A large value allows
longer bursts of traffic before marking starts
Try (min+min+max)/(3*avpkt)
limit Hard limit on the real (not average) queue size in
bytes. When limit is reached, ’tail-drop’ is used (even with ECN).
Set to at least max+burst
probability Maximum probability for marking. Use 0.01 or 0.02
avpkt Average packet size. Use 1000 for a 1500 bytes MTU link
bandwidth Rate used for calculating the average queue
size after some idle time
ecn Allows RED to notify remote hosts that their rate
exceeds the amount of bandwidth available via ECN instead of
dropping for non-ECN enabled hosts. Requires patched tc binary

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
Example
# tc qdisc add dev eth0 root red limit 60KB min 15KB max 45KB burst 20 avpkt 1000
bandwidth 10Mbit probability 0.4

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
4.4.8. Generic Random Early Detection (GRED)
Basic Principle:
• Not a lot known about it
• Based on RED, with several internal queues
• DiffServ tcindex field is used to choose internal queue
• Each virtual queue can have its own Drop Parameters specified
• Can be configured to act like Cisco’s WRED
Syntax

tc qdisc ... gred DP DROP-PROBABILITY limit BYTES min BYTES max BYTES
avpkt BYTES burst PACKETS probability PROBABILITY bandwidth KBPS
or:

tc qdisc ... gred setup DPs NUM-OF-DPS default DEFAULT-DP [prio]

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
4.4.9. True/Trivial Link Equalizer (TLE/TEQL)
General Principle:
• Used for load sharing over multiple interfaces
Syntax/Example:

# tc qdisc add dev eth1 root teql0


# tc qdisc add dev eth2 root teql0
# ip link set dev teql0 up

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
4.4.10. N-Band Priority Queue Scheduler (PRIO)
General Principle
• Simple classful queueing discipline
• Contains an arbitrary number of classes of differing priority
• Dequeing: first band 0, then band 1, ...
• Classes are automatically created (since number of bands = number of classes)
and numbered X:1, X:2, ...
• Work-conserving qdisc, but children may be Non-work-conserving
• Very useful for lowering latency while not slowing down traffic
• Determine band:
– From userspace: SO PRIORITY
– tc filter
– priomap based on priority (derived from TOS)

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
Syntax
tc qdisc ... dev DEV ( parent CLASSID | root ) [ handle MAJOR: ] prio
[ bands BANDS ] [ priomap BAND,BAND,BAND... ]
[ estimator INTERVAL TIMECONSTANT ]

Parameters
parameter meaning
bands Number of bands
priomap map of priorities, if bands is changed from default value 3
Example
# tc qdisc add dev eth0 root handle 1: prio bands 2 priomap 0 0 0 0 0 0 0 0 1 1 1
1 1 1 1 1
# tc qdisc add dev eth0 parent 1:1 handle 10: tbf rate 2mbit latency 50ms burst 8000
# tc qdisc add dev eth0 parent 1:2 handle 20: bfifo
# tc filter add dev eth0 protocol ip parent 1: prio 1 u32 match ip sport 8001 0xffff
flowid 1:2
# tc filter add dev eth0 protocol ip parent 1: prio 1 u32 match ip sport 8000 0xffff
flowid 1:1

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
4.4.11. Class-Based Queueing (CBQ)
General
• ”sendmail” effect
• One of the first classful qdiscs under Linux, often mistaken for ”the One” qdisc
to use.
• Complicated, avoid if possible, use htb (later)
Syntax
tc qdisc ... dev DEV ( parent CLASSID | root) [ handle MAJOR: ] cbq
[ allot bytes ] avpkt BYTES bandwidth RATE [ cell BYTES ]
[ ewma LOG ] [ mpu BYTES ]

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
4.4.12. Hierarchical Tocken Bucket Filter (HTB)
General
• http://luxik.cdi.cz/˜devik/qos/htb/
• Based on Token Bucket Filter, but with classes
• Finally in standard kernel since 2.4.20
• Author (Martin Devera aka Devik), by writing this qdisc and documenting it,
has allowed many to better grasp the concepts of Linux Traffic Control
• What can I say ? Works very well !
Syntax
tc qdisc ... dev DEV ( parent CLASSID | root) [ handle MAJOR: ] htb
[ default MINOR-ID ]

tc class ... dev DEV parent MAJOR:[MINOR] [ classid MAJOR:MINOR ] htb rate RATE
[ ceil RATE ] burst BYTES [ cburst BYTES ] [ prio PRIORITY ]

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
Parameters

parameter meaning
default Default classid to direct traffic to
rate Speed knob, rate at which the class can send
ceil Maximum rate at which the class can send by borrowing idle
bandwidth from its parent
burst Amount of bytes that can be burst at ceil speed, in
excess of the configured rate
cburst Amount of bytes that can be burst at ’infinite’ speed,
iow, as fast as the interface can transmit them
˜ peakrate in TBF
prio Priority when borrowing from parent class
(low=higher priority)

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
Example

# tc qdisc add dev eth0 root handle 1: htb default 12


# tc class add dev eth0 parent 1: classid 1:1 htb rate 100kbps ceil 100kbps
(We add one class (htb) under root qdisc - this allows the children of this class to borrow from its parent (a root
class cannot borrow from other classes))

# tc class add dev eth0 parent 1:1 classid 1:10 htb rate 30kbps ceil 100kbps
# tc class add dev eth0 parent 1:1 classid 1:11 htb rate 10kbps ceil 100kbps
# tc class add dev eth0 parent 1:1 classid 1:12 htb rate 60kbps ceil 100kbps

We can optionally add qdiscs to the leaf classes:


# tc qdisc add dev eth0 parent 1:10 handle 20: pfifo limit 5
# tc qdisc add dev eth0 parent 1:11 handle 30: pfifo limit 5
# tc qdisc add dev eth0 parent 1:12 handle 40: sfq perturb 10

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
4.4.13. Clark-Shenker-Zhang Scheduler (CSZ)
• David D. Clark, Scott Shenker & Lixia Zhang, ”Supporting Real-Time Applications
in an Integrated Services Packet Network: Architecture and Mechanism”
• If you understand this, please let the rest of the world know :-)
• Provides guaranteed service to WFQ flows and best-effort service to the other
traffic

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
4.4.14. Diffserv Field Marker (DSMark)
• Qdisc that offers the capabilities needed in Differentiated Services (DiffServ, DS)
• Not discussed here

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
4.4.15. Ingress Queueing Discipline
• Not discussed here
• Can be used to do ping flood DoS limiting
• # tc qdisc add dev eth0 ingress

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
4.4.16. Other Points of Attention
• Apache: mod throttle
• Squid: Delay Pools
• Weighted Round Robin (WRR) (not in kernel) http://wipl-wrr.dkik.dk/
wrr/
• ESFQ: http://www.ssi.bg/˜alex/esfq/index.html (not in kernel)
• IMQ (Intermediate Queueing Device) (not in kernel) http://luxik.cdi.cz/
˜patrick/imq/
• ”The Wondershaper” on http://www.lartc.org
• CBQ.init: ftp://ftp.equinox.gu.net/pub/linux/cbq/
• HTB.init: http://sourceforge.net/projects/htbinit/
• tcng

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
4.5. Filters for Packet Classification
4.5.1. Basic Filtering Examples
• # tc filter add dev eth0 protocol ip parent 10: prio 1 u32 match ip dport 22
0xffff flowid 10:1
• # tc filter add dev eth0 protocol ip parent 10: prio 2 flowid 10:2
• # tc filter add dev eth0 parent 10:0 protocol ip prio 1 u32 match ip dst 4.3.2.0/24
match ip dport 80 0xffff flowid 10:1
• # tc filter add dev eth0 parent 10:0 protocol ip match ip protocol 1 0xff
• Filter on FWMARK 6:
# tc filter add dev eth1 protocol ip parent 1:0 prio 1 handle 6 fw flowid 1:1

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
4.5.2. Advanced Filters
An incomplete list:
• fw: firewall mark
• u32: based on fields within packet
• route: based on which route the packet will be routed to
• rsvp, rsvp6: Based on resource reservation protocol
• tcindex: used in DSMARK qdisc
General arguments to classifiers:
• protocol: protocol the classifier will accept
• parent: handle this classifier is attached to
• prio: priority (lower=tested first)
• handle: means different things to different filters
General Syntax:

tc filter add dev DEV [ protocol PROTO ] [ ( preference | prio ) PRIO ]


parent CBQ

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
4.5.3. u32 Classifier
• Most advanced filter available
• Based on hashing, robust & scalable
• Selector & action principle, action is often a direction to a class
• u32 selector: describes which bits are to be matched in the packet header:

# tc filter add dev eth0 protocol ip parent 1:0 pref 10 u32 match u32 00100000
00ff0000 at 0 flowid 1:10
# tc filter add dev eth0 protocol ip parent 1:0 pref 10 u32 match u32 00000016
0000ffff at nexthdr+0 flowid 1:10

• Hashing filters: not discussed here

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
4.5.4. General Selector
Syntax:
match [ u32 | u16 | u8 ] PATTERN MASK [ at OFFSET | nexthdr+OFFSET]

Examples:
• Packets with TTL=64:

# tc filter add dev ppp14 parent 1:0 prio 10 u32 match u8 64 0xff at 8 flowid
1:4

• TCP packets < 64 bytes with ACK bit set:

# tc filter add dev ppp14 parent 1:0 protocol ip prio 10 u32


match ip protocol 6 0xff
match u8 0x05 0x0f at 0
match u16 0x0000 0xffc0 at 2
match u8 0x10 0xff at 33 flowid 1:3

• Because of the complexity, there are specific selectors: dport, sport, tos, ...

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
4.5.5. route Classifier
• Filters based on results of routing tables

4.5.6. rsvp, rsvp6 Classifier


• Filters based on RSVP protocol info

4.5.7. fw Classifier
• For beginners, probably the best classifier
• Mark packets in the iptables mangle tables
• Filter based on FWMARK:

tc filter add dev eth1 protocol ip parent 1:0 prio 1 handle 6 fw flowid 1:1

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
4.5.8. Policing filters
• Most of the time, the action associated with a filter selector will be to direct traffic
to a flowid
• It’s possible to do other stuff !
– Filters that only match up to a certain bandwidth
– Filters that only match traffic exceeding a certain bandwidth
• You can drop traffic, reclassify it, or see if another filter will match it
• If I want to police, how do I estimate bandwidth ?
– kernel estimator (.config option); avrate parameter
– tbf
• overlimit actions:
– continue: causes filter not to match
– drop
– pass/OK
– reclassify: default action
– Note: no way to delay packets!

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
5. More Info
5.1. URLs
• http://www.lartc.org: Linux Advanced Routing & Traffic Control
• http://www.policyrouting.org: Policy Routing Using Linux
• http://tcng.sourceforge.net: not discussed here, user-friendier inter-
face to tc
• http://luxik.cdi.cz/˜devik/qos/: Devik’s homepage, HTB info is very
good
• http://www.docum.org/: Stef Coene’s Traffic Control page

5.2. Books
• Policy Routing Using Linux, Matthew G. Marsh

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
5.3. Mailing Lists
• lartc@mailman.ds9a.nl
• diffserv-general@lists.sourceforge.net
• netfilter@lists.netfilter.org

5.4. IRC Channel


• #lartc on irc.oftc.net

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit

Anda mungkin juga menyukai