(320301)
Jurgen Schonwalder
The lecture Networks and Protocols is an introduction into the foundations of packet switched
data communication networks. The lecture covers widely deployed Internet technologies and the
IEEE 802 standards for local area networks.
The selection of the material covered in this lecture is specifically dealing with widely deployed
technologies and protocols. This approach has the disadvantage to ignore some interesting alter-
nate technologies which did not become widely deployed for whatever (often non-technical) reason.
However, the advantage of this approach is that more time can be spend to discuss the selected tech-
nologies to some level of detail and to enable and encourage students to experiment with their own
network infrastructure. This usually increases the understanding of the material and the motivation.
Some parts of the lecture notes date back to a lecture called Introduction to Operating Systems
and Networks which I have given at the Technical University Braunschweig. These notes were
later heavily revised and extended for a lecture called Computer Networks which I have given at
the University of Osnabruck. Some parts of these lecture notes are heavily influenced by standard
text books such as [1, 2, 3, 4, 5, 6, 7] while other parts are directly derived from the relevant stan-
dards. Students who want to understand the discussed protocols in even more details are strongly
encouraged to read the relevant parts of the standards which are referenced throughout the text.
My thanks go to the many students who asked critical questions and provided constructive feedback
which improved the presentation and reduced the amount of errors and inconsistencies.
Jurgen Schonwalder
Contents
1 Introduction 1
1.1 Fundamental Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.3 Names and Addresses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.4 ISO/OSI Reference Model . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.5 Internet Reference Model . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Standardization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 ISO Standardization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.2 Internet Standardization . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.3 IEEE Standardization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
B Sockets 115
B.1 Socket Addresses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
B.2 Communication Kinds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
B.3 Socket API Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
B.4 Name Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
B.5 Multiplexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
Chapter 1
Introduction
This section discusses some architectural concepts and introduces some basic terms. A very funda-
mental approach to deal with complex systems is to divide them into sub-systems with well-defined
interfaces between those subsystems. Accordingly, networks are usually designed as layered sys-
tems where each layer is responsible to provide a certain function. The layering principle is a very
fundamental principle for structuring communication systems.
1.1.1 Services
The abstract set of functions provided by a given network are called the service realized by the
network. The service provided by a network is usually defined by abstract terms to allow for
multiple concrete programming interfaces.
A service is used by using one or more service primitives. Typical ISO/OSI service primitives
are:
The interface, which is used to access the service primitives, is called a service access point.
Services are realized by so called (protocol) instances. The instances at layer N of a layered
system are accordingly called N -instances.
Strict layering requires that N -instances may only use services realized by (N 1)-instances
to realize the layer N service.
1
2 CHAPTER 1. INTRODUCTION
1.1.2 Protocols
A protocol is a set of functions which together realize a well-defined communication service (e.g.,
error-free ordered transmission of data from a sender to a receiver).
Protocols define the format and the semantics of the protocol data units (PDUs) exchanged
between communicating parties.
Protocols specifically define the rules which have to be followed when creating or processing
protocol data units.
Specialized protocols have been developed for various application domains. Note that a cer-
tain service can be realized by multiple different protocols.
The specification of protocols can be either informal (plain English text) or formal. Formal
protocol specifications usually use specification languages such as Lotos, Estelle or SDL
which have been developed for this purpose.
Protocol instances are usually identified by some sort of an address. Addresses for protocol in-
stances that only exist once on a certain system are often also used to identify whole systems. In
addition, more human friendly names are often used and mapped to addresses as needed.
Names often have variable length and the name space is usually structured hierarchically.
Addresses on the other hand are identifications of protocol instances which are optimized
for machine processing. A typical example are Internet Protocol (IP) addresses such as
212.201.48.1.
Addresses usually have a fixed length and are relatively compact since they are frequently
transmitted.
ISDN Addresses
The Integrated Services Digital Network (ISDN) is the digital telecommunication network which is
widely available in Europe. ISDN addresses which identify telecommunication equipments such as
phones are structured according to the E.164 numbering plan defined by the ITU:
An international ISDN phone number consists of a maximum of 15 digits. The first digits
contain the country code, followed by a national region code followed by the phone number
within that region.
1.1. FUNDAMENTAL CONCEPTS 3
An international ISDN phone number can be followed by an up to 40 digit long target iden-
tifier.
The common notation for international ISDN phone numbers starts with a + symbol followed
by digits which can be grouped into blocks by using white space or other separator characters.
An example would be +49 241 200 3587.
Internet Addresses
Internet network layer addresses have a fixed size. Depending on the protocol version (IPv4 or
IPv6), these addresses are either 4 byte or 16 byte long.
Four byte IPv4 addresses are typically written as four decimal numbers separated by dots
where every decimal number represents one byte (dotted quad notation). A typical example
is the IPv4 address 212.201.48.1.
Sixteen byte IPv6 addresses are typically written as a sequence of hexadecimal numbers
separated by colons (:) where every hexadecimal number represents two bytes. Leading
nulls can be omitted and two consecutive colons can represent a sequence of nulls. For
example, the IPv6 address 1080:0:0:0:8:800:200C:417A can be written somewhat
shorter as 1080::8:800:200C:417A. IPv6 addresses which contain IPv4 addresses can
be written by using the dotted quad notation for the IPv4 address portion. For example, the
IPv6 address 0:0:0:0:0:0:0D01:4403 can be written as ::0D01:4403 as well as
::13.1.68.3.
Further details about IPv6 addresses can be found in RFC 3513 [8]. A more compact representation
of IPv6 addresses can be found in RFC 1924 [9] (recommended reading). In this context, also RFC
1925 [?] is highly recommended background reading material.
IEEE 802 addresses, sometimes also called MAC addresses, are usually 6 bytes or 48 bit long.
(There are also 2 byte or 16 bit IEEE 802 addresses which however do not play a significant role.)
The common notation for IEEE 802 addresses is a sequence of hexadecimal numbers (one
number for each address byte) where the numbers are separated from each other using colons
or hyphens. Typical examples are 00:D0:59:5C:03:8A or 00-D0-59-5C-03-8A.
The highest bit of an IEEE 802 address indicates whether it is a normal unicast address (0)
or a multicast address (1). The broadcast address, which represents all stations within a
broadcast domain, consists of 48 one bits.
The second highest bit of an IEEE 802 address defines whether it is a local address (1) or a
global address (0). A local address is assigned administratively and only unique within this
administrative region while global addresses are globally unique.
Globally unique IEEE 802 addresses are created by vendors who have to apply for a number
space by the IEEE. The vendor then assigns a unique number taken from the address space
delegated to him. It is thus possible to identify the vendor of a network device by looking up
the vendor code (the first three bytes) in a number space delegation list.
4 CHAPTER 1. INTRODUCTION
Internet addresses are optimized for machine processing and storage and not necessarily for human
memories. This lead to the introduction of names which are more oriented towards the requirements
of human beings.
virtual Root
The Domain Name System (DNS) defines a distributed hierarchical name space which in
particular supports the delegation of name assignments.
In many cases, the structure of the DNS name space reflects the organizational structure of
the organization which maintains the relevant part of the DNS name space.
When using DNS names to refer to a node on the Internet, a process called name resolution
is performed which translates the DNS name to one or more IP addresses.
The traditional and widely deployed DNS does not support internationalized domain names.
A special encoding has therefore been defined recently to support internationalized domain
names without any changes to the DNS infrastructure.
The ISO/OSI Reference Model is the classic layered model for communication networks which was
developed during the ISO work on the Open Systems Interconnection (OSI). Real networks usually
do not follow strictly the seven layer OSI model.
Physical Layer
Application Application
Aplication System
Presentation Presentation
Session Session
Transport System
Network Network Network
Medium Media
Network Layer
Error detection and correction between sending and receiving network nodes.
Transport Layer
Session Layer
Presentation Layer
Data compression.
Application Layer
Examples: Terminal emulationen, management of name spaces, data base access, network
management, electronic messaging systems, process and machine control, . . .
Application Application
Medium Medium
The Internet has been designed as a network which can be implemented on top of almost any
other communication network by making very few assumptions about the services provided
by the underlying communication networks. Accordingly, the layer below the Internet layer
(which basically corresponds to the network layer of the OSI reference model) is called a
subnetwork (see RFC 1149 [10] for an interesting example of a subnetwork).
The Internet Protocol (IP) provides a common basis which allows to cross boundaries im-
posed by various other network technologies.
The Internet Protocol can of course also be used as a subnetwork technology, which naturally
leads to so called IP tunnels.
There are currently two protocols on the Internet network layer. The currently widely de-
ployed IP protocol is version 4 (IPv4). The IP protocol version 6 (IPv6) is slowly gaining
deployment and practical importance.
Internet protocols are often designed to simplify implementations (usually in software, even
though high-speed devices implement many protocols in hardware).
The Internet protocols are primarily designed for data communication (asynchronous, best-
effort) and only recent work tries to support voice and multi-media communication (iso-
chronous traffic and quality of service).
Implementations of many Internet protocols are freely available which helps to transfer the
protocols from research/development into actual products. Universities and research labs
traditionally play a big role as a melting pot and experimentation field for new protocols.
1.2 Standardization
The standardization of protocols creates unified network architectures supporting open (that is ven-
dor independent) communication. Vendor specific protocols and architectures (e.g., SNA or DEC-
net) have lost importance.
Activity
Time for Standardization
Research Investment
Time
Standardization itself is a complicated and in most cases a time consuming and thus expensive
process. However, once an open standard has been established, it can create an open competitive
market which leads to the development of high-quality products which are usually available at very
8 CHAPTER 1. INTRODUCTION
reasonable prices. However, only a very small fraction of the developed standards are actually
successful in terms of wide-spread deployment:
The success of a standard must be measured in the number of actually deployed interoperable
implementations.
One critical factor for the success of a standards activity is the timing.
There are many organizations which develop standards for communication networks. The most
important organizations and their standards processes are briefly introduced in the following sub-
sections.
ISO is a network of the national standards institutes of almost 150 countries, on the basis of
one member per country (ANSI for the USA, DIN for Germany), with a Central Secretariat
in Geneva, Switzerland, that coordinates the system.
The transition between these states requires majorities during voting processes and transitions
can be repeated multiple times.
Standards are identified by numbers. Different revisions of the same standard are published
under the same number. To distinguish the revisions, the year of the publication is usually
appended to the number of a standard.
The Open Systems Interconnection (OSI) maintains the standards which deal with commu-
nication in open (communication) systems.
The Internet Engineering Task Force (IETF) is responsible for the standardization of the
Internet protocols (RFC 3233 [11], RFC 2026 [12]).
Internet standards are usually developed by working groups (WGs) which are organized in
different areas (e.g., routing or transport).
1.2. STANDARDIZATION 9
Every area is lead by usually two area directors (ADs). All the area directors together form
the Internet Engineering Steering Group (IESG), which has to approve all documents on the
standardization track.
1. Proposed Standard
2. Draft Standard
3. Internet Standard
The transitions between these states require usually rough consensus and running code.
Multiple interoperable independent implementations are required to move from Proposed
Standard to Draft Standard and real-world deployment is required to move from Draft Stan-
dard to Internet Standard.
All standards are published as so called Request for Comments (RFCs). Every RFC has
a unique number and RFCs are never changed after publication. Different revisions of a
standard thus have different RFC numbers. There are special documents which help to locate
the current RFC number for a given standard. Note: Not all RFCs are standards! There
are also informational and experimental RFCs as well as RFCs which document best current
practices.
The Internet Architecture Board (IAB) is a panel which looks at longer-term architectural
issues and sometimes gives advise to the IETF.
The Internet Research Task Force (IRTF) is an organization that exists in parallel to the IETF
and which looks at research questions, potentially preparing future standardization work.
The IRTF is similarly to the IETF structured into research groups. The chairs of the research
groups together form the Internet Research Steering Group (IRSG).
Standardization within the IEEE is organized and controlled by the IEEE-SA Standards Board. The
documents created by standardization activities fall into the following categories:
Guides discuss alternate approaches and can provide additional background information.
10 CHAPTER 1. INTRODUCTION
A new document (New) defines a standard which is not a revision of an already existing
standard.
An already existing standard can be updated and replaced by a document which is called a
Revision.
An existing standard can be extended by another document which can also make substantial
corrections. Such a document is called an Amendment.
The IEEE is called a sponsor and responsible for the creation and process management of a stan-
dardization project. A project starts by submitting a Project Authorization Request (PAR). The
IEEE-SA Standards Board is the board which decides whether a PAR is accepted. PARs are evalu-
ated by the New Standards Committee (NesCom).
Technical work takes places in so called working groups and is finalized by a voting procedure
(ballot). It is generally desired to avoid negative votes by achieving consensus before the final ballot.
After a successful ballot, the draft of the new standard is submitted to the IEEE-SA Standards Board
for approval. The IEEE-SA Standards Board itself makes use of a Review Committee (RevCom)
which helps to review the documents and to form an opinion.
Chapter 2
The 802.x series of IEEE standards are under development since the middle of the 1980s. They
dominate the technology used in local area networks (LANs) and there are currently a trend to
use the IEEE 802.x specifications also in metropolitan area networks (MANs). Some of the IEEE
standards have also been approved as official ISO standards.
802.1 Bridging
802.1 Management
The currently most widely known standards are the Ethernet (IEEE 802.3) and WaveLANs (IEEE
802.11). An IEEE standard for bluetooth was approved in March 2002.
The IEEE 802.x standards cover the two lower layers of the OSI reference model. However, the
IEEE 802.x standards subdivide the OSI data link layer into two sub-layer:
The Logical Link Control (LLC) layer provides a service interface which is the same for all
IEEE 802 protocols. Protocols on the network layer (e.g., the Internet Protocol) use the ser-
vices provided by the LLC layer and thus work (in principle) over all IEEE 802.x protocols.
(In reality, there are sometime differences with regard to the LLC layer service primitives
supported by a given IEEE 802.x technology that can affect the mapping of network layer
protocols.)
The Medium Access Control (MAC) layer defines the method used to access the media being
used.
11
12 CHAPTER 2. IEEE 802 LOCAL AREA NETWORKS
Application Process
EndSystem
Application System
Application
Representation
Session
Transport
Transport System
Network
Logical Link Control (LLC)
Data Link
IEEE 802
Media Access Control (MAC)
Physical
Physical (PHY)
The Physical (PHY) layer defines the physical properties for the various transmission media
that can be used with a certain IEEE 802.x protocol.
The split of the data link layer into two sub-layer has been a very important decision which enabled
the IEEE to standardize very different media access technologies and protocols with a common data
link interface.
The Logical Link Control layer is modeled after the ISO service model and provides services that
are close to those offered by the HDLC protocol discussed in the second year lecture Operating
Systems and Networks. Note that not all services are realized by all existing IEEE 802.x protocols.
The IEEE 802.3 standard is probably better known as Ethernet1 . The Ethernet technology was
developed in the 1970s at XEROX PARC [13] and was later standardized with little changes by the
IEEE [14]. The classic IEEE 802.3 network is a 1-persistent CSMA/CD network with a bandwidth
of 1-10 Mbps.
1
The term Ethernet is usually used synonymously for the IEEE 802.3 standards and the CSMA/CD technology in
general, although this is not really correct.
2.2. ETHERNET (IEEE 802.3) 13
Since the IEEE 802.3 technology was very successful, the IEEE started efforts to define extensions
for 1 Gbps, 10 Gbps networks and so on. In June 2002, an IEEE standard for 10 Gbps Ethernets
was approved while the standard for 100 Gbps Ethernet is under development. The evolution of the
Ethernet standards is summarized in Table 2.1.
The physical layer of the IEEE 802.3 standard defines the transmission related properties. The
following medias and topologies are defined:
The different medias have different signal propagation delays. The speed of light c is approximately
c 300000 km s . The speed of the various medias can be expressed relative to the speed of light as
shown in Table 2.3.
Table 2.3: Signal propagation speeds for various IEEE 802.3 physical layer media
The 10Base5 medium, a rather thick copper coax wire, was also known as yellow cable. Stations
were attached to a yellow cable by drilling a hole into the coax cable and sticking a needle into the
heart of the cable. The 10Base2 medium, also sometimes called cheaper net, was easier to deploy
since it was more flexible and stations were by means of so called T-connectors. The downside
of this technology was that segments were more significantly limited in size and the number of
stations that could be supported. The fiber optic medium on the other hand supported a much larger
distance, but was rather expensive to deploy.
14 CHAPTER 2. IEEE 802 LOCAL AREA NETWORKS
The IEEE 802.3/Ethernet frame format is rather simple and shown in Figure 2.3.
preamble 7 Byte
641518 Byte
length / type field 2 Byte
data
(network layer packet)
461500 Byte
The seven byte preamble consists of the bit pattern 101010102 . This pattern together with
the Manchester Coding technique results in a periodic signal which allows the receiver to
synchronize to the speed of the sender.
The start-of-frame delimiter (SFD) has the bit pattern 101010112 . The resulting signal change
at the end of the start-of-frame delimited after the preamble indicates that start of a frame.
The source and destination address fields contain six byte IEEE MAC addresses.
The two byte type/length field contains either the length of the frame (value less than 60016 )
or the identification of higher level protocol used by the data carried in the frame (value
greater or equal to 60016 ). Type numbers are maintained by the IEEE and globally unique.
The data portion contains the actual payload, usually a packet of a network layer protocol. If
necessary, the frame will be filled with padding bytes to achieve a minimal frame length.
The end of the packet contains a four byte CRC frame checksum (CRC-32).
IEEE 802.3 uses the CSMA/CD medium access method. Figures 2.4 and 2.5 show the principal
logic that is used to send and receive frames.
2.2. ETHERNET (IEEE 802.3) 15
start transmission
collision detected?
Y
N
The following parameters play a role for a classic 10 Mbps IEEE 802.3 network:
The slot time of 512 bit times equals twice the propagation delay plus some safety margin.
Between two successive frames, a minimum inter-frame gap of 96 bit times is required to
ensure that frames ends are properly recognized.
The minimal length of a frame is 64 byte; the maximum length is 1518 byte.
If a collision has been detected, a special jam-signal is generated for the duration of 32 bit
times.
On the n-th retransmission, a uniformly distributed number R is chosen from the interval
[0..2k ) with k = min(n, b) and the bake-off-limit b = 10.
16 CHAPTER 2. IEEE 802 LOCAL AREA NETWORKS
receive frame
There are a number of special situations which can be recognized by the MAC layer:
Received frames with a non-integral number of bytes which fail the CRC test (alignment
errors).
Received frames with a legal length that fail the CRC test (frame check sequence (FCS)
errors).
Frames that could not be transmitted immediately since the medium was busy (deferred trans-
missions).
Frames which were transmitted successfully after a single collision (single collision frames).
Frames which were transmitted successfully after multiple collisions (multiple collision frames).
Frames which could not be transmitted due to continued collisions (excessive collisions).
Collisions detected after the slot time after the start of the transmission (late collisions).
Collisions that happen after the slot time are typically indications of wires that exceed the
maximum allowed length.
Not further specified MAC internal errors during the transmission of a frame (internal MAC
transmit errors).
Not further specified MAC internal errors during the receipt of a frame (internal MAC receive
errors).
Failure to listen to the carrier signal during a transmission (carrier sense errors).
Frames that exceed the maximum length of allowed frames (frame too long errors).
2.2. ETHERNET (IEEE 802.3) 17
The classic 10 Mbps IEEE 802.3 standard allows a maximum wire length (including repeaters which
basically amplify the signal) of 2.5km. This results in a maximum propagation delay (inclusive
some detail in repeaters) of 50s. This leads to a minimum packet length of 512 bit.
The objective of the development of the Fast-Ethernet standard was a data rate of 100 Mbps without
changes to the medium access mechanism. To achieve a higher bit-rate, the maximum length of the
wire has to be reduced. Accordingly, the Fast-Ethernet wire length is limited to 100m. This rela-
tively short length was acceptable since the developers envisioned the transition to star topologies
with twisted pair cables.
Fast-Ethernet can be used with twisted pair and fiber optic cables. The support of UTP Category 3
and 5 cables results in some specialties in the physical layer. The general advise, however, is to use
Category 5 cables (or higher).
The 100BaseT4 media uses two twisted pairs while 100BaseTX uses a single twisted pair.
The Gigabit-Ethernet standard specified in IEEE 802.3z initially supported fiber optic media. Sup-
port for category 5 UTP cables was later added by the IEEE 802.3ab specifications.
Gigabit Ethernet can operate in half-duplex and full-duplex mode. In half-duplex mode, the protocol
still uses the CSMA/CD method. To make the use of CSMA/CD possible, the slot time has been
changed from 64 bytes to 512 bytes which means that packets smaller than 512 bytes are augmented
with a new carrier extension field following the CRC field. When operating in full-duplex mode,
the original IEEE 802.3 slot-time is used and frames are not augmented.
New installations usually use Gigabit Ethernet in full duplex mode where frames can be sent and
received simultaneously and where almost all the theoretically available bandwidth can be used to
transmit data.
The 10 Gigabit Ethernet specification IEEE 802.3ae is a full-duplex and fiber-only technology and
thus does not need the CSMA/CD medium access method anymore. There are two different physi-
cal layers specified: The LAN PHY layer is for local area networks while the WAN PHY layer has
an extended feature set compared to the LAN PHY layer.
The Wireless LAN (WaveLan) standard specified in IEEE 802.11 is rather different from the IEEE
802.3 standards. It uses the MACA medium access method where small RTS/CTS frames are
exchanged before the data is actually transmitted.
Wireless LANs support two modes of operation. In the ad-hoc mode, stations are brought together
to form a network on the fly. An election algorithm is used to elect one station which serves as
the master while the other stations become slaves. The second mode assumes the presence of some
fixed network access points (also sometimes called base stations) with which mobile stations can
communicate.
The Bluetooth standard specified in IEEE 802.15 provides a wireless network technology for rather
small cells and is typically used to create wireless personal area networks. Typical bluetooth devices
are PDAs or wireless headsets which can communicate with a PC or Laptop system. Due to the
relatively small area covered by IEEE 802.15, it is possible to save quite some energy compared to
the IEEE 802.11 family of standards.
Port-based network access control as defined in IEEE 802.1X makes use of the physical access
characteristics of IEEE 802 LAN infrastructures in order to provide a means of authenticating and
authorizing devices attached to a LAN port that has point-to-point connection characteristics, and
of preventing access to that port in cases in which the authentication and authorization process fails.
A port in this context is a single point of attachment to the LAN infrastructure. Examples of ports in
which the use of authentication can be desirable include the ports of MAC Bridges (as specified in
IEEE 802.1D), the ports used to attach servers or routers to the LAN infrastructure, and associations
between stations and access points in IEEE 802.11 wireless LANs.
2.6. BRIDGES 19
2.6 Bridges
Multiple IEEE 802 LAN segments can be interconnected by using so called bridges. By using
bridges, it does not really matter which IEEE 802 technology is used in the segments that are to be
connected. Examples are big Ethernet LANs that consists of multiple Ethernet segments and also
include Wireless LAN segments.
802.11
B3
10Base5
B1
B2
10Base2 100BaseT
10Base2
802.5
Bridges (sometimes also called layer two switches) have a number of advantages:
1. Different IEEE 802 LAN technologies (e.g., Ethernet, Token Ring, WLAN) can be intercon-
nected.
2. Geographically dispersed LAN segments can be connected by using different medias in the
backbone segments (e.g., fiber) and the access segments (e.g., twisted pair).
3. Highly loaded LAN segments can be split into smaller segments which improves their per-
formance.
4. Bridges can improve the robustness of the network since errors are better localized (due to
smaller segments) and since bridges offer the possibility to have multiple redundant paths in
the network.
20 CHAPTER 2. IEEE 802 LOCAL AREA NETWORKS
5. Bridges can improve to some extend the security of the network since traffic can be better
restricted to the shorter local LAN segments.
Bridges operate on the IEEE 802 LLC layer as shown in Figure 2.7 and this is the reason why
different IEEE 802 technologies can be crossed via a bridge.
Network Network
Bridge
IEEE 802.3 MAC IEEE 802.3 MAC IEEE 802.11 MAC IEEE 802.11 MAC
IEEE 802.3 PHY IEEE 802.3 PHY IEEE 802.11 PHY IEEE 802.11 PHY
Figure 2.7: IEEE 802 bridge connecting an IEEE 802.3 and an IEEE 802.11 segment
Although conceptionally simple, there are some issues one has to pay attention to:
Different LAN segments usually operate at different speeds in terms of bits per second. A
bridge connecting such segments should have some buffering capacity to handle traffic bursts
or peaks (but of course, every puffer has a limited size).
Different LAN segments may have different maximum frame sizes. A bridge receiving a
frame which exceeds the maximum frame size of the destination LAN segment can only drop
that frame.
Different LAN segments which operate at different speeds may confuse timers at higher
protocol levels that are not aware of the bridging situation.
Some LAN technologies are real-time capable while others are not.
Some LAN technologies signal the delivery of a frame to the sender which others do not.
There are two basic types of bridges: source routing bridges and transparent bridges. Both of them
are discussed in the next sections.
Source Routing Bridges assume that a sending station can distinguish between stations attached
to the local LAN segment and stations that are attached to remote LAN segments. If a frame has
to be send to a station connected to a remote LAN segment, the sender first has to determine the
path to the remote LAN segment before sending the frame along this path. The path to follow is
2.6. BRIDGES 21
actually encoded and sent along with the frame. A special protocol is used by the stations for locate
destination stations and to find suitable routes.
The advantage of this approach is that one can make efficient use of the available bandwidth by
utilizing redundant paths to the receiving station. The price is, however, increased complexity in
the end systems that participate in a source routing bridged network.
Transparent bridges (sometimes also called spanning tree bridges) do not need special software
on the stations nor to they need a manual configuration. Instead, they adapt to their environment
automatically and are thus fully transparent from the view of the network used or (to some extend)
the network operator. The price for this is that not all available bandwidth in a bridged network can
be used to its full potential.
LAN segments are connected to transparent bridged through so called ports. The simplest of all
transparent bridges has two ports. Today, it is not unusual to have bridges which have hundreds
of ports that are realized on multiple modules interconnected by a high-speed backplane network.
Many of the commercial products can be stacked so that a bridge can grow in the number of ports
and the number of IEEE 802 technologies supported on the ports.
Forwarding
database Station Port
address number
Port Bridge
management protocol
software entity
Port 1 Port 2
Bridges can receive frames on multiple ports simultaneously. It it therefore necessary to have some
buffer space to hold incoming frames. The ports of a transparent bridge generally work in the
promiscuous mode which allows to receive all frames on the segment and not only the frames that
are destined to the bridge.
A transparent bridge internally maintains a forwarding database which maps received destination
MAC addresses to outgoing port numbers.
When a frame has been received by a transparent bridge, the forwarding database is checked
to find an entry which matches the destination address contained in the received frame.
22 CHAPTER 2. IEEE 802 LOCAL AREA NETWORKS
If a matching entry has been found and if the port number associated with the MAC address is
not equal to the port number from which the frame was received, then the frame is forwarded
to the port indicated by the forwarding database entry. The frame is discarded if the port
number of the forwarding database entry is identical to the port number from which the frame
was received.
If no matching entry can be found in the forwarding database, then the frame is forwarded to
all ports except the port from which the frame was received (flooding).
Many bridges also support a feature which allows network operators to configure that the
forwarding function is disabled for certain MAC addresses.
Backward Learning
The forwarding database must be populated and it must adapt to changes of the network topology
dynamically. One usually solves this problem by learning the current configuration from the frames
received by the bridge. Learned entries in the forwarding database have a timer attached to expire
these entries in case no other matching packets have been received:
When a bridge receives a frame which does not yet exist in the forwarding database, then
it extracts the source address and determines the port number from which the frame was
received. The source address and the port number are then stored in the forwarding database.
The frame is then forwarded to all other ports (which also propagates information to other
bridges).
Every entry in the forwarded database has a timer attached to it. Entries are automatically
discarded if they have not been confirmed by additional received frames within a certain time
interval (soft state).
The aging of unused entries reduces the size of the forwarding table and allows bridges to
react to topology changes dynamically (after a short delay).
The backward learning algorithm only works if the topology is a strict tree and does not
contain cycles. In case of multiple paths between LAN segments, it is possible that entries in
the forwarding database are overwritten periodically. This behavior of a network is not stable
in such a situation.
Spanning Trees
Bridged networks which do not have a loop-free tree structure cause problems since frames might
travel endlessly in a loop (ping-pong) when using backward learning alone. Transparent bridges
therefore construct a spanning tree in these cases which is used to restrict how frames are forwarded.
The spanning tree protocol requires a unique identification of the bridges involved. The so called
bridge identifier consists of one of the MAC addresses (six bytes) of a bridge plus a priority value
(two bytes). The priority value can be set administratively to influence the spanning trees computed
with the spanning tree protocol.
The spanning tree protocol executes in the following steps:
2.7. VIRTUAL LANS (IEEE 802.1Q) 23
1. In the first step, the root of the spanning tree is selected (root bridge). The root bridge is the
bridge with the highest priority and the smallest bridge address. The root of the spanning tree
is periodically broadcasted and will be recomputed as needed.
2. In the second step, the costs for all possible paths from the root bridge to the various ports on
the bridges is computed (root path cost). Every bridge determines which local port is used to
reach the root bridge at the lowest costs. The selected port is called the root port.
3. In the third step, the designated bridge is determined for each segment. The designated bridge
of a segment is the bridge which connects the segment to the root bridge with the lowest costs
on its root port. At equal costs, the bridge with the lowest bridge identifier wins. The port
used to reach designated bridges are called designated ports.
4. Finally, all ports are blocked which are not designated ports. The resulting active topology is
a spanning tree.
The spanning tree protocol uses so called BPDUs to distribute information. A BPDU has the struc-
ture shown in Figure 2.9.
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Protocol Identifier | Version | BPDU Type |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Flags | =
+-+-+-+-+-+-+-+-+ Root ID +
= =
+ +-----------------------------------------------+
= | Root Path Costs =
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
= | =
+-+-+-+-+-+-+-+-+ Bridge ID +
= =
+ +-----------------------------------------------+
= | Port ID | Message |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Age | Maximum Age | Hello |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Timer | Forward Delay |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Virtual LANs (virtual bridged lans, VLANs) emulate a virtual LAN segment on top of a complex
IEEE 802 bridged network.
24 CHAPTER 2. IEEE 802 LOCAL AREA NETWORKS
B1
B2
B0
VLANs allow to separate the traffic on an IEEE 802 network which has several advantages:
A station connected to a certain VLAN only sees frames that belong to the VLAN.
VLANs can reduce the network load. In particular, frames that are targeted to all stations
(broadcasts) will only be delivered to the stations connected to the VLAN.
By assigning stations to VLANs, it is possible to create logical LAN topologies that are
independent of the underlying physical LAN topology.
A VLAN is identified by a VLAN identifier (1..4094) and realized by VLAN supporting bridges.
The assignment of bridge ports to VLANs can be done in different ways:
Port based VLANs: The ports of a bridge are assigned administratively to the various VLANs.
A single port can in general participate in multiple VLANs.
MAC address based VLANs: The MAC addresses of the stations are assigned administra-
tively to the various VLANs. With this scheme, it does not matter on which port a given
station connects to a bridge.
Protocol based VLANs: Frames are assigned to VLANs by inspecting the payload contained
in the frames. This technique allows to create VLANs for e.g., Appletalk or IPX frames.
Multi-cast group based VLANs: VLANs are defined for all members of a certain multi-cast
group. This requires a multi-cast group membership protocol to be effective.
2.7. VIRTUAL LANS (IEEE 802.1Q) 25
On links that carry frame which belong to different VLANs, it is necessary to tag the frames with
the VLAN identifier. In the case of Ethernet frames, an new field called the tag header is introduced
right after the destination and source addresses, as shown in Figure 2.11.
preamble 7 Byte
641522 Byte
CFI
data
(network layer packet)
461500 Byte
Tagged frames can exceed the maximum frame lengths accepted by stations which do not
support VLANs.
The IEEE 802.1Q standard generally requires that frames which exceed the maximum al-
lowed length are discarded.
In the case of IEEE 802.3 frames, an extension of the original frame of four bytes has been
granted (which changes the maximal length of a frame from 1518 bytes to 1522 bytes).
The IEEE 802.1Q standard also introduces the Generic Attribute Registration Protocol (GARP),
which can among other things propagate information about VLAN membership of individual ports.
This information can be used by VLAN enabled devices to suppress frames for VLANs which
currently have no members.
26 CHAPTER 2. IEEE 802 LOCAL AREA NETWORKS
The 1998 revision of the IEEE 802.1D standard introduces additional support for priorities and
quality of service support. (The additions were developed under the name IEEE 802.1p and are still
referred to by this name.) The original IEEE 802.3 and IEEE 802.11 frame formats do not allow
to communication priorities. When using IEEE 802.1Q VLAN frames, priorities can be encoded in
the 3-bit priority field of the four byte VLAN tag.
The core idea behind 802.1D priority extensions is to support bridges that have multiple output
queues for each port. Within a bridge, frames are assigned to certain traffic classes based on the user
priority (usually carried with the frame) and the access priority (which is the priority associated with
the media access mechanism). The traffic class of a frame is then used to select the queue where the
frame is queued for transmission. Note that a bridge must preserve the ordering of unicast frames
with a given combination of source and destination addreses and the order of multicast frames with
a given destination address.
Chapter 3
The Internet Protocol(s) developed and standardized by the IETF currently dominate the network
layer in data communication networks and they reach out into voice and multi-media communica-
tion networks. The widely deployed version of the Internet Protocol (IP) is version 4 (IPv4) while
version 6 (IPv6) is currently gaining more deployment and thus practical relevance. This chapter
is centered around these two protocols. It also discusses the protocols which support the network
layer such as routing protocols or protocols which aim to automate the end system configuration
process.
3.1 Fundamentals
First, we consider some fundamentals which are important to understand the design of the Internet
protocols.
In the mid 1970s, the Defense Advanced Research Project Agency (DARPA) of the USA started
projects to develop inter-networking technologies. These projects led to the ARPANET, a packet
switched network running on top of leased lines. The ARPANET later became a backbone network
connecting all the major Universities in the USA.
An implementation of the Internet protocols became a part of the BSD Unix operating system in the
early 1980. The BSD Unix system became very popular in research organizations and that made
the Internet protocols deployed in research environments. The integration of the Internet protocols
into the BSD Unix also led to the development of the so called socket application programming
interface (API) which became the defacto standard operating system-level API to write networked
applications.
In 1983, the ARPANET is split into the ARPANET research network and the MILNET for use by
the US militaries. The ARPANET research network becomes the NSFNET in 1986, which is now
funded by the National Science Foundation of the USA. In 1990, the NSFNET backbone turns into
the ANSNET operated jointly by MERIT, MCI and IBM. In the early 1990s, the World Wide Web
is born at CERN in Switzerland. More details about the evolution of the Internet can be found at
the Web page of the Internet Society1 .
1
http://www.isoc.org/
27
28 CHAPTER 3. INTERNET NETWORK LAYER
There are a number of fundamental principles which were followed during the development of
Internet protocols. Some of the principles described in RFC 1958 [15] are:
The first principle is that connectivity is its own reward. The idea here is that connectivity
across different link and transmission technologies is more valuable than any individual ap-
plication such as email or the World-Wide Web. The technique to realize connectivity is to
realize an inter-networking layer which puts only very basic requirements on the underlying
link and transmission technologies.
All functions which require knowledge of the state of end-to-end communication should be
realized at the endpoints and not inside of the network (end-to-end argument). In other words,
end-to-end protocol design should not rely on the maintenance of state (i.e. information
about the state of the end-to-end communication) inside the network. Such state should be
maintained only in the endpoints, in such a way that the state can only be destroyed when the
endpoint itself breaks (known as fate-sharing).
Of course, to perform its services, the network maintains some state information such as
routes and the like. This state must be self-healing; adaptive procedures or protocols must
exist to derive and maintain that state, and change it when the topology or activity of the
network changes. The volume of this state must be minimized, and the loss of the state must
not result in more than a temporary denial of service given that connectivity exists. Manually
configured state must be kept to an absolute minimum.
There is no central instance which controls the Internet and which is able to turn it off.
Addresses should uniquely identify endpoints. Dynamic changes within the network should
be possible without having to change the identification of the end-systems.
To increase interoperability, implementations should be liberal in what they accept and strin-
gent in what they generate. Interoperability is more important than strict correctness.
Keep it simple. When in doubt during design, choose the simplest solution.
It is also important to consider that protocols sometimes show effects when used on a larger scale
that can not be observed on small scales. This often comes from interactions between layers or fea-
tures. One approach to address these issues is to keep complexity down to a minimum by following
the simplicity principle discussed further in RFC 3439 [16].
It is necessary to introduce some terminology. These lecture notes use the terminology as defined
in RFC 2460 [17]. Some older books and documents do not necessarily use the same terminology
and it is thus sometimes necessary to mentally map terms when reading other documents.
A link is a communication channel below the IP layer which allows nodes to communicate
with each other (e.g., an Ethernet).
The neighbors is the set of all nodes attached to the same link.
The link MTU is the maximum transmission unit, i.e., maximum packet size in octets, that
can be conveyed over a link.
The path MTU is the the minimum link MTU of all the links in a path between a source node
and a destination node.
The global Internet consists of a set of so called autonomous systems which are inter-connected. An
autonomous system (AS) is basically a set of routers and networks under the same administration.
IP packets are forwarded between autonomous systems over paths that are established by an
Exterior Gateway Protocol. The internal structure of an autonomous system is irrelevant for
the protocol establishing paths between autonomous systems.
Within an autonomous system, IP packets are forwarded over paths that are established by an
Interior Gateway Protocol.
The introduction of autonomous systems and the distinction between interior and exterior routing
protocols implies a two-level Internet routing architecture.
Autonomous systems can be classified as follows:
A multihomed AS has multiple connections to other ASes but does not forward transit traffic.
A transit AS has multiple connections to other ASes and carries local as well as transit traffic.
30 CHAPTER 3. INTERNET NETWORK LAYER
Internet addresses do not all have the same scope of uniqueness. While most IP addresses have
global scope, some addresses are only guaranteed to be unique on a certain interface while others
are only guaranteed too be unique on a certain link.
The scope of an Internet address is a topological span within which the address may be used as
a unique identifier for an interface or a set of interfaces. A scope zone, or simply a zone, is a
concrete connected region of topology of a given scope. Note that a zone is a particular instance of
a topological region, whereas a scope is the size of a topological region.
---------------------------------------------------------------
| a node |
| |
| |
| /--link1--\ /--------link2--------\ /--link3--\ /--link4--\ |
| |
| /--intf1--\ /--intf2--\ /--intf3--\ /--intf4--\ /--intf5--\ |
---------------------------------------------------------------
: | | | |
: | | | |
: | | | |
(imaginary ================= a point- a
loopback an Ethernet to-point tunnel
link) link
Since Internet addresses on devices that connect multiple zones are not necessarily unique, an ad-
ditional zone index is needed on these devices to select an interface or a set of interfaces.
The Internet Protocol version 4 (IPv4) was standardized in 1981 and is documented in RFC 791
[18]. The IPv4 protocol is the basis of todays global Internet. The original IPv4 specification has
been adopted to emerging requirements during the last 20 years. The following description of IPv4
describes the current interpretation of IPv4.
The principal structure and the textual representation of IPv4 addreses has already been introduced
in chapter 1.
For forwarding purposes, IPv4 addresses are divided into a part which identifies a network
(netid) and a part which identifies an interface of a node within that network (hostid).
The number of bits of an IPv4 address which identifies the network is called the address
prefix. The address prefix is commonly written as a decimal number, appended to the usual
IPv4 address notation by using a slash (/) as a separator (e.g., 192.0.2.0/24).
3.2. INTERNET PROTOCOL VERSION 4 (IPV4) 31
Older documents use a so called netmask which is a bitfield of the size of an IPv4 address
which gives the network identifies by performing a logical bitwise and operation with an IPv4
address (e.g., 192.0.2.0 & 255.255.255.0).
Not all possible IPv4 addresses can be used in the global Internet without restrictions. Some ad-
dresses are reserved or have special semantics attached to them, as described in RFC 3330 [19]:
Adresses for private networks, which are not routed through the public global Internet, can be
taken from the address blocks 10.0.0.0/8 und 192.168.0.0/16 as specified in RFC 1918 [20].
Test addresses or addresses that are used solely for documentation purposes can be taken
from the address block 192.0.2.0/24.
Address from the address block 0.0.0.0/8 identify a sender which is not yet fully configured
(typically 0.0.0.0).
The address block 127.0.0.1/8 identifies the local node, also called the loopback network.
IPv4 packets have the following structure as specified in RFC 791 [18]:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|Version| IHL |Type of Service| Total Length |
32 CHAPTER 3. INTERNET NETWORK LAYER
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Identification |Flags| Fragment Offset |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Time to Live | Protocol | Header Checksum |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Source Address |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Destination Address |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Options | Padding |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Remarks:
The length of the protocol header is stored in the Internet Header Length (IHL)
field. The length is counted in the number of 4 byte words. The minimum header length is 5
(which corresponds to 20 bytes) and the maximum header length is 15 (which corresponds to
60 bytes).
The interpretation of the Type of Service (TOS) field has changed over time. The
current interpretation of this field uses six bit as the Differentiated Services Code Point
(DSCP) and two bits for explicit congestion notifications (ECN), as specified in RFC 2474
[21], RFC 3168 [22] and RFC 3260 [23].
0 1 2 3 4 5 6 7
+-----+-----+-----+-----+-----+-----+-----+-----+
| DS FIELD, DSCP | ECN FIELD |
+-----+-----+-----+-----+-----+-----+-----+-----+
The length of the IPv4 packet (including the protocol header) is stored in the Total Length
field. Since this is a 16-bit field, IPv4 packets can have a maximum length of 65535 bytes.
The Time to Live (TTL) field is used to limit the lifetime of an IPv4 packet. The
lifetime is usually measured in the number of hops passed rather than a period in time. Every
router forwarding an IPv4 packet decrements this field and the packet is discarded once the
value of the field becomes zero.
The Protocol fiel identifies the protocol contained in the IPv4 packet, in most cases one
of the Internet transport protocols.
3.2. INTERNET PROTOCOL VERSION 4 (IPV4) 33
The Header Checksum field contains the Internet checksum computed over the header.
The Source Address and Destination Address field contain the source and des-
tination address of the packet.
There are a number of options which can be used to control the forwarding of a packet or
which cause routers to append forwarding information to the protocol header. Most of these
options are practically irrelevant since the remaining 40 bytes are usually not enough for
using these options.
Every node maintains a forwarding table (also sometimes called the forwarding information base)
which is used to direct IPv4 packets closer to their destination [24].
The forwarding table realizes a mapping of the network prefix to the next node (next hop)
and the local interface used to reach the next node.
For every IP packet, the entry in the forwarding table has to be found with longest matching
network address prefix (longest prefix match).
The following example shows on a simple network topology the contents of the various forwarding
tables involved:
Prefix Next Hop Interface
0.0.0.0/0 134.169.2.1 eth0
134.169.34.0/24 134.169.246.34 eth0
R1 H1 134.169.0.0/16 134.169.9.10 eth0
134.169.2.1 134.169.9.10
134.169.0.0/16
134.169.34.0/24
A node has multiple forwarding tables. Information contained in fields of the incoming IP
packet (e.g., the DSCP value) is used to select one out of many forwarding tables to forward
the packet.
Another approach to increase performance is to use chaches for frequently used destination
addresses.
Forwarding tables can become very large (around 100000 entries on backbone routers have
been reported in January 2001).
One technique used to reduce the size of forwarding tables is called address aggregation. If
a router has multiple forwarding table entries with a common prefix which point to the same
interface, the router can aggregate these entries into a single entry with a shorter prefix length.
Exceptions can still be handled by having some entries with a longer prefix.
Due to the grows in the number of packets per second a router has to handle and the grows of
the forwarding tables, it is crucial to design lookup algorithms that scale well in the number
of addresses stored in a forwarding table. Note that routing updates occur frequently in
backbone routers and thus update operations must be reasonable fast as well.
Large forwarding tables are usually represented as a tries so that the complexity of lookup
operations depends on the distribution of the length of network prefixes and not on the total
number of table entries [25]. A trie is a tree-based data structure allowing the organization of
prefixes on a digital basis by using the bits of prefixes to direct the branching.
The usage of optimized tree representations, usually implemented in hardware, provides the
performance that is needed to handle IP on very high speed links. See [26] for a good survey
on fast IP address lookup algorithms.
The Internet Control Message Protocol (ICMP) as specified in RFC 792 [27] is used to inform
nodes about problems encountered while forwarding IP packets. It also introduces messages which
can be used to perform simple tests. ICMP messages are transported in the payload of ordinary IP
packets.
In the following, a selection of ICMP message formats will be discussed. ICMP messages in general
contain a checksum which is computed over the ICMP message in order to detect some bit errors
in ICMP messages.
Echo Request/Reply
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type | Code | Checksum |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Identifier | Sequence Number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Data ...
+-+-+-+-+-
3.2. INTERNET PROTOCOL VERSION 4 (IPV4) 35
The ICMP echo request message (type = 8, code = 0) asks the destination node to return an
echo reply message (type = 0, code = 0) to the sender of the echo request message.
The Identifier and Sequence Number fields are used by the sender to correlate in-
coming replies with previously sent requests.
The data field may contain additional data or just fill bytes in order to bring the IP packets to
a certain size.
Unreachable Destinations
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type | Code | Checksum |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| unused |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Internet Header + 64 bits of Original Data Datagram |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The Type field has the value 3 for all unreachable destination messages.
The data field contains the beginning of the packet which caused the ICMP unreachable
destination message.
Redirect
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
36 CHAPTER 3. INTERNET NETWORK LAYER
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type | Code | Checksum |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Router Internet Address |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Internet Header + 64 bits of Original Data Datagram |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The Type field has the value 5 for all redirect messages.
The data field contains the beginning of the packet which caused the ICMP redirect message.
The receiver must buffer fragments until all fragments have been received. However, it is not
useful to keep fragments in a buffer indefinately. Hence, the TTL field of all buffered packets
will be decremented once per second and fragments are dropped when the TTL field becomes
zero.
The loss of a fragment causes in most cases the sender to resend the original IP packet which
in most cases gets fragmented as well. Hence, the probability of transmitting a large IP packet
successfully goes quickly down if the loss rate of the network goes up.
Since the Identification field identifies fragments that belong together and the number
space is limited, one cannot fragment an arbitrary large number of packets.
An obvious solution for the problem is to cause the sender never to generate packets that are larger
than the path MTU and thus never have to be fragmented [29]. To make this simple solution work,
the sender has to be able to learn the path MTU:
The sender sends IPv4 packets with the DF flag turned on.
A router which has to fragment a packet with the DF flag turned on drops the packet and sends
an ICMP message back to the sender which also includes the local maximum link MTU.
Upon receiving the ICMP message, the sender adapts his estimate of the path MTU and
retries.
Since the path MTU can change dynamically (since the path can change), a once learned path
MTU should be verified and adjusted periodically.
3.2. INTERNET PROTOCOL VERSION 4 (IPV4) 37
Not all routers send necessarily the local link MTU. In this cases, the sender usually tries
typical MTU values, which is usually faster than doing a binary search.
IPv4 packets are sent in the payload of IEEE 802.3 frames according to the specification in RFC
894 [30].
IPv4 packets are identified by the value 0x800 in the IEEE 802.3 type field.
According to the maximum length of IEEE 802.3 frames, the maximum link MTU is 1500
byte.
The mapping of IPv4 addresses to IEEE 802.3 addresses is table driven. Entries in so called
mapping tables (sometimes also called address translation tables) can either be statically con-
figured or dynamically learned.
The Address Resolution Protocol (ARP) defined in RFC 826 [31] allows an IP node to determine
the link-layer address of a neighboring node on a broadcast network. The fundamental principle
here is to broadcast a message asking for the translation of an IP address to a link-layer to all stations
attached to a broadcast network. Since the message is broadcasted, it will also reach the node which
has the IP address assigned to one of its interfaces. This node can thus respond by sending a unicast
message back to the node which asked the question.
Subsequently, an extension was defined which allows to perform reverse address resolutions. The
Reverse Address Resolution Protocol defined in RFC 903 [32] resolves a nodes hardware address
to an IP address.
In case of IPv4 addresses and IEEE 802.3 addresses, the following message format is used for both
ARP and RARP:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Hardware Type | Protocol Type |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| HLEN | PLEN | Operation |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Sender Hardware Address (SHA) =
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
= Sender Hardware Address (SHA) | Sender IP Address (SIP) =
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
= Sender IP Address (SIP) | Target Hardware Address (THA) =
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
= Target Hardware Address (THA) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Target IP Address |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
38 CHAPTER 3. INTERNET NETWORK LAYER
The ARP message format is not aligned to 32-bit word boundaries in case of IPv4 addresses
and IEEE 802 MAC addresses.
The Hardware Type field identifies the address type used on the link-layer (the value 1 is
used for IEEE 802.3 MAC addresses).
The Protocol Type field identifies the network layer address type (the value 0x800 is
used for IPv4).
ARP/RARP packets use the type value 0x806 in the IEEE 802.3 frame.
The Operation field contains the message type: ARP Request (1), ARP Response (2),
RARP Request (3), RARP Response (4).
The sender fills, depending on the request type, either the Target Hardware Address
(RARP) field or the Target IP Address (ARP) field.
The responding node swaps the Sender/Target fiels and fills the empty fields with the re-
quested information.
The Dynamic Host Configuration Protocol (DHCP) defined in RFC 2131 [33] allows nodes (DHCP
clients) to retrieve configuration parameters dynamically from a central configuration server (DHCP
server). A binding is a collection of configuration parameters, including at least an IP address,
associated with or bound to a DHCP client. Bindings are managed by DHCP servers.
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| op (1) | htype (1) | hlen (1) | hops (1) |
+---------------+---------------+---------------+---------------+
| xid (4) - transaction id |
+-------------------------------+-------------------------------+
| secs (2) | flags (2) |
+-------------------------------+-------------------------------+
| ciaddr (4) - client IPv4 address |
+---------------------------------------------------------------+
| yiaddr (4) - your (client) IPv4 address |
+---------------------------------------------------------------+
| siaddr (4) - next server IPv4 address |
+---------------------------------------------------------------+
| giaddr (4) - relay agent IPv4 address |
+---------------------------------------------------------------+
| |
| chaddr (16) - client hardware address of |
| type htype and length hlen |
| |
+---------------------------------------------------------------+
| |
3.2. INTERNET PROTOCOL VERSION 4 (IPV4) 39
The DHCPOFFER message is sent from a DHCP server to offer a client a set of configuration
parameters.
The DHCPREQUEST is sent from the client to a DHCP server as a response to a previous
DHCPOFFER message, to verify a previously allocated binding or to extend the lease of a
binding.
The DHCPACK message is sent by a DHCP server with some additional parameters to the
client as a positive acknowledgement to a DHCPREQUEST.
The DHCPNAK message is sent be a DHCP server to indicate that the clients notion of a
configuration binding is incorrect.
The DHCPDECLINE message is sent by a DHCP client to indicate that parameters are already
in use.
The DHCPRELEASE message is sent by a DHCP client to inform the DHCP server that
configuration parameters are no longer used.
The DHCPINFORM message is sent from the DHCP client to inform the DHCP server that
only local configuration parameters are needed.
A typical exchange between a client and two candidate servers is displayed below.
v v v
| | |
| Begins initialization |
| | |
| _____________/|\____________ |
|/DHCPDISCOVER | DHCPDISCOVER \|
| | |
Determines | Determines
configuration | configuration
40 CHAPTER 3. INTERNET NETWORK LAYER
| | |
|\ | ____________/ |
| \________ | /DHCPOFFER |
| DHCPOFFER\ |/ |
| \ | |
| Collects replies |
| \| |
| Selects configuration |
| | |
| _____________/|\____________ |
|/ DHCPREQUEST | DHCPREQUEST\ |
| | |
| | Commits configuration
| | |
| | _____________/|
| |/ DHCPACK |
| | |
| Initialization complete |
| | |
. . .
. . .
| | |
| Graceful shutdown |
| | |
| |\ ____________ |
| | DHCPRELEASE \|
| | |
| | Discards lease
| | |
v v v
See RFC 2131 [33] for a complete state diagram and a complete description of the possible transitions.
The options field of DHCP messages can contain various configuration options. Options may have a fixed
length of variable length. All options begin with a tag octet, usually followed by a length field and the actual
value (tag-length-value, TLV). An initial set of options such as options to configure a list of routers, a list of
name servers and so on is defined in RFC 2132 [34].
Some security aspects related to the lack of authentication within DHCP are discussed in RFC 3118 [35] and
a proposal is made to provide delayed authentication, which is still subject to denial of service attacks.
Capability to mark packets that belong to particular traffic flows for which the sender requests special
handling.
Authentication and privacy capabilities to support authentication, data integrity and optional data con-
fidentiality of IPv6 packets.
Integrated automatic end-system configuration capabilities.
IPv6 addresses are 128-bit identifiers for interfaces and sets of interfaces. The details are defined in RFC
3513 [8]. There are three types of IPv6 addresses:
A unicast address is an identifier for a single interface. A packet sent to a unicast address is delivered
to the interface identified by that address.
A anycast address is an identifier for a set of interfaces. A packet sent to an anycast address is delivered
to one of the interfaces identified by that address.
A multicast address is an identifier for a set of interfaces. A packet sent to a multicast address is
delivered to all interfaces identified by that address.
The type of an IPv6 address is identified by the high-order bits of the address:
Anycast addresses are taken from the unicast address spaces (of any scope) and are not syntactically distin-
guishable from unicast addresses.
Interface Identifiers
Interface identifiers in IPv6 unicast addresses are used to uniquely indentify interfaces on a link. For all
unicast addresses, except those that start with binary 000, interface identifiers are required to be 64 bits long
and to be constructed in modified EUI-64 format.
The modified EUI-64 format can be obtained from IEEE 802 MAC addresses by inserting two octets with the
hexadecimal values 0xFF and 0xFE in the middle of the 48-bit MAC address. A 48-bit IEEE MAC address
with global scope has the following format:
|0 1|1 3|3 4|
|0 5|6 1|2 7|
+----------------+----------------+----------------+
|cccccc0gcccccccc|ccccccccmmmmmmmm|mmmmmmmmmmmmmmmm|
+----------------+----------------+----------------+
42 CHAPTER 3. INTERNET NETWORK LAYER
The c bits are the assigned company identification, 0 is the universal/local bit to indicate global scope, the g
bit is the individual/group bit, and m bits are the manufacturer selected extension identifier. The corresponding
modified EUI-64 identifier has the following format:
With this transformation, it is possible to compute a link local IPv6 address for each physical IEEE 802 MAC
interface.
While the automatic computation of interface identifier from MAC addresses is a simple way to construct
link local and global IPv6 addresses, some people have concerns that these IPv6 addresses can be used to
track mobile nodes used in different networks. There are two approaches to address this concern: The first
approach is to use DHCP instead of IPv6 auto-configuration to assign IPv6 addresses. The other approach
documented in RFC 2893 [36] is to generate a pseudo-random sequence of interface identifiers via a oneway
hash function which depends on a random component and the globally unique interface identifier (where
available). The pseudo-random interface identifiers are then only used for a certain period of time.
The global routing prefix is a typically hierarchically structured value assigned to a site (a cluster of sub-
nets/links), the subnet ID is an identifier of a link within the site.
There is a special IPv6 address space which which contains the complete IPv4 address space. The so called
mapped IPv4 addresses where invented to make the transition from IPv4 to IPv6 networks easier. There is
an ongoing controversy whether this is actually the case.
| 80 bits | 16 | 32 bits |
+--------------------------------------+--------------------------+
|0000..............................0000|0000| IPv4 address |
+--------------------------------------+----+---------------------+
Link-local unicast addresses are assigned automatically and guaranteed to be unique on the link attached to
an interface.
| 10 |
| bits | 54 bits | 64 bits |
+----------+-------------------------+----------------------------+
|1111111010| 0 | interface ID |
+----------+-------------------------+----------------------------+
3.3. INTERNET PROTOCOL VERSION 6 (IPV6) 43
IPv6 packets have the following structure, as specified in RFC 2460 [17]:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|Version| Traffic Class | Flow Label |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Payload Length | Next Header | Hop Limit |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
+ +
| |
+ Source Address +
| |
+ +
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
+ +
| |
+ Destination Address +
| |
+ +
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The Traffic Class field contains in the current interpretation the Differentiated Services Code
Point (DSCP) as well as two bits for explicit congestion notification [21, 22, 23].
0 1 2 3 4 5 6 7
+-----+-----+-----+-----+-----+-----+-----+-----+
| DS FIELD, DSCP | ECN FIELD |
+-----+-----+-----+-----+-----+-----+-----+-----+
The Flow Label field allows to mark packets transmitted from a source address to a destination
address which belong to a certain traffic flow (e.g., all packets that belong to a certain voice call). The
motivation for this field is that routers can handle packets that belong to a certain flow in a specific
way.
The Payload Length field contains the length of the payload following the IPv6 protocol header.
Note that this field is different from the IPv4 Total Length field.
The Next Header field identifies the type of the payload following the header. This is roughly
equivalent to the IPv4 Protocol field. Note, however, that IPv6 uses a daisy-chain of IPv6 headers
to realize IPv6 options as discussed below.
The Hop Limit field is used to limit the lifetime of IPv6 packets. Every router which forwards and
IPv6 packet decrements this field and the packet is discarded if the value reaches zero.
The Source Address and Destination Address fields contain the 128-bit source and des-
tination addresses.
44 CHAPTER 3. INTERNET NETWORK LAYER
Compared to the IPv4 packet formant, the IPv6 packet format is much simpler. This has been achieved by
moving some functionality into so called extension headers which can be carried in a daisy chain between
the IPv6 protocol header and the actual payload.
If a node does not understand an extension header, it has to discard the whole packet. Parameters, which can
be ignored by implementations, are called options and they are carried in special extension headers.
The Routing Header (RH) is an extension header that can be used by the sender to specify one or more nodes
that must be visited on the way to the destination.
The RH extension header as defined in RFC 2460 [17] has the following format:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Next Header | Hdr Ext Len | Routing Type | Segments Left |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
. .
. Type-Specific Data .
. .
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The Next Header field identifies the type of the payload following the RH extension header.
The Hdr Ext Len field contains the length of the RH counted in 64-bit words minus 1.
The Routing Type field identifies a certain variant of the RH and the semantics of the field
Type-Specific Data. At the time of this writing, only a single routing type has been defined.
The Segments Left field indicates the number of remaining routing segments.
The contents of the Type-Specific Data field depends on the value of the Routing Type
field. This field contains, under the currently defined Routing Type, 32 unused bits followed by
a sequence of 128-bit fields where each 128-bit field contains an IPv6 address. When an IPv6 packet
reaches the destination and if there are remaining segments, then the next routing address is copied
into the destination address field, the number of remaining segments is decremented and the packet if
forwarded to the new destination address.
IPv6 assumes that every link has a link MTU of at least 1280 bytes [17]. Links that only support smaller
MTUs must provide fragmentation and reassembly services below the IPv6 layer. Simple IPv6 implementa-
tions which do not perform MTU path discovery must restrict themself to packet which do not exceed 1280
bytes. Packets, which are bigger than the path MTU, can be fragmented by using the Fragment Header (FH)
extension. Only IPv6 source nodes are allowed to fragment IPv6 packets. In contrast to IPv4, routers are not
allowed to fragment packets.
The FH extension header as defined in RFC 2460 [17] has the following format:
3.3. INTERNET PROTOCOL VERSION 6 (IPV6) 45
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Next Header | Reserved | Fragment Offset |Res|M|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Identification |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The Next Header field identifies the type of the payload following the FH extension header.
The Fragment Offset field defines the relative position of the fragment (counted in 64-bit words)
in the original IPv6 packet.
The flag M is set if more fragments follow. The bits Res are currently unused and reserved.
The Identification field contains the same value for all fragments of an IPv6 packet.
The Authentication Header (AH) extension header is used to provide data origin authentication, data integrity
and replay protection services for IPv6 packets.
The AH extension header as defined in RFC 2402 [37] has the following format:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Next Header | Payload Len | RESERVED |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Security Parameters Index (SPI) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Sequence Number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
+ Authentication Data (variable) |
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The Next Header field identifies the type of the payload following the AH extension header.
The Payload Len field contains the length of the AH extension header counted in the number of
32-words minus 1.
The Security Parameters Index field contains a value which together with the destination
address identifies a so called Security Association (SA). The SA is basically a data structure which
maintains all the necessary cryptographic information.
The Sequence Number field contains a monotonically increasing sequence number. The first
packet which is sent after establishing a SA has the sequence number 1. If the sequence number
reaches 232 , a new SA has to be established.
The Authentication Data field contains an integrity check value, ICV. The length of this field
depends on the authentication function in use (which is determined by the SA).
46 CHAPTER 3. INTERNET NETWORK LAYER
The Encapsulating Security Payload (ESP) extension header realizes security services such as confidentiality,
data origin authentication, data integrity, replay protection and limited traffic flow confidentiality.
The ESP as defined in RFC 2406 [38] has the following format:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ----
| Security Parameters Index (SPI) | Auth.
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |Cov-
| Sequence Number | |erage
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ----
| Payload Data* (variable) | |
| |
| | |Conf.
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |Cov-
| | Padding (0-255 bytes) | |erage*
+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| | Pad Length | Next Header | v v
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ------
| Authentication Data (variable) |
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The Security Parameters Index field contains a value which together with the destination
address identifies a so called Security Association (SA). The SA is basically a data structure which
maintains all the necessary cryptographic information.
The Sequence Number field contains a monotonically increasing sequence number. The first
packet which is sent after establishing a SA has the sequence number 1. If the sequence number
reaches 232 , a new SA has to be established.
The Payload Data field contains the encrypted payload (including any required initialization vec-
tors).
The Padding field can be used to align the payload to a certain desired length or to provide a certain
size required by the encryption function. The Padding field can also be used to hide the original size
of the actual payload.
The Pad Length field contains the number of fill bytes.
The Next Header field identifies the type of the payload.
The Authentication Data field contains an integrity check value, ICV. The length of this field
depends on the authentication function in use (which is determined by the SA).
Fragmentation can only happen after encryption. It is not allowed to apply ESP on a fragment.
The Hop-by-Hop Options (HO) extension header carries optional information that must be examined by
every node along a packets delivery path.
The HO extension header as defined in RFC 2460 [17] has the following format:
3.3. INTERNET PROTOCOL VERSION 6 (IPV6) 47
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Next Header | Hdr Ext Len | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +
| |
. .
. Options .
. .
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The Next Header field identifies the type of the payload following the HO extension header.
The Hdr Ext Len field contains the length of the HO counted in 64-bit words minus 1.
The Options field contains the list of options. Each option is encoded as a tag-length-value (TLV)
triple:
0 1
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- - - - - - - - -
| Option Type | Opt Data Len | Option Data
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- - - - - - - - -
The Option Type field identifies the option kind and the Option Data Len field contains the
length of the Option Data field counted in bytes. The sequence of options in a HO extension
header is processed in the order they appear in the header.
The Destination Options (DO) extension header carries optional information that must be processed by the
final receiver of the packet.
The DO extension header as defined in RFC 2460 [17] has the following format:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Next Header | Hdr Ext Len | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +
| |
. .
. Options .
. .
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The fields Next Header, Hdr Ext Len and Options have the same format and semantics as
in the HO extension header.
48 CHAPTER 3. INTERNET NETWORK LAYER
IPv6 packets are forwarded using the longest prefix match algorithm which is used in the IPv4 network.
However, IPv6 addresses have much longer prefixes which allows to do better address aggregation in order
to reduce the number of forwarding table entries. On the other hand, due to the length of the prefixes, it is
even more crucial to use an algorithm whose complexity does not dependent on the number of entries in the
forwarding table or the average prefix length.
The Internet Control Message Protocol Version 6 (ICMPv6) is an adapted version of the ICMPv4 protocol.
It introduces a set of control messages which are needed to report errors, to run diagnostic tests, to auto-
configure IPv6 nodes and to resolve IPv6 addresses to link-layer addresses.
The ICMPv6 messages defined in RFC 2463 [39] have the following format:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type | Code | Checksum |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
+ Message Body +
| |
The Type field identifies the type of an ICMPv6 message. ICMPv6 messages are categorized into
error message (Type 0-127) and informational messages (Type 128-255).
The Code field contains a value which further discriminates the message type. The exact meaning of
the Code value depends on the contents of the Type field.
The Checksum field contains the Internet checksum computed over the ICMPv6 message and parts
of the IPv6 protocol header.
The contents of the Message Body depends on the ICMPv6 message type.
3.3. INTERNET PROTOCOL VERSION 6 (IPV6) 49
IPv6 packets can be encapsulated into IEEE 802.3 frames and sent over IEEE 802.3 packets as defined in
RFC 2464 [40]:
Frames containing IPv6 packets are identified by the value 0x86dd in the IEEE 802.3 type field.
The link MTU is 1500 bytes which corresponds to the IEEE 802.3 maximum frame size of 1500 byte.
The mapping of IPv6 addresses to IEEE 802.3 addresses is table driven. Entries in so called map-
ping tables (sometimes also called address translation tables) can either be statically configured or
dynamically learned using neighbor discovery.
IPv6 supports the automatic configuration of hosts (autoconfiguration) and the discovery of neighbors at-
tached to the same link. The Neighbor Discovery (ND) which is part of the ICMPv6 protocol simplifies the
configuration of hosts and includes features that are realized by different protocols (ICMPv4, ARP) in IPv4.
ND as documented in RFC 2461 [41] supports the following features:
Discovery of the local routers that are attached to the same link (router discovery).
Discovery of the prefixes used on a link-layer so that it is possible to determine which IPv6 addresses
can be reached directly (prefix discovery).
Discovery of parameters such as the link MTU or the hop limit for outgoing packets (parameter dis-
covery).
Automatic configuration of IPv6 addresses (address autoconfiguration).
Resolution of IPv6 addresses to link-layer addresses (address resolution).
Determination of next-hop addresses for IPv6 destination addresses (next-hop determination).
Detection of unreachable nodes which are attached to the same link (neighbor unreachability detec-
tion).
Detection of conflicts that can arise during address generation (duplicate address detection).
Discovery of better alternatives to forward packets (redirect).
all-nodes: The link-local multicast address FF02::1 is used to reach all nodes connected to a link.
all-routers: The link-local multicast address FF02::2 is used to reach all routers connected to a link.
solicited-node: A link-local multicast address which is derived from the address of a node which
is formed by taking the low-order 24 bits of the address and appending those bits to the prefix
FF02:0:0:0:0:1:FF00::/104.
link-local: A link-local unicast address which in the case of IEEE 802 links can be derived from the
IEEE 802 MAC address as discussed above.
The ND protocol is realized as an extension of the ICMPv6 protocol and introduces five new message formats.
To prevent some attacks on the ND protocol, it is required that the Hop Limit field of the IPv6 protocol
header is set to the value 255. Receiver of ND protocol messages must discard messages where the Hop
Limit field does not contain the value 255. Packets can only contain a value unequal to 255 if the packet
has been forwarded by a router, which might be a potential attack from somewhere outside the link.
50 CHAPTER 3. INTERNET NETWORK LAYER
Router Solicitation
Hosts can ask routers attached to a link to generate router advertisements by sending a Router Solicitation
(RS) message to the all-routers link-local multicast group. The format of the RS message is as follows:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type | Code | Checksum |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Reserved |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Options ...
+-+-+-+-+-+-+-+-+-+-+-+-
The Type field contains the value 133 and the Code field contains the value 0.
The Checksum field contains the usual ICMPv6 checksum.
The Options field may contain the link-layer address of the sender, if known.
Router Advertisement
Routers send periodically or as a reaction to an RS message Router Advertisement (RA) messages to the
all-nodes multi-cast group. The format of the RA message is as follows:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type | Code | Checksum |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Cur Hop Limit |M|O| Reserved | Router Lifetime |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Reachable Time |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Retrans Timer |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Options ...
+-+-+-+-+-+-+-+-+-+-+-+-
The Type field contains the value 134 and the Code field contains the value 0.
The Checksum field contains the usual ICMPv6 checksum.
The Cur Hop Limit field contains a proposed value which should be used by hosts in the Hop
Limit field of outgoing IPv6 packets.
The flag M indicates that hosts should use in addition another mechanism such as DHCPv6 for the
autoconfiguration of addresses (managed address configuration).
The flag O indicates that hosts should use in addition another mechanism such as DHCPv6 for the
autoconfiguration of other parameters (other stateful configuration).
The Router Lifetime field defines the time (in seconds) in which the advertised router may be
used as a default router.
3.3. INTERNET PROTOCOL VERSION 6 (IPV6) 51
The Reachable Time field defines the time (in milliseconds) in which a node assumes a neighbor
is reachable after having received a reachability confirmation.
The Retrans Timer field defines the time (in milliseconds) between retransmitted Neighbor So-
licitation messages.
The Options field may contain additional parameters such as the link-layer address of the sending
router, the link MTU or information about the prefixes that are used on the link.
Neighbor Solicitation
Hosts can ask other notes attached to a link to generate neighbor advertisements by sending a Neighbor
Solicitation (NS) message to the all-nodes link-local multicast group. The format of the NS message is as
follows:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type | Code | Checksum |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Reserved |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
+ +
| |
+ Target Address +
| |
+ +
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Options ...
+-+-+-+-+-+-+-+-+-+-+-+-
The Type field contains the value 135 and the Code field contains the value 0.
The Checksum field contains the usual ICMPv6 checksum.
The Target Address field contains the address for which information is requested.
The Options field may contain the link-layer address of the sender, if known.
Neighbor Advertisement
Hosts send a Neighbor Advertisement (NA) message as a reaction to a Neighbor Solicitation. Unsolicited
NA messages can also be sent in order to propagate changes quickly. Solicited NA messages are sent to the
IPv6 address of the requestor whicl unsolicited NA messages are sent to the all-nodes multicast group.
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type | Code | Checksum |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|R|S|O| Reserved |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
52 CHAPTER 3. INTERNET NETWORK LAYER
+ +
| |
+ Target Address +
| |
+ +
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Options ...
+-+-+-+-+-+-+-+-+-+-+-+-
The Type field contains the value 136 and the Code field contains the value 0.
The Checksum field contains the usual ICMPv6 checksum.
The flag R indicates that the sender is a router. The flag S indicates that the message is sent as
a reaction to a Neighbor Solicitation message. The flag O indicates that the contained information
should overwrite any existing cache entries.
The Options field may contain the link-layer address of the sender, if known.
Redirect
Router can generate Redirect (R) messages to inform hosts about better paths towards a given destination
address.
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type | Code | Checksum |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Reserved |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
+ +
| |
+ Target Address +
| |
+ +
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
+ +
| |
+ Destination Address +
| |
+ +
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Options ...
+-+-+-+-+-+-+-+-+-+-+-+-
The Type field contains the value 137 and the Code field contains the value 0.
The Checksum field contains the usual ICMPv6 checksum.
3.4. ROUTING PROTOCOLS 53
The Target Address field contains the IPv6 address or a router which provides a better path to
the destination address.
The Destination Address field contains the destination address which is being redirected.
The Options field may contain the link-layer address of the Target Address, if known. In addi-
tion, the Options field should contain the beginning of the IPv6 packet which caused the generation
of the Redirect message.
For routing purposes, the Internet is divided into autonomous systems (ASs). An autonomous system
(AS) is basically a set of routers and networks under the same administration.
The routing protocol(s) within an AS are called Interior Gateway Protocols (IGPs). They are gener-
ally independent from the routing protocols used in other ASs. Widely used IGPs are the Routing
Information Protocol (RIP) and the Open Shortest Path First (OSPF) protocol.
The routing protocol(s) between ASs are called Exterior Gateway Protocols (EGPs). The currently
most widely used EGP is the Border Gateway Protocol version 4 (BGP4).
The Routing Information Protocol version 2 (RIP-2) defined in RFC 2453 [42] is a simple routing protocol
to be used within ASs. It is based on the exchange of distance vectors and thus falls into the class of dis-
tance vector routing protocols. The foundation of this protocol is the Bellman-Ford algorithm for computing
shortest paths in graphs.
Let G = (V, E) be a graph with the vertices V and the edges E with n = |V | and m = |E|.
Let D be an n n distance matrix in which D(i, j) denotes the distance from node i V to the node
j V.
Let H be an n n matrix in which H(i, j) E denotes the edge on which node i V forwards a
message to node j V .
Let M be a vector with the link metrics, S a vector with the start node of the links and D a vector with
the end nodes of the links.
Properties
Simple distance vector protocols like RIP have the property that good news propagates quickly while
bad news propagates relatively slowly.
In particular, the failure of links can lead to situations where the bad news propagates slowly by
counting up the costs (count to infinity).
RIP defines infinity to be 16 hops. Hence, RIP can only be used in networks where the longest paths
(the network diameter) is smaller than 16 hops.
RIP uses the number of hops as the only metric.
Protocol
RIP-2 runs over the User Datagram Protocol (UDP) and uses normally the port number 520. All RIP-2
messages have the following structure:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Command | Version | must be zero |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
RIP Entries
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The Command field indicates, whether the message is a request or a response. Response messages
can also be send without a previous request (unsolicited responses).
The Version field contains the protocol version number.
The RIP Entries field contains a list of so called fixed size RIP Entries.
0 1 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Address Family Identifier | Route Tag |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| IP Address |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Subnet Mask |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Next Hop |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Metric |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The Address Family Identifier field identifies an address family. RIP was originally devel-
oped for networks with different address formats.
The Route Tag field marks entries which contain external routes, which might have been establised
by an EGP.
3.4. ROUTING PROTOCOLS 55
The first RIP Entry can have a special format to support authentication:
0 1 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| 0xFFFF | Authentication Type |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
+ +
| |
+ Authentication +
| |
+ +
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The constant 0xFFFF is used to distinguish an authentication entry from other entries.
The Authentication Type field identifies an authentication scheme. The RIP-2 specification
only defines a simple cleartext password authentication scheme.
The Authentication field contains data which is checked by the receiver to determine the au-
thenticity of the message.
RFC 2082 [43] defines an authentication scheme based on MD5 which uses an additional trailer at the
end of a RIP-2 message.
5. If there are still nodes with tentative cost labels, a node with the smallest costs is selected as the new
current node. Goto step 3 if a new current node was selected.
6. The shortest paths to a destination node can now be read by following the labels from the destination
node towards the root.
OSPF Areas
An OSPF area is a group of a set of networks within an autonomous system.
The internal topology of an OSPF area is invisible for other OSPF areas. The routing within an area
(intra-area routing) is constrainted to that area.
The OSPF areas are inter-connected via the OSPF backbone area (OSPF area 0). A path from a source
node within one area to a destination node in another area has three segments (inter-area routing):
1. An intra-area path from the source to a so called area border router.
2. A path in the backbone area from the area border of the source area to the area border router of
the destination area.
3. An intra-area path from the area border router of the destination area to the destination node.
OSPF routers are classified according to their location in the OSPF topology:
1. Internal Router: A router where all interfaces belong to the same OSPF area.
2. Area Border Router: A router which connects multiple OSPF areas. An area border router has
to be able to run the basic OSPF algorithm for all areas it is connected to.
3. Backbone Router: A router that has an interface to the backbone area. Every area border router
is automatically a backbone router.
4. AS Boundary Router: A router that exchanges routing information with routers belonging to
other autonomous systems.
Stub Areas are OSPF areas with a single area border router. The routing in stub areas can be simplified
by using default forwarding table entries which significantly reduces the overhead.
Protocol
OSPF messages are carried in IP packets. The value of the Protocol of the IPv4 header or the Next
Header of the IPv6 header is 89 for the OSPF protocol. All OSPF messages have the same header:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Version # | Type | Packet Length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Router ID |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Area ID |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Checksum | AuType |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Authentication |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Authentication |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
3.4. ROUTING PROTOCOLS 57
Hello
The Hello protocol is used to test the status of links and the attached neighbors. The hello protocol works
differently on broadcast networks, non-broadcast multi-access networks and point-to-multipoint networks.
On broadcast and non-broadcast multi-access networks, the hello protocol selects a Designated Router and a
Backup Designated Router.
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Version # | Type = 1 | Packet length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Router ID |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Area ID |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Checksum | AuType |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Authentication |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Authentication |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Network Mask |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| HelloInterval | Options | Rtr Pri |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| RouterDeadInterval |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Designated Router |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Backup Designated Router |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Neighbor |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| ... |
Der ersten Felder enthalten den normalen OSPF-Nachrichtenkopf, wobei das Feld Type den Wert 1
hat.
58 CHAPTER 3. INTERNET NETWORK LAYER
Das Feld Network Mask enthalt die Netzmaske fur das Interface.
Das Feld HelloInterval enthalt das Zeitintervall in Sekunden zwischen aufeinanderfolgende
Hello-Nachrichten.
Das Feld Rtr Pri enthalt die Prioritat des Routers, die fur die Auswahl des Designated bzw.
Backup Designated Routers verwendet wird. Router mit der Prioritat 0 nehmen nicht an der Auswahl
teil.
Das Feld RouterDeadInterval definiert das Zeitintervall in Sekunden, nachdem ein Router als
nicht mehr erreichbar betrachtet wird.
Das Feld Designated Router enthalt die Identitat des Designated Routers bzw. 0 falls noch
kein Designated Router bekannt ist.
Das Feld Backup Designated Router enthalt die Identitat des Backup Designated Router
bzw. 0 falls noch kein Backup Designated Router bekannt ist.
Am Ende der Nachricht befindet sich eine Liste von Neighbor Feldern, wobei jedes Neighbor
Feld die Identitat eines Routers anzeigt, von dem im letzten RouterDeadInterval eine Hello-
Nachrichten empfangen wurde.
Ein Link wird als verfugbar betrachtet, wenn Hello-Nachrichten in beide Richtungen ausgetauscht
werden konnen. Bei direkten Verbindungen (point-to-point links, virtual links) kann, sobald der Link
als verfugbar erkannt wurde, mit dem Austausch der Datenbasis begonnen werden.
Bei Netzwerk-Verbindungen (broadcast links, non-broadcast links) wird zunachst der Designated
Router und der Backup Designated Router bestimmt:
1. Zunachst verhalt sich ein Router fur ein RouterDeadInterval passiv indem er einge-
hende Hello-Nachrichten sammelt und eigene Hello-Nachrichten generiert, in denen er sich
nicht zu Wahl stellt. Anschliesend werden nur die Nachbarn betrachtet, fur die der Link in
beide Richtungen verfugbar ist.
2. Wenn einer oder mehrere Router sich als Backup Designated Router angeboten haben, wird der
Router mit der hochsten Prioritat ausgewahlt. Sollte die Prioritat nicht eindeutig sein, wird
aus den Kandidaten der Router mit der grosten Identifikationsnummer ausgewahlt.
3. Wenn kein Router sich als Backup Designated Router angeboten hat, wird der Router mit der
hochsten Prioritat (und der grosten Identifikationsnummer) ausgewahlt.
4. Wenn einer oder mehrere Router sich als Designated Router angeboten haben, wird der Router
mit der hochsten Prioritat ausgewahlt. Sollte die Prioritat nicht eindeutig sein, wird aus den
Kandidaten der Router mit der grosten Identifikationsnummer ausgewahlt.
5. Wenn kein Router sich als Designated Router angeboten hat, wird der Router mit der hochsten
Prioritat (und der grosten Identifikationsnummer) ausgewahlt.
Ein Router kann nicht zugleich Designated Router und Backup Designated Router sein. Daher mussen
nach dem Schritt 5 die Schritte 2 und 3 wiederholt werden.
Exchange
Flooding
...
3.4. ROUTING PROTOCOLS 59
Autonomous systems usually perform policy-based routing by using the Border Gateway Protocol version
4 (BGP4) as defined in RFC 1771 [45] to exchange reachability information between autonomous systems
(ASs). The reachability information is sufficient to construct a graph of ASs connectivity from which routing
loops may be pruned and some policy decisions at the autonomous system level be enforced.
BGP4 runs over the reliable transport protocol TCP which eliminates explicit fragmentation, retransmission,
acknowledgement, and sequencing. BGP4 uses TCP port 179 for establishing connections between two
BGP4 peers which are typically located in different ASs.
When two ASs agree to exchange routing information, each AS must designate a router that will speak BGP4
on its behalf. These two routers are called BGP4 peers. The peers establish a TCP connection and run the
BGP4 protocol which basically has three phases:
1. The BGP4 peers exchange messages to open and confirm connection parameters.
2. The BGP4 peers exchange initially the entire BGP routing table. Incremental updates are sent as the
routing tables change.
3. The BGP4 peers exchange so called keep-alive messages periodically to ensure that the connection
and the BGP4 peers are alive.
Each BGP4 message has a fixed-size header which may or may not be followed by a data portion:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
+ +
| |
+ +
| Marker |
+ +
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Length | Type |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The Marker field contains a value (initially all 1s) that the receiver of the BGP4 message can pre-
dict and verify. The Marker field can be used to detect loss of synchronization and to authenticate
incoming BGP messages.
The Length field indicated the total length of the message including the header, counted in bytes.
The maximum length of BGP4 messages is 4096 bytes.
The Type field indicates the type of the message. The following message types are defined:
1. OPEN
2. UPDATE
3. NOTIFICATION
4. KEEPALIVE
60 CHAPTER 3. INTERNET NETWORK LAYER
It is important to realize that BGP peers in general advertise only routes that should be seen from the outside.
There might be additional possible routes which for policy reasons are not announced to other peers. Fur-
thermore, it is important to realize that BGP only advertises routing information. The final decision which
paths are selected by putting approriate entries into the forwarding tables remains a local policy decision.
For some analysis about the usage of BGP, the growth of BGP routing tables and the increase of AS numbers,
see [46].
Once a TCP connection has been established between two BPG4 peers, they both send an OPEN message to
communicate their AS number and to establish other parameters.
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+
| Version |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Autonomous System Number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Hold Time |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| BGP Identifier |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Opt Parm Len |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
| Optional Parameters |
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The UPDATE messages are used to transfer routing information between BGP peers. The information in the
UPDATE packet can be used to construct a graph describing the relationships of the various Autonomous
Systems.
An UPDATE message may simultaneously advertise a feasible route and withdraw multiple unfeasible routes
from service. Hence, the UPDATE message consists of two parts:
3.4. ROUTING PROTOCOLS 61
+-----------------------------------------------------+
| Unfeasible Routes Length (2 octets) |
+-----------------------------------------------------+
| Withdrawn Routes (variable) |
+-----------------------------------------------------+
| Total Path Attribute Length (2 octets) |
+-----------------------------------------------------+
| Path Attributes (variable) |
+-----------------------------------------------------+
| Network Layer Reachability Information (variable) |
+-----------------------------------------------------+
The Unfeasible Routes Length field indicates the total length of the Withdrawn Routes
field counted in bytes. The value 0 indicates that no routes are being withdrawn.
The Withdrawn Routes field contains a list of IPv4 address prefixes that are being withdrawn
from service. Each IPv4 address prefix is encoded as a 2-tuple of the form (length, prefix) where
the length indicates the prefix length and prefix contains the IPv4 address prefix bits padded to
the next byte boundary.
The Total Path Attribute Length field indicates the total length of the Path Attributes
field counted in bytes.
The Path Attributes field contains a list of path attributes. Each attribute is encoded using a
tag-length-value (TLV) triple.
Path attributes convey information such as the origin of the path information (ORIGIN), the sequence
of AS path segments (AS PATH), the IPv4 address of the border router that should be used as the next
hop (NEXT HOP), or the local preference assigned by a BGP4 speaker (LOCAL PREF).
The Network Layer Reachability Information field contains a list of IPv4 prefixes. ad-
dress prefixes that are being withdrawn from service. Each IPv4 address prefix is encoded as a 2-tuple
of the form (length, prefix) where the length indicates the prefix length and prefix contains
the IPv4 address prefix bits padded to the next byte boundary.
BGP4 supports a NOTIFICATION message type used for control or when an error occurs. The transport
connection is closed immediately after sending a NOTIFICATION message.
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Error code | Error subcode | Data |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The Error code field and the Error subcode field contain one of the following error codes:
62 CHAPTER 3. INTERNET NETWORK LAYER
BGP4 peers periodically exchange KEEPALIVE messages. A KEEPALIVE message consists of the standard
BGP4 header with no additional data. The KEEPALIVE messages are needed to verify that shared state
information is still present. If a BGP4 peer does not receive a message within the Hold Time, then the peer
will assume that there is a communication problem and tear down the connection.
Chapter 4
The transport layer is responsible for providing application protocols suitable transport services. Some ap-
plication protocols require a stream-based connection while others prefer a reliable datagram service and yet
others are happy with a lightweight unreliable datagram service.
IP Layer IP Address
IP addresses are network layer endpoints and identify interfaces on nodes (hosts or routers)1 . Network
addresses have node-to-node significance.
Transport layer endpoints identify communicating application processes and are in the Internet rep-
resented by a tuple consisting of an IP address and a 16-bit port number. Transport addresses have
end-to-end significance.
The number space for port numbers is divided in a port number range that can freely be used and a port
number range which is managed by the Internet Assigned Numbers Authority (IANA). Well-known
port numbers for standardized or frequently used protocols can be registered by IANA.
Port numbers basically allow to multiplex/demultiplex and packets at the transport layer as shown in
Figure 4.1.
1. The User Datagram Protocol (UDP) provides a simple unreliable best-effort datagram service.
1
It is worth to note that IP addresses are typically used to identify (interfaces of) nodes as well as their location in the
network. This dual role of IP addresses becomes interesting in the context of mobile devices.
63
64 CHAPTER 4. INTERNET TRANSPORT LAYER
2. The Transmission Control Protocol (TCP) provides a bidirectional, connection-oriented and reliable
data stream.
3. The Stream Control Transmission Protocol (SCTP) provides a reliable transport service supporting se-
quenced delivery of messages within multiple streams. SCTP maintains application protocol message
boundaries (application protocol framing) and was designed to support signaling protocols.
4. The Real-Time Transport Protocol (RTP) provides a transport service for real-time multi-media ap-
plications where different data streams have to be synchronized. RTP is often implemented on top of
UDP (and thus from the layering not a pure transport layer protocol).
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Source Address |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Destination Address |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| unused (0) | Protocol | Length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The IPv6 pseudo header consists of the IPv6 source and destination address, the length of the transport layer
message and the next header field value which identifies the transport protocol.
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
+ +
| |
+ Source Address +
| |
+ +
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
+ +
| |
+ Destination Address +
| |
+ +
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Upper-Layer Packet Length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| zero | Next Header |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
4.2. USER DATAGRAM PROTOCOL (UDP) 65
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Source Port | Destination Port |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Length | Checksum |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The Source Port field contains the port number used by the sending application layer process.
The Destination Port field contains the port number used by the receiving application layer
process.
The Length field contains the length of the UDP datagram including the UDP header counted in
bytes.
The Checksum field contains the Internet checksum computed over the pseudo header, the UDP
header and the payload contained in the UDP packet.
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Source Port | Destination Port |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Sequence Number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Acknowledgment Number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Offset| Reserved | Flags | Window |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Checksum | Urgent Pointer |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Options | Padding |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
66 CHAPTER 4. INTERNET TRANSPORT LAYER
The Source Port field contains the port number used by the sending application layer process.
The Destination Port field contains the port number used by the receiving application layer
process.
The Sequence Number field contains after connection establishment the sequence number of the
first data byte in the segment. During connection establishment, this field is used to establish the initial
sequence number.
The Acknowledgment Number field contains the next sequence number which the sender of the
acknowledgement expects.
The Offset field contains the length of the TCP header including any options, counted in 32-bit
words.
The Flags field contains a set of binary flags:
URG: Indicates that the Urgent Pointer field is significant.
ACK: Indicates that the Acknowledgment Number field is significant.
PSH: Data should be pushed to the application as quickly as possible.
RST: Reset of the connection.
SYN: Synchronization of sequence numbers.
FIN: No more data from the sender.
The Window field indicates the number of data bytes which the sender of the segment is willing to
receive.
The Checksum field contains the Internet checksum computed over the pseudo header, the TCP
header and the data contained in the TCP segment.
The Urgent Pointer field points, relative to the actual segment number, to important data if the
URG flag is set.
The Options field can contain additional options.
SYN x
SYN x
ACK x+1, SYN y
ACK y+1
ACK y+1
One of the TCP protocol engines first waits passively for incoming connections (passive open).
The other TCP protocol engine actively initiates the connection establishment procedure (active open).
The first TCP packet contains the initial sequence number in a SYN packet. The initial sequence
number is determined by a counter which is incremented roughly all 4 microseconds. This guarantees
that the initial sequence number is not reused as long as old packets may still exist in the network.
The passive TCP engine stores the received sequence number and sends its own randomly created
initial sequence number, at the same time acknowledging the received SYN packets.
The active TCP engine stores the received the sequence number and acknowledges the receipt of this
sequence number.
TCP provides initially a bidirectional data stream after completing the connection establishment procedure.
It is possible to turn the bidirectional connection into a unidirectional connection by closing one half of the
connection. The TCP connection itself is terminated when both unidirectional connections have been closed.
In the normal case (no packet loss), the connection tear-down is performed as shown in Figure 4.3.
FIN x
FIN x
ACK x+1
ACK x+1
ACK x+1, FIN y
ACK y+1
ACK y+1
The connection tear-down procedure is started by a TCP protocol engine by setting the FIN flag.
The receiver usually first acknowledges the receipt of the FIN packet.
The receiving protocol engine the informs the application about the tear-down of the first half of the
connection.
Once the application indicates that it wants to close the other half of the connection, another TCP
packet is transmitted into the other direction with a FIN flag set.
The receiver of the second FIN packet acknowledges the receipt of the second FIN packet and the
connection is closed.
In cases where a connection between two TCP engines is interrupted (e.g., a cable breaks or a node is turned
off), the TCP specification requires quiet time of 120 seconds (maximum segment lifetime, MSL) before new
TCP connections can be established. The quiet time is motivated by the time needed to ensure that packets
belonging to the broken TCP connection have disappeared from the network.
68 CHAPTER 4. INTERNET TRANSPORT LAYER
The various transitions possible during connection establishment and tear-down are best described by a finite
state machine as shown below.
The TCP state machine has the states shown in Table 4.1.
4.3. TRANSMISSION CONTROL PROTOCOL (TCP) 69
1. The sending application delivers urgent data that should be transmitted and delivered to the remote application as
fast as possible.
2. The sending application may send a 1 byte segment in order to make the receiver reannounce the next byte
expected and the current window size. This is useful to protect against deadlocks that can otherwise occur if a
window update is lost.
The following illustrative example shown in Figure 4.4 is taken from [2].
Sender Receiver
0 4K
write(2K) 2K | SEQ = 0
0
4K
4K
ACK = 4096 Win = 0
read(2K)
0
4K
Suppose the receiver has a 4096 byte buffer and the sender has 2048 bytes ready to send. The sender will immediately
transmit a 2048 byte segment (assuming that the path MTU is large enough). The receiver now fills half of the buffer and
announces a new window size of 2048 bytes in the acknowledgement. Once the application has 2048 more bytes to send,
another 2048 byte segment is transmitted. This now fully fills the receivers buffer which leads to an announcement of
70 CHAPTER 4. INTERNET TRANSPORT LAYER
a window of size 0 in the following acknowledgement. Once the receivers application process consumes data, another
acknowledgement will be created which informs the sender of a new window size.
The TCP specification does not require that acknowledgements are created immediately for each received segment. This
allows for optimizations where a receiver might choose to send just a single acknowledgement for several segments that
have been received quickly in sequence. Furthermore, the receiving TCP engine might choose to send the acknowledge-
ment delayed so that the acknowledgement can be piggybacked on other data send from the receiver to the sender.
Nagles Algorithm
The original TCP behaves rather ineffective in situations where an application sends a stream of very small (one byte)
payloads. In the extreme case, the sender sends a segment containing one byte payload. The receiver responds with
an acknowledgement for that single byte. One the received byte has been copied to the application process, another
acknowledgement is send to advertise a new window size. The sending application may now have another byte to send
and the process repeats.
Nagle suggested to solve this problem by introducing the following rule: When data comes into the sender one byte at
a time, just send the first byte and buffer all the rest until the byte in flight has been acknowledgement. This algorithm
provides noticeable improvements especially for interactive traffic where a quickly typing user is connected over a rather
slow network.
Clarks Algorithm
Another related problem is known as the silly window syndrome. This problem deals with applications on the receiving
side that read the data one byte at a time from the receivers buffer. The original TCP implementations immediately
announced a window of one byte when the application removed a byte from the receive buffer. This acknowledgement
then causes the transmission of another TCP segment which contains again just one byte of data.
Clark suggested to solve this problem by preventing the receiver from sending a window update of 1 byte. Specifically,
the receiver should not send a window update until it can handle the maximum segment size it advertised when the
connection was established or until its buffer is half empty, whichever is smaller.
The key problem to be solved is the dynamic estimation of the congestion window. The solution adopted by TCP
assumes that lost segments are indications of congestion. While this is true in most wired networks, this assumption
does not work that well in wireless networks where the loss rate is much higher. Recent work also introduced explicit
congestion notifications which can be used by a router in the network to indicate congestion without having to drop
packets.
A TCP connection usually has different phases where different congestion control techniques should be used:
After connection establishment, no suitable value for the congestion window is known. The solution is to define
in initial window size (IW) and to start probing the network for the real congestion window using the slow start
algorithm.
Once a certain threshold, the so called slow start threshold (ssthresh) for the congestion window has been
crossed, the connections enters the congestion avoidance phase in which the congestion window increases lin-
early.
4.3. TRANSMISSION CONTROL PROTOCOL (TCP) 71
If a timeout occurs or congestion is signalled by other means, the slow start threshold ssthresh is reduced and
the congestion window is set to the so called loss window (LW) which is one full-sized segment. The sender now
switches back to the slow start algorithm until ssthresh is crossed and congestion avoidance takes over.
After a long period if idle time, the congestion window cwin is usually not accurate anymore. Hence, the value of
the congestion window must be set to the restart window (RW) which is typically the same as the initial window
(IW) and the slow start algorithm is executed.
The initial window (IW) is usually initialized using the following formula:
In this formula, SM SS is the sender maximum segment size, the size of the largest sement that the sender can transmit.
The size does not include the TCP/IP headers and options.
During slow start, the congestion window cwnd increases by at most SM SS bytes for every acknowledgement received
that acknowledges data. Slow start ends when cwnd exceeds ssthresh or when congestion is observed. The initial value
of ssthresh may be arbitrarily high. Some implementations use the size of the advertised window. Note that this leads
to an exponential increase if there are multiple segments acknowledged in the cwnd.
During congestion avoidance, cwnd is incremented by one full-sized segment per round-trip time (RTT). Congestion
avoidance continues until congestion is detected. One formula commonly used to update cwnd during congestion avoid-
ance is given by the following equation:
This adjustment is executed on every incoming non-duplicate ACK. The equation provides an acceptable approximation
to the underlying principle of increasing cwnd by one full-sized segment per RTT.
When congestion is noticed during slow start of congestion avoidance, (the retransmission timer expires), then the slow
start threshold ssthresh is updated as follows:
The flight size is the amount of outstanding data in the network. The impact of slow start and congestion avoidance is
summarized in Figure 4.5
44
40
36
timeout
32
congestion window (cwin)
28
ssthresh
24
20
16
12
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28
transmission number
To reduce the time for retransmissions, TCP receivers should send an immediate duplicate acknowledgement when
an out-of-order segment arrives. The purpose of this acknowledgement is to inform the sender that a segment was
received out-of-order and which sequence number is expected. In addition, a TCP receiver should send an immediate
acknowledgement when the incoming segment fills in all or part of a gap in the sequence space.
TCP senders should use the fast retransmit algorithm to detect and repair loss. The fast retransmit algorithm uses the ar-
rival of three duplicate acknowledgements (four identical acknowledgements without the arrival of any other intervening
packets) as an indication that a segment has been lost. After receiving three duplicate acknowledgements, TCP performs
a retransmission of what appears to be the missing segment, without waiting for the retransmission timer to expire.
The fast recovery algorithm controls how the congestion window and the slow start threshold is updated when the fast
retransmit algorithm is used. The basic idea is to not exercise the normal congestion reaction with a full slow start since
acknowledgements are still flowing. For details how fast retransmit and fast recovery are implemented, see Section 3.2
in RFC 2581 [49]
D = D + (1 )|RT T M |
The parameter is another smoothing factor which can be different from the smoothing factor used to estimate RT T .
With these estimations of the average round-trip time and the standard deviation, the retransmission timeout RT O is
usually set as follows:
RT O = RT T + 4 D
The factor 4 is more or less chosen by doing experiments. According to some studies, less than one percent of all packets
come in more than four standard deviations late.
Karns Algorithm
The dynamic estimation of the RT T has a problem if a timeout occurs and the segment is retransmitted. A subse-
quent acknowledgement might acknowledge the receipt of the first packet which contained that segment or any of the
retransmissions. Guessing wrong can seriously impact the RT T estimation.
Karn therefore suggested that the RT T estimation is not updated for any segments which were retransmitted. Fur-
thermore, Karn suggested that the RT O is doubled on each failure until the segment gets through which leads to an
exponential back-off for each consecutive attempt. These fixes are now known as Karns algorithm.
4.4. STREAM CONTROL TRANSMISSION PROTOCOL (SCTP) 73
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Common Header |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Chunk #1 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| ... |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Chunk #n |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Multiple chunks can be bundled into one SCTP packet up to the MTU size. If a user data message doesnt fit into one
SCTP packet it can be fragmented into multiple chunks.
The SCTP common header has the following format:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Source Port | Destination Port |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Verification Tag |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Checksum |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The Source Port field contains the port number used by the sending application layer process.
The Destination Port field contains the port number used by the receiving application layer process.
The Verification Tag field is used by the receiver to validate the sender of the SCTP packet. This field
protects the SCTP protocol against certain attacks.
The Checksum field contains a 32-bit CRC checksum as specified in RFC 3309 [53].
Payload is transmitted in so called data chunks. Data chunks have the following format:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type = 0 | Flags | Length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Transmission Sequence Number (TSN) |
74 CHAPTER 4. INTERNET TRANSPORT LAYER
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Stream Identifier (S) | Stream Sequence Number (n) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Payload Protocol Identifier |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
\ \
/ User Data (seq n of Stream S) /
\ \
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The Type field indicates the chunk type. Data chunks use the type number 0.
The Flags field contains a set of binary flags:
U: The data chunk is unordered and there is no Stream Sequence Number assigned to the data chunk.
B: Indicates the beginning of a fragment of a user message.
E: Indicates the ending of a fragment (last fragment) of a user message.
The Length field indicates the size of the chunk in bytes including the chunk header fields.
The Transmission Sequence Number field contains the transmission sequence number which is also
used by the receiver to reassemble messages.
The Stream Identifier identifies the stream to which the following data belongs.
The Stream Sequence Number identifies the stream sequence number of the following user data within the
stream identified by the Stream Identifier.
The Payload Protocol Identifier identifies the upper layer application protocol and is opaque from
the viewpoint of an SCTP protocol engine.
The User Data is of variable length and contains the actual payload.
The connection teardown can also be started by the client. In this case, the client sends a DCCP-Close to close the
connection. The server responds with a DCCP-Reset packet to clear the connection state.
All DCCP packets use a common header:
4.5. DATAGRAM CONGESTION CONTROL PROTOCOL (DCCP) 75
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Source Port | Dest Port |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type | CCval | Sequence Number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Data Offset | # NDP | Cslen | Checksum |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The Source Port field contains the port number used by the sending application layer process.
The Destination Port field contains the port number used by the receiving application layer process.
The Type field indicates the type of the DCCP message.
The CCval field contains data which might be used by a congestion control mechanism.
The Sequence Number field contains a sequence number which counts the number of packets.
The Data Offset field contains the offset from the start of the DCCP header to the start of the payload,
counted in 32-bit words.
The NDP field contains the number of non-data packets send on the senders sequence, modulo 16.
The Cslen field specifies what parts of the packet are covered by the checksum field. The checksum always
covers at least the DCCP header, DCCP options, and a pseudoheader taken from the network-layer header.
The Checksum field contains the Internet checksum computed over the pseudo header, the DCCP header, any
DCCP options and, depending on the Cslen value, over some of the payload.
76 CHAPTER 4. INTERNET TRANSPORT LAYER
Chapter 5
The application system is not very much structured. Many protocols run directly on top of the transport protocols and it is
not unusual that the various protocols solve recurring problems (such as data encoding) in different ways. Although this
approach does not seem to be very efficient from a global perspective, it must be noted that the individual point solutions
have been very successful in practice.
There are many application layer protocols in use in the Internet. This chapter focusses on those protocols which either
realize certain core services (e.g., DNS) or which carry a significant portion of the traffic (e.g., HTTP, FTP, SMTP) or
which are interesting because of some of their special features.
The Domain Name System provides a hierarchical name space with a virtual root. The administration of the name
space can be delegated along the paths starting from the virtual root.
Name resolution is realized by so called DNS servers. A DNS server knows a part (a zone) of the global name
space and its position within the global name space. Note that some parts of the name space might be further
delegated to other name servers.
Name resolution queries can in principle be sent to arbitrary DNS servers. However, it is good practice to use a
local DNS server as the primary DNS server. Non-local servers might be used as backup servers.
Recursive queries cause the queried DNS server to contact other DNS servers as needed in order to obtain a
response to the query. This is convenient for the DNS client but requires more complexity in the DNS server.
The alternative are iterative queries where the client may have to sent a series of queries to several DNS servers
to retrieve the desired information.
77
78 CHAPTER 5. INTERNET APPLICATION LAYER
The original DNS protocol does not provide sufficient security. In particular, there is no guarantee that the
returned response is trustworthy. (Does a request to obtain an IP address for the name www.my-bank.com
really return an IP address of my bank?) Furthermore, if the DNS would be secure, then it could be used as
an infrastructure to distribute for example certificates used by other security mechanisms. Although there are
standards for secure DNS (RFC 2535 [56]), they are not as widely used as they should be.
Since well known DNS names became a trade item in recent years, there is quite some political debate around the
question who is actually in charge to define new top-level domain names. At the moment, this responsibility lies
in the hands of the Internet Corporation for Assigned Names and Numbers (ICANN). However, ICANN itself is
an organization that is not everywhere well accepted.
Cache
Encompassing
API Name Server RR DB
DNS
Primary
Application Resolver DNS RR DB
Name Server
Cache Cache
1. The hierarchical name space and the so called Resource Records (RRs) that hold typed information for a given
name. There are several standardized resource record types and new resource record types can be defined in order
to store additional information in the domain name system.
2. A set of DNS servers which provide access to the information stored in resource records via the DNS protocol.
DNS server usually also maintain some information in local caches in order to increase the overall efficiency of
the DNS system. DNS server usually have authoritative information about one or multiple local zones which they
are responsible for.
3. A set of resolver libraries, usually shipped as part of the operating system, which contain an implementation of
a DNS client provide a programmatic API to resolve names to addresses and vice versa (see getaddrinfo()
and getnameinfo()). Application programs call the resolver system library functions to resolve DNS and
potentially other names.
The names (labels) on a certain level of the tree must be unique and may not exceed 63 byte in length. The
character set for the labels is seven bit ASCII. Comparisons are done in a case-insensitive manner.
The labels must begin with a letter and end with a letter or decimal digit. The characters between the first and last
character must be letters, digits or hyphens.
The labels can be concatenated with dots to form paths within the name space. Absolute paths which end at the
virtual root node end with a trailing dot. All other paths which do not end with a trailing dot are relative paths.
The overall length of a domain name is limited to 255 bytes.
5.1. DOMAIN NAME SYSTEM (DNS) 79
Type Description
A IPv4 addresse
AAAA IPv6 addresse
CNAME Alias for another name (canonical name)
HINFO Identification of the CPU and the operating system (host info)
MX List of mail server (mail exchanger)
NS Identification of an authoritative server for a domain
PTR Pointer to another part of the name space
SOA Start and parameters of a zone (start of zone of authority)
KEY Public key associated with a name
SIG Signature over the RRs associated with a name
Recent efforts did result in proposals for Internationalized Domain Names in Applications (IDNA) (RFC 3490 [57], RFC
3491 [58], RFC 3492 [59]). The basic idea is to support internationalized character sets within applications. However, for
backward compatibility reasons, internationalized character sets are encoded into seven bit ASCII representations (ASCII
Compatible Encoding, ACE). ACE labels are recognized by a so called ACE prefix. The ACE prefix for IDNA is xn--. A
label which contains an encoded internationalized name might for example be the value xn--de-jg4avhby1noc0d.
1. A DNS message starts with a protocol header. It indicates which of the following four parts is presents and
whether the message is a query or a response.
2. The header is followed by a list of questions.
3. The list of questions is followed by a list of answers (resource records).
4. The list of answers is followed by a list of pointers to authorities (also in the form of resource records).
5. The list of pointers to authorities is followed by a list of additional information (also in the form of resource
records). This list may contain for example A resource records for names in a response to an MX query.
The DNS protocol normally runs over UDP for simple queries. For larger data transfers (e.g. zone transfers), DNS may
utilize TCP. Both protocols use the well-known port number 53.
80 CHAPTER 5. INTERNET APPLICATION LAYER
Message Header
All DNS messages start with the common header which has the following format:
0 1
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| ID |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|QR| Opcode |AA|TC|RD|RA| Z|AD|CD| RCODE |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| QDCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| ANCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| NSCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| ARCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
The ID field contains a number which allows to correlate incoming responses with outstanding requests.
The QR bit is 0 in a query message and 1 in a response message.
The field OPCODE indicates the query type. This field is 0 for a standard query (QUERY) and 1 for an inverse
query (IQUERY).
The AA bit is set if the response is authoritative (authoritative answer).
The TC bit indicates that the message is truncated due to restrictions of the transport system (truncated).
The RD bit is set on queries to start a recursive query (recursion desired).
The RA bit is set on responses and denotes whether recursive query support is available (recursion available).
The Z bit is unused.
The AD bit indicates that all data in the response has been cryptographically verified or otherwise meets the DNS
servers local security policy.
The CD bit is set when a resolver ist willing to accept non-authenticated data (checking disabled).
The RCODE field contains an error code which is only significant in response messages.
The QDCOUNT field contains the number of queries in the query list.
The ANCOUNT field contains the number of responses in the answer list.
The NSCOUNT field contains the number of authoritative name servers in the list of authoritative name servers.
The ARCOUNT field contains the number of elements in the list of additional information.
Query Format
0 1
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| |
/ QNAME /
/ /
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| QTYPE |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| QCLASS |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
5.1. DOMAIN NAME SYSTEM (DNS) 81
The QNAME field contains the domain name which is being queried.
The 16-bit field QTYPE determines the type of the query. Some defined values are 1 (A), 2 (NS), 5 (CNAME), 6
(SOA), 12 (PTR), 13 (HINFO), 15 (MX), 24 (SIG), 25 (KEY), and 28 (AAAA).
The 16-bit field QCLASS determines the class. This field usually contains the value 1 for the Internet (IN).
Response Format
The list of answers, the list of authoritative servers and the list with additional information all have the same structure:
0 1
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| |
/ /
/ NAME /
| |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| TYPE |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| CLASS |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| TTL |
| |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| RDLENGTH |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--|
/ RDATA /
/ /
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
The NAME field contains the domain name associated with the following resource record.
The 16-bit TYPE field indicates the type ot the resource records and determines the format of the RDATA field.
The 16-bit CLASS field indicates the class. This field usually contains the value 1 for the Internet (IN).
The 16-bit TTL field contains the lifetime for the following resource record in seconds.
The 16-bit RDLENGTH field contains the length of the following RDATA field.
The RDATA field contains the actual data in form or a resource record. The format of the RDATA field depends
on the resource record type indicated by the TYPE field.
An A resource record contains an IPv4 address encoded in 4 bytes in network byte order.
An AAAA resource record contains an IPv6 address encoded in 16 bytes in network byte order.
A CNAME resource record contains a character string preceded by the length of the string which is encoded in the
first byte (and thus restricts the string to 255 characters).
A HINFO resource record contains two character strings, each prefixed with a length byte. The first character
string describes the CPU and the second string the operating system.
A MX resource record contains a 16-bit preference number (used to prioritize multiple entries) followed by a
character string prefixed with a length bytes. The character string contains the DNS name of a mail exchanger.
A NS resource record contains a character string prefixed by a length byte which contains the name of an author-
itative DNS server.
82 CHAPTER 5. INTERNET APPLICATION LAYER
A PTR resource record contains a character string prefixed with a length byte which contains the name of another
DNS server. PTR records are used to map IP addresses to names (so called reverse lookups). For an IPv4
address of the form d1 .d2 .d3 .d4 , a PTR resource record is created for the pseudo domain name d4 .d3 .d2 .d1 .in
addr.arpa. For an IPv6 address of the form h1 h2 h3 h4 : . . . : h13 h14 h15 h16 , a PTR resource record is created
for the pseudo domain name h16 .h15 .h14 .h13 . . . . .h4 .h3 .h2 .h1 .ip6.arpa
A SOA resource record contains two character strings, each prefixed by a length byte, and four 32-bit numbers.
The first character string contains the name of the DNS server responsible for a zone. The second character string
contains the mail address of the administrator responsible for the management of the zone. The first unsigned 32-
bit number contains a serial number (SERIAL) which must be incremented by the zone administrator whenever he
makes changes to the zone database. The second 32-bit number defines the time which may elapse before cached
zone information must be updated (REFRESH). The third 32-bit number defines a time interval after which zone
information is consider not current anymore (EXPIRE). The fourth 32-bit number contains the minimum lifetime
for resource records (MINIMUM).
The KEY and SIG resource records have a rather complex format which is described in detail in RFC 2535 [56].
Data Exchange
De/Encoder De/Encoder
System A System B
ASN.1 was primarily developed to formally describe data structures that are exchanged between applications in a dis-
tributed system. The idea was to let developers focus on the definition of data structures and to give developers tools to
generate the necessary encoding / decoding functions. Some more specific requirements for ASN.1 were:
The fundamental principle behind ASN.1 is the separation of the data representation during transmission from the data
representation within applications that might be written in different programming languages (see Figure 5.3).
5.2. ABSTRACT SYNTAX NOTATION ONE (ASN.1) 83
The abstract syntax defines data structures in an application implementation neutral format. The abstract syntax
is mapped to a local syntax which is used in a concrete implementation. The local syntax is specific for the
programming language being used to develop the application and might be different for different implementations.
ASN.1 compilers can be used to generate the local syntax from the abstract syntax.
The transfer syntax defines how data structures are serialized for transmission over a network. A concrete transfer
syntax is commonly defined as a set of encoding rules which define how values of the various ASN.1 types are
encoded. There are multiple encoding rules for ASN.1 and it is therefore necessary that applications agree on the
transfer syntax to use. The implementation of the encoding and decoding functions can be automated by using
ASN.1 compilers.
ASN.1 definitions basically consist of type definitions and value definitions that are organized into ASN.1 modules.
It is often necessary to uniquely identify artefacts such as specific protocol definitions or parameters. The ISO registration
tree provides a hierarchical name system that can be used for this purpose. Note that the hierarchical structure makes
it is possible to delegate authority. For example, the US Department of Defense (dod) has delegated authority over the
internet subtree to the Internet Assigned Number Authority (IANA).
dod(6)
internet(1)
system(1) interfaces(2) ip(4) icmp(5) tcp(6) udp(7) transmission(10) snmp(11) ... snmpMIB(1) snmpFrameworkMIB(10) ...
Note that nodes are uniquely identified by the assigned numbers and not necessarily by their associated descriptors.
84 CHAPTER 5. INTERNET APPLICATION LAYER
Table 5.2: Universal tag numbers for the core ASN.1 data types
The Table 5.2 summarizes the universal tags for the fundamental ASN.1 data types which are assigned in the ASN.1
standard.
application-wide ApplicationSyntax
}
END
88 CHAPTER 5. INTERNET APPLICATION LAYER
Encoding of Tags
8 7 6 5 4 3 2 1
8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1
1 1 1 1 1 1 ... 1 0
If the tag number is less than 31, the tag number can be encoded in a single byte. This usually covers most of
the cases. If the tag number is larger, then multiple bytes must be used to encode the tag number. The highest bit
indicates whether more bytes follow and hence only seven bits are used to actually encode the number.
The primitive / constructed bit can be used by the receiver to determine whether the value itself is a BER
encoding, which is the case for constructed types. This allows the decoder to simply recursively call the BER
decoder whenever a constructed tag has been received.
Encoding of Lengths
8 7 6 5 4 3 2 1
0
8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1
1 ...
A length less than 128 bytes can be encoded in a single length byte. Larger length values must be encoded in
multiple bytes where the first byte indicates how many bytes are used to encode the length value.
5.2. ABSTRACT SYNTAX NOTATION ONE (ASN.1) 89
8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1
b0 b1 b2 b3 b4 b5 b6 b7 ...
bits
unused bits in last byte
A BIT STRING value is encoded in a sequence of bytes that contain the named bits. The byte sequence is
prefixed with a byte which indicates the number of unused bits in the last byte of the byte sequence.
8 7 6 5 4 3 2 1
0
8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1
1 ... 1 0
The following example shows the BER encoding of an SNMP message which is consistent with the ASN.1 definition
shown in Section 5.2.7.
30:1b SEQUENCE { 27
02:01:00 INTEGER 1 0
04:06:70:75:62:6C:69:63 OCTET STRING 6 "public"
a1:0e GetNextRequest-PDU { 14
02:04:36:a2:8f:07 INTEGER 4 916623111
02:01:00 INTEGER 1 0
02:01:00 INTEGER 1 0
30:00 SEQUENCE OF {} 0
}
}
Remarks
With definite length encoding, it is required that the length of the BER encoding of a value is known before the
length field can be encoded. The alternative would be to reserve some space for the length and to move encoded
values around if the reserved space was insufficient (or too big). This approach however is rather costly.
An alternative approach is to create the BER encoding from the innermost ASN.1 element to the outermost ASN.1
element and to construct the BER encoding from the end to the beginning. This technique works fine in some
special cases but not in the general case.
During the processing of messages, it is often required to change some fields in a message before it is passed
on. This is difficult to achieve in the general case since some changes can cause massive changes in the BER
encoding since for example length fields have to be increased.
5.3.1 Foundations
It is necessary to first introduce some fundamental concepts and the associated terminology that is being used frequently
in the network management community.
Functional Areas
The services provided by a network management system can be grouped into five categories:
1. Fault Management
Umfast die Fehlererkennung, die Fehlerisolation und die Fehlerbehebung.
2. Configuration Management
Umfast die Erzeugung und Verwaltung von Konfigurationsinformationen, die Namensverwaltung sowie Start,
Kontrolle und Beendigung von Diensten.
3. Account Management
Umfast die Erfassung von Verbrauchsdaten, die Verteilung und Uberwachung von Kontingenten sowie das
Fuhren von Verbrauchsstatistiken.
4. Performance Management
Umfast das Sammeln von statistischen Daten, die Ermittlung der Systemleistung und etwaige Veranderungen
zur Leistungsoptimierung
5. Security Management
Umfast die Erzeugung und Kontrolle von Sicherheitsdiensten, die Schlusselgenerierung und -verteilung und
die Meldung und Analyse von sicherheitsrelevanten Ereignissen.
92 CHAPTER 5. INTERNET APPLICATION LAYER
name = expression
The end of a rule is marked by the end of the line or by a comment. Comments start with the comment symbol ;
(semicolon) and continue to the end of the line.
The name of a rule must start with an alphabetic character followed by a combination of alphabetics, digits and
hyphens. The case of a rule name is not significant.
Terminal symbols are non-negative numbers. The basis of these numbers can be binary (b), decimal (d) or
hexadecimal (x). Multiple values can be concatenated by using the dot . as a value concatenation operator. It is
also possible to define ranges of consecutive values by using the hyphen - as a value range operator.
Terminal symbols can also be defined by using literal text strings containing US ASCII characters enclosed in
double quotes. Note that these literal text strings are case-insensitive.
Note that ABNF does not define a module concept and import/export mechanism which could be used to import rules
from other modules into the current module. This reflects the fact that ABNF is mostly used for documentation purposes
rather than code generation purposes.
5.4.2 Operators
The right hand side expressions of ABNF rules use a set of different operators. The ABNF operators are briefly described
in the text below.
Concatenation
The simplest operator is the concatenation operator. The operator symbol is the empty word. (Note that white space
characters are not significant in ABNF.) The rule
and the terminal symbol concatenation feature is thus just a shortcut to make definitions more compact.
Alternatives
Elements separated by alternatives operator / (forward slash) are alternatives. A rule of the form
results in the elements defined either by the rule foo or the rule bar. Note that the alternatives operator can be used to
emulate the value range operator by spelling out all numbers in the range:
DIGIT = "0" / "1" / "2" / "3" / "4" / "5" / "6" / "7" / "8" / "9"
Since it is often required to specify long lists of alternatives (and since ABNF is line oriented), there is an incremental
version of the alternatives operator which combines the alternatives operator with the assignment symbol:
Grouping
An expression can be grouped using parenthesis. A grouped expression is treated as a single element, regardless of the
internal structure of the groups expression. It is recommended to use groups instead of relying on operator precendences
whenever there is a good chance to misread a rule.
Repetitions
Repetitions are specified using the parameterized repetition operator * (star). The full format for the operator is n*m
where n and m are optional decimal values. The value n indicates the minimum number of repetitions (defaults to 0 if
not present) and the value m indicates the maximum number of repetitions (defaults to infinity if not present).
There are two notations for special cases that appear frequently enough to justify the introduction of a special notation:
The notation n foo indicates that foo appears exactly n times. This is equivalent to n*n foo.
Optional elements can be written in square brackets as [foo] which is equivalent to *1 foo.
Operator Precedence
The precedence of the operators from highest (binding tightest) at the top, to lowest and loosest at bottom:
Repetition
Grouping, Optional
Concatenation
Alternative
It is generally recommended that the grouping operator be used to make explicit groups in order to avoid any potential
confusion.
94 CHAPTER 5. INTERNET APPLICATION LAYER
CHAR = %x01-7F
; any 7-bit US-ASCII character, excluding NUL
CR = %x0D
; carriage return
CRLF = CR LF
; Internet standard newline
DIGIT = %x30-39
; 0-9
DQUOTE = %x22
; " (Double Quote)
HTAB = %x09
; horizontal tab
LF = %x0A
; linefeed
OCTET = %x00-FF
; 8 bits of data
SP = %x20
; space
VCHAR = %x21-7E
; visible (printing) characters
WSP = SP / HTAB
; White space
alternation = concatenation
*(*c-wsp "/" *c-wsp concatenation)
5.5.1 Grundlagen
Elektronische Post wird auch nach dem store-and-forward-Prinzip weitergeleitet bei dem jeweils ein Knoten die aktuelle
Kopie einer Nachricht hat und Verantwortung fur deren Weiterleitung ubernimmt. Bei der Weiterleitung wird zunachst
auf dem nachsten Knoten eine neue Kopie erzeugt. Ist der Vorgang erfolgreich beendet worden, so kann die lokale Kopie
vernichtet werden, da der nachste Knoten jetzt fur die Zustellung zustandig ist.
Eine typische Konfiguration zum Versenden, Weiterleiten und Lesen von elektronischer Post im Internet zeigt Abbildung
5.9.
SMTP
mail relay
queue MTA
organization A
SMTP
organization B
mail relay user
queue MTA mailbox
SMTP IMAP
Der mail user agent (MUA) fuhrt den Dialog mit dem Benutzer und nimmt eine neue elektronische Nachricht
entgegen. Die Nachricht wird anschliesend entweder dem lokalen MTA oder einem relay MTA (siehe unten)
ubergeben.
Ein mail transfer agent (MTA) ist fur die Weiterleitung von elektronischen Nachrichten durch das Internet
zustandig. Ein relay MTA ist ein Zwischensystem das einzig zur Weiterleitung von Nachrichten im Internet
dient. Ein gateway MTA ist ein Zwischensystem das zur Weiterleitung von Nachrichten in andere Netze dient
(z.B. ISO/OSI Netzwerke auf der Basis des X.400 Protokolls).
5.5. SIMPLE MAIL TRANSFER PROTOCOL (SMTP) 97
Am Zielsystem angekommen wird die elektronische Post in einem benutzerspezifischen Zwischenspeicher (mail-
box) abgelegt. Der Zugriff auf diesen Zwischenspeicher erfolgt entweder uber das Dateisystem oder mit Hilfe
spezieller Protokolle wie z.B. IMAP (siehe nachstes Kapitel).
Bei der Ubertragung von elektronischer Post unterscheidet man analog zu der gelben Post zwischen einem
Umschlag (envelop) der fur die Weiterleitung und Zustellung einer Nachricht wichtig ist, einem Nachrichtenkopf
(header) der allgemeine Parameter einer Nachricht beschreibt und dem eigentlichen Inhalt (body).
Der Inhalt (body) ist zunachst nicht weiter struktuiert und auf die 7-Bit US-ASCII Zeichen reduziert. Zusatzliche
Standards beschreiben wie der Inhalt strukturiert werden kann und insbesondere mehrteilige Dokumente mit ver-
schiedenen Dokumenttypen realisiert werden (MIME).
;
; Syntax of the SMTP commands:
;
;
; Syntax of the SMTP command paramters:
;
Reverse-path = Path
Forward-path = Path
Path = "<" [ A-d-l ":" ] Mailbox ">"
A-d-l = At-domain *( "," A-d-l )
; Note that this form, the so-called "source route",
98 CHAPTER 5. INTERNET APPLICATION LAYER
Keyword = Ldh-str
Argument = Atom
Domain = (sub-domain 1*("." sub-domain)) / address-literal
sub-domain = Let-dig [Ldh-str]
Atom = 1*atext
;
; Syntax for IPv4/IPv6 addresses:
;
Als Reaktion auf die Kommandos schickt der Server dreistellige numerische Antwortcodes mit zusatzlichen textuellen
Erklarungen, die allerdings nicht fur das Protokoll signifikant sind. Die Antwortcodes sind nach einem festen Muster
aufgebaut (theory of reply codes).
Die erste Stelle des dreistelligen Antwortcodes gibt daruber Auskunft, um was fur eine Art von Antwortcode
es sich handelt:
1yz Vorlaufige positive Antwort, wobei zur Ausfuhrung der Aktion weitere Informationen notwendig sind
(positive preliminary reply).
2yz Endgultige positive Antwort uber die erfolgreiche Ausfuhrung einer Aktion (positive completion reply).
3yz Zwischenzeitliche positive Antwort, wobei weitere Informationen zur Beendigung einer Aktion notwendig
sind (positive intermediate reply).
4yz Transiente negative Antwort, wobei der Fehler temporarer Natur ist und das Kommando wiederholt wer-
den kann (transient negative completion reply).
5yz Endgultige negative Antwort, wobei eine automatische Wiederholung des Kommandos nicht sinnvoll ist
(permanent negative completion reply).
Mit der zweiten Stelle gruppiert man Antworten in spezielle Kategorien:
x0z Syntaktische Probleme.
x1z Informelle Antworten und Statusinformationen.
x2z Antworten, die sich auf den Ubertragungskanal beziehen.
x3z Nicht definiert.
x4z Nicht definiert.
x5z Status des Servers im Kontext der eingeleiteten Aktionen.
Die dritte Stelle gibt die genaue Bedeutung der Antwort in der jeweiligen Kategorie an.
Durch die Benutzung dieser dreistufigen Antwortcodes kann eine Implementation auch auf unbekannte neue Antwort-
codes relativ sinnvoll reagieren. Die im SMTP-Protokoll definierten Antwortcodes sind unten nach funktionalen Kriterien
geordnet angegeben:
252 Cannot VRFY user, but will accept message and attempt delivery
450 Requested mail action not taken: mailbox unavailable
550 Requested action not taken: mailbox unavailable
451 Requested action aborted: error in processing
551 User not local; please try <forward-path>
452 Requested action not taken: insufficient system storage
552 Requested mail action aborted: exceeded storage allocation
553 Requested action not taken: mailbox name not allowed
354 Start mail input; end with <CRLF>.<CRLF>
554 Transaction failed (Or, in the case of a connection-opening
response, "No SMTP service here")
5.5.3 Nachrichtenkopfe
Das Format der Nachrichtenkopfe ist in RFC 2822 [?] festgelegt. Die wesentliche Produktion der ABNF sieht folgen-
dermasen aus:
Die resend-field Elemente werden bei der Ubertragung von Nachrichten von MTAs erzeugt. Sie dienen
der Fehlersuche und der Erkennung von Schleifen.
Die ubrigen Felder haben die vermutlich mittlerweile allgemein bekannten Bedeutungen.
Zusatzliche Nachrichtenkopffelder
Medientypen
1. Der Medientyp (text) kann fur beliebigen Text verwendet werden. Der einfachste Untertyp ist plain. Beim
Medientyp (text) kann uber Parameter der Zeichensatz (charset) angegeben werden.
5.5. SIMPLE MAIL TRANSFER PROTOCOL (SMTP) 101
2. Der Medientyp (image) kann fur beliebige Bildformate verwendet werden. Typische Untertypen sind jpeg
oder png.
3. Der Medientyp (audio) steht fur ein Dokument, das einen Audiokanal zur Ausgabe benotigt.
4. Der Medientyp (video) steht fur Sequenzen von bewegten Bilddaten. Ein typischer Untertyp ist mpeg.
5. Der Medientyp (application) steht fur Datenformate, die von bestimmten Applikationen verstanden wer-
den. Typische Vertreter sind postscript oder pdf.
1. Der Medientyp (multipart) wird fur Dokumente benutzt, die aus Teilen mit jeweils unterschiedlichen Medi-
entypen bestehen.
2. Der Medientyp (message) kann selbst wieder eine Nachricht enthalten, wobei diese Nachricht allerdings nicht
selbst vom Typ (message) sein darf.
Bei zusammengesetzten Inhalten werden die einzelnen Teil durch Markierungen voneinander getrennt, die jeweils an
Anfang der Zeile mit zwei Minuszeichen (--) beginnen. Auserdem mus im Feld Content-Type mit dem Parameter
boundary eine zusatzliche Trennzeichenfolge festgelegt werden, die nach den zwei Minuszeichen (--) folgen mus.
Die letzte Markierung wird zusatzlich durch zwei Minuszeichen (--) abgeschlossen.
--simple boundary
--simple boundary--
Base64 Encoding
Bei dieser Kodierung werden jeweils drei Byte (also 24 Bit) durch ein vier Zeichen dargestellt, die aus einem 6-Bit
Zeichenvorrat entnommen werden. Die sich ergebende Zeichenfolge wird so umgebrochen, das Zeilen niemals langer
sind als 76 Zeichen.
5 F 22 W 39 n 56 4
6 G 23 X 40 o 57 5
7 H 24 Y 41 p 58 6
8 I 25 Z 42 q 59 7
9 J 26 a 43 r 60 8
10 K 27 b 44 s 61 9
11 L 28 c 45 t 62 +
12 M 29 d 46 u 63 /
13 N 30 e 47 v
14 O 31 f 48 w (pad) =
15 P 32 g 49 x
16 Q 33 h 50 y
Sollten bei der Kodierung am Ende weniger als drei Byte ubrig bleiben, so werden Fullbytes angehangt und das
besondere Zeichen = an die Kodierung angefugt, um die Anzahl der benutzten Fullbytes anzuzeigen.
Offensichtlich ist ein Text nach einer Base64-Kodierung nicht mehr ohne weitere Hilfsmittel lesbar.
Quoted-Printable Encoding
Beim Quoted-Printable Encoding wird versucht, soviel wie moglich von dem darstellbaren Text zu erhalten. Nur Ze-
ichen, die nicht in US-ASCII darstellbar sind, werden besonders kodiert. Das grundlegende Prinzip ist es, nicht darstell-
bare Zeichen durch die entsprechende Hexadezimalzahl darzustellen, wobei das Zeichen = zur Identifikation benutzt
wird (=0A=0D).
Die genauen Regel sind in Wirklichkeit relativ komplex. Siehe RFC 2045 [?] fur die Details.
IMAP basiert in der Regel auf TCP und benutzt standardmasig die Portnummer 143. Da ursprunglich die Authen-
tifizierung durch die Ubertragung von Pasworten im Klartext geschah, wird oftmals IMAP uber TLS bzw. SSL ver-
wendet. Interaktionen zwischen einem Client und einem Server sind in der Regel zeilenorientiert, wobei Zeilen durch
ein CRLF abgeschlossen werden.
Die Positionsnummer einer Nachricht message sequence number ist die relative Position einer Nachricht in einem
Zwischenspeicher (gezahlt wird ab 1). Die Positionsnummer einer Nachricht kann sich durch Loschungen und
Einfugungen verandern.
5.6. INTERNET MESSAGE ACCESS PROTOCOL (IMAP) 103
Die Identifikationsnummer einer Nachricht unique identifier identifiziert eine Nachricht in einem Zwischenspe-
icher unabhangig von der Position der Nachricht im Zwischenspeicher. Die Identifikationsnummern bleiben
uber mehrere IMAP-Sitzungen erhalten und erlauben daher die Synchronisierung beim off-line Betrieb.
Naturlich konnen auch bei der Verwendung von Identifikationsnummern Probleme auftreten, insbesondere wenn ex-
terne Programme einen Zwischenspeicher umorganisieren oder ganze Zwischenspeicher geloscht und neue mit demsel-
ben Namen angelegt werden. Daher gibt es einen zusatzlichen globalen Zahler, der bei solchen Ereignissen inkremen-
tiert wird und mit dem angezeigt wird, das die aktuellen Identifikationsnummern nicht mehr mit alten gespeicherten
Identifikationsnummern identisch sein mussen.
5.6.2 Zustande
Das IMAP-Protokoll unterscheidet verschiedene Zustande. Abhangig vom jeweiligen Zustand stehen einem Client
unterschiedliche Kommandos zur Verfugung.
nonauthenticated
authenticated
selected
Der Startzustand connection established and server greeting wird nach dem Aufbau der Transportverbindung
angenommen. Der Server schickt in diesem Zustand eine Begrusungsnachricht.
Anschliesend findet normalerweise eine Transition in den Zustand non-authenticated statt. In diesem Zustand
sind im wesentlichen nur die Kommandos zulassig, die zu einer Authentifikation notwendig sind.
Nach erfolgreicher Authentifikation findet eine Transition in den Zustand authenticated statt. In diesem Zustand
kann im wesentlichen ein Zwischenspeichern zur weiteren Bearbeitung ausgewahlt werden.
Im Zustand selected kann auf den Inhalt des selektierten Zwischenspeichers zugegriffen werden, und es konnen
Anderungen vorgenommen werden. Man kann den selektierten Zwischenspeichers wieder freigeben (Transition
in den authenticated Zustand oder aber auch die IMAP-Sitzung beeinden.
Im Zustand logout and connection release wird die Sitzung beendet und die Transportverbindung ordnungs-
gemas abgebaut.
5.6.3 Kommandos
Die einzelnen IMAP-Kommandos sind immmer nur in den verschiedenen Zustanden erlaubt. Die folgende Liste der
Kommandos ist daher nach den jeweiligen Zustanden sortiert:
104 CHAPTER 5. INTERNET APPLICATION LAYER
Beliebiger Zustand
CAPABILITY Liefert eine Liste der Faehigkeiten des Servers
NOOP Leeres Kommando (kann fur Statusaktualisierungen benutzt werden)
LOGOUT Beendigung der IMAP-Sitzung
Zustand non-authenticated
AUTHENTICATE Auswahl eines Authentifizierungsverfahrens
LOGIN Triviale Authentifizierung mit einem Klartext Paswort
Zustand authenticated
SELECT Auswahl eines Zwischenspeichers (read-write)
EXAMINE Auswahl eines Zwischenspeichers (read-only)
CREATE Anlegen eines neuen Zwischenspeichers
DELETE Loschen eines neuen Zwischenspeichers
RENAME Umbenennen eines Zwischenspeichers
SUBSCRIBE Eintrag eines Zwischenspeichers in die Liste der aktiven Zwischenspeicher
UNSUBSCRIBE Austragung eines Zwischenspeichers aus der Liste der aktiven Zwischenspeicher
LIST Auflisten der Namen der Zwischenspeicher
LSUB Auflisten der Namen der aktiven Zwischenspeicher
STATUS Statusabfrage von einem Zwischenspeicher
APPEND Anfugen von Daten an einen Zwischenspeicher
Zustand selected
CHECK Anlegen einer Sicherungskopie des Zwischenspeiches
CLOSE Schliesen des aktuellen Zwischenspeichers
EXPUNGE Loschen aller Nachrichten, die zum Loschen markiert sind.
SEARCH Suchen von Nachrichten, die bestimmte Kriterien erfullen
FETCH Lesen von Daten einer Nachricht aus dem Zwischenspeicher
STORE Andern von Daten einer Nachricht in dem Zwischenspeicher
COPY Kopieren von Nachrichten an das Ende eines Zwischenspeichers
UID
5.6.4 Tagging
IMAP unterstutzt nebenlaufige Operationen auf dem Server. Ein Client kann also mehrere Kommandos absetzen, die
dann vom Server asynchron ausgefuhrt werden. Ein Client versieht seine Kommandos daher mit eindeutigen Tags, um
spater Antworten den verschiedenen Kommandos zuordnen zu konnen. Die Syntax eines Kommandos ist damit:
Der Server antwortet auf Kommandos mit Antworten. Dabei werden wiederum drei Arten von Antwortzeilen unter-
schieden:
1. Antwortzeilen, die weitere Informationen zur Bearbeitung eines Kommandos anfordern, werden durch ein +
markiert.
2. Antwortzeilen, die eine Statusmeldung beinhalten und nicht das Ende einer Kommandobearbeitung implizieren,
werden durch ein * markiert.
3. Alle anderen Antworten beginnen mit dem vom Client gesetzten Tag und zeigen die (erfolgreiche oder erfolglose)
Beendigung der Bearbeitung eines Kommandos an.
Die Ruckmeldungen uber die erfolgreiche/erfolglose Bearbeitung erfolgt durch Schlusselworte (OK, NO, BAD) und
nicht wie beim SMTP durch strukturierte Antwortcodes.
5.6. INTERNET MESSAGE ACCESS PROTOCOL (IMAP) 105
5.6.5 Nachrichtenformat
Das IMAP-Nachrichtenformat selbst ist wiederum in ABNF beschrieben. Einen Ausschnitt der ABNF-Definitionen fuer
die grundlegensten Elemente ist hier angegeben. Fur die recht umfangreiche vollstandige Version sei jedoch auch RFC
2060 [?] verwiesen.
;
; top-level productions for IMAP commands:
;
;
; top-level productions for IMAP responses:
;
user
interface
Remarks:
The control connection uses a text-based line-oriented protocol which is similar to SMTP. The client sends com-
mands which are processed by the server. The server sends responses using three digit response codes.
A separate TCP connection is established for each data transfer. The connection can be initiated either from the
client of the server. If the data transfer connection is initiated by the server, then the clients port number must
be conveyed first to the server. If the data transfer connection is initiated by the client, then the well-known port
number 20 is used. The well-known port number for the control connection is 21.
FTP allows to resume a data transmission that did not complete using a special restart mechanism.
FTP can be used to initiate a data transfer between two remote systems. However, this feature of FTP can result
in some interesting security problems. For more details, consult RFC 2577 [72].
5.8. HYPERTEXT TRANSFER PROTOCOL (HTTP) 107
The Hypertext Transfer Protocol (HTTP) is defined in RFC 2616 [73] and one of the core building blocks of the World
Wide Web. HTTP is a simple request/response protocol primarily used to exchange documents between clients (browsers)
and servers. The HTTP protocol runs on top of TCP and it uses the well known port number 80. HTTP utilizes MIME
conventions in order to distinguish different media types.
Documents are identified using Uniform Resource Identifier (URIs) as defined in RFC 2396 [74]. HTTP provides a fixed
set of methods that can be applied to documents identified by a URI. The current version of HTTP supports the methods
shown in Table 5.3.
Method Description
OPTIONS Request information about the communication options available
GET Retrieve whatever information is identified by a URI
HEAD Retrieve only the meta-information which is identified by a URI
POST Annotate an existing resource or pass data to a data-handling process
PUT Store information under the supplied URI
DELETE Delete the resource identified by the URI
TRACE Application-layer loopback of request messages for testing purposes
CONNECT Initiate a tunnel such as a TLS or SSL tunnel
The principal structure of HTTP messages is described in the following simplified ABNF. A Message is either a
Request or a Response. They syntactically only differ in the first line, which is either a Request-Line or a
Status-Line.
Extension-Method = token
HTTP 1.1 supports persistent connections. A client can establish a connection to a server and use it to send multiple
Request messages. Earlier version of HTTP allowed only a single Request/Response exchange over a single
connection, which is of course rather expensive. To use a connection for multiple requests, it is important to detect the
end of a message body (document). HTTP relies on the MIME Content-Length header field for this purpose. If for
some reason the server does not know the length before starting to send the response (typically the case for dynamic pages
that are constructed on the fly), then the server may choose to close the connection to indicate the end of the message
body.
HTTP 1.1 also allows clients to make multiple requests without waiting for each response (pipelining), which can sig-
nificantly reduce latency. Web pages typically consist of an HTML document which has links to many small icons and
other elements. Being able to retrieve all these referenced elements over a single connection in a pipelined mode clearly
significantly reduces the number of TCP round trip message exchanges.
Probably the most interesting and also most complex part of HTTP is its support for proxies and caching. Proxies are
entities that exist between the client and the server and which basically relays requests and responses. The HTTP 1.1
specifically describes how proxies are supposed to handle requests and it defines message headers that can be used by a
client to learn about the proxies between the client and the server or to control how many proxies may exist in the path
between the client and the server.
Some proxies and clients also maintain caches where copies of documents are stored in local storage space to speedup
future accesses to these cached documents. The HTTP protocol allows a client to interrogate the server to determine
whether the document has changed or not. Caching is a key to the efficient operation of the Web. HTTP allows servers
to control whether and how a page can be cached as well as its lifetime. Furthermore, browsers can force a request to
bypass caches and obtain a fresh copy from a server.
Note that not all problems related to HTTP proxies and caches have been solved. A good list of issues can be found in
RFC 3143 [75].
5.8.3 Negotiation
HTTP supports header fields that allow to negotiate capabilities and preferences. There are two different mechanisms:
1. Server-driven negotiation begins with a request from a client (a browser). The client indicates a list of its prefer-
ences. The server then decides how to best respond to the request.
2. Client-driven negotiation requires two requests. The client first asks the server what is available and then decides
which concrete request to send to the server.
Negotation can be used to select different document formats, different transfer encodings, different languages or different
character sets. In most cases, server-driven negotiation is used since this is much more efficient. The following is an
example of typical negotiation header lines:
The last line says that the client prefers ISO-8859-1 encoding (with preference 1), UTF8 encoding with preference 0.66
and any other encoding with preference 0.33.
5.8. HYPERTEXT TRANSFER PROTOCOL (HTTP) 109
The server checks whether the document was changed after the date indicated by the header line and only process the
request if this is the case.
Packet Capturing
Networks sometimes do not function as expected and in some situations it is useful to capture the relevant frames /
packets for analysis. It is also often useful to count certain frames/packets in order to obtain usage statistics.
There are special programs such as tcpdump, ethereal or ngrep which can be used to capture and analyze packets
/ frames. A performance critical aspect is the interface between the network interface (hardware), the operating system
kernel and the user space programs. To achieve good performance, all the components have to play well together:
An approach to address this problem is to discard unimportant frames / packets as early as possible, ideally before they
are copied from the device drivers memory to the operating system kernel memory.
The instruction set of the BPF machine uses a fixed format which can be interpreted efficiently. Below are some example
BPF programs. For more details, see [78].
Example 1 Select all Ethernet frames which contain IPv4 packets (ip):
111
112 APPENDIX A. PACKET CAPTURING
user
kernel
BPF
kernel
network
Example 2 Select all Ethernet frames which contain IPv4 packets which do not originate are not from the networks
128.3.112/24 and 128.3.254/24 (ip and not src net 128.3.112/24 and not src net 128.3.254/24):
A.2 libpcap
The libpcap1 C library provides the following functionality:
A portable API that hides the differences of packet filter implementations in different operating systems.
A compiler which translates human readable filter expressions into BPF programs.
An interpreter for BPF programs which can be used to filter (previously captured) packets in user space.
Functions for writing captured packets to files and for reading previously captured packets from files.
The usage of the libpcap API can best be illustrated by an example program. The following C source code implements
a program which opens a file containing previously captured packets, optionally installs a filter and then processes the
filtered packets in a callback function.
A more detailed description of the libpcap API can be found in the pcap(3) manual page. The syntax of the
supported filter expression is documented in the tcpdump(1) manual page.
A.3 jpcap
Programmers who prefer the Java language can use the Java packet jpcap2 which allows to process captured packets
with Java programs. The jpcap package is implemented as a Java wrapper around the libpcap API.
1
http://www.tcpdump.org/
2
http://www.sf.net/projects/jpcap
A.3. JPCAP 113
The C program from the previous section can be written in Java as follows:
For a more detailed description of the jpcap API, see the jpcap Java documentation. Before using the jpcap API,
please check whether Java is fast enough for processing data rates from high speed networks.
114 APPENDIX A. PACKET CAPTURING
Appendix B
Sockets
The socket application programming interface (API) was developed at the University of California in Berkeley as part
of the work on the BSD Unix system. The socket interface is a generic interface for interprocess communication using
message passing. Sockets are abstract communication endpoints with a rather small number of associated function calls.
The socket API distinguishes different types of sockets:
A stream socket (SOCK STREAM) is a bidirectional reliable communication endpoint. Data written to a local
stream socket can be read from a remote stream socket without having to worry about transmission errors, frag-
mentation or reordering that might occur in the underlying network. While the order of the byte stream is not
changed, the data block boundaries are not preserved.
A datagram socket (SOCK DGRAM) is a bidirectional unreliable communication endpoint which allows to ex-
change datagrams. Datagrams send over the local datagram socket may not be received by the remote datagram
socket or they may be received multiple times. Furthermore, the ordering of the datagrams can change during the
transmission. Note that datagram boundaries are preserved.
A raw socket (SOCK RAW) is a communication endpoint which allows to receive and send network or interface
layer datagrams.
A reliable delivered message socket (SOCK RDM) is similar to a datagram socket but provides in addition reliable
datagram delivery.
A sequenced packet socket (SOCK SEQPACKET) is similar to a stream socket but retains data block boundaries.
#include <sys/socket.h>
struct sockaddr {
uint8_t sa_len /* address length (BSD) */
sa_family_t sa_family; /* address family */
char sa_data[...]; /* data of some size */
};
struct sockaddr_storage {
uint8_t ss_len; /* address length (BSD) */
sa_family_t ss_family; /* address family */
char padding[...]; /* padding of some size */
};
115
116 APPENDIX B. SOCKETS
Newer BSD systems support the (sa len) field in the generic and the specific socket addresses which was not present
in the older socket API. Other systems usually do not have this (sa len) member (although it is generally a good idea
to have this member). The currently most important name spaces are the name spaces for the Internet and a name space
for local communication:
Sockets that represent IPv4 communication endpoints use the address family AF INET and the protocol family PF INET.
IPv4 transport addresses are represented by the structure struct sockaddr in:
#include <sys/socket.h>
#include <netinet/in.h>
struct in_addr {
uint8_t s_addr[4]; /* IPv4 address */
};
struct sockaddr_in {
uint8_t sin_len; /* address length (BSD) */
sa_family_t sin_family; /* address family */
in_port_t sin_port; /* transport layer port */
struct in_addr sin_addr; /* IPv4 address */
};
Sockets that represent IPv6 communication endpoints use the address family AF INET6 and and the protocol family
PF INET6. IPv6 transport addresses are represented by the structure struct sockaddr in6:
#include <sys/socket.h>
#include <netinet/in.h>
struct in6_addr {
uint8_t s6_addr[16]; /* IPv6 address */
};
struct sockaddr_in6 {
uint8_t sin6_len; /* address length (BSD) */
sa_family_t sin6_family; /* address family */
in_port_t sin6_port; /* transport layer port */
uint32_t sin6_flowinfo; /* flow information */
struct in6_addr sin6_addr; /* IPv6 address */
uint32_t sin6_scope_id; /* scope identifier */
};
B.2. COMMUNICATION KINDS 117
#include <sys/socket.h>
#include <sys/un.h>
struct sockaddr_un {
uint8_t sun_len; /* address length (BSD) */
sa_family_t sun_family; /* address family */
char sun_path[108]; /* xxx Is 108 POSIX ? */
};
socket()
bind()
socket()
bind()
recvfrom()
data
sendto()
data
sendto() recvfrom()
close()
Figure B.1 shows how a server and a client make use of the socket primitives to provide and realize a connection-less
datagram application protocol. After creating and binding a local socket, the processes use the recvfrom() and the
sendto() primitives to receive and send datagrams.
Figure B.2 shows how a server and a client make use of the socket primitives to provide and realize a connection-oriented
application protocol. The server creates a listening local socket which is used to accept incoming connections. Once
a connection has been accepted, a new local file descriptor is returned which can be used to read() or write()
data. The close() function is called to close the connection. On the client side, the connect() function is used
to connected the local socket to a remote (server) socket. When the connect() function returns successfully, normal
read() or write() functions can be used to exchange data. The close() function is again called to close the
connection.
118 APPENDIX B. SOCKETS
socket()
bind()
listen()
socket()
data
read() write()
data
write() read()
connection release
close() close()
#include <sys/types.h>
#include <sys/socket.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netdb.h>
struct addrinfo {
int ai_flags;
int ai_family;
int ai_socktype;
int ai_protocol;
size_t ai_addrlen;
struct sockaddr *ai_addr;
char *ai_canonname;
struct addrinfo *ai_next;
};
The mapping of names to addresses is realized by the function getaddrinfo(). This function has three input pa-
rameters (node, service, hints) and returns a pointer to a list of struct addrinfo elements. This list must be
released by calling freeaddrinfo() if it is not used anymore. In case of an error, getaddrinfo() returns a value
unequal to 0 which can be passed to gai strerror() in order to get a human readable error description.
One of the arguments node and service can be NULL thus requesting only a name resolution of the other element.
The name resolution process can be further controlled by passing some hints to the function. Hints can be used, for
example, to request addresses of a certain address family or socket type.
#include <sys/types.h>
#include <sys/socket.h>
#include <netdb.h>
The inverse mapping of addresses to symbolic names is supported by the function getnameinfo(). The first two
parameters (sa, salen) are input parameters. The result of the mapping is a host name and a service name which
is written to the memory location host with the length hostlen and serv with the length servlen. Additional
flags can be passed to the mapping function in order to control the details of the mapping process.
Example 3 The following source code implements a client for a simple connection-oriented protocol which retrieves the
date and time from a server.
Example 4 The following source code implements a server for a simple connection-oriented protocol which retrieves
the date and time from a server.
Example 5 The following source code implements a client for a simple connection-less protocol which retrieves the date
and time from a server.
Example 6 The following source code implements a server for a simple connection-less protocol which retrieves the
date and time from a server.
B.5 Multiplexing
The examples discussed so far all had the property that the server or the client could block (freeze) in case of some
communication errors. For example, the connection oriented server block incoming requests until a client has been
served. One approach to address this deficiency is to use threads which of course requires thread-safe libraries. The other
alternative is to avoid calling blocking functions by first checking whether a socket can be read or written.
B.5. MULTIPLEXING 121
#include <sys/select.h>
FD_ZERO(fd_set *set);
FD_SET(int fd, fd_set *set);
FD_CLR(int fd, fd_set *set);
FD_ISSET(int fd, fd_set *set);
The functions select() and pselect() can be used to test whether socket descriptors are ready so that a subse-
quent socket library call does not block. The select() call distinguishes three sets of socket descriptors:
1. The set readfds contains descriptors which will be watched to see if a subsequent read operation will not block.
2. The set writefds contains descriptors which will be watched to see if a subsequent write operation will not
block.
3. The set exceptfds contains the descriptors which will be watched for excetions.
The macros FD ZERO(), FD SET(), FD CLR() and FD ISSET() can be used to manipulate the sets. The timeout
is an upper bound on the amount of time elapsed before the select function returns. The parameter n contains the
highest-numbered file descriptor in any of the three sets plus 1.
The function pselect() allows the correct handling of situations where a program wants to wait for socket descriptors
as well as software signals.
Example 7 The following source code combines the connection-less server with the connection-oriented server. The
main loop uses the select() function to wait for incoming requests.
122 APPENDIX B. SOCKETS
Bibliography
[1] F. Halsall. Data Communications, Computer Networks and Open Systems. Addison-Wesley, 4 edition, 1996.
[2] A. S. Tanenbaum. Computer Networks. Prentice Hall, 4 edition, 2002.
[3] W. Stallings. Data and Computer Communications. Prentice Hall, 6 edition, 2000.
[4] D. E. Comer. Internetworking with TCP/IP: Principles, Protocols, and Architectures. Prentice Hall, 4 edition,
2000.
[5] C. Huitema. Routing in the Internet. Prentice Hall, 2 edition, 1999.
[6] W. R. Stevens. TCP/IP Illustrated, Volume 1: The Protocols. Addison Wesley, 1994.
[7] J. F. Kurose and K. W. Ross. Computer Networking: A Top-Down Approach Featuring the Internet. Addison-
Wesley, 3 edition, 2004.
[8] R. Hinden and S. Deering. Internet Protocol Version 6 (IPv6) Addressing Architecture. RFC 3513, Nokia, Cisco
Systems, April 2003.
[9] R. Elz. A Compact Representation of IPv6 Addresses. RFC 1924, University of Melbourne, April 1996.
[10] D. Waitzman. A Standard for the Transmission of IP Datagrams on Avian Carriers. RFC 1149, BBN STC, April
1990.
[11] P. Hoffman and S. Bradner. Defining the IETF. RFC 3233, Internet Mail Consortium, Harvard University, February
2002.
[12] S. Bradner. The Internet Standards Process Revision 3. RFC 2026, Harvard University, October 1996.
[13] R. M. Metcalfe and D. R. Boggs. Ethernet: Distributed packet switching for local computer networks. Communi-
cations of the ACM, 19(5):395404, July 1976.
[14] ANSI/IEEE. Local Area Networks: CSMA/CD, Std 802.3, 1988.
[15] B. Carpenter. Architectural Principles of the Internet. RFC 1958, IAB, June 1996.
[16] R. Bush and D. Meyer. Some Internet Architectural Guidelines and Philosophy. RFC 3439, December 2002.
[17] S. Deering and R. Hinden. Internet Protocol, Version 6 (IPv6) Specification. RFC 2460, Cisco, Nokia, December
1998.
[18] J. Postel. Internet Protocol. RFC 791, ISI, September 1981.
[19] IANA. Special-Use IPv4 Addresses. RFC 3330, Internet Assigned Numbers Authority, September 2002.
[20] Y. Rekhter, B. Moskowitz, D. Karrenberg, G. J. deGroot, and E. Lear. Address Allocation for Private Internets.
RFC 1918, Cisco Systems, Chrysler Corp., RIPE NCC, Silicon Graphics, Inc., February 1996.
[21] K. Nichols, S. Blake, F. Baker, and D. Black. Definition of the Differentiated Services Field (DS Field) in the IPv4
and IPv6 Headers. RFC 2474, Cisco Systems, Torrent Networking Technologies, EMC Corporation, December
1998.
[22] K. Ramakrishnan, S. Floyd, and D. Black. The Addition of Explicit Congestion Notification (ECN) to IP. RFC
3168, TeraOptic Networks, ACIRI, EMC, September 2001.
[23] D. Grossman. New Terminology and Clarifications for Diffserv. RFC 3260, Motorola, Inc., April 2002.
[24] F. Baker. Requirements for IP Version 4 Routers. RFC 1812, Cisco Systems, June 1995.
[25] G. Trotter. Terminology for Forwarding Information Base (FIB) based Router Performance. RFC 3222, Agilent
Technologies, December 2001.
123
124 BIBLIOGRAPHY
[26] M. A. Ruiz-Sanchez, E. W. Biersack, and W. Dabbous. Survey and Taxonomy of IP Address Lookup Algorithms.
IEEE Network, pages 823, March 2000.
[27] J. Postel. Internet Control Message Protocol. RFC 792, ISI, September 1981.
[28] C. Kent and J. Mogul. Fragmentation Considered Harmful. In Proc. SIGCOMM 87 Workshop on Frontiers in
Computer Communications Technology, August 1987.
[29] J. Mogul and S. Deering. Path MTU Discovery. RFC 1191, DECWRL, Stanford University, November 1990.
[30] C. Hornig. A Standard for the Transmission of IP Datagrams over Ethernet Networks. RFC 894, Symbolics
Cambridge Research Center, April 1984.
[31] D. C. Plummer. An Ethernet Address Resolution Protocol. RFC 826, MIT, November 1982.
[32] R. Finlayson, T. Mann, J. Mogul, and M. Theimer. A Reverse Address Resolution Protocol. RFC 903, Stanford
University, June 1984.
[33] R. Droms. Dynamic Host Configuration Protocol. RFC 2131, Bucknell University, March 1997.
[34] S. Alexander and R. Droms. DHCP Options and BOOTP Vendor Extensions. RFC 2132, Silicon Graphics, Bucknell
University, March 1997.
[35] R. Droms and W. Arbaugh. Authentication for DHCP Messages. RFC 3118, Cisco Systems, University of Mary-
land, June 2001.
[36] R. Gilligan and E. Nordmark. Transition Mechanisms for IPv6 Hosts and Routers. RFC 2893, FreeGate Corp., Sun
Microsystems, August 2000.
[37] S. Kent and R. Atkinson. IP Authentication Header. RFC 2402, BBN Corporation, At Home Network, November
1998.
[38] S. Kent and R. Atkinson. IP Encapsulating Security Payload (ESP). RFC 2406, BBN Corporation, At Home
Network, November 1998.
[39] A. Conta and S. Deering. Internet Control Message Protocol (ICMPv6) for the Internet Protocol Version 6 (IPv6)
Specification. RFC 2463, Lucent, Cisco Systems, December 1998.
[40] M. Crawford. Transmission of IPv6 Packets over Ethernet Networks. RFC 2464, Fermilab, December 1998.
[41] T. Narten, E. Nordmark, and W. Simpson. Neighbor Discovery for IP Version 6 (IPv6). RFC 2461, IBM, Sun
Microsystems, Daydreamer, December 1998.
[42] G. Malkin. RIP Version 2. RFC 2453, Bay Networks, November 1998.
[43] F. Baker and R. Atkinson. RIP-2 MD5 Authentication. RFC 2082, Cisco Systems, January 1997.
[44] J. Moy. OSPF Version 2. RFC 2328, Ascend Communications, April 1998.
[45] Y. Rekhter and T. Li. A Border Gateway Protocol 4 (BGP-4). RFC 1771, IBM, Cisco, March 1995.
[46] G. Huston. The BGP Routing Table. The Internet Journal, 4(1), March 2001.
[47] J. Postel. User Datagram Protocol. RFC 768, ISI, August 1980.
[48] J. Postel. Transmission Control Protocol. RFC 793, ISI, September 1981.
[49] M. Allman, V. Paxson, and W. Stevens. TCP Congestion Control. RFC 2581, NASA Glenn/Sterling Software,
ACIRI/ICSI, April 1999.
[50] M. Allman, S. Floyd, and C. Partridge. Increasing TCPs Initial Window. RFC 3390, BBN/NASA GRC, ICIR,
BBN Technologies, October 2002.
[51] R. Stewart, Q. Xie, K. Morneault, C. Sharp, H. Schwarzbauer, T. Taylor, I. Rytina, M. Kalla, L. Zhang, and V. Pax-
son. Stream Control Transmission Protocol. RFC 2960, Motorola, Cisco, Siemens, Nortel Networks, Ericsson,
Telcordia, UCLA, ACIRI, October 2000.
[52] L. Ong and J. Yoakum. An Introduction to the Stream Control Transmission Protocol (SCTP). RFC 3286, Ciena
Corporation, Nortel Networks, May 2002.
[53] J. Stone, R. Stewart, and D. Otis. Stream Control Transmission Protocol (SCTP) Checksum Change. RFC 3309,
Stanford, Cisco Systems, SANlight, September 2002.
[54] P. Mockapetris. Domain Names - Concepts and Facilities. RFC 1034, ISI, November 1987.
[55] P. Mockapetris. Domain Names - Implementation and Specification. RFC 1035, ISI, November 1987.
[56] D. Eastlake. Domain Name System Security Extensions. RFC 2535, IBM, March 1999.
BIBLIOGRAPHY 125
[57] P. Faltstrom, P. Hoffman, and A. Costello. Internationalizing Domain Names in Applications (IDNA). RFC 3490,
Cisco, IMC & VPNC, UC Berkeley, March 2003.
[58] P. Hoffman and M. Blanchet. Nameprep: A Stringprep Profile for Internationalized Domain Names (IDN). RFC
3491, IMC & VPNC, Viagenie, March 2003.
[59] A. Costello. Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications
(IDNA). RFC 3492, UC Berkeley, March 2003.
[60] S. Thomson and C. Huitema. DNS Extensions to support IP version 6. RFC 1886, Bellcore, INRIA, December
1995.
[61] S. Thomson, C. Huitema, V. Ksinant, and M. Souissi. DNS Extensions to Support IP Version 6. RFC 3596, Cisco,
Microsoft, 6WIND, AFNIC, October 2003.
[62] ITU. Information technology - Abstract Syntax Notation One (ASN.1): Specification of basic notation. Recom-
mendation ITU-T X.680, International Telecommunication Union, December 1997.
[63] ITU. Information technology - ASN.1 encoding rules: Specification of Basic Encoding Rules (BER), Canonical
Encoding Rules (CER) and Distinguished Encoding Rules (DER). Recommendation ITU-T X.690, International
Telecommunication Union, December 1997.
[64] D. Steedman. Abstract Syntax Notation One (ASN.1): The Tutorial and Reference. Technology Appraisals, 1990.
[65] M. Rose and K. McCloghrie. Structure and Identification of Management Information for TCP/IP-based Internets.
RFC 1155, Performance Systems International, Hughes LAN Systems, May 1990.
[66] J. Case, M. Fedor, M. Schoffstall, and J. Davin. A Simple Network Management Protocol. RFC 1157, SNMP
Research, PSI, MIT, May 1990.
[67] S. Legg. Generic String Encoding Rules (GSER) for ASN.1 Types. RFC 3641, Adacel Technologies, October
2003.
[68] D. Crocker and P. Overell. Augmented BNF for Syntax Specifications: ABNF. RFC 2234, Internet Mail Consor-
tium, Demon Internet Ltd., November 1997.
[69] J. Postel. Simple Mail Transfer Protocol. RFC 821, ISI, August 1982.
[70] J. Klensin. Simple Mail Transfer Protocol. RFC 2821, AT&T Laboratories, April 2001.
[71] J. Postel and J. Reynolds. File Transfer Protocol (FTP). RFC 959, ISI, October 1985.
[72] M. Allman and S. Ostermann. FTP Security Considerations. RFC 2577, NASA Glenn/Sterling Software, Ohio
University, May 1999.
[73] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, and T. Berners-Lee. Hypertext Transfer Protocol
HTTP/1.1. RFC 2616, UC Irvine, Compaq/W3C, Compaq, W3C/MIT, Xerox, Microsoft, W3C/MIT, June 1999.
[74] T. Berners-Lee, R. Fielding, and L. Masinter. Uniform Resource Identifiers (URI): Generic Syntax. RFC 2396,
MIT/LCS, U.C. Irvine, Xerox Corporation, August 1998.
[75] I. Cooper and J. Dilley. Known HTTP Proxy/Caching Problems. RFC 3143, Equinix, Akamai Technologies, June
2001.
[76] J. Mogul, B. Krishnamurthy, F. Douglis, A. Feldmann, Y. Goland, A. van Hoff, and D. Hellerstein. Delta encoding
in HTTP. RFC 3229, Compaq WRL, AT&T, Univ. of Saarbruecken, Marimba, ERS/USDA, January 2002.
[77] K. Moore. On the use of HTTP as a Substrate. RFC 3205, University of Tennessee, February 2002.
[78] S. McCanne and V. Jacobson. The BSD Packet Filter: A New Architecture for User-level Packet Capture. In Proc.
Usenix Winter Conference, January 1993.