OpenOSPFD - Paper

Design and Implementation of
OpenOSPFD
by Claudio Jeker <claudio@openbsd.org>
Internet Business Solutions AG
Abstract
OpenOSPFD is a free and secure implementation of the
Open Shortest Path First protocol. It allows ordinary
machines to be used as routers exchanging and calcu-
lating routes within an OSPF cloud.
OpenOSPFD is the next major step after OpenBGPD for
full router capabilities in OpenBSD and other BSDs.
Together with OpenBGPD it is possible to re-route traf-
fic in case of link loss resulting in a higher-level of avail-
ability.
OpenOSPFD – design and implementation Claudio Jeker
Overview the network topology is distributed. This results in one of

the biggest weaknesses of RIP – the count to infinity
problem – resulting in slow convergence and routing
1.1 Routing Protocols loops if a network becomes unavailable. There are some
countermeasures against this. The simplest is to pass the
The Internet is split into regions called Autonomous Sys- full routing path instead of only the metric. This path
tems (AS). Each AS is under the control of a single distance vector algorithm is used by BGP. It is easy to
administrative entity – for example a university or an implement routing policies on distance vector algo-
ISP. The edge routers of these AS use an Exterior Gate- rithms.
way Protocol (EGP) to exchange routing information
between AS. Currently BGP4, the Border Gateway Pro- 1.2.2 Link-State Algorithms
tocol is the only EGP in widespread use. Routers within
an AS use an Interior Gateway Protocol to exchange In a link-state protocol every router or node sends out his
routing information. There are different IGPs. OSPF, current link-states. The link-state advertisements are dis-
IS-IS, and RIP are the most commonly used. It is possi- tributed to all nodes in the network. The resulting repli-
ble and common to have multiple IGPs running inside cated distributed database represents the entire network
one AS. topology. Every node uses this connectivity map to cal-
The Routing Information Protocol (RIP) is a legacy pro- culate the shortest path to every other router. Link-state
tocol that is often found on appliances. It is not suitable protocols have good convergence properties. The biggest
for larger networks because the distance vector algo- weakness of link-state protocols is the replicated distrib-
rithm used by RIP converges slowly. Especially in the uted database. If the database gets out of sync non opti-
face of certain network failures (count to infinity). OSPF mal routes are used and in worst case routing loops are
and IS-IS on the other hand are both link-state protocols. created. Link-state protocols are more complicated than
The Intermediate System to Intermediate System (IS-IS) distance vector protocols.
protocol was developed for the OSI protocol suite under
the lead of the ITU.
Why not use one protocol for everything, EGP and IGP?
The requirements for an IGP differ from those of a an OSPF – the protocol
EGP. For an IGP it is important to recalculate the routing
table quickly when the network changes. Another factor
is automatic neighbor discovery. On the other hand the The OSPF routing protocol was developed within the
most important feature of an EGP is the ability to IETF. The work started in 1987. The current version
express routing policies. The resulting routing table is (OSPFv2) of the specification was published in 1998 as
normally cost optimised. RFC 2328.
area 0.0.0.4
(stub)
1.2 Algorithms
Internet
There are two main concepts to exchange routing infor-
mation. These algorithms are working in a totally differ-
ent ways.
ABR
k
lin
ASBR ASBR
al
1.2.1 Distance Vector Algorithms
tu
vir
ABR
Distance vector algorithms got their name from the form
area 0.0.0.2
of the routing updates: a vector of metrics. ABR
In a distance vector algorithm every router exchanges its

vir
tu
al
routing table with all his neighbors. The neighbors then ABR
lin
k
walk through the list and compare if their current route area 0.0.0.0
entry is better or not. If not the route is replaced and area 0.0.0.1 (backbone)
redistributed again.
In case of RIP the list of routes and their metric is RIP
exchanged every 30 seconds. This results in a slow con- cloud
area 0.0.0.5
vergence because an update propagates only one hop
every 30 seconds. On the other hand the protocol is
Figure 1: Sample OSPF network
simple and robust because every router cares only about
his own neighbors. In other words the information about
2.1 Architecture Database synchronisation takes two forms. First there is

the initial database synchronisation. Following it the dis-
The Open Shortest Path First (OSPF) protocol is a link-
tributed copies of the database need to be kept in sync by
state, hierarchical routing protocol. It is probably the
reliably flooding updates to all routers in the network.
most used IGP in the world. It is capable of doing neigh-
The initial database exchange is done when two routers
bor discovery on different types of networks with mini-
build an adjacency. First a request list is built up through
mal need for configuration. OSPF encapsulates its
a TFTP like database exchange phase. In the exchange
routing messages directly on top of IP as its own proto-
phase one of the two neighbors is elected as master of
col type (89). TCP connections are not used because the
that session. This router sends a Database Description
link-state flooding algorithm already includes its own
packet to the slave and waits for an answer. If none is
way for reliable communications – adding to OSPF's
received within some amount of time the packet is
complexity. Most obvious the massive use of IP multi-
retransmitted. A sequence number identifies duplicates.
cast in OSPF makes TCP infeasible.
At any given point in time only one packet can be out-
standing. Afterwards Link-State Requests are sent
2.1.1 Networks
between the two routers. The other side then sends the
An OSPF router discovers neighbors by periodically requested link-state announcement (LSA) back to the
sending OSPF Hello packets out on all configured inter- requesting router. A full adjacency has been set up when
faces. Depending of the interface type different methods the request list is empty. Now reliable flooding needs to
are used. The flooding algorithm depends on the inter- ensure that the databases remain perfectly synchronised.
face type as well. Every time a link changes state or after a 30 minute time-
The simplest interface type is a point-to-point interface. out a LSA needs to be reflooded. A LS update received
Neighbor discovery is easy – there is only one neighbor on one interface needs to be sent out on all other inter-
on the other side of the link – and no special link-state faces. This simple rule is unfortunately not sufficient
flooding enhancement is required. because the flooding would never stop. So the router
For ethernet and other broadcast networks OSPF uses checks his database to see if the update was already
multicast to find all neighbors on the segment. The link- received on a different path. In that case the update does
state updates are flooded via multicast as well. To make not need to get reflooded. It is also necessary to
the thing even more complicated a designated router acknowledge the updates because an non reliable trans-
(DR) was introduced. The DR has the duty to enforce the port layer was chosen. Additionally implicit acknowl-
reliable flooding for all other routers connected to the edgements and timeouts, throttling the generated LS
same LAN. A backup designated router (BDR) was updates, help to make the flooding more robust and the
introduced to take over in case of a DR failure. implementation more complex, yet again.
Additionally more flooding procedures where defined 2.1.3 Areas

for other important network types like NBMA (non-
broadcast multiple-access) or point-to-multipoint net- One problem of a link-state protocol is the computation
works. Examples include X.25, Frame Relay, or ATM cost bourn by every router, particularly in large net-
using full mesh or switched virtual circuits. works. Many routers have an underpowered CPU and so
OpenOSPFD does not support these exotic networks OSPF areas where invented to divide a large network
mostly because of lack of support by the OS and missing into smaller pieces. Every area is connected to a special
infrastructure. backbone area. In most cases inter-area routing goes via
the backbone. Routers that are connected to multiple
2.1.2 Database synchronisation and areas are area border routers (ABR) and are always con-
reliable flooding nected to the backbone area. If no direct connection to
the backbone is possible, a virtual-link has to be estab-
Database synchronisation in a link-state protocol is cru- lished to at least one backbone router. Areas where no
cial. The routing calculation ensures a loop-free routing transit traffic is exchanged can be converted into stub
as long as the database remains perfectly synchronised. areas, reducing the routing table to a bare minimum.
It is no wonder that this is the most fragile part of the Stub areas are useful to connect routers with minimal
specification. Especially with all the additional complex- memory configurations to large OSPF clouds.
ity added by multicasting of updates and the presence of
DR and BDR routers. A reliable and robust flooding pro- LSAs are flooded only inside an area. The ABR has the
cedure is very important because a little inadvertence duty to reflood the other areas with special summary-
can result in a major network “melt down” where only a LSAs to inform them of available prefixes inside the
full reset of all routers cures the situation. originating area.
2.1.4 Border routers 2.2.1 Hello

Besides ABRs another kind of boarder router exists. A Version # 1 Packet Length
router is automatically an AS border router (ASBR) if it Router ID
imports routes from external sources into the link-state Area ID
database. External sources are other routing protocols or Checksum Authentication Type
manually configured static routes. These routers are on Authentication Data
Authentication Data
the boarder of the OSPF cloud but are not necessary on
Network Mask
the real AS border. The external routes redistributed by a
Hello Interval Options Router Priority
ASBR are special as they are flooded through the full Router Dead Interval
OSPF cloud instead of per area as all other LSAs. Only Designated Router
stub areas are left out to avoid overloading those poor Backup Designated Router
little routers in them. Neighbor
...
2.2 Packets
Figure 3: Hello Header
There are five different packet types defined. Every
packet starts with a common 24 byte OSPF header. This Hello packets are sent periodically in order to establish
header includes all necessary information for the recipi- and maintain neighbor relationships. Hello packets are
ent to determine if it should be accepted and processed sent to a multicast group to enable dynamic discovery of
or ignored and dropped. neighboring routers. All routers to a common network
Version # Type Packet Length must agree on certain parameters. The most important
Router ID part of the hello packet is the neighbor list at the end.
Area ID The router ID of each router from which a valid Hello
Checksum Authentication Type packet has recently been received is added to that list.
Authentication Data Only after the own router ID is seen in a neighbors Hello
Authentication Data
packet an adjacency can be formed.
Figure 2: Common OSPF header 2.2.2 Database Description

The standard IP CRC checksum is used to validate Version # 2 Packet Length
packet integrity. Multiple authentication procedures are Router ID

Area ID
defined but only one can be considered useful. Only the
Checksum Authentication Type
cryptographic authentication is enough strong to protect
Authentication Data
OSPF traffic. Only cryptographic authentication can pre- Authentication Data
vent spoofing and replay attacks. After the verification Interface MTU Options Flags
the payload of the packet is examined. DD Sequence Number
The following packet types are defined:
LSA Header
Table 1: OSPF packet types
1 Hello
...
2 Database Description
3 Link-State Request
4 Link-State Update Figure 4: Database Description Header
5 Link-State Acknowledgement
These packets are exchanged when an adjacency is ini-
tialised. They describe the contents of the link-state data-
base. The initial database exchange is done similar to the
TFTP protocol. For that reason a sequence number is
included in the header.
Additionally the MTU of the outgoing interface is
included to detect possible forwarding issues with large
packets. The rest of the packet consists of a list of LSA
headers. A LSA header contains all information to
uniquely identify a LSA.
2.2.3 Link-State Request 2.2.5 Link-State Acknowledgement

Version # 3 Packet Length Version # 5 Packet Length
Router ID Router ID
Area ID Area ID
Checksum Authentication Type Checksum Authentication Type
Authentication Data Authentication Data
Authentication Data Authentication Data
LS Type
Link-State ID LSA Header
Advertising Router
... ...
Figure 5: Link-State Request Header Figure 7: Link-State Acknowledgement Header
After exchanging Database Description packets with the In order to make the flooding procedure reliable, flooded
neighboring router, Link-State Request packets request LSAs are acknowledged in Link-State Acknowledge-
pieces of the neighbors LS database that are more up-to- ment packets. Multiple LSAs can be acknowledged in a
date. Each LSA requested is specified by its LS type, single Link-State Acknowledgement packet. The format
Link-State ID, and Advertising Router. This uniquely of this packet is similar to that of the Data Description
identifies the LSA, but not its instance. Link-State packet. The body of both packets is simply a list of LSA
Request packets are understood to be requests for the headers.
most recent instance. It is possible to request multiple
LSA with one LS request packet. 2.2.6 Link-State Advertisements Header
Each LSA begins with a common 20 byte header. This
2.2.4 Link-State Update
header is enough to uniquely identify a LSA. So it is
Version # 4 Packet Length enough to use the LSA header in LS acknowledgements
Router ID and Database Description packets. LSAs are identified
Area ID
by the LS type, Link-State ID, and Advertising Router
Checksum Authentication Type
triple. Additionally a LS sequence number and LS age
Authentication Data
Authentication Data
are included to determine which instance is more recent.
Number of LSAs The LS checksum protects the integrity of LSAs. Instead
of the known CRC algorithm specified in many IP proto-
LSA cols a ISO checksum algorithm – also known as Fletcher
Checksum – is employed.
... LS age Options LS Type
Link-State ID
Advertising Router
LS sequence number
Figure 6: Link-State Update Header LS Checksum Length
These packets implement the flooding of LSAs. Each Figure 8: Link-State Advertisements Header
Link-State Update packet carries a collection of LSAs
one hop further from their origin. Several LSAs may be Each LSA type has a separate advertisement format. The
included in a single packet. The body of the Link-State LS types defined in the OSPF standard are as follows:
Update packet consists of a list of LSAs.
Table 2: LS types
1 Hello
2 Database Description
3 Link-State Request
4 Link-State Update
5 Link-State Acknowledgement
Router- and Network-LSA describe the network inside 3.1 Processes

an area. Summary-LSA are injected by area border rout-
ers (ABRs) and describe inter-area destinations. AS- 3.1.1 ospfd parent
external-LSAs are originated by ASBRs to describe des-
The ospfd parent process is the only one running with
tinations external to the OSPF routing domain.
root privileges. This is necessary to update the kernel
routing table. This process listens on a routing socket for
changes and updates and distributes that information to
the OSPF engine or the RDE. At a later time config-file
Design reloads will be handled by the parent process too.
3.1.2 OSPF engine
update table
fetch table
The OSPF engine listens to the network
and processes the OSPF packets. Both
routing socket the interface and the neighbor finite state
machine are implemented in the OSPF
Parent engine. This includes the DR/BDR elec-
tion process. Additionally the reliable
ges flooding of LS updates with retransmis-
ospfctl root priviledges sion and acknowledgement is done by
redistribute list
chan
fork() fork() the engine.
route updates
face
so
air
3.1.3 Route Decision Engine
ck
inter
etp
requ
etp
ck
air
so
est
The RDE stores the LS database, calcu-

OSPF engine socketpair RDE
res
lates the SPF tree, and informs the

po
parent process about changes in the

ns
jailed child updates jailed child

e
runs as _ospfd:_ospfd flood request runs as _ospfd:_ospfd resulting routing table. Premature LSA
chroot to /var/empty chroot to /var/empty aging is done by the RDE as well. Addi-
UNIX socket
/var/run/ospfd.sock
tionally redistribution of networks is
raw IP socket handled by the process. The RDE syn-
proto 89
chronises multiple areas if the router is
acting as ABR and refloods summary-
LSA into the different areas if necessary.
Figure 9: Design of OpenOSPFD
The design of OpenOSPFD is based on the one in 3.1.4 ospfctl

OpenBGPD. The routing daemon is split into three proc-
esses. The privileged parent process handles the kernel ospfctl is the tool to control and monitor OpenOSPFD. It
routing table updates. The OSPF engine handles all uses a UNIX local socket to communicate with ospfd.
incoming packets and the state machines with all the Over this socket imsgs are passed which encapsulate the
necessary periodic events and timeouts. Finally the route information. There is no command line interface to
decision engine stores the LS database, calculates the OpenOSPFD because it doesn't make sense to write a
SPF tree and the resulting routing table. This separation clumsy CLI on a UNIX system shipping with very pow-
into three processes does not only enhance the security erful shells and many tools to manipulate the status out-
but also the stability. Even a large database recomputa- put. ospfctl is mostly an adapted bgpctl.
tion in the RDE will not hold up the keep alive packets
sent out by the OSPF engine. The Inter-Process Commu-
nication (IPC) system is almost the same as in
OpenBGPD. The only major difference is the use of
libevent for timers and file descriptor polling instead of
poll(2). The basic imsg framework is still the same.
OpenOSPFD switched to libevent mostly because of the
OSPF engine. The engine is mostly event driven with
many concurrent timers running. OpenOSPFD can be
controlled and monitored via ospfctl. It works very simi-
lar to bgpctl for OpenBGPD.
Implementation Table 3: Overview of source files

ospfd.c Parent process, home of main().
ospfe.c OSPF engine main event loop
OpenOSPFD currently consist of around 12'000 lines of plus functions for self originated
C code. For comparison OpenBGPD is currently a bit LSAs.
under 20'000 lines. Zebra/Quagga ospfd has almost packet.c Packet reception and sending.
40'000 lines of code. And that is just the ospfd directory,
parse.y Configuration parser.
not including the 35'000 lines in lib and the 15'000 lines
for the zebra daemon. printconf.c Configuration dumping used by
the -n switch.
Lets start with a short overview of the source files.
rde.c RDE main event loop plus other
Table 3: Overview of source files
RDE specific functions.
rde_lsdb.c LS database code.
area.c Area handling which is actually
very simple. rde_spf.c SPF algorithm and RIB
calculation.
auth.c Implementing all OSPF
authentication extensions.
Nobody wants to run a OSPF 4.1 Important datastructures
network without using
cryptographic authentication. There are four main datastructures in OpenOSPFD. It is
buffer.c buffer handling mostly for the important to know what such a structure represents to
imsg framework but also used to understand the code. Most of the time when the term
generate outgoing packets. interface is used, the actual struct iface of that inter-
control.c ospfctl session management and face is meant. Ditto for neighbor or area.
message verification.
4.1.1 ospfd_conf
database.c Code for the initial database
exchange. This is not related the This is the main config of the router. It holds the parame-
LS database that is managed by ters like the router ID, spf_delay or
the RDE. redistribute_flags. The lsa_tree and cand_list are
hello.c Generating and parsing of Hello used in the RDE by the LS database and SPF algorithm.
packets is done here. The area_list holds all configured areas. Finally there
imsg.c imsg framework mostly copied is one event handler used for polling the raw socket or
from OpenBGPD. implementing the SPF timer depending on the process it
in_cksum.c Implementation of the CRC16 is used in.
checksum of the TCP/IP standards. Code snip 1: struct ospfd_conf
interface.c Interface finite state machine, struct ospfd_conf {
struct event ev;
event handling and interface struct in_addr rtr_id;
specific functions. struct lsa_tree lsa_tree;
LIST_HEAD(, area) area_list;
iso_cksum.c ISO checksum also known as LIST_HEAD(, vertex) cand_list;
u_int32_t opts;
Fletcher checksum for LSAs. u_int32_t spf_delay;
u_int32_t spf_hold_time;
kroute.c Kernel routing socket handling int spf_state;
int ospf_socket;
including the FIB table. int flags;
int redistribute_flags;
log.c Various logging functions mostly int options; /* OSPF options */
adapted from OpenBGPD. u_int8_t rfc1583compat;
u_int8_t border;
lsack.c Link-State Acknowledgement };
construction and parsing.

4.1.2 area
lsreq.c Link-State Request construction
and parsing, including the Area specific configurations are stored in the area
request list functions. descriptor. There are many parameters that are mostly
lsupdate.c Link-State Updates construction used by the OSPF engine. Exclusively for the RDE are
and parsing, including the lsa_tree and the nbr_list. The first stores the per area
flooding function and LS database. The second is a list of all active neighbors
retransmission lists. from the RDE perspective. The OSPF engine tells the
neighbor.c Neighbor finite state machine RDE when neighbors are created, deleted, or when their
and event handling. state changes. On the other hand active is only used by
the OSPF engine. active tracks the number of neigh- u_int16_t

u_int16_t
rxmt_interval;
metric;
bors which are in state FULL. If the number is zero the enum iface_type type;
enum auth_type auth_type;
area is considered inactive. This counter is used to deter- u_int8_t auth_keyid;
mine if a router is an area border router. u_int8_t linkstate;
u_int8_t priority;
u_int8_t passive;
Code snip 2: struct area };
struct area {
LIST_ENTRY(area) entry;
struct in_addr id; 4.1.4 neighbor
struct lsa_tree lsa_tree;
LIST_HEAD(, iface) iface_list;
LIST_HEAD(, rde_nbr) nbr_list; Struct neighbor represents the neighbor relationship
u_int32_t
u_int32_t
stub_default_cost;
num_spf_calc;
from the local point of view. To maintain a session suc-
u_int32_t dead_interval; cessfully a LS retransmission and request list is required
int active;
u_int16_t transmit_delay; plus a list for the database snapshot. Then a few values –
u_int16_t hello_interval; dd_seq_num, dd_pending, last_rx_options,
u_int16_t rxmt_interval;
u_int16_t metric; last_rx_bits, and master – are only used in the
u_int8_t priority;
u_int8_t transit; EXCHANGE phase when Database Description packets
u_int8_t stub;
}; are transmitted. peerid is a unique ID used in all three
processes. The peerid is used in imsgs to tell the recipi-
4.1.3 interface ent of the message which neighbor is guilty for the just
received message. The interface, over which this neigh-
Every configured interface is represented by a struct bor is reached, is stored in iface. The neighbor structure
iface. It stores values like the link_state, baudrate, is per interface so if two routers are connected via two
MTU, and interface type. There are some additional different networks two different neighbor structures will
OSPF specific parameters like the auth_type, list of be created for the same router but the structures are
keys used for cryptographic authentication added to different interfaces.
(auth_md_list), interface metric and interface state.
Lets have a look at the neighbor list and the three neigh- Code snip 4: struct nbr
struct nbr {
bor pointers dr, bdr, and self. dr and bdr are pointers to LIST_ENTRY(nbr) entry, hash;
the active DR or BDR neighbor or NULL if there is none. struct event inactivity_timer;
struct event db_tx_timer;
self is used for a dummy neighbor structure that repre- struct event lsreq_tx_timer;
struct event ls_retrans_timer;
sents the router himself. Using this dummy neighbor struct event adj_timer;
simplifies many cases but additional care needs to be struct nbr_stats stats;
taken to not remove it by accident or doing some other struct
struct
lsa_head
lsa_head
ls_retrans_list;
db_sum_list;
stupid action with it. A back pointer to the parent area struct lsa_head ls_req_list;
this interface is part of is also included. An interface can struct in_addr addr;
have up to three concurrent timers running and therefore struct in_addr id;
struct in_addr dr; /* designated router */
three different event structures are needed. struct in_addr bdr; /* backup DR */
struct iface *iface;
Code snip 3: struct iface struct lsa_entry*ls_req;
struct iface { struct lsa_entry*dd_end;
LIST_ENTRY(iface) entry;
struct event hello_timer; u_int32_t dd_seq_num;
struct event wait_timer; u_int32_t dd_pending;
struct event lsack_tx_timer; u_int32_t peerid;/* unique ID in DB */
u_int32_t ls_req_cnt;
LIST_HEAD(, nbr) nbr_list; u_int32_t crypt_seq_num;
TAILQ_HEAD(, auth_md) auth_md_list;
struct lsa_head ls_ack_list; int state;
u_int8_t priority;
char name[IF_NAMESIZE]; u_int8_t options;
struct in_addr addr; u_int8_t last_rx_options;
struct in_addr dst; u_int8_t last_rx_bits;
struct in_addr mask; u_int8_t master;
struct in_addr abr_id; };
char *auth_key;
struct nbr *dr;
struct nbr *bdr;
struct
struct
nbr
area
*self;
*area;
4.2 Parent Process
u_int32_t
u_int32_t
baudrate;
dead_interval;
4.2.1 Start-up
u_int32_t ls_ack_cnt;
u_int32_t crypt_seq_num; On start-up ospfd first initialises the log subsystem and
unsigned int ifindex;
int fd; fetches the list of available interfaces. This list is
int state;
int mtu; required for the next step, the configuration file parsing.
u_int16_t
u_int16_t
flags;
transmit_delay;
The yacc parser used by ospfd is based on bgpds parser
u_int16_t hello_interval; which in turn has his origin in the pf parser. Explaining
the parser goes beyond the scope of this paper. Important ally reachable. This is a work a round that should be
to know is that the configuration is parsed into a hierar- fixed later as it is currently not possible to track and
chy of structures. handle newly arriving network interfaces at runtime.
The configuration consists of a list of areas and every Last but not least kr_show_route() and kr_ifinfo()
area holds a list of interfaces that are part of this area. pass information about kroutes or interfaces to ospfctl.
Last but not least every interface has a list of neighbors
that is dynamically created as soon as a valid Hello
packet is received from an other OSPF router on that 4.3 OSPF Engine
interface. The finite state machines implemented in ospfd are
After the file got parsed ospfd daemonises and starts the simple table driven state machines. Any state transition
child processes. Beforehand a set of socketpairs – a spe- may result in an specific action to be run. The resulting
cial sort of pipes – are created. Finally the event handlers next state can either be a result of the action or is fixed
are set up, rest of the kroute structures is initialised and and pre-determined.
the parent reports ready for service.
Meanwhile both children have started. First of all both 4.3.1 Interface state machine
chroot(2) to /var/empty and drop privileges by switching
to the special user _ospfd. Before doing that the OSPF
Down
engine creates a UNIX local socket for ospfctl and opens LOOPBACK DOWN
UnloopIndication
the raw IP socket to receive and send packets to the net-
work. After dropping privileges the OSPF engine initial- LoopIndication
ises the different subsystems, sets the event handlers and Up Up
POINT-TO-
starts the actual work by kicking the interface finite state WAITING
POINT
machine. The RDE start-up is even simpler as it just has
BackupSeen
Waittimer
to initialise the internal structures and event handlers.
Neighbor
Change
4.2.2 Routing socket and FIB
DROTHER Election BACKUP
The main purpose of the parent process is to maintain Neighbor
Neighbor
Change
Change
the Forward Information Base (FIB) and keep the infor-
mation in sync with the kernel routing table. This syn-
chronisation is to be done in both directions.
DR
Additionally link-state changes and arrival or departure
of interfaces are handled via the routing socket as well. Figure 10: Interface FSM
The kroute code maintains two primary data structures.
A prefix tree (kroute) and an interface tree (kif). These DOWN
two trees are kept in sync with the kernel through the
In this state, the lower-level protocols have indicated that
routing socket. On start-up fetchtable() loads the
the interface is unusable. No protocol traffic at all will be
kroute tree and fetchifs() does the same for the kif
sent or received on such an interface.
tree. Routing changes are tracked by dispatch_rtmsg()
which handles kroute changes directly but off-loads
LOOPBACK
interface specific messages to if_change() and
if_announce(). To modify the kernel routing table In this state, the router's interface to the network is
send_rtmsg() is used. send_rtmsg() translates a looped back. Loopback interfaces are advertised in
struct kroute into a rt_msg structure expected by the router-LSAs as single host routes, whose destination is
routing socket. The parent process uses kr_change() to the interface IP address.
add or modify routes and kr_delete() to remove routes.
These changes are propagated to the kernel routing table POINT-TO-POINT
if needed.
Point-to-point networks or virtual links enter this state as
Both the kroute and kif tree are implemented as red- soon as the interface is operational.
black trees – a balanced binary tree. An API to find,
insert and remove nodes is specified to simplify the tree WAITING
manipulation.
Everytime a route is added or removed to the kroute tree Broadcast or NBMA interfaces enter this state when the
kr_redistribute() is called. This function transmits interface gets operational. While in this state no DR/
possible candidates for redistribution to the RDE. In the BDR election is allowed. Receiving and sending of
RDE kif_validate() verifies that the nexthop is actu- Hello packets is allowed and is used to try to determine
the identity of the DR/BDR routers.
DROTHER Every neighbor is evaluated, neighbors with a priority of

0 are skipped. Additionally all neighbors that are not in
The router is neither DR nor BDR on the connected net- state 2-WAY or higher plus possible DRs are skipped.
work. In this state the router will only form adjacencies From the remaining set a BDR is selected. Routers
to both the DR and the BDR. All other neighbors will announcing themselves as BDR have higher precedence
stay in neighbor state 2-WAY. so the code checks if the current neighbor is announcing
himself BDR. The same thing is done with the current
BACKUP candidate. If both are announcing themselves as BDR or
The router is the BDR on the connected network seg- both are not announcing themselves as BDR
ment. If the DR fails it will promote itself to be the new if_elect() elects a new candidate. The helper function
DR. The router forms adjacencies to all neighbors in the if_elect() compares two neighbors and returns the
network segment. preferred one. In the other two cases no additional com-
parison needs to be done as the next candidate is known.
DR Code snip 6: DR election
/* elect designated router */
The router is the DR on the connected network segment. LIST_FOREACH(nbr, &iface->nbr_list, entry) {
Adjacencies are established to all neighbors in the net- if (nbr->priority == 0 ||
nbr->state & NBR_STA_PRELIM ||
work segment. Additional duties are origination of a net- (nbr != dr &&
nbr->dr.s_addr != nbr->addr.s_addr))
work-LSA for the network node and flooding of LS /* only DR may be elected check priority too */
updates on behalf of all other neighbors. continue;
if (dr == NULL)
dr = nbr;
Only a few events are needed. The events UP, DOWN, else
dr = if_elect(dr, nbr);
LOOP, UNLOOP are obvious. The other events WAIT- }
TIMER, BACKUPSEEN and NEIGHBORCHANGE are if (dr == NULL) {
restricted to broadcast and NBMA networks. WAIT- /* no designate router found use backup DR */
dr = bdr;
TIMER and BACKUPSEEN are used to move out of state bdr = NULL;
}
WAITING by running the election process. The NEIGH-
BORCHANGE event is issued when there is a change in Almost the same process is done for electing a DR.
the set of the bidirectional neighbors. This event will Neighbors that are neither in state 2-WAY or higher or
force a re-election of the DR and BDR. have a priority of 0 are skipped again. Additionally all
The most important actions are if_act_start() and neighbors that don't announce themselves as DR are
if_act_elect(). if_act_start() sets the correct next skipped as well, with the only exception of the current
state (POINT-TO-POINT or WAITING), initialises the DR itself. This is done because the election process can
interface and starts the hello timer to begin with the be restarted with the current candidates. If no DR was
neighbor discovery process. if_act_elect() elects a elected the current BDR is promoted DR. If the router is
DR and BDR for a network. This function caused major involved in the election it has to redo the election.
problems because of subtle bugs and a sloppy written
RFC. Code snip 7: final step of election
/*
First a backup designated router has to be elected. * if we are involved in the election (e.g. new DR or no
* longer BDR) redo the election
Code snip 5: BDR election */
if (round == 0 &&
/* elect backup designated router */ ((iface->self == dr && iface->self != iface->dr) ||
LIST_FOREACH(nbr, &iface->nbr_list, entry) { (iface->self != dr && iface->self == iface->dr) ||
if (nbr->priority == 0 || /* not electable */ (iface->self == bdr && iface->self != iface->bdr) ||
nbr->state & NBR_STA_PRELIM || (iface->self != bdr && iface->self == iface->bdr))) {
/* not available */ /*
nbr->dr.s_addr == nbr->addr.s_addr || * Reset announced DR/BDR to calculated one, so
nbr == dr) /* don't elect DR */ * that we may get elected in the second round.
continue; * This is needed to drop from a DR to a BDR.
if (bdr != NULL) { */
/* iface->self->dr.s_addr = dr->addr.s_addr;
* routers announcing themselves as BDR if (bdr)
* have higher precedence over those iface->self->bdr.s_addr = bdr->addr.s_addr;
* routers announcing a different BDR. round = 1;
*/ goto start;
if (nbr->bdr.s_addr == nbr->addr.s_addr) { }
if (bdr->bdr.s_addr ==
bdr->addr.s_addr)
bdr = if_elect(bdr, nbr); Before doing that we set the current candidates in our
else
bdr = nbr; own structure so that the second round will actually
} else if (bdr->bdr.s_addr != modify the behaviour. It is well possible that some
bdr->addr.s_addr)
bdr = if_elect(bdr, nbr); checks are unnecessary or to complex but this current
} else
bdr = nbr; implementation seems to behave correctly and so we
} keep it as is.
After the election process a bit of housekeeping has to be EXSTART

performed. If the DR or BDR changed, all neighbors
This is the first step in creating an adjacency between the
have to be checked if the adjacency is still OK. Addition-
two routers. In this state the initial DD sequence number
ally it may be necessary to join or leave the AllDRouters
and the master is selected for the upcoming database
multicast group. In case the router was or is now the DR
exchange phase.
an updated network-LSA needs to be reflooded.
Getting the DR/BDR election right was one of the most
SNAPSHOT
difficult parts of the development. Often unexpected
behaviours where found because of small mistakes here This state is actually an extension of the state machine
and in recv_hello(). It took multiple retries and many defined by the RFC. Because the LS database is stored in
debugging sessions to get that code where it is now. The the RDE, a current snapshot of all LSA headers have to
poorly written RFC doesn't help much in clarifying the be requested by the OSPF engine. The database
issues. exchange will start after the snapshot is done.
4.3.2 Neighbor state machine EXCHANGE

KillNbr
InactivityTimer This is the database exchange phase. Additionally all
LLDown
neighbors in state EXCHANGE or higher (LOADING,
DOWN
FULL) participate in the flooding procedure. Starting
1-WayReceived HelloReceived
from this state all packet types can be received inclusive
Start
flooded LS updates.
INIT ATTEMPT
HelloReceived
SeqNumberMismatch
BadLSReq 2-WayReceived
AdjOK?
LOADING
EXSTART
true
AdjOK? false
2-WAY
The state is only entered if the Link-State Request list is
not empty. In that case Link-State Request packets are
NegotiationDone sent out to fetch the more recent LSAs from the neigh-
SNAPSHOT
Snapshot
EXCHANGE
bors LS database.
Done
ExchangeDone FULL
FULL LOADING
LoadingDone The two routers are now fully adjacent. The connection
Figure 11: Neighbor FSM will now appear in router-LSAs and network-LSAs.
Only in this state real traffic will be routed between the
DOWN two routers.
A neighbor is considered down if no hello has been 4.3.3 Packet reception

received for more than router-dead-time seconds. This is
also the initial state of a neighbor. The OSPF engine uses the recv_packet() libevent han-
dler to receive packets from the raw IP socket. The
ATTEMPT packet is validated via ip_hdr_sanity_check() and
ospf_hdr_sanity_check(). Some additional length
This state is only valid for neighbors attached to NBMA checks are done to ensure that no access outside of the
networks. Therefore it is currently unused. packet is done. It is currently not possible in OpenBSD
3.8 to get the incoming interface via recvfrom(2) so we
INIT need to find the interface the hard way. find_iface()
In this state, a Hello packet has recently been seen from does this job by walking through all configured inter-
the neighbor. However, bidirectional communication has faces and comparing the source address of the incoming
not yet been established. packet with the interface address. This is not optimal and
will be changed soon. The next step is looking up the
2-WAY neighbor and afterwards the OSPF authentication is run.
nbr_find_id() takes the unique router ID to get the
The communication between the neighbor and the router neighbor structure with all information needed. This is
is bidirectional. Neighbors will remain in this state if done before auth_validate() because the crypto-
both the router itself and the neighbor are neither DR nor graphic authentication method uses a per neighbor spe-
BDR. cific sequence number to immunize against replay
attacks. If necessary auth_validate() does the CRC
checksumming of the packet. Finally the packet is if (len == 0) {

nbr_fsm(nbr, NBR_EVT_1_WAY_RCVD);
passed on according to its packet type to one of the fol- /* set neighbor parameters */
nbr->dr.s_addr = hello.d_rtr;
lowing functions. nbr->bdr.s_addr = hello.bd_rtr;
nbr->priority = hello.rtr_priority;
return;
recv_hello() }
Every hello-interval seconds a Hello packet is sent to all Multiple neighbor events have to be generated. First of
neighbors. On broadcast networks this is done with one all is the hello received event. Next it is checked if there
multicast packet. The Hello packet is used for neighbor is already bidirectional communication between the
discovery and to maintain neighbor relationships. As routers. This is done by walking through the list of
first step all the common options need to be compared. If neighbors in the hello packet and compared it with the
one of hello-interval, router-dead-time, or the stub area own router ID. If no match was found a 1-WAY received
flag differs the packet is not accepted. So all routers on a event gets issued. If the match is done the first time – the
common network must have the same configuration for neighbor is in an embryonic state like INIT – a 2-WAY
these values. received event is generated.
Code snip 8: neighbor look up Now the scariest part of OpenOSPFD is coming. Han-
switch (iface->type) { dling fast start-ups and the famous interface event
case IF_TYPE_POINTOPOINT:
case IF_TYPE_VIRTUALLINK: BACKUPSEEN. This part of the Hello protocol was
/* match router-id */
LIST_FOREACH(nbr, &iface->nbr_list, entry) { rewritten multiple times and the result was always some
if (nbr == iface->self)
continue;
other obscure problem in the election process. In the end
if (nbr->id.s_addr == rtr_id) OpenOSPFD had to violate the RFC a bit. The RFC is
break;
} not very clear about how to handle the event BACK-
break; UPSEEN correctly.
case IF_TYPE_BROADCAST:
case IF_TYPE_NBMA:
case IF_TYPE_POINTOMULTIPOINT: From the RFC:
/* match src IP */
LIST_FOREACH(nbr, &iface->nbr_list, entry) {
if (nbr == iface->self) • If the neighbor is both declaring itself to be Designated
continue; Router (Hello Packet's Designated Router field = Neighbor
if (nbr->addr.s_addr == src.s_addr)
break; IP address) and the Backup Designated Router field in the
} packet is equal to 0.0.0.0 and the receiving interface is in
break;
default: state Waiting, the receiving interface's state machine is
fatalx("recv_hello: unknown interface type");
} scheduled with the event BACKUPSEEN. …
if (!nbr) { • If the neighbor is declaring itself to be Backup Designated
nbr = nbr_new(rtr_id, iface, 0);
/* set neighbor parameters */ Router (Hello Packet's Backup Designated Router field =
nbr->dr.s_addr = hello.d_rtr; Neighbor IP address) and the receiving interface is in state
nbr->bdr.s_addr = hello.bd_rtr;
nbr->priority = hello.rtr_priority; Waiting, the receiving interface's state machine is scheduled
nbr_change = 1; with the event BACKUPSEEN. …
}
The packet is now accepted and the neighbor is looked Now this sounds simple but it isn't. The first case is not
up. Depending on the interface type either by router ID problematic but the second one is. Why? Because it is
or by interface address. If no neighbor could be found a not known in which order hello packets are received.
new one is created. A new neighbor is considered a What does happen if we start an election process and the
NEIGHBORCHANGE and the nbr_change flag is set actual DR neighbor is still in state 1-WAY? A major con-
that an interface neighbor change event can be issued fusion is the result. The election process evaluates the
later. BDR as DR and himself as BDR or something like this
and the result is a network with too many DR / BDR
Code snip 9: bidirectional or not routers.
nbr_fsm(nbr, NBR_EVT_HELLO_RCVD);
Code snip 10: scary fast start-ups
while (len >= sizeof(nbr_id)) {
memcpy(&nbr_id, buf, sizeof(nbr_id)); if (iface->state & IF_STA_WAITING &&
if (nbr_id == ospfe_router_id()) { hello.d_rtr == nbr->addr.s_addr && hello.bd_rtr == 0)
/* seen myself */ if_fsm(iface, IF_EVT_BACKUP_SEEN);
if (nbr->state & NBR_STA_PRELIM)
nbr_fsm(nbr, NBR_EVT_2_WAY_RCVD); if (iface->state & IF_STA_WAITING &&
break; hello.bd_rtr == nbr->addr.s_addr) {
} /*
buf += sizeof(nbr_id); * In case we see the BDR make sure that the DR is
len -= sizeof(nbr_id); * around with a bidirectional connection
} */
LIST_FOREACH(dr, &iface->nbr_list, entry)
if (hello.d_rtr == dr->addr.s_addr &&
dr->state & NBR_STA_BIDIR)
if_fsm(iface, IF_EVT_BACKUP_SEEN);
}
To clear up the situation OpenOSPFD does an additional Code snip 12: EXSTART scenario 2
check. It verifies that the DR has a bidirectional connec- } else if (!(dd_hdr.bits & (OSPF_DBD_I | OSPF_DBD_MS))) {
/* M only case: we are master */
tion to the router and only if that is true a backup seen if (ntohl(dd_hdr.dd_seq_num) != nbr->dd_seq_num) {
log_warnx("recv_db_description: invalid "
event is issued. The result is that it may take a bit longer "seq num, mine %x his %x",
nbr->dd_seq_num,
to establish an adjacency and that some initial Database ntohl(dd_hdr.dd_seq_num));
Description packet are dropped. But the confusion of too nbr_fsm(nbr, NBR_EVT_SEQ_NUM_MIS);
return;
many DR/BDRs is avoided. The rest of recv_hello() is }
nbr->dd_seq_num++;
simply here to issue the possible neighbor change events
that were detected earlier. /* packet may already have data so pass it on */
if (len > 0) {
nbr->dd_pending++;
ospfe_imsg_compose_rde(IMSG_DD,
recv_db_description() nbr->peerid, 0, buf, len);
}
While the send_db_description() function ended up /* event negotiation done */
pretty simple recv_db_description() turned out to be nbr_fsm(nbr, NBR_EVT_NEG_DONE);
more problematic. Usual sanity checking is done first. }

Afterwards additional checks are performed to verify the
Afterwards the actual transfer starts or continues. First of
MTU and detect possible duplicates because of retrans-
all, packets with invalid flags and options result in a reset
missions. The MTU check is required by the RFC, the
of the session (sequence number mismatch event). If the
problem is that some OSPF implementations are lying
slave receives a duplicate packet it has to resend the last
about their MTU and so only bigger MTUs are consid-
packet. The master does not care about duplicate pack-
ered a problem.
ets. Actually the master should never see a duplicate –
The code path is dependent on the neighbor state. Pack-
the slave will never send a packet by its own. If the
ets received from neighbors in unexpected states are just
neighbor state is either LOADING or FULL the only
ignored. This includes state SNAPSHOT because during
packets received should be duplicates. Anything else is
the time the LSA snapshot is done we cannot respond to
considered an error and the session is reset. Side effect
a received packet. Funnily it is allowed to get Database
of this is that sending a packet with the Initialise (I) bit
Description packets in state INIT. In that case some kind
set can be used to reset a neighbor relationship. Now the
of super fast start-up needs to be done. It looks like it
sequence number is checked. Only the master is increas-
was simpler to fix the RFC than to fix someone's OSPF
ing the number so the slave receives packets with the
implementation. So both the interface and neighbor FSM
current sequence number plus one. In case of the master
are kicked and afterwards the new neighbor state has to
the sequence numbers are equal on receive and after-
be checked again. If it is now in state EXSTART a fall-
wards the sequence number is increased. Our first imple-
through into the next case can be done.
mentation was a bit buggy and it took some debugging to
In case EXSTART there are two possible scenarios. The
find all the small issues like forgetting to bump the
first is the reception of a Christmas packet – one with all
sequence number in a specific case.
flags turned on. This is the initial packet and
OpenOSPFD has to evaluate if it is master or slave of the Code snip 13: synchronising part 1
database exchange phase. The slave will issue a negotia- /* forward to RDE and let it decide which LSAs to request
*/
tion done event and sends back a packet with just the M if (len > 0) {
bit set. nbr->dd_pending++;
ospfe_imsg_compose_rde(IMSG_DD, nbr->peerid, 0,
buf, len);
Code snip 11: EXSTART scenario 1 }
/*
* check bits: either I,M,MS or only M The received LSA headers have to be sent to the RDE
*/
if (dd_hdr.bits == (OSPF_DBD_I | OSPF_DBD_M | where they are compared with the LS database. This
OSPF_DBD_MS)) {
/* if nbr Router ID is larger than own -> slave */ resulted in an interesting issue: if the RDE was busy the
if ((ntohl(nbr->id.s_addr)) > OSPF engine could move forward and suddenly think
ntohl(ospfe_router_id())) {
/* slave */ that no LSAs have to be requested and move the neigh-
nbr->master = 0;
nbr->dd_seq_num = ntohl(dd_hdr.dd_seq_num); bor directly into state FULL. Afterwards the RDE would
/* event negotiation done */
send some LSAs to request to the OSPF engine but it
nbr_fsm(nbr, NBR_EVT_NEG_DONE); was too late. To solve this race condition the dd_pending
}
counter was added. It gets increased for each sent data-
The second scenario – a packet with just the M bit set, is base description packet.
received. The M bit stands for “more” as in more data.
The master will finally issue the negotiation done event.
So the slave is actually sending valid data ahead of the
master. This is a bit strange but we are used to it.
Code snip 14: synchronising part 2 4.3.4 Packet delivery

ospfe_dispatch_rde()
nbr->dd_pending--;
if (nbr->dd_pending == 0 && nbr->state & NBR_STA_LOAD) { send_hello()
if (ls_req_list_empty(nbr))
nbr_fsm(nbr, NBR_EVT_LOAD_DONE);
else send_hello() is called by the if_hello_timer() func-
start_ls_req_tx_timer(nbr); tion that is run every hello-interval seconds if an inter-
}
face is not in state DOWN. Sending hellos is pretty
When an IMSG_DD_END message arrives from the RDE simple so it is a good example how the buffer framework
the counter gets decremented. If the counter drops to is used in OpenOSPFD.
zero no DD packets are pending. In case that the neigh-
Code snip 15: Allocate dynamic buffer
bor state is now LOADING we actually hit the race con-
/* XXX READ_BUF_SIZE */
dition and so we have to either move to state FULL if the if ((buf = buf_dynamic(PKG_DEF_SIZE,
READ_BUF_SIZE)) == NULL)
request list is empty or start sending out LS requests. fatal("send_hello");
Sometimes running a single daemon as three processes
needs some additional work to synchronise the proc- First a dynamic buffer is allocated. Currently a fixed size
esses. This is a nice example. Finally the next packet is of PKG_DEF_SIZE bytes is used but the buffer is allowed
prepared for being sent by send_db_description(). If to grow till READ_BUF_SIZE. This is not optimal as pack-
there is nothing left to send and the received packet has ets should not be fragmented by OSPF. For Hello pack-
no M bit set then the exchange phase is mostly done. The ets this is not a big issue because the embedded data is
slave is finished but the master has to ensure that at least often very small. Other send functions use a different
one packet without the M bit has been sent and acknowl- approach by limiting the resulting packet size to the
edged. The result is that the slave will always change MTU of the corresponding interface.
state before the master. Why should the end of the Code snip 16: Set correct destination
exchange be less strange than the beginning? dst.sin_family = AF_INET;
dst.sin_len = sizeof(struct sockaddr_in);
recv_ls_req() switch (iface->type) {

case IF_TYPE_POINTOPOINT:
case IF_TYPE_BROADCAST:
Link-State Requests are simply passed to the RDE but inet_aton(AllSPFRouters, &dst.sin_addr);
only if the neighbor state is EXCHANGE or higher. In all break;
case IF_TYPE_NBMA:
other states Link-State Request packets are ignored. case IF_TYPE_POINTOMULTIPOINT:
/* XXX not supported */
break;
case IF_TYPE_VIRTUALLINK:
recv_ls_update() dst.sin_addr = iface->dst;
break;
Link-State Updates are simply dropped if the neighbor is default:
fatalx("send_hello: unknown interface type");
not in state EXCHANGE or higher. Otherwise all LSAs }
are extracted from the packet and sent to the RDE one
The outgoing address needs to be determined. For broad-
after the other. While doing that additional length checks
cast and point-to-point networks this is the multicast
are done to guard against buffer overflows.
address AllSPFRouters. Virtual links are sent as unicast.
NBMA and point-to-multipoint are special and currently
recv_ls_ack()
not supported. For NBMA and point-to-multipoint the
Link-State Acknowledgements are only accepted in packet has to be sent to all neighbors directly and
neighbor state EXCHANGE or higher. Otherwise the send_packet() would be called for every neighbor
packet is dropped. Every LSA header included in the once.
packet needs to be roughly validated with
Code snip 17: create Hello packet
lsa_hdr_check() and then possibly deleted from the
/* OSPF header */
retransmission list. In case the interface is in state if (gen_ospf_hdr(buf, iface, PACKET_TYPE_HELLO))
goto fail;
DROTHER ls_retrans_list_del() will be called
twice. First it deletes LSAs from the global retransmis- /* hello header */
hello.mask = iface->mask.s_addr;
sion list of updates sent to the AllDRouters multicast hello.hello_interval = htons(iface->hello_interval);
hello.opts = oeconf->options;
address. Second the per-neighbor queue is purged in case hello.rtr_priority = iface->priority;
hello.rtr_dead_interval = htonl(iface->dead_interval);
the interface state changed lately.
if (iface->dr) {
hello.d_rtr = iface->dr->addr.s_addr;
iface->self->dr.s_addr = iface->dr->addr.s_addr;
} else
hello.d_rtr = 0;
if (iface->bdr) {
hello.bd_rtr = iface->bdr->addr.s_addr;
Obvious differences to send_hello() are the use of
iface->self->bdr.s_addr = iface->bdr->addr.s_addr; buf_open() instead of buf_dynamic(). Buf_open()
} else
hello.bd_rtr = 0; allocates a fixed size buffer of size nbr->iface->mtu -
if (buf_add(buf, &hello, sizeof(hello)))
sizeof(struct ip) – which is the maximum packet
goto fail; size that does not get fragmented. Later buf_reserve()
is used on that buffer to reserve sizeof(dd_hdr) bytes.
Finally the packet is constructed. First of all the common
The rest of the packet can be added and later
OSPF header is added. This is done for every packet type
buf_seek() can be used to write into the reserved space
and so a helper function gen_ospf_hdr() is used. The
like this:
Hello specific contents are filled in afterwards and added
with buf_add(). Code snip 21: Usage of buf_seek()
memcpy(buf_seek(buf, sizeof(struct ospf_hdr),
Code snip 18: Add active neighbors sizeof(dd_hdr)), &dd_hdr, sizeof(dd_hdr));
/* active neighbor(s) */
LIST_FOREACH(nbr, &iface->nbr_list, entry) { The remainder of the function sets up the Database
if ((nbr->state >= NBR_STA_INIT) &&
(nbr != iface->self)) Description header with its bit fields and sequence
if (buf_add(buf, &nbr->id,
sizeof(nbr->id))) number. If in state EXCHANGE, as many LSA headers
goto fail; as possible are appended. While appending LSA headers
}
one must keep in mind that the cryptographic authentica-
The Hello packets include a list of all bidirectional tion will append MD5_DIGEST_LENGTH bytes to the end of
neighbors (state 2-WAY or higher). Again the neighbor the packet.
IDs are added directly with buf_add(). The neighbor ID
is stored in network byte order or htonl() is used to cor- send_ls_req()
rectly switch byte order.
send_ls_req() uses like send_db_description()
Code snip 19: Final step buf_open() to get a buffer that doesn't get fragmented.
/* update authentication and calculate checksum */
if (auth_gen(buf, iface))
While filling in the requested LSA headers some addi-
goto fail; tional space gets reserved for the possible MD5 sum.
ret = send_packet(iface, buf->buf, buf->wpos,
&dst); Code snip 22: Filling packet with requests
buf_free(buf); /* LSA header(s), keep space for a possible md5 sum */
return (ret); for (le = TAILQ_FIRST(&nbr->ls_req_list); le != NULL &&
fail: buf->wpos + sizeof(struct ls_req_hdr) < buf->max -
log_warn("send_hello"); MD5_DIGEST_LENGTH; le = nle) {
buf_free(buf); nbr->ls_req = nle = TAILQ_NEXT(le, entry);
return (-1); ls_req_hdr.type = htonl(le->le_lsa->type);
ls_req_hdr.ls_id = le->le_lsa->ls_id;
ls_req_hdr.adv_rtr = le->le_lsa->adv_rtr;
Last is updating authentication and checksum of the out- if (buf_add(buf, &ls_req_hdr, sizeof(ls_req_hdr)))
going packet. The interface pointer is passed to goto fail;
}
auth_gen() to get the necessary keys and sequence
number for the simple and cryptographic authentication. The rest is straight forward and mostly the same as in
The packet gets sent out via send_packet(). Before send_hello().
sending the packet it is necessary to set the outgoing
interface for multicast traffic. This is done by send_ls_ack()
if_set_mcast() inside of send_packet(). Finally the
no longer needed buffer is freed. Actually we have to start in ls_ack_tx_timer()
because send_ls_ack() is just the last step to send out
send_db_description() an ack. send_ls_ack() will add the common OSPF
header and add the data passed to the function to the
send_db_description() implements the sending part packet. The list of acknowledgements is created by
of the database exchange. It sends out the initial Data- ls_ack_tx_timer() in a not so nice way and therefore it
base Description packet when moving the neighbor state should not be used as example for other code. Especially
to EXSTART. as it will be rewritten soon.
Code snip 20: Allocate fixed buffer
if ((buf = buf_open(nbr->iface->mtu - sizeof(struct ip)))
send_ls_update()
== NULL)
fatal("send_db_description"); Sending out LS updates is easy but the retransmission
/* OSPF header */ list and flooding procedure are a bit tricky.
if (gen_ospf_hdr(buf, nbr->iface, PACKET_TYPE_DD)) send_ls_update() will just add a LSA to a buffer
goto fail;
together with a common OSPF header and send the
/* reserve space for database description header */
if (buf_reserve(buf, sizeof(dd_hdr)) == NULL)
goto fail;
results out. But there is one thing that must to be done u_int32_t
u_int32_t
ls_id;
adv_rtr;
with the LSA first. It has to be aged with the value of u_int8_t type;
u_int8_t flooded;
transmit-delay. };
Code snip 23: LSA aging The vertex contains all necessary information not only
pos = buf->wpos;
if (buf_add(buf, data, len)) for the LS Database but for the SPF calculation too.
goto fail; entry and cand are used to put the vertex into the red-
/* age LSA before sending it out */ black tree or into the candidate list respectively. The
memcpy(&age, data, sizeof(age));
age = ntohs(age); event ev is for a per-LSA entry timeout for aging. Addi-
if ((age += iface->transmit_delay) >= MAX_AGE) tionally stamp is used for aging as well. changed is set to
age = MAX_AGE;
age = htons(age); the time the last modification was done to the LSA.
memcpy(buf_seek(buf, pos, sizeof(age)), &age, sizeof(age));
ls_id, adv_rtr and type are shorthands for the actual
First the current write position is stored and the LSA is values that are stored inside of lsa. These are used by
added to the buffer. The LS Age is stored in the first two the tree search routine. The flooded flag should indicate
bytes of the LSA. The memcpy() extracts the age because that a LSA was received as part of a flooding. Flooded
a direct memory access could end on unaligned memory. LSA are locked for MIN_LS_ARRIVAL seconds whereas
Then the LSA is aged and written into the buffer with the requested LSA are not. nbr represents the neighbor from
help of buf_seek() and the previously stored position. which the LSA was received. nbr has nothing to do with
the actual originator of the LSA. This is only done to
4.3.5 Control handling correctly flood out LSAs and sending an acknowledge-
ment back to the neighbor. prev is the parent vertex in
The handling of control sessions is actually a small the SPF tree. It is possible to construct the actual path
UNIX local socket server. There is a listener event through the network by following all prev pointers. This
(control_listen()) that accepts (control_accept()) is used to calculate the nexthop. The nexthop is the
connections and creates a per control connection struc- address for forwarding packets to that destination. It is
ture. control_dispatch_imsg() reads the request from normally the address of the last router-LSA before the
ospfctl. First the per connection structure are retrieved root node.
and then the imsg's sent are extracted. They get either
forwarded to the parent, the RDE, or directly answered. 4.4.2 LSA aging
Messages forwarded to the other processes will often
require a response that needs to be relayed to ospfctl Before using a LSA that is in the DB it normally needs
because neither the RDE nor the parent process have to be aged. This is done by lsa_age() with help of the
access to the socket. Relaying is done by vertex time stamp.
control_imsg_relay(). It has to be called for those
Code snip 25: LSA aging
imsgs that need to get forwarded. This is done in the now = time(NULL);
imsg dispatch functions ospfe_dispatch_main() and d = now - v->stamp;
/* set stamp so that at least new calls work */
ospfe_dispatch_rde(). v->stamp = now;
if (d < 0) {
log_warnx("lsa_age: time went backwards");
4.4 Route Decision Engine }
return;
age = ntohs(v->lsa->hdr.age);
4.4.1 LS Database if (age + d > MAX_AGE)
age = MAX_AGE;
The LS database is implemented as a red-black tree – else
age += d;
actually multiple trees exist – one per area and a global
v->lsa->hdr.age = htons(age);
one for AS-external-LSAs. The key is the LS-type LS-ID
advertising router triple. The LSA is part of a vertex Normally it is enough to just add the difference of the
that builds a node of the network connectivity graph. current time and stamp. Nonetheless some additional
Code snip 24: struct vertex
care is needed. First of all time() returns the system
struct vertex { time and this can be modified by the user. I remember a
RB_ENTRY(vertex) entry;
TAILQ_ENTRY(vertex) cand;
complete network outage at an ISP because the UNIX
struct event ev; time got changed on a Zebra/Quagga router. Afterwards
struct in_addr nexthop;
struct vertex *prev; Zebra/Quagga was no longer working until a reboot on
struct rde_nbr *nbr; the changed machines was performed. So by checking
struct lsa *lsa;
time_t changed; whether the difference is positive it is at least possible to
time_t stamp;
u_int32_t cost; fail in a save way. The other case that needs to be consid-
ered is that a LSA may never get older than MAX_AGE (1
hour).
4.4.3 Comparing LSA Code snip 27: First set sequence number
if (v == NULL) {
There are two functions to compare LSA. lsa_equal() lsa_add(nbr, lsa);
rde_imsg_compose_ospfe(IMSG_LS_FLOOD, nbr->peerid,
is similar to a memcmp() but compares a bit more. One 0, lsa, ntohs(lsa->hdr.len));
return;
thing is important to note: LSA with age MAX_AGE are }
never considered equal. This comes from the fact that /*
lsa_equal() is mostly used to determine if a recalcula- * set the seq_num to the current one.
* lsa_refresh() will do the ++
tion of the SPF tree is required or for similar situations. */
lsa->hdr.seq_num = v->lsa->hdr.seq_num;
In that context LSAs with an age of MAX_AGE are always /* recalculate checksum */
special and it is OK to force an update. len = ntohs(lsa->hdr.len);
lsa->hdr.ls_chksum = 0;
The other compare function is lsa_newer() and imple- lsa->hdr.ls_chksum = htons(iso_cksum(lsa, len,
LS_CKSUM_OFFSET));
ments the RFC specification of newer, equal and older
LSA. It works similar to other compare functions by Sure if there was no LSA in the database in the first
returning -1 if the first LSA is older, 1 if newer and 0 if place there is no need to merge. It is enough to just add
equal to the second LSA passed. The function compares and flood the LSA. When changing the sequence number
the sequence number, the LS checksum, and the LS age. the checksum has to be recalculated. The sequence
Once again a bit care needs to be taken when comparing number is only set to the current value because there is
ages. no need to increase it already. Especially if lsa_merge()
is used to remove a self originated LSA from the data-
Code snip 26: Comparing ages
a16 = ntohs(a->age);
base there is no need to rise the sequence number, it is
b16 = ntohs(b->age); sufficient to set the age to MAX_AGE.
if (a16 >= MAX_AGE && b16 >= MAX_AGE)
return (0); Code snip 28: Then overwrite and
if (b16 >= MAX_AGE) reflood if necessary
return (-1); /*
if (a16 >= MAX_AGE) * compare LSA; most header fields are equal
return (1); * so don't check them
*/
i = b16 - a16; if (lsa_equal(lsa, v->lsa)) {
if (abs(i) > MAX_AGE_DIFF) free(lsa);
return (i > 0 ? 1 : -1); return;
}
return (0);
/* overwrite the lsa all other fields are unaffected */
If both LSA are at age MAX_AGE they are considered free(v->lsa);
v->lsa = lsa;
equal. If only one has age MAX_AGE that one is newer and start_spf_timer();
last but not least the LS ages need to be at least /* set correct timeout for reflooding the LSA */
MAX_AGE_DIFF (15 minutes) apart to be not considered now = time(NULL);
timerclear(&tv);
equal. if (v->changed + MIN_LS_INTERVAL >= now)
tv.tv_sec = MIN_LS_INTERVAL;
evtimer_add(&v->ev, &tv);
4.4.4 LSA refresh
Now lsa_equal() is used to determine whether to actu-
All LS_REFRESH_TIME seconds a LSA needs to be ally reflood the LSA. If the LSA did not change there is
refreshed by its originator. The age is reset to the initial nothing to modify and we're done. Otherwise the LSAs
value and the sequence number is bumped. After modi- are exchanged and a SPF recalculation is issued. Finally
fying the LSA the checksum has to be recalculated. The the reflooding is prepared. This is done via a timer
LSA is flooded and a new timeout event is registered. because it is not allowed to send out updates faster than
Non self originated LSA have the same timer running MIN_LS_INTERVAL (5) seconds.
but with MAX_AGE instead of LS_REFRESH_TIME. If the
timer fires the LSA will be deleted from the LS database 4.4.6 lsa_self()
by flooding it out with age MAX_AGE. How to delete LSA
will be explained later as it is fairly complex. Identifying self originated LSA is an important task.
This comes from the fact that if a router leaves the net-
4.4.5 LSA merging work the other routers will not remove the LSAs of this
router until the LS age hits MAX_AGE. If the router joins
If a self originated LSA changes, for example because a the network again – after a reboot for example – the old
neighbor relationship is established or lost, an updated LSAs are still floating around. So it is the routers duty to
LSA needs to be reflooded. lsa_merge() takes care of detect those old self originated LSAs and renew them or
replacing the LSA in the database with the new one and remove them from the database. This task is done by
sets the LS sequence number of the new LSA to the cur- lsa_self().
rent used number.
Code snip 29: Detect self originated LSA LSAs that are sent to stub areas get silently discarded.
if (nbr->self) At the end the LS age is checked and if it is MAX_AGE
return (0);
some special care needs to be taken.
if (rde_router_id() == new->hdr.adv_rtr)
goto self;
Code snip 31: MAX_AGE handling
if (new->hdr.type == LSA_TYPE_NETWORK) if (lsa->hdr.age == htons(MAX_AGE) &&
LIST_FOREACH(iface, &nbr->area->iface_list, entry) !nbr->self && lsa_find(area, lsa->hdr.type,
if (iface->addr.s_addr == new->hdr.ls_id) lsa->hdr.ls_id, lsa->hdr.adv_rtr) == NULL &&
goto self; !rde_nbr_loading(area)) {
return (0); /*
* if no neighbor in state Exchange or Loading
First of all the newly received LSA (new) gets classified. * ack LSA but don't add it. Needs to be a direct
* ack.
If the router ID is the same or if an interface address */
rde_imsg_compose_ospfe(IMSG_LS_ACK, nbr->peerid, 0,
matches the LS ID of a network-LSA the LSA is consid- &lsa->hdr, sizeof(struct lsa_hdr));
return (0);
ered self originated. }
Code snip 30: Remove or update If the LS age is MAX_AGE and the LSA is not in the data-
self:
if (v == NULL) { base there is actually no need to add the LSA to the data-
/* base. However this is a fallacy, there are some additional
* LSA is no longer announced, remove by premature
* aging. The problem is that new may not be checks required. The RFC mentions that if a neighbor is
* altered so a copy needs to be added to the LSA
* DB first. currently establishing an adjacency – state EXCHANGE
*/
if ((dummy = malloc(ntohs(new->hdr.len))) == NULL) or LOADING – no short-cuts are allowed. Additionally
fatal("lsa_self");
memcpy(dummy, new, ntohs(new->hdr.len));
self originated LSA generated by the OSPF engine have
dummy->hdr.age = htons(MAX_AGE); to be passed. Therefore nbr->self is tested. If all condi-
/*
* The clue is that by using the remote nbr as tions are met the LSA will not be added. Instead only a
* originator the dummy LSA will be reflooded via direct acknowledgement is sent back.
* the default timeout handler.
*/
lsa_add(rde_nbr_self(nbr->area), dummy);
return (1); 4.4.8 Deleting LSA
}
/* Deleting something from a replicated distributed data-
* LSA is still originated, just reflood it. But we need to
* create a new instance by setting the LSA sequence number
base is not a trivial task. Especially if there is no LS
* equal to the one of new and calling lsa_refresh(). remove packet type. Removing is done via the LS age.
* Flooding will be done by the caller.
*/ LSA with LS age MAX_AGE are ready to be removed from
v->lsa->hdr.seq_num = new->hdr.seq_num; the database. Especially for OpenOSPFD removing
lsa_refresh(v);
return (1); LSAs is even more complicated. To remove a LSA it first
has to be reflooded and all neighbors have to acknowl-
In case of a self originated LSA there are two cases. The
edge the reception before removing it from the database.
first one is that the LSA is no longer announced. In that
In OpenOSPFD the database and the retransmission
case the LSA gets added to the Database with a LS age
logic are in two different processes so additional IPC is
of MAX_AGE. The database code will then reflood the LSA
needed. If the RDE tries to delete the LSA either because
as soon as possible and by doing that removing it from
it exceeds the MAX_AGE age or because of premature
the database. There is no other way in doing this because
aging – used to clean the database from no longer valid
removing LSAs is a complex task that only works if the
LSAs – it simply sets the age to MAX_AGE and sends a
LSA is in the database. The other case is much simpler
flood request to the OSPF engine. The OSPF engine will
because there is already a self originated LSA in the
then start the flooding procedure. The LSA is added to
local database but the sequence number is lower then the
the LSA cache and the different retransmission lists refer
new one. In this case the sequence number is bumped
to the cached LSA. If the last reference to the cached
like in the lsa_merge() case and lsa_refresh() is
object drops the following happens:
called to flood the LSA.
Code snip 32: lsa_cache_put()
4.4.7 LSA check void
lsa_cache_put(struct lsa_ref *ref, struct nbr *nbr)
{
Before even accepting a LS update the embedded LSA if (--ref->refcnt > 0)
return;
has to be verified. Once again lengths are compared and
especially the ISO checksum is verified. Additionally the if (ntohs(ref->hdr.age) >= MAX_AGE)
ospfe_imsg_compose_rde(IMSG_LS_MAXAGE,
LS age and sequence number are checked to be in a valid nbr->peerid, 0, ref->data,
sizeof(struct lsa_hdr));
range. Per LS type checks follow the generic ones. It is
verified that the packet has the right size for this type and free(ref->data);
LIST_REMOVE(ref, entry);
that values like the metric – which is a 24bit value stored free(ref);
}
as 32bit integer is in the correct range. AS-external-
The LS age is compared with MAX_AGE and if true a case LINK_TYPE_POINTTOPOINT:

case LINK_TYPE_VIRTUAL:
IMSG_LS_MAXAGE is sent back to the RDE. In the RDE /* find router LSA */
w = lsa_find(area,
the message is received and verified. If something is LSA_TYPE_ROUTER,
incorrect the RDE bombs out. rtr_link->id,
rtr_link->id);
break;
Code snip 33: IMSG_LS_MAXAGE handling case LINK_TYPE_TRANSIT_NET:
/* find network LSA */
case IMSG_LS_MAXAGE: w = lsa_find_net(area,
nbr = rde_nbr_find(imsg.hdr.peerid); rtr_link->id);
if (nbr == NULL) break;
fatalx("rde_dispatch_imsg: " default:
"neighbor does not exist"); fatalx("spf_calc: "
"invalid link type");
if (imsg.hdr.len != IMSG_HEADER_SIZE + }
sizeof(struct lsa_hdr)) break;
fatalx("invalid size of OE request"); case LSA_TYPE_NETWORK:
memcpy(&lsa_hdr, imsg.data, sizeof(lsa_hdr)); net_link = get_net_link(v, i);
/* find router LSA */
if (rde_nbr_loading(nbr->area)) w = lsa_find(area, LSA_TYPE_ROUTER,
break; net_link->att_rtr,
net_link->att_rtr);
v = lsa_find(nbr->area, lsa_hdr.type, break;
lsa_hdr.ls_id, lsa_hdr.adv_rtr); default:
if (v == NULL) fatalx("spf_calc: "
db_hdr = NULL; "invalid LSA type");
else }
db_hdr = &v->lsa->hdr;
...
/* cand_list_add(w);
* only delete LSA if the one in the db isn’t newer }
*/ /* get next vertex */
if (lsa_newer(db_hdr, &lsa_hdr) <= 0) v = cand_list_pop();
lsa_del(nbr, &lsa_hdr); w = NULL;
break; } while (v != NULL);
If there is still a neighbor in state EXCHANGE or LOAD- The loops starts at the root vertex and moves through
ING the LSA may not be removed. It is possible that the one vertex after another. After a vertex is selected all
neighbor may request that LSA just a bit later. Now the next vertices that are connected to this vertex are
LSA is searched in the database and the entry of the extracted and added to the candidate list. After all verti-
database is compared with the LSA that should be ces are added the one with the lowest cost is popped
removed. If the database entry is newer the entry will not from the list and the loops starts over with this vertex.
be removed else it would get finally removed from the Before a vertex is added to the candidate list it is verified
database and freed. that the connection is still valid.
4.4.9 SPF and RIB calculation Code snip 35: the three dots in the previous snip-
pet
The SPF calculation is still a large construction area. The if (w == NULL)
continue;
code should be split up as some steps are not necessary
if (w->lsa->hdr.age == MAX_AGE)
in all cases. Especially on ABRs this is not optimal and continue;
creates a lot of superfluous load. Worth knowing: RIB if (!linked(w, v))
and FIB are terms from BGP and got inherited into continue;
OpenOSPFD. RIB is the Routing Information Base and if (v->type == LSA_TYPE_ROUTER)
d = v->cost + ntohs(rtr_link->metric);
FIB is the Forwarding Information Base. The FIB is else
mostly the kernel routing table and is stripped from d = v->cost;
unneeded ballast whereas the RIB contains all additional if (cand_list_present(w)) {

if (d > w->cost)
protocol specific informations. continue;
To calculate the routing table three calculations are per- if (d < w->cost) {
formed. First the SPF tree gets built. Then the local w->cost = d;
w->prev = v;
LSAs are added to the RIB and finally the AS-external- calc_next_hop(w, v);
/*
LSAs are inserted. * need to readd to candidate list
* because the list is sorted
Code snip 34: SPF calculation */
TAILQ_REMOVE(&cand_list, w, cand);
/* calculate SPF tree */ }
do { } else if (w->cost == LS_INFINITY && d < LS_INFINITY) {
/* loop links */ w->cost = d;
for (i = 0; i < lsa_num_links(v); i++) { w->prev = v;
switch (v->type) { calc_next_hop(w, v);
case LSA_TYPE_ROUTER: }
rtr_link = get_rtr_link(v, i);
switch (rtr_link->type) {
case LINK_TYPE_STUB_NET: On leaf nodes – w is NULL – there is nothing to do. If the
/* skip */
continue; next vertex has an age of MAX_AGE it is no longer consid-
ered valid and dropped. The connection between the two
vertices has to be bidirectional and this is checked by
linked(). The next steps calculate the cost to the new noack += lsa_flood(iface, nbr,
&lsa_hdr, imsg.data, l);
vertex w. There is one important thing to note: only links }
}
into a network have a cost but links from the network to } else {
the router have no cost. The result is that modifying the /*
* flood on all area interfaces on
cost of an interface will often not change incoming traf- * area 0.0.0.0 include also virtual links.
*/
fic flow only outgoing traffic may be rerouted due to the area = nbr->iface->area;
LIST_FOREACH(iface, &area->iface_list, entry) {
change. Before adding a vertex to the candidate list it is noack += lsa_flood(iface, nbr,
necessary to check if the vertex is already on the list. If it }
&lsa_hdr, imsg.data, l);
is, then the calculated cost is compared with the current }
one. The new path must be shorter than the current
Before starting the flooding decision process the LS
selected one. In that case the cost and the prev pointer
update is added to the LSA cache. Later, if the LSA is
are modified and the nexthop is recalculated. The vertex
added to different retransmission queues, only a refer-
is also removed from the candidate list and later added
ence to the LSA cache is retained. Depending on the LS
back to keep the list sorted. If the vertex is not on the
type it must be flooded to all areas (AS-external-LSA) or
candidate list then cost and prev pointer are initialised
only to the current area (all other LSAs). lsa_flood() is
and the nexthop is calculated. Finally the new candidate
doing the per interface specific part of the flooding.
is added to the list of candidates.
More about that a bit later.
Now the RIB needs to be built. To start the area specific
routes are added. First of all, all LSAs with LS age Code snip 37: flooding part2
MAX_AGE, a cost of LS_INFINITY, or a zero nexthop /* remove from ls_req_list */
le = ls_req_list_get(nbr, &lsa_hdr);
address are skipped. They are invalid. All valid network- if (!(nbr->state & NBR_STA_FULL) && le != NULL) {
LSAs are added to the RIB and all router-LSAs for ls_req_list_free(nbr, le);
/*
ABRs and ASBRs are added as well. Summary-LSAs * XXX no need to ack requested lsa
* the problem is that the RFC is very
are put into the RIB. On ABRs only for area 0. On non * unclear about this.
*/
ABRs there is no limitation. A summary-LSA is only
valid if the ABR was previously added to the RIB. The }
noack = 1;
last step is adding of the AS-external routes to the RIB.

if (!noack && nbr->iface != NULL &&
This is done only once and not for every area. Similarly nbr->iface->self != nbr) {
to summary-LSAs AS-external-LSAs will do a look-up if (!(nbr->iface->state & IF_STA_BACKUP) ||
nbr->iface->dr == nbr) {
of the ASBR router and if the router is not found the /* delayed ack */
lhp = lsa_hdr_new();
route is considered invalid. When updating the RIB with memcpy(lhp, &lsa_hdr, sizeof(*lhp));
ls_ack_list_add(nbr->iface, lhp);
rt_update() some order is retained. Intra-area routes }
(router and network-LSAs) have highest priority, inter- }
area routers (summary-LSAs) follow and Type1 and lsa_cache_put(ref, nbr);

break;
Type2 AS-external routes have the lowest priority. So if a
network is added multiple times that order will favour After flooding the LSA out on all affected interfaces an
intra-area traffic over inter-area or external routes. acknowledgement has to be sent back to the initial
sender of the LS update. In some cases there is no
requirement to send a LS acknowledge back. One of
4.5 Workflow
those cases are requested LSAs – sending back a LSA
4.5.1 Flooding ack to an explicitly requested LSA does not make much
sense. However the RFC is not very clear about this fact.
The flooding and retransmission of LS updates is So let's be prepared for some broken implementations
entirely done in the OSPF engine. The RDE sends a out there. The last step adds the LSA to the LS acknowl-
IMSG_LS_FLOOD imsg with the peer ID of the neighbor edge list so that a, possibly delayed, acknowledge can be
from which the update was initially received. The OSPF sent back. This is only done if an ack is required, the
engine uses that information to flood out the LS update neighbor where the ack is sent to is not ourselves and
to all affected networks. additionally no acks were sent from the BDR to the DR.
Finally the acquired reference of the LSA gets passed
Code snip 36: flooding part 1
ref = lsa_cache_add(imsg.data, l);
back. Reference counting makes careful programming a
necessity to avoid missing a reference change some-
if (lsa_hdr.type == LSA_TYPE_EXTERNAL) {
/* where.
* flood on all areas but stub areas and
* virtual links
*/
LIST_FOREACH(area, &oeconf->area_list, entry) {
if (area->stub)
continue;
LIST_FOREACH(iface, &area->iface_list,
entry) {
lsa_flood() Code snip 41: neighbor loop part 4

if (nbr == originator) {
As mentioned earlier lsa_flood() is used for flooding dont_ack++;
continue;
on a per interface scope. In particular it loops over all }
neighbors and decides if it has to send the update to this /* non DR or BDR router keep all lsa in one retrans list */
if (iface->state & IF_STA_DROTHER) {
neighbor or not. if (!queued)
ls_retrans_list_add(iface->self, data);
Code snip 38: neighbor loop part 1 queued = 1;
} else {
LIST_FOREACH(nbr, &iface->nbr_list, entry) { ls_retrans_list_add(nbr, data);
if (nbr == iface->self) queued = 1;
continue; }
if (!(nbr->state & NBR_STA_FLOOD))
continue;
If the current neighbor is the initial sender of this LS
First of all self is skipped. Then all neighbors which are update there is high chances that no ack has to be sent
not available for flooding – their state is neither FULL back. This decision is done later. At least there is also no
nor LOADING nor EXCHANGE – are skipped as well. need to flood the LS update back to this router.
Finally the LS update or actually a reference to the LS
Code snip 39: neighbor loop part 2
if (iface->state & IF_STA_DROTHER && !queued)
update is added to the retransmission queue. Depending
if ((le = ls_retrans_list_get(iface->self, on the interface state, different queues are chosen. If the
lsa_hdr)))
ls_retrans_list_free(iface->self, le); interface is not in state DROTHER it will be added to the
if ((le = ls_retrans_list_get(nbr, lsa_hdr)))
neighbor retransmission list. In case of DROTHER only
ls_retrans_list_free(nbr, le); one global queue is used because all updates go to the
AllDRouters address. For this special case iface->self
Afterwards the retransmission lists are searched for an
is “abused”. Because only one queue is used it is impor-
older LS update for the same LSA. If an older LSA is
tant to protect the queue from multiple adds. Currently
found it is removed and replaced later with the new one.
there is a known feature in the queuing behaviour of
A special queue is used for interfaces with state
OpenOSPFD that needs to be solved. In case of the
DROTHER as explained later on. Because only one
router being BDR it will queue the update to all neigh-
queue is used, redoing this check after the LSA got
bors on that interface including the DR. The DR there-
queued once results in unexpected behaviour. So this
fore is required to send an acknowledge to the BDR.
case is protected by the !queued check.
This will not happen and so one retransmission is done
Code snip 40: neighbor loop part 3 from the BDR to the DR and the DR will then answer
if (!(nbr->state & NBR_STA_FULL) && with a direct acknowledge. This is unnecessary and no
(le = ls_req_list_get(nbr, lsa_hdr)) != NULL) {
r = lsa_newer(lsa_hdr, le->le_lsa); updates to the DR should be queued unless they are self
if (r > 0) {
/* to flood LSA is newer than requested */ originated or from a different interface.
ls_req_list_free(nbr, le);
/* new needs to be flooded */ Code snip 42: sending LS update
} else if (r < 0) {
/* to flood LSA is older than requested */ if (!queued)
continue; return (0);
} else {
/* LSA are equal */ if (iface == originator->iface &&
ls_req_list_free(nbr, le); iface->self != originator) {
continue; if (iface->dr == originator ||
} iface->bdr == originator)
} return (0);
if (iface->state & IF_STA_BACKUP)
return (0);
If the adjacency is not yet full, the LS request list is dont_ack++;
}
examined. If a LSA is found we know the exact LSA the
neighbor has in his database. So if the LSA in the request /* flood LSA but first set correct destination */
switch (iface->type) {
list is older than the new one, the requested one is case IF_TYPE_POINTOPOINT:
inet_aton(AllSPFRouters, &addr);
removed and the new one will be flooded. Otherwise if send_ls_update(iface, addr, data, len);
break;
the LSA is older than the requested one, there is no need case IF_TYPE_BROADCAST:
to flood it to the neighbor and the request list is left alone if (iface->state & IF_STA_DRORBDR)
inet_aton(AllSPFRouters, &addr);
so that the newer LSA of that neighbor is requested later. else
inet_aton(AllDRouters, &addr);
In case both LSAs are equal there is no need to request send_ls_update(iface, addr, data, len);
the LSA anymore. There is also no need to flood the ...
break;
LSA to that neighbor. }

return (dont_ack == 2);
After inspecting every neighbor and adding LSA refer- Code snip 44: ls_retrans_list_free()
ences to the retransmission lists an initial flooding gets void
ls_retrans_list_free(struct nbr *nbr, struct lsa_entry *le)
sent out. If nothing got queued there is no reason to send {
TAILQ_REMOVE(&nbr->ls_retrans_list, le, entry);
the LSA, do a return. In the other cases we send the
lsa_cache_put(le->le_ref, nbr);
update to the correct address. For point-to-point links it free(le);
is always AllSPFRouters. For broadcast networks it is }
either AllSPFRouters or AllDRouters to multicast the ls_retrans_list_free() will not only unlink the LSA
update to the correct group. All other interface types use from the request list but hands the LSA cache reference
unicast to send the updates. Before sending out the LS back by calling lsa_cache_put(). Again it is important
update a special check is done mostly for broadcast and to take care of those references.
NBMA networks. In case the originator of the initial LS
update is on the now outgoing interface more checks How does this LSA cache work?
have to be done. First of all if the originator is DR or The LSA cache is nothing more than a hash list. A
BDR there is no need to send an update. The actual simple hash is built over the LSA header and used to find
flooding was already done by the DR respectively BDR. the correct hash bucket. In the LSA cache a LSA is iden-
Additionally if the router itself is BDR there is no need tified not only by LS type, LS ID, and advertising router.
to flood the network. This will be done by the DR. If The sequence number and LS checksum is compared as
none of these two tests where true it is now clear that no well. To find a LSA in the cache the internal
acknowledgement needs to be sent back. Therefore lsa_cache_look() function is used.
dont_ack is bumped a second time and so lsa_flood() lsa_cache_get() returns a new reference to an existing
will return true. LSA.
Code snip 45: lsa_cache_get()
4.5.2 Retransmission Lists and LSA Cache struct lsa_ref *
lsa_cache_get(struct lsa_hdr *lsa_hdr)
Now lets have a look at the retransmission lists. All other {
struct lsa_ref *ref;
lists – acknowledge, request, and database descriptor list
– are implemented in a similar way. The retransmission ref = lsa_cache_look(lsa_hdr);
if (ref)
list is a bit more complex because of the LSA cache. To ref->refcnt++;
add a LS update to the request list return (ref);
}
ls_retrans_list_add() is used.
Code snip 43: ls_retrans_list_add()

This function is very simple and the only important step
if ((ref = lsa_cache_get(lsa)) == NULL) is not to forget to bump the reference count.
fatalx("King Bula sez: somebody forgot to
lsa_cache_add");
lsa_cache_add() works very similar to
lsa_cache_get(). Again lsa_cache_look() is used to
if ((le = calloc(1, sizeof(*le))) == NULL)
fatal("ls_retrans_list_add"); find already added LSAs. In that case a bump of the ref-
le->le_ref = ref; erence count is enough. Else a new reference object gets
TAILQ_INSERT_TAIL(&nbr->ls_retrans_list, le, entry); allocated and filled in. There is a timestamp included to
if (!evtimer_pending(&nbr->ls_retrans_timer, NULL)) { age the LSA when it is sent out. The initial reference
timerclear(&tv);
tv.tv_sec = nbr->iface->rxmt_interval; count is set to one because a reference is immediately
if (evtimer_add(&nbr->ls_retrans_timer, &tv) == -1)
returned to the caller.
log_warn("ls_retrans_list_add: evtimer_add
failed"); Code snip 46: lsa_cache_add()
} struct lsa_ref *
lsa_cache_add(void *data, u_int16_t len)
First of all a LSA cache reference is acquired via {
struct lsa_cache_head*head;
lsa_cache_get(). If this call fails we have an internal struct lsa_ref *ref, *old;
program error and the OSPF engine has no way to if ((ref = calloc(1, sizeof(*ref))) == NULL)
recover from that. The reference is added to a list ele- fatal("lsa_cache_add");
memcpy(&ref->hdr, data, sizeof(ref->hdr));
ment that in turn is added to the retransmission list. And
if ((old = lsa_cache_look(&ref->hdr))) {
if there is no timer pending a new retransmission timer is free(ref);
started. old->refcnt++;
return (old);
Removing works in a similar way. First the correct entry }
is searched with the help of ls_retrans_list_get() if ((ref->data = malloc(len)) == NULL)
fatal("lsa_cache_add");
and afterwards it gets freed if the LSA was the same. memcpy(ref->data, data, len);
ls_retrans_list_get() uses the known LSA triple to ref->stamp = time(NULL);
ref->len = len;
identify a LSA. ref->refcnt = 1;
head = lsa_cache_hash(&ref->hdr); /* LSA header */

LIST_INSERT_HEAD(head, ref, entry); if (iface->state & IF_STA_DR)
return (ref); lsa_hdr.age = htons(DEFAULT_AGE);
} else
lsa_hdr.age = htons(MAX_AGE);
lsa_cache_put() was only roughly explained in the lsa_hdr.opts = oeconf->options;/* XXX */
MAX_AGE handling. First the reference count is decreased lsa_hdr.type = LSA_TYPE_NETWORK;
lsa_hdr.ls_id = iface->addr.s_addr;
and if it hits zero the cache is no longer referenced and lsa_hdr.adv_rtr = oeconf->rtr_id.s_addr;
lsa_hdr.seq_num = htonl(INIT_SEQ_NUM);
can be freed. Now the known MAX_AGE dance comes. lsa_hdr.len = htons(buf->wpos);
lsa_hdr.ls_chksum = 0;/* updated later */
Sending back an IMSG_LS_MAXAGE if the LSA has an age memcpy(buf_seek(buf, 0, sizeof(lsa_hdr)), &lsa_hdr,
of MAX_AGE to make it possible to remove the LSA from sizeof(lsa_hdr));
the LS DB. Afterwards the cache object is cleaned and chksum = htons(iso_cksum(buf->buf, buf->wpos,
LS_CKSUM_OFFSET));
removed. memcpy(buf_seek(buf, LS_CKSUM_OFFSET, sizeof(chksum)),
&chksum, sizeof(chksum));
Code snip 47: lsa_cache_put()
imsg_compose(ibuf_rde, IMSG_LS_UPD, iface->self->peerid, 0,
void -1, buf->buf, buf->wpos);
lsa_cache_put(struct lsa_ref *ref, struct nbr *nbr)
{ buf_free(buf);
if (--ref->refcnt > 0)
return;
Once again the buf API is used. First space for the
if (ntohs(ref->hdr.age) >= MAX_AGE)
ospfe_imsg_compose_rde(IMSG_LS_MAXAGE, header is reserved then the network mask is added and
nbr->peerid, 0, ref->data,
sizeof(struct lsa_hdr));
finally a list of all fully adjacent routers is added. The
router itself needs to be added as well but this is no prob-
free(ref->data);
LIST_REMOVE(ref, entry); lem because of the special self neighbor. If there is no
free(ref); other OSPF router on the network it is not necessary to
}
create a network-LSA. A stub network entry in the
4.5.3 Self originated LSA router-LSA will do the job. In that case the buffer gets
freed and the function returns. Otherwise the LSA
There are three kinds of self originated LSAs. First header has to be built. First the correct age is set. To
router and network-LSAs – those are generated in the remove a network-LSA the age is set to MAX_AGE else the
OSPF engine. Then AS-external-LSAs which are gener- initial DEFAULT_AGE is used. Other important fields are
ated in the RDE with the help of the parent process. LS type, LS ID and advertising router. Also the sequence
Finally on ABRs summary-LSAs are generated – this number has to be set but the correct instance number is
happens in the RDE as well. only known by the RDE. The RDE uses lsa_merge()
To create a self originated LSA in the OSPF engine and later on to merge this LSA into the database and
commit it to the LS DB in the RDE is a bit tricky. Let's lsa_merge() will take care of the sequence number – so
have a look at orig_net_lsa() because it is a lot sim- here we set it just to the initial value. Copy the header
pler than orig_rtr_lsa(). into the buffer, calculate the checksum and finally send
this self originated LSA with the peerid of the special
Code snip 48: originate network-LSA
if ((buf = buf_dynamic(sizeof(lsa_hdr), READ_BUF_SIZE)) ==
neighbor self to the RDE.
NULL) Originating a router-LSA is done in a similar way. It is
fatal("orig_net_lsa");
just more complex because many additional informa-
/* reserve space for LSA header and LSA Router header */
if (buf_reserve(buf, sizeof(lsa_hdr)) == NULL) tions are added in the router-LSA. One tricky part is set-
fatal("orig_net_lsa: buf_reserve failed"); ting the correct router flags.
/* LSA net mask and then all fully adjacent routers */
if (buf_add(buf, &iface->mask, sizeof(iface->mask))) Code snip 49: originate router-LSA
fatal("orig_net_lsa: buf_add failed"); /* LSA router header */
lsa_rtr.flags = 0;
/* fully adjacent neighbors + self */ /*
LIST_FOREACH(nbr, &iface->nbr_list, entry) * Set the E bit as soon as an as-ext lsa may be
if (nbr->state & NBR_STA_FULL) { * redistributed, only setting it in case we redistribute
if (buf_add(buf, &nbr->id, * something is not worth the fuss.
sizeof(nbr->id))) */
fatal("orig_net_lsa: " if (oeconf->redistribute_flags &&
"buf_add failed"); (oeconf->options & OSPF_OPTION_E))
num_rtr++; lsa_rtr.flags |= OSPF_RTR_E;
}
border = area_border_router(oeconf);
if (num_rtr == 1) {
/* if (border != oeconf->border) {
* non transit net therefor no need to generate oeconf->border = border;
* a net lsa orig_rtr_lsa_all(area);
*/ }
buf_free(buf);
return; if (oeconf->border)
} lsa_rtr.flags |= OSPF_RTR_B;
if (virtual)
lsa_rtr.flags |= OSPF_RTR_V;
There are three bits that have to be set. The E bit indi- changes in that area. In the next step a walk over the RIB
cates that the router is an AS border router and will is done. By calling rde_summary_update() for every
announce AS-external routes. The E bit is used in the area and any route all required summary informations
SPF calculation and for summary-LSAs. In the SPF cal- are generated. Afterwards the kernel routing table is
culation routers with E bit set are added to the RIB. updated by sending change or delete messages to the
Without setting the E bit all AS-external routes using this parent process. This is only done for routes that describe
router as advertising router are considered invalid networks. After that old invalid summary-LSAs get
because the router is not present in the RIB. Similar hap- removed from all areas. Finally the hold timer is started.
pens for summary-LSAs. On ABRs router summary- This is specified in the RFC so that the SPF calculation
LSAs will be generated for every router with E bit set. does not kill the underpowered routers.
OpenOSPFD tricks a bit with the E bit by setting the bit rde_summary_update() does the decision if it necessary
as soon as it is possible that a AS-external route is redis- to create a summary-LSA.
tributed and not when the router actually redistributes a
Code snip 51: Is summary-LSA needed?
route. Other implementations have the same sloppy /* first check if we actually need to announce this route
behaviour. Even more complex is setting the B bit, which */
if (!(rte->d_type == DT_NET || rte->flags & OSPF_RTR_E))
is used to mark ABRs. As soon as a router is part of two return;
active areas the B bit has to be set on all router-LSA. /* never create summaries for as-ext LSA */
if (rte->p_type == PT_TYPE1_EXT || rte->p_type ==
area_border_router() returns true if there are two or PT_TYPE2_EXT)
return;
more active areas. If the state of the ABR changes all self /* no need for summary LSA in the originating area */
if (rte->area.s_addr == area->id.s_addr)
originated router-LSAs in all areas have to be updated. return;
This is done via orig_rtr_lsa_all() which in turn /* TODO nexthop check, nexthop part of area -> no summary
*/
calls orig_rtr_lsa() for all areas but the current one. if (rte->cost >= LS_INFINITY)
return;
Afterwards setting the B bit is no longer a problem. The /* TODO AS border router specific checks */
last bit that can be set is the V bit. It is used to mark inter- /* TODO inter-area network route stuff */
/* TODO intra-area stuff -- condense LSA ??? */
faces where a virtual link is terminated. Areas where one
router has a V bit set are transit areas. Transit areas need First of all only network routes or router routes where
some special handling in the SPF calculation as example the E bit is set are summarised into other areas. The E bit
it is not allowed to send aggregated summary routing is the same as the one in router-LSAs specifying that the
information into a transit area. router is an ASBR. An ASBR has to be added to other
areas so that they can validate the AS-external-LSAs. As
4.5.4 ABR and summary-LSA AS-external routes are flooded through all areas there is
no need to create summaries for those networks. The
The code handling ABRs and summary-LSAs is still in originating area and all invalid routes are skipped.
some flux. There are to many work a rounds and some Finally there are some other minor but very complicated
stuff is still missing. Lets have a look at it anyway. It things left out for now.
actually starts in the SPF calculation. The code that
recalculates the RIB looks currently like this: Code snip 52: update summary-LSA
/* update lsa but only if it was changed */
Code snip 50: SPF timer if (rte->d_type == DT_NET) {
type = LSA_TYPE_SUM_NETWORK;
rt_invalidate(); v = lsa_find(area, type, rte->prefix.s_addr,
rde_router_id());
LIST_FOREACH(area, &conf->area_list, entry) } else if (rte->d_type == DT_RTR) {
spf_calc(area); type = LSA_TYPE_SUM_ROUTER;
v = lsa_find(area, type, rte->adv_rtr.s_addr,
RB_FOREACH(r, rt_tree, &rt) { rde_router_id());
LIST_FOREACH(area, &conf->area_list, entry) } else
rde_summary_update(r, area); fatalx("orig_sum_lsa: unknown route type");
if (r->d_type != DT_NET) lsa = orig_sum_lsa(rte, type);
continue; lsa_merge(rde_nbr_self(area), lsa, v);
if (r->invalid) if (v == NULL) {
rde_send_delete_kroute(r); if (rte->d_type == DT_NET)
else v = lsa_find(area, type,
rde_send_change_kroute(r); rte->prefix.s_addr, rde_router_id());
} else
v = lsa_find(area, type,
LIST_FOREACH(area, &conf->area_list, entry) rte->adv_rtr.s_addr, rde_router_id());
lsa_remove_invalid_sums(area); }
v->cost = rte->cost;
start_spf_holdtimer(conf);
To update the LS DB lsa_merge() is used. Before it is
First the RIB is invalidated by flagging routes as invalid.
possible to call lsa_merge() two things have to be done.
While doing that old invalid routes are removed from the
First the current database version of the LSA has to be
tree. Afterwards the SPF calculation is run for every
found. Secondly a new LSA is generated by
area. This is one of the things that should be changed.
orig_sum_lsa(). After merging the LSA it is necessary
There is no need to recalculate an area if there was no
to update the cost of the vertex so that a later call to Code snip 55: rde_redistribute()
lsa_remove_invalid_sums() sees that this vertex is int
rde_redistribute(struct kroute *kr)
still in use. In case the LSA was newly added the previ- {
struct area*area;
ous lsa_find() returned NULL so the search has to be struct iface*iface;
int rv = 0;
repeated to get a valid vertex.
lsa_remove_invalid_sums() does nothing more than a if (!(kr->flags & F_KERNEL))
return (0);
tree walk looking for summary-LSAs with a cost of
if ((rdeconf->options & OSPF_OPTION_E) == 0)
LS_INFINITY and removes those by setting their age to return (0);
MAX_AGE and calling lsa_timeout() to flood them out. if ((rdeconf->redistribute_flags &
REDISTRIBUTE_DEFAULT) &&
(kr->prefix.s_addr == INADDR_ANY &&
4.5.5 Originating AS-external-LSA kr->prefixlen == 0))
return (1);
To redistribute AS-external-LSA the parent process /* only allow 0.0.0.0/0 if REDISTRIBUTE_DEFAULT */
sends a list of candidates to the RDE. The RDE uses if (kr->prefix.s_addr == INADDR_ANY &&
kr->prefixlen == 0)
rde_asext_get() to convert the kroute into a LSA and return (0);
with the help of lsa_find() and lsa_merge() the LSA if ((rdeconf->redistribute_flags &
is added to the database. Similarly on remove REDISTRIBUTE_STATIC) &&
(kr->flags & F_STATIC))
rde_asext_put() is used to get the no longer needed rv = 1;
if ((rdeconf->redistribute_flags &
LSA and again lsa_find() and lsa_merge() do the REDISTRIBUTE_CONNECTED) &&
(kr->flags & F_CONNECTED))
actual job. rv = 1;
rde_asext_put() has a more or less simple job. Find
/*
the kroute, remove it from the list and create a LSA with * interface is not up and running so don't
* announce
LS age MAX_AGE if the LSA was used. */
if (kif_validate(kr->ifindex) == 0)
Code snip 53: rde_asext_put() return (0);
LIST_FOREACH(ae, &rde_asext_list, entry) LIST_FOREACH(area, &rdeconf->area_list, entry)
if (kr->prefix.s_addr == ae->kr.prefix.s_addr && LIST_FOREACH(iface, &area->iface_list,
kr->prefixlen == ae->kr.prefixlen) { entry) {
LIST_REMOVE(ae, entry); if ((iface->addr.s_addr &
used = ae->used; iface->mask.s_addr) ==
free(ae); kr->prefix.s_addr &&
if (used) iface->mask.s_addr ==
return (orig_asext_lsa(kr, prefixlen2mask(kr->prefixlen))
MAX_AGE)); /* already announced
break; * as net LSA */
} rv = 0;
return (NULL); }
return (rv);
On the other hand rde_asext_get() has a bit more to }
do. It first looks if the route was added already before. In
that case the route needs to be updated, else a new one is First it is checked if we have to redistribute anything.
created. Afterwards the default route gets handled. The default
route is only redistributed if explicitly enforced via
Code snip 54: rde_asext_get() part 1 “redistribute default”. Dependent on the flags it is now
LIST_FOREACH(ae, &rde_asext_list, entry)
if (kr->prefix.s_addr == ae->kr.prefix.s_addr && decided if routes gets redistributed. The interface state is
kr->prefixlen == ae->kr.prefixlen) checked and finally all configured interfaces are
break;
inspected to see if the route is not already part of a net-
if (ae == NULL) {
if ((ae = calloc(1, sizeof(*ae))) == NULL) work-LSA or is announced as a stub network.
fatal("rde_asext_get");
LIST_INSERT_HEAD(&rde_asext_list, ae, entry); After the rde_redistribute() call it is now clear what
} remains to be done.
memcpy(&ae->kr, kr, sizeof(ae->kr));
Code snip 56: rde_asext_get() part 2
wasused = ae->used; if (ae->used)
ae->used = rde_redistribute(kr); /* update of seqnum is done by lsa_merge */
return (orig_asext_lsa(kr, DEFAULT_AGE));
Next task is to find out if the route should be redistrib- else if (wasused)
/*
uted. The actual logic is in rde_redistribute() and so * lsa_merge will take care of removing the
* lsa from the db
lets have a look at that. */
return (orig_asext_lsa(kr, MAX_AGE));
else
/* not in lsdb, superseded by a net lsa */
return (NULL);
If the route has to be redistributed a LSA with the initial

LS age is generated and returned. If it is no longer used a
LSA with LS age MAX_AGE is generated and returned.
Otherwise the work is completed and function returns.

In case an interface state changes,
rde_update_redistribute() is called and all routes
that depend on this interface are recalculated very simi-
lar to the presented code here. Again going through
rde_redistribute(), orig_asext_lsa(), lsa_find(),
and lsa_merge().
4.6 Issues and other stuff

There are still some problems in OpenOSPFD that have
to be solved. Some features are incomplete and so there
is still a lot of work to be done. Lets look back at the
solved problems. The first problem encountered was
probably the privilege separation because a clever split-
ting had to be done. This is still sometimes an issue – for
example the current redistribute code is partially done in
the wrong place. The result is massive overhead if the
router does “redistribute static” with a full view in the
routing table. All ~170'000 routes are passed to the RDE
and evaluated there. It works but is inefficient. Other
problems with privsep were solved like the MAX_AGE or
the database exchange problems explained earlier. A
good example of a work a round is the multicast han-
dling. A real fix for this problem is in progress but some
kernel patches are required to make it fly. At least many
issues and bugs were identified and fixed in the flooding
and database exchange phase – the most important part
of the protocol.
Things that remain to be fixed include the redistribute
code or the missing support for interface aliases. The
ABR code is still not optimal and is not as good tested as
the normal case. Virtual links still need a lot of work to
get them flying – a lot of code is around but some impor-
tant bits are missing. Interface handling should be
improved, like supporting aliases and dynamic inter-
faces. Last but not least there are all those supercool new
features planned but that's a different paper. :)
Bibliography
[1] Moy, J. OSPF version 2. RFC 2328, April 1998.
[2] Moy, J. OSPF: Anatomy of an Internet Routing Proto-
col. Addison-Wesley, September 1998
[3] OpenBSD, http://www.openbsd.org/
[4] OpenBGPD, http://www.openbgpd.org/
[5] OpenOSPFD source code, http://www.openbsd.org/
cgi-bin/cvsweb/src/usr.sbin/ospfd/

OpenOSPFD - Paper

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

OpenOSPFD - Paper

Diunggah oleh

Hak Cipta:

Format Tersedia

Design and Implementation of

Overview the network topology is distributed. This results in one of

In a distance vector algorithm every router exchanges its

2.1 Architecture Database synchronisation takes two forms. First there is

Additionally more ﬂooding procedures where deﬁned 2.1.3 Areas

2.1.4 Border routers 2.2.1 Hello

Figure 2: Common OSPF header 2.2.2 Database Description

packet integrity. Multiple authentication procedures are Router ID

2.2.3 Link-State Request 2.2.5 Link-State Acknowledgement

Figure 5: Link-State Request Header Figure 7: Link-State Acknowledgement Header

Figure 6: Link-State Update Header LS Checksum Length

Router- and Network-LSA describe the network inside 3.1 Processes

3.1.2 OSPF engine

fork() fork() the engine.

The RDE stores the LS database, calcu-

lates the SPF tree, and informs the

parent process about changes in the

jailed child updates jailed child

The design of OpenOSPFD is based on the one in 3.1.4 ospfctl

Implementation Table 3: Overview of source files

construction and parsing.

the OSPF engine. active tracks the number of neigh- u_int16_t

DROTHER Every neighbor is evaluated, neighbors with a priority of

After the election process a bit of housekeeping has to be EXSTART

4.3.2 Neighbor state machine EXCHANGE

A neighbor is considered down if no hello has been 4.3.3 Packet reception

checksumming of the packet. Finally the packet is if (len == 0) {

more problematic. Usual sanity checking is done ﬁrst. }

Code snip 14: synchronising part 2 4.3.4 Packet delivery

recv_ls_req() switch (iface->type) {

The LS age is compared with MAX_AGE and if true a case LINK_TYPE_POINTTOPOINT:

unneeded ballast whereas the RIB contains all additional if (cand_list_present(w)) {

last step is adding of the AS-external routes to the RIB.

area routers (summary-LSAs) follow and Type1 and lsa_cache_put(ref, nbr);

lsa_flood() Code snip 41: neighbor loop part 4

LSA to that neighbor. }

Code snip 43: ls_retrans_list_add()

head = lsa_cache_hash(&ref->hdr); /* LSA header */

If the route has to be redistributed a LSA with the initial

Otherwise the work is completed and function returns.

4.6 Issues and other stuff

Anda mungkin juga menyukai