OpenOSPFD
by Claudio Jeker <claudio@openbsd.org>
Internet Business Solutions AG
Abstract
OpenOSPFD is a free and secure implementation of the
Open Shortest Path First protocol. It allows ordinary
machines to be used as routers exchanging and calcu-
lating routes within an OSPF cloud.
OpenOSPFD is the next major step after OpenBGPD for
full router capabilities in OpenBSD and other BSDs.
Together with OpenBGPD it is possible to re-route traf-
fic in case of link loss resulting in a higher-level of avail-
ability.
OpenOSPFD – design and implementation Claudio Jeker
area 0.0.0.4
(stub)
1.2 Algorithms
Internet
There are two main concepts to exchange routing infor-
mation. These algorithms are working in a totally differ-
ent ways.
ABR
k
lin
ASBR ASBR
al
1.2.1 Distance Vector Algorithms
tu
vir
ABR
Distance vector algorithms got their name from the form
area 0.0.0.2
of the routing updates: a vector of metrics. ABR
routing table with all his neighbors. The neighbors then ABR
lin
k
walk through the list and compare if their current route area 0.0.0.0
entry is better or not. If not the route is replaced and area 0.0.0.1 (backbone)
redistributed again.
In case of RIP the list of routes and their metric is RIP
exchanged every 30 seconds. This results in a slow con- cloud
area 0.0.0.5
vergence because an update propagates only one hop
every 30 seconds. On the other hand the protocol is
Figure 1: Sample OSPF network
simple and robust because every router cares only about
his own neighbors. In other words the information about
OpenOSPFD – design and implementation Claudio Jeker
2.2 Packets
Figure 3: Hello Header
There are five different packet types defined. Every
packet starts with a common 24 byte OSPF header. This Hello packets are sent periodically in order to establish
header includes all necessary information for the recipi- and maintain neighbor relationships. Hello packets are
ent to determine if it should be accepted and processed sent to a multicast group to enable dynamic discovery of
or ignored and dropped. neighboring routers. All routers to a common network
Version # Type Packet Length must agree on certain parameters. The most important
Router ID part of the hello packet is the neighbor list at the end.
Area ID The router ID of each router from which a valid Hello
Checksum Authentication Type packet has recently been received is added to that list.
Authentication Data Only after the own router ID is seen in a neighbors Hello
Authentication Data
packet an adjacency can be formed.
LS Type
Link-State ID LSA Header
Advertising Router
... ...
After exchanging Database Description packets with the In order to make the flooding procedure reliable, flooded
neighboring router, Link-State Request packets request LSAs are acknowledged in Link-State Acknowledge-
pieces of the neighbors LS database that are more up-to- ment packets. Multiple LSAs can be acknowledged in a
date. Each LSA requested is specified by its LS type, single Link-State Acknowledgement packet. The format
Link-State ID, and Advertising Router. This uniquely of this packet is similar to that of the Data Description
identifies the LSA, but not its instance. Link-State packet. The body of both packets is simply a list of LSA
Request packets are understood to be requests for the headers.
most recent instance. It is possible to request multiple
LSA with one LS request packet. 2.2.6 Link-State Advertisements Header
Each LSA begins with a common 20 byte header. This
2.2.4 Link-State Update
header is enough to uniquely identify a LSA. So it is
Version # 4 Packet Length enough to use the LSA header in LS acknowledgements
Router ID and Database Description packets. LSAs are identified
Area ID
by the LS type, Link-State ID, and Advertising Router
Checksum Authentication Type
triple. Additionally a LS sequence number and LS age
Authentication Data
Authentication Data
are included to determine which instance is more recent.
Number of LSAs The LS checksum protects the integrity of LSAs. Instead
of the known CRC algorithm specified in many IP proto-
LSA cols a ISO checksum algorithm – also known as Fletcher
Checksum – is employed.
... LS age Options LS Type
Link-State ID
Advertising Router
LS sequence number
These packets implement the flooding of LSAs. Each Figure 8: Link-State Advertisements Header
Link-State Update packet carries a collection of LSAs
one hop further from their origin. Several LSAs may be Each LSA type has a separate advertisement format. The
included in a single packet. The body of the Link-State LS types defined in the OSPF standard are as follows:
Update packet consists of a list of LSAs.
Table 2: LS types
1 Hello
2 Database Description
3 Link-State Request
4 Link-State Update
5 Link-State Acknowledgement
OpenOSPFD – design and implementation Claudio Jeker
update table
fetch table
The OSPF engine listens to the network
and processes the OSPF packets. Both
routing socket the interface and the neighbor finite state
machine are implemented in the OSPF
Parent engine. This includes the DR/BDR elec-
tion process. Additionally the reliable
ges flooding of LS updates with retransmis-
ospfctl root priviledges sion and acknowledgement is done by
redistribute list
chan
route updates
face
so
air
3.1.3 Route Decision Engine
ck
inter
etp
requ
etp
ck
air
so
est
runs as _ospfd:_ospfd flood request runs as _ospfd:_ospfd resulting routing table. Premature LSA
chroot to /var/empty chroot to /var/empty aging is done by the RDE as well. Addi-
UNIX socket
/var/run/ospfd.sock
tionally redistribution of networks is
raw IP socket handled by the process. The RDE syn-
proto 89
chronises multiple areas if the router is
acting as ABR and refloods summary-
LSA into the different areas if necessary.
Figure 9: Design of OpenOSPFD
the parser goes beyond the scope of this paper. Important ally reachable. This is a work a round that should be
to know is that the configuration is parsed into a hierar- fixed later as it is currently not possible to track and
chy of structures. handle newly arriving network interfaces at runtime.
The configuration consists of a list of areas and every Last but not least kr_show_route() and kr_ifinfo()
area holds a list of interfaces that are part of this area. pass information about kroutes or interfaces to ospfctl.
Last but not least every interface has a list of neighbors
that is dynamically created as soon as a valid Hello
packet is received from an other OSPF router on that 4.3 OSPF Engine
interface. The finite state machines implemented in ospfd are
After the file got parsed ospfd daemonises and starts the simple table driven state machines. Any state transition
child processes. Beforehand a set of socketpairs – a spe- may result in an specific action to be run. The resulting
cial sort of pipes – are created. Finally the event handlers next state can either be a result of the action or is fixed
are set up, rest of the kroute structures is initialised and and pre-determined.
the parent reports ready for service.
Meanwhile both children have started. First of all both 4.3.1 Interface state machine
chroot(2) to /var/empty and drop privileges by switching
to the special user _ospfd. Before doing that the OSPF
Down
engine creates a UNIX local socket for ospfctl and opens LOOPBACK DOWN
UnloopIndication
the raw IP socket to receive and send packets to the net-
work. After dropping privileges the OSPF engine initial- LoopIndication
ises the different subsystems, sets the event handlers and Up Up
POINT-TO-
starts the actual work by kicking the interface finite state WAITING
POINT
machine. The RDE start-up is even simpler as it just has
BackupSeen
Waittimer
to initialise the internal structures and event handlers.
Neighbor
Change
4.2.2 Routing socket and FIB
DROTHER Election BACKUP
The main purpose of the parent process is to maintain Neighbor
Neighbor
Change
Change
the Forward Information Base (FIB) and keep the infor-
mation in sync with the kernel routing table. This syn-
chronisation is to be done in both directions.
DR
Additionally link-state changes and arrival or departure
of interfaces are handled via the routing socket as well. Figure 10: Interface FSM
The kroute code maintains two primary data structures.
A prefix tree (kroute) and an interface tree (kif). These DOWN
two trees are kept in sync with the kernel through the
In this state, the lower-level protocols have indicated that
routing socket. On start-up fetchtable() loads the
the interface is unusable. No protocol traffic at all will be
kroute tree and fetchifs() does the same for the kif
sent or received on such an interface.
tree. Routing changes are tracked by dispatch_rtmsg()
which handles kroute changes directly but off-loads
LOOPBACK
interface specific messages to if_change() and
if_announce(). To modify the kernel routing table In this state, the router's interface to the network is
send_rtmsg() is used. send_rtmsg() translates a looped back. Loopback interfaces are advertised in
struct kroute into a rt_msg structure expected by the router-LSAs as single host routes, whose destination is
routing socket. The parent process uses kr_change() to the interface IP address.
add or modify routes and kr_delete() to remove routes.
These changes are propagated to the kernel routing table POINT-TO-POINT
if needed.
Point-to-point networks or virtual links enter this state as
Both the kroute and kif tree are implemented as red- soon as the interface is operational.
black trees – a balanced binary tree. An API to find,
insert and remove nodes is specified to simplify the tree WAITING
manipulation.
Everytime a route is added or removed to the kroute tree Broadcast or NBMA interfaces enter this state when the
kr_redistribute() is called. This function transmits interface gets operational. While in this state no DR/
possible candidates for redistribution to the RDE. In the BDR election is allowed. Receiving and sending of
RDE kif_validate() verifies that the nexthop is actu- Hello packets is allowed and is used to try to determine
the identity of the DR/BDR routers.
OpenOSPFD – design and implementation Claudio Jeker
EXSTART
true
AdjOK? false
2-WAY
The state is only entered if the Link-State Request list is
not empty. In that case Link-State Request packets are
NegotiationDone sent out to fetch the more recent LSAs from the neigh-
SNAPSHOT
Snapshot
EXCHANGE
bors LS database.
Done
ExchangeDone FULL
FULL LOADING
LoadingDone The two routers are now fully adjacent. The connection
Figure 11: Neighbor FSM will now appear in router-LSAs and network-LSAs.
Only in this state real traffic will be routed between the
DOWN two routers.
Every hello-interval seconds a Hello packet is sent to all Multiple neighbor events have to be generated. First of
neighbors. On broadcast networks this is done with one all is the hello received event. Next it is checked if there
multicast packet. The Hello packet is used for neighbor is already bidirectional communication between the
discovery and to maintain neighbor relationships. As routers. This is done by walking through the list of
first step all the common options need to be compared. If neighbors in the hello packet and compared it with the
one of hello-interval, router-dead-time, or the stub area own router ID. If no match was found a 1-WAY received
flag differs the packet is not accepted. So all routers on a event gets issued. If the match is done the first time – the
common network must have the same configuration for neighbor is in an embryonic state like INIT – a 2-WAY
these values. received event is generated.
Code snip 8: neighbor look up Now the scariest part of OpenOSPFD is coming. Han-
switch (iface->type) { dling fast start-ups and the famous interface event
case IF_TYPE_POINTOPOINT:
case IF_TYPE_VIRTUALLINK: BACKUPSEEN. This part of the Hello protocol was
/* match router-id */
LIST_FOREACH(nbr, &iface->nbr_list, entry) { rewritten multiple times and the result was always some
if (nbr == iface->self)
continue;
other obscure problem in the election process. In the end
if (nbr->id.s_addr == rtr_id) OpenOSPFD had to violate the RFC a bit. The RFC is
break;
} not very clear about how to handle the event BACK-
break; UPSEEN correctly.
case IF_TYPE_BROADCAST:
case IF_TYPE_NBMA:
case IF_TYPE_POINTOMULTIPOINT: From the RFC:
/* match src IP */
LIST_FOREACH(nbr, &iface->nbr_list, entry) {
if (nbr == iface->self) • If the neighbor is both declaring itself to be Designated
continue; Router (Hello Packet's Designated Router field = Neighbor
if (nbr->addr.s_addr == src.s_addr)
break; IP address) and the Backup Designated Router field in the
} packet is equal to 0.0.0.0 and the receiving interface is in
break;
default: state Waiting, the receiving interface's state machine is
fatalx("recv_hello: unknown interface type");
} scheduled with the event BACKUPSEEN. …
if (!nbr) { • If the neighbor is declaring itself to be Backup Designated
nbr = nbr_new(rtr_id, iface, 0);
/* set neighbor parameters */ Router (Hello Packet's Backup Designated Router field =
nbr->dr.s_addr = hello.d_rtr; Neighbor IP address) and the receiving interface is in state
nbr->bdr.s_addr = hello.bd_rtr;
nbr->priority = hello.rtr_priority; Waiting, the receiving interface's state machine is scheduled
nbr_change = 1; with the event BACKUPSEEN. …
}
The packet is now accepted and the neighbor is looked Now this sounds simple but it isn't. The first case is not
up. Depending on the interface type either by router ID problematic but the second one is. Why? Because it is
or by interface address. If no neighbor could be found a not known in which order hello packets are received.
new one is created. A new neighbor is considered a What does happen if we start an election process and the
NEIGHBORCHANGE and the nbr_change flag is set actual DR neighbor is still in state 1-WAY? A major con-
that an interface neighbor change event can be issued fusion is the result. The election process evaluates the
later. BDR as DR and himself as BDR or something like this
and the result is a network with too many DR / BDR
Code snip 9: bidirectional or not routers.
nbr_fsm(nbr, NBR_EVT_HELLO_RCVD);
Code snip 10: scary fast start-ups
while (len >= sizeof(nbr_id)) {
memcpy(&nbr_id, buf, sizeof(nbr_id)); if (iface->state & IF_STA_WAITING &&
if (nbr_id == ospfe_router_id()) { hello.d_rtr == nbr->addr.s_addr && hello.bd_rtr == 0)
/* seen myself */ if_fsm(iface, IF_EVT_BACKUP_SEEN);
if (nbr->state & NBR_STA_PRELIM)
nbr_fsm(nbr, NBR_EVT_2_WAY_RCVD); if (iface->state & IF_STA_WAITING &&
break; hello.bd_rtr == nbr->addr.s_addr) {
} /*
buf += sizeof(nbr_id); * In case we see the BDR make sure that the DR is
len -= sizeof(nbr_id); * around with a bidirectional connection
} */
LIST_FOREACH(dr, &iface->nbr_list, entry)
if (hello.d_rtr == dr->addr.s_addr &&
dr->state & NBR_STA_BIDIR)
if_fsm(iface, IF_EVT_BACKUP_SEEN);
}
OpenOSPFD – design and implementation Claudio Jeker
To clear up the situation OpenOSPFD does an additional Code snip 12: EXSTART scenario 2
check. It verifies that the DR has a bidirectional connec- } else if (!(dd_hdr.bits & (OSPF_DBD_I | OSPF_DBD_MS))) {
/* M only case: we are master */
tion to the router and only if that is true a backup seen if (ntohl(dd_hdr.dd_seq_num) != nbr->dd_seq_num) {
log_warnx("recv_db_description: invalid "
event is issued. The result is that it may take a bit longer "seq num, mine %x his %x",
nbr->dd_seq_num,
to establish an adjacency and that some initial Database ntohl(dd_hdr.dd_seq_num));
Description packet are dropped. But the confusion of too nbr_fsm(nbr, NBR_EVT_SEQ_NUM_MIS);
return;
many DR/BDRs is avoided. The rest of recv_hello() is }
nbr->dd_seq_num++;
simply here to issue the possible neighbor change events
that were detected earlier. /* packet may already have data so pass it on */
if (len > 0) {
nbr->dd_pending++;
ospfe_imsg_compose_rde(IMSG_DD,
recv_db_description() nbr->peerid, 0, buf, len);
}
While the send_db_description() function ended up /* event negotiation done */
pretty simple recv_db_description() turned out to be nbr_fsm(nbr, NBR_EVT_NEG_DONE);
if (iface->bdr) {
hello.bd_rtr = iface->bdr->addr.s_addr;
Obvious differences to send_hello() are the use of
iface->self->bdr.s_addr = iface->bdr->addr.s_addr; buf_open() instead of buf_dynamic(). Buf_open()
} else
hello.bd_rtr = 0; allocates a fixed size buffer of size nbr->iface->mtu -
if (buf_add(buf, &hello, sizeof(hello)))
sizeof(struct ip) – which is the maximum packet
goto fail; size that does not get fragmented. Later buf_reserve()
is used on that buffer to reserve sizeof(dd_hdr) bytes.
Finally the packet is constructed. First of all the common
The rest of the packet can be added and later
OSPF header is added. This is done for every packet type
buf_seek() can be used to write into the reserved space
and so a helper function gen_ospf_hdr() is used. The
like this:
Hello specific contents are filled in afterwards and added
with buf_add(). Code snip 21: Usage of buf_seek()
memcpy(buf_seek(buf, sizeof(struct ospf_hdr),
Code snip 18: Add active neighbors sizeof(dd_hdr)), &dd_hdr, sizeof(dd_hdr));
/* active neighbor(s) */
LIST_FOREACH(nbr, &iface->nbr_list, entry) { The remainder of the function sets up the Database
if ((nbr->state >= NBR_STA_INIT) &&
(nbr != iface->self)) Description header with its bit fields and sequence
if (buf_add(buf, &nbr->id,
sizeof(nbr->id))) number. If in state EXCHANGE, as many LSA headers
goto fail; as possible are appended. While appending LSA headers
}
one must keep in mind that the cryptographic authentica-
The Hello packets include a list of all bidirectional tion will append MD5_DIGEST_LENGTH bytes to the end of
neighbors (state 2-WAY or higher). Again the neighbor the packet.
IDs are added directly with buf_add(). The neighbor ID
is stored in network byte order or htonl() is used to cor- send_ls_req()
rectly switch byte order.
send_ls_req() uses like send_db_description()
Code snip 19: Final step buf_open() to get a buffer that doesn't get fragmented.
/* update authentication and calculate checksum */
if (auth_gen(buf, iface))
While filling in the requested LSA headers some addi-
goto fail; tional space gets reserved for the possible MD5 sum.
ret = send_packet(iface, buf->buf, buf->wpos,
&dst); Code snip 22: Filling packet with requests
buf_free(buf); /* LSA header(s), keep space for a possible md5 sum */
return (ret); for (le = TAILQ_FIRST(&nbr->ls_req_list); le != NULL &&
fail: buf->wpos + sizeof(struct ls_req_hdr) < buf->max -
log_warn("send_hello"); MD5_DIGEST_LENGTH; le = nle) {
buf_free(buf); nbr->ls_req = nle = TAILQ_NEXT(le, entry);
return (-1); ls_req_hdr.type = htonl(le->le_lsa->type);
ls_req_hdr.ls_id = le->le_lsa->ls_id;
ls_req_hdr.adv_rtr = le->le_lsa->adv_rtr;
Last is updating authentication and checksum of the out- if (buf_add(buf, &ls_req_hdr, sizeof(ls_req_hdr)))
going packet. The interface pointer is passed to goto fail;
}
auth_gen() to get the necessary keys and sequence
number for the simple and cryptographic authentication. The rest is straight forward and mostly the same as in
The packet gets sent out via send_packet(). Before send_hello().
sending the packet it is necessary to set the outgoing
interface for multicast traffic. This is done by send_ls_ack()
if_set_mcast() inside of send_packet(). Finally the
no longer needed buffer is freed. Actually we have to start in ls_ack_tx_timer()
because send_ls_ack() is just the last step to send out
send_db_description() an ack. send_ls_ack() will add the common OSPF
header and add the data passed to the function to the
send_db_description() implements the sending part packet. The list of acknowledgements is created by
of the database exchange. It sends out the initial Data- ls_ack_tx_timer() in a not so nice way and therefore it
base Description packet when moving the neighbor state should not be used as example for other code. Especially
to EXSTART. as it will be rewritten soon.
Code snip 20: Allocate fixed buffer
if ((buf = buf_open(nbr->iface->mtu - sizeof(struct ip)))
send_ls_update()
== NULL)
fatal("send_db_description"); Sending out LS updates is easy but the retransmission
/* OSPF header */ list and flooding procedure are a bit tricky.
if (gen_ospf_hdr(buf, nbr->iface, PACKET_TYPE_DD)) send_ls_update() will just add a LSA to a buffer
goto fail;
together with a common OSPF header and send the
/* reserve space for database description header */
if (buf_reserve(buf, sizeof(dd_hdr)) == NULL)
goto fail;
OpenOSPFD – design and implementation Claudio Jeker
results out. But there is one thing that must to be done u_int32_t
u_int32_t
ls_id;
adv_rtr;
with the LSA first. It has to be aged with the value of u_int8_t type;
u_int8_t flooded;
transmit-delay. };
Code snip 23: LSA aging The vertex contains all necessary information not only
pos = buf->wpos;
if (buf_add(buf, data, len)) for the LS Database but for the SPF calculation too.
goto fail; entry and cand are used to put the vertex into the red-
/* age LSA before sending it out */ black tree or into the candidate list respectively. The
memcpy(&age, data, sizeof(age));
age = ntohs(age); event ev is for a per-LSA entry timeout for aging. Addi-
if ((age += iface->transmit_delay) >= MAX_AGE) tionally stamp is used for aging as well. changed is set to
age = MAX_AGE;
age = htons(age); the time the last modification was done to the LSA.
memcpy(buf_seek(buf, pos, sizeof(age)), &age, sizeof(age));
ls_id, adv_rtr and type are shorthands for the actual
First the current write position is stored and the LSA is values that are stored inside of lsa. These are used by
added to the buffer. The LS Age is stored in the first two the tree search routine. The flooded flag should indicate
bytes of the LSA. The memcpy() extracts the age because that a LSA was received as part of a flooding. Flooded
a direct memory access could end on unaligned memory. LSA are locked for MIN_LS_ARRIVAL seconds whereas
Then the LSA is aged and written into the buffer with the requested LSA are not. nbr represents the neighbor from
help of buf_seek() and the previously stored position. which the LSA was received. nbr has nothing to do with
the actual originator of the LSA. This is only done to
4.3.5 Control handling correctly flood out LSAs and sending an acknowledge-
ment back to the neighbor. prev is the parent vertex in
The handling of control sessions is actually a small the SPF tree. It is possible to construct the actual path
UNIX local socket server. There is a listener event through the network by following all prev pointers. This
(control_listen()) that accepts (control_accept()) is used to calculate the nexthop. The nexthop is the
connections and creates a per control connection struc- address for forwarding packets to that destination. It is
ture. control_dispatch_imsg() reads the request from normally the address of the last router-LSA before the
ospfctl. First the per connection structure are retrieved root node.
and then the imsg's sent are extracted. They get either
forwarded to the parent, the RDE, or directly answered. 4.4.2 LSA aging
Messages forwarded to the other processes will often
require a response that needs to be relayed to ospfctl Before using a LSA that is in the DB it normally needs
because neither the RDE nor the parent process have to be aged. This is done by lsa_age() with help of the
access to the socket. Relaying is done by vertex time stamp.
control_imsg_relay(). It has to be called for those
Code snip 25: LSA aging
imsgs that need to get forwarded. This is done in the now = time(NULL);
imsg dispatch functions ospfe_dispatch_main() and d = now - v->stamp;
/* set stamp so that at least new calls work */
ospfe_dispatch_rde(). v->stamp = now;
if (d < 0) {
log_warnx("lsa_age: time went backwards");
4.4 Route Decision Engine }
return;
age = ntohs(v->lsa->hdr.age);
4.4.1 LS Database if (age + d > MAX_AGE)
age = MAX_AGE;
The LS database is implemented as a red-black tree – else
age += d;
actually multiple trees exist – one per area and a global
v->lsa->hdr.age = htons(age);
one for AS-external-LSAs. The key is the LS-type LS-ID
advertising router triple. The LSA is part of a vertex Normally it is enough to just add the difference of the
that builds a node of the network connectivity graph. current time and stamp. Nonetheless some additional
Code snip 24: struct vertex
care is needed. First of all time() returns the system
struct vertex { time and this can be modified by the user. I remember a
RB_ENTRY(vertex) entry;
TAILQ_ENTRY(vertex) cand;
complete network outage at an ISP because the UNIX
struct event ev; time got changed on a Zebra/Quagga router. Afterwards
struct in_addr nexthop;
struct vertex *prev; Zebra/Quagga was no longer working until a reboot on
struct rde_nbr *nbr; the changed machines was performed. So by checking
struct lsa *lsa;
time_t changed; whether the difference is positive it is at least possible to
time_t stamp;
u_int32_t cost; fail in a save way. The other case that needs to be consid-
ered is that a LSA may never get older than MAX_AGE (1
hour).
OpenOSPFD – design and implementation Claudio Jeker
4.4.3 Comparing LSA Code snip 27: First set sequence number
if (v == NULL) {
There are two functions to compare LSA. lsa_equal() lsa_add(nbr, lsa);
rde_imsg_compose_ospfe(IMSG_LS_FLOOD, nbr->peerid,
is similar to a memcmp() but compares a bit more. One 0, lsa, ntohs(lsa->hdr.len));
return;
thing is important to note: LSA with age MAX_AGE are }
never considered equal. This comes from the fact that /*
lsa_equal() is mostly used to determine if a recalcula- * set the seq_num to the current one.
* lsa_refresh() will do the ++
tion of the SPF tree is required or for similar situations. */
lsa->hdr.seq_num = v->lsa->hdr.seq_num;
In that context LSAs with an age of MAX_AGE are always /* recalculate checksum */
special and it is OK to force an update. len = ntohs(lsa->hdr.len);
lsa->hdr.ls_chksum = 0;
The other compare function is lsa_newer() and imple- lsa->hdr.ls_chksum = htons(iso_cksum(lsa, len,
LS_CKSUM_OFFSET));
ments the RFC specification of newer, equal and older
LSA. It works similar to other compare functions by Sure if there was no LSA in the database in the first
returning -1 if the first LSA is older, 1 if newer and 0 if place there is no need to merge. It is enough to just add
equal to the second LSA passed. The function compares and flood the LSA. When changing the sequence number
the sequence number, the LS checksum, and the LS age. the checksum has to be recalculated. The sequence
Once again a bit care needs to be taken when comparing number is only set to the current value because there is
ages. no need to increase it already. Especially if lsa_merge()
is used to remove a self originated LSA from the data-
Code snip 26: Comparing ages
a16 = ntohs(a->age);
base there is no need to rise the sequence number, it is
b16 = ntohs(b->age); sufficient to set the age to MAX_AGE.
if (a16 >= MAX_AGE && b16 >= MAX_AGE)
return (0); Code snip 28: Then overwrite and
if (b16 >= MAX_AGE) reflood if necessary
return (-1); /*
if (a16 >= MAX_AGE) * compare LSA; most header fields are equal
return (1); * so don't check them
*/
i = b16 - a16; if (lsa_equal(lsa, v->lsa)) {
if (abs(i) > MAX_AGE_DIFF) free(lsa);
return (i > 0 ? 1 : -1); return;
}
return (0);
/* overwrite the lsa all other fields are unaffected */
If both LSA are at age MAX_AGE they are considered free(v->lsa);
v->lsa = lsa;
equal. If only one has age MAX_AGE that one is newer and start_spf_timer();
last but not least the LS ages need to be at least /* set correct timeout for reflooding the LSA */
MAX_AGE_DIFF (15 minutes) apart to be not considered now = time(NULL);
timerclear(&tv);
equal. if (v->changed + MIN_LS_INTERVAL >= now)
tv.tv_sec = MIN_LS_INTERVAL;
evtimer_add(&v->ev, &tv);
4.4.4 LSA refresh
Now lsa_equal() is used to determine whether to actu-
All LS_REFRESH_TIME seconds a LSA needs to be ally reflood the LSA. If the LSA did not change there is
refreshed by its originator. The age is reset to the initial nothing to modify and we're done. Otherwise the LSAs
value and the sequence number is bumped. After modi- are exchanged and a SPF recalculation is issued. Finally
fying the LSA the checksum has to be recalculated. The the reflooding is prepared. This is done via a timer
LSA is flooded and a new timeout event is registered. because it is not allowed to send out updates faster than
Non self originated LSA have the same timer running MIN_LS_INTERVAL (5) seconds.
but with MAX_AGE instead of LS_REFRESH_TIME. If the
timer fires the LSA will be deleted from the LS database 4.4.6 lsa_self()
by flooding it out with age MAX_AGE. How to delete LSA
will be explained later as it is fairly complex. Identifying self originated LSA is an important task.
This comes from the fact that if a router leaves the net-
4.4.5 LSA merging work the other routers will not remove the LSAs of this
router until the LS age hits MAX_AGE. If the router joins
If a self originated LSA changes, for example because a the network again – after a reboot for example – the old
neighbor relationship is established or lost, an updated LSAs are still floating around. So it is the routers duty to
LSA needs to be reflooded. lsa_merge() takes care of detect those old self originated LSAs and renew them or
replacing the LSA in the database with the new one and remove them from the database. This task is done by
sets the LS sequence number of the new LSA to the cur- lsa_self().
rent used number.
OpenOSPFD – design and implementation Claudio Jeker
Code snip 29: Detect self originated LSA LSAs that are sent to stub areas get silently discarded.
if (nbr->self) At the end the LS age is checked and if it is MAX_AGE
return (0);
some special care needs to be taken.
if (rde_router_id() == new->hdr.adv_rtr)
goto self;
Code snip 31: MAX_AGE handling
if (new->hdr.type == LSA_TYPE_NETWORK) if (lsa->hdr.age == htons(MAX_AGE) &&
LIST_FOREACH(iface, &nbr->area->iface_list, entry) !nbr->self && lsa_find(area, lsa->hdr.type,
if (iface->addr.s_addr == new->hdr.ls_id) lsa->hdr.ls_id, lsa->hdr.adv_rtr) == NULL &&
goto self; !rde_nbr_loading(area)) {
return (0); /*
* if no neighbor in state Exchange or Loading
First of all the newly received LSA (new) gets classified. * ack LSA but don't add it. Needs to be a direct
* ack.
If the router ID is the same or if an interface address */
rde_imsg_compose_ospfe(IMSG_LS_ACK, nbr->peerid, 0,
matches the LS ID of a network-LSA the LSA is consid- &lsa->hdr, sizeof(struct lsa_hdr));
return (0);
ered self originated. }
Code snip 30: Remove or update If the LS age is MAX_AGE and the LSA is not in the data-
self:
if (v == NULL) { base there is actually no need to add the LSA to the data-
/* base. However this is a fallacy, there are some additional
* LSA is no longer announced, remove by premature
* aging. The problem is that new may not be checks required. The RFC mentions that if a neighbor is
* altered so a copy needs to be added to the LSA
* DB first. currently establishing an adjacency – state EXCHANGE
*/
if ((dummy = malloc(ntohs(new->hdr.len))) == NULL) or LOADING – no short-cuts are allowed. Additionally
fatal("lsa_self");
memcpy(dummy, new, ntohs(new->hdr.len));
self originated LSA generated by the OSPF engine have
dummy->hdr.age = htons(MAX_AGE); to be passed. Therefore nbr->self is tested. If all condi-
/*
* The clue is that by using the remote nbr as tions are met the LSA will not be added. Instead only a
* originator the dummy LSA will be reflooded via direct acknowledgement is sent back.
* the default timeout handler.
*/
lsa_add(rde_nbr_self(nbr->area), dummy);
return (1); 4.4.8 Deleting LSA
}
/* Deleting something from a replicated distributed data-
* LSA is still originated, just reflood it. But we need to
* create a new instance by setting the LSA sequence number
base is not a trivial task. Especially if there is no LS
* equal to the one of new and calling lsa_refresh(). remove packet type. Removing is done via the LS age.
* Flooding will be done by the caller.
*/ LSA with LS age MAX_AGE are ready to be removed from
v->lsa->hdr.seq_num = new->hdr.seq_num; the database. Especially for OpenOSPFD removing
lsa_refresh(v);
return (1); LSAs is even more complicated. To remove a LSA it first
has to be reflooded and all neighbors have to acknowl-
In case of a self originated LSA there are two cases. The
edge the reception before removing it from the database.
first one is that the LSA is no longer announced. In that
In OpenOSPFD the database and the retransmission
case the LSA gets added to the Database with a LS age
logic are in two different processes so additional IPC is
of MAX_AGE. The database code will then reflood the LSA
needed. If the RDE tries to delete the LSA either because
as soon as possible and by doing that removing it from
it exceeds the MAX_AGE age or because of premature
the database. There is no other way in doing this because
aging – used to clean the database from no longer valid
removing LSAs is a complex task that only works if the
LSAs – it simply sets the age to MAX_AGE and sends a
LSA is in the database. The other case is much simpler
flood request to the OSPF engine. The OSPF engine will
because there is already a self originated LSA in the
then start the flooding procedure. The LSA is added to
local database but the sequence number is lower then the
the LSA cache and the different retransmission lists refer
new one. In this case the sequence number is bumped
to the cached LSA. If the last reference to the cached
like in the lsa_merge() case and lsa_refresh() is
object drops the following happens:
called to flood the LSA.
Code snip 32: lsa_cache_put()
4.4.7 LSA check void
lsa_cache_put(struct lsa_ref *ref, struct nbr *nbr)
{
Before even accepting a LS update the embedded LSA if (--ref->refcnt > 0)
return;
has to be verified. Once again lengths are compared and
especially the ISO checksum is verified. Additionally the if (ntohs(ref->hdr.age) >= MAX_AGE)
ospfe_imsg_compose_rde(IMSG_LS_MAXAGE,
LS age and sequence number are checked to be in a valid nbr->peerid, 0, ref->data,
sizeof(struct lsa_hdr));
range. Per LS type checks follow the generic ones. It is
verified that the packet has the right size for this type and free(ref->data);
LIST_REMOVE(ref, entry);
that values like the metric – which is a 24bit value stored free(ref);
}
as 32bit integer is in the correct range. AS-external-
OpenOSPFD – design and implementation Claudio Jeker
If there is still a neighbor in state EXCHANGE or LOAD- The loops starts at the root vertex and moves through
ING the LSA may not be removed. It is possible that the one vertex after another. After a vertex is selected all
neighbor may request that LSA just a bit later. Now the next vertices that are connected to this vertex are
LSA is searched in the database and the entry of the extracted and added to the candidate list. After all verti-
database is compared with the LSA that should be ces are added the one with the lowest cost is popped
removed. If the database entry is newer the entry will not from the list and the loops starts over with this vertex.
be removed else it would get finally removed from the Before a vertex is added to the candidate list it is verified
database and freed. that the connection is still valid.
4.4.9 SPF and RIB calculation Code snip 35: the three dots in the previous snip-
pet
The SPF calculation is still a large construction area. The if (w == NULL)
continue;
code should be split up as some steps are not necessary
if (w->lsa->hdr.age == MAX_AGE)
in all cases. Especially on ABRs this is not optimal and continue;
creates a lot of superfluous load. Worth knowing: RIB if (!linked(w, v))
and FIB are terms from BGP and got inherited into continue;
OpenOSPFD. RIB is the Routing Information Base and if (v->type == LSA_TYPE_ROUTER)
d = v->cost + ntohs(rtr_link->metric);
FIB is the Forwarding Information Base. The FIB is else
mostly the kernel routing table and is stripped from d = v->cost;
linked(). The next steps calculate the cost to the new noack += lsa_flood(iface, nbr,
&lsa_hdr, imsg.data, l);
vertex w. There is one important thing to note: only links }
}
into a network have a cost but links from the network to } else {
the router have no cost. The result is that modifying the /*
* flood on all area interfaces on
cost of an interface will often not change incoming traf- * area 0.0.0.0 include also virtual links.
*/
fic flow only outgoing traffic may be rerouted due to the area = nbr->iface->area;
LIST_FOREACH(iface, &area->iface_list, entry) {
change. Before adding a vertex to the candidate list it is noack += lsa_flood(iface, nbr,
necessary to check if the vertex is already on the list. If it }
&lsa_hdr, imsg.data, l);
is, then the calculated cost is compared with the current }
one. The new path must be shorter than the current
Before starting the flooding decision process the LS
selected one. In that case the cost and the prev pointer
update is added to the LSA cache. Later, if the LSA is
are modified and the nexthop is recalculated. The vertex
added to different retransmission queues, only a refer-
is also removed from the candidate list and later added
ence to the LSA cache is retained. Depending on the LS
back to keep the list sorted. If the vertex is not on the
type it must be flooded to all areas (AS-external-LSA) or
candidate list then cost and prev pointer are initialised
only to the current area (all other LSAs). lsa_flood() is
and the nexthop is calculated. Finally the new candidate
doing the per interface specific part of the flooding.
is added to the list of candidates.
More about that a bit later.
Now the RIB needs to be built. To start the area specific
routes are added. First of all, all LSAs with LS age Code snip 37: flooding part2
MAX_AGE, a cost of LS_INFINITY, or a zero nexthop /* remove from ls_req_list */
le = ls_req_list_get(nbr, &lsa_hdr);
address are skipped. They are invalid. All valid network- if (!(nbr->state & NBR_STA_FULL) && le != NULL) {
LSAs are added to the RIB and all router-LSAs for ls_req_list_free(nbr, le);
/*
ABRs and ASBRs are added as well. Summary-LSAs * XXX no need to ack requested lsa
* the problem is that the RFC is very
are put into the RIB. On ABRs only for area 0. On non * unclear about this.
*/
ABRs there is no limitation. A summary-LSA is only
valid if the ABR was previously added to the RIB. The }
noack = 1;
After inspecting every neighbor and adding LSA refer- Code snip 44: ls_retrans_list_free()
ences to the retransmission lists an initial flooding gets void
ls_retrans_list_free(struct nbr *nbr, struct lsa_entry *le)
sent out. If nothing got queued there is no reason to send {
TAILQ_REMOVE(&nbr->ls_retrans_list, le, entry);
the LSA, do a return. In the other cases we send the
lsa_cache_put(le->le_ref, nbr);
update to the correct address. For point-to-point links it free(le);
is always AllSPFRouters. For broadcast networks it is }
either AllSPFRouters or AllDRouters to multicast the ls_retrans_list_free() will not only unlink the LSA
update to the correct group. All other interface types use from the request list but hands the LSA cache reference
unicast to send the updates. Before sending out the LS back by calling lsa_cache_put(). Again it is important
update a special check is done mostly for broadcast and to take care of those references.
NBMA networks. In case the originator of the initial LS
update is on the now outgoing interface more checks How does this LSA cache work?
have to be done. First of all if the originator is DR or The LSA cache is nothing more than a hash list. A
BDR there is no need to send an update. The actual simple hash is built over the LSA header and used to find
flooding was already done by the DR respectively BDR. the correct hash bucket. In the LSA cache a LSA is iden-
Additionally if the router itself is BDR there is no need tified not only by LS type, LS ID, and advertising router.
to flood the network. This will be done by the DR. If The sequence number and LS checksum is compared as
none of these two tests where true it is now clear that no well. To find a LSA in the cache the internal
acknowledgement needs to be sent back. Therefore lsa_cache_look() function is used.
dont_ack is bumped a second time and so lsa_flood() lsa_cache_get() returns a new reference to an existing
will return true. LSA.
Code snip 45: lsa_cache_get()
4.5.2 Retransmission Lists and LSA Cache struct lsa_ref *
lsa_cache_get(struct lsa_hdr *lsa_hdr)
Now lets have a look at the retransmission lists. All other {
struct lsa_ref *ref;
lists – acknowledge, request, and database descriptor list
– are implemented in a similar way. The retransmission ref = lsa_cache_look(lsa_hdr);
if (ref)
list is a bit more complex because of the LSA cache. To ref->refcnt++;
add a LS update to the request list return (ref);
}
ls_retrans_list_add() is used.
There are three bits that have to be set. The E bit indi- changes in that area. In the next step a walk over the RIB
cates that the router is an AS border router and will is done. By calling rde_summary_update() for every
announce AS-external routes. The E bit is used in the area and any route all required summary informations
SPF calculation and for summary-LSAs. In the SPF cal- are generated. Afterwards the kernel routing table is
culation routers with E bit set are added to the RIB. updated by sending change or delete messages to the
Without setting the E bit all AS-external routes using this parent process. This is only done for routes that describe
router as advertising router are considered invalid networks. After that old invalid summary-LSAs get
because the router is not present in the RIB. Similar hap- removed from all areas. Finally the hold timer is started.
pens for summary-LSAs. On ABRs router summary- This is specified in the RFC so that the SPF calculation
LSAs will be generated for every router with E bit set. does not kill the underpowered routers.
OpenOSPFD tricks a bit with the E bit by setting the bit rde_summary_update() does the decision if it necessary
as soon as it is possible that a AS-external route is redis- to create a summary-LSA.
tributed and not when the router actually redistributes a
Code snip 51: Is summary-LSA needed?
route. Other implementations have the same sloppy /* first check if we actually need to announce this route
behaviour. Even more complex is setting the B bit, which */
if (!(rte->d_type == DT_NET || rte->flags & OSPF_RTR_E))
is used to mark ABRs. As soon as a router is part of two return;
active areas the B bit has to be set on all router-LSA. /* never create summaries for as-ext LSA */
if (rte->p_type == PT_TYPE1_EXT || rte->p_type ==
area_border_router() returns true if there are two or PT_TYPE2_EXT)
return;
more active areas. If the state of the ABR changes all self /* no need for summary LSA in the originating area */
if (rte->area.s_addr == area->id.s_addr)
originated router-LSAs in all areas have to be updated. return;
This is done via orig_rtr_lsa_all() which in turn /* TODO nexthop check, nexthop part of area -> no summary
*/
calls orig_rtr_lsa() for all areas but the current one. if (rte->cost >= LS_INFINITY)
return;
Afterwards setting the B bit is no longer a problem. The /* TODO AS border router specific checks */
last bit that can be set is the V bit. It is used to mark inter- /* TODO inter-area network route stuff */
/* TODO intra-area stuff -- condense LSA ??? */
faces where a virtual link is terminated. Areas where one
router has a V bit set are transit areas. Transit areas need First of all only network routes or router routes where
some special handling in the SPF calculation as example the E bit is set are summarised into other areas. The E bit
it is not allowed to send aggregated summary routing is the same as the one in router-LSAs specifying that the
information into a transit area. router is an ASBR. An ASBR has to be added to other
areas so that they can validate the AS-external-LSAs. As
4.5.4 ABR and summary-LSA AS-external routes are flooded through all areas there is
no need to create summaries for those networks. The
The code handling ABRs and summary-LSAs is still in originating area and all invalid routes are skipped.
some flux. There are to many work a rounds and some Finally there are some other minor but very complicated
stuff is still missing. Lets have a look at it anyway. It things left out for now.
actually starts in the SPF calculation. The code that
recalculates the RIB looks currently like this: Code snip 52: update summary-LSA
/* update lsa but only if it was changed */
Code snip 50: SPF timer if (rte->d_type == DT_NET) {
type = LSA_TYPE_SUM_NETWORK;
rt_invalidate(); v = lsa_find(area, type, rte->prefix.s_addr,
rde_router_id());
LIST_FOREACH(area, &conf->area_list, entry) } else if (rte->d_type == DT_RTR) {
spf_calc(area); type = LSA_TYPE_SUM_ROUTER;
v = lsa_find(area, type, rte->adv_rtr.s_addr,
RB_FOREACH(r, rt_tree, &rt) { rde_router_id());
LIST_FOREACH(area, &conf->area_list, entry) } else
rde_summary_update(r, area); fatalx("orig_sum_lsa: unknown route type");
if (r->d_type != DT_NET) lsa = orig_sum_lsa(rte, type);
continue; lsa_merge(rde_nbr_self(area), lsa, v);
if (r->invalid) if (v == NULL) {
rde_send_delete_kroute(r); if (rte->d_type == DT_NET)
else v = lsa_find(area, type,
rde_send_change_kroute(r); rte->prefix.s_addr, rde_router_id());
} else
v = lsa_find(area, type,
LIST_FOREACH(area, &conf->area_list, entry) rte->adv_rtr.s_addr, rde_router_id());
lsa_remove_invalid_sums(area); }
v->cost = rte->cost;
start_spf_holdtimer(conf);
To update the LS DB lsa_merge() is used. Before it is
First the RIB is invalidated by flagging routes as invalid.
possible to call lsa_merge() two things have to be done.
While doing that old invalid routes are removed from the
First the current database version of the LSA has to be
tree. Afterwards the SPF calculation is run for every
found. Secondly a new LSA is generated by
area. This is one of the things that should be changed.
orig_sum_lsa(). After merging the LSA it is necessary
There is no need to recalculate an area if there was no
OpenOSPFD – design and implementation Claudio Jeker
to update the cost of the vertex so that a later call to Code snip 55: rde_redistribute()
lsa_remove_invalid_sums() sees that this vertex is int
rde_redistribute(struct kroute *kr)
still in use. In case the LSA was newly added the previ- {
struct area*area;
ous lsa_find() returned NULL so the search has to be struct iface*iface;
int rv = 0;
repeated to get a valid vertex.
lsa_remove_invalid_sums() does nothing more than a if (!(kr->flags & F_KERNEL))
return (0);
tree walk looking for summary-LSAs with a cost of
if ((rdeconf->options & OSPF_OPTION_E) == 0)
LS_INFINITY and removes those by setting their age to return (0);
MAX_AGE and calling lsa_timeout() to flood them out. if ((rdeconf->redistribute_flags &
REDISTRIBUTE_DEFAULT) &&
(kr->prefix.s_addr == INADDR_ANY &&
4.5.5 Originating AS-external-LSA kr->prefixlen == 0))
return (1);
To redistribute AS-external-LSA the parent process /* only allow 0.0.0.0/0 if REDISTRIBUTE_DEFAULT */
sends a list of candidates to the RDE. The RDE uses if (kr->prefix.s_addr == INADDR_ANY &&
kr->prefixlen == 0)
rde_asext_get() to convert the kroute into a LSA and return (0);
with the help of lsa_find() and lsa_merge() the LSA if ((rdeconf->redistribute_flags &
is added to the database. Similarly on remove REDISTRIBUTE_STATIC) &&
(kr->flags & F_STATIC))
rde_asext_put() is used to get the no longer needed rv = 1;
if ((rdeconf->redistribute_flags &
LSA and again lsa_find() and lsa_merge() do the REDISTRIBUTE_CONNECTED) &&
(kr->flags & F_CONNECTED))
actual job. rv = 1;
rde_asext_put() has a more or less simple job. Find
/*
the kroute, remove it from the list and create a LSA with * interface is not up and running so don't
* announce
LS age MAX_AGE if the LSA was used. */
if (kif_validate(kr->ifindex) == 0)
Code snip 53: rde_asext_put() return (0);
LIST_FOREACH(ae, &rde_asext_list, entry) LIST_FOREACH(area, &rdeconf->area_list, entry)
if (kr->prefix.s_addr == ae->kr.prefix.s_addr && LIST_FOREACH(iface, &area->iface_list,
kr->prefixlen == ae->kr.prefixlen) { entry) {
LIST_REMOVE(ae, entry); if ((iface->addr.s_addr &
used = ae->used; iface->mask.s_addr) ==
free(ae); kr->prefix.s_addr &&
if (used) iface->mask.s_addr ==
return (orig_asext_lsa(kr, prefixlen2mask(kr->prefixlen))
MAX_AGE)); /* already announced
break; * as net LSA */
} rv = 0;
return (NULL); }
return (rv);
On the other hand rde_asext_get() has a bit more to }
do. It first looks if the route was added already before. In
that case the route needs to be updated, else a new one is First it is checked if we have to redistribute anything.
created. Afterwards the default route gets handled. The default
route is only redistributed if explicitly enforced via
Code snip 54: rde_asext_get() part 1 “redistribute default”. Dependent on the flags it is now
LIST_FOREACH(ae, &rde_asext_list, entry)
if (kr->prefix.s_addr == ae->kr.prefix.s_addr && decided if routes gets redistributed. The interface state is
kr->prefixlen == ae->kr.prefixlen) checked and finally all configured interfaces are
break;
inspected to see if the route is not already part of a net-
if (ae == NULL) {
if ((ae = calloc(1, sizeof(*ae))) == NULL) work-LSA or is announced as a stub network.
fatal("rde_asext_get");
LIST_INSERT_HEAD(&rde_asext_list, ae, entry); After the rde_redistribute() call it is now clear what
} remains to be done.
memcpy(&ae->kr, kr, sizeof(ae->kr));
Code snip 56: rde_asext_get() part 2
wasused = ae->used; if (ae->used)
ae->used = rde_redistribute(kr); /* update of seqnum is done by lsa_merge */
return (orig_asext_lsa(kr, DEFAULT_AGE));
Next task is to find out if the route should be redistrib- else if (wasused)
/*
uted. The actual logic is in rde_redistribute() and so * lsa_merge will take care of removing the
* lsa from the db
lets have a look at that. */
return (orig_asext_lsa(kr, MAX_AGE));
else
/* not in lsdb, superseded by a net lsa */
return (NULL);
Bibliography
[1] Moy, J. OSPF version 2. RFC 2328, April 1998.
[2] Moy, J. OSPF: Anatomy of an Internet Routing Proto-
col. Addison-Wesley, September 1998
[3] OpenBSD, http://www.openbsd.org/
[4] OpenBGPD, http://www.openbgpd.org/
[5] OpenOSPFD source code, http://www.openbsd.org/
cgi-bin/cvsweb/src/usr.sbin/ospfd/