Lecture 22: routing, and start of TCP

Routing

The network layer (also called the internet or IP layer) is responsible for delivering packets between hosts on different local area networks.

Analogy: the mail room in the department is like a local area network, the postal service routes packets (letters) from the CS department mailroom to mailboxes all over the world. The postal service is analogous to the network layer.

The Internet Protocol (IP) is the network-layer protocol that runs the internet. There are two versions of IP in use: version 4 and version 6. We will describe version 4.

Each host on the internet has a 32-bit IP address (typically written as four decimal numbers separated by periods; IPv6 uses 64-bit addresses). An IP packet contains a destination IP address; the goal of the IP layer is to deliver packets to their destinations, by routing them through many networks. A router is a machine that reads packets from one network interface and forwards them on another.

One way to accomplish this is to use source routing: before sending a packet the sender examines the network and selects a path, encoding the path into the packet. Source routing is impractical for a number of reasons:

Instead, IP uses path routing. With path routing, the packet contains only the destination address; routers decide which "next hop" to forward each packet on to get it closer to its destination. Each router makes this decision locally, based only on the destination address and its local configuration.

routing tables

One way to store this configuration is using routing tables. A routing table contains several entries, each containing a destination network and a next hop. The destination network is specified by an address / netmask pair. An IP address x is in the network y/m if x & m = y (here & is bitwise and). For example, the address 192.168.3.4 is in the network 192.0.0.0/255.0.0.0, and is also in the network 192.168.0.0/255.255.0.0, but is not in the network 192.0.0.0/255.255.255.0.

To determine the next hop for a given packet, the router will compare it to each of the entries in the routing table (by anding it with the netmask and comparing it to the network address). It will forward the packet to the first next-hop that matches.

For example, suppose a router is connected to four networks, n1, n2, n3, and n4, and that it has the following routing table:

dest. addr netmask next-hop
1. 2. 3. 0 255.255.255. 0 n1
1. 2. 0. 0 255.255. 0. 0 n2
1. 3. 0. 0 255.255. 0. 0 n3
1. 4. 6. 2 255.255.255.255 n4
0. 0. 0. 0 0. 0. 0. 0 n1

While routing a packet destined for 1.2.3.4, it will compare it to the first row, and find that it matches (because 1.2.3.4 & 255.255.255.0 = 1.2.3.0), so the packet will be routed to n1. If the packet is destined for 1.2.5.6, the first row will not match, but the second will, so it will be forwarded to n2.

Similarly, a packet destined for 1.4.6.5 will be routed to n1, while a packet destined for 1.4.6.2 will be routed to n4.

Routing tables are a very primitive method for configuring networks. They work well for a small network, but are error prone, and don't handle routing packets across multiple paths (for example to split a stream of traffic across two different paths). Modern routers have much more sophisticated methods for deciding how to route traffic.

Where do routing tables come from?

Good routing tables require that packets are forwarded "closer" to their destinations. Routers can discover this information by communicating with their neighbors.

One such algorithm proceeds as follows. Each router maintains a local table containing the distance and next hop to each destination network. Periodically, each router r shares its entire table with its neighbors n. Each neighbor n compares this table to its own, to see if there is a shorter path to each destination d that passes through r. If so, n updates its entry, recording that the next hop to get to d is r (and that the distance from n to d is one plus the distance from r to d).

By iterating this process, each router will converge on a routing table that will give the shortest path to each destination network. This algorithm is referred to as a distance vector protocol, because each router maintains a vector of distances to endpoints.

If the network topology changes while the routes are being calculated, it is possible to create a routing loop, where each router in the loop thinks the next is the right place to forward a packet to; packets stuck in such a loop will never be delivered. This can be avoided in various ways; one approach is to store a path vector instead of a distance vector. Each router maintains the path to each other endpoint; when updating their path vectors, routers can detect loops and avoid entering them into their routing tables.

The Border Gateway Protocol (BGP) is an application-layer path vector protocol that is used to configure routing information on the internet. BGP does not operate at the level of individual routers; instead the nodes in the BGP graph represent entire internet service providers, which are referred to as autonomous systems (AS). Each ISP may use a different protocol for determining routes within their own network (some use "internal BGP", but others use more sophisticated schemes); BGP is used to establish paths that span multiple ISPs.

Transport layer

Transport layer protocols sit atop the network layer and provide additional services. UDP is a very thin wrapper; it only provides multiplexing. TCP uses retransmission to provide reliable in-order delivery of a stream of data.

Key terms

Multiplexing

IP provides a facility for delivering packets to hosts. However, we often want to deliver data to applications running on a given host. Moreover, the receivers may wish to deliver responses; those responses must be returned to the actual process that sent the request.

TCP and UDP both enable multiplexing by including a port number in their respective headers. To receive packets, applications can bind to a given port on their local host; the operating system will only deliver packets that are addressed to that port. For example, a web server may bind to port 80; while a mail server may bind to port 993.

When creating an outgoing connection, TCP and UDP both create a new port on the client machine so that the server can send back responses. These ports are sometimes referred to as anonymous ports.

UDP

The Uniform Datagram Protocol (UDP) is a thin wrapper around IP that supports multiplexing. A UDP header contains a source and destination port number, the length of the data, and a checksum.

Datagrams are self-contained messages of a fixed size. Think of them as postcards.

Stream-oriented protocol

Datagrams are convenient when sending small, self-contained messages whose size is known in advance. However many applications require interaction that is more akin to a conversation: each side may send some data, wait for a response, and then send some more data, or may send a very large stream of continuous data. Streams are open-ended communication channels.

The Transmission Control Protocol (TCP) provides a stream-oriented interface. Applications wishing to communicate using TCP first establish a connection. Then, over time, they can append data to the stream that they send to a remote endpoint. The remote endpoint can repeatedly read data from the stream. TCP guarantees that the data read by the remote endpoint is the same as the data written by the source.

TCP streams are bidirectional; once a connection is established from a client to a server, both parties can read and write data; the data written by the client will be read by the server and vice-versa.

Acknowlegement and resending

TCP provides reliable delivery by requiring an acknowlegement from the remote endpoint that each packet was received. If a TCP packet is sent and not acknowleged before a certain amount of time, then the sender will resend the packet. It will continue to do so until the receipt is acknowleged.

TCP communications are divided into segments, each of which is identified by a sequence number; the first packet from host A to B might have sequence number 0, the second one 1, and so on.

Because TCP endpoints must maintain state (which packets have been sent, which have been acknowleged and so on), TCP requires a connection to be established. To begin a connection, the endpoints perform a 3-way handshake: the initiator sends a synthesis (SYN) packets. The receiver responds with an acknowlegement (ACK) of the SYN and its own SYN packet. These are usually combined into a single SYN/ACK packet. The initiator then acknowleges the sender's SYN with it's own ACK.

sender/receiver sqn contents
A -> B A0 SYN
B -> A B0 SYN/ACK(A0)
A -> B A1 ACK(B0)

At this point the connection is established; both sides know that the other side is ready to receive packets. At that point, either side can send messages:

sender/receiver sqn contents
A -> B A2 Data ("hello!")
B -> A B1 ACK(A2)
B -> A B2 Data ("sup?")
A -> B A3 ACK(B2)

Messages can be piggybacked: if two messages are going in the same direction, they can be combined into a single message. In the above example, B can piggyback its ACK of A2 onto its Data packet:

sender/receiver sqn contents
A -> B A2 Data ("hello!")
B -> A B1 Data ("sup?") / ACK(A2)
A -> B A3 ACK(B2)

If a segment is not acknowleged within a timeout, the segment is resent. This can happen either because the initial transmission was dropped, or because the acknowlegement was dropped. This may cause the receiver to receive duplicate segments; it will simply discard the duplicates.

When either endpoint is done sending data, it should inform the other end by closing the connection. This will cause a FIN packet to be sent (which should be acknowleged by the remote endpoint). When both endpoints have acknowleged their corresponding FIN packets, the connection is closed.

Congestion control (windowing)

If a sender has a large amount of data to send over TCP, and if it waits for the acknowlegement of the first segment before transmitting the second, then it will be wasting a lot of bandwidth; especially if transmitting over a high latency connection (Bandwidth is the amount of data that can be transmitted over a given link in a given time period, latency is the total time it takes for a single unit of data to be transmitted)

However, if it tries to transmit too much data, it may overwhelm the links in between. This could lead to packet loss (because overloaded routers may simply discard packets, for example), and a high rate of retransmission.

Ideally, the sender will send just enough data at a time to keep the connection saturated but not oversaturated. TCP uses an adaptive algorithm to determine the right amount of data to send.

The number of sent but unacknowleged packets is called the window size. TCP adapts its window size using linear increase with exponential backoff with slow start: - slow start: initially the windows size is a small value (such as 1). As long as no packets are dropped, the window size increases exponentially. - After any packet is dropped (i.e. an acknowlegement is not received before the timeout expires), the sender will decrease the window size exponentially (exponential backoff). - As soon as one packet has been lost, the slow start period is over; successful deliveries will only increase the window size linearly (linear increase)

This algorithm does a good job of approximating the maximum bandwidth and adapting to change in the network.

Estimating timeouts

Another parameter that can change is the amount of time to wait for a timeout to occur (while waiting for acknowlegements). Ideally the timeout duration would be slightly longer than the round-trip time from the sender to the receiver and back.

Most TCP implementations compute a weighted historical average of the round-trip time. The initial estimate t0 comes from the TCP handshake. Each time an acknowlegement is received, the amount of time between when the packet was sent and when the ack was received gives a more recent estimate of the round trip time. This can be used to form an updated estimate to use for future timeouts.