Anda di halaman 1dari 15

WHATS INSIDE YOURTCP

OPTIMIZATION TOOLBOX?

BY ALEKSANDR KUPRIYANOV AND PADDY GANTI

Note: this post delves fairly deep into the


inner workings of TCP, so a
quick refresher on TCP prior might be helpfu
l
.

INTRODUCTION
In Paddys previous post, he talked about how to load resources intelligently given the constraints of the last mile options
(ISP/Cell Tower). Before content is delivered from the edge to the end user, it must first get to the edge from the original
source by traversing the global IP infrastructure that we call the middle mile in industry jargon. Todays post specifically
talks about this phase of the contents journey, which primarily focuses on TCP optimizations.
Most people think that TCP optimizations means "set some values for some parameters and be done with it." For example,
there is undue attention focused on the initial congestion window parameter (initcwnd) settings. We try to show here that a
holistic approach, examining each detail of data transfer, is whats needed for sustained and consistent TCP performance.

Since we are performance-obsessed at Instart Logic, lets start by taking a look at the impact of bandwidth/latency on a
given web pages load time:

What this demonstrates is that content reduction techniques are important in low bandwidth contexts, while request
reduction shines in high latency regimes. To demonstrate the same thing in different way, consider the following
breakdown of the Google home page:
Request

Domain
www.google.com
ssl.gstatic.com
www.gstatic.com
google.com
apis.google.com

Bytes

Request
9
1
1
1
1

Domain
www.google.com
www.gstatic.com
apis.google.com
ssl.gstatic.com
google.com

Byte
321105
131172
49394
14290
576

At high speeds the requests become the most critical factor for performance, but at low speeds the bytes dictate what the
end user experiences. Most synthetic performance measurement services such as Keynote, Gomez, and Catchpoint will be
more sensitive to the number of requests due to their high speed connectivity, whereas Real User Monitoring (RUM) tools like
NewRelic, SOASTA, WebPageTest (using throttling) will be more sensitive to the volume of content delivered. Be sure to test
on both types of platforms to get a realistic view of the performance experienced by your end users.

WHICH LAYER TO FOCUS ON: HTTP VS. TCP


Now let's say you want to optimize the number of requests, and you have read that the standard best practice says that the
way to implement fewer HTTP resources is to package multiple small resources into one bundle. For example, rather than
sending three individual resources, the same content can be sent in one resource bundle. This way you expect to save two
round trips or at least, this is the accepted wisdom.
However depending on certain conditions, three separate connections downloading n bytes will complete much faster than one
connection downloading 3n bytes. This is because there is no 1-1 correlation between a given HTTP request and a TCP round
trip. HTTP uses TCP to actually segment your request/response into packets and sends a certain number of packets in a
"train." Just because we have one big HTTP request, this does not necessarily translate to a shorter time for the browser to
receive all of the bytes for the combined request.

The answer to what is the optimal bundling strategy lies in TCP mechanics, which dictate the delivery dynamics of any web
resource.
Based on this information, we can re-interpret the same latency graphic above as the reduction in the number of back and
forth exchanges between transacting TCP peers. This should help convince you to focus on optimizing TCP round trips rather
than HTTP request reductions.

WHY TCP SUCKS FOR LONG DISTANCE DATA TRANSFER


It turns out that most default TCP stacks aren't set up for use over today's WAN and satellite links, even gigabit ethernet
anything with either a high bandwidth or delay, or both. To understand why, lets see how much TCP can transfer in 1 second
using basic arithmetic (complicated models are out there, but the following is sufficient for understanding the concept):
Throughput < BufferSize/NetLatency ===> NetLatency <
BufferSize/Throughput

This should tell you that if the network latency is greater than 5 ms, the throughput will be limited even with the maximum
possible value of the receiver buffer. For example, a 100ms link with a 32KB receive buffer, caps the throughput at 2.56Mbps
regardless of the available capacity. This should convince you that something is broken with TCP for long haul delivery.
Given the above situation, we employ the following five heuristics to specifically overcome this handicap of TCP for
high bandwidth-delay paths. Please note that due to their interdependencies, you need all of these working in unison, rather
than deploying any single one of these options.

CONGESTION FLOOR
The bandwidth-delay for a 100Mbps link across the US is 90KB which means you need to be pumping that much data to fully
utilize the link capacity. Given that the middle mile nodes have a greater-than-1Gbps link between them, and given their
geographical dispersion, we would want to set the minimum value for the TCP congestion window, and never fall below that so
as to ensure maximum network utilization. Even at slow start, and after a timeout, the congestion window will have to remain at
least at this value. When we set it to 30 or more, we can ensure that most HTML/JSON responses get sent in a single flight of
packets even after slow start or packet loss. (30 x 1500 bytes = 45KB, more than 90 percent of the Top 1000 sites' HTML
response size.)

DELAYED ACKNOWLEDGEMENTS
Simply adjusting the congestion floor wont do, as we will be limited by the number of acknowledgements we receive. When a
TCP receiver uses delayed acknowledgment, this also slows down the rate of growth of the congestion window of the sender
and reduces the sender throughput. Moreover, for HTTP-type request/response traffic, there is no hope of piggy-backing the
ACK on the data anyway. So disabling the delayed acknowledgement on our edge PoPs will ensure that we can sustain the
data transfer as fast as the sender can send it, without bogging it down.

RETRANSMISSION TIMER OPTIMIZATION


Once we have a floor on the window and remove delayed ACKs, we have the pump primed to send data at high throughput.
However, TCP timeouts are unavoidable (full window loss, lost re-transmit), so we should try reducing the time spent waiting for
a timeout. While it can visibly improve throughput, this solution should be viewed with caution because it also increases the
probability of premature timeouts. So, estimating the right re-transmission timeout (RTO) value is important for achieving a
timely response to packet losses, while avoiding premature timeouts.

A premature timeout has two negative effects:


It leads to a spurious re-transmission;
With every timeout, TCP enters the slow start mode even though no packets are lost. Since there is no congestion, TCP
thus would underestimate the link capacity and throughput would suffer.
TCP has a conservative minimum RTO (RTOmin) value to guard against spurious re-transmissions. The Linux TCP stack uses
an RTOmin value of 200ms. Unfortunately, this value may be at times greater than round-trip times for end-user connections
(which are typically about 20-50ms). To fix this situation the following approach may be employed:

reduce RTOmin to 20ms;


estimate the current RTO value as 3x the current smoothed RTT

By disabling delayed acknowledgements, we don't need the minimum to be at 200ms. Our tests with mobile clients has shown
that this strategy helps achieve a timely response to packet losses, while retaining a rather small risk of spurious retransmissions in case of RTT spikes.

REDUNDANT PACKETS
While the above techniques again optimize for a train of packets, the last packet in a train is not eligible for fast recovery, and
hence will time out in the classic sense. The only way to avoid a "classic" RTO re-transmit and to start either slow start or fast
re-transmit mechanisms in the case of loss of a last-sent packet (or a bunch of last-sent packets) is to resend it, if we did not
receive its ACK for a time a bit longer than a single RTT. Two packets have a higher probability of arriving at their destination,
so we resend the last packet in a train. The same tactic can be used for SYN and SYN/ACK packets when establishing
connections to make the establishment time faster.

REORDERING OPTIMIZATION
As network speeds increase, there is a greater chance that packets wont arrive in the same order we sent them. This occurs
when the order of packets is inverted due to multi-path routing or parallelism at routers and communicating hosts.

It can affect performance because:


It causes unnecessary re-transmission: When the TCP receiver gets packets out of order, it sends duplicate ACKs to trigger
the fast re-transmit algorithm at the sender. These ACKs make the TCP sender infer that a packet has been lost and
retransmits it. If the temporary sequence number gap is caused by reordering, then the duplicate ACKs and the fast retransmission are unnecessary and a waste of bandwidth.
It limits transmission speed: When fast re-transmission is triggered by duplicate ACKs, the TCP sender assumes it is an
indication of network congestion. It reduces its congestion window to limit the transmission speed, which needs to grow larger
from a slow start again. If reordering happens frequently, the congestion window is at a small size and can hardly grow larger.
As a result, TCP has to transmit packets at a limited speed and cannot efficiently utilize the bandwidth.
Results of our measurements demonstrate the high prevalence of packet reordering to packet losses across high-speed
backbone networks with a degree of reordering up to 90 packets. Investigations of real IP flows show also that most reordered
packets arrive at the receiver with time lags less than 10ms. To take into account this fact, the following strategy can be
employed as a means of blocking the impact of this phenomenon on performance:

If the first dupACK is detected, the stack is blocked from any actions on this event for a certain time.
If the actual packet reordering took place, this timeout is enough for self-recovery.
If the packet loss took place, a "standard" fast re-transmit algorithm starts.

RESULTS
File Size
Intranode latency
Client bandwidth
.5Mbps
1.5Mbps
100Mbps
THROUGHPUT

1MB
Direct

Instart
% Diff
Direct
18
17.2
105%
7
6.1
115%
7
0.9
778%
1.1
8.9

4MB
180 ms
Instart
69
26
18
1.8

1MB

4MB
340 ms
Direct
Instart
% Diff
Direct
Instart
% Diff
67
18.5
17.3
107%
68
67
101%
23.3
12
6
200%
35
22
159%
3.2
12
1.7
706%
34
6.5
523%
10
0.7
4.7
0.9
4.9

As you can see, the benefits are material and significant. All of Instart Logic's customers have access to these TCP benefits by
virtue of our Global Network Accelerator.

CONCLUSION
Now, lets circle back to our original question how should you package individual resources for high performance, end-to-end
application delivery? The answer is to treat each resource like a packet and model it after TCP dynamics.
We have a lot more to say on this topic. Stay tuned to hear how this theory helps you better package and bundle your assets.

REFERENCES
Robert T Morris gives you some magic numbers like why TCP won't work if the packet loss climbs up to more than 2%,
among other things.
For those interested in measuring the Internet, Vern Paxson did this landmark study which remains unparalleled.

Visit our Blog for more information

Anda mungkin juga menyukai