OPTIMIZATION TOOLBOX?
INTRODUCTION
In Paddys previous post, he talked about how to load resources intelligently given the constraints of the last mile options
(ISP/Cell Tower). Before content is delivered from the edge to the end user, it must first get to the edge from the original
source by traversing the global IP infrastructure that we call the middle mile in industry jargon. Todays post specifically
talks about this phase of the contents journey, which primarily focuses on TCP optimizations.
Most people think that TCP optimizations means "set some values for some parameters and be done with it." For example,
there is undue attention focused on the initial congestion window parameter (initcwnd) settings. We try to show here that a
holistic approach, examining each detail of data transfer, is whats needed for sustained and consistent TCP performance.
Since we are performance-obsessed at Instart Logic, lets start by taking a look at the impact of bandwidth/latency on a
given web pages load time:
What this demonstrates is that content reduction techniques are important in low bandwidth contexts, while request
reduction shines in high latency regimes. To demonstrate the same thing in different way, consider the following
breakdown of the Google home page:
Request
Domain
www.google.com
ssl.gstatic.com
www.gstatic.com
google.com
apis.google.com
Bytes
Request
9
1
1
1
1
Domain
www.google.com
www.gstatic.com
apis.google.com
ssl.gstatic.com
google.com
Byte
321105
131172
49394
14290
576
At high speeds the requests become the most critical factor for performance, but at low speeds the bytes dictate what the
end user experiences. Most synthetic performance measurement services such as Keynote, Gomez, and Catchpoint will be
more sensitive to the number of requests due to their high speed connectivity, whereas Real User Monitoring (RUM) tools like
NewRelic, SOASTA, WebPageTest (using throttling) will be more sensitive to the volume of content delivered. Be sure to test
on both types of platforms to get a realistic view of the performance experienced by your end users.
The answer to what is the optimal bundling strategy lies in TCP mechanics, which dictate the delivery dynamics of any web
resource.
Based on this information, we can re-interpret the same latency graphic above as the reduction in the number of back and
forth exchanges between transacting TCP peers. This should help convince you to focus on optimizing TCP round trips rather
than HTTP request reductions.
This should tell you that if the network latency is greater than 5 ms, the throughput will be limited even with the maximum
possible value of the receiver buffer. For example, a 100ms link with a 32KB receive buffer, caps the throughput at 2.56Mbps
regardless of the available capacity. This should convince you that something is broken with TCP for long haul delivery.
Given the above situation, we employ the following five heuristics to specifically overcome this handicap of TCP for
high bandwidth-delay paths. Please note that due to their interdependencies, you need all of these working in unison, rather
than deploying any single one of these options.
CONGESTION FLOOR
The bandwidth-delay for a 100Mbps link across the US is 90KB which means you need to be pumping that much data to fully
utilize the link capacity. Given that the middle mile nodes have a greater-than-1Gbps link between them, and given their
geographical dispersion, we would want to set the minimum value for the TCP congestion window, and never fall below that so
as to ensure maximum network utilization. Even at slow start, and after a timeout, the congestion window will have to remain at
least at this value. When we set it to 30 or more, we can ensure that most HTML/JSON responses get sent in a single flight of
packets even after slow start or packet loss. (30 x 1500 bytes = 45KB, more than 90 percent of the Top 1000 sites' HTML
response size.)
DELAYED ACKNOWLEDGEMENTS
Simply adjusting the congestion floor wont do, as we will be limited by the number of acknowledgements we receive. When a
TCP receiver uses delayed acknowledgment, this also slows down the rate of growth of the congestion window of the sender
and reduces the sender throughput. Moreover, for HTTP-type request/response traffic, there is no hope of piggy-backing the
ACK on the data anyway. So disabling the delayed acknowledgement on our edge PoPs will ensure that we can sustain the
data transfer as fast as the sender can send it, without bogging it down.
By disabling delayed acknowledgements, we don't need the minimum to be at 200ms. Our tests with mobile clients has shown
that this strategy helps achieve a timely response to packet losses, while retaining a rather small risk of spurious retransmissions in case of RTT spikes.
REDUNDANT PACKETS
While the above techniques again optimize for a train of packets, the last packet in a train is not eligible for fast recovery, and
hence will time out in the classic sense. The only way to avoid a "classic" RTO re-transmit and to start either slow start or fast
re-transmit mechanisms in the case of loss of a last-sent packet (or a bunch of last-sent packets) is to resend it, if we did not
receive its ACK for a time a bit longer than a single RTT. Two packets have a higher probability of arriving at their destination,
so we resend the last packet in a train. The same tactic can be used for SYN and SYN/ACK packets when establishing
connections to make the establishment time faster.
REORDERING OPTIMIZATION
As network speeds increase, there is a greater chance that packets wont arrive in the same order we sent them. This occurs
when the order of packets is inverted due to multi-path routing or parallelism at routers and communicating hosts.
If the first dupACK is detected, the stack is blocked from any actions on this event for a certain time.
If the actual packet reordering took place, this timeout is enough for self-recovery.
If the packet loss took place, a "standard" fast re-transmit algorithm starts.
RESULTS
File Size
Intranode latency
Client bandwidth
.5Mbps
1.5Mbps
100Mbps
THROUGHPUT
1MB
Direct
Instart
% Diff
Direct
18
17.2
105%
7
6.1
115%
7
0.9
778%
1.1
8.9
4MB
180 ms
Instart
69
26
18
1.8
1MB
4MB
340 ms
Direct
Instart
% Diff
Direct
Instart
% Diff
67
18.5
17.3
107%
68
67
101%
23.3
12
6
200%
35
22
159%
3.2
12
1.7
706%
34
6.5
523%
10
0.7
4.7
0.9
4.9
As you can see, the benefits are material and significant. All of Instart Logic's customers have access to these TCP benefits by
virtue of our Global Network Accelerator.
CONCLUSION
Now, lets circle back to our original question how should you package individual resources for high performance, end-to-end
application delivery? The answer is to treat each resource like a packet and model it after TCP dynamics.
We have a lot more to say on this topic. Stay tuned to hear how this theory helps you better package and bundle your assets.
REFERENCES
Robert T Morris gives you some magic numbers like why TCP won't work if the packet loss climbs up to more than 2%,
among other things.
For those interested in measuring the Internet, Vern Paxson did this landmark study which remains unparalleled.