Sunday, September 27, 2009

Understanding TCP Incast Throughput Collapse in Datacenter Networks; Chen, Griffith, Liu, Katz, & Joseph

This paper examines TCP Incast in a datacenter environment and attempts to come up with an analytical model to understand it. The authors take an experimental approach to better understand the phenomenon by making a few modifications (such as changing minimum RTO and randomizing backoff) to see if any prevent incast collapse and then moving on from there. They found that the biggest improvements came from minimizing the minimum RTO.

This work is based on the original TCP Incast paper from CMU and used a workload such that each server sends a fixed amount of data; however, the CMU group came out with a new workload in which there is a constant amount of data to send split between some changing number of senders, meaning that the more senders there are, the less each sender actually has to send. However, large numbers of senders with this new workload could hide incast behavior because each sender only ends up sending a few packets. So this paper's authors used the original workload setup of a fixed amount of data per sender.

Some of the results obtained in this study differ from those made by the original CMU group (regarding the importance of high-res timers and turning off delay-acks), and the authors presume that this is due to the underlying network. The authors use both a small scale cluster and an experimental testbed setup, but my complaint is that they do not list how many machines are actually in these systems (though it appears to be at least 20 based on later graphs).

A number of experiments were run with varying RTO minimum values, with a few noteworthy results. There appear to be 3 phases to the incast problem: collapse, goodput recovery increase, and goodput decrease. Large numbers of timeouts occur with this initial phase, but then there are few timeout events as senders ramp up. However, at some point the timing of sending packets surpasses the RTO such that timeouts happen frequently, causing interference between senders retransmitting after a timeout and those waiting for a timeout. Hence, the third phase will happen after a longer time post-collapse for higher RTO minimum values.

Smaller RTO minimum values equate to shorter "goodput recovery" time, that is, time required to hit the maximum goodput. The time at which collapse turns into goodput recovery increase is the same regardless of the minimum RTO. While the CMU paper made much ado about using high resolution timers, it appears that the optimal setting is actually a low resolution RTO of 1ms paired with delayed acks turned on. An important difference to note between the two studies, however, is that the average observed RTT in this paper was 2ms, while RTTs in the CMU study were along the lines of 100s of microseconds. With that in mind, a starting RTO value of 1ms is actually shorter than the expected RTT.

The optimality of delayed acks was surprising since it could in theory slow down goodput, but a suggestion put forth by the authors is that not using delay-acks can over-drive the congestion window. The graphs used to justify this theory were somewhat confusing; I wasn't quite sure what each dot on the graph corresponded to, as the delay-acks graph has a significantly larger number of dots on it.

3 comments:

Yanpei Chen said...

Trust you to make a criticism based on the number of dots. Rean came up with that graph. It was the best we have given time constraints (a few hours at the last RAD Lab retreat).

And you said you were going to rip our paper to shreds ...

Adrienne said...

Didn't both papers randomize backoff?

Randy H. Katz said...

Hmmm ... several students found those graphs to be confusing. We need to work on that.