Sunday, September 27, 2009

Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication; Vasudevan et al.

This paper discusses TCP incast collapse in which throughput in datacenter environments drops significantly due to simultaneous sending and backoff attempts of servers. For the problem to occur, multiple senders must communicate with a single receiver in a high bandwidth, low delay network, causing switch buffers to overflow which in turn causes timeouts.

In a datacenter environment where RTTs tend to be on the order of 10s or 100s or microseconds, 100+ millisecond timeouts are 1000s of times larger than the RTT. Incast is particularly a problem for barrier-synchronized workloads because even just one server experiencing these disproportionately long timeouts can delay the entire job. However, as the paper notes this timeout-to-RTT ratio in datacenters is a problem for any latency-sensitive workload even when incast is not occurring.

The authors study the problem in both simulation and in two different sized clusters and examine data for RTTs in the real world, though some of the conclusions and configuration choices the authors make seem questionable or tenuous at best. For one thing, the RTT distribution obtained from Los Alamos has around 20% of RTTs pegged at 100 microseconds or less, which the authors claim shows that "networks... operate in very low-latency environments," though in reality at least 50% of the RTTs fall in between 300 and 600 microseconds, meaning that the minimum obtainable RTO min of 5ms would be perfectly acceptable without worrying about high resolution timers.

In order to scale up to 10Gbps links, the authors set blocksize at 80MB so that any flow has the ability to saturate a link; I'm not sure that this is a reasonable choice to make. In their comparison of throughput for RTO min values of 200ms and 200us, the authors ignore all flows smaller than a certain threshold, and I would definitely be curious to see how those flows differed for the two scenarios.

The authors also believe that delayed acks could cause problems for the network due to the fact that their delay timer is set to 40ms, much larger than the proposed new RTO min value. Another thing to bear in mind is that any changes made to dealing with the RTO such as randomizing it by adding delay could actually harm performance in any non-datacenter environment because flows will not have homogeneous RTTs.

1 comment:

Randy H. Katz said...

This is a good summary of the paper. How do you compare the radically different approaches in the two papers?