Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems

  • ,
  • Elie Krevat ,
  • Vijay Vasudevan ,
  • David G. Andersen ,
  • Gregory R. Ganger ,
  • Garth A. Gibson ,
  • Srinivasan Seshan

USENIX FAST 2008 |

Cluster-based and iSCSI-based storage systems rely on standard TCP/IP-over-Ethernet for client access to data. Unfortunately, when data is striped over multiple networked storage nodes, a client can experience a TCP throughput collapse that results in much lower read bandwidth than should be provided by the available network links. Conceptually, this problem arises because the client simultaneously reads fragments of a data block from multiple sources that together send enough data to overload the switch buffers on the client’s link. This paper analyzes this Incast problem, explores its sensitivity to various system parameters, and examines the effectiveness of alternative TCP- and Ethernet-level strategies in mitigating the TCP throughput collapse.