A large amount of popular content is transferred repeatedly across network links in the Internet. In recent years, packet-level protocol-independent redundancy elimination which can remove duplicate strings from within network packets has emerged as a powerful technique to improve the efficiency of network links in the face of repeated data. Many vendors offer such redundancy elimination middleboxes to improve the effective bandwidth of enterprise, data center and ISP links alike.

In this paper, we conduct a large scale trace-driven study of protocol independent redundancy elimination mechanisms, driven by several terabytes of packet payload traces collected at 12 distinct network locations, including the access link of a large US-based university and of 11 enterprise networks of different sizes. Based on extensive analysis, we present a number of findings on the benefits and fundamental design issues in redundancy elimination systems. Two of our key findings are (1) A new redundancy elimination algorithm based on Winnowing that outperforms the widely-used Rabin fingerprint-based algorithm by 5-10% on most traces and by as much as 35% in some traces. (2) A surprising finding that 75-90% of middlebox bandwidth savings in our enterprise traces is due to redundant byte-strings from within each client’s traffic, implying that pushing redundancy elimination capability to the end hosts, i.e. an end-to-end redundancy elimination solution, could obtain most of the middlebox’s bandwidth savings.