WANalytics: Geo-Distributed Analytics for a Data Intensive World

  • Ashish Vulimiri ,
  • Carlo Curino ,
  • Philip Brighten Godfrey ,
  • Thomas Jungblut ,
  • Konstantinos Karanasos ,
  • Jitu Padhye ,
  • George Varghese

SIGMOD |

Published by ACM

Publication

Many large organizations collect massive volumes of data
each day in a geographically distributed fashion, at data
centers around the globe. Despite their geographically diverse
origin the data must be processed and analyzed as
a whole to extract insight. We call the problem of supporting
large-scale geo-distributed analytics Wide-Area Big
Data (WABD). To the best of our knowledge, WABD is
currently addressed by copying all the data to a central
data center where the analytics are run. This approach consumes
expensive cross-data center bandwidth and is incompatible
with data sovereignty restrictions that are starting
to take shape. We instead propose WANalytics, a system
that solves the WABD problem by orchestrating distributed
query execution and adjusting data replication across data
centers in order to minimize bandwidth usage, while respecting
sovereignty requirements. WANalytics achieves an
up to 360 reduction in data transfer cost when compared
to the centralized approach on both real Microsoft production
workloads and standard synthetic benchmarks, including
TPC-CH and Berkeley Big-Data. In this demonstration,
attendees will interact with a live geo-scale multi-data center
deployment of WANalytics, allowing them to experience the
data transfer reduction our system achieves, and to explore
how it dynamically adapts execution strategy in response to
changes in the workload and environment.