PRISM: PRecision-Integrated Scalable Monitoring

  • Navendu Jain ,
  • Dmitry Kit ,
  • Prince Mahajan ,
  • Praveen Yalagandula ,
  • Mike Dahlin ,
  • Yin Zhang

TR06-22 |

This paper describes PRISM, a scalable monitoring service that makes imprecision a first-class abstraction for its scalable DHT-based aggregation service. Exposing imprecision is essential for both correctness in the face of network and node failures and scalability to large systems. PRISM introduces the notion of conditioned consistency that quantifies imprecision along a threedimensional vector: arithmetic imprecision (AI) bounds numeric inaccuracy, temporal imprecision (TI) bounds update delays, and network imprecision (NI) bounds uncertainty due to network and node failures. AI and TI balance precision against monitoring overhead for scalability while NI addresses the fundamental challenge of providing consistency guarantees despite failures in a large distributed system. Our implementation addresses the challenge of providing these metrics while scaling to a large numbers of nodes and attributes. By introducing a 10% AI, PRISM’s PlanetLab monitoring service, PrMon, can reduce network overheads by an order of magnitude compared to the currently-used CoMon service. And, by using NI metrics to automatically select the best of four redundant aggregation results, we can reduce the observed worst-case inaccuracy by nearly a factor of five.