Workshop on Algorithms and Data Science


wads_2014The goal of this workshop is to bring together researchers working in the area of algorithms across different application domains, discuss what the most interesting challenges are, and provide an overview of the on-going activities within MSR Cambridge, as part of external collaborations, and the technology transfer success stories.
The topics covered include biological computations, data structures and concurrency, distributed computing and networks, and large scale inference and machine learning.


The Cell Cycle Switch Computes Approximate Majority

Luca Cardelli, Microsoft Research | I will discuss a connection between an efficient and well-studied distributed consensus algorithm (Approximate Majority) expressed as a 4-reaction chemical system, and a Nobel prize winning biological switch that is found in the cell cycle of all eukaryotes, from yeast to us. This connection is an example of ‘network morphisms’ that preserve both structure and functionality across systems of different complexity.

Data Structures of the Future: Concurrent, Optimistic, and Relaxed

Dan Alistarh, Microsoft Research | The need to process larger and larger amounts of data as efficiently as possible has been one of the main trends of the past decade. This new set of requirements significantly changes the way data structures are designed, implemented and employed. In this talk, I will give a couple of examples of such new data structure designs, and describe some of the major challenges in the area.

Theory and Practice in Algorithm and Data Structure Design

Robert E. Tarjan, Princeton University and Microsoft Research | In the study and use of algorithms, theory and practice interact, as do algorithm and data structure design. To illustrate this, I’ll discuss recent theoretical and practical results on the problem of finding dominators in a flow graph and on the disjoint set union problem. The former is an important basic computation in optimizing compilers; the latter is a classical problem in data structures with diverse applications. Fast algorithms to find dominators rely on disjoint set union algorithms and extensions. In practice, simple versions of these algorithms beat theoretically faster ones. A theoretical analysis of a randomized disjoint set union algorithm provides an explanation of these empirical results.

Community Detection: Recent Results and Open Problems

Laurent Massoulie, MSR/INRIA joint research centre | Community detection, similar to graph partitioning, consists in identification of groups of similar items within a population based on observed interactions. In the context of online social networks, it is a useful primitive for recommending either contacts or news items to users. We will consider a particular generative probabilistic model for the observations, namely the so-called stochastic block model, and generalizations thereof. We will describe spectral methods for community detection. Exploiting results on the spectrum of random graphs, we will establish consistency of these approaches under suitable assumptions, namely presence of a sufficiently strong signal in the observed data. We will also discuss open questions on phase transitions for detectability in such models when the signal becomes weak. In particular we will introduce a novel spectral method which provably allows detection of communities down to a critical threshold, thereby settling an open conjecture of Decelle, Krzakala, Moore and Zdeborová. We will conclude with open problems of current interest in the field.

Data-Oblivious Algorithms

Olya Ohrimenko, Microsoft Research | Cloud storage has emerged as the next generation of data storage where users can remotely store their data and leave its management to a third party, e.g., Microsoft Azure. However, the fact that users no longer have physical possession of their data raises new challenges in terms of data privacy. Storing the data in encrypted form is a key component in maintaining privacy against the storage provider. However, encryption alone is not enough since information may be leaked through the pattern in which users access the data. In this talk, I describe algorithms that allow data-oblivious access to remotely stored data. That is, access patterns of such algorithms depend only on the size of the outsourced data and algorithm input, but not their content. Hence, such algorithms reveal nothing about the data they are processing.

Labelling Images: Exploiting Problem Structure for Efficient Inference

Pushmeet Kohli, Microsoft Research | Many problems in computer vision and machine learning require inferring the most probable states of certain hidden or unobserved variables. This inference problem can be formulated in terms of minimizing a function of discrete variables. The scale and form of computer vision problems raise many challenges in this optimization task. For instance, functions encountered in vision may involve millions or sometimes even billions of variables. Furthermore, the functions may contain terms that encode very high-order interaction between variables. These properties ensure that the minimization of such functions using conventional algorithms is extremely computationally expensive. In this talk, I will discuss how many of these challenges can be overcome by exploiting the sparse and heterogeneous nature of discrete optimization problems encountered in real world computer vision problems. Such problem-aware approaches to optimization can lead to substantial improvements in running time and allow us to produce good solutions to many important problems.


Are Lock-Free Concurrent Algorithms Practically Wait-Free?

Dan Alistarh, Microsoft Research | Lock-free concurrent algorithms guarantee that some concurrent operation will always make progress in a finite number of steps. Yet programmers prefer to treat concurrent code as if it were wait-free, guaranteeing that all operations always make progress. Unfortunately, designing wait-free algorithms is generally a very complex task, and the resulting algorithms are not always efficient. While obtaining efficient wait-free algorithms has been a long-time goal for the theory community, most non-blocking commercial code is only lock-free.

Our work suggests a simple solution to this problem. We show that, for a large class of lock-free algorithms, under scheduling conditions which approximate those found in commercial hardware architectures, lock-free algorithms behave as if they are wait-free. In other words, under certain conditions, programmers can keep on designing simple lock-free algorithms, and in practice, they will get wait-free progress.

Streaming Verification of Outsourced Computation

Graham Cormode, University of Warwick | When handling large quantities of data, it is desirable to outsource the computational effort to a third party: this captures current efforts in cloud computing, but also scenarios within trusted computing systems and inter-organizational data sharing. When the third party is not fully trusted, it is desirable to give assurance that the computation has been performed correctly. This poster presents some recent results in designing new protocols for verifying computations which are streaming in nature: the verifier (data owner) needs only a single pass over the input, storing a sublinear amount of information, and follows a simple protocol with a prover (service provider) that takes a small number of rounds. A dishonest prover fools the verifier with vanishingly small probability, while an honest prover’s answer is always accepted.

Interactive proof techniques allow a quite general class of computations to be verified, leading to practical implementations.

Programmable Chemical Controllers made from DNA

Neil Dalchau, Microsoft Research | The ability to reliably re-program cells has great potential for tackling societal challenges, in areas such as health, food and energy. DNA is the code that stores the information required to replicate life, and is read by cellular machinery. In addition to its role in the natural sense, it is also possible to use DNA as a substrate for computing, by taking advantage of its physical properties. This gives rise to the potential use of DNA as an intracellular controller, improving cell-based production of biofuels and medicines, and even acting as “smart drugs”. This poster will show how we have used DNA strand displacement to create complex dynamical behaviours. Network designs benefit from the Visual DSD software, developed within Microsoft Research, to build, analyse and test models of DNA strand displacement networks. We formally implement chemical reaction networks (CRNs), which are capable of a rich set of nonlinear dynamics. By combining DNA implementations of non-catalytic and autocatalytic reactions, we construct a circuit that implements a well-studied approximate majority algorithm, one of the fastest known ways for a population to reach a collective majority decision between two possible outcomes. The experimentally observed dynamics of the full circuit are well predicted by a model whose parameters are inferred from experimental measurements of the constituent reactions using Markov chain Monte Carlo parameter estimation. The predictability of the dynamics of DNA-based computing architectures could one day lead to their use in a clinical context as sophisticated molecular controllers.

Ranking using Spectral Methods

Fajwel Fogel, Ecole Normale Superieure | We study the problem of ranking a set of n items given pairwise comparisons.

We show that the ranking problem is directly connected to another classical ordering problem, namely, seriation. In particular, we show how to construct a similarity matrix based on pairwise comparisons, and, in the noiseless case, recover the true ranking by applying the spectral seriation algorithm to the similarity matrix.

In the noisy case, we show that spectral seriation can exactly recover the true ranking even when some of the pairwise comparisons are either corrupted or missing, provided that the pattern of errors is relatively unstructured.

Finally, we perform numerical experiments which suggest our method also to be competitive in settings when the input data consists of a small fraction of all possible comparisons.

Servicing “mice” – Streaming Queries as a Service

Christos Gkantsidis, Microsoft Research | Traditional systems for stream processing have focused on queries on high-rate data streams, typically by scaling-out the work. However, with the emergence of Personal Assistants such as Cortana, we are faced with a different type of queries, low-rate but computationally more expressive. In addition, the back-end to support those queries should be able to cope with potentially huge numbers of such small queries, many of which may be similar (e.g. many users may require weather updates for the same area). The placement of queries to servers in the back-end should maximize opportunities for sharing work among queries, and, at the same time, balance the load across servers. In this demo, we describe the algorithms we have designed to minimize network traffic and balance load for Reactor (the system that executes the queries created by Cortana).

Delta: Scalable Data Dissemination under Capacity Constraints

Konstantinos Karanasos, Microsoft Research | In content-based publish-subscribe (pub/sub) systems, users express their interests as queries over a stream of publications. Scaling up content-based pub/sub to very large numbers of subscriptions is challenging: users are interested in low latency, that is, getting subscription results fast, while the pub/sub system provider is mostly interested in scaling, i.e., being able to serve large numbers of subscribers, with low computational resources utilization. In this poster I will present a novel approach for scalable content-based pub/sub in the presence of constraints on the available CPU and network resources, implemented within our pub/sub system Delta. We achieve scalability by off-loading some subscriptions from the pub/sub server, and leveraging view-based query rewriting to feed these subscriptions from the data accumulated in others. The main contribution is a novel algorithm for organizing views in a multi-level dissemination network, exploiting view-based rewriting and powerful linear programming capabilities to scale to many views, respect capacity constraints, and minimize latency. The efficiency and effectiveness of our algorithm are confirmed through extensive experiments and a large deployment in a WAN. This is a joint work with Asterios Katsifodimos and Ioana Manolescu.

Randomized Load Balancing on Networks

Thomas Sauerwald, University of Cambridge | We consider the problem of balancing load items (tokens) on networks.

Starting with an arbitrary load distribution, we allow in each round nodes to exchange tokens with their neighbours. The goal is to obtain a load distribution where all nodes have the same number of tokens. For the continuous case where tokens are arbitrarily divisible, most load balancing schemes correspond to Markov chains whose convergence is fairly well-understood.

However, in many applications load items cannot be divided arbitrarily often and we need to deal with the discrete case where load is composed of indivisible tokens. We investigate a natural randomised protocol and demonstrate that there is almost no difference between the discrete and continuous case. Specifically we show that for any network all nodes have the same number of tokens up to an additive constant in the same number of rounds as in the continuous case.

Balanced Graph Partition

Milan Vojnovic, Microsoft Research | Abstract TBD.