Inria Joint Center

Inria Joint Center

Publications

Key Publications 

Microsoft Research blog

Microsoft Research blog

Microsoft Research blog

Microsoft Research blog

Overview

INRIA logoThe Microsoft Research-Inria Joint Center, Inria JC, is a collaborative research partnership between Microsoft Research and Inria, the French Public Research Institute in Computer Science. Since its creation in 2005, the Inria JC has been home to over 25 projects co-led by researchers from Inria and MSR. There are currently eight active projects, addressing a wide range of cutting-edge research challenges.

Security is an area of particular focus, with projects on cryptography for blockchains, certification of distributed algorithms’ correctness, and design of cryptographic protocols for the Internet with certified security properties.

Artificial Intelligence is another main focus of the Inria JC with projects on machine learning and computer vision (specifically for Mixed Reality devices such as HoloLens) and alternative paradigms to reinforcement learning for continuously evolving and interacting agents.  In Computer Vision, the Inria JC partners closely with the Mixed Reality & AI Zurich Lab.

Photo of Laurent Massoulie“Collaboration between computer science researchers from academia and industry has never made more sense than today, given the exciting challenges and opportunities offered by our field. The MSR-Inria JC empowers such collaborations between fellow researchers at Inria and Microsoft Research on topics from artificial intelligence to security and quantum computing.” 

Laurent Massoulie, Inria JC Managing Director

The Inria JC is governed by a Management Committee consisting of representatives of Inria, Microsoft Research and Microsoft France. Among other things, the Management Committee oversees the progress of existing projects and identifies new project opportunities.

Address

Microsoft Research – Inria JC
2 rue Simone Iff
75012 PARIS

People

Students and Postdocs

Microsoft - Principal Investigators

Inria - Principal Investigators

Management Board

Projects

Current projects

Acquisition and synthesis of high-quality four-dimensional data

Inria PI: Edmond Boyer, Jean-Sébastien Franco

MS PI: Federica Bogo, Martin de la Gorce

PhD students: Matthieu Armando, Boyao Zhou

One aim of this project is the synthesis of 3D-shape models based on observations from several viewpoints. A challenge to reach this aim is to remove noise from acquired shape models. To that end we develop denoising algorithms based on graph convolutional neural networks.A second aim of this project is the completion of shape and motion information from partial observations, e.g. sparse point clouds. To address it we develop approaches based on Gaussian processes combined with deep learning.

Autonomous Learning

Inria PI: Pierre-Yves Oudeyer

MSR PI: Katja Hofmann

PhD students: Rémy Portelas, Laetitia Teodorescu

This project takes place in the context of developing the foundations of a novel machine learning framework, called Intrinsically Motivated Goal Exploration (IMGEPs), focused on the challenge of autonomous learning of diverse repertoires of skills in open and changing environments. This is an alternative to the classical reinforcement learning framework, which is currently reaching both conceptual and practical limits. It is grounded in a decade of work modelling mechanisms of curiosity-driven learning and development in human infants. In particular, it aims to develop algorithms enabling i) to learn incrementally goal spaces representations and associated internal goal achievement reward function, with the aim to foster efficient discovery of a diversity of skills; ii) a social peer to guide curiosity-driven learners using natural language, both fostering more efficient exploration leveraging language input, and enabling external users to leverage skills learned autonomously by the machine by using language-based queries.

Cryptography for the Blockchain

Inria PIs: David Pointcheval

MSR PIs: Melissa Chase, Esha Ghosh, Santiago Zanella

PhD students: Balthazar Bauer, Antoine Plouviez

The project aims to advance state-of-the-art in blockchain technology. Blockchains as exemplified by Bitcoin, Ethereum, ZCash, show promise to enable guaranteed transactions without the need for a central authority. Among identified objectives, researchers of the project are looking at: making blockchains “greener”, for instance by replacing notion of proof of work that underlies bitcoin by “proof of storage”, and develop new applications of blockchains, in the spirit of recent proposals to perform decentralized verification of e.g., web certificates, and public-key infrastructure system.

Flexible AI

Inria PI: Emmanuel Dupoux

MSR PI: Paul Smolensky

PhD student: Mathieu Rita

Recent progresses in deep learning led to the successful development of neural networks able to achieve complex tasks in domains such as computer vision or natural language. These successes are attained by networks that are trained with massive amounts of data. However, the interactive and functional aspects of intelligence are almost completely ignored. Indeed, researchers only begin to take an interest in introducing communicating tools in multi-agent settings. This research aims in studying the conditions under which populations of agents develop a communication that helps them to succeed in their tasks. From a scientific perspective, understanding the evolution of languages in communities of deep agents and its emergent properties can shed light on human language evolution. From an applied perspective, endowing deep networks with the ability to solve problems interactively by communicating with each other and with us should make them more flexible and useful in everyday life.

Machine Learning for Distributed Environments

Inria PIs: Francis Bach, Laurent Massoulié

MSR PIs: Sébastien Bubeck

PhD Student: Hadrien Hendrikx

The project aims to develop efficient distributed ML algorithms suited to run for instance in the cloud.

Large-scale machine learning algorithms, whether supervised or unsupervised, are ideally run on a cloud platform with many cores organized in a hierarchical fashion in datacenters. Important efficiency gains can be had by designing new algorithms tailored to exploit such distributed platforms. Specific objectives include design of new algorithms for supervised learning (e.g., distributed, accelerated gradient descent) and unsupervised learning (e.g., distributed spectral analysis and graph clustering)

Project Everest

Inria PI: Karthik Bhargavan

MSR PIs: Antoine Delignat-Lavaud, Cédric Fournet, Jonathan Protzenko, Nikhil Swamy, Santiago Zanella

Postdoc: Florian Groult

PhD Student: Marina Polubelova

The project aims at building formally verified high-performance standard-compliant cryptographic components for securing Internet communications. These components can be used either as drop-in replacements for existing libraries and protocols, or to build verified secure sub-systems and applications. By construction, these components prevent flaws and vulnerabilities that litter existing implementations and require frequent security patches—see, e.g., the 3SHAKE, FREAK and LOGJAM attacks uncovered as part of our research. More details can be found at the following sites:

View project page

https://project-everest.github.io/

https://www.fstar-lang.org/

https://mitls.org/

Software Heritage and Software Productivity

Inria PIs: Roberto di Cosmo, Stefano Zacchiroli

MSR PI: Tom Zimmermann

Software Heritage is today the largest corpus of software source code artifacts, containing over 6 billion unique source code files, 1 billion commits, coming from more than 90 million software projects. So far, empirical studies of software engineering have been conducted on much smaller repositories, and as such can’t be fully conclusive. The aim of this project is to reproduce previous empirical studies relevant for the improvement of developer productivity on Software Heritage. In particular, regarding Software Heritage it will analyze: the extent of large-scale code reuse in open-source software; identification of topics in source code via semantic clustering; relation between programming language and code quality; quality and productivity outcomes relating to continuous integration.

TLA Proof System

Inria PIs: Damien Doligez, Stephan Merz

MSR PIs: Markus Kuppe, Leslie Lamport

Postdoc: Ioannis Filippidis

PhD student: Antoine Defourné

The project aims to develop a proof assistant for enabling certification of TLA+-expressed specification.

TLA+, with the associated Pluscal algorithm language, have been proposed by Leslie Lamport as tools for developing and verifying algorithms for concurrent and distributed systems. These have gradually matured and have recently been used for industrial applications. They are now entering mainstream use for certifying properties of distributed systems. An associated proof assistant is an important complement to model-checking for establishing properties of TLA+-expressed algorithm specifications. More details can be found at: https://lamport.azurewebsites.net/tla/tla.html

https://tla.msr-inria.inria.fr

Video Understanding for Personal Assistants

Inria PIs: Ivan Laptev, Josef Sivic

MS PIs: Marc Pollefeys, Johannes Schönberger, Bugra Tekin

PhD student: Yana Hasson, Dimitri Zhukov

The project aims to develop tools for the HoloLens Mixed Reality device, in particular solutions for construction of visual cues to be superimposed to real scene in order to help conduct practical tasks (such as changing a car tyre). Ongoing work at Inria shows promise on automatic processing of YouTube videos of instructions for various tasks to identify key stages and visual signatures of these steps. It is a bold challenge to try and go from there to the actual production of visual cues to be projected onto HoloLens to assist users in conducting chosen tasks.

Archived projects

4D Cardiac MR Images

4D cardiac images

This project finished in 2013 – then continued by “Medilearn”

Goals of the project

Given a large database of cardiac images of patients stored with an expert diagnosis, we wish to automatically select the most similar cases to the cardiac images of a new patient.

Application

This is important, for instance to try and estimate the optimal time for an operation when a Tetralogy of Fallot condition is diagnosed (see fig. 1).

Science

We want to be able to index images based on e.g. the shape of the heart, the dynamics of the myocardium, or the presence of anomalies. This would require capturing the right level of shape, motion and appearance characteristics in a compact and efficient way so as to enable fast indexation and retrieval even for hundreds of 4D datasets. Using state of the art efficient machine learning techniques is also of paramount importance. The design of both the visual features and the classification algorithms have to be informed by medical experts so as to make sure the final system remains relevant from a clinical point of view.

Such a system would provide a strong innovation in cardiology, a learning tool for residents and young physicians, and a new tool to help physicians to assess a diagnosis and plan a therapy. The method could emphasize local motion and/or local shape singularities (e.g. septal flash (motion anomaly) vs. ventricle overload (shape anomaly), with the possibility to actually weight motion vs. shape features.

The MSRC­-Inria collaboration

Key points to succeed in developing such a system are the following:

  • Expertise in cardiac image processing, in order to extract automatically the boundaries of ventricular cavities (endocardium of left and right ventricles) from cardiac MR images, and outer boundaries of the heart (epicardium), as well as extraction of cardiac motion (dense non-­‐rigid motion of myocardium, accounting for quasi incompressibility constraints of tissues and isovolumetric phases of cardiac motion).
  • Expertise in indexation of large databases of complex images, in order to select the appropriate shape and motion invariant features, and to select an appropriate metric in feature space reflecting the “clinical distance” between different medical cases, as well as the development of efficient algorithmic solutions to beat the curse of dimensionality.
  • Expertise in pediatric cardiology, with the constitution and access to a very large database of 3D + time cardiac images stored with an expert clinical diagnosis including the specificity of each case but also the therapy choice along with the corresponding outcome.

Our partnership

It is clear that the close partnership between INRIA Sophia Antipolis, Microsoft Research Cambridge, and King’s College London meets the required expertise of the three points above, and could lead to a pioneering advance in the field of computer aided clinical decision for pediatric cardiology and more generally to the field of content-­‐based retrieval of medical images.

The best strategy to implement this research agenda would be to rapidly hire a highly qualified PhD student, who would share his/her time between the 3 institutions, in order to rapidly develop expertise by leveraging the extensive knowledge of three participating laboratories. The exact repartition of the time would probably depend on the original qualification of the candidate, and on the progress made during the PhD research.

A-Brain

A-brain

This project finished in 2013 – then continued by Z-CloudFlow and MediLearn.

Goals of the project

Joint acquisition of neuroimaging and genetic data on large cohorts of subjects is a new approach used to assess and understand the variability that exists between individuals, and that has remained poorly understood so far. As both neuroimaging- and genetic-domain observations represent a huge amount of variables (of the order of 106), performing statistically rigorous analyses on such amounts of data represents a computational challenge that cannot be addressed with conventional computational techniques. On one hand, sophisticated regression techniques need to be used in order to perform sensitive analysis on these large datasets; on the other hand, the cost entailed by parameter optimization and statistical validation procedures (e.g. permutation tests). However, the computational framework can easily by run in parallel.

In this project, researchers of the Parietal and KerData INRIA teams will jointly address address this computational problem using cloud computing techniques on Microsoft Azure cloud computing environment. The two teams bring their complementary expertise: KERDATA (Rennes) in the area of scalable cloud data management and PARIETAL (Saclay) in the field of neuroimaging and genetics data analysis. The Map-Reduce programming model has recently arisen as a very effective approach to develop high-performance applications over very large distributed systems such as grids and now clouds. KerData has recently proposed a set of algorithms for data management, combining versioning with decentralized metadata management to support scalable, efficient, fine-grain access to massive, distributed Binary Large OBjects (BLOBs) under heavy concurrency. The project investigates the benefits of integrating BlobSeer with Microsoft Azure storage services and aims to evaluate the impact of using BlobSeer on Azure with large-scale application experiments such as the genetics-neuroimaging data comparisons addressed by Parietal.

This project will thus bring together researchers from algorithmic and statistical analysis domain on the one hand, and researchers involved on the organization of data management in intensive computation on the other hand, to work on the Microsoft Azure platform in order to unveil the relationships between genes and brain characteristics.

Team members

Current members:

  • Gabriel Antoniu, Inria Rennes – Bretagne Atlantique

Former members:

  • Elena Apostol, Microsoft Research-Inria Joint Centre (Post Doctoral Student)
  • Luc Bougé, Ecole Normale Supérieure de Rennes – IRISA
  • Alexandru Costan, INSA Rennes
  • Benoit Da Mota, Microsoft Research – Inria Joint Center (Expert Engineer)
  • Jean-Baptiste Poline, CEA
  • Bertrand Thirion, Inria Saclay – Île-de-France
  • Radu Tudoran, IRISA / ENS Cachan, Antenne de Bretagne (PHD student)

Adaptative Combinatorial Search for E-science

image depicting privacy online

This project finished in 2013.

Goals of the project

The goal of this research is to improve the applicability of constraint-based or heuristic-based solvers to complex scientific problems. As demonstrated by a large literature, e-Scientists already benefit from the use of search procedures to tackle a large variety of important problems. Unfortunately, these applications suffer from the limits of current solving technologies which appear to be poorly adapted to these new domains. One solution to improve performance is the fine tuning of the solver parameters. This is a tedious and time-consuming task that often requires knowledge about both the domain and the algorithm.

This approach is hardly applicable to e-Science whose applications fields are constantly growing. We claim that the self-adaptation of a solver to the domain of interest is the only viable solution to this problem. Our goal in this project is to develop tools able to automatically choose the optimal parameter configuration of a given search algorithm for a given problem or class of problems.

This adaptation would allow us to deploy combinatorial search with new complex scientific problems with good expected performance.

Dynamic Dictionary of Mathematical Functions

mathematical functions being written on a clear board

This project finished in 2013.

Project Summary

Our research project is concerned with the automatic derivation of properties and formulae concerning classical elementary and special functions. We are located in Orsay, within the Inria-Microsoft Research Joint Centre, with participants from the Algorithms project at Inria Rocquencourt. Please contact us for additional information, visits, internships, etc.

Overview

The aim of this project is to automate the computation of numerous mathematical formulae that are needed in analysis. We intend to show that large parts of the code dealing with elementary and special functions in computer algebra systems can actually be generated automatically in a uniform manner.

The starting point is to promote linear differential or difference equations to the status of data-structures representing their solutions. New efficient algorithms operating directly on this data structure will be designed as part of this project. As a showcase for this project, we intend to develop an encyclopedia of special functions available on the web. In addition to our existing prototype, this new version will provide dynamical entries depending on user queries; integral transforms (Laplace, Fourier,…); more information on each function.

In order to achieve the desired efficiency, the project contains an important theoretical investment on the complexity of fundamental algorithms of computer algebra. Computer algebra is devoted to the study of effective mathematics and the complexity of corresponding algorithms. The emphasis is on exact mathematical objects (integers, rational numbers, polynomials, algebraic numbers,…) rather than numerical approximations; the foundations lie in algebra. So far, most of the impact of this area has been in pure mathematics and in education.

The applications of computer algebra systems such as Maple or Mathematica to applied, discrete or experimental mathematics have often been restricted by either the scarceness of exact solutions or the huge memory requirements on practical problems. Our long-term project aims at improving this state of affairs and bring computer algebra to a stage where it can generate efficient and guaranteed numerical code for approximating the solutions of large classes of problems of real-life size. We pursue three main ideas rooted in computer science.

Equations as data structures

Very few equations admit explicit solutions. This does not mean that questions about these solutions cannot be answered. Indeed, even when explicit solutions exist, some of their properties can be accessed more efficiently from the defining equations. By considering equations themselves as data-structures, one is led to develop algorithms that construct equations satisfied by objects of interest as well as algorithms that extract information from these equations. This idea is classical in the case of roots of polynomials, where the polynomials themselves are a good data-structure for representing the roots. Algorithmic tools operating on this data-structure are the classical Euclidean division, Euclidean algorithm and, in the case of multivariate systems, Gröbner bases.

More recently, it has been realized (in a large part by members of this project), that another important class of examples is provided by linear differential equations. The tools there are non-commutative analogues of the algorithms used for polynomials. To give an idea of the impact of this idea, it can be estimated that approximately 60% of Abramowitz and Stegun’s Handbook of Mathematical Functions (one of the most cited references in mathematics, physics, chemistry and engineering sciences), covering the knowledge of the best specialists on about 100 mathematical functions can be generated automatically by computer algebra programs using systems of linear differential equations as the basic data-structure. This has been implemented in a prototype on-line encyclopedia.

Obvious advantages of computer generation are added interactivity, a virtually unbounded set of functions, and automatically generated symbolic or numerical code. The next steps in this area will be to add more interactivity to this prototype, to extend it so as to deal with integral transforms and more general parameterized integrals. In the long term, another application of this data-structure is the generation of good approximations for otherwise difficult to evaluate multiple integrals arising in theoretical physics or chemistry.

Mathematical objects stored as programs

Intermediate expression swell is a common disease of computer algebra algorithms: integers or polynomials intervening in intermediate computations are much larger than the actual input and output and most of the computation time is spent on them. In several important cases, the actual expression of these objects are not needed for the full computation but only pieces of information concerning them. It has been realized that using for data-structure programs that evaluate these polynomials or integers or more general mathematical objects results in huge memory savings.

All the algorithms then need to be revisited since arithmetic operations become costless while equality testing becomes expensive. The results of carrying out this program for polynomial systems are spectacular: the worst-case complexity which was exponential in the size of the result has become only polynomial. This is the result of a long research program by the TERA group, to which some of us have contributed.

Another recent application of this idea to differential equations has led us to algorithms whose complexity is almost optimal (up to logarithmic factors) in the size of the result for their solutions. Such results have an immediate impact throughout large parts of computer algebra.

Geometry of Complexity

Another source of computational complexity comes from the nature of the problems that are being attacked. The ill-conditioning of equations or matrices which causes difficulties in numerical analysis has a counterpart in the size of exact representations. Typically, the existence of a close-by singular matrix or polynomial or polynomial system forces the representations of solutions to involve very large integers. Thus the intrinsic complexity of a problem depends on how far (in a mathematically quantifiable sense) it lies from singularities. One is thus led to a theory of geometric computational complexity. There, global upper bounds are refined by a complexity measure depending on the geometry of the input data-structure space.

A lot of work is yet to be done, both experimentally and theoretically, to develop geometry-aware algorithms. For instance, we have recently developed and precisely analyzed an algorithm that converges quadratically to solutions of systems, even in the presence of multiplicities or clusters of roots, by analyzing numerically and on the fly the nature of the close-by singularity. The goal there is to enlarge as much as possible the class of problems for which answers can be computed in time almost linear in the size of appropriate data structures for both input and output. Again, results obtained there are of high relevance not only in our project.

Team members

Former members:

  • Alexandre Benoit, Microsoft Research – Inria Joint Centre (PHD Student)
  • Alin Bostan, Inria Saclay – Île-de-France
  • Frédéric Chyzak, Inria Saclay – Île-de-France
  • Alexis Darrasse, Microsoft Research – Inria Joint Centre (PostDoc)
  • Stefan Gerhold, Microsoft Research – Inria Joint Centre (PostDoc)
  • Christoph Koutschan, Microsoft Research – Inria Joint Centre (PostDoc)
  • Pierre Lairez, Microsoft Research – Inria Joint Centre (PHD student)
  • Marc Mezzarobba, Microsoft Research – Inria Joint Centre (PHD Student)
  • Bruno Salvy, Inria Saclay – Île-de-France
  • Flavia Stan, Microsoft Research – Inria Joint Centre (Post Doc Student)

Our software

Information Flows in Online Social Networks

diagram of Information Flows in Online Social Networks

The project focuses on access to information via systems such as Twitter and FaceBook. Our goal is to improve the efficiency of information access via OSN’s in terms of precision/recall. To that end, we will develop solutions for recommending both content items and potential contacts to users.

Such recommender systems will be constructed from the following primitives:

  • Reward mechanisms allowing users to give feedback on specific content items;
  • Inference mechanisms that exploit the former reward mechanisms for characterizing i) types of specific content and ii) tastes and expertise of individual users.

The resulting recommender systems should be light-weight, accurate, and resilient to users’ strategic behaviors.

The expected outcomes of the project are:

  1. A deeper understanding of the limits of collective processing and filtering of information in terms of precision / recall trade-offs.
  2. Optimized schemes for identification of implicit communities of like-minded users and contact recommendation for helping users “rewire” the information network for better performance.  In particular robust versions of spectral embedding and message-passing algorithms à la Belief Propagation will be elaborated in this respect. Limitations / relative merits of candidate schemes, their robustness to noise in the input data, will be investigated. Distributed mechanisms for evaluating users’ expertise on particular topics. Such “expertise” evaluation can be multi-dimensional.
  3. Active learning strategies for quick categorization of content types. In particular we will design rules for adaptively selecting users from whom to obtain feedback on content, aiming to achieve accurate content categorization based on the smallest possible number of user solicitations.
  4. Protection against selfishness of participating users: proposed reward mechanisms should induce desirable content editing and filtering by the users when they are selfishly trying to maximize their overall rewards. . We will also aim to protect the system against the attack whereby groups of users conspire and distribute among themselves rewards similar to FaceBook’s “likes” thus artificially boosting their individual expertise. A particular scheme that we will consider is an instance of the so-called “marginal utility” reward mechanism whereby a user is rewarded by the utility to others that would have been lost if this particular user had not actively participated to the system.
  5. Validation on available datasets and experimental deployment of some of the proposed mechanisms.

Team members

Current members:

  • Laurent Massoulie, Inria Paris / Microsoft Research-INRIA Joint Centre

Former members:

  • George Giakkoupis, Inria Rennes Atlantique
  • Lennart Gulikers, Microsoft Research-Inria Joint Centre (PHD Student)
  • Ian Kash, Microsoft Research Cambridge
  • Anne-Marie Kermarrec, Inria Rennes Atlantique
  • Peter Key, Microsoft Research Cambridge
  • Arnaud Legout, Inria Sophia Antipolis Méditerranée
  • Marc Lelarge, Inria Paris Rocquencourt
  • Mesrob Ohanessian, Microsoft Research-Inria Joint Centre (Post Doctoral Student)
  • Rémi Varloot, Microsoft Research-Inria Joint Centre (PHD Student)
  • Laurent Viennot, Inria Paris – Rocquencourt
  • Kuang Xu, Microsoft Research-Inria Joint Centre (Post Doctoral Student)

Interactive Network Visualization

diagram of networks

Networks are used to model a wide range of phenomena, from computer networks to similarities between genes, brain activity, and social interactions between individuals (social networks) or organizations. Yet, making sense of these complex networks requires more than modeling and statistics. Network visualization has progressed dramatically in recent years and provides novel and effective ways to make sense of complex networks through effective visual encodings and interactions. With the advent of novel display technologies and interaction techniques, network visualization has potentials to become even more effective.

Many scientific domains are thus now convinced that network visualization is essential to improve their work since it allows them to see complex structures that statistics and modeling alone cannot reveal. From the perspective of network visualization, there are lots of commonalities in the wide range of issues raised by the different scientific domains, as well as some very specific issues too.

In the recent years, The Microsoft Research Redmond VIBEVis working group and INRIA AVIZ project team have shown that their techniques were extremely effective to visualize social networks as well as brain connectivity data.

Social networks have become popular in recent years with platforms such as Facebook and LinkedIn, but they are also used to study collaboration between people and organizations. VIBEVis and AVIZ have developed a wide arsenal of techniques to facilitate the exploration of such data, to support its understanding, and to enable the presentation to a large audience for decision making or communication purposes.

Brain connectivity visualization is an enabling tool for understanding how the brain works in the long range, but is also important in the short term to convey important information to neuroscientists and brain surgeon about interactions in the brain related to studies or surgery.

In this project VIBEVis and AVIZ want to further explore the generic issues raised by network visualization and the more specific ones for the domains of social network visualization and brain connectivity visualization as two major application domains for this field of work.

The main scientific challenges are:

  1. Temporal/dynamic networks,
  2. Multivariate networks,
  3. Relating spatial-3D or 2D information with networks (e.g. brain vis, migration maps), navigation and interaction in clustered graphs,
  4. Scalability.

Team members

Former members:

  • Benjamin Bach, Microsoft Research-Inria Joint Centre (Post Doctoral Student)
  • Jean-Daniel Fekete, Inria Saclay – Île-de-France
  • Nathalie Henry-Riche, Microsoft Research Redmond

Mathematical Components

4 colours cartouche

The formal proof of the Odd Order theorem was completed on September 20th, 2012.

Summary

The object of this project is to demonstrate that formalized mathematical theories can, like modern software, be built out of components. By components we mean modules that comprise both the static (objects and facts) and dynamic (proof and computation methods) contents of theories. We propose to develop a general platform for mathematical components, based on the Coq “Ssreflect” extension that was used to carry out the formalization of the Four Colour Theorem. We would validate the platform on two significant developments: a theory of efficient arithmetic, and the proof of the Odd Order theorem. The former could be used in the resolution of several outstanding computer proof challenges, among them the Kepler Conjecture, and the latter could become the first step in a new major challenge: the classification of finite simple groups.

Team members

  • Boris Djalal Microsoft Research – Inria Joint Centre

Former members:

  • Andrea Asperti, University of Bologna (Visiting Professor)
  • Jeremy Avigad, Carnegie Mellon University (Invited)
  • Bruno Barras, Inria Saclay – Île-de-France
  • Yves Bertot, INRIA Sophia Antipolis – Méditerranée
  • Guillaume Cano, Inria Sophia Antipolis – Méditerranée
  • Cyril Cohen, PhD student
  • Maxime Dénès, Inria Sophia Antipolis – Méditerranée
  • François Garillot, Microsoft Research- Inria (PhD Student)
  • Georges Gonthier, MSR Cambridge
  • Nicolas Julien, PhD Student
  • Stéphane Le Roux, Microsoft Research-Inria (Post Doctoral Student)
  • Assia Mahboubi, Inria Saclay Ile de France
  • Sean McLaughlin, Microsoft Research-Inria (PhD Student)
  • Guillaume Melquiond, Microsoft Research – Inria Joint Centre (Post Doc)
  • Russell O’Connor, McMaster University ( Inviteed researcher)
  • Sidi Ould Biha, Microsoft Research-Inria (PhD Student)
  • Ioana Pasca, Microsoft Research-Inria (PHD Student)
  • Laurence Rideau
  • Alexey Solovyev, Microsoft Research-Inria (Internship)
  • Enrico Tassi, INRIA Saclay – Île-de-France
  • Laurent Théry, Inria Sophia Antipolis – Méditerranée
  • Benjamin Werner (Arithmetic leader), École Polytechnique
  • Noam Zeilberger, Microsoft Research-Inria Joint Centre
  • Roland Zumkeller, Microsoft Reasearch-Inria (Post Doc)

Medilearn

medical image analysis with annotated data

Project Scope

A long standing problem of medical image analysis is the lack of annotated data (e.g. images of a tumour with its associated, expert annotated regions). Such data is crucially important for training automatic classification techniques which are able to automatically detect and delineate an anomaly in a new, previously unseen patient. Applications range from diagnosis, to more accurate treatment, from quantification of the effect of a drug to prevention.

Research

Recent trends in machine learning and computer vision demonstrate that under certain constraints it is possible to generate annotated training data synthetically. The trained classifiers then are shown to work well on previously unseen test images. For instance, this technique is at the basis of the successful Microsoft Kinect sensor; where a body part classifier was trained on millions of synthetically generated images of people in various poses.

In this project we wish to do the same for medical images. This is very timely for a number of reasons. For instance, we now have available realistic generative models of e.g. brain scans, brain tumors, beating hearts, as well as full-body CT scans. So now the question is how well does a system trained on such synthetic images work on real patients scans, and how to modify the algorithm and its features to make it work more accurately.

As part of this project, the generated synthetic training and testing sets and their annotations will also be made available to the whole scientific community. This will enable various research teams to take part in competitions and compare their different algorithms on an equal base.

Neuroimaging

Neuroimaging is accumulating large functional MRI datasets that display, in spite of their low signal-to-noise ratio (SNR), brain activation patterns giving access to a meaningful representation of brain spatial organization. This ongoing accumulation is intensified via new large-scale international initiatives such as the Human Connectome Project or the recently accepted Human Brain Project, but also to existing open repositories of functional neuroimaging datasets. These datasets represent a very significant resource for the community, but require new analytic approaches in order to be fully exploited.

In order to provide a synthetic picture of the brain substrate of human cognition and its pathologies, we proceed by learning from large-scale datasets a brain atlas that summarizes adequately these functional activation maps drawing from a large number of protocols and subjects. Once learned, such an atlas is extremely useful to understand the large-scale functional organization of the brain: it is a tool for understanding brain segregation, the different encoding of many cognitive parameters into different brain regions, as well as brain integration, i.e. how remote brain regions co-activate across subjects and experiments.

Privacy-Friendly Services and Apps

The objective of this project is to re-imagine how modern on-line services and applications may be engineered to provide a higher degree of technical privacy protection. Supporting stronger privacy and user control involves a serious redesign of key protocols and architectures that underlie these services, as well as the development of principled definitions of privacy, tools to realize them, and methods to evaluate the degree of protection afforded.

We consider two categories of services: The first one consists of those systems that process personal information but generate information that is publicly disclosed. A typical example is aggregating the data of multiple individuals, and extracting higher level information and statistics. For example, the likelihood of customers buying two items together, or the correlation between a certain disease and obesity. For those features, the objective of this project is to devise generic techniques to compute and make available such statistical data while minimizing the exposure of personal information.

The second category of services both processes personal information and also generates personal information. For example, the insurance premium to be paid by an individual on the basis of their health may be both the result of processing sensitive information, as well as sensitive itself. For those applications, we aim to devise generic architectures that (a) minimize the exposure of private information and (b) allow for flexible policies to determine under what conditions the resulting sensitive information is shared, and with whom.

A fundamental objective, common to both categories, is to develop principled and robust definitions of privacy, as well as methods for evaluating the quality of protection offered by different proposed mechanisms. We intend to consider variants of differential privacy, cryptography and other established privacy technologies, augmented by rigorous proofs of correctness.

We plan to dedicate particular attention to the domain of location-based services. We aim at designing mechanisms that will allow a user to get information from such services, such as point of interest, or nearby friends, without exposing his precise location. To this aim, we plan to develop obfuscation techniques that will provide strong privacy guarantees, such as geo-indistinguishability, a variant of differential privacy suitable for geo-location.

Finally, we intend to prototype high quality software tools for developing and evaluating privacy-friendly services. These include tools and libraries that implement high value computations in a privacy preserving manner; language based tools, such as compilers and runtimes, to support building higher level services; and platforms and API that support privacy features out of the box. These should be capable of inter-operating to produce larger and richer services that process location, medical, or financial data without exposing them to third parties. While this is not a core objective of the project, we hope to achieve synergies with other Microsoft Research-Inria projects that focus on automatic verification to validate the code that implements protection mechanisms.

ReActivity

This project finished in 2011.

Research goals

  • to explore how to capture and visualize user activity
  • to enable scientists and groups of scientists to reflect upon, interact with and improve their research processes

Our goal is to develop:

  • Sophisticated strategies for logging activity with bilateral links between on-line and off-line data
  • Common data formats that interoperate across platforms (PC, Mac, Unix, other equipment)
  • Innovative techniques to capture events triggered by the user activity, both low-level and high-level while maintaining the relations across levels
  • Powerful methods for visualising across multiple scales and interacting with the results

ReActivity phases

  • Log platform and infrastructure for data collection and aggregation
    • Create a common format & share experiences
    • Apply our own visualisation and interaction tools to the logged data
  • Visualisation and instrumentation of scientific data logs
    • Visualisation of raw data from phase I scaled to month-long or longer logs
    • Explore strategies for meaningful sampling of data with smooth interaction and navigation
  • Mining of desktop data and interactions with visualised activities
    • Design highly interactive tools for scientists to understand and interact with their past activities
    • Create high-level interactive reflexive views that can be manipulated and reused

Team members

Former members:

  • Michel Beaudouin-Lafon University. Paris-Sud (invited)
  • Olivier Chapuis CNRS
  • Mary Czerwinski Microsoft Research
  • Pierre Dragicevic Inria Saclay – Île-de-France
  • Niklas Elmqvist Expert
  • Jean-Daniel Fekete Inria Saclay – Île-de-France
  • Danyel Fisher Microsoft Research
  • Nathalie Henry-Riche Microsoft Research Redmond
  • Bongshin Lee Microsoft Research
  • Catherine Letondal Expert Engineer
  • Wendy Mackay Inria-Saclay – Île-de-France
  • Nicolas Masson Microsoft Research – Inria (PHD Student)
  • Brian Meyers Microsoft Research
  • Tomer Moscovich Expert Engineer
  • Emmanuel Piétriga Inria
  • George Robertson Microsoft Research
  • Greg Smith Microsoft Research
  • Aurélien Tabard Microsoft Research – Inria ( PHD Student)

Scientific Image and Video Mining

This project finished in 2013 – then continued by “Video Understanding”.

Project Summary

We propose to focus on fundamental computer science research in computer vision and machine learning, and its application to archaeology, cultural heritage preservation, and sociology. We validate our project by collaborations with researchers and practitioners in these fields.

Goals of the project

We address the following problems:

Mining historical collections of photographs and paintings with applications to archaeology and cultural heritage preservation. This includes for example the quantitative analysis of environmental dammage on wall paintings or mosaics over time, and the cross-indexing of XIXth Century paintings of Pompeii with modern photographs.

Mining TV broadcasts with applications to sociology. This includes automating the analysis and annotation of human actions and interactions in video segments to assist –and provide data for– studies of consumer trends in commercials, political event coverage in newscasts, and class- and gender-related behavior patterns in situation comedies, for example.

For every one of the problems we have in mind, indexing, searching and analyzing photo and video collections is a key issue. Recent advances in image analysis, computer vision, and machine learning promise an opportunity to automate, partly or completely, these tasks (e.g., annotation of photos and videos), as well as to access information whose extraction from images is simply beyond human capabilities (e.g., indexing of very large image archives). To fulfil this promise, we propose to conduct fundamental research in object, scene, and activity modeling, learning, and recognition, and to validate it with the development of computerized image and video mining tools at the service of sciences and humanities.

Secure Distributed Computations and their Proofs

Secure Distributed Computations and their Proofs

We develop formal tools for programming distributed computation with effective security guarantees. Our goal is to enable programmers to express and prove high-level security properties with a reasonable amount of effort—sometimes automatically, sometimes with mechanical assistance—as part of the development process. These properties should hold in a hostile environment, with realistic (partial) trust assumptions on the principals and machines involved in the computation.

We are located in Orsay, within the Inria-Microsoft Research Joint Lab, with participants from the Programming Principles and Tools group at Microsoft Research Cambridge, the Moscova team at Inria Rocquencourt, and the Everest team at Inria Sophia-Antipolis. Please contact us for additional information, visits, internships, etc.

Programming Abstractions for Security

Over the last decade, as security expectations have evolved, the task of writing secure code has become more challenging and widespread. A large amount of code now has security implications, and programmers often struggle to analyze them. In principle, a security analysis should involve a precise mapping from security goals to enforcement mechanisms, but these high-level goals are often left informal, whereas lower-level enforcement mechanisms are largely hidden in library implementations.

Partly because their design predates security concerns, the abstractions embedded in programming languages are usually ineffective as regards security. In a distributed system, for example, these abstractions are trustworthy only inasmuch as every machine involved in the computation “plays by the rules” and correctly implements them. This assumption is clearly unrealistic when some of these machines may be controlled by an adversary. Hence, for instance, a cautious programmer who relies on remote procedure calls should not reason about them as if they were local calls, and must instead look at the details of their implementation, including the network stack, the underlying protocols, and the system configuration. Also, due to functional demands such as flexibility and ease of deployment, the boundary between applications and protocols gets blurred. Modern application code often needs to dynamically select protocols, or change their configuration. Hence, even the experts who design the protocols may not be able to guarantee their security in advance, without knowledge of the application. In fact, most interesting high-level security properties (privacy, integrity, or secrecy invariants) necessarily depend on the whole application. As a result, those properties must somehow be reflected to the programmer, rather than hidden behind “transparent” abstractions.

Secure Implementations for Typed Session Abstractions

Distributed applications can be structured as parties that exchange messages according to some pre-arranged communication patterns. These sessions (or contracts, or protocols) simplify distributed programming: when coding a role for a given session, each party just has to follow the intended message flow, under the assumption that the other parties are also compliant. In an adversarial setting, remote parties may not be trusted to play their role. Hence, defensive implementations also have to monitor one another, in order to detect any deviation from the assigned roles of a session. This task involves low-level coding below session abstractions, thus giving up most of their benefits.

We explore language-based support for sessions. We extend the ML language with session types that express flows of messages between roles, such that well-typed programs always play their roles. We compile session type declarations to cryptographic communication protocols that can shield programs from any low-level attempt by coalitions of remote peers to deviate from their roles. (…)

Cryptographic Enforcement of Information-Flow Security

We consider cryptographic enforcement mechanisms for imperative programs with untrusted components. In source programs, security depends on an abstract information-flow policy for accessing the shared memory. In their implementations, shared memory is unprotected and security depends instead on encryption and signing. We build compilers from source programs and policies to cryptographic implementations. To establish their correctness, we develop type systems that enforce a correct usage of cryptographic primitives against active adversaries, under standard (probabilistic, polytime) security assumptions. (…)

Secure Audit Logs

Many protocols rely on secure audit logs to store traces of protocol runs. These logs can be used a posteriori to verify the compliance of each principal to its role: principals who attempt non-compliant actions will be blamed using the logged evidence. Secure logging is widely used in practice and much research effort has been devoted to techniques for implementing logs securely and efficiently. Still, which data should be logged? and why? (…)

Automated Verifications of Protocol Implementations

There has been much progress in formal methods and tools for cryptography, enabling, in principle, the automated verification of complex security protocols. In practice, however, these methods and tools remain difficult to apply. Indeed, the verification of a system that uses a given protocol involves more than the cryptographic verification of an abstract model. For these reasons, we are interested in the integration of protocol verifiers into the arsenal of software testing and verification tools.

In recent work, Bhargavan et al. advocate the automatic extraction and verification of cryptographic models from executable code. They verify protocol implementations written in F#. Their approach relies on sharing as much code as possible between implementations and models. We explore a similar approach to extend the benefit of computational cryptographic verification to protocol implementations. to this end, we develop a tool for extracting cryptographic models from executable code written in F#, then rely on CryptoVerif to obtain computational verification results for executable code.

As an extended case study, we verify implementations of TLS, one of the most widely deployed communications protocol.

CertiCrypt: Formal Proofs for Computational Cryptography

Provable security, whose origins can be traced back to the pioneering work of Goldwasser and Micali, advocates a mathematical approach based on complexity theory in which the goals and requirements of cryptosystems are specified precisely, and where the security proof is carried out rigorously and makes all underlying assumptions explicit. In a typical provable security setting, one reasons about effective adversaries, modelled as arbitrary probabilistic polynomial-time Turing machines, and about their probability of thwarting a security objective, e.g., secrecy. In a similar fashion, security assumptions about cryptographic primitives bound the probability of polynomial algorithms to solve difficult problems, e.g., computing discrete logarithms. The security proof is performed by reduction by showing that the existence of an effective adversary with a certain advantage in breaking security implies the existence of an effective algorithm contradicting the security assumptions.

The game-playing technique is a general method to structure and unify cryptographic proofs. Its central idea is to view the interaction between an adversary and the cryptosystem as a game, and to study game transformations that preserve security, thus allowing to transform an initial game-that explicitly encodes a security property-into a game where it is easy to bound the advantage of the adversary. Code-based techniques (CBT) is an instance of the game-playing technique that has been used successfully to verify state-of-the-art cryptographic schemes, see e.g. Belare and Rogaway. Its distinguishing feature is to take a code-centric view of games, security hypotheses and computational assumptions, that are expressed using (probabilistic, imperative, polynomial-time) programs. Under this view, game transformations become program transformations, and can be justified rigorously by semantic means. Although CBT proofs are easier to verify and are more easily amenable to machine-checking, they go far beyond established theories of program equivalence and exhibit a rich and broad set of reasoning principles that draws from program verification, algebraic reasoning, and probability and complexity theory. Thus, despite the beneficial effect of their underlying framework, CBT proofs remain inherently difficult to verify. In an inspiring paper, Halevi argues that formal verification techniques are mandatory to improve trust in game-based proofs, going as far as describing an interface to a tool for computer-assisted code-based proofs. As a first step towards the completion of Halevi’s programme, we develop CertiCrypt, a framework to construct machine-checked code-based proofs in the Coq proof assistant. (…)

Related

  • Samoa: Formal tools for securing web services
  • PiDuce: Web programming with native XML datatypes

Team members

  • Karthikeyan Bhargavan, Inria Paris
  • Cédric Fournet, Microsoft Research Cambridge
  • Santiago Zanella Béguelin, Microsoft Research – Cambridge

Former members:

  • Pedro Adão, IST Lisbon (Visiting researcher)
  • Evmorfia-Iro Bartzia, Inria Paris – Rocquencourt (PHD Student)
  • Ricardo Corin, Inria Paris-Rocquencourt
  • Pierre-Malo Deniélou, Microsoft-Inria Joint Centre (PHD Student)
  • Benjamin Grégoire, Inria Sophia Antipolis – Méditerranée
  • Nataliya Guts, Microsoft Research- Inria Joint Centre
  • James J. Leifer, Inria Paris – Rocquencourt
  • Chantal Keller, Microsoft Research-Inria Joint Centre (Post Doctoral Student)
  • Cosimo Laneve, University of Bologna (Visiting researcher)
  • Gurvan Le Guernic, Microsoft research- Inria Joint Centre
  • Sergio Maffeis, Imperial College London (Visiting Professor)
  • Alfredo Pironti, Inria Paris – Rocquencourt (Post Doc)
  • Jérémy Planul, PhD student
  • Tamara Rezk, Inria Sophia Antipolis – Méditerranée
  • Pierre-Yves Strub, Post Doc, Microsoft Research- Inria
  • Eugen Zălinescu, PostDoc researcher
  • Francesco Zappa Nardelli, Inria Paris-Rocquencourt

Structured Large Scale Machine Learning

In this project we cover three research areas:

1) Large-scale convex optimization

Large-scale convex optimization for big data: Many classification problems that practitioners are facing have large dimensions: large size of observations p, large number of observations n, large number of classes k. Supervised learning algorithms based on convex optimization (such as the support vector machines or logistic regression) cannot yet achieve robustly the ideal complexity O(nk+np), i.e., the size of the data. To achieve this complexity, we propose several open avenues of research (whose impact on solving large problems with higher accuracy is immediate).

Specifically we intend to focus on i) Large-scale learning with latent structure, ii) Optimal computational-statistical trade-offs, iii) Large-scale learning with large number of classes : scaling in log(k), iv) Large-scale active learning, v) Large-scale learning with large number of classes : dealing with imbalance and vi) Robust stochastic approximation algorithms.

2) Large-scale combinatorial optimization

Many problems in computer vision or natural language processing involve large-scale combinatorial problems, for example involving graph cuts or submodular functions. We propose several open avenues of research at the interface between combinatorial and convex optimization (whose main impact is to enlarge the set of problems that can be solved through machine learning).

In this context we intend to work on i) Large-scale learning with large number of classes : leveraging the latent geometry; ii) Relationships between convex and combinatorial optimization; and iii)  Structured prediction without pain.

3) Sequential decision making for structured data

Multi-Armed Bandit (MAB) problems constitute a generic benchmark model for learning to make sequential decisions under uncertainty. They capture the trade-off between exploring decisions to learn the statistical properties of the corresponding rewards, and exploiting decisions that have generated the highest rewards so far. In this project, we aim at investigating bandit problems with a large set of available decisions, with structured rewards. The project addresses bandit problems with known and unknown structure, and targets specific applications in online advertising, recommendation and ranking systems.

Team members

Current members:

  • Francis Bach, Inria Paris
  • Alberto Bietti, Microsoft Research – Inria Joint Centre
  • Julien Mairal, Microsoft Research – Inria Joint Centre

Former members:

  • Alekh Agarwal, Microsoft Research – New-York
  • Léon Bottou, Microsoft Research – Redmond
  • Alexandre d’Aspremont, CNRS
  • Miro Dudík, Microsoft Research – New-York
  • Zaid Harchaoui, Inria Grenoble – Rhône-Alpes
  • Nebojsa Jojic, Microsoft Research – Redmond
  • Sham Kakade, Microsoft research – New England
  • Pushmeet Kohli, Microsoft Research Cambridge
  • Simon Lacoste-Julien, Inria Paris-Rocquencourt
  • John Langford, Microsoft Research – New-York
  • Sebastian Nowozin, Microsoft Research Cambridge
  • Anton Osokin, Microsoft Research-Inria Joint Centre (Post Doctoral Student)
  • Anastasia Podosinnikova, Microsoft Research – Inria Joint Centre (PHD Student)
  • Alexandre Proutière, Inria Paris – Rocquencourt
  • Bozidar Radunovic, Microsoft Research cambridge
  • Milan Vojnovic, Microsoft Research Cambridge
  • Se-Young Yun, Microsoft Research-Inria Joint Centre (Post Doctoral Student)

Tools for Proofs

TLA+ logo

Goals of the project

TLA+ is a specification and proof language based on temporal logic, where first-order logic and set theory are used to describe a set of states and the possible transitions between these states. TLA+ also includes a module system for manipulating large-scale specifications.

There are a number of existing tools for working on TLA+ specifications, the most important of which is the TLC model-checker. Although the proof side of TLA+ is not well-developed yet, with no proof tools and an incomplete definition of the proof language, TLA+ has already proved its worth in significant projects in hardware design (Alpha and Itanium processors), protocols (PCI-X), and software (Doligez-Leroy-Gonthier garbage collector).

In this project, we are working on turning TLA+ into a complete solution for writing, debugging, and proving specifications. More precisely, we are concentrating on the proof aspect:

  • refining the proof language
  • developing and adapting automatic tools to help users prove TLA+ theorems (Zenon, veriT)
  • providing an interface between the automatic provers and the TLA+ development environment
  • translating TLA+ proofs into a machine-checkable format for verification by an independent checker (Isabelle)

We will validate and enhance our tools by finding examples of real-world projects where formal specifications bring real improvements over other methodologies. Feedback from these examples will help us improve the proof language and the tools and develop methods and “design patterns” for using TLA+.

Video Understanding

AI understanding, searching and organizing dynamic video content

Understanding, searching and organizing dynamic video content

Automatic understanding and interpretation of video content 1 is a key enabling factor for a range of practical applications such as organizing and searching home videos or content-aware video advertising. For example, interpreting videos of “making a birthday cake” or “planting a tree” could provide effective means for advertising products in local grocery stores or garden centers.

The goal of this work is to automatically generate annotations of complex dynamic events in video. We wish to deal with events involving multiple people interacting with each other, objects and the scene, for example people at a party in a house. The goal is to generate structured annotations going beyond simple tags. Examples include entire text sentences describing the video content as well as bounding boxes or segmentations spatially and temporally localizing the described objects and people in video. Such annotations will in turn open-up the possibility to organize and search video content using well-developed technology from e.g text search or natural language processing.

We build on the considerable progress in visual object, scene and human action recognition achieved in the last ten years, as well as the recent advances in large-scale machine learning that enable optimizing complex structured models using massive sets of data. In particular, we develop structured video representations of people, objects and scenes, able to describe their complex interactions as well as their long-term evolution in the dynamic scene.

To this end, we investigate different models of the video stream including: (i) designing explicit representations of scene geometry together with scene entities and their interactions as well as (ii) directly learning mid-level representations of spatio-temporal video content using dictionary learning or convolutional neural networks.

To train the developed models we design weakly-supervised learning methods making use of videos and the associated readily-available metadata such as text, speech or associated depth (in the case of 3D videos).

To enable accessing the massive amounts of available video data we also develop representations that allow for efficient extraction and indexing. We believe such models and methods are required to make a breakthrough in automatic understanding of dynamic video scènes.

Team members

Current members:

  • Guilhem Chéron, Microsoft Research – Inria Joint Centre (PhD)
  • Andrew Fitzgibbon, Microsoft Research Cambridge
  • Ivan Laptev, Inria Paris
  • Jean Ponce, Inria Paris
  • Josef Sivic, Inria Paris

Former members:

  • Karteek Alahari, Inria Grenoble – Rhône-Alpes
  • Léon Bottou, Microsoft Research – Redmond
  • Yang Hua, Microsoft Research – Inria Joint Centre (PhD)
  • Maxime Oquab, Microsoft Research – Inria Joint Centre (PhD)
  • Cordelia Schmid, Inria Grenoble – Rhône-Alpes

White-Box Search-Based Software Engineering

There is an ever increasing tendency to consider software engineering as an optimization problem – from bug detection to fault identification and, more recently, to software design – known today as SBSE (Search Based Software Engineering). In that perspective, software building amounts to properly assemble existing components, or cleverly tune the use of ensembles/portfolios of existing (black-box) software components. However, complex tasks often require different characteristics at different times during execution for the software at hand – and static off-line approaches rapidly fail to fulfill such dynamic requirements: the decision making problem (which algorithm/heuristic/operator to use from the available ensemble/portfolio) is in fact a sequential decision-making problem, thus amenable to Reinforcement Learning (RL).

We propose in this new project to address this sequential decision-making problem using Multi-Armed Bandit (MAB) and Monte-Carlo Tree Search (MCTS) techniques, on-line, within (white-box) existing software. Two milestones will allow us to instantiate the first steps on the path to RL-Based Software Engineering.

Within existing Constraint Programming solvers, MCTS will be used to improve the efficiency of the successive tree walks, provided a meaningful reward is designed, and a history-preserving mechanism is implemented to accumulate knowledge between restarts.

Within highly distributed SAT solving, MAB can be used to dynamically tune the structure of the graph of communications, allowing each node to gather information only from nodes where it has been demonstrated useful in the recent past, leading to a scalable clause sharing mechanism.

Z-CloudFlow: Data Workflows in the Cloud

Scientific applications are often represented by workflows, which describe sequences of tasks (computations) and data dependencies between these tasks. Several scientific workflow
environments have been already proposed. They have been developed primarily to simplify the design and the execution of a set of tasks in parallel-distributed infrastructures. These environments propose a “process-oriented” design approach, where the information about data dependencies (data flow) is purely syntactic. In addition, the targeted execution infrastructures are mostly computation-oriented, like clusters and grids. They have no or little support for efficiently managing large data sets. Finally, data analyzed and produced by a scientific workflow are often stored in loosely structured files. Simple and classical mechanisms for their management are used: they are either stored on a centralized disk or directly transferred between tasks. This approach is not suitable for data-centric applications (because of inherent bottlenecks, costly data transfers, etc.).

Clouds have recently emerged as an interesting infrastructure option for deploying scientific workflows. Building on their elasticity, in recent years, scientific workflows have become an archetype to model experiments on such infrastructures. In addition, the Cloud allows users to simply outsource data storage and application executions. Still, there are substantial challenges ahead that must be addressed before their potential can be fully exploited in the context of scientific workflows.

One missing link is data management, as clouds mainly target web and business applications, and lack specific support for data-intensive scientific workflows. Currently, the management of workflow data in the clouds is achieved using either some application specific overlays that map the output of one task to the input of another in a pipeline fashion, or, more recently, by leveraging the MapReduce programming model (e.g. Hadoop on Azure – HDInsight). However, most scientific applications do not fit this model and require a more general data and task orchestration model, independent of any programming model.

The goal of this project is to specifically address these challenges by proposing a framework for the efficient processing of scientific workflows in clouds. Our approach will leverage the cloud infrastructure capabilities for handling and processing large data volumes. In order to support data-intensive workflows, our cloud-based solution will:

  1. Adapt the workflows to the cloud environment and exploit its capabilities;
  2. Optimize data transfers to provide reasonable times;
  3. Manage data and tasks so that they can be efficiently placed and accessed during execution.

The validation of this approach will be performed using synthetic benchmarks and real-life applications from bioinformatics on the Microsoft Azure cloud environment.

Team members

Current members:

  • Gabriel Antoniu, Inria Rennes – Bretagne Atlantique

Former members:

  • Luc Bougé, Ecole Normale Supérieure de Rennes – IRISA
  • Alexandru Costan, INSA Rennes
  • Ji Liu, Microsoft Research-Inria Joint Centre (PHD Student)
  • Luis Pineda Morales, Microsoft Research – Inria Joint Centre (PHD Student)
  • Radu Tudoran, IRISA / ENS Cachan, Antenne de Bretagne (PHD student)
  • Patrick Valduriez, Inria Sophia Antipolis Mediterranée (Montpellier)