Overview and Vision
The Distributed Causal Inference (DCI) project explores ways to improve distributed systems technology and data query & storage functionality to enable and support answering causal inference questions from very-large-scale longitudinal and observational datasets, with the long-term goal to make data-driven exploration of outcomes as fast and common place as “web search”.
Everyone, at some point in their lives, finds themselves in an unfamiliar situation, considering what they should do, and trying to understand what to expect of the future.
We see such expectation exploration questions show up in web search queries, with people exploring possible consequences of their choices and the outcomes of situations.
These explorations cover both consequential topics, such as life-changing education and career choices (e.g., “Should I join the military?”) or major financial and personal decisions (e.g., “Should I move to California?”) as well as more quotidian topics, such as the consequences of purchase decisions, athletic training regimens and dating rituals.
The answers to these questions are not readily available in Wikipedia or other knowledge bases powering modern web search engines.
But, the information necessary to answer these questions is already being recorded on social media platforms such as Twitter, where hundreds of millions of individuals regularly and publicly report their personal experiences, including the situations they are in, the actions they take, and the experiences they have afterwards.
Exploring expectations on the Internet plays an important role in people’s planning, decision-making, and forecasting for both every day and extraordinary scenarios.
These explorations encompass a broad variety of tasks, including explorations of hypothetical, ongoing or past problems, or seeking informational support, emotional satisfaction, or preparation for a future event. In particular, decision-making processes about future unknowns depend critically on such information gathering (especially in unfamiliar situations) where the web augments more conventional information sources such as professional and friends’ advice, training, etc.
Advice-related searches were measured to make up around 2-5% of web search tasks in 2004, and even in pregnancy (a scenario with dedicated information infrastructures, related health professionals and care programs) over 80% of women used web search to help make decisions.
The DCI (Distributed Causal Inference) project is focused on providing the runtime substrate to make such causal inference scenarios work on huge datasets (such as the Twitter corpus).
This requires a carefully curated and constructed combination of technologies spanning distributed systems, databases, machine learning and computational statistics.