overhead view of Microsoft Silicon Valley campus
May 11, 2023

NorCalDB Day 2023

Location: Mountain View, CA

NorCalDB Day is a single-day, workshop-style event where participants from academia and industry in Northern California meet to present ideas and discuss their research and experiences. In 2023, NorCalDB Day will be held at the Microsoft Silicon Valley Campus in Mountain View, on Thursday May 11, 2023.

Agenda

8:00 – 9:00 AMRegistration and Light Breakfast
9:00 – 9:15 AMIntroduction and Logistics
9:15 – 10:00 AMKeynote: Benchmarking and Tuning Log-Structured Table Formats (slides (opens in new tab))Raghu Ramakrishnan, Microsoft
10:00 – 10:30 AMPresto: A Decade of SQL Analytics at MetaJames Sun, Meta
10:30 – 11:00 AMCoffee Break and Posters
11:00 – 12:00 PMGong Show
1. Cal Poly Database and Data Science Work After COVID (slides (opens in new tab))
2. Stanford @ NorCalDB Day (slides (opens in new tab))
3. UC Berkeley @ NorCalDB Day (slides (opens in new tab))
4. Resilient Journey in Building Fault-tolerant Systems (slides (opens in new tab))
5. Insights from Sketch-based Relational Query Optimization (slides (opens in new tab))
6. UC Santa Cruz @ NorCalDB Day (slides (opens in new tab))

1. Alexander Dekhtyar, Cal Poly
2. Peter Kraft, Stanford University
3. Aditya Parameswaran, UC Berkeley
4. Dakai Kang, UC Davis
5. Florin Rusu, UC Merced
6. Peter Alvaro, UC Santa Cruz
12:00 – 1:00 PMLunch and Posters
1:00 – 1:30 PMUnexpected Lessons from Production Systems Impacting the Foundations of Distributed Computing (slides (opens in new tab))Dahlia Malkhi, Chainlink Labs
1:30 – 2:30 PMPanel Discussion: DB and AIModerator: Fatma Özcan, Google
Panelists:
Dipti Borkar, Microsoft
Idan Gazit, GitHub
Jure Leskovec, Stanford University
Edo Liberty, Pinecone
Aditya Parameswaran, UC Berkeley
2:30 – 3:00 PMCoffee Break and Posters
3:00 – 3:30 PMBringing Structure to Unstructured Data with an AI-First System Design (slides (opens in new tab)) Will Gaviria Rojas, CoactiveAI
3:30 – 4:15 PMKeynote: Hydro: A Data-Centric Compiler Stack for the Cloud (slides (opens in new tab))Joe Hellerstein, UC Berkeley
4:15 – 4:30 PMClosing Remarks

Keynote talks

Keynote Speaker 1: Raghu Ramakrishnan, Microsoft

Raghu Ramakrishnan wearing glasses and smiling at the camera

Title: Benchmarking and Tuning Log-Structured Table Formats

Abstract:
In recent years, analytic SQL databases have adopted updatable column-oriented table formats based on Parquet. These represent a profound shift from traditional row-oriented page-based data representation that continues to dominate OLTP SQL systems. In this talk, we will present a quick overview of updatable Parquet table implementations such as Delta Lake, Hudi and Iceberg and then consider the new challenges in rigorously comparing their performance. We describe LST-Bench, a new benchmarking framework that adapts a base workload such as TPC-DS, and present the results of a comparison that we carried out. We have open sourced LST-Bench. There are a number of exciting problems in this space that are exposed by our results, such as the opportunity (and need!) for auto-tuning various parameters that heavily influence performance of updatable table implementations.
Bio:
Raghu Ramakrishnan is CTO for Data, and a Technical Fellow at Microsoft. Previously, he was a professor at University of Wisconsin-Madison, where he wrote the widely used text “Database Management Systems” with Johannes Gehrke, and Chief Scientist at Yahoo! He has received the Innovation Award from both ACM SIGMOD and SIGKDD, multiple 10-year paper awards, and the ACM SIGMOD Contributions Award.

Keynote Speaker 2: Joe Hellerstein, UC Berkeley

Joe Hellerstein

Title: Hydro: A Data-Centric Compiler Stack for the Cloud

Abstract:
Relational Databases were invented to hide the concerns of how data is laid out, and how queries are executed.
Forty years later, Cloud Computing was invented to hide the concerns of how computing resources are laid out, and how general-purpose computations are executed. Surely lessons from the database community can translate to this new domain!
This is not a facile analogy or empty vision. I am convinced that the opportunities for outbound, translational research from databases to general-purpose modern computing are profound. This has been a longstanding agenda in my group, which is maturing into high-performance software with significant benefits for developers. The ideas that powered the success of databases – declarative languages, dataflow parallelism, data replication and consistency, query optimization, etc. – can be fruitfully applied to a wide variety of systems challenges, particularly related to distributed systems. 
Our current hypothesis is that we can build low-latency, high-performance, elastic cloud infrastructure out of declarative queries. Can we? Should we? I believe we can, and that there are significant engineering benefits to doing so. Our prior work has included declarative networking (e.g. Overlog, P2), declarative IoT (TinyDB), declarative implementations of Big Data distributed infrastructure (BOOM Analytics), general-purpose distributed programming models (Dedalus, Bloom), declarative ML (Apache MADlib), stateful serverless technologies (Anna KVS, Cloudburst), and coordination-free foundations including the CALM Theorem. 
In 2021, in a collaboration between Berkeley and Sutter Hill Ventures, we kicked off an ambitious research effort to cull lessons from this work, and push forward into a new generation of cloud technology. The emerging agenda is embodied in Hydro: a language stack for distributed programming (opens in new tab) — or as we sometimes call it, “LLVM for the cloud”. 
In this talk I’ll overview the goals of the Hydro project, and give some status reports on the language stack, with early use cases including an autoscaling Key Value Store and an optimizable Multipaxos implementation. Hydro is a young but growing open source project (opens in new tab) and we welcome collaborators!
Bio:
Since 1995, Joe Hellerstein (opens in new tab) has had the good fortune to serve on the faculty at UC Berkeley, where he is the Jim Gray Professor of Computer Science. During that time he has done research on a range of topics across computing and data, advised dozens of remarkable graduate students, taught thousands of undergraduates, and helped co-direct a number of research labs. Outside Berkeley, Joe was a co-founder of Trifacta, the AI-assisted visual data wrangling company, where he served for a decade as founding CEO and Chief Strategy Officer. Joe is currently a co-founder at Aqueduct (opens in new tab), which provides open source to make it easy to run AI workloads on standard cloud infrastructure. Joe continues to advise a number of startups in data and AI systems. For the last two years Joe has been on leave as a Faculty Fellow at Sutter Hill Ventures, which has been funding him to focus on his research.  Outside of work, Joe plays music — mostly jazz, but he recently has been heard on recordings by James Combs (opens in new tab) and his Americana band, Great Willow (opens in new tab).

Organizing committee