overhead view of Microsoft Silicon Valley campus
May 11, 2023

NorCalDB Day 2023

Location: Mountain View, CA

Invited Talks

Presto: A Decade of SQL Analytics at Meta by James Sun, Meta

James Sun

Abstract:
Presto is an open-source distributed SQL query engine that supports analytics workloads involving multiple exabyte-scale data sources. Presto is used for low-latency interactive use cases as well as long-running ETL jobs at Meta. It was originally launched at Meta in 2013 and donated to the Linux Foundation in 2019. Over the last ten years, upholding query latency and scalability with the hyper growth of data volume at Meta as well as new SQL analytics requirements have raised impressive challenges for Presto. A top priority has been ensuring query reliability does not regress with the shift towards smaller, more elastic container allocation, which requires queries to run with substantially smaller memory headroom and can be preempted at any time. In this talk, we discuss several successful evolutions in recent years that have improved Presto latency as well as scalability by several orders of magnitude in production at Meta. Some of the notable ones are hierarchical caching, native vectorized execution engines, materialized views, and Presto on Spark. With these new capabilities, we have deprecated or are in the process of deprecating various legacy query engines so that Presto becomes the single piece to serve interactive, ad-hoc, ETL, and graph processing workloads for the entire data warehouse.
Bio:
James Sun is a software Engineer at Meta working on large-scale data systems. His interests are query optimization, low-latency query execution, and system scalability. He led the Presto team developing the open-source distributed SQL query engine at EB scale. He received a Ph.D. in Computer Science from University of California, Santa Barbara focusing on data integration and data-centric processes.

Dahlia Malkhi

Abstract:
In this talk, I will share insights from distributed systems I worked on that led to breaking certain myths in distributed computing, including positive answers to the following questions:
• Can you build a permissioned blockchain with linear communication complexity, namely, the same communication complexity of Bitcoin merely spreading updates, but without the energy consumption?
• Can you scale-out distributed databases with a centralized coordinator?
• Can you geo-replicate data consistently without intersecting quorums?
Bio:
Dahlia Malkhi currently serves as a Distinguished Scientist at Chainlink Labs. Dr. Malkhi’s research spans broad aspects of reliability and security of distributed systems, recently focused on blockchains and advances in financial technology. Her work over two decades resulted in over 150 publications as well as a strong impact on computing technology, notably HotStuff (driving the Diem blockchain core engine, the Aptos blockchain core engine), VMware blockchain, Flexible Paxos, CorfuDB, and the FairPlay project. Previously, Dr. Malkhi served as CTO, lead maintainer, and lead researcher of the Diem(Libra) project, founder and Principal Researcher at VMWare research, Partner Principal Researcher at Microsoft Research, tenured Associate Professor of the Hebrew University of Jerusalem, and senior researcher at AT&T Labs.

Bringing Structure to Unstructured Data with an AI-First System Design by Will Gaviria Rojas, CoactiveAI

Will Gaviria Rojas

Abstract:
Today, over 80% of enterprise data is unstructured and this fraction is expected to rapidly increase with the proliferation of generative AI tools. However, doing anything meaningful with this unstructured content remains extremely challenging as traditional data systems have not adapted, and ad hoc machine learning approaches remain expensive to implement and difficult to scale. In this talk, I will present the pressing need to create AI-powered data systems for understanding unstructured data, share our experiences building these systems, and present key design considerations when building these systems for end-to-end applications.
Bio:
A former Data Scientist at eBay, Will has previously held various roles as a visiting researcher. His most recent work focuses on the intersection of AI and data systems, including performance benchmarks for data-centric AI and computer vision (e.g., DataPerf @ ICML 2022, the Dollar Street dataset @ NeurIPS 2022). His previous academic work spans from IoT electronics to design and performance benchmarking of deep learning in neuromorphic systems. Will holds a PhD in Materials Science from Northwestern University and a BS from MIT.

Gong Show

Panel Discussion: DB and AI

Moderator:

Fatma Ozcan

Fatma Özcan, Google
Fatma Özcan is a Principal Engineer at Systems Research@Google. Before that, she was a Distinguished Research Staff Member and a senior manager at IBM Almaden Research Center. Her current research focuses on platforms and infra-structure for large-scale data analysis, machine learning for databases, and democratizing analytics via NLQ and conversational interfaces to data. Dr Özcan got her PhD degree in computer science from University of Maryland, College Park, and her BSc degree in computer engineering from METU, Ankara. She has over 21 years of experience in industrial research, and has delivered core technologies into various IBM products. She has been a contributor to various SQL standards, including SQL/XML, SQL/JSON and SQL/PTF. She is the co-author of the book “Heterogeneous Agent Systems”, and co-author of several conference papers and patents. She received the VLDB Women in Database Research Award in 2022. She is an ACM Distinguished Member, and the vice chair of ACM SIGMOD. She has served on the board of trustees for the VLDB Endowment (2016-2022), and on the board of directors of CRA (2020-2023).

Panelists:

Dipti Borkar

Dipti Borkar, Microsoft
Dipti is a senior technology executive and entrepreneur with over 18 years of experience in cloud, open source and distributed data/database technologies. She is Vice President & General Manager at Microsoft where she is responsible for SaaS App Development, Strategic ISVs and Azure Databricks. She founded Ahana acquired by IBM in 2023 and created a cloud managed service for SQL on data lakes and was Chief Product Officer and Vice President of Cloud & open-source engineering. She also served as the Chairperson of Presto Foundation, Community team.
Prior to Ahana, Dipti held VP roles at Alluxio, Kinetica, and Couchbase. At Alluxio, she was Vice President of Products and at Couchbase she held several leadership positions there including VP, Product Management & Head of Global Solution Engineering. Earlier in her career Dipti managed development teams at IBM DB2 Distributed where she started her career as a database software engineer. Dipti holds a M.S. in Computer Science from UC San Diego, and an MBA from the Haas School of Business at UC Berkeley.

Idan Gazit

Idan Gazit, GitHub
Idan is a Senior Director of Research at GitHub Next. He is a hybrid designer-developer, and can usually be found geeking out about the Web, data visualization, typography, and color. Prior to GitHub, he led the Data UX team at Heroku, which built the human interfaces to Heroku’s Postgres, Redis, and Kafka datastores. He lives in the East Bay with his family and surrounds himself with a rotating cast of half-finished projects.




Jure Leskovec

Jure Leskovec, Stanford University
Jure Leskovec (opens in new tab) is Professor of Computer Science at Stanford University, and a co-Founder of Stanford Data Science Initiative. He co-founded several machine learning start-ups and spent 6 years as Chief Scientist at Pinterest building AI systems. Leskovec pioneered the field of Graph Neural Networks and has successfully deployed them across many industrial use cases. Leskovec also co-authored PyG, the most widely-used graph neural network library. Leskovec’s research area is machine learning and data science for complex, richly-labeled relational structures, graphs, and networks for systems at all scales, from interactions of proteins in a cell to interactions between humans in a society. Applications include commonsense reasoning, recommender systems, social network analysis, computational social science, and computational biology with an emphasis on drug discovery. This research has won several awards including a Lagrange Prize, Microsoft Research Faculty Fellowship, the Alfred P. Sloan Fellowship, and numerous best paper and test of time awards. It has also been featured in popular press outlets such as the New York Times and the Wall Street Journal. Leskovec received his bachelor’s degree in computer science from University of Ljubljana, Slovenia, PhD in machine learning from Carnegie Mellon University and postdoctoral training at Cornell University.

Edo Liberty

Edo Liberty, Pinecone
Edo Liberty is the Founder and CEO of Pinecone, the managed database for large-scale vector search.
Until April 2019, Edo was a Director of Research at AWS and Head of Amazon AI Labs. The Lab built cutting-edge machine learning algorithms, systems, and services for AWS customers. The team built parts of SageMaker, Kinesis, QuickSight, Amazon ElasticSearch, Glue, Rekognition, DeepRacer, Personalize, Forecast, and other yet-to-be-released services.
Before AWS, Edo was a Senior Research Director at Yahoo and Head of Yahoo’s Research Lab in New York. He worked on building horizontal machine learning platforms and improving applications such as online advertising, search, security, media recommendation, email abuse prevention, and many more.
Edo received his B.Sc in Physics and Computer Science from Tel Aviv University and my Ph.D. in Computer Science from Yale University. After that, he was a Postdoctoral fellow at Yale in the Program in Applied Mathematics. He is the author of more than 75 academic papers and patents about machine learning, systems, and optimization.

Aditya Parameswaran

Aditya Parameswaran, UC Berkeley
Aditya Parameswaran is an Associate Professor at UC (opens in new tab) (opens in new tab)Berkeley (opens in new tab). Aditya co-directs the EPIC Data (opens in new tab) (opens in new tab)Lab (opens in new tab), a lab targeted at low/no-code data tooling with a special emphasis on social justice applications. Aditya also serves as the President of Ponder (opens in new tab), a company he co-founded with his students based on popular data science tools developed at Berkeley. Aditya develops human-centered tools for scalable data science — making it easy for end-users and teams to leverage and make sense of their large and complex datasets. His visualization and data exploration tools have been downloaded millions of times.