Many Excel users can relate to the problem of data wrangling.
You start with an input data set that you wish to transform into a new format so that you can perform further analysis using the transformed data. Not surprisingly, these problems arise for common types of data such as date and time, addresses, phone numbers, URLs.
For example, consider the following task. Given a column of dates in your spreadsheet as shown in column A in Figure 1, suppose you want to extract the day-of-week for each date so that you can subsequently filter or group your data by day-of-week. Note that the input data might appear in different formats, and the day of week may not even exist as part of the input data. Excel today does not include a built-in function that does this transformation. Although suitable functions do exist in open source repositories such as .NET, GitHub and Stack Overflow, or perhaps in a repository within your own organization, it is not easy for you to discover the right function for your task.
Transform Data by Example is an Excel add-in that helps you find the desired transformation function for your data wrangling task easily. All you need to do is to provide a few examples of the desired output (as shown in column B). Once you click on Get Transformations, Transform Data by Example will return a ranked list of programs that are automatically generated using a large collection of transformation functions it indexes to perform your desired transformation tasks. When you click on a result, it applies that program and fills the output column with the transformed data (see Figure 2). In this example, a program automatically generated using a standard .NET function (System.DateTime.Parse) was used in completing the task.
Transform Data by Example is powered by a backend service that runs on Azure. Like a web search engine, this service crawls and indexes many open source data transformation functions from sources such as .NET, GitHub and Stack Overflow and web services such as Bing Maps. By exploiting the powerful capabilities of the Azure platform, and using technology developed in Microsoft Research, the engine can quickly search through this large index and find functions best suited for the given task specified by input-output examples.
“Our technology can unlock the value of sophisticated, domain-specific transformations that people have laboriously developed over the years, and make it available to large number of users who are not necessarily sophisticated programmers” says Surajit Chaudhuri, a Distinguished Scientist and the Deputy Managing Director of Microsoft Research Redmond Lab. “This is significant for information workers in the enterprise who today spend a lot of their time on data wrangling.”
“We were inspired by technologies such as Flash Fill in Excel which uses an example-driven approach to automatically generate a program for syntactic string transformations, and the DataHub project at MIT that enables easy sharing and reuse of data analysis code” says Yeye He, a Researcher in the DMX group at Microsoft Research.
For many tasks, applying a transformation function found in the index on the user’s input data will not produce the desired output exactly. In these cases, Transform Data by Example can automatically generate a program to compose one or more functions to produce the desired result.
“We are excited that the technology we have developed is unique in the industry”, says Kris Ganjam, a Principal Research Software Development Engineer in the same group. “Transform Data by Example can automatically find a relevant data transformation functions and compose them to solve the user’s data wrangling task — all by using only a few input-output examples. Even recent startups in this space such as Trifacta, Paxata, and established players such as Informatica don’t have this capability”.
What makes the technology even more compelling is the ability to easily add your custom data transformations to the backend engine (by choosing Add My Service from the menu). Once added, your custom transformation is indexed and becomes searchable instantly. For now, your custom transformation is only searchable by you, and it will not be available for anyone else to use. “Longer term, we plan to make data transformations sharable and reusable across users within an enterprise seamlessly, and make Transform Data by Example the go-to search engine for data transformation problems” says Vivek Narasayya, a Principal Researcher in the DMX group.
While Transform Data by Example currently targets information workers in Excel who need to transform data residing within their spreadsheets, in the future the project members plan to make it easy to deploy the resulting transformations on large datasets residing in databases or data lakes. So, bring on your examples and let Transform Data by Example build the transformations you need!
Get Transform Data by Example https://microsoft.com/en-us/garage/profiles/transform-data-by-example/