{"id":666933,"date":"2020-07-20T09:39:52","date_gmt":"2020-07-20T16:39:52","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&#038;p=666933"},"modified":"2022-04-19T15:08:19","modified_gmt":"2022-04-19T22:08:19","slug":"prose-text-transformation","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/prose-text-transformation\/","title":{"rendered":"PROSE &#8211; Text Transformation"},"content":{"rendered":"<p><strong><code class=\"language-plaintext highlighter-rouge\">Transformation.Text<\/code><\/strong> is a system that performs string transformations using examples allowing for many tasks involving strings to be performed automatically. <code class=\"language-plaintext highlighter-rouge\">Transformation.Text<\/code> is based on the same research as the <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/automating-string-processing-spreadsheets-using-input-output-examples\/\" target=\"_blank\" rel=\"noopener noreferrer\">FlashFill feature in Excel<\/a>, but with extended capabilities for semantic transformations involving dates and numbers as well as support for interactivity due to being part of PROSE. The <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/project\/prose-text-transformation\/#!usage\">Usage<\/a> page and the<br \/>\n<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/Microsoft\/prose\/tree\/master\/Transformation.Text\" target=\"_blank\" rel=\"noopener noreferrer\"><code class=\"language-plaintext highlighter-rouge\">Transformation.Text<\/code> sample project<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> show examples of how to use the Transformation.Text API.<\/p>\n<h2>Example Transformation<\/h2>\n<p>Given an example like:<\/p>\n<table class=\" msr-table-schedule all-borders\" style=\"border-collapse: collapse;width: 80%;border-spacing: inherit\">\n<thead>\n<tr>\n<th style=\"text-align: left\">Input1<\/th>\n<th style=\"text-align: left\">Input2<\/th>\n<th style=\"text-align: left\">Example output<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td style=\"text-align: left\">Greta<\/td>\n<td style=\"text-align: left\">Hermansson<\/td>\n<td style=\"text-align: left\">Hermansson, G.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Transformation.Text will generate a program to perform the same transformation given any other first name, last name pair:<\/p>\n<table class=\" msr-table-schedule all-borders\" style=\"border-collapse: collapse;width: 80%;border-spacing: inherit\">\n<thead>\n<tr>\n<th style=\"text-align: left\">Input1<\/th>\n<th style=\"text-align: left\">Input2<\/th>\n<th style=\"text-align: left\">Program output<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td style=\"text-align: left\">Kettil<\/td>\n<td style=\"text-align: left\">Hansson<\/td>\n<td style=\"text-align: left\">Hansson, K.<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: left\">Etelka<\/td>\n<td style=\"text-align: left\">Bala<\/td>\n<td style=\"text-align: left\">Bala, E.<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: left\">\u2026<\/td>\n<td style=\"text-align: left\">\u2026<\/td>\n<td style=\"text-align: left\">\u2026<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Transformation.Text is a system that performs string transformations using examples allowing for many tasks involving strings to be performed automatically.<\/p>\n","protected":false},"featured_media":674232,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"research-area":[13556,13554,13560],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-666933","msr-project","type-msr-project","status-publish","has-post-thumbnail","hentry","msr-research-area-artificial-intelligence","msr-research-area-human-computer-interaction","msr-research-area-programming-languages-software-engineering","msr-locale-en_us","msr-archive-status-active"],"msr_project_start":"","related-publications":[777094],"related-downloads":[],"related-videos":[],"related-groups":[663303],"related-events":[],"related-opportunities":[],"related-posts":[],"related-articles":[],"tab-content":[{"id":0,"name":"Usage","content":"The <code class=\"language-plaintext highlighter-rouge\">Transformation.Text<\/code> API is accessed through the <code class=\"language-plaintext highlighter-rouge\">Transformation.Text.Session<\/code> class. The primary methods are <code class=\"language-plaintext highlighter-rouge\">Constraints.Add()<\/code> which adds examples (or other constraints) to a session and <code class=\"language-plaintext highlighter-rouge\">Learn()<\/code> which and learns a <code class=\"language-plaintext highlighter-rouge\">Transformation.Text<\/code> program consistent with those examples. In order to use <code class=\"language-plaintext highlighter-rouge\">Transformation.Text<\/code>, you need assembly references to <code class=\"language-plaintext highlighter-rouge\">Microsoft.ProgramSynthesis.Transformation.Text.dll<\/code>,\r\n<code class=\"language-plaintext highlighter-rouge\">Microsoft.ProgramSynthesis.Transformation.Text.Language.dll<\/code>, and\r\n<code class=\"language-plaintext highlighter-rouge\">Microsoft.ProgramSynthesis.Transformation.Text.Semantics.dll<\/code>.\r\n<h2>Basic usage<\/h2>\r\n\r\n[code autolinks=\"true\" highlight=\"\"]\r\nSession session = new Session();\r\nIEnumerable&amp;lt;Example&amp;gt; examples = new[]\r\n{\r\n    new Example(new InputRow(\"Greta Hermansson\"), \"Hermansson, G.\")\r\n};\r\nsession.Constraints.Add(examples);\r\nProgram program = session.Learn();\r\nobject output = program.Run(new InputRow(\"Kettil Hansson\")); \/\/ output is \"Hansson, K.\"[\/code]\r\n\r\nThe examples are given as an <code class=\"language-plaintext highlighter-rouge\">IEnumerable&lt;Example&gt;<\/code> with the input and the correct output. The input to\r\n<code class=\"language-plaintext highlighter-rouge\">Transformation.Text<\/code> is a row of a table of data which may include data from multiple columns. The <code class=\"language-plaintext highlighter-rouge\">InputRow<\/code> type lets you give a row as just a list of <code class=\"language-plaintext highlighter-rouge\">string<\/code>s without naming the columns. To get more control, implement <code class=\"language-plaintext highlighter-rouge\">Transformation.Text<\/code>'s <code class=\"language-plaintext highlighter-rouge\">IRow<\/code> interface.\r\n<h3>One example with multiple strings<\/h3>\r\n\r\n[code autolinks=\"true\" highlight=\"\"]\r\nvar session = new Session();\r\nsession.Constraints.Add(new Example(new InputRow(\"Greta\", \"Hermansson\"), \"Hermansson, G.\"))\r\nProgram program = session.Learn();\r\nstring output = program.Run(new InputRow(\"Kettil\", \"Hansson\")) as string; \/\/ output is \"Hansson, K.[\/code]\r\n\r\n(While the API types the output of running a <code class=\"language-plaintext highlighter-rouge\">Transformation.Text<\/code> program as <code class=\"language-plaintext highlighter-rouge\">object<\/code>, the output type will always be <code class=\"language-plaintext highlighter-rouge\">string<\/code> (or <code class=\"language-plaintext highlighter-rouge\">null<\/code>) in the current version. The cast to <code class=\"language-plaintext highlighter-rouge\">string<\/code> is done using <code class=\"language-plaintext highlighter-rouge\">as string<\/code> to acknowledge that future versions of <code class=\"language-plaintext highlighter-rouge\">Transformation.Text<\/code> may support other return types.)\r\n<h3>Multiple examples<\/h3>\r\n<code class=\"language-plaintext highlighter-rouge\">Transformation.Text<\/code> can be given multiple examples in order to generate a program that will generalize over differently formatted inputs. In this example, we give <code class=\"language-plaintext highlighter-rouge\">Transformation.Text<\/code> a phone number to normalized in two different formats and it is able to take a phone number in a third similar format and normalize it as well.\r\n\r\n[code autolinks=\"true\" highlight=\"\"]\r\nvar session = new Session();\r\nvar examples = new[]\r\n{\r\n    new Example(new InputRow(\"212-555-0183\"), \"212-555-0183\"),\r\n    new Example(new InputRow(\"(212) 555 0183\"), \"212-555-0183\")\r\n};\r\nsession.Constraints.Add(examples);\r\nProgram program = session.Learn();\r\nstring output = program.Run(new InputRow(\"425 311 1234\")) as string; \/\/ output is \"425-311-1234\"[\/code]\r\n\r\nIf your input data is in multiple formats, you will likely have to provide more than one example. A common workflow is\r\nto have the user give a small number of examples and then inspect the output (possibly with inputs to inspect suggested by the significant inputs feature) and have the option of providing additional examples if they discover an undesired result. The code for that workflow might look something like this:\r\n\r\n[code autolinks=\"true\" highlight=\"\"]\r\nvar session = new Session();\r\nsession.Constraints.Add(new Example(new InputRow(\"212-555-0183\"), \"212-555-0183\"));\r\nProgram program = session.Learn();\r\n\/\/ ... check program and find it is does not work as desired.\r\nsession.Constraints.Add(new Example(new InputRow(\"(212) 555 0183\"), \"212-555-0183\"));\r\nprogram = session.Learn();\r\nstring output = program.Run(new InputRow(\"425 311 1234\")) as string; \/\/ output is \"425-311-1234\"[\/code]\r\n\r\n<div style=\"height: 30px\"><\/div>\r\n<h2>Inputs without known outputs<\/h2>\r\nMost likely, when learning a program, you will have some idea of other inputs you intend to run the program on in the future. <code class=\"language-plaintext highlighter-rouge\">Transformation.Text<\/code> can take those inputs and use them to help decide which program to return.\r\n\r\n[code autolinks=\"true\" highlight=\"\"]\r\nvar session = new Session();\r\nsession.Inputs.Add(new InputRow(\"04\/02\/1962\"),\r\n                  new InputRow(\"27\/08\/1998\"));\r\nsession.Constraints.Add(new Example(new InputRow(\"02\/04\/1953\"), \"1953-04-02\"));\r\nProgram program = session.Learn();\r\nstring output = program.Run(\"31\/01\/1983\") as string; \/\/ output is \"1983-01-31\"[\/code]\r\n\r\n<div style=\"height: 30px\"><\/div>\r\n<h2>Learning multiple programs<\/h2>\r\nThere are usually a large number of programs consistent with any given set of examples. <code class=\"language-plaintext highlighter-rouge\">Transformation.Text<\/code> has a ranking scheme which it uses to return the most likely program for the examples it has seen, but in some cases this may not be the desired program.\r\n\r\n<code class=\"language-plaintext highlighter-rouge\">LearnTopK()<\/code> has a parameter <code class=\"language-plaintext highlighter-rouge\">k<\/code> which specifies how many programs it should try to learn; it returns the top <code class=\"language-plaintext highlighter-rouge\">k<\/code>\u00a0ranked programs (or programs with the top <code class=\"language-plaintext highlighter-rouge\">k<\/code> ranks if there are ties).\r\n\r\n[code autolinks=\"true\" highlight=\"\"]\r\nvar session = new Session();\r\nsession.Constraints.Add(new Example(new InputRow(\"Greta Hermansson\"), \"Hermansson\"));\r\n\/\/ Learn top 10 programs instead of just the single top program.\r\nIReadOnlyList&amp;lt;Program&amp;gt; programs = session.LearnTopK(k: 10);\r\nforeach (Program program in programs)\r\n{\r\n    Console.WriteLine(program.Run(new InputRow(\"Kettil Hansson Smith\"))); \/\/ note that this input has a middle name\r\n}[\/code]\r\n\r\nThe first several programs output \u201cSmith\u201d, but after that one outputs \u201cHansson Smith\u201d. This could be used to ask the user which they meant or to do automated reranking of the top results based on some logic other than\r\n<code class=\"language-plaintext highlighter-rouge\">Transformation.Text<\/code>'s internal ranking system.\r\n\r\nTo specifically get the top distinct outputs, without needing to directly access the programs, use\r\n<code class=\"language-plaintext highlighter-rouge\">ComputeTopKOutputsAsync()<\/code>:\r\n\r\n[code autolinks=\"true\" highlight=\"\"]\r\nvar session = new Session();\r\nsession.Constraints.Add(new Example(new InputRow(\"Greta Hermansson\"), \"Hermansson\"));\r\nIReadOnlyList&amp;lt;object&amp;gt; outputs = await session.ComputeTopKOutputsAsync(new InputRow(\"Kettil Hansson Smith\"), k: 10);\r\nforeach (object output in outputs)\r\n{\r\n    Console.WriteLine(output);\r\n}[\/code]\r\n\r\n<div style=\"height: 30px\"><\/div>\r\n<h2>Serializing programs<\/h2>\r\nSometimes you will want to learn a program in one session and run it on other data in a future session or transfer learned programs between computers. In order to do so, PROSE supports serializing programs:\r\n\r\n[code autolinks=\"true\" highlight=\"\"]\r\nvar session = new Session();\r\nsession.Constraints.Add(new Example(new InputRow(\"Kettil Hansson\"), \"Hansson, K.\"));\r\nProgram program = session.Learn();\r\n\/\/ Programs can be serialized using .Serialize().\r\nstring serializedProgram = program.Serialize();\r\n\/\/ Serialized programs can be loaded in another program using the Transformation.Text API using .Load():\r\nProgram parsedProgram = Loader.Instance.Load(serializedProgram);\r\n\/\/ The program can then be run on new inputs:\r\nConsole.WriteLine(parsedProgram.Run(new InputRow(\"Etelka Bala\"))); \/\/ outputs \"Bala, E.\"[\/code]\r\n\r\n<div style=\"height: 30px\"><\/div>\r\n<h2>API<\/h2>\r\n<h3>Learning <code class=\"language-plaintext highlighter-rouge\">Transformation.Text<\/code> programs<\/h3>\r\nTo start, construct an empty <code class=\"language-plaintext highlighter-rouge\">Session<\/code> which encapsulates learning a program for a single task, often refined over the course of multiple learning calls.\r\n\r\nThe collection of all known inputs should be provided using <code class=\"language-plaintext highlighter-rouge\">.Inputs.Add()<\/code>. <code class=\"language-plaintext highlighter-rouge\">Transformation.Text<\/code> can make good use of around one hundred inputs; providing over a thousand may cause performance issues for some operations, although it will attempt to work on only a randomly selected sample when possible if too many inputs are provided. If selecting a subset of inputs to provide, they should be representative of the inputs the program will be run on. The inputs provided can be accessed using <code class=\"language-plaintext highlighter-rouge\">.Inputs<\/code> and removed using <code class=\"language-plaintext highlighter-rouge\">.RemoveInputs()<\/code> or <code class=\"language-plaintext highlighter-rouge\">RemoveAllInputs()<\/code>.\r\n\r\nThe main input to the learning procedure is a set of <strong>constraints<\/strong>, primarily examples, which are provided using <code class=\"language-plaintext highlighter-rouge\">.Constraints.Add()<\/code>. The following are common constraints used with <code class=\"language-plaintext highlighter-rouge\">Transformation.Text<\/code>:\r\n<ul>\r\n \t<li><strong><code class=\"language-plaintext highlighter-rouge\">Example<\/code><\/strong> (or <code class=\"language-plaintext highlighter-rouge\">Example&lt;IRow, object&gt;<\/code>). The most common constraint. Asserts what the output should be for a specific input.<\/li>\r\n \t<li><strong><code class=\"language-plaintext highlighter-rouge\">DoesNotEqual&lt;IRow, object&gt;<\/code><\/strong>. The opposite: for a specific input, gives a specific disallowed output.<\/li>\r\n \t<li><strong><code class=\"language-plaintext highlighter-rouge\">ColumnPriority<\/code><\/strong>. Used to specify which columns of the input to use. Useful if the <code class=\"language-plaintext highlighter-rouge\">IRow<\/code> implementation exposes many columns but only a few columns should be used by the program.<\/li>\r\n \t<li><strong><code class=\"language-plaintext highlighter-rouge\">OutputIs&lt;IRow, object&gt;<\/code><\/strong>. Constrains the output to be of a specific semantic kind. Note that the .NET type of the output will still be <code class=\"language-plaintext highlighter-rouge\">string<\/code>; support for other .NET types in the output is expected in the future. The supported types for this constraint are <code class=\"language-plaintext highlighter-rouge\">NumberType<\/code>, <code class=\"language-plaintext highlighter-rouge\">PartialDateTimeType<\/code>, and `FormattedPartialDateTimeType.<\/li>\r\n \t<li>See the <code class=\"language-plaintext highlighter-rouge\">Transformation.Text.Constraints<\/code> namespace for other constraints.<\/li>\r\n<\/ul>\r\n<code class=\"language-plaintext highlighter-rouge\">Session<\/code> has three different methods for learning (plus \u201c<code class=\"language-plaintext highlighter-rouge\">Async<\/code>\u201d variants):\r\n<ul>\r\n \t<li><code class=\"language-plaintext highlighter-rouge\">Learn()<\/code>\/<code class=\"language-plaintext highlighter-rouge\">LearnAsync()<\/code> returns the single top-ranked program as a <code class=\"language-plaintext highlighter-rouge\">Program<\/code>.<\/li>\r\n \t<li><code class=\"language-plaintext highlighter-rouge\">LearnTopK()<\/code>\/<code class=\"language-plaintext highlighter-rouge\">LearnTopKAsync()<\/code> takes an integer <code class=\"language-plaintext highlighter-rouge\">k<\/code> and returns the top-<code class=\"language-plaintext highlighter-rouge\">k<\/code> ranked programs as an\r\n<code class=\"language-plaintext highlighter-rouge\">IReadOnlyList&lt;Program&gt;<\/code>.<\/li>\r\n \t<li><code class=\"language-plaintext highlighter-rouge\">LearnAll()<\/code>]\/<code class=\"language-plaintext highlighter-rouge\">LearnAllAsync()<\/code> learns all programs consistent with the examples, giving the result compactly as a <code class=\"language-plaintext highlighter-rouge\">ProgramSet<\/code> (wrapped in an <code class=\"language-plaintext highlighter-rouge\">IProgramSetWrapper<\/code>).<\/li>\r\n<\/ul>\r\nTo run a <code class=\"language-plaintext highlighter-rouge\">Program<\/code>, use its <code class=\"language-plaintext highlighter-rouge\">Run()<\/code> method:\r\n\r\n[code autolinks=\"true\" highlight=\"\"]\r\npublic object Run(IRow input)\r\n[\/code]\r\n\r\nIf performance of running a single program on many inputs is an issue, then implementing the <code class=\"language-plaintext highlighter-rouge\">IIndexableRow<\/code> interface and using the <code class=\"language-plaintext highlighter-rouge\">Run(IIndexableRow)<\/code> variant may help."}],"slides":[],"related-researchers":[{"type":"user_nicename","display_name":"Vu Le","user_id":39174,"people_section":"Section name 0","alias":"levu"},{"type":"user_nicename","display_name":"Clint Simon","user_id":40801,"people_section":"Section name 0","alias":"clsimon"},{"type":"user_nicename","display_name":"Ashish Tiwari","user_id":39171,"people_section":"Section name 0","alias":"astiwar"}],"msr_research_lab":[],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/666933","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-project"}],"version-history":[{"count":14,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/666933\/revisions"}],"predecessor-version":[{"id":836656,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/666933\/revisions\/836656"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/674232"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=666933"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=666933"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=666933"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=666933"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=666933"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}