{"id":667005,"date":"2020-07-20T09:41:57","date_gmt":"2020-07-20T16:41:57","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&#038;p=667005"},"modified":"2022-04-19T15:09:14","modified_gmt":"2022-04-19T22:09:14","slug":"prose-pattern-inspector","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/prose-pattern-inspector\/","title":{"rendered":"PROSE &#8211; Pattern Inspector"},"content":{"rendered":"<p>Have you ever written a script to perform a string transformation and have it either crash or produce wrong results silently due to input data being in unexpected formats? Or do you want to figure out how many different cases you need to handle in your standardization procedure. <strong>Matching.Text<\/strong> to the rescue!<\/p>\n<p><strong>Matching.Text<\/strong> automatically identifies different formats and patterns in string data. Given a set of input strings, <strong>Matching.Text<\/strong> produces a small number of disjoint regular expressions such that they together match all the input strings, except possibly a small fraction of outliers. Additional documentation and usage can be found <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/project\/prose-pattern-inspector\/#!usage\">here<\/a>.<\/p>\n<h2>Scenario<\/h2>\n<p>Consider a list of names below which from which you want to extract last names.<\/p>\n<table class=\" msr-table-schedule all-borders\" style=\"border-collapse: collapse;width: 80%;border-spacing: inherit\">\n<thead>\n<tr>\n<th style=\"text-align: left\">Full Name<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td style=\"text-align: left\">Laia Sanchis<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: left\">Gwilym Jones<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: left\">Cai Huws<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: left\">Tomi Elis<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: left\">Geraint Llwyd<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: left\">\u2026<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<div style=\"height: 30px\"><\/div>\n<p>A simple looking task, if there was one \u2013 the python function below is a good attempt.<\/p>\n<pre class=\"brush: plain; auto-links: true; title: ; notranslate\" title=\"\">\r\ndef extract_last_name(name):\r\n    return name&#x5B;name.find(' ')+1:]<\/pre>\n<p>However, while the first 10 names look standard, running <strong>Matching.Text<\/strong> provides more insight into the different formats, further identifies outliers that do not fall into any of the other formats.<\/p>\n<table class=\" msr-table-schedule all-borders\" style=\"border-collapse: collapse;width: 80%;border-spacing: inherit\">\n<thead>\n<tr>\n<th style=\"text-align: left\">Pattern Name<\/th>\n<th style=\"text-align: left\">Regex Pattern<\/th>\n<th style=\"text-align: right\">Frequency<\/th>\n<th style=\"text-align: left\">Examples<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td style=\"text-align: left\">Word_Word<\/td>\n<td style=\"text-align: left\"><code class=\"language-plaintext highlighter-rouge\">[A-Z][a-z]+ [A-Z][a-z]+<\/code><\/td>\n<td style=\"text-align: right\">0.84<\/td>\n<td style=\"text-align: left\">\u201cLaia Sanchis\u201d, \u201cGwilym Jones\u201d<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: left\">Word_Word_Hyphen_Word<\/td>\n<td style=\"text-align: left\"><code class=\"language-plaintext highlighter-rouge\">[A-Z][a-z]+ [A-Z][a-z]+-[A-Z][a-z]+<\/code><\/td>\n<td style=\"text-align: right\">0.06<\/td>\n<td style=\"text-align: left\">\u201cTulga Bat-Erdene\u201d, \u201cDabir Al-Zuhairi\u201d<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: left\">Word_Word_Word<\/td>\n<td style=\"text-align: left\"><code class=\"language-plaintext highlighter-rouge\">[A-Z][a-z]+ [A-Z][a-z]+ [A-Z][a-z]+<\/code><\/td>\n<td style=\"text-align: right\">0.06<\/td>\n<td style=\"text-align: left\">\u201cYue Ying Jen\u201d, \u201cRolf Van Eeuwijk\u201d<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: left\">Word<\/td>\n<td style=\"text-align: left\"><code class=\"language-plaintext highlighter-rouge\">[A-Z][a-z]+<\/code><\/td>\n<td style=\"text-align: right\">0.04<\/td>\n<td style=\"text-align: left\">\u201cDanlami\u201d, \u201cIsioma\u201d<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: left\">Outliers<\/td>\n<td style=\"text-align: left\"><\/td>\n<td style=\"text-align: right\"><0.01<\/td>\n<td style=\"text-align: left\">\u201cUNKNOWN\u201d, \u201cNULL\u201d<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<div style=\"height: 30px\"><\/div>\n<p>Given this new insight, it can be seen that <code class=\"language-plaintext highlighter-rouge\">extract_last_name<\/code> may not always return the right answer, and you may want to handle the last name extraction task quite differently. Further, to make the writing the procedure easier, <strong>Matching.Text<\/strong> can also generate a switch-case like template to match against the different patterns.<\/p>\n<pre class=\"brush: plain; auto-links: true; title: ; notranslate\" title=\"\">\r\nregex_word_word = re.compile(r'&#x5B;A-Z]&#x5B;a-z]+ &#x5B;A-Z]&#x5B;a-z]+')\r\nregex_word_word_hyphen_word = re.compile(r'&#x5B;A-Z]&#x5B;a-z]+ &#x5B;A-Z]&#x5B;a-z]+-&#x5B;A-Z]&#x5B;a-z]+')\r\nregex_word_word_word = re.compile(r'&#x5B;A-Z]&#x5B;a-z]+ &#x5B;A-Z]&#x5B;a-z]+ &#x5B;A-Z]&#x5B;a-z]+')\r\nregex_word = re.compile(r'&#x5B;A-Z]&#x5B;a-z]+')\r\n\r\ndef extract_last_name(name):\r\n  if regex_word_word.match(name):\r\n    return \"TitleWord &amp; TitleWord\" # Modify\r\n  elif regex_word_word_hyphen_word.match(name):\r\n    return \"TitleWord &amp; TitleWord &amp; Const&#x5B;-] &amp; TitleWord\" # Modify\r\n  elif regex_word_word_word.match(name):\r\n    return \"TitleWord &amp; TitleWord &amp; TitleWord\" # Modify\r\n  elif regex_word.match(name):\r\n    return \"TitleWord\" # Modify\r\n  else:\r\n    return \"Others\" # Modify\r\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Matching.Text automatically identifies different formats and patterns in string data.<\/p>\n","protected":false},"featured_media":674232,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"research-area":[13556,13554,13560],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-667005","msr-project","type-msr-project","status-publish","has-post-thumbnail","hentry","msr-research-area-artificial-intelligence","msr-research-area-human-computer-interaction","msr-research-area-programming-languages-software-engineering","msr-locale-en_us","msr-archive-status-active"],"msr_project_start":"","related-publications":[],"related-downloads":[],"related-videos":[],"related-groups":[663303],"related-events":[],"related-opportunities":[],"related-posts":[],"related-articles":[],"tab-content":[{"id":0,"name":"Usage","content":"The <code class=\"language-plaintext highlighter-rouge\">Matching.Text<\/code> API is accessed through the <code class=\"language-plaintext highlighter-rouge\">Matching.Text.Session<\/code> class. The input strings are added using <code class=\"language-plaintext highlighter-rouge\">Session.Constraints.Add()<\/code>. Once the inputs are added, calling <code class=\"language-plaintext highlighter-rouge\">Session.LearnPatterns()<\/code> returns a list of <code class=\"language-plaintext highlighter-rouge\">PatternInfo<\/code> objects that describe each pattern.\r\n\r\nEach <code class=\"language-plaintext highlighter-rouge\">PatternInfo<\/code> object either has:\r\n<ol>\r\n \t<li>The <code class=\"language-plaintext highlighter-rouge\">IsNull<\/code> field set to true that indicates that the pattern matches only <code class=\"language-plaintext highlighter-rouge\">null<\/code> strings, or<\/li>\r\n \t<li>The <code class=\"language-plaintext highlighter-rouge\">IsNull<\/code> field set to false, and the strings that match the pattern are those that match the regular expression in the <code class=\"language-plaintext highlighter-rouge\">Regex<\/code> field and do not match the regular expressions in the <code class=\"language-plaintext highlighter-rouge\">RegexesToExclude<\/code> field.<\/li>\r\n<\/ol>\r\nThe other fields indicate the frequency of the pattern (<code class=\"language-plaintext highlighter-rouge\">MatchingFraction<\/code>), a description in a PROSE specific format (<code class=\"language-plaintext highlighter-rouge\">Description<\/code>), and a few examples of the input strings matched by the pattern (<code class=\"language-plaintext highlighter-rouge\">Examples<\/code>).\r\n<h2>Basic usage<\/h2>\r\n\r\n[code autolinks=\"true\" highlight=\"\"]\r\nusing Microsoft.ProgramSynthesis.Matching.Text;\r\n\r\nSession session = new Session();\r\n\r\nIEnumerable inputs = new[] {\r\n \"21-Feb-73\",\r\n \"2 January 1920a \",\r\n \"4 July 1767 \",\r\n \"1892\",\r\n \"11 August 1897 \",\r\n \"11 November 1889 \",\r\n \"9-Jul-01\",\r\n \"17-Sep-08\",\r\n \"10-May-35\",\r\n \"7-Jun-52\",\r\n \"24 July 1802 \",\r\n \"25 April 1873 \",\r\n \"24 August 1850 \",\r\n \"Unknown \",\r\n \"1058\",\r\n \"8 August 1876 \",\r\n \"26 July 1165 \",\r\n \"28 December 1843 \",\r\n \"22-Jul-46\",\r\n \"17 January 1871 \",\r\n \"17-Apr-38\",\r\n \"28 February 1812 \",\r\n \"1903\",\r\n \"1915\", \r\n \"1854\",\r\n \"9 May 1828 \",\r\n \"28-Jul-32\",\r\n \"25-Feb-16\",\r\n \"19-Feb-40\",\r\n \"10-Oct-50\",\r\n \"5 November 1880 \",\r\n \"1928\",\r\n \"13-Feb-03\",\r\n \"8-Oct-43\",\r\n \"1445\",\r\n \"8 July 1859 \",\r\n \"25-Apr-27\",\r\n \"25 November 1562 \",\r\n \"2-Apr-10\", };\r\n \r\n session.Inputs.Add(inputs);\r\n IReadOnlyList patterns = session.LearnPatterns(); \/\/ Five patterns are returned corresponding to the formats \"dd-MMM-yy\", \"dd MMMM yyyy \", \"yyyy\", \"Unknown\", and \"2 January 1920a \".[\/code]\r\n"}],"slides":[],"related-researchers":[{"type":"user_nicename","display_name":"Arjun Radhakrishna","user_id":39405,"people_section":"Section name 0","alias":"arradha"}],"msr_research_lab":[],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/667005","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-project"}],"version-history":[{"count":11,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/667005\/revisions"}],"predecessor-version":[{"id":732766,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/667005\/revisions\/732766"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/674232"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=667005"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=667005"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=667005"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=667005"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=667005"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}