PROSE – Text Splitting

PROSE – Text Splitting

Groups

Overview

Split.Text is a system for splitting data in plain text format, where there may be multiple fields that need to be separated into different columns. The Usage page and the Split.Text sample project show examples of how to use the Split.Text API. The Split.Text system supports purely predictive as well as interactive techniques to learn programs for splitting textual data.

Predictive Splitting

The predictive learning technique attempts to infer a program given only the input data and no other constraints from the user (such as output examples). It analyses the properties of the input data to infer the most regular pattern of fields and delimiters that have good alignment with one another. For instance, if we are given the following input data without any output examples:

Input
PE5 Leonard Robledo (Australia)
U109 Adam Jay Lucas (New Zealand)
R342 Carrie Dodson (United States)
TS51 Naomi Cole (Canada)
Y722 Owen Murphy (United States)
UP335 Zoe Erin Rees (GB)

 

Split.Text will predictively generate a program to perform the following three-column splitting:

Split Column 1 Split Column 2 Split Column 3
PE5 Leonard Robledo Australia
U109 Adam Jay Lucas New Zealand
R342 Carrie Dodson United States
TS51 Naomi Cole Canada
Y722 Owen Murphy United States
UP335 Zoe Erin Rees GB

In this case it determines the space as well as open/close brackets as probable delimiters given the pattern in the inputs. However, not all occurrences of the space character is a delimiter, as there are varying number of spaces inside the person names (some including middle names) and countries as well. Hence we cannot simply split by all spaces. The Split.Text DSL and learning algorithm handles such scenarios by analyzing the patterns within the inferred data fields as well as supporting contextual delimiters, which look at data patterns around occurrences of possible delimiting substrings. More information about the DSL and learning techniques can be found in our recent publication on predictive program synthesis.

Interactive Splitting

The predictive inference of Split.Text can handle many common practical scenarios for text splitting. However, in many cases different users may have different preferences for the kind of splitting they want, especially with respect to how they want to split a particular field into subfields. For example, in the above scenario, one user may want to separate the first names into a separate column while another may prefer to have just the last name in its own column. Split.Text supports such scenarios with interactive features that permit the user to provide various constraints on the program that will be learnt.

The most powerful constraint is to provide examples of the desired splitting on some inputs. For instance, if the user wants first names to be split into a separate column, she may provide the following examples on the first two inputs:

Input Split Column 1 Split Column 2 Split Column 3 Split Column 4
PE5 Leonard Robledo (Australia) PE5 Leonard Robledo Australia
U109 Adam Jay Lucas (New Zealand) U109 Adam Jay Lucas New Zealand

The system will then learn a program that can perform the same splitting on the rest of the data:

Input Split Column 1 Split Column 2 Split Column 3 Split Column 4
PE5 Leonard Robledo (Australia) PE5 Leonard Robledo Australia
U109 Adam Jay Lucas (New Zealand) U109 Adam Jay Lucas New Zealand
R342 Carrie Dodson (United States) R342 Carrie Dodson United States
TS51 Naomi Cole (Canada) TS51 Naomi Cole Canada
Y722 Owen Murphy (United States) Y722 Owen Murphy United States
UP335 Zoe Erin Rees (GB) UP335 Zoe Erin Rees GB

If another user wants last names to be in a separate column, then he can similarly provide the corresponding examples to achieve that splitting:

Input Split Column 1 Split Column 2 Split Column 3 Split Column 4
PE5 Leonard Robledo (Australia) PE5 Leonard Robledo Australia
U109 Adam Jay Lucas (New Zealand) U109 Adam Jay Lucas New Zealand

 

The system will then learn a program that can perform the same splitting on the rest of the data:

Input Split Column 1 Split Column 2 Split Column 3 Split Column 4
PE5 Leonard Robledo (Australia) PE5 Leonard Robledo Australia
U109 Adam Jay Lucas (New Zealand) U109 Adam Jay Lucas New Zealand
R342 Carrie Dodson (United States) R342 Carrie Dodson United States
TS51 Naomi Cole (Canada) TS51 Naomi Cole Canada
Y722 Owen Murphy (United States) Y722 Owen Murphy United States
UP335 Zoe Erin Rees (GB) UP335 Zoe Erin Rees GB

 

As well as the ability to provide examples, Split.Text supports various other constraints, such as whether the user wants to keep the delimiters in separate columns or not.

People

Usage

The Split.Text APIs are accessed through the SplitSession class. The user can create a new SplitSession object, add input data and various constraints to the session, and then call the Learn() method to obtain a SplitProgram. This is the program that is learnt from the given input data and constraints. The SplitProgram’s key method is the Run() method which executes the program to perform a split on any given text input.

To use Split.Text, one needs to reference Microsoft.ProgramSynthesis.Split.Text.dll, Microsoft.ProgramSynthesis.Split.Text.Semantics.dll
and Microsoft.ProgramSynthesis.Split.Text.Learning.dll, Microsoft.ProgramSynthesis.Extraction.Text.Semantics.dll and Microsoft.ProgramSynthesis.Extraction.Text.Learning.dll.

The complete code for the scenarios described in this walk-through is available in the Sample Project which illustrates our API usage.

Initializing the session

The user can create a new Split session and add the input data as follows:

// create a new ProseSplit session
var splitSession = new SplitSession();

// add the input rows to the session
// each input is a StringRegion object containing the text to be split
var inputs = new List<StringRegion> {
       SplitSession.CreateStringRegion("PE5 Leonard Robledo (Australia)"),
       SplitSession.CreateStringRegion("U109 Adam Jay Lucas (New Zealand)"),
       SplitSession.CreateStringRegion("R342 Carrie Dodson (United States)")
};
splitSession.Inputs.Add(inputs);

Each row of text in the input data is added as a StringRegion object created from the text content in that row. If we want we can also add some constraints to the session to specify basic properties of the desired splitting, such as whether we want to include the delimiters in the resulting split or not. If we do not want delimiters in the output, we can specify with a constraint as follows:

splitSession.Constraints.Add(new IncludeDelimitersInOutput(false));

We can clear any constraints provided in the session at any time by calling the splitSession.RemoveAllConstraints() method.

Learning a new split program

Split.Text can learn a program using only the provided input data in a purely predictive fashion, without any examples or other output constraints. This can be done by simply calling the Learn() function after adding the inputs.

// call the learn function to learn a splitting program from the given input examples
SplitProgram program = splitSession.Learn();

// check if the program is null (no program could be learnt from the given inputs)
if (program == null)
{
    Console.WriteLine("No program learned.");
    return;
}

Serializing/Deserializing a program

The SplitProgram.Serialize() method serializes the learned program to a string. The SplitProgramLoader.Instance.Load() method deserializes the program text to a program.

// serialize the learnt program and then deserialize
string progText = program.Serialize();
program = SplitProgramLoader.Instance.Load(progText);

Executing the learnt program

The learnt split program can be executed on any input StringRegion to produce an array of SplitCells. For example, we can execute the learnt program on each of the inputs as follows:

SplitCell[][] splitResult =
inputs.Select(input => program.Run(input)).ToArray();

Each SplitCell object represents information about a single split cell. It’s CellValue field is the sub-region of the input that this split cell represents, and the IsDelimiter flag indicates whether this split cell is a field or delimiter value. The learnt program can be executed indepedently of the Session object on any new input text, and not just the inputs that have been entered into the session.

Executing the predictively learnt program on the three inputs given above, and having specified delimiters to not be included in the output, we get the following splitting:

PE5 Leonard Robledo Australia
U109 Adam Jay Lucas New Zealand
R342 Carrie Dodson United States

Providing examples constraints

If the user desires a different split, then they can provide examples constraints to specify what kind of split they would like. For instance, if the user wants to separate the first name into a different split cell, then they can provide examples on some of the input rows as follows:

splitSession.Constraints.Add(new NthExampleConstraint(inputs[0].Value, 0, "PE5"));
splitSession.Constraints.Add(new NthExampleConstraint(inputs[0].Value, 1, "Leonard"));
splitSession.Constraints.Add(new NthExampleConstraint(inputs[0].Value, 2, "Robledo"));
splitSession.Constraints.Add(new NthExampleConstraint(inputs[0].Value, 3, "Australia"));
splitSession.Constraints.Add(new NthExampleConstraint(inputs[1].Value, 0, "U109"));
splitSession.Constraints.Add(new NthExampleConstraint(inputs[1].Value, 1, "Adam"));
splitSession.Constraints.Add(new NthExampleConstraint(inputs[1].Value, 2, "Jay Lucas"));
splitSession.Constraints.Add(new NthExampleConstraint(inputs[1].Value, 3, "New Zealand"));

Each NthExampleConstraint takes three parameters: the input text on which the program will execute (the entire string), the index of the output split cell for which this example is being given, and the text value desired in that split cell. The examples constraints given above describe each of the four split cells that are desired for the first two inputs that have been given in this session. After calling Learn() with these constraints, we obtain a program that produces the following output splitting on the three inputs given in this session:

PE5 Leonard Robledo Australia
U109 Adam Jay Lucas New Zealand
R342 Carrie Dodson United States