
Groups
Overview
Split.Text is a system for splitting data in plain text format, where there may be multiple fields that need to be separated into different columns. The Usage page and the Split.Text
sample project show examples of how to use the Split.Text API. The Split.Text system supports purely predictive as well as interactive techniques to learn programs for splitting textual data.
Predictive Splitting
The predictive learning technique attempts to infer a program given only the input data and no other constraints from the user (such as output examples). It analyses the properties of the input data to infer the most regular pattern of fields and delimiters that have good alignment with one another. For instance, if we are given the following input data without any output examples:
Input |
---|
PE5 Leonard Robledo (Australia) |
U109 Adam Jay Lucas (New Zealand) |
R342 Carrie Dodson (United States) |
TS51 Naomi Cole (Canada) |
Y722 Owen Murphy (United States) |
UP335 Zoe Erin Rees (GB) |
Split.Text will predictively generate a program to perform the following three-column splitting:
Split Column 1 | Split Column 2 | Split Column 3 |
---|---|---|
PE5 | Leonard Robledo | Australia |
U109 | Adam Jay Lucas | New Zealand |
R342 | Carrie Dodson | United States |
TS51 | Naomi Cole | Canada |
Y722 | Owen Murphy | United States |
UP335 | Zoe Erin Rees | GB |
In this case it determines the space as well as open/close brackets as probable delimiters given the pattern in the inputs. However, not all occurrences of the space character is a delimiter, as there are varying number of spaces inside the person names (some including middle names) and countries as well. Hence we cannot simply split by all spaces. The Split.Text DSL and learning algorithm handles such scenarios by analyzing the patterns within the inferred data fields as well as supporting contextual delimiters, which look at data patterns around occurrences of possible delimiting substrings. More information about the DSL and learning techniques can be found in our recent publication on predictive program synthesis.
Interactive Splitting
The predictive inference of Split.Text can handle many common practical scenarios for text splitting. However, in many cases different users may have different preferences for the kind of splitting they want, especially with respect to how they want to split a particular field into subfields. For example, in the above scenario, one user may want to separate the first names into a separate column while another may prefer to have just the last name in its own column. Split.Text supports such scenarios with interactive features that permit the user to provide various constraints on the program that will be learnt.
The most powerful constraint is to provide examples of the desired splitting on some inputs. For instance, if the user wants first names to be split into a separate column, she may provide the following examples on the first two inputs:
Input | Split Column 1 | Split Column 2 | Split Column 3 | Split Column 4 |
---|---|---|---|---|
PE5 Leonard Robledo (Australia) | PE5 | Leonard | Robledo | Australia |
U109 Adam Jay Lucas (New Zealand) | U109 | Adam | Jay Lucas | New Zealand |
The system will then learn a program that can perform the same splitting on the rest of the data:
Input | Split Column 1 | Split Column 2 | Split Column 3 | Split Column 4 |
---|---|---|---|---|
PE5 Leonard Robledo (Australia) | PE5 | Leonard | Robledo | Australia |
U109 Adam Jay Lucas (New Zealand) | U109 | Adam | Jay Lucas | New Zealand |
R342 Carrie Dodson (United States) | R342 | Carrie | Dodson | United States |
TS51 Naomi Cole (Canada) | TS51 | Naomi | Cole | Canada |
Y722 Owen Murphy (United States) | Y722 | Owen | Murphy | United States |
UP335 Zoe Erin Rees (GB) | UP335 | Zoe | Erin Rees | GB |
If another user wants last names to be in a separate column, then he can similarly provide the corresponding examples to achieve that splitting:
Input | Split Column 1 | Split Column 2 | Split Column 3 | Split Column 4 |
---|---|---|---|---|
PE5 Leonard Robledo (Australia) | PE5 | Leonard | Robledo | Australia |
U109 Adam Jay Lucas (New Zealand) | U109 | Adam Jay | Lucas | New Zealand |
The system will then learn a program that can perform the same splitting on the rest of the data:
Input | Split Column 1 | Split Column 2 | Split Column 3 | Split Column 4 |
---|---|---|---|---|
PE5 Leonard Robledo (Australia) | PE5 | Leonard | Robledo | Australia |
U109 Adam Jay Lucas (New Zealand) | U109 | Adam Jay | Lucas | New Zealand |
R342 Carrie Dodson (United States) | R342 | Carrie | Dodson | United States |
TS51 Naomi Cole (Canada) | TS51 | Naomi | Cole | Canada |
Y722 Owen Murphy (United States) | Y722 | Owen | Murphy | United States |
UP335 Zoe Erin Rees (GB) | UP335 | Zoe Erin | Rees | GB |
As well as the ability to provide examples, Split.Text supports various other constraints, such as whether the user wants to keep the delimiters in separate columns or not.
People
Usage
The Split.Text APIs are accessed through the SplitSession
class. The user can create a new SplitSession
object, add input data and various constraints to the session, and then call the Learn()
method to obtain a SplitProgram
. This is the program that is learnt from the given input data and constraints. The SplitProgram
’s key method is the Run()
method which executes the program to perform a split on any given text input.
To use Split.Text, one needs to reference Microsoft.ProgramSynthesis.Split.Text.dll
, Microsoft.ProgramSynthesis.Split.Text.Semantics.dll
and Microsoft.ProgramSynthesis.Split.Text.Learning.dll
, Microsoft.ProgramSynthesis.Extraction.Text.Semantics.dll
and Microsoft.ProgramSynthesis.Extraction.Text.Learning.dll
.
The complete code for the scenarios described in this walk-through is available in the Sample Project which illustrates our API usage.
Initializing the session
The user can create a new Split session and add the input data as follows:
// create a new ProseSplit session var splitSession = new SplitSession(); // add the input rows to the session // each input is a StringRegion object containing the text to be split var inputs = new List<StringRegion> { SplitSession.CreateStringRegion("PE5 Leonard Robledo (Australia)"), SplitSession.CreateStringRegion("U109 Adam Jay Lucas (New Zealand)"), SplitSession.CreateStringRegion("R342 Carrie Dodson (United States)") }; splitSession.Inputs.Add(inputs);
Each row of text in the input data is added as a StringRegion
object created from the text content in that row. If we want we can also add some constraints to the session to specify basic properties of the desired splitting, such as whether we want to include the delimiters in the resulting split or not. If we do not want delimiters in the output, we can specify with a constraint as follows:
splitSession.Constraints.Add(new IncludeDelimitersInOutput(false));
We can clear any constraints provided in the session at any time by calling the splitSession.RemoveAllConstraints()
method.
Learning a new split program
Split.Text can learn a program using only the provided input data in a purely predictive fashion, without any examples or other output constraints. This can be done by simply calling the Learn()
function after adding the inputs.
// call the learn function to learn a splitting program from the given input examples SplitProgram program = splitSession.Learn(); // check if the program is null (no program could be learnt from the given inputs) if (program == null) { Console.WriteLine("No program learned."); return; }
Serializing/Deserializing a program
The SplitProgram.Serialize()
method serializes the learned program to a string. The SplitProgramLoader.Instance.Load()
method deserializes the program text to a program.
// serialize the learnt program and then deserialize string progText = program.Serialize(); program = SplitProgramLoader.Instance.Load(progText);
Executing the learnt program
The learnt split program can be executed on any input StringRegion
to produce an array of SplitCell
s. For example, we can execute the learnt program on each of the inputs as follows:
SplitCell[][] splitResult = inputs.Select(input => program.Run(input)).ToArray();
Each SplitCell
object represents information about a single split cell. It’s CellValue
field is the sub-region of the input that this split cell represents, and the IsDelimiter
flag indicates whether this split cell is a field or delimiter value. The learnt program can be executed indepedently of the Session
object on any new input text, and not just the inputs that have been entered into the session.
Executing the predictively learnt program on the three inputs given above, and having specified delimiters to not be included in the output, we get the following splitting:
PE5 | Leonard Robledo | Australia |
U109 | Adam Jay Lucas | New Zealand |
R342 | Carrie Dodson | United States |
Providing examples constraints
If the user desires a different split, then they can provide examples constraints to specify what kind of split they would like. For instance, if the user wants to separate the first name into a different split cell, then they can provide examples on some of the input rows as follows:
splitSession.Constraints.Add(new NthExampleConstraint(inputs[0].Value, 0, "PE5")); splitSession.Constraints.Add(new NthExampleConstraint(inputs[0].Value, 1, "Leonard")); splitSession.Constraints.Add(new NthExampleConstraint(inputs[0].Value, 2, "Robledo")); splitSession.Constraints.Add(new NthExampleConstraint(inputs[0].Value, 3, "Australia")); splitSession.Constraints.Add(new NthExampleConstraint(inputs[1].Value, 0, "U109")); splitSession.Constraints.Add(new NthExampleConstraint(inputs[1].Value, 1, "Adam")); splitSession.Constraints.Add(new NthExampleConstraint(inputs[1].Value, 2, "Jay Lucas")); splitSession.Constraints.Add(new NthExampleConstraint(inputs[1].Value, 3, "New Zealand"));
Each NthExampleConstraint
takes three parameters: the input text on which the program will execute (the entire string), the index of the output split cell for which this example is being given, and the text value desired in that split cell. The examples constraints given above describe each of the four split cells that are desired for the first two inputs that have been given in this session. After calling Learn()
with these constraints, we obtain a program that produces the following output splitting on the three inputs given in this session:
PE5 | Leonard | Robledo | Australia |
U109 | Adam | Jay Lucas | New Zealand |
R342 | Carrie | Dodson | United States |