PROSE – Json Extraction

PROSE – Json Extraction

Overview

Extraction.Json automatically extracts tabular data from Json files. It supports extracting Newline Delimited Json and truncated Json.

The Usage page and the Sample Project illustrate the API usage.

Extraction.Json supports two main modes of extraction:

  1. No Joining Inner Arrays: arrays are not joined and are kept as a single cell in the output table. Each Json outer object corresponds to one row in the output table.
  2. Joining Inner Arrays: arrays are joined with other fields. Each Json outer object corresponds to multiple rows in the output table.

We use the following Json to illustrate different extraction modes:

[
  {
    "person": {
      "name": {
        "first": "Carrie",
        "last": "Dodson"
      },
      "address": "1 Microsoft Way",
      "phone number": []
    }
  },
  {
    "person": {
      "name": {
        "first": "Leonard",
        "last": "Robledo"
      },
      "phone number": [
        "123-4567-890",
        "456-7890-123",
        "789-0123-456"
      ]
    }
  }
]

No Joining Inner Arrays

In this mode, Extraction.Json produces the following output, in which each outer object corresponds to one row:

person.name.first person.name.last person.address person.phone number
Carrie Dodson 1 Microsoft Way
Leonard Robledo [“123-4567-890”,”456-7890-123”,”789-0123-456”]

 

Joining Inner Arrays

We can view each inner array as an external table and are joined with the main table using a surrogate key. In this mode, there are two join semantics: inner join and outer join. These semantics are similar to those in database terms.

Inner Join

Under inner join semantics, the outer object having an empty array does not appear in the output table (because inner joining with an empty table results in another empty table).

Extraction.Json produces the following table for the above Json:

person.name.first person.name.last person.address person.phone number
Leonard Robledo 123-4567-890
Leonard Robledo 456-7890-123
Leonard Robledo 789-0123-456

 
Note that the values of person.name.first and person.name.last are duplicated (as a result of the join), and the row of “Carrie Dodson” does not exist in the output table (because its person.phone number is empty.)

Outer Join

Under outer join semantics, the outer object having an empty array still appears in the output table. This is the default semantics.

Extraction.Json produces the following table for the above Json:

person.name.first person.name.last person.address person.phone number
Carrie Dodson 1 Microsoft Way
Leonard Robledo 123-4567-890
Leonard Robledo 456-7890-123
Leonard Robledo 789-0123-456

People

Usage

The main entry point is Session class’s Learn() method, which returns a Program object. The Program’s key method is Run() that executes the program on an input Json to obtain the extracted output. Each program also has a Schema property that defines the structure of the extracted data.

Other important methods are Serialize() and Deserialize() to serialize and deserialize Program object.

To use Extraction.Json, one needs to reference:

Microsoft.ProgramSynthesis.Extraction.Json.dll, Microsoft.ProgramSynthesis.Extraction.Json.Learner.dll
and Microsoft.ProgramSynthesis.Extraction.Json.Semantics.dll.

The Sample Project illustrates our API usage.

Basic Usage

By default, Extraction.Json learns a join program in which inner arrays are joined with other fields. As a result, an outer object in the input Json can be flattened into several rows in the output table.

The below snippet illustrates a learning session to generate such program from the input jsonText:

string jsonText = ... 
var session = new Session(); 
session.Constraints.Add(new FlattenDocument(jsonText)); 
Program program = session.Learn(); 

Clients may add NoJoinInnerArrays constraint to the session to learn non-join programs, as illustrated in the following snippet:

var noJoinSession = new Session();
noJoinSession.Constraints.Add(new FlattenDocument(jsonText), new NoJoinInnerArrays());
Program noJoinProgram = noJoinSession.Learn();

The Introduction page has more discussion on this topic.

Serializing/Deserializing a Program

The Extraction.Json.Program.Serialize() method serializes the learned program to a string. The Extraction.Json.Loader.Instance.Load() method deserializes the program text to a program.

// program was learned previously
string progText = program.Serialize();
Program loadProg = Loader.Instance.Load(progText);

Executing a Program

Given an input Json, a program can generate a hierarchical tree or a flattened table. If the program is a join program, the table is flattened either using outer join (default) or inner join semantics.

Generating a Tree

Use this method to obtain a hierarchical tree of the input document.

// program was learned previously
ITreeOutput tree = program.Run(jsonText);

Generating a Table

Supply the desired join semantics to the RunTable() method as follows:

// program was learned previously

IEnumerable outerJoinTable = program.RunTable(jsonText, TreeToTableSemantics.OuterJoin);

IEnumerable innerJoinTable = program.RunTable(jsonText, TreeToTableSemantics.InnerJoin);