PROSE Framework

PROSE Framework

Publications

Overview

Microsoft Program Synthesis using Examples (PROSE)

Microsoft PROSE is a framework of technologies for programming by examples: automatic generation of programs from input-output examples.

Given a domain-specific language (DSL) and input-output examples for the desired program’s behavior, PROSE synthesizes a ranked set of DSL programs that are consistent with the examples.

Download NuGet package

NuGet package for automatically synthesizing programs from examples.

Tutorial

A step-by-step walkthrough of the process of building a DSL in PROSE and enabling program synthesis for it.

API and samples

If you want to apply an existing PROSE DSL, check out its documentation and our samples.


Release notes

Read the release notes for the SDK.

GitHub

For questions, email us or open an issue on GitHub.

FAQ

Get answers to frequently asked questions.

People

We are a team of researchers and engineers working to create state-of-the-art programming-by-example and other technologies for data wrangling and other application domains of program synthesis. This is a revolutionary charter since 99% of computer users lack programming skills, and data scientists spend 80% of their time wrangling with data.

PROSE SDK team

Alumni

Tutorial

Tutorial

The core component of the PROSE SDK is its program synthesis framework for custom domain-specific languages (DSLs). It allows you to define a DSL that describes a typical space of tasks in your application domain, and automatically provides parsing, execution, and synthesis technologies for this DSL. Text transformation and text extraction DSLs are programming-by-example technologies that have been developed on top of the PROSE core synthesis framework.

This is a step-by-step tutorial on using the PROSE framework to create new DSLs. To use an existing DSL in your programming-by-examples application, please check out its individual documentation link on the left.

DSL Structure

A DSL consists of four main components:

  1. Syntax – a context-free grammar describing a space of possible programs in a DSL.
  2. Semantics – an executable function for each user-defined DSL operator.
  3. [optional] Witness functions – small “inverse semantics” functions that enable deductive backpropagation, the main synthesis technology provided in the PROSE framework.
  4. [optional] Features – computed attributes on individual programs in the DSL (for instance, a syntactic score of a program for ranking purposes).

Below, we illustrate the usage of all four components on a small example DSL – a portion of FlashFill.


See sections on: Syntax | Semantics | Witness functions | Features


Syntax

Our example DSL describes a space of programs that extract a substring from a given string. They can do it in two possible ways – either extract a substring based on absolute position indices, or based on matches of regular expressions.

Here’s the first version of the grammar of our DSL:

#reference 'file:SubstringExtraction.dll';
using semantics SubstringExtraction.Semantics;
language SubstringExtraction;

@input string x;

// Extract a substring from 'x' between positions 'posPair'
@start string out := Substring(x, posPair);
Tuple <int?, int?> posPair := PositionPair(pos, pos)
    = Pair(pos, pos);
int? pos := // A position at index 'k' (from the left if k >= 0, or from the right if k < 0)
    AbsolutePosition(x, k)
	// A position where two regexes 'positionBoundaries' match to left and to the right,
	// respectively, and it is the 'k'-th such position
	| RegexPosition(x, positionBoundaries, k);
Tuple <Regex, Regex> positionBoundaries := RegexPair(r, r)
    = Pair(r, r);
	
Regex r;
int k;

Here are some possible extraction programs contained in the SubstringExtraction DSL:

  • First 5 characters:
    Substring(x, PositionPair(AbsolutePosition(x, 0), AbsolutePosition(x, 5)))
    
  • Last character:
    Substring(x, PositionPair(AbsolutePosition(x, -2), AbsolutePosition(x, -1)))
  • A substring from the start until (but not including) the last number1:
    Substring(x, PositionPair(AbsolutePosition(x, 0), RegexPosition(x, RegexPair(//, /\d+/), -1))
  • A substring between the first pair of parentheses:
    Substring(x, PositionPair(RegexPosition(x, RegexPair(/\(/, //), 0), RegexPosition(x, RegexPair(//, /\)/), 0)))

    Note that this program will not extract the desired content if x contains unbalanced parentheses (for instance, x = "a) Bread ($2.00)"). This problem can be addressed by a language extension.

In general, a DSL consists of symbols and operators upon these symbols. In a context-free grammar, a DSL is represented as a set of rules, where each symbol on the left-hand side is bound to a set of possible operators on the right-hand side that represent this symbol. Every operator must be pure – it should not have observable side effects. In other words, PROSE DSLs are functional – they operate upon immutable states.

A program in a DSL transforms input data into output data. One terminal symbol in a DSL should be marked as @input – this is the input variable to all programs in this DSL. One nonterminal symbol in a DSL should be marked as @start – this is the root symbol for all programs in the DSL.

A program is represented as an abstract syntax tree (AST) of the DSL operators. Each node in this tree (similarly, each DSL operator) has some invocation semantics. More formally, each AST node has a method Invoke. It takes as input a state σ and returns some output.

A state is a mapping from DSL variables to their values. Initially, the topmost AST node invoked on a state with a single variable binding for the DSL’s input variable.

Here’s how you can parse and execute a program in our SubstringExtraction DSL:

using Microsoft.ProgramSynthesis;
using Microsoft.ProgramSynthesis.AST;
using Microsoft.ProgramSynthesis.Compiler;

// Parse the DSL grammar above, saved in a .grammar file
var grammar = DSLCompiler.ParseGrammarFromFile("SubstringExtraction.grammar").Value;
// Parse a program in this grammar. PROSE supports two serialization formats:
// "human-readable" expression format, used in this tutorial, and machine-readable XML.
var ast =
  ProgramNode.Parse("Substring(x, PositionPair(AbsolutePosition(x, 0), AbsolutePosition(x, 5)))",
                    grammar, ASTSerializationFormat.HumanReadable);
// Create an input state to the program. It contains one binding: a variable 'x' (DSL input)
// is bound to the string "PROSE Rocks".
var input = State.Create(grammar.InputSymbol, "PROSE Rocks");
// Execute the program on the input state.
var output = (string) ast.Invoke(input);
Assert(output == "PROSE");

A ParseGrammarFromFile invocation returns a Result<Grammar>, which either holds a valid parsed grammar in its Value property, or a collection of errors/warnings in its Diagnostics property. You can quickly examine all diagnostics by calling result.TraceDiagnostics().

At this moment grammar parsing will fail since we haven’t defined any execution semantics for the operators in our DSL, only its syntax. Let’s fix that.


Semantics

An executable semantics for an operator F in PROSE is a .NET function that matches the signature of F. You need to provide it for every operator in your DSL that is not imported from the standard library of PROSE or another language. In our example, such operators are Substring, AbsolutePosition, and RegexPosition.

All semantics functions should be defined as static methods. Static classes where PROSE searches for such functions (called semantics holders) are indicated in the grammar with a using semantics declaration. A DSL may contain multiple using semantics declarations, but each operator should correspond to exactly one semantics function with the same name and signature in one of semantics holders.

using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;

static class Semantics
{
    static string Substring(string x, Tuple<int?, int?> posPair)
    {
        if (posPair.Item1 == null || posPair.Item2 == null)
            return null;
        int start = posPair.Item1.Value;
        int end = posPair.Item2.Value;
        if (start < 0 || start >= x.Length ||
            end < 0 || end >= x.Length || end < start)
            return null;
        return x.Substring(start, end - start);
    }

    static int? AbsolutePosition(string x, int k)
    {
        if (k > x.Length || k < -x.Length - 1)
            return null;
        return k >= 0 ? k : (x.Length + k + 1);
    }

    static int? RegexPosition(string x, Tuple<Regex, Regex> regexPair, int occurrence)
    {
        if (regexPair.Item1 == null || regexPair.Item2 == null)
            return null;
        Regex left = regexPair.Item1;
        Regex right = regexPair.Item2;
        var rightMatches = right.Matches(x).Cast<Match>().ToDictionary(m => m.Index);
        var matchPositions = new List<int>();
        foreach (Match m in left.Matches(x))
        {
            if (rightMatches.ContainsKey(m.Index + m.Length))
                matchPositions.Add(m.Index + m.Length);
        }
        if (occurrence >= matchPositions.Count ||
            occurrence < -matchPositions.Count)
            return null;
        return occurrence >= 0
            ? matchPositions[occurrence]
            : matchPositions[matchPositions.Count + occurrence];
    }
}

This code illustrates several important points that you should keep in mind when designing a DSL:

  • DSL operators must be total (return a value for each possible combination of inputs) and pure (deterministic without observable side effects). An invalid input or any other exceptional situation should be handled not by throwing an exception, but by returning null instead. In PROSE, null is a valid value with a meaning “computation failed”.
  • Semantics functions should have the same name and signature as their corresponding DSL operators. They don’t have access to the current input state – if you need to access a DSL variable in scope, include it explicitly as a parameter. In our example, x is a parameter for both AbsolutePosition and RegexPosition.

Note: The dslc grammar compiler uses reflection to find definitions of external components of a grammar, such as semantics functions. It searches over the assemblies specified with reference statements in the grammar. Those assemblies must be built and present at given locations when you execute dslc (in a command-line or API form). If you build your semantics functions and your grammar definition in the same solution, make sure to separate them into different projects and make the grammar project depend on the semantics project, so that the latter one is built first.

If you generate your project template using our Yeoman generator, its projects will be pre-arranged in this fashion automatically.

Syntax and semantics above constitute a minimal DSL definition. They are sufficient for our little parsing/execution sample to work. Let’s proceed now to synthesizing programs in this DSL.


Synthesis

PROSE comes with a default synthesis strategy which we call deductive backpropagation. It also enables researchers in the field of synthesis to develop their own strategies on top of its common API. However, in this tutorial we explain how to use backpropagation for synthesis of programs in our SubstringExtraction DSL.

Program synthesis starts with a specification: what do we want from a desired program? In PROSE, specifications are inductive: they specify an output of a desired program on some input state, or, more generally, some property of this output.

In this tutorial, we start with the simplest form of such a spec – ExampleSpec, an input-output example. Given a spec, we invoke a learning session on it, generating a set of programs in the DSL that are consistent with the input-output examples in the spec.

using Microsoft.ProgramSynthesis.Specifications;
using Microsoft.ProgramSynthesis.Learning;

var input = State.Create(grammar.InputSymbol, "PROSE Rocks");
string desiredOutput = "PROSE";
var spec = new ExampleSpec(new Dictionary<State, object> { [input] = desiredOutput });
var engine = new SynthesisEngine(grammar);
ProgramSet learned = engine.LearnGrammar(spec);
Assert(learned?.Size > 0);

At this moment the learning result will be empty. The reason for this is that PROSE does not have information about your DSL to perform any kind of reasoning over it. For instance, terminal symbols k and r should be replaced with literal int or Regex constants, respectively, in each generated program. However, they are seemingly unbounded: any integer or regular expression in the world could possibly be present in a desired program, thus our search space is effectively infinite.

What about the specification, then?
An input-output example that we provided drastically restricts the search space size. For instance, the input string "PROSE Rocks" is 11 characters long, hence any absolute position extraction logic AbsolutePosition(x, k) with k > 12 or k < -11 cannot be consistent with the spec. What we just did here was backwards reasoning over the DSL structure: we deduced a constraint on k in a desired program from a constraint on the entire program.

To do that, we essentially inverted the semantics of AbsolutePosition, deducing its inputs (or their properties) given the output. In PROSE, such a procedure is called a witness function, and it is a surprisingly simple way to specify immensely powerful hints for the learning process.


Witness Functions

A witness function is defined for a parameter of a DSL operator. In its simplest form a witness function deduces a specification on that parameter given a specification on the entire operator. A witness function does not by itself constitute a learning algorithm (or even a substantial portion of it), it is simply a domain-specific property of some operator in your language – its inverse semantics.

For instance, the first witness function we’ll write in this tutorial is defined for the parameter posPair of the rule Substring(x, posPair) of our SubstringExtraction DSL. It takes as input an ExampleSpec φ on an output of Substring(x, posPair), and deduces a spec φ on an output of posPair subexpression that is necessary (or even better, necessary and sufficient) for the entire expression to satisfy φ.

Consider a program Substring(x, posPair) that outputs "PROSE" on a given input state { x := "PROSE Rocks" }. What could be a possible spec on posPair? Clearly, we know it precisely for the given example: posPair, whatever this program is, must evaluate to (0, 5) because this is the only occurrence of the string "PROSE" in the given input "PROSE Rocks".

In a more complex example, however, there is no single answer. For instance, suppose x = "(555) 279-2261", and the corresponding desired output in a spec is "2". In this case, the substring "2" could have been extracted from 3 different places in the input string. Therefore, instead of witnessing a single output value for posPair on a given input, in this case we witness a disjunction of three possible output values. A disjunction of possible outputs has its own representative spec type in PROSE – DisjunctiveExamplesSpec.

The two cases above lead us to a general definition of a witness function for posPair: find all occurrences of the desired output string in the input, and return a disjunction of them. In PROSE, you express it in the following way:

using Microsoft.ProgramSynthesis;
using Microsoft.ProgramSynthesis.Specifications;
using Microsoft.ProgramSynthesis.Learning;

class WitnessFunctions : DomainLearningLogic
{
    public WitnessFunctions(Grammar grammar) : base(grammar) { }  
  
    [WitnessFunction("Substring", parameterIndex: 1)]
    DisjunctiveExamplesSpec WitnessPositionPair(GrammarRule rule, ExampleSpec spec)
    {
        var result = new Dictionary<State, IEnumerable<object>>();
        foreach (var example in spec.Examples)
        {
            State inputState = example.Key;
            // the first parameter of Substring is the variable symbol 'x'
            // we extract its current bound value from the given input state
            var x = (string) inputState[rule.Body[0]];
            var substring = (string) example.Value;
            var occurrences = new List<Tuple<int?, int?>>();
            // Iterate over all occurrences of 'substring' in 'x',
            // and add their position boundaries to the list of possible outputs for posPair.
            for (int i = x.IndexOf(substring);
                 i >= 0;
                 i = x.IndexOf(substring, i + 1))
            {
                occurrences.Add(Tuple.Create((int?) i, (int?) i + substring.Length));
            }
            if (occurrences.Count == 0) return null;
            result[inputState] = occurrences;
        }
        return new DisjunctiveExamplesSpec(result);
    }
}

We put this witness function in a class called, for instance, SubstringExtraction.WitnessFunctions. Such a class holds domain learning logic – all hints and annotations that a DSL designer wants to provide to help the PROSE synthesis engine.

WitnessFunctions inherits from Microsoft.ProgramSynthesis.Learning.DomainLearningLogic, and must override a constructor to provide its base class with the grammar of the DSL it is defined for. Usually it’s easiest to supply this grammar to the constructor of WitnessFunctions since it is not accessible from within the WitnessFunction – the grammar definition depends on domain learning logic, not the other way around.

A domain learning logic is specified in the grammar file similarly to a semantics holder, with its own statement:

using learners SubstringExtraction.WitnessFunctions;

Similarly to semantics, a grammar may contain multiple using learners statements.

Writing witness functions

Some important points on writing witness functions:

  • A witness function is defined for one parameter of an operator, not the entire operator.
  • A witness function takes as input a spec on the output of the entire operator expression, and outputs a spec on the output of one parameter program in that expression.
  • Ideally, the spec produced by a witness function is necessary and sufficient to satisfy the given outer spec. You can also write an imprecise witness function, whose produced spec is only necessary for the outer spec to hold (in other words, it is an overapproximation). Such a witness function says “I cannot constrain this parameter precisely, but I can narrow down the space of possibilities. All valid parameter programs should satisfy my produced spec, but there may be some invalid ones that also satisfy it.”
    To mark a witness function as imprecise, add a property Verify = true to its [WitnessFunction] attribute.
  • You don’t need to define witness functions for parameters that are grammar variables in a state (such as x). More generally, you don’t need to define witness functions for a parameter p if all DSL programs that may derive from p do not include any constant literals.
  • You don’t need to define witness functions for operators from the standard library (with some exceptions).
  • Returning null from a witness function means “The given spec is inconsistent, no program can possibly satisfy it.”
  • Hint: use C# nameof operator with your Semantics implementations to provide an operator name in the [WitnessFunction] attribute.

Absolute positions

Covering more DSL operators with witness functions is straightforward. The next one witnesses k in the AbsolutePosition operator.

Given an example of position l that AbsolutePosition(x, k) produced, k must have been one of two options: the offset of l from the left or from the right in x. We can apply similar logic if we are given not one position l but a disjunction of them: the witness function just enumerates over each option, collecting all possible values of k.

[WitnessFunction("AbsolutePosition", 1)
DisjunctiveExamplesSpec WitnessK(GrammarRule rule, DisjunctiveExamplesSpec spec)
{
    var result = new Dictionary<State, IEnumerable<object>>();
    foreach (var example in spec.DisjunctiveExamples)
    {
        State inputState = example.Key;
        var ks = new HashSet<int?>();
        var x = (string) inputState[rule.Body[0]];
        foreach (int? pos in example.Value)
        {
            ks.Add(pos);
            ks.Add(pos - x.Length - 1);
        }
        if (ks.Count == 0) return null;
        result[inputState] = ks.Cast<object>();
    }
    return new DisjunctiveExamplesSpec(result);
}

Regex-based positions

Operator RegexPosition needs two witness functions: one for its rr parameter and one for its k parameter. For the first one, we need to learn a list of regular expressions that match to the left and to the right of given position. There are many techniques for doing that; in this tutorial, we assume that we have a predefined list of “common” regexes like /[0-9]+/, and enumerate them exhaustively at a given position.

Note: to make the semantics of RegexPosition(x, rr, k) and its witness functions consistent, we need to agree on what it means for a regex to “match” at a given position. If we take a standard definition of “matching”, and simply test each regex at each position, we may run into problems when determining the corresponding k for each regex.

Consider a string x = "abc def". We would like a program RegexPosition(x, RegexPair(//, /[a-z]+/), 1) to match before a second word in x – in this case, at position №4. However, for that we need to assume non-overlapping semantics of regex matches, since the regex /[a-z]+/ also matches at positions №0, №1, and №2. In fact, there are 6 matches of this regex in x, but only two “words” by a “common sense” definition. Therefore, instead of testing a regex at each position, we need to first run it against the entire string, record a list of non-overlapping matches, and only then test a position for a match in that list.

For computational efficiency, ideally we should cache the run of each predefined regex against each input string in the examples before the learning session starts. That way, we avoid recomputing it in each call of RegexPosition semantics and its witness functions. We avoid such caching in this tutorial for simplicity of presentation.

Here’s a witness function for rr:

Regex[] UsefulRegexes = {
    new Regex(@"\w+"),  // Word
    new Regex(@"\d+"),  // Number
    // ...
};

// For efficiency, this function should be invoked only once for each input string before the learning session starts
static void BuildStringMatches(string x,
                               out List<Tuple<Match, Regex>>[] leftMatches,
                               out List<Tuple<Match, Regex>>[] rightMatches)
{
    leftMatches = new List<Tuple<Match, Regex>>[inp.Length + 1];
    rightMatches = new List<Tuple<Match, Regex>>[inp.Length + 1];
    for (int p = 0; p <= x.Length; ++p)
    {
        leftMatches[p] = new List<Tuple<Match, Regex>>();
        rightMatches[p] = new List<Tuple<Match, Regex>>();
    }
    foreach (Regex r in UsefulRegexes)
    {
        foreach (Match m in r.Matches(x))
        {
            leftMatches[m.Index + m.Length].Add(Tuple.Create(m, r));
            rightMatches[m.Index].Add(Tuple.Create(m, r));
        }
    }
}

[WitnessFunction("RegexPosition", 1)]
DisjunctiveExamplesSpec WitnessRegexPair(GrammarRule rule, DisjunctiveExamplesSpec spec)
{
    var result = new Dictionary<State, IEnumerable<object>>();
    foreach (var example in spec.DisjunctiveExamples)
    {
        State inputState = example.Key;
        var x = (string) inputState[rule.Body[0]];
        List<Tuple<Match, Regex>>[] leftMatches, rightMatches;
        BuildStringMatches(x, out leftMatches, out rightMatches);
        var regexes = new List<Tuple<Regex, Regex>>();
        foreach (int? pos in example.Value)
        {
            regexes.AddRange(from l in leftMatches[pos.Value]
                             from r in rightMatches[pos.Value]
                             select Tuple.Create(l.Item2, r.Item2));
        }
        if (regexes.Count == 0) return null;
        result[inputState] = regexes;
    }
    return new DisjunctiveExamplesSpec(result);
}

Conditional witness functions

The last witness function in this tutorial witnesses a match index k for each regex pair in RegexPosition. To write this witness function, an outer spec on RegexPosition alone is insufficient: we can only write it for each individual regex pair, but not for all possible regex pairs at once. Thus, our witness function for k is conditional on rr: in addition to an outer spec, it takes an additional input – a spec on its prerequisite parameter rr.

In general, the prerequisite spec can be of any kind that provides your witness function any useful information. Typically, an ExampleSpec (i.e., a concrete value of prerequisite – in this case rr) is the most useful and common prerequisite spec. We use ExampleSpec here to deduce possible indices k for each regex pair in a manner similar to deducing absolute positions above.

[WitnessFunction("RegexPosition", 2, DependsOnParameters = new[] { 1 })]
DisjunctiveExamplesSpec WitnessKForRegexPair(GrammarRule rule, DisjunctiveExamplesSpec spec,
                                             ExampleSpec rrSpec)
{
    var result = new Dictionary<State, IEnumerable<object>>();
    foreach (var example in spec.DisjunctiveExamples)
    {
        State inputState = example.Key;
        var x = (string) inputState[rule.Body[0]];
        var regexPair = (Tuple<Regex, Regex>) rrSpec.Examples[inputState];
        Regex left = regexPair.Item1;
        Regex right = regexPair.Item2;
        var rightMatches = right.Matches(x).Cast<Match>().ToDictionary(m => m.Index);
        var matchPositions = new List<int>();
        foreach (Match m in left.Matches(x))
        {
            if (rightMatches.ContainsKey(m.Index + m.Length))
                matchPositions.Add(m.Index + m.Length);
        }
        var ks = new HashSet<int?>();
        foreach (int? pos in example.Value)
        {
            int occurrence = matchPositions.BinarySearch(pos.Value);
            if (occurrence < 0) continue;
            ks.Add(occurrence);
            ks.Add(occurrence - matchPositions.Count);
        }
        if (ks.Count == 0) return null;
        result[inputState] = ks.Cast<object>();
    }
    return new DisjunctiveExamplesSpec(result);
}

Learning with witness functions

After completing these four witness functions, we must make one final change in order for learning to happen – we must create an instance of WitnessFunctions and communicate it to the synthesis engine at learning time. To do that, we override a standard configuration of a SynthesisEngine, providing our WitnessFunctions to the DeductiveSynthesis strategy.

var engine = new SynthesisEngine(grammar, new SynthesisEngine.Config
{
    Strategies = new ISynthesisStrategy[]
    {
        new DeductiveSynthesis(witnessFunctions),
    }
});

After making this change, our LearnGrammar call succeeds, and we get back a set of several dozens possible consistent programs!

Ranking

Example are inherently an ambiguous form of specification. A user-provided spec of several input-output examples usually produces a huge set of DSL programs that are consistent with it (often billions of them!). To build a useful application, a synthesis-based technology has to somehow pick one “most likely” program from such a set. Many disambiguation techniques exist; in this tutorial, we show the most common one – ranking.

Ranking assigns each program a score – an approximation to its “likelihood” of being a desired program. For instance, string extraction based on absolute indices is less common than extraction based on regular expressions, therefore the former should be assigned a smaller score than the latter.


Features

In PROSE, scores are represented using computed features. A feature is a named attribute on a program AST, computed using provided feature calculator functions. A feature can be complete, which means that it must be defined with some value for each possible DSL program, or incomplete if it only exists on some DSL programs.

A feature is defined in a DSL as follows:

@complete feature double Score = SubstringExtraction.ScoreCalculator;

Here Score is its name, double is its type, and ScoreCalculator is a feature class that holds calculator functions. To implement this feature class, we inherit from Microsoft.ProgramSynthesis.Feature<T>:

using Microsoft.ProgramSynthesis;
using Microsoft.ProgramSynthesis.AST;

namespace SubstringExtraction
{
    class ScoreCalculator : Feature<double>
    {
        public ScoreCalculator(Grammar grammar): base(grammar, "Score", isComplete: true) { }
            /// 

            ///     Calculates the value of the feature represented by the current instance
            ///     for a given 
      	    ///	    program .
            ///   
        protected override double GetFeatureValueForVariable(VariableNode variable) => 0;
    }
}

Given a ProgramNode p, you can access the value of Score on this program as follows:

var scoreFeature = new ScoreCalculator(grammar);
Console.WriteLine(p.GetFeatureValue(scoreFeature));

Note: Since we defined Score as a complete feature, we had to override its GetFeatureValueForVariable method to provide an implementation of this feature computation on variable ASTs such as x.

Feature calculators

A feature calculator is defined for a grammar rule. There are four ways to define a calculator: based on recursive feature values, based on program syntax (given a ProgramNode or its children), or based on literal values.

Calculation from recursive values

The most common feature definitions are inductive, recursively defined over the grammar. For instance, a score for RegexPosition(x, rr, k) would be defined as a formula over a score for rr and a score for k. Such feature calculators take as input recursively computed values of the same feature on parameters of a current program AST:

[FeatureCalculator("RegexPosition", Method = CalculationMethod.FromRecursiveFeatureValues)]
double ScoreRegexPosition(double inScore, double rrScore, double kScore) => rrScore * kScore;
Calculation from program syntax

When recursively computed feature values are insufficient, you can take into account the entire syntax of a program AST. There are two ways to define such a calculator: (a) take as input a ProgramNode to score, or (b) take as input several individual ProgramNode instances representing ASTs of its parameter programs.

Hint: You can specify a specific subclass of ProgramNode as a parameter, if you know that your grammar structure only allows some specific AST kinds at this place.

[FeatureCalculator("AbsolutePosition", Method = CalculationMethod.FromChildrenNodes)]
double ScoreAbsolutePosition(VariableNode x, LiteralNode k)
{
    double score = (double) x.GetFeatureValue(this) + (double) k.GetFeatureValue(this);
    int kValue = (int) k.Value;
    if (Math.Abs(k) <= 1)
        score *= 10;
    return score;
}

// Alternatively:
[FeatureCalculator("AbsolutePosition", Method = CalculationMethod.FromProgramNode)]
double ScoreAbsolutePosition(ProgramNode p) { ... }
Calculation from literals

An inductively defined computed feature needs a basic case – its value on literal program ASTs. Feature calculators on terminal rules can take as input simply the value of a literal in a LiteralNode currently being scored. Instead of an operator name, you provide the name of the corresponding terminal symbol in the [FeatureCalculator] attribute:

[FeatureCalculator("k", Method = CalculationMethod.FromLiteral)] double KScore(int k) => 1.0 / (1 + Math.Abs(k));

Learning top programs

Our SubstringExtraction grammar needs a Score calculator for each rule (including standard library rules like PositionPair). After you define all of them, you can extract k top-ranked programs out of the candidates returned by learning:

ProgramSet learned = engine.LearnGrammar(spec);
var scoreFeature = new ScoreCalculator(grammar);
IEnumerable<ProgramNode> best = learned.TopK(scoreFeature, k: 1);

The method TopK assumes that your feature has a numerical type (convertible to double). It returns an descendingly ordered sequence of programs (greater score values are better). If several programs have the same score, they are all returned in the sequence, thus it may hold more programs than the requested value of k.

Instead of ordering a learned set of programs, you can instead learn only k topmost-ranked programs in the first place, if you are not interested in the entire set of candidates. Such a request significantly improves learning performance because PROSE can perform aggressive filtering in the middle of the learning process.

IEnumerable<ProgramNode> bestLearned = engine.LearnGrammarTopK(spec, scoreFeature, k: 1);

Important: in order for both TopK methods to work soundly, your feature must be monotonic over the grammar structure. In other words, greater-scored subexpressions should produce greater-scored expressions.

We can now take the best program and apply it on new user-provided data. Assuming scoring functions similar to FlashFill, this program will be “Extract the first word”:

ProgramNode p = bestLearned.First();
Console.WriteLine(p);
/*  Substring(x, PositionPair(RegexPosition(x, RegexPair(//, /\w+/), 0), RegexPosition(x, RegexPair(/\w+/, //), 0))  */
State input = State.Create(grammar.InputSymbol, "Program Synthesis");
Console.WriteLine(p.Invoke(input));
/*  Program  */

Further reading

This concludes our basic PROSE tutorial. To learn about more PROSE features, please check out these resources:

Footnote

  1. PROSE uses JavaScript/Perl syntax for regular expression literals.

Usage

Usage

Terminology

A domain-specific language (DSL) is a context-free programming language, created for a specific purpose, which can express tasks from a certain domain. A DSL consists of symbols and operators upon these symbols. A DSL is described as a context-free grammar – i.e., as a set of rules, where each symbol on the left-hand side is bound to a set of possible operators on the right-hand side that represent this symbol. Every operator of a DSL is pure – it does not have side effects.

A program in a DSL transforms input data into output data. A program is represented as an abstract syntax tree (AST) of the DSL operators. Each node in this tree (similarly, each DSL operator) has some invocation semantics. More formally, each AST node has a method Invoke. It takes as input a state σ and returns some output. A state is a mapping from DSL variables to their values. Initially, the topmost AST node invoked on a state σ = {v -> i} , where v is the DSL variable that represents program input, and i is the value of the input data on which the program is invoked. AST nodes can modify this state (add new variable bindings in scope) and pass it down to children nodes in recursive invocations of Invoke.

Architecture

The Microsoft.ProgramSynthesis package unites the core pieces of the meta-synthesizer. Here are the included assemblies:

  • Microsoft.ProgramSynthesis. includes the core functionality for manipulating
    ASTs, validating DSLs, maintain a DSL and a set of its operators, and core infrastructure for the synthesis algorithms.
  • dslc.exe is a DSL compiler. This is the executable that takes a language definition, parses it, validates it, potentially emits some diagnostics (warnings and/or errors), and serializes the parsed language representation into a portable file.
  • Microsoft.ProgramSynthesis.Learning is a core library of synthesis algorithms,
    standard operators and strategies.
  • Microsoft.ProgramSynthesis.Utils is a collection of utility methods that are used in different parts of the project.

A typical workflow of a DSL designer consists of the following steps:

  1. Define a DSL in a separate file (usually has a *.grammar extension).
  2. Define the semantics and all the supporting infrastructure for the DSL in a separate .NET assembly (see Section 4 below).
  3. Use one of the two options below to compile the DSL definition into a Grammar object:
    • At compile-time, invoke the DSL compiler dslc.exe manually on a *.grammar file with a DSL definition. Deploy the resulting serialized *.grammar.xml file with your application. At run-time, deserialize it using Grammar.Deserialize method.
    • Deploy the dslc.exe and the *.grammar file with your application. At run-time, compile your DSL definition in memory using DSLCompiler.Compile method.

Language definition

The main class Grammar represents a context-free grammar as a set of DSL rules and a list of references to operators’ semantics and/or custom operator learners, implemented in C# in a separate assembly. The method DSLCompiler.Compile loads a grammar definition from a string. Multiple examples of its usage can be found in our sample repository.

Here’s a typical language definition:

using TestSemantics;
using semantics TestSemantics.FlashFill;
using learners TestSemantics.FlashFill.Learners;

language FlashFill;

@start string f := ConstStr(s) | let x: string = v in SubStr(x, P, P);
int P := CPos(x, k) | Pos(x, r, r, k);
int k;
@values[StringGen] string s;
Regex r;
@input string v;

NOTE: The operators ConstStr, SubStr, CPos and Pos are defined in Black-box operators below.

First, we reference any namespaces for external typename lookup. Then, we specify the static classes for semantics and learners. There may be several using, using semantics, and using learners instructions. In this example, TestSemantics.FlashFill is a static class defined in the assembly “TestSemantics.dll” (which we pass as a reference in the DSLCompiler.Compile invocation) as is TestSemantics.FlashFill.Learners.

Next, we specify the language name and its rules in a form of a context-free grammar (BNF). Each nonterminal rule has a symbol on the left-hand side and a list of operators on the right-hand side, separated by “|”. Each terminal rule has only the left-hand side. Each symbol has its type, specified on the LHS. The supported types are: any of the standard C# types, including classes from System.Collection.Generic, System.Text.RegularExpressions, etc., or your own custom types, defined in the same semantics assembly. In the latter case, the framework searches for the type name in the namespaces referenced in using instructions.

Exactly one nonterminal rule should be annotated as “@start” – this is a start symbol of the grammar (i.e., the topmost node of every AST). Exactly one terminal rule should be annotated as “@input” – this is the input data to the program.

Terminals

Terminal rules specify the leaf symbols that will be replaced with literal constants or variables in the AST. For example:

  • A terminal rule int k; specifies a symbol k that represents a literal integer constant.
  • A terminal rule @input string v; specifies a variable symbol v that will be replaced with program input data at runtime.

A user can specify the list of possible values that a literal symbol can be set to. This is done with a @values[G] annotation, where G is a value generator – a reference to a user-defined static field, property, or method.  G should evaluate to IEnumerable (thus, it can be any standard .NET collection, such as an array or a list, or a user-defined collection type). The framework searches for the definition of G in the learner classes, specified by a using learners instruction. For our example grammar, any of the following definition can serve as a value generator for S:

public static class FlashFill
{
    public static class Learners
    {
        public static string[] StringGen = {"", "123", "Gates, B.", "ABRACADABRA"};
        public static string[] StringGen
        {
            get { return new[] {"", "123", "Gates, B.", "ABRACADABRA"}; }
        }
        public static string[] StringGen()
        {
            return new[] { "", "123", "Gates, B.", "ABRACADABRA" };
        }
    }
}

If no value generator is provided, the framework can either pick a standard generator for some common types (such as byte), or assume that the literal can be set to any value. This can impact the performance of some synthesis strategies or even make them inapplicable.


Black-box operators


black-box operator is an operator that does not refer to any standard concepts, and its invocation semantics do not modify the state σ that has been passed to it. In the example above ConstStrSubStrCPosPos are black-box operators. Note that SubStr is a black-box operator, even though the Let concept surrounding it isn’t. SubStr’s semantics don’t modify σ, they just pass it on to parameter symbols. Let’s semantics change the state σ by extending it with a new variable binding x -> v.

The semantics of every black-box operator should be defined in the semantics assembly with the same name as a public static method:

public static class Semantics
{
    public static int CPos(string s, int k)
    {
        return k < 0 ? s.Length + 1 + k : k;
    }

    public static int? Pos(string s, Regex leftRegex, Regex rightRegex,
                           int occurrence)
    {
        var boundaryRegex = new Regex(string.Format("(?<={0}){1}", leftRegex, rightRegex));
        var matches = boundaryRegex.Matches(s);
        int index = occurrence > 0 ? occurrence - 1 : matches.Count + occurrence;
        if (index < 0 || index >= matches.Count)
        {
            return null;
        }
        return matches[index].Index;
    }

    public static string ConstStr(string s)
    {
        return s;
    }

    public static string SubStr(string s, int left, int right)
    {
        if (left < 0 || right > s.Length || right < left)
        {
            return null;
        }
        return s.Substring(left, right - left);
    }

    public static string Concat(string s, string t)
    {
        return s + t;
    }
}

It is important to note that the learning process requires that semantic functions are total. If the function is inapplicable for the current choice of arguments or if for any other reason it would throw an exception, it must return null instead1. In such a case, if the return type of a function is a .NET value type, it should be made nullable.


This choice was made to enable efficient program synthesis. During learning, PROSE may repeatedly invoke partial programs on various inputs for verification purposes. If an input is invalid and a program handles it by throwing an exception, it slows down the learning by two orders of magnitude.

Nulls are automatically propagated upward like exceptions: if an argument to a semantic function is null, its result is automatically presumed to be null as well. You can override this behavior by annotating your semantics function with a [LazySemantics] attribute. In that case, your function will receive any null arguments as usual, and must handle it on its own.

Every grammar operator should be pure, and its formal signature should be σ -> T (where T is the type of the corresponding LHS symbol). The actual signature of the semantics function that implements a black-box operator is ( T1,T2},. . .,Tk ) -> T, where Tj is the type of  jth parameter symbol, and T is the return type of the LHS symbol. Consequently, if one needs to invoke their operators on anything other than the program input data, they have to introduce additional variables using Let construct and/or lambda functions.

Let construct


Let construct is a standard concept with the following syntax:

let x11 = v1,. . ., xkk = vk in RHS

where RHS is any standard rule RHS (a black-box operator, a grammar symbol, etc.). The RHS and any of the symbols it (indirectly) references can make use of the variables x1,. . .,xk Grammar symbols v1,. . .,vk are parameters of the rule; at runtime, each variable xj is bound to some value of the corresponding symbol vj.

Example:

The running grammar definition FlashFill contains the following rule:

@start string f := let x: string = v in SubStr(x, P, P)

The formal signature of this rule has three free parameters: v, P, P. At runtime, the Let construct, when given a state σ, executes the following operations:

  • ϑ :=  [[v]] σ
    (Execute the parameter symbol v on the state and save the value as ϑ)
  • σ' := σ ∪ { x -> ϑ} (Add a new variable x to the state, bind it to the value ϑ)
  • return [[SubStr(x,P,P) ]] σ'
    (Invoke the RHS – the black-box operator SubStr – on the new state)

The RHS indirectly references the variable x further in the grammar through the symbol P. Namely, the grammar further contains the following rule:

int P := Pos(x, r, r, k)

Pos is a simple black-box operator, just like SubStr. When given a state σ', it passes it on to its parameter symbols x,r,r,k. The latter three symbols are terminals; they will be represented as literal constants of the corresponding types in any final program (AST). The first symbol is a variable x; when executed on a state σ', it will just extract the binding x -> ϑ from it, and return ϑ.

Standard concepts

There exists a range of concepts that are common for many DSLs and implement standard functionality. In particular, many list/set processing concepts (Map, Filter, TakeWhile, etc.) encode various forms of loops that arise in many DSLs for different purposes. The synthesizer treats many such concepts as first-class citizens and is aware of many efficient strategies for synthesizing such concepts from example specifications. Consequently, we encourage framework users to use standard concepts for any loop form that arises in their DSLs, instead of encoding it as a black-box operator. In our experience, most loop forms commonly used in DSLs can be expressed as combinations of standard concepts.

The current list of supported concepts can be found in the Microsoft.ProgramSynthesis.Rules.Concepts namespace.

Most standard loop concepts express some form of list/set processing with lambda functions. We explain their usage here on a simple Filter example.

A Filter(p, s) concept is an operator that takes as input a predicate symbol p and a set symbol s. It evaluates s on a state σ to some set S of objects. It also evaluates p on a state σ to some lambda function L = λ x:e. Finally, it filters the set S using L as a predicate, and returns the result. Essentially, Filter is equivalent to Select in LINQ.

Consider the following grammar for various filters of an array of strings:

using TestSemantics;
using semantics TestSemantics.Flare;
using learners TestSemantics.Flare.Learners;

language Flare;

@input string[] v;
Regex r;
StringFilter f := Match(r) | FTrue();
@start string[] P := Selected(f, v) = Filter(f, v);

The input string array v is filtered with a filter f. A filter f can filter elements either according to a regular expression r, or trivially (by returning true).

The main rule of the grammar is string[] P := Selected(f, v). Here Selected(f, v) is a formal operator signature: this is how it would look like if it was implemented as a black-box operator. However, the actual implementation of Selected(f, v) refers to the standard Filter concept instead of a black-box semantics implementation. The arguments of the Filter concept are in this case the parameters of a Selected operator – the symbols f and v. At runtime, the framework interprets a Selected AST node by executing the standard Filter semantics with the corresponding arguments.

For a valid execution, the runtime types of the symbols f and v should satisfy the following contract:

  1. The type of v should implement IEnumerable. It can be a standard .NET collection (array, List, etc.), or a user-defined custom type that implements IEnumerable.
  2. The type of f should have functional semantics. In other words, it should behave like a lambda function, because at runtime it will be “invoked” with array elements as arguments. Moreover, for the Filter concept specifically, the return type of this “function” should be assignable to bool.

Lambda functions

One can capture functional semantics in an explicit lambda function in the grammar:

using semantics TestSemantics.Flare;
using learners TestSemantics.Flare.Learners;

language Flare;

@input string[] v;
Regex r;
bool f := Match(x, r) | True();
@start string[] P := Selected(f, v) = Filter(\x: string => f, v);

The first argument of the Filter concept is a
lambda function λ x:f. In our syntax, it is represented as \x: string => f2. Here x is a freshly bound variable (lambda function parameter), and f is a grammar symbol that represents function body. The runtime type of x should be specified explicitly after a colon.

Just as with Let constructs, the lambda function body symbol and all its indirect descendants can now reference the variable symbol x. At runtime, it will be successively bound to every element of the input string array. Since this binding introduces a new variable in the state σ, a lambda definition cannot be expressed as a black-box rule. Instead, it is a special rule kind with first-class treatment in the framework (again, just as Let).

The corresponding semantics implementation is now much simpler:

public static class Flare
{
    public static bool Match(string x, Regex r)
    {
        return r.IsMatch(x);
    }

    public static bool True()
    {
        return true;
    }
}

Note that a lambda function body is a free symbol on the RHS of the “=” sign. In other words, the set of free symbols on the RHS includes the direct parameters of the concept (in the example above, v) and the lambda function bodies (in the example above, f). To make the concept rule well-defined, this set should be exactly the same as the set of free symbols on the LHS of the “=” sign (i.e. the set of formal parameters of the rule). However, they should only be equivalent as sets, the order of parameters does not matter. In case of multiple usages of the same symbol among parameters, the correspondence between the LHS and the RHS is resolved in a left-to-right order.

Example:

Consider the following concept rule:

int S := F(v, P, P) = G(\x: string => P, v, P);

Here G is some standard concept, and S, vP are grammar symbols. The correspondence between formal parameters on the LHS and free symbols on the RHS is resolved as follows:

  • The first parameter v on the LHS corresponds to the second argument v on the RHS.
  • The second parameter P on the LHS corresponds to the body of the first argument λ x:P on the RHS.
  • The third parameter P on the LHS corresponds to the third argument P on the RHS.

Two usages of the same symbol P among parameters were resolved in a left-to-right order.

Language usage

Definition and usage of custom DSLs starts with the following steps:

  1. Define a string representation of your DSL grammar in our syntax.
  2. Implement your black-box operator semantics, custom types, and value generators in a separate assembly. Make sure that it is accessible at a path specified in the grammar string3.
  3. Load the grammar into a Grammar object:
    • Option 1: programmatically using DSLCompiler.Compile.

The example below shows how to compile the grammar programmatically (option “a”):

 const string TestFlareGrammar = @"
         using TestSemantics;
         using semantics TestSemantics.Flare;
         language Flare;
         @input string[] v;
         Regex r;
         StringFilter f := Match(r) | FTrue();
         @start string[] P := Selected(f, v) = Filter(f, v);";
         
         
var grammar = DSLCompiler.Compile(new CompilerOptions()
    {
        InputGrammarText = TestFlareGrammar,
        References = CompilerReference.FromAssemblyFiles(typeof(Program).GetTypeInfo().Assembly)
    }).Value;

Partial programs

In many applications there is need to manipulate partial programs – ASTs where some tree nodes are replaced with holes. A hole is a special type of an AST node: it’s an unfilled placeholder for some instantiation of a corresponding grammar symbol. In the AST node class hierarchy it is represented as Microsoft.ProgramSynthesis.AST.Hole.

To get a list of descendant holes in a program p, call p.Holes. Note that in current architecture AST nodes do not maintain any references to their parents. Consequently, to enable practical usages, the Holes property returns not the Hole nodes themselves, but their positions instead. A position is represented as a tuple (P, k, H), where P is a parent AST node of a hole, k is the hole’s index in the list of P’s children, and H is the hole itself.

A string representation of a hole of symbol S is “?S”. The framework supports parsing AST strings that contain holes.

Synthesis

Specifications

Program synthesis in the PROSE SDK is defined as a process of generating a set of programs ˜P that start with a root symbol P, given a specification φ. A specification is a way of defining the desired behavior of every program in ˜P. Different synthesis applications have different kinds of specifications to specify the desired program behavior. Some of the examples from prior work include:

  • In FlashFill, φ is an example specification: for every given input state σ, the desired program output should be equal to the given string φ[σ].
  • In Extraction.Text, φ is a subset specification. It assumes that
    the runtime type of a root symbol P is some sequence IEnumerable. The specification φ expresses the following property: for every given input state φ, the desired program output should be a superset of the given subset φ[σ]. In terms of Extraction.Text, φ[σ] are the substrings highlighted by the user, and the program output includes them and the rest of the substrings that should be highlighted.

The supported specification kinds are located in the namespace Microsoft.ProgramSynthesis.Specifications. All of them inherit the base abstract class Specification. This class has a single abstract method Valid:(State, object) to bool, which returns true if and only if the program output on the given input state satisfies the constraint that the specification expresses.

Spec is the main abstract base class for all
inductive specifications, i.e. those that constraint the program behavior on a set of provided input states φ.ProvidedInputs. Some of the main inductive specification kinds are described below:

  • InductiveConstraint specifies an arbitrary constraint Constraint: (State, object) to bool on the set of provided input states.
  • SubsequenceSpec specifies a subsequence of the desired output sequence for each provided input state in the dictionary Examples: State -> IEnumerable<object>.
    • PrefixSpec is a subclass of SubsequenceSpec where the desired subsequence of a program output is also required to be its prefix.
  • FunctionalOutputSpec describes the behavior of the program whose output type is a lambda function or a functional symbol. For each input state, it specifies a set of input/output pairs. These pairs are the behavior examples for the function that is the output of a desired program on a given state.
  • DisjunctiveExamplesSpec specifies a set of possible desired outputs for each provided input state. On a given state σ, the program output is allowed to equal any of the given possible outputs φ.DisjunctiveExamples[σ].
    • ExampleSpec is a subclass of
      DisjunctiveExamplesSpec where the size of
      φ.DisjunctiveExamples[σ] is constrained to 1. In other words, this is the simplest specification that specifies the single desired program output for each provided input state.

Strategies

The main point of program synthesis in the framework is the SynthesisEngine class. Its function is to execute different synthesis strategies for synthesizing parts of the resulting program. The synthesis process starts with an engine.LearnGrammar(φ) call to synthesize a program that starts with a start symbol of the grammar, or, more generally, engine.LearnSymbol(P, φ) call to synthesize a program that starts with a given grammar symbol P.

Given a learning task  (P,φ) to synthesize a program that starts with a symbol P and satisfies the specification φ, the engine can assign this task to any of the available synthesis strategies. A synthesis strategy represents a specific algorithm that can synthesize a program set ˜P for a particular kind of a learning task (P,φ). In other words, a synthesis strategy is parameterized by its supported specification type, takes a learning task (P,φ) where φ should be an instance of this specification type, and learns a set of programs ˜P that are consistent with φ.

A synthesis strategy is represented as a class inheriting Microsoft.ProgramSynthesis.Learning.SynthesisStrategy, which specifies the following contract:

public abstract class SynthesisStrategy<TSpec> : ISynthesisStrategy
        where TSpec : Spec
{
    void Initialize(SynthesisEngine engine);
    abstract Optional<ProgramSet> Learn(SynthesisEngine engine, LearningTask task, CancellationToken cancel);
    bool CanCall(Spec spec);
}

Here TSpec is a supported specification type, Learn is the main learning method, and CanCall is the function that determines whether this synthesis strategy supports learning for a given specification spec (in the default implementation, the result is true if and only if spec is an instance of TSpec).

Version space algebra

The return type of Learn in SynthesisStrategy is Microsoft.ProgramSynthesis.VersionSpace.ProgramSet. This is an abstract base class for our representation for the sets of programs. A version space, defined by Mitchell and Lau , is a succinct representation of a program set (hypothesis space in machine learning community), consistent with a specification. A version space can be defined explicitly (as a set of programs), or composed from smaller version spaces using standard set operators. The latter property is a key to succinctness of version spaces: representing exponential sets of programs using composition operators requires only polynomial space. Such a structure defines an algebra over primitive version spaces, hence the name.

An abstract ProgramSet class defines the following contract:

public abstract class ProgramSet : ILanguage
{
    protected ProgramSet(Symbol symbol)
    {
        Symbol = symbol;
    }

    public Symbol Symbol { get; private set; }

    public abstract IEnumerable<ProgramNode> RealizedPrograms { get; }

    public abstract ulong Size { get; }

    public virtual bool IsEmpty
    {
        get { return RealizedPrograms.IsEmpty(); }
    }

    public abstract ProgramSet Intersect(ProgramSet other);

    public Dictionary<object, ProgramSet> ClusterOnInput(State input);

    public Dictionary<object[], ProgramSet> ClusterOnInputs(IEnumerable<State> input);

    public abstract XElement ToXML();
}

The property RealizedPrograms calculates the set of programs stored in the version space.

Direct version space

Direct version space is a primitive version space that represents a set of programs explicitly, by storing a reference to IEnumerable. Since IEnumerable in .NET is lazy, storing it in a version space does not by itself enumerate it into an explicit set. This allows you to implement a synthesis strategy as an iterator in C# (with yield return statements), which calculates the required number of consistent programs on the fly, as needed. Moreover, such an iterator can even be theoretically infinite, as long as the end-user requests only a finite number of consistent programs.

Union version space

A union of k version spaces is a version space that contains those and only those programs that belong to at least one of the given k spaces. Such a version space naturally arises when we want to learn a set of programs that start with a certain grammar symbol by learning each of its possible RHS rules automatically. For example:

P := F(A, B) | G(C, D) 

Here P,A,B,C,D are grammar symbols, and F,G are black-box operators. Given a learning task (P,φ), one way to resolve it is to independently learn a set of programs ˜F of form F(?A, ?B), and a set of programs ˜G of form G(?C, ?D). If all the programs in ˜F and ˜G are consistent with φ, then &tildeF ∪ ˜G is a valid answer to the outer learning task (P,φ).

Join version space

A join version space is defined for a single operator P := F( E1,. . .,Ek). It represents a set of programs ˜P formed by a Cartesian product of parameter version spaces ˜{E1},. . .,˜{Ek}. In general, not every combination of the parameter programs from this Cartesian product is a valid sequence of parameters for the operator F, thus join version space depends on the operator logic to filter out the invalid combinations.

The table below outlines the APIs for building version spaces in the Microsoft Program Synthesis using Examples framework:

Version space Given Builder code
Empty Symbol s ProgramSet.Empty(s)
Explicit list of programs Symbol s
ProgramNode p1, …, pk
ProgramSet.List(s, p1, …, pk)
Lazy stream of programs Symbol s
IEnumerable stream
new DirectProgramSet(s, stream)
Union of version spaces for the rules Symbol s
ProgramSet v1, …, vk
new UnionProgramSet(s, v1, …, vk)
Join of version spaces for the parameters Symbol s
GrammarRule r // r.Head == s
ProgramSet v1, …, vk
new JoinProgramSet(r, v1, …, vk)

Keeping documentation up to date is a challenge.

  • If you find something that doesn’t work for you, please contact us.
  • In the meantime, check out our sample code to learn the API.

Footnotes

  1. In other words, null is used as a special value ⊥ that is typically found in a formal definition of a language.
  2. Following Haskell syntax, we start our lambda functions with the “\” character, which is supposed to approximately represent the letter λ.
  3. The common recipe that we use in our development is to reference the semantics DLL from the main project in a Visual Studio solution. This way, the semantics DLL is automatically copied to the target subdirectory on each build next to the main executable, and you can refer to it in the grammar string by simply using its filename. Alternatively, you can specify additional library paths as extra parameters to the Compile method.

Grammar syntax

Grammar syntax

This section describes the syntax of PROSE v1 DSL grammars.

Note: this syntax is volatile and subject to breaking changes in the future. We will notify you about each breaking change in accordance with SemVer.

Hereafter in the general syntax descriptions, angle brackets <> denote a placeholder to be filled with a concrete value, and curly braces {} with an opt subscript denote an optional component (unless specified otherwise).

Basic structure

The basic structure of a *.grammar file is as follows:

<Namespace usings>
<Semantics usings>
<Learner usings>

language <Name>;
<Feature declarations>

<Nonterminal symbols and rules>
<Terminal symbols>

It first specifies some metadata about the DSL, and then describes it as a grammar. A PROSE language is represented as a context-free grammari.e., as a set of rules, where each symbol on the left-hand side is bound to a set of possible expansions of this symbol on the right-hand side.

Usings

Namespace usings

These statements are identical to the corresponding C# forms. They import a namespace into the current scope.

using System.Text.RegularExpressions;

Semantics usings

These statements specify semantics holders – static classes that contain implementations of the grammar’s operators. There may be more than one semantics holder, as long as their members do not conflict with each other.

using semantics TestLanguage.Semantics.Holder;

Learner usings

These statements specify learning logic holders – non-static classes that inherit DomainLearningLogic and contain domain-specific helper learning logic such as witness functions and value generators. There may be more than one learning logic holder.

using learners TestLanguage.Learning.LogicHolder;

Language name

May be any valid C# full type identifier – that is, a dot-separated string where each element is a valid C# identifier.

Features

A feature is a computed property on an AST in the language. Each such property has a name, type, and associated feature calculator functions that compute the value of this property on each given AST. A feature may be declared as complete, which requires it to be defined on every possible AST kind in the language. By default, a feature is not complete. Syntax:

{@complete} feature <Type> <Name> = <Implementation class 1>, …, <Implementation class N>;

Here @complete is an optional completeness annotation, and the comma-separated identifiers on the right specify one or more classes that inherit Feature and provide implementations of the feature calculators. Notice that one feature may be implemented in multiple possible ways (e.g. the program’s RankingScore may be computed differently as LikelihoodScore or PerformanceScore, depending on the requirements), thus it is possible to specify multiple implementation classes for the same feature. A feature class does not have to be specified in the *.grammar file to properly interact with the framework. As long as it inherits Feature and holds the required feature calculator implementations, its instances may be used at runtime to compute the value of the corresponding feature on the ASTs. However, specifying it in the *.grammar file provides additional information to the dslc grammar compiler. The compiler can then verify your feature calculator definitions and provide more detailed error messages. Example:

using TestLanguage.ScoreFunctions;
@complete feature double RankingScore = LikelihoodScore, PerformanceScore, ReadabilityScore;

// alternatively:
@complete feature double RankingScore = TestLanguage.ScoreFunctions.LikelihoodScore,
                                        TestLanguage.ScoreFunctions.PerformanceScore,
                                        TestLanguage.ScoreFunctions.ReadabilityScore;

feature HashSet<int> UsedConstants = TestLanguage.UsedConstantsCalculator;

Terminal rules

Each terminal symbol of the grammar is associated with its own unique terminal rule. Terminal rules specify the leaf symbols that will be replaced with literal constants or variables in the AST. For example:

  • A terminal rule int k; specifies a symbol k that represents a literal integer constant.
  • A terminal rule @input string v; specifies a variable symbol v that contains program input data at runtime.

Syntax:

{@values[<Generator member>]} {@input} <Type> <Symbol name>;

Annotations

@input

Denotes the input variable passed to the DSL programs. A DSL program may depend only on a single input variable, although of an arbitrary type.

@values

A user can specify the list of possible values that a literal symbol can be set to. This is done with a @values[G] annotation, where G is a value generator – a reference to a user-defined static field, property, or method. The compiler will search for G in the provided learning logic holders, and will report an error if it does not find a type-compatible member. Example: given the following declaration of a terminal s:

using learners TestLanguage.Learners;
@values[StringGen] string s;

any of the following members in TestLanguage.Learners can serve as a generator for s:

namespace TestLanguage
{
	public class Learners : DomainLearningLogic
	{
		// Field
		public static string[] StringGen = {"", "42", "foobar"};

		// Property
		public static string[] StringGen => new[] {"", "42", "foobar"};
        public static string[] StringGen {
            get { return new[] {"", "42", "foobar"}; }
        }

		// Method
		public static string[] StringGen() => new[] {"", "42", "foobar"};
        public static string[] StringGen() {
            return new[] {"", "42", "foobar"};
        }
	}
}

Nonterminal rules

A nonterminal rule describes a possible production in a context-free grammar of the DSL. In contrast to conventional programming languages, the productions of PROSE grammars describe not the surface syntax of DSL programs, but their direct semantics as ASTs. In other words, where a conventional context-free grammar would specify something like

expression := atom '+' atom | atom '-' atom ;

the corresponding PROSE grammar specifies

expression := Plus(atom, atom) | Minus(atom, atom) ;

This snippet contains two nonterminal rules expression := Plus(atom, atom) and expression := Minus(atom, atom). The functions Plus and Minus are operators in the grammar – domain-specific functions that may be included as steps of your DSL programs. Thusly, PROSE DSLs do not have a syntax – they directly describe a grammar of possible domain-specific program actions.

Structure

Every nonterminal rule has a head and a body. Its head is a typed nonterminal symbol on the left-hand side of the production. Its body is a sequence of free symbols on the right-hand side, which may be nonterminal or terminal (i.e. variables or constants). There exist multiple different kinds of nonterminal rules, which differ in their semantics as well as in the roles of the symbols in their bodies. Syntax:

{@start} <Type> <Symbol name> := <Rule 1> | ... | <Rule N>;

Annotations

@start

An optional annotation that specifies the start symbol of the grammar – that is, the root nonterminal symbol of all programs in this DSL.

Backpropagation

Backpropagation

Deductive strategy (i.e., backpropagation) is the main synthesis algorithm used by the PROSE SDK. It relies on external annotations, provided by the DSL designer for the language operators – witness functions. Our tutorial and samples show many use cases for specific witness functions.

Witness Functions

A witness function is a domain-specific deductive procedure for a parameter k of an operator F, that, given an outer spec ϕ on F, answer a question: “What should a program used as this parameter satisfy in order for the entire F expression to satisfy ϕ?” In other words, witness functions backpropagate specifications from expressions to their subexpressions.

There are two kinds of witness functions: non-conditional and conditional.

Non-conditional

Non-conditional witness functions have the following signature:

[WitnessFunction("OperatorName", paramIndex)]
Spec Witness(GrammarRule rule, Spec outerSpec);

Since PROSE uses .NET reflection to extract information about witness functions, you should make the actual types in your signature as precise as possible.
In particular, the outerSpec parameter specifies what kind of spec for F your witness function can handle.
Typically a different witness function is written for each spec kind: it takes a different requirement to satisfy a PrefixSpec that an ExampleSpec.

A witness function may produce an overapproximation to the required spec instead of a necessary and sufficient spec for the parameter k.
PROSE can still use such a witness function, but it should be marked with Verify = true in its [WitnessFunction] metadata attribute.

If outerSpec is inconsistent (no program can possibly satisfy it), witness function should return null.
null in PROSE is a placeholder for an “always false” spec.
(An “always true” spec is called TopSpec).

Conditional

Conditional witness functions depend not only on an outer spec on their operator, but also possibly on some other parameters of that operator.
They have the following signature:

[WitnessFunction("OperatorName", paramIndex, DependsOnParameters = new[] { prereqParam1, prereqParam2, ... }]
Spec Witness(GrammarRule rule, Spec outerSpec, Spec prereqSpec1, Spec prereqSpec2, ...);

As with non-conditional witness functions, prerequisite specs in the signature should be as precise as possible.
Typically they will be ExampleSpecs: deductive reasoning is easiest when you know precisely some fixed value of a prerequisite on the same input state.

You can use DependsOnSymbols = new[] { prereqName1, prereqName2, ... } in the attribute, referring to parameter names instead of their indices (if they are unambiguous).

ID Annotations

If a target grammar rule does not have a name (for instance, it is a let rule or a conversion rule A := B), you can use an @id annotation in the grammar file to give it one, and then use this name as a reference in [WitnessFunction] attributes.

string expr := @id['LetSubstring'] let x = ChooseInput(inputs, k) in SubStr(x, posPair);

[WitnessFunction("LetSubstring", 0)]
Spec Witness(LetRule rule, Spec outerSpec);


In case of a let rule, it has two parameters: its “binding” expression (the part on the right-hand side of an equal sign) and its “body” expression (the part after in).
PROSE provides an automatic witness function for the body parameter, so you only to write one for the binding parameter (whose index in the containing let rule is o).

Rule Learners

Rule learners are designed for use cases when you cannot express you deductive logic in terms of witness functions on individual parameters.
They are mini-strategies: search algorithms for one grammar rule.

Note: Usage of rule learners is generally discouraged: if you can describe deductive reasoning as a witness function, PROSE framework can do a more aggressive optimization of its search process.

A rule learner has the following signature:

[RuleLearner("OperatorName")]
Optional<ProgramSet> Learn(SynthesisEngine engine, GrammarRule rule, LearningTask<Spec> task, CancellationToken token);

You can make recursive calls to engine.LearnSymbol in your rule learner to solve deductive subproblems.
The final result should be constructed as a Optional out of such subproblem results.
The learning task may have a more precise Spec type, for example DisjunctiveExamplesSpec or ExampleSpec.

Note: it is a good .NET practice to check on the given CancellationToken regularly and throw a TaskCancelledException when you detect a cancellation request.

Standard concepts

Standard concepts

Standard concepts are built-in PROSE operators. They have predefined semantics and, most of the time, witness functions for backpropagation. Thus, you can use standard concepts arbitrarily in your own DSLs without reimplementing them or designing your own synthesis procedures for them.

Concept rules

The grammar syntax for a simple concept rule looks as follows:

Type P := CustomOperatorName(E1, ..., Ek) = ConceptName(E1, ..., Ek);

Here ConceptName is the name of the concept (e.g. Pair), and CustomOperatorName is a DSL-specific name for this particular usage of the concept (e.g. RegexPair).
The parameter symbols E1, …, Ek are passed directly to the standard concept.

Example: in the SubstringExtraction language in our tutorial, the standard concept Pair is referenced as follows:

Tuple<Regex, Regex> positionBoundaries := RegexPair(r, r) = Pair(r, r);
Regex r;

The two r parameters denote the regexes to the left and to the right of a given position boundary. They are united into a tuple with a standard concept Pair.

Lambda functions

More complex concept rules may include a lambda function on their right-hand side.
For instance, a list-processing operator Filter takes as input a predicate of type Func and a sequence of type IEnumerable, and returns the filtered subsequence of elements that satisfy the predicate.
Here is a complete example of referencing Filter in a DSL:

#reference 'file:TestLanguage.dll';
using semantics TestLanguage.Filtering;
using learners TestLanguage.Filtering.Learners;

language Filtering;

@input string[] v;
Regex r;
bool f := Match(x, r) | True();
@start string[] P := Selected(f, v) = Filter(\x: string => f, v);

Here the custom operator Selected(f, v) is implemented as a concept rule Filter.
The first parameter of Filter is a lambda \x: string => f, and the second one is v.
Notice that you can either use a formal parameter directly in a concept, or pass it down into a lambda body.

List of concepts

Concept Semantics Specs handled by PROSE Witness functions needed?
Pair(x: T, y: T): Tuple Combine x and y in a tuple DisjunctiveExamplesSpec
Map(f: Func, seq: IEnumerable): IEnumerable Apply f to each element of seq, and return a sequence of results PrefixSpec seq
Filter(f: Func, seq: IEnumerable): IEnumerable Return only those elements of seq that satisfy the predicate f PrefixSpec, SubsequenceSpec, ExampleSpec
FilterNot(f: Func, seq: IEnumerable): IEnumerable Return only those elements of seq that do not satisfy f PrefixSpec, SubsequenceSpec,
Kth(seq: IEnumerable, k: int): T Return an element of seq at the specified index, from the left if k >= 0 or from the right if k < 0 DisjunctiveExamplesSpec
TakeWhile(f: Func, seq: IEnumerable): IEnumerable Return the longest prefix of seq where all the elements satisfy f SubsequenceSpec
FilterInt(initIter: Tuple, seq: IEnumerable): IEnumerable Return a subsequence of seq defined by the arithmetic progression starting at the index initIter.Item1 with the step initIter.Item2 PrefixSpec, SubsequenceSpec
First(f: Func, seq: IEnumerable): T Return the first element of seq that satisfies f ExampleSpec

Interactivity

Interactivity

Producing a program using a program synthesis system involves a series of interactions between the user and the system. These interactions take the general form of the user providing information about the task followed by reviewing the synthesized program to determine what, if any, additional information they need to provide to accomplish their intended goal. More concretely, a user might provide an example for one input and manually inspect the output of the synthesized program on other inputs, looking for an input with an incorrect output to give a second example on. By seeking to capture that entire process instead of just the step where a program is learned from examples, PROSE’s Session API is a useful model for real scenarios.

PROSE’s Session provides a stateful API for program synthesis to support interactive workflows. A Session represents a user’s efforts to synthesize a program for a single task. To begin a task, a new Session object is constructed and maintained until the user is satisfied with the final synthesized program (and possibly serialized for future refinements to that program). In addition to keeping track of the inputs and constraints to be fed to the synthesizer, the Session keeps track of programs which have been learned and provides APIs for helping the user select new inputs and outputs.

Each DSL exposes a Session subclass as the entrypoint to its learning API (e.g. Transformation.Text.Session). To implement a Session for your own DSL, extend Wrangling.NonInteractiveSession (or the base class Wrangling.Session if you want more control).

using Microsoft.ProgramSynthesis.Transformation.Text; // construct a session

var session = new Session();

Inputs

The collection of all known inputs the program is expected to be run on should be provided using .Inputs.Add(). If that set is large, then providing all of them may not be worthwhile (as the algorithms will only have time to consider a subset anyway). If selecting a subset of inputs to provide, they should be representative of the inputs the program will be run on. The inputs provided can accessed using .Inputs and removed using .RemoveInputs() or .RemoveAllInputs().

The inputs are used when learning and ranking programs (unless .UseInputsInLearn is set to false), as well as for suggesting inputs that more information is likely needed for.

// provide some inputs
session.Inputs.Add(new InputRow("Greta", "Hermansson"),
    new InputRow("Kettil", "Hansson"),
    new InputRow("Etelka", "bala"));

Constraints

Constraints are any information that describe the program to be synthesized. The most common kind of constraint is an example, but DSLs may support many kinds of constraints including negative examples, types for the output, programs the synthesized program should similar to, or any constraint the author of the synthesizer wishes to define.

The base type for constraints is Constraint where TInput and TOutput are the input and output types of programs in the DSL being synthesized. For example, for Transformation.Text, the type of constraints is Constraint.

In order to provide constraints to a Session, use .Constraints.Add(). The constraints provided can accessed using .Constraints and removed using .RemoveConstraints() or .RemoveAllConstraints().

// give an example session.Constraints.Add(new Example(new InputRow("Greta", "Hermansson"), "Hermansson, G."))

Synthesizing programs

Once a Session has some inputs and constraints, a program can be synthesized. Programs are generated using the various .Learn*() methods, which use the information in .Inputs and .Constraints to learn a program. They, like all Session methods that do any significant amount of computation, have Async variants which do the computation on a separate thread to make it easier to attach a Session to a GUI.

  • Learn()/[LearnAsync() returns the single top-ranked program as a Program.
  • LearnTopK()/[LearnTopKAsync() takes an integer k and returns the top-k ranked programs as an IReadOnlyList.
  • LearnAll()/[LearnAllAsync() learns all programs consistent with the examples, giving the result compactly as a ProgramSet (wrapped in an IProgramSetWrapper).
var program = session.Learn();

Explanations

In order to decide if the synthesized program is satisfactory, the user has to comprehend what has been learned. As we assume that, in general, the user is not a programmer, simply showing the code to the user is a poor way to explain the what the program is doing. Even experienced programmers can have difficulty reading programs, especially ones in DSLs designed to be easy for a computer to synthesize programs in as opposed to being easy for a human to read.

Running the program

The most straightforward way to explain the program is to run it. To run a Program, use its Run() method:

foreach(var input in session.Inputs)
{
   Console.Out.WriteLine(program.Run(input));
}
Input1 Input2 Program output
Greta Hermansson Hermansson, G.
Kettil Hansson Hansson, K.
Etelka bala bala, E.

DSL-specific

Other explanations might be DSL-specific. For instance, Transformation.Text offers a feature called “output provenance” which pairs up substrings of the output with the substrings in the input they were selected from:

foreach(var input in session.Inputs)
{ 
    Console.Out.WriteLine(program.ComputeOutputProvenance(input));
}

Shown with italic and bold substrings corresponding to each other between the input and the output:

Input1 Input2 Program output
Greta Hermansson Hermansson, G.
Kettil Hansson Hansson, K.
Etelka bala bala, E.

An interactive variant of this could allow the user to select where in the input a specific part of the output should come from, although the current implementation of Transformation.Text does not support such a constraint.

Significant Inputs

While explanations help the user understand how the program works on inputs they are looking at, if the input set is large, it is likely some problems occur only on inputs the user is not looking at. .GetSignificantInputClustersAsync() can suggest inputs that the user should take a look at. The default algorithm works for any DSL that supports learning multiple programs: it looks for inputs where the top program disagrees with other highly ranked programs. The return value is a set of clusters instead of single inputs because sets of inputs the algorithm cannot distinguish are returned together, so, for example, the UI could give preference to inputs that are currently on screen.

When presenting a significant input to the user, possible alternative outputs can be suggested using
.ComputeTopKOutputsAsync():

foreach(SignificantInput<IRow> sigInput in await session.GetSignificantInputsAsync())
{
    Console.Out.WriteLine("Input[Confidence=" + sigInput.Confidence + "]: " + sigInput.Input);
    foreach(object output in await session.ComputeTopKOutputsAsync(sigInput.Input, 5))
    {
        Console.Out.WriteLine("Possible output: " + output);
    }
}

If the significant inputs algorithm returns nothing at all, that indicates an assertion that the user has given sufficiently specific constraints to define the program to synthesize (modulo the inputs provided), which should give the user confidence that the synthesized program is correct.

Conclusion

PROSE’s Session API provides a mechanism for supporting an interactive synthesis task. After loading in the data to work with, the user can switch between providing constraints, generating programs interacting with their explanations, and requesting pointers to significant inputs. This rich vocabulary allows a user to interact with the program synthesis in a way such that they can have confidence that the program they generate will generalize as desired.

FAQ

Frequently asked questions

General questions

How do I use the PROSE SDK?

The SDK is used through .NET APIs. See the tutorial for how to get started and the documentation links on the sidebar for more details.

How do I install it?

Install the Microsoft.ProgramSynthesis and (optionally) Microsoft.ProgramSynthesis.Compiler NuGet packages in Visual Studio.

You can also generate a template DSL project by running

npm install -g yo
npm install -g generator-prose
yo prose

Is it cross-platform?

Yes, PROSE SDK is supported on Windows, Linux, and macOS. Currently, the package is guaranteed to work on .NET 4.5+ and .NET Core.

Where can I use it?

The SDK is released under a non-commercial license for use in research, education, and non-commercial applications. See the license for details.  Please note: The samples for using the PROSE SDK are available under the MIT license (as you can see in the github repo for the samples), but the license for the SDK itself is restricted to non-commercial use only.

Where can I find sample code?

Our samples are located in the PROSE GitHub repository.

How can I contact you if I have any questions or feedback?

If you run into any bugs or issues, please open an issue in our GitHub repository. Feel free also to email us.

Visual Studio Code on Linux

How do I restore NuGet packages for a PROSE solution (a sample or a yo-generated template)?

sudo apt install mono-complete nuget
sudo nuget update -self
nuget restore YourSolution.sln

How do I build a solution in VS Code?

Press Ctrl+Shift+P and configure a task runner. Pick the msbuild task. In the generated tasks.json, replace msbuild with xbuild.

When I try to run xbuild on my yo-generated solution, it fails with an error about @(AssemblyPaths -> Replace('/', '/')).

As it turns out, xbuild does not support the entirety of the msbuild language. This means that you won’t be able to recompile your grammar automatically on each build out of the box. We are working on fixing this. In the meantime, please regenerate your solution with yo prose but answer “No” to the last question.

How do I launch a program or debug it in VS Code?

Install the mono-debug extension for VS Code and follow the instructions on its webpage.