From Dyna
Revision as of 04:28, 30 April 2006 by Jason (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

This page is for working out some sample programs that we'd like to run in the new version. Please illustrate how to handle the new features in the concrete syntax, and how to get at them from the C++ side. You may wish to borrow from Examples as well as from your own code in Dyna 0.3.

Comments on problems that these programs present to the compiler, and solutions, are also welcome.

At the bottom of this page are listed some topics that we should try to cover in these examples.


Matrix multiplication

Here's our inference rule:

c(I,K) += a(I,J) * b(J,K).
Should also show how to explicitly request forward chaining, backward chaining, memoization, etc. This is probably done by declarations that apply to the consequent c.

We also need at least one declaration, which would typically be written above the rule. The version below says that the axiom type, which is an automatically declared union over [a/2,b/2], has values that are doubles and default to 0.

:- double axiom := 0.   
Possibly a better way of writing this default is to say
axiom X := 0.    % or more idiomatically,  axiom _ := 0.

which assigns 0 to all items X of type axiom. This non-ground assignment can be partially overridden by assignments such as b(3,4) := 5.

The following declarations are optional, given the declaration of axiom above, but they may improve efficiency.

:- double a(int X, int Y).
:- double b(int X, int Y).
:- double c(int X, int Y).
We should also show optional storage declarations that get various efficient versions of dense and sparse arrays.

Here are some alternative ways that one is allowed to write the double a(int X, int Y) declaration:

:- double a(int X, int Y).
:- _ a(_ X, _ Y).            % Using _ for a type means to inherit it (double) or infer it (int).
:- double a(int _, int _).   % Using _ for a var name means to generate a one-time var.
:- _ a(_ _, _ _).            % We can do both.
:- _ a(int, Y).              % _ can be omitted where syntactically valid.
:- _ a(_,_).                 % This is as much as we can syntactically omit ...
:- _ a/2.                    % ... unless we do this!

Neural networks

We begin with some optional declarations. Let's suppose we want to name nodes by arbitrary terms (not necessarily ints or strings). For readability and type-safety, however, we want to introduce an explicit node type. So let's require the terms to be of the form node(Name) where Name is a term:

:- void node(term Name) storedas Name.

The void return type means that nodes don't themselves have values. The optional storedas declaration ensures that there is no storage overhead for the node(...) wrapper. Roughly speaking, any term may be stored as any other term.

Now let's optionally declare some item types with node arguments. (Remark: A node term itself is not regarded as an item, since it is neither an axiom or a theorem.)

% axiom declarations
:- double input(node) := 0.       % how much external input to this node?
:- double weight(From, To) := 0.  % arguments are inferred to be nodes.
:- double target(node).           % is there an external target for this node?
% theorem declarations
:- double in(node).
:- double out(node).

Note that we could have given storage declarations for these, too. For example,

double in(node X) storedas X.
double weight(From, To) storedas (To,From).
Should probably spin out this example further. Maybe input nodes are a subtype of node, and only they reserve space to store input.

No default value was specified for the target axiom, so it defaults to undef. Asking for the value of an undef item in C++ throws an exception, and such an item cannot be used to help derive other items.

Here are the inference rules. Recurrent networks (involving cycles in the weight graph) are okay without any special handling.

in(To) += input(To).
in(To) += out(From) * weight(From,To).   % the compiler would warn if you accidentally wrote weight(To,From)

out(Node) = sigmoid(in(Node)).   

error += (out(Node)-target(Node))**2.   

While we had to specify default values for axioms (or allow them to be undef, which would mostly work fine in this example), we cannot specify them for theorems. Theorems' values are always defined by rule, including in the default case:

  • error is defined to be a sum over expressions that default to undef (since target(Node) does). Thus, if there are no non-default summands, the sum is 0 (the identity for +).
  • in(To) is defined to be a sum over infinitely many products that default to 0. Thus, if there are no non-default summands, the sum is again 0 (the sum of infinitely many zeroes).
  • out(Node) is defined to be sigmoid(in(Node)), so it is sigmoid(0) (i.e., 0.5) if in(Node) is still 0. (If in(Node) were undef for some reason, then out(Node) would be too, since its aggregator is =.)

A training algorithm would try to minimize error. This declaration says that it is the weight nodes that are trainable, so that the compiler can generate code to help train them.

:- trainable(weight).    

This declaration adds weight/2 to the union type trainable, which is a possibly empty subtype of axiom.

It might be better if we could find a way to do without the trainable declaration, instead allowing any subtype of axiom to be trained at runtime. Perhaps the trainable declaration is simply an optional pragma suggesting that the compiler should generate efficient support for training weight.

We also need to define sigmoid, which appeared above. Let's say that the value of a sigmoid/1 term is a double computed by a foreign function. For example, the value of sigmoid(0.5*0.5) is computed by looking up the value of the term 0.5*0.5 (namely 0.25) and calling the foreign function on the result.

:- foreign double sigmoid(double =X).   

The type double = is the type of expressions X with double value -- in other words, =X is a double.

We still need a way to specify the C++ name of the foreign function, as well as any algebraic properties, such as the fact that it distributes over max or max=. The foreign function can't simply be called sigmoid, because (barring renaming by a cname declaration) the name sigmoid in C++ will already be a constructor function used to construct a C++ term object, of type sigmoid_t, such as sigmoid(times(0.5,0.5)).

It would be consistent for Dyna to pass the expression X to the foreign function. However, as special handling for declared = arguments to foreign functions, we instead pass the value =X. Then the foreign function is only allowed to depend on =X, not X itself, and does not need access to the chart, which foreign functions probably should never have.

This special handling does make a problem for a foreign function like derivative, which really does want to accept a double = expression and not its value:
:- foreign double =derivative(double =Formula).

Here however is a possible workaround using a typedef so that the argument isn't syntactically a double =:

:- double_expr ~ double =.
:- foreign double_expr derivative(double_expr).

Another, harder-to-use workaround uses a wrapper, so that the double = is not a direct argument of derivative but only an argument to an argument:

:- void formula(double =).
:- foreign formula derivative(formula).
Another issue: Probably we want to regard 0.5*0.5 and likewise sigmoid(0.5*0.5) as expressions, not as foreign axioms from an infinite set. However, if they didn't evaluate all of their arguments, would we then regard them as foreign axioms and hence as items?

Permutation Neighborhood

The following program computes the best permutation in an exponential-size neighborhood of an existing permutation using a finite-state acceptor and two other cost functions.

Note that this requires automatic evaluation of subterms such as I+1. Also you're mixing aggregation operators. Anything else new? Jason 04:22, 28 Apr 2006 (EDT)
% Axioms:
% initial (State),
% final (State),
% arc (Source, Target, Input),
% b (Left, Right),
% c (Left, Middle, Right).
% permute (From, To),

goal min=
 path (0, =length, S, T)
 + initial (S)
 + final (T).

length max=
 whenever permute (From, To).

path (I, I + 1, S, T) min=
 arc (S, T, W)
 whenever permute (W, I + 1).

path (I, K, S, T) min=
 path (I, J, S, M)
 + path (J, K, M, T)
 + combine (I, J, K, false).

path (I, K, S, T) min=
 path (J, K, S, M)
 + path (I, J, M, T)
 + combine (I, J, K, true).

combine (I, I, K, Swap) = 0.
combine (I, K, K, Swap) = 0.

combine (I, J, K, Swap) =
 combine (I, J, K - 1, Swap)
 + combine (I + 1, J, K, Swap)
 - combine (I + 1, J, K - 1, Swap)
 + before (I + 1, K, Swap)
 - middle (I + 1, K, Swap)
 + outside (I + 1, K, Swap).

middle (I, K, Swap) +=
 cyclic (I, M, K, Swap)
 if I < M & M < K.

outside (I, K, Swap) +=
 cyclic (I, K, O, Swap)
 if O < I | K < O.

before (I, K, false) =
 before (I, K).

before (I, K, true) =
 before (K, I).

before (I, J) =
 b (L, R)
 if permute (L, I)
 & permute (R, J).

cyclic (I, J, K, false) =
 cyclic (I, J, K).

cyclic (I, J, K, true) =
 cyclic (J, I, K).

cyclic (I, J, K) =
 c (L, M, R)
 if permute (L, I)
 & permute (M, J)
 & permute (R, K).

Ideally, we could also compute the new permutation implied by the best path. I'm not sure how to get that within Dyna.

Well, hyperedges need to show up as items so you can follow the backpointers. Maybe proof trees (in this case paths) should show up as items too. How would you like to write it in Dyna? Jason 04:22, 28 Apr 2006 (EDT)

Also, b and c should be derived from features of the identity permutation, and the weights of those features should be trainable.

This version doesn't properly handle Failed to parse (Missing texvc executable; please see math/README to configure.): \epsilon -transitions between states in the acceptor.

CKY with trainable latent annotations

Note the mixed =, + and * with backward for computing dot products. Two theorems are built for pruning half() constitents: one maximizes over the equivalence class, and another counts members of this class.

I don't see any of this in the example; did something get lost? Jason 04:02, 28 Apr 2006 (EDT)
:- type smallInt = 0..15. % can be stored more efficiently?                                                                                                                                                      
double constit(string label, smallInt annotation, unsigned, unsigned).                                                                                                                                           
boolean allowed(...).                                                                                                                                                                                            
:- sum(rewrites(X, Alpha, ...)) = log(1.0).                                                                                                                                                                      
:- sum(annotateStart(...)) = log(1.0).  % since annotateStart has always only 1 arg it would be equivalent with  :- sum(annotateStart(_)) = log(1.0).                                                             
:- union(nonpersistent, [word, constit, length, allowed]).
:- retractable(nonpersistent).
:- trainable([rewrites, annotateStart]).

constit(X, Alpha, from=I, to=J) log+= rewrites(X, Alpha, Word) whenever allowed(X, Word, I, J).
constit(X, Alpha, I, K) log+= rewrites(X, Alpha, Y, Beta, Z, Gamma) + constit(Y, Beta, I, J) + constit(Z, Gamma, J, K) 
   whenever allowed(X, Y, Z, from=I, via=J, to=K). 
goal log+= annotateStart(Alpha) + constit("START", Alpha, 0, length).


  • No declaration:
    :- item(item,double,-inf)
  • a (memory-efficent) type can be defined (smallInt),
  • dots notation,
  • nicer sum constraints for dynamite,
  • key-value pairs in argument lists,
  • different structure declarations,
  • an item can have boolean (see "allowed") or unsigned (see "length") value,
  • arguments are evaluated by default (see "length" in the last line)
The sum constraints might be nicer and more explicit as
   lhssum(X,Alpha) log+= rewrites(X, Alpha, ...).
   :- constrain(lhssum(...) == log(1.0)).

If you want to use an anonymous sum item instead of introducing lhssum, you could use the notation that I've suggested before, where the aggregator is used as a prefix operator:

   :- constrain((log+= rewrites(X, Alpha, ...)) == log(1.0)).n

Jason 04:02, 28 Apr 2006 (EDT)

We're already anticipating key-value pairs in argument lists (see the section on keywords at Design/StorageSpec), but using the syntax
  constit(X, Alpha, From~I, To~J)

following the convention that ~ and ~~ are about assigning and comparing quoted terms whereas = and == assign and compare their values. How strongly do you want = rather than ~ in the case of keyword args? Jason 04:02, 28 Apr 2006 (EDT)

Your intended meaning for ... seems to be that there can be any number of remaining arguments and they can have any value (similar to _). Questions/thoughts:
  • If some of the omitted arguments were declared with defaults, does ... in the pattern mean that they are constrained to have their default values, or are unconstrained as with _?
  • If the functor is overloaded with multiple arities, presumably we get a pattern that can match terms of different arities.
  • Presumably ... should also be usable with keyword arguments.
  • I note that David used coverage(..., Sextent) below, which seems symmetric enough -- you can bind the first few and/or the last few items positionally. Ditto for mixed positional and keyword args.
  • Your annotateStart(...) seems equivalent to the type-restricted free variable annotateStart _, though perhaps your version is better style.
Jason 04:02, 28 Apr 2006 (EDT)

Syntactic MT decoding

In addition to backward chaining for computing feature/weight dot products, I assume you can assign to the priority of an item in an inference rule, as well as refer to "value" (perhaps a reserved word?).

% a quasi-synchronous grammar translation decoder

:- item(item, double, -inf).
:- pragma(keep_all_antecedents(item)).

:- union(nonpersistent,[ tword,dep,endS,endT,link ]).
:- retractable(nonpersistent).

:- union(lmstate, [ s, nil ] ).
:- union(linkState, [ rangeS, nil ] ).

:- structure(coverage(int twords, int skids, int halfKids, rangeS sextent)).
:- foreign(term merge_coverage(term, term)).
:- foreign(term merge_lm(int, term, term)).

:- structure(s(string first, term rest)).

:- foreign(nodep).
:- foreign(nodom).
:- foreign(neq).
:- foreign(getspan).

:- union(backs, [ score_word, score_hook, extend, complete, cap_tree ]).
:- pragma(backword_chain(backs)).
The above syntax is all old-fashioned. I assume it's ok to update it. Jason 03:16, 28 Apr 2006 (EDT)
% Calculate transitive tree relations
sibling(A, B) max= dep(A, H) + dep(B, H) + neq(A, B).

desc(A, B) max= dep(A, M) + desc(M, B) + getspan(A,B).
desc(A, B) max= dep(A, B) + getspan(A,B).

grandchild(A, B) max= dep(A, M) + dep(M, B).

ccommand(A, B) max= (dep(A, C) + desc(B, C) whenever alignChild("c-commandee")) + nodom(A, B).
% end tree relations
Shouldn't the above be booleans written with |= and &? Jason 03:16, 28 Apr 2006 (EDT)
goal max= root(Head,Lcov,Rcov,Llm,Rlm) + cap_tree(Head,Lcov,Rcov,Llm,Rlm).

root(nonterm(Head,Link,Hword),Lcov,Rcov,Llm,Rlm) max=
  capped(LEFT, nonterm(Head,Link,Hword),Lcov,Llm) + rootchild(Head)
  + capped(RIGHT, nonterm(Head,Link,Hword),Rcov,Rlm).

half(RIGHT,nonterm(Head,rangeS(U,V),W),coverage(1,0,0,rangeS(U,V)),"noadj",nil,s(W,nil)) max=
  whenever trans(LinkWord, W) whenever wchild(Head,W))
  + score_word(Head,W,LinkTag,LinkWord).

% compute the dot product of feature and weight vectors
% note = instead of max=
score_word(Head,W,LinkTag,LinkWord) =
  trans(LinkWord, W) * param("trans")
  + btrans(LinkWord, W) * param("btrans")
  + ctrans(LinkTag, Head) * param("ctrans")
  + bctrans(Head, LinkTag) * param("bctrans")
  + wchild(Head, W) * param("dmv")
  + wtag(W, Head) * param("wtag").
I bet the dot-product rule above can be made prettier somehow, maybe by breaking it into a bunch of += rules that can then be collapsed down into fewer rules. Jason 03:16, 28 Apr 2006 (EDT)
capped(Dir,nonterm(Head,Link,Hword),Hcov,Hlm) max= half(Dir,nonterm(Head,Link,Hword),Hcov,Adj,Alink,Hlm) + stop(Dir,Head,Adj).

half(LEFT,Head,coverage(0,0,0,rangeS(0,0)),"noadj",nil,nil) = 0 whenever rdone(Head). % only instantiate when right is done

rdone(Head) max= capped(RIGHT,Head,Hcov,Hlm).

hook(Dir,Head,Child,Hcov,Hlm) max= (half(Dir,Head,Hcov,Adj,Alink,Hlm)
				    whenever half(RIGHT,Child,coverage(1,0,0,Cloc),"noadj",nil,Clm))
  + score_hook(Dir,Head,Child,Adj,Alink,Hcov,Hlm)
  priority := -inf if max_left_dep > Hcov.twords + 1. % We'd like to put in some dependency length bounds here.
The idiom priority := -inf if might deserve a shortcut name like prune if. Not sure, though. Jason 03:16, 28 Apr 2006 (EDT)
The use of Hcov.twords here is trying to extract a keyword argument from a term of type coverage. I think this is a good accessor syntax, and follows naturally from our use of keyword constructors. We were already planning to allow keyword accessors of this kind on the C++ side; clearly they should be available in Dyna too.
Field names are now supposed to be capitalized, since they are repeatable variables in the type declaration. Hence it should be Hcov.Twords.
(Then I guess other C++ accessors like .functor, .argc and .arg(3) -- though perhaps not .argc(I) -- should also be allowed in Dyna. Could be useful for analyzing subterms. In fact, maybe =X and =foo are just sugar for X.value and foo.value.)
(And I guess if Hcov were a primitive object (a.k.a. foreign datatype), we could use this syntax to call its methods: e.g., Hcov.accessor(2) instead of accessor(Hcov,2,ReturnedValue).)
Presumably Hcov.Twords is only legal if the Hcov variable can be typed at compile time to something like coverage that is guaranteed to have a Twords accessor, thanks to the declaration of half or an explicit type restrictor in the rule. The use of the Twords accessor could be a clue to type inference, allowing us to infer a declaration for half; we could even infer a type restriction on this specific rule but that seems dangerous. Jason 03:16, 28 Apr 2006 (EDT)
score_hook(Dir,nonterm(Htag,Hlink,Hword),nonterm(Ctag,Clink,Cword),Adj,Alink,Hcov,Hlm) =
  nonstop(Dir, Htag, Adj) * param("dmv")
  + dchild(Dir, Htah, Adj, Ctag) * param("dmv")
  + slm(Dir, Cword, Hword) * param("slm")
  + cprod(Dir, =tag(Hlink), =tag(Clink), Htag, Ctag) * param("cprod").

% In the line above, we want the source tag at a given source position.
tag(Loc) = Tag if tword(Tag,Word,Loc).

% see the pruning clauses tacked onto half(), which we should replicate here
trap(RIGHT,Head,Child,merge_coverage(Hcov,Ccov),merge_lm(RIGHT,Hlm,Clm)) max=
  hook(RIGHT,Head,Child,Hcov,Hlm) + capped(LEFT,Child,Ccov,Clm)
  + extend(RIGHT,Head,Child,Hcov,Ccov,Hlm,Clm)
  priority := -inf if foo.
Surely rather than "replicating" the pruning clause foo, we should be able to make foo be a backward-chained item? It would presumably have to take arguments, but we might be able to get away with just a This argument as below.
trap(LEFT,Head,Child,merge_coverage(Hcov,Ccov),merge_lm(LEFT,Hlm,Clm)) max=
  hook(LEFT,Head,Child,Hcov,Hlm) + capped(RIGHT,Child,Ccov,Clm)
  + extend(LEFT,Head,Child,Hcov,Ccov,Hlm,Clm)
  priority := -inf if foo.

extend(Dir,Head,Child,Hcov,Ccov,Hlm,Clm) =

% old pruning rules
% trap(Dir,Head,Child,Cov,Lm) max= ptrap(Dir,Head,Child,Cov,Lm) + prune_trap(Dir,Head,Child,Cov,Lm).
% half(Dir,Head,Cov,Adj,Alink,Lm) max= phalf(Dir,Head,Cov,Adj,Alink,Lm) + prune_half(Dir,Head,Cov,Adj,Alink,Lm).

% we don't want to destructure the whole coverage
best_half(Dir, Sextent) max= half(Dir,Head,coverage(...,Sextent),Chead,Clink,Lm).
count_half(Dir, Sextent) += 1 whenever half(Dir,Head,coverage(...,Sextent),Chead,Clink,Lm).

half(Dir,Head,merge_coverage(Hcov,Ccov),Chead,Clink,merge_lm(Dir,Hlm,Clm)) max= trap(Dir,Head,nonterm(Chead,Clink,Cword),Hcov,Hlm)
  + capped(Dir,nonterm(Chead,Clink,Cword),Ccov,Clm)
  + complete(Dir,Head,nonterm(Chead,Clink,Cword),Hcov,Ccov,Hlm,Clm)
  priority := -inf if (value + hparam("beam_width") < best_half(Dir, merge_coverage(Hcov, Ccov).sextent)
		       or count_half(Dir, merge_coverage(Hcov, Ccov).sextent) > hparam("max_bin")).
The expression merge_coverage(Hcov, Ccov).Sextent is not extracting the Sextent field from the constructed term of type merge_coverage itself (though that could also be useful), but rather from that term's (foreignly computed) value, which is another term. So I think this should actually be written as (=merge_coverage(Hcov, Ccov)).Sextent, where the extra parens are probably optional by precedence rules. Jason 03:16, 28 Apr 2006 (EDT)
complete(Dir,Head,Child,Hcov,Ccov,Hlm,Clm) =

top_trans(S) max= trans(S,T).
tag_trans(STag) max= ctrans(STag,TTag).
In the "subordinate clause"
  priority := -inf if value + hparam("beam_width") < best_half(Dir, etc. etc.

how about something like

  with priority(This) := -inf if This + beam_width < best_half(Dir, etc. etc.

where this is a reserved word referring to the consequent of the main clause. Or perhaps value was supposed to be the value of the antecedent expression, not the accumulated value of the consequent, in which case maybe the game is

  with priority(Consequent) := -inf if Antecedent + beam_width < best_half(Dir, etc. etc.

But maybe you are not trying to change the priority of the consequent, only this particular update to the consequent. Hence perhaps

  with priority(This) = -inf if Antecedent + beam_width < best_half(Dir, etc. etc.

In any case, your second rule is "subordinate" because it implicitly refers to the hyperedge item This created by the first rule. This has keyword arguments, allowing us to get at This.Dir, This.Consequent, This.Antecedent, etc., where the This. part is omittable by convention, just as in C++. Maybe that hyperedge item should be more explicit in Dyna? We may want it to be anyway, for making the parse forest beautiful and efficient, and for definining queries over its hyperedges.

The subordinate clauses let you economically write multiple inference rules (one defining Consequent, the other defining priority(Consequent)) that are triggered by the same set of antecedents. This is not quite a case of a multi-headed rule because the bodies are also different (the second body is -inf if ...). But there is some notion of keeping the local variables from a previous definition "in scope." Jason 03:16, 28 Apr 2006 (EDT)



Here are some topics that we want to illuminate above using both real and artificial examples:

  • declarations
    • stick with :- or something better?
    • stick with storedas or something better?
    • when to write "pragma" explicitly?
    • inheritance of declarations
    • type-specific declarations: chaining and memoization, different kinds of transience, transparency, retractability, incremental vs. replacement updates, determinism, parallelization ...
    • rule-specific declarations, especially declaring transformations and prunability
    • program-specific declarations: multi-charts, declaring new operators ...
  • types
    • default values
    • type inference (new version)
    • meaning of type restrictors
    • const (chart-independent) side conditions to use as guards in type defs
  • axioms
    • axiom accumulation operators like max= as well as := (maybe also max:= to control what assignment does but still allow retractability)
  • inference rules
    • "if" vs. "needed only if" ("especially if"?) (should be property of consequent rather than specific rule?)
    • rule-specific declarations
    • evaluated subterms
    • syntax like (+= foo(X,Y)) for an anonymous sum inside a larger expression
  • inference order
    • prioritization
    • pruning
    • backward chaining; mixing forward and backward chaining
    • funky agenda disciplines, including CKY order or block-processing of related items
    • convergence declarations
    • memoization and flush policy
      • memoizing foreign functions; interaction with disk
    • tape
    • propagating different kinds of updates for general formulas
  • hyperedges
    • efficiently storing them
    • specifying which ones to store (all? none? 1-best?) and reconstructing them by backward chaining
    • finding entire antecedent proof trees; efficiently constructing k-best trees on demand
    • finding consequents as well as antecedents
    • controlling what they look like (in case of program transformation)
    • C++ accessors
  • queries over chart
    • non-ground items in chart (default value is the simplest example)
    • explicit inference-rules with "<-" for accumulating queries? or just have queriable types?
    • backward-chained and memoized queries (result materialized on demand)
    • queries over foreign items
    • C++ interface
  • foreign interface
    • declaration of foreign primitives
      • operators and their properties
      • defaults (under various accumulation operators)
      • naming the foreign class and operator functions that get used (and what if not C++?)
      • query iterator over all instances of the type, e.g., goal += 1/int X if X > 0
    • declaration of foreign items
      • auto-evaluation of arguments
      • explicitly updating/retracting their values
      • queries
Personal tools