From Dyna
Jump to: navigation, search

DynaMITE (the Dyna Module for Iterative Training and Estimation) is a C++ library that is intended to make it easy to train weights using Dyna-generated code. Currently, DynaMITE provides implementations of the Expectation-Maximization (EM) algorithm, gradient-based unsupervised training, and gradient-based maximum likelihood training of log-linear ("max-ent") models. DynaMITE also provides tools for tuning smoothing parameters (deleted interpolation and Laplacian). The neat thing about these algorithms is that for many models, their implementation is essentially the same. Further, all of these algorithms share a core inner loop: a "forward" dynamic program and a "backward" dynamic program.

Here we discuss how to run the EM algorithm using DynaMITE. Our working example is the file io.dyna (which gives the inside-outside algorithm). We assume that the underlying model is a probabilistic context-free grammar (PCFG), and our goal is to train the weights using the inside-outside algorithm (an instance of EM).


What to put in the Dyna program

Here is the core Dyna program we will use for forward-backward:

  constit(X, I, J) log+= word(Y, I, J) + rewrites1(X, Y).
  constit(X, I, J) log+= constit(Y, I, J) + rewrites1(X, Y).
  constit(X, I, K) log+= rewrites(X, Y, Z) + constit(Y, I, J) 
                         + constit(Z, J, K).
  goal log+= constit(S, 0, N) + startsym(S) whenever length(N).

Now, the following need to be inserted for use by DynaMITE:

  •  :- union(trainable, [rewrites, rewrites1]).
This means that the weights of rewrites and rewrites1 items are trainable; they correspond to probabilities or weights in the model. In fact DynaMITE doesn't currently use the trainable union, but a trainable declaration is how the Dyna compiler knows to allow the gradient/backprop operation.
  •  :- union(nonpersistent, [word, length]).
These data-based axioms need to be retracted between examples; they are "sentence specific." Note that if you define new_example it must be in the nonpersistent union type.
  •  :- retractable(nonpersistent).
All nonpersistent axioms are to be removed from the chart after the forward and backward passes on a sentence.
  •  :- retractable(term, term).
All terms are to be retracted after a pass over the whole corpus. This is so that axioms can be asserted with new weights on the next iteration. The fact that term appears twice is the result of a temporary workaround.
  • nt_sum(Nonterm) log+= rewrites2(Nonterm, _, _).
nt_sum(Nonterm) log+= rewrites1(Nonterm, _).
constrain_eq(nt_sum(N), 0) log+= nt_sum(N).
s_sum(Nonterm) log+= startsym(Nonterm).
constrain_eq(s_sum(N), 0) log+= s_sum(N).
:- query(constraints,constrain_eq(_,_)).
These lines specify sum-to-one constraints (in the logarithmic domain) for all rewrite probability distributions. There is one such distribution for each nonterminal in the grammar, to choose how to rewrite that nonterminal. There is also a distribution for choosing the start symbol.
  •  :- union(sum_type, [ s_sum, nt_sum ]).
 :- pragma(keep_all_antecedents(sum_type)).
Allows DynaMITE to step backwards and recover all antecedents of a sum_type term. For obscure reasons the pragma keep_all_antecedents requires a union type, so we create sum_type---but we would have needed that anyway, since there are two kinds of sums in this program.
Currently the above syntax is broken. A less elegant and less efficient hack is to ignore the union and just give item as an argument to keep_all_antecedents.
  • (Optional)  :- structure(new_example).
This is a term that DynaMITE will use to signal that we are starting a forward pass on the next example. You need not use it in your program, though you might find it to be a useful side condition for constructing a theorem that depends on trainable axioms but is of a type that could depend on a nonpersistent type (see below, in the forward-backward example).

These six or so additions to the Dyna program extend the usual chart interface in ways that DynaMITE expects and needs. For some optimization algorithms, some of the above may not be necessary (you won't need the log-sum-to-zero constraints if you're training log-linear weights instead of generative probabilities, for example).

The Inside-Outside algorithm using DynaMITE

A more complete version of very similar code, including include files, using declarations, and a generic main() function, can be found in the DynaMITE tutorial directory as em.cc. A much more useful program with lots of options can be compiled using dynac with --em; see the section below on compilation. This program is meant to help users who might want to write their own DynaMITE programs, but we're trying to release tools that build the most useful kinds of training opaquely.

1  hash_pv<term> theta;            // the parameter vector
2  load_parameters("grammar_rules", theta); 
                                   // load the initial parameters
3  chart c;                        // the chart; does the dynamic programming
4  build_exhaustively engine(c, goal);  
                                   // drives operations on the chart
5  corpus dat("data", c);          // read in sentence data
6  summation_objective obj(engine, dat); 
                                   // sums log-likelihood over all data
7  multinomial<term> model;        // knows about the log-sum-to-zero constraints
8  construct_multinomial_from_chart(c, theta, model);
                                   // extract constraint info from the chart 
9  fixed_iterations ctest(10);     // EM is to run for 10 iterations
10 em_algorithm EM(ctest, &model); // instantiate the EM algorithm object
11 real l = opt.optimize(obj, theta); 
                                   // run EM
12 cout << "final log-likelihood value = " << l << endl; 
                                   // report log-likelihood
13 write_parameters(cout, theta);  // write learned parameters to stdout 

Let's consider this program line by line. Certain choices are made along the way; we will not point out all of the available alternatives yet.

Line 1 declares an object called a parameter vector (pv); this one is a hash_pv<term>, which is an implementation that stores parameters associated with io::terms. The initial values of the parameters are read in from a file called grammar_rules (the format is described below); this happens in line 2.

Line 3 declares the chart; this is at the core of our EM learner's computations. The chart knows how to run the inside and outside passes over a given sentence.

But a chart is a very low-level object. The build_exhaustively object (line 4) drives the chart, telling it to build to completion (inside pass) and then to backpropagate to get outside probabilities (outside pass). The more general type is engine; an engine does a core piece of computation (usually one that is repeated many times).

To run a pass of inside-outside, we need the parameter axioms (rewrites and startsym), but we also need the sentence-specific axioms ( word and length). These come from the data. Here we assume the data comes in a file data. We build an object called a corpus (line 5) (the more general type is dataset), which can construct these example-specific examples for each sentence in turn, and put them in the chart when the time comes.

The real objective function that we want to optimize is a sum of log-likelihoods over all sentences in the corpus. The engine (here a build_exhaustively object) computes likelihood (and expected counts) for one sentence; the summation_objective (line 6) calls the engine once for each sentence.

Those classes give us everything we need for the E step, which computes the value of the likelihood function (the likelihood of the given corpus given the current parameters theta) and gets expected counts for each of the parameter axioms. What about the M step?

In the M step, we pick parameters that maximize the probability of the data. We specify the M step by defining a parametric model that can compute the maximum likelihood estimates of its parameters, given sufficient statistics. The E step gives the sufficient statistics; the parametric model is defined as a set of multinomial distributions over terms, or a multinomial<term> (line 7). This object knows how to renormalize counts into probabilties, once it is set up to know which counts go into the same distribution together. Where does it get that information? It is constructed in line 8, using the chart. Recall that in the Dyna program, we specified which terms are supposed to log-add up to zero. The function construct_multinomial_from_chart uses the chart and the grammar rules (stored in theta) to determine what parameters go into multinomial distributions together.

EM iteratively improves the probability of the data given the parameters. When should it stop? A stopping criterion is encapsulated in a convergence_test object, of which fixed_iterations (line 9) is one kind. The fixed_iterations object in this program is set so that EM will run for ten iterations before stopping.

An object corresponding to the EM algorithm, of type em_algorithm, is instantiated on line 10; note that it knows about the convergence test and the model.

EM is run on line 11; the log-likelihood value is returned. That value is printed on line 12. The final parameters are written out on line 13.

The following pseudo-code for the EM algorithm may help clarify the role of each of these objects:

 load axioms and initial parameter values from file "grammar_rules"
 load data from file "data"                     (corpus)
 call to optimize():                            (em_algorithm)
  while not converged:                          (fixed_iterations)
    E step:                                       
      load current parameter axioms             (hash_pv -> chart)
      for all sentences i:                      (summation_objective)
        load word, length axioms for sentence i (corpus -> chart)
        compute expectations for sentence i:    (build_exhaustively)
          inside pass                           (chart)
          outside pass                          (chart)
    M step:                                     
      renormalize the counts                    (multinomial<term>)
      update parameters                         (hash_pv)

The EM program mentions two input files, grammar_rules and data. The grammar_rules file should contain the axioms corresponding to the grammar, along with their initial weights (which are log probabilities). For example,

 rewrites2("A", "B", "C") := -.693147. 
 rewrites2("A", "C", "B") := -.693147. 
 rewrites1("B", "foo") := -1.386294.
 startsym("A") := 0.0. 
Would be nice if we could write log(0.5) instead of -.693147.
It's now possible in the latest CVS version. -Markus 16:42, 9 Nov 2004 (EST)

The data file should contain one sentence per line, with spaces between the words.

The above program is everything you need to run the inside-outside algorithm, given io.dyna and DynaMITE. Further, it will work with other dynamic programs (e.g., the forward-backward algorithm), provided you recompile properly.

Compiling your DynaMITE-based program

DynaMITE is a C++ library and a number of template classes in include files. Compiling with DynaMITE is easy.

dynac your_dyna_program.dyna your_cplusplus_program.cc --dynamite -o executable_name

This will create a subdirectory called build.executable_name where the intermediate files will go. Your executable will end up in the current working directory.

Note that everything in italics must be properly replaced.

dynac has a built-in option to compile an EM training program, jam-packed with all the latest DynaMITE features, for your Dyna program. (So now that you understand how a DynaMITE program works, you can avoid writing the C++ code altogether!) Making sure to augment your Dyna program appropriately (see above), run:

dynac your_dyna_program.dyna --dynamite --em -o executable_name

Running the resulting program with --help or -h will show how to use it.

Any of the input or output files of this program may be compressed with gzip simply by appending .gz to their names.

Deleted interpolation smoothing in DynaMITE

Deleted interpolation (Jelinek and Mercer, 1980) is a smoothing technique. It is probabilistically motivated, tuning the parameters is straightforward (Bahl, Jelinek, and Mercer, 1983), especially with generic optimizers like DynaMITE.

We'll start with a simple example. Suppose our model simply predicts Failed to parse (Missing texvc executable; please see math/README to configure.): \Pr(x \mid y)

for some set of values Failed to parse (Missing texvc executable; please see math/README to configure.): x \in \mathcal{X}
and Failed to parse (Missing texvc executable; please see math/README to configure.): y \in \mathcal{Y}

. Assume that we are given frequency counts Failed to parse (Missing texvc executable; please see math/README to configure.): f(x, y)

for all observed pairs Failed to parse (Missing texvc executable; please see math/README to configure.): x, y

. The maximum likelihood estimate (MLE) for Failed to parse (Missing texvc executable; please see math/README to configure.): p(x \mid y)

Failed to parse (Missing texvc executable; please see math/README to configure.): p_{\mathit{MLE}}(x \mid y) = \frac{f(x, y)}{\sum_{x^\prime} f(x^\prime, y)} = \frac{f(x, y)}{f_Y(y)}

But the count Failed to parse (Missing texvc executable; please see math/README to configure.): f(x, y)

may be 0 for some pairs.  Two alternative estimates are to back off and not condition on Failed to parse (Missing texvc executable; please see math/README to configure.): y

, or to use the uniform distribution.

Failed to parse (Missing texvc executable; please see math/README to configure.): p_{\mathit{BO}}(x) = \frac{\sum_{y^\prime} f(x, y^\prime)}{\sum_{x^\prime, y^\prime} f(x^\prime, y^\prime)} = \frac{f_X(x)}{f_{\mathit{total}}}

Failed to parse (Missing texvc executable; please see math/README to configure.): p_U(x) = \frac{1}{\left| \mathcal{X} \right|}

Deleted interpolation chooses to linearly interpolate between all three estimates:

Failed to parse (Missing texvc executable; please see math/README to configure.): \hat{p}(x \mid y) = \lambda_1 p_{\mathit{MLE}}(x \mid y) + \lambda_2 p_{\mathit{BO}}(x) + \lambda_3 p_U(x)

The question is how to choose the Failed to parse (Missing texvc executable; please see math/README to configure.): \lambda s, and the answer is to choose them to maximize the likelihood of a development dataset.

To see how this is an unsupervised estimation problem suitable for EM, consider the following probabilistic model. We are given Failed to parse (Missing texvc executable; please see math/README to configure.): y . First, we choose one of three models: the MLE model, the BO model, or the U model. We choose the model randomly according to the distribution given by the Failed to parse (Missing texvc executable; please see math/README to configure.): \lambda s (i.e., we choose MLE with probability Failed to parse (Missing texvc executable; please see math/README to configure.): \lambda_1 , BO with probability Failed to parse (Missing texvc executable; please see math/README to configure.): \lambda_2 , and so on). Once we choose a model, we pick Failed to parse (Missing texvc executable; please see math/README to configure.): x

using the chosen model.  Given the data, what we don't observe is which model was used to choose Failed to parse (Missing texvc executable; please see math/README to configure.): x
in each example.  The model variable is a hidden variable, which we need to estimate.

There are many ways to do this kind of backing off and interpolation. The Dyna language allows a clean way to specify the interpolation formula declaratively. We will see how to solve this problem using Dyna and DynaMITE.

Here is a Dyna program that will compute smoothed probabilities given raw counts and values of the three Failed to parse (Missing texvc executable; please see math/README to configure.): \lambda s. I'll refer to it as simple.dyna. Please read the comments in the program.

 :- item(item, double, neg_inf).
 :- structure(fxy(string x, string y)).
 :- query(constraints, constrain_eq(_, _)).
 :- pragma(keep_all_antecedents(item)).
 :- retractable(term,term).
 :- union(trainable, [lambda1, lambda2, lambda3]).
 % Although this is a declarative program, I will describe it in stages.
 % We start out with fxy terms being asserted.  These are the training set (log) counts.
 % These three lines compute marginal frequencies.
 f log+= fx(_).
 fx(X) log+= fxy(X, _).
 fy(Y) log+= fxy(_, Y).
 % Dyna doesn't yet give us an inverse operation (in this case, log-division, i.e., subtraction).
 % But we need that to get relative frequencies of different kinds.  The inverse term
 % is just a convention; DynaMITE will build in the chart until completion, then look
 % for any inverse terms, assert each one's argument with the inverse of the value
 % (so that, for instance, zy("A") is equal to -fy("A")).  vocxsize is computed below.
 inverse(zy(Y)) log+= fy(Y).
 inverse(z) log+= f.
 inverse(zu) log+= vocxsize.
 % These are the maximum likelihood estimates under each of the three interpolated models.
 % mle1 is what we called MLE above; mle2 is what we called BO above, and MLE is what we
 % called U above.
 mle1(X, Y) log+= fxy(X, Y) + zy(Y).
 mle2(X) log+= fx(X) + z.
 mle log+= zu.  % uniform
 % Here we compute the smoothed probabilities, interpolating the three estimates.
 % The side conditions cover variables that would otherwise be unbound on the lefthand side.
 srf(fxy(X, Y)) log+= lambda1 + mle1(X, Y).
 srf(fxy(X, Y)) log+= lambda2 + mle2(X) whenever invocy(Y).
 srf(fxy(X, Y)) log+= lambda3 + mle whenever invocx(X) whenever invocy(Y).
 % These just keep track of what is in the X and Y vocabularies.  If there is some element
 % (say, "A") that was never observed at all, you could assert invocx("A") to force it to
 % be counted.
 invocx(X) log+= zero whenever fxy(X, _).
 invocy(Y) log+= zero whenever fxy(_, Y).
 vocxsize log+= zero whenever invocx(X).
 % DynaMITE needs to be able to iterate over inverse items and smoothed probabilties.
 :- query(inverses, inverse(_)).
 :- query(smoothed, srf(_)).
 % Force lambdas to sum to one.
 lsum log+= lambda1 whenever sums.
 lsum log+= lambda2 whenever sums.
 lsum log+= lambda3 whenever sums.
 constrain_eq(lsum, 0) log+= lsum.

Here is the DynaMITE-based C++ program that will train the Failed to parse (Missing texvc executable; please see math/README to configure.): \lambda s. Note that you must supply initial values of the Failed to parse (Missing texvc executable; please see math/README to configure.): \lambda s, as well as training and development set counts. (This program assumes that the Failed to parse (Missing texvc executable; please see math/README to configure.): \lambda

probabilities and the counts it reads in are not in the log domain.)
 using namespace simple;
 hash_pv<term> train, dev, L, smoothed;                            // vectors
 load_parameters<dynamic_program>(argv[1], train, LOG);            // read in training counts,
 load_parameters<dynamic_program>(argv[2], dev, LOG);              //         development counts,
 load_parameters<dynamic_program>(argv[3], L, LOG);                //         initial lambdas
 multinomial<term> lmodel;                                         // the interpolation model
 construct_multinomial_from_chart<dynamic_program>(c, L, lmodel);  // extract sum-to-one info from the chart
 rtol_test ctest(1.0e-7, MAXIMIZE);                                // convergence test
 em_algorithm em(ctest, &lmodel);                                  // optimizer used to train lambdas
 del_interp<dynamic_program> model(L, c, dev, em);                 // the model giving smoothed probabilities
 model.estimate_from_statistics(train, smoothed);                  // estimate the model (train lambdas)
 write_parameters<dynamic_program>(cout, params, EXP);             // write smoothed probabilities to stdout
 cerr << "final lambdas:" << endl;                
 L.write(stderr, "\t%s\t%f\n", exp);                               // write lambdas to stderr

Gradient-based optimization

DynaMITE uses an external toolkit, the Toolkit for Advanced Optimization (TAO; http://www-fp.mcs.anl.gov/tao/ ) for gradient-based optimization. TAO in turn requires a toolkit called PETSc (http://www-unix.mcs.anl.gov/petsc/petsc-2/ ). If you want to use TAO for gradient-based optimization in DynaMITE, you must install PETSc version 2.1.5 and TAO version 1.4. If these are installed and the usual environment variables are set up, you should be able to compile the dynamite_tao library and link it.

An alternative is to use the gradient descent method implemented in DynaMITE (gdls class, "gradient descent with line search"). This method works, though it is not as fast to converge as the methods implemented in TAO.

Personal tools