[TOC]
The general goal is to train structured models, where there's a beam or a trellis, normalized globally.
To begin, we implement the CRF objective (also termend "contrastive learning") for parsing. Paraphrasing the notes of who/mjcollins, we have globally normalized path probabilities
where \(A_b(j, k, m)\) is the activation for step \(j\), beam \(b\), slot \(k\), and output class \(m\). We also have \(D_b(j,k)\), the decision at step \(j\) in the final hypothesis \(k\), and \(K_b(j,k)\), the position at step \(j\) of final hypothesis \(k\).
The final log-likelihood is then:
where \(g_b\) is the position of the gold hypothesis in the final beam.
With early updates, the 'final' beam is the one where the gold would have fallen off the beam. With this option, using a large beam and setting the maximum path length to one results in effectively doing the usual non-structured training. This is a useful control.
The beam reader Ops take care of producing \(K_b\), \(D_b\) and \(g_b\), while the TensorFlow graph takes care of constructing \(A_b\).
The only fixed dimension of the tensor-like object \(A_b\) is its final dimension, which is always the number of classes. Everything else, including the size of the beam and the number of steps, can vary. And due to early updates even the effective number of batches can shrink as the gold falls off.
Therefore \(A\) is stored as matrix, with a fixed number of columns but a
variable number of rows. The sparse storage of \(A\), named concat_scores
in
the diagram below and in the code, relies of keeping track of the (variable)
number of beams and beam slots.
Due to early updates we do not know ahead of time the number of steps that need
to be taken before all the gold falls off. Therefore we use TensorFlow Cond
operators to terminate parsing when all the gold paths have fallen off.
-
BeamParseReader opens the corpus for reading and allocates the main BeamState object that will be used by the other Ops. It outputs a handle to the BeamState and the feature vectors associated with the first token. Only one BeamParseReader needs to be created per decoding network.
-
BeamParser takes the scores output by the network and combines it with the BeamState to generate an updated beam and a fresh set of feature vectors. The BeamState object is mutated in the process. Multiple BeamParsers are chained together to construct a fully unrolled network. It is important that consecutive BeamParsers have the correct dependencies to ensure that only one BeamParser is evaluated at any time.
-
BeamParserOutput processes the BeamState to generate the paths over which the cost will be computed. Multiple BeamParserOutputs can be instantiated, although they all access the same underlying BeamState, so in practice only one is needed.
-
BeamEvalOutput processes the BeamState to compute eval metrics for the best path in each beam. The eval metrics is returned as a vector of length two, containing [#tokens, #correct tokens].