Natural language processing tasks, such as caption generation and machine translation, involve generating sequences of words.
Models developed for these problems often operate by generating probability distributions across the vocabulary of output words and it is up to decoding algorithms to sample the probability distributions to generate the most likely sequences of words.
Two algorithms that can be used for this are as:-
1. Greedy search
2. Beam search decoding algorithms.
It is common for models developed for these types of problems to output a probability distribution over each word in the vocabulary for each word in the output sequence. It is then left to a decoder process to transform the probabilities into a final sequence of words.
Beam search uses breadth-first search to build its search tree. At each level of the tree, it generates all successors of the states at the current level, sorting them in increasing order of heuristic cost. However, it only stores a predetermined number, β, of best states at each level (called the beam width). Only those states are expanded next.
It is common for models developed for these types of problems to output a probability distribution over each word in the vocabulary for each word in the output sequence. It is then left to a decoder process to transform the probabilities into a final sequence of words.
Beam search uses breadth-first search to build its search tree. At each level of the tree, it generates all successors of the states at the current level, sorting them in increasing order of heuristic cost. However, it only stores a predetermined number, β, of best states at each level (called the beam width). Only those states are expanded next.
If there are sentences in the output: then the output score is the sum of the translation and language model log probability over all sentences:
It sums over all possible ways that the model could have generated the English from the French, including translations that permute the phrases. This sum is exponential (and hence intractable) in general, but the phrase dictionary is fixed and sparse (and small for this homework), so we can compute it in a few minutes. It is still easier to do this than it is to find the optimal translation but if you look at the implementation of
score-decoder.py
you may get some hints about how to do the assignment!
Comments
Post a Comment