[D] Decoding for the transformer in inference mode time series data
With the Transformer model from “Attention is All you need” you have to feed in the the actual target during training. However, this can obviously not be done for actual inference. Now usually for inference greedy decoding or beam search is used for generating the target sequence iteratively. However, from my understanding (could be wrong) beam search and greedy decoding generally work in conjunction with a softmax function. Moreover, this is generally done over a set of vocabulary. How would we use the transformer model in inference mode for a time series forecasting task? What is the best way to generate the target values for the decoder? Could beam search still work?