[D] Transformer predicts own input on time-series data
Hi, I’m working on a project which requires predicting the remainder of a time series given many examples. For example, given a time series from T=0 to T=m, I need to predict T=m to T=n.
I’ve trained a few different autoregressive models for the task: (1) pure decoder Transformer where I learn the joint probability over the full sequence then given the known part of the sequence I just impute the part that I’m interested in, and (2) encoder-decoder Transformer where I provide the data up to the point I’d like to predict as conditioning information then model the joint conditional probability over just the region of interest.
In both cases, I’m finding a very strong effect where the network learns to simply predict whatever the input value is at each time step (and I’ve confirmed that the labels are being passed in correctly — shifted right relative to input.) This means during inference it will always predict a straight line. In contrast with neural machine translation or language modeling tasks where the token of the next word may be very different from the previous word, with a high resolution time series the next token is always almost the input token because it’s a function. I’ve also tried a continuous version of the Transformer and it simply picks out a few common modes and predicts these each time during inference. I found I can do better in terms of RMSE and MAE by just using a fully connected network that predicts the entire region of interest simultaneously (making the assumption each point is independent of the others) which seems strange.
Does anyone have experience with a similar task and suggestions on how to handle this? I imagine using an artificially lower time resolution would make this better but that solution is rather unsatisfying.
(I’ve seen a few blog posts and a previous r/MachineLearning post about Transformers on time series data but none address this problem.)