Skip to main content


Learn About Our Meetup

5000+ Members



Join our meetup, learn, connect, share, and get to know your Toronto AI community. 



Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.



Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[D] Transformer predicts own input on time-series data

Hi, I’m working on a project which requires predicting the remainder of a time series given many examples. For example, given a time series from T=0 to T=m, I need to predict T=m to T=n.

I’ve trained a few different autoregressive models for the task: (1) pure decoder Transformer where I learn the joint probability over the full sequence then given the known part of the sequence I just impute the part that I’m interested in, and (2) encoder-decoder Transformer where I provide the data up to the point I’d like to predict as conditioning information then model the joint conditional probability over just the region of interest.

In both cases, I’m finding a very strong effect where the network learns to simply predict whatever the input value is at each time step (and I’ve confirmed that the labels are being passed in correctly — shifted right relative to input.) This means during inference it will always predict a straight line. In contrast with neural machine translation or language modeling tasks where the token of the next word may be very different from the previous word, with a high resolution time series the next token is always almost the input token because it’s a function. I’ve also tried a continuous version of the Transformer and it simply picks out a few common modes and predicts these each time during inference. I found I can do better in terms of RMSE and MAE by just using a fully connected network that predicts the entire region of interest simultaneously (making the assumption each point is independent of the others) which seems strange.

Does anyone have experience with a similar task and suggestions on how to handle this? I imagine using an artificially lower time resolution would make this better but that solution is rather unsatisfying.

(I’ve seen a few blog posts and a previous r/MachineLearning post about Transformers on time series data but none address this problem.)

submitted by /u/collider_in_blue
[link] [comments]