I was wondering if anyone has experience using transformer architectures for time series forecasting? Did it work well or if it didn’t why not? In particular has anyone used Transformer-XL? Just intuitively I was thinking it could work well for handling really long term dependencies in time series data. However, I haven’t seen any recent research mentioning it being used outside of NLP.

