[D] LSTM with walk-forward validation and data normalization/standardization
I’m currently trying to build a multivariate model to predict stock market movements using LSTM. The model is not seq-to-seq, but rather seq-to-one, if that matters.
I’ve read that walk-forward validation is the ‘gold-standard‘ for validation in time-series forecasting and that crossvalidation doesn’t work due to the spatial-temporal relevancy of the data.
This creates some weird implications for data normalization…
I’ve firmly held the belief that information leakage can spoil a model by providing unreasonable in-sample performance accuracy/loss. Consequently, I’m pretty careful when train-test splitting and then using custom tranforming pipelines to standardize the data (i.e. fit_transform() vs. transform() ). How do you overcome this issue? Is it really that big of a deal to split before standardization?
Main question: If you’re using a moving-window walk-forward validation, how would you handle train/test data splits and data normalization?