[D] Normalization of highly variable 1-D data
Hi, I’m working with a large dataset of 1-D data (think time series) with wildly varying values (decidedly non-Normal.) My goal is to train an autoregressive generative model like WaveNet using this data. I’ll need to normalize all series to the range [0, 1] so I can then quantize the data to 256 possible values for the softmax output of the WaveNet. I’ve run into a few problems and haven’t found much help on the internet (mostly searching for time series normalization, standardization, etc.)
A quick run through my current process:
- Divide each series by their median value in a small window where the signal is (or should be) ~0
- At this point, the distribution of values in each series is roughly log-normal so I’m taking the log and then standardizing each series individually by subtracting its mean and dividing by its standard deviation
- If I now normalize the entire dataset based on the max/min (i.e. data <- (data – data.min())/(data.max() – data.min()), most series are squeezed into a range like [0.4, 0.6] due to massive outlying max/min values. I’ve tried scaling each series by their respective min/max but again, due to some series having massive outliers this skews their scale compared to the rest of the dataset.
Is my best option to just cull the outliers? Or am I missing a step somewhere? I’ve visualized a few of the outliers, and they are valid data. I’ve also tried median-stacking nearest neighbor series to tame some of the volatility but am not sure where to go from here. Leaving it as is will increase the effective quantization noise in my data since instead of being spread over 256 values, most series only span ~100 values in the discrete space. Any help would be much appreciated!