Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[D] Normalization of highly variable 1-D data

Hi, I’m working with a large dataset of 1-D data (think time series) with wildly varying values (decidedly non-Normal.) My goal is to train an autoregressive generative model like WaveNet using this data. I’ll need to normalize all series to the range [0, 1] so I can then quantize the data to 256 possible values for the softmax output of the WaveNet. I’ve run into a few problems and haven’t found much help on the internet (mostly searching for time series normalization, standardization, etc.)

A quick run through my current process:

  • Divide each series by their median value in a small window where the signal is (or should be) ~0
  • At this point, the distribution of values in each series is roughly log-normal so I’m taking the log and then standardizing each series individually by subtracting its mean and dividing by its standard deviation
  • If I now normalize the entire dataset based on the max/min (i.e. data <- (data – data.min())/(data.max() – data.min()), most series are squeezed into a range like [0.4, 0.6] due to massive outlying max/min values. I’ve tried scaling each series by their respective min/max but again, due to some series having massive outliers this skews their scale compared to the rest of the dataset.

Is my best option to just cull the outliers? Or am I missing a step somewhere? I’ve visualized a few of the outliers, and they are valid data. I’ve also tried median-stacking nearest neighbor series to tame some of the volatility but am not sure where to go from here. Leaving it as is will increase the effective quantization noise in my data since instead of being spread over 256 values, most series only span ~100 values in the discrete space. Any help would be much appreciated!

submitted by /u/collider_in_blue
[link] [comments]