Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[D] detecting anomalies in neural network data

Hi Reddit,

I am writing my thesis on anomalies (errors) in data and how it effects neural networks. The main goal is to detect them with no extra information such as response time. The anomalies can occur in both the x and y variables so some data is mislabelled. Since there is not a lot of literature on this topic I have a few questions.

1) What techniques could I use. For mislabeled data I could simply select the observations which have the worst prediction results (not using these to train the data if the number of observations is small). For anomalies in the x variables I am considering creating a neural network to detect them. This network would include random x data as observations with a separate label (so for example 3 classes, A B en C (anomalie)). Again with a low number of observations if I use data with errors to train the network would recognise such data as true so I would have to exclude all data I am checking to be an outlier from training. But that would cost a lot of training time. Maybe such a procedure would only be worthwhile if I suspect data to be an outlier (poor prediction performance, x far away for other x’s). Other methods I am considering are the KNN and the isolation tree/forrest method.

After having a list of presumed anomalies I could train the model without them and see if performance increases. The number of outlier excluded from training can be chosen with a test set. For what number of outliers excluded is the performance best on a separate set without anomalies? or with bayesian techniques.

2) What papers to use? There is not a lot of good papers on this topic so I have problems finding good papers to cite. If anybody would have some ideas it would be greatly appreciated.

3) What data to use? right now I am considering a classification numerical dataset (so no images or audio) which can be modelled with a neural network. But not sure where to find good data. After that I can add some noise to the data.

Thank you very much for your time and have a great day 🙂

submitted by /u/OscarSchyns
[link] [comments]