[D] An overlooked pitfall in data science is incorporation of numerical identifiers.

As a professional data sciencetis and algorithm developer I ofent encounter serious errors in the use and design of machine learning software. One of the most ignoring is incorporation of numerical identifiers in data sets used for machine learning. Many data sets come with identifiers in the feature set, however they should never be used when validating or estimating models!

Take for example Forensic Science Glass Identification data, it comes with a id column. Use the data as is, I can get an error rates of less than 1%, too good to be true, and it is. Removing the id column and the error rate is approximately 35%! Details can be found in my code-snippet here: http://roasted.space/?page=codeZglass and short in my blog post: http://roasted.space/?page=blogZglass

submitted by /u/at-roasted-space
[link] [comments]

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

JOB POSTINGS

CONTACT

[D] An overlooked pitfall in data science is incorporation of numerical identifiers.