Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[D] An overlooked pitfall in data science is incorporation of numerical identifiers.

As a professional data sciencetis and algorithm developer I ofent encounter serious errors in the use and design of machine learning software. One of the most ignoring is incorporation of numerical identifiers in data sets used for machine learning. Many data sets come with identifiers in the feature set, however they should never be used when validating or estimating models!

Take for example Forensic Science Glass Identification data, it comes with a id column. Use the data as is, I can get an error rates of less than 1%, too good to be true, and it is. Removing the id column and the error rate is approximately 35%! Details can be found in my code-snippet here: http://roasted.space/?page=codeZglass and short in my blog post: http://roasted.space/?page=blogZglass

submitted by /u/at-roasted-space
[link] [comments]