[Discussion] My boss is convinced you can do a SVM using ASCII integer codes as features
Where do I even begin this rant?
I am a machine learning intern. We have a labelling problem in which we want to classify strings into category “Something” and category “Not Something”. These are not sentences so we can’t use any standard NLP library. My boss is convinced we should turn these strings into ASCII codes, in order to make them “non categorical”, with each feature being the ASCII code for the character in question.
I tried to gently assert that even though they’re numbers, that doesn’t mean that they’re quantitative data – is the average of B and D, C? (He answered yes to that, btw.).
I told him if the word ‘apple’ appears in the beginning of the string and in the other row appears in the end of the string, it won’t be put in the same cluster necessarily. He says the SVM will pick up the pattern – say you have for features 0, 1 and 2 the values 65, 112 and 112 and in another row for features 10, 11 and 1 the same values, the SVM will “detect the pattern” and put them closer together. “That’s not how support vector machines work.” “Oh really, how many have you done?”
I ran it anyway – it gives results with 98% accuracy because in this case “Something” and “Not Something” tend to have radically different lengths. To show him it doesn’t detect patterns, I put a bunch of zeros behind the string and it obviously did not correctly recognise the label. He says that doesn’t prove anything, it’s just a “vulnerability”.
I am at a loss here. Does anyone have a source I can share with him? Or an alternative way of solving my problem?