[Discussion] Problem: Classify variables in large data set
Hi everyone. I have been presented with a problem and I’m hoping someone here could provide some advice or a direction in which to look.
The problem is this: Given a large data set with, say, 10,000 columns (so 10,000 variables or factors), classify some variables as type A and the others as type B.
More specifically, the data set contains customer data, and some of the variables are personal information such as customer address, SSN, etc. and need to be classified as private. Since there are so many variables, one cannot simply identify the private ones and mark them as private/not private. The process needs to be automated, and we have training sets that are already classified that could be used for training a model to recognize private variables in future, unmarked datasets.
My problem is that I do not know what machine learning techniques are appropriate for this type of classification task. My understanding is that typically classification methods will classify the record value (row variable, in this case a customer) according to the values of the variable. My problem seems to be the inverse.
Furthermore, it would be interesting to be able to classify each private variable that appears in the future data set. For example, if one column contains SSNs, can we identify it and mark the column correctly?
Thank you in advance for any comments or advice.