[D] Selection of randomly generated features
I have some raw data and a list of feature descriptors. A feature descriptor defines a function with parameters including their domain. This allows me to generate almost infinite many random features. Obviously, most features are garbage. My goal is to find a subset of features to train a “good enough” model. I suspect there will be features which are usable on their own and features which are only useful in combination with other features.
My current approach is to generate n features, take k of them and train a tree-based model with it. Then I measure the model score and divide it according to the feature importance among the features. A few rounds of cross validation follow. Then I take some other k of the n features and repeat the process until all of the n features have been tested a number of times. Then I start the process with new n features.
I am aware that there is a very high chance that I will miss some great feature combinations. However, I do not see how this could be avoided. Nevertheless, I would like to improve the process. One idea I have is to randomly pick some of the previously best scored features and use them together with new features to train the model. Then at least I might discover features which support the already good features.
Do you know of similar techniques which I could use for inspiration? Or do you think I should approach the problem completely different? Any inputs are welcome.
submitted by /u/kalabele
[link] [comments]