[D] Instrumenting a differential list of apartment complex features based on real choices (between complex A and B, B was chosen) in order to perform feature selection and figure out most important apartment complex features related to choice
Good afternoon ML community,
I am approaching this problem from a supervised machine learning perspective since that is where the majority of my experience is — so I need a sanity check on if this approach is correct or if I should be using a different approach altogether.
Lets say I have data on approximately 600 apartment complexes, each with about 50-100 features (‘amenities’). These include ‘pool’, or ‘no pool’, ‘pets allowed’, ‘no pets allowed’, ‘small pets allowed’, “more expensive”, “less expensive”,etc.
I also have, for about 15 of these complexes, choice data on rental losses. So– for these 15, everytime somebody chose another complex, they were surveyed and revealed which alternative they chose. There’s about 100 ‘lost choices’ for each of the 15 complexes. My goal is to construct the data in such a way that I can do feature selection on the amenities to figure out which ones play most prominently into choosing another complex, to help understand how to improve the initial 15 complexes.
The approach I was thinking about implementing was constructing a dataset based of differentials and similarities. So for each ‘choice’, there becomes two datapoints: one where we have a list of amenities in complex A vs complex B, and then a counterpoint for the opposite. So it would look like this:
For each datapoint, in the case when complex B is chosen, which we’ll label with an output of “1” for “chosen”, the input data vector would be a list of 0-3 for every amenity in the matrix:
B has this amenity but A doesn't: 0
A has this amenity but B doesn't: 1
Both facilities have this amenity: 2
Neither facilities have this amenity: 3
Then we would create the complimentary data point, where the A and B vector differentials are switched (A has this amenity but B doesn’t: 1, etc) and the output label would be 0 for “not chosen”.
Logically this makes sense to me, but I can’t help but think I am over complicating it– and I can’t think of any other way to instrument the data. Once it’s instrumented in this way, I could either build a classifier (xgboost) and look at feature importance of all the choices of ‘1’, or do feature selection analysis on the data to come up with which features to focus on. Does this seem like a good approach, or are there some glaringly obvious drawbacks and/or better tools for this?