[D] Methods to perform unsupervised similarity scoring
I have a task and I don’t know how to tackle this. I received a set of positives and I have to find similar points from a big dataset (that I call basket). I have around 1’000 positives and around 1’000’000 points in the basket. All points are represented with 10 to 15 features. As an output, I would like to have a score for each point of the basket and this score would represent the closeness of the point to the positive set.
I first thought of using a k-nearest neighbours method on the positives but this approach presents two big drawbacks for me. First, I wouldn’t have a score associated to each point of the basket as I would only have a set of close points for each positive. Secondly, and this is the biggest drawback in my opinion, I would have to define the distance in the n-dimensional space myself while I would prefer that the method directly defines weights for each feature on the data (for instance, based on the level of information (variance) contained in each feature).
Does someone could point out to me a good approach to tackle this problem?