[D] Threshold for rejecting word embedding similarities
I have a problem where I have certain set of target words and I need to use them to match with other words that are found in new csvs. I was wondering if there are any good approaches to determining the threshold for rejecting word similarities. I was thinking using a random sample of 10k words and plot their similarities (10k*9.99k/2) but I am not sure whether this is the right approach. Or should I use the distribution of the similarities of the target words on a vocabulary and choose a percentile cutoff? Any ideas?