Datasets of fully, semantically equivalent sentences [Discussion]
Hello all,
I am not sure if this is the correct sub to ask this. Let me know if not. I’ve posted this in r/MLQuestions as well. But this is a question and an attempt to collect datasets of a certain type for research so I’m posting it here as well:
I am looking for a short text classification dataset for de-duping semantically equivalent sentences. It seems that most text classification datasets I can find online classifies text into a relatively small number of topics but doesn’t have classes of fully semantically equivalent sentences. For example I want something which has a class with samples like “where is the cake?”, “where can I find the cake?”, “what is the location of the cake?”, etc. But I instead find datasets where these sentences are labeled “cake” and has other sentences like “do you like cake?”, “what is your favorite cake?”, etc. I can’t find a short-text dataset in which the samples in each class are fully semantically equivalent rather than sharing a general topic. I imagine such a dataset should have at least thousands of classes, if not more, just to be a reasonable dataset since there are many semantically unique English sentences.
All I have found so far can be summarized by what is in this 3 year old repo:
https://github.com/brmson/dataset-sts
Does anyone know of any other such datasets?
Thank you!
submitted by /u/LartTheLuser
[link] [comments]