Skip to main content


Learn About Our Meetup

5000+ Members



Join our meetup, learn, connect, share, and get to know your Toronto AI community. 



Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.



Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

Datasets of fully, semantically equivalent sentences [Discussion]

Hello all,

I am not sure if this is the correct sub to ask this. Let me know if not. I’ve posted this in r/MLQuestions as well. But this is a question and an attempt to collect datasets of a certain type for research so I’m posting it here as well:

I am looking for a short text classification dataset for de-duping semantically equivalent sentences. It seems that most text classification datasets I can find online classifies text into a relatively small number of topics but doesn’t have classes of fully semantically equivalent sentences. For example I want something which has a class with samples like “where is the cake?”, “where can I find the cake?”, “what is the location of the cake?”, etc. But I instead find datasets where these sentences are labeled “cake” and has other sentences like “do you like cake?”, “what is your favorite cake?”, etc. I can’t find a short-text dataset in which the samples in each class are fully semantically equivalent rather than sharing a general topic. I imagine such a dataset should have at least thousands of classes, if not more, just to be a reasonable dataset since there are many semantically unique English sentences.

All I have found so far can be summarized by what is in this 3 year old repo:

Does anyone know of any other such datasets?

Thank you!

submitted by /u/LartTheLuser
[link] [comments]