Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

Datasets of fully, semantically equivalent sentences [Discussion]

Hello all,

I am not sure if this is the correct sub to ask this. Let me know if not. I’ve posted this in r/MLQuestions as well. But this is a question and an attempt to collect datasets of a certain type for research so I’m posting it here as well:

I am looking for a short text classification dataset for de-duping semantically equivalent sentences. It seems that most text classification datasets I can find online classifies text into a relatively small number of topics but doesn’t have classes of fully semantically equivalent sentences. For example I want something which has a class with samples like “where is the cake?”, “where can I find the cake?”, “what is the location of the cake?”, etc. But I instead find datasets where these sentences are labeled “cake” and has other sentences like “do you like cake?”, “what is your favorite cake?”, etc. I can’t find a short-text dataset in which the samples in each class are fully semantically equivalent rather than sharing a general topic. I imagine such a dataset should have at least thousands of classes, if not more, just to be a reasonable dataset since there are many semantically unique English sentences.

All I have found so far can be summarized by what is in this 3 year old repo:

https://github.com/brmson/dataset-sts

Does anyone know of any other such datasets?

Thank you!

submitted by /u/LartTheLuser
[link] [comments]