[D] I’m looking for success/failure stories applying unsupervised document embedding techniques
Hey everyone! 🙂
As the title says, I am looking for both success stories and disappointing failures of applications of modern unsupervised document embedding techniques on actual problems (as opposed to academic benchmarks, toy datasets, academic evaluation tasks, etc.). The main focus is naturally on industry uses for business/product problems, but I would also love to hear about cases from government bodies, non-profits, use in research (with empirical measurement and where document embedding is one of the tools, not the subject of research) and any other “real life” use. I would love to hear about your experience, but connecting me to people you know or even hinting me towards companies or projects you know used these techniques (or tried to) would also be of tremendous help.
What’s in it for you? Well, I’m preparing a talk for the data science track of the CodeteCON #KRK5 conference based on my literature review-y blog post on document embedding techniques, and while I feel I have a pretty good overview of the academic papers, benchmarks and SOTA status up until the most recent stuff to come out in the field at this point in time, I can’t say the same for uses in the industry; I have a partial view from my experience in one ongoing project to actually use this, and experience shared by some of my data scientist friends (all in Israel, naturally) – most of it, so far, by the way, is that averaging (good) word embeddings is a very tough “baseline” to beat.
This is why I thought reaching out to get a better sense of things in the industry world-wide, and enriching my talk with the status of actual successes and industry applications will give people attending my talk more value, and will serve my attempt to make my talk a status report on the topic.
And (coming back to WIIFM) naturally (I think), I intend to share any (share-able) knowledge I accumulate not only in my talk, but also by adding a section dedicated to it to the aforementioned blog post, and maybe even by writing an extended post around it (if enough interesting trends and issues come up). So, hopefully, if you are (like me) interested in this, we might also end up getting, together, a nice overview of where the industry stands at the moment.
What modern techniques (so no variants of bag-of-words or topic modeling techniques) am I talking about? These are the ones that I know of (I’d love to hear about others!):
- n-gram embeddings
- Averaging word embeddings (including all variants, e.g. SIF)
- Paragraph vectors (doc2vec)
- Skip-thought vectors
- Quick-thought vectors
- Word Mover’s Embedding (WME)
- Sentence-BERT (SBERT)
- GPT/GPT2 (can also be supervised)
- Universal Sentence Encoder (can also be supervised)
- GenSen (can also be supervised)
Thank you and cheers,