[D] Statistical/ML analysis of intention + wordnets, phrasenets
I’m having a mental struggle right now trying to understand how I would go about programming this, and I’m not even sure it’s feasible.
Let’s say we’re analyzing song lyrics. Let’s say that hypothetically, whenever the word “darkness” is mentioned in a lyric, there is a 23% chance that the word “night” is also mentioned and a 14% chance that the word “doubt” is also in the lyric.
A second and more complex relationship would be that of phrases. We could imagine that whenever the word “darkness” is mentioned, there is a 3.2% chance that the phrase “I’m scared” is somewhere in the lyric and 0.9% chance that the phrase “going to die” is also there.
A third addition to the complexity would be to add sentiment analysis with a machine learning version of a wordnet that analyzes not only the related words but the related moods.
A fourth addition to the complexity would see morphosyntactical analysis. “I’m scared” is not a feasible assumption as there are many possible subjects in a “scared” sentence, but it would be more feasible for it to be frequent if we said “noun + [to be, present tense] + scared”. This would cover “I’m scared”, “he’s scared”, “we’re scared”, “my son is scared”, etc. And then we could add adverbs and sentence changes (‘our family is, therefore, exceptionally scared’).
The bad way
My current thoughts about it come from traditional programming where for that analysis to occur, we would grab a
reference word, grab the rest of the
corpus words and count each of ocurrence of each
corpus word, then throw all of those counts into an array belonging to the
reference word we were analyzing for, and then do that for every word in a text. That would be insanely expensive and would get nowhere.
The ideal but unknown way
A cheaper way to do this would be with an AI + a vectorial or matrix datatype. I’ve been exploring the kinds of AI’s that there are but I’m very new to this and don’t know which one is more appropriate and which analysis algorithm would be best. I’m not even sure if it can be done with our current technology in this exact way, or whether there would be differences in the results I described. Perhaps AI would not be as accurate statistically but would instead rate analytically with a 0-100 not the statistical tendency but the “feel” it gets for how “similar” one word is to another due to their common context. How accurate would this be statistically?
I’ve been pumped recently with BERT, but I’m not experienced enough to create my own conclusions on the topic.
- How feasible do you think this would be?
- What are your thoughts about the necessary implications and existing ways to approach them?
- What similar projects are there being developed right now that you know?
- How would someone interested in this go into learning more about this specifically without much experience in machine learning in general?