[P] Machine learning application to identify “risky” words
So I am doing a project to create a model that extracts words in a sentence that are related to risks (will be stored in an array after). I have a large set of data (around 27k lines).
An example of words: Injury, collision, police, hit, fatal, etc…
I am doing this with Python, Sklearn library. Any suggestions on how to approach this?
So far, I have achieved to apply TFIDF on the data and print each word with its relative TFIDF score, I’m not sure if this is usefull at all.
It does output the “risky” words, but it also outputs all other words that I do not need. The only way I can filter the risk words out is by typing them on a seperate file, and just compare word by word, but there is no machine learning in that, and I would really like to apply some sort of machine learning (maybe naive bayes?). I am willing to label some data if it helps and make this supervised instead of being unsupervised currently.
Any help is appreciated 🙂
submitted by /u/abdane