[Discussion] NLP, on numbers inside word embeddings
Let me preface by saying that I am new to NLP, so it is very likely that a good solution to my problem already exists and in that case, I would really appreciate being pointed in the right direction 🙂
I am working on a machine reading comprehension task where the inputs often contain numbers in addition to words. I initially wanted to use pre-trained word embeddings, but I am not sure how the numerical data are represented when numbers are treated as words and are multiplied by an embedding matrix. What is worse, only numbers that occurred in the training set would have a representation, unless I am missing something.. I could extract numbers from sentences before putting the non-numerical words through an embedding layer, and treat them separately but it would be easier if a pre-trained word embedding layer(s) could take care of it all.
As far as I can tell, the optimal way to represent both numbers and words via embedding vectors would be to introduce two extra dimensions: one that would specify the type (1 for numerical vs 0 for vocab), and one that would contain a floating point representation of the original number. At the level of the embedding matrix, this suggests that the matrix can be put in a block-diagonal form, but if it is not, it should not be a problem – I figure the rest of the network would be able to learn that if the 1st component of the embedding vector for a certain word is 1, it should ignore all components but the second one, and vice versa.
This solution is similar to treating numerical and non-numerical data separately, but the advantage is that the pre-trained embedding takes care of it, and once you are past it, you’ve got an N-dimensional representation for every word in your sentence, including digits and numbers written in text, without loosing any information of those numbers.
I can go ahead and implement this, but as I have not seen this solution in existing projects (could very well be that I was not looking at the right place), I wonder if there are better ways of representing numbers + words in deep NNs. Any thoughts?