Why do we need one-hot encoding?
Conversion of categorical features into a numerical format.
In real world NLP problems, the data needs to be prepared in specific ways before we can apply a model. This is when we use encoding. For NLP, most of the time the data consist of a corpus of words. This is categorical data.
Understanding Categorical Data:
Categorical data are variables that contain label values. This data is mostly in the form of words. These are words that form the vocabulary. The words from this vocabulary need to be turned into vectors to apply modelling.
Some examples include:
- A “country” variable with the values: “USA”, “Canada“, “India”, “Mexico” and “China”.
- A “city” variable with the values: “San Francisco“, “Toronto” and “Mumbai“.
The categorical data above needs to be converted into vectors using a vectorization technique. This is One-hot encoding.
Vectorization:
Vectorization is an important aspect of feature extraction in NLP. These techniques try to map every possible word to a specific integer. scikit-learn has DictVectorizer to convert text to a one-hot encoding form. The other API is the CountVectorizer, which converts the collection of text documents to a matrix of token counts. We could also use word2vec to convert text data to the vector form.
One-hot Encoding:
Consider that you have a vocabulary of size N. In the one-hot encoding technique, we map the words to the vectors of length n, where the nth digit is an indicator of the presence of the particular word. If you are converting words to the one-hot encoding format, then you will see vectors such as 0000…100, 0000…010, 0000…001, and so on. Every word in the vocabulary is represented by one of the combinations of a binary vector. The nth bit of each vector indicates the presence of the nth word in the vocabulary.
>>> measurements = [
... {'city': 'San Francisco', 'temperature': 18.},
... {'city': 'Toronto', 'temperature': 12.},
... {'city': 'Mumbai', 'temperature': 33.},
... ]
>>> from sklearn.feature_extraction import DictVectorizer
>>> vec = DictVectorizer()
>>> vec.fit_transform(measurements).toarray()
array([[ 1., 0., 0., 18.],
[ 0., 1., 0., 12.],
[ 0., 0., 1., 33.]])
>>> vec.get_feature_names()
['city=San Francisco', 'city=Toronto', 'city=Mumbai', 'temperature']
Using this technique normal sentences can be represented as vectors. This vector is made based on the vocabulary size and the encoding schema. Numerical operations can be performed on this vector form.
Applications of One-hot encoding:
The word2vec algorithm accepts input data in the form of vectors that are generated using one-hot encoding.
Neural networks can tell us if an input image is of a cat or a dog. Since the neural network only uses numbers, it can’t output the words “cat” or “dog”. Instead, it uses one-hot encoding to represent is prediction in a semantic manner.
Important links for reference:
- Understanding DictVectorizer: Stackoverflow
- All Feature Extraction function signatures: scikit learn
- Python NLP Book: Python NLP Processing
Subscribe to our Acing AI newsletter, I promise not to spam and its FREE!
Thanks for reading! 😊 If you enjoyed it, test how many times can you hit 👏 in 5 seconds. It’s great cardio for your fingers AND will help other people see the story.
Why do we need one-hot encoding? was originally published in Acing AI on Medium, where people are continuing the conversation by highlighting and responding to this story.