Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

Author: torontoai

Why do we need one-hot encoding?

Conversion of categorical features into a numerical format.

In real world NLP problems, the data needs to be prepared in specific ways before we can apply a model. This is when we use encoding. For NLP, most of the time the data consist of a corpus of words. This is categorical data.

Understanding Categorical Data:

Categorical data are variables that contain label values. This data is mostly in the form of words. These are words that form the vocabulary. The words from this vocabulary need to be turned into vectors to apply modelling.

Some examples include:

  • A “country” variable with the values: “USA”, “Canada“, “India”, “Mexico” and “China”.
  • A “city” variable with the values: “San Francisco“, “Toronto” and “Mumbai“.

The categorical data above needs to be converted into vectors using a vectorization technique. This is One-hot encoding.

Photo by Amanda Jones on Unsplash

Vectorization:

Vectorization is an important aspect of feature extraction in NLP. These techniques try to map every possible word to a specific integer. scikit-learn has DictVectorizer to convert text to a one-hot encoding form. The other API is the CountVectorizer, which converts the collection of text documents to a matrix of token counts. We could also use word2vec to convert text data to the vector form.

One-hot Encoding:

Consider that you have a vocabulary of size N. In the one-hot encoding technique, we map the words to the vectors of length n, where the nth digit is an indicator of the presence of the particular word. If you are converting words to the one-hot encoding format, then you will see vectors such as 0000…100, 0000…010, 0000…001, and so on. Every word in the vocabulary is represented by one of the combinations of a binary vector. The nth bit of each vector indicates the presence of the nth word in the vocabulary.

>>> measurements = [
... {'city': 'San Francisco', 'temperature': 18.},
... {'city': 'Toronto', 'temperature': 12.},
... {'city': 'Mumbai', 'temperature': 33.},
... ]

>>> from sklearn.feature_extraction import DictVectorizer
>>> vec = DictVectorizer()

>>> vec.fit_transform(measurements).toarray()
array([[ 1., 0., 0., 18.],
[ 0., 1., 0., 12.],
[ 0., 0., 1., 33.]])

>>> vec.get_feature_names()
['city=San Francisco', 'city=Toronto', 'city=Mumbai', 'temperature']

Using this technique normal sentences can be represented as vectors. This vector is made based on the vocabulary size and the encoding schema. Numerical operations can be performed on this vector form.

Applications of One-hot encoding:

The word2vec algorithm accepts input data in the form of vectors that are generated using one-hot encoding.

Neural networks can tell us if an input image is of a cat or a dog. Since the neural network only uses numbers, it can’t output the words “cat” or “dog”. Instead, it uses one-hot encoding to represent is prediction in a semantic manner.

Important links for reference:

  1. Understanding DictVectorizer: Stackoverflow
  2. All Feature Extraction function signatures: scikit learn
  3. Python NLP Book: Python NLP Processing

Subscribe to our Acing AI newsletter, I promise not to spam and its FREE!

Newsletter

Thanks for reading! 😊 If you enjoyed it, test how many times can you hit 👏 in 5 seconds. It’s great cardio for your fingers AND will help other people see the story.


Why do we need one-hot encoding? was originally published in Acing AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

[D] How to deal with bags of images?

We are creating a classifier, that should get as input bags of images, and output binary labels *per bag*. A bag could have between 2 and 25 images, photos of the same object from different angles, and we must output a fixed-length binary vector for each bag.

What we are using right now:

  1. We filter the 5% of bags with too many images. We are left with maximum bag size of 13.
  2. For the bags with less than 13 images, we pad them with grey images. (we could also repeat some of the images).
  3. The classifier is fit, predicting the binary label *for each image*. So, for the first bag, we would have an input vector with shape (13 x 224 x 224 x 3), and an output vector of (13 x n), where the images have a shape of 224 x 224, and n is the length of the binary vector.
  4. We make predictions for each image for each bag in the test set.
  5. We use a heuristic to aggregate the 13 prediction vectors into a single one. That could be simple maximum, some sort of mean, etc. etc.

This pipeline feels unsatisfactory, because the model is not using all the images at once. Also, the signals seem noisy, since most images, when labeled by a human, would be just zero vectors.

We also have two ideas we will try:

  1. make the model operate directly on bags of images. So, for example, if the batch size is 16, in the pipeline I described above, the input vector could be something like 208 x 224 x 224 x 3, and the output vector would be 208 x n. We could make the input be 16 x 13 x 224 x 224 x 3, and the output vector to be 16 x n, and instead of using 2D convolutions, we could use 3D convolutions. This seems a lot cleaner. However, the images are not “similar”. The images from a video would be “similar” since it’s a small angle change in each frame. This is not the case here. Maybe we could start with several consecutive layers of 2D convolutions, before we move on to 3D layers? This still feels wrong, but it’s hard for me to explain why I feel that.
  2. Using the pipeline above, we get a label of 13 x n, for each bag. Each row of 1 x n is wrong, since most of those should be mostly zeros (the features we are looking for are small, and are seen usually from only one or two angles). So, we could use some heuristic to find the “true” labels for each separate photo. For this idea, could you recommend me some papers/ways to do this?

Do you have any tips, tricks, ideas to try, papers to read?

Thank you.

submitted by /u/Icko_
[link] [comments]

[Project] Need some advice on my Final School Project(Sudoku solving with AI)

Hey I am a 18 year old student from Slovenia, just a few months before graduating.

To graduate and pass this last year I have to do a project(Wanted to do a game first but it was already taken) so the only thing I had left was a Artificial Inteligence project – Solving Sudoku with the help of AI.

I have a lot of questions and dont really know where to even start.Anyone knows any good sources I could learn from(About machine learning…)?

One of the main questions I have is if anyone knows a good C++ library for machine learning, and one for graphical programming(like “java Swing” for Java).

submitted by /u/UnlikelyDriver
[link] [comments]

[D] How models are actually used in practice?

I don’t have a lot of experience in industry and I would really like to hear from people with practical experience how things are done in practice. Not experimentation, but actual usage of models. Anything from classification to regression, I just want to hear from people who use these things from day to day.

Also, are there books which discuss case studies of models that made it into production? I’m looking to move on beyond the “regressing housing prices” examples into actual real-world examples of models. Maybe a book or an article which discusses these. Thanks!

submitted by /u/Minimum_Zucchini
[link] [comments]

[Discussion] Examples of mis-specifying optimization objectives causing unpleasant outcomes

In Stewart Russell’s book ‘Human Compatible’, he gave an example of social platform specifying maximization of click-through rate as objective, which did not only promote echo chamber effect, but in fact slowly modifying people’s preference so we become more predictable. In the process, driving more extreme viewpoints, because it is easier to predict what content will be clicked through when your view points are extreme to any one side of the spectrum.

I find this example complex and interesting, and am wondering what are other real-world examples?

submitted by /u/dbcrib
[link] [comments]

[R] Research Survey about Security in Machine Learning

Hey everybody,

we at the Fraunhofer AISEC are concerned with the awareness of security in machine learning implementations. Therefore, we are currently performing a survey with ML developers to capture the current state of the art.

If you are a developer working with ML and have ~15 minutes of free time, we kindly ask you to take part in our anonymous online survey:
https://websites.fraunhofer.de/ML_security/index.php/232539?lang=en

Our research is conducted in cooperation with the Freie Universität Berlin. For more information visit the following link:

https://www.mi.fu-berlin.de/inf/groups/ag-idm/projects/SecureMachineLearning/index.html

Thank you in advance!

submitted by /u/oliver133322
[link] [comments]

[Research] Reinforcement Learning – Rainbow algorithm. Need some help with code

Hello good people!

Background : I need your help. First of all, I am out of my elements here. I am just learning about RL. I got a job on it luckily. It’s more code oriented but I need some concepts as well. I decided to throw myself in the water to break my stagnation. I hope you can help me here.

Issue : I want to run the code from the Rainbow paper. When I run it with default arguments it just keep running. I think by default it is set to run 5 million episodes(T-max = 50e6). I want to run one successful run before I start playing with it so I have an idea on what the result is supposed to look like. Should I just change the T-max variable? There are about 20 more arguments and I am not sure if it affects other or not. For example, I think the target-update is related to this. And since my concepts are not so clear, I could use some help here.

I hope I was clear, if not please ask me here.

Edit : spelling and stuff

submitted by /u/loser-two-point-o
[link] [comments]

[D] A Decade in Deep Learning

This decade really belonged to Deep Learning and in a bid to recap, I have written a post covering the most significant contributions to it in the past decade.

The post is split into the 3 main domains of Deep Learning: Natural Language Processing, Computer Vision and Reinforcement Learning. Each subtopic covers an important milestone that has shaped Deep Learning as we know it today, with links to the original papers. The aim is to look back at what will be remembered from this decade and stem discussion regarding what areas of Deep Learning will play a major role in the 2020s.

https://medium.com/%C3%A9clair%C3%A9/a-decade-in-deep-learning-19b611588aa0?source=friends_link&sk=7567f3da9e88ae105289376a2f9f4485

submitted by /u/MrKotia
[link] [comments]

[D] Top Down or Bottom Up: Two Paradigms for Artificial General Intelligence

I wrote a blog post a while ago that I thought I’d post here for discussion. It discusses some building blocks for AGI, e.g. intuitive physics, intuitive psychology, intrinsically-motivated RL, etc. I also discuss what I think may be most promising.

Please criticize, discuss and let me know if there’s any more recent work in any of these fields that has changed the landscape since I wrote this piece.

I’m still pretty confident in it other than:

use one giant neural network, or use more than one

I’d generalize this to “use one end-to-end learning system, or more than one”.

submitted by /u/the_roboticist
[link] [comments]

[D] How to estimate RoI for Descriptive Analytics Usecases?

I am currently working with a company that has a lot of data issues in terms of Quality and Quantity. By quality, I mean that they don’t have standardized, clean, and structured data. And in quantity, they don’t have a good amount of historical data.

After doing data cleaning and wrangling, the final dataset is only suitable for descriptive analytics. And the management wants to know the RoI from implementing this use-case.

From my PoV, descriptive analytics will only aid the decision-makers in knowing the status of different components of their projects and identifying any issues related to them. Any financial value that can be generated is in the hands of these decision-makers. But the management wants to put a price tag to it and I don’t know how to guesstimate it.

Any ideas?

submitted by /u/themonkwarriorX
[link] [comments]