Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

Author: torontoai

Multi Class Text Classification with LSTM using TensorFlow 2.0

Recurrent Neural Networks, Long Short Term Memory

A lot of innovations on NLP have been how to add context into word vectors. One of the common ways of doing it is using Recurrent Neural Networks. The following are the concepts of Recurrent Neural Networks:

  • They make use of sequential information.
  • They have a memory that captures what have been calculated so far, i.e. what I spoke last will impact what I will speak next.
  • RNNs are ideal for text and speech analysis.
  • The most commonly used RNNs are LSTMs.
Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/

The above is the architecture of Recurrent Neural Networks.

  • “A” is one layer of feed-forward neural network.
  • If we only look at the right side, it does recurrently to pass through the element of each sequence.
  • If we unwrap the left, it will exactly look like the right.
Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs

Assuming we are solving document classification problem for a news article data set.

  • We input each word, words relate to each other in some ways.
  • We make predictions at the end of the article when we see all the words in that article.
  • RNNs, by passing input from last output, are able to retain information, and able to leverage all information at the end to make predictions.
https://colah.github.io/posts/2015-08-Understanding-LSTMs
  • This works well for short sentences, when we deal with a long article, there will be a long term dependency problem.

Therefore, we generally do not use vanilla RNNs, and we use Long Short Term Memory instead. LSTM is a type of RNNs that can solve this long term dependency problem.

In our document classification for news article example, we have this many-to- one relationship. The input are sequences of words, output is one single class or label.

Now we are going to solve a BBC news document classification problem with LSTM using TensorFlow 2.0 & Keras. The data set can be found here.

  • First, we import the libraries and make sure our TensorFlow is the right version.

https://medium.com/media/c3ce0e6b10a84a676c3d9de30e90b1fb/href

  • Put the hyperparameters at the top like this to make it easier to change and edit.
  • We will explain how each hyperparameter works when we get there.

https://medium.com/media/7f3901aa39de489d143a5b733a09ff9c/href

  • Define two lists containing articles and labels. In the meantime, we remove stopwords.

https://medium.com/media/78a4697725a3ec5f5861866ffd32cc56/href

There are 2,225 news articles in the data, we split them into training set and validation set, according to the parameter we set earlier, 80% for training, 20% for validation.

https://medium.com/media/87df8873bbd3b423c283bf3855fcb002/href

Tokenizer does all the heavy lifting for us. In our articles that it was tokenizing, it will take 5,000 most common words. oov_token is to put a special value in when an unseen word is encountered. This means we want <OOV> to be used for words that are not in the word_index. fit_on_text will go through all the text and create dictionary like this:

https://medium.com/media/cef356057de89bd5541b5537c2baeb8b/href

We can see that “<OOV>” is the most common token in our corpus, followed by “said”, followed by “mr” and so on.

After tokenization, the next step is to turn those tokens into lists of sequence. The following is the 11th article in the training data that has been turned into sequences.

train_sequences = tokenizer.texts_to_sequences(train_articles)
print(train_sequences[10])
Figure 1

When we train neural networks for NLP, we need sequences to be in the same size, that’s why we use padding. If you look up, our max_length is 200, so we use pad_sequences to make all of our articles the same length which is 200. As a result, you will see that the 1st article was 426 in length, it becomes 200, the 2nd article was 192 in length, it becomes 200, and so on.

train_padded = pad_sequences(train_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)
print(len(train_sequences[0]))
print(len(train_padded[0]))

print(len(train_sequences[1]))
print(len(train_padded[1]))

print(len(train_sequences[10]))
print(len(train_padded[10]))

In addition, there is padding_type and truncating_type, there are all post, means for example, for the 11th article, it was 186 in length, we padded to 200, and we padded at the end, that is adding 14 zeros.

print(train_padded[10])
Figure 2

And for the 1st article, it was 426 in length, we truncated to 200, and we truncated at the end as well.

Then we do the same for the validation sequences.

https://medium.com/media/c43e36bb81b9b4fdea972e2ebdb2068c/href

Now we are going to look at the labels. Because our labels are text, so we will tokenize them, when training, labels are expected to be numpy arrays. So we will turn list of labels into numpy arrays like so:

label_tokenizer = Tokenizer()
label_tokenizer.fit_on_texts(labels)

training_label_seq = np.array(label_tokenizer.texts_to_sequences(train_labels))
validation_label_seq = np.array(label_tokenizer.texts_to_sequences(validation_labels))
print(training_label_seq[0])
print(training_label_seq[1])
print(training_label_seq[2])
print(training_label_seq.shape)

print(validation_label_seq[0])
print(validation_label_seq[1])
print(validation_label_seq[2])
print(validation_label_seq.shape)

Before training deep neural network, we should explore what our original article and article after padding look like. Running the following code, we explore the 11th article, we can see that some words become “<OOV>”, because they did not make to the top 5,000.

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_article(text):
return ' '.join([reverse_word_index.get(i, '?') for i in text])
print(decode_article(train_padded[10]))
print('---')
print(train_articles[10])
Figure 3

Now its the time to implement LSTM.

  • We build a tf.keras.Sequential model and start with an embedding layer. An embedding layer stores one vector per word. When called, it converts the sequences of word indices into sequences of vectors. After training, words with similar meanings often have the similar vectors.
  • The Bidirectional wrapper is used with a LSTM layer, this propagates the input forwards and backwards through the LSTM layer and then concatenates the outputs. This helps LSTM to learn long term dependencies. We then fit it to a dense neural network to do classification.
  • We use relu in place of tahn function since they are very good alternatives of each other.
  • We add a Dense layer with 6 units and softmax activation. When we have multiple outputs, softmax converts outputs layers into a probability distribution.

https://medium.com/media/928f6aee4451f31480029ddb8efd7cf6/href

Figure 4

In our model summary, we have our embeddings, our Bidirectional contains LSTM, followed by two dense layers. The output from Bidirectional is 128, because it doubled what we put in LSTM. We can also stack LSTM layer but I found the results worse.

print(set(labels))

We have 5 labels in total, but because we did not one-hot encode labels, we have to use sparse_categorical_crossentropy as loss function, it seems to think 0 is a possible label as well, while the tokenizer object which tokenizes starting with integer 1, instead of integer 0. As a result, the last Dense layer needs outputs for labels 0, 1, 2, 3, 4, 5 although 0 has never been used.

If you want the last Dense layer to be 5, you will need to subtract 1 from the training and validation labels. I decided to leave it as it is.

I decided to train 10 epochs, and it is plenty of epochs as you will see.

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
num_epochs = 10
history = model.fit(train_padded, training_label_seq, epochs=num_epochs, validation_data=(validation_padded, validation_label_seq), verbose=2)
Figure 5
def plot_graphs(history, string):
plt.plot(history.history[string])
plt.plot(history.history['val_'+string])
plt.xlabel("Epochs")
plt.ylabel(string)
plt.legend([string, 'val_'+string])
plt.show()

plot_graphs(history, "accuracy")
plot_graphs(history, "loss")
Figure 6

We probably only need 3 or 4 epochs. At the end of the training, we can see that there is a little bit overfitting.

In the future posts, we will work on improving the model.

Jupyter notebook can be found on Github. Enjoy the rest of the weekend!

References:


Multi Class Text Classification with LSTM using TensorFlow 2.0 was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

[D] Struggled with reading deep learning papers

Actually, as a senior graduate student, I have been doing research in the field of deep learning/nlp for several years. But there is a problem has troubled me a lot during these years. Specifically, a lot of deep learning papers (especially those trying to introduce a new model for some very specific task, for example, reading comprehension, text to SQL, e.t.c) give me a feeling that some design for the model described in the paper is highly engineered and not that intuitive, or in another word, it could have many alternative designs for some module, but few papers really justify why they adopt their specific design in depth. For instance, in a seq2seq setting, some may directly use BERT as the encoder, some may use BERT to generate the embedding for the input sequence first, and then feed the embedding to an LSTM encoder. In fact, this example cannot reveal the problem completely since there can be some other scenarios that have tons of different possible designs that might work, and different papers always adopt their very own design with no much justification.

This really makes me feel extremely bad! First, as a guy who is always eager to know WHY, those papers really can’t answer my question, or maybe it’s just not smart to ask why questions in the context of deep learning model designs. It makes doing research in this field looks more like engineering or even art design, but not science. Secondly, those various designs really impose difficulty in comparing different models. It’s really hard to do control the variables! If one model achieves better performance than the other, it’s hard to tell it is truly due to what the paper claims or some other subtle and tricky designs.

I don’t know is there any other people who feel the same way as me. How should I adjust my mindset for doing research in this field?

submitted by /u/entslscheia
[link] [comments]

[D] Best way to cluster text paragraphs?

My boss wants me to do a hack project where I cluster user feedback / complaints (e.g. people saying “wtf I can’t log in” or “this UI is ugly bla bla” etc.) We have >100k unlabeled data points. There may be jargon in there but it’s mostly legible English. Our goal is to cluster these things so that those talking about the same issue get grouped, and we can take care of them in chunks as nobody wants to read a thousand of these per day.

I’m not an NLP guy by any stretch, so I’ve been reading papers all day to try catching up, however I’m kind of in the middle of the ocean right now. There’s a lot of stuff out there and being inexperienced I thought I’d summon you folks for a discussion on what to try.

My idea now is to use some kind of Transformer model to embed each data point (paragraph) but stuck here as I’m learning that the vectors coming out of those encoders don’t cluster well by text meaning. Let me know any ideas.

P.S. simple models like counting keywords failed me because 1) the data points have a lot of shared vocab so irrelevant things get clustered together, and 2) there are many ways of talking about the same thing with different words.

Ciao

submitted by /u/ME_PhD
[link] [comments]

[D] My experience with Paperspace virtual machines

I was looking for a VM with a GPU to train my model. I was going to use Google Cloud but unfortunately they don’t do business with people from my country so I had to look elsewhere.

That’s when I remembered of paperspace which looked pretty nice. They even have a separate option for ml which allows you send calculations to the cloud and launch notebooks.

But the system wouldn’t accept my card. It simply said “Card is declined”. I reached support and they said that’s probably because their system cannot determine my ip because of VPN or firewall and that I need to turn that off to add card info. Pretty strange thing to ask for IP to simply add payment info but that worked.

I quickly understood that I’m not comfortable with this Gradient service and that I’d like to operate from PyCharm, using vm as a remote interpreter via ssh.

So I tried to rent a regular VM but all the options were locked saying that I need to send a request, describing reasons and ways in which I want to use it. Strange, but I send a request, saying that thing about using PyCharm. Waited a day, no response and sent one more request.

Later that day I get an email from their security staff saying that my account rated highly on their risk matrix and was flagged as suspicious and that I must send them:

  • a photo of my ID with name matching the card
  • contact information
  • company or personal website
  • link to github or social media accounts
  • detailed description of what I’m going to do with the service

And if I don’t do it in 24 hours they will ban me forever.

tl;dr accused me of being suspicious and potentially fraudulent and asked all kinds of personal info to unblock me

Well, imo they should balance their false positive rate and improve customer service greatly.

What are other good alternatives for VMs for machine learning? What do you use?

submitted by /u/Darell1
[link] [comments]

What does the global minima of a non-convex loss function look like?

For LeNet trained on MNIST with the lowest possible loss (global minima),

  • What would the test error rate look like? Is there a benchmark for best possible performance?
  • Can we achieve global minima on non-convex loss functions for a classification task with a minimum number of parameters? Or conversely, how does adding more parameters to a NN help with this?

submitted by /u/liqui_date_me
[link] [comments]

[D] Is there research that focuses primarily on speed of learning, or minimizing required dataset size instead of results?

I see a lot of brilliant techniques that produce incredible results, but are there any techniques that try to bring learning speeds an order of magnitude or two up even if it costs them half the accuracy, or techniques that try to learn from 400 images instead of 40k?

Or in other words, I would love if someone were to link me some research pursuing non-conventional goals.

submitted by /u/derpderp3200
[link] [comments]

[P] Short films scripted by GPT-2

I started a project where I’m going to shoot short stories generated by GPT-2. I insert the first line and it completes it. Of course, there is a lot of filling the gaps as sometimes all we get is dialogue and other times only a setting.

This is my first video and I’ve got a few others already shot and plenty scripts selected.

https://www.youtube.com/watch?v=QfI0Pu0jz3E

I think this is an interesting application of GPT-2 and illustrates visually that although there is coherence in the text, meaning is many times lost in the process.

It would be amazing if in a few years after posting consistently we can see the evolution of AI generated scripts and who knows we can actually have an interesting story which is not pure absurd comedy as they are mostly now.

Hope you enjoy the project and, if you do, please suggest starting lines.

submitted by /u/brunoplak
[link] [comments]

[P] Clustering Pollock

Hi all,

I applied kmeans clustering to some of the Pollock’s paintings. The idea was to track the artist’s usage of #colors through the years. Here’s the outcome!

I had really good fun in mixing computer science and art. I used Python with the standard data science stack (pandas, numpy, scikitlearn) plus opencv. echarts for the visualizations at the end of the article .

Let me know what you think!

https://medium.com/@andrea.ialenti/clustering-pollock-1ec24c9cf447

submitted by /u/travellingsalesman2
[link] [comments]