Neural Networks common mistakes story

    Training Recurrent Neural Networks is a hard task, especially for an inexperienced engineer, that can be surprised by some phenomena specific to that type of a network. You are asked by your colleague to debug his network.

    You made an investigation, and the project seems to be well structured. Splits are correct, distributions similar, no data leakage, data size is probably enough, and the task is not very complex: tag the abusive phrases in some chats.

Phrase Is abusive
Wagner’s music is better than it sounds. True
You are my hero buddy! False
You’re so fake, GANs are jealous. True

Nothing seems to work in the implemented network. You asked your colleague to prepare given plots for you:

  1. Train loss
  2. Dev loss (after each epoch)
  3. Histogram of neurons activations
  4. Gradients values sum

How can you help with this problem?

 

 

You have the opportunity to train the first model

Correct
Incorrect

 

 

It is a very good idea to train the single batch. However your friend asked you how should he initialize the network parameters.

Correct
Incorrect

 

 

Definitely, something is wrong, as you can’t overfit a single batch. Look at the hidden units activations histograms:

the train loss function:

and the gradients norm:

Correct
Incorrect

 

 

It helped to zero the gradients between the iteration both with updating the network parameters. Network fitted the single batch. What would be the next step?

Correct
Incorrect

 

 

The loss function started to decay. Do you like the plots?

Training loss:

Gradients norm:

Correct
Incorrect

 

 

This is how your losses look like. It seems that the network cannot generalize as much as your intuition prompts. The sets are big, distributions are really similar, also small dropout is used, so you would suspect dev score raise much more. Which of the following mistakes may be the cause of the problem?

Correct
Incorrect

 

 

The next thing is the learning time. With such a high learning rate, the learning should proceed faster. What are possible causes of a slow learning?

Correct
Incorrect

 

 

You have checked the implementation of a single hidden layer. It looks like this:
x = W*x+b
x = dropout(x)

x = relu(x)
x = batch_normalization(x)
Is anything wrong with that?

Correct
Incorrect

 

 

Your colleague has some ideas about shuffling the data. Which would you approve?

Correct
Incorrect

 

 

Increasing the regularization:

Correct
Incorrect

 

 

In the end, your learning rate is equal to 0.001 which is the same value as at the beginning. What’s your opinion

Correct
Incorrect

 

 

Correct
Incorrect

 

 

Your friend implemented batch normalization. What do you need to remember about using it?

Correct
Incorrect

Leave a Comment

Your email address will not be published. Required fields are marked *