Training Recurrent Neural Networks is a hard task, especially for an inexperienced engineer, that can be surprised by some phenomena specific to that type of a network. You are asked by your colleague to debug his network.
You made an investigation, and the project seems to be well structured. Splits are correct, distributions similar, no data leakage, data size is probably enough, and the task is not very complex: tag the abusive phrases in some chats.
Phrase | Is abusive |
Wagner’s music is better than it sounds. | True |
You are my hero buddy! | False |
You’re so fake, GANs are jealous. | True |
Nothing seems to work in the implemented network. You asked your colleague to prepare given plots for you:
- Train loss
- Dev loss (after each epoch)
- Histogram of neurons activations
- Gradients values sum
How can you help with this problem?
You have the opportunity to train the first model
It is a very good idea to train the single batch. However your friend asked you how should he initialize the network parameters.
It helped to zero the gradients between the iteration both with updating the network parameters. Network fitted the single batch. What would be the next step?
The loss function started to decay. Do you like the plots?
Training loss:
Gradients norm:
This is how your losses look like. It seems that the network cannot generalize as much as your intuition prompts. The sets are big, distributions are really similar, also small dropout is used, so you would suspect dev score raise much more. Which of the following mistakes may be the cause of the problem?
The next thing is the learning time. With such a high learning rate, the learning should proceed faster. What are possible causes of a slow learning?
You have checked the implementation of a single hidden layer. It looks like this:
x = W*x+b
x = dropout(x)
x = relu(x)
x = batch_normalization(x)
Is anything wrong with that?
Your colleague has some ideas about shuffling the data. Which would you approve?
Increasing the regularization:
In the end, your learning rate is equal to 0.001 which is the same value as at the beginning. What’s your opinion
Your friend implemented batch normalization. What do you need to remember about using it?