Neural Networks common mistakes story – Machine Learning Stories

Training Recurrent Neural Networks is a hard task, especially for an inexperienced engineer, that can be surprised by some phenomena specific to that type of a network. You are asked by your colleague to debug his network.

You made an investigation, and the project seems to be well structured. Splits are correct, distributions similar, no data leakage, data size is probably enough, and the task is not very complex: tag the abusive phrases in some chats.

Phrase	Is abusive
Wagner’s music is better than it sounds.	True
You are my hero buddy!	False
You’re so fake, GANs are jealous.	True

Nothing seems to work in the implemented network. You asked your colleague to prepare given plots for you:

Train loss
Dev loss (after each epoch)
Histogram of neurons activations
Gradients values sum

How can you help with this problem?

You have the opportunity to train the first model

	You should remember to use some regularization so the net won’t overfit
	You should try to fit a single batch first
	You should train on all the data because usually, the more data gives better systems

Correct

Incorrect

It is a very good idea to train the single batch. However your friend asked you how should he initialize the network parameters.

	You can initialize all the parameters with the same value between -0.1 and 0.1
	You can initialize all the parameters with some values between -10 and 10
	You can initialize all the parameters with some values between -0.01 and 0.01
	You can initialize the deeper layers of the network with bigger values.

Correct

Incorrect

Definitely, something is wrong, as you can’t overfit a single batch. Look at the hidden units activations histograms:

the train loss function:

and the gradients norm:

	Parameters aren’t modified after each step
	The gradient may be accumulated over all the steps
	Weights are initialized with zeros

Correct

Incorrect

It helped to zero the gradients between the iteration both with updating the network parameters. Network fitted the single batch. What would be the next step?

	Try to overfit the entire training set to check if the model capacity is big enough
	If you are sure that there are no more implementation problems, focus on the development set

Correct

Incorrect

The loss function started to decay. Do you like the plots?

Training loss:

Gradients norm:

	Gradient clipping may help
	The loss didn’t overfit the loss enough
	You can turn off the logging because there are no more problems with the network How can you even think about it?

Correct

Incorrect

This is how your losses look like. It seems that the network cannot generalize as much as your intuition prompts. The sets are big, distributions are really similar, also small dropout is used, so you would suspect dev score raise much more. Which of the following mistakes may be the cause of the problem?

	Check if the development data has the same preprocessing as the train data
	You are using too much regularization so the network can’t generalize to the dev data enough
	You forgot to update the weights each step so the network can’t learn important features for the dev set.
	You forgot to toggle train/eval mode for the net (e.g. forgot to turn off dropout)

Correct

Incorrect

The next thing is the learning time. With such a high learning rate, the learning should proceed faster. What are possible causes of a slow learning?

	You passed softmaxed outputs to a loss that expects raw logits
	You left a relu activation before a softmax
	You're using softmax with a negative log-likelihood loss instead of logged softmax

Correct

Incorrect

You have checked the implementation of a single hidden layer. It looks like this:
x = W*x+b x = dropout(x)
x = relu(x)
x = batch_normalization(x)
Is anything wrong with that?

	Dropout should be used at the end
	Nonlinearity should be used before dropout
	Batch activation should be used before activation function

Correct

Incorrect

Your colleague has some ideas about shuffling the data. Which would you approve?

	You can shuffle the words in the sentences so your data can be augmented and the system can be resistant to word order
	You can sort sentences from the shortest to the longest so the net can learn simple examples at the beginning and the hardest at the end
	You can select batches randomly on the sentences
	You sort sentences and put them in the batch so the sentences in batches have similar length

Correct

Incorrect

Increasing the regularization:

	Increases the training loss
	Decreases the training loss

Correct

Incorrect

In the end, your learning rate is equal to 0.001 which is the same value as at the beginning. What’s your opinion

	Learning rate at the end should be lower than at the beginning so it can fit the dataset more precise
	Learning rate decay in any way is not the necessary optimization

Correct

Incorrect

Correct

Incorrect

Your friend implemented batch normalization. What do you need to remember about using it?

	To turn off bias in the linear layers
	To turn off batch normalization in the single example inference

Correct

Incorrect

Leave a Comment Cancel reply