Initialization: Uniform vs Gaussian

I have mentioned before that initialization becomes more difficult when increasing the depth of the network. We are trying to find the norm of the weights such that we do not encounter one of the following scenarios (which are forms of the vanishing/exploding gradient problem):

  1. The weights are too small, and repeatedly multiplying with the weights dampens the signal so much that you essentially lose the input signal. If the input doesn’t flow through the network, then it becomes impossible to credit assignment (since a small change in the parameters still do not change the output).
  2. The norm of the weights is too big and repeatedly multiplying amplifies the signal too much such that it explodes. As we will see, this puts the network in a very bad position to start.

In this blog post I will quantify what I’ve found experimentally before: it becomes harder to find good weights when the depth of the network and increases, and more suprisingly uniform random initialization seems to be better than Gaussian initialization. To this end, I initialized the weights of the 6 layer (+2 FC) convnet of the previous posts with zero mean Uniform and Gaussian noise and a standard deviation in [0.001, 0.005, 0.01, 0.05, 0.1, 0.05, 1]. I then looked at a histogram of the output probability of the network for 1000 examples and all these initializations. For small weights (scenario 1) the output probability is always 1/2 (see left figure below), while for big weights the network will always output either one of the classes with probability almost one (see right figure below). Of course, this latter initialization results in a bad situation to start the network since we want to assign more or less equal probability to both classes. And it makes it very difficult for the network to recover from this bad initialization. Note that in most cases this also results in a very bad initial log likelihood.

gaussian2I therefore plotted the gaussian4

Now that we know that we don’t want to be in one of those two scenarios, it’s interesting to check the transition from one phase of the network to the other. The slower this transition the easier it is to find good initial weights. Below, I plot the std of the weight initialization vs the negative log likelihood. As expected the transition is slower for shallower networks, but more interesting the transition is much slower for uniform noise than Gaussian noise!

uniform-gauss-nlls

I still don’t know exactly why Uniform initialization is better than Gaussian, but I suspect it has to do with some properties of Uniform and Gaussian random matrices (When you multiply a bunch of such matrices they probably have more structure than you think they have).

PS: Note that I’m well aware of the initialization of Saxe et al, which uses random orthonormal matrices. It would be interesting to include that initialization in this comparison, but I didn’t have the code for that now.

Leave a comment