95.5% accuracy

I’ve been playing around with the very deep model for a while, but I’ve found it very difficult to train. It seems very sensitive to initialization which either resulted in very slow optimization progress or explosion of the parameter values (NaNs).

So I’ve decided to scale down the model to the following architecture:

Image size Type Kernel size Feature maps
260×260 Conv 3×3 16
258×258 MaxPool 2×2
129×129 Conv 3×3 32
127×127 MaxPool 2×2
64×64 Conv 3×3 64
62×62 MaxPool 2×2
31×31 Conv 3×3 128
29×29 MaxPool 2×2
15×15 Conv 3×3 256
13×13 MaxPool 2×2
7×7 Conv 3×3 256
5×5 MaxPool 2×2
FullyConnected 256
FullyConnected 128
Softmax 2

Something I forgot to mention in my previous post is that the I use rectified linear activations functions everywhere. This time I added some regularization: dropout on the two fully connected layers with a dropout probability of 0.5, and weight decay with regularization coefficient 1e-3. I again optimized with SGD with momentum (learning rate: 0.01, momentum coefficient 0.9). The resulting error curves are shown below.

error nll

With early stopping on the validation error I obtained the following results:

Error NLL
Training 0.0187 0.05289
Validation 0.0447 0.1180
Test 0.0451 0.1268

EDIT: Alexandre achieved the best results in our class project so far with 97.8% accuracy with a similar architecture. However, he included data augmentation to achieve some form of viewpoint invariance, and the 95.5% is the best result so far without data augmentation. Obviously, including these kind of techniques will boost performance for the rather small datasets, but it’s interesting to quantify the exact performance gain. Our similar architectures allow for such a comparison (excluding the weight decay): the data augmentation results in 2% performance gain (which is significant).

2 thoughts on “95.5% accuracy

  1. Hey Harm,
    Cool, your new model is very similar to the one I used https://github.com/adbrebs/dogs_vs_cats/blob/master/models/mod_8.py with the same input shapes, same lr and momentum values. I had similar validation results with dropout only.

    Have you measured the influence of the weight decay? I would be curious to see how important it is for our models with this dataset. Alex K. used some for his AlexNet and it did not only help generalization but, interestingly, it also helped the model to learn (the training error with weight decay was lower).

    With deeper models, you told me the other day that it was quite sensitive to the weight initialization schemes (gaussian vs uniform), did you do other experiments with interesting observations?

    Like

  2. Hey Alexandre,

    Yes, I actually kept going for the VGG strategy with 3×3 filters. To reduce the number of layers, I applied pooling after each conv layer, and doubled (except for the last conv layer) the number of feature maps. I just compared with your model, and I saw the only difference is that my last two conv layers have 256 instead of 128 feature maps.

    I didn’t try without weight decay, but it might be a good idea to check what the exact influence is. I think I’ve also read in Sander Dieleman’s post that it actually made the training more stable. I will try to do another run without weight decay!

    Yes, the initilization for deep models seems to be quite sensitive. I have done some experiments, but didn’t do any systematic comparison. I will try to do it in my next post 🙂

    Like

Leave a comment