Now that we have finished building and testing a baseline model that performs decently in part 1, I will now start building models using transfer learning.
Transfer learning is a machine learning method where a prebuilt model is used as a base and new layers are added to fit our own dataset. The reason why transfer learning is an amazing tool for image recognition is that the prebuilt models are trained on millions of images with thousands of classes. Since the first few layers of an image recognition deep learning model are used to detect basic and common features of images, such as lines, patterns, and blobs, transferring the learning will help improve our model vastly.
The models that I will try using as the base are the following
- VGG19
- VGG16
- ResNet50
- EfficientNet
All of these models are trained on 14 million images which is much larger than our dataset of 11 thousand images. The equipment and time necessary to train on that many images is not available to the majority of people, therefore transfer learning is a great tool that should be used for almost all deep learning projects.
Below is a chart showing the models accuracies and losses
All of the models suffer from overfitting despite heavy regularization methods. The training accuracies are higher for all of the models. In the case of VGG19 with 16 trainable layers, the difference between the accuracies for training and testing data is 59 percent. The best accuracy on the test data is by the VGG16 model. This model also has the second-lowest loss score, therefore it is by far the best out of all the models.
The VGG16 model has an accuracy of 53% for the training data and 45% for the testing data. Although the accuracy still seems quite low, we can take a look at the confusion matrix heatmap to see that it is performing much better than the baseline model.
We can see that the model is no longer predicting only a few classes in the middle. The diagonal line in the heat map is much stronger showing that the accuracy is much better. Even for the predictions that are wrong, the range by which it got wrong is a lot less. Most of the incorrect predictions are only off by one or two bins. The model still struggles with predicting ages in the middle and excels at predicting the ages of infants, teens, and elders. An improvement could be made while measuring the performances of the models by counting the predictions that are off by only 1 class range as correct so that we can see which models are almost always correct.