Today we have the task of predicting ages using image classification. Since we are limited in the computational resource and data we have, the top way to get the best performance out of our model is to use transfer learning. Before we get to transfer learning, first let’s build a Convolutional Neural Network baseline model to compare performances.
The Dataset
The data that I will be using is from Kaggle and contains over 23,000 images of cropped human faces of various ages, gender, and ethnicity. The images also cover a large variation in pose, facial expression, illumination, occlusion, resolution, etc.
For the age bins, I chose the 12 age ranges below. I decided on these bins base on what age ranges people look the same, as well as having an equal amount of original images for each bin.
Age bin= [[1, 3], [4, 10], [11, 17], [18, 22], [23, 26], [27, 32], [33, 37], [38, 45], [46, 53], [54, 62], [63, 73], [74, 120]]
I used the ImageDataGenerartor from Keras to scale the images by 1/255 to speed up the training, and also shaped the images to be 224 x 224. ImageDataGenerator also created dummy variables for us.
Below we can see a pie chart of the distribution of images for each bin and see that they are equal. Although some bins are much smaller than others, it is important for each class to have an equal amount of images to train on. If you do not have enough images, ImageDataGenerator can be used to create augmented images that can be added to the dataset.

Convolutional Neural Network Architecture

CNN models make use of convolution operation and excel at detecting patterns so it is perfect for data containing images and videos.
CNN architecture contains these building blocks -
- Convolutional Layer
- Takes in input, transform data and output data
- Excel at detecting patterns
2. ReLu (Activation) Layer
- Relu layer determines whether an output will fire or not
3. Pooling Layer
- Down-sampling method to reduce the width and height of the output.
4. Fully-Connected Layer
- Selecting the right activation functions is important, usually Sigmond is used for binary classification and softmax can be used for both binary and multi-class classification problems

Our baseline model achieved around 50 percent accuracy on the training data but only 35 percent on the validation data. Overfitting was a problem I ran into for all the models that I tried.
In the Confusion Matrix Heat Map below, we can see that the model struggles a lot with the ages in the middle.

A lot of the categories have very low predictions. For example, the 23–26 age group only have 6 predictions with only 2 of them correct. The range 38–45 has only 3 correct predictions, the 46–53 range only has 7 correct predictions, and the 54–62 range only has 9 correct predictions. After getting the baseline model, I will now go through the other models that I tried out and see how they performs.