Image Classification

In this tutorial, we are going to build a neural network that can classify handwritten digits (0-9)! Does that sound exciting? Probabbly not in this problem domain. However the principles we are going to apply in handwritten digit recognition are equally valid to other visual recognition challanges. Would you like to build a 1000 class image classifier? Are you developing an automated vehicle? Do you want to replicate the human visual system? Read on.

Disclaimer: the latter two objectives are far more complicated than this tutorial makes vision seem to be. It is almost abusive to consider human vision as simple as convolution.

This tutorial was written to complete the Quiz 14 requirement of Data Mining:

Complete the MNIST Classifier shown in class and submit the code+output screenshot.

Change the network to contain 4 convolution layers with 6, 32, 64, 16 layers, and 3 fully connected layers with 256, 64, 10 nodes in each layer respectively.

Use sigmoid activation in all layers except the output layer.

And later extended for the Assignment 1 requirement of the same class:

Your goal is building CIFAR-10 image classifier.

All comments and code were written from memory. No papers, books, Google, stack overflow, or Internet unless noted.

Copyright © Jacob Valdez 2021. Released under MIT License.

Getting Started

As you start to explore github, you'll observe a few common nicknames that we give our packages. I'm just going to import my default go-to's for now:

The Data

Let's load the mnist dataset and observe a few elements.

Notice that each number has an image (stored in X_train/test) and a label (stored in Y_train/test) Each image is 28 by 28 pixels and there are 60000 training examples and 10000 test examples. Note that this dataset is supplied in integers so I'm going to convert it to floating point representation for our neural network:

The Classifier

We're just going to build a plain-old Convolutional Neural Network. The idea of performing convolutions is that not every part of an images has information pertaining to every other part. As we analyze a scene, we can often decompose the visual information relationships into a spatially segmented hierarchy. Convolutional neural networks carry this inductive bias by performing a miniature perceptron operation at every receptive field location in an image. Unless commented below, we'll use keras's default implementations to achieve this:

Training

Next, we're goign to train our classifier. Since the data is supplied with integer labels but our model outputs probabilities over 10 classes, we cannot directly differentiate between the two without either

I select the latter option for computational and information theoretic reasons. Cross entropy $H(p,q)$ represents the expected amount of extra information needed to encode some code under an existing distribution. Formally, $$H(p,q)=E_{x \sim p(x)}[-\log{q(x)}]$$ This is ideal when our model serves as the posterier $q(x,y)$ and the dataset as the prior $p(x,y)$. Our loss function will then be the sparse categorical cross entropy between our model's estimates and the dataset labels. Keras provides a high level interface to implement this in the model.compile function:

Now it's time to actually train the model. Let's supply our training a testing data and see how training progresses:

What's happening? The loss isn't improving.

Why can't we just plug and chug whatever data we want into our model? Consider two reasons: 1) There is no globally optimal universal approximator, and specialized models such as this CNN may not have sufficient inductive priors to estimate their data generating distribution 2) Sigmoid-type activation functions saturate the gradients relatively easily. This means that when the input is large in the positive or negative extrema, gradients are effectively zero. During backpropagation, the gradients hardly penetrate the top layer and only slowly penetrate lower and lower into the model. (See the paper that introduced batch norm and The Principles of Deep Learning Theory for a longer discussion of these points.)

We can solve this problem by changing our activation function to something that is still nonlinear but allows gradients to flow faster over the epochs. My go-to activation function is the rectified linear unit relu:

What a significant change. relu definitely performed better in the first 10 epochs than sigmoid. Feel free to experiment yourself with this model.

How to Overfit Your Dev Set

I hope you've enjoyed learning about machine learning by tweaking the hyperparameters of your model. Likely you realize at this point that we could tweek hyperparameters forever. Why not let machine learn machine learning instead? ray-tune is a powerful tool we can use to find the optimal hyperparameters for a model. Per its official docs, ray.tune frames its optimization problem into a run -- report metric -- optimize iteration loop. To give you the idea, here's their quick start code:

from ray import tune


def objective(step, alpha, beta):
    return (0.1 + alpha * step / 100)**(-1) + beta * 0.1


def training_function(config):
    # Hyperparameters
    alpha, beta = config["alpha"], config["beta"]
    for step in range(10):
        # Iterative training function - can be any arbitrary training procedure.
        intermediate_score = objective(step, alpha, beta)
        # Feed the score back back to Tune.
        tune.report(mean_loss=intermediate_score)


analysis = tune.run(
    training_function,
    config={
        "alpha": tune.grid_search([0.001, 0.01, 0.1]),
        "beta": tune.choice([1, 2, 3])
    })

print("Best config: ", analysis.get_best_config(
    metric="mean_loss", mode="min"))

# Get a dataframe for analyzing trial results.
df = analysis.results_df

Let's make an isomorphic case with our MNIST classifier: We'll have a triple optimization loop. On the inside, SGD, Adam, RMSProp, or another first order optimizer will backpropagate gradients into the trainable parameters. After 10 epochs, a hyperparameter optimizer will tune our choice of activation function, hidden convolution and dense layers, hidden depth, loss function, and inner optimizer. Finally, we'll be the slow optimizer and make changes to the primary and secondary optimization loops when needed. Let's start by defining our meta-objective:

Iteration 1

Now just looking at the hyperparameter space we've defined, you can see why this is overkill for MNIST. Each run of meta_loss runs a full 10 iterations on the optimization loop beneath it. To meet these computation demands, I'm running this notebook on a deep learning optimized Google Cloud VM (n1-highmem-2 with an nvidia-tesla-k80). Learn how you can do this on your own for AWS or GCP from my previous notebook.

Without further hesitation (the assignment due date is approaching), let's start tuning!

Obviously, I got carried away. There are way too many tunable parameters to expect convergence. Let's try a smaller search space with only units changing

Iteration 2

There were several outer loop iterations behind those code blocks above. It seems like the behavior of meta_loss changes during a tuning session causing bugs to be raised. I found a keras native example while debugging this issue and read the ray tune documentation. The above keras example did not actually work, so I probed into the training pipeline and found that the model.fit function raises tensorflow errors during ray worker execution time. To combat this, I wrote my own training loop using tf.GradientTape and further scaled down the hyperparameter space:

Iteration 3

Iteration 4

At this point, I began to suspect serious issues with the way I am using ray.tune. I hope you found a solution. Maybe I have by now as well, since ray tune is such as flexible and useful tool. Along my search, I encountered a more narrow library keras-tuner that does what I want. Check out the guide for information on getting started. In my case, deploying keras-tuner was as simple as:

The following code is directly copied from https://www.tensorflow.org/tutorials/keras/keras_tuner

I was surprised that the tuner chose tanh and the humble SGD optimizer. That's convenient since SGD is the fastest of the tested optimizers. Let's now test these hyperparameters to see if we can reach the 0.9877 validation accuracy that keras-tuner boasts: