Preface
In this article, I will go through some key maths background to understand DNN. As well as finetuning aka how to build a train a linear model on top of a existing image recogniton to our tasks.
This is a study note of Fast.ai Lesson 2 .
So what’s magic happening in Lesson 1?
How can we just borrow the Vgg model, finetune it and it magically can distinguish cats vs dogs?
Let’s start with how can Vgg recognize images.
What is Vgg?
Vgg is basically a DNN that is trained upon the ImageNet which is enable to classify the input into one of the 1000 categories(It actually gives probas for each categories). How it’s doing that? Although DNN is often time a black box, we can still understand parts of how it works following this paper. Generally, a Image Recognition DNN is consists of multiple layers of pattern from the very simple ones to complicated ones. The first layer might just be a gradient, a line, a diagonal, a curve, any small simple patterns like that. The second layer basically assembles patterns found in the first layer, it might recognize corner (just two connecting diagonal connected with a 90 degree angle), circule, oval now . The same logic follows on. For example, the model could identify human faces at level 5 or level 6.
Here is an image taken from the paper, first 5 layers is shown along with actual images where it found a match.
For example, Vgg has 16 layers.
DNN
So how is Vgg doing that? No one is telling it about the patterns. This is done by the black box of DNN.
A neural network is at its core a sequence of matrices that map an input vector to an output vector through matrix multiplication. The intermediate vectors in between each matrix are the activations, and the matrices themselves are the layers. Through a process we’ll learn about called “fitting”, our goal is to adjust the values of the matrices, which we call “weights”, so that when our input vectors are passed into the neural network we are able to produce an output vector that is as close as possible to the true output vector, and we do this across multiple labeled input vectors. This is what makes up a training set.
Above, we started with randomly generated weights as our matrix elements. After performing all the operations and observing the outcome, notice how the activations output is significantly different than our target vector y. Our goal is to get as close to the target vector y as possible using some sort of optimization algorithm. Before using the optimization algorithm, it’s suggested to start your weight values in a manner that makes the activations output at least relatively close to the target vector. This method is called weight initialization.
There are many weight initializers to choose from. In the lecture, Jeremy uses Xavier Initialization (also known as Glorot Initialization). However, it’s important to note that most modern deep learning libraries will handle weight initialization for you.
Loss Functions
So how are we minimizing this loss function? First we need to define a loss function, there are several popular loss functions. Here I will introduce SVM(Support Vector Machine) and Softmax.
SVM as Loss Function
SVM (Multiclass Support Vector Machine loss)
 j for true label class
 $ j \ne y_i$ for all incorrect class
 $s_j$ for weight of the true label class
 $s_{y_i}$ for weight of other incorrect class
 $\Delta$ for tolerence of the difference
 $max(0, )$ aka hinge loss, people sometimes use squared hinge loss as $max(0, )^2$ that penalizes violated margins more strongly. Usually linear hinge loss is good enough.
In summary, the SVM loss function wants the score of the correct class $y_i$ to be larger than the incorrect class scores by at least by $\Delta$ (delta). If this is not the case, we will accumulate loss.
$i.e.$ The Multiclass Support Vector Machine “wants” the score of the correct class to be higher than all other scores by at least a margin of delta.
Regularization for SVM
There is a problem with this SVM loss function is that there could multiple set of $W$ that satisfies (minimizing $L$ to 0). So we want to encode our $W$ to remove this ambiguity. A standard way is to extend the loss function with a regularization penalty $R(W)$. The most common scheme for regularization penalty is $L2$ norm that discourages large weights through an elementwise quadratic penalty over all parameters as shown below:
And thew new loss function $L$ now contains two parts: data loss (which is average loss $L_i$ over all samples$ and the regularization loss. That is the full Multiclass SVM loss:
which can be expand to its full form as :
And this is able to improve the generalization performance at the end lead to less overfitting. As the $L2$ penalty prefers smaller and more diffuse weight vectors so the final classifier is encouraaged to take into acocunt all input dimensions to small amounts rather than a few input dimensions and very strongly.
Softmax as Loss Function
In a softmax classifier, the function mapping $ f(xi; W) = W{x_i} $ is unchanged, but it interprets these scores as unnormalized log porbabilities for each class and replace the hinge loss with corssentrophy loss in the following form:
 $f_j$ is the $j_th$ element of the vector of class scores $f$.
 $\frac{e^{f_i}}{\sum_j e^{f_j}}$ is the softmax function, it takes over a vector of arbitrary realvalued scores and squashes it to a vector of values between zero and one that sum to one.
Optimization with SGD
So with loss function, we are able to build the correlation that
So our goal of optimization is to find $W$ which minimizes the loss function.
Strategy 1: A very bad solution: Random Search
What should we do?
Core idea: iterative refinement.
Of course, it turns out that we can do much better than this random search. The core idea is that finding the best set of weights W is a very difficult or even impossible problem (especially once W contains weights for entire complex neural networks), but the problem of refining a specific set of weights W to be slightly better is significantly less difficult. In other words, our approach will be to start with a random W and then iteratively refine it, making it slightly better each time.


Strategy 2: Random Local Search (Slightly Better)
So a little better solution is:
 Do the same random search
 but only proceed if less loss.


Strategy 3: Following the Gradient
It turns out that there is no need to randomly search for a good direction: we can compute the best direction along which we should change our weight vector that is mathematically guaranteed to be the direction of the steepest descend (at least in the limit as the step size goes towards zero). This direction will be related to the gradient of the loss function.
In our hiking analogy, this approach roughly corresponds to feeling the slope of the hill below our feet and stepping down the direction that feels steepest.
Computing the Gradient
There are two ways of computing gradient:
Numerical Gradient
1234567891011121314151617181920212223242526272829303132333435363738394041def eval_numerical_gradient(f, x):"""a naive implementation of numerical gradient of f at x f should be a function that takes a single argument x is the point (numpy array) to evaluate the gradient at"""fx = f(x) # evaluate function value at original pointgrad = np.zeros(x.shape)h = 0.00001# iterate over all indexes in xit = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])while not it.finished:nction at x+hix = it.multi_indexold_value = x[ix]x[ix] = old_value + h # increment by hfxh = f(x) # evalute f(x + h)x[ix] = old_value # restore to previous value (very important!)# compute the partial derivativegrad[ix] = (fxh  fx) / h # the slopeit.iternext() # step to next dimensionreturn graddef CIFAR10_loss_fun(W):return L(X_train, Y_train, W)W = np.random.rand(10, 3073) * 0.001 # random weight vectordf = eval_numerical_gradient(CIFAR10_loss_fun, W) # get the gradientloss_original = CIFAR10_loss_fun(W) # the original lossprint 'original loss: %f' % (loss_original, )for step_size_log in [10, 9, 8, 7, 6, 5,4,3,2,1]:step_size = 10 ** step_size_logW_new = W  step_size * df # new position in the weight spaceloss_new = CIFAR10_loss_fun(W_new)print 'for step size %f new loss: %f' % (step_size, loss_new)Analytic Gradient
Because of the fact numerical gradient are expensive to compute for datasets with millions of features which is very common for DNNs. ( Because each step needs to compute the gradient for each feature, so it is linear complexity).
We normally use the other option: analytic gradient.
In which we use a direct formula for the gradient which is way faster to compute.
Suppose we have the SVM loss function for a single data point as follows:
And we can differentiate the function w/ respect to weights $W$. $e.g.$, taking the gradient with respect to $w_{y_i}$ we can obtain:
where $\mathbb{1}$ is the indicator function which
 if the condition inside is true, it evals to 1
 if false, it evals to 0
The function looks confusing but at its essence, it is equivalent to:
Count the number of classes that didn’t meet the desired margin $\Delta$ and scale the data vector $x_i$ by this margin. And the result is the gradient.
Gradient Descent
Now we can compute the gradient of the loss function, the procedure of repeatedly evaluating the gradient and then performing a parameter update is called Gradient Descent. A vanilla version looks like this:


This simple loop is the core of every nueural network libs.
There are a few different methods of gradient descent:
 Batch Gradient Descent
 MiniBatch Gradient Descent
 Stochastic Gradietn Descent
MiniBatch Gradient Descent is the mostly used one and often refered as SGD. It takes a random batch of (32, 64, 128, 256) arbitrary number of samples and compute gradient descent on it and update the parameters(weights) every time. A vanilla version looks like this:


Backpropagation (Gradient Descent using reversemode autodiff)
An ANN(MLP, multilayer perceptron) is composed of a input layer and n (n $\geq$ 1) hidden layers and one final layer. Every layer except the output layer includes a bias neuron and is fully connected to next layer. When an ANN has $\geq$ 2 hidden layers, it is called a DNN.
But for years people struggle to find a way to train DNN uintil backpropagation.
For each trainning instance, the algorithm feeds it to the network and computes the output of every neuron in each consecutive layer. (Known as the forward pass). Then it measures the ouput error of the network and it computes how much each neuron in hte last hidden layer contributed to each output neuron’s error. It then proceeds to measure how much of these error contributions comes from the previous hidden layer. And this logic carries on until the algorithm reaches the input layer.
Eventually, this reverse pass efficently measures the error gradient accross all the connection wieghts in the DNN by propagating the error gradient backward in the network.
In short, for each training instance the backpropagation algorithm first makes a prediction by some scheme (the forward pass). Then it measures the error of this prediction then goes through each layer in each layer in reverse order to measure the error contribution from each connection (the reverse pass). And slightly tweaks the connection weights to reduce the error (Gradient Dscent step).
The Math details is skipped here, for details checkout here: Back Propagation.
The code
So now we have a basic understanding of how Vgg works behind the scene and some fundaments about DNN. It’s time to dig what’s happening in the finetuning step. Basically, we just need to converge the 1000categories output to our 2categories output.
How can we do that?
Just apply a DNN to it:
 Take the 1000 categories result as an input array of shape
[1000, 1]
.  Train a DNN to fit the
[1000, 1]
input to[2, 1]
ouput using the training set.  Remove the original
1000 categories
layer and append our new layer.
Checkout the Source Code here for details.