Linear Algebra View: Vector and Matrix Sizes
In this section, we will take a Linear Algebra view of what we have covered before, and look at the specific vectors and matrices and their sizes, as they are involved in the gradient descent algorithm.
So if you remember, a simple linear model is just , we actually have the term which is tucked into .
where the size of is and the size of is where is the number of classes and is the dimensionality of the input. The matrix has a row for each of the classes and columns where the extra one is for the included bias. The size of is just .
Let's take a look at some convention with respect to the sizes of derivatives. The size of derivatives for scalars, vectors and matrices are as follows. Assume that we have a scalar which is one-dimensional (so ), we have a vector which is -dimensional (so ), and then we have a matrix with dimensions (so ). Now what is the size of the partial derivative of with respect to , or a vector with respect to a scalar? This size is actually (or ), which is a column vector of size . Each element of this column vector is the partial derivative of a particular element in with respect to
What is the size of the partial derivative of a scalar with respect to a vector ? Here, it is a row vector of size , where again, each element is the partial derivative of the scalar with respect to each element in the vector.
Now what is the size of the partial derivative of a vector with respect to another vector ? In this case, it is actually a matrix called the Jacobian, which contains all the partial derivatives with respect to and .
Specifically, for row and column , it has the partial derivative of with respect to . And so this tells us the interactions between each element in and each element in , and that is why it is a squared form. So again this tells us that for each small change in and element in , how does that affect all the elements in ? And so on for all elements of .
So what is the size of the partial derivative of a scalar with respect to a matrix, that is the size of
Again, it is a matrix, here the scalar is just one number and so all we have is all the elements in the matrix essentially give us the partial derivative of the scalar with respect to each element in the matrix.
So in Deep Learning, what we actually care about is the change in loss with respect to , the partial derivative of the loss with respect to
So what is the dimensionality of that? Can figure that out? Remember that the loss is a scalar, and is often a matrix.
So the Jacobian is a matrix as well:
What we have is a matrix specifying the change in loss with respect to any one parameter in the matrix. And so this is really what we need in order to perform gradient descent.
As mentioned earlier, gradient descent works in batches of data. And when this data is for example, matrices (if you have an image or tensors(if you have a multi channel image, then these gradients tend to turn into tensors)). For example, if you have each instance being a vector of size , our batch is of size . If each instance is a matrix, for example our grayscale image, an image of one channel of size , then our batch is . If each instance is a multi-channel matrix, for example a color image, then our batch becomes , where is the color channel. And so, when we compute gradient with respect to these types of input, it becomes a bit unwieldy. What we do instead of turning Jacobians into tensors, is flattening out everything into vectors.
And so all inputs become vectors, and then we get a vector of derivatives. This can also be done for particular derivatives between two vectors, two matrices, or two tensors. That is really the conceptually simple way to think about things.
- Linear Algebra View: Vector and Matrix Sizes
-
So if you remember, a simple linear model is just , we actually have the term which is tucked into .
-
-
where the size of is and the size of is where is the number of classes and is the dimensionality of the input. The matrix has a row for each of the classes and columns where the extra one is for the included bias. The size of is just .
-
Let's take a look at some convention with respect to the sizes of derivatives. The size of derivatives for scalars, vectors and matrices are as follows. Assume that we have a scalar which is one-dimensional (so ), we have a vector which is -dimensional (so ), and then we have a matrix with dimensions (so ). Now what is the size of the partial derivative of with respect to , or a vector with respect to a scalar? This size is actually (or ), which is a column vector of size . Each element of this column vector is the partial derivative of a particular element in with respect to
-
-
What is the size of the partial derivative of a scalar with respect to a vector ? Here, it is a row vector of size , where again, each element is the partial derivative of the scalar with respect to each element in the vector.
-
Now what is the size of the partial derivative of a vector with respect to another vector ? In this case, it is actually a matrix called the Jacobian, which contains all the partial derivatives with respect to and .
-
-
Specifically, for row and column , it has the partial derivative of with respect to . And so this tells us the interactions between each element in and each element in , and that is why it is a squared form. So again this tells us that for each small change in and element in , how does that affect all the elements in ? And so on for all elements of .
-
-
-
So in Deep Learning, what we actually care about is the change in loss with respect to , the partial derivative of the loss with respect to
-
-
So what is the dimensionality of that? Can figure that out? Remember that the loss is a scalar, and is often a matrix.
-
-
-
What we have is a matrix specifying the change in loss with respect to any one parameter in the matrix. And so this is really what we need in order to perform gradient descent.
-
As mentioned earlier, gradient descent works in batches of data. And when this data is for example, matrices (if you have an image or tensors(if you have a multi channel image, then these gradients tend to turn into tensors)). For example, if you have each instance being a vector of size , our batch is of size . If each instance is a matrix, for example our grayscale image, an image of one channel of size , then our batch is . If each instance is a multi-channel matrix, for example a color image, then our batch becomes , where is the color channel. And so, when we compute gradient with respect to these types of input, it becomes a bit unwieldy. What we do instead of turning Jacobians into tensors, is flattening out everything into vectors.
-
-