Linear Algebra View: Vector and Matrix Sizes

In this section, we will take a Linear Algebra view of what we have covered before, and look at the specific vectors and matrices and their sizes, as they are involved in the gradient descent algorithm.

So if you remember, a simple linear model is just WTxW^T x, we actually have the +b+b term which is tucked into WW.

Wx=[w11w12w1mb1w21w22w2mb2w31w32w3mb3][x1x2xm1]Wx = \begin{bmatrix} w_{11} & w_{12} & \cdots & w_{1m} & b_1 \\ w_{21} & w_{22} & \cdots & w_{2m} & b_2 \\ w_{31} & w_{32} & \cdots & w_{3m} & b_3 \\ \end{bmatrix} \begin{bmatrix}x_1 \\ x_2 \\ \vdots \\ x_m \\ 1 \end{bmatrix}

where the size of WW is c×(d+1)c×(d+1) and the size of xx is (d+1)×1(d+1)×1 where cc is the number of classes and dd is the dimensionality of the input. The matrix WW has a row for each of the classes and d+1d+1 columns where the extra one is for the included bias. The size of xx is just d+1d+1.

Let's take a look at some convention with respect to the sizes of derivatives. The size of derivatives for scalars, vectors and matrices are as follows. Assume that we have a scalar ss which is one-dimensional (so sRs\in \mathbb{R}), we have a vector vv which is mm-dimensional (so vRmv\in\mathbb{R}^m), and then we have a matrix MM with dimensions k×lk×l (so MRk×lM\in\mathbb{R}^{k×l}). Now what is the size of the partial derivative of vv with respect to ss, or a vector with respect to a scalar? This size is actually m×1m×1 (or v/sRm×1\partial v /\partial s \in \mathbb{R}^{m \times 1}), which is a column vector of size mm. Each element of this column vector is the partial derivative of a particular element in vv with respect to ss

[v1sv2svms]\begin{bmatrix} \frac{\partial v_1}{\partial s} \\ \frac{\partial v_2}{\partial s} \\ \vdots \\ \frac{\partial v_m}{\partial s}\end{bmatrix}

What is the size of the partial derivative of a scalar ss with respect to a vector vv? Here, it is a row vector of size mm, where again, each element is the partial derivative of the scalar with respect to each element in the vector.

Now what is the size of the partial derivative of a vector v1v_1 with respect to another vector v2v_2? In this case, it is actually a matrix called the Jacobian, which contains all the partial derivatives with respect to v1v_1 and v2v_2.

[v11v12v11v22v11vn2v21v12vi1vj2vm1v12vm1vn2]\begin{bmatrix} \frac{\partial v_1^1}{\partial v_1^2} & \frac{\partial v_1^1}{\partial v_2^2} & \cdots & \cdots & \frac{\partial v_1^1}{\partial v_n^2}\\ \frac{\partial v_2^1}{\partial v_1^2} & \ddots & & & \vdots\\ \vdots & & \frac{\partial v_i^1}{\partial v_j^2} & & \vdots \\ \vdots & & & \ddots & \vdots \\ \frac{\partial v_m^1}{\partial v_1^2} & \cdots & \cdots & \cdots & \frac{\partial v_m^1}{\partial v_n^2}\\ \end{bmatrix}

Specifically, for row ii and column jj, it has the partial derivative of vi1v_i^1 with respect to vj2v_j^2. And so this tells us the interactions between each element in v1v_1 and each element in v2v_2, and that is why it is a squared form. So again this tells us that for each small change in v2v_2 and element in v2v_2, how does that affect all the elements in v1v_1? And so on for all elements of v2v_2.

So what is the size of the partial derivative of a scalar with respect to a matrix, that is the size of

sM\frac{\partial s}{\partial M}

Again, it is a matrix, here the scalar is just one number and so all we have is all the elements in the matrix essentially give us the partial derivative of the scalar with respect to each element in the matrix.

[sm[1,1]sm[1,2]sm[1,n]sm[2,1]sm[i,j]sm[m,1]sm[m,n]]\begin{bmatrix} \frac{\partial s}{\partial m_{[1, 1]}} & \frac{\partial s}{\partial m_{[1, 2]}} & \cdots & \cdots & \frac{\partial s}{\partial m_{[1, n]}}\\ \frac{\partial s}{\partial m_{[2, 1]}} & \ddots & & & \vdots\\ \vdots & & \frac{\partial s}{\partial m_{[i, j]}} & & \vdots \\ \vdots & & & \ddots & \vdots \\ \frac{\partial s}{\partial m_{[m, 1]}} & \cdots & \cdots & \cdots & \frac{\partial s}{\partial m_{[m, n]}}\\ \end{bmatrix}

So in Deep Learning, what we actually care about is the change in loss with respect to WW, the partial derivative of the loss with respect to WW

LW\frac{\partial L}{\partial W}

So what is the dimensionality of that? Can figure that out? Remember that the loss is a scalar, and WW is often a matrix.

W=[w11w12w1mb1w21w22w2mb2w31w32w3mb3]W = \begin{bmatrix} w_{11} & w_{12} & \cdots & w_{1m} & b_1 \\ w_{21} & w_{22} & \cdots & w_{2m} & b_2 \\ w_{31} & w_{32} & \cdots & w_{3m} & b_3 \\ \end{bmatrix}

So the Jacobian is a matrix as well:

[Lw11Lw12Lw1mLb1Lw21Lw22Lw2mLb2Lw31Lw32Lw3mLb3]\begin{bmatrix} \frac{\partial L}{\partial w_{11}} & \frac{\partial L}{\partial w_{12}} & \cdots & \frac{\partial L}{\partial w_{1m}} & \frac{\partial L}{\partial b_1}\\ \frac{\partial L}{\partial w_{21}} & \frac{\partial L}{\partial w_{22}} & \cdots & \frac{\partial L}{\partial w_{2m}} & \frac{\partial L}{\partial b_2}\\ \frac{\partial L}{\partial w_{31}} & \frac{\partial L}{\partial w_{32}} & \cdots & \frac{\partial L}{\partial w_{3m}} & \frac{\partial L}{\partial b_3}\\ \end{bmatrix}

What we have is a matrix specifying the change in loss with respect to any one parameter in the WW matrix. And so this is really what we need in order to perform gradient descent.

As mentioned earlier, gradient descent works in batches of data. And when this data is for example, matrices (if you have an image or tensors(if you have a multi channel image, then these gradients tend to turn into tensors)). For example, if you have each instance being a vector of size mm, our batch is of size B×mB×m. If each instance is a matrix, for example our grayscale image, an image of one channel of size W×HW×H, then our batch is B×W×HB×W×H. If each instance is a multi-channel matrix, for example a color image, then our batch becomes B×C×W×HB×C×W×H, where CC is the color channel. And so, when we compute gradient with respect to these types of input, it becomes a bit unwieldy. What we do instead of turning Jacobians into tensors, is flattening out everything into vectors.

[x11x12x1nx21x22x2nxn1xn2xnn]flatten[x11x12x21x22xn1xnn]\begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1n} \\ x_{21} & x_{22} & \cdots & x_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{nn} \end{bmatrix} \overset{\text{flatten}}{\longrightarrow} \begin{bmatrix} x_{11} \\ x_{12} \\ \vdots \\ x_{21} \\ x_{22} \\ \vdots \\ x_{n_1} \\ \vdots \\ x_{nn} \end{bmatrix}

And so all inputs become vectors, and then we get a vector of derivatives. This can also be done for particular derivatives between two vectors, two matrices, or two tensors. That is really the conceptually simple way to think about things.

  • Linear Algebra View: Vector and Matrix Sizes
    • So if you remember, a simple linear model is just WTxW^T x, we actually have the +b+b term which is tucked into WW.

    • Wx=[w11w12w1mb1w21w22w2mb2w31w32w3mb3][x1x2xm1]Wx = \begin{bmatrix} w_{11} & w_{12} & \cdots & w_{1m} & b_1 \\ w_{21} & w_{22} & \cdots & w_{2m} & b_2 \\ w_{31} & w_{32} & \cdots & w_{3m} & b_3 \\ \end{bmatrix} \begin{bmatrix}x_1 \\ x_2 \\ \vdots \\ x_m \\ 1 \end{bmatrix}

    • where the size of WW is c×(d+1)c×(d+1) and the size of xx is (d+1)×1(d+1)×1 where cc is the number of classes and dd is the dimensionality of the input. The matrix WW has a row for each of the classes and d+1d+1 columns where the extra one is for the included bias. The size of xx is just d+1d+1.

    • Let's take a look at some convention with respect to the sizes of derivatives. The size of derivatives for scalars, vectors and matrices are as follows. Assume that we have a scalar ss which is one-dimensional (so sRs\in \mathbb{R}), we have a vector vv which is mm-dimensional (so vRmv\in\mathbb{R}^m), and then we have a matrix MM with dimensions k×lk×l (so MRk×lM\in\mathbb{R}^{k×l}). Now what is the size of the partial derivative of vv with respect to ss, or a vector with respect to a scalar? This size is actually m×1m×1 (or v/sRm×1\partial v /\partial s \in \mathbb{R}^{m \times 1}), which is a column vector of size mm. Each element of this column vector is the partial derivative of a particular element in vv with respect to ss

    • [v1sv2svms]\begin{bmatrix} \frac{\partial v_1}{\partial s} \\ \frac{\partial v_2}{\partial s} \\ \vdots \\ \frac{\partial v_m}{\partial s}\end{bmatrix}

    • What is the size of the partial derivative of a scalar ss with respect to a vector vv? Here, it is a row vector of size mm, where again, each element is the partial derivative of the scalar with respect to each element in the vector.

    • Now what is the size of the partial derivative of a vector v1v_1 with respect to another vector v2v_2? In this case, it is actually a matrix called the Jacobian, which contains all the partial derivatives with respect to v1v_1 and v2v_2.

    • [v11v12v11v22v11vn2v21v12vi1vj2vm1v12vm1vn2]\begin{bmatrix} \frac{\partial v_1^1}{\partial v_1^2} & \frac{\partial v_1^1}{\partial v_2^2} & \cdots & \cdots & \frac{\partial v_1^1}{\partial v_n^2}\\ \frac{\partial v_2^1}{\partial v_1^2} & \ddots & & & \vdots\\ \vdots & & \frac{\partial v_i^1}{\partial v_j^2} & & \vdots \\ \vdots & & & \ddots & \vdots \\ \frac{\partial v_m^1}{\partial v_1^2} & \cdots & \cdots & \cdots & \frac{\partial v_m^1}{\partial v_n^2}\\ \end{bmatrix}

    • Specifically, for row ii and column jj, it has the partial derivative of vi1v_i^1 with respect to vj2v_j^2. And so this tells us the interactions between each element in v1v_1 and each element in v2v_2, and that is why it is a squared form. So again this tells us that for each small change in v2v_2 and element in v2v_2, how does that affect all the elements in v1v_1? And so on for all elements of v2v_2.

    • sM\frac{\partial s}{\partial M}

    • [sm[1,1]sm[1,2]sm[1,n]sm[2,1]sm[i,j]sm[m,1]sm[m,n]]\begin{bmatrix} \frac{\partial s}{\partial m_{[1, 1]}} & \frac{\partial s}{\partial m_{[1, 2]}} & \cdots & \cdots & \frac{\partial s}{\partial m_{[1, n]}}\\ \frac{\partial s}{\partial m_{[2, 1]}} & \ddots & & & \vdots\\ \vdots & & \frac{\partial s}{\partial m_{[i, j]}} & & \vdots \\ \vdots & & & \ddots & \vdots \\ \frac{\partial s}{\partial m_{[m, 1]}} & \cdots & \cdots & \cdots & \frac{\partial s}{\partial m_{[m, n]}}\\ \end{bmatrix}

    • So in Deep Learning, what we actually care about is the change in loss with respect to WW, the partial derivative of the loss with respect to WW

    • LW\frac{\partial L}{\partial W}

    • So what is the dimensionality of that? Can figure that out? Remember that the loss is a scalar, and WW is often a matrix.

    • W=[w11w12w1mb1w21w22w2mb2w31w32w3mb3]W = \begin{bmatrix} w_{11} & w_{12} & \cdots & w_{1m} & b_1 \\ w_{21} & w_{22} & \cdots & w_{2m} & b_2 \\ w_{31} & w_{32} & \cdots & w_{3m} & b_3 \\ \end{bmatrix}

    • [Lw11Lw12Lw1mLb1Lw21Lw22Lw2mLb2Lw31Lw32Lw3mLb3]\begin{bmatrix} \frac{\partial L}{\partial w_{11}} & \frac{\partial L}{\partial w_{12}} & \cdots & \frac{\partial L}{\partial w_{1m}} & \frac{\partial L}{\partial b_1}\\ \frac{\partial L}{\partial w_{21}} & \frac{\partial L}{\partial w_{22}} & \cdots & \frac{\partial L}{\partial w_{2m}} & \frac{\partial L}{\partial b_2}\\ \frac{\partial L}{\partial w_{31}} & \frac{\partial L}{\partial w_{32}} & \cdots & \frac{\partial L}{\partial w_{3m}} & \frac{\partial L}{\partial b_3}\\ \end{bmatrix}

    • What we have is a matrix specifying the change in loss with respect to any one parameter in the WW matrix. And so this is really what we need in order to perform gradient descent.

    • As mentioned earlier, gradient descent works in batches of data. And when this data is for example, matrices (if you have an image or tensors(if you have a multi channel image, then these gradients tend to turn into tensors)). For example, if you have each instance being a vector of size mm, our batch is of size B×mB×m. If each instance is a matrix, for example our grayscale image, an image of one channel of size W×HW×H, then our batch is B×W×HB×W×H. If each instance is a multi-channel matrix, for example a color image, then our batch becomes B×C×W×HB×C×W×H, where CC is the color channel. And so, when we compute gradient with respect to these types of input, it becomes a bit unwieldy. What we do instead of turning Jacobians into tensors, is flattening out everything into vectors.

    • [x11x12x1nx21x22x2nxn1xn2xnn]flatten[x11x12x21x22xn1xnn]\begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1n} \\ x_{21} & x_{22} & \cdots & x_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{nn} \end{bmatrix} \overset{\text{flatten}}{\longrightarrow} \begin{bmatrix} x_{11} \\ x_{12} \\ \vdots \\ x_{21} \\ x_{22} \\ \vdots \\ x_{n_1} \\ \vdots \\ x_{nn} \end{bmatrix}