Components of a Parametric Learning Algorithm
In this section, we will look at the components of a parametric Machine Learning algorithm. As mentioned, a Machine Learning algorithm has several different parts. There is the input and how you represent those inputs. There is the functional form of the model, for example a linear classifier, including the parameters that it contained. There is a performance measure that we seek to improve. And often we have a surrogate loss or objective function that we use to determine how to improve. The algorithm for finding the best parameters is called the optimization algorithm.
Above is a visual depiction of this process. For example given an image, we can convert it to features. Those features are fed into a model, for example, a linear classifier. And that classifier outputs class scores, often normalized to be a probability distribution.
We often have a loss function, which determines how incorrect or correct the model's outputs are when compared to some ground truth that a human annotated.
This is often represented as a one hot vector or a probability distribution where there is a probability of one if it is the correct class and zero elsewhere. The output of this loss function is used by the optimizer to drive the model optimization.
Specifically, the optimizer will change the weights or parameters of the model in order to drive the loss function down. Let's take a closer look at the functional form of a model.
The form of a parametric classifier is
where
- is the classifier
- is the input (vector)
- is the set of weights
- is the output
For example, it could be a scalar score for a particular class or it could be a vector representing a probability distribution over classes. So the input is a continuous number or vector, the output is also a continuous number (for classification it is typically a score). For regression, we typically output what we want to regress to such as a house prices, crime rates, etc. And is a set of weights that we want to optimize to fit the target function.
What is the simplest function you can think of to use for such a model? One of the simplest functions that we can have is just a linear function, . So our model can be
Remember from your Linear Algebra class that represents factors that essentially rotate the lines, whereas , the bias, moves the lines up and down. We can use this simple functional form to create a classifier. Specifically, we can compute a score, , so we multiply or dot product the vector with the vector and add a bias offset and from that we get a score. We can then have a binary classification rule where is the score is greater than or equal to zero, then we say that , the label, is one; otherwise we say it is zero. And this is a binary classification model.
We can also have a multi-class classifier by converting from a vector to a matrix.
So we can have be a matrix where each row is a separate classifier. And so taking times will give us a vector which will give us a set of scores that are then added with a bias vector and then we will have as output a set of scores, one for each class. And then we can take the highest score to determine what class is in the data from the model's perspective.
While a linear classifier is pretty simple, the idea is really intuitive. What we want to do is separate classes via a high dimensional linear separator or hyperplane.
If the input representation is linearly separable, then this model will work very well. And that is why feature engineering is often used to make sure that the inputs are more than just pixels. It is one of the simplest parametric models, but is is surprisingly effective. It is used very commonly across many applications. Let's take a look more closely at all the elements that make it up.
As mentioned we have a piece of data, for example an image, we want to feed it through a model which is for example a linear classifier, and that model outputs a set of class scores, one for each category.
Let's take a closer look at the Linear Algebra involved. The input is a matrix. It is composed of a width and height of an image and it contains many pixels, in this case pixels.
Often in Machine Learning we want vectors not matrices as inputs (can you think why?). Thus we can flatten this matrix into a vector
To simplify notation, we will refer to input as through , where . Our actual model itself or the parameters for that model are composed of and . Here is a matrix because we have a three class classifier. Each row is a particular classifier. For example row one contains the classifier for class one, and so on.
And then we multiply this matrix with which is again the vector input. We also add a bias and the bias term is one for each class. Again you can see this as three independent linear classifiers. Note that we can actually move the bias into the weight matrix itself, where we can have the weight matrix be composed of the values that it had originally, but we add a column for the bias term. And similarly, for the input, we add a one at the end on the bottom. This is Mathematically equivalent, but it is easier notationally.
And computationally it is nice because it really allows us to perform inference or prediction from the model by just performing one matrix vector multiplication, which is efficient.
The output of a linear classifier is a score for each class. So in this case, represents parameters for a class classifier. Really, it can be seen as independent classifiers, and each one outputs a score for each class. And so we have a cat score, a coffee cup score, and so on.
There are many interpretations that one can take for a linear classifier. Let's take an example from a data set called CIFAR-10, which has ten different categories and images of size 32 by 32.
From the visual viewpoint, we can actually look at the weight vector as a template. Specifically, we can shape it back into the shape of an image and visualize them. Remember, the weight vectors are multiplied by the inputs, so they have to be the same shape.
As you can see the result here after an optimization algorithm is applied, the images kind of look like templates for each class. For example, the car roughly look like a blurry car average over the entire data set. From the geometric viewpoint, we can again view a linear classifier as a linear separator.
This latter of viewpoints allows us to reason about why linear classifiers are actually limited. We can see that there are many different functions or spaces that are not separable using a linear classifier. For example, below you can see the XOR function where there is no one line that can separate the two colors in this case representing the two classes.
Similarly, if you have a bull's eye type pattern where the inside of a circle is one class as well as the outside of the circle, but the circle itself is a different class. Then we can not really draw one linear separator that can separate the two classes. And this is true for even more complex spaces as well. Later, we will start using non-linearities in order to address this, similar to how support vector machines project the input space into a different warped space, where linear separators can be effective.
So we have several different viewpoints: the algebraic viewpoint, which just looks at it as a linear algebra computation; the visual viewpoint which looks at the weight vectors as templates; and there is a geometric viewpoint which looks at it from a hyperplane.
- Components of a Parametric Learning Algorithm
-
-
- is the classifier
- is the input (vector)
- is the set of weights
- is the output
-
For example, it could be a scalar score for a particular class or it could be a vector representing a probability distribution over classes. So the input is a continuous number or vector, the output is also a continuous number (for classification it is typically a score). For regression, we typically output what we want to regress to such as a house prices, crime rates, etc. And is a set of weights that we want to optimize to fit the target function.
-
What is the simplest function you can think of to use for such a model? One of the simplest functions that we can have is just a linear function, . So our model can be
-
-
Remember from your Linear Algebra class that represents factors that essentially rotate the lines, whereas , the bias, moves the lines up and down. We can use this simple functional form to create a classifier. Specifically, we can compute a score, , so we multiply or dot product the vector with the vector and add a bias offset and from that we get a score. We can then have a binary classification rule where is the score is greater than or equal to zero, then we say that , the label, is one; otherwise we say it is zero. And this is a binary classification model.
-
-
We can also have a multi-class classifier by converting from a vector to a matrix.
-
-
So we can have be a matrix where each row is a separate classifier. And so taking times will give us a vector which will give us a set of scores that are then added with a bias vector and then we will have as output a set of scores, one for each class. And then we can take the highest score to determine what class is in the data from the model's perspective.
-
Let's take a closer look at the Linear Algebra involved. The input is a matrix. It is composed of a width and height of an image and it contains many pixels, in this case pixels.
-
-
-
To simplify notation, we will refer to input as through , where . Our actual model itself or the parameters for that model are composed of and . Here is a matrix because we have a three class classifier. Each row is a particular classifier. For example row one contains the classifier for class one, and so on.
-
And then we multiply this matrix with which is again the vector input. We also add a bias and the bias term is one for each class. Again you can see this as three independent linear classifiers. Note that we can actually move the bias into the weight matrix itself, where we can have the weight matrix be composed of the values that it had originally, but we add a column for the bias term. And similarly, for the input, we add a one at the end on the bottom. This is Mathematically equivalent, but it is easier notationally.
-
-
The output of a linear classifier is a score for each class. So in this case, represents parameters for a class classifier. Really, it can be seen as independent classifiers, and each one outputs a score for each class. And so we have a cat score, a coffee cup score, and so on.
-