Linear Regression with One Variable, also called as ‘Univariate Linear Regression’, is used when you want to predict a single output value from a single input value.
For example, we want to predict the house price (y) solely from the house size (x) based on a linear regression model.
If this topic is new to you, I suggest watching the free online course of linear regression with one variable first.
Our hypothesis function has the general form:
We give to hθ values for θ0 and θ1 to get our output ‘y’. In other words, we are trying to create a function called hθ that is able to reliably map our input data (the x’s) to our output data (the y’s).
We can measure the accuracy of our hypothesis function by using a cost function. This takes an average (actually a fancier version of an average) of all the results of the hypothesis with inputs from x’s compared to the actual output y’s.
This function is otherwise called the “Squared error function”, or “Mean Squared Error”.
With cost function, we are able to concretely measure the accuracy of our predictor function against the correct results we have so that we can predict new results we don’t have.
Hypothesis Function vs. Cost Function
The following chart shows the relationship between hypothesis function and cost function.
So we have our hypothesis function and we have a way of measuring how accurate it is (cost function). Now what we need is a way to automatically improve our hypothesis function. The goal is to minimize the cost function value by adjusting θ0 and θ1. That’s where gradient descent comes in.
Suppose the cost function is plotted like a mountain. The entire gradient descent could be described as a journey from a certain location to the local bottom (local minimum).
The gradient descent equation is:
Intuitively, this could be thought of as:
α is called Learning Rate. It decides the step size in each iteration. Convergence means the cost function J(θ0, θ1) reaches the local minimum when adjusting θ0 and θ1.
For Linear Regression, we can substitute the actual cost function and our actual hypothesis function and modify the equation to
where m is the size of the training set, θ0 a constant that will be changing simultaneously with θ1 and x(i),y(i) are values of the given training set (data).
Note that we have separated out the two cases for θj and that for θ1 we are multiplying x(i) at the end due to the derivative.
The point of all this is that if we start with a guess for our hypothesis and then repeatedly apply these gradient descent equations, our hypothesis will become more and more accurate.