Linear Regression with Multiple Variables, also called as ‘Multivariate Linear Regression’, is used when you want to predict a single output value from multiple input value.
For example, we want to predict the house price (y) from not only the house size, but also the number of bedrooms, the number of floors and the age of home.
|x1: Size (feet2)||x2: Number of bedrooms x2||x3: Number of floors||x4: Age of home (years)||y: Price ($1000)|
If this topic is new to you, I suggest watching the free online course of linear regression with multiple variables first.
Our hypothesis function has the general form:
in which x0 = 1
We can measure the accuracy of our hypothesis function by using a cost function.
So we have our hypothesis function and we have a way of measuring how accurate it is (cost function). Now what we need is a way to automatically improve our hypothesis function. The goal is to minimize the cost function value by adjusting θ0, θ1, … θn. That’s where gradient descent comes in.
The gradient descent equation is:
For Linear Regression, we can substitute the actual cost function and our actual hypothesis function and modify the equation to
where m is the size of the training set, and x(i),y(i) are values of the given training set (data).
Learning Rate (choice of α)
The learning rate α decides the step size in each iteration. If the gradient descent is working correctly (because the learning rate α is sufficiently small), J(θ) should decrease after every iteration (repeat). By plotting the relationship between J(θ) and the number of iteration, we can then manually decide the convergence point.
(BTW, we may also automatically decide the convergence point by declaring convergence if J(θ) decreases by less than 10-3 in one iteration.)
However, if J(θ) is increasing sometimes in the plot, it means the learning rate α is too big. If J(θ) decreases very slowly, it means the α is too small. The following Q/A helps you understand the relationship between the plot and the selection of α:
To choose α, try 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, …
In the previous section (Gradient Descent), we learnt that in order to converge in the following algorithm,
should equal to ‘0’. In other words, we can try to solve θ from the following equation:
According to calculus, we get
For example, suppose you have the training data in the table below.
Normal equation gives us a method to solve for θ analytically, so that rather than needing to run the iterative algorithm in ‘gradient descent’, we can instead just solve for the optimal value for θ all at one go.
Both gradient descent and normal equation can solve θ. To decide when you should use Gradient Descent and when you should use Normal Equation, please read ML101: Gradient Descent vs. Normal Equation.