Linear Regression with Multiple Variables, also called as ‘Multivariate Linear Regression’, is used when you want to predict a single output value from multiple input value.
For example, we want to predict the house price (y) from not only the house size, but also the number of bedrooms, the number of floors and the age of home.
x1: Size (feet^{2})  x2: Number of bedrooms x2  x3: Number of floors  x4: Age of home (years)  y: Price ($1000) 
2104  5  1  45  460 
1416  3  2  40  232 
…  …  …  …  … 
Notation:
If this topic is new to you, I suggest watching the free online course of linear regression with multiple variables first.
Hypothesis Function
Our hypothesis function has the general form:
in which x0 = 1
Cost Function
We can measure the accuracy of our hypothesis function by using a cost function.
Gradient Descent
So we have our hypothesis function and we have a way of measuring how accurate it is (cost function). Now what we need is a way to automatically improve our hypothesis function. The goal is to minimize the cost function value by adjusting θ0, θ1, … θn. That’s where gradient descent comes in.
The gradient descent equation is:
For Linear Regression, we can substitute the actual cost function and our actual hypothesis function and modify the equation to
where m is the size of the training set, and x^{(i)},y^{(i)} are values of the given training set (data).
Learning Rate (choice of α)
The learning rate α decides the step size in each iteration. If the gradient descent is working correctly (because the learning rate α is sufficiently small), J(θ) should decrease after every iteration (repeat). By plotting the relationship between J(θ) and the number of iteration, we can then manually decide the convergence point.
(BTW, we may also automatically decide the convergence point by declaring convergence if J(θ) decreases by less than 10^{3} in one iteration.)
However, if J(θ) is increasing sometimes in the plot, it means the learning rate α is too big. If J(θ) decreases very slowly, it means the α is too small. The following Q/A helps you understand the relationship between the plot and the selection of α:
To choose α, try 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, …
Normal Equation
In the previous section (Gradient Descent), we learnt that in order to converge in the following algorithm,
should equal to ‘0’. In other words, we can try to solve θ from the following equation:
According to calculus, we get
in which
For example, suppose you have the training data in the table below.
Then
Normal equation gives us a method to solve for θ analytically, so that rather than needing to run the iterative algorithm in ‘gradient descent’, we can instead just solve for the optimal value for θ all at one go.
Both gradient descent and normal equation can solve θ. To decide when you should use Gradient Descent and when you should use Normal Equation, please read ML101: Gradient Descent vs. Normal Equation.

1 The 3rd Eye for Your Car

2 A few UW students hacked the Google Perspective API

3 A Complete List of Free Dev Resources Exclusive to Students and Educators

4 Microsoft Azure Machine Learning Cheat Sheet v6 – Released today

5 Interesting Visual Explaining Machine Learning to Beginners

6 New Book: Machine Learning Projects for .NET Developers

7 Best Machine Learning & AI Cloud Services in the Market

8 ML101: How to Choose a Machine Learning Algorithm for Multiclass Classification Problems

9 ML101: How to Choose Machine Learning Algorithms

10 ML101: How to Choose a Machine Learning Algorithm for Twoclass Classification Problems
Pingback: ML101: Polynomial Regression  Scott Ge
Pingback: ML101: Logistic Regression @ Scott Ge