ML101: Gradient Descent vs. Normal Equation

This blog talks about when you should use Gradient Descent and when you should use Normal Equation.  Here are some of their advantages and disadvantages.

Let’s say that you have m training samples and n features.


Gradient Descent

In Gradient Descent algorithm, in order to minimize the cost function J(θ), we take this iterative algorithm that takes many steps, multiple iterations of gradient descent to converge to the local minimum.


  1. Disadvantage: Need to choose the learning rate α
    This means that you need to run the algorithm for a few times with different learning rate α, and choose a sufficiently small value of α to make sure the cost function J(θ) decreases after every iteration.
  2. Disadvantage: Needs many iterations to reach convergence
  3. Advantage: Works well even when n is very large.


Normal Equation

In contrast, the normal equation would give us a method to solve for θ analytically, so that rather than needing to run the iterative algorithm in ‘gradient descent’, we can instead just solve for the optimal value for θ all at one go.


  1. Advantage: No need to choose the learning rate α
  2. Advantage: Don’t’ need to iterate to reach convergence
  3. Disadvantage: Need to compute
    image, which is slow (O(n3)) if n is too large, e.g. n > 10,000

2 thoughts on “ML101: Gradient Descent vs. Normal Equation

Leave a Reply