Work@Microsoft    Study@UW.edu    Live@Seattle

# ML101: Logistic Regression

ML101: Logistic Regression
5 (100%) 1 vote

Logistic regression, despite its name, is a linear model for classification rather than regression.  It is regressing for the probability of a categorical outcome. And this categorical outcome is captured in binary format, i.e. 0 or 1.

Note: Logistic regression is also known as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier.  Frequently, “logistic regression” is used to refer specifically to the problem in which the dependent variable is binary—that is, the number of available categories is two—while problems with more than two categories are referred to as multinomial logistic regression, or, if the multiple categories are ordered, as ordinal logistic regression.

If this topic is new to you, I suggest watching the free online course of Logistic Regression first.

## Training Set

This training set has m examples, n features.  the output y has 2 values {0, 1}:

## Hypothesis Function (Logistic Regression Model)

We want our classifier to output values between 0 and 1, so our hypothesis function is going to satisfy this requirement:

Here g(z) is called ‘logistic function’ or ‘sigmoid function’.  The sigmoid function crosses 0.5 at the origin, asymptotes at 1, and asymptotes at 0.

Given that we need to fit the parameter θ to our training data.

### Interpreting Hypothesis Output

When our hypothesis function (hθ(x)) outputs a value, we treat that value as the estimated probability that y = 1 on input x.  We can write this using the following notation:

• P(y=1|x ; θ) = hθ(x) is the probability that y = 1, given x, parameterized by θ
• P(y=0|x ; θ) = 1 – P(y=1|x ; θ) is the probability that y = 0, given x, parameterized by θ

### Decision Boundary

Suppose we predict ‘y = 1’ if hθ(x) >= 0.5; otherwise, ‘y = 0’

• hθ(x) = g(θTx) and g(z) >= 0.5 when z >= 0, therefore, hθ(x) >= 0.5 when θTx >= 0.  In this condition, we can predict y = 1.
• hθ(x) < 0.5 when θTx < 0.  In this condition, we can predict y = 0.

1. Linear Decision Boundary

For example, we have the hypothesis function

and the following training data set:

Assume θ has the following values (The latter sections will discuss how to find these appropriate θ values.  Here let’s assume we have figured them out): θ0 = –3, θ1 = 1, θ2 = 1

So θTx = –3 + x1 + x2

We can predict ‘y = 1’ if θTx >= 0, i.e. x1 + x2 >= 3.

If we plot the line x1 + x2 = 3, the line itself is the decision boundary.  Any points on this line has hθ(x) = 0.5 exactly.  In the orange area, we predict y = 1

#### 2. Non-linear decision boundaries

To get logistic regression to fit a complex non-linear data set, e.g.

we can add higher order terms to the hypothesis function like polynomial regression.

Assume θ has the following values (The latter sections will discuss how to find these appropriate θ values.  Here let’s assume we have figured them out): θ0 = –1, θ1 = 0, θ2 = 0, θ3 = 1, θ4 = 1

So θTx = –1 + x12 + x22

We can predict ‘y = 1’ if θTx >= 0, i.e. –1 + x12 + x22 >= 0

If we plot the line –1 + x12 + x22 = 0, the line itself is the non-linear decision boundary.  Any points on this line has hθ(x) = 0.5 exactly.  In the orange area, we predict y = 1.

By using higher order polynomial terms, we can get even more complex decision boundaries, e.g.

The hypothesis function is based on parameters θ.  In the following sections, we will discuss, given the training set, how to fit / choose the parameters θ in the hypothesis function for the logistic regression.

## Cost Function

Note: y = 0 or 1 always.    This Cost(hθ(x), y) function is the cost for a single training data example.  It has the following features:

The x axis is what we predict, the y axis is the cost associated with that prediction.  When the prediction is “more” wrong, the cost increases.  When the prediction is exactly correct, the cost corresponds to 0.

If we take the average of cost for the entire training data set, we get the complete Cost Function of logistic regression:

Why do we choose this cost function for logistic regression?  In brief,

• This cost function can be derived from statistics using the principle of maximum likelihood estimation
• This is a convex function, meaning we can find the global minimum (avoid local minimum) using gradient descent.

So we have our hypothesis function and we have a way of measuring the cost of the prediction (cost function). Now what we need is a way to automatically improve our hypothesis function.  The goal is to minimize the cost function value J(θ) by adjusting θ0, θ1, … θn. That’s where gradient descent comes in again.

Here is our usual template for gradient descent, where we repeatedly update each parameter by updating itself minus the learning rate α times the derivative.

For Logistic Regression, we can substitute the actual cost function and modify the equation to

where m is the size of the training set, x(i),y(i) are values of the given training set (data).

This Gradient Descent algorithm looks identical to the Linear Regression Gradient Descent algorithm!  The only difference is the hypothesis function h(θ).  In Logistic Regression , while in Linear Regression, .

Feature Scaling, and the choice of Learning Rate α that we discussed in Linear Regression also applies to Logistic Regression.  I’m not going to repeat here.

## Other Algorithms to minimize the Cost Function J(θ)

These are more optimized and complicated algorithms which take that same input and minimize the cost function.  They have advantages and disadvantages compared with the Gradient Descent algorithm.

• No need to manually pick alpha (learning rate)
• Have a clever inner loop (line search algorithm) which tries a bunch of alpha values and picks a good one
• Often faster than gradient descent
• Do more than just pick a good learning rate
• Can be used successfully without understanding their complexity

• More complex
• Could make debugging more difficult
• Should not be implemented by yourself
• Different libraries may use different implementations

• ### 10ML101: How to Choose a Machine Learning Algorithm for Two-class Classification Problems

http://scottge.net

I’m a Program Manager at Microsoft (MSFT), and a part-time computer science master student in University of Washington (UW).