Logistic regression, despite its name, is a linear model for classification rather than regression. It is regressing for the probability of a categorical outcome. And this categorical outcome is captured in binary format, i.e. 0 or 1.
Note: Logistic regression is also known as logit regression, maximumentropy classification (MaxEnt) or the loglinear classifier. Frequently, “logistic regression” is used to refer specifically to the problem in which the dependent variable is binary—that is, the number of available categories is two—while problems with more than two categories are referred to as multinomial logistic regression, or, if the multiple categories are ordered, as ordinal logistic regression.
If this topic is new to you, I suggest watching the free online course of Logistic Regression first.
Training Set
This training set has m examples, n features. the output y has 2 values {0, 1}:
Hypothesis Function (Logistic Regression Model)
We want our classifier to output values between 0 and 1, so our hypothesis function is going to satisfy this requirement:
Here g(z) is called ‘logistic function’ or ‘sigmoid function’. The sigmoid function crosses 0.5 at the origin, asymptotes at 1, and asymptotes at 0.
Given that we need to fit the parameter θ to our training data.
Interpreting Hypothesis Output
When our hypothesis function (h_{θ}(x)) outputs a value, we treat that value as the estimated probability that y = 1 on input x. We can write this using the following notation:
 P(y=1x ; θ) = h_{θ}(x) is the probability that y = 1, given x, parameterized by θ
 P(y=0x ; θ) = 1 – P(y=1x ; θ) is the probability that y = 0, given x, parameterized by θ
Decision Boundary
Suppose we predict ‘y = 1’ if h_{θ}(x) >= 0.5; otherwise, ‘y = 0’
 h_{θ}(x) = g(θ^{T}x) and g(z) >= 0.5 when z >= 0, therefore, h_{θ}(x) >= 0.5 when θ^{T}x >= 0. In this condition, we can predict y = 1.
 h_{θ}(x) < 0.5 when θ^{T}x < 0. In this condition, we can predict y = 0.
1. Linear Decision Boundary
For example, we have the hypothesis function
and the following training data set:
Assume θ has the following values (The latter sections will discuss how to find these appropriate θ values. Here let’s assume we have figured them out): θ_{0} = –3, θ_{1} = 1, θ_{2} = 1
So θ^{T}x = –3 + x_{1} + x_{2}
We can predict ‘y = 1’ if θ^{T}x >= 0, i.e. x_{1} + x_{2} >= 3.
If we plot the line x_{1} + x_{2} = 3, the line itself is the decision boundary. Any points on this line has h_{θ}(x) = 0.5 exactly. In the orange area, we predict y = 1
2. Nonlinear decision boundaries
To get logistic regression to fit a complex nonlinear data set, e.g.
we can add higher order terms to the hypothesis function like polynomial regression.
Assume θ has the following values (The latter sections will discuss how to find these appropriate θ values. Here let’s assume we have figured them out): θ_{0} = –1, θ_{1} = 0, θ_{2} = 0, θ_{3 }= 1, θ_{4 }= 1
So θ^{T}x = –1 + x_{1}^{2} + x_{2}^{2}
We can predict ‘y = 1’ if θ^{T}x >= 0, i.e. –1 + x_{1}^{2} + x_{2}^{2} >= 0
If we plot the line –1 + x_{1}^{2} + x_{2}^{2} = 0, the line itself is the nonlinear decision boundary. Any points on this line has h_{θ}(x) = 0.5 exactly. In the orange area, we predict y = 1.
By using higher order polynomial terms, we can get even more complex decision boundaries, e.g.
The hypothesis function is based on parameters θ. In the following sections, we will discuss, given the training set, how to fit / choose the parameters θ in the hypothesis function for the logistic regression.
Cost Function
Note: y = 0 or 1 always. This Cost(h_{θ}(x), y) function is the cost for a single training data example. It has the following features:
The x axis is what we predict, the y axis is the cost associated with that prediction. When the prediction is “more” wrong, the cost increases. When the prediction is exactly correct, the cost corresponds to 0.
If we take the average of cost for the entire training data set, we get the complete Cost Function of logistic regression:
Why do we choose this cost function for logistic regression? In brief,
 This cost function can be derived from statistics using the principle of maximum likelihood estimation
 This is a convex function, meaning we can find the global minimum (avoid local minimum) using gradient descent.
Gradient Descent
So we have our hypothesis function and we have a way of measuring the cost of the prediction (cost function). Now what we need is a way to automatically improve our hypothesis function. The goal is to minimize the cost function value J(θ) by adjusting θ0, θ1, … θn. That’s where gradient descent comes in again.
Here is our usual template for gradient descent, where we repeatedly update each parameter by updating itself minus the learning rate α times the derivative.
For Logistic Regression, we can substitute the actual cost function and modify the equation to
where m is the size of the training set, x^{(i)},y^{(i)} are values of the given training set (data).
This Gradient Descent algorithm looks identical to the Linear Regression Gradient Descent algorithm! The only difference is the hypothesis function h(θ). In Logistic Regression , while in Linear Regression, .
Feature Scaling, and the choice of Learning Rate α that we discussed in Linear Regression also applies to Logistic Regression. I’m not going to repeat here.
Other Algorithms to minimize the Cost Function J(θ)
These are more optimized and complicated algorithms which take that same input and minimize the cost function. They have advantages and disadvantages compared with the Gradient Descent algorithm.
Advantages:
 No need to manually pick alpha (learning rate)
 Have a clever inner loop (line search algorithm) which tries a bunch of alpha values and picks a good one
 Often faster than gradient descent
 Do more than just pick a good learning rate
 Can be used successfully without understanding their complexity
Disadvantages:
 More complex
 Could make debugging more difficult
 Should not be implemented by yourself
 Different libraries may use different implementations

1 The 3rd Eye for Your Car

2 A few UW students hacked the Google Perspective API

3 A Complete List of Free Dev Resources Exclusive to Students and Educators

4 Microsoft Azure Machine Learning Cheat Sheet v6 – Released today

5 Interesting Visual Explaining Machine Learning to Beginners

6 New Book: Machine Learning Projects for .NET Developers

7 Best Machine Learning & AI Cloud Services in the Market

8 ML101: How to Choose a Machine Learning Algorithm for Multiclass Classification Problems

9 ML101: How to Choose Machine Learning Algorithms

10 ML101: How to Choose a Machine Learning Algorithm for Twoclass Classification Problems
Pingback: ML101: How to Choose a Machine Learning Algorithm for Twoclass Classification Problems @ Scott Ge