Logistic Regression Model
Cost Function
Given: $$ \begin{align} \text{training set} &= {(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(m)}, y^{(m)})}\ x &= \begin{bmatrix} x_0\ x_1\ \vdots \ x_n \end{bmatrix} \in \R^{n+1}\ x_0 &= 1\ y &\in {0,1}\ h_\theta(x) &= \frac{1}{1 + e^{-\theta^\intercal x}} \end{align} $$ How do we choose \(\theta\)? Similar to linear regression, we have to minimize the cost function.
Linear Regression Cost Function
Let's re-write linear regression cost function: $$ \begin{align} J(\theta) &= \frac{1}{m} \sum_{i=1}^m \frac12(h_\theta(x^{(i)}) - y^{(i)})^2\ &= \frac1m \sum_{i=1}^m Cost(h_\theta(x^{(i)}), y^{(i)}) & ``\text{The average cost across the training dataset}"\ Cost(a,b) &= \frac12 (a - b)^2 \end{align} $$ Can we use the same \(Cost\) for logistic regression? No, we can't. This results in a non-convex function, a function with many local minima. This is due to \(h_\theta(x) = \frac1{1+e^{-\theta^\intercal x}}\) not being linear. We need a convex function for logistic regression
Logistic Regression Cost Function
Looking at \(y=1\)
TODO Include graph.
Note - because we are dealing with logistic functions, \(0 \leq a \leq 1\).
Does \(\lim_{a \rarr 0} cost = \infty\) make sense? If we say, for some value of \(x\), that \(h_\theta(x) = 0\), then we are saying that there is no chance that \(y=1\). If it turns out that \(y=1\), then our hypothesis is completely, absolutely wrong.
Looking at \(y=0\)
TODO Include graph.
Note - because we are dealing with logistic functions, \(0 \leq a \leq 1\).
Does \(\lim_{a \rarr1 } cost = \infty\) make sense? If we say, for some value of \(x\), that \(h_\theta(x) = 1\), then we are saying that there is no chance that it is not \(y=1\). If it turns out that \(y=0\), then our hypothesis is completely, absolutely wrong.
Simplified Cost Function and Gradient Descent
The cost function can be simplified, if we take advantage of the fact that \(y \in \{0,1\}\): $$ Cost(a,b) = -b\ log(a) + (-(1-b)\ log(1-a)) $$ Since \(y \in \{0,1\}\), one of the log multipliers will be 0 and the other will be 1.
Logistic Regression Cost Function
Vectorized:
$$
\begin{align}
h &= g(X\theta)\
J(\theta) &= \frac1m \cdot (-y^\intercal log(h) - (1-y)^\intercal log(1-h))
\end{align}
$$
The above can be derived from the principle of maximum likelihood estimation. TODO Derive this
.
So how to do we find \(\theta\) that minimizes \(J(\theta)\)? Gradient descent: $$ \begin{align} & Repeat \; \lbrace \ & \; \theta_j := \theta_j - \alpha \dfrac{\partial}{\partial \theta_j}J(\theta) \ & \rbrace \end{align} $$
Let's calculate \(\frac{\partial}{\partial\theta_j}J(\theta) = \frac1m \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}\\\): $$ \begin{align} \frac{\partial}{\partial \theta_j} h_\theta(x) &= \frac{\partial}{\partial \theta_j} \frac{1}{1+e^{-\theta^\intercal x}}\ &= \frac{\partial}{\partial \theta_j}(1+e^{-\theta^\intercal x})^{-1}\ &= -(1+e^{-\theta^\intercal x})^{-2} \frac{\partial}{\partial \theta_j} (1+e^{-\theta^\intercal x})\ &= -(1+e^{-\theta^\intercal x})^{-2} \frac{\partial}{\partial \theta_j} e^{-\theta^\intercal x}\ &= -(1+e^{-\theta^\intercal x})^{-2} e^{-\theta^\intercal x} \frac{\partial}{\partial \theta_j}(-\theta^\intercal x)\ &= (1+e^{-\theta^\intercal x})^{-2} e^{-\theta^\intercal x} \frac{\partial}{\partial \theta_j}(\theta^\intercal x)\ &= (1+e^{-\theta^\intercal x})^{-2} e^{-\theta^\intercal x} x_j\ &= h_\theta(x)^2 \times e^{-\theta^\intercal x} \times x_j\ \end{align} $$
Our updated gradient descent:
$$
\begin{align}
& Repeat \; \lbrace \
& \; \theta_j := \theta_j - \alpha \frac1m \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j \
& \rbrace
\end{align}
$$
Vectorized gradient descent: TODO Think about this
$$
\theta := \theta - \frac{\alpha}{m} X^\intercal(g(X\theta) - \vec{y})
$$
This looks very similar to linear regression gradient descent! What is the difference between the two, then? The \(h(\theta)\).
- Linear Regression - \(h_\theta(x)=\theta^\intercal x\)
- Logistic Regression - \(h_\theta(x)= \frac{1}{1 + e^{-\theta^\intercal x}}\)
One last note, feature scaling applies for logistic regression as well.
Advanced Optimization
There are other ways to find \(\theta\) that minimizes \(J(\theta)\):
- Conjugate Gradient
- BFGS
- L-BFGS
These are more sophisticated algorithms compared to gradient descent, but usually runs faster than it and we do not need to pick a step size (\(\alpha\)).
For more details, check the slides/video.