intuition - If logistic is the log odds ratio, what's softmax?

Sunday, 14 May 2017

intuition - If logistic is the log odds ratio, what's softmax?

I recently saw a nice explanation of logistic regression: With logistic regression, we want to model the probability of getting success, however you define that in the context of the problem. Probabilities are between 0 and 1, so we can't do a linear regression, but we can still do a linear regression if we wrote the probabilities in an equivalent form whose domain spanned the entire real line. The odds ratio, $\frac{P}{1-P}$ , spans from 0 to infinity, so to get the rest of the way, the natural log of that spans from -infinity to infinity. Then we so a linear regression of that quantity, $\beta X = \log{\frac{P}{1-P}}$ . When solving for the probability, we naturally end up with the logistic function, $P = \frac{e^{\beta X}}{1 + e^{\beta X}}$ .

That explanation felt really intuitive for me, and it nicely explains why the output of the logistic function is interpreted as probabilities. The softmax function, $\frac{e^{x_i}}{\sum_k{e^{x_k}}}$ is supposed to generalize the logistic function to multiple classes instead of just two (success or failure).

Is there a similarly intuitive explanation for why the output of the softmax is a probability and how it generalizes the logistic function? I've seen various derivations, but they don't have the same ring to it that the log odds ratio does.

Answer

I will separate my answer based on your 2 questions:

How does softmax generalizes the logistic function?

As you've stated correctly logistic regression models the probability of success. The problem is that in multilabel classification you do not have a single success, what you would like is to encode the probability of being of a certain class (softmax). To show that it is a generalization, we simply need to realize that the probability of success could also be simply encoded by the probability of seeing in class success and the probability of seeing in class failure. Here's a very loose proof of equivalence ok softmax with K=2 and logistic regression:

$\begin{align} \Pr(y_i=1) &= \frac{e^{\theta_1^T x_i}} {~\sum_{0 \leq c \leq 2}^{}{e^{\theta_c^T x_i}}} \\ &= \frac{e^{\theta_1^T x_i }}{e^{\theta_0^T x_i} + e^{\theta_1^T x_i}} \\ &= \frac{1}{e^{(\theta_0-\theta_1)^T x_i} + 1} \\ \end{align}$

You now simply define $\theta = -(\theta_0-\theta_1)$ and you have the logistic regression :).

What's the intuition behind softmax?

As for logistic regression, there is a simple intuitive explanation. I will approach it from the other way around (from linear regression to softmax, as I find it more intuitive ). The output of your linear regression is between $]-\infty,\infty[$ , but we need it to be in $[0,1]$ as we are trying to model a probability of being in a certain class. This can be simply done by taking the exponential of the linear regression: $e^{\theta_{c'}^T x_i}: \ ]-\infty,\infty[ \ \rightarrow ]0,\infty[$ . This gives a certain importance weight of each class, to get a probability we simply have to normalize it by the sum of weights for each class: $\frac{e^{\theta_{c'}^T x_i}} {~\sum_{0 \leq c \leq 2}^{}{e^{\theta_c^T x_i}}}: \ ]-\infty,\infty[ \ \rightarrow ]0,1]$ . And there you have your probability!

Hope that helps :)

EDIT: You are asking specifically about what are you regressing. I didn't answer this at the beginning because there's not an explanation as clear as for logistic regression (I think of softmax as a map from a linear regression to probabilities). If you really want to understand what you're regressing, you can simply derive it (I'll use $p_c:=\Pr(y_i=c)$ for simplicity ):

$\begin{align} p_c &= \frac{e^{\theta_c^T x_i}} {e^{\theta_c^T x_i} + ~\sum_{c' \neq c}^{}{e^{\theta_{c'}^T x_i}}} \\ (1-p_c)e^{\theta_c^T x_i } &= p_c + (~\sum_{c' \neq c}^{}{e^{\theta_{c'}^T x_i}}) \\ \theta_c^T x_i &= log(\frac{p_c}{1-p_c}(~\sum_{c' \neq c}^{}{e^{\theta_{c'}^T x_i}})) \\ \end{align}$

Using the fact that $log(\sum_i{e^{x_i}}) \approx max(x_i)$ :

$\begin{align} \theta_c^T x_i &= log(\frac{p_c}{1-p_c}) + max_{c'}(\theta_{c'}^T x_i) \\ (\theta_c^T-max_{c'}(\theta_{c'}^T)) \ x_i &= log(\frac{p_c}{1-p_c}) \\ \end{align}$

We thus see that the log odds of a single class approximatevely encodes the the difference of your linear regression compared to the regression of the max of all other classes.

Blog

Sunday, 14 May 2017