All notes
LogisticRegressi

# Intro

## Pre-requisites

• Features roughly linear.
• The problem is linearly separable.
• Dependent variable (DV) is categorical.

• The output can be interpreted as probability.
• Robust to noise.
• Avoid overfitting.
• Efficient, and calculations can be distributed.

## Usage

Quora: comparion among classifications.
L2-regularized LR could be a baseline for more fancier classification approaches.

## Categories

• Multinomial logistic regression: cases with more than two categories.
• Ordinal logistic regression: multiple categories are ordered.

# Logistic functions

## Odds Ratio

### Odds

• An alternate way of expressing probabilities.
• It simplifies the process of updating with new evidence.
Odds
$$\frac{P(A)}{P(\neg A)} = \frac{a}{b} = \frac{p}{1-p}, \quad p=\frac{a}{a+b}.$$
Odds Ratio
$$R = \frac{\frac{p_1}{1-p_1}}{\frac{p_2}{1-p_2}}$$

## Logit

For 0<p<1$$logit(p) = log(\frac{p}{1-p})$$

base 2 - bit
base e - nat
base 10 - ban


## Logistic function

Logistic function is the inverse-logit: $$logit^{-1}(\alpha) = \frac{1}{1+\exp ^{-\alpha}}$$

# SVM

## Difference with Logistic Regression

• Use a different loss function (Hinge) from Logistic Regression (LR).
• The results are interpreted differently (maximum-margin).
• If your problem is not linearly separable, use SVM with a non-linear kernel (e.g. RBF).

Hard to train, esp. many training examples.

# FAQ

## Sample size requirement

There are (at least) two different kinds of instability:
• The model parameters vary a lot with only slight changes in the training data.
• The predictions (for the same case) of models trained with slight changes in the training data vary a lot.
The best method is to scrutinize the two instabilities. Just relying on the 1 to 10 rule will be insufficient (see below).

### 1 to 10 rule

Basically, as the ratio of parameters estimated to the number of data gets close to 1, your model will become saturated, and will necessarily be overfit (unless there is, in fact, no randomness in the system). The 1 to 10 ratio rule of thumb comes from this perspective.

The 1 to 10 rule comes from the linear regression world, however, and it's important to recognize that logistic regression has additional complexities.

• One issue is that logistic regression works best when the percentages of 1's and 0's is approximately 50% / 50% (as @andrea and @psj discuss in the comments above).
• Another issue to be concerned with is separation. That is, you don't want to have all of your 1's gathered on one extreme of an independent variable (or some combination of them), and all of the 0's at the other extreme.
• One last issue with that rule of thumb, is that it assumes your IV's are orthogonal.