[DL] Hyperplane Classifier

Perceptron Algorithm

  • Converge to a separating hyperplane if data are linearly separable
  • Does not terminate otherwise

Let $\thetab = (b, w_1, w_2, \cdots, w_d)^\top$ and $\xb = (1, x_1, x_2, \cdots, x_d)^\top$

Perceptron Algorithm

$$ \thetab^{(t)} = \thetab^{(t-1)} + \eta_t\, (y^{(t)} - \hat{y}^{(t)})\,\xb^{(t)} $$ where $$ \hat{y}_i = {\rm sign}\big(\thetab^\top \xb^{(t)}\big) $$

A short proof of the convergence of perceptron algorithm: Here

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def predict(w, x):
return 1 if (np.dot(w, x) >= 0) else -1

def perceptron_alg(X, Y):
X = np.append(np.ones((np.size(X, 0), 1)), X, 1)
w = np.zeros(np.size(X, 1))
lr = 1 # learning rate
converge = False
while not converge:
converge = True
for x, y in zip(X, Y):
y_pred = predict(w, x)
if y != y_pred:
converge = False
w = w + (y - y_pred) * lr * x

return w

Perceptron Algorithm as Stochastic Gradient Descent

Loss function (rectified linear unit - relu)

Sub-gradient

Stochastic gradient descent

Stochastic approximation of gradient descent algorithm. It replaces the actual gradient (calculated from the entire data set) by an estimate thereof (calculated from a randomly selected subset $S$ of the data).

$$ \nabla J = \frac{1}{m} \sum_{i \in S} \nabla_\thetab\, L\big(y_i, f(\thetab, \xb_i)\big) $$

Especially in big data applications this reduces the computational burden, achieving faster iterations in trade for a slightly lower convergence rate.

Logistic Regression