Introduction to Machine Learning

Note: This page is actively being extended and modified (01/03/19)

Goal of ML: find the relation $f: X \rightarrow Y​$ given some sample data

Supervised Learning - Given an example set/training data $(\mathbf{x}_i, y_i)$

  • Classification, regression, structured prediction

Unsupervised Learning - Given only a set of $\mathbf{x}_i​$, need to infer labels/categories $y_i​$

  • Clustering, dimension reduction, anomaly detection

Note: There are also other methods: semi-supervised, transfer…

1. Regression

For $x,w \in \mathbb{R}^d, y,w_0 \in \mathbb{R}​$ want to find $w, w_0​$ s.t. $y=w^Tx+w_0​$

Regression aims to find $\mathbf{w}​$ with the minimal error/cost/loss function $R​$

  • $\mathbf{w^*} = argmin_{w} R(\mathbf{w})​$

Note: as true $R​$ is generally not known we use the approximated $\hat{R}​$, this leads us to ERM (Empirical Risk Minimisation)

Representation

Convert to Homogeneous Representation (include $w_0$ in $w$):

  • $\mathbf{x} \mapsto [\mathbf{x}^T, 1]^T​$ and $\mathbf{w} \mapsto [\mathbf{w}^T, w_0]^T​$ giving $y=\mathbf{w}^T\mathbf{x}​$

Convert non-linear to linear by extending $\mathbf{x}$ (also extends $\mathbf{w}$)

  • In $\mathbb{R}: x \mapsto [1, x, x^2…x^n]^T$ for polynomial regr. order $n$
  • In $\mathbb{R}^2: [x_1, x_2]^T \mapsto [1, x_1, x_1^2, x_2, x_2^2, x_1x_2…]^T$ using all monomials up to degree $k$

Goodness of Fit

Goodness of fit is measured through the error $R$

We can define the approx loss function $\hat{R}​$ in two main ways:

  1. How an individual residual $r_i​$ is calculated
  2. How residuals are aggregated into $\hat{R}​$

For least squares:

  1. Use $|y_{true} - y_{pred}|$ giving $r_i=|y_i-w^Tx_i|$
  2. Sum squares $\hat{R}(w) = \sum_{i=1}^n r_i^2​$
    • So optimal sol is $\mathbf{w^*} = argmin_{w} \hat{R}(\mathbf{w}) = argmin_{w}\sum_{i=1}^n (y_i - \mathbf{w}^T\mathbf{x})^2​$

Other loss functions:

  • $\hat{R}(w) = \sum_{i=1}^n r_i^p$
    • $p > 2$ large error added by outliers, for $p \gg 2$ errors in $[0,1]$ are ignored
    • $0 < p < 1​$ less effect of outliers

Closed Forms

We can solve some types of $R$ directly, for example the normal equations for least squares

Only for least squares $w^* = (X^TX)^{-1}X^Ty_{train}​$

w = np.linalg.solve(X_train.T.dot(X_train), X_train.T.dot(y_train))
y_test = X_test.dot(w)
from sklearn import linear_model           # Using sklearn
regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)
y_test = regr.predict(X_test)

Gradient Descent

Iterative finding of minima, requires that $\hat{R}$ is convex to get the optimal sol

  1. Start at arbitrary $w_0 \in \mathbb{R}^d$
  2. $w_{t+1} = w_t - \eta_t\nabla\hat{R}(w_t)$ with $\eta_t$ as the stepsize/learning rate

Poor step size:

  • $\eta$ too large: overshooting, divergence
  • $\eta$ too small: slow convergence

We can choose the learning rate $\eta_t​$ via:

  1. Line search: find step size with maximal decrease in $R​$
    1. Do a 1D minimisation along the line of the gradient
  2. Bold Driver heuristic: adapt $\eta$ according to change in $R$
    1. Incease $\eta$ by const if $\hat{R}(w_t-\eta_t g_t) \lt \hat{R}(w_t)$ $g$ being the gradient, vice versa

2. Python Cheatsheet

See Numpy Cheatsheet

A = np.array([(1, 2), (3, 4)], dtype=double)
y = A.dot(x)  #y = Ax
A_1 = A[:,0]  #first col
x_T = x.T	  #transpose

Load/Save matrices with csv. Load everything into one matrix, then select different parts

data = np.loadtxt(open("test.csv", "rb"), delimiter=",", skiprows=1) #skip header
Id = data[:,0]	    #select different cols
X = data[:,1:11]

out = np.array([Id, y]).T
np.savetxt("output.csv", out, delimiter=",", header="Id,y", fmt='%d,%s', comments='')