Introduction to Machine Learning

Note: This page is actively being extended and modified (01/03/19)

Goal of ML: find the relation $f: X \rightarrow Y$ given some sample data

Supervised Learning - Given an example set/training data $(\mathbf{x}_i, y_i)$

Classification, regression, structured prediction

Unsupervised Learning - Given only a set of $\mathbf{x}_i$, need to infer labels/categories $y_i$

Clustering, dimension reduction, anomaly detection

Note: There are also other methods: semi-supervised, transfer…

1. Regression

For $x,w \in \mathbb{R}^d, y,w_0 \in \mathbb{R}$ want to find $w, w_0$ s.t. $y=w^Tx+w_0$

Regression aims to find $\mathbf{w}$ with the minimal error/cost/loss function $R$

$\mathbf{w^*} = argmin_{w} R(\mathbf{w})$

Note: as true $R$ is generally not known we use the approximated $\hat{R}$, this leads us to ERM (Empirical Risk Minimisation)

Representation

Convert to Homogeneous Representation (include $w_0$ in $w$):

$\mathbf{x} \mapsto [\mathbf{x}^T, 1]^T$ and $\mathbf{w} \mapsto [\mathbf{w}^T, w_0]^T$ giving $y=\mathbf{w}^T\mathbf{x}$

Convert non-linear to linear by extending $\mathbf{x}$ (also extends $\mathbf{w}$)

In $\mathbb{R}: x \mapsto [1, x, x^2…x^n]^T$ for polynomial regr. order $n$
In $\mathbb{R}^2: [x_1, x_2]^T \mapsto [1, x_1, x_1^2, x_2, x_2^2, x_1x_2…]^T$ using all monomials up to degree $k$

Goodness of Fit

Goodness of fit is measured through the error $R$

We can define the approx loss function $\hat{R}$ in two main ways:

How an individual residual $r_i$ is calculated
How residuals are aggregated into $\hat{R}$

For least squares:

Use $|y_{true} - y_{pred}|$ giving $r_i=|y_i-w^Tx_i|$
Sum squares $\hat{R}(w) = \sum_{i=1}^n r_i^2$
- So optimal sol is $\mathbf{w^*} = argmin_{w} \hat{R}(\mathbf{w}) = argmin_{w}\sum_{i=1}^n (y_i - \mathbf{w}^T\mathbf{x})^2$

Other loss functions:

$\hat{R}(w) = \sum_{i=1}^n r_i^p$
- $p > 2$ large error added by outliers, for $p \gg 2$ errors in $[0,1]$ are ignored
- $0 < p < 1$ less effect of outliers

Closed Forms

We can solve some types of $R$ directly, for example the normal equations for least squares

Only for least squares $w^* = (X^TX)^{-1}X^Ty_{train}$

w = np.linalg.solve(X_train.T.dot(X_train), X_train.T.dot(y_train))
y_test = X_test.dot(w)

from sklearn import linear_model           # Using sklearn
regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)
y_test = regr.predict(X_test)

Gradient Descent

Iterative finding of minima, requires that $\hat{R}$ is convex to get the optimal sol

Start at arbitrary $w_0 \in \mathbb{R}^d$
$w_{t+1} = w_t - \eta_t\nabla\hat{R}(w_t)$ with $\eta_t$ as the stepsize/learning rate

Poor step size:

$\eta$ too large: overshooting, divergence
$\eta$ too small: slow convergence

We can choose the learning rate $\eta_t$ via:

Line search: find step size with maximal decrease in $R$
1. Do a 1D minimisation along the line of the gradient
Bold Driver heuristic: adapt $\eta$ according to change in $R$
1. Incease $\eta$ by const if $\hat{R}(w_t-\eta_t g_t) \lt \hat{R}(w_t)$ $g$ being the gradient, vice versa

2. Python Cheatsheet

See Numpy Cheatsheet

A = np.array([(1, 2), (3, 4)], dtype=double)
y = A.dot(x)  #y = Ax
A_1 = A[:,0]  #first col
x_T = x.T	  #transpose

Load/Save matrices with csv. Load everything into one matrix, then select different parts

data = np.loadtxt(open("test.csv", "rb"), delimiter=",", skiprows=1) #skip header
Id = data[:,0]	    #select different cols
X = data[:,1:11]

out = np.array([Id, y]).T
np.savetxt("output.csv", out, delimiter=",", header="Id,y", fmt='%d,%s', comments='')