Introduction to Machine Learning
Note: This page is actively being extended and modified (01/03/19)
Goal of ML: find the relation $f: X \rightarrow Y$ given some sample data
Supervised Learning - Given an example set/training data $(\mathbf{x}_i, y_i)$
- Classification, regression, structured prediction
Unsupervised Learning - Given only a set of $\mathbf{x}_i$, need to infer labels/categories $y_i$
- Clustering, dimension reduction, anomaly detection
Note: There are also other methods: semi-supervised, transfer…
1. Regression
For $x,w \in \mathbb{R}^d, y,w_0 \in \mathbb{R}$ want to find $w, w_0$ s.t. $y=w^Tx+w_0$
Regression aims to find $\mathbf{w}$ with the minimal error/cost/loss function $R$
- $\mathbf{w^*} = argmin_{w} R(\mathbf{w})$
Note: as true $R$ is generally not known we use the approximated $\hat{R}$, this leads us to ERM (Empirical Risk Minimisation)
Representation
Convert to Homogeneous Representation (include $w_0$ in $w$):
- $\mathbf{x} \mapsto [\mathbf{x}^T, 1]^T$ and $\mathbf{w} \mapsto [\mathbf{w}^T, w_0]^T$ giving $y=\mathbf{w}^T\mathbf{x}$
Convert non-linear to linear by extending $\mathbf{x}$ (also extends $\mathbf{w}$)
- In $\mathbb{R}: x \mapsto [1, x, x^2…x^n]^T$ for polynomial regr. order $n$
- In $\mathbb{R}^2: [x_1, x_2]^T \mapsto [1, x_1, x_1^2, x_2, x_2^2, x_1x_2…]^T$ using all monomials up to degree $k$
Goodness of Fit
Goodness of fit is measured through the error $R$
We can define the approx loss function $\hat{R}$ in two main ways:
- How an individual residual $r_i$ is calculated
- How residuals are aggregated into $\hat{R}$
For least squares:
- Use $|y_{true} - y_{pred}|$ giving $r_i=|y_i-w^Tx_i|$
- Sum squares $\hat{R}(w) = \sum_{i=1}^n r_i^2$
- So optimal sol is $\mathbf{w^*} = argmin_{w} \hat{R}(\mathbf{w}) = argmin_{w}\sum_{i=1}^n (y_i - \mathbf{w}^T\mathbf{x})^2$
Other loss functions:
- $\hat{R}(w) = \sum_{i=1}^n r_i^p$
- $p > 2$ large error added by outliers, for $p \gg 2$ errors in $[0,1]$ are ignored
- $0 < p < 1$ less effect of outliers
Closed Forms
We can solve some types of $R$ directly, for example the normal equations for least squares
Only for least squares $w^* = (X^TX)^{-1}X^Ty_{train}$
w = np.linalg.solve(X_train.T.dot(X_train), X_train.T.dot(y_train))
y_test = X_test.dot(w)
from sklearn import linear_model # Using sklearn
regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)
y_test = regr.predict(X_test)
Gradient Descent
Iterative finding of minima, requires that $\hat{R}$ is convex to get the optimal sol
- Start at arbitrary $w_0 \in \mathbb{R}^d$
- $w_{t+1} = w_t - \eta_t\nabla\hat{R}(w_t)$ with $\eta_t$ as the stepsize/learning rate
Poor step size:
- $\eta$ too large: overshooting, divergence
- $\eta$ too small: slow convergence
We can choose the learning rate $\eta_t$ via:
- Line search: find step size with maximal decrease in $R$
- Do a 1D minimisation along the line of the gradient
- Bold Driver heuristic: adapt $\eta$ according to change in $R$
- Incease $\eta$ by const if $\hat{R}(w_t-\eta_t g_t) \lt \hat{R}(w_t)$ $g$ being the gradient, vice versa
2. Python Cheatsheet
See Numpy Cheatsheet
A = np.array([(1, 2), (3, 4)], dtype=double)
y = A.dot(x) #y = Ax
A_1 = A[:,0] #first col
x_T = x.T #transpose
Load/Save matrices with csv
. Load everything into one matrix, then select different parts
data = np.loadtxt(open("test.csv", "rb"), delimiter=",", skiprows=1) #skip header
Id = data[:,0] #select different cols
X = data[:,1:11]
out = np.array([Id, y]).T
np.savetxt("output.csv", out, delimiter=",", header="Id,y", fmt='%d,%s', comments='')