CS-433 (Machine Learning)

1. Linear Regression

ynx~nTw~y_n \approx \mathbf{\tilde x}_n^T \mathbf{\tilde w}, where x~nRD+1\mathbf{\tilde x}_n \in \R^{D+1}  is the feature vector with a 1 in front (adding a constant) and w~RD+1\mathbf{\tilde w} \in \R^{D+1}  the weight vector for predicting label yny_n

2. Loss functions

Desirable: symmetric around 0, penalise large and very large mistakes similarly

  • MSE(w)=1Nn=1Nen2=1Nn=1N[ynfw(xn)]2\text{MSE(w)} = \frac{1}{N} \sum_{n=1}^{N} e_n^2 = \frac{1}{N} \sum_{n=1}^{N} [y_n - f_\mathbf{w}(\mathbf{x}_n)]^2 → not good for outliers
  • MAE(w)=1Nn=1Nen=1Nn=1Nynfw(xn)\text{MAE(w)} = \frac{1}{N} \sum_{n=1}^{N} |e_n| = \frac{1}{N} \sum_{n=1}^{N} |y_n - f_\mathbf{w}(\mathbf{x}_n)| → good for outliers

Convexity

Function h(u)h(u) is convex if for any u,vRDu, v \in \R^D and for any 0λ10 \leq \lambda \leq 1  we have:

h(λu+(1λ)v)λh(u)+(1λ)h(v)h(\lambda u + (1-\lambda)v) \leq \lambda h(u) + (1-\lambda)h(v)

A strictly convex function has a unique global minimum. For convex functions, every local minimum is a global minimum. Sums of convex functions and compositions of convex functions with convex non-decreasing functions are convex.

3. Optimisation

Find wRD\mathbf{w}^* \in \R^D such that it is minwL(w)\text{min}_\mathbf{w} \mathcal{L}(\mathbf{w})

Grid search: nn  parameters per dimension → nDn^D evaluations and no guarantee to find minimum

Gradient descent: L(w):=[L(w)w1,...,L(w)wD]RD\nabla \mathcal{L}(\mathbf{w}) := \left[ \frac{\partial \mathcal{L}(\mathbf{w})}{\partial w_1}, ..., \frac{\partial \mathcal{L}(\mathbf{w})}{\partial w_D} \right] \in \R^D

  • w(t+1):=w(t)γL(w(t))\mathbf{w}^{(t+1)}:=\mathbf{w}^{(t)}-\gamma \nabla \mathcal{L}(\mathbf{w^{(t)}})
  • Linear MSE: e=yXw\mathbf{e = y-Xw}
    • L(w)=12Nee\mathcal{L}(\mathbf{w}) = \frac{1}{2N}\mathbf{e}^\top\mathbf{e}
    • L(w)=1NXe\nabla \mathcal{L}(\mathbf{w}) = -\frac{1}{N}\mathbf{X}^\top\mathbf{e}
    • Global complexity O(ND)\mathcal{O}(ND)

SGD:

  • w(t+1):=w(t)γLn(w(t))\mathbf{w}^{(t+1)}:=\mathbf{w}^{(t)}-\gamma \nabla \mathcal{L}_n(\mathbf{w^{(t)}})
  • Cheap but unbiased estimate of the gradient
  • Global complexity O(D)\mathcal{O}(D)

Mini-batch SGD:

  • g:=1BnBLn(w(t))g:=\frac{1}{|B|} \sum_{n \in B}\nabla \mathcal{L}_n(\mathbf{w^{(t)}}), B is a subset of the N samples
  • w(t+1):=w(t)γg\mathbf{w}^{(t+1)}:=\mathbf{w}^{(t)}-\gamma g

Subgradient descent:

  • Convexity for differentiable functions: L(u)L(w)+L(w)(uw)\mathcal{L}(\mathbf{u}) \geq \mathcal{L}(\mathbf{w})+\nabla \mathcal{L}(\mathbf{w})^\top(\mathbf{u}-\mathbf{w})
  • Subgradient gLg \in \partial \mathcal{L}: L(u)L(w)+g(uw)\mathcal{L}(\mathbf{u}) \geq \mathcal{L}(\mathbf{w})+g^\top(\mathbf{u}-\mathbf{w})

Projected gradient descent:

  • Intersections of convex sets are convex
  • Projection on convex constraint set C\mathcal{C} after each gradient descent step
  • Turn into unconstrained problem: penalty function if not in C\mathcal{C}
4. Least squares

Gram matrix XXRD×D\mathbf{X}^\top \mathbf{X}\in \R^{D\times D}. Closed form for optimal w\mathbf{w}^* given by XXw=Xy\mathbf{X}^\top \mathbf{X} \mathbf{w}^*=\mathbf{X}^\top \mathbf{y}. The Gram matrix is invertible iff XRN×D\mathbf{X} \in \R^{N \times D} has full column rank. Complexity is O(ND2+D3)\mathcal{O}(ND^2+D^3)

  • L2L_2-regularisation closed form with λ=2Nλ\lambda' = 2N \lambda: (XX+λId)w=Xy(\mathbf{X}^\top \mathbf{X}+\lambda'\mathbf{I}_d) \mathbf{w}^*=\mathbf{X}^\top \mathbf{y}
  • Eigenvalues of (XX+λI)(\mathbf{X}^\top \mathbf{X}+\lambda'\mathbf{I}) are at least λ\lambda'
5. Maximum Likelihood

Maximising the likelihood instead of minimising the loss. Our data is yn=xnw+ϵny_n = \mathbf{x}^\top_n \mathbf{w} + \epsilon_n, where ϵn\epsilon_n is a random variable (e.g. Gaussian) that is iid across nn.

Log likelihood LLL(w):=logp(yX,w)=n=1Np(ynxn,w)\mathcal{L}_{LL}(\mathbf{w}) := \log p(\mathbf{y}|\mathbf{X}, \mathbf{w}) = \sum_{n=1}^N p(y_n|\mathbf{x}_n, \mathbf{w})

6. Regularisation

Penalise complex models: minwL(w)+Ω(w)\text{min}_\mathbf{w} \mathcal{L}(\mathbf{w}) + \Omega(\mathbf{w})

  • L2L_2-regularisation (ridge): Ω(w)=λw22\Omega(\mathbf{w})=\lambda||\mathbf{w}||_2^2
    • Gradient is 2w2\mathbf{w}
    • Model small in magnitude
  • L1L_1-regularisation (lasso): Ω(w)=λw1\Omega(\mathbf{w})=\lambda||\mathbf{w}||_1
    • Gradient is sign(w)\text{sign}(\mathbf{w})
    • Model is sparse
  • L0L_0-regularisation (lasso): Ω(w)=#(w0)\Omega(\mathbf{w})=\#(\mathbf{w} \neq 0)
  • Shrinkage, dropout, weight decay, early stopping
7. Generalisation, model selection and validation

Generalisation gap: how far is the test from the true error?

Given KK different models fKf_K, an iid test set StestD\text{S}_\text{test}\sim \mathcal{D} (that follows the true data distribution) and a loss L[a,b]\mathcal{L} \in [a, b]:

P[maxKLD(fK)LStest(fK)(ba)2ln(2Kδ)2Stest]δ\mathbb{P} \left[ \max_K|\mathcal{L}_\mathcal{D}(f_K)-\mathcal{L}_{\text{S}_\text{test}}(f_K)| \geq \sqrt{\frac{(b-a)^2\ln(\frac{2K}{\delta})}{2|\text{S}_\text{test}|}} \right] \leq \delta

To remove the absolute value, put 2 in front of the square root and take the values of KK for which the functions fk^f_{\hat k} and fkf_{k^*} have the smallest empirical or true risk respectively.

Hoeffding inequality ϵ0\forall \epsilon \geq 0, Θn\Theta_n are the loss functions

P[1Nn=1NΘnE[Θ]ϵ]2e2Nϵ2(ba)2\mathbb{P} \left[ |\frac{1}{N}\sum_{n=1}^N \Theta_n - \mathbb{E}[\Theta]| \geq \epsilon \right] \leq 2e^{{\frac{-2N\epsilon^2}{(b-a)^2}}}

Hoeffding lemma s0\forall s \geq 0, and RV X[a,b]\mathbf{X} \in [a,b] with E[X]=0\mathbb{E}[\mathbf{X}] = 0

E[esX]e18s2(ba)2\mathbb{E} \left[ e^{sX} \right] \leq e^{{\frac{1}{8}s^2(b-a)^2}}

K-fold cross validation returns an unbiased estimate of the generalisation-error and its variance

8. Bias-variance decomposition

Bias and variance of the prediction considering that the input sample SS is a random variable

Noise → Strict lower bound, as independently random and impossible to predict (1st term)

Bias → how far off in general the model’s predictions are from the correct value (2nd term)

Variance → How much the predictions for a given point vary between realisations of the training set (3rd term)

ESD,ϵDϵ[(f(x0)+ϵfS(x0))2]=VarϵDϵ[ϵ]+(f(x0)ESD[fS(x0)])2+ESD[(fS(x0)ESD[fS(x0)])2]\mathbb{E}_{S\sim\mathcal{D}, \epsilon\sim\mathcal{D_\epsilon}}[(f(x_0)+\epsilon - f_S(x_0))^2] = \text{Var}_{\epsilon\sim\mathcal{D_\epsilon}}[\epsilon] \\ + (f(x_0) - \mathbb{E}_{S'\sim\mathcal{D}}[f_{S'}(x_0)])^2 \\ + \mathbb{E}_{S\sim\mathcal{D}}[(f_S(x_0) - \mathbb{E}_{S'\sim\mathcal{D}}[f_{S'}(x_0)])^2]
9. Classification

The optimal classification performance is the Bayers classifier :=g=argmingLD(g):= g_* = \text{arg} \min_g \mathcal{L}_\mathcal{D}(g) 

g(x)=argmaxy{1,1}P(Y=yX=x)g_*(x) = \text{arg} \max_{y \in \{ -1, 1\}} \mathbb{P}(Y=y|X=x) 

Loss functions

  • (0-1 Loss) → L(y,y)=1yy\mathcal{L}(y, y')=1_{y \neq y'} which is 1 if yyy \neq y' and 0 if y=yy = y'
  • True risk for classification for a predictor ggLD(g)=ED[1Yg(X)]=PD[Yg(X)]\mathcal{L}_\mathcal{D}(g)=\mathbb{E}_\mathcal{D}[1_{Y \neq g(X)}] = \mathbb{P}_\mathcal{D}[Y \neq g(X)]
  • Convex and continuous losses (η=yxw)\eta = yx^\top w):
    • quadratic loss :=(1η)2:= (1-\eta)^2 → symmetric but only works for [,2][-\infty, 2]
    • hinge loss :=[1η]+:= [1-\eta]_+ → penalty on [,2][-\infty, 2] (prediction wrong or not confident)
    • logistic loss :=ln(1+eη)ln(2):= \frac{\ln(1+e^{-\eta})}{\ln(2)} → always penalising

Non-parametric

Approximate conditional distribution P(Y=yX=x)\mathbb{P}(Y=y|X=x)  via local averaging (KNN)

Parametric

Approximate true distribution D\mathcal{D} via training data → minimise empirical risk (ERM)

Instead of learning function g:X{1,1}g: X\rightarrow \{-1, 1\}, we learn a continuous function hh and predict with g(x)=sign(h(x))g(x) = \text{sign}(h(x)). We replace the 0-1 loss by a convex and continuous surrogate ϕ\phi:

minhH1Nn=1Nϕ(ynh(xn))\min_{h\in \mathcal{H}} \frac{1}{N} \sum_{n=1}^N \phi(y_nh(x_n))

In the over-parameterisation (n<<d)(n<<d) regime, the training data is well fit with a regressor → a good regressor can be used as a classifier

10. Logistic regression

Logistic function σ(η):=eη1+eη\sigma(\eta):=\frac{e^\eta}{1+e^\eta}. We have 1σ(η)=11+eη1-\sigma(\eta) = \frac{1}{1+e^\eta} and σ(η):=σ(η)(1σ(η))\sigma'(\eta):=\sigma(\eta)(1-\sigma(\eta))

Robust against outliers and unbalanced data. If σ(x)>12\sigma(x)>\frac{1}{2}, then for a>0,σ(ax)>12a > 0, \sigma(ax)>\frac{1}{2}

Optimality:

  • linear for y{1,0}y \in \{1, 0\}: w=argminwL:=1Nn=1Nynxnw+log(1+exnw)w_* = \text{arg} \min_w \mathcal{L} := \frac{1}{N}\sum_{n=1}^N-y_n x_n^\top w + \log (1+e^{x_n^\top w})
  • generic hh for y{1,0}y \in \{1, 0\}: L(y,h(x))=yh(x)+log(1+eh(x))\mathcal{L}(y, h(x)) = -yh(x) + \log (1+e^{h(x)})
  • generic hh for y{1,1}y \in \{-1, 1\}: L(y,h(x))=log(1+eyh(x))\mathcal{L}(y, h(x)) = \log (1+e^{-yh(x)})

Gradient:

  • linear problem convex and L(w)=1NX(σ(Xw)y))\nabla \mathcal{L}(w)=\frac{1}{N} \mathbf{X}^\top(\sigma(\mathbf{X}w)-y))

Hessian:

  • 2L(w)=1NXSX\nabla^2 \mathcal{L}(w)=\frac{1}{N} \mathbf{X}^\top S \mathbf{X}, where S=diag[σ(xnw)(1σ(xnw))]0S=\text{diag}[\sigma(x_n^\top w)(1-\sigma(x_n^\top w))] \geq 0
  • Hessian is psd → loss is convex
  • Newton’s method: w(t+1):=w(t)γ(t)2L(w(t))1L(w(t))\mathbf{w}^{(t+1)}:=\mathbf{w}^{(t)}-\gamma^{(t)} \nabla^2 \mathcal{L}(\mathbf{w^{(t)}})^{-1}\nabla \mathcal{L}(\mathbf{w^{(t)}})

When data is linearly separable, weights ww \rightarrow \infty even though all points are well separated. Solution: add L2L_2  regularisation

11. Support vector machines (SVM)

We consider y{1,1}y \in \{-1, 1\}. Margin of a hyperplane :=minnNwxn:= \min_{n \leq N} |w^\top x_n|

Hard SVM - 3 equivalent formulations

  1. maxw,w=1minnNwxn \max_{w, ||w||=1} \min_{n \leq N} |w^\top x_n| such that n,ynxnw0\forall n, y_n x_n^\top w \geq 0
  1. maxMR,w,w=1M \max_{M \in \R, w, ||w||=1} M such that n,ynxnwM\forall n, y_n x_n^\top w \geq M
  1. minw12w2 \min_w \frac{1}{2}||w||^2 such that n,ynxnw1\forall n, y_n x_n^\top w \geq 1M=1wM = \frac{1}{||w||}

Soft SVM - non linearly separable data

Maximise margin while allowing some constraints to be violated

minwλ2w2+1Nn=1N[1ynxnw]+ \min_w \frac{\lambda}{2}||w||^2 + \frac{1}{N}\sum_{n=1}^N[1-y_nx_n^\top w]_+

[z]+=max(0,z)=maxα[0,1]αz[z]_+ = \max(0, z)=\max_{\alpha \in [0,1]} \alpha z. Continuous but non-smooth → subgradients

Dual formulation

Define a function G(w,α)G(w, \alpha)  such that minwL=minwmaxαG(w,α)\min_w \mathcal{L}=\min_w \max_\alpha G(w, \alpha). GG  is convex in ww and concave in α\alpha.

Primal problem: minwmaxαG(w,α)\min_w \max_\alpha G(w, \alpha)

Dual problem: maxαminwG(w,α)\max_\alpha \min_w G(w, \alpha)

minwL(w)=maxα[0,1]nα112λNαYXXYα\min_w \mathcal{L}(w) = \max_{\alpha \in [0,1]^n} \alpha^\top 1 - \frac{1}{2 \lambda N} \alpha^\top \mathbf{YXX}^\top\mathbf{Y}\alpha

where w(α)=1λNXYαw(\alpha)=\frac{1}{\lambda N} \mathbf{X}^\top\mathbf{Y}\alpha and Y=diag(y)\mathbf{Y} = \text{diag}(\mathbf{y}). YXXY\mathbf{YXX}^\top\mathbf{Y}  is psd and only depends on kernel matrix.

  • αn=0\alpha_n = 0 if xnx_n is on the right side, outside of the margin (1ynxnw<0)1-y_nx_n^\top w < 0)
  • αn[0,1]\alpha_n \in [0,1] if xnx_n is on the right side and on the margin (1ynxnw=0)1-y_nx_n^\top w = 0)
  • αn=1\alpha_n = 1 if xnx_n is on the inside the margin or on the wrong side (1ynxnw>0)1-y_nx_n^\top w > 0)
  • Points for which αn>0\alpha_n > 0 are called support vectors (model only depends on support vectors)
w=1λNn=1Nαnynxnw=\frac{1}{\lambda N} \sum_{n=1}^N\alpha_n y_n x_n

12. Nearest Neighbour Classifiers

KNN can be used for:

  • regression: for point xx, find the KK closest points yky_k (the KK points in the neighbourhood) → the estimate is then given by fK(x)=1KKykf_K(x)=\frac{1}{K} \sum^Ky_k
  • classification: get KK closest neighbours of xx and count how many are in groups aa or bb → assign group of point xx depending on the majority group in the neighbourhood

Small KK → complex decision boundary → low bias, high variance (overfitting)

Large KK → when k=Nk=N prediction is constant → high bias, low variance

Curse of dimensionality: NN i.i.d. points uniform in [0,1]dP(xid)=1(1rd)N[0,1]^d \rightarrow \mathbb{P}(\exist x_i \in \square^d) = 1-(1- r^d)^N

Generalisation bound for 1-NN: X×Y=[0,1]d×{0,1}\mathbf{X} \times \mathbf{Y} = [0,1]^d \times \{0,1\} 

  • Bayes classifier minimises L\mathcal{L} over all classifiers f(x)=1η(x)12f_*(x)= 1_{\eta(x)\geq\frac{1}{2}} where

    η(x)=P(Y=1X=x)\eta(x)=\mathbb{P}(Y=1|X=x)
  • Bayes risk: L(f)=P(f(X)Y)=EXDX[min{η(x),1η(x)}]\mathcal{L}(f_*)=\mathbb{P}(f_*(X)\neq Y)=\mathbb{E}_{X\sim \mathcal{D_X}}[\min\{\eta(x), 1-\eta(x)\}]
  • Assumption: c0\exist c \geq 0, x,xX\forall x, x' \in X: η(x)η(x)cxx2|\eta(x)-\eta(x')| \leq c||x-x'||_2 (Lipschitz with ct. c)
    → nearby points are likely to share the same label
  • Claim: EStrain[L(fStrain)]2L(f)+4cdN1d+1\mathbb{E}_{S_{train}}[\mathcal{L}(f_{S_{train}})] \leq 2 \mathcal{L}(f_*) + 4c \sqrt{d}N^{-\frac{1}{d+1}}
  • To achieve constant error we need Ndd+12N \propto d^{\frac{d+1}{2}}
  • P(YY)2min{η(x),1η(x)}+cxx\mathbb{P}(Y'\neq Y) \leq 2\min\{\eta(x), 1- \eta(x)\}+c||x-x'||
  • Let pk=P(XBoxk)p_k = \mathbb{P}(X \in \text{Box}_k), then sample XX has probability 1(1pk)N1-(1-p_k)^N  to have a neighbour in StrainS_{train} in the box at distance dϵ\leq \sqrt{d}\epsilon (ϵ\epsilon is Box edge lenght) and probability (1pk)N(1-p_k)^N  to not have a neighbour in the box (closest neighbour d\leq \sqrt{d})
  • E[Xnbh(X)]kpk[(1pk)Nd+(1(1pk)N)dϵ]\mathbb{E}[||X-\text{nbh}(X)||] \leq \sum_k p_k[(1-p_k)^N \sqrt{d} + (1-(1-p_k)^N) \sqrt{d}\epsilon]
  • Local averaging methods aim to approximate the Bayes’ predictor directly by approximating the conditional distribution p^(yx)\hat p(y|x). For NN \rightarrow \infty, 1-NN is competitive with Bayes’ classifier
13. Kernel trick

Covariance matrix XXRD×D\mathbf{X}^\top \mathbf{X}\in \R^{D\times D}, Kernel matrix XXRN×N \mathbf{X}\mathbf{X}^\top\in \R^{N\times N}

→ Ridge regression w=1NX(1NXX+λIN)1y\mathbf{w}^*=\frac{1}{N}\mathbf{X}^\top (\frac{1}{N} \mathbf{X} \mathbf{X}^\top+\lambda\mathbf{I}_N)^{-1} \mathbf{y} complexity O(DN2+N3)\mathcal{O}(DN^2+N^3)

Representer theorem: For any loss L\mathcal{L} there exists αRN\alpha_* \in \R^N, where R(w)R(w) is any increasing regularisation term, such that

w=Xαargminw1Nn=1NL(xnw,yn)+R(w)w_* = \mathbf{X}^\top \alpha_* \in \text{arg} \min_w \frac{1}{N}\sum_{n=1}^N \mathcal{L}(x_n^\top w, y_n) + R(w)

Kernel trick: Kernel function κ(x,x)=ϕ(x)ϕ(x)\kappa(x, x') = \phi(x)^\top \phi(x') → computation of linear classifiers in high-dimensional space ϕ(xn)Rd~\phi(x_n) \in \R^{\tilde d} without computing directly in that space. Prediction with kernel: y=ϕ(x)w=n=1Nκ(x,xn)αny=\phi(x)^\top w_* = \sum_{n=1}^N \kappa(x, x_n)\alpha_{*n} (non-linear prediction in XX space but linear in feature space ϕ(X)\phi(X))

  1. Linear kernel: κ(x,x)=xxϕ(x)=x\kappa(x, x') = x^\top x' \rightarrow \phi(x)=x
  1. Quadratic kernel: κ(x,x)=(xx)2ϕ(x)=x2\kappa(x, x') = (x x')^2 \rightarrow \phi(x)=x^2 (for x,xRx, x' \in \R)
  1. Polynomial kernel: κ(x,x)=(x1x1+x2x2+x3x3)2\kappa(x, x') = (x_1 x_1'+x_2 x_2'+x_3 x_3')^2 

    ϕ(x)=[x12,x22,x32,2x1x2,2x1x3,2x2x3]R6\rightarrow \phi(x)=[x_1^2, x_2^2, x_3^2, \sqrt{2}x_1x_2, \sqrt{2}x_1x_3, \sqrt{2}x_2x_3] \in \R^6 (for x,xR3x, x' \in \R^3)
  1. Radial basis function (RBF) kernel:
    1. x,xRdx, x' \in \R^d: κ(x,x)=e(xx)(xx)\kappa(x, x') = e^{-(x-x')^\top(x-x')}
    1. x,xRx, x' \in \R: κ(x,x)=e(xx)2ϕ(x)=ex2(...,2k2xkk!,...)\kappa(x, x') = e^{-(x-x')^2} \rightarrow \phi(x)=e^{-x^2}(..., \frac{2^{\frac{k}{2}}x^k}{\sqrt{k!}},...) for 0k<0\leq k<\infty

Building new kernels from existing ones:

  • κ(x,x)=ακ1(x,x)+βκ2(x,x)\kappa(x, x') = \alpha \kappa_1(x, x') + \beta \kappa_2(x, x') is a kernel for α,β0\alpha, \beta \geq 0
  • κ(x,x)=κ1(x,x)κ2(x,x)\kappa(x, x') = \kappa_1(x, x') \kappa_2(x, x') is a kernel

Mercer’s condition: ϕ(x)\exist \phi(x) such that κ(x,x)=ϕ(x)ϕ(x)\kappa(x, x') = \phi(x)^\top \phi(x') iff

  1. kernel function is symmetric: κ(x,x)=κ(x,x)\kappa(x, x') = \kappa(x', x)  x,x\forall x, x'
  1. kernel matrix is psd: κ(x,x)0\kappa(x, x') \geq 0

14. Neural Networks

Learn non-linear function fNN(x)f_{NN}(x) to transform xx into a good feature representation for predictions.

Multi-layer perceptron (MLP) structure

Network with LL hidden layers with KLK_L neurons each. The output of hidden layer ll is

x(l)=f(l)(x(l1)):=ϕ((W(l))x(l1)+b(l))x^{(l)}=f^{(l)}(x^{(l-1)}):=\phi((\mathbf{W}^{(l)})^\top x^{(l-1)}+b^{(l)}), where the weights W(l)\mathbf{W}^{(l)} and biases b(l)b^{(l)} are learnable → O(K2L)\mathcal{O}(K^2L) learnable parameters. The global function y=f(x(0))y=f(x^{(0)}) is the composition f=f(L+1)f(L)...f(1)f=f^{(L+1)} \circ f^{(L)} \circ ... \circ f^{(1)}

  • Inference: h(x)=f(x)w(L+1)+b(L+1)h(x) = f(x)^\top w^{(L+1)}+b^{(L+1)}
    • Regression → h(x)h(x)
    • Binary classification with y{1,1}y\in \{-1,1\}sign(h(x))\text{sign}(h(x))
    • Multi-Class classification y{1,...,K}y\in \{1,...,K\}argmaxc{1,...,K}h(x)c\text{arg} \max_{c\in \{1,...,K\}}h(x)_c
  • Training
    • Regression → L(y,h(x))=(h(x)y)2\mathcal{L}(y,h(x))=(h(x)-y)^2
    • Binary classification with → L(y,h(x))=ln(1+exp(yh(x)))\mathcal{L}(y,h(x))=\ln(1+\exp(-yh(x)))
    • Multi-Class classification → L(y,h(x))=ln(eh(x)yi=1Keh(x)i)\mathcal{L}(y,h(x))=-\ln(\frac{e^{h(x)_y}}{\sum_{i=1}^K e^{h(x)_i}})
  • Activation functions:
    sigmoid
    ϕ(x)=σ(x)\phi(x) = \sigma(x), ReLU ϕ(x)=[x]+=max{0,x}\phi(x) = [x]_+ = \max\{0, x\}, GeLU ϕ(x)xσ(1.702x)\phi(x) \approx x\cdot\sigma(1.702x)

Representation power

  • All sufficiently smooth function can be approximated by a one-hidden-layer NN (l2l_2 norm):

    xr(f(x)fn(x))2dx(2Cr)2n\int_{|x|\leq r} (f(x)-f_n(x))^2dx\leq\frac{(2Cr)^2}{n}, CC→ the smaller, the smoother ff
  • A NN with sigmoid activation and at most two hidden layers can approximate well a smooth function in l1l_1-norm
    1. Approximate the function integral in the Riemann sense by a sum of kk  rectangles
    1. Represent each rectangle using two nodes in the hidden layer of a neural network

      h(ϕ(w(xa))ϕ(w(xb)))h(\phi(w(x-a))-\phi(w(x-b))) 
    1. Compute the sum of all nodes in the hidden layer (considering appropriate weights and signs) to get the final output → NN with one hidden layer containing 2k2k nodes for a
      Riemann sum with
      kk rectangles
  • ll_\infty approximation result: ff continuous on [c,d][c,d]. ϵ0\forall \epsilon \geq0, it exists piecewise linear (pwl) continuous qq such that supx[c,d]f(x)q(x)ϵ\sup_{x\in[c,d]} |f(x)-q(x)| \leq \epsilon
    1. pwl qq can be written as combination of ReLU q(x)=a~1x+b~1+i=2ma~i(xb~1)+q(x)=\tilde a_1x+\tilde b_1 + \sum_{i=2}^m \tilde a_i(x-\tilde b_1)_+
    1. qq can be implemented as a one-hidden-layer NN with ReLU activation

Training (SGD) - Backpropagation

Searching minwi,j(l),bi(l)L(f)\min_{w_{i,j}^{(l)}, b_i^{(l)}} \mathcal{L}(f) using SGD→ non-convex

Forward pass: O(K2L)\mathcal{O}(K^2L)

  • x(0)=xnRdx^{(0)}=x_n \in \R^d
  • z(l)=(W(l))x(l1)+b(l)z^{(l)}=(\mathbf{W}^{(l)})^\top x^{(l-1)}+b^{(l)}
  • x(l)=ϕ(z(l))x^{(l)} = \phi(z^{(l)})

Backward pass: O(K2L)\mathcal{O}(K^2L)

  • δ(L+1)=z(L+1)yn\delta^{(L+1)} = z^{(L+1)}-y_n
  • δ(l)=(W(l+1)δ(l+1))ϕ(z(l))\delta^{(l)}=(\mathbf{W}^{(l+1)}\delta^{(l+1)}) \odot \phi'(z^{(l)})

Derivatives:

  • Lnwi,j(l)=δj(l)xi(l1)\frac{\partial \mathcal{L}_n}{\partial w_{i,j}^{(l)}} = \delta_j^{(l)}x_i^{(l-1)}
  • Lnbj(l)=δj(l)\frac{\partial \mathcal{L}_n}{\partial b_j^{(l)}} = \delta_j^{(l)}

Parameter initialisation: vanishing/exploding gradient → He initialisation. For ReLU networks, initialise weights as N(0,2/KIK)\mathcal{N}(0, \sqrt{2/K} \cdot \mathbf{I}_K)

Normalisation layers: dynamically stabilise training process → faster convergence

  • Batch normalisation: zˉn(l)=zn(l)μB(l)(σB(l))2+ϵ\bar z_n^{(l)}= \frac{z_n^{(l)}-\mu_B^{(l)}}{\sqrt{(\sigma_B^{(l)})^2+\epsilon}} (ϵ0\epsilon \approx 0 for numerical stability)
    Learnable parameters to reverse normalisation:
    z^n(l)=γ(l)zˉn(l)+β(l)\hat z_n^{(l)}=\gamma^{(l)} \odot \bar z_n^{(l)}+ \beta^{(l)}

    Prediction should not depend on batch → estimate μ^=E[μ]\hat \mu = \mathbb{E}[\mu] and σ^=E[σ]\hat \sigma = \mathbb{E}[\sigma]  and use for inference. Requires sufficiently large batches for good approximations

  • Layer normalisation: Same but summing over KK to get μ\mu and σ\sigma
    Batch independent → use same normalisation for training and inference

Convolutional nets

Image size W×HW\times H. Convolution xn,m(1)=k,lfk,lxnk,ml(0)x_{n,m}^{(1)}=\sum_{k,l}f_{k,l}\cdot x_{n-k,m-l}^{(0)} where ff is a local filter with learnable weights. xn,m(1)x_{n,m}^{(1)} only depends on values of x(0)x^{(0)} close to (n,m)(n,m) → sparsely connected

  • Padding: for borders, use zero padding or valid padding (reduces dimensionality)
  • Multiple channels: possible to use CC channels (filters) on one input
  • Pooling: down-sampling
    • max pooling: returns max value of feature portion covered by kernel
    • average pooling: returns average value of feature portion covered by kernel
  • Hyperparameters: pooling/convolution size/type/stride. Generally: W,HW,H \searrow  and CC \nearrow
  • Weight sharing: back-propagation ignoring weights are shared and summing gradients of edges sharing weights

Regularisation

Residual networks: skip connections around some layers Y=R(X)+X\mathbf{Y}=R(\mathbf{X})+\mathbf{X}: lower training loss

Data augmentation: generate new data with same labels by corrupting existing data → robust

Weight decay: l2l_2  regularise weights without regularising biases. No direct regularisation effect (scale invariance) but training dynamics differ

Dropout: at each training step, retain nodes in layer (l)(l)  with probability p(l)p^{(l)} and scale by p(l)p^{(l)} (for inference, where all nodes are used). Variance not preserved! → works poorly with normalisation

15. Transformers

Transformer f:sequencesequencef:sequence \rightarrow sequence . Applicable across any modality (everything can be a token), good for long-range dependencies in text, self-attention scales quadratically but highly parallelisable

Input Transformations

InputtokensRDInput \rightarrow tokens \in \R^D

  • for each word in text→ token ID → vector in RD\R^D
  • for each patch in image → flatten RF\in \R^{F} → multiply by embedding matrix WRF×D\mathbf{W} \in \R^{F\times D}

Transformer block

tokensRDtokensRDtokens \in \R^D \rightarrow tokens \in \R^D

  • Attention: A:tokenstokensA:tokens \rightarrow tokens
    • Query tokens QRTout×DKQ \in \R^{T_{out}\times D_K}

      Key tokens KRTin×DKK \in \R^{T_{in}\times D_K}
    • mixes info between tokens ~ weighted average with weights pi,jp_{i,j} based on how similar qiq_i and kjk_j are: P=softmax(QKDK)P=\text{softmax}\left(\frac{QK^\top}{\sqrt{D_K}}\right), where softmax(x)i=exijexj\text{softmax}(\mathbf{x})_i=\frac{e^{x_i}}{\sum_j e^{x_j}}
  • Self-Attention: T=Tin=ToutT=T_{in}=T_{out}, XRT×DX\in \R^{T \times D}
    • Queries Q=XWQRT×DKQ = XW_Q \in \R^{T\times D_K}

      Keys K=XWKRT×DKK = XW_K\in \R^{T\times D_K}

      Values V=XWVRT×DV = XW_V\in \R^{T\times D}
    • Output Z=softmax(XWQWKXDK)XWV=softmax(QKDK)VZ=\text{softmax}\left(\frac{X W_Q W_K^\top X^\top}{\sqrt{D_K}}\right) X W_V = \text{softmax}\left(\frac{QK^\top}{\sqrt{D_K}}\right) V
    • Run HH self-attention heads in parallel and linearly combine the outputs
  • MLP: mixes information within tokens MLP(X)=φ(XW1)W2\text{MLP}(X)=\varphi(XW_1)W_2 ← learned WW
  • Other blocks: Layer Normalisation, Skip connections + add positional embedding!

Output Transformations

tokensRDoutputtokens \in \R^D \rightarrow output

  • Simple: linear or small MLP, task dependent (classification/multiple outputs)
16. Adversarial machine learning

Adversarial risk of classifier ff: Rϵ(f)=ED[maxx^,x^Xϵ1f(x^)Y]R_\epsilon(f)=\mathbb{E}_\mathcal{D}\left[ \max_{\hat x, ||\hat x- X|| \leq \epsilon} 1_{f(\hat x)\neq Y} \right] → optimise the input xx to get maximum error (easy if already misclassified, else need to optimise). Use smooth classification loss \ell instead of (0-1): output before classification is gg, classify with sign(g(x))\text{sign}(g(x)). Then the objective is equivalent to maxx^,x^Xϵ(yg(x^))maxδϵx(yg(x))δ\max_{\hat x, ||\hat x- X|| \leq \epsilon} \ell(yg(\hat x)) \Leftrightarrow \max_{||\delta|| \leq \epsilon} \nabla_x \ell(yg(x))^\top \delta because g(x+δ)g(x)+δxg(x)g(x+\delta) \approx g(x) + \delta^\top\nabla_x g(x) (Taylor series)

White box attacks: model gg is known

  • compute gradient x\nabla_x \ell during back-propagation to move in the direction opposite to the right classification
  • one-step attack:
    • 2\ell_2 norm: x^=xϵyxg(x)xg(x)2\hat x = x-\epsilon y \frac{\nabla_x g(x)}{||\nabla_x g(x)||_2}
    • \ell_\infty norm: x^=xϵysign(xg(x))\hat x = x-\epsilon y \cdot \text{sign}(\nabla_x g(x))
  • multi-step attack: Projected Gradient Descent (PGD): iteratively update δ\delta and project back on the feasible set δϵ||\delta||\leq \epsilon
    • 2\ell_2 norm: δt+1=ΠB2(ϵ)[δt+α~(x+δt)~(x+δt)2]\delta^{t+1} = \Pi_{B_2(\epsilon)} \left[\delta^t+\alpha \frac{\nabla \tilde \ell (x+ \delta^t)}{||\nabla \tilde \ell (x+ \delta^t)||_2} \right]
      with
      ΠB2(ϵ)(δ)={ϵδ/δ2if δ2ϵδotherwise\Pi_{B_2(\epsilon)}(\delta) = \begin{cases}\epsilon \cdot \delta/||\delta||_2 & \text{if } ||\delta||_2 \geq \epsilon \\ \delta & \text{otherwise}\end{cases} 
    • \ell_\infty norm: δt+1=ΠB(ϵ)[δt+αsign(~(x+δt))]\delta^{t+1} = \Pi_{B_\infty(\epsilon)} \left[\delta^t+\alpha \cdot \text{sign}( \nabla \tilde \ell (x+ \delta^t)) \right]
      with
      ΠB(ϵ)(δ)i={ϵsign(δi)if δiϵδiotherwise\Pi_{B_\infty(\epsilon)}(\delta)_i = \begin{cases}\epsilon \cdot \text{sign}(\delta_i) & \text{if } |\delta_i| \geq \epsilon \\ \delta_i & \text{otherwise}\end{cases} 

Black box attacks: model gg unknown

  • score-based: we can query model scores g(x)g(x) → approximate gradient using finite difference
  • decision-based: we can query only prediction f(x)f(x)
  • transfer attacks: train f^f\hat f \approx f on similar data → transfer white box (f^\hat f) attack on ff
    Query with unlabelled data to obtain
    {xn,f(xn)}\{x_n, f(x_n)\} → model stealing

Create robust models:

  • Minimise the adversarial risk → train model on best adversarial example x^n\hat x_n^*
  • increased computational time + robustness/accuracy tradeoff (non-robust feature may be more accurate but can have high adversarial risk while robust has less than ideal accuracy but better resistance to attacks)
17. Ethics and fairness

No fairness through unawareness! Minorities may be underrepresented → high error rates for minorities (because we aim to avoid overfitting).

Consider we have data XX, outcome variables YY and a score function R=r(X)R=r(X) that allows to make classification decisions D=1R>tD=1_{R>t}. Furthermore, let AA be a RV that encodes membership status in a protected class.

D=0D=0D=1D=1
Y=0Y=0True negative
Probability
1α1-\alpha
Type I error
False positive
Probability
α\alpha
Y=1Y=1Type II error
False negative
Probability
β\beta
True positive
Probability
1β1-\beta
  • TPR=P(D=1Y=1)\text{TPR} = \mathbb{P}(D=1|Y=1)
  • FPR=P(D=1Y=0)\text{FPR} = \mathbb{P}(D=1|Y=0)
  • FNR=P(D=0Y=1)\text{FNR} = \mathbb{P}(D=0|Y=1)
  • TNR=P(D=0Y=0)\text{TNR} = \mathbb{P}(D=0|Y=0)

Fairness criteria: equalise statistical quantities involving two groups a,ba,b \in AA

  • Independence: RAR \perp A
    • Acceptance rate (decision DD) does not depend on group AA
    • P(D=1A=a)=P(D=1A=b)\mathbb{P}(D=1|A=a)=\mathbb{P}(D=1|A=b)
    • Counting both true and false positive decisions, but should not compare them!
  • Separation: RAYR \perp A | Y
    • Post-hoc criterion, but compares by label
    • FPR(a)=P(D=1Y=0,A=a)=FPR(b)\text{FPR}(a) = \mathbb{P}(D=1|Y=0, A=a) = \text{FPR}(b)
    • FNR(a)=FNR(b)\text{FNR}(a) = \text{FNR}(b)
  • Sufficiency: YARY \perp A|R
    • Meaning for predicting YY we do not need to know AA if we have RR
    • P(Y=1R=r,A=a)=P(Y=1R=r,A=b)\mathbb{P}(Y=1|R=r, A=a) = \mathbb{P}(Y=1|R=r, A=b)
    • Calibrated by group, i.e. P(Y=1R=r,A=a)=r\mathbb{P}(Y=1|R=r, A=a) = r implies sufficiency
  • Any of these criteria are mutually exclusive!
    • Independence vs. sufficiency: ARA \perp R and AYRA(Y,R)AYA \perp Y|R \Rightarrow A \perp (Y,R) \Rightarrow A \perp Y
    • Independence vs. separation: ARA \perp R and ARYAYA \perp R|Y \Rightarrow A \perp Y or RYR \perp Y
    • Separation vs. sufficiency: ARYA \perp R|Y and AYRA(R,Y)A \perp Y|R \Rightarrow A \perp (R,Y)
18. Unsupervised learning

Representation learning (encoder xϕ(x)x \mapsto \phi(x))

Density estimation & generative models (decoder ϕ(x)x\phi(x) \mapsto x)

19. K-Means Clustering, Gaussian Mixture Models (GMM)

K-Means Clustering

Objective:

  • For all NN data vectors xnRDx_n \in \R^D: find cluster means μ1,...,μK\mu_1, ..., \mu_K and cluster assignments znk={1if xnclusterK0if xnclusterKRDz_{nk} = \begin{cases} 1 & \text{if } x_n \in \text{cluster} K \\ 0 & \text{if } x_n \notin \text{cluster} K \end{cases} \in \R^D
  • Assuming KK  is known, we are searching minz,μL(z,μ)=NKznkxnμk22\min_{z, \mu} \mathcal{L}(z, \mu) = \sum^N \sum^K z_{nk} ||x_n-\mu_k||^2_2

Algorithm: Initialise μk\mu_k  k\forall k, then

  1. For all nn compute znz_n  given μ\mu  (cost O(NKD)\mathcal{O}(NKD)) → z(t+1):=argminzL(z,μ(t))z^{(t+1)}:=\text{arg} \min_z \mathcal{L}(z, \mu^{(t)})

    znk={1if k=argminjxnμj220otherwisez_{nk} = \begin{cases} 1 & \text{if } k = \text{arg} \min_j ||x_n-\mu_j||^2_2 \\ 0 & \text{otherwise}\end{cases} 
  1. For all kk compute the group means μk\mu_k given zz (cost O(NKD)\mathcal{O}(NKD))

    μ(t+1):=argminμL(z(t),μ)\mu^{(t+1)}:=\text{arg} \min_\mu \mathcal{L}(z^{(t)}, \mu) and μk=NznkxnNznk\mu_k = \frac{\sum^N z_{nk}x_n}{\sum^N z_{nk}}

  1. Repeat until there is no more change in assignment → no more change of the μk\mu_k

→ Convergence to a local optimum is assured since each step decreases the cost



K-means as matrix factorisation:

minz,μL(z,μ)=NKznkxnμk22=XMZF2\min_{z, \mu} \mathcal{L}(z, \mu) = \sum^N \sum^K z_{nk} ||x_n-\mu_k||^2_2 = ||\mathbf{X}^\top - \mathbf{MZ}^\top||^2_F

The matrix MRD×K\mathbf{M} \in \R^{D\times K} contains the KK  mean vectors μK\mu_K and ZRK×N\mathbf{Z^\top} \in \R^{K\times N} contains the NN assignment vectors. Convex in M\mathbf{M}  and Z\mathbf{Z} but not jointly convex.

Frobenius norm: AF=MNamn2=tr(AA)||A||_F = \sqrt{\sum^M \sum^N |a_{mn}|^2} = \sqrt{\text{tr}(A^*A)}

Gaussian Mixture Models

  • Elliptical clusters and soft-assignment
  • Bayes’ Law: p(a,b)=p(ab)p(b)p(a, b)=p(a|b)p(b)
  • Multivariate normal distribution

    f(x;μ,Σ)=1(2π)k/2Σ1/2exp(12(xμ)TΣ1(xμ))f(\mathbf{x};\mathbf{\mu},\mathbf{\Sigma}) = \frac{1}{(2\pi)^{k/2}|\mathbf{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x}-\mathbf{\mu})^T\mathbf{\Sigma}^{-1}(\mathbf{x}-\mathbf{\mu})\right)
  • Added ΣRD2×K\Sigma \in \R^{D^2 \times K} (covariance matrices) and ΠRK\Pi \in \R^K (cluster probabilities) such that p(zn=k)=πkp(z_n = k) = \pi_k where πk>0\pi_k > 0  for all kk  and k=1Kπk=1\sum_{k=1}^{K} \pi_k =1

    New parameters θ={μ1,...,μK,Σ1,...,ΣK,π}\theta = \{\mu_1, ..., \mu_K, \Sigma_1, ..., \Sigma_K, \pi\}

  • marginal likelihood p(xnθ)=k=1KπkN(xnμk,Σk)p(x_n|\theta)=\sum_{k=1}^K\pi_k \mathcal{N}(x_n|\mu_k, \Sigma_k)O(D2K)\mathcal{O}(D^2K)  parameters left
  • Searching θ=argmaxθn=1Nlogk=1KπkN(xnμk,Σk)\theta^* = \text{arg}\max_\theta \sum_{n=1}^N\log\sum_{k=1}^K \pi_k\mathcal{N}(x_n|\mu_k, \Sigma_k)
    • Not convex, not identifiable (permutation), not bounded (σ0)\sigma \rightarrow 0)

Expectation-Maximisation (EM) algorithm

  • In short: θ(t+1):=argmaxθn=1NEp(znxn,θ(t))[logp(xn,znθ)]\theta^{(t+1)}:= \text{arg}\max_\theta \sum_{n=1}^N \mathbb{E}_{p(z_n|x_n, \theta^{(t)})}[\log p(x_n, z_n|\theta)]
  • Expectation step: find lower bound L\underline{\mathcal{L}} such that L(θ)L(θ,θ(t))\mathcal{L}(\theta) \geq \underline{\mathcal{L}}(\theta, \theta^{(t)}) and L(θ(t))=L(θ(t),θ(t))\mathcal{L}(\theta^{(t)}) = \underline{\mathcal{L}}(\theta^{(t)}, \theta^{(t)})
    • Concavity of log: Jensen’s inequality: log(k=1Kqkrk)k=1Kqklogrk\log \left(\sum_{k=1}^K q_kr_k\right) \geq \sum_{k=1}^K q_k\log r_k for rk>0r_k >0  and kqk=1\sum_k q_k = 1
    • Jensen’s inequality with qk=πk(t)N(xnμk(t),Σk(t))k=1Kπk(t)N(xnμk(t),Σk(t))q_k = \frac{\pi_k^{(t)} \mathcal{N}(x_n|\mu_k^{(t)}, \Sigma_k^{(t)})}{\sum_{k=1}^K\pi_k^{(t)} \mathcal{N}(x_n|\mu_k^{(t)}, \Sigma_k^{(t)})} and rk=πkN(xnμk,Σk)qkr_k = \frac{\pi_k \mathcal{N}(x_n|\mu_k, \Sigma_k)}{q_k}
  • Maximisation step:
    • θ(t+1)=argmaxθL(θ,θ(t))\theta^{(t+1)}=\text{arg}\max_\theta \underline{\mathcal{L}}(\theta, \theta^{(t)})
    • μk(t+1):=nqkn(t)xnnqkn(t)\mu_k^{(t+1)}:=\frac{\sum_n q_{kn}^{(t)} x_n}{\sum_n q_{kn}^{(t)}}
    • Σk(t+1):=nqkn(t)(xnμk(t+1))(xnμk(t+1))nqkn(t)\Sigma_k^{(t+1)}:=\frac{\sum_n q_{kn}^{(t)} (x_n-\mu_k^{(t+1)})(x_n-\mu_k^{(t+1)})^\top}{\sum_n q_{kn}^{(t)}}
    • πk(t+1):=1Nnqkn(t)\pi_k^{(t+1)}:=\frac{1}{N}\sum_n q_{kn}^{(t)}
  • Posterior distribution:
    • p(xn,znθ)=p(xnzn,θ)p(znθ)=p(znxn,θ)p(xnθ)p(x_n, z_n|\theta) = p(x_n| z_n,\theta)p(z_n|\theta) = p(z_n | x_n,\theta)p(x_n|\theta)
    • joint=likelihoodprior=posteriormarginal likelihood\text{joint} = \text{likelihood} \cdot \text{prior} = \text{posterior} \cdot \text{marginal likelihood}
    • joint=N(xnμk,Σk)πk=qknk=1KπkN(xnμk,Σk)\text{joint} = \mathcal{N}(x_n|\mu_k, \Sigma_k) \cdot \pi_k = q_{kn} \cdot \sum_{k=1}^K\pi_k \mathcal{N}(x_n|\mu_k, \Sigma_k)
20. Matrix factorisations

Aim to find WRD×K\mathbf{W} \in \R^{D\times K} (e.g. movies) and ZRK×N\mathbf{Z^\top} \in \R^{K\times N} (e.g. users) such that XWZ\mathbf{X}\approx \mathbf{WZ}^\top, where each user and movie are described by a vector in RK\R^K and xdnx_{dn} such that (d,n)Ω(d,n)\in\Omega contains the existing rating of user nn  for movie dd (KD,N)(K\ll D,N). We are minimising:

minW,ZL(W,Z):=12(d,n)Ω[xdn(WZ)dn]2\min_{\mathbf{W,Z}} \mathcal{L}(\mathbf{W,Z}) := \frac{1}{2}\sum_{(d,n)\in\Omega}[x_{dn}-(\mathbf{WZ}^\top)_{dn}]^2

Regularisation (not matrix factorisation anymore): add λw2WF2\frac{\lambda_w}{2}||\mathbf{W}||^2_F or λz2ZF2\frac{\lambda_z}{2}||\mathbf{Z}||^2_F, λw,λz>0\lambda_w, \lambda_z>0

SGD:

  • For a fixed element (d,n)(d,n):

    Ldnwd,k(W,Z)={[xdn(WZ)dn]zn,kif d=d0otherwiseRK\frac{\partial \mathcal{L}_{dn}}{\partial w_{d',k}}(\mathbf{W,Z}) = \begin{cases} -[x_{dn}-(\mathbf{WZ}^\top)_{dn}]z_{n,k} & \text{if } d'=d \\ 0 & \text{otherwise} \end{cases} \in \R^K

    Ldnzn,k(W,Z)={[xdn(WZ)dn]wd,kif n=n0otherwiseRK\frac{\partial \mathcal{L}_{dn}}{\partial z_{n',k}}(\mathbf{W,Z}) = \begin{cases} -[x_{dn}-(\mathbf{WZ}^\top)_{dn}]w_{d,k} & \text{if } n'=n \\ 0 & \text{otherwise} \end{cases} \in \R^K

Alternating Least Squares (ALS):

  • Assuming no missing entries. First update Z\mathbf{Z} with fixed W\mathbf{W} then W\mathbf{W}with fixed Z\mathbf{Z}
  • Z:=(WW+λzIK)1WX\mathbf{Z}^\top:=(\mathbf{W}^\top\mathbf{W}+ \lambda_z\mathbf{I}_K)^{-1}\mathbf{W}^\top\mathbf{X}
  • W:=(ZZ+λwIK)1ZX\mathbf{W}^\top:=(\mathbf{Z}^\top\mathbf{Z}+ \lambda_w\mathbf{I}_K)^{-1}\mathbf{Z}^\top\mathbf{X}
21. Text representation learning

Use log of co-occurence counts of words wdw_d and context words wnw_n → sparse matrix ND×N\in \N^{D \times N}

  • Same objective as matrix factorisations, except for adding a weighting term fdnf_{dn}  in the sum. For GloVe, we have fdn:=min{1,(ndnnmax)α}f_{dn}:=\min\{1, (\frac{n_{dn}}{n_{max}})^\alpha\} for α[0,1]\alpha\in [0, 1]
  • Rows of WRD×K\mathbf{W} \in \R^{D\times K} and ZRN×K\mathbf{Z} \in \R^{N\times K} are word (or context word) representations
  • Skip-Gram model (word2vec): use binary classification to separate real word pairs (wd,wn)(w_d, w_n) (appearing together in context window size 5) from fake word paris (random words)
  • FastText: matrix factorisation to learn sentence sns_n representations (supervised (sn,yn)(s_n,y_n)). yn{1,1}y_n \in \{-1, 1\} and ff  is a linear classifier loss, xnx_n the bag-of-words (no ordering!) representation of sns_n. Minimising
minW,ZL(W,Z):=snf(ynWZxn)\min_{\mathbf{W,Z}} \mathcal{L}(\mathbf{W,Z}) := \sum_{s_n}f(y_n\mathbf{WZ}^\top x_n)
22. Self-supervised learning

Use pretext tasks f:xinxoutf:\mathbf{x}_{in} \mapsto \mathbf{x}_{out} that learn from unlabelled data with a function g:x(xin,xout)g: \mathbf{x} \mapsto(\mathbf{x}_{in}, \mathbf{x}_{out}) that creates the data (e.g. predict image rotation, relative patch placement) → only need to corrupt data to create training pairs

Masked Language Modelling (MLM): predict hidden word [MASK]

  • Bidirectional Encoder Representations from Transformers (BERT): Encoder only, can look at both previous and following tokens. Pretext tasks: 1.predict original masked token 2. predict whether second sentence immediately follows first sentence [CLS] . Fine-tuned for sentiment prediction, noun/verb prediction, find start/end of passage.
  • Next token prediction - Generative Pre-trained Transformers (GPT): Decoder only, can look only at prior tokens (masked attention). Auto-regressive → uses previous output to compute loss (softmax cross-entropy) for next output. Fine-tuned for in-context learning or instruction following.
  • Joint Embedding Methods: Learn encoder invariants (e.g. rotations) by creating views (rotations, distorsions) of an original image
  • Contrastive learning: positive pair x,x+\mathbf{x}, \mathbf{x}^+ and negative view x\mathbf{x}^-s(f(x),f(x+))>s(f(x),f(x))s(f(\mathbf{x}), f(\mathbf{x}^+)) > s(f(\mathbf{x}), f(\mathbf{x}^-)), where ss is a similarity function (e.g. cosine similarity)
    • SimCLR:
      1. classification with NN negative samples
      1. map encoder output to the similarity f(x)=f2f1(x)f(\mathbf{x})=f_2 \circ f_1(\mathbf{x}) use only f1f_1  for downstream tasks
      1. use cosine similarity with temperature scaling τ:s(e1,e2)=e1,e2e12e22/τ\tau: s(\mathbf{e}_1, \mathbf{e}_2)=\frac{\langle\mathbf{e}_1, \mathbf{e}_2 \rangle}{||\mathbf{e}_1||_2||\mathbf{e}_2||_2}/\tau
      1. generate views with data augmentation
    • CLIP: captioned images to learn a joint multimodal embedding space
23. Generative Models

Creates more form the learned distribution p(x)p(x). Explicit → learn distribution. Implicit → only generating samples according to distribution.

Generative Adversarial Networks: Create images from known noise distribution pzp_z. 2-player game: generator G(θ)G(\theta) vs. discriminator D(φ)D(\varphi) (deep NNs). TT steps gradient descent for steps

  1. φargminφΦLφ(θ,φ)\varphi^*\in\text{arg} \min_{\varphi \in \Phi} \mathcal{L}^\varphi(\theta^*, \varphi) → distinguishes real (xpdx \sim p_d) vs. generated images G(z)G(z)
  1. θargminθΘLθ(θ,φ)\theta^*\in\text{arg} \min_{\theta \in \Theta} \mathcal{L}^\theta(\theta, \varphi^*) → creates realistic images G(z)G(z)
  1. Objective: minGmaxDExpd[logD(x)]+Ezpz[log(1D(G(z)))]\min_G \max_D \mathbb{E}_{x \sim p_d}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log (1-D(G(z)))]
  1. Theoretical solution: optimum when pg=pdp_g = p_d  with value log4- \log4

Conditional GAN (CGAN): additional information cc (e.g. class labels)

Diffusion models: forward decomposition → add noise, backward decomposition → undo noise Use ancestral sampling starting from a pure Gaussian noise and denoising using Markov chain. Often use U-net architecture (with exponential moving average to stabilise training) with ResNet blocks and self-attention layers. To add additional information yy (conditional training), the probabilities of neural net sθs_\theta should be conditional on YY.