Asymptotics of Learning with Deep Structured (Random) Features

Dominik Schröder, Hugo Cui, Daniil Dmitriev, Bruno Loureiro

ICML 2024(2024)

Summary

We derive an approximative formula for the generalization error of deep neural networks with structured (random) features, confirming a widely believed conjecture. We also show that our results can capture feature maps learned by deep, finite-width neural networks trained under gradient descent.

A widely observed phenomen in deep learning is that the generalization error of a trained model is often well-predicted by the so-called “double descent” curve. This curve is characterized by a peak in the generalization error for a certain model complexity, followed by a decrease in the error as the model complexity increases. The following plot shows the generalization error computed using our asymptotic formula for a deep neural network with structured features. The double descent curve is visible for sufficiently big additive noise.

Asymptotic formula for the generalization error

For $n$ independent samples $x_i\in\mathbb{R}^p$ of a zero mean random vector with covariance matrix

\Omega := \mathbf E\, x_i x_i^\top

define the sample covariance matrix, and the Gram matrices

\frac{XX^\top}{p}\quad\text{and}\quad \frac{X^\top X}{p}

and the corresponding resolvents

G(\lambda):=\Bigl(\frac{XX^\top}{p}+\lambda\Bigr)^{-1}, \qquad \check G(\lambda):=\Bigl(\frac{X^\top X}{p}+\lambda\Bigr)^{-1}.

The deterministic equivalents of these matrices are

G(\lambda)\approx M(\lambda),\quad \check G(\lambda)\approx m(\lambda)I,

where $m(\lambda)$ is the solution of the self-consistent equation

\frac{1}{m(\lambda)}=\lambda + \lambda \langle\Omega M(\lambda)\rangle = \lambda + \Bigl\langle \Omega\Bigl(1+\frac{n}{p}m(\lambda)\Omega\Bigr)^{-1}\Bigr\rangle.

and

M(\lambda):= \Bigl(\lambda + \frac{n}{p}\lambda m(\lambda)\Omega\Bigr)^{-1}.

The generalization error of ridge regression is given by

Asymptotics of Learning with Deep Structured (Random) Features

Summary

Asymptotic formula for the generalization error

Abstract

Paper