Dept. of Computer Science, University of Toronto; Vector Institute
Primer: Enforcing Lipschitz constraints for neural networks
We can understand a lot about a neural network by understanding the Jacobian of the function it computes, i.e. the derivatives of its outputs with respect to its inputs. I’ll explain what the Jacobian is, how it’s built up from the Jacobians of individual layers, and what it tells us about neural net optimization. I’ll then motivate why we might like to bound the matrix norm of the Jacobian, or equivalently, enforce a small Lipschitz constant for a neural net, i.e. ensure that a small change to the input makes a correspondingly small change to the output. This is useful for several reasons: (1) it lets us make the predictions provably robust to small perturbations produced by an adversary, (2) it helps us to estimate the Wasserstein distance between probability distributions, (3) the generalization error can be bounded in terms of the Lipschitz constant, and (4) Lipschitz constraints prevent some optimization difficulties, most notably the problem of exploding gradients. To set the stage for the research talk, I’ll relate the Lipschitz bound of the network to the norms of individual layers’ Jacobians.
Grosse Group, University of Toronto; Vector Institute
Efficient Lipschitz-constrained neural networks
Training neural networks under a strict Lipschitz constraint is useful for provable adversarial robustness, generalization bounds, interpretable gradients, and Wasserstein distance estimation. By the composition property of Lipschitz functions, it suffices to ensure that each individual affine transformation or nonlinear activation is 1-Lipschitz. The challenge is to do this while maintaining the expressive power. We identify a necessary property for such an architecture: each of the layers must preserve the gradient norm during backpropagation. We propose two architectural components that satisfy strict Lipschitz constraints with norm preservation. First is the GroupSort activation function, which sorts units within a group. Second is the use of orthogonal linear layers; this is straightforward for fully connected layers, but more involved for convolution layers. We present a flexible and efficient representation of orthogonal convolutions. Our provably Lipschitz-constrained architectures perform competitively at Wasserstein distance estimation and provable adversarial robustness.