# Learning and Optimization in Côte d'Azur 2024

**September 23-25, 2024**

The workshop focuses on optimization applied to solving problems in imaging and machine learning. This event will bring together experts to explore the latest advancements in these themes. Participants will have the opportunity to exchange innovative ideas to tackle current challenges in optimization and machine learning.

## Speakers #

- Aurélien Bellet (Inria)
- Jérôme Bolte (Toulouse School of Economics)
- Claire Boyer (Université Paris-Saclay)
- Anna Korba (ENSAE)
- Jérôme Malick (CNRS, Université Grenoble-Alpes)
- Gabriel Peyré (CNRS, École Normale Supérieure)
- Gabriele Steidl (TU Berlin)
- Eloi Tanguy (Université Paris-Descartes)

## Location #

The workshop will take place at the I3S institute, room 007, Université Côte d’Azur, Sophia-Antipolis, France.

## Schedule #

### September 23: Mini-course #

- 13:30: Registration & Welcome
- 14:00-15:30: Mini-course “Optimal Transport in Machine Learning” by Elsa Cazelles
- 15:30-16:00: Coffee break
- 16:00-17:30: Mini-course “Optimal Transport in Machine Learning” by Elsa Cazelles

### September 24: Conference #

- 09:00-09:30: Welcome
- 09:30-10:30: Jérôme Bolte

**A bestiary of counterexamples in smooth convex optimization**

Abstract: Counterexamples to some old-standing optimization problems in the smooth convex coercive setting will be provided. For instance, block-coordinate descent, steepest descent with exact search or Bregman descent methods do not generally converge. Other failures of various desirable features will be discussed: directional convergence of Cauchy’s gradient curves, convergence of Newton’s flow, finite length of Tikhonov path, convergence of central paths, or smooth Kurdyka-Lojasiewicz inequalities. All examples are planar. These examples rely on a new convex interpolation result: given a decreasing sequence of positively curved C^k smooth convex compact sets in the plane, we can interpolate these sets through the sublevel sets of a C^k smooth convex function where k ≥ 2 is arbitrary.

- 10:30-11:00: Coffee break
- 11:00-12:00: Jérôme Malick

**Towards more resilient, robust, responsible decisions**

This talk will be a gentle introduction to — and a passionate advocacy for — distributionally robust optimization (DRO). Beyond the classical empirical risk minimization paradigm in machine learning, DRO has the ability to effectively address data uncertainty and distribution ambiguity, thus paving the way to more robust and fair models. In this talk, I will highlight the key mathematical ideas, the main algorithmic challenges, and some versatile applications of DRO. I will insist on the statistical properties of DRO with Wasserstein uncertainty, and I will finally present an easy-to-use toolbox (with scikit-learn and PyTorch interfaces) to make your own models more robust.

- 12:00-14:00: Lunch
- 14:00-15:00: Claire Boyer

**A primer on physics-informed learning**

Abstract: Physics-informed machine learning combines the expressiveness of data-based approaches with the interpretability of physical models. In this context, we consider a general regression problem where the empirical risk is regularized by a partial differential equation that quantifies the physical inconsistency. Practitioners often resort to physics-informed neural networks (PINNs) to solve this kind of problem. After discussing some strengths and limitations of PINNs, we prove that for linear differential priors, the problem can be formulated directly as a kernel regression task, giving a rigorous framework to analyze physics-informed ML. In particular, the physical prior can help in boosting the estimator convergence.

- 15:00-15:30: Coffee break
- 15:30-16:30: Eloi Tanguy

**Optimisation Properties of the Discrete Sliced Wasserstein Distance**

Abstract: For computational reasons, the Sliced Wasserstein distance is commonly used in practice to compare discrete probability measures with uniform weights and the same amount of points. We will address the properties of this energy as a function of the support of one of the measures. We study the regularity and optimisation properties of this energy, as well as its Monte Carlo approximation (estimating the expected SW using samples on the projections), including both the asymptotic and non-asymptotic statistical properties of the estimation. Finally, we show that in a certain sense, stochastic gradient descent methods that minimise these energies converge to (generalised) critical points, with an extension to training generative neural networks.

### September 25: Conference #

- 09:30-10:30: Anna Korba

**Implicit Diffusion: Efficient Optimization through Stochastic Sampling**

Abstract: We present a new algorithm to optimize distributions defined implicitly by parameterized stochastic diffusions. Doing so allows us to modify the outcome distribution of sampling processes by optimizing over their parameters. We introduce a general framework for first-order optimization of these processes, that performs jointly, in a single loop, optimization and sampling steps. This approach is inspired by recent advances in bilevel optimization and automatic implicit differentiation, leveraging the point of view of sampling as optimization over the space of probability distributions. We provide theoretical guarantees on the performance of our method, as well as experimental results demonstrating its effectiveness. We apply it to training energy-based models and finetuning denoising diffusions.

- 10:30-11:00: Coffee break
- 11:00-12:00: Gabriel Peyré

**Transformers are Universal In-context Learners**

Abstract: Transformers deep networks define “in-context mappings’”, which enable them to predict new tokens based on a given set of tokens (such as a prompt in NLP applications or a set of patches for vision transformers). This work studies the ability of these architectures to handle an arbitrarily large number of context tokens. To mathematically and uniformly address the expressivity of these architectures, we consider that the mapping is conditioned on a context represented by a probability distribution of tokens (discrete for a finite number of tokens). The related notion of smoothness corresponds to continuity in terms of the Wasserstein distance between these contexts. We demonstrate that deep transformers are universal and can approximate continuous in-context mappings to arbitrary precision, uniformly over compact token domains. A key aspect of our results, compared to existing findings, is that for a fixed precision, a single transformer can operate on an arbitrary (even infinite) number of tokens. Additionally, it operates with a fixed embedding dimension of tokens (this dimension does not increase with precision) and a fixed number of heads (proportional to the dimension). The use of MLP layers between multi-head attention layers is also explicitly controlled. This is a joint work with Takashi Furuya (Shimane Univ.) and Maarten de Hoop (Rice Univ.).

- 12:00-14:00: Lunch
- 14:00-15:00: Aurélien Bellet

**Differentially Private Optimization with Coordinate Descent and Fixed-Point Iterations**

Abstract: Machine learning models are known to leak information about individual data points used to train them. Differentially private optimization aims to address this problem by training models with strong differential privacy guarantees. This is achieved by adding controlled noise to the optimization process, for instance during the gradient computation steps in the case of the popular DP-SGD algorithm. In this talk, I will discuss how to beyond DP-SGD by (i) introducing private coordinate descent algorithms that can better exploit the problem structure, and (ii) leveraging the framework of fixed-point iterations to design and analyze new private optimization algorithms for centralized and federated settings.

- 15:00-15:30: Coffee break
- 15:30-16:30: Gabriele Steidl

**Gradient flows, non-smooth kernels and generative models for posterior sampling in inverse problems**

Abstract: This talk is concerned with inverse problems in imaging from a Bayesian point of view, i.e. we want to sample from the posterior given noisy measurement. We tackle the problem by studying gradient flows of particles in high dimensions. More precisely, we analyze Wasserstein gradient flows of maximum mean discrepancies defined with respect to different kernels, including non-smooth ones. In high dimensions, we propose the efficient flow computation via Radon transform (slicing) and subsequent sorting or Fourier transform at nonequispaced knots. Special attention is paid to non-smooth Riesz kernels. We will see that Wasserstein gradient flows of corresponding maximum mean discrepancies have a rich structure. In particular, singular measures can become absolutely continuous ones and conversely. Finally, we approximate our particle flows by conditional generative neural networks and apply them for conditional image generation and in inverse image restoration problems like computerized tomography and superresolution. This is joint work with Johannes Hertrich (UCL) and Paul Hagemann, Fabian Altekrüger, Robert Beinert, Jannis Chemseddine, Manual Gräf, Christian Wald (TU Berlin).

## Scientific committee #

- Laure Blanc-Féraud
- Luca Calatroni
- Yassine Laguel
- Samuel Vaiter