Wasserstein Distances

Intro

Previously I made a post about discrete optimal transport where we walked through ways to efficiently compute approximations to distances between discrete distributions. However, what can we do when we want to calculate distances between continuous distributions? One might recall metrics like, KL, Jensen-Shannon yet those have some critical issues. Specifically, when intersection of the supports of the distributions are empty then the aforementioned metrics are ill defined. Therefore, we want to look into something called Earth Movers (EM) distance, and this is often called Wasserstein distance Distance or Kantorovich–Rubinstein metric.

Wasserstein Distance

This is just a continuous formulations of the EM distance

W(P_1,P_2) = \inf_{\pi\in \Pi(P_1,P_2)}\int_x\int_y||x-y||\pi(x,y)dxdy = \inf_{\pi\in \Pi(P_1,P_2)} \mathbb{E}_{(x,y)\sim\pi}[||x-y||]

Unfortunately, the integrals make this intractable to compute in practical situations. Therefore, we have to look into Kantorovich-Rubinstein Duality which states

\inf_{\pi\in \Pi(P_1,P_2)}\int_x\int_y||x-y||\pi(x,y)dxdy = \sup_{||f||_L\leq 1}\mathbb{E}_{x \sim P_1}[f(x)]-\mathbb{E}_{x \sim P_2}[f(x)]

We can show why this dual formulation is useful.

Kantorovich Rubinstein Duality

Good sources for looking at the proofs are linked in the reference section.

Notes

The key aspects of the proof are when can you set $\min_x \max_v L(x,v)= \max_v \min_x L(x,v)$ , when $L$ is convex with respect to $x$ , and concave with respect to $v$ - from the linked textbook by Ambrosio et al. You can also check when we change $W(P_1,P_2)$ to be

= \inf_{\pi\in \Pi(P_1,P_2)}\int_x\int_yK||x-y||\pi(x,y)dxdy = \inf_{\pi\in \Pi(P_1,P_2)} \mathbb{E}_{(x,y)\sim\pi}[K||x-y||]

Then the resulting dual problem with equality is

\sup_{||f||_L\leq K}\mathbb{E}_{x \sim P_1}[f(x)]-\mathbb{E}_{x \sim P_2}[f(x)]

This is useful as shown below.

How to optimize?

Let’s look at Wasserstein GAN paper. One thing in particular I found interesting is how they optimized over a set of K Lipschitz functions. Specifically, they clip the weights to ensure that set of weights is bounded, thus implying that the network is K Lipschitz for some $K$ . What that $K$ is doesn’t matter since recall that it just scales $W(P_1,P_2)$ by $K$ . Therefore, we can use the Kantorovich Rubinstein Duality as shown in the WGAN algorithm below, where they use Monte Carlo estimator for estimating the two expectations.

References

https://vincentherrmann.github.io/blog/wasserstein/

https://courses.cs.washington.edu/courses/cse599i/20au/resources/L12_duality.pdf

https://abdulfatir.com/blog/2020/Wasserstein-Distance/

https://link.springer.com/chapter/10.1007/978-3-030-72162-6_3