Image Manipulation with Perceptual Discriminators

Abstract

Systems that perform image manipulation using deep convolutional networks have achieved remarkable realism. Perceptual losses and losses based on adversarial discriminators are the two main classes of learning objectives behind these advances. In this work, we show how these two ideas can be combined in a principled and non-additive manner for unaligned image translation tasks. This is accomplished through a special architecture of the discriminator network inside generative adversarial learning framework. The new architecture, that we call a \textit{perceptual discriminator}, embeds the convolutional parts of a pre-trained deep classification network inside the discriminator network. The resulting architecture can be trained on unaligned image datasets, while benefiting from the robustness and efficiency of perceptual losses. We demonstrate the merits of the new architecture in a series of qualitative and quantitative comparisons with baseline approaches and state-of-the-art frameworks for unaligned image translation.

Main idea

Generative adversarial networks have shown impressive results in photorealistic image synthesis. The model includes a generative network $G$ trained to produce samples $y \sim p_\text{fake}(y)$ that match target distribution $p_\text{real}(y)$ in the data space $\mathcal Y$, and a discriminator network $D$ that is trained to distinguish whether the input is real or generated by $G$. The two networks optimize the policy function $V(D, G)$: \begin{equation*} \min_{G} \max_{D}\ V(D, G) \label{eq:gan} \end{equation*} where $V(D, G)$ is usually chosen so that $\max_{D} V(D, G)$ evaluates to the divergence between distributions $p_\text{real}(y)$ and $p_\text{fake}(y)$.

Converging to good equilibria for any of the proposed GAN games is known to be hard. In general, the performance of the trained generator network crucially depends on the architecture of the discriminator network, that needs to learn meaningful statistics, which are good for matching the target distribution $p_\text{real}$. The typical failure mode of GAN training is when the discriminator does not manage to learn such statistics before being ``overpowered'' by the generator.

Following this line of work, we suggest to base the GAN discriminator $D(y)$ on the perceptual statistics computed by the reference network $F$ on the input image $y$, which can be either real (coming from $p_\text{target}$) or fake (produced by the generator). Our motivation is that a discriminator that uses perceptual features has a better chance to learn good statistics than a discriminator initialized to a random network. For simplicity, we assume that the network $F$ has a chain structure, e.g. $F$ can be presented by VGG.

Consider the subsequent blocks of the convolutional part of the reference network $F$, and denote them as $b_0,b_1,\dots,b_{K-1}$. Each block may include one or more convolutional layers interleaved with non-linearities and pooling operations. Then, the perceptual statistics $\{f_1(y), \dots, f_K(y)\}$ are computed as: \begin{eqnarray*} f_1(y) &=& b_0(y)\\ f_i(y) &=& b_{i-1}(f_{i-1}(y)), \quad i = 2,\dots,K\,, \end{eqnarray*} so that each $f_i(y)$ is a stack of convolutional maps of the spatial dimensions $W_i \times W_i$. The dimension $W_i$ is determined by the preceeding size $W_{i-1}$ as well as by the presence of strides and pooling operations inside $b_i$. In our experiments we use features from consecutive blocks, i.e. $W_i = W_{i-1} / 2$.

Proposed discriminator architecture combines together perceptual statistics using the following computations: \begin{eqnarray*} h_1(y) &=& f_1(y)\\ h_i(y) &=& \texttt{stack}\left[ c_{i-1}(h_{i-1}(y),\phi_{i-1})\,,\, f_i(y) \right], \quad i = 2,\dots,K\,, \label{eq:midlayer} \end{eqnarray*} where stack denotes stacking operation, and the convolutional blocks $c_j$ with learnable parameters $\phi_j$ (for $j = 1,\dots,K-1$) are composed of convolutions, leaky ReLU nonlinearities, and average pooling operations. Each of the $c_j$ blocks thus transforms map stacks of the spatial size $W_j \times W_j$ to map stacks of the spatial size $W_{j+1} \times W_{j+1}$. Thus, the strides and pooling operations inside $c_j$ match the strides and/or pooling operations inside $b_j$.