This is a quick blog about an idea that occurred to me yesterday. I am currently working on experiments involving differentiable neural architecture, but the idea got me excited enough to write about it.

We currently represent an image as \((r, g, b)\) pixel values. When trained end-end, each layer transforms image into representations that are useful for the task at hand. Whenever we come up with a new architecture/task, these representations are learned again from scratch. Intuitively, all image classification problems, be it imagenet or flowers dataset should share common intermediate representations. If we could learn an image representation that works for all image problems, it would go a long way in making some progress towards one shot learning.

Constrained latent representation

I really liked the idea of using an encoder-decoder architecture for learning useful image representations. An encoder is a neural network that maps input image to some latent image representation (\(Z\)), i.e., \(encoder: \Bbb R^{rows \times cols \times channels} \rightarrow \Bbb R^d\). Decoder (another neural network) is tasked to reconstruct the original input image from the latent representation \(Z\). The key idea is to set \(d\) to be a lot smaller than \(rows \times cols \times channels\) so that the encoder is forced to discover and efficiently internalize the essence of the data as a compressed representation while being useful enough for reconstruction.

Figure 1. Encoder-Decoder Architecture

Then came supervised conv nets, beating all sorts of records on imagenet. This is because learning a "useful representation" depends on the task. For example, a water classifier might care about learning representations that make it easy to detect blue blobs, while an animal classifier might learn to detect shapes. Of late, most unsupervised learning focus has shifted towards Generative Adversarial Networks[1] as they showed very promising results in modeling probability distributions of complex real world data.

We possess both discriminative and generative abilities. When we look at a new object, say a bottle, we can imagine plausible variations such as rotations and deformations. Like GANs, the hallucinated image is vague and not very detailed.

An encoder-decoder also possess generative abilities provided that perturbations in latent representation decode into useful object variations. Instead of using the standard encoder-decoder pipeline, we could impose additional constraints on \(Z\). One possibility is to constrain \(Z\) so that it is useful for both generative and discriminative tasks.

constrained encoder-decoder
Figure 2. Modified Encoder-Decoder architecture with additional constraints on \(Z\)

Hopefully, newer architectures can start with \(Z\) as input, instead of the raw \((r, g, b)\) image to train faster.



1. An excellent overview of GANs can be found at
comments powered by Disqus