This episode of Fresh from the arXiv is going to be a little different. Normally I skim through all of the AI, computer vision and NLP preprints that came out during the week and pick a few that I consider particularly interesting. Often there is a common theme uniting a few of my choices, but the idea is not really to zoom in on any particular subject. Last week, however, I could not help but fall down a rabbit hole called semi-supervised learning with GANs.
I ended up putting together a little introduction into the topic that is not too technical (meaning, it should be understandable to anyone with a vague idea of how a vanilla unsupervised GAN operates), but also provides a few directions to explore in more detail should you be interested.
These sure are interesting times to be involved in deep learning. By the way, did you know that May you live in interesting times is supposed to be an English translation of a Chinese saying that does not actually exist? (Still makes for a nice epigraph and/or tattoo, in my opinion.) The saying is often used ironically: meaning, interesting times are also the ones filled with peril and uncertainty. One particular menace, that practitioners of deep learning face, stems from the fact that neural networks are data hungry beasts. To make matters worse, many of the algorithms that are most frequently used in practice are of the supervised learning variety - meaning, not only do you need tons of data, but it also needs to be labelled appropriately for the task you are trying to solve. Labelling data is expensive, but fortunately there are a few ways to get around it - or rather, to get away with using fewer labelled training samples. One approach, especially popular in natural language processing involves pre-training a model on a different, unsupervised or self-supervised task for which unlabelled data is easy to find (e.g. language modeling) and using that as a starting point for your problem. Alternatively, you can pre-train a model on another labelled set - as is often done in the field of computer vision where classification models are pre-trained on ImageNet, a dataset currently containing over 14 million labelled images. Both of these approaches refer to the so-called transfer learning, where knowledge is literally transferred from one model to the other.
However, transfer learning is not the only strategy available when you are short on labelled data. In the likely event that you also have a decent sized set of unlabelled training samples, you can make use of them via the so-called semi-supervised learning (SSL - not to be confused with its self-supervised cousin!) In last week's pre-print Semi-supervised Learning using Adversarial Training with Good and Bad Samples, the authors construct an intricare four-player GAN setup, depicted below:
The idea of using GANs (Generative Adversarial Networks) for SSL is not new, and multiple predecessors (e.g. 2015, 2016, and 2017) are discussed in the beginning of the preprint. There turn out to be a few ways to go about it:
Let's say your supervised task involves classifying images into one of K classes. For starters, you can create an additional (K+1)th class for synthetic (i.e. generated) images, keeping the familiar Generator/Discriminator two-player structure. All that is left is to replace the binary Real vs. Fake classifier (a.k.a. the vanilla Discriminator) with one that classifies "real" images into one of K classes, and lumps the "fake" ones into (K+1).
Now, you can go one step further and add a third player to the game: a Classifier, which is trained together with the Generator while the Discriminator learns to reject data-label pairs that it deems to be fake. This setup came to be known as a Triple-GAN, and achieved state-of-the-art classification results in year 2017 when it came out.
In the most recent work on the subject, the Unified GAN (UGAN) has not one, but two Generators, bringing the total number of players to four: a good Generator gG, a bad Generator bG, a Classifier C, and a Discriminator D. What do they all do? Let's see: mathematically speaking, the purpose of a GAN is to capture the probability distribution p(x) of real data (this way it can generate new, previously unseen, samples from this distribution). If we are talking classification, the samples are labelled, so the distibution that we want to represent is actually p(x,y) where y is the class label. Here gG is in charge of capturing the conditional probability p(x|y) (probability of sample x given label y), and C is dealing with the inverse, in a sense: p(y|x). (If you find this confusing, it may help to consider the fact that the Generator is producing samples x, while the Classifier is outputting class labels y). No conditional surprises await on the Discriminator front, however, as it is still busy figuring out whether whatever it is fed comes from the true distribution p(x,y) or not. The one remaining player is bG, the bad Generator.
The idea of introducing a bad Generator seems to date back to 2017, the year the world learned of Good Semi-supervised Learning that Requires a Bad GAN. The paper is on a rather mathematical side, but let me try to rephrase some of the main ideas in plain words. In the preceding attempts to utilize GANs for SSL purposes, researchers noticed that although they were getting two models for the price of one (an image Classifier for the original supervised task of interest and a Generator of synethetic images), the performance of one seemed to be inversely linked to that of the other. Grossly oversimplifying, a great Generator corresponded to a poor Classifier and vice versa. Why would that be? The 2017 preprint proposes an answer. For a math-free version, consider the situation when your trained Generator is perfect. A perfect Generator basically means that you can now generate any amount of images you want... unlabelled. Supposedly you already had quite a few unlabelled images lying around, so producing more of them really does not help as far as your supervised task goes. What does help, on the other hand, is training a bad Generator - one whose output does not come from the true data distribution. The worst Generator (which, of course, is the best one for the job) is a complement one. The authors show that it will force the Discriminator to obtain correct decision boundaries in the feature space (to put it simply, the separation lines between the different classes will be nice).
Finally, the four-player UGAN from Semi-supervised Learning using Adversarial Training with Good and Bad Samples takes ideas from all of the above, having both the good and the bad Generators in addition to the Classifier and the Discriminator, and sets yet another state-of-the-art result on image classification.