Hello there! Fresh from the arXiv took the first half of November off due to the two long weekends we just had here in France, but the series is now back and in full swing :) Without further ado, let us dive into some of the preprints that caught my eye in this past week.

One of the obstacles facing the practitioners of deep learning today is that most of the state-of-the-art results come from extremely large networks. Not only are large models challenging to use in practical situations, but training them also requires enormous datasets - to make matters worse, the latter often have to be labeled manually. There are a several approaches to tackling these two problems, e.g. compressing the original network down to a more manageable size, and making use of unlabeled data in addition to smaller labeled sets. Last week, the authors of Learning from a Teacher using Unlabeled Data did both. Their paper discusses knowledge distillation, a technique where knowledge is transferred from a (large) Teacher network that has been trained to perform a specific machine learning task to a (smaller) Student one. What is unusual is that their Student is trained on an unlabeled dataset that is different from the labeled set that the Teacher was originally trained on. Sounds cryptic? Let's consider a MNIST / Fashion MNIST example from this preprint to see why one might want to go down this road:

Examples of training samples from MNIST (handwritten digits on the left) and Fashion MNIST (right) datasets

The basic idea behind knowledge distillation is the following: let us say the original, Teacher model is trained as an image classifier. The last layer of a typical classifier is a softmax: this is a layer that takes in logits (a name used for the outputs of the layer preceding softmax) and produces K probabilities of the image belonging to one of K classes (these values being probabilities, all K numbers have to sum up to one). The probabilities won't necessarily be equal to 1 for the correct class and zeroes elsewhere: instead, we could get something like 0.17 for class label "8" and 0.83 for "3" for the image of the number three in the bottom right figure of the left panel above. Such results reflect the fact that handwritten threes often look similar to handwritten eights, one person's ones may resemble another person's sevens, etc. These similarities in turn reflect the features that the network learns to look for in order to complete the classification task. Hence, there is additional knowledge, that extends beyond pure classification, to be gained from looking at all K probabilities - or the logits, since probabilities are just a mathematical function of the latter. It is this knowledge that gets passed on from the Teacher to the Student in the process of knowledge distillation: rather than trying to learn the classification task on a labeled set directly, the Student is trained to reproduce the logits that are calculated by the Teacher. Again, grossly oversimplifying, we can say that since part of the feature engineering work has already been done by the Teacher, the Student can get away with being a smaller network - voilà, our model has been compressed!

A neural network getting compressed is not always a pretty sight

Now that was more or less how regular knowledge distillation works. What the authors of Learning from a Teacher using Unlabeled Data refer to as blind distillation is the following: say you have a labeled dataset of handwritten digits (MNIST) and another dataset, which may be unlabeled, of various pieces of clothing (Fashion MNIST). First, we train the Teacher network to classify images of digits using MNIST. Now we take the Fashion MNIST dataset and put it through the [trained] Teacher, without modifying it further, simply keeping the resulting logits for each image of the Fashion set. This may seem counterintuitive since there is no overlap in the classes of these two data distributions, handwritten digits and clothing, but bear with me. Next, we train the Student on Fashion MNIST, trying to match the logits produced by the Teacher. However, the point is to arrive at a Student model that is capable of classifying MNIST digits, so after pre-training on Fashion MNIST (a stage termed blind distillation in the preprint), the Student is fine-tuned on the actual MNIST set. Interestingly, the accuracy of the Student on the labeled set after blind distillation alone may come close to the accuracy of the Teacher, especially for Student networks with higher capacity (i.e., larger models). The reason I call this result interesting is that here we are measuring the network's accuracy on the data distribution it has not seen before, and comparing its performance to another network, that was specifically trained on that data distribution. To me this looks like a perfect illustration of how universal the features learned by computer vision models are across different domains and tasks. (Of course, this is also a reason why models pre-trained on the ImageNet do so well in transfer learning).

As far as the practical applications of Learning from a Teacher using Unlabeled Data go, I would be very curious to see how their results compare to knowledge distillation without the use of additional unlabeled data. Some of the outcomes listed in the paper indicate fine-tuned Student networks can actually obtain higher accuracies than those of the original Teacher models (when the two have similar capacities, however). Since the Teacher is still assumed to have been trained on a labeled set, this method in itself does not solve the problem of not having enough labeled data, but I am looking forward to exploring whether it can be incorporated into another semi-supervised learning scenario.

An alternative popular approach to downsizing a neural network is pruning: removing some (in some cases, even most!) connections from the network while maintaining high accuracy on the test set. The results can be impressive: for instance, it was shown that stripping an ImageNet-trained ResNet50 down to only 10% of its original weights can result in less than 3% decrease in top-1 test set accuracy. However, there is a catch: according to the authors of  Selective Brain Damage: Measuring the Disparate Impact of Model Pruning,

Pruning has a non-uniform impact across classes; a fraction of classes are disproportionately and systematically impacted by the introduction of sparsity.

Thus, even if the overall effect on the test set appears to be small, certain types of data may be completely mishandled by the pruned network. For some domains (self-driving cars, medical applications, hiring decisions to name just a few), consequences can be particularly grave. Additionally,

Pruning significantly reduces robustness to image corruptions and adversarial attacks.

Another major concern for real life use cases! Aside from the cationary pruning tale, this preprint serves as a good reminder that the most obvious measures of the model's performance are not always enough, and that there are times when we need to look beyong those good-looking averages, especially when human lives may be effected by the model's outcome.

And last week's award for best title goes to... Eternal Sunshine of the Spotless Net: Selective Forgetting in Deep Neural Networks! In addition to the cool header, this preprint was a pleasure to read (the introduction won me over early on) and discussed a timely and important issue. The capacity of deep neural networks to memorize the training data is well known. It should come as no surprise that for some models, it may be possible to recover enough of the training data from the model's parameters to identify the individuals involved in collecting the data. The preprint discusses what it means for information to be forgotten: removed from the trained network, and how this can be done in a way that concerns a particular set of data only, with minimal effect on the rest of the model.

Never mind, last week's best title is a tie. Meet CamemBERT: a Tasty French Language Model - a french version of the contextualized pre-trained word emeddings called BERT. I have already discussed both the Transformer architecture for NLP and the BERT model extensively elsewhere on the blog, so I will just leave you with this and hope you have a great, cheese-filled week: