This article is the second installment of a two-part post on Building a machine reading comprehension system using the latest advances in deep learning for NLP. Here we are going to look at a new language representation model called BERT (Bidirectional Encoder Representations from Transformers). Click here for part one, an in-depth introduction to the Transformer neural network architecture!

The written word has allowed human communication to transcend both time and distance, revolutionizing the world order in the process. For most of us understanding text seems almost too trivial of a task to reflect upon, yet there is much complexity to it. Decoding symbols to extract meaning requires not only having a wide enough vocabulary, but also the ability to choose the appropriate meaning of a word based on the context, and the ability to understand the organization of a sentence as well as a passage. One of the ways to assess reading comprehension is to pose questions based on a given text. This task is familiar to all of us from childhood, so it comes as no surprise that machine reading comprehension is also often cast as question-answering.

1. The Data: meet the SQuAD

There are multiple ways of answering questions based on a text. For instance, this may or may not involve text summarization, and/or inferring - tactics that are necessary when the answer is not explicitely stated in the body of the text. In the simpler case that it is, the task is narrowed down to span extraction.

For this machine learning task, the inputs come in the form of a Context / Question pair, and the outputs are Answers: pairs of integers, indexing the start and the end of the answer's text contained inside the Context. This is precisely how one of the largest question-answering sets, the Stanford Question Answering Dataset abbreviated as SQuAD, is organized. Here is one training sample from SQuAD 1.0:

Context: Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.

Question: The Basilica of the Sacred heart at Notre Dame is beside to which structure?

Answer: start_position: 49, end_position: 51

49 and 51 are the indices that span the Main Building in the fourth sentence of the Context.

2. Transfer learning in NLP

Having humans read text passages, then ask and answer questions based on them, is a time-consuming and expensive task. Even the largest datasets, such as the SQuAD, rarely go beyond 100 000 training samples - which is not that much in the deep learning world. A technique called transfer learning is invaluable in situations when training data is scarce, however its use in NLP has been limited compared to the wide success that transfer learning enjoyed in computer vision. To a large extent, the latter has been due to the existence of ImageNet - a dataset with over 14 million images, each hand-annotated with the label of the subject that appears in the photo.

The idea behind transfer learning is simple: we develop a model for a task for which we have enough training data, then use that model (or a part of it) as a starting point for a different (downstream, or target) task, presumably one that we don't have as much data for. The tricky part? Choosing the right task for the pre-training stage. We want the resulting model to generalise beyond the original problem sufficiently well: thus, the features of the data that the model learns to identify have to be, in some sense, general. For computer vision the solution came in the form of image classification. It turned out that many downstream tasks could benefit from building upon a network that had initially been trained to recognise objects that appear on images. Conveniently, such networks can be trained on ImageNet! Unfortunately, there is no equivalent of ImageNet, a vast labeled dataset that can be used for supervised learning, for NLP. To make matters worse, it is not clear which task is best to use for pre-training. Language modeling? Natural language inference?

In this article, we will look at BERT: one of the major milestones in transfer learning for NLP. Here is the TL;DR summary for the impatient:

BERT is the Encoder of the Transformer that has been trained on two supervised tasks, which have been created out of the Wikipedia corpus in an unsupervised way: 1) predicting words that have been randomly masked out of sentences and 2) determining whether sentence B could follow after sentence A in a text passage. The result is a pre-trained Encoder that embeds words while taking into acount their surrounding context. When supplemented with an additional fully connected layer, BERT was able to achieve state-of-the-art results on 11 downstream tasks at the time that it was released in 2018.

3. BERT: Bidirectional Encoder Representations from Transformers

3a. Who What is BERT?

Much like the title of the Attention is all you need paper, the meaning of the acronym BERT is the epitome of spoiler. Being an Encoder of a Transformer (I bet Representation was mainly put in there to make the abbreviation work - too bad, I would have rather had Pre-trained in the name), BERT is Bidirectional by design due to the nature of the Encoder Self-Attention in the Transformer architecture. BERT seeks to provide a pre-trained method for obtaining contextualized word embeddings, which can then be used for a wide variety of downstream NLP tasks. And provide it does - at the time that the BERT paper was published in 2018, BERT-based NLP models have surpassed the previous state-of-the-art results on eleven different NLP tasks, including Question-Answering. A pre-trained BERT model serves as a way to embed words in a given sentence while taking into account their context: the final word embeddings are none other than the hidden states produced by the Transformer's Encoder.

What are the tasks that BERT had been (pre)trained on? Turns out that there are two. The first one, Masked Language Modeling (MLM), involves randomly masking (omitting) about 15% of words in the text corpus used for training and having the network predict the missing word. For this, BERT (the Encoder) is supplemented by a softmax layer assigning probabilities to each token in the vocabulary being the one that has been masked. The MLM task focuses on "teaching" the relationships between words in a sentence. However, it is also important to understand how different sentences making up a text are related as well; for this, BERT is trained on another NLP task: Next Sentence Prediction (NSP). Here two sentences selected from the corpus are both tokenized, separated from one another by a special Separation token, and fed as a single intput sequence into BERT. What comes next is a binary classification problem: half of the sentence pairs are successive, and half have been selected randomly, the objective is to determine which is which.

Now this sounds more like the workings of a regular, non-masked language model

The original BERT model was developed and trained by Google using TensorFlow. More precisely, one should say models, as multiple model versions have been released:

  • BERT-Base: 12 layer Encoder / Decoder, d = 768, 110M parameters
  • BERT-Large: 24 layer Encoder / Decoder, d = 1024, 340M parameters

where d is the dimensionality of the final hidden vector output by BERT. Both of these have a Cased and an Uncased version (the Uncased version converts all words to lowercase).

3b. Using BERT for Question-Answering

Being a PyTorch fan, I opted to use the BERT re-implementation that was done by Hugging Face and is able to reproduce Google’s results.

In addition to providing the pre-trained BERT models, the Hugging Face pytorch-transformers repository includes various utilities and training scripts for multiple NLP tasks, including Question Answering for the SQuAD. One can download their trained BERT-based model that has already been fine-tuned for the QA task (bert-large-uncased-whole-word-masking-finetuned-squad or bert-large-cased-whole-word-masking-finetuned-squad), or use the script that they provide for the SQuAD training. In the interest of understanding the model better, let's take a closer look at some of the Hugging Face code. We'll start with the BertForQuestionAnswering class contained in pytorch_transformers.modeling_bert:

class BertForQuestionAnswering(BertPreTrainedModel):

    def __init__(self, config):
        super(BertForQuestionAnswering, self).__init__(config)
        self.num_labels = config.num_labels

Remember that each token of the sequence is going to be processed by the network. The self.num_labels is the dimensionality of the final output that we'll get for each token; in our case, self.num_labels = 2, corresponding to a token possibly being the (a) start and/or (b) the end of the answer to the question in question (pardon the pun).

        self.bert = BertModel(config)
        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)

The first part of the QA model is the pre-trained BERT (self.bert), which is followed by a Linear layer taking BERT's final output, the contextualized word embedding of a token, as input (config.hidden_size = 768 for the BERT-Base model), and outputting two labels: the likelyhood of that token to be the start and the end of the answer.


    def forward(self, input_ids, token_type_ids=None, attention_mask=None, start_positions=None,
                end_positions=None, position_ids=None, head_mask=None):
        outputs = self.bert(input_ids, position_ids=position_ids, token_type_ids=token_type_ids,
                            attention_mask=attention_mask, head_mask=head_mask)

Let's follow the computation as it is done during the forward pass of the network. First, the input sequence goes through self.bert. The output of BertModel, of which self.bert is an instance, is a tuple, whose contents actually depend on what it is that you are trying to do. What we need is the last hidden state of the BERT encoding, which is the first element of that output tuple:

        sequence_output = outputs[0]

Once we have the BERT encodings for all the elements in the input sequence, we put them through the linear layer, separating the output into start and end logits:

        logits = self.qa_outputs(sequence_output)
        start_logits, end_logits = logits.split(1, dim=-1)
        start_logits = start_logits.squeeze(-1)
        end_logits = end_logits.squeeze(-1)

        outputs = (start_logits, end_logits,) + outputs[2:]
        if start_positions is not None and end_positions is not None:
            # If we are on multi-GPU, split add a dimension
            if len(start_positions.size()) > 1:
                start_positions = start_positions.squeeze(-1)
            if len(end_positions.size()) > 1:
                end_positions = end_positions.squeeze(-1)
            # sometimes the start/end positions are outside our model inputs, we ignore these terms
            ignored_index = start_logits.size(1)
            start_positions.clamp_(0, ignored_index)
            end_positions.clamp_(0, ignored_index)

            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)
            start_loss = loss_fct(start_logits, start_positions)
            end_loss = loss_fct(end_logits, end_positions)
            total_loss = (start_loss + end_loss) / 2
            outputs = (total_loss,) + outputs

        return outputs

This architecture of this BertForQuestionAnswering model is captured in the Figure below that was taken from the original BERT paper:

As the Figure shows, the input sequence for the Question Answering task has two parts: the (tokenized) Question, followed by a special [SEP] token, and then by the (once again, tokenized) Context.

Now that we have the general idea of how BERT can be used for question answering, let us go ahead and train the model above. To make use of the Hugging Face implementation, start with installing their pytorch_transformers package:

pip install pytorch-transformers

Now copy the SQuAD train and dev datasets, as well as the evaluation script to a directory $SQUAD_DIR. While we are at it, let's also copy Hugging Face's and bert-base-uncased-vocab.txt into the current directory.

Now let's train the model for two epochs and save the result:

import time

import os
import torch
from import (DataLoader, RandomSampler, SequentialSampler,
from import DistributedSampler

from apex.optimizers import FP16_Optimizer, FusedAdam

from pytorch_transformers import BertForQuestionAnswering, BertTokenizer

from utils_squad import (read_squad_examples, convert_examples_to_features)

num_train_epochs = 2
train_batch_size = 32

OUTPUT_DIR = '/root/BERT/output'


# 1. Load the training data from JSON
train_file = SQUAD_DIR + '/train-v1.1.json'
train_examples = read_squad_examples(train_file, is_training = True, 
                                     version_2_with_negative = False)

# 2. Tokenize the training data
tokenizer = BertTokenizer(vocab_file="bert-base-uncased-vocab.txt")
train_features = convert_examples_to_features(train_examples, tokenizer, 
                                              max_seq_length=384, doc_stride=128, 
                                              max_query_length=64, is_training=True)

# 3. Get the tokenized data ready for training the model
all_input_ids = torch.tensor([f.input_ids for f in train_features], dtype=torch.long)
all_input_mask = torch.tensor([f.input_mask for f in train_features], dtype=torch.long)
all_segment_ids = torch.tensor([f.segment_ids for f in train_features], dtype=torch.long)
all_start_positions = torch.tensor([f.start_position for f in train_features], dtype=torch.long)
all_end_positions = torch.tensor([f.end_position for f in train_features], dtype=torch.long)
train_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids,
                                   all_start_positions, all_end_positions)

train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=train_batch_size)

# 4. Initialize the BERT-based model for Question Answering
#    Using half-precision (FP16) for the model
model = BertForQuestionAnswering.from_pretrained('bert-base-uncased')

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# 5. Prepare the optimizer (using mixed precision)
param_optimizer = list(model.named_parameters())
param_optimizer = [n for n in param_optimizer if 'pooler' not in n[0]]
no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}]

optimizer = FusedAdam(optimizer_grouped_parameters,
optimizer = FP16_Optimizer(optimizer, dynamic_loss_scale=True)

# 6. Train the model

start_time = time.time()

for epoch in range(num_train_epochs):
    for step, batch in enumerate(train_dataloader):
        batch = tuple( for t in batch) 
        input_ids, input_mask, segment_ids, start_positions, end_positions = batch
        outputs = model(input_ids, segment_ids, input_mask, start_positions, end_positions)
        loss = outputs[0]  


    if epoch==0:
        print("Time it took to complete the first training epoch: ", (time.time()-start_time))
    print("Loss after epoch ", epoch, ": ", loss.item())
# 7. Save the trained model to OUTPUT_DIR 
#    (Create the directory if it does not exist; otherwise override the contents)
if not os.path.exists(OUTPUT_DIR):

model_to_save = model.module if hasattr(model, 'module') else model

Here the BERT model is being fine-tuned: meaning, the pre-trained BERT layers are not frozen, and their weights are being updated during the SQuAD training, just as the weights of the additional linear layer that we added on top of BERT for our downstream task. Given the size of BERT, the use of a GPU is all but mandatory. A single training epoch takes about 50 minutes on a Scaleway GPU. Now lets test the model that we just trained! For this, we are going to load the trained model from $OUTPUT_DIR that we previously saved it to, and run inference on the SQuAD dev set:

import os

import torch
from import (DataLoader, RandomSampler, SequentialSampler, TensorDataset)

from utils_squad import read_squad_examples, convert_examples_to_features, RawResult, write_predictions

from pytorch_transformers import BertForQuestionAnswering, BertTokenizer

# 1. Load a trained model

OUTPUT_DIR = '/root/BERT/test_output'
model = BertForQuestionAnswering.from_pretrained(OUTPUT_DIR)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 2. Load and pre-process the test set

dev_file = "/root/BERT/SQuAD1/dev-v1.1.json"
predict_batch_size = 32

eval_examples = read_squad_examples(input_file=dev_file, is_training=False, version_2_with_negative=False)

tokenizer = BertTokenizer(vocab_file="bert-base-uncased-vocab.txt")
eval_features = convert_examples_to_features(

all_input_ids = torch.tensor([f.input_ids for f in eval_features], dtype=torch.long)
all_input_mask = torch.tensor([f.input_mask for f in eval_features], dtype=torch.long)
all_segment_ids = torch.tensor([f.segment_ids for f in eval_features], dtype=torch.long)
all_example_index = torch.arange(all_input_ids.size(0), dtype=torch.long)
eval_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_example_index)

eval_sampler = SequentialSampler(eval_data)
eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=predict_batch_size)

# 3. Run inference on the test set

all_results = []
for input_ids, input_mask, segment_ids, example_indices in eval_dataloader:
    input_ids =
    input_mask =
    segment_ids =
    with torch.no_grad():
        batch_start_logits, batch_end_logits = model(input_ids, segment_ids, input_mask)
    for i, example_index in enumerate(example_indices):
        start_logits = batch_start_logits[i].detach().cpu().tolist()
        end_logits = batch_end_logits[i].detach().cpu().tolist()
        eval_feature = eval_features[example_index.item()]
        unique_id = int(eval_feature.unique_id)
output_prediction_file = os.path.join(OUTPUT_DIR, "predictions.json")
output_nbest_file = os.path.join(OUTPUT_DIR, "nbest_predictions.json")
output_null_log_odds_file = os.path.join(OUTPUT_DIR, "null_odds.json")

preds = write_predictions(eval_examples, eval_features, all_results, 20,
                      30, True, output_prediction_file,
                      output_nbest_file, output_null_log_odds_file, True,
                      False, 0.0)

Next, we'll run the SQuAD evaluation script on the prediction file that we produced in the previous step:

~ python SQuAD1/ SQuAD1/dev-v1.1.json /root/BERT/output/predictions.json
{"exact_match": 78.40113528855251, "f1": 86.50679014002237}

If you check the current SQuAD 1.0 leaderboard, you'll see that our result puts us close to the 20th place. Not bad for under two hours and 2 euros worth of compute time - when using a Scaleway GPU ;-)

A word of caution: while BERT-based models achieve state-of-the-art (or nearly so) results on many NLP problems, for sentence-pair regression tasks this success comes at a high cost. As you may have noticed, answering a context-based question in a manner described above, requires feeding both the question and the context as inputs into a (very large) BERT embedding network. This way, BERT's self-attention layers have access to both sequences, but unfortunately this setup prevents us from, for instance, pre-BERTifying large documents in a way that we can then ask questions about them in a more efficient manner. Some work on finding alternative network architectures is underway, uncluding a recent preprint that I have also mentioned in this week's inagural Fresh from the arXiv post.

In a future article, we will learn how to use the Scaleway products and our Question Answering model to run inference on text and questions of our choice in a way that scales!