Overview

This file presents the results of automatic summarization of an online lecture video titled: DeepMind x UCL | Deep Learning Lectures | 7/12 | Deep Learning for Natural Language Processing - YouTube. The abstract is generated using LED, or Longformer-Encoder-Decoder, a state-of-the-art Transformer-based language model. This implementation uses a pre-trained model, fine-tuned on PubMed, a long-range summarization dataset. The top-ranked words/phases and sentences are extracted from the original transcript of the video to produce a summary using textrank, an unsupervised graph-based algorithm. The sentences for the summary are returned in the order of original occurrence in the transcript (i.e., not ranked order). Words and sentences in the summary are reduced by 57% and 70%, respectively, compared with the original transcript. Sentences are grouped into paragraphs based on their positional locations. Long paragraphs indicate several sentences in close proximity with minimal pruning between them. Short paragraphs and orphaned sentences suggest that more context may be needed. The final section is the full extracted transcript, line by line. Sentences in the 'Summary' section are hyperlinked to the 'Full Transcript' section. Sentences in the 'Full Transcript' section are hyperlinked to the video at the approximate time of utterance.

Abstract

Deep learning and language understanding is an enormous area of research in machine learning and neural computation. It has been shown that deep learning has been able to improve performance on a lot of language processing applications over the last few years, so it raises the question of why deep learning, and models which have this neural computation at the heart of their processing, have been so effective in language processing. In the first section of this lecture we give an overview of neural computation in general and language in general, and then we give some idea of why neural computation, deep learning or language might be an appropriate fit to come together and produce the sort of improvements and impressive language processing performance that we have seen over the past few years. In particular, we focus in on one particular neural language model, which we think is quite representative of many of the principles that govern all neural language models. And that model is the transformer which was released in 2018 and then in section three we go a bit deeper into a particular application of the transformer, that's the well known BERT model, and BERT in particular is an impressive demonstration of unsupervised learning and the ability of neural language language models to transfer knowledge from one training environment to another. And in the final section, we take a bit more of a look towards the future of language understanding and deep learning. To do that we delve into some work that's been done at DeepMind on grounded language learning, where we study the acquisition of language in deep neural networks that have the ability to interact and move around simulated environments.

Keywords/phrases

language models, word sentences, neural language models, different neural language models, input words, different language understanding models, word meanings, many neural language models, different words, word representations, masked language model prediction, word order, word senses, word, words, representing words, many other words, individual words, transformer models, distributed word representations

Summary

My name's Felix Hill and I'm going to be talking to you about deep learning and language understanding. So in the first section I'll talk a little bit about neural computation in general and language in general and then give some idea of why neural computation, deep learning and language might be an appropriate fit to come together and produce the sort of improvements and impressive language processing performance that we've seen over the last few years. In the second section I'll focus in on one particular neural language model, which I think is quite representative of a lot of the principles that govern all neural language models. And that model is the transformer which was released in 2018 and then in section three I'll go a bit deeper into a particular application of the transformer, that's the well known BERT model, and BERT in particular is an impressive demonstration of unsupervised learning and the ability of neural language models to transfer knowledge from one training environment to another. And then in the final section, we'll take a bit more of a look towards the future of language understanding and deep learning and to do that we'll delve into some work that's been done at DeepMind on grounded language learning, where we study the acquisition of language in deep neural networks that have the ability to interact and move around simulated environments. It's important to add that of course, natural language processing is an enormous field and there are many things that I'm not going to have the time to talk about during this lecture. So some of the most important ones are things like sequence to sequence models and specific applications of deep learning to neural machine translation.

So you may have heard of models like GPT-2 or BERT, or WaveNet, which was developed in DeepMind. And all of these models have done really impressive things with respect to the various aspects of language processing that they focus on. So GPT-2 as a language model is now able to produce long streams of text which look like plausible stories and BERT has led to very large improvements on many language classification tasks. So if you think about all the sort of panorama of different things you might be able to apply language models or language processing technology to, like to a much greater sense than at any point in the past, neural computation and deep learning plays a role in those systems. So on the left we have systems which are almost now entirely based on neural networks from machine translation systems to speech synthesis systems and speech recognition and then on the right here, it's important to note that there are still many applications which do language processing but don't use deep learning or neural networks for all of their computation or even at all. So things like home assistants, which you might have to provide specific pieces of information from the internet, we're still a long way from building systems like that in an end-to-end fashion in deep neural networks. Having said that, the balance of this particular scale has moved a lot over the last few years and it's certainly a trend towards more applications of neural computation and neural networks in language processing applications.

And it's obviously not just the number of publications but the effective quality of systems and models, which seems to be improving over this time.

So, this is, sort of taken together, a bunch of evidence that, you know, deep learning has really been able to improve performance on a bunch of language processing applications and I think looking at that evidence, it raises the question of why deep learning, and models which have this neural computation at the heart of their processing, have been able to be so effective in language processing. What is it about deep learning and what is it about language which has sort of allowed this sort of effect to take place.

So the first thing about language, it's often said that language is a process of manipulating symbols or that language processing involves symbolic data, operations on symbols. But of course, if those who think a little bit more about language specifically, have many reasons to believe that individual words that we might be passing to these models don't seem to behave like discrete symbols exactly. If we think about the word 'face' we can find it in many different contexts in language. And we call these differences word senses, but the important thing to note about the different senses of the word 'face' is that they're not entirely different.

So this example shows and you will see these effects if you look at many other words that rather than discreet word sentences which are orthogonal to each other, we might be better off modeling this discrepancy in meaning within individual words as operations that can interact then, but are not necessarily the same.

So we've seen then that words are not necessarily best modelled as discrete symbols.

So this tells us that it can be things at one end of the sentence and things at the very other end of the sentence, which must be considered to interact in order for us to form the most satisfactory meaning when we read sequences of words.

Whereas, our knowledge of peppers will tell us that they don't typically sneeze and therefore we don't think that the pepper sneeze is a very likely state of affairs and we look for other ways to make sense of the sentence and the correct way of making sense of that sentence, of the sentence in fact is more salient to us as we process it. So that's just a thought to bear in mind when we're thinking about optimal processes of language in deep learning models. So lots of people who consider and talk about language, particularly in the wider machine learning community, consider language to be compositional in the sense that the meaning can be computed simply by elegant operations on the individual parts.

So even in something as simple as combining a colour adjective with a noun, there's all sorts of factors at play telling us exactly how those meanings combine that don't seem to be equivalent from one pair of words to the next.

This doesn't always happen when we combine words, but it does sometimes. So these are kind of wacky effects of how meanings interact when two words come together and it's not necessarily easy to explain them in a model which treated every pair of words fed into that model with exactly the same function to combine their meanings, it very much seems to me that what's instead happening is that whatever function is combining the meanings is taking into account the individual meanings of the components going into that function and in, in additional, additionally, that function may well need to take into account a wider knowledge of typical things we might encounter in the world and how their properties might fit together under the constraints of the world as we know it. So just to summarise, we've seen in this section that words have many related senses and that they're not necessarily characterised as sort of perfectly idealised discrete symbols. And finally, when we're thinking about building models of how word meanings might combine, we've seen that functions that combine meanings will probably need to take into account what the inputs to those functions are in order to come up with the best bespoke way of combining for those particular words. And we've even suggested that they may well also need a widened sense of how the world works and how things can naturally fit together in order to eventually arrive at the optimal representation for the combination of meanings in each particular case. So, in the first part we talked about particular aspects of language and particular aspects of neural computation that have that seem to fit together in a particularly appropriate way, such that, they define certain ways in which a computational model might need to behave in order to capture the ways that meaning works in language. So in this section we're going to talk much more concretely about a specific model, which was published just a couple of years ago and has had an incredible impact on a large number of natural language processing problems from machine translation to sentence classification and essentially any problem that requires a model to process a sentence or a passage of multiple sentences and compute some sort of behavioural prediction based on that. And in this section I'll talk about the details of the transformer and just refer back to those aspects of language processing that we saw in the first section in order to give some intuition about why the transformer might be so effective when it processes language. So the transformer contains a distributed representation of words in its first layer which is something it has in common with almost any neural language models now. And what do I mean by a distributed representation of words?

But in general with language processing applications, because we have texts stored in digital form, we don't need to go through that phase and subject our model to having to learn to process pixels. And in most applications of neural language models these days, that can either be character level, which is where we pass each unit as an individual letter or it can be word level, which is where we split the input according to white space in the text and then we pass each of the individual words to the model as discrete different symbols. But of course, as we've talked about in the last section, a model which just takes symbols and treats them as symbols might not be optimal for capturing all of the aspects of meaning that we see in natural language. So instead of doing that, the developers of neural language models have come up with a procedure which allows the model to be more flexible than which would be represents, in the ways of which it represents words. So let's say we do take the decision to chop up our input text according to individual words.

What we typically do in a neural language model then is pass each of the words to an input layer and that input layer contains a particular unit for each, corresponding to each of the words in the vocabulary of the model. Now that dimension we can think of as the word representation dimension or the word embedding dimension and when the model sees a given word, we turn on the unit corresponding to that word and we leave all of the other units at zero. So we put an activation of one on unit corresponding to the word leave all of the other weights as zero and we've marked those weights in this diagram here with yellow and light blue shows the space occupied by the whole layer of input weights for the model.

So we might find that representing words in a space like that allows words to move together in the space if it's useful for the model to represent them as somewhat similar and to move further away in that space if it's useful for the model to represent them as different. So this gives the model the flexibility to move its representation of individual words around as it sees fit and the best way to achieve its objective. So just to recap, this is the first layer of many neural language models, including the transformer. So if we have a total of capital V words in our vocabulary and if capital D is the dimension of the vector that we're going to represent each of these words with in a floating point vector, then the total number of weights that we have in the first layer is V multiplied by D and we end up with a D dimensional Euclidean space with which to represent these input units in the model. Now this idea of representing words or letters or whatever we take as the input units to a model in some sort of high dimensional floating value, real valued vector space is actually quite an old idea. If we go back to 1991 Mikkulainen and Dyer produced a language, a neural language model with much less computational power than current models have, but it still tried to execute this principle of representing input words in this distributed geometric space. And it was able to exhibit certain types of interesting generalisation when trained on real texts that a model which represented words as individual discrete symbols wouldn't be able to represent or achieve. And in this paper Elman analysed the distributed representations corresponding to lots of different words as he trained the model on sequences of, sort of, subject verb object style sequences of natural language style snippets. And the objective of this model was just to represent a sequence of words such that the model was able to optimally predict the next word with as much accuracy as possible. And what Elman found when he analysed the way that the model was distri-, was representing these words internally was that of all the words in his vocabulary, they started to cluster together in this geometric space such as the words with similar meanings came together. And this tells us that neural language models, as they experience more and more text, start to slowly infer the underlying structures in language which we might be able to perceive as language users such as subject, object, verb and how things fit together like that as well as an emergent categorical semantic structure where we see that certain classes of different types of words naturally fit together. Distributed representations of words have been a part of neural language models as I pointed out since the early nineties.

So this then gives us a set of weights correspond-, it's a probability distribution, which gives us a set of weights between zero and one, so for a given word 'beetle', we get a set of weights, one for each of the words in the input telling us to what extent is there a strong interaction between the word 'beetle' and that other word.

So what we end up with then for each word like 'beetle' is that we take a small amount of the value of each of the other words plus some of the value of the word 'beetle' through to the next layer of the transformer. So notice that having performed this transformer layer, we haven't reduced the number of embeddings in the model in any way, we still have a representation corresponding to the word 'beetle' that we started with, but that representation has been updated or modulated, conditioned exactly on information about how well it corresponds or how well it should interact with all of the other words in the input.

But it does give the model many independent ways with which it can represent interactions between the words and the inputs.

Well, in the examples I gave in the previous section about language, one thing that should have maybe come across is the importance of, or the role of our expectations in forming a consistent representation of what a particular input is.

And so these sorts of top down influences our expectations influencing how we actually combine the inputs in language are really common in many different contexts. And if you think about skip connections, it's not a perfect model of this, but it does give the transformer a rudimentary ability to allow its representations of things at a higher level of processing to interact with this representations of things at a lower level of processing. At that point after computing many different interactions, the model might form a consistent sense of the fact that a meaning needs to be understood in a particular way, but of course those top down influences tell us that that expectation of what the meaning might be should actually feed back and allow us to remodulate how we understand the input.

Now, if you were aware, if you were paying attention during the explanation, you may well have noticed that none of the operations that I described on the input words took into account the actual relative order of the words in the input. So there was no way that a model like this would have any ability to express the fact that certain words appear close together in the input or certain words appear further apart. And of course we know in language that the word order can tell us some important things about what the overall sentence means. So positional encoding is just a way of determining a set of scalar constants which are added to the word embedding vector after say let's say in the lowest level of the transformer it can be added before the first self-attention layer, but just after the word embedding layer, and those scalars combined with the word embedding to mean that whatever, if a word appears in a particular slot in the input, regardless of the fact that it's embedding weights will necessarily be the same, the actual effective representation that the transformer sees will be slightly different depending on where it appears in the input. So to achieve this sort of thing, you just need, the model just needs a set of small scalars which are different in each of the possible locations that are word could appear in the input. And they use a nice sinusoidal function which has various properties which may well be more desirable than just being, allowing the word to discriminate words according to their position. Because in fact that sinusoidal function gives the model a slight prior to pay attention to relationships of a certain wavelength, a certain distance across the input and each particular unit in the embedding representation can then specialise at recognising interactions or correspondences at a different distance from a given word. So, unlike if you think about models like a recurrent neural network or an LSTM, those models have the notion of order built in because they process input sequentially, one word after the other according to a process transitioning the state from the, from its position after reading one word to after reading two words, to after reading three words, to after reading four words. The model has sort of natively, in its native functional form, it has no awareness of word order and we have to add on these additional positional encodings to give the model a weak awareness of word order. But the transformer actually performs better than RNNs and LSTMs on a lot of language tasks and this maybe tells us that it's easier to learn the word, the notion of word order for the few cases or for the number of cases where it's actually important in language than it is to be given the notion of word order automatically, but to have to learn the very difficult process of paying attention to things a long time in the past. With a transformer that path that the gradient has to go through is much shorter because there's no prior favouring of things which are close together instead of the gradient path that the model needs to go through to connect any two words in the input is equivalent and in fact it indeed is shorter on average than it is in recurrent neural networks. So just to summarise this section: we saw in the previous section that words shouldn't necessarily be thought of as independent discrete symbols and that disambiguating their meaning can depend a lot on the context but not only on the immediate context which is closest to those words, but on potentially distant context of the information encoded in words a long way away. We've also seen that functions which models use in order to combine the meaning of two words should take into account the meaning of those words and if possible, take into account a wider general knowledge of how things typically combine in order to allow that to modulate the interactions between the words coming in and we think about the architectural components I talked about in the transformer. The multi-head processing is one way of getting at this notion that words are not discrete symbols because it naturally gives the transformer even in one feed, single feed forward pass the opportunity to represent each word at each layer with n, let's say four, different possible contextualised representations and of course going back longer term just the general notion of representing words as distributed representations and allowing words with similar meanings to occupy local areas in a large geometric vector space also allows the model to express this non-discreet nature of word meaning in a very eloquent way. So interactions over words that are next to each other are not particularly favoured over minor interactions. Another fact is that the more layers we have, the more chance the model has to learn as it moderates this representation of different things, how interactions might take place at different levels of abstraction as the model goes, continues to reprocess the model, the input. The multiplication of a matrix by a vector is precisely the operation of a function which combines word meanings according to the meanings of the words themselves and those operations are common in most, many neural language models, but are a really important part of the transformer architecture. So hopefully this section has given you some intuition about how a transformer works, but also some intuition about maybe why it works, why it is that the various components in the transformer improve on a model's ability to process language because of the way that we think meaning works in a very sort of intricate and interactive way when we understand linguistic input. In the last section, we introduced the transformer and we talked about how various components within the transformer combine to make it a very powerful process, a very powerful model for processing sentences and combinations or sequences of words. It's a way of training transformer models in order to allow them to excel at a wide range of different language tasks. But before I do that, I also just want to go back to our points about the nature of language and discuss one more issue which I think is quite motivating when we think about how transformers are applied in the model that I'm going to talk about in this section. But of course when we start to process and make sense of these sentences, it feels very clear to us as native English speakers, that there's quite a difference in the way that the words in those sentences have to relate to each other in order for us to sort of construct the meanings in our head.

And then of course there's again, other than that background knowledge of how the world works, how fruit flies are, there's also this kind of more linguistic knowledge of sentences we may well have already previously understood in which the meaning seems to combine in a similar way to 'fruit flies like a banana'.

And a lot of the places where such a model needs to get its understanding of the world and its understanding of language and those considerations lead us to add a fifth point to these many characteristics of language, which is that when we actually form an understanding, you know, it really does seem to be a process of balancing our existing knowledge and that could be knowledge of language and also knowledge of the world with the input with the particular facts of the thing that's currently coming into the model. But the key insight with BERT is that rather than training a transformer just to understand the inputs to the sentences which the model is currently considering, a process of pre-training takes place in which the weights within the model are endowed with knowledge of a much wider range of text in this case, which can plausibly give the model that background knowledge which is really necessary for forming a coherent understanding of the total of the different types of sentences a language understanding processor needs to, to be able to understand. So the important thing to remember when considering how BERT works is that a transformer as described in the previous section really is just a mapping from a set of distributed word representations to another set of distributed word representations. And corresponding to each of the words in the input, the exact same set of words that were the input.

So given that fact that a transformer is just a mapping from a set of word representations to a modified set of word representations of the same length, there's quite aneasy way in which we could train such a model in order to extract knowledge from an enormous amount of texts that we might just have lying around. So in particular, the insight from BERT is precisely how can we get knowledge into the weights of such a model without requiring problems or data which has been labeled by human experts or other, some other mechanism in order to give the model sort of knowledge of what's the right classification or what's the right answer to make. So how can we get knowledge into a model, a transformer model in an unsupervised way? So the way this works is the following, the authors just considered the problem of mapping a particular sequence of words to the exact same sequence of words. And instead of having to predict all of the words in the input sentence, the model just has to make a prediction conditioned on the output embedding for the missing word of what that missing word was. So it just has to answer the question, you know, here's a sentence with a missing word in it, 'sucking up '...' from words', and the model just has to make the prediction that the missing word in that case is knowledge. And when training the model, the authors of BERT do that with 15% of words at random. So they ran, they present sentences from any, any particular place where we might be able to get running text language and the authors mask out words with a probability of 15% and then ask the model to make a prediction and backpropagate the cost, which is essentially, the likelihood, the negative log likelihood of the model, having predicted that word over all of the other words in its vocabulary.

In this case, the model really does just need to retain which word is in a particular point in the sentence and at the output representation corresponding to that point, conditioned on that, make a prediction of what that word was.

But in order for BERT to be an effective language processor, the authors wanted it to also be able to, to be aware of the flow of how meaning works on a longer scale than just within a particular sentence. So in order to achieve this, they came up with an additional mechanism for training the weights in the BERT model, which is complimentary to the masked language model objective.

Then as input to the model, the model is presented with not one sentence, but two sentences in this case and so there's the additional input token, then there's the first sentence, then a separation token, and then the second sentence, and that's all passed to the transformer and it's processed through in parallel.

So like combining next sentence prediction and masked language modeling, slowly the weights of this large transformer, the BERT transformer, gradually start to acquire knowledge of how words interact in sentences typically maybe abstract knowledge of the typical ways in which meaning flows through sentences. And of course, the spaces in which they represent each of the individual words at various levels of the stack, things start to happen like words that have similar meanings start to come close together. The model might require to separate them out into the different parallel heads if words have various different senses. And so, a lot of the general knowledge that we talked about being very necessary for forming a consistent and coherent representation of loads of different language sentences can start to be introduced into the weights of this model as it trains according to these unsupervised objectives.

And in this case, the way that BERT is then evaluated is by taking its knowledge in all of those ways and using that as a start process to train on many specific language understanding tasks. So, in order to apply BERT to these models, the BERT weights, which are trained on all of the unsupervised objectives, are then taken and the data specific to each of these tasks is passed through the BERT model and then BERT is, the BERT weights are updated according to the signal from the supervised learning signal from these actual specific language understanding tasks. And it's also necessary in many cases when fine tuning in this way to add in a little bit of machinery onto the top of BERT because, you know, in the standard BERT architecture, it's just making predictions where it outputs a number of distributed representations at the top of this transformer model. So just doing this massively improves the performance of any models which aim to exhibit some sort of general understanding of language. What I mean by that is, any model which is intending to be trained on a wide range of different tasks, using the BERT style approach, so transferring knowledge from an enormous running text corpora, via fine tuning to those specific tasks, has led to a really strong and significant performance on a large number of these tasks.

Now a few years before BERT, a model called ELMO and a couple of other models, started to show that there was some promise in sharing more than just those word embedding weights, but actually sharing a large amount of functions which learned to combine weights when pretrained on some task agnostic objective and transferred to specific tasks. So we've now acquired five interesting principles of how language and meaning seem to interact when we understand the sentence. And when we've added this fifth one: understanding is balancing input with knowledge that we've had already or our general knowledge of the world. And we've talked about BERT as a mechanism for endowing models with something like a general knowledge that may be necessary. And we've shown that, in fact, indeed it is very important on a lot of language understanding tasks to have this sort of prior knowledge acquired from a massive range of different experiences and different types of texts. So in the next section we'll look a bit forward to other sources of information which may plausibly be useful for different language understanding models because of course, BERT only has the means to acquire knowledge through text whereas if you think about the fruit fly example or time flying like an arrow, those sorts of examples tell us that there are many other sources of information that we may have used in order to gain the general conceptual or world knowledge required to actually make sense of language. So in the last section we saw how the BERT model is a really exciting example of transferring knowledge from an enormous amount of text, to apply that knowledge to very specific language tasks that maybe have a small amount of data from which to learn. And this works in part because of the critical importance of general knowledge in understanding language and we need ways in which models can acquire general principles of how language works and how word meanings fit together in order to make high quality predictions for a range of different language tasks. Now in this section we're going to talk about further ways in which we might be able to endow models with general or conceptual knowledge which they can then apply to language related tasks. And in particular in a way that's not accessible to the BERT model, which is the ability to extract knowledge, general knowledge and conceptual knowledge from our surroundings, which is something as humans that we are doing all the time. So as well as the objective of masked language modeling and next sentence prediction that we saw with BERT, there's also exciting techniques in the field of computer vision, that often involve things like missing parts of an image and making predictions about whether or not that part of the image is the correct part or of which pixels would most appropriately fit in to that part of an image or maybe contrasting incorrect parts of images with correct parts of images and things like that. And, of course, in the world of learning when it comes to jointly learning language and behaviour, which involves often reinforcement learning on those sorts of tasks are techniques for having agents develop a more robust understanding of their surroundings and possibly import what's known as a model of their world.

Whether or not this knowledge would be knowledge which could serve the agent's ability to understand or use language. And the way we did that was as well as creating loads of random rooms with different objects positioned in different places in this simulation, we also created a bunch of questions such that for any random room that was created the agent could find in the environment questions which could plausibly be answered, so examples of the sorts of questions we asked were things like, 'what is the colour of the table? And importantly, being able to answer these questions requires a particular type of knowledge, that's propositional knowledge, the knowledge, the ability to tell whether something's true or false in our environment and that's often contrasted especially by philosophers with procedural knowledge, which is just the sort of instinctive knowledge that maybe a reinforcement learning agent would naturally have when it learns to solve control problems in a very fast and precise way.

So the agent just has to, the learning algorithm just has to find a way based on that experience to aggregate general knowledge into the agent such that when the question the QA decoder is queued with the state of the agent at the end of the episode and queued with the question, it's possible to combine those two pieces of knowledge and answer with 'dinosaur'. So to do this effectively, the agent needs a large amount of general knowledge about how things are arranged in the environment around it such that the QA decoder can take that knowledge and make predictions about the answers to questions.

A little bit like the sorts of masked language model prediction that BERT's making, but where the agent has to given a certain time point in the episode, a predictive loss, a predictive engine an overshoot engine, takes the current memory state of the agent at that time point and rolls forward in time.

So that's just a small insight into work that's going on in DeepMind where we're starting to consider how we can aggregate knowledge from the general environment as well as knowledge from large amounts of text into a single model which can start to combine this sort of conceptual understanding and general knowledge understanding and our understanding and a really strong understanding of language into a single agent which can come up with a coherent and strong ability to form the meaning of statements and sentences and also to take that knowledge to answer questions, to produce language and to enact policies, enabling it to do things things in complex environments. So we've talked about various aspects of language which make neural networks and deep networks particularly appropriate models for capturing the way that meaning works. So in particular, we raised the fact that words are not discrete symbols but they actually almost always have some sense of different related senses that disambiguation is a huge part of understanding language and then that can critically depend very often on context.

And so if we look at these features or these aspects of language, the mechanisms that I've discussed today cover them reasonably well and hopefully they shed some light on why neural networks and interactive processing architectures that obey the sort of disparate principles of neural computation. And distributed representations are particularly effective for language processing. But of course, it should be said that there are many aspects of language processing that the work I've talked about just doesn't start to approximate, doesn't start to capture. So our models are not currently able to do things like understanding the intentions of others or reflecting on how language is used to communicate and do things. And, you know, we need to make a lot more progress in these areas if we're actually going to arrive at agents which are truly able to understand language. So yeah, just as a final note, I think it's interesting that before deep learning really exhibited its success on language processing problems, a typical view of language understanding was what I call the pipeline view, which was that each independent, each part of processing language from the letters to the words, to the syntax and to the meaning and then eventually to some prediction could be thought of relatively independently as a separate process. But now that we've reflected on how language works and in particular taking in all of the evidence from the effectiveness of different neural language models on language processing tasks, I think maybe this is a more effective or more realistic schematic of how language processing should be thought of. Those two things input to our system but critically, it's that input combined with our general background knowledge of the world of knowledge of language, which together allow us to arrive at some sort of plausible meaning for everything that we hear or everything that we might say. So anyway, I hope you've enjoyed this lecture and it's given you some insight into why language and language understanding is such an interesting problem for computational models to try and tackle.

Full Transcript

Hello and welcome to the UCL and DeepMind lecture series.

My name's Felix Hill and I'm going to be talking to you about deep learning and language understanding.

So here's an overview of the structure of today's talk.

It's going to be divided into four sections.

So in the first section I'll talk a little bit about neural computation in general and language in general and then give some idea of why neural computation, deep learning and language might be an appropriate fit to come together and produce the sort of improvements and impressive language processing performance that we've seen over the last few years.

In the second section I'll focus in on one particular neural language model, which I think is quite representative of a lot of the principles that govern all neural language models.

And that model is the transformer which was released in 2018 and then in section three I'll go a bit deeper into a particular application of the transformer, that's the well known BERT model, and BERT in particular is an impressive demonstration of unsupervised learning and the ability of neural language models to transfer knowledge from one training environment to another.

And then in the final section, we'll take a bit more of a look towards the future of language understanding and deep learning and to do that we'll delve into some work that's been done at DeepMind on grounded language learning, where we study the acquisition of language in deep neural networks that have the ability to interact and move around simulated environments.

So that's the overall structure.

It's important to add that of course, natural language processing is an enormous field and there are many things that I'm not going to have the time to talk about during this lecture.

So some of the most important ones are things like sequence to sequence models and specific applications of deep learning to neural machine translation.

Also speech recognition and speech synthesis are really important applications that I won't have time to talk about.

And then there's many NLP tasks which I also won't get the chance to delve into from machine comprehension and question answering and dialogue.

And even in grounded language learning in the last section, I won't get the chance to go into things like visual question answering, video captioning.

So in short, the important thing to take away is that I'm not going to have the chance to cover all aspects of natural language processing.

I'm going to just talk about a few focused areas and that's because I think they, they're quite representative and they hopefully convey the key concepts.

I mean, it's not because I think they're more important or more valid than any other areas.

Cool.

So, let's start off with a bit of background about deep learning and language and how they might fit together.

Of course, where we are now is that, there's been a load of impressive results relating deep learning to natural language processing in the last few years.

So you may have heard of models like GPT-2 or BERT, or WaveNet, which was developed in DeepMind.

And all of these models have done really impressive things with respect to the various aspects of language processing that they focus on.

So GPT-2 as a language model is now able to produce long streams of text which look like plausible stories and BERT has led to very large improvements on many language classification tasks.

And of course WaveNet has led to fantastic performance in speech synthesis where we're now are able to synthesize voices for various speech applications with much more fidelity than was previously possible.

So it's really like an exciting era for natural language processing and we're moving at a rate of progress which is possibly unprecedented at least in recent years.

So if you think about all the sort of panorama of different things you might be able to apply language models or language processing technology to, like to a much greater sense than at any point in the past, neural computation and deep learning plays a role in those systems.

So on the left we have systems which are almost now entirely based on neural networks from machine translation systems to speech synthesis systems and speech recognition and then on the right here, it's important to note that there are still many applications which do language processing but don't use deep learning or neural networks for all of their computation or even at all.

So things like home assistants, which you might have to provide specific pieces of information from the internet, we're still a long way from building systems like that in an end-to-end fashion in deep neural networks.

Having said that, the balance of this particular scale has moved a lot over the last few years and it's certainly a trend towards more applications of neural computation and neural networks in language processing applications.

And it's not just in practical applications.

In the slightly more focused world of research, we see a similar trend.

So this is data from 2010 to 2016 and it covers submissions to two of the main language processing conferences, ACL and EMNLP.

And on the chart, you just see the number of papers published at those conferences for which the word 'deep' or 'neural' is found in the title.

And you can see that back in 2010 there was close to or effectively zero papers with those words in the title.

But by the time we got to 2016 this number has scaled up rapidly.

And of course there's a very good chance that, if we looked at the data up to 2020, we would just expect this trend to continue in that time.

And it's obviously not just the number of publications but the effective quality of systems and models, which seems to be improving over this time.

So here's just a snapshot in time, 2018-19, of how well the best model was able to perform on the GLUE benchmarks.

So the GLUE benchmarks are just a sort of intended to be a representative of language classification challenges, things like reading a couple of statements saying whether one of them entails another one or maybe classifying them as positive sentiment or negative sentiment, things like that.

So our ability to do those sorts of things automatically has rapidly increased according to this benchmark, just between 2018 and 19.

You can see the rate of improvement from under 60% performance to over approximately 85% performance.

And again, on this task, on this benchmark, the performance has just increased and increased up to the present date.

So, this is, sort of taken together, a bunch of evidence that, you know, deep learning has really been able to improve performance on a bunch of language processing applications and I think looking at that evidence, it raises the question of why deep learning, and models which have this neural computation at the heart of their processing, have been able to be so effective in language processing.

What is it about deep learning and what is it about language which has sort of allowed this sort of effect to take place.

And of course, if we can answer that question, if we can understand that, then that can help us to think a little bit about ways to improve things futher.

But of course, in order to understand that, we really need to think a bit about language.

So in the other lectures in this series, I think you've had a, you would have had a very comprehensive introduction to deep learning neural networks and the principles of neural computation.

In this lecture, I'd like to just spend a bit of time to think about language in itself so we can start to think about why these two paradigms fit well together.

So the first thing about language, it's often said that language is a process of manipulating symbols or that language processing involves symbolic data, operations on symbols.

So, you know, if we had a sentence coming into a network and a sentence coming out of a network, then one characterisatian of that problem is from mapping symbols to symbols, these very discrete units.

But of course, if those who think a little bit more about language specifically, have many reasons to believe that individual words that we might be passing to these models don't seem to behave like discrete symbols exactly.

So let's just consider an example.

The word 'face'.

If we think about the word 'face' we can find it in many different contexts in language.

So in the sentence 'did you see the look on her face?'

'we could see the clock face from below', 'it could be time to face his demons' or 'there are a few new faces in the office today'.

And those, we will, as we think about those uses of the word 'face', we get some sense that they are different in meaning or different in usage.

And we call these differences word senses, but the important thing to note about the different senses of the word 'face' is that they're not entirely different.

So it's not the case that we should model of these as entirely independent symbols, which we would like to pass through a model.

In actual fact, what we find when we, when we delve into the meaning is that these meanings have certain aspects in common, but they're just not identical.

So if we think about the first case of 'face', we might think of that as the most prototypical meaning.

And of course, that's just the face that you can see in front of you.

Now the face, which is the most important side of somebody's head of sight for the eyes and the nose.

So yeah, as well as being the most important side of some, somebody it's something that represents them and it's something which is used to inform or communicate other people.

So if we think of all of those as sort of features or properties of this sense of 'face' and then when we think about the sense 'the clock face', we see that it actually shares some of those properties but not all of them.

So it's also the most important side of the clock, clock face, and it's also the side that's used to communicate or inform others by conveying some information.

And then if we think about this notion of face as a verb, 'to face your demons', again, there's this idea that when you face somebody, you point your literal face directly at them.

So it conveys this same sense of the pointing aspect of face that's also conveyed in the core prototype.

And then finally, if we think about the idea of 'new faces in the office today', then it conveys this sense of identity yourself, which is also potentially shared by the core meaning of 'face'.

So this example shows and you will see these effects if you look at many other words that rather than discreet word sentences which are orthogonal to each other, we might be better off modeling this discrepancy in meaning within individual words as operations that can interact then, but are not necessarily the same.

Okay.

Now, when we think about the fact that each work could have many of these different senses, how could a processor possibly make sense of this sentence?

How could it possibly disambiguate the possibilities for the different senses in order to come to one coherent understanding of a phrase?

Well, one of the ways in which we do that is of course by using additional information separate from that particular word.

So we use the wider context.

And to give just a small example of how our language processing really depends on context, consider this example.

It actually goes back to Rumelhart in the mid to late seventies.

So he noticed that if we had some handwriting like this, 'Jack and Jill went up the hill', we can read it very quickly.

And in the bottom case, 'The pole vote was the last event'.

We can read that just as easily.

But if you look at the areas highlighted in red here, you'll see that the, there's actually a character which is identical in both cases.

And it's arguably midway between an E V and a W. But when we read the sentence seamlessly, we just interpret this character, which could potentially be ambiguous, in the way which fixed most naturally with the whole of the wider sentence around it.

So this sort of example intuitively gives us some justification for thinking, well, maybe it's the interactions between the individual tokens that we're looking at and all of the things around them, which actually allow us to solve the mystery of which possible sort of sense of a word we might be looking at at any one time.

And in particular, what we probably do according to this example at least, is think about the whole sentence and think what's the most likely interpretation of the whole sentence and that in itself informs the individual interpretation of the particular characters where ambiguity might be.

Another classic example of this phenomenon can be simply gained by reading the following symbol on your screen, the following image by reading it across or down.

So obviously as we read it down, the character at the very center of the image looks very much like a 13, but as we read it across it looks clearly like a B, and this tells us the extent to which, even in our very early perceptual processes, the context is informing the ways in which we map what we're seeing into things further inside our processor, which might be our memories of existing symbols like 13 or B in this case.

So we've seen then that words are not necessarily best modelled as discrete symbols.

And we've also seen that in order to disambiguate meanings that naturally fit between these word-like things, we better off be considering wider context in order to modulate those computations.

Another very important fact about language is that the important interactions which we may well need to model, can very often be non-local.

So it can be things that are not very close together, which we have to capture the interactions between.

Classic example is sentences a bit like this: 'The man who ate the pepper sneezed'.

So even though the pepper sneezed, those, that part of the sentence is contiguous, we, as we read the sentence, know that it's in fact the man who sneezed, we know that this sort of image characterises what happened when we read this sentence and that there isn't necessarily anything to do with pepper to be seen apart from the fact that that's something that the man just ate.

So this tells us that it can be things at one end of the sentence and things at the very other end of the sentence, which must be considered to interact in order for us to form the most satisfactory meaning when we read sequences of words.

However, there are of course other factors at play.

So consider the sentence, 'the cat who bit the dog barked'.

Now it's actually the case that people are much slower to make sense of this sentence than 'the man who ate the pepper sneezed' even though they have exactly the same overall structure and length.

Eventually upon thinking about it, we do realize that in the sentence 'the cat who bit the dog barked' it's actually unusually the cat which does the barking.

But our difficulty to process this also tells us that many factors are at play.

So in particular it seems to be that the, the three word phrase, 'the dog barked' seems to capture our attention and we sort of have an urge to consider that it's actually the dog barking, in a way that's more strong than in the other case where we don't have a such a strong urge to consider, it's the pepper that sneezed.

Now where might those urges come from?

And can we capture those in our conversational models?

Well, these sorts of examples seem to tell us that those urges can come from our underlying understanding of the world, our understanding of the meaning of dog and barking and the fact that those are very likely to come together and form and describe a particular situation.

Whereas, our knowledge of peppers will tell us that they don't typically sneeze and therefore we don't think that the pepper sneeze is a very likely state of affairs and we look for other ways to make sense of the sentence and the correct way of making sense of that sentence, of the sentence in fact is more salient to us as we process it.

So that's just a thought to bear in mind when we're thinking about optimal processes of language in deep learning models.

And then there's another final point.

So lots of people who consider and talk about language, particularly in the wider machine learning community, consider language to be compositional in the sense that the meaning can be computed simply by elegant operations on the individual parts.

But when we actually consider how meanings combine, the picture is a little bit less clear and it seems very likely that whatever we do to combine meanings very well ought to take into account exactly what those meanings are.

And it shouldn't operate arbitrarily on any different set of inputs.

It should be a function which really takes into account the individual meanings in a particular scenario before deciding the best way to combine those meanings.

Just to justify that, consider the following example, here's a characteristic image of something that's red.

But if you look at red wine, none of us would find that unusual, but of course the colour of that wine is much darker.

It could even be black in that particular image.

And here's a red pen.

Our experience of pens tells us that even red pens needn't be at all red when we look at them from afar, it's only the ink that comes out of them that needs to be red.

So even in something as simple as combining a colour adjective with a noun, there's all sorts of factors at play telling us exactly how those meanings combine that don't seem to be equivalent from one pair of words to the next.

Things get even more wacky in certain cases.

So here's a classic example about pets.

If we think about the prototypical pet, it's probably black or white or brown because obviously dogs and cats have those sorts of colours.

If you think about fish, then maybe a classical fish we'd think about would be silver or gray, slippery in that way.

But when we think about pet fish, this sort of magic seems to happen where our typical pet fish has lots of bright colours.

It could be orange, green or purple or yellow.

So something seems to have happened in our mind to allow these strong features to come into the representation of pet fish which didn't play a strong role in our representations of either pet or fish.

This doesn't always happen when we combine words, but it does sometimes.

Another example would be this representation of plant, which might be typically looks something like that and our representation of carnivore, which might be, look a bit like that, but our representation of carnivorous plant has this additional feature about eating insects.

So these are kind of wacky effects of how meanings interact when two words come together and it's not necessarily easy to explain them in a model which treated every pair of words fed into that model with exactly the same function to combine their meanings, it very much seems to me that what's instead happening is that whatever function is combining the meanings is taking into account the individual meanings of the components going into that function and in, in additional, additionally, that function may well need to take into account a wider knowledge of typical things we might encounter in the world and how their properties might fit together under the constraints of the world as we know it.

So just to summarise, we've seen in this section that words have many related senses and that they're not necessarily characterised as sort of perfectly idealised discrete symbols.

We've also seen that in order to somehow find which of those senses is most relevant in a particular scenario, many, some of the ways to settle that problem might require us to look at the wider context around that word.

And in many cases we may need to look a long way from a particular word to satisfactorily disambiguate the uncertainty that we have at any particular point.

And finally, when we're thinking about building models of how word meanings might combine, we've seen that functions that combine meanings will probably need to take into account what the inputs to those functions are in order to come up with the best bespoke way of combining for those particular words.

And we've even suggested that they may well also need a widened sense of how the world works and how things can naturally fit together in order to eventually arrive at the optimal representation for the combination of meanings in each particular case.

So, in the first part we talked about particular aspects of language and particular aspects of neural computation that have that seem to fit together in a particularly appropriate way, such that, they define certain ways in which a computational model might need to behave in order to capture the ways that meaning works in language.

So in this section we're going to talk much more concretely about a specific model, which was published just a couple of years ago and has had an incredible impact on a large number of natural language processing problems from machine translation to sentence classification and essentially any problem that requires a model to process a sentence or a passage of multiple sentences and compute some sort of behavioural prediction based on that.

So it's fair to say that for any of any problem of that form transformer is probably the state of the art method or some variant of a transformer is the best way to for the model to learn and to learn to extract the signal from those sentences in order to make optimal protections.

And in this section I'll talk about the details of the transformer and just refer back to those aspects of language processing that we saw in the first section in order to give some intuition about why the transformer might be so effective when it processes language.

So just here's credit to the authors of the transformer from Google Brain and collaborators.

And the paper is obviously available for you to find out the fine details, but I'll give a broad overview and starting in particular with the first layer.

So the transformer contains a distributed representation of words in its first layer which is something it has in common with almost any neural language models now.

And what do I mean by a distributed representation of words?

Well, the first thing that we do when we construct a neural language model is we have to determine what is the vocabulary on which the model is going to operate.

So what I mean by that is we do need to chop up the process, the input which the model sees into some sort of units in order to pass them to the model.

Now, if you think about a large page of text, those units could be individual characters.

In an extreme case, they could be individual pixels if we consider that the, the text an actual image.

But in general with language processing applications, because we have texts stored in digital form, we don't need to go through that phase and subject our model to having to learn to process pixels.

So we have to make a decision about what it, what are the units that we actually pass to the model.

And in most applications of neural language models these days, that can either be character level, which is where we pass each unit as an individual letter or it can be word level, which is where we split the input according to white space in the text and then we pass each of the individual words to the model as discrete different symbols.

But of course, as we've talked about in the last section, a model which just takes symbols and treats them as symbols might not be optimal for capturing all of the aspects of meaning that we see in natural language.

So instead of doing that, the developers of neural language models have come up with a procedure which allows the model to be more flexible than which would be represents, in the ways of which it represents words.

And that process is something like the following.

So let's say we do take the decision to chop up our input text according to individual words.

So what we should, what we first do is we consider all of the words that we want our model to be aware of and we let that define the vo-, the total vocabulary of the model.

So to get such a list, we might scan hundreds of thousands of pages of texts and count all the works that we find there, and then we can take some subset of the words which appear the most frequently or alternatively if we have lots of memory and a really big model, we can take all of the words and allow all of those to be in the vocabulary of the model.

What we typically do in a neural language model then is pass each of the words to an input layer and that input layer contains a particular unit for each, corresponding to each of the words in the vocabulary of the model.

But importantly, those units are then connected to a set of weights and it's always, each unit is connected to the same number of weights and those weights connect to a set of units of a particular dimension.

Now that dimension we can think of as the word representation dimension or the word embedding dimension and when the model sees a given word, we turn on the unit corresponding to that word and we leave all of the other units at zero.

So we put an activation of one on unit corresponding to the word leave all of the other weights as zero and we've marked those weights in this diagram here with yellow and light blue shows the space occupied by the whole layer of input weights for the model.

So in this case the model sees the word 'the', we turn on the weights corresponding to the word 'the', and of course because we activate the 'the' unit with the strength of one and we activate each of the units at the output of the next phase, which is corresponding to this black box around the grey rectangle, we activate each of the units there according to exactly what the weight is that went from the word, the unit corresponding to 'the', to this distributed layer.

So effectively we get a representation in that, in that layer with the black box around the rectangle, we get a representation corresponding to the word 'the', but that representation is actually a finite number of weights, floating point valued weights.

And if we do this for all the words, we get a different representation for all of the words.

So we can unroll the input and actually do repeatedly do this and get a sequence of vectors of floating point values for each of the words in our input.

And those vectors live in a space and importantly that space has certain geometric properties.

So we might find that representing words in a space like that allows words to move together in the space if it's useful for the model to represent them as somewhat similar and to move further away in that space if it's useful for the model to represent them as different.

Because remember with backpropagation, all of the weights in this first layer of the model are going to be trained to optimise the model to achieve its objective.

So this gives the model the flexibility to move its representation of individual words around as it sees fit and the best way to achieve its objective.

So just to recap, this is the first layer of many neural language models, including the transformer.

And it contains quite a lot of weights.

So if we have a total of capital V words in our vocabulary and if capital D is the dimension of the vector that we're going to represent each of these words with in a floating point vector, then the total number of weights that we have in the first layer is V multiplied by D and we end up with a D dimensional Euclidean space with which to represent these input units in the model.

Now this idea of representing words or letters or whatever we take as the input units to a model in some sort of high dimensional floating value, real valued vector space is actually quite an old idea.

If we go back to 1991 Mikkulainen and Dyer produced a language, a neural language model with much less computational power than current models have, but it still tried to execute this principle of representing input words in this distributed geometric space.

And it was able to exhibit certain types of interesting generalisation when trained on real texts that a model which represented words as individual discrete symbols wouldn't be able to represent or achieve.

And of course, perhaps the most famous example of this demonstration came from a very famous paper in which Jeff Elman introduced the recurrent neural network to the wider community.

And in this paper Elman analysed the distributed representations corresponding to lots of different words as he trained the model on sequences of, sort of, subject verb object style sequences of natural language style snippets.

And the objective of this model was just to represent a sequence of words such that the model was able to optimally predict the next word with as much accuracy as possible.

And what Elman found when he analysed the way that the model was distri-, was representing these words internally was that of all the words in his vocabulary, they started to cluster together in this geometric space such as the words with similar meanings came together.

And importantly also words with similar syntactic roles, so things like verbs or nouns or subjects or objects also started to cluster together in the space.

And this tells us that neural language models, as they experience more and more text, start to slowly infer the underlying structures in language which we might be able to perceive as language users such as subject, object, verb and how things fit together like that as well as an emergent categorical semantic structure where we see that certain classes of different types of words naturally fit together.

So that's the solid foundation on which the transformer builds.

But that's of course not novel to the transformer.

Distributed representations of words have been a part of neural language models as I pointed out since the early nineties.

So what else does the transformer do that makes it so powerful and allows it to fit and correspond and capture some of the aspects of language that I talked about from the first section?

Well, after the first stage of processing, which I've just outlined in the previous slides, we end up with a particular real valued continuous vector for each of the words in an input sentence.

So the next stage, the transformer, computes what's called a self-attention operation.

So how does that work?

Well for any self-attention operation, there are three matrices containing the weights which parameterise the operation.

So the first matrix is we could call the query weight matrix WQ.

The second matrix we would call the key weight matrix WK and the third weight matrix we'll call the value weight matrix WV.

And each of these matrices have independent weights in the transformer and we can, their dimensions are such that they can naturally multiply, in this case, I've written it as post multiplication of the distributed word vector that I talked about in the first section.

And importantly as the self-attention operation is carried out, these weights are applied equally and in exactly the same way to each of the words in the input.

So we end up with for every individual word vector I've written here 'e beetle' corresponding to the word 'beetle' in the input, we ended up with three further vectors corresponding to multiplying that vector by the matrix WQ, the matrix WK and the matrix WV.

So those three additional vectors we can call bold Q, bold K and bold V and we can call those, they are typically called the Query Vector, the Key Vector and the Value Vector for this self-attention operation corresponding to each of the words.

And then with those three vectors, we use them to understand how the different words in the input start to interact.

So in particular with the query vector, we produce an operation where for every word we take the query vector corresponding to that word and we compute the inner product, the dot product of that word, with the value, with the, with the key vector corresponding to each of the other words.

So that's represented here by the dotted line.

And by taking a dot product in that way, we get a scalar, and then we can, we want to understand how big is that scalar relative to an average scalar that we would get if we just took that operation arbitrarily.

So essentially we want to give the model the power to represent how strong should the connection between these two words be.

And, in order for that to be a nice normalised distribution over all the possible strengths computed by the model, we first work out the inner product of the query value with the key value of a particular word.

And then we divide that number by the dot product of the, well, we need to normalise by quantity corresponding to the dot product of that query vector with each of the key vectors of the other words.

And the way we do that is, we compute those values and we pass all of those values and the dot products through a softmax layer, which gives us a distribution, so it normalises, it exponentiates and normalises, such that we get a nice smooth distribution corresponding to how well each of the queries corresponds to each of the keys of the words in the input.

So this then gives us a set of weights correspond-, it's a probability distribution, which gives us a set of weights between zero and one, so for a given word 'beetle', we get a set of weights, one for each of the words in the input telling us to what extent is there a strong interaction between the word 'beetle' and that other word.

So in this case, the way I've marked it in the slide is that the strongest interaction when we do this operation is with the word 'drove' and that might be because the word 'drove' tells us in particular that this beetle is not the animal type of beetle, but it should in fact be thought of as the car beetle.

So that's the sort of interaction that we want to naturally capture here.

Once we've got these weights, we then use them to tell us how much of the value representation to take through to the next layer of the transformer.

So in this case, for example, when representing the word 'beetle', we would notice a strong connection with the word 'drove' and that that would give us a strong weight in our attention distribution.

And then that would tell us to take a lot of the value of the embedding for 'drove' through to the next layer of the transformer.

So the operation which allows us to take an amount of the value through to the next layer, which corresponds to the weight computed by the transformer, it's just this simple weighted sum.

So what we end up with then for each word like 'beetle' is that we take a small amount of the value of each of the other words plus some of the value of the word 'beetle' through to the next layer of the transformer.

And that can then be aggregated to form the next layer's representation of the word 'beetle'.

So notice that having performed this transformer layer, we haven't reduced the number of embeddings in the model in any way, we still have a representation corresponding to the word 'beetle' that we started with, but that representation has been updated or modulated, conditioned exactly on information about how well it corresponds or how well it should interact with all of the other words in the input.

And of course that was just for the word 'beetle' but we do that for each of the other words in turn.

And that computation can be computed in parallel, which makes the transformer quite fast to feed forward in today's deep learning libraries.

And so for one application of a self-attention layer we end up with the same number of distributed representations coming out as we had going in.

And within the mechanism, the only weights that we learned are those single matrix giving us the queries, a second matrix, giving us the keys and a third matrix, which gives us the value representations of course those matrices are then applied to each of the individual words.

But it's not just this self-attention layer that gives the transformer it's expressability and power is actually an operation known as a multi-head self-attention, which basically takes the operation I just talked about and reapplies it four different times in parallel.

So if you imagine the operation that I just spoke about being parameterised by three matrices, WQ, WK and WV, well we can repeat that process with three additional independent matrices.

And in fact, typically we might do it say four times.

So we'll end up with four sets of three independent matrices and each of them can do exactly the same self-attention operation as I just talked about in the previous slides.

So we end up with four independent and parallelisable self-attention operations, each computed on the input words of a particular layer in order to get us through to the next layer.

Now of course that is a lot of computation and it might require a lot of memory if we end up with very large representations in our model.

In practice, then, what the developers of the transformer recommend is that each of our self-attention layers actually effectively reduces the dimensionality of the input vectors.

So if the input vector and the light blue at the bottom of the slide here, has dimension 100, then we can make the matrix WV be a rectangular matrix rather than square matrix.

What that would do is mean that the output of WV, which is the value vector, which gets passed to the next layer of the transformer, that can be arbitrarily small.

In this case, we might find it to be just 25 units.

And so each self-attention layer independently takes 100 dimensional vector and returns 25 dimensional vector for each unit, for each word in our inputs.

But if we do that four times, then we end up with four 25 dimensional vectors and those can be aggregated, in fact, in the transformer they're passed through an additional linear layer which is parameterised by matrix W0 but then concatenated to return overall a vector of the same dimension, 100 units as was the dimensionality of the input.

So in that way we can apply multi-head self-attention.

We can give the model four independent ways to analyse the interactions across the different words in the input.

And we can do so without expanding the dimensionality with which the model needs to represent each of its words.

And that makes it a relatively practical tool, which doesn't lead to an enormous explosion in the memory requirements of the models.

But it does give the model many independent ways with which it can represent interactions between the words and the inputs.

Now after that multi-head self-attention layer, the model does what's called a feed forward layer.

So conceptually this is less interesting, but essentially the representations that the output of the multi-head self-attention layer are then multiplied again by a linear layer, there's a rectified linear unit non-linearity and then they're actually expanded out and dimensioned somewhat and then reduced again in dimension with another linear layer.

So when considering a transformer all together, it's actually multiple applications of those multi-head self-attention layers and the linear layers that I described afterwards.

But there's another important detail in the transformer, which is the notion of skip connections.

So whenever we apply a multi-head self-attention layer or indeed a linear layer, the transformer also gives the model the option to ignore that computation and instead to pass the activations that were at the input to that multi-head self-attention layer direct, to bypass the self-attention layer and go through to the point of the network of which the output is coming out of that self-attention layer.

And then that is added to the output of the self-attention layer, passed through a layer normalisation layer and then that represents the actual output of the whole unit that, the whole part of the network, the whole module which is doing the multi-head self-attention.

So why might that sort of skip connection be important?

Well, in the examples I gave in the previous section about language, one thing that should have maybe come across is the importance of, or the role of our expectations in forming a consistent representation of what a particular input is.

So as an example, in the case of pet fish, we came to the understanding that pet fishes have many bright colours even though that was not necessarily part of the individual parts of the input.

It's not necessarily something we would associate with pets and it's not something we would necessarily associate with fish, ordinarily.

And where does that additional notion of colours come from?

Well it probably comes from our sort of, our wider understanding of the world and our ability to think about pet fish as a combination and then reconsider how the input works.

And so these sorts of top down influences our expectations influencing how we actually combine the inputs in language are really common in many different contexts.

And if you think about skip connections, it's not a perfect model of this, but it does give the transformer a rudimentary ability to allow its representations of things at a higher level of processing to interact with this representations of things at a lower level of processing.

So let's say that the model didn't have skip connections and fed things through to a certain level in the hierarchy.

At that point after computing many different interactions, the model might form a consistent sense of the fact that a meaning needs to be understood in a particular way, but of course those top down influences tell us that that expectation of what the meaning might be should actually feed back and allow us to remodulate how we understand the input.

Well a skip connection which comes up for the input and interacts with the model at that point can actually compute such an interaction in subsequent layers because at that point the model has access to both a high level representation of what it expects the best way of interpreting the whole situation is and it has direct access to the lower level input.

So in some ways in a very deep model, the addition of skip connections allows the model to execute a form of top down influence on processing.

There's one more detail I'll finish off with and our characterisation of the transformer.

Now, if you were aware, if you were paying attention during the explanation, you may well have noticed that none of the operations that I described on the input words took into account the actual relative order of the words in the input.

It was a series of matrix multiplications which were applied identically to each of the words.

And then on top of that, a series of inner products, which are symmetric operations, which don't favour the ordering in which we apply them with respect to the words.

So there was no way that a model like this would have any ability to express the fact that certain words appear close together in the input or certain words appear further apart.

And of course we know in language that the word order can tell us some important things about what the overall sentence means.

So in order to give the model sensitivity to word order in a way that the computational form, the functional form of the model doesn't allow, the developers of the transformer came up with a rather nice trick known as positional encoding.

So positional encoding is just a way of determining a set of scalar constants which are added to the word embedding vector after say let's say in the lowest level of the transformer it can be added before the first self-attention layer, but just after the word embedding layer, and those scalars combined with the word embedding to mean that whatever, if a word appears in a particular slot in the input, regardless of the fact that it's embedding weights will necessarily be the same, the actual effective representation that the transformer sees will be slightly different depending on where it appears in the input.

So to achieve this sort of thing, you just need, the model just needs a set of small scalars which are different in each of the possible locations that are word could appear in the input.

And they use a nice sinusoidal function which has various properties which may well be more desirable than just being, allowing the word to discriminate words according to their position.

Because in fact that sinusoidal function gives the model a slight prior to pay attention to relationships of a certain wavelength, a certain distance across the input and each particular unit in the embedding representation can then specialise at recognising interactions or correspondences at a different distance from a given word.

So, unlike if you think about models like a recurrent neural network or an LSTM, those models have the notion of order built in because they process input sequentially, one word after the other according to a process transitioning the state from the, from its position after reading one word to after reading two words, to after reading three words, to after reading four words.

What that means is that the model has a very strong awareness of the ordering of the words naturally.

But then it, it's harder for that model to remember to pay attention to things a long time in the past even if those things actually end up having a really important influence on what I'm currently looking at now.

With a transformer, things are totally different.

The model has sort of natively, in its native functional form, it has no awareness of word order and we have to add on these additional positional encodings to give the model a weak awareness of word order.

But the transformer actually performs better than RNNs and LSTMs on a lot of language tasks and this maybe tells us that it's easier to learn the word, the notion of word order for the few cases or for the number of cases where it's actually important in language than it is to be given the notion of word order automatically, but to have to learn the very difficult process of paying attention to things a long time in the past.

And when I say difficult, I mean the gradients have to pass back through many, many weight matrices in order to determine, in order to allow the model to update and then learn to encode dependencies between things in the past and things in the present.

With a transformer that path that the gradient has to go through is much shorter because there's no prior favouring of things which are close together instead of the gradient path that the model needs to go through to connect any two words in the input is equivalent and in fact it indeed is shorter on average than it is in recurrent neural networks.

So that gives a small amount of intuition about another reason why the transformer might be so effective at processing.

So just to summarise this section: we saw in the previous section that words shouldn't necessarily be thought of as independent discrete symbols and that disambiguating their meaning can depend a lot on the context but not only on the immediate context which is closest to those words, but on potentially distant context of the information encoded in words a long way away.

We've also seen that functions which models use in order to combine the meaning of two words should take into account the meaning of those words and if possible, take into account a wider general knowledge of how things typically combine in order to allow that to modulate the interactions between the words coming in and we think about the architectural components I talked about in the transformer.

The multi-head processing is one way of getting at this notion that words are not discrete symbols because it naturally gives the transformer even in one feed, single feed forward pass the opportunity to represent each word at each layer with n, let's say four, different possible contextualised representations and of course going back longer term just the general notion of representing words as distributed representations and allowing words with similar meanings to occupy local areas in a large geometric vector space also allows the model to express this non-discreet nature of word meaning in a very eloquent way.

Now the fact that distribution depends on context is very nicely modeled by self-attention precisely gives the meaning of every word to be critically dependent on the meaning of all the other words in a given input stream.

And the fact that that context could be non-local as I've just talked about is very nicely modeled by the self-attention mechanism because the gradient flow from the particular point I'm in at a sentence to any other point in the sentence, is the same.

So interactions over words that are next to each other are not particularly favoured over minor interactions.

Another fact is that the more layers we have, the more chance the model has to learn as it moderates this representation of different things, how interactions might take place at different levels of abstraction as the model goes, continues to reprocess the model, the input.

And finally on this point about how meaning combined and the fact that the meaning, the ways in which the meanings of two words combine seems to often depend on the particular meaning of those words and also top down effects.

We've seen that skip connections are one way in which the transformer can learn to implement the interaction of higher level information with lower information.

And we've also again seen that parameterised functions on distributed representations, i.e.

the multiplication of a matrix by a vector is precisely the operation of a function which combines word meanings according to the meanings of the words themselves and those operations are common in most, many neural language models, but are a really important part of the transformer architecture.

So hopefully this section has given you some intuition about how a transformer works, but also some intuition about maybe why it works, why it is that the various components in the transformer improve on a model's ability to process language because of the way that we think meaning works in a very sort of intricate and interactive way when we understand linguistic input.

In the last section, we introduced the transformer and we talked about how various components within the transformer combine to make it a very powerful process, a very powerful model for processing sentences and combinations or sequences of words.

In this section, I'm going to talk a little bit about the very specific use of the transformer.

It's a way of training transformer models in order to allow them to excel at a wide range of different language tasks.

And those tasks might involve reading a sentence and making a prediction or classifying how two sentences relate to each other or even classifying or making predictions about longer texts such as documents.

But before I do that, I also just want to go back to our points about the nature of language and discuss one more issue which I think is quite motivating when we think about how transformers are applied in the model that I'm going to talk about in this section.

So let's consider this sentence: 'time flies like an arrow'.

And then we can compare it to what seems superficially to be a very similar sentence: 'fruit flies like a banana'.

But of course when we start to process and make sense of these sentences, it feels very clear to us as native English speakers, that there's quite a difference in the way that the words in those sentences have to relate to each other in order for us to sort of construct the meanings in our head.

And it feels, it certainly seems to me like there's at least two factors that are really important here.

So one thing is that we, in the top sentence, 'time flies like an arrow', we know what an arrow is, we know that they regularly fly and in fact we know how they fly.

So we've got our experience of arrows.

Another important piece of experience that we have is our experience of bunch of phrases or sentences which are quite similar to the phrase 'time flies like an arrow'.

And in particular they're similar in the way that the meanings of the words combine for us to come up with a representation style sentence.

So those could be things like 'John works like a Trojan' or 'the trains run like clockwork'.

These are all actually kind of metaphorical or simile style sentences where we can, where we compare the way that something works with the way that something else works.

So it feels to me like those two pieces of experience are very important in our ability to read a sentence, like 'time flies like an arrow' and immediately understand it.

In the case of 'fruit flies like a banana' of course we come to quite a different understanding, right?

We know that we're not comparing the way that fruit flies with the way that bananas fly and how is itthat we can somehow know that that's not what we have to do to understand this sentence.

Instead what we do is we have some knowledge of fruit flies and we, and we know that in fact that one of maybe one of the most salient things about fruit flies, I'm not an expert in fruit flies, but there's one thing I do know, which is that they like fruits and we know that bananas are fruits.

And so this connection helps to tell us, well maybe it's a different type of liking that I need to be thinking about it in this sentence.

And then of course there's again, other than that background knowledge of how the world works, how fruit flies are, there's also this kind of more linguistic knowledge of sentences we may well have already previously understood in which the meaning seems to combine in a similar way to 'fruit flies like a banana'.

So 'Fido likes having his tummy rubbed' or 'grandma likes a good cuppa'.

In those cases it seems like the process of putting together the meanings has something quite similar or in common with the scenario in 'fruit flies like a banana'.

So if we're going to come up with a general language understanding engine that's able to cope with all these different types of processes and constraints which are involved in understanding a sentence, then there's obviously a lot of places where such a model needs to get its experience.

And a lot of the places where such a model needs to get its understanding of the world and its understanding of language and those considerations lead us to add a fifth point to these many characteristics of language, which is that when we actually form an understanding, you know, it really does seem to be a process of balancing our existing knowledge and that could be knowledge of language and also knowledge of the world with the input with the particular facts of the thing that's currently coming into the model.

And that consideration is a key motivating factor behind the approach which is taken in this model.

I'm going to describe in the section which is called BERT.

And BERT stands for Bi-directional Encoder Representations with Transformers.

And BERT is essentially an application of the transformer architecture that I described in the last section.

But the key insight with BERT is that rather than training a transformer just to understand the inputs to the sentences which the model is currently considering, a process of pre-training takes place in which the weights within the model are endowed with knowledge of a much wider range of text in this case, which can plausibly give the model that background knowledge which is really necessary for forming a coherent understanding of the total of the different types of sentences a language understanding processor needs to, to be able to understand.

So the important thing to remember when considering how BERT works is that a transformer as described in the previous section really is just a mapping from a set of distributed word representations to another set of distributed word representations.

So as I talked about in the last section, the first layer of a transformer goes from the particular input symbols passed to the model to a space, a geometric space of continuous valued vectors.

And of course what comes out after these many layers of self-attention is precisely another space of continuous valued vectors.

And corresponding to each of the words in the input, the exact same set of words that were the input.

So if I pass a sentence to transform a model, it'll very quickly compute a set of embeddings for those, for each of the words in that sentence.

And then it will output, it will pass them through self-attention models and output a set of embeddings for each of the words in that sentence.

But of course each of those embeddings will be highly contextualised, highly modulated by all the other words in the sentence.

And hopefully will have gone through the sorts of processing needed for us to, for the model to sort of gradually and incrementally form a reasonable representation of what the sentence means.

So given that fact that a transformer is just a mapping from a set of word representations to a modified set of word representations of the same length, there's quite aneasy way in which we could train such a model in order to extract knowledge from an enormous amount of texts that we might just have lying around.

So in particular, the insight from BERT is precisely how can we get knowledge into the weights of such a model without requiring problems or data which has been labeled by human experts or other, some other mechanism in order to give the model sort of knowledge of what's the right classification or what's the right answer to make.

So how can we get knowledge into a model, a transformer model in an unsupervised way?

And the approach that the authors of BERT take is firstly, by means of a masked language model pretraining phase.

So the way this works is the following, the authors just considered the problem of mapping a particular sequence of words to the exact same sequence of words.

So the job of this transformer in theory is just to represent a sentence, for example, and then output a sentence at the very top of its network.

But rather than, rather than having the model output the exact same sentence, instead in the input to the model, one of the words is masked out.

So the model is not aware of one of the words in the input sentence.

And instead of having to predict all of the words in the input sentence, the model just has to make a prediction conditioned on the output embedding for the missing word of what that missing word was.

So it just has to answer the question, you know, here's a sentence with a missing word in it, 'sucking up '...' from words', and the model just has to make the prediction that the missing word in that case is knowledge.

And when training the model, the authors of BERT do that with 15% of words at random.

So they ran, they present sentences from any, any particular place where we might be able to get running text language and the authors mask out words with a probability of 15% and then ask the model to make a prediction and backpropagate the cost, which is essentially, the likelihood, the negative log likelihood of the model, having predicted that word over all of the other words in its vocabulary.

So that's masked language modal pretraining.

But one thing that the authors noticed is that, if they trained the model in that way, then on the test set when they came to use this model, of course, in the input there wouldn't actually be any tokens masked out.

So there's a risk that just by training the model in this way, it would not behave well on inputs where there wasn't anything masked out.

So for a small amount of the time, instead of masking out a word, they have the model make a prediction of which word is missing even though they didn't actually mask a word out.

So no words missing.

In this case, the model really does just need to retain which word is in a particular point in the sentence and at the output representation corresponding to that point, conditioned on that, make a prediction of what that word was.

This of course, if this was just the only objective, the model would never have to do any sort of inference.

It would never have to make any sort of unexpected judgment about what word could be missing.

It would instead just be able to copy knowledge straight through.

And that wouldn't lead to any interesting formation of any interesting representations.

So this is only done occasionally, but it does make the model perform better on the test set because the model does not, kind of, find itself completely out of its training experience when it encountered sentences for which no words are masked out.

Okay.

So that's the masked language modeling objective.

But in order for BERT to be an effective language processor, the authors wanted it to also be able to, to be aware of the flow of how meaning works on a longer scale than just within a particular sentence.

So in order to achieve this, they came up with an additional mechanism for training the weights in the BERT model, which is complimentary to the masked language model objective.

So this objective can be trained at the same time as the masked language modeling objective.

And as in that case, it doesn't require any data that's been labeled by experts or found to have a right answer in some way.

We can just construct this objective by taking running text from the internet.

And the way this works, this is called the next sentence prediction pretraining objective.

So the way this works is the authors add an additional input token at the start and it's the output embedding corresponding to that input location that's going to be used to make the prediction on this objective.

Then as input to the model, the model is presented with not one sentence, but two sentences in this case and so there's the additional input token, then there's the first sentence, then a separation token, and then the second sentence, and that's all passed to the transformer and it's processed through in parallel.

At the end, the model produces representations for corresponding to each of the input tokens, but it's only the initial representation corresponding to this additional token that was added to the inputs that needs to be considered in this objective.

And conditioned on that, the model just makes a binary choice of whether or not this was actually two consecutive sentences from the training corpus.

So in this case it is two consecutive sentences: 'Sid went outside' and 'it began to rain'.

So in this case, the model would predict, yes, those are two sentences which are likely to follow one another in a corpus.

But by shuffling data, the trainers can, the people who train the model can also present tips with negative cases.

So cases where one sentence didn't follow the other sentence.

So that might look something like 'Sid went outside', 'unfortunately it wasn't'.

So the objective of the model here is to identify this as two sentences which don't fit well together and to make the prediction no on the next sentence prediction task.

So like combining next sentence prediction and masked language modeling, slowly the weights of this large transformer, the BERT transformer, gradually start to acquire knowledge of how words interact in sentences typically maybe abstract knowledge of the typical ways in which meaning flows through sentences.

And of course, the spaces in which they represent each of the individual words at various levels of the stack, things start to happen like words that have similar meanings start to come close together.

The model might require to separate them out into the different parallel heads if words have various different senses.

And so, a lot of the general knowledge that we talked about being very necessary for forming a consistent and coherent representation of loads of different language sentences can start to be introduced into the weights of this model as it trains according to these unsupervised objectives.

So that's the theory behind BERT or at least the intuition behind that.

And of course, because neither of those training objectives required any sort of particular labels you can, BERT is trainable on all of the texts that exist in digital form in English around the world.

So you could take any text from the internet and use it to train more and more knowledge in theory into the weights of BERT.

Of course, that's in principle how BERT works but it wouldn't be a very convincing demonstration unless there was some evaluation.

And in this case, the way that BERT is then evaluated is by taking its knowledge in all of those ways and using that as a start process to train on many specific language understanding tasks.

And these tasks typically have, they do use labeled data and they typically have a lot less data.

So, in order to apply BERT to these models, the BERT weights, which are trained on all of the unsupervised objectives, are then taken and the data specific to each of these tasks is passed through the BERT model and then BERT is, the BERT weights are updated according to the signal from the supervised learning signal from these actual specific language understanding tasks.

Typically this process of fine tuning the BERT representations for a specific task takes place separately and independently for each of those additional tasks.

And it's also necessary in many cases when fine tuning in this way to add in a little bit of machinery onto the top of BERT because, you know, in the standard BERT architecture, it's just making predictions where it outputs a number of distributed representations at the top of this transformer model.

But, of course, given a specific task, it may be necessary to come to some sort of prediction, depending on the output format of the task, it may be necessary to take only some of those representations and condition on them with additional weights in order to make that prediction.

But typically that's only a small amount of additional weights that contains task-specific knowledge and the vast majority of the model contains the general knowledge that was trained into the model.

So just doing this massively improves the performance of any models which aim to exhibit some sort of general understanding of language.

What I mean by that is, any model which is intending to be trained on a wide range of different tasks, using the BERT style approach, so transferring knowledge from an enormous running text corpora, via fine tuning to those specific tasks, has led to a really strong and significant performance on a large number of these tasks.

And importantly, this doesn't just allow one model to solve lots of tasks better, in many cases, this is the way to achieve state-of-the-art performance on these additional tasks.

So even a model which was just specialised to those additional supervised learning tasks would not perform better than a model that was initially pre-trained on BERT, in fact, you know, for a lot of these tasks, performance is substantially worse unless you apply BERT-style pre-training on enormous corpus before transferring to these additional tasks.

So this is a really sort of compelling demonstration of transfer learning.

And the key insight with BERT is that transfer needs to take place throughout the weights of a large network.

Previous attempts to do this involve transfer just at the level of those specific word embedding weights which can encode the information relevant to each individual word in a modern vocabulary.

But they didn't have a mechanism to encode the ways in which those words combined.

Now a few years before BERT, a model called ELMO and a couple of other models, started to show that there was some promise in sharing more than just those word embedding weights, but actually sharing a large amount of functions which learned to combine weights when pretrained on some task agnostic objective and transferred to specific tasks.

And then the BERT model really took that to the next level using the machinery of the transformer to exhibit really impressive transfer learning.

So we've now acquired five interesting principles of how language and meaning seem to interact when we understand the sentence.

And when we've added this fifth one: understanding is balancing input with knowledge that we've had already or our general knowledge of the world.

And we've talked about BERT as a mechanism for endowing models with something like a general knowledge that may be necessary.

And we've shown that, in fact, indeed it is very important on a lot of language understanding tasks to have this sort of prior knowledge acquired from a massive range of different experiences and different types of texts.

So in the next section we'll look a bit forward to other sources of information which may plausibly be useful for different language understanding models because of course, BERT only has the means to acquire knowledge through text whereas if you think about the fruit fly example or time flying like an arrow, those sorts of examples tell us that there are many other sources of information that we may have used in order to gain the general conceptual or world knowledge required to actually make sense of language.

So in the last section we saw how the BERT model is a really exciting example of transferring knowledge from an enormous amount of text, to apply that knowledge to very specific language tasks that maybe have a small amount of data from which to learn.

And this works in part because of the critical importance of general knowledge in understanding language and we need ways in which models can acquire general principles of how language works and how word meanings fit together in order to make high quality predictions for a range of different language tasks.

Now in this section we're going to talk about further ways in which we might be able to endow models with general or conceptual knowledge which they can then apply to language related tasks.

And in particular in a way that's not accessible to the BERT model, which is the ability to extract knowledge, general knowledge and conceptual knowledge from our surroundings, which is something as humans that we are doing all the time.

Now this is a good opportune time to start thinking about these challenges because the tools available for these sorts of unsupervised knowledge extraction processes are improving all the time.

So as well as the objective of masked language modeling and next sentence prediction that we saw with BERT, there's also exciting techniques in the field of computer vision, that often involve things like missing parts of an image and making predictions about whether or not that part of the image is the correct part or of which pixels would most appropriately fit in to that part of an image or maybe contrasting incorrect parts of images with correct parts of images and things like that.

So those sorts of objectives are also leading to really good ability to transfer from large banks of images to specific image classification tasks.

And, of course, in the world of learning when it comes to jointly learning language and behaviour, which involves often reinforcement learning on those sorts of tasks are techniques for having agents develop a more robust understanding of their surroundings and possibly import what's known as a model of their world.

Those techniques are also improving.

So in DeepMind we thought it was the right time given all of these improvements to start to study this question of knowledge acquisition through prediction in an actual agent that can interact in its surroundings.

But the idea of knowledge acquisition through prediction is actually a very old one in neuroscience and psychology.

So it goes back all the way to the time of Helmholtz and there's some very influential papers you can see in this slide that really proposed and made clear the idea that predicting what was about to happen to an agent or an organism was a very powerful way of extracting knowledge and structure about the world that surrounds that agent.

Now, in our case, we unfortunately can't set an enormous neural network free in the world in which we live and just see if it learns.

But the next best thing is to create a simulated world.

And we did that in the unity game engine.

And the aim with this work was to study precisely whether or not an agent which moves around this world can apply various different algorithms in order to acquire as much knowledge as possible from its environment.

But in particular, in a slight difference to other work on this sort of topic, we were interested in whether or not this knowledge would be language relevant, i.e.

whether or not this knowledge would be knowledge which could serve the agent's ability to understand or use language.

And the way we did that was as well as creating loads of random rooms with different objects positioned in different places in this simulation, we also created a bunch of questions such that for any random room that was created the agent could find in the environment questions which could plausibly be answered, so examples of the sorts of questions we asked were things like, 'what is the colour of the table?

', 'what is the shape of the red object?'

'how many rubber ducks are there in the room?'

'is there a teddy bear somewhere?'

And even comparison questions like things like 'is the number of rubber ducks bigger than the number of toy sheep?'

So those are the sorts of questions.

And importantly, being able to answer these questions requires a particular type of knowledge, that's propositional knowledge, the knowledge, the ability to tell whether something's true or false in our environment and that's often contrasted especially by philosophers with procedural knowledge, which is just the sort of instinctive knowledge that maybe a reinforcement learning agent would naturally have when it learns to solve control problems in a very fast and precise way.

So this is a different type of problem most typically faced by agents which are trained with reinforcement learning.

So, in order to think about how we could develop algorithms to aggregate this sorts of knowledge as an agent explores their surroundings, we first just gave the agent a policy which meant that it visited all of the things in the room a little bit.

So that essentially creates a video of experience and then we set our learning model the challenge of taking in that experience and aggregating knowledge as much as possible in the memory state of the agent as it lives that experience.

And then the way that we measure the quality of that knowledge is by bolting on a QA decoder onto the agent.

And that's the part of the model which is going to actually produce the answer to the questions when fed with the current memory state of the agent and the particular question.

So as an example, the agent might explore a room with a yellow teddy bear and a red sheep and a large table and then a small toy dinosaur under the table.

And the environment might present the question: 'what is the toy that's under the table?'

Now the agent would explore and the agent's learning algorithm can take in all of the things it sees as it moves around the room.

But the agent itself can't see the question.

So the agent just has to, the learning algorithm just has to find a way based on that experience to aggregate general knowledge into the agent such that when the question the QA decoder is queued with the state of the agent at the end of the episode and queued with the question, it's possible to combine those two pieces of knowledge and answer with 'dinosaur'.

So to do this effectively, the agent needs a large amount of general knowledge about how things are arranged in the environment around it such that the QA decoder can take that knowledge and make predictions about the answers to questions.

It's a very important detail that we do not backpropagate from the answer to the question back into the agent.

So the weights and the agent and the objectives that the agents applying must be general.

They can't be specifically tailored to getting the knowledge to answer the question.

Instead, it must be a process of aggregating throughout this episode such that at the end of the episode the agent's memory is as knowledgeable as possible.

And then to this we apply various different baseline.

So the obvious baseline is just an LSTM or any sort of recurrent agent.

It's much more complicated to apply transformers in this context because, of course, we can't see the whole episode at once.

As we're moving through the world, we can only see timesteps up to the current timestep.

Now, another approach is to endow the agent with predictive learning objectives.

A little bit like the sorts of masked language model prediction that BERT's making, but where the agent has to given a certain time point in the episode, a predictive loss, a predictive engine an overshoot engine, takes the current memory state of the agent at that time point and rolls forward in time.

Once the agent has finally experienced the episode, we can then do some learning where we compare the prediction of that predictive loss to what the agent actually encounters.

And importantly, the predictive loss can also take into account the action that the agent chose to take in each of those timesteps.

So these are kind of action conditional overshoot unrolls where we see what the agent actually encountered in the future and then update the weights of the agent such that they're better able to make these sorts of predictions.

And we tried two specific algorithms: so in one case in the SimCore algorithm, which was proposed last year, the loss that's used in this predictive mechanism is a generative model loss, which is modeling each of the individual pixels in the observations that the agent sees in the world in future timestamps and in the other predictive objective, we use contrastive predictive coding.

This is basically asking the model to distinguish or maybe, presenting the model with two images at a given timestep and asking the model to say which of those two is actually the one that the agent encounters in the future as opposed to one which is selected randomly from some other episodes.

Now we can evaluate these sorts of predictive mechanisms for aggregating knowledge in the agent precisely by their ability to create knowledge in the memory state of the agent such that at the final timestep of every episode, the question answering decoder can take that knowledge and answer the question.

What we found surprisingly is that only one of these predictive algorithms actually led to the agent being able to effectively answer questions.

And that was the SimCore, the model which uses a generative model to estimate the probability density of the pixels in future observations of the model conditioned on the memory state further back in time.

So the contrastive predictive algorithm was much less effective at giving the agent the general knowledge required to be able to answer these questions.

The green line at the top of the plot here shows the performance of the agent if you backpropagate from the question answers back into the actual agent memory.

So by doing that, you allow the agent to specialise in a particular type of question for every episode rather than requiring it to build up knowledge in a general way.

But you can see that that makes the agent much more effective at answering these questions.

To give us some flavour of exactly what the agent does, here's a video: can see the question is what is the colour of the pencil?

And you can see that as the episode continues, the agent's prediction gets more and more confident that the answer is red.

Such that on the final timestep red is by far the most probable answer.

If we consider the other video, let me just activate the other video.

You can see that a similar thing happens with a different type of question.

So here the question is 'what is the aquamarine object?'

And the answer is it's a grinder, it's a salt and pepper grinder.

And again, the agent's confidence is very strong at the end.

But we exhibit, we observed these sorts of effects only in the agent which was endowed with the SimCore model of predicting the probabilities of pixels of future observations conditioned on the actions that it took at arbitrary overshoots into the future.

So that's just a small insight into work that's going on in DeepMind where we're starting to consider how we can aggregate knowledge from the general environment as well as knowledge from large amounts of text into a single model which can start to combine this sort of conceptual understanding and general knowledge understanding and our understanding and a really strong understanding of language into a single agent which can come up with a coherent and strong ability to form the meaning of statements and sentences and also to take that knowledge to answer questions, to produce language and to enact policies, enabling it to do things things in complex environments.

So we've reached the end of the lecture and I just thought we'd go back, reflect a little bit quickly on the various things that we've covered.

So we've talked about various aspects of language which make neural networks and deep networks particularly appropriate models for capturing the way that meaning works.

So in particular, we raised the fact that words are not discrete symbols but they actually almost always have some sense of different related senses that disambiguation is a huge part of understanding language and then that can critically depend very often on context.

We've talked about the facts that that context can be non-local.

So to do with the work we're currently thinking about that context can also be very non-linguistic.

It can require, it can depend very much on what we're currently seeing or doing.

And the notion of composition, we've reflected on the fact that that in itself seems to vary depending on what the words are that are being combined in any one instance.

And we've talked about the importance of background knowledge and ways to combine, ways to acquire that.

So one way that we talked about was BERT and unsupervised learning from text and another way was through predictive objectives in a situated agent.

And so if we look at these features or these aspects of language, the mechanisms that I've discussed today cover them reasonably well and hopefully they shed some light on why neural networks and interactive processing architectures that obey the sort of disparate principles of neural computation.

And distributed representations are particularly effective for language processing.

But of course, it should be said that there are many aspects of language processing that the work I've talked about just doesn't start to approximate, doesn't start to capture.

And that's in particular around a lot of the social aspects of language understanding.

So our models are not currently able to do things like understanding the intentions of others or reflecting on how language is used to communicate and do things.

And, you know, we need to make a lot more progress in these areas if we're actually going to arrive at agents which are truly able to understand language.

So yeah, just as a final note, I think it's interesting that before deep learning really exhibited its success on language processing problems, a typical view of language understanding was what I call the pipeline view, which was that each independent, each part of processing language from the letters to the words, to the syntax and to the meaning and then eventually to some prediction could be thought of relatively independently as a separate process.

But now that we've reflected on how language works and in particular taking in all of the evidence from the effectiveness of different neural language models on language processing tasks, I think maybe this is a more effective or more realistic schematic of how language processing should be thought of.

So we may have some stimule, some letters or sounds, and we've always got some sort of context around those letters or sounds.

Those two things input to our system but critically, it's that input combined with our general background knowledge of the world of knowledge of language, which together allow us to arrive at some sort of plausible meaning for everything that we hear or everything that we might say.

So on that, we'll finish up.

Thanks very much for your time.

There's some selected references here and many other references, which I didn't, I don't have time to list here, but have been hugely inspirational for the work that I think that I've talked about today.

At the end, I've popped a few there for recent work at DeepMind but again, there's no, not time to list a huge amount of very related work.

So anyway, I hope you've enjoyed this lecture and it's given you some insight into why language and language understanding is such an interesting problem for computational models to try and tackle.

And I hope that you've enjoyed the talk and you'll enjoy the following lectures on the DeepMind lecture series Thank you very much.