Deep Learning in NLP
![]() |
| Figure 1: word cloud of the words in the titles of the accepted papers of ACL 2018, from https://acl2018.org/2018/07/31/conference-stats/. Note the prevalence of deep learning-related words such as “embeddings”, “network”, “rnns”. |
This writing is based on many resources and most of the material is a summary of the main points I have taken from them. Credit is given in the form of linking to the original source. I would be happy to get corrections and comments!
Who is this post meant for?
As always, this post will probably be too basic for most NLP researchers, but you’re welcome to distribute it to people who are new to the field! And if you’re like me and you enjoy reading people’s views about things you already know, you’re welcome to read too!
What this post is not
This post is NOT an extensive list of all the most up-to-date recent innovations in deep learning for NLP; I don’t know many of them myself. I tried to remain in a relatively high-level, so there are no formulas or formal definitions. If you want to read an extensive, detailed overview of how deep learning methods are used in NLP, I strongly recommend Yoav Goldberg’s “Neural Network Methods for Natural Language Processing” book. My post will also not teach you anything practical. If you’re looking to learn DL for NLP in 3 days and strike a fortune, there are tons of useful guides on the web. Not this post!
Let’s talk about deep learning already!
In the context of NLP, machine learning has been used for a long time, so what is new? Before we go through the differences, let’s think of the characteristics of a traditional machine learning algorithm, specifically a supervised classifier. In general, a traditional supervised classification pipeline was as follows.
![]() |
|
|
![]() |
| Figure 3: features for NER, from this paper. The example is taken from Stanford CS224d course: Deep Learning for Natural Language Processing. |
Phase (2) is the learning/training phase, in which the computer tries to approximate a function that takes as input the feature vectors and predicts the correct labels. The function is now a function of the features, i.e., in the simple case of a linear classifier, it assigns a weight for each feature with respect to each label. Features which are highly indicative of a label (e.g. previous word = The for ORGANIZATION) would be assigned a high weight for that label, and those which are highly indicative of not belonging to that label would be assigned a very low weight. The final prediction is done by summing up all the weighted feature values, and choosing the label that got the highest score.
What has improved?
First of all, it works, empirically. The state-of-the-art performance in nearly every NLP task today is achieved by a neural model. Literally every task listed on this website “Tracking Progress in Natural Language Processing” has a neural model as the best performing model. Why is it working so much better from previous approaches?
Going Deep
“Deep learning” is a buzzword referring to deep neural networks – powerful learning algorithms inspired by the brain’s computation mechanism. A simple single-layered feed-forward neural network is basically the same as the linear classifier we discussed. A deep neural network is a network that contains one or more hidden layers, which are also learned. This is the visual difference between a linear classifier (left, neural network with no hidden layers) and a neural network with a single hidden layer (right):
![]() |
| Figure 4: shallow neural network (left) vs. deep (1 hidden layer) neural network (right). |
In general, the deeper the network is, the more complex functions it can estimate, and the better it can (theoretically) approximate them.
Representation Learning
Another key aspect that changed is how the input is represented. Traditional machine learning methods worked because someone designed a well-thought feature vector to represent the inputs, as we can see in figure 3. Deep learning obviates the need to come up with a meaningful representation, and learns a representation from raw input (e.g. words). The new pipeline looks like this:
![]() |
||
|
![]() |
| Figure 6: a neural network for NER, which uses a window of 3 words to classify a word. |
Recurrent Neural Networks
Recurrent neural networks (RNNs) solve both problems. An RNN is a network that takes as input a sequence (e.g. a sentence as a sequence of words, or a sequence of characters, etc.) and outputs vectors representing each subsequence (i.e. the first item, the first two items, …, the entire sequence). At each time step, the RNN considers both the previous memory (of the sequence until the previous input) and the current input. The last output vector can be considered as representing of the entire sequence, like a sentence embedding. Intermediate vectors can represent a word in its context.
The output vectors can then be used for many things, among which: representation – a fixed-size vector representation for arbitrarily long texts; as feature vectors for classification (e.g. representing a word in context, and predicting its NER tag); or for generating new sequences (e.g. in translation, a sequence-to-sequence or seq2seq model encodes the source sentence and then decodes = generates the translation). In the case of the NER example, we can now use the output vector corresponding to each word as the word’s feature vector, and predict the label based on the entire preceding context.
![]() |
| Figure 7: a NER model that uses an RNN to represent each word in its context. |
Technical notes: an LSTM (long short-term memory) is a specific type of RNN that works particularly well and is commonly used. The differences between various RNN architectures are in the implementation of the internal memory. A bidirectional RNN/LSTM or a biLSTM processes the sequences from both sides – right to left and left to right, such that each output vector contains information pertaining to both the previous and the subsequent items in the sequence. For a much more complete overview of RNNs, I refer you to Andrej Karpathy’s blog post “The Unreasonable Effectiveness of Recurrent Neural Networks”.
So these are the main differences. There are many other new techniques on top of them, such as attention, Generative Adversarial Networks (GAN), deep reinforcement learning, and multi-task learning, but we won’t discuss them.
Interestingly, neural networks are not a new idea. They have been around since the 1950s, but have become increasingly popular in recent years thanks to the advances in computing power and the amount of available text on the web.
What is not yet working perfectly?
Although the popular media likes to paint an optimistic picture of “AI is solved” (or rather, a very pessimistic, false picture of “AI is taking over humanity”), in practice there are still many limitations to current deep methods. Here are several of them, mostly from the point of view of someone who works on semantics (feel free to add more limitations in the comments!).
The Need for Unreasonable Amount of Data
It’s difficult to convey the challenges in our work to people outside the NLP field (and related fields), when there are already many products out there that perform so well. Indeed, we have come a long way to be able to generally claim that some tasks are “solved”. This is especially true to low-level text analysis tasks such as part of speech tagging. But the performance on more complex semantic tasks like machine translation is surprisingly good as well. So what is the secret sauce of success, and why am I still unsatisfied?
First, let’s look at a few examples of tasks “solved” by DL, not necessarily in NLP:
- Automatic Speech Recognition (ASR) – also known as speech-to-text. Deep learning-based methods have reported human-level performance last year, but this interesting blog post tells us differently. According to this post, while the the recent improvements are impressive, the claims about human-level performance are too broad. ASR works very well on American accented English with high signal-to-noise ratios. It has been trained on conversations by mostly American native English speakers with little background noise, which is available in large-scale. It doesn’t work well, definitely not human-level performance, for other languages, accents, non-native speakers, etc.
- Facial Recognition – another task claimed to be solved, but this article from the New York Times (based on a study from MIT), says that it’s only solved for white men. An algorithm trained to identify gender from images was 99% accurate on images of white man, but far less accurate — only 65% — for dark-skinned women. Why is that so? a widely used dataset for facial recognition was estimated to be more than 75% male and more than 80% white.
-
Machine Translation – not claimed to be solved yet, but the release of Google Translate neural models in 2016 reported large performance improvements. The paper reported “60% reduction in translation errors on several popular language pairs”. The language pairs were English→Spanish, English→French, English→Chinese, Spanish→English, French→English, and Chinese→English. All these languages are considered “high-resource” languages, or in simple words, languages for which there is a massive amount of training data. As we discussed in the blog post about translation models, the training data for machine translation systems is a large collection of the same texts written in both the source language (e.g. English) and the target language (e.g. French). Think book translations as an example for a source of training data.
In other news, popular media was recently worried that Google Translate spits out some religious nonsense, completely unrelated to the source text. Here is an example. The top example is with religious content, the bottom example is not religious, only unrelated to the source text.This excellent blog post offers a simple explanation to this phenomenon: “low-resource” language pairs (e.g. Igbo and English), for which there is not a lot of available data, have worse neural translation models. Not surprising so far. The reason the translator generates nonsense is that when it is given unknown inputs (the input nonsense in the figure 8), the system tries to provide a fluent translation and ends up “hallucinating” sentences. Why religious texts? Because religious texts like the Bible and the Koran exist in many languages, and they are probably the major part of the available training data for translations between low-resource languages.
Can you spot a pattern among the successful tasks? Having a tremendous amount of training data increases the chances of training high-quality neural models for the task. Of course, it is a necessary-but-not-sufficient condition. In her RepL4NLP keynote, Yejin Choi talked about “solved” tasks and said that they all have in common a lot of training data and enough layers. But with respect to NLP tasks, there is another factor. Performing well on machine translation is possible without having a model with deep text understanding abilities, but rather by relying on the strong alignment between the input and the output. Other tasks, which require deeper understanding, such as recognizing fake news, summarizing a document, or making a conversation, have not been solved yet. (And may not be solvable just by adding more training data or more layers?).
The models of these “solved” tasks are only applicable for inputs which are similar to the training data. ASR for Scottish accent, facial recognition for black women, and translation from Igbo to English are not solved yet. Translation is not the only NLP example; whenever someone tells you some NLP task is solved, it probably only applies to very specific domains in English.
The Risk of Overfitting
Immediately following the previous point, when the training data is limited, we have a risk of overfitting. By definition, overfitting happens when a model performs extremely well on the training set, while performing poorly on the test set. It happens because the model memorizes specific patterns in the training data instead of looking at the big picture. If these patterns are not indicative of the actual task, and are not present in the test data, the model would perform badly on the test data. For example, let’s say you’re working on a very simplistic variant of text classification, in which you need to distinguish between news and sports articles. Your training data contains articles about sports and tweets from news agencies. Your model may learn that an article with less than 280 characters is related to news. The performance would be great on the training set, but what your model actually learned is not to distinguish between news and sports articles but rather between tweets and other texts. This definitely won’t be helpful in production when your news examples can be full-length articles.
Overfitting is not new to DL, but what’s changed from traditional machine learning are two main aspects: (1) it’s more difficult to “debug” DL models and detect overfitting, because we no longer have nice manually-designed features, but automatically learned representations; and (2) the models have many many more parameters than traditional machine learning models used to have – the more layers, the more parameters. This means that a model can now learn more complex functions, but it’s not guaranteed to learn the best function for the task, but rather the best function for the given data. Unfortunately, it’s not always the same, and this is something to keep in mind!
The Artificial Data Problem
Machine learning enthusiasts like to say that given enough training data, DL can learn to estimate any function. I personally take the more pessimistic stand that some language tasks are too complex and nuanced to solve using DL with any reasonable amount of training data. Nevertheless, like everyone else in the field, I’m constantly busy thinking of ways to get more data quickly and without spending too much money.
- Entailment: the hypothesis must also be true.
Hypothesis = “A person performing for children on the street”.
The premise tells us there is a street performer, so we can infer that he is a person and he is on the street. The premise also tells us that he is doing his act – therefore performing, and for kids = for children. The hypothesis only repeats the information conveyed in the premise and information which can be inferred from it. - Neutral: the hypothesis may or may not be true.
Hypothesis = “A juggler entertaining a group of children on the street”.
In addition to repeating information from the premise, the hypothesis now tells us that it’s a juggler. The premise is more general and can refer to other types of street performers (e.g. a guitar player), so this may or may not be a juggler. - Contradiction: the hypothesis must be false.
Hypothesis = “A magician performing for an audience in a nightclub”.
The hypothesis describes a completely different event happening at a nightclub.
This task is very difficult for humans and machines, and requires background knowledge, commonsense, knowledge about the relationship between words (whether they mean the same thing, one of them is more specific than the other, or do they contradict each other), recognizing that different mentions refer to same entity (coreference), dealing with syntactic variations, etc. For an extensive list, I recommend reading this really good summary.
- Good performance on the test set doesn’t necessarily indicate solving the task. Whenever our training and test data are not “natural” but rather generated in a somewhat artificial way, we run the risk that they will both contain the same peculiarities which are not properties of the actual task. A model learning these peculiarities is wasting energy on remembering things that are unhelpful in real-world usage, and creating an illusion of solving an unsolved task. I’m not saying we shouldn’t in any case process our data – just that we should be aware of this. If your model is getting really good performance on a really difficult task, there’s reason to be suspicious.
- DL methods can only be as good as the input they get. When we make inferences, we employ a lot of common sense and world knowledge. This knowledge is simply not available in the training data, and we can never expect the training data to be extensive enough to contain it. Domain knowledge is not redundant, and in the near future someone will come up with smart ways to incorporate it into a neural model, and achieve good performance on newer, less simple datasets.
The Shaky Ground of Distributional Word Embeddings
At the core of many deep learning methods lie pre-trained word embeddings. They are a great tool for capturing semantic similarity between words. They mostly capture topical similarity, or relatedness (e.g. elevator-floor), but they can also capture functional similarity (e.g. elevator-escalator). Relying on pre-trained word embeddings is a great way to make your model generalize to new inputs, and I recommend it for any task that doesn’t have loads of available data.
![]() |
| Figure 10: a part of Martin Luther King’s “I Have a Dream” speech after replacing nouns with most similar word2vec words. |
Apart from being funny, this is a good illustration of this phenomenon: words have been replaced with other words in different relationships with them. For example, country instead of nation is not quite the same, but synonymous enough. Kids instead of children is perfectly fine. But a daydream is a type of a dream, week and day are mutually exclusive (they share a mutual category of time unit), Classical.com is completely unrelated to content (yes, statistical methods have errors…), and protagonist is synonymous with the original word character, but in the sense of a character in the book – and not of individual’s qualities.
The Unsatisfactory Representations Beyond the Word Level
Recurrent neural networks allow us to process texts in arbitrary length: from phrases consisting of several words to sentences, paragraphs, and even full documents. Does this also mean that we can have phrase, sentence, and document embeddings, that will capture the meaning of these texts?
To be honest, I completely agree with this opinion when talking about general-purpose sentence embeddings. While we have good methods to represent sentences for specific tasks and objectives, it’s not clear to me what a “generic” sentence embedding needs to capture and how to learn such a thing.
Many researchers think differently, and sentence embeddings have been pretty common in the last few years. To name a few, the Skip-Thought vectors build upon the assumption that a meaningful sentence representation can help predicting the next sentence. Given that even I as a human can rarely predict the next sentence in a text I’m reading, I think this is a very naive assumption (could you predict that I’ll say that?…). But it probably predicts the next sentence with more accuracy than it would predict a completely unrelated sentence, creating kind of a topical, rather than meaning representation. In the example in figure 11, the model considered lexically-similar sentences (i.e. sentences that share the same words or contain similar words) as more similar to the target sentence than a sentence with a similar meaning but a very different phrasing. I’m not surprised.
![]() |
| Figure 11: The most similar sentences to the target sentence “A man starts his day in India” out of the 3 other sentences that were encoded, using the sent2vec demo. |
- Do they capture things which are not said explicitly, but can be implied? (I didn’t eat anything since the morning implies I’m hungry).
- Do they capture the meaning of the sentence in the context it is said? (No, thanks can mean I don’t want to eat if you’ve just offered me food).
- Do they always assume that the meaning of a sentence is literal and compositional (composed of the meanings of the individual words), or do they have good representations for idioms? (I will clean the house when pigs fly means I will never clean the house).
- Do they capture pragmatics? (Can you tell me about yourself actually means tell me about yourself. You don’t want your sentence vectors to be like the interviewee in this joke).
- Do they capture things which are not said explicitly because the speakers have a common background? (If I tell a local friend that the prime minister must go home, we both know the specific prime minister I’m talking about).
I can go on and on, but I think these are enough examples to show that we have yet to come up with a meaning representation that mimics whatever representation we have in our heads, which is derived by making inferences and basing on common sense and world knowledge. If you need more examples, I recommend taking a look at the slides from Emily Bender’s “100 Things You Always Wanted to Know about Semantics & Pragmatics But Were Afraid to Ask” tutorial.
The Non Existing Robustness
Anyone who’s ever tried to reimplement neural models and reproduce the published results knows we have a problem. More often than not, you wouldn’t get the exact same published results. Sometimes not even close. The reason this happens is often due to differences in the values of “hyper-parameters”: technical settings that have to do with training such as the number of epochs, the regularization values and methods, and others. While they are seemingly not super important, in practice hyper-parameter values can make large performance differences.
The Lack of Interpretability
Last but not least, the interpretability issue is probably the worst caveat of DL.
- Self-driving cars
- Predicting probability of a prisoner to commit another crime (who needs to be released?)
- Predicting probability of death for patients (who is a better healthcare “financial investment”)
- Deciding who should be approved a loan
- more…
In the next post I will elaborate on some of these examples and how ML sometimes discriminates against particular groups. This post is long enough already!











