Hi! Welcome to the first post on Deep Learning for NLP. Today we'll be working on Sentiment Classification.

Read the welcome post first if you haven't yet!

From this point on, here's how we'll be doing things. I'll put up the GitHub link for the notebook and the dataset files (which you can put up on Google Colab if you want to run it). All practical implementation details are to be explained in the notebook.

These posts, however, will be delegated to more conceptual level things – explaining models, architectures, working through equations, etc. It is recommended that you read through the post first before you run through the notebook, since we'll also be introducing concepts here.

So anyways, here's the GitHub link. Today we'll be looking into:

  • Natural Language Processing (NLP)
  • Sentiment analysis and classification
  • Recurrent Neural Networks
  • RNN Internals and Limitations
  • Long-short Term Memory (LSTM) Networks

Natural Language Processing (NLP)

So what's NLP? How is it different from Computational Linguistics? These are some of the most common questions related to the subject and we'll answer them with as few words as possible so we can go straight to the nitty gritty.

NLP is (quite literally) the processing of natural language. The "natural" in "natural language" means more or less the same as in "all natural ingredients." Natural language is language that evolves through use given various socio-cultural factors. The language we speak in is an example of this. How animals communicate are also arguably a form of natural language. "Processing" refers to the storing, extracting, modifying, and altering of something (in this case, language data).

NLP is then quite simply the processing of natural language towards a certain goal or task. We design and implement systems that learn how to generate text, read passages and answer questions, translate sentences, verify logic, and many more. We're interested in solving tasks that involve language.

Computational Linguistics (CL) differs from NLP in some ways yet share tools and ideas. As the common saying goes, CL is "doing linguistics assisted by computers." You're more concerned about the nature of language and the theories behind languages. NLP is "teaching computers how to solve language-based tasks." You're more concerned with teaching computers how to solve tasks that involve language. However, in a lot ways, the line between the two fields are gradually blurring, with both sides sharing ideas and resources. If you've got a correction or a better comparison, do let me know in the Disqus box below!

Sentiment Analysis

Let's say we have a list of tweets about politicians and we are to figure out how people generally[1] feel about those politicians: do they like them? Do they hate them? If the campaign manager of a certain politician could discern this "general feel" then they could make better decisions during campaign season. Or let's say we have a list of reviews about movies. If we can figure out which movies people like or dislike based on reviews, we could then also (theoretically) glean what people generally like in movies, and thus (possibly) make more successful movies in the future.

This task is broadly called Sentiment Analysis, and is one of the many tasks that NLP tries to solve.

Classically, we solve sentiment analysis problems with techniques like Bag-of-Words (BOW). This method pretty much creates a "bag" of all words that exist in the sentence. It doesn't care how many times each word appears, nor which words come after the others, it only knows they're there.

"To the left, to the left, everything you own in the box to the left" becomes [box, everything, in, left, own, the, to, you]

We can then aggregate these bags and figure out which words most likely appear in reviews that are positive or negative. However, these models present one glaring limitation, which is that they treat words independent of each other.

Here's an example.

Say I have the sentence "that was a good movie." Our model sees each word[2] independently. It does not care about grammar. It does not care about structure. It does not see syntax. All it cares about is if it sees words that it has seen in a lot of positive reviews before, it will tag the review as positive. In this case, it sees "good" and tags the review as positive.

So far so good. But it starts to break pretty quick.

But how about "that was not a good movie." Using our current model, it does not know that "not" modifies the word "good" and will still take the occurrence of "good" to mean that the review is positive, even when its not.

The reason NLP uses neural networks more often these days is because they allow us to model sequences[3]. Whenever they produce representations for input text data, they do not discard word order (like BOW does), but instead incorporates it.

Recurrent Neural Networks

NLP generally uses Recurrrent Neural Networks[4] to solve various tasks. Here's an illustration (lifted from lecture slides I made) that illustrates an RNN.

Let's set some notation here. A sequence $x_t$ with $t$ tokens is pretty much a sentence with $t$ words. It gets fed into an RNN, and at every timestep, outputs a "hidden state" which is denoted as $h_t$, as well as an "output" which can be thought of as information it is passing to the next timestep. We call this a "recurrence."

We usually look at RNNs in an "unrolled" format, like this.

So say I have the sentence "I want to eat sisig" the network will take five timesteps to read through the entire sentence.

Note that every hidden state $h_t$ carries information from $h_0$ to $h_{t-1}$. This gives RNNs the ability to model sequentiality. In practice, we often receive all the hidden states of the RNN, but we're usually concerned with the last hidden state because this contains the most information.

Do note that the initial hidden state $h_0$ is usually initialized to a zero tensor, sort of like a "blank slate."

RNN Internals and Limitations

How does an RNN compute what to output and what to keep? Here's an illustration of a basic unrolled RNN (with the current timestep zoomed in).

So what happens here?

Basically the output hidden state of the previous timestep is concatenated with the currrent timestep input (which is $x_t$). This is then matrix-multiplied to a hidden layer inside the RNN, then a hyperbolic tangent nonlinearity is applied to the output.

An intuition for the tanh layer is that it's basically choosing "candidate information" to pass from the previous and the current timestep into the next one. It then outputs the new hidden state and passes it on to the next timestep.

After the last timestep, the final hidden state should have information from all the timesteps.

This all sounds good in theory, but it actually poses a lot of problems.

  • RNNs find it hard to model long-term dependencies (far-apart words that depend on each other for context)
  • They're also hard to train and experience vanishing gradient problems (as your weights get smaller and smaller, so does its gradient during backpropagation, until they become very miniscule and you find your network suddenly isn't learning).

Hochreiter and Schmidhuber (1997) looked into these problems and proposed a solution that is now called the Long-short Term Memory (LSTM) network.

Long-short Term Memory (LSTM) Networks

Here's what an LSTM cell (one timestep) looks like.

It pretty much functions the same was as a standard RNN cell, albeit with some differences.

The Cell State is basically a form of "controlled memory storage." We'll see why this is the case more later on. In addition to the cell state, we also have gates, which control the flow of information inside the LSTM.

As we can recall, the sigmoid activation function bounds inputs between 0 and 1. This allows it to function like a "gate." If it outputs 0, it means do not keep certain information. If it outputs 1, it means keep all the information. Any value between this means "keep this much information." Gates control how the cell state changes. You use gates as mechanisms that control how much and which information gets on the cell state, as this is directly passed on to the next timestep.

With this architecture, our LSTM learns how to be picky with the information it takes in and remembers across timesteps.

The LSTM has four gates, namely the forget gate, the input gate, the cell gate, and the output gate. Each gate has a hidden layer (and it's own hidden weight matrix). Information flows through an LSTM in four steps.

First, we figure out which information from the previous hidden state to remove. We do this by taking the previous hidden state and the new input, concatenating them, and applying the forget gate (a matrix multiply to the hidden weights then a sigmoid activation). Intuitively, you can think of this as looking at past information and new information and selecting which old information to throw away.

Next, we're ready to take in new information.

Given the previous hidden state and the new input, the cell gate figures out "candidate information" to pass on, essentially picking which new information to store. The input gate controls how much new information to store.

Next, we update the cell state.

Our forget gate modifies our cell state by removing information it doesn't need anymore, then the input gate controls how much information from the cell gate to pass on, which is then added to the cell state.

Next, we figure out what to pass on to the next timestep.

We then use a hyperbolic tangent nonlinearity to produce "candidate" to pass on to the next timestep. We control how much of this cell state information we pass on using the output gate. The process then repeats until all timesteps are finished.

The controlled nature of gating mechanisms allow the LSTM to be picky about which information to pass on. This allows to train in a more stable fashion, as there is a smoother flow of gradients from each timestep to the last because of the existence of a cell state.

As a recap, here's an unrolled LSTM and the equations that describe its gating mechanisms:

Last Words

Hopefully this runthrough has given you an idea of how LSTMs and RNNs work in general. For the practical notebook, we'll be creating a sentiment classifier using LSTMs and a subset of the iMDB Movie Reviews dataset.

For any comments, correction, or suggestions, feel free to reach out in the Disqus box below!


[1]I say "generally" but to be statistically correct about it, we're talking about how the "sample of people" feel about the politicians in their tweets, not the general populace.

[2]I should be talking about "tokens" but referring to them as "words" is much simpler in this example.

[3]While it is true that we can use n-grams and bags of n-grams to add a notion of sequentiality to our data, your data becomes very sparse.

[4]As of press time, Transformers have quickly gained traction and are starting to replace RNNs in a lot of tasks. While these models present state-of-the-art results, RNNs still prove to be bread-and-butter tools to the NLP community as good baseline models and whatnot.