Getting Sentimental

NLP Sentiment Analysis with Deep Learning

Darius Fuller
12 min readSep 30, 2020
Credit: Clipartmax

This is the big leagues. Deep Learning with NLP. Something that before now I would imagine to be impossible with a standard laptop. But this project turned out to be a fun, an intriguing exploration of how a revolutionary technology can be deployed on a stock computer.

In this post I will explain how I took a data set of tweets from the internet and used deep learning via Keras to classify tweets according to their sentiment. Since tweets are text data, I will touch on some aspects with relation to NLP (Natural Language Processing), but will primarily focus on the construction and results of the neural network.

Some Context..

The project and task I will be referring to during this post are not something that needs to be completed through the use of a neural network. There are multiple, well-established options available to anyone looking to perform text classification. For example, Naive Bayes classifiers and Support Vector Machines (SVM) are commonly used to classify texts. This blog post by Kamran Kowsari goes over text classification and commonly used algorithms in greater detail if you’re curious.

Knowing my options, I decided to go with what I though was the most novel. Having just done a deep dive into how deep learning and artificial intelligence (AI) work with respect to NLP, I was eager to give it a go. My use of deep learning employs an artificial neural network, or ANN, to look at input text(s) and learn from underlying “hidden” features within to produce a desired output. Basically, the ANN will engineer it’s own features depending on the input it receives, allowing for (in theory) greater understanding. Similarly, I would recommend looking over Jason Brownlee’s blog post for greater understanding on the topic.

The data set I used for my project is from Crowdflower and can be found on data.world. There was not much information that I could find on exactly where and when the tweets come from, so I am only inferring from what the tweets themselves say. I can say for sure that it consists of over 9,000 tweets taken from those attending one or more tech-related events at the South by Southwest (SXSW) festival in Austin, TX sometime around 2013.

Setting Up

For brevity, I will include all the imports and a brief explanation on their purpose in one convenient GitHub gist:

Imports used in this project

Peeking In

This is what the data set looks like after being converted into a DataFrame in a Jupyter Notebook:

Output from code above

Each row includes:

  • A sentiment label**
  • A brand or product the sentiment is directed at**
  • The raw tweet text

**determined by human evaluators

Out of the three features available, the “emotion_in_tweet_is_directed_at” column contains all but one of the missing values in entire data set (roughly 2/3 missing). In addition, this information would not be useful in training a model to learn from the text, as itself is an interpretation drawn from the text by a human; for these reasons I removed this column and the one single row without a “tweet_text” value from the DataFrame.

At this point during the completion of the project, I used a custom function to clean the all of the tweets. This was not in preparation for the neural network, it was in an effort to explore the data by creating a couple visualizations: a frequency distribution and a word cloud.

Frequency distribution

A frequency distribution is a plot displaying how many occurrences of each token there are in a given corpus. It comes from Natural Language Toolkit (NLTK) and is relatively easy to implement after a text has been tokenized. Here’s how I did it:

Code for creating FreqDist plot
Plot for all tweets in data set

Just by looking at this FreqDist plot, it is easy to see where my inferences about the data set come from. There is a strong showing from the tech-related tokens such as “google”, “apple”, and “ipad”, indicating some tie between these tweets and technology. It seems possible to even recreate a sentence describing the event, I thought potentially that there was a “pop-up” to promote the “launch” of a “new” “ipad”.

Word cloud

As a partner to this plot, I coded in a word cloud, which represents the same idea stretched over a custom image. The necessary package is from WordCloud. Here’s how I made mine:

Code to generate twitter bird word cloud
Word cloud for entire corpus

The word cloud appears to confirm the results of the FreqDist plot, with some minor differences. For example, there are bigrams included such as “apple store” , “social network”, or “called circle”. This helped provide more clues as to what was inspiring some these tweets: the launching of a social network called circles (confirmed by a Google search).

Tidying Up

Now with a better understanding of the type of tweets that are in the data set, let’s get into how I prepared the data set for the neural network.

Finding the target

First I changed the text labels into a numerical representation that the network can understand easily:

  • “Negative emotion” → 0
  • “No emotion toward brand or product” → 1
  • “I can’t tell” → 1
  • “ Positive emotion” → 2

I chose to do it this way as I felt it best represented a negative, neutral, positive structure for the target variable given the text labels that came with the data. Following this, I needed to convert the labels into a one-hot encoded representation, using the to_categorical() function from Keras. The last thing I did before cleaning the actual text was a train-test split. This was in order to have data to evaluate my network’s performance with after it has been trained.

Regex cleaning

Cleaning the text data is, in my opinion, made a lot easier through the use of regular expressions (regex). Although there is a bit of a learning curve when attempting to do non-basic tasks, using regex, one can modify text documents on a character-by-character basis. I recommend playing around with it on regexr.com first, just to see how it works in a hand-on manner.

In my project I ended up using a custom function that would take in a multiple regex patterns, text, and replacement strings, returning a cleaned version of the input text according to the input patterns. Here is the core function doing the cleaning and some of it’s work:

Function used to perform regex cleaning

Before:

“Best thing I’ve heard in a long while actually! "I gave iPad 2 money to #Japan relief." #sxsw @mention @mention @mention”

After:

“Best thing I’ve heard in a long while actually! I gave iPad 2 money to #Japan relief. #sxsw”

Tokenizing

Now that some of the nonsense has been removed, the next step I took was to tokenize the text. Keras makes this really easy via the use of the Tokenizer() class found in the “text” module. After this it is necessary that the tweets, now represented as a list of tokens, be converted into a padded sequence. A padded sequence is an ordered numerical representation of a sentence (or text) padded with zeroes to be a desired length. Here’s how I did it (trust me, its important):

Results of code above

Class imbalances

The target class distribution for this data set was highly imbalanced, which is a problem for any type of machine learning. Essentially the less examples of a given class a model has to learn from, the more likely it is to not predict that class. The distribution was:

  • Negative sentiment: 6.26%
  • Neutral sentiment: 60.97%
  • Positive sentiment: 32.75%

In addition, my predicament required the usage of a different package than I was used to when addressing the class imbalance. Due to the data being sequenced, using the SMOTE (Synthetic Minority Over-sampling Technique) class from imbalanced-learn (imblearn) was not possible.

The synthetic sequences generated using SMOTE would not be reverse translatable into coherent English, since they were randomly generated based upon each minority class’ attributes respectively. Thus, I decided to use imblearn’s RandomOverSampler() class to randomly copy tweets in the minority classes, ensuring the “readability” of the inputs remain intact for the learning process.

Application of RandomOverSampler class

Now that I have a data set with an even distribution among classes, I can begin to put together the neural network that will eventually attempt to learn how to classify tweets by their sentiment!

Layering Up

When building an ANN with Keras, the first step is to instantiate the model. This is done like with any other package by calling the class and storing in into a variable. From here one only needs to use the .add() method to stack on as many layers as desired before finalizing the build using the .compile() method. Here is how I did it:

ANN architecture used in project

I’ll do my best to explain a bit on what is going on above. In line 8, I begin by adding the embedding layer that will serve as the “space” that each input sequence will live in before moving into the next layer. Choosing the values for the two parameters will depend on the task and input data, but sticking with memory-friendly numbers (64, 128, etc.) for the embedding size in my experience helps.

Lines 10–13 detail the addition of a Long-Short Term Memory (LSTM) layer, that in theory would help the model to analyze each sequence as a whole rather than part-by-part, thus increasing it’s understanding. I needed to apply a GlobalAveragePooling1D() to transform the data appropriately (more on this concept) for use in the next layers. The last line applies dropout regularization, which promotes generalization of the model by restricting the features passed on to the next layer by 30%.

Lines 16 and 17 show the addition of a densely connected layer (Dense()) and another application of dropout regularization. Line 20 is the final dense layer, which is the output layer; the activation function and neurons are dependent on the number of classes one is attempting to predict (in my case ‘softmax’ and 3 neurons).

As mentioned before, in order to finalize the ANN’s architecture, one needs to apply the .compile() method (lines 23–25). There are three main parameters:

  • Optimizer: String name or class of optimization algorithm (default: “rmsprop”)
  • Loss: Method that network uses to determine distance in space
  • Metrics: Metric that network uses to judge it’s performance each cycle

Line 28 is just a demonstration of how to use the .summary() method to receive a confirmation of the architecture:

Summary of model built in code above

In my experience, visually confirming the neural network architecture prior to training is a great way to potentially catch any missteps and strategize on what parameters to tweak when tuning. Regardless, I now have a compiled ANN that can start training on data!

Learning Up

Training an ANN with Keras is very similar to how one would do so with a package like Sci-Kit Learn (sklearn): using the method .fit(). However this is where the similarities cease, as Keras’ method has a different set of parameters specific to training ANNs. Here were the ones I made use of:

  • batch_size: The amount of samples used to train neurons with per epoch
  • epochs: Total number of iterations over the batches of training data
  • callbacks: Place to input callback class, which will perform a specific task with relation to the training of the data.

Side Note: How callbacks are implemented can vary because one can create their own or use from Keras directly; I will be making use of the EarlyStopping() callback specifically.

  • validation_split: A percentage of the training data to be set aside each epoch as validation data. This data will be used to evaluate the model’s performance and guide adjustments to weight for each neuron after each pass over the entire set.
  • verbose: Accepted values are: 0, 1, or 2. These values determine the level of detail in the display during ANN training. A value of 0 will produce nothing, 1 will display a progress bar for each epoch, and 2 displays a line for each epoch including chosen evaluation metrics and how long the training took.

The .fit() method will produce what the Keras documentation calls a history object. This object has the attribute .history, which will be instrumental in evaluating the model’s performance. The documentation describes it as:

“…a record of training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values (if applicable).”

Using the information stored in the this attribute, I will be able to create a graph that is commonly used to evaluate how the well training of an ANN is going.

Fitting the data:

Training ANN and storing info from “.history”

Display during training with verbose=2:

Training display with verbose parameter set to 2

With the training finished, the model is now ready to make some predictions! A key point is that in order to make predictions, the test data, or any new data for that matter, must be in the same format as the data used for training. Luckily this was not a concern for me, since my test data comes from the same data set, but I would keep a copy of the training data around for reference if necessary.

Guesswork

Getting the model to create predictions is mandatory if we are to see how well it can classify tweets based upon their sentiment. Once again, similarly to sklearn, this can be done by putting the desired data into the model’s .predict() method.

Code to generate predictions and coerce actuals/predictions into 1D vector

Once the predictions have been generated, it is fairly simple to produce a classification report and confusion matrix using functions found in sklearn’s metrics module. In this post, I will not go into great detail about the “why”, but I wrote in fair detail about both topics in a previous post if you’d like more information.

Side Note: I needed to use the .argmax() method on my predictions prior to creating the following plots. This was in order to change my vector of probabilities for each of the three classes respectively into a vector of the most probable class as the prediction; without this step, I would be unable to generate these plots.

A classification report serves as a quick snapshot of commonly used metrics for classification tasks.

Creating the classification report:

Code to generate classification report

A confusion matrix is a plot that helps illustrate how well a model predicts each of the classes respectively. Generally, it is used to analyze the relationship between the predicted labels and true labels. In order to display properly, I needed to make use of a custom function. I will include code for both.

Custom function:

Using the function to generate a confusion matrix:

Just like that I now have access all the information necessary to begin tuning my model for better performance. Tuning generally refers to tweaking parameters and/or the architecture during the compilation stage of an ANN. The “Optimization in Neural Networks” section in Matthew Stewart’s blog post does a great job explaining how this process works.

Alternatively, I can just leave it be and conclude with the results I have since I have completed the task of creating an ANN that can classify tweets by their sentiment with relative accuracy.

Trying It Out

As a personal preference, I like to functionize processes or otherwise unwieldly blocks of code so that I can consistently execute them with minimal effort. In this next section I will show how I generalized the process above into a callable function as well as the results of the ANN created in this post (I did not discuss all of the functionalities I added during this post).

Through the use of this function, I was able to train and evaluate models efficiently, leaving time and space left over for strategizing on my next tuning adjustment.

Results:

Training/Validation accuracy and loss over time
Confusion matrix and classification report for model

Although not the end-all be-all of classification metrics, I felt getting 64% accuracy on my first go classifying text using deep learning isn’t half bad!

Holding It Down

Honestly, I do believe that I was able to squeeze out most of the predictive capabilities from the input data. This, however, is not to say that I believe the performance could not be further improved. I think with more time preprocessing the data and/or playing with the architecture, my model can detect sentiment with higher accuracy.

The entire project is viewable on GitHub if you would like to see how actually did everything discussed in this post.

--

--