Creating Visuals with NLTK’s FreqDist

Darius Fuller
5 min readDec 21, 2020

--

Photo by Raymond Pang on Unsplash

In my opinion, finding ways to create visualizations during the EDA phase of a NLP project can become time consuming. Preprocessing is a lot different with text values than numerical data and finding the numbers to plot on a graph is possible, but usually requires some engineering. Luckily, there are some tools out there to help folks like myself in this stage!

In my last post, I went over a few quick ways to generate some visuals using a bag of words (BoW). In this one I want to briefly demonstrate how one can use NLTK’s FreqDist class to further explore a BoW through the use of simple visualizations.

Getting Started

In order to create this visualizations, I had these packages installed and/or imported to my Jupyter Notebook:

You will also need your own BoW; mine was a set of song lyrics from a famous rapper scraped from the Internet. Additionally, I had my BoW stored in a Python dictionary, but the process I will detail is applicable to list filled with tokens stored as strings. Here is what mine looks like:

Example 1: First 15 items in my Bag of Words

Creating the FreqDist

Without the NLTK package, creating a frequency distribution plot (histogram) for a BoW is possible, but will take multiple lines of code to do so. Through the use of the FreqDist class, we are able to obtain the frequencies of every token in the BoW with one single line of code:

Example 2: Creation/display of FreqDist object

Just like that we have an object that can easily be converted into a Python dictionary or Pandas series for further manipulation (hint, hint). Now I can quickly check how often this rapper uses specific words with only a few more lines of code:

Example 3: Creating and plotting FreqDist
Figure 1: FreqDist plot of 20 most common tokens

Stop Word Removal

As you can see here, this rapper likes to talk a lot about themselves: “I”, “I’m”, “my”, and “me” all appear in the top 20. In addition, there is a specific curse word that appeared just short of 400 times out of the 110 songs that were sampled. This occurrence in my opinion points towards this specific artist’s subject matter, which tends to involve partying, money, and women.

The words in this graph do not really tell us much else about the subject matter this artist covers, but luckily NLTK has something to help. Most of the top 20 tokens are what can be considered a stop word. A stop word is one that is commonly used due to a specific languages grammatical or syntax requirements. A couple examples (not on Figure 1) for the English language are: “this”, “down”, “after”, “you”. In the corpus module of NLTK there is a premade list for English that can be easily used to filter out stop words in any BoW. Here’s how to use it:

Example 4: Filtering stop words out of original token list

After the completion of this step, my list went from containing 45,196 tokens down to 27,141 (~40% reduction!). This is the resulting graph:

Figure 2: FreqDist plot of 20 most common tokens (stop words removed)

Disclaimer: The token “im” was not supposed to be carried over in this case.

This looks a lot more informative with respect to the type of topics covered by this rapper in these songs. The fact that “like” is the most frequent word used after filtering may point to the amount of similes this rapper uses during their verses. Otherwise, “dont” and “aint” could point to “brag raps” declaring whatever lifestyle or action this artist does not subscribe to.

N-Grams Too?

Up until this point, everything has been done with respect to a single token, but often there are patterns in text that can be teased out through the use of a “n-gram”. These are specific to NLP and refer to groups of words that appear next to each other within a given string; typically how many words per group is indicated by the corresponding Latin prefix.

Groups of two words are called bigrams, groups of three words are trigrams, etc. For example, the bigrams in the sentence, “Natural Language Processing is so fun” are: (Natural, Language), (Language, Processing), (Processing, is), (is, so), (so, fun).

NLTK provides an easy way to collate and visualize n-grams within the collocations module. Each “n” will have different objects to use and are interchangeable depending on your needs. For example, any place with “bigram” can be replaced with “trigram” and still function correctly. Here’s how to create them:

Example 5: Creating bigrams, then plotting them according to frequency

And the results:

Figure 3: FreqDist plot of bigrams, stop words removed

In Figure 3, you can see that the first two pairs, may actually represent one word (can be fixed/investigated during preprocessing), but the other bigrams do indicate a bit about the songs included in the sample. Money is a topic (“hunnids_hunnids”, “ten_ten”, “make_rain”) with reference to throwing it in a club, i.e. making it rain. Also the “b****, im” bigram is the beginning of a phrase commonly used by this rapper and half of the title of a song in the sample, which is probably why it ranks so high here.

Wrapping Up

NLTK’s FreqDist class is a huge timesaver when it comes to analyzing the distribution of tokens within text. Using it appropriately can save, in my estimation, at least 20 lines of code with each implementation. In addition, it’s flexibility with other forms of text if preprocessed carefully (such as bigrams) can help to provide more information about the underlying patterns or sentiments in a given text.

I am certain there are more applications of this class, but these are two quick and easy ways I have found to peer into text data without much fuss. There are a lot more parameters to play around with that I may cover in a future post as I continue to work on my text generation project using this dataset, but I feel this is a good starting point for those who may be at a loss for what to do with a dataset of text.

--

--